This Week In Cheminformatics: Issue #021
conformal prediction for molecular retrieval from MS2, how expensive should your conformer generation workflows be and a long list of papers
Highlights
Conformer Generation Workflows for COSMO-RS Calculations: Are They All the Same?
Gomes et al. systematically evaluate whether expensive conformer generation is strictly necessary for COSMO-RS thermodynamic predictions. The authors tested four pipelines. They used RDKit ETKDG/MMFF94 to compute-heavy DFT and semiempirical GOAT workflows across 926 experimental solubility data points. Surprisingly, for general high-throughput screening of typical drug-like molecules, the RDKit-Direct (ETKDGv3), as the authors call it, performed comparably to the complex pipelines, with mean absolute errors around 0.90 log units. Authors note that for highly flexible molecules with more than five rotatable bonds or systems prone to solvent-dependent intramolecular hydrogen bonding, choice of method dictates the result drastically.
Reliable Molecular Retrieval from Mass Spectra Using Conformal Prediction
This paper by Rakhshaninejad et al. applies conformal prediction to candidate-based molecular retrieval from LC-MS/MS data. Usually, retrieval pipelines are evaluated using metrics like top-k accuracy, which summarize performance at the dataset level but fail to provide a spectrum-specific reliability statement. To address this, the authors construct prediction sets that contain the true molecule with a user-specified probability, effectively quantifying uncertainty for individual spectra. They evaluated marginal and conditional conformal prediction across in-distribution, partially shifted, and fully out-of-distribution scenarios on the MassSpecGym benchmark. As this framework operates directly on the output scores of a retrieval model without requiring any architectural modifications, it is a good read for anyone looking to implement uncertainty quantification in their annotation pipelines.
Long List
Cheminformatics
Algorithm-driven, phenotype-directed bioactive molecular discovery
strainedSMILES2xyz: a workflow for reliable 3D structures of strained molecules from SMILES
Breaking glycolysis: allosteric hotspots for multi-target drug repurposing
QUASAR: A Universal Autonomous System for Atomistic Simulation and a Benchmark of Its Capabilities
Molspectra: a general framework for multi-spectra prediction from molecular structures
Policy-Based Active Learning for Efficient Molecular Identification
A Spectral Matching Algorithm Based on the Wasserstein Metric
A systematic evaluation of protein allosteric site prediction tools with independent datasets
ULCYP: A Multitask Model for Predicting P450 Inducers Based on Positive-Unlabeled Learning
ConfDTI: Structure Confidence-Guided Multimodal Fusion for Drug–Target Interaction Prediction
Comprehensive annotation and analysis of human microproteins by human microprotein atlas platform
Functional Groups Are All You Need for Chemically Interpretable Molecular Property Prediction
Assessment of Alphafold Protein Models for Small-Molecule Ligand Docking versus Co-Folding
Cryo-Electron Microscopy Structural Ensemble Optimization Using Individual Particles
Novel Molecules Generation Using Graph Generative Adversarial Networks
MedChem
Other
Palate Cleanser
The Story of the Woodward–Hoffmann Rules — Very cool !!!
Best,
Manas

























