This Week In Cheminformatics: Issue #007
Improving reaction coverage in USPTO, BIG Crystal structure data release from FAIR and a Long list of papers
Highlights
Data augmentation in a triple transformer loop retrosynthesis model
Retrosynthesis models trained on the USPTO dataset are notoriously bad due to power-law distribution, where a handful of common reaction types dominate rarer but essential transformations. To break this dependence on biased data, researchers have introduced a “triple transformer loop” architecture to generate and validate a template-equilibrated dataset of 27.5 million fictive reactions. The process begins by extracting over 14,000 reaction templates from the USPTO and applying them to existing molecules to generate hypothetical starting materials. To ensure these “fictive” reactions are chemically plausible, Grandjean et al. employs a multi-step validation loop: transformer T2 predicts the necessary reagents, while a specialized transformer T3* validates the forward transformation by requiring a confidence score above 95%. The resulting dataset significantly broadens the training dataset, increasing the representation of underused elements like lithium, magnesium, and tin. Models trained on this equilibrated subset outperform those trained on raw USPTO data in template-averaged round-trip accuracy, proving that balanced, synthetic diversity can be more valuable than imbalanced, historical records. Great read!
Open Molecular Crystals 2025 (OMC25) dataset and models
The Open Molecular Crystals 2025 dataset from FAIR is a deposit of labeled, high-fidelity data for molecular crystal structures. By releasing over 27 million structures derived from 230,000 dispersion-corrected DFT relaxation trajectories, the authors provide a resource orders of magnitude larger than previous open-source efforts. This dataset covers 12 elements and up to 300 atoms per unit cell, OMC25 captures 167 distinct space groups across all seven crystal systems. The authors employed a sampling strategy that includes both loosely packed initial configurations and densely packed structures optimized via Rigid Press. This comprehensive coverage of the PES might be very useful for training robust machine learning interatomic potentials. Validation experiments demonstrate that these models (MLIPs) trained on OMC25, such as Equiformer V2 and UMA, achieve impressive accuracy on established benchmarks like X23b and Schrödinger polymorph ranking. These models offer a middle ground between classical force fields and ab-initio calculations. Dataset and model checkpoints are released under a CC BY 4.0 license. Go check it out here.
Long List
Cheminformatics
ChemBERTa-3: an open source training framework for chemical foundation models
MMRCL: An interpretable multi-modal deep learning framework for predicting hERG blockers
From virtual screening to bench: A dual-validation framework for drug repurposing against PI3K
DFDD: A Cloud-Ready Tool for Distance-Guided Fully Dynamic Docking in Host–Guest Complexation
More Accurate Binding Affinity Prediction Using Protein Homology and Ligand-Based Transfer Learning
Blind Challenges Let Us See the Path Forward for Predictive Models
Prediction of Charged Small Molecule Conformations in Solution Using a Balanced ML/MM Potential
Constant-pH Molecular Dynamics of Cationic Peptide Dendrimers Binding to siRNA
Mapping Still Matters: Coarse-Graining with Machine Learning Potentials
MetaStab-Analyzer: Classification and Regression Models for Metabolic Stability Prediction
RetNeXt: A Pretrained Model for Transfer Learning Across the MOF Adsorption Space
Transforming MOF Modeling with Machine-Learned Potentials: Progress and Perspectives
Toward More Trustworthy QSAR: A Systematic Discussion on Data Set Partitioning
Variational Bayesian Multi-Kernel Adaptive Deep Fusion for Microbe-Related Drug Prediction
MedChem
PhotoChem
Other
Single Molecule Force Spectroscopy to Probe Intermediates and Energetics of Membrane Protein Folding
PM2.5 Is a Toxic Mixture: Not Just a Matter of Concentration
BiCLUM: Bilateral contrastive learning for unpaired single-cell multi-omics integration
Palate Cleanser
Lessons Learned from the OpenADMET - ExpansionRx Blind Challenge
good luck this week,
Manas















