This Week In Cheminformatics: Issue #007

Improving reaction coverage in USPTO, BIG Crystal structure data release from FAIR and a Long list of papers

Feb 09, 2026

Highlights

Data augmentation in a triple transformer loop retrosynthesis model

Retrosynthesis models trained on the USPTO dataset are notoriously bad due to power-law distribution, where a handful of common reaction types dominate rarer but essential transformations. To break this dependence on biased data, researchers have introduced a “triple transformer loop” architecture to generate and validate a template-equilibrated dataset of 27.5 million fictive reactions. The process begins by extracting over 14,000 reaction templates from the USPTO and applying them to existing molecules to generate hypothetical starting materials. To ensure these “fictive” reactions are chemically plausible, Grandjean et al. employs a multi-step validation loop: transformer T2 predicts the necessary reagents, while a specialized transformer T3* validates the forward transformation by requiring a confidence score above 95%. The resulting dataset significantly broadens the training dataset, increasing the representation of underused elements like lithium, magnesium, and tin. Models trained on this equilibrated subset outperform those trained on raw USPTO data in template-averaged round-trip accuracy, proving that balanced, synthetic diversity can be more valuable than imbalanced, historical records. Great read!

Open Molecular Crystals 2025 (OMC25) dataset and models

The Open Molecular Crystals 2025 dataset from FAIR is a deposit of labeled, high-fidelity data for molecular crystal structures. By releasing over 27 million structures derived from 230,000 dispersion-corrected DFT relaxation trajectories, the authors provide a resource orders of magnitude larger than previous open-source efforts. This dataset covers 12 elements and up to 300 atoms per unit cell, OMC25 captures 167 distinct space groups across all seven crystal systems. The authors employed a sampling strategy that includes both loosely packed initial configurations and densely packed structures optimized via Rigid Press. This comprehensive coverage of the PES might be very useful for training robust machine learning interatomic potentials. Validation experiments demonstrate that these models (MLIPs) trained on OMC25, such as Equiformer V2 and UMA, achieve impressive accuracy on established benchmarks like X23b and Schrödinger polymorph ranking. These models offer a middle ground between classical force fields and ab-initio calculations. Dataset and model checkpoints are released under a CC BY 4.0 license. Go check it out here.