This Week In Cheminformatics: Issue #020

The last mile problem, visualizing over a billion molecules, how to train a shallow ensemble and a long list of papers

May 10, 2026

Highlights

The Last Mile Problem: A Critical Assessment of Physics-Based and AI Tools for Small Molecule Binding Prediction in Virtual Screening

Wang et al. provide a much-needed, large-scale critical assessment of both physics-based and ML models for the “last mile” rescoring problem in virtual screening. The authors benchmarked alchemical absolute binding free energy calculations, end-state methods like MM/PBSA and MM/GBSA, and seven different ML tools against PDBbind for quantitative accuracy and DUD-E for active/decoy discrimination. What stands out to me is how starkly they expose the domain applicability gap in standard ML predictors; while tools like KDEEP and OnionNet-2 performed well on PDBbind most likely due to training set overlap as they effectively failed to enrich true actives over decoys in the DUD-E set. In contrast, the more recent Boltz-2 model showed excellent generalization, achieving an 81% true positive rate on DUD-E that rivals physics-based methods, but at a fraction of the computational cost.

Nested TMAPs to Visualize Billions of Molecules

This Reymond lab paper achieves visualization of 9.6 billion-molecule REAL database without relying on massive compute clusters. They represent molecules using 42-dimensional molecular quantum numbers count vectors, compress them to 6-byte codes using Product Quantization, and partition the space via GPU-accelerated PQk-Means. The project builds a primary TMAP of about 92,000 cluster representatives based on MQN similarity to map macro-level trends, which then branch into secondary TMAPs organized by ECFP4 similarity for fine-grained substructure exploration within each cluster. Pretty cool!

How to Train a Shallow Ensemble

Schäfer et al. show the computational overhead of uncertainty quantification in machine-learning interatomic potentials by optimizing training strategies for shallow ensembles. They demonstrate that while last-layer approximations and energy-only negative log-likelihood objectives are computationally efficient, they often yield severely miscalibrated, element-specific force uncertainties. To achieve robust calibration without the prohibitive cost of computing full force-feature Jacobians during a from-scratch training run, the authors propose a practical protocol: initializing an ensemble via Last Layer Prediction Rigidity or an energy-only probabilistic loss, followed by full-model fine-tuning with a joint energy-force NLL objective. This short fine-tuning step allows the backbone representation to properly separate high-error configurations in the latent space, effectively matching the calibration quality of a full training run while reducing training times by up to 96% for datasets containing large structures.

This Week In Cheminformatics

This Week In Cheminformatics: Issue #020

The last mile problem, visualizing over a billion molecules, how to train a shallow ensemble and a long list of papers

Highlights

The Last Mile Problem: A Critical Assessment of Physics-Based and AI Tools for Small Molecule Binding Prediction in Virtual Screening

Nested TMAPs to Visualize Billions of Molecules

How to Train a Shallow Ensemble

Long List

Cheminformatics

MedChem

Other

Palate Cleanser

Discussion about this post

Ready for more?