This Week In Cheminformatics: Issue #020
The last mile problem, visualizing over a billion molecules, how to train a shallow ensemble and a long list of papers
Highlights
The Last Mile Problem: A Critical Assessment of Physics-Based and AI Tools for Small Molecule Binding Prediction in Virtual Screening
Wang et al. provide a much-needed, large-scale critical assessment of both physics-based and ML models for the “last mile” rescoring problem in virtual screening. The authors benchmarked alchemical absolute binding free energy calculations, end-state methods like MM/PBSA and MM/GBSA, and seven different ML tools against PDBbind for quantitative accuracy and DUD-E for active/decoy discrimination. What stands out to me is how starkly they expose the domain applicability gap in standard ML predictors; while tools like KDEEP and OnionNet-2 performed well on PDBbind most likely due to training set overlap as they effectively failed to enrich true actives over decoys in the DUD-E set. In contrast, the more recent Boltz-2 model showed excellent generalization, achieving an 81% true positive rate on DUD-E that rivals physics-based methods, but at a fraction of the computational cost.
Nested TMAPs to Visualize Billions of Molecules
This Reymond lab paper achieves visualization of 9.6 billion-molecule REAL database without relying on massive compute clusters. They represent molecules using 42-dimensional molecular quantum numbers count vectors, compress them to 6-byte codes using Product Quantization, and partition the space via GPU-accelerated PQk-Means. The project builds a primary TMAP of about 92,000 cluster representatives based on MQN similarity to map macro-level trends, which then branch into secondary TMAPs organized by ECFP4 similarity for fine-grained substructure exploration within each cluster. Pretty cool!
How to Train a Shallow Ensemble
Schäfer et al. show the computational overhead of uncertainty quantification in machine-learning interatomic potentials by optimizing training strategies for shallow ensembles. They demonstrate that while last-layer approximations and energy-only negative log-likelihood objectives are computationally efficient, they often yield severely miscalibrated, element-specific force uncertainties. To achieve robust calibration without the prohibitive cost of computing full force-feature Jacobians during a from-scratch training run, the authors propose a practical protocol: initializing an ensemble via Last Layer Prediction Rigidity or an energy-only probabilistic loss, followed by full-model fine-tuning with a joint energy-force NLL objective. This short fine-tuning step allows the backbone representation to properly separate high-error configurations in the latent space, effectively matching the calibration quality of a full training run while reducing training times by up to 96% for datasets containing large structures.
Long List
Cheminformatics
AiiDA-TrainsPot: towards automated training of neural-network interatomic potentials
CReM-dock: de novo design of synthetically feasible structures guided by molecular docking
Data Management and Analysis of Metal–Organic Framework Synthesis Using Data Models
Enabling Automatic Generation of Protein–Ligand Complex Data Sets with Atomistic Detail
Scaffold-based evaluation metrics for fair comparison of molecular generators
MonicaMD: Molecules and Internal Cluster Analysis of Molecular Dynamics Simulations
Drug-Induced Liver Injury: Mitochondrial Mechanisms, Biomarkers, and Emerging Therapeutic Strategies
BERT-T6: Toward High-Accuracy T6SS Bacterial Toxin Identification Using a Protein Language Model
CRISP: Enhancing ASE Workflows With Advanced Molecular Simulation Post‐Processing
MegaPX: fast and space-efficient peptide assignment method using IBF-based multi-indexing
PSMa: Learning Protein Surface Representations with Physicochemical Masked Autoencoders
A Deep Learning Approach for OER Electrocatalyst Kinetics Prediction
Ensemble Machine Learning for Interpretable Prediction of Acute
A network medicine framework for multi-modal data integration in therapeutic target discovery
QuantumPDB: A Workflow for High-Throughput Quantum Cluster Model Generation from Protein Structures
Enumeration of Autocatalytic Subsystems in Large Chemical Reaction Networks
Framework for evaluating explainable AI in antimicrobial drug discovery
MedChem
Structure-Based Search for Novel Creatine Transporter Inhibitors
Integrating QSAR-Machine Learning, Biochemical Assays, and Molecular
Dynamics for the Discovery of JAK2 Inhibitors in Cervical Cancer
Other
Palate Cleanser
Ok I guess !?
Manas









































