This Week In Cheminformatics: Issue #006
Binding Site Vectors to Compare Active Sites, ArtiDock, Benchmarking Boltz-2, and a long list of papers from last week of Jan
Highlights
Assessing Boltz-2 Performance for the Binding Classification of Docking Hits
This is yet another paper on the “Clever Hans” paradox. While the Boltz-2 model achieves impressive hit enrichment on the difficult ULVSH dataset, outperforming physics-based methods that typically fail there, it appears to do so for the different/wrong reasons. The authors demonstrate that affinity predictions are completely decoupled from pose quality (surprise, surprise!), with the model successfully classifying binders even when the predicted structure is incorrect. On adversarial tests, Boltz-2 often continues to predict high affinity even when the binding site is destroyed by mutations or, in some cases, when the target sequence is swapped entirely. This implies Boltz-2 is acting less like a “true” cofolding tool (whatever that means) and more like a sophisticated ligand-based model, likely memorizing 3D pharmacophoric shapes from its training data while largely ignoring the physical reality of the protein pocket. Good read!
Binding Site Vectors Enable Mapping of Cytochrome P450 Functional Landscapes
Kuvek et al. introduced a clever solution to compare active sites that share function but vary in sequence. Instead of traditional grids or surface probes, the authors utilize “binding site vectors” that radiate from a central anchor (the heme iron) to the pocket surface, encoding both topology and local electrostatics into a multidimensional numeric format. This allows complex 3D binding environments to be compared rapidly using simple RMSD calculations. This method effectively works on dynamic conformational ensembles. While standard backbone alignments grouped enzymes merely by lineage, the vector analysis of Molecular Dynamics trajectories successfully clustered diverse enzymes such as CYP3A4 and CYP1A2 based on their actual substrate overlap. This work also proposed “ligand vectors” for binary docking. By inverting the method to map a ligand’s surface, one can instantaneously check if a ligand “fits” a specific protein conformation without computationally expensive energy evaluations, offering a novel approach for high-throughput pharmacophore modeling.
ArtiDock: Accurate Machine Learning Approach to Protein–Ligand Docking Optimized for High-Throughput Virtual Screening
Voitsitsky et al. benchmarked their new docking model on the PLINDER dataset with cluster-based splitting, supposedly to minimize the data leakage that is known to inflate performance in many docking papers. The result is a median RMSD of 2.0 Å compared to Glide’s 2.8Å, which is pretty impressive! For high-throughput workflows, it’s quite efficient. The reported cost of roughly $5 to screen one million molecules is significantly lower than both Glide and GPU-accelerated AutoDock. ArtiDock outperforms other models most significantly in pockets containing ions and explicit water molecules. The authors also tried coupling ArtiDock with a fast UFF minimization, whereby they fix steric clashes and bond distortions without sacrificing accuracy, offering a hybrid workflow that feels robust enough for actual production use.
Long List
Cheminformatics
MOFReasoner: think like a scientist—a reasoning large language model via knowledge distillation
ToPolyAgent: AI Agents for Coarse-Grained Bead-Spring Topological Polymer Simulations
Fragment-Guided New Therapeutic Molecule Discovery and Mapping of Clinically Relevant Interactomes
Selector: A General Python Library for Diverse Subset Selection
PharmGEO: A Curated Atlas of Drug-Response Transcriptomes Enabling Cross-Study Comparisons
Cut-SOAP: A Machine Learning Descriptor for Rapid Screening of Molecular Adsorption Energetics
Machine-Learning Framework for Excitation Energies of Chromophores in Polarizable Environments
The discovery of monoamine oxidase inhibitors: virtual screening and in vitro inhibition potencies
NavDB: A Comprehensive Database for Voltage-Gated Sodium Channels Modulators and Targets
Traj2Relax: A Trajectory-Supervised Method for Robust Structure Relaxation
Random Functions as Data Compressors for Machine Learning of Molecular Processes
Uncertainty-Aware Prediction of 195Pt Chemical Shifts from Limited Data
Estimating the Hydrogen Bond Strength by Machine Learning Approaches
MedChem
Idler Compounds: A Simple Protocol for Openly Sharing Fridge Contents for Cross-Screening
General Binding Affinity Guidance for Diffusion Models in Structure-Based Drug Design
Shaping Antimalarials: A Geometry-First Approach to PfCLK3 Covalent Inhibitors
Integrating Medicinal Chemist Expertise with Deep Learning for Automated Molecular Optimization
PhotoChem
Other
Molecular dynamics simulations accelerated on FPGA with high-bandwidth memory
Outpacing Emerging Drug Threats: Validation of ToxBox Kits That Automate LC-MS/MS Analyses
Palate Cleanser
Stay alive,
Manas













Brilliant roundup on the Clever Hans problem in molecular docking. The finding that Boltz-2's affinity predictions decouple from pose quality is a huge red flag for anyone using these tools in real drug discovery pipelines. I ran into similar issues when a docking model kept suggesting binding in completley destroyed pockets, turns out it was just pattern-matching ligand scaffolds from training data. That $5 per million molecules cost for ArtiDock is pretty wild tho, makes high-throughput feasibl for smaller teams without GPU clusters.