This Week In Cheminformatics: Issue #023
Tsetlin Machine, Are We Underestimating Overfitting, MCTS MOO on drug like space, and a long list of papers to end the month of May
Highlights
The Tsetlin Machine: A “Third Way” in QSAR Modeling
Tsetlin Machine is a type of architecture that uses reinforcement learning to train finite-state automata that learn propositional logic (AND/NOT) clauses directly from molecular descriptors. This works very well with binary (ECFP4, etc.) fingerprints. Here, the authors benchmarked the model and demonstrated that it frequently outperforms Random Forest and XGBoost on critical early-enrichment virtual screening metrics like PRC-AUC and Positive Predictive Value (PPV). The model also offers direct, wrapper-free interpretability by allowing you to extract exact atom-wise contributions and global descriptor weights straight from the learned logical clauses. While the architecture currently underperforms when handling discretized continuous descriptors like RDKit2D features, its computational efficiency and strong out-of-the-box inter-scaffold generalization with ECFP4 make it a compelling, highly transparent method to explore for hit prioritization. Always a good read from Ivan Čmelo :)
Are We Underestimating Overfitting?
This JCIM perspective by Winkler makes a obvious case for rethinking our structural aversion to overparameterization. The paper challenges the long-standing dogma that parsimonious models inherently generalize best, instead exploring the double-descent performance curve where models operating well beyond the interpolation threshold can actually exhibit benign overfitting and robust predictive accuracy on unseen data. Winkler demonstrates that highly parameterized nonlinear architectures such as deep neural networks and ensemble trees like Random Forests or XGBoost often maintain test-set performance regardless of traditional bias-variance trade-off constraints. This has been obvious for a long while and perhaps now we know whom to cite when arguing my bigger model is better than your slightly smaller model, hahaha.
When Trees Guide Molecules: Multiobjective Search in de Novo Drug Design
This paper by Druchok and Rovenchak proposes multiobjective de novo design by integrating Monte Carlo Tree Search (MCTS) with machine learning property predictors. They empirically compared linear scalarization versus a dynamic Pareto front approach within the MCTS generation loops. They successfully apply this to generate highly selective OX1R inhibitors and more complex polypharmacological candidates targeting both OX1R and H3R, all while enforcing physicochemical constraints on solubility, melting point, and acute toxicity. They used UMAP space projections to “ensure” candidate molecules remain within the applicability domain of the underlying surrogate models. The generated molecules were not validated experimentally. Interesting read.
Long List
Cheminformatics
A Critical Examination of Active Learning Workflows in Materials Science
Novelty-Aware Evolutionary Bayesian Optimisation for Multi-Objective Discovery Science
Physics-Aware Representation Learning on Electronic Charge Density for Materials Property Prediction
Intercalation Favors DNA Covalent Photobinding in Photoresponsive Dual PDT/PCT Bimetallic Assemblies
Unifying pKa and Protonation Prediction with Sequence-Based Deep Learning
A Nonlinear Multi‐Objective Prediction Strategy for Small‐Sample Datasets in Homogeneous Catalysis
Naphthalene-based hydroxamate HDAC inhibitors with anti-breast tumor activity
Rational design of triazole tyrosinase inhibitors via integrated free energy calculations
M-Tune: imbalanced data handling in machine learning by tuning the decision threshold
Efficient and Precise Force Field Optimization for Biomolecules Using DPA-3
Scalable Ligand Pose Generation via QUBO-Guided Grid Sampling and Geometric Triplet Matching
Membrane Protein Insertion in Cells: Principles, Pathways, and Quality Control
Precision-Guarded Graph–Text Alignment for Universal Chemical Understanding
FLOWR: flow matching for structure-aware de novo, interaction- and fragment-based ligand generation
Efficient fine-tuning of vision-language adapters in chemical VLMs for molecular image-text tasks
A causal inference framework for identifying essential genes to enhance drug synergy prediction
SARS-CoV-2 Spike Protein’s Structural Dynamics Affect the Activity of the Bebtelovimab Antibody
Smarter Data: Rethinking Data Generation for Machine Learning Potentials in Heterogeneous Catalysis
A Structure‐Informed Atlas of Venom‐Derived Peptides Reveals the Organization of Chemical Space
Predicting the Host–Guest Binding Gibbs Free Energy for Anion Guests
Predicting fire consequences with the transformer model based on multimodal feature fusion
ADMET-vault: an interactive framework for real-time ADMET prediction and molecular optimization
A theoretical investigation of inotilone as a potential free radical scavenging agent
MedChem
Other
Palate Cleanser
Slay,
Manas


































