This Week In Cheminformatics: Issue #023

Tsetlin Machine, Are We Underestimating Overfitting, MCTS MOO on drug like space, and a long list of papers to end the month of May

May 31, 2026

Highlights

The Tsetlin Machine: A “Third Way” in QSAR Modeling

Tsetlin Machine is a type of architecture that uses reinforcement learning to train finite-state automata that learn propositional logic (AND/NOT) clauses directly from molecular descriptors. This works very well with binary (ECFP4, etc.) fingerprints. Here, the authors benchmarked the model and demonstrated that it frequently outperforms Random Forest and XGBoost on critical early-enrichment virtual screening metrics like PRC-AUC and Positive Predictive Value (PPV). The model also offers direct, wrapper-free interpretability by allowing you to extract exact atom-wise contributions and global descriptor weights straight from the learned logical clauses. While the architecture currently underperforms when handling discretized continuous descriptors like RDKit2D features, its computational efficiency and strong out-of-the-box inter-scaffold generalization with ECFP4 make it a compelling, highly transparent method to explore for hit prioritization. Always a good read from Ivan Čmelo :)

Are We Underestimating Overfitting?

This JCIM perspective by Winkler makes a obvious case for rethinking our structural aversion to overparameterization. The paper challenges the long-standing dogma that parsimonious models inherently generalize best, instead exploring the double-descent performance curve where models operating well beyond the interpolation threshold can actually exhibit benign overfitting and robust predictive accuracy on unseen data. Winkler demonstrates that highly parameterized nonlinear architectures such as deep neural networks and ensemble trees like Random Forests or XGBoost often maintain test-set performance regardless of traditional bias-variance trade-off constraints. This has been obvious for a long while and perhaps now we know whom to cite when arguing my bigger model is better than your slightly smaller model, hahaha.

When Trees Guide Molecules: Multiobjective Search in de Novo Drug Design

This paper by Druchok and Rovenchak proposes multiobjective de novo design by integrating Monte Carlo Tree Search (MCTS) with machine learning property predictors. They empirically compared linear scalarization versus a dynamic Pareto front approach within the MCTS generation loops. They successfully apply this to generate highly selective OX1R inhibitors and more complex polypharmacological candidates targeting both OX1R and H3R, all while enforcing physicochemical constraints on solubility, melting point, and acute toxicity. They used UMAP space projections to “ensure” candidate molecules remain within the applicability domain of the underlying surrogate models. The generated molecules were not validated experimentally. Interesting read.

This Week In Cheminformatics: Issue #023

Tsetlin Machine, Are We Underestimating Overfitting, MCTS MOO on drug like space, and a long list of papers to end the month of May

Highlights

The Tsetlin Machine: A “Third Way” in QSAR Modeling

Are We Underestimating Overfitting?

When Trees Guide Molecules: Multiobjective Search in de Novo Drug Design

Long List

Cheminformatics

MedChem

Other

Palate Cleanser

Discussion about this post

Ready for more?