This Week In Cheminformatics: Issue #002
First week's readings, Big W for OpenFE, Andrew Dalke and RNA Structure Prediction
Welcome or welcome back to This Week In Cheminformatics. Cheers on surviving the first week of 2026!
In the Highlights, we examine a massive validation study from OpenFE, Andrew Dalke’s clever method for preserving count magnitudes in binary fingerprints and results of Stanford Kaggle competition for RNA structure prediction. As always, scroll down for a packed Long List, some pleasure reading on Pre-training in the Archive, and a few Palate Cleansers to start your week right.
Highlights
Community-driven infrastructure can now rival expensive commercial platforms in both scale and reliability. The Open Free Energy (OpenFE) consortium presents this massive validation study, performing over 3,000 relative binding free energy calculations across 14 varied datasets to benchmark their new framework. They show that OpenFE achieves accuracy comparable to SOTA proprietary tools (like Schrödinger’s FEP+).
Superimposed Coding of Count Fingerprints to Binary Fingerprints
This might be niche for some, but I quite enjoy Andrew Dalke’s work. The core idea of this preprint is using a “thermometer encoding” that lives inside your standard hashing collision space. In a typical binary fingerprint (like ECFP), if a specific substructure appears 5 times, you just set its hashed bit once meaning its magnitude is lost.
Dalke proposes superimposed coding, where you hash the feature combined with every count level up to its occurrence. If a feature appears 3 times, you don’t just hash Feature; you hash Feature_count:1, Feature_count:2, and Feature_count:3 independently. This effectively stacks the counts into the bitstring. A molecule with more copies of a substructure will have more bits set for that feature than a molecule with fewer copies. The beauty is that you can run standard, fast binary Tanimoto on these results, and the value you get approximates the computationally expensive MinMax similarity of the full count vectors.Template-based RNA structure prediction advanced through a blind code competition
I found this preprint quite interesting because it challenges the prevailing assumption that end-to-end deep learning is the sole frontier for solving structural biology challenges, in fact, those “old-school” methods still work quite well. This preprint details the results of the Stanford RNA 3D Folding Kaggle competition, where over 1,700 teams competed to predict RNA structures on blind test sets. Unexpectedly, the top-performing algorithms (from teams ‘john’ and ‘odat’) prioritized Template-Based Modeling while searching for homologous structures in the PDB instead of GNNs, ultimately outperforming the human expert ‘Vfold’ team on hidden targets. The authors compiled these community insights into a new model, RNAPro, which combines an AlphaFold 3-style pairwise representation with a template-retrieval pipeline and a diffusion module. The RNAPro performance suggests that as the experimental database of RNA structures grows, the most effective path forward isn’t necessarily larger parameter counts, but rather the integration of retrieved physical priors (templates) with attention. Also see gRNAde, RhoFold+, Rosetta, etc. Watch this space!
Long List
Cheminformatics
PymolFold: A PyMOL Plugin for API-Driven Structure Prediction and Quality Assessment
NeuMTL: A Unified Multimodal Framework for Multi-Task Prediction in CNS Drug Discovery
Struct2GO-Enhanced: Multimodal Graph Attention Improves Protein Function Prediction
Complete Computational Exploration of Eight-Carbon Hydrocarbon Chemical Space
ActivityFinder: Toward the Fully Automatic Integration of Structural and Binding Affinity Data
Optimizing SMILES token sequences via trie-based refinement and transition graph filtering
Confidently uncertain: Probabilistic machine learning to predict soil biotransformation half-lives
Structure-free drug–target affinity prediction using protein and molecule language models
MedChem
Reviews
Trends In Computational Metabolomics In The Past Five Years (2021–2025)
The growing role of open source software in molecular modeling
Ionic Liquids in Pharmaceuticals: A Scoping Review of Formulation Strategies
Others
Salt Wars: A Card Game To Learn Inorganic Chemistry Nomenclature
Riemannian denoising model for molecular structure optimization with chemical accuracy
AI-Driven Robotic Crystal Explorer for Rapid Polymorph Identification
MATTERIX: toward a digital twin for robotics-assisted chemistry laboratory automation
Archive
Some reading on Why pretraining works so well ?
Palate Cleanser
I’ll experiment a bit with the format of this newsletter but I’m quite open to feedback so please let me know what you think @manasmahale.
Have a great week,
Manas









