This Week In Cheminformatics: Issue #002

First week's readings, Big W for OpenFE, Andrew Dalke and RNA Structure Prediction

Jan 04, 2026

Welcome or welcome back to This Week In Cheminformatics. Cheers on surviving the first week of 2026!

In the Highlights, we examine a massive validation study from OpenFE, Andrew Dalke’s clever method for preserving count magnitudes in binary fingerprints and results of Stanford Kaggle competition for RNA structure prediction. As always, scroll down for a packed Long List, some pleasure reading on Pre-training in the Archive, and a few Palate Cleansers to start your week right.

Highlights

Large-scale collaborative assessment of binding free energy calculations for drug discovery using OpenFE
Community-driven infrastructure can now rival expensive commercial platforms in both scale and reliability. The Open Free Energy (OpenFE) consortium presents this massive validation study, performing over 3,000 relative binding free energy calculations across 14 varied datasets to benchmark their new framework. They show that OpenFE achieves accuracy comparable to SOTA proprietary tools (like Schrödinger’s FEP+).
Superimposed Coding of Count Fingerprints to Binary Fingerprints
This might be niche for some, but I quite enjoy Andrew Dalke’s work. The core idea of this preprint is using a “thermometer encoding” that lives inside your standard hashing collision space. In a typical binary fingerprint (like ECFP), if a specific substructure appears 5 times, you just set its hashed bit once meaning its magnitude is lost.
Dalke proposes superimposed coding, where you hash the feature combined with every count level up to its occurrence. If a feature appears 3 times, you don’t just hash Feature; you hash Feature_count:1, Feature_count:2, and Feature_count:3 independently. This effectively stacks the counts into the bitstring. A molecule with more copies of a substructure will have more bits set for that feature than a molecule with fewer copies. The beauty is that you can run standard, fast binary Tanimoto on these results, and the value you get approximates the computationally expensive MinMax similarity of the full count vectors.
Template-based RNA structure prediction advanced through a blind code competition
I found this preprint quite interesting because it challenges the prevailing assumption that end-to-end deep learning is the sole frontier for solving structural biology challenges, in fact, those “old-school” methods still work quite well. This preprint details the results of the Stanford RNA 3D Folding Kaggle competition, where over 1,700 teams competed to predict RNA structures on blind test sets. Unexpectedly, the top-performing algorithms (from teams ‘john’ and ‘odat’) prioritized Template-Based Modeling while searching for homologous structures in the PDB instead of GNNs, ultimately outperforming the human expert ‘Vfold’ team on hidden targets. The authors compiled these community insights into a new model, RNAPro, which combines an AlphaFold 3-style pairwise representation with a template-retrieval pipeline and a diffusion module. The RNAPro performance suggests that as the experimental database of RNA structures grows, the most effective path forward isn’t necessarily larger parameter counts, but rather the integration of retrieved physical priors (templates) with attention. Also see gRNAde, RhoFold+, Rosetta, etc. Watch this space!