This Week In Cheminformatics: Issue #003
Better Ensemble For Federated Learning, A New Chirality, Transformer Model's Memorization Limiting Performance, Closed-Loop Molecular Discovery and Results of Blind Challenge on Pan-Coronavirus Data
Highlights
Empowering federated learning for robust compound-protein interaction prediction across heterogeneous cross-pharma domains
If you have suspected that simply averaging weights from disparate big pharma datasets in Federated Learning might affect (local) performance, this study confirms those fears and offers a fix. Koyama et al. demonstrate that while standard Federated Learning improves generalization on out-of-domain targets, it often underperforms simple local models on in-domain data when chemical spaces are highly heterogeneous (e.g., one company focuses on kinases, another on GPCRs). To resolve this trade-off, the authors propose a Similarity-Guided Ensemble, which dynamically weights predictions from the global Federated Learning model and a fine-tuned local model based on the query compound’s Tanimoto similarity to local training data. Validated across 13 pharmaceutical companies, this approach successfully captures the “best of both worlds,” securing robust performance on both proprietary pipelines and novel chemical space without “compromising privacy”. Also read about MELLODDY, and Effiris Hackathon.
A New Chirality Phenomenon in Amino Acid and Peptide Derivatives
Just when I thought R/S or L/D nomenclature seemed to tell all about stereochemistry, Yuan et al. have introduced “turbo chirality”. By analyzing X-ray structures of N-acetyl amino acids and the opioid peptide biphalin, the researchers found that planar amide and carboxylic acid groups naturally arrange themselves into “propeller blades” around the ɑ-carbon. This secondary chirality is deterministically locked to the central chirality meaning, (S)-center enforces an (M,M)-propeller configuration, while an (R)-center dictates a (P,P)-form. Quite a fun read.
Transformer Learning in Sequence-Based Drug Design Depends on Compound Memorization and Similarity of Sequence-Compound Pairs
Transformer models are often hailed for their generative capabilities in drug design. In this study, Prof. Bajorath suggests they may be “faking it” through memorization rather than learning any meaningful rules. Using sequence-based compound design as a test set, the paper shows that the model’s ability to reproduce active compounds is driven almost entirely by similarity relationships between training and test data and the memorization of multi-target compounds. In fact, the models proved so reliant on statistical correlations that they could still successfully generate valid compounds even when specific binding motifs were masked or up to 50% of the input protein sequence was randomized. This implies that these transformers are not learning anything fundamental about ligand binding, but are instead exploiting data redundancy !
Toward fully autonomous closed-loop molecular discovery – A case study on JAK targets
In this preprint, a fully autonomous, “human-out-of-the-loop” discovery tool is used to study JAK inhibitors. While many “closed-loop” systems still rely on human hands for synthesis, purification, etc., this work integrates IBM’s RoboRXN for synthesis and Arctoris’ Ulysses for screening to execute two complete Design-Make-Test-Analyze cycles with minimal intervention. In a feat of “blind” discovery, the model which was initially unaware of existing kinase inhibitors, managed to “re-discover” known nitrogen-rich scaffolds (like pyrrolopyrimidines) and significantly improve potency and ligand efficiency in just the second cycle. This is further evidence advocating autonomous systems will effectively navigate chemical space and iterate on structure-activity relationships without explicit prior bias.
A Computational Community Blind Challenge on Pan-Coronavirus Drug Discovery Data
If you ever needed a reality check on the state of lead optimization for critical antiviral targets, the ASAP-Polaris-OpenADMET challenge delivers it. Focusing on the main protease of SARS-CoV-2 and MERS-CoV, this challenge saw 66 unique participants predict biochemical potency, crystallographic poses, and key ADMET endpoints using real-world data from an active drug discovery campaign. The outcome was a strong validation of modern computational methods: deep learning models achieved impressive accuracy, with top entries predicting potency within ~0.5 log units and co-folding methods achieving >80% success in pose prediction. This shows unlike earlier hit-finding hurdles, current AI-driven tools can effectively drive lead optimization when there is high-quality, prospective data.
Long List
Cheminformatics
Prodrug-ML: prodrug-likeness prediction via machine learning on sampled negative decoys
pepADMET: A Novel Computational Platform For Systematic ADMET Evaluation of Peptides
The Digital Chiroscope: Unlocking Blueprints from Chiroptical Spectroscopy’s Black Box
FragmentRetro: A Quadratic Retrosynthetic Method Based on Fragmentation Algorithms
Artificial neural network as a strategy to predict rheological properties in emulgel formulations
Intercellular Communication Guides the Prediction of Intracellular Gene Regulatory Relationships
QPred: A Quantum Mechanical Property Predictor for Small Molecules
ConforFormer: representation for molecules through understanding of conformers
PepGraphormer: an ESM-GAT hybrid deep learning framework for antimicrobial peptide prediction
Multimodal Bond Reconstruction toward Generative Molecular Design
Learnable protein representations in computational biology for predicting drug-target affinity
TropMol-Caipora: A Cloud-Based Web Tool to Predict Cruzain Inhibitors by Machine Learning
MedChem
Structure-Based Virtual Screening Discovers Potent PLK1 Inhibitors with Anticancer Activity
Approved Steroidal Drugs (2000–2025): A Medicinal Chemistry Perspective
Mapping the Allosteric Landscape of PPARγ: a Markov State Modeling and Energetic Analysis Approach
Reviews
Other
A New Chirality Phenomenon in Amino Acid and Peptide Derivatives
Calibration Transfer Based on Nonparametric Varying Coefficient Regression
Cross-omics interpretable neural network for discovery of molecular markers in prostate cancer
Knowledge graph integration of clustered medicinal plants, molecules, diseases, and targets
Linking brain and behavior states in Zebrafish Larvae locomotion using hidden Markov models
Tools
ChemIllusion is a pretty cool tool to generate graphic abstract using GenAI.
Deck Gallery is a wonderful collection of inspiring slide decks.
Archive
Some papers on what KAN can do
Kolmogorov–Arnold Chemical Reaction Neural Networks for learning
MOF-KAN: Kolmogorov–Arnold Networks for Digital Discovery of Metal–Organic Frameworks
Kolmogorov–Arnold graph neural networks for molecular property prediction
Palate Cleanser
Have a week as iconic as you are,
Manas











