This Week In Cheminformatics: Issue #008
Weighted fingerprints, curated annotation of ChEMBL, a compound prioritization method and a long list of papers from last week
Highlights
Guiding Similarity Search in Chemical Fragment Spaces with Weighted Fingerprints
For those “navigating” ultra-large chemical fragment spaces like Enamine REAL (70B+) or SAVI, Lübbers et al. present Weighted SpaceLight, a new version of the SpaceLight algorithm preserves and retrieves specific substructures using high-speed fingerprint similarity searching. Addressing the limitation where standard Tanimoto-based searches often discard essential pharmacophores or scaffolds in favor of global similarity , this method enables the user to assign higher weights to specific atom environments via SMARTS patterns directly within the fingerprint comparison step . The impact on retrieval precision is notable: in a validation study using a Glucosyltransferase inhibitor, the unweighted search preserved the core scaffold in less than 5% of hits, whereas the weighted approach achieved preservation rates up to 95% while maintaining chemical diversity. Critically, this “focused” search incurs negligible computational overhead , offering a scalable solution for enforcing scaffold hopping constraints without resorting to computationally expensive post-filtering.
Integrating artificial intelligence and manual curation to enhance bioassay annotations in ChEMBL
Smit et al. present a “FAIRification” of ChEMBL bioactivity data by combining manual curation with AI-driven retrospective enrichment of legacy assay metadata. The authors developed a so called “perfect assay description” template and trained a spaCy-based Named Entity Recognition (NER) model that extracts experimental methods with an F1-score of 0.94, successfully annotating 57% of Binding and Functional assays. This is complemented by a multi-class classification model that refines the coarse ASSAY_TYPE schema into granular “broad assay categories” (e.g., distinguishing cell phenotype from protein activity), achieving high-confidence predictions for 88% of literature-derived assays. Furthermore, they used regex for extracting ADME parameters, such as dose and administration route from over 17,000 PK/PD assays, while also establishing an automated pipeline for mapping these terms to the BioAssay Ontology. Code is here.
A simple compound prioritization method for drug discovery considering multi-target binding
This study in Digital Discovery presents a compelling framework for multi-objective active learning that decouples the training and acquisition of distinct molecular properties. By training separate Gaussian Process models for individual target affinities rather than fitting a composite objective function, the authors demonstrate that a “separated acquisition” strategy i.e. using a modified Expected Improvement function significantly outperforms conventional models. Validated retrospectively on the DOCKSTRING benchmark, this approach improved the retrieval of the top 0.04–0.4% of binders and increased the Spearman rank correlation of predictions by a factor of 1.5 compared to joint modeling.
Long List
Cheminformatics
Enzyformer: a two-stage pretrained model for enzymatic retrosynthesis
Advances and Perspectives in Computer-Assisted Structure Elucidation: A Review
When machine learning models learn chemistry II: applying WISP to real-world examples
Enhancing Molecular Structure Elucidation with Reasoning-Capable LLMs
Symmetry-Sensitive Analysis of Molecular Graph Neural Network Models
Rapid Parallel Virtual Screening Aids the Discovery of Novel P-Glycoprotein Inhibitors
Solvent-Dependent Conformational Diversity of Polysaccharide-Based Chiral Selectors
Machine Learning and Computer Simulation Disentangle the Fuzzy Inhibitor Binding by Hsp90
A Review of Current Computational Tools for Peptide–Protein Docking
Infrared Spectral Descriptors for Reaction Yield Prediction: Toward Redefining Experimental Spaces
Rapid Generation of Transition-State Conformer Ensembles via Constrained Distance Geometry
Pairwise Neural Networks for Ranking Molecular Structures Based on Properties
Boltz-ABFE: Free Energy Perturbation without Crystal Structures
MolQuery: Prediction of Lipid Synthesizability Using Active Learning
ChemGraphX: an open-source web tool for computing topological indices and entropy measures
Elucidating Ligand Charge Effects in MR1 Cell-Surface Translocation Using Molecular Simulations
WeMol: A Cloud-Based and Zero-Code Platform for AI-Driven Molecular Design and Simulation
Large Language Model Agent for Modular Task Execution in Drug Discovery
PROTAC-Mediated Ternary Complex Stability with Ricin Toxin A: A Computational Perspective
MedChem
PhotoChem
Other
Direct and Enantioselective Acylation of Diverse C(sp3)–H Bonds with Aldehydes
Ten quick tips for using the NIH Comparative Genomics Resource
Palate Cleanser
whatever,
Manas
















