This Week In Cheminformatics: Issue #008

Weighted fingerprints, curated annotation of ChEMBL, a compound prioritization method and a long list of papers from last week

Feb 15, 2026

Highlights

Guiding Similarity Search in Chemical Fragment Spaces with Weighted Fingerprints

For those “navigating” ultra-large chemical fragment spaces like Enamine REAL (70B+) or SAVI, Lübbers et al. present Weighted SpaceLight, a new version of the SpaceLight algorithm preserves and retrieves specific substructures using high-speed fingerprint similarity searching. Addressing the limitation where standard Tanimoto-based searches often discard essential pharmacophores or scaffolds in favor of global similarity , this method enables the user to assign higher weights to specific atom environments via SMARTS patterns directly within the fingerprint comparison step . The impact on retrieval precision is notable: in a validation study using a Glucosyltransferase inhibitor, the unweighted search preserved the core scaffold in less than 5% of hits, whereas the weighted approach achieved preservation rates up to 95% while maintaining chemical diversity. Critically, this “focused” search incurs negligible computational overhead , offering a scalable solution for enforcing scaffold hopping constraints without resorting to computationally expensive post-filtering.

Integrating artificial intelligence and manual curation to enhance bioassay annotations in ChEMBL

Smit et al. present a “FAIRification” of ChEMBL bioactivity data by combining manual curation with AI-driven retrospective enrichment of legacy assay metadata. The authors developed a so called “perfect assay description” template and trained a spaCy-based Named Entity Recognition (NER) model that extracts experimental methods with an F1-score of 0.94, successfully annotating 57% of Binding and Functional assays. This is complemented by a multi-class classification model that refines the coarse ASSAY_TYPE schema into granular “broad assay categories” (e.g., distinguishing cell phenotype from protein activity), achieving high-confidence predictions for 88% of literature-derived assays. Furthermore, they used regex for extracting ADME parameters, such as dose and administration route from over 17,000 PK/PD assays, while also establishing an automated pipeline for mapping these terms to the BioAssay Ontology. Code is here.

A simple compound prioritization method for drug discovery considering multi-target binding

This study in Digital Discovery presents a compelling framework for multi-objective active learning that decouples the training and acquisition of distinct molecular properties. By training separate Gaussian Process models for individual target affinities rather than fitting a composite objective function, the authors demonstrate that a “separated acquisition” strategy i.e. using a modified Expected Improvement function significantly outperforms conventional models. Validated retrospectively on the DOCKSTRING benchmark, this approach improved the retrieval of the top 0.04–0.4% of binders and increased the Spearman rank correlation of predictions by a factor of 1.5 compared to joint modeling.