BioMiner: Unlocking Drug Discovery Data

BioMiner, a novel multi-modal framework, automates protein-ligand bioactivity extraction, accelerating drug discovery and enabling identification of novel therapeutic candidates.

2 min read
Diagram illustrating the BioMiner multi-modal extraction framework for protein-ligand bioactivity.
Visual representation of the BioMiner system architecture.

The exponential growth of biomedical literature presents a critical bottleneck for drug discovery, overwhelming manual curation efforts and hindering the extraction of vital protein-ligand bioactivity data. This challenge is compounded by the complexity of interpreting distributed biochemical semantics and reconstructing precise chemical structures, including challenging Markush structures.

Deconstructing Bioactivity: Semantic Interpretation Meets Structure Resolution

The core innovation lies in BioMiner's explicit separation of bioactivity semantic interpretation from ligand structure construction. This multi-modal extraction framework employs direct reasoning for semantic understanding and a novel chemical-structure-grounded visual reasoning paradigm for inferring inter-structure relationships. Importantly, exact molecular construction is offloaded to specialized domain chemistry tools, streamlining the process. This approach, detailed in a recent arXiv preprint, tackles the dual challenge of understanding complex biological interactions and accurately representing the chemical entities involved.

Related startups

BioVista Benchmark and Quantifiable Performance Gains

To rigorously evaluate and advance automated extraction, the authors introduce BioVista, a comprehensive benchmark comprising 16,457 bioactivity entries from 500 publications. BioMiner demonstrates its efficacy on this benchmark, achieving an F1 score of 0.32 for bioactivity triplets. This quantitative baseline underscores the system's extraction capabilities. The practical impact is further highlighted by its application in building a pre-training database from 11,683 papers, which improved downstream model performance by 3.9%.

Accelerating Discovery and Identifying Novel Therapeutics

BioMiner's real-world utility is evident in its successful application across three key areas. First, it fuels large-scale data aggregation for pre-training, enhancing subsequent AI models. Second, integrated into a human-in-the-loop workflow, it has doubled the yield of high-quality NLRP3 bioactivity data, leading to a 38.6% improvement over 28 QSAR models and the identification of 16 hit candidates with novel scaffolds. Finally, in annotating protein-ligand complex bioactivity for the PoseBusters dataset, BioMiner achieved a 5.59-fold speed increase and a 5.75% accuracy improvement compared to manual workflows, showcasing its potential to dramatically accelerate the drug discovery pipeline.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.