Benchmark
Catalog entries using this tag (links open the entry card on its page):
Entries
BixBench
FULL NAME
BixBench — Comprehensive Benchmark for LLM-based Agents in Computational Biology
DESCRIPTION
BixBench by FutureHouse and ScienceMachine is a benchmark designed to evaluate AI agents on real-world bioinformatics tasks. Features 61 real-world analytical scenarios with 205 associated questions, supporting both open-answer and multiple-choice evaluation. Tests agents on data analysis, insight generation, and result interpretation in bioinformatics. Current frontier models achieve only ~21% accuracy, highlighting significant room for improvement.
URL
Main citation
Mitchener L, Laurent J, Wellawatte G, et al. (2025) BixBench: a Comprehensive Benchmark for LLM-based Agents in Computational Biology. arXiv:2503.00096.
ABSTRACT
Artificial intelligence (AI) is changing scientific research at a rapid pace and is beginning to enable the automation of complex analytical tasks. One of the most promising fields for AI-driven automation is bioinformatics, where data-focused research lends itself to purely computational analysis. We introduce BixBench, a benchmark designed to evaluate AI agents on real-world bioinformatics tasks. BixBench challenges AI models with open-ended analytical research scenarios, requiring them to analyze data, generate insights, and interpret results autonomously. The benchmark comprises over 50 real-world scenarios with nearly 300 associated open-answer questions.
DOI
10.48550/arXiv.2503.00096
DNA Foundation Benchmark (DNA FM Benchmark)
PUBMED_LINK
FULL NAME
Benchmarking DNA Foundation Models for Genomic and Genetic Tasks
DESCRIPTION
First comprehensive, unbiased benchmark of five DNA foundation models (DNABERT-2, Nucleotide Transformer V2, HyenaDNA, Caduceus-Ph, GROVER) across 57 datasets spanning sequence classification, gene expression prediction, variant effect quantification, and TAD recognition using zero-shot embeddings. Key finding: mean token embedding pooling consistently outperforms other strategies. Model choice should align with task — Caduceus-Ph excels at TFBS, NT-v2 at pathogenic variants, HyenaDNA scales to long sequences. Specialized models (Enformer, Sei) still outperform general DNA models on QTL prediction.
URL
TITLE
Benchmarking DNA foundation models for genomic and genetic tasks.
Main citation
Feng H, Wu L, Zhao B, Huff C, Zhang J, Wu J, Lin L, Wei P, Wu C. (2025) Benchmarking DNA foundation models for genomic and genetic tasks. Nature Communications, 16:10780. doi:10.1038/s41467-025-65823-8. PMID 41315262
ABSTRACT
The rapid evolution of DNA foundation models promises to revolutionize genomics, yet comprehensive evaluations are lacking. Here, we present a comprehensive, unbiased benchmark of five models (DNABERT-2, Nucleotide Transformer V2, HyenaDNA, Caduceus-Ph, and GROVER) across diverse genomic and genetic tasks including sequence classification, gene expression prediction, variant effect quantification, and TAD region recognition, using zero-shot embeddings. Our analysis reveals that mean token embedding consistently and significantly improves sequence classification performance. Model performance varies among tasks and datasets; while general purpose DNA foundation models showed competitive performance in pathogenic variant identification, they were less effective in predicting gene expression and identifying putative causal QTLs compared to specialized models.
DOI
10.1038/s41467-025-65823-8
TDC
PUBMED_LINK
FULL NAME
Therapeutics Data Commons — AI Foundation for Therapeutic Science
DESCRIPTION
Therapeutics Data Commons (TDC) is a coordinated initiative providing AI-ready datasets and curated benchmarks across the full spectrum of therapeutic modalities (small molecules, biologics, gene therapy) and stages (target identification, hit discovery, lead optimization, manufacturing). Features 100+ datasets across 50+ learning tasks, with standardized evaluation protocols, data splits, and public leaderboards. Supports systematic evaluation of AI methods for drug discovery and development.
URL
TITLE
Artificial intelligence foundation for therapeutic science.
Main citation
Huang K, Fu T, Gao W, Zhao Y, Roohani Y, Leskovec J, Coley CW, Xiao C, Sun J, Zitnik M. (2022) Artificial intelligence foundation for therapeutic science. Nature Chemical Biology, 18(10):1034-1036. doi:10.1038/s41589-022-01131-2. PMID 35970914
ABSTRACT
Artificial intelligence is poised to enable breakthroughs and discoveries in therapeutic science. Therapeutics Data Commons is a coordinated initiative to access and evaluate AI capability across therapeutic modalities and stages of discovery. The Commons is a resource with AI-solvable tasks, AI-ready datasets, and curated benchmarks, providing an ecosystem of tools, libraries, leaderboards, and community resources.
DOI
10.1038/s41589-022-01131-2