Benchmark

Catalog entries using this tag (links open the entry card on its page):

Entries

BixBench

AI Benchmark Bioinformatics LLM Agent Computational Biology FutureHouse

FULL NAME

BixBench — Comprehensive Benchmark for LLM-based Agents in Computational Biology

DESCRIPTION

BixBench by FutureHouse and ScienceMachine is a benchmark designed to evaluate AI agents on real-world bioinformatics tasks. Features 61 real-world analytical scenarios with 205 associated questions, supporting both open-answer and multiple-choice evaluation. Tests agents on data analysis, insight generation, and result interpretation in bioinformatics. Current frontier models achieve only ~21% accuracy, highlighting significant room for improvement.

Show full descriptionShow less

URL

https://github.com/Future-House/BixBench

Main citation

Mitchener L, Laurent J, Wellawatte G, et al. (2025) BixBench: a Comprehensive Benchmark for LLM-based Agents in Computational Biology. arXiv:2503.00096.

ABSTRACT

Artificial intelligence (AI) is changing scientific research at a rapid pace and is beginning to enable the automation of complex analytical tasks. One of the most promising fields for AI-driven automation is bioinformatics, where data-focused research lends itself to purely computational analysis. We introduce BixBench, a benchmark designed to evaluate AI agents on real-world bioinformatics tasks. BixBench challenges AI models with open-ended analytical research scenarios, requiring them to analyze data, generate insights, and interpret results autonomously. The benchmark comprises over 50 real-world scenarios with nearly 300 associated open-answer questions.

Show full abstractShow less

DOI

10.48550/arXiv.2503.00096

DNA Foundation Benchmark (DNA FM Benchmark)

AI Benchmark DNA Foundation Model Genomics Genomic Language Model Nature Communications

PUBMED_LINK

41315262

FULL NAME

Benchmarking DNA Foundation Models for Genomic and Genetic Tasks

DESCRIPTION

First comprehensive, unbiased benchmark of five DNA foundation models (DNABERT-2, Nucleotide Transformer V2, HyenaDNA, Caduceus-Ph, GROVER) across 57 datasets spanning sequence classification, gene expression prediction, variant effect quantification, and TAD recognition using zero-shot embeddings. Key finding: mean token embedding pooling consistently outperforms other strategies. Model choice should align with task — Caduceus-Ph excels at TFBS, NT-v2 at pathogenic variants, HyenaDNA scales to long sequences. Specialized models (Enformer, Sei) still outperform general DNA models on QTL prediction.

Show full descriptionShow less

URL

https://github.com/ChongWuLab/dna_foundation_benchmark

TITLE

Benchmarking DNA foundation models for genomic and genetic tasks.

Main citation

Feng H, Wu L, Zhao B, Huff C, Zhang J, Wu J, Lin L, Wei P, Wu C. (2025) Benchmarking DNA foundation models for genomic and genetic tasks. Nature Communications, 16:10780. doi:10.1038/s41467-025-65823-8. PMID 41315262

ABSTRACT

The rapid evolution of DNA foundation models promises to revolutionize genomics, yet comprehensive evaluations are lacking. Here, we present a comprehensive, unbiased benchmark of five models (DNABERT-2, Nucleotide Transformer V2, HyenaDNA, Caduceus-Ph, and GROVER) across diverse genomic and genetic tasks including sequence classification, gene expression prediction, variant effect quantification, and TAD region recognition, using zero-shot embeddings. Our analysis reveals that mean token embedding consistently and significantly improves sequence classification performance. Model performance varies among tasks and datasets; while general purpose DNA foundation models showed competitive performance in pathogenic variant identification, they were less effective in predicting gene expression and identifying putative causal QTLs compared to specialized models.

Show full abstractShow less

DOI

10.1038/s41467-025-65823-8

TDC

AI Benchmark Drug Discovery Therapeutics Datasets Harvard

PUBMED_LINK

35970914

FULL NAME

Therapeutics Data Commons — AI Foundation for Therapeutic Science

DESCRIPTION

Therapeutics Data Commons (TDC) is a coordinated initiative providing AI-ready datasets and curated benchmarks across the full spectrum of therapeutic modalities (small molecules, biologics, gene therapy) and stages (target identification, hit discovery, lead optimization, manufacturing). Features 100+ datasets across 50+ learning tasks, with standardized evaluation protocols, data splits, and public leaderboards. Supports systematic evaluation of AI methods for drug discovery and development.

Show full descriptionShow less

URL

https://tdcommons.ai

TITLE

Artificial intelligence foundation for therapeutic science.

Main citation

Huang K, Fu T, Gao W, Zhao Y, Roohani Y, Leskovec J, Coley CW, Xiao C, Sun J, Zitnik M. (2022) Artificial intelligence foundation for therapeutic science. Nature Chemical Biology, 18(10):1034-1036. doi:10.1038/s41589-022-01131-2. PMID 35970914

ABSTRACT

Artificial intelligence is poised to enable breakthroughs and discoveries in therapeutic science. Therapeutics Data Commons is a coordinated initiative to access and evaluate AI capability across therapeutic modalities and stages of discovery. The Commons is a resource with AI-solvable tasks, AI-ready datasets, and curated benchmarks, providing an ecosystem of tools, libraries, leaderboards, and community resources.

Show full abstractShow less

DOI

10.1038/s41589-022-01131-2