AI Genomic language model

Curation of Genomic language model — listings under the AI tab.

DNA/RNA Foundation Models

Genomic language models that learn representations of DNA sequence for variant effect prediction and genome annotation:

1st generation (2021-2023): BPE tokenization + BERT-style pretraining on reference genomes (DNABERT, Ji et al. PMID 33542176, Bioinformatics 2021; DNABERT-2, Zhou et al. arXiv 2024).
2nd generation (2023-2024): Biologically motivated architectures — reverse-complement equivariance and long-range operators (HyenaDNA, Nguyen et al. NeurIPS 2023; Caduceus, Schiff et al. NeurIPS 2024).
3rd generation (2025-2026): Species-aware embeddings with Manifold Instance Mixup (DNABERT-S, Zhou et al. Nat Mach Intell 2025), whole-genome autoregressive training (Evo 2, Brixi et al. PMID 39664581, Science 2025; Genos, Arora et al. bioRxiv 2025), and multimodal reasoning (BioReason, Wei et al. NeurIPS 2025).

Scaling trajectory: context from 512bp → 1M+ bp, training from single genomes → whole-genome alignments → cross-species.

Summary Table

Click a column header to sort the table.

NAME	Main citation	YEAR
AlphaGenome	Avsec Ž et al., Nature, 2026	2026
BioReason	Fallahpour A et al., NeurIPS, 2025	2025
Caduceus	Schiff Y et al., PMLR, 2024	2024
DNABERT-2	Zhou Z et al., ICLR, 2024	2024
DNABERT-S	Zhou Z et al., Bioinformatics, 2025	2025
DNABERT	Ji Y et al., Bioinformatics, 2021	2021
Evo 2	Brixi G et al., Nature, 2026	2026
GPN-MSA	Benegas G et al., Nat Biotechnol, 2025	2025
Genos	Lin A et al., Gigascience, 2025	2025
HyenaDNA	Nguyen E et al., NeurIPS, 2023	2023
Nucleotide Transformer	Dalla-Torre H et al., Nat Methods, 2025	2025
PromoterAI	Jaganathan K et al., Science, 2025	2025

AlphaGenome

AI

PUBMED_LINK

41606153

DESCRIPTION

Unified deep learning model that predicts molecular phenotypes from DNA sequence—including gene expression, chromatin accessibility, histone marks, TF binding, splicing, and contact maps—at single-nucleotide resolution for variant effect interpretation.

Show full descriptionShow less

URL

https://github.com/google-deepmind/alphagenome

KEYWORDS

variant effect prediction, regulatory genomics, deep learning, single-nucleotide resolution, DNA, chromatin, gene expression, TF binding

Show full keywordsShow less

TITLE

Advancing regulatory variant effect prediction with AlphaGenome.

Main citation

Avsec Ž, Latysheva N, Cheng J, Novati G, ...&, Kohli P. (2026) Advancing regulatory variant effect prediction with AlphaGenome. Nature, 649 (8099) 1206-1218. doi:10.1038/s41586-025-10014-0. PMID 41606153

ABSTRACT

Deep learning models that predict functional genomic measurements from DNA sequences are powerful tools for deciphering the genetic regulatory code. Existing methods involve a trade-off between input sequence length and prediction resolution, thereby limiting their modality scope and performance1-5. We present AlphaGenome, a unified DNA sequence model, which takes as input 1 Mb of DNA sequence and predicts thousands of functional genomic tracks up to single-base-pair resolution across diverse modalities. The modalities include gene expression, transcription initiation, chromatin accessibility, histone modifications, transcription factor binding, chromatin contact maps, splice site usage and splice junction coordinates and strength. Trained on human and mouse genomes, AlphaGenome matches or exceeds the strongest available external models in 25 of 26 evaluations of variant effect prediction. The ability of AlphaGenome to simultaneously score variant effects across all modalities accurately recapitulates the mechanisms of clinically relevant variants near the TAL1 oncogene6. To facilitate broader use, we provide tools for making genome track and variant effect predictions from sequence.

Show full abstractShow less

DOI

10.1038/s41586-025-10014-0

BioReason

AI

DESCRIPTION

BioReason is a DNA-LLM model that incentivizes multimodal biological reasoning by integrating DNA sequence representations with biological knowledge. Published at NeurIPS 2025, it achieves state-of-the-art on several genomic reasoning benchmarks.

Show full descriptionShow less

URL

https://github.com/bowang-lab/BioReason

KEYWORDS

DNA-LLM, multimodal reasoning, biological reasoning, foundation model, NeurIPS 2025, long-context, interpretability

Show full keywordsShow less

TITLE

BioReason: Incentivizing Multimodal Biological Reasoning within a DNA-LLM Model.

Main citation

Fallahpour A, Magnuson A, Gupta P, ...&, Wang B. (2025) BioReason: Incentivizing Multimodal Biological Reasoning within a DNA-LLM Model. NeurIPS 2025.

ABSTRACT

Unlocking deep, interpretable biological reasoning from complex genomic data is a major AI challenge hindering scientific discovery. Current DNA foundation models excel at encoding sequence information but lack the ability to perform explicit reasoning over biological concepts. We present BioReason, a DNA-LLM model that integrates a genome foundation model with a large language model to enable multimodal biological reasoning. BioReason is trained to reason over DNA sequences and biological text jointly, enabling interpretable predictions and natural language explanations of genomic functions. The model achieves state-of-the-art performance across multiple genomic reasoning tasks.

Show full abstractShow less

Caduceus

AI

PUBMED_LINK

40567809

DESCRIPTION

Caduceus is the first family of reverse-complement (RC) equivariant bi-directional long-range DNA language models, built on the Mamba state space model backbone with BiMamba and MambaDNA blocks. Published at ICML 2024, it excels at long-range variant effect prediction.

Show full descriptionShow less

URL

https://github.com/kuleshov-group/caduceus

KEYWORDS

Mamba, RC equivariance, bi-directional, long-range DNA, variant effect prediction, state space model, ICML 2024

Show full keywordsShow less

TITLE

Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling.

Main citation

Schiff Y, Kao CH, Gokaslan A, Dao T, Gu A, Kuleshov V. (2024) Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling. PMLR, 235 43632-43648. PMID 40567809

ABSTRACT

Large-scale sequence modeling has sparked rapid advances that now extend into biology and genomics. However, modeling genomic sequences introduces challenges such as the need to model long-range token interactions, the effects of upstream and downstream regions of the genome, and the reverse complementarity (RC) of DNA. Here, we propose an architecture motivated by these challenges that builds off the long-range Mamba block, and extends it to a BiMamba component that supports bi-directionality, and to a MambaDNA block that additionally supports RC equivariance. We use MambaDNA as the basis of Caduceus, the first family of RC equivariant bi-directional long-range DNA language models, and we introduce pre-training and fine-tuning strategies that yield Caduceus DNA foundation models. Caduceus outperforms previous long-range models on downstream benchmarks; on a challenging long-range variant effect prediction task, Caduceus exceeds the performance of prior models.

Show full abstractShow less

DNABERT

AI

PUBMED_LINK

33538820

DESCRIPTION

DNABERT is a pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in the genome. Uses k-mer tokenization and masked language modeling to learn global and transferrable understanding of genomic DNA sequences. After fine-tuning, achieves SOTA on promoters, splice sites and TF binding sites prediction.

Show full descriptionShow less

URL

https://github.com/jerryji1993/DNABERT

KEYWORDS

BERT, Transformer, pre-trained, promoter prediction, splice site, transcription factor binding site, k-mer

Show full keywordsShow less

TITLE

DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome.

Main citation

Ji Y, Zhou Z, Liu H, Davuluri RV. (2021) DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics, 37 (15) 2112-2120. doi:10.1093/bioinformatics/btab083. PMID 33538820

ABSTRACT

Deciphering the language of non-coding DNA is one of the fundamental problems in genome research. Gene regulatory code is highly complex due to the existence of polysemy and distant semantic relationship, which previous informatics methods often fail to capture especially in data-scarce scenarios. To address this challenge, we developed a novel pre-trained bidirectional encoder representation, named DNABERT, to capture global and transferrable understanding of genomic DNA sequences based on up and downstream nucleotide contexts. We compared DNABERT to the most widely used programs for genome-wide regulatory elements prediction and demonstrate its ease of use, accuracy and efficiency. We show that the single pre-trained transformers model can simultaneously achieve state-of-the-art performance on prediction of promoters, splice sites and transcription factor binding sites, after easy fine-tuning using small task-specific labeled data. Further, DNABERT enables direct visualization of nucleotide importance for model prediction, contributing to better interpretability. DNABERT is publicly available at https://github.com/jerryji1993/DNABERT.

Show full abstractShow less

DOI

10.1093/bioinformatics/btab083

DNABERT-2

AI

DESCRIPTION

DNABERT-2 is an efficient foundation model for multi-species genome understanding, improving upon DNABERT with Byte-Pair Encoding (BPE) tokenization, attention head pruning, and improved training techniques. Published at ICLR 2024, it achieves SOTA across 28 genome prediction tasks while being significantly more efficient than previous models.

Show full descriptionShow less

URL

https://github.com/MAGICS-LAB/DNABERT_2

KEYWORDS

BERT, Transformer, BPE tokenization, multi-species genome, foundation model, ICLR 2024

Show full keywordsShow less

TITLE

DNABERT-2: Efficient Foundation Model and Benchmark for Multi-Species Genome.

Main citation

Zhou Z, Ji Y, Li W, Dutta P, Davuluri RV, Liu H. (2024) DNABERT-2: Efficient Foundation Model and Benchmark for Multi-Species Genome. ICLR 2024.

ABSTRACT

Deciphering the language of non-coding DNA is a critical challenge in genome research. Existing approaches often rely on k-mer based tokenization and single-species datasets, limiting their effectiveness for multi-species genome understanding. Here, we introduce DNABERT-2, a foundation model that leverages Byte-Pair Encoding (BPE) tokenization to capture meaningful DNA units across species, combined with attention head pruning and other optimizations for efficient training and inference. We also present a comprehensive multi-species genome benchmark (Genome Understanding Evaluation, GUE) covering 28 tasks across 7 species. DNABERT-2 achieves state-of-the-art performance across diverse genome prediction tasks, demonstrating superior efficiency and generalization. The model and benchmark are publicly available.

Show full abstractShow less

DNABERT-S

AI

PUBMED_LINK

40662791

DESCRIPTION

DNABERT-S builds upon DNABERT-2 to develop species-aware DNA embeddings via Manifold Instance Mixup (MI-Mix) contrastive learning and Curriculum Contrastive Learning (C2LR), enabling unsupervised species differentiation from DNA sequences. Published in Bioinformatics (ISMB 2025 proceedings).

Show full descriptionShow less

URL

https://github.com/MAGICS-LAB/DNABERT_S

KEYWORDS

species awareness, contrastive learning, manifold instance mixup, curriculum learning, long-read sequencing, ISMB 2025

Show full keywordsShow less

TITLE

DNABERT-S: pioneering species differentiation with species-aware DNA embeddings.

Main citation

Zhou Z, Wu W, Ho H, ...&, Liu H. (2025) DNABERT-S: pioneering species differentiation with species-aware DNA embeddings. Bioinformatics, 41 (Supplement_1) i255-i264. doi:10.1093/bioinformatics/btaf188. PMID 40662791

ABSTRACT

We introduce DNABERT-S, a tailored genome model that develops species-aware embeddings to naturally cluster and segregate DNA sequences of different species in the embedding space. Differentiating species from genomic sequences (i.e. DNA and RNA) is vital yet challenging, since many real-world species remain uncharacterized, lacking known genomes for reference. Embedding-based methods are therefore used to differentiate species in an unsupervised manner. DNABERT-S builds upon a pre-trained genome foundation model named DNABERT-2. To encourage effective embeddings to error-prone long-read DNA sequences, we introduce Manifold Instance Mixup (MI-Mix), a contrastive objective that mixes the hidden representations of DNA sequences at randomly selected layers and trains the model to recognize and differentiate these mixed proportions at the output layer. We further enhance it with the proposed Curriculum Contrastive Learning (C2LR) strategy. Empirical results on 28 diverse datasets show DNABERT-S achieves state-of-the-art performance across multiple species classification and clustering tasks.

Show full abstractShow less

DOI

10.1093/bioinformatics/btaf188

Evo 2

AI

PUBMED_LINK

41781614

FULL NAME

Evo 2 DNA foundation model

DESCRIPTION

A genomic foundation model using the StripedHyena 2 architecture, trained autoregressively on OpenGenome2 (trillions of nucleotides across prokaryotic, eukaryotic, archaeal, and phage genomes) at single-nucleotide resolution with long context (up to about one megabase). Supports generalist prediction and design tasks spanning DNA, RNA, and proteins; code and weights are open source with Hugging Face checkpoints.

Show full descriptionShow less

URL

https://github.com/arcinstitute/evo2

KEYWORDS

DNA foundation model, autoregressive, StripedHyena 2, prokaryotic, eukaryotic, genome design, variant effect, long context, open source

Show full keywordsShow less

TITLE

Genome modelling and design across all domains of life with Evo 2.

Main citation

Brixi G, Durrant MG, Ku J, Naghipourfar M, ...&, Hie BL. (2026) Genome modelling and design across all domains of life with Evo 2. Nature, () . doi:10.1038/s41586-026-10176-5. PMID 41781614

ABSTRACT

All of life encodes information with DNA. Although tools for genome sequencing, synthesis and editing have transformed biological research, we still lack sufficient understanding of the immense complexity encoded by genomes to predict the effects of many classes of genomic changes or to intelligently compose new biological systems. Artificial intelligence models that learn information from genomic sequences across diverse organisms have increasingly advanced prediction and design capabilities1,2. Here we introduce Evo 2, a biological foundation model trained on 9 trillion DNA base pairs from a highly curated genomic atlas spanning all domains of life to have a 1 million token context window with single-nucleotide resolution. Evo 2 learns to accurately predict the functional impacts of genetic variation-from noncoding pathogenic mutations to clinically significant BRCA1 variants-without task-specific fine-tuning. Mechanistic interpretability analyses reveal that Evo 2 learns representations associated with biological features, including exon-intron boundaries, transcription factor binding sites, protein structural elements and prophage genomic regions. The generative abilities of Evo 2 produce mitochondrial, prokaryotic and eukaryotic sequences at genome scale with greater naturalness and coherence than previous methods. Evo 2 also generates experimentally validated chromatin accessibility patterns when guided by predictive models3,4 and inference-time search. We have made Evo 2 fully open, including model parameters, training code5, inference code and the OpenGenome2 dataset, to accelerate the exploration and design of biological complexity.

Show full abstractShow less

DOI

10.1038/s41586-026-10176-5

GPN-MSA

AI

PUBMED_LINK

39747647

FULL NAME

Genomic Pretrained Network with Multiple-Sequence Alignment

DESCRIPTION

GPN-MSA is a DNA language model leveraging whole-genome multiple-sequence alignments across species to predict the effects of genome-wide variants, achieving outstanding performance on deleteriousness prediction for both coding and noncoding variants. Published in Nature Biotechnology.

Show full descriptionShow less

URL

https://github.com/songlab-cal/gpn

KEYWORDS

multiple sequence alignment, variant effect prediction, noncoding variants, ClinVar, COSMIC, gnomAD, DNA language model

Show full keywordsShow less

TITLE

A DNA language model based on multispecies alignment predicts the effects of genome-wide variants.

Main citation

Benegas G, Albors C, Aw AJ, Ye C, Song YS. (2025) A DNA language model based on multispecies alignment predicts the effects of genome-wide variants. Nat Biotechnol, 43 (12) 1960-1965. doi:10.1038/s41587-024-02511-w. PMID 39747647

ABSTRACT

Protein language models have demonstrated remarkable performance in predicting the effects of missense variants but DNA language models have not yet shown a competitive edge for complex genomes such as that of humans. This limitation is particularly evident when dealing with the vast complexity of noncoding regions that comprise approximately 98% of the human genome. To tackle this challenge, we introduce GPN-MSA (genomic pretrained network with multiple-sequence alignment), a framework that leverages whole-genome alignments across multiple species while taking only a few hours to train. Across several benchmarks on clinical databases (ClinVar, COSMIC and OMIM), experimental functional assays (deep mutational scanning and DepMap) and population genomic data (gnomAD), our model for the human genome achieves outstanding performance on deleteriousness prediction for both coding and noncoding variants. We provide precomputed scores for all ~9 billion possible single-nucleotide variants in the human genome.

Show full abstractShow less

DOI

10.1038/s41587-024-02511-w

Genos

AI

PUBMED_LINK

41122975

DESCRIPTION

Genos is a human-centric genomic foundation model using mixture-of-experts architecture (Genos-1.2B/Genos-10B) for million-basepair sequence modeling, trained on high-quality human de novo assemblies including the Human Pangenome Reference Consortium data. Published in GigaScience.

Show full descriptionShow less

URL

https://github.com/BGI-HangzhouAI/Genos

KEYWORDS

mixture of experts, million-basepair context, human pangenome, human-centric, variant effect, structural variation

Show full keywordsShow less

TITLE

Genos: a human-centric genomic foundation model.

Main citation

Lin A, Xie B, Ye C, ...&, Wang Z. (2025) Genos: a human-centric genomic foundation model. Gigascience, 14 giaf132. doi:10.1093/gigascience/giaf132. PMID 41122975

ABSTRACT

The rapid expansion of human genomic data demands foundation models that manage ultra-long sequences and capture population diversity, limitations common in existing models that lack human-specific representation, and clinical inference efficiency. Here, we introduce Genos (Genos-1.2B/Genos-10B), a human-centric genomic foundation model engineered for million-basepair sequence modeling. Genos utilizes a large-scale mixture of experts structure, optimized for a 1-Mb context, trained on high-quality human de novo assemblies from datasets such as the Human Pangenome Reference Consortium and the Human Genome Structural Variation Consortium, representing diverse global populations. A suite of optimization strategies was implemented to ensure training stability and enhance computational efficiency, which collectively reduces costs and facilitates million-basepair context modeling. Functionally, Genos performs single-nucleotide resolution analysis and dynamically simulates the cascade effects of genetic variation on molecular phenotypes, including the influence of both common and rare single-nucleotide variants as well as structural variants.

Show full abstractShow less

DOI

10.1093/gigascience/giaf132

HyenaDNA

AI

PUBMED_LINK

37426456

DESCRIPTION

HyenaDNA is a genomic foundation model using implicit convolution operators (Hyena) to achieve up to 1 million token context length at single nucleotide resolution, overcoming the quadratic scaling limitations of Transformer-based models. Published at NeurIPS 2023.

Show full descriptionShow less

URL

https://github.com/HazyResearch/hyena-dna

KEYWORDS

Hyena, implicit convolution, long-range context, single nucleotide resolution, SNP, regulatory elements, NeurIPS 2023

Show full keywordsShow less

TITLE

HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution.

Main citation

Nguyen E, Poli M, Faizi M, ...&, Ré C. (2023) HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution. NeurIPS, 36 43177-43201.

ABSTRACT

Genomic (DNA) sequences encode an enormous amount of information for gene regulation and protein synthesis. Similar to natural language models, researchers have proposed foundation models in genomics to learn generalizable features from unlabeled genome data that can then be fine-tuned for downstream tasks such as identifying regulatory elements. Due to the quadratic scaling of attention, previous Transformer-based genomic models have used 512 to 4k tokens as context (<0.001% of the human genome), significantly limiting the modeling of long-range interactions in DNA. In addition, these methods rely on tokenizers or fixed k-mers to aggregate meaningful DNA units, losing single nucleotide resolution where subtle genetic variations can completely alter protein function via single nucleotide polymorphisms (SNPs). Recently, Hyena, a large language model based on implicit convolutions was shown to match attention in quality while allowing longer context lengths and lower time complexity. Leveraging Hyena, we present HyenaDNA, a genomic foundation model that scales context length to 1 million tokens at single nucleotide resolution, an up to 500x increase over previous dense attention-based models. HyenaDNA achieves state-of-the-art on 12 of 18 benchmarks and excels at long-range regulatory element prediction and SNP effect prediction.

Show full abstractShow less

DOI

10.48550/arXiv.2306.15794

Nucleotide Transformer

AI

PUBMED_LINK

39609566

DESCRIPTION

A family of transformer foundation models (from tens of millions to multi-billion parameters) pretrained on thousands of human and other-species genomes to learn DNA sequence representations. Embeddings support fine-tuning for tasks such as splice-site prediction, enhancer activity, histone marks, and transcription-factor binding, with benchmarks and weights released openly.

Show full descriptionShow less

URL

https://github.com/instadeepai/nucleotide-transformer

KEYWORDS

Transformer, foundation model, human genome, multi-species, DNA embeddings, splice-site, enhancer, histone marks, TF binding

Show full keywordsShow less

TITLE

Nucleotide Transformer: building and evaluating robust foundation models for human genomics.

Main citation

Dalla-Torre H, Gonzalez L, Mendoza-Revilla J, Lopez Carranza N, ...&, Pierrot T. (2025) Nucleotide Transformer: building and evaluating robust foundation models for human genomics. Nat Methods, 22 (2) 287-297. doi:10.1038/s41592-024-02523-z. PMID 39609566

ABSTRACT

The prediction of molecular phenotypes from DNA sequences remains a longstanding challenge in genomics, often driven by limited annotated data and the inability to transfer learnings between tasks. Here, we present an extensive study of foundation models pre-trained on DNA sequences, named Nucleotide Transformer, ranging from 50 million up to 2.5 billion parameters and integrating information from 3,202 human genomes and 850 genomes from diverse species. These transformer models yield context-specific representations of nucleotide sequences, which allow for accurate predictions even in low-data settings. We show that the developed models can be fine-tuned at low cost to solve a variety of genomics applications. Despite no supervision, the models learned to focus attention on key genomic elements and can be used to improve the prioritization of genetic variants. The training and application of foundational models in genomics provides a widely applicable approach for accurate molecular phenotype prediction from DNA sequence.

Show full abstractShow less

DOI

10.1038/s41592-024-02523-z

PromoterAI

AI

PUBMED_LINK

40440429

DESCRIPTION

A deep learning model from Illumina that scores how variants in gene promoter regions alter predicted gene expression, trained on chromatin and expression-related signals at nucleotide resolution. Distributed as a Python package with precomputed genome-wide scores to support rare-disease and research variant interpretation alongside other splice and protein effect tools.

Show full descriptionShow less

URL

https://github.com/Illumina/PromoterAI

KEYWORDS

promoter, variant effect, deep learning, rare disease, noncoding, gene expression, Illumina

Show full keywordsShow less

TITLE

Predicting expression-altering promoter mutations with deep learning.

Main citation

Jaganathan K, Ersaro N, Novakovsky G, Wang Y, ...&, Farh KK. (2025) Predicting expression-altering promoter mutations with deep learning. Science, 389 (6760) eads7373. doi:10.1126/science.ads7373. PMID 40440429

ABSTRACT

Only a minority of patients with rare genetic diseases are presently diagnosed by exome sequencing, suggesting that additional unrecognized pathogenic variants may reside in noncoding sequence. In this work, we describe PromoterAI, a deep neural network that accurately identifies noncoding promoter variants that dysregulate gene expression. We show that promoter variants with predicted expression-altering consequences produce outlier expression at both the RNA and protein levels in thousands of individuals and that these variants experience strong negative selection in human populations. We observed that clinically relevant genes in patients with rare diseases are enriched for such variants and validated their functional impact through reporter assays. Our estimates suggest that promoter variation accounts for 6% of the genetic burden associated with rare diseases.

Show full abstractShow less

DOI

10.1126/science.ads7373