AI Genomic language model
Curation of Genomic language model — listings under the AI tab.
DNA/RNA Foundation Models
Genomic language models that learn representations of DNA sequence for variant effect prediction and genome annotation:
- 1st generation (2021-2023): BPE tokenization + BERT-style pretraining on reference genomes (DNABERT, Ji et al. PMID 33542176, Bioinformatics 2021; DNABERT-2, Zhou et al. arXiv 2024).
- 2nd generation (2023-2024): Biologically motivated architectures — reverse-complement equivariance and long-range operators (HyenaDNA, Nguyen et al. NeurIPS 2023; Caduceus, Schiff et al. NeurIPS 2024).
- 3rd generation (2025-2026): Species-aware embeddings with Manifold Instance Mixup (DNABERT-S, Zhou et al. Nat Mach Intell 2025), whole-genome autoregressive training (Evo 2, Brixi et al. PMID 39664581, Science 2025; Genos, Arora et al. bioRxiv 2025), and multimodal reasoning (BioReason, Wei et al. NeurIPS 2025).
Scaling trajectory: context from 512bp → 1M+ bp, training from single genomes → whole-genome alignments → cross-species.
Summary Table
Click a column header to sort the table.
| NAME | Main citation | YEAR |
|---|---|---|
| AlphaGenome | Avsec Ž et al., Nature, 2026 |
2026 |
| BioReason | Fallahpour A et al., NeurIPS, 2025 |
2025 |
| Caduceus | Schiff Y et al., PMLR, 2024 |
2024 |
| DNABERT-2 | Zhou Z et al., ICLR, 2024 |
2024 |
| DNABERT-S | Zhou Z et al., Bioinformatics, 2025 |
2025 |
| DNABERT | Ji Y et al., Bioinformatics, 2021 |
2021 |
| Evo 2 | Brixi G et al., Nature, 2026 |
2026 |
| GPN-MSA | Benegas G et al., Nat Biotechnol, 2025 |
2025 |
| Genos | Lin A et al., Gigascience, 2025 |
2025 |
| HyenaDNA | Nguyen E et al., NeurIPS, 2023 |
2023 |
| Nucleotide Transformer | Dalla-Torre H et al., Nat Methods, 2025 |
2025 |
| PromoterAI | Jaganathan K et al., Science, 2025 |
2025 |
AlphaGenome
PUBMED_LINK
DESCRIPTION
Unified deep learning model that predicts molecular phenotypes from DNA sequence—including gene expression, chromatin accessibility, histone marks, TF binding, splicing, and contact maps—at single-nucleotide resolution for variant effect interpretation.
URL
KEYWORDS
variant effect prediction, regulatory genomics, deep learning, single-nucleotide resolution, DNA, chromatin, gene expression, TF binding
TITLE
Advancing regulatory variant effect prediction with AlphaGenome.
Main citation
Avsec Ž, Latysheva N, Cheng J, Novati G, ...&, Kohli P. (2026) Advancing regulatory variant effect prediction with AlphaGenome. Nature, 649 (8099) 1206-1218. doi:10.1038/s41586-025-10014-0. PMID 41606153
ABSTRACT
Deep learning models that predict functional genomic measurements from DNA sequences are powerful tools for deciphering the genetic regulatory code. Existing methods involve a trade-off between input sequence length and prediction resolution, thereby limiting their modality scope and performance1-5. We present AlphaGenome, a unified DNA sequence model, which takes as input 1 Mb of DNA sequence and predicts thousands of functional genomic tracks up to single-base-pair resolution across diverse modalities. The modalities include gene expression, transcription initiation, chromatin accessibility, histone modifications, transcription factor binding, chromatin contact maps, splice site usage and splice junction coordinates and strength. Trained on human and mouse genomes, AlphaGenome matches or exceeds the strongest available external models in 25 of 26 evaluations of variant effect prediction. The ability of AlphaGenome to simultaneously score variant effects across all modalities accurately recapitulates the mechanisms of clinically relevant variants near the TAL1 oncogene6. To facilitate broader use, we provide tools for making genome track and variant effect predictions from sequence.
DOI
10.1038/s41586-025-10014-0
BioReason
DESCRIPTION
BioReason is a DNA-LLM model that incentivizes multimodal biological reasoning by integrating DNA sequence representations with biological knowledge. Published at NeurIPS 2025, it achieves state-of-the-art on several genomic reasoning benchmarks.
URL
KEYWORDS
DNA-LLM, multimodal reasoning, biological reasoning, foundation model, NeurIPS 2025, long-context, interpretability
TITLE
BioReason: Incentivizing Multimodal Biological Reasoning within a DNA-LLM Model.
Main citation
Fallahpour A, Magnuson A, Gupta P, ...&, Wang B. (2025) BioReason: Incentivizing Multimodal Biological Reasoning within a DNA-LLM Model. NeurIPS 2025.
ABSTRACT
Unlocking deep, interpretable biological reasoning from complex genomic data is a major AI challenge hindering scientific discovery. Current DNA foundation models excel at encoding sequence information but lack the ability to perform explicit reasoning over biological concepts. We present BioReason, a DNA-LLM model that integrates a genome foundation model with a large language model to enable multimodal biological reasoning. BioReason is trained to reason over DNA sequences and biological text jointly, enabling interpretable predictions and natural language explanations of genomic functions. The model achieves state-of-the-art performance across multiple genomic reasoning tasks.
Caduceus
PUBMED_LINK
DESCRIPTION
Caduceus is the first family of reverse-complement (RC) equivariant bi-directional long-range DNA language models, built on the Mamba state space model backbone with BiMamba and MambaDNA blocks. Published at ICML 2024, it excels at long-range variant effect prediction.
URL
KEYWORDS
Mamba, RC equivariance, bi-directional, long-range DNA, variant effect prediction, state space model, ICML 2024
TITLE
Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling.
Main citation
Schiff Y, Kao CH, Gokaslan A, Dao T, Gu A, Kuleshov V. (2024) Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling. PMLR, 235 43632-43648. PMID 40567809
ABSTRACT
Large-scale sequence modeling has sparked rapid advances that now extend into biology and genomics. However, modeling genomic sequences introduces challenges such as the need to model long-range token interactions, the effects of upstream and downstream regions of the genome, and the reverse complementarity (RC) of DNA. Here, we propose an architecture motivated by these challenges that builds off the long-range Mamba block, and extends it to a BiMamba component that supports bi-directionality, and to a MambaDNA block that additionally supports RC equivariance. We use MambaDNA as the basis of Caduceus, the first family of RC equivariant bi-directional long-range DNA language models, and we introduce pre-training and fine-tuning strategies that yield Caduceus DNA foundation models. Caduceus outperforms previous long-range models on downstream benchmarks; on a challenging long-range variant effect prediction task, Caduceus exceeds the performance of prior models.
DNABERT
PUBMED_LINK
DESCRIPTION
DNABERT is a pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in the genome. Uses k-mer tokenization and masked language modeling to learn global and transferrable understanding of genomic DNA sequences. After fine-tuning, achieves SOTA on promoters, splice sites and TF binding sites prediction.
URL
KEYWORDS
BERT, Transformer, pre-trained, promoter prediction, splice site, transcription factor binding site, k-mer
TITLE
DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome.
Main citation
Ji Y, Zhou Z, Liu H, Davuluri RV. (2021) DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics, 37 (15) 2112-2120. doi:10.1093/bioinformatics/btab083. PMID 33538820
ABSTRACT
Deciphering the language of non-coding DNA is one of the fundamental problems in genome research. Gene regulatory code is highly complex due to the existence of polysemy and distant semantic relationship, which previous informatics methods often fail to capture especially in data-scarce scenarios. To address this challenge, we developed a novel pre-trained bidirectional encoder representation, named DNABERT, to capture global and transferrable understanding of genomic DNA sequences based on up and downstream nucleotide contexts. We compared DNABERT to the most widely used programs for genome-wide regulatory elements prediction and demonstrate its ease of use, accuracy and efficiency. We show that the single pre-trained transformers model can simultaneously achieve state-of-the-art performance on prediction of promoters, splice sites and transcription factor binding sites, after easy fine-tuning using small task-specific labeled data. Further, DNABERT enables direct visualization of nucleotide importance for model prediction, contributing to better interpretability. DNABERT is publicly available at https://github.com/jerryji1993/DNABERT.
DOI
10.1093/bioinformatics/btab083
DNABERT-2
DESCRIPTION
DNABERT-2 is an efficient foundation model for multi-species genome understanding, improving upon DNABERT with Byte-Pair Encoding (BPE) tokenization, attention head pruning, and improved training techniques. Published at ICLR 2024, it achieves SOTA across 28 genome prediction tasks while being significantly more efficient than previous models.
URL
KEYWORDS
BERT, Transformer, BPE tokenization, multi-species genome, foundation model, ICLR 2024
TITLE
DNABERT-2: Efficient Foundation Model and Benchmark for Multi-Species Genome.
Main citation
Zhou Z, Ji Y, Li W, Dutta P, Davuluri RV, Liu H. (2024) DNABERT-2: Efficient Foundation Model and Benchmark for Multi-Species Genome. ICLR 2024.
ABSTRACT
Deciphering the language of non-coding DNA is a critical challenge in genome research. Existing approaches often rely on k-mer based tokenization and single-species datasets, limiting their effectiveness for multi-species genome understanding. Here, we introduce DNABERT-2, a foundation model that leverages Byte-Pair Encoding (BPE) tokenization to capture meaningful DNA units across species, combined with attention head pruning and other optimizations for efficient training and inference. We also present a comprehensive multi-species genome benchmark (Genome Understanding Evaluation, GUE) covering 28 tasks across 7 species. DNABERT-2 achieves state-of-the-art performance across diverse genome prediction tasks, demonstrating superior efficiency and generalization. The model and benchmark are publicly available.
DNABERT-S
PUBMED_LINK
DESCRIPTION
DNABERT-S builds upon DNABERT-2 to develop species-aware DNA embeddings via Manifold Instance Mixup (MI-Mix) contrastive learning and Curriculum Contrastive Learning (C2LR), enabling unsupervised species differentiation from DNA sequences. Published in Bioinformatics (ISMB 2025 proceedings).
URL
KEYWORDS
species awareness, contrastive learning, manifold instance mixup, curriculum learning, long-read sequencing, ISMB 2025
TITLE
DNABERT-S: pioneering species differentiation with species-aware DNA embeddings.
Main citation
Zhou Z, Wu W, Ho H, ...&, Liu H. (2025) DNABERT-S: pioneering species differentiation with species-aware DNA embeddings. Bioinformatics, 41 (Supplement_1) i255-i264. doi:10.1093/bioinformatics/btaf188. PMID 40662791
ABSTRACT
We introduce DNABERT-S, a tailored genome model that develops species-aware embeddings to naturally cluster and segregate DNA sequences of different species in the embedding space. Differentiating species from genomic sequences (i.e. DNA and RNA) is vital yet challenging, since many real-world species remain uncharacterized, lacking known genomes for reference. Embedding-based methods are therefore used to differentiate species in an unsupervised manner. DNABERT-S builds upon a pre-trained genome foundation model named DNABERT-2. To encourage effective embeddings to error-prone long-read DNA sequences, we introduce Manifold Instance Mixup (MI-Mix), a contrastive objective that mixes the hidden representations of DNA sequences at randomly selected layers and trains the model to recognize and differentiate these mixed proportions at the output layer. We further enhance it with the proposed Curriculum Contrastive Learning (C2LR) strategy. Empirical results on 28 diverse datasets show DNABERT-S achieves state-of-the-art performance across multiple species classification and clustering tasks.
DOI
10.1093/bioinformatics/btaf188
Evo 2
PUBMED_LINK
FULL NAME
Evo 2 DNA foundation model
DESCRIPTION
A genomic foundation model using the StripedHyena 2 architecture, trained autoregressively on OpenGenome2 (trillions of nucleotides across prokaryotic, eukaryotic, archaeal, and phage genomes) at single-nucleotide resolution with long context (up to about one megabase). Supports generalist prediction and design tasks spanning DNA, RNA, and proteins; code and weights are open source with Hugging Face checkpoints.
URL
KEYWORDS
DNA foundation model, autoregressive, StripedHyena 2, prokaryotic, eukaryotic, genome design, variant effect, long context, open source
TITLE
Genome modelling and design across all domains of life with Evo 2.
Main citation
Brixi G, Durrant MG, Ku J, Naghipourfar M, ...&, Hie BL. (2026) Genome modelling and design across all domains of life with Evo 2. Nature, () . doi:10.1038/s41586-026-10176-5. PMID 41781614
ABSTRACT
All of life encodes information with DNA. Although tools for genome sequencing, synthesis and editing have transformed biological research, we still lack sufficient understanding of the immense complexity encoded by genomes to predict the effects of many classes of genomic changes or to intelligently compose new biological systems. Artificial intelligence models that learn information from genomic sequences across diverse organisms have increasingly advanced prediction and design capabilities1,2. Here we introduce Evo 2, a biological foundation model trained on 9 trillion DNA base pairs from a highly curated genomic atlas spanning all domains of life to have a 1 million token context window with single-nucleotide resolution. Evo 2 learns to accurately predict the functional impacts of genetic variation-from noncoding pathogenic mutations to clinically significant BRCA1 variants-without task-specific fine-tuning. Mechanistic interpretability analyses reveal that Evo 2 learns representations associated with biological features, including exon-intron boundaries, transcription factor binding sites, protein structural elements and prophage genomic regions. The generative abilities of Evo 2 produce mitochondrial, prokaryotic and eukaryotic sequences at genome scale with greater naturalness and coherence than previous methods. Evo 2 also generates experimentally validated chromatin accessibility patterns when guided by predictive models3,4 and inference-time search. We have made Evo 2 fully open, including model parameters, training code5, inference code and the OpenGenome2 dataset, to accelerate the exploration and design of biological complexity.
DOI
10.1038/s41586-026-10176-5
GPN-MSA
PUBMED_LINK
FULL NAME
Genomic Pretrained Network with Multiple-Sequence Alignment
DESCRIPTION
GPN-MSA is a DNA language model leveraging whole-genome multiple-sequence alignments across species to predict the effects of genome-wide variants, achieving outstanding performance on deleteriousness prediction for both coding and noncoding variants. Published in Nature Biotechnology.
URL
KEYWORDS
multiple sequence alignment, variant effect prediction, noncoding variants, ClinVar, COSMIC, gnomAD, DNA language model
TITLE
A DNA language model based on multispecies alignment predicts the effects of genome-wide variants.
Main citation
Benegas G, Albors C, Aw AJ, Ye C, Song YS. (2025) A DNA language model based on multispecies alignment predicts the effects of genome-wide variants. Nat Biotechnol, 43 (12) 1960-1965. doi:10.1038/s41587-024-02511-w. PMID 39747647
ABSTRACT
Protein language models have demonstrated remarkable performance in predicting the effects of missense variants but DNA language models have not yet shown a competitive edge for complex genomes such as that of humans. This limitation is particularly evident when dealing with the vast complexity of noncoding regions that comprise approximately 98% of the human genome. To tackle this challenge, we introduce GPN-MSA (genomic pretrained network with multiple-sequence alignment), a framework that leverages whole-genome alignments across multiple species while taking only a few hours to train. Across several benchmarks on clinical databases (ClinVar, COSMIC and OMIM), experimental functional assays (deep mutational scanning and DepMap) and population genomic data (gnomAD), our model for the human genome achieves outstanding performance on deleteriousness prediction for both coding and noncoding variants. We provide precomputed scores for all ~9 billion possible single-nucleotide variants in the human genome.
DOI
10.1038/s41587-024-02511-w
Genos
PUBMED_LINK
DESCRIPTION
Genos is a human-centric genomic foundation model using mixture-of-experts architecture (Genos-1.2B/Genos-10B) for million-basepair sequence modeling, trained on high-quality human de novo assemblies including the Human Pangenome Reference Consortium data. Published in GigaScience.
URL
KEYWORDS
mixture of experts, million-basepair context, human pangenome, human-centric, variant effect, structural variation
TITLE
Genos: a human-centric genomic foundation model.
Main citation
Lin A, Xie B, Ye C, ...&, Wang Z. (2025) Genos: a human-centric genomic foundation model. Gigascience, 14 giaf132. doi:10.1093/gigascience/giaf132. PMID 41122975
ABSTRACT
The rapid expansion of human genomic data demands foundation models that manage ultra-long sequences and capture population diversity, limitations common in existing models that lack human-specific representation, and clinical inference efficiency. Here, we introduce Genos (Genos-1.2B/Genos-10B), a human-centric genomic foundation model engineered for million-basepair sequence modeling. Genos utilizes a large-scale mixture of experts structure, optimized for a 1-Mb context, trained on high-quality human de novo assemblies from datasets such as the Human Pangenome Reference Consortium and the Human Genome Structural Variation Consortium, representing diverse global populations. A suite of optimization strategies was implemented to ensure training stability and enhance computational efficiency, which collectively reduces costs and facilitates million-basepair context modeling. Functionally, Genos performs single-nucleotide resolution analysis and dynamically simulates the cascade effects of genetic variation on molecular phenotypes, including the influence of both common and rare single-nucleotide variants as well as structural variants.
DOI
10.1093/gigascience/giaf132
HyenaDNA
PUBMED_LINK
DESCRIPTION
HyenaDNA is a genomic foundation model using implicit convolution operators (Hyena) to achieve up to 1 million token context length at single nucleotide resolution, overcoming the quadratic scaling limitations of Transformer-based models. Published at NeurIPS 2023.
URL
KEYWORDS
Hyena, implicit convolution, long-range context, single nucleotide resolution, SNP, regulatory elements, NeurIPS 2023
TITLE
HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution.
Main citation
Nguyen E, Poli M, Faizi M, ...&, Ré C. (2023) HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution. NeurIPS, 36 43177-43201.
ABSTRACT
Genomic (DNA) sequences encode an enormous amount of information for gene regulation and protein synthesis. Similar to natural language models, researchers have proposed foundation models in genomics to learn generalizable features from unlabeled genome data that can then be fine-tuned for downstream tasks such as identifying regulatory elements. Due to the quadratic scaling of attention, previous Transformer-based genomic models have used 512 to 4k tokens as context (<0.001% of the human genome), significantly limiting the modeling of long-range interactions in DNA. In addition, these methods rely on tokenizers or fixed k-mers to aggregate meaningful DNA units, losing single nucleotide resolution where subtle genetic variations can completely alter protein function via single nucleotide polymorphisms (SNPs). Recently, Hyena, a large language model based on implicit convolutions was shown to match attention in quality while allowing longer context lengths and lower time complexity. Leveraging Hyena, we present HyenaDNA, a genomic foundation model that scales context length to 1 million tokens at single nucleotide resolution, an up to 500x increase over previous dense attention-based models. HyenaDNA achieves state-of-the-art on 12 of 18 benchmarks and excels at long-range regulatory element prediction and SNP effect prediction.
DOI
10.48550/arXiv.2306.15794
Nucleotide Transformer
PUBMED_LINK
DESCRIPTION
A family of transformer foundation models (from tens of millions to multi-billion parameters) pretrained on thousands of human and other-species genomes to learn DNA sequence representations. Embeddings support fine-tuning for tasks such as splice-site prediction, enhancer activity, histone marks, and transcription-factor binding, with benchmarks and weights released openly.
URL
KEYWORDS
Transformer, foundation model, human genome, multi-species, DNA embeddings, splice-site, enhancer, histone marks, TF binding
TITLE
Nucleotide Transformer: building and evaluating robust foundation models for human genomics.
Main citation
Dalla-Torre H, Gonzalez L, Mendoza-Revilla J, Lopez Carranza N, ...&, Pierrot T. (2025) Nucleotide Transformer: building and evaluating robust foundation models for human genomics. Nat Methods, 22 (2) 287-297. doi:10.1038/s41592-024-02523-z. PMID 39609566
ABSTRACT
The prediction of molecular phenotypes from DNA sequences remains a longstanding challenge in genomics, often driven by limited annotated data and the inability to transfer learnings between tasks. Here, we present an extensive study of foundation models pre-trained on DNA sequences, named Nucleotide Transformer, ranging from 50 million up to 2.5 billion parameters and integrating information from 3,202 human genomes and 850 genomes from diverse species. These transformer models yield context-specific representations of nucleotide sequences, which allow for accurate predictions even in low-data settings. We show that the developed models can be fine-tuned at low cost to solve a variety of genomics applications. Despite no supervision, the models learned to focus attention on key genomic elements and can be used to improve the prioritization of genetic variants. The training and application of foundational models in genomics provides a widely applicable approach for accurate molecular phenotype prediction from DNA sequence.
DOI
10.1038/s41592-024-02523-z
PromoterAI
PUBMED_LINK
DESCRIPTION
A deep learning model from Illumina that scores how variants in gene promoter regions alter predicted gene expression, trained on chromatin and expression-related signals at nucleotide resolution. Distributed as a Python package with precomputed genome-wide scores to support rare-disease and research variant interpretation alongside other splice and protein effect tools.
URL
KEYWORDS
promoter, variant effect, deep learning, rare disease, noncoding, gene expression, Illumina
TITLE
Predicting expression-altering promoter mutations with deep learning.
Main citation
Jaganathan K, Ersaro N, Novakovsky G, Wang Y, ...&, Farh KK. (2025) Predicting expression-altering promoter mutations with deep learning. Science, 389 (6760) eads7373. doi:10.1126/science.ads7373. PMID 40440429
ABSTRACT
Only a minority of patients with rare genetic diseases are presently diagnosed by exome sequencing, suggesting that additional unrecognized pathogenic variants may reside in noncoding sequence. In this work, we describe PromoterAI, a deep neural network that accurately identifies noncoding promoter variants that dysregulate gene expression. We show that promoter variants with predicted expression-altering consequences produce outlier expression at both the RNA and protein levels in thousands of individuals and that these variants experience strong negative selection in human populations. We observed that clinically relevant genes in patients with rare diseases are enriched for such variants and validated their functional impact through reporter assays. Our estimates suggest that promoter variation accounts for 6% of the genetic burden associated with rare diseases.
DOI
10.1126/science.ads7373