References Function
Curation of Function — listings under the References tab.
Summary Table
Click a column header to sort the table.
| NAME | CATEGORY | Main citation | YEAR |
|---|---|---|---|
| ESM-2 | Evolutionary and Generative Protein Models | Lin Z et al., Science, 2023 |
2023 |
| EVE | Evolutionary and Generative Protein Models | Frazer J et al., Nature, 2021 |
2021 |
| ProGen2 | Evolutionary and Generative Protein Models | Nijkamp E et al., Cell Syst, 2023 |
2023 |
| ProtBERT | Evolutionary and Generative Protein Models | Elnaggar A et al., IEEE Trans Pattern Anal Mach Intell, 2022 |
2022 |
| ProteinBERT | Evolutionary and Generative Protein Models | Brandes N et al., Bioinformatics, 2022 |
2022 |
| ClinVar | Functional Annotation | Landrum MJ et al., Nucleic Acids Res, 2025 |
2025 |
| dbNSFP v4 | Functional Annotation | Liu X et al., Genome Med, 2020 |
2020 |
| DAVID | Pathway and Gene Ontology Enrichment | Huang DW et al., Genome Biol, 2007 |
2007 |
| Gene Ontology | Pathway and Gene Ontology Enrichment | Ashburner M et al., Nat Genet, 2000 |
2000 |
| AlphaFold 2 | Structure Prediction | Jumper J et al., Nature, 2021 |
2021 |
| AlphaFold 3 | Structure Prediction | Abramson J et al., Nature, 2024 |
2024 |
| AlphaFold | Structure Prediction | Senior AW et al., Nature, 2020 |
2020 |
| AlphaMissense | Variant Effect Prediction | Cheng J et al., Science, 2023 |
2023 |
| CADD v1.4 | Variant Effect Prediction | Rentzsch P et al., Nucleic Acids Res, 2019 |
2019 |
| CADD v1.6 (CADD-Splice) | Variant Effect Prediction | Rentzsch P et al., Genome Med, 2021 |
2021 |
| CADD v1.7 | Variant Effect Prediction | Schubach M et al., Nucleic Acids Res, 2024 |
2024 |
| CADD | Variant Effect Prediction | Kircher M et al., Nat Genet, 2014 |
2014 |
| M-CAP | Variant Effect Prediction | Jagadeesh KA et al., Nat Genet, 2016 |
2016 |
| MVP | Variant Effect Prediction | Qi H et al., Nat Commun, 2021 |
2021 |
| MetaLR / MetaSVM | Variant Effect Prediction | Dong C et al., Hum Mol Genet, 2015 |
2015 |
| MutationAssessor | Variant Effect Prediction | Su Y et al., bioRxiv, 2025 |
2025 |
| PolyPhen-2 | Variant Effect Prediction | Adzhubei I et al., Curr Protoc Hum Genet, 2013 |
2013 |
| PrimateAI-3D | Variant Effect Prediction | Gao H et al., Science, 2023 |
2023 |
| REVEL | Variant Effect Prediction | Ioannidis NM et al., Am J Hum Genet, 2016 |
2016 |
| SIFT | Variant Effect Prediction | Ng PC et al., Nucleic Acids Res, 2003 |
2003 |
| UNEECON | Variant Effect Prediction | Huang YF, PLoS Genet, 2020 |
2020 |
| VEST4 | Variant Effect Prediction | Carter H et al., BMC Genomics, 2013 |
2013 |
Evolutionary and Generative Protein Models
ESM-2
PUBMED_LINK
FULL NAME
Evolutionary Scale Modeling v2
DESCRIPTION
Large-scale protein language model enabling structure/function prediction.
URL
KEYWORDS
transformer, LLM, structure, sequence
USE
embeddings, structure prediction
TITLE
Evolutionary-scale prediction of atomic-level protein structure with a language model.
Main citation
Lin Z, Akin H, Rao R, Hie B, ...&, Rives A. (2023) Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379 (6637) 1123-1130. doi:10.1126/science.ade2574. PMID 36927031
ABSTRACT
Recent advances in machine learning have leveraged evolutionary information in multiple sequence alignments to predict protein structure. We demonstrate direct inference of full atomic-level protein structure from primary sequence using a large language model. As language models of protein sequences are scaled up to 15 billion parameters, an atomic-resolution picture of protein structure emerges in the learned representations. This results in an order-of-magnitude acceleration of high-resolution structure prediction, which enables large-scale structural characterization of metagenomic proteins. We apply this capability to construct the ESM Metagenomic Atlas by predicting structures for >617 million metagenomic protein sequences, including >225 million that are predicted with high confidence, which gives a view into the vast breadth and diversity of natural proteins.
DOI
10.1126/science.ade2574
EVE
PUBMED_LINK
FULL NAME
Evolutionary model of Variant Effect
DESCRIPTION
VAE-based unsupervised model to predict variant impact using MSAs.
URL
KEYWORDS
evolutionary, MSA, variant effect
USE
missense scoring
TITLE
Disease variant prediction with deep generative models of evolutionary data.
Main citation
Frazer J, Notin P, Dias M, Gomez A, ...&, Marks DS. (2021) Disease variant prediction with deep generative models of evolutionary data. Nature, 599 (7883) 91-95. doi:10.1038/s41586-021-04043-8. PMID 34707284
ABSTRACT
Quantifying the pathogenicity of protein variants in human disease-related genes would have a marked effect on clinical decisions, yet the overwhelming majority (over 98%) of these variants still have unknown consequences1-3. In principle, computational methods could support the large-scale interpretation of genetic variants. However, state-of-the-art methods4-10 have relied on training machine learning models on known disease labels. As these labels are sparse, biased and of variable quality, the resulting models have been considered insufficiently reliable11. Here we propose an approach that leverages deep generative models to predict variant pathogenicity without relying on labels. By modelling the distribution of sequence variation across organisms, we implicitly capture constraints on the protein sequences that maintain fitness. Our model EVE (evolutionary model of variant effect) not only outperforms computational approaches that rely on labelled data but also performs on par with, if not better than, predictions from high-throughput experiments, which are increasingly used as evidence for variant classification12-16. We predict the pathogenicity of more than 36 million variants across 3,219 disease genes and provide evidence for the classification of more than 256,000 variants of unknown significance. Our work suggests that models of evolutionary information can provide valuable independent evidence for variant interpretation that will be widely useful in research and clinical settings.
DOI
10.1038/s41586-021-04043-8
ProGen2
PUBMED_LINK
DESCRIPTION
Generative protein design using LLMs trained on protein sequences.
URL
KEYWORDS
protein design, LLM
USE
sequence generation
TITLE
ProGen2: Exploring the boundaries of protein language models.
Main citation
Nijkamp E, Ruffolo JA, Weinstein EN, Naik N, ...&, Madani A. (2023) ProGen2: Exploring the boundaries of protein language models. Cell Syst, 14 (11) 968-978.e3. doi:10.1016/j.cels.2023.10.002. PMID 37909046
ABSTRACT
Attention-based models trained on protein sequences have demonstrated incredible success at classification and generation tasks relevant for artificial-intelligence-driven protein design. However, we lack a sufficient understanding of how very large-scale models and data play a role in effective protein model development. We introduce a suite of protein language models, named ProGen2, that are scaled up to 6.4B parameters and trained on different sequence datasets drawn from over a billion proteins from genomic, metagenomic, and immune repertoire databases. ProGen2 models show state-of-the-art performance in capturing the distribution of observed evolutionary sequences, generating novel viable sequences, and predicting protein fitness without additional fine-tuning. As large model sizes and raw numbers of protein sequences continue to become more widely accessible, our results suggest that a growing emphasis needs to be placed on the data distribution provided to a protein sequence model. Our models and code are open sourced for widespread adoption in protein engineering. A record of this paper's Transparent Peer Review process is included in the supplemental information.
DOI
10.1016/j.cels.2023.10.002
ProtBERT
PUBMED_LINK
DESCRIPTION
BERT-based protein language model for downstream functional tasks.
URL
KEYWORDS
protein LM, transformer, embeddings
USE
feature extraction
TITLE
ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning.
Main citation
Elnaggar A, Heinzinger M, Dallago C, Rehawi G, ...&, Rost B. (2022) ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning. IEEE Trans Pattern Anal Mach Intell, 44 (10) 7112-7127. doi:10.1109/TPAMI.2021.3095381. PMID 34232869
ABSTRACT
Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models (LMs) taken from Natural Language Processing (NLP). These LMs reach for new prediction frontiers at low inference costs. Here, we trained two auto-regressive models (Transformer-XL, XLNet) and four auto-encoder models (BERT, Albert, Electra, T5) on data from UniRef and BFD containing up to 393 billion amino acids. The protein LMs (pLMs) were trained on the Summit supercomputer using 5616 GPUs and TPU Pod up-to 1024 cores. Dimensionality reduction revealed that the raw pLM-embeddings from unlabeled data captured some biophysical features of protein sequences. We validated the advantage of using the embeddings as exclusive input for several subsequent tasks: (1) a per-residue (per-token) prediction of protein secondary structure (3-state accuracy Q3=81%-87%); (2) per-protein (pooling) predictions of protein sub-cellular location (ten-state accuracy: Q10=81%) and membrane versus water-soluble (2-state accuracy Q2=91%). For secondary structure, the most informative embeddings (ProtT5) for the first time outperformed the state-of-the-art without multiple sequence alignments (MSAs) or evolutionary information thereby bypassing expensive database searches. Taken together, the results implied that pLMs learned some of the grammar of the language of life. All our models are available through https://github.com/agemagician/ProtTrans.
DOI
10.1109/TPAMI.2021.3095381
ProteinBERT
PUBMED_LINK
TITLE
ProteinBERT: a universal deep-learning model of protein sequence and function.
Main citation
Brandes N, Ofer D, Peleg Y, Rappoport N, ...&, Linial M. (2022) ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics, 38 (8) 2102-2110. doi:10.1093/bioinformatics/btac020. PMID 35020807
ABSTRACT
SUMMARY: Self-supervised deep language modeling has shown unprecedented success across natural language tasks, and has recently been repurposed to biological sequences. However, existing models and pretraining methods are designed and optimized for text analysis. We introduce ProteinBERT, a deep language model specifically designed for proteins. Our pretraining scheme combines language modeling with a novel task of Gene Ontology (GO) annotation prediction. We introduce novel architectural elements that make the model highly efficient and flexible to long sequences. The architecture of ProteinBERT consists of both local and global representations, allowing end-to-end processing of these types of inputs and outputs. ProteinBERT obtains near state-of-the-art performance, and sometimes exceeds it, on multiple benchmarks covering diverse protein properties (including protein structure, post-translational modifications and biophysical attributes), despite using a far smaller and faster model than competing deep-learning methods. Overall, ProteinBERT provides an efficient framework for rapidly training protein predictors, even with limited labeled data. AVAILABILITY AND IMPLEMENTATION: Code and pretrained model weights are available at https://github.com/nadavbra/protein_bert. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
DOI
10.1093/bioinformatics/btac020
Functional Annotation
ClinVar
PUBMED_LINK
DESCRIPTION
Archive of clinically relevant variants with interpretations.
URL
KEYWORDS
pathogenicity, variant, clinical
USE
clinical annotation
TITLE
ClinVar: updates to support classifications of both germline and somatic variants.
Main citation
Landrum MJ, Chitipiralla S, Kaur K, Brown G, ...&, Kattman BL. (2025) ClinVar: updates to support classifications of both germline and somatic variants. Nucleic Acids Res, 53 (D1) D1313-D1321. doi:10.1093/nar/gkae1090. PMID 39578691
ABSTRACT
ClinVar (www.ncbi.nlm.nih.gov/clinvar/) is a free, public database of human genetic variants and their relationships to disease, with >3 million variants submitted by >2800 organizations across the world. The database was recently updated to have three types of classifications: germline, oncogenicity and clinical impact for somatic variants. As for germline variants, classifications for somatic variants can be submitted in batches in a file submission or through the submission API; variants can also be submitted and updated one at a time in online submission forms. The ClinVar XML files were redesigned to allow multiple classification types. Both old and new formats of the XML are supported through the end of 2024. Data for somatic classifications were also added to the ClinVar VCF files and to several tab-delimited files. The ClinVar VCV pages were updated to display the three types of classifications, both as it was submitted and as it was aggregated by ClinVar. Clinical testing laboratories and others in the cancer community are invited to share their classifications of somatic variant classifications through ClinVar to provide transparency in genomic testing and improve patient care.
DOI
10.1093/nar/gkae1090
dbNSFP v4 (dbNSFP)
PUBMED_LINK
FULL NAME
Database for Nonsynonymous SNPs’ Functional Predictions
DESCRIPTION
Database aggregating functional predictions and annotations for nonsynonymous variants.
URL
KEYWORDS
annotation, variant, missense
USE
meta-annotation
TITLE
dbNSFP v4: a comprehensive database of transcript-specific functional predictions and annotations for human nonsynonymous and splice-site SNVs.
Main citation
Liu X, Li C, Mou C, Dong Y, ...&, Tu Y. (2020) dbNSFP v4: a comprehensive database of transcript-specific functional predictions and annotations for human nonsynonymous and splice-site SNVs. Genome Med, 12 (1) 103. doi:10.1186/s13073-020-00803-9. PMID 33261662
ABSTRACT
Whole exome sequencing has been increasingly used in human disease studies. Prioritization based on appropriate functional annotations has been used as an indispensable step to select candidate variants. Here we present the latest updates to dbNSFP (version 4.1), a database designed to facilitate this step by providing deleteriousness prediction and functional annotation for all potential nonsynonymous and splice-site SNVs (a total of 84,013,093) in the human genome. The current version compiled 36 deleteriousness prediction scores, including 12 transcript-specific scores, and other variant and gene-level functional annotations. The database is available at http://database.liulab.science/dbNSFP with a downloadable version and a web-service.
DOI
10.1186/s13073-020-00803-9
Pathway and Gene Ontology Enrichment
DAVID
PUBMED_LINK
FULL NAME
Database for Annotation, Visualization and Integrated Discovery
DESCRIPTION
Functional annotation and enrichment analysis platform.
URL
KEYWORDS
functional enrichment, GO, pathway
USE
enrichment analysis
TITLE
The DAVID Gene Functional Classification Tool: a novel biological module-centric algorithm to functionally analyze large gene lists.
Main citation
Huang DW, Sherman BT, Tan Q, Collins JR, ...&, Lempicki RA. (2007) The DAVID Gene Functional Classification Tool: a novel biological module-centric algorithm to functionally analyze large gene lists. Genome Biol, 8 (9) R183. doi:10.1186/gb-2007-8-9-r183. PMID 17784955
ABSTRACT
The DAVID Gene Functional Classification Tool http://david.abcc.ncifcrf.gov uses a novel agglomeration algorithm to condense a list of genes or associated biological terms into organized classes of related genes or biology, called biological modules. This organization is accomplished by mining the complex biological co-occurrences found in multiple sources of functional annotation. It is a powerful method to group functionally related genes and terms into a manageable number of biological modules for efficient interpretation of gene lists in a network context.
DOI
10.1186/gb-2007-8-9-r183
Gene Ontology (GO)
PUBMED_LINK
DESCRIPTION
Controlled vocabulary for gene function classification.
URL
KEYWORDS
GO terms, pathways
USE
enrichment analysis
TITLE
Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.
Main citation
Ashburner M, Ball CA, Blake JA, Botstein D, ...&, Sherlock G. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet, 25 (1) 25-9. doi:10.1038/75556. PMID 10802651
ABSTRACT
Genomic sequencing has made it clear that a large fraction of the genes specifying the core biological functions are shared by all eukaryotes. Knowledge of the biological role of such shared proteins in one organism can often be transferred to other organisms. The goal of the Gene Ontology Consortium is to produce a dynamic, controlled vocabulary that can be applied to all eukaryotes even as knowledge of gene and protein roles in cells is accumulating and changing. To this end, three independent ontologies accessible on the World-Wide Web (http://www.geneontology.org) are being constructed: biological process, molecular function and cellular component.
DOI
10.1038/75556
Structure Prediction
AlphaFold
PUBMED_LINK
URL
TITLE
Improved protein structure prediction using potentials from deep learning.
Main citation
Senior AW, Evans R, Jumper J, Kirkpatrick J, ...&, Hassabis D. (2020) Improved protein structure prediction using potentials from deep learning. Nature, 577 (7792) 706-710. doi:10.1038/s41586-019-1923-7. PMID 31942072
ABSTRACT
Protein structure prediction can be used to determine the three-dimensional shape of a protein from its amino acid sequence1. This problem is of fundamental importance as the structure of a protein largely determines its function2; however, protein structures can be difficult to determine experimentally. Considerable progress has recently been made by leveraging genetic information. It is possible to infer which amino acid residues are in contact by analysing covariation in homologous sequences, which aids in the prediction of protein structures3. Here we show that we can train a neural network to make accurate predictions of the distances between pairs of residues, which convey more information about the structure than contact predictions. Using this information, we construct a potential of mean force4 that can accurately describe the shape of a protein. We find that the resulting potential can be optimized by a simple gradient descent algorithm to generate structures without complex sampling procedures. The resulting system, named AlphaFold, achieves high accuracy, even for sequences with fewer homologous sequences. In the recent Critical Assessment of Protein Structure Prediction5 (CASP13)-a blind assessment of the state of the field-AlphaFold created high-accuracy structures (with template modelling (TM) scores6 of 0.7 or higher) for 24 out of 43 free modelling domains, whereas the next best method, which used sampling and contact information, achieved such accuracy for only 14 out of 43 domains. AlphaFold represents a considerable advance in protein-structure prediction. We expect this increased accuracy to enable insights into the function and malfunction of proteins, especially in cases for which no structures for homologous proteins have been experimentally determined7.
DOI
10.1038/s41586-019-1923-7
AlphaFold 2 (AlphaFold)
PUBMED_LINK
FULL NAME
AlphaFold Protein Structure Database
DESCRIPTION
High-accuracy protein structure prediction using deep learning.
URL
KEYWORDS
protein structure, deep learning, folding
USE
structure prediction
SERVER
EMBL-EBI
TITLE
Highly accurate protein structure prediction with AlphaFold.
Main citation
Jumper J, Evans R, Pritzel A, Green T, ...&, Hassabis D. (2021) Highly accurate protein structure prediction with AlphaFold. Nature, 596 (7873) 583-589. doi:10.1038/s41586-021-03819-2. PMID 34265844
ABSTRACT
Proteins are essential to life, and understanding their structure can facilitate a mechanistic understanding of their function. Through an enormous experimental effort1-4, the structures of around 100,000 unique proteins have been determined5, but this represents a small fraction of the billions of known protein sequences6,7. Structural coverage is bottlenecked by the months to years of painstaking effort required to determine a single protein structure. Accurate computational approaches are needed to address this gap and to enable large-scale structural bioinformatics. Predicting the three-dimensional structure that a protein will adopt based solely on its amino acid sequence-the structure prediction component of the 'protein folding problem'8-has been an important open research problem for more than 50 years9. Despite recent progress10-14, existing methods fall far short of atomic accuracy, especially when no homologous structure is available. Here we provide the first computational method that can regularly predict protein structures with atomic accuracy even in cases in which no similar structure is known. We validated an entirely redesigned version of our neural network-based model, AlphaFold, in the challenging 14th Critical Assessment of protein Structure Prediction (CASP14)15, demonstrating accuracy competitive with experimental structures in a majority of cases and greatly outperforming other methods. Underpinning the latest version of AlphaFold is a novel machine learning approach that incorporates physical and biological knowledge about protein structure, leveraging multi-sequence alignments, into the design of the deep learning algorithm.
DOI
10.1038/s41586-021-03819-2
AlphaFold 3
PUBMED_LINK
URL
TITLE
Accurate structure prediction of biomolecular interactions with AlphaFold 3.
Main citation
Abramson J, Adler J, Dunger J, Evans R, ...&, Jumper JM. (2024) Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature, 630 (8016) 493-500. doi:10.1038/s41586-024-07487-w. PMID 38718835
ABSTRACT
The introduction of AlphaFold 21 has spurred a revolution in modelling the structure of proteins and their interactions, enabling a huge range of applications in protein modelling and design2-6. Here we describe our AlphaFold 3 model with a substantially updated diffusion-based architecture that is capable of predicting the joint structure of complexes including proteins, nucleic acids, small molecules, ions and modified residues. The new AlphaFold model demonstrates substantially improved accuracy over many previous specialized tools: far greater accuracy for protein-ligand interactions compared with state-of-the-art docking tools, much higher accuracy for protein-nucleic acid interactions compared with nucleic-acid-specific predictors and substantially higher antibody-antigen prediction accuracy compared with AlphaFold-Multimer v.2.37,8. Together, these results show that high-accuracy modelling across biomolecular space is possible within a single unified deep-learning framework.
DOI
10.1038/s41586-024-07487-w
Variant Effect Prediction
AlphaMissense
PUBMED_LINK
DESCRIPTION
Deep learning model predicting pathogenicity of all possible missense variants in human proteins.
URL
KEYWORDS
missense, pathogenicity, variant effect, deep learning
USE
variant effect scoring
TITLE
Accurate proteome-wide missense variant effect prediction with AlphaMissense.
Main citation
Cheng J, Novati G, Pan J, Bycroft C, ...&, Avsec Ž. (2023) Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science, 381 (6664) eadg7492. doi:10.1126/science.adg7492. PMID 37733863
ABSTRACT
The vast majority of missense variants observed in the human genome are of unknown clinical significance. We present AlphaMissense, an adaptation of AlphaFold fine-tuned on human and primate variant population frequency databases to predict missense variant pathogenicity. By combining structural context and evolutionary conservation, our model achieves state-of-the-art results across a wide range of genetic and experimental benchmarks, all without explicitly training on such data. The average pathogenicity score of genes is also predictive for their cell essentiality, capable of identifying short essential genes that existing statistical approaches are underpowered to detect. As a resource to the community, we provide a database of predictions for all possible human single amino acid substitutions and classify 89% of missense variants as either likely benign or likely pathogenic.
DOI
10.1126/science.adg7492
CADD
PUBMED_LINK
FULL NAME
Combined Annotation–Dependent Depletion
DESCRIPTION
Combined Annotation–Dependent Depletion; integrates multiple annotations to score variant deleteriousness.
URL
KEYWORDS
genome-wide, deleteriousness, annotation
USE
prioritization, filtering
TITLE
A general framework for estimating the relative pathogenicity of human genetic variants.
Main citation
Kircher M, Witten DM, Jain P, O'Roak BJ, ...&, Shendure J. (2014) A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet, 46 (3) 310-5. doi:10.1038/ng.2892. PMID 24487276
ABSTRACT
Current methods for annotating and interpreting human genetic variation tend to exploit a single information type (for example, conservation) and/or are restricted in scope (for example, to missense changes). Here we describe Combined Annotation-Dependent Depletion (CADD), a method for objectively integrating many diverse annotations into a single measure (C score) for each variant. We implement CADD as a support vector machine trained to differentiate 14.7 million high-frequency human-derived alleles from 14.7 million simulated variants. We precompute C scores for all 8.6 billion possible human single-nucleotide variants and enable scoring of short insertions-deletions. C scores correlate with allelic diversity, annotations of functionality, pathogenicity, disease severity, experimentally measured regulatory effects and complex trait associations, and they highly rank known pathogenic variants within individual genomes. The ability of CADD to prioritize functional, deleterious and pathogenic variants across many functional categories, effect sizes and genetic architectures is unmatched by any current single-annotation method.
DOI
10.1038/ng.2892
CADD v1.4
PUBMED_LINK
URL
TITLE
CADD: predicting the deleteriousness of variants throughout the human genome.
Main citation
Rentzsch P, Witten D, Cooper GM, Shendure J, ...&, Kircher M. (2019) CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res, 47 (D1) D886-D894. doi:10.1093/nar/gky1016. PMID 30371827
ABSTRACT
Combined Annotation-Dependent Depletion (CADD) is a widely used measure of variant deleteriousness that can effectively prioritize causal variants in genetic analyses, particularly highly penetrant contributors to severe Mendelian disorders. CADD is an integrative annotation built from more than 60 genomic features, and can score human single nucleotide variants and short insertion and deletions anywhere in the reference assembly. CADD uses a machine learning model trained on a binary distinction between simulated de novo variants and variants that have arisen and become fixed in human populations since the split between humans and chimpanzees; the former are free of selective pressure and may thus include both neutral and deleterious alleles, while the latter are overwhelmingly neutral (or, at most, weakly deleterious) by virtue of having survived millions of years of purifying selection. Here we review the latest updates to CADD, including the most recent version, 1.4, which supports the human genome build GRCh38. We also present updates to our website that include simplified variant lookup, extended documentation, an Application Program Interface and improved mechanisms for integrating CADD scores into other tools or applications. CADD scores, software and documentation are available at https://cadd.gs.washington.edu.
DOI
10.1093/nar/gky1016
CADD v1.6 (CADD-Splice)
PUBMED_LINK
URL
TITLE
CADD-Splice-improving genome-wide variant effect prediction using deep learning-derived splice scores.
Main citation
Rentzsch P, Schubach M, Shendure J, Kircher M. (2021) CADD-Splice-improving genome-wide variant effect prediction using deep learning-derived splice scores. Genome Med, 13 (1) 31. doi:10.1186/s13073-021-00835-9. PMID 33618777
ABSTRACT
BACKGROUND: Splicing of genomic exons into mRNAs is a critical prerequisite for the accurate synthesis of human proteins. Genetic variants impacting splicing underlie a substantial proportion of genetic disease, but are challenging to identify beyond those occurring at donor and acceptor dinucleotides. To address this, various methods aim to predict variant effects on splicing. Recently, deep neural networks (DNNs) have been shown to achieve better results in predicting splice variants than other strategies. METHODS: It has been unclear how best to integrate such process-specific scores into genome-wide variant effect predictors. Here, we use a recently published experimental data set to compare several machine learning methods that score variant effects on splicing. We integrate the best of those approaches into general variant effect prediction models and observe the effect on classification of known pathogenic variants. RESULTS: We integrate two specialized splicing scores into CADD (Combined Annotation Dependent Depletion; cadd.gs.washington.edu ), a widely used tool for genome-wide variant effect prediction that we previously developed to weight and integrate diverse collections of genomic annotations. With this new model, CADD-Splice, we show that inclusion of splicing DNN effect scores substantially improves predictions across multiple variant categories, without compromising overall performance. CONCLUSIONS: While splice effect scores show superior performance on splice variants, specialized predictors cannot compete with other variant scores in general variant interpretation, as the latter account for nonsense and missense effects that do not alter splicing. Although only shown here for splice scores, we believe that the applied approach will generalize to other specific molecular processes, providing a path for the further improvement of genome-wide variant effect prediction.
DOI
10.1186/s13073-021-00835-9
CADD v1.7
PUBMED_LINK
URL
TITLE
CADD v1.7: using protein language models, regulatory CNNs and other nucleotide-level scores to improve genome-wide variant predictions.
Main citation
Schubach M, Maass T, Nazaretyan L, Röner S, ...&, Kircher M. (2024) CADD v1.7: using protein language models, regulatory CNNs and other nucleotide-level scores to improve genome-wide variant predictions. Nucleic Acids Res, 52 (D1) D1143-D1154. doi:10.1093/nar/gkad989. PMID 38183205
ABSTRACT
Machine Learning-based scoring and classification of genetic variants aids the assessment of clinical findings and is employed to prioritize variants in diverse genetic studies and analyses. Combined Annotation-Dependent Depletion (CADD) is one of the first methods for the genome-wide prioritization of variants across different molecular functions and has been continuously developed and improved since its original publication. Here, we present our most recent release, CADD v1.7. We explored and integrated new annotation features, among them state-of-the-art protein language model scores (Meta ESM-1v), regulatory variant effect predictions (from sequence-based convolutional neural networks) and sequence conservation scores (Zoonomia). We evaluated the new version on data sets derived from ClinVar, ExAC/gnomAD and 1000 Genomes variants. For coding effects, we tested CADD on 31 Deep Mutational Scanning (DMS) data sets from ProteinGym and, for regulatory effect prediction, we used saturation mutagenesis reporter assay data of promoter and enhancer sequences. The inclusion of new features further improved the overall performance of CADD. As with previous releases, all data sets, genome-wide CADD v1.7 scores, scripts for on-site scoring and an easy-to-use webserver are readily provided via https://cadd.bihealth.org/ or https://cadd.gs.washington.edu/ to the community.
DOI
10.1093/nar/gkad989
M-CAP
PUBMED_LINK
FULL NAME
Mendelian Clinically Applicable Pathogenicity
DESCRIPTION
Rare missense pathogenicity classifier for clinical interpretation.
URL
KEYWORDS
missense, clinical
USE
clinical scoring
TITLE
M-CAP eliminates a majority of variants of uncertain significance in clinical exomes at high sensitivity.
Main citation
Jagadeesh KA, Wenger AM, Berger MJ, Guturu H, ...&, Bejerano G. (2016) M-CAP eliminates a majority of variants of uncertain significance in clinical exomes at high sensitivity. Nat Genet, 48 (12) 1581-1586. doi:10.1038/ng.3703. PMID 27776117
ABSTRACT
Variant pathogenicity classifiers such as SIFT, PolyPhen-2, CADD, and MetaLR assist in interpretation of the hundreds of rare, missense variants in the typical patient genome by deprioritizing some variants as likely benign. These widely used methods misclassify 26 to 38% of known pathogenic mutations, which could lead to missed diagnoses if the classifiers are trusted as definitive in a clinical setting. We developed M-CAP, a clinical pathogenicity classifier that outperforms existing methods at all thresholds and correctly dismisses 60% of rare, missense variants of uncertain significance in a typical genome at 95% sensitivity.
DOI
10.1038/ng.3703
MVP
PUBMED_LINK
FULL NAME
Missense Variant Pathogenicity prediction
DESCRIPTION
A new prediction method that uses deep residual network to leverage large training data sets and many correlated predictors
URL
KEYWORDS
deep residual network, pathogenic missense variant
TITLE
MVP predicts the pathogenicity of missense variants by deep learning.
Main citation
Qi H, Zhang H, Zhao Y, Chen C, ...&, Shen Y. (2021) MVP predicts the pathogenicity of missense variants by deep learning. Nat Commun, 12 (1) 510. doi:10.1038/s41467-020-20847-0. PMID 33479230
ABSTRACT
Accurate pathogenicity prediction of missense variants is critically important in genetic studies and clinical diagnosis. Previously published prediction methods have facilitated the interpretation of missense variants but have limited performance. Here, we describe MVP (Missense Variant Pathogenicity prediction), a new prediction method that uses deep residual network to leverage large training data sets and many correlated predictors. We train the model separately in genes that are intolerant of loss of function variants and the ones that are tolerant in order to take account of potentially different genetic effect size and mode of action. We compile cancer mutation hotspots and de novo variants from developmental disorders for benchmarking. Overall, MVP achieves better performance in prioritizing pathogenic missense variants than previous methods, especially in genes tolerant of loss of function variants. Finally, using MVP, we estimate that de novo coding variants contribute to 7.8% of isolated congenital heart disease, nearly doubling previous estimates.
DOI
10.1038/s41467-020-20847-0
MetaLR / MetaSVM (MetaLR)
PUBMED_LINK
DESCRIPTION
Ensemble pathogenicity scores integrating multiple annotations.
URL
KEYWORDS
ensemble, missense
USE
prioritization
TITLE
Comparison and integration of deleteriousness prediction methods for nonsynonymous SNVs in whole exome sequencing studies.
Main citation
Dong C, Wei P, Jian X, Gibbs R, ...&, Liu X. (2015) Comparison and integration of deleteriousness prediction methods for nonsynonymous SNVs in whole exome sequencing studies. Hum Mol Genet, 24 (8) 2125-37. doi:10.1093/hmg/ddu733. PMID 25552646
ABSTRACT
Accurate deleteriousness prediction for nonsynonymous variants is crucial for distinguishing pathogenic mutations from background polymorphisms in whole exome sequencing (WES) studies. Although many deleteriousness prediction methods have been developed, their prediction results are sometimes inconsistent with each other and their relative merits are still unclear in practical applications. To address these issues, we comprehensively evaluated the predictive performance of 18 current deleteriousness-scoring methods, including 11 function prediction scores (PolyPhen-2, SIFT, MutationTaster, Mutation Assessor, FATHMM, LRT, PANTHER, PhD-SNP, SNAP, SNPs&GO and MutPred), 3 conservation scores (GERP++, SiPhy and PhyloP) and 4 ensemble scores (CADD, PON-P, KGGSeq and CONDEL). We found that FATHMM and KGGSeq had the highest discriminative power among independent scores and ensemble scores, respectively. Moreover, to ensure unbiased performance evaluation of these prediction scores, we manually collected three distinct testing datasets, on which no current prediction scores were tuned. In addition, we developed two new ensemble scores that integrate nine independent scores and allele frequency. Our scores achieved the highest discriminative power compared with all the deleteriousness prediction scores tested and showed low false-positive prediction rate for benign yet rare nonsynonymous variants, which demonstrated the value of combining information from multiple orthologous approaches. Finally, to facilitate variant prioritization in WES studies, we have pre-computed our ensemble scores for 87 347 044 possible variants in the whole-exome and made them publicly available through the ANNOVAR software and the dbNSFP database.
DOI
10.1093/hmg/ddu733
MutationAssessor
PUBMED_LINK
DESCRIPTION
Predicts functional impact based on evolutionary conservation.
URL
KEYWORDS
conservation, function
USE
variant effect
TITLE
MutationAssessor in cBioPortal.
Main citation
Su Y, Li X, Reva B, Antipin Y, ...&, Sander C. (2025) MutationAssessor in cBioPortal. bioRxiv, () . doi:10.1101/2025.08.10.669566. PMID 40832239
ABSTRACT
MutationAssessor (MA) helps researchers evaluate the likely functional impact of somatic and germline mutations in cancer. It provides an evolution-based functional impact score (FIS) to classify mutations based on their likely effect on protein function. FIS scores are based on analysis of patterns of conservation in protein families (conserved residues) and subfamilies (specificity residues). In this new version (r4) we have (1) refined the combinatorial entropy analysis of conservation patterns, (2) recalculated full-length protein multiple sequence alignments covering a larger fraction of human proteins and making use of the explosive growth of protein sequence data, (3) compared predicted functional impact with the pathogenic-benign classification of sequence variants in curated knowledge bases, such as ClinVar, (4) observed the inverse relationship between predicted high functional impact and variant frequency in germline genome sequences and (5) explore the evaluation of switch-of-function mutational effects. Functional impact of ~4 million somatic amino-acid changing mutations across more than 320K human tumor samples are now available in the widely used cBioPortal for Cancer Genomics.
DOI
10.1101/2025.08.10.669566
PolyPhen-2
PUBMED_LINK
FULL NAME
Polymorphism Phenotyping v2
DESCRIPTION
Predicts functional impact of amino acid substitutions.
URL
KEYWORDS
missense, conservation
USE
variant scoring
TITLE
Predicting functional effect of human missense mutations using PolyPhen-2.
Main citation
Adzhubei I, Jordan DM, Sunyaev SR. (2013) Predicting functional effect of human missense mutations using PolyPhen-2. Curr Protoc Hum Genet, Chapter 7 () Unit7.20. doi:10.1002/0471142905.hg0720s76. PMID 23315928
ABSTRACT
PolyPhen-2 (Polymorphism Phenotyping v2), available as software and via a Web server, predicts the possible impact of amino acid substitutions on the stability and function of human proteins using structural and comparative evolutionary considerations. It performs functional annotation of single-nucleotide polymorphisms (SNPs), maps coding SNPs to gene transcripts, extracts protein sequence annotations and structural attributes, and builds conservation profiles. It then estimates the probability of the missense mutation being damaging based on a combination of all these properties. PolyPhen-2 features include a high-quality multiple protein sequence alignment pipeline and a prediction method employing machine-learning classification. The software also integrates the UCSC Genome Browser's human genome annotations and MultiZ multiple alignments of vertebrate genomes with the human genome. PolyPhen-2 is capable of analyzing large volumes of data produced by next-generation sequencing projects, thanks to built-in support for high-performance computing environments like Grid Engine and Platform LSF.
DOI
10.1002/0471142905.hg0720s76
PrimateAI-3D
PUBMED_LINK
DESCRIPTION
DL model trained on primate variation + 3D structure.
URL
KEYWORDS
deep learning, primate, missense
USE
clinical variant scoring
TITLE
The landscape of tolerated genetic variation in humans and primates.
Main citation
Gao H, Hamp T, Ede J, Schraiber JG, ...&, Farh KK. (2023) The landscape of tolerated genetic variation in humans and primates. Science, 380 (6648) eabn8153. doi:10.1126/science.abn8197. PMID 37262156
ABSTRACT
Personalized genome sequencing has revealed millions of genetic differences between individuals, but our understanding of their clinical relevance remains largely incomplete. To systematically decipher the effects of human genetic variants, we obtained whole-genome sequencing data for 809 individuals from 233 primate species and identified 4.3 million common protein-altering variants with orthologs in humans. We show that these variants can be inferred to have nondeleterious effects in humans based on their presence at high allele frequencies in other primate populations. We use this resource to classify 6% of all possible human protein-altering variants as likely benign and impute the pathogenicity of the remaining 94% of variants with deep learning, achieving state-of-the-art accuracy for diagnosing pathogenic variants in patients with genetic diseases.
DOI
10.1126/science.abn8197
REVEL
PUBMED_LINK
FULL NAME
Rare Exome Variant Ensemble Learner
DESCRIPTION
Ensemble method integrating multiple tools to predict pathogenicity.
URL
KEYWORDS
ensemble, missense
USE
pathogenicity scoring
TITLE
REVEL: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants.
Main citation
Ioannidis NM, Rothstein JH, Pejaver V, Middha S, ...&, Sieh W. (2016) REVEL: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants. Am J Hum Genet, 99 (4) 877-885. doi:10.1016/j.ajhg.2016.08.016. PMID 27666373
ABSTRACT
The vast majority of coding variants are rare, and assessment of the contribution of rare variants to complex traits is hampered by low statistical power and limited functional data. Improved methods for predicting the pathogenicity of rare coding variants are needed to facilitate the discovery of disease variants from exome sequencing studies. We developed REVEL (rare exome variant ensemble learner), an ensemble method for predicting the pathogenicity of missense variants on the basis of individual tools: MutPred, FATHMM, VEST, PolyPhen, SIFT, PROVEAN, MutationAssessor, MutationTaster, LRT, GERP, SiPhy, phyloP, and phastCons. REVEL was trained with recently discovered pathogenic and rare neutral missense variants, excluding those previously used to train its constituent tools. When applied to two independent test sets, REVEL had the best overall performance (p < 10-12) as compared to any individual tool and seven ensemble methods: MetaSVM, MetaLR, KGGSeq, Condel, CADD, DANN, and Eigen. Importantly, REVEL also had the best performance for distinguishing pathogenic from rare neutral variants with allele frequencies <0.5%. The area under the receiver operating characteristic curve (AUC) for REVEL was 0.046-0.182 higher in an independent test set of 935 recent SwissVar disease variants and 123,935 putatively neutral exome sequencing variants and 0.027-0.143 higher in an independent test set of 1,953 pathogenic and 2,406 benign variants recently reported in ClinVar than the AUCs for other ensemble methods. We provide pre-computed REVEL scores for all possible human missense variants to facilitate the identification of pathogenic variants in the sea of rare variants discovered as sequencing studies expand in scale.
DOI
10.1016/j.ajhg.2016.08.016
SIFT
PUBMED_LINK
FULL NAME
Sorting Intolerant From Tolerant
DESCRIPTION
Predicts whether substitutions affect protein function.
URL
KEYWORDS
conservation, missense
USE
variant scoring
TITLE
SIFT: Predicting amino acid changes that affect protein function.
Main citation
Ng PC, Henikoff S. (2003) SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res, 31 (13) 3812-4. doi:10.1093/nar/gkg509. PMID 12824425
ABSTRACT
Single nucleotide polymorphism (SNP) studies and random mutagenesis projects identify amino acid substitutions in protein-coding regions. Each substitution has the potential to affect protein function. SIFT (Sorting Intolerant From Tolerant) is a program that predicts whether an amino acid substitution affects protein function so that users can prioritize substitutions for further study. We have shown that SIFT can distinguish between functionally neutral and deleterious amino acid changes in mutagenesis studies and on human polymorphisms. SIFT is available at http://blocks.fhcrc.org/sift/SIFT.html.
DOI
10.1093/nar/gkg509
UNEECON
PUBMED_LINK
DESCRIPTION
UNEECON is a statistical method for inferring deleterious mutations and constrained genes in human and potentially other species.
URL
TITLE
Unified inference of missense variant effects and gene constraints in the human genome.
Main citation
Huang YF. (2020) Unified inference of missense variant effects and gene constraints in the human genome. PLoS Genet, 16 (7) e1008922. doi:10.1371/journal.pgen.1008922. PMID 32667917
ABSTRACT
A challenge in medical genomics is to identify variants and genes associated with severe genetic disorders. Based on the premise that severe, early-onset disorders often result in a reduction of evolutionary fitness, several statistical methods have been developed to predict pathogenic variants or constrained genes based on the signatures of negative selection in human populations. However, we currently lack a statistical framework to jointly predict deleterious variants and constrained genes from both variant-level features and gene-level selective constraints. Here we present such a unified approach, UNEECON, based on deep learning and population genetics. UNEECON treats the contributions of variant-level features and gene-level constraints as a variant-level fixed effect and a gene-level random effect, respectively. The sum of the fixed and random effects is then combined with an evolutionary model to infer the strength of negative selection at both variant and gene levels. Compared with previously published methods, UNEECON shows improved performance in predicting missense variants and protein-coding genes associated with autosomal dominant disorders, and feature importance analysis suggests that both gene-level selective constraints and variant-level predictors are important for accurate variant prioritization. Furthermore, based on UNEECON, we observe a low correlation between gene-level intolerance to missense mutations and that to loss-of-function mutations, which can be partially explained by the prevalence of disordered protein regions that are highly tolerant to missense mutations. Finally, we show that genes intolerant to both missense and loss-of-function mutations play key roles in the central nervous system and the autism spectrum disorders. Overall, UNEECON is a promising framework for both variant and gene prioritization.
DOI
10.1371/journal.pgen.1008922
VEST4
PUBMED_LINK
FULL NAME
Variant Effect Scoring Tool v4
DESCRIPTION
Machine learning pathogenicity score for SNVs.
URL
KEYWORDS
ML, SNV
USE
variant scoring
TITLE
Identifying Mendelian disease genes with the variant effect scoring tool.
Main citation
Carter H, Douville C, Stenson PD, Cooper DN, ...&, Karchin R. (2013) Identifying Mendelian disease genes with the variant effect scoring tool. BMC Genomics, 14 Suppl 3 (Suppl 3) S3. doi:10.1186/1471-2164-14-S3-S3. PMID 23819870
ABSTRACT
BACKGROUND: Whole exome sequencing studies identify hundreds to thousands of rare protein coding variants of ambiguous significance for human health. Computational tools are needed to accelerate the identification of specific variants and genes that contribute to human disease. RESULTS: We have developed the Variant Effect Scoring Tool (VEST), a supervised machine learning-based classifier, to prioritize rare missense variants with likely involvement in human disease. The VEST classifier training set comprised ~ 45,000 disease mutations from the latest Human Gene Mutation Database release and another ~45,000 high frequency (allele frequency >1%) putatively neutral missense variants from the Exome Sequencing Project. VEST outperforms some of the most popular methods for prioritizing missense variants in carefully designed holdout benchmarking experiments (VEST ROC AUC = 0.91, PolyPhen2 ROC AUC = 0.86, SIFT4.0 ROC AUC = 0.84). VEST estimates variant score p-values against a null distribution of VEST scores for neutral variants not included in the VEST training set. These p-values can be aggregated at the gene level across multiple disease exomes to rank genes for probable disease involvement. We tested the ability of an aggregate VEST gene score to identify candidate Mendelian disease genes, based on whole-exome sequencing of a small number of disease cases. We used whole-exome data for two Mendelian disorders for which the causal gene is known. Considering only genes that contained variants in all cases, the VEST gene score ranked dihydroorotate dehydrogenase (DHODH) number 2 of 2253 genes in four cases of Miller syndrome, and myosin-3 (MYH3) number 2 of 2313 genes in three cases of Freeman Sheldon syndrome. CONCLUSIONS: Our results demonstrate the potential power gain of aggregating bioinformatics variant scores into gene-level scores and the general utility of bioinformatics in assisting the search for disease genes in large-scale exome sequencing studies. VEST is available as a stand-alone software package at http://wiki.chasmsoftware.org and is hosted by the CRAVAT web server at http://www.cravat.us.
DOI
10.1186/1471-2164-14-S3-S3