References Function

Curation of Function — listings under the References tab.

Summary Table

Click a column header to sort the table.

NAME	CATEGORY	Main citation	YEAR
ESM-2	Evolutionary and Generative Protein Models	Lin Z et al., Science, 2023	2023
EVE	Evolutionary and Generative Protein Models	Frazer J et al., Nature, 2021	2021
ProGen2	Evolutionary and Generative Protein Models	Nijkamp E et al., Cell Syst, 2023	2023
ProtBERT	Evolutionary and Generative Protein Models	Elnaggar A et al., IEEE Trans Pattern Anal Mach Intell, 2022	2022
ProteinBERT	Evolutionary and Generative Protein Models	Brandes N et al., Bioinformatics, 2022	2022
ClinGen	Functional Annotation	Shah N et al., Cell Genom, 2026	2026
ClinVar	Functional Annotation	Landrum MJ et al., Nucleic Acids Res, 2025	2025
dbNSFP v4	Functional Annotation	Liu X et al., Genome Med, 2020	2020
DAVID	Pathway and Gene Ontology Enrichment	Huang DW et al., Genome Biol, 2007	2007
Gene Ontology	Pathway and Gene Ontology Enrichment	Ashburner M et al., Nat Genet, 2000	2000
AlphaFold 2	Structure Prediction	Jumper J et al., Nature, 2021	2021
AlphaFold 3	Structure Prediction	Abramson J et al., Nature, 2024	2024
AlphaFold	Structure Prediction	Senior AW et al., Nature, 2020	2020
AlphaMissense	Variant Effect Prediction	Cheng J et al., Science, 2023	2023
CADD v1.4	Variant Effect Prediction	Rentzsch P et al., Nucleic Acids Res, 2019	2019
CADD v1.6 (CADD-Splice)	Variant Effect Prediction	Rentzsch P et al., Genome Med, 2021	2021
CADD v1.7	Variant Effect Prediction	Schubach M et al., Nucleic Acids Res, 2024	2024
CADD	Variant Effect Prediction	Kircher M et al., Nat Genet, 2014	2014
M-CAP	Variant Effect Prediction	Jagadeesh KA et al., Nat Genet, 2016	2016
MVP	Variant Effect Prediction	Qi H et al., Nat Commun, 2021	2021
MetaLR / MetaSVM	Variant Effect Prediction	Dong C et al., Hum Mol Genet, 2015	2015
MutationAssessor	Variant Effect Prediction	Su Y et al., bioRxiv, 2025	2025
PolyPhen-2	Variant Effect Prediction	Adzhubei I et al., Curr Protoc Hum Genet, 2013	2013
PrimateAI-3D	Variant Effect Prediction	Gao H et al., Science, 2023	2023
REVEL	Variant Effect Prediction	Ioannidis NM et al., Am J Hum Genet, 2016	2016
SIFT	Variant Effect Prediction	Ng PC et al., Nucleic Acids Res, 2003	2003
UNEECON	Variant Effect Prediction	Huang YF, PLoS Genet, 2020	2020
VEST4	Variant Effect Prediction	Carter H et al., BMC Genomics, 2013	2013

Evolutionary and Generative Protein Models

ESM-2

Reference

PUBMED_LINK

36927031

FULL NAME

Evolutionary Scale Modeling v2

DESCRIPTION

Large-scale protein language model enabling structure/function prediction.

Show full descriptionShow less

URL

https://esmatlas.com/

KEYWORDS

transformer, LLM, structure, sequence

Show full keywordsShow less

USE

embeddings, structure prediction

TITLE

Evolutionary-scale prediction of atomic-level protein structure with a language model.

Main citation

Lin Z, Akin H, Rao R, Hie B, ...&, Rives A. (2023) Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379 (6637) 1123-1130. doi:10.1126/science.ade2574. PMID 36927031

ABSTRACT

Recent advances in machine learning have leveraged evolutionary information in multiple sequence alignments to predict protein structure. We demonstrate direct inference of full atomic-level protein structure from primary sequence using a large language model. As language models of protein sequences are scaled up to 15 billion parameters, an atomic-resolution picture of protein structure emerges in the learned representations. This results in an order-of-magnitude acceleration of high-resolution structure prediction, which enables large-scale structural characterization of metagenomic proteins. We apply this capability to construct the ESM Metagenomic Atlas by predicting structures for >617 million metagenomic protein sequences, including >225 million that are predicted with high confidence, which gives a view into the vast breadth and diversity of natural proteins.

Show full abstractShow less

DOI

10.1126/science.ade2574

EVE

Reference

PUBMED_LINK

34707284

FULL NAME

Evolutionary model of Variant Effect

DESCRIPTION

VAE-based unsupervised model to predict variant impact using MSAs.

Show full descriptionShow less

URL

https://evemodel.org/

KEYWORDS

evolutionary, MSA, variant effect

Show full keywordsShow less

USE

missense scoring

TITLE

Disease variant prediction with deep generative models of evolutionary data.

Main citation

Frazer J, Notin P, Dias M, Gomez A, ...&, Marks DS. (2021) Disease variant prediction with deep generative models of evolutionary data. Nature, 599 (7883) 91-95. doi:10.1038/s41586-021-04043-8. PMID 34707284

ABSTRACT

Quantifying the pathogenicity of protein variants in human disease-related genes would have a marked effect on clinical decisions, yet the overwhelming majority (over 98%) of these variants still have unknown consequences1-3. In principle, computational methods could support the large-scale interpretation of genetic variants. However, state-of-the-art methods4-10 have relied on training machine learning models on known disease labels. As these labels are sparse, biased and of variable quality, the resulting models have been considered insufficiently reliable11. Here we propose an approach that leverages deep generative models to predict variant pathogenicity without relying on labels. By modelling the distribution of sequence variation across organisms, we implicitly capture constraints on the protein sequences that maintain fitness. Our model EVE (evolutionary model of variant effect) not only outperforms computational approaches that rely on labelled data but also performs on par with, if not better than, predictions from high-throughput experiments, which are increasingly used as evidence for variant classification12-16. We predict the pathogenicity of more than 36 million variants across 3,219 disease genes and provide evidence for the classification of more than 256,000 variants of unknown significance. Our work suggests that models of evolutionary information can provide valuable independent evidence for variant interpretation that will be widely useful in research and clinical settings.

Show full abstractShow less

DOI

10.1038/s41586-021-04043-8

ProGen2

Reference

PUBMED_LINK

37909046

DESCRIPTION

Generative protein design using LLMs trained on protein sequences.

Show full descriptionShow less

URL

https://github.com/salesforce/progen

KEYWORDS

protein design, LLM

Show full keywordsShow less

USE

sequence generation

TITLE

ProGen2: Exploring the boundaries of protein language models.

Main citation

Nijkamp E, Ruffolo JA, Weinstein EN, Naik N, ...&, Madani A. (2023) ProGen2: Exploring the boundaries of protein language models. Cell Syst, 14 (11) 968-978.e3. doi:10.1016/j.cels.2023.10.002. PMID 37909046

ABSTRACT

Attention-based models trained on protein sequences have demonstrated incredible success at classification and generation tasks relevant for artificial-intelligence-driven protein design. However, we lack a sufficient understanding of how very large-scale models and data play a role in effective protein model development. We introduce a suite of protein language models, named ProGen2, that are scaled up to 6.4B parameters and trained on different sequence datasets drawn from over a billion proteins from genomic, metagenomic, and immune repertoire databases. ProGen2 models show state-of-the-art performance in capturing the distribution of observed evolutionary sequences, generating novel viable sequences, and predicting protein fitness without additional fine-tuning. As large model sizes and raw numbers of protein sequences continue to become more widely accessible, our results suggest that a growing emphasis needs to be placed on the data distribution provided to a protein sequence model. Our models and code are open sourced for widespread adoption in protein engineering. A record of this paper's Transparent Peer Review process is included in the supplemental information.

Show full abstractShow less

DOI

10.1016/j.cels.2023.10.002

ProtBERT

Reference

PUBMED_LINK

34232869

DESCRIPTION

BERT-based protein language model for downstream functional tasks.

Show full descriptionShow less

URL

https://huggingface.co/Rostlab/prot_bert

KEYWORDS

protein LM, transformer, embeddings

Show full keywordsShow less

USE

feature extraction

TITLE

ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning.

Main citation

Elnaggar A, Heinzinger M, Dallago C, Rehawi G, ...&, Rost B. (2022) ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning. IEEE Trans Pattern Anal Mach Intell, 44 (10) 7112-7127. doi:10.1109/TPAMI.2021.3095381. PMID 34232869

ABSTRACT

Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models (LMs) taken from Natural Language Processing (NLP). These LMs reach for new prediction frontiers at low inference costs. Here, we trained two auto-regressive models (Transformer-XL, XLNet) and four auto-encoder models (BERT, Albert, Electra, T5) on data from UniRef and BFD containing up to 393 billion amino acids. The protein LMs (pLMs) were trained on the Summit supercomputer using 5616 GPUs and TPU Pod up-to 1024 cores. Dimensionality reduction revealed that the raw pLM-embeddings from unlabeled data captured some biophysical features of protein sequences. We validated the advantage of using the embeddings as exclusive input for several subsequent tasks: (1) a per-residue (per-token) prediction of protein secondary structure (3-state accuracy Q3=81%-87%); (2) per-protein (pooling) predictions of protein sub-cellular location (ten-state accuracy: Q10=81%) and membrane versus water-soluble (2-state accuracy Q2=91%). For secondary structure, the most informative embeddings (ProtT5) for the first time outperformed the state-of-the-art without multiple sequence alignments (MSAs) or evolutionary information thereby bypassing expensive database searches. Taken together, the results implied that pLMs learned some of the grammar of the language of life. All our models are available through https://github.com/agemagician/ProtTrans.

Show full abstractShow less

DOI

10.1109/TPAMI.2021.3095381

ProteinBERT

Reference

PUBMED_LINK

35020807

TITLE

ProteinBERT: a universal deep-learning model of protein sequence and function.

Main citation

Brandes N, Ofer D, Peleg Y, Rappoport N, ...&, Linial M. (2022) ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics, 38 (8) 2102-2110. doi:10.1093/bioinformatics/btac020. PMID 35020807

ABSTRACT

SUMMARY: Self-supervised deep language modeling has shown unprecedented success across natural language tasks, and has recently been repurposed to biological sequences. However, existing models and pretraining methods are designed and optimized for text analysis. We introduce ProteinBERT, a deep language model specifically designed for proteins. Our pretraining scheme combines language modeling with a novel task of Gene Ontology (GO) annotation prediction. We introduce novel architectural elements that make the model highly efficient and flexible to long sequences. The architecture of ProteinBERT consists of both local and global representations, allowing end-to-end processing of these types of inputs and outputs. ProteinBERT obtains near state-of-the-art performance, and sometimes exceeds it, on multiple benchmarks covering diverse protein properties (including protein structure, post-translational modifications and biophysical attributes), despite using a far smaller and faster model than competing deep-learning methods. Overall, ProteinBERT provides an efficient framework for rapidly training protein predictors, even with limited labeled data. AVAILABILITY AND IMPLEMENTATION: Code and pretrained model weights are available at https://github.com/nadavbra/protein_bert. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Show full abstractShow less

DOI

10.1093/bioinformatics/btac020

Functional Annotation

ClinGen

Reference

PUBMED_LINK

41956073

FULL NAME

Clinical Genome Resource

DESCRIPTION

The Clinical Genome Resource (ClinGen) is an NIH-funded resource that provides authoritative standards and resources for evaluating the clinical relevance of genes and variants.

Show full descriptionShow less

URL

https://clinicalgenome.org/

KEYWORDS

clinical genomics, variant interpretation, gene-disease association, curation

Show full keywordsShow less

USE

clinical annotation

TITLE

ClinGen API platform for classification of human genetic variants.

Main citation

Shah N, Farris T, Zuniga AA, Jackson AR, ...&, Milosavljevic A. (2026) ClinGen API platform for classification of human genetic variants. Cell Genom, 6 (4) 101211. doi:10.1016/j.xgen.2026.101211. PMID 41956073

ABSTRACT

In this commentary, we describe how the Clinical Genome Resource's (ClinGen's) application programming interface-based microservices accelerate growth and dissemination of knowledge about human genetic variation. By exposing findable, accessible, interoperable, reusable, and AI-ready variant data, ClinGen lays a foundation for next-generation software applications, AI systems, and variant classification workflows.

Show full abstractShow less

DOI

10.1016/j.xgen.2026.101211

ClinVar

Reference

PUBMED_LINK

39578691

DESCRIPTION

Archive of clinically relevant variants with interpretations.

Show full descriptionShow less

URL

https://www.ncbi.nlm.nih.gov/clinvar/

KEYWORDS

pathogenicity, variant, clinical

Show full keywordsShow less

USE

clinical annotation

TITLE

ClinVar: updates to support classifications of both germline and somatic variants.

Main citation

Landrum MJ, Chitipiralla S, Kaur K, Brown G, ...&, Kattman BL. (2025) ClinVar: updates to support classifications of both germline and somatic variants. Nucleic Acids Res, 53 (D1) D1313-D1321. doi:10.1093/nar/gkae1090. PMID 39578691

ABSTRACT

ClinVar (www.ncbi.nlm.nih.gov/clinvar/) is a free, public database of human genetic variants and their relationships to disease, with >3 million variants submitted by >2800 organizations across the world. The database was recently updated to have three types of classifications: germline, oncogenicity and clinical impact for somatic variants. As for germline variants, classifications for somatic variants can be submitted in batches in a file submission or through the submission API; variants can also be submitted and updated one at a time in online submission forms. The ClinVar XML files were redesigned to allow multiple classification types. Both old and new formats of the XML are supported through the end of 2024. Data for somatic classifications were also added to the ClinVar VCF files and to several tab-delimited files. The ClinVar VCV pages were updated to display the three types of classifications, both as it was submitted and as it was aggregated by ClinVar. Clinical testing laboratories and others in the cancer community are invited to share their classifications of somatic variant classifications through ClinVar to provide transparency in genomic testing and improve patient care.

Show full abstractShow less

DOI

10.1093/nar/gkae1090

dbNSFP v4 (dbNSFP)

Reference

PUBMED_LINK

33261662

FULL NAME

Database for Nonsynonymous SNPs’ Functional Predictions

DESCRIPTION

Database aggregating functional predictions and annotations for nonsynonymous variants.

Show full descriptionShow less

URL

https://sites.google.com/site/jpopgen/dbNSFP

KEYWORDS

annotation, variant, missense

Show full keywordsShow less

USE

meta-annotation

TITLE

dbNSFP v4: a comprehensive database of transcript-specific functional predictions and annotations for human nonsynonymous and splice-site SNVs.

Main citation

Liu X, Li C, Mou C, Dong Y, ...&, Tu Y. (2020) dbNSFP v4: a comprehensive database of transcript-specific functional predictions and annotations for human nonsynonymous and splice-site SNVs. Genome Med, 12 (1) 103. doi:10.1186/s13073-020-00803-9. PMID 33261662

ABSTRACT

Whole exome sequencing has been increasingly used in human disease studies. Prioritization based on appropriate functional annotations has been used as an indispensable step to select candidate variants. Here we present the latest updates to dbNSFP (version 4.1), a database designed to facilitate this step by providing deleteriousness prediction and functional annotation for all potential nonsynonymous and splice-site SNVs (a total of 84,013,093) in the human genome. The current version compiled 36 deleteriousness prediction scores, including 12 transcript-specific scores, and other variant and gene-level functional annotations. The database is available at http://database.liulab.science/dbNSFP with a downloadable version and a web-service.

Show full abstractShow less

DOI

10.1186/s13073-020-00803-9

Pathway and Gene Ontology Enrichment

DAVID

Reference

PUBMED_LINK

17784955

FULL NAME

Database for Annotation, Visualization and Integrated Discovery

DESCRIPTION

Functional annotation and enrichment analysis platform.

Show full descriptionShow less

URL

https://david.ncifcrf.gov/

KEYWORDS

functional enrichment, GO, pathway

Show full keywordsShow less

USE

enrichment analysis

TITLE

The DAVID Gene Functional Classification Tool: a novel biological module-centric algorithm to functionally analyze large gene lists.

Main citation

Huang DW, Sherman BT, Tan Q, Collins JR, ...&, Lempicki RA. (2007) The DAVID Gene Functional Classification Tool: a novel biological module-centric algorithm to functionally analyze large gene lists. Genome Biol, 8 (9) R183. doi:10.1186/gb-2007-8-9-r183. PMID 17784955

ABSTRACT

The DAVID Gene Functional Classification Tool http://david.abcc.ncifcrf.gov uses a novel agglomeration algorithm to condense a list of genes or associated biological terms into organized classes of related genes or biology, called biological modules. This organization is accomplished by mining the complex biological co-occurrences found in multiple sources of functional annotation. It is a powerful method to group functionally related genes and terms into a manageable number of biological modules for efficient interpretation of gene lists in a network context.

Show full abstractShow less

DOI

10.1186/gb-2007-8-9-r183

Gene Ontology (GO)

Reference

PUBMED_LINK

10802651

DESCRIPTION

Controlled vocabulary for gene function classification.

Show full descriptionShow less

URL

http://geneontology.org/

KEYWORDS

GO terms, pathways

Show full keywordsShow less

USE

enrichment analysis

TITLE

Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.

Main citation

Ashburner M, Ball CA, Blake JA, Botstein D, ...&, Sherlock G. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet, 25 (1) 25-9. doi:10.1038/75556. PMID 10802651

ABSTRACT

Genomic sequencing has made it clear that a large fraction of the genes specifying the core biological functions are shared by all eukaryotes. Knowledge of the biological role of such shared proteins in one organism can often be transferred to other organisms. The goal of the Gene Ontology Consortium is to produce a dynamic, controlled vocabulary that can be applied to all eukaryotes even as knowledge of gene and protein roles in cells is accumulating and changing. To this end, three independent ontologies accessible on the World-Wide Web (http://www.geneontology.org) are being constructed: biological process, molecular function and cellular component.

Show full abstractShow less

DOI

10.1038/75556

Structure Prediction

AlphaFold

Reference

PUBMED_LINK

31942072

URL

https://alphafold.ebi.ac.uk/

TITLE

Improved protein structure prediction using potentials from deep learning.

Main citation

Senior AW, Evans R, Jumper J, Kirkpatrick J, ...&, Hassabis D. (2020) Improved protein structure prediction using potentials from deep learning. Nature, 577 (7792) 706-710. doi:10.1038/s41586-019-1923-7. PMID 31942072

ABSTRACT

Protein structure prediction can be used to determine the three-dimensional shape of a protein from its amino acid sequence1. This problem is of fundamental importance as the structure of a protein largely determines its function2; however, protein structures can be difficult to determine experimentally. Considerable progress has recently been made by leveraging genetic information. It is possible to infer which amino acid residues are in contact by analysing covariation in homologous sequences, which aids in the prediction of protein structures3. Here we show that we can train a neural network to make accurate predictions of the distances between pairs of residues, which convey more information about the structure than contact predictions. Using this information, we construct a potential of mean force4 that can accurately describe the shape of a protein. We find that the resulting potential can be optimized by a simple gradient descent algorithm to generate structures without complex sampling procedures. The resulting system, named AlphaFold, achieves high accuracy, even for sequences with fewer homologous sequences. In the recent Critical Assessment of Protein Structure Prediction5 (CASP13)-a blind assessment of the state of the field-AlphaFold created high-accuracy structures (with template modelling (TM) scores6 of 0.7 or higher) for 24 out of 43 free modelling domains, whereas the next best method, which used sampling and contact information, achieved such accuracy for only 14 out of 43 domains. AlphaFold represents a considerable advance in protein-structure prediction. We expect this increased accuracy to enable insights into the function and malfunction of proteins, especially in cases for which no structures for homologous proteins have been experimentally determined7.

Show full abstractShow less

DOI

10.1038/s41586-019-1923-7

AlphaFold 2 (AlphaFold)

Reference

PUBMED_LINK

34265844

FULL NAME

AlphaFold Protein Structure Database

DESCRIPTION

High-accuracy protein structure prediction using deep learning.

Show full descriptionShow less

URL

https://alphafold.ebi.ac.uk/

KEYWORDS

protein structure, deep learning, folding

Show full keywordsShow less

USE

structure prediction

SERVER

EMBL-EBI

TITLE

Highly accurate protein structure prediction with AlphaFold.

Main citation

Jumper J, Evans R, Pritzel A, Green T, ...&, Hassabis D. (2021) Highly accurate protein structure prediction with AlphaFold. Nature, 596 (7873) 583-589. doi:10.1038/s41586-021-03819-2. PMID 34265844

ABSTRACT

Proteins are essential to life, and understanding their structure can facilitate a mechanistic understanding of their function. Through an enormous experimental effort1-4, the structures of around 100,000 unique proteins have been determined5, but this represents a small fraction of the billions of known protein sequences6,7. Structural coverage is bottlenecked by the months to years of painstaking effort required to determine a single protein structure. Accurate computational approaches are needed to address this gap and to enable large-scale structural bioinformatics. Predicting the three-dimensional structure that a protein will adopt based solely on its amino acid sequence-the structure prediction component of the 'protein folding problem'8-has been an important open research problem for more than 50 years9. Despite recent progress10-14, existing methods fall far short of atomic accuracy, especially when no homologous structure is available. Here we provide the first computational method that can regularly predict protein structures with atomic accuracy even in cases in which no similar structure is known. We validated an entirely redesigned version of our neural network-based model, AlphaFold, in the challenging 14th Critical Assessment of protein Structure Prediction (CASP14)15, demonstrating accuracy competitive with experimental structures in a majority of cases and greatly outperforming other methods. Underpinning the latest version of AlphaFold is a novel machine learning approach that incorporates physical and biological knowledge about protein structure, leveraging multi-sequence alignments, into the design of the deep learning algorithm.

Show full abstractShow less

DOI

10.1038/s41586-021-03819-2

AlphaFold 3

Reference

PUBMED_LINK

38718835

URL

https://alphafold.ebi.ac.uk/

TITLE

Accurate structure prediction of biomolecular interactions with AlphaFold 3.

Main citation

Abramson J, Adler J, Dunger J, Evans R, ...&, Jumper JM. (2024) Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature, 630 (8016) 493-500. doi:10.1038/s41586-024-07487-w. PMID 38718835

ABSTRACT

The introduction of AlphaFold 21 has spurred a revolution in modelling the structure of proteins and their interactions, enabling a huge range of applications in protein modelling and design2-6. Here we describe our AlphaFold 3 model with a substantially updated diffusion-based architecture that is capable of predicting the joint structure of complexes including proteins, nucleic acids, small molecules, ions and modified residues. The new AlphaFold model demonstrates substantially improved accuracy over many previous specialized tools: far greater accuracy for protein-ligand interactions compared with state-of-the-art docking tools, much higher accuracy for protein-nucleic acid interactions compared with nucleic-acid-specific predictors and substantially higher antibody-antigen prediction accuracy compared with AlphaFold-Multimer v.2.37,8. Together, these results show that high-accuracy modelling across biomolecular space is possible within a single unified deep-learning framework.

Show full abstractShow less

DOI

10.1038/s41586-024-07487-w

Variant Effect Prediction

AlphaMissense

Reference

PUBMED_LINK

37733863

DESCRIPTION

Deep learning model predicting pathogenicity of all possible missense variants in human proteins.

Show full descriptionShow less

URL

https://github.com/google-deepmind/alphamissense

KEYWORDS

missense, pathogenicity, variant effect, deep learning

Show full keywordsShow less

USE

variant effect scoring

TITLE

Accurate proteome-wide missense variant effect prediction with AlphaMissense.

Main citation

Cheng J, Novati G, Pan J, Bycroft C, ...&, Avsec Ž. (2023) Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science, 381 (6664) eadg7492. doi:10.1126/science.adg7492. PMID 37733863

ABSTRACT

The vast majority of missense variants observed in the human genome are of unknown clinical significance. We present AlphaMissense, an adaptation of AlphaFold fine-tuned on human and primate variant population frequency databases to predict missense variant pathogenicity. By combining structural context and evolutionary conservation, our model achieves state-of-the-art results across a wide range of genetic and experimental benchmarks, all without explicitly training on such data. The average pathogenicity score of genes is also predictive for their cell essentiality, capable of identifying short essential genes that existing statistical approaches are underpowered to detect. As a resource to the community, we provide a database of predictions for all possible human single amino acid substitutions and classify 89% of missense variants as either likely benign or likely pathogenic.

Show full abstractShow less

DOI

10.1126/science.adg7492

CADD

Reference

PUBMED_LINK

24487276

FULL NAME

Combined Annotation–Dependent Depletion

DESCRIPTION

Combined Annotation–Dependent Depletion; integrates multiple annotations to score variant deleteriousness.

Show full descriptionShow less

URL

https://cadd.gs.washington.edu/

KEYWORDS

genome-wide, deleteriousness, annotation

Show full keywordsShow less

USE

prioritization, filtering

TITLE

A general framework for estimating the relative pathogenicity of human genetic variants.

Main citation

Kircher M, Witten DM, Jain P, O'Roak BJ, ...&, Shendure J. (2014) A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet, 46 (3) 310-5. doi:10.1038/ng.2892. PMID 24487276

ABSTRACT

Current methods for annotating and interpreting human genetic variation tend to exploit a single information type (for example, conservation) and/or are restricted in scope (for example, to missense changes). Here we describe Combined Annotation-Dependent Depletion (CADD), a method for objectively integrating many diverse annotations into a single measure (C score) for each variant. We implement CADD as a support vector machine trained to differentiate 14.7 million high-frequency human-derived alleles from 14.7 million simulated variants. We precompute C scores for all 8.6 billion possible human single-nucleotide variants and enable scoring of short insertions-deletions. C scores correlate with allelic diversity, annotations of functionality, pathogenicity, disease severity, experimentally measured regulatory effects and complex trait associations, and they highly rank known pathogenic variants within individual genomes. The ability of CADD to prioritize functional, deleterious and pathogenic variants across many functional categories, effect sizes and genetic architectures is unmatched by any current single-annotation method.

Show full abstractShow less

DOI

10.1038/ng.2892

CADD v1.4

Reference

PUBMED_LINK

30371827

URL

https://cadd.gs.washington.edu/

TITLE

CADD: predicting the deleteriousness of variants throughout the human genome.

Main citation

Rentzsch P, Witten D, Cooper GM, Shendure J, ...&, Kircher M. (2019) CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res, 47 (D1) D886-D894. doi:10.1093/nar/gky1016. PMID 30371827

ABSTRACT

Combined Annotation-Dependent Depletion (CADD) is a widely used measure of variant deleteriousness that can effectively prioritize causal variants in genetic analyses, particularly highly penetrant contributors to severe Mendelian disorders. CADD is an integrative annotation built from more than 60 genomic features, and can score human single nucleotide variants and short insertion and deletions anywhere in the reference assembly. CADD uses a machine learning model trained on a binary distinction between simulated de novo variants and variants that have arisen and become fixed in human populations since the split between humans and chimpanzees; the former are free of selective pressure and may thus include both neutral and deleterious alleles, while the latter are overwhelmingly neutral (or, at most, weakly deleterious) by virtue of having survived millions of years of purifying selection. Here we review the latest updates to CADD, including the most recent version, 1.4, which supports the human genome build GRCh38. We also present updates to our website that include simplified variant lookup, extended documentation, an Application Program Interface and improved mechanisms for integrating CADD scores into other tools or applications. CADD scores, software and documentation are available at https://cadd.gs.washington.edu.

Show full abstractShow less

DOI

10.1093/nar/gky1016

CADD v1.6 (CADD-Splice)

Reference

PUBMED_LINK

33618777

URL

https://cadd.gs.washington.edu/

TITLE

CADD-Splice-improving genome-wide variant effect prediction using deep learning-derived splice scores.

Main citation

Rentzsch P, Schubach M, Shendure J, Kircher M. (2021) CADD-Splice-improving genome-wide variant effect prediction using deep learning-derived splice scores. Genome Med, 13 (1) 31. doi:10.1186/s13073-021-00835-9. PMID 33618777

ABSTRACT

BACKGROUND: Splicing of genomic exons into mRNAs is a critical prerequisite for the accurate synthesis of human proteins. Genetic variants impacting splicing underlie a substantial proportion of genetic disease, but are challenging to identify beyond those occurring at donor and acceptor dinucleotides. To address this, various methods aim to predict variant effects on splicing. Recently, deep neural networks (DNNs) have been shown to achieve better results in predicting splice variants than other strategies. METHODS: It has been unclear how best to integrate such process-specific scores into genome-wide variant effect predictors. Here, we use a recently published experimental data set to compare several machine learning methods that score variant effects on splicing. We integrate the best of those approaches into general variant effect prediction models and observe the effect on classification of known pathogenic variants. RESULTS: We integrate two specialized splicing scores into CADD (Combined Annotation Dependent Depletion; cadd.gs.washington.edu ), a widely used tool for genome-wide variant effect prediction that we previously developed to weight and integrate diverse collections of genomic annotations. With this new model, CADD-Splice, we show that inclusion of splicing DNN effect scores substantially improves predictions across multiple variant categories, without compromising overall performance. CONCLUSIONS: While splice effect scores show superior performance on splice variants, specialized predictors cannot compete with other variant scores in general variant interpretation, as the latter account for nonsense and missense effects that do not alter splicing. Although only shown here for splice scores, we believe that the applied approach will generalize to other specific molecular processes, providing a path for the further improvement of genome-wide variant effect prediction.

Show full abstractShow less

DOI

10.1186/s13073-021-00835-9

CADD v1.7

Reference

PUBMED_LINK

38183205

URL

https://cadd.gs.washington.edu/

TITLE

CADD v1.7: using protein language models, regulatory CNNs and other nucleotide-level scores to improve genome-wide variant predictions.

Main citation

Schubach M, Maass T, Nazaretyan L, Röner S, ...&, Kircher M. (2024) CADD v1.7: using protein language models, regulatory CNNs and other nucleotide-level scores to improve genome-wide variant predictions. Nucleic Acids Res, 52 (D1) D1143-D1154. doi:10.1093/nar/gkad989. PMID 38183205

ABSTRACT

Machine Learning-based scoring and classification of genetic variants aids the assessment of clinical findings and is employed to prioritize variants in diverse genetic studies and analyses. Combined Annotation-Dependent Depletion (CADD) is one of the first methods for the genome-wide prioritization of variants across different molecular functions and has been continuously developed and improved since its original publication. Here, we present our most recent release, CADD v1.7. We explored and integrated new annotation features, among them state-of-the-art protein language model scores (Meta ESM-1v), regulatory variant effect predictions (from sequence-based convolutional neural networks) and sequence conservation scores (Zoonomia). We evaluated the new version on data sets derived from ClinVar, ExAC/gnomAD and 1000 Genomes variants. For coding effects, we tested CADD on 31 Deep Mutational Scanning (DMS) data sets from ProteinGym and, for regulatory effect prediction, we used saturation mutagenesis reporter assay data of promoter and enhancer sequences. The inclusion of new features further improved the overall performance of CADD. As with previous releases, all data sets, genome-wide CADD v1.7 scores, scripts for on-site scoring and an easy-to-use webserver are readily provided via https://cadd.bihealth.org/ or https://cadd.gs.washington.edu/ to the community.

Show full abstractShow less

DOI

10.1093/nar/gkad989

M-CAP

Reference

PUBMED_LINK

27776117

FULL NAME

Mendelian Clinically Applicable Pathogenicity

DESCRIPTION

Rare missense pathogenicity classifier for clinical interpretation.

Show full descriptionShow less

URL

https://bejerano.stanford.edu/mcap/

KEYWORDS

missense, clinical

Show full keywordsShow less

USE

clinical scoring

TITLE

M-CAP eliminates a majority of variants of uncertain significance in clinical exomes at high sensitivity.

Main citation

Jagadeesh KA, Wenger AM, Berger MJ, Guturu H, ...&, Bejerano G. (2016) M-CAP eliminates a majority of variants of uncertain significance in clinical exomes at high sensitivity. Nat Genet, 48 (12) 1581-1586. doi:10.1038/ng.3703. PMID 27776117

ABSTRACT

Variant pathogenicity classifiers such as SIFT, PolyPhen-2, CADD, and MetaLR assist in interpretation of the hundreds of rare, missense variants in the typical patient genome by deprioritizing some variants as likely benign. These widely used methods misclassify 26 to 38% of known pathogenic mutations, which could lead to missed diagnoses if the classifiers are trusted as definitive in a clinical setting. We developed M-CAP, a clinical pathogenicity classifier that outperforms existing methods at all thresholds and correctly dismisses 60% of rare, missense variants of uncertain significance in a typical genome at 95% sensitivity.

Show full abstractShow less

DOI

10.1038/ng.3703

MVP

Reference

PUBMED_LINK

33479230

FULL NAME

Missense Variant Pathogenicity prediction

DESCRIPTION

A new prediction method that uses deep residual network to leverage large training data sets and many correlated predictors

Show full descriptionShow less

URL

https://figshare.com/articles/dataset/Predicting_pathogenicity_of_missense_variants_by_deep_learning/13204118

KEYWORDS

deep residual network, pathogenic missense variant

Show full keywordsShow less

TITLE

MVP predicts the pathogenicity of missense variants by deep learning.

Main citation

Qi H, Zhang H, Zhao Y, Chen C, ...&, Shen Y. (2021) MVP predicts the pathogenicity of missense variants by deep learning. Nat Commun, 12 (1) 510. doi:10.1038/s41467-020-20847-0. PMID 33479230

ABSTRACT

Accurate pathogenicity prediction of missense variants is critically important in genetic studies and clinical diagnosis. Previously published prediction methods have facilitated the interpretation of missense variants but have limited performance. Here, we describe MVP (Missense Variant Pathogenicity prediction), a new prediction method that uses deep residual network to leverage large training data sets and many correlated predictors. We train the model separately in genes that are intolerant of loss of function variants and the ones that are tolerant in order to take account of potentially different genetic effect size and mode of action. We compile cancer mutation hotspots and de novo variants from developmental disorders for benchmarking. Overall, MVP achieves better performance in prioritizing pathogenic missense variants than previous methods, especially in genes tolerant of loss of function variants. Finally, using MVP, we estimate that de novo coding variants contribute to 7.8% of isolated congenital heart disease, nearly doubling previous estimates.

Show full abstractShow less

DOI

10.1038/s41467-020-20847-0

MetaLR / MetaSVM (MetaLR)

Reference

PUBMED_LINK

25552646

DESCRIPTION

Ensemble pathogenicity scores integrating multiple annotations.

Show full descriptionShow less

URL

https://www.ncbi.nlm.nih.gov/clinvar/docs/scoreinfo/

KEYWORDS

ensemble, missense

Show full keywordsShow less

USE

prioritization

TITLE

Comparison and integration of deleteriousness prediction methods for nonsynonymous SNVs in whole exome sequencing studies.

Main citation

Dong C, Wei P, Jian X, Gibbs R, ...&, Liu X. (2015) Comparison and integration of deleteriousness prediction methods for nonsynonymous SNVs in whole exome sequencing studies. Hum Mol Genet, 24 (8) 2125-37. doi:10.1093/hmg/ddu733. PMID 25552646

ABSTRACT

Accurate deleteriousness prediction for nonsynonymous variants is crucial for distinguishing pathogenic mutations from background polymorphisms in whole exome sequencing (WES) studies. Although many deleteriousness prediction methods have been developed, their prediction results are sometimes inconsistent with each other and their relative merits are still unclear in practical applications. To address these issues, we comprehensively evaluated the predictive performance of 18 current deleteriousness-scoring methods, including 11 function prediction scores (PolyPhen-2, SIFT, MutationTaster, Mutation Assessor, FATHMM, LRT, PANTHER, PhD-SNP, SNAP, SNPs&GO and MutPred), 3 conservation scores (GERP++, SiPhy and PhyloP) and 4 ensemble scores (CADD, PON-P, KGGSeq and CONDEL). We found that FATHMM and KGGSeq had the highest discriminative power among independent scores and ensemble scores, respectively. Moreover, to ensure unbiased performance evaluation of these prediction scores, we manually collected three distinct testing datasets, on which no current prediction scores were tuned. In addition, we developed two new ensemble scores that integrate nine independent scores and allele frequency. Our scores achieved the highest discriminative power compared with all the deleteriousness prediction scores tested and showed low false-positive prediction rate for benign yet rare nonsynonymous variants, which demonstrated the value of combining information from multiple orthologous approaches. Finally, to facilitate variant prioritization in WES studies, we have pre-computed our ensemble scores for 87 347 044 possible variants in the whole-exome and made them publicly available through the ANNOVAR software and the dbNSFP database.

Show full abstractShow less

DOI

10.1093/hmg/ddu733

MutationAssessor

Reference

PUBMED_LINK

40832239

DESCRIPTION

Predicts functional impact based on evolutionary conservation.

Show full descriptionShow less

URL

http://mutationassessor.org/

KEYWORDS

conservation, function

Show full keywordsShow less

USE

variant effect

TITLE

MutationAssessor in cBioPortal.

Main citation

Su Y, Li X, Reva B, Antipin Y, ...&, Sander C. (2025) MutationAssessor in cBioPortal. bioRxiv, () . doi:10.1101/2025.08.10.669566. PMID 40832239

ABSTRACT

MutationAssessor (MA) helps researchers evaluate the likely functional impact of somatic and germline mutations in cancer. It provides an evolution-based functional impact score (FIS) to classify mutations based on their likely effect on protein function. FIS scores are based on analysis of patterns of conservation in protein families (conserved residues) and subfamilies (specificity residues). In this new version (r4) we have (1) refined the combinatorial entropy analysis of conservation patterns, (2) recalculated full-length protein multiple sequence alignments covering a larger fraction of human proteins and making use of the explosive growth of protein sequence data, (3) compared predicted functional impact with the pathogenic-benign classification of sequence variants in curated knowledge bases, such as ClinVar, (4) observed the inverse relationship between predicted high functional impact and variant frequency in germline genome sequences and (5) explore the evaluation of switch-of-function mutational effects. Functional impact of ~4 million somatic amino-acid changing mutations across more than 320K human tumor samples are now available in the widely used cBioPortal for Cancer Genomics.

Show full abstractShow less

DOI

10.1101/2025.08.10.669566

PolyPhen-2

Reference

PUBMED_LINK

23315928

FULL NAME

Polymorphism Phenotyping v2

DESCRIPTION

Predicts functional impact of amino acid substitutions.

Show full descriptionShow less

URL

http://genetics.bwh.harvard.edu/pph2/

KEYWORDS

missense, conservation

Show full keywordsShow less

USE

variant scoring

TITLE

Predicting functional effect of human missense mutations using PolyPhen-2.

Main citation

Adzhubei I, Jordan DM, Sunyaev SR. (2013) Predicting functional effect of human missense mutations using PolyPhen-2. Curr Protoc Hum Genet, Chapter 7 () Unit7.20. doi:10.1002/0471142905.hg0720s76. PMID 23315928

ABSTRACT

PolyPhen-2 (Polymorphism Phenotyping v2), available as software and via a Web server, predicts the possible impact of amino acid substitutions on the stability and function of human proteins using structural and comparative evolutionary considerations. It performs functional annotation of single-nucleotide polymorphisms (SNPs), maps coding SNPs to gene transcripts, extracts protein sequence annotations and structural attributes, and builds conservation profiles. It then estimates the probability of the missense mutation being damaging based on a combination of all these properties. PolyPhen-2 features include a high-quality multiple protein sequence alignment pipeline and a prediction method employing machine-learning classification. The software also integrates the UCSC Genome Browser's human genome annotations and MultiZ multiple alignments of vertebrate genomes with the human genome. PolyPhen-2 is capable of analyzing large volumes of data produced by next-generation sequencing projects, thanks to built-in support for high-performance computing environments like Grid Engine and Platform LSF.

Show full abstractShow less

DOI

10.1002/0471142905.hg0720s76

PrimateAI-3D

Reference

PUBMED_LINK

37262156

DESCRIPTION

DL model trained on primate variation + 3D structure.

Show full descriptionShow less

URL

https://www.broadinstitute.org

KEYWORDS

deep learning, primate, missense

Show full keywordsShow less

USE

clinical variant scoring

TITLE

The landscape of tolerated genetic variation in humans and primates.

Main citation

Gao H, Hamp T, Ede J, Schraiber JG, ...&, Farh KK. (2023) The landscape of tolerated genetic variation in humans and primates. Science, 380 (6648) eabn8153. doi:10.1126/science.abn8197. PMID 37262156

ABSTRACT

Personalized genome sequencing has revealed millions of genetic differences between individuals, but our understanding of their clinical relevance remains largely incomplete. To systematically decipher the effects of human genetic variants, we obtained whole-genome sequencing data for 809 individuals from 233 primate species and identified 4.3 million common protein-altering variants with orthologs in humans. We show that these variants can be inferred to have nondeleterious effects in humans based on their presence at high allele frequencies in other primate populations. We use this resource to classify 6% of all possible human protein-altering variants as likely benign and impute the pathogenicity of the remaining 94% of variants with deep learning, achieving state-of-the-art accuracy for diagnosing pathogenic variants in patients with genetic diseases.

Show full abstractShow less

DOI

10.1126/science.abn8197

REVEL

Reference

PUBMED_LINK

27666373

FULL NAME

Rare Exome Variant Ensemble Learner

DESCRIPTION

Ensemble method integrating multiple tools to predict pathogenicity.

Show full descriptionShow less

URL

https://sites.google.com/site/revelgenomics/

KEYWORDS

ensemble, missense

Show full keywordsShow less

USE

pathogenicity scoring

TITLE

REVEL: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants.

Main citation

Ioannidis NM, Rothstein JH, Pejaver V, Middha S, ...&, Sieh W. (2016) REVEL: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants. Am J Hum Genet, 99 (4) 877-885. doi:10.1016/j.ajhg.2016.08.016. PMID 27666373

ABSTRACT

The vast majority of coding variants are rare, and assessment of the contribution of rare variants to complex traits is hampered by low statistical power and limited functional data. Improved methods for predicting the pathogenicity of rare coding variants are needed to facilitate the discovery of disease variants from exome sequencing studies. We developed REVEL (rare exome variant ensemble learner), an ensemble method for predicting the pathogenicity of missense variants on the basis of individual tools: MutPred, FATHMM, VEST, PolyPhen, SIFT, PROVEAN, MutationAssessor, MutationTaster, LRT, GERP, SiPhy, phyloP, and phastCons. REVEL was trained with recently discovered pathogenic and rare neutral missense variants, excluding those previously used to train its constituent tools. When applied to two independent test sets, REVEL had the best overall performance (p < 10-12) as compared to any individual tool and seven ensemble methods: MetaSVM, MetaLR, KGGSeq, Condel, CADD, DANN, and Eigen. Importantly, REVEL also had the best performance for distinguishing pathogenic from rare neutral variants with allele frequencies <0.5%. The area under the receiver operating characteristic curve (AUC) for REVEL was 0.046-0.182 higher in an independent test set of 935 recent SwissVar disease variants and 123,935 putatively neutral exome sequencing variants and 0.027-0.143 higher in an independent test set of 1,953 pathogenic and 2,406 benign variants recently reported in ClinVar than the AUCs for other ensemble methods. We provide pre-computed REVEL scores for all possible human missense variants to facilitate the identification of pathogenic variants in the sea of rare variants discovered as sequencing studies expand in scale.

Show full abstractShow less

DOI

10.1016/j.ajhg.2016.08.016

SIFT

Reference

PUBMED_LINK

12824425

FULL NAME

Sorting Intolerant From Tolerant

DESCRIPTION

Predicts whether substitutions affect protein function.

Show full descriptionShow less

URL

https://sift.bii.a-star.edu.sg/

KEYWORDS

conservation, missense

Show full keywordsShow less

USE

variant scoring

TITLE

SIFT: Predicting amino acid changes that affect protein function.

Main citation

Ng PC, Henikoff S. (2003) SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res, 31 (13) 3812-4. doi:10.1093/nar/gkg509. PMID 12824425

ABSTRACT

Single nucleotide polymorphism (SNP) studies and random mutagenesis projects identify amino acid substitutions in protein-coding regions. Each substitution has the potential to affect protein function. SIFT (Sorting Intolerant From Tolerant) is a program that predicts whether an amino acid substitution affects protein function so that users can prioritize substitutions for further study. We have shown that SIFT can distinguish between functionally neutral and deleterious amino acid changes in mutagenesis studies and on human polymorphisms. SIFT is available at http://blocks.fhcrc.org/sift/SIFT.html.

Show full abstractShow less

DOI

10.1093/nar/gkg509

UNEECON

Reference

PUBMED_LINK

32667917

DESCRIPTION

UNEECON is a statistical method for inferring deleterious mutations and constrained genes in human and potentially other species.

Show full descriptionShow less

URL

https://github.com/yifei-lab/UNEECON

TITLE

Unified inference of missense variant effects and gene constraints in the human genome.

Main citation

Huang YF. (2020) Unified inference of missense variant effects and gene constraints in the human genome. PLoS Genet, 16 (7) e1008922. doi:10.1371/journal.pgen.1008922. PMID 32667917

ABSTRACT

A challenge in medical genomics is to identify variants and genes associated with severe genetic disorders. Based on the premise that severe, early-onset disorders often result in a reduction of evolutionary fitness, several statistical methods have been developed to predict pathogenic variants or constrained genes based on the signatures of negative selection in human populations. However, we currently lack a statistical framework to jointly predict deleterious variants and constrained genes from both variant-level features and gene-level selective constraints. Here we present such a unified approach, UNEECON, based on deep learning and population genetics. UNEECON treats the contributions of variant-level features and gene-level constraints as a variant-level fixed effect and a gene-level random effect, respectively. The sum of the fixed and random effects is then combined with an evolutionary model to infer the strength of negative selection at both variant and gene levels. Compared with previously published methods, UNEECON shows improved performance in predicting missense variants and protein-coding genes associated with autosomal dominant disorders, and feature importance analysis suggests that both gene-level selective constraints and variant-level predictors are important for accurate variant prioritization. Furthermore, based on UNEECON, we observe a low correlation between gene-level intolerance to missense mutations and that to loss-of-function mutations, which can be partially explained by the prevalence of disordered protein regions that are highly tolerant to missense mutations. Finally, we show that genes intolerant to both missense and loss-of-function mutations play key roles in the central nervous system and the autism spectrum disorders. Overall, UNEECON is a promising framework for both variant and gene prioritization.

Show full abstractShow less

DOI

10.1371/journal.pgen.1008922

VEST4

Reference

PUBMED_LINK

23819870

FULL NAME

Variant Effect Scoring Tool v4

DESCRIPTION

Machine learning pathogenicity score for SNVs.

Show full descriptionShow less

URL

https://www.cravat.us/

KEYWORDS

ML, SNV

Show full keywordsShow less

USE

variant scoring

TITLE

Identifying Mendelian disease genes with the variant effect scoring tool.

Main citation

Carter H, Douville C, Stenson PD, Cooper DN, ...&, Karchin R. (2013) Identifying Mendelian disease genes with the variant effect scoring tool. BMC Genomics, 14 Suppl 3 (Suppl 3) S3. doi:10.1186/1471-2164-14-S3-S3. PMID 23819870

ABSTRACT

BACKGROUND: Whole exome sequencing studies identify hundreds to thousands of rare protein coding variants of ambiguous significance for human health. Computational tools are needed to accelerate the identification of specific variants and genes that contribute to human disease. RESULTS: We have developed the Variant Effect Scoring Tool (VEST), a supervised machine learning-based classifier, to prioritize rare missense variants with likely involvement in human disease. The VEST classifier training set comprised ~ 45,000 disease mutations from the latest Human Gene Mutation Database release and another ~45,000 high frequency (allele frequency >1%) putatively neutral missense variants from the Exome Sequencing Project. VEST outperforms some of the most popular methods for prioritizing missense variants in carefully designed holdout benchmarking experiments (VEST ROC AUC = 0.91, PolyPhen2 ROC AUC = 0.86, SIFT4.0 ROC AUC = 0.84). VEST estimates variant score p-values against a null distribution of VEST scores for neutral variants not included in the VEST training set. These p-values can be aggregated at the gene level across multiple disease exomes to rank genes for probable disease involvement. We tested the ability of an aggregate VEST gene score to identify candidate Mendelian disease genes, based on whole-exome sequencing of a small number of disease cases. We used whole-exome data for two Mendelian disorders for which the causal gene is known. Considering only genes that contained variants in all cases, the VEST gene score ranked dihydroorotate dehydrogenase (DHODH) number 2 of 2253 genes in four cases of Miller syndrome, and myosin-3 (MYH3) number 2 of 2313 genes in three cases of Freeman Sheldon syndrome. CONCLUSIONS: Our results demonstrate the potential power gain of aggregating bioinformatics variant scores into gene-level scores and the general utility of bioinformatics in assisting the search for disease genes in large-scale exome sequencing studies. VEST is available as a stand-alone software package at http://wiki.chasmsoftware.org and is hosted by the CRAVAT web server at http://www.cravat.us.

Show full abstractShow less

DOI

10.1186/1471-2164-14-S3-S3