Reference
Catalog entries using this tag (links open the entry card on its page):
- HapMap Phase I — Projects
- HapMap Phase II — Projects
- HapMap Phase III — Projects
- Phase 1 — Projects
- Phase 3 — Projects
- Pilot — Projects
- AlphaFold — References
- AlphaFold 2 — References
- AlphaFold 3 — References
- AlphaMissense — References
- b37 — References
- b38 — References
- BioGRID — References
- CADD — References
- CADD v1.4 — References
- CADD v1.6 (CADD-Splice) — References
- CADD v1.7 — References
- Chinese Millionome Database — References
- CHM13 — References
- ClinVar — References
- CPC — References
- DAVID — References
- dbNSFP v4 — References
- dbSNP — References
- ENCODE 2004 — References
- ENCODE 2007 — References
- ENCODE 2012 — References
- ENCODE 2020 — References
- ENCODE 2026 — References
- ENCODE portal — References
- Ensembl — References
- ESM-2 — References
- EVE — References
- Gene Ontology — References
- GeneCards — References
- gnomAD — References
- GRCh37.p13 — References
- GRCh38.p14 — References
- GRCh39 (indefinitely postponed) — References
- GWAS Catalog — References
- hg19 — References
- hg38 — References
- HPRC first draft pangenome — References
- hs37d5 — References
- HUGO Gene Nomenclature Committee — References
- humanG1Kv37 — References
- jMorp — References
- M-CAP — References
- MetaLR / MetaSVM — References
- MutationAssessor — References
- MVP — References
- NCBI-Gene — References
- OneK1K — References
- Open Target Genetics — References
- PGG.Han 2.0 — References
- PolyPhen-2 — References
- PrimateAI-3D — References
- ProGen2 — References
- ProtBERT — References
- ProteinBERT — References
- ProtVar — References
- Reactome — References
- REVEL — References
- ROADMAP 2010 — References
- ROADMAP 2015 — References
- SAHA — References
- SIFT — References
- STRING — References
- Taiwan View — References
- UKBB-LD — References
- UNEECON — References
- UniProt — References
- VEST4 — References
- Westlake BioBank for Chinese (WBBC) — References
Entries
HapMap Phase I
STAGE_PERIOD
2003–2005
DESCRIPTION
International HapMap Project first data release: ~1 million SNPs in CEU, YRI, and JPT+CHB; produced the first genome-wide LD and recombination maps and drove early GWAS SNP selection and imputation panels.
URL
HapMap Phase II
STAGE_PERIOD
2005–2007
DESCRIPTION
Expanded SNP density (~3.1M SNPs) and haplotype structure across the same core panels; improved tagging coverage and supported finer-scale association and phasing workflows before large-scale resequencing.
URL
HapMap Phase III
STAGE_PERIOD
2007–2009
DESCRIPTION
Extended to 11 populations and ~1.6M SNPs; broader ancestry representation and LD maps that informed the design and early phases of the 1000 Genomes Project.
URL
Phase 1
STAGE_PERIOD
2010–2011
DESCRIPTION
Expanded low-coverage WGS (~1,092 individuals) with exome capture and dense SNP genotyping; primary SNP and indel reference for early imputation panels.
URL
Phase 3
STAGE_PERIOD
2012–2015
DESCRIPTION
~2,504 individuals across 26 populations; GRCh37/38 VCF releases became the standard allele-frequency, LD, and imputation backbone for GWAS and SV pipelines.
URL
Pilot
STAGE_PERIOD
2008–2010
DESCRIPTION
Proof-of-concept low-coverage whole-genome sequencing and SNP arrays across multiple populations; established protocols and data model for the main project.
URL
AlphaFold
PUBMED_LINK
URL
TITLE
Improved protein structure prediction using potentials from deep learning.
Main citation
Senior AW, Evans R, Jumper J, Kirkpatrick J, ...&, Hassabis D. (2020) Improved protein structure prediction using potentials from deep learning. Nature, 577 (7792) 706-710. doi:10.1038/s41586-019-1923-7. PMID 31942072
ABSTRACT
Protein structure prediction can be used to determine the three-dimensional shape of a protein from its amino acid sequence1. This problem is of fundamental importance as the structure of a protein largely determines its function2; however, protein structures can be difficult to determine experimentally. Considerable progress has recently been made by leveraging genetic information. It is possible to infer which amino acid residues are in contact by analysing covariation in homologous sequences, which aids in the prediction of protein structures3. Here we show that we can train a neural network to make accurate predictions of the distances between pairs of residues, which convey more information about the structure than contact predictions. Using this information, we construct a potential of mean force4 that can accurately describe the shape of a protein. We find that the resulting potential can be optimized by a simple gradient descent algorithm to generate structures without complex sampling procedures. The resulting system, named AlphaFold, achieves high accuracy, even for sequences with fewer homologous sequences. In the recent Critical Assessment of Protein Structure Prediction5 (CASP13)-a blind assessment of the state of the field-AlphaFold created high-accuracy structures (with template modelling (TM) scores6 of 0.7 or higher) for 24 out of 43 free modelling domains, whereas the next best method, which used sampling and contact information, achieved such accuracy for only 14 out of 43 domains. AlphaFold represents a considerable advance in protein-structure prediction. We expect this increased accuracy to enable insights into the function and malfunction of proteins, especially in cases for which no structures for homologous proteins have been experimentally determined7.
DOI
10.1038/s41586-019-1923-7
AlphaFold 2 (AlphaFold)
PUBMED_LINK
FULL NAME
AlphaFold Protein Structure Database
DESCRIPTION
High-accuracy protein structure prediction using deep learning.
URL
KEYWORDS
protein structure, deep learning, folding
USE
structure prediction
SERVER
EMBL-EBI
TITLE
Highly accurate protein structure prediction with AlphaFold.
Main citation
Jumper J, Evans R, Pritzel A, Green T, ...&, Hassabis D. (2021) Highly accurate protein structure prediction with AlphaFold. Nature, 596 (7873) 583-589. doi:10.1038/s41586-021-03819-2. PMID 34265844
ABSTRACT
Proteins are essential to life, and understanding their structure can facilitate a mechanistic understanding of their function. Through an enormous experimental effort1-4, the structures of around 100,000 unique proteins have been determined5, but this represents a small fraction of the billions of known protein sequences6,7. Structural coverage is bottlenecked by the months to years of painstaking effort required to determine a single protein structure. Accurate computational approaches are needed to address this gap and to enable large-scale structural bioinformatics. Predicting the three-dimensional structure that a protein will adopt based solely on its amino acid sequence-the structure prediction component of the 'protein folding problem'8-has been an important open research problem for more than 50 years9. Despite recent progress10-14, existing methods fall far short of atomic accuracy, especially when no homologous structure is available. Here we provide the first computational method that can regularly predict protein structures with atomic accuracy even in cases in which no similar structure is known. We validated an entirely redesigned version of our neural network-based model, AlphaFold, in the challenging 14th Critical Assessment of protein Structure Prediction (CASP14)15, demonstrating accuracy competitive with experimental structures in a majority of cases and greatly outperforming other methods. Underpinning the latest version of AlphaFold is a novel machine learning approach that incorporates physical and biological knowledge about protein structure, leveraging multi-sequence alignments, into the design of the deep learning algorithm.
DOI
10.1038/s41586-021-03819-2
AlphaFold 3
PUBMED_LINK
URL
TITLE
Accurate structure prediction of biomolecular interactions with AlphaFold 3.
Main citation
Abramson J, Adler J, Dunger J, Evans R, ...&, Jumper JM. (2024) Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature, 630 (8016) 493-500. doi:10.1038/s41586-024-07487-w. PMID 38718835
ABSTRACT
The introduction of AlphaFold 21 has spurred a revolution in modelling the structure of proteins and their interactions, enabling a huge range of applications in protein modelling and design2-6. Here we describe our AlphaFold 3 model with a substantially updated diffusion-based architecture that is capable of predicting the joint structure of complexes including proteins, nucleic acids, small molecules, ions and modified residues. The new AlphaFold model demonstrates substantially improved accuracy over many previous specialized tools: far greater accuracy for protein-ligand interactions compared with state-of-the-art docking tools, much higher accuracy for protein-nucleic acid interactions compared with nucleic-acid-specific predictors and substantially higher antibody-antigen prediction accuracy compared with AlphaFold-Multimer v.2.37,8. Together, these results show that high-accuracy modelling across biomolecular space is possible within a single unified deep-learning framework.
DOI
10.1038/s41586-024-07487-w
AlphaMissense
PUBMED_LINK
DESCRIPTION
Deep learning model predicting pathogenicity of all possible missense variants in human proteins.
URL
KEYWORDS
missense, pathogenicity, variant effect, deep learning
USE
variant effect scoring
TITLE
Accurate proteome-wide missense variant effect prediction with AlphaMissense.
Main citation
Cheng J, Novati G, Pan J, Bycroft C, ...&, Avsec Ž. (2023) Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science, 381 (6664) eadg7492. doi:10.1126/science.adg7492. PMID 37733863
ABSTRACT
The vast majority of missense variants observed in the human genome are of unknown clinical significance. We present AlphaMissense, an adaptation of AlphaFold fine-tuned on human and primate variant population frequency databases to predict missense variant pathogenicity. By combining structural context and evolutionary conservation, our model achieves state-of-the-art results across a wide range of genetic and experimental benchmarks, all without explicitly training on such data. The average pathogenicity score of genes is also predictive for their cell essentiality, capable of identifying short essential genes that existing statistical approaches are underpowered to detect. As a resource to the community, we provide a database of predictions for all possible human single amino acid substitutions and classify 89% of missense variants as either likely benign or likely pathogenic.
DOI
10.1126/science.adg7492
b37
FULL NAME
Broad Institute Homo_sapiens_assembly19 (b37)
DESCRIPTION
GRCh37-compatible reference FASTA used across Broad Institute and 1000 Genomes workflows: chromosomes 1-22, X, Y, MT, plus GL/NC unlocalized and unplaced contigs (as in the distributed assembly19 package). Coordinate system matches the 1KG/b37 ecosystem used by many GWAS imputation and joint-calling pipelines.
URL
KEYWORDS
GRCh37; 1000 Genomes; Broad; b37; reference FASTA
Main citation
Broad Institute / 1000 Genomes Project. Homo_sapiens_assembly19.fasta (b37). https://data.broadinstitute.org/snowman/hg19/
b38
FULL NAME
Broad Institute Homo_sapiens_assembly38 (b38)
DESCRIPTION
GRCh38-based reference FASTA distributed with GATK and Broad pipelines (Homo_sapiens_assembly38), including primary chromosomes and standard alternate contigs (hs38d5 decoy is distributed separately). Default reference for many germline short-variant and joint-genotyping workflows on cloud and HPC.
URL
KEYWORDS
GRCh38; GATK; Broad; b38; reference FASTA
Main citation
Broad Institute. Homo_sapiens_assembly38.fasta (GATK GRCh38 reference bundle). https://storage.googleapis.com/genomics-public-data/references/hg38/v0/
BioGRID
PUBMED_LINK
DESCRIPTION
BioGRID is a biomedical interaction repository with data compiled through comprehensive curation efforts. Our current index is version 4.4.242 and searches 86,339 publications for 2,834,410 protein and genetic interactions, 31,144 chemical interactions and 1,128,339 post translational modifications from major model organism species. All data are freely provided via our search index and available for download in many standardized formats.
URL
KEYWORDS
Interaction
TITLE
The BioGRID database: A comprehensive biomedical resource of curated protein, genetic, and chemical interactions.
Main citation
Oughtred R, Rust J, Chang C, Breitkreutz BJ, ...&, Tyers M. (2021) The BioGRID database: A comprehensive biomedical resource of curated protein, genetic, and chemical interactions. Protein Sci, 30 (1) 187-200. doi:10.1002/pro.3978. PMID 33070389
ABSTRACT
The BioGRID (Biological General Repository for Interaction Datasets, thebiogrid.org) is an open-access database resource that houses manually curated protein and genetic interactions from multiple species including yeast, worm, fly, mouse, and human. The ~1.93 million curated interactions in BioGRID can be used to build complex networks to facilitate biomedical discoveries, particularly as related to human health and disease. All BioGRID content is curated from primary experimental evidence in the biomedical literature, and includes both focused low-throughput studies and large high-throughput datasets. BioGRID also captures protein post-translational modifications and protein or gene interactions with bioactive small molecules including many known drugs. A built-in network visualization tool combines all annotations and allows users to generate network graphs of protein, genetic and chemical interactions. In addition to general curation across species, BioGRID undertakes themed curation projects in specific aspects of cellular regulation, for example the ubiquitin-proteasome system, as well as specific disease areas, such as for the SARS-CoV-2 virus that causes COVID-19 severe acute respiratory syndrome. A recent extension of BioGRID, named the Open Repository of CRISPR Screens (ORCS, orcs.thebiogrid.org), captures single mutant phenotypes and genetic interactions from published high throughput genome-wide CRISPR/Cas9-based genetic screens. BioGRID-ORCS contains datasets for over 1,042 CRISPR screens carried out to date in human, mouse and fly cell lines. The biomedical research community can freely access all BioGRID data through the web interface, standardized file downloads, or via model organism databases and partner meta-databases.
DOI
10.1002/pro.3978
CADD
PUBMED_LINK
FULL NAME
Combined Annotation–Dependent Depletion
DESCRIPTION
Combined Annotation–Dependent Depletion; integrates multiple annotations to score variant deleteriousness.
URL
KEYWORDS
genome-wide, deleteriousness, annotation
USE
prioritization, filtering
TITLE
A general framework for estimating the relative pathogenicity of human genetic variants.
Main citation
Kircher M, Witten DM, Jain P, O'Roak BJ, ...&, Shendure J. (2014) A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet, 46 (3) 310-5. doi:10.1038/ng.2892. PMID 24487276
ABSTRACT
Current methods for annotating and interpreting human genetic variation tend to exploit a single information type (for example, conservation) and/or are restricted in scope (for example, to missense changes). Here we describe Combined Annotation-Dependent Depletion (CADD), a method for objectively integrating many diverse annotations into a single measure (C score) for each variant. We implement CADD as a support vector machine trained to differentiate 14.7 million high-frequency human-derived alleles from 14.7 million simulated variants. We precompute C scores for all 8.6 billion possible human single-nucleotide variants and enable scoring of short insertions-deletions. C scores correlate with allelic diversity, annotations of functionality, pathogenicity, disease severity, experimentally measured regulatory effects and complex trait associations, and they highly rank known pathogenic variants within individual genomes. The ability of CADD to prioritize functional, deleterious and pathogenic variants across many functional categories, effect sizes and genetic architectures is unmatched by any current single-annotation method.
DOI
10.1038/ng.2892
CADD v1.4
PUBMED_LINK
URL
TITLE
CADD: predicting the deleteriousness of variants throughout the human genome.
Main citation
Rentzsch P, Witten D, Cooper GM, Shendure J, ...&, Kircher M. (2019) CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res, 47 (D1) D886-D894. doi:10.1093/nar/gky1016. PMID 30371827
ABSTRACT
Combined Annotation-Dependent Depletion (CADD) is a widely used measure of variant deleteriousness that can effectively prioritize causal variants in genetic analyses, particularly highly penetrant contributors to severe Mendelian disorders. CADD is an integrative annotation built from more than 60 genomic features, and can score human single nucleotide variants and short insertion and deletions anywhere in the reference assembly. CADD uses a machine learning model trained on a binary distinction between simulated de novo variants and variants that have arisen and become fixed in human populations since the split between humans and chimpanzees; the former are free of selective pressure and may thus include both neutral and deleterious alleles, while the latter are overwhelmingly neutral (or, at most, weakly deleterious) by virtue of having survived millions of years of purifying selection. Here we review the latest updates to CADD, including the most recent version, 1.4, which supports the human genome build GRCh38. We also present updates to our website that include simplified variant lookup, extended documentation, an Application Program Interface and improved mechanisms for integrating CADD scores into other tools or applications. CADD scores, software and documentation are available at https://cadd.gs.washington.edu.
DOI
10.1093/nar/gky1016
CADD v1.6 (CADD-Splice)
PUBMED_LINK
URL
TITLE
CADD-Splice-improving genome-wide variant effect prediction using deep learning-derived splice scores.
Main citation
Rentzsch P, Schubach M, Shendure J, Kircher M. (2021) CADD-Splice-improving genome-wide variant effect prediction using deep learning-derived splice scores. Genome Med, 13 (1) 31. doi:10.1186/s13073-021-00835-9. PMID 33618777
ABSTRACT
BACKGROUND: Splicing of genomic exons into mRNAs is a critical prerequisite for the accurate synthesis of human proteins. Genetic variants impacting splicing underlie a substantial proportion of genetic disease, but are challenging to identify beyond those occurring at donor and acceptor dinucleotides. To address this, various methods aim to predict variant effects on splicing. Recently, deep neural networks (DNNs) have been shown to achieve better results in predicting splice variants than other strategies. METHODS: It has been unclear how best to integrate such process-specific scores into genome-wide variant effect predictors. Here, we use a recently published experimental data set to compare several machine learning methods that score variant effects on splicing. We integrate the best of those approaches into general variant effect prediction models and observe the effect on classification of known pathogenic variants. RESULTS: We integrate two specialized splicing scores into CADD (Combined Annotation Dependent Depletion; cadd.gs.washington.edu ), a widely used tool for genome-wide variant effect prediction that we previously developed to weight and integrate diverse collections of genomic annotations. With this new model, CADD-Splice, we show that inclusion of splicing DNN effect scores substantially improves predictions across multiple variant categories, without compromising overall performance. CONCLUSIONS: While splice effect scores show superior performance on splice variants, specialized predictors cannot compete with other variant scores in general variant interpretation, as the latter account for nonsense and missense effects that do not alter splicing. Although only shown here for splice scores, we believe that the applied approach will generalize to other specific molecular processes, providing a path for the further improvement of genome-wide variant effect prediction.
DOI
10.1186/s13073-021-00835-9
CADD v1.7
PUBMED_LINK
URL
TITLE
CADD v1.7: using protein language models, regulatory CNNs and other nucleotide-level scores to improve genome-wide variant predictions.
Main citation
Schubach M, Maass T, Nazaretyan L, Röner S, ...&, Kircher M. (2024) CADD v1.7: using protein language models, regulatory CNNs and other nucleotide-level scores to improve genome-wide variant predictions. Nucleic Acids Res, 52 (D1) D1143-D1154. doi:10.1093/nar/gkad989. PMID 38183205
ABSTRACT
Machine Learning-based scoring and classification of genetic variants aids the assessment of clinical findings and is employed to prioritize variants in diverse genetic studies and analyses. Combined Annotation-Dependent Depletion (CADD) is one of the first methods for the genome-wide prioritization of variants across different molecular functions and has been continuously developed and improved since its original publication. Here, we present our most recent release, CADD v1.7. We explored and integrated new annotation features, among them state-of-the-art protein language model scores (Meta ESM-1v), regulatory variant effect predictions (from sequence-based convolutional neural networks) and sequence conservation scores (Zoonomia). We evaluated the new version on data sets derived from ClinVar, ExAC/gnomAD and 1000 Genomes variants. For coding effects, we tested CADD on 31 Deep Mutational Scanning (DMS) data sets from ProteinGym and, for regulatory effect prediction, we used saturation mutagenesis reporter assay data of promoter and enhancer sequences. The inclusion of new features further improved the overall performance of CADD. As with previous releases, all data sets, genome-wide CADD v1.7 scores, scripts for on-site scoring and an easy-to-use webserver are readily provided via https://cadd.bihealth.org/ or https://cadd.gs.washington.edu/ to the community.
DOI
10.1093/nar/gkad989
Chinese Millionome Database (CMDB)
CHM13
PUBMED_LINK
FULL NAME
T2T-CHM13 v1.1 complete hydatidiform mole assembly
DESCRIPTION
Telomere-to-telomere (T2T) assembly of the CHM13 hydatidiform mole cell line, providing the first gap-resolved maps of centromeres and the full Y (from a composite). Use as a complement to GRCh38 for studying repetitive and structurally variable loci; chromosome naming and coordinates differ from GRC primary assemblies; use liftover and T2T-specific tooling where appropriate.
URL
KEYWORDS
T2T; telomere-to-telomere; complete genome; CHM13; GRCh38 alternative
TITLE
The complete sequence of a human genome.
Main citation
Nurk S, Koren S, Rhie A, Rautiainen M, ...&, Phillippy AM. (2022) The complete sequence of a human genome. Science, 376 (6588) 44-53. doi:10.1126/science.abj6987. PMID 35357919
ABSTRACT
Since its initial release in 2000, the human reference genome has covered only the euchromatic fraction of the genome, leaving important heterochromatic regions unfinished. Addressing the remaining 8% of the genome, the Telomere-to-Telomere (T2T) Consortium presents a complete 3.055 billion-base pair sequence of a human genome, T2T-CHM13, that includes gapless assemblies for all chromosomes except Y, corrects errors in the prior references, and introduces nearly 200 million base pairs of sequence containing 1956 gene predictions, 99 of which are predicted to be protein coding. The completed regions include all centromeric satellite arrays, recent segmental duplications, and the short arms of all five acrocentric chromosomes, unlocking these complex regions of the genome to variational and functional studies.
DOI
10.1126/science.abj6987
ClinVar
PUBMED_LINK
DESCRIPTION
Archive of clinically relevant variants with interpretations.
URL
KEYWORDS
pathogenicity, variant, clinical
USE
clinical annotation
TITLE
ClinVar: updates to support classifications of both germline and somatic variants.
Main citation
Landrum MJ, Chitipiralla S, Kaur K, Brown G, ...&, Kattman BL. (2025) ClinVar: updates to support classifications of both germline and somatic variants. Nucleic Acids Res, 53 (D1) D1313-D1321. doi:10.1093/nar/gkae1090. PMID 39578691
ABSTRACT
ClinVar (www.ncbi.nlm.nih.gov/clinvar/) is a free, public database of human genetic variants and their relationships to disease, with >3 million variants submitted by >2800 organizations across the world. The database was recently updated to have three types of classifications: germline, oncogenicity and clinical impact for somatic variants. As for germline variants, classifications for somatic variants can be submitted in batches in a file submission or through the submission API; variants can also be submitted and updated one at a time in online submission forms. The ClinVar XML files were redesigned to allow multiple classification types. Both old and new formats of the XML are supported through the end of 2024. Data for somatic classifications were also added to the ClinVar VCF files and to several tab-delimited files. The ClinVar VCV pages were updated to display the three types of classifications, both as it was submitted and as it was aggregated by ClinVar. Clinical testing laboratories and others in the cancer community are invited to share their classifications of somatic variant classifications through ClinVar to provide transparency in genomic testing and improve patient care.
DOI
10.1093/nar/gkae1090
CPC
PUBMED_LINK
FULL NAME
Chinese Pangenome Consortium (phase I core)
DESCRIPTION
Phase I data from the Chinese Pangenome Consortium: 116 high-quality haplotype-phased de novo assemblies from 58 core samples across 36 minority Chinese ethnic groups (high-fidelity long-read coverage). Adds substantial novel sequence and variant discovery relative to GRCh38 and supports population-specific reference panels for Asian-ancestry genomics.
URL
KEYWORDS
pangenome; Chinese populations; long-read; haplotype; GRCh38
TITLE
A pangenome reference of 36 Chinese populations.
Main citation
Gao Y, Yang X, Chen H, Tan X, ...&, Xu S. (2023) A pangenome reference of 36 Chinese populations. Nature, 619 (7968) 112-121. doi:10.1038/s41586-023-06173-7. PMID 37316654
ABSTRACT
Human genomics is witnessing an ongoing paradigm shift from a single reference sequence to a pangenome form, but populations of Asian ancestry are underrepresented. Here we present data from the first phase of the Chinese Pangenome Consortium, including a collection of 116 high-quality and haplotype-phased de novo assemblies based on 58 core samples representing 36 minority Chinese ethnic groups. With an average 30.65× high-fidelity long-read sequence coverage, an average contiguity N50 of more than 35.63 megabases and an average total size of 3.01 gigabases, the CPC core assemblies add 189 million base pairs of euchromatic polymorphic sequences and 1,367 protein-coding gene duplications to GRCh38. We identified 15.9 million small variants and 78,072 structural variants, of which 5.9 million small variants and 34,223 structural variants were not reported in a recently released pangenome reference1. The Chinese Pangenome Consortium data demonstrate a remarkable increase in the discovery of novel and missing sequences when individuals are included from underrepresented minority ethnic groups. The missing reference sequences were enriched with archaic-derived alleles and genes that confer essential functions related to keratinization, response to ultraviolet radiation, DNA repair, immunological responses and lifespan, implying great potential for shedding new light on human evolution and recovering missing heritability in complex disease mapping.
DOI
10.1038/s41586-023-06173-7
DAVID
PUBMED_LINK
FULL NAME
Database for Annotation, Visualization and Integrated Discovery
DESCRIPTION
Functional annotation and enrichment analysis platform.
URL
KEYWORDS
functional enrichment, GO, pathway
USE
enrichment analysis
TITLE
The DAVID Gene Functional Classification Tool: a novel biological module-centric algorithm to functionally analyze large gene lists.
Main citation
Huang DW, Sherman BT, Tan Q, Collins JR, ...&, Lempicki RA. (2007) The DAVID Gene Functional Classification Tool: a novel biological module-centric algorithm to functionally analyze large gene lists. Genome Biol, 8 (9) R183. doi:10.1186/gb-2007-8-9-r183. PMID 17784955
ABSTRACT
The DAVID Gene Functional Classification Tool http://david.abcc.ncifcrf.gov uses a novel agglomeration algorithm to condense a list of genes or associated biological terms into organized classes of related genes or biology, called biological modules. This organization is accomplished by mining the complex biological co-occurrences found in multiple sources of functional annotation. It is a powerful method to group functionally related genes and terms into a manageable number of biological modules for efficient interpretation of gene lists in a network context.
DOI
10.1186/gb-2007-8-9-r183
dbNSFP v4 (dbNSFP)
PUBMED_LINK
FULL NAME
Database for Nonsynonymous SNPs’ Functional Predictions
DESCRIPTION
Database aggregating functional predictions and annotations for nonsynonymous variants.
URL
KEYWORDS
annotation, variant, missense
USE
meta-annotation
TITLE
dbNSFP v4: a comprehensive database of transcript-specific functional predictions and annotations for human nonsynonymous and splice-site SNVs.
Main citation
Liu X, Li C, Mou C, Dong Y, ...&, Tu Y. (2020) dbNSFP v4: a comprehensive database of transcript-specific functional predictions and annotations for human nonsynonymous and splice-site SNVs. Genome Med, 12 (1) 103. doi:10.1186/s13073-020-00803-9. PMID 33261662
ABSTRACT
Whole exome sequencing has been increasingly used in human disease studies. Prioritization based on appropriate functional annotations has been used as an indispensable step to select candidate variants. Here we present the latest updates to dbNSFP (version 4.1), a database designed to facilitate this step by providing deleteriousness prediction and functional annotation for all potential nonsynonymous and splice-site SNVs (a total of 84,013,093) in the human genome. The current version compiled 36 deleteriousness prediction scores, including 12 transcript-specific scores, and other variant and gene-level functional annotations. The database is available at http://database.liulab.science/dbNSFP with a downloadable version and a web-service.
DOI
10.1186/s13073-020-00803-9
dbSNP
ENCODE 2004
PUBMED_LINK
TITLE
The ENCODE (ENCyclopedia Of DNA Elements) Project.
Main citation
ENCODE Project Consortium. (2004) The ENCODE (ENCyclopedia Of DNA Elements) Project. Science, 306 (5696) 636-40. doi:10.1126/science.1105136. PMID 15499007
ABSTRACT
The ENCyclopedia Of DNA Elements (ENCODE) Project aims to identify all functional elements in the human genome sequence. The pilot phase of the Project is focused on a specified 30 megabases (approximately 1%) of the human genome sequence and is organized as an international consortium of computational and laboratory-based scientists working to develop and apply high-throughput approaches for detecting all sequence elements that confer biological function. The results of this pilot phase will guide future efforts to analyze the entire human genome.
DOI
10.1126/science.1105136
ENCODE 2007
PUBMED_LINK
TITLE
Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project.
Main citation
ENCODE Project Consortium, Birney E, Stamatoyannopoulos JA, Dutta A, ...&, de Jong PJ. (2007) Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature, 447 (7146) 799-816. doi:10.1038/nature05874. PMID 17571346
ABSTRACT
We report the generation and analysis of functional data from multiple, diverse experiments performed on a targeted 1% of the human genome as part of the pilot phase of the ENCODE Project. These data have been further integrated and augmented by a number of evolutionary and computational analyses. Together, our results advance the collective knowledge about human genome function in several major areas. First, our studies provide convincing evidence that the genome is pervasively transcribed, such that the majority of its bases can be found in primary transcripts, including non-protein-coding transcripts, and those that extensively overlap one another. Second, systematic examination of transcriptional regulation has yielded new understanding about transcription start sites, including their relationship to specific regulatory sequences and features of chromatin accessibility and histone modification. Third, a more sophisticated view of chromatin structure has emerged, including its inter-relationship with DNA replication and transcriptional regulation. Finally, integration of these new sources of information, in particular with respect to mammalian evolution based on inter- and intra-species sequence comparisons, has yielded new mechanistic and evolutionary insights concerning the functional landscape of the human genome. Together, these studies are defining a path for pursuit of a more comprehensive characterization of human genome function.
DOI
10.1038/nature05874
ENCODE 2012
PUBMED_LINK
DESCRIPTION
ENCODE Phase II
TITLE
An integrated encyclopedia of DNA elements in the human genome.
Main citation
ENCODE Project Consortium. (2012) An integrated encyclopedia of DNA elements in the human genome. Nature, 489 (7414) 57-74. doi:10.1038/nature11247. PMID 22955616
ABSTRACT
The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure and histone modification. These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions. Many discovered candidate regulatory elements are physically associated with one another and with expressed genes, providing new insights into the mechanisms of gene regulation. The newly identified elements also show a statistical correspondence to sequence variants linked to human disease, and can thereby guide interpretation of this variation. Overall, the project provides new insights into the organization and regulation of our genes and genome, and is an expansive resource of functional annotations for biomedical research.
DOI
10.1038/nature11247
ENCODE 2020
PUBMED_LINK
DESCRIPTION
ENCODE Phase III
TITLE
Expanded encyclopaedias of DNA elements in the human and mouse genomes.
Main citation
ENCODE Project Consortium, Moore JE, Purcaro MJ, Pratt HE, ...&, Weng Z. (2020) Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature, 583 (7818) 699-710. doi:10.1038/s41586-020-2493-4. PMID 32728249
ABSTRACT
The human and mouse genomes contain instructions that specify RNAs and proteins and govern the timing, magnitude, and cellular context of their production. To better delineate these elements, phase III of the Encyclopedia of DNA Elements (ENCODE) Project has expanded analysis of the cell and tissue repertoires of RNA transcription, chromatin structure and modification, DNA methylation, chromatin looping, and occupancy by transcription factors and RNA-binding proteins. Here we summarize these efforts, which have produced 5,992 new experimental datasets, including systematic determinations across mouse fetal development. All data are available through the ENCODE data portal (https://www.encodeproject.org), including phase II ENCODE1 and Roadmap Epigenomics2 data. We have developed a registry of 926,535 human and 339,815 mouse candidate cis-regulatory elements, covering 7.9 and 3.4% of their respective genomes, by integrating selected datatypes associated with gene regulation, and constructed a web-based server (SCREEN; http://screen.encodeproject.org) to provide flexible, user-defined access to this resource. Collectively, the ENCODE data and registry provide an expansive resource for the scientific community to build a better understanding of the organization and function of the human and mouse genomes.
DOI
10.1038/s41586-020-2493-4
ENCODE 2026
PUBMED_LINK
DESCRIPTION
ENCODE Phase IV (ENCODE4)
TITLE
An expanded registry of candidate cis-regulatory elements.
Main citation
Moore JE, Pratt HE, Fan K, Phalke N, ...&, Weng Z. (2026) An expanded registry of candidate cis-regulatory elements. Nature, () . doi:10.1038/s41586-025-09909-9. PMID 41501460
ABSTRACT
Mammalian genomes contain millions of regulatory elements that control the complex patterns of gene expression1. Previously, the ENCODE consortium mapped biochemical signals across hundreds of cell types and tissues and integrated these data to develop a registry containing 0.9 million human and 300,000 mouse candidate cis-regulatory elements (cCREs) annotated with potential functions2. Here we have expanded the registry to include 2.37 million human and 967,000 mouse cCREs, leveraging new ENCODE datasets and enhanced computational methods. This expanded registry covers hundreds of unique cell and tissue types, providing a comprehensive understanding of gene regulation. Functional characterization data from assays such as STARR-seq3, massively parallel reporter assay4, CRISPR perturbation5,6 and transgenic mouse assays7 have profiled more than 90% of human cCREs, revealing complex regulatory functions. We identified thousands of novel silencer cCREs and demonstrated their dual enhancer and silencer roles in different cellular contexts. Integrating the registry with other ENCODE annotations facilitates genetic variation interpretation and trait-associated gene identification, exemplified by the identification of KLF1 as a novel causal gene for red blood cell traits. This expanded registry is a valuable resource for studying the regulatory genome and its impact on health and disease.
DOI
10.1038/s41586-025-09909-9
ENCODE portal
PUBMED_LINK
TITLE
Data navigation on the ENCODE portal.
Main citation
Kagda MS, Lam B, Litton C, Small C, ...&, Hitz BC. (2025) Data navigation on the ENCODE portal. Nat Commun, 16 (1) 9592. doi:10.1038/s41467-025-64343-9. PMID 41168159
ABSTRACT
Spanning two decades, the collaborative ENCODE project aims to identify all the functional elements within human and mouse genomes. To best serve the scientific community, the comprehensive ENCODE data including results from 23,000+ functional genomics experiments, 800+ functional elements characterization experiments and 60,000+ results from integrative computational analyses are available on an open-access data-portal ( https://www.encodeproject.org/ ). The final phase of the project includes data from several novel assays aimed at characterization and validation of genomic elements. In addition to developing and maintaining the data portal, the Data Coordination Center (DCC) implemented and utilised uniform processing pipelines to generate uniformly processed data. Here we report recent updates to the data portal including a redesigned home page, an improved search interface, new custom-designed pages highlighting biologically related datasets and an enhanced cart interface for data visualisation plus user-friendly data download options. A summary of data generated using uniform processing pipelines is also provided.
DOI
10.1038/s41467-025-64343-9
Ensembl
ESM-2
PUBMED_LINK
FULL NAME
Evolutionary Scale Modeling v2
DESCRIPTION
Large-scale protein language model enabling structure/function prediction.
URL
KEYWORDS
transformer, LLM, structure, sequence
USE
embeddings, structure prediction
TITLE
Evolutionary-scale prediction of atomic-level protein structure with a language model.
Main citation
Lin Z, Akin H, Rao R, Hie B, ...&, Rives A. (2023) Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379 (6637) 1123-1130. doi:10.1126/science.ade2574. PMID 36927031
ABSTRACT
Recent advances in machine learning have leveraged evolutionary information in multiple sequence alignments to predict protein structure. We demonstrate direct inference of full atomic-level protein structure from primary sequence using a large language model. As language models of protein sequences are scaled up to 15 billion parameters, an atomic-resolution picture of protein structure emerges in the learned representations. This results in an order-of-magnitude acceleration of high-resolution structure prediction, which enables large-scale structural characterization of metagenomic proteins. We apply this capability to construct the ESM Metagenomic Atlas by predicting structures for >617 million metagenomic protein sequences, including >225 million that are predicted with high confidence, which gives a view into the vast breadth and diversity of natural proteins.
DOI
10.1126/science.ade2574
EVE
PUBMED_LINK
FULL NAME
Evolutionary model of Variant Effect
DESCRIPTION
VAE-based unsupervised model to predict variant impact using MSAs.
URL
KEYWORDS
evolutionary, MSA, variant effect
USE
missense scoring
TITLE
Disease variant prediction with deep generative models of evolutionary data.
Main citation
Frazer J, Notin P, Dias M, Gomez A, ...&, Marks DS. (2021) Disease variant prediction with deep generative models of evolutionary data. Nature, 599 (7883) 91-95. doi:10.1038/s41586-021-04043-8. PMID 34707284
ABSTRACT
Quantifying the pathogenicity of protein variants in human disease-related genes would have a marked effect on clinical decisions, yet the overwhelming majority (over 98%) of these variants still have unknown consequences1-3. In principle, computational methods could support the large-scale interpretation of genetic variants. However, state-of-the-art methods4-10 have relied on training machine learning models on known disease labels. As these labels are sparse, biased and of variable quality, the resulting models have been considered insufficiently reliable11. Here we propose an approach that leverages deep generative models to predict variant pathogenicity without relying on labels. By modelling the distribution of sequence variation across organisms, we implicitly capture constraints on the protein sequences that maintain fitness. Our model EVE (evolutionary model of variant effect) not only outperforms computational approaches that rely on labelled data but also performs on par with, if not better than, predictions from high-throughput experiments, which are increasingly used as evidence for variant classification12-16. We predict the pathogenicity of more than 36 million variants across 3,219 disease genes and provide evidence for the classification of more than 256,000 variants of unknown significance. Our work suggests that models of evolutionary information can provide valuable independent evidence for variant interpretation that will be widely useful in research and clinical settings.
DOI
10.1038/s41586-021-04043-8
Gene Ontology (GO)
PUBMED_LINK
DESCRIPTION
Controlled vocabulary for gene function classification.
URL
KEYWORDS
GO terms, pathways
USE
enrichment analysis
TITLE
Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.
Main citation
Ashburner M, Ball CA, Blake JA, Botstein D, ...&, Sherlock G. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet, 25 (1) 25-9. doi:10.1038/75556. PMID 10802651
ABSTRACT
Genomic sequencing has made it clear that a large fraction of the genes specifying the core biological functions are shared by all eukaryotes. Knowledge of the biological role of such shared proteins in one organism can often be transferred to other organisms. The goal of the Gene Ontology Consortium is to produce a dynamic, controlled vocabulary that can be applied to all eukaryotes even as knowledge of gene and protein roles in cells is accumulating and changing. To this end, three independent ontologies accessible on the World-Wide Web (http://www.geneontology.org) are being constructed: biological process, molecular function and cellular component.
DOI
10.1038/75556
GeneCards
PUBMED_LINK
DESCRIPTION
GeneCards is a searchable, integrative database that provides comprehensive, user-friendly information on all annotated and predicted human genes. The knowledgebase automatically integrates gene-centric data from ~200 web sources, including genomic, transcriptomic, proteomic, genetic, clinical and functional information.
URL
TITLE
The GeneCards Suite: From Gene Data Mining to Disease Genome Sequence Analyses.
Main citation
Stelzer G, Rosen N, Plaschkes I, Zimmerman S, ...&, Lancet D. (2016) The GeneCards Suite: From Gene Data Mining to Disease Genome Sequence Analyses. Curr Protoc Bioinformatics, 54 () 1.30.1-1.30.33. doi:10.1002/cpbi.5. PMID 27322403
ABSTRACT
GeneCards, the human gene compendium, enables researchers to effectively navigate and inter-relate the wide universe of human genes, diseases, variants, proteins, cells, and biological pathways. Our recently launched Version 4 has a revamped infrastructure facilitating faster data updates, better-targeted data queries, and friendlier user experience. It also provides a stronger foundation for the GeneCards suite of companion databases and analysis tools. Improved data unification includes gene-disease links via MalaCards and merged biological pathways via PathCards, as well as drug information and proteome expression. VarElect, another suite member, is a phenotype prioritizer for next-generation sequencing, leveraging the GeneCards and MalaCards knowledgebase. It automatically infers direct and indirect scored associations between hundreds or even thousands of variant-containing genes and disease phenotype terms. VarElect's capabilities, either independently or within TGex, our comprehensive variant analysis pipeline, help prepare for the challenge of clinical projects that involve thousands of exome/genome NGS analyses. © 2016 by John Wiley & Sons, Inc.
DOI
10.1002/cpbi.5
gnomAD
GRCh37.p13
FULL NAME
Genome Reference Consortium Human Build 37 patch release 13
DESCRIPTION
NCBI/GRC human assembly build 37, patch 13 (GCF_000001405.25): the authoritative GRCh37 patch-level reference used for stable accessioning and alignment. Distinct from UCSC hg19/Broad b37 contig naming; always verify chromosome naming and inclusion of ALT/patch contigs when mixing resources.
URL
KEYWORDS
GRCh37; GRC; NCBI; reference assembly; patch 13
Main citation
Genome Reference Consortium. Human genome assembly GRCh37.p13 (GCF_000001405.25). National Center for Biotechnology Information.
GRCh38.p14
FULL NAME
Genome Reference Consortium Human Build 38 patch release 14
DESCRIPTION
NCBI/GRC human assembly build 38, patch 14 (GCF_000001405.40): current GRC primary human reference on the GRCh38 line, including cumulative sequence fixes and scaffold updates through p14. Use this accession when you need the exact GRC patch level that matches NCBI/RefSeq alignment products.
URL
KEYWORDS
GRCh38; GRC; NCBI; reference assembly; patch 14
Main citation
Genome Reference Consortium. Human genome assembly GRCh38.p14 (GCF_000001405.40). National Center for Biotechnology Information. https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.40
GRCh39 (indefinitely postponed) (GRCh39)
FULL NAME
Genome Reference Consortium Human Build 39 (not pursued)
DESCRIPTION
The Genome Reference Consortium announced that work toward a distinct GRCh39 assembly line was indefinitely postponed; human reference updates continue on the GRCh38 series (patches) and complementary resources such as T2T-CHM13 and pangenome references. Check the GRC human page for current guidance and patch releases.
URL
KEYWORDS
GRC; GRCh39; reference assembly; postponed
Main citation
Genome Reference Consortium. Human genome reference updates (GRCh39 indefinitely postponed; continued GRCh38 patches). https://www.ncbi.nlm.nih.gov/grc/human
GWAS Catalog
hg19
FULL NAME
UCSC hg19 (GRCh37) reference bundle
DESCRIPTION
UCSC Genome Browser distribution of the GRCh37-era human reference (hg19): chromosomes chr1-22, chrX, chrY, chrM, plus unlocalized and unplaced contigs, alternate loci (e.g. chr6_apd_hap1), and related patches as packaged for the browser. Widely used in legacy pipelines and liftOver chains to/from hg38.
URL
KEYWORDS
GRCh37; UCSC; reference genome; FASTA; legacy assembly
Main citation
UCSC Genome Browser. Human reference assembly hg19 (GRCh37-aligned). https://hgdownload.soe.ucsc.edu/goldenPath/hg19/
hg38
FULL NAME
UCSC hg38 (GRCh38) reference bundle
DESCRIPTION
UCSC Genome Browser distribution of the human reference aligned to GRCh38 (primary assembly plus standard patches and decoys as packaged in the browser bigZips downloads). Chromosome names use the chr1-chrM convention; coordinates match the corresponding GRC assembly for the same patch level when sequences are identical.
URL
KEYWORDS
GRCh38; UCSC; reference genome; FASTA; primary assembly
Main citation
UCSC Genome Browser. Human reference assembly hg38 (GRCh38-aligned). https://hgdownload.soe.ucsc.edu/goldenPath/hg38/
HPRC first draft pangenome (HPRC draft)
PUBMED_LINK
FULL NAME
Human Pangenome Reference Consortium first-draft pangenome
DESCRIPTION
First-draft human pangenome from the HPRC: 47 phased diploid assemblies from diverse samples, aligned and summarized relative to GRCh38. Adds substantial euchromatic polymorphic sequence and duplicated gene content versus a single linear reference; intended for pangenome-aware alignment, variant calling, and downstream graph-based genomics (see HPRC data portal and companion software).
URL
KEYWORDS
HPRC; pangenome; graph genome; haplotypes; GRCh38
TITLE
A draft human pangenome reference.
Main citation
Liao WW, Asri M, Ebler J, Doerr D, ...&, Paten B. (2023) A draft human pangenome reference. Nature, 617 (7960) 312-324. doi:10.1038/s41586-023-05896-x. PMID 37165242
ABSTRACT
Here the Human Pangenome Reference Consortium presents a first draft of the human pangenome reference. The pangenome contains 47 phased, diploid assemblies from a cohort of genetically diverse individuals1. These assemblies cover more than 99% of the expected sequence in each genome and are more than 99% accurate at the structural and base pair levels. Based on alignments of the assemblies, we generate a draft pangenome that captures known variants and haplotypes and reveals new alleles at structurally complex loci. We also add 119 million base pairs of euchromatic polymorphic sequences and 1,115 gene duplications relative to the existing reference GRCh38. Roughly 90 million of the additional base pairs are derived from structural variation. Using our draft pangenome to analyse short-read data reduced small variant discovery errors by 34% and increased the number of structural variants detected per haplotype by 104% compared with GRCh38-based workflows, which enabled the typing of the vast majority of structural variant alleles per sample.
DOI
10.1038/s41586-023-05896-x
hs37d5
FULL NAME
1000 Genomes GRCh37 + decoy (hs37d5)
DESCRIPTION
GRCh37 (b37-style) primary chromosomes and contigs plus the hs37d5 decoy sequence set (HuRef/BAC/Fosmid/NA12878-derived sequences) to reduce spurious alignments in short-read mapping. Standard reference for Phase 3-era 1000 Genomes alignment and many imputation and low-pass WGS workflows that target the 1KG coordinate system.
URL
KEYWORDS
GRCh37; decoy; 1000 Genomes; alignment; hs37d5
Main citation
1000 Genomes Project / Broad Institute. hs37d5 reference (GRCh37 plus decoy sequences). https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/
HUGO Gene Nomenclature Committee (HGNC)
PUBMED_LINK
FULL NAME
Human Genome Organisation Gene Nomenclature Committee
URL
TITLE
Guidelines for human gene nomenclature.
Main citation
Bruford EA, Braschi B, Denny P, Jones TEM, ...&, Tweedie S. (2020) Guidelines for human gene nomenclature. Nat Genet, 52 (8) 754-758. doi:10.1038/s41588-020-0669-3. PMID 32747822
ABSTRACT
Standardized gene naming is crucial for effective communication about genes, and as genomics becomes increasingly important in healthcare, the need for a consistent language for human genes becomes ever more vital. Here we present the current HUGO Gene Nomenclature Committee (HGNC) guidelines for naming not only protein-coding but also RNA genes and pseudogenes, and outline the changes in approach and ethos that have resulted from the discoveries of the last few decades.
DOI
10.1038/s41588-020-0669-3
humanG1Kv37
FULL NAME
1000 Genomes human_g1k_v37 reference
DESCRIPTION
GRCh37-based reference FASTA distributed by the 1000 Genomes Project (human_g1k_v37): chromosomes 1-22, X, Y, MT, plus GL unlocalized/unplaced contigs, without separate haplotype scaffolds or EBV. Commonly used as the Phase 1/III alignment reference when harmonizing with public 1KG VCFs and phase panels.
URL
KEYWORDS
GRCh37; 1000 Genomes; reference FASTA; human_g1k_v37
Main citation
1000 Genomes Project. human_g1k_v37 reference (GRCh37). https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/
jMorp
PUBMED_LINK
FULL NAME
Japanese Multi-Omics Reference Panel
URL
TITLE
jMorp: Japanese Multi-Omics Reference Panel update report 2023.
Main citation
Tadaka S, Kawashima J, Hishinuma E, Saito S, ...&, Kinoshita K. (2024) jMorp: Japanese Multi-Omics Reference Panel update report 2023. Nucleic Acids Res, 52 (D1) D622-D632. doi:10.1093/nar/gkad978. PMID 37930845
ABSTRACT
Modern medicine is increasingly focused on personalized medicine, and multi-omics data is crucial in understanding biological phenomena and disease mechanisms. Each ethnic group has its unique genetic background with specific genomic variations influencing disease risk and drug response. Therefore, multi-omics data from specific ethnic populations are essential for the effective implementation of personalized medicine. Various prospective cohort studies, such as the UK Biobank, All of Us and Lifelines, have been conducted worldwide. The Tohoku Medical Megabank project was initiated after the Great East Japan Earthquake in 2011. It collects biological specimens and conducts genome and omics analyses to build a basis for personalized medicine. Summary statistical data from these analyses are available in the jMorp web database (https://jmorp.megabank.tohoku.ac.jp), which provides a multidimensional approach to the diversity of the Japanese population. jMorp was launched in 2015 as a public database for plasma metabolome and proteome analyses and has been continuously updated. The current update will significantly expand the scale of the data (metabolome, genome, transcriptome, and metagenome). In addition, the user interface and backend server implementations were rewritten to improve the connectivity between the items stored in jMorp. This paper provides an overview of the new version of the jMorp.
DOI
10.1093/nar/gkad978
M-CAP
PUBMED_LINK
FULL NAME
Mendelian Clinically Applicable Pathogenicity
DESCRIPTION
Rare missense pathogenicity classifier for clinical interpretation.
URL
KEYWORDS
missense, clinical
USE
clinical scoring
TITLE
M-CAP eliminates a majority of variants of uncertain significance in clinical exomes at high sensitivity.
Main citation
Jagadeesh KA, Wenger AM, Berger MJ, Guturu H, ...&, Bejerano G. (2016) M-CAP eliminates a majority of variants of uncertain significance in clinical exomes at high sensitivity. Nat Genet, 48 (12) 1581-1586. doi:10.1038/ng.3703. PMID 27776117
ABSTRACT
Variant pathogenicity classifiers such as SIFT, PolyPhen-2, CADD, and MetaLR assist in interpretation of the hundreds of rare, missense variants in the typical patient genome by deprioritizing some variants as likely benign. These widely used methods misclassify 26 to 38% of known pathogenic mutations, which could lead to missed diagnoses if the classifiers are trusted as definitive in a clinical setting. We developed M-CAP, a clinical pathogenicity classifier that outperforms existing methods at all thresholds and correctly dismisses 60% of rare, missense variants of uncertain significance in a typical genome at 95% sensitivity.
DOI
10.1038/ng.3703
MetaLR / MetaSVM (MetaLR)
PUBMED_LINK
DESCRIPTION
Ensemble pathogenicity scores integrating multiple annotations.
URL
KEYWORDS
ensemble, missense
USE
prioritization
TITLE
Comparison and integration of deleteriousness prediction methods for nonsynonymous SNVs in whole exome sequencing studies.
Main citation
Dong C, Wei P, Jian X, Gibbs R, ...&, Liu X. (2015) Comparison and integration of deleteriousness prediction methods for nonsynonymous SNVs in whole exome sequencing studies. Hum Mol Genet, 24 (8) 2125-37. doi:10.1093/hmg/ddu733. PMID 25552646
ABSTRACT
Accurate deleteriousness prediction for nonsynonymous variants is crucial for distinguishing pathogenic mutations from background polymorphisms in whole exome sequencing (WES) studies. Although many deleteriousness prediction methods have been developed, their prediction results are sometimes inconsistent with each other and their relative merits are still unclear in practical applications. To address these issues, we comprehensively evaluated the predictive performance of 18 current deleteriousness-scoring methods, including 11 function prediction scores (PolyPhen-2, SIFT, MutationTaster, Mutation Assessor, FATHMM, LRT, PANTHER, PhD-SNP, SNAP, SNPs&GO and MutPred), 3 conservation scores (GERP++, SiPhy and PhyloP) and 4 ensemble scores (CADD, PON-P, KGGSeq and CONDEL). We found that FATHMM and KGGSeq had the highest discriminative power among independent scores and ensemble scores, respectively. Moreover, to ensure unbiased performance evaluation of these prediction scores, we manually collected three distinct testing datasets, on which no current prediction scores were tuned. In addition, we developed two new ensemble scores that integrate nine independent scores and allele frequency. Our scores achieved the highest discriminative power compared with all the deleteriousness prediction scores tested and showed low false-positive prediction rate for benign yet rare nonsynonymous variants, which demonstrated the value of combining information from multiple orthologous approaches. Finally, to facilitate variant prioritization in WES studies, we have pre-computed our ensemble scores for 87 347 044 possible variants in the whole-exome and made them publicly available through the ANNOVAR software and the dbNSFP database.
DOI
10.1093/hmg/ddu733
MutationAssessor
PUBMED_LINK
DESCRIPTION
Predicts functional impact based on evolutionary conservation.
URL
KEYWORDS
conservation, function
USE
variant effect
TITLE
MutationAssessor in cBioPortal.
Main citation
Su Y, Li X, Reva B, Antipin Y, ...&, Sander C. (2025) MutationAssessor in cBioPortal. bioRxiv, () . doi:10.1101/2025.08.10.669566. PMID 40832239
ABSTRACT
MutationAssessor (MA) helps researchers evaluate the likely functional impact of somatic and germline mutations in cancer. It provides an evolution-based functional impact score (FIS) to classify mutations based on their likely effect on protein function. FIS scores are based on analysis of patterns of conservation in protein families (conserved residues) and subfamilies (specificity residues). In this new version (r4) we have (1) refined the combinatorial entropy analysis of conservation patterns, (2) recalculated full-length protein multiple sequence alignments covering a larger fraction of human proteins and making use of the explosive growth of protein sequence data, (3) compared predicted functional impact with the pathogenic-benign classification of sequence variants in curated knowledge bases, such as ClinVar, (4) observed the inverse relationship between predicted high functional impact and variant frequency in germline genome sequences and (5) explore the evaluation of switch-of-function mutational effects. Functional impact of ~4 million somatic amino-acid changing mutations across more than 320K human tumor samples are now available in the widely used cBioPortal for Cancer Genomics.
DOI
10.1101/2025.08.10.669566
MVP
PUBMED_LINK
FULL NAME
Missense Variant Pathogenicity prediction
DESCRIPTION
A new prediction method that uses deep residual network to leverage large training data sets and many correlated predictors
URL
KEYWORDS
deep residual network, pathogenic missense variant
TITLE
MVP predicts the pathogenicity of missense variants by deep learning.
Main citation
Qi H, Zhang H, Zhao Y, Chen C, ...&, Shen Y. (2021) MVP predicts the pathogenicity of missense variants by deep learning. Nat Commun, 12 (1) 510. doi:10.1038/s41467-020-20847-0. PMID 33479230
ABSTRACT
Accurate pathogenicity prediction of missense variants is critically important in genetic studies and clinical diagnosis. Previously published prediction methods have facilitated the interpretation of missense variants but have limited performance. Here, we describe MVP (Missense Variant Pathogenicity prediction), a new prediction method that uses deep residual network to leverage large training data sets and many correlated predictors. We train the model separately in genes that are intolerant of loss of function variants and the ones that are tolerant in order to take account of potentially different genetic effect size and mode of action. We compile cancer mutation hotspots and de novo variants from developmental disorders for benchmarking. Overall, MVP achieves better performance in prioritizing pathogenic missense variants than previous methods, especially in genes tolerant of loss of function variants. Finally, using MVP, we estimate that de novo coding variants contribute to 7.8% of isolated congenital heart disease, nearly doubling previous estimates.
DOI
10.1038/s41467-020-20847-0
NCBI-Gene
PUBMED_LINK
DESCRIPTION
Gene integrates information from a wide range of species. A record may include nomenclature, Reference Sequences (RefSeqs), maps, pathways, variations, phenotypes, and links to genome-, phenotype-, and locus-specific resources worldwide.
URL
TITLE
Database resources of the National Center for Biotechnology Information.
Main citation
Sayers EW, Beck J, Bolton EE, Bourexis D, ...&, Sherry ST. (2021) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res, 49 (D1) D10-D17. doi:10.1093/nar/gkaa892. PMID 33095870
ABSTRACT
The National Center for Biotechnology Information (NCBI) provides a large suite of online resources for biological information and data, including the GenBank® nucleic acid sequence database and the PubMed® database of citations and abstracts published in life science journals. The Entrez system provides search and retrieval operations for most of these data from 34 distinct databases. The E-utilities serve as the programming interface for the Entrez system. Custom implementations of the BLAST program provide sequence-based searching of many specialized datasets. New resources released in the past year include a new PubMed interface and NCBI datasets. Additional resources that were updated in the past year include PMC, Bookshelf, Genome Data Viewer, SRA, ClinVar, dbSNP, dbVar, Pathogen Detection, BLAST, Primer-BLAST, IgBLAST, iCn3D and PubChem. All of these resources can be accessed through the NCBI home page at https://www.ncbi.nlm.nih.gov.
DOI
10.1093/nar/gkaa892
OneK1K
PUBMED_LINK
DESCRIPTION
The OneK1K cohort consists of single-cell RNA sequencing (scRNA-seq) data from 1.27 million peripheral blood mononuclear cells (PMBCs) collected from 982 donors. We developed a framework for the classification of individual cells, and by combining the scRNA-seq data with genotype data, we mapped the genetic effects on gene expression in each of 14 immune cell types and identified 26,597 independent cis–expression quantitative trait loci (eQTLs).
URL
TITLE
Single-cell eQTL mapping identifies cell type-specific genetic control of autoimmune disease.
Main citation
Yazar S, Alquicira-Hernandez J, Wing K, Senabouth A, ...&, Powell JE. (2022) Single-cell eQTL mapping identifies cell type-specific genetic control of autoimmune disease. Science, 376 (6589) eabf3041. doi:10.1126/science.abf3041. PMID 35389779
ABSTRACT
The human immune system displays substantial variation between individuals, leading to differences in susceptibility to autoimmune disease. We present single-cell RNA sequencing (scRNA-seq) data from 1,267,758 peripheral blood mononuclear cells from 982 healthy human subjects. For 14 cell types, we identified 26,597 independent cis-expression quantitative trait loci (eQTLs) and 990 trans-eQTLs, with most showing cell type-specific effects on gene expression. We subsequently show how eQTLs have dynamic allelic effects in B cells that are transitioning from naïve to memory states and demonstrate how commonly segregating alleles lead to interindividual variation in immune function. Finally, using a Mendelian randomization approach, we identify the causal route by which 305 risk loci contribute to autoimmune disease at the cellular level. This work brings together genetic epidemiology with scRNA-seq to uncover drivers of interindividual variation in the immune system.
DOI
10.1126/science.abf3041
Open Target Genetics
PGG.Han 2.0 (PGG.Han)
PolyPhen-2
PUBMED_LINK
FULL NAME
Polymorphism Phenotyping v2
DESCRIPTION
Predicts functional impact of amino acid substitutions.
URL
KEYWORDS
missense, conservation
USE
variant scoring
TITLE
Predicting functional effect of human missense mutations using PolyPhen-2.
Main citation
Adzhubei I, Jordan DM, Sunyaev SR. (2013) Predicting functional effect of human missense mutations using PolyPhen-2. Curr Protoc Hum Genet, Chapter 7 () Unit7.20. doi:10.1002/0471142905.hg0720s76. PMID 23315928
ABSTRACT
PolyPhen-2 (Polymorphism Phenotyping v2), available as software and via a Web server, predicts the possible impact of amino acid substitutions on the stability and function of human proteins using structural and comparative evolutionary considerations. It performs functional annotation of single-nucleotide polymorphisms (SNPs), maps coding SNPs to gene transcripts, extracts protein sequence annotations and structural attributes, and builds conservation profiles. It then estimates the probability of the missense mutation being damaging based on a combination of all these properties. PolyPhen-2 features include a high-quality multiple protein sequence alignment pipeline and a prediction method employing machine-learning classification. The software also integrates the UCSC Genome Browser's human genome annotations and MultiZ multiple alignments of vertebrate genomes with the human genome. PolyPhen-2 is capable of analyzing large volumes of data produced by next-generation sequencing projects, thanks to built-in support for high-performance computing environments like Grid Engine and Platform LSF.
DOI
10.1002/0471142905.hg0720s76
PrimateAI-3D
PUBMED_LINK
DESCRIPTION
DL model trained on primate variation + 3D structure.
URL
KEYWORDS
deep learning, primate, missense
USE
clinical variant scoring
TITLE
The landscape of tolerated genetic variation in humans and primates.
Main citation
Gao H, Hamp T, Ede J, Schraiber JG, ...&, Farh KK. (2023) The landscape of tolerated genetic variation in humans and primates. Science, 380 (6648) eabn8153. doi:10.1126/science.abn8197. PMID 37262156
ABSTRACT
Personalized genome sequencing has revealed millions of genetic differences between individuals, but our understanding of their clinical relevance remains largely incomplete. To systematically decipher the effects of human genetic variants, we obtained whole-genome sequencing data for 809 individuals from 233 primate species and identified 4.3 million common protein-altering variants with orthologs in humans. We show that these variants can be inferred to have nondeleterious effects in humans based on their presence at high allele frequencies in other primate populations. We use this resource to classify 6% of all possible human protein-altering variants as likely benign and impute the pathogenicity of the remaining 94% of variants with deep learning, achieving state-of-the-art accuracy for diagnosing pathogenic variants in patients with genetic diseases.
DOI
10.1126/science.abn8197
ProGen2
PUBMED_LINK
DESCRIPTION
Generative protein design using LLMs trained on protein sequences.
URL
KEYWORDS
protein design, LLM
USE
sequence generation
TITLE
ProGen2: Exploring the boundaries of protein language models.
Main citation
Nijkamp E, Ruffolo JA, Weinstein EN, Naik N, ...&, Madani A. (2023) ProGen2: Exploring the boundaries of protein language models. Cell Syst, 14 (11) 968-978.e3. doi:10.1016/j.cels.2023.10.002. PMID 37909046
ABSTRACT
Attention-based models trained on protein sequences have demonstrated incredible success at classification and generation tasks relevant for artificial-intelligence-driven protein design. However, we lack a sufficient understanding of how very large-scale models and data play a role in effective protein model development. We introduce a suite of protein language models, named ProGen2, that are scaled up to 6.4B parameters and trained on different sequence datasets drawn from over a billion proteins from genomic, metagenomic, and immune repertoire databases. ProGen2 models show state-of-the-art performance in capturing the distribution of observed evolutionary sequences, generating novel viable sequences, and predicting protein fitness without additional fine-tuning. As large model sizes and raw numbers of protein sequences continue to become more widely accessible, our results suggest that a growing emphasis needs to be placed on the data distribution provided to a protein sequence model. Our models and code are open sourced for widespread adoption in protein engineering. A record of this paper's Transparent Peer Review process is included in the supplemental information.
DOI
10.1016/j.cels.2023.10.002
ProtBERT
PUBMED_LINK
DESCRIPTION
BERT-based protein language model for downstream functional tasks.
URL
KEYWORDS
protein LM, transformer, embeddings
USE
feature extraction
TITLE
ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning.
Main citation
Elnaggar A, Heinzinger M, Dallago C, Rehawi G, ...&, Rost B. (2022) ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning. IEEE Trans Pattern Anal Mach Intell, 44 (10) 7112-7127. doi:10.1109/TPAMI.2021.3095381. PMID 34232869
ABSTRACT
Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models (LMs) taken from Natural Language Processing (NLP). These LMs reach for new prediction frontiers at low inference costs. Here, we trained two auto-regressive models (Transformer-XL, XLNet) and four auto-encoder models (BERT, Albert, Electra, T5) on data from UniRef and BFD containing up to 393 billion amino acids. The protein LMs (pLMs) were trained on the Summit supercomputer using 5616 GPUs and TPU Pod up-to 1024 cores. Dimensionality reduction revealed that the raw pLM-embeddings from unlabeled data captured some biophysical features of protein sequences. We validated the advantage of using the embeddings as exclusive input for several subsequent tasks: (1) a per-residue (per-token) prediction of protein secondary structure (3-state accuracy Q3=81%-87%); (2) per-protein (pooling) predictions of protein sub-cellular location (ten-state accuracy: Q10=81%) and membrane versus water-soluble (2-state accuracy Q2=91%). For secondary structure, the most informative embeddings (ProtT5) for the first time outperformed the state-of-the-art without multiple sequence alignments (MSAs) or evolutionary information thereby bypassing expensive database searches. Taken together, the results implied that pLMs learned some of the grammar of the language of life. All our models are available through https://github.com/agemagician/ProtTrans.
DOI
10.1109/TPAMI.2021.3095381
ProteinBERT
PUBMED_LINK
TITLE
ProteinBERT: a universal deep-learning model of protein sequence and function.
Main citation
Brandes N, Ofer D, Peleg Y, Rappoport N, ...&, Linial M. (2022) ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics, 38 (8) 2102-2110. doi:10.1093/bioinformatics/btac020. PMID 35020807
ABSTRACT
SUMMARY: Self-supervised deep language modeling has shown unprecedented success across natural language tasks, and has recently been repurposed to biological sequences. However, existing models and pretraining methods are designed and optimized for text analysis. We introduce ProteinBERT, a deep language model specifically designed for proteins. Our pretraining scheme combines language modeling with a novel task of Gene Ontology (GO) annotation prediction. We introduce novel architectural elements that make the model highly efficient and flexible to long sequences. The architecture of ProteinBERT consists of both local and global representations, allowing end-to-end processing of these types of inputs and outputs. ProteinBERT obtains near state-of-the-art performance, and sometimes exceeds it, on multiple benchmarks covering diverse protein properties (including protein structure, post-translational modifications and biophysical attributes), despite using a far smaller and faster model than competing deep-learning methods. Overall, ProteinBERT provides an efficient framework for rapidly training protein predictors, even with limited labeled data. AVAILABILITY AND IMPLEMENTATION: Code and pretrained model weights are available at https://github.com/nadavbra/protein_bert. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
DOI
10.1093/bioinformatics/btac020
ProtVar
PUBMED_LINK
URL
TITLE
ProtVar: mapping and contextualizing human missense variation.
Main citation
Stephenson JD, Totoo P, Burke DF, Jänes J, ...&, Martin MJ. (2024) ProtVar: mapping and contextualizing human missense variation. Nucleic Acids Res, 52 (W1) W140-W147. doi:10.1093/nar/gkae413. PMID 38769064
ABSTRACT
Genomic variation can impact normal biological function in complex ways and so understanding variant effects requires a broad range of data to be coherently assimilated. Whilst the volume of human variant data and relevant annotations has increased, the corresponding increase in the breadth of participating fields, standards and versioning mean that moving between genomic, coding, protein and structure positions is increasingly complex. In turn this makes investigating variants in diverse formats and assimilating annotations from different resources challenging. ProtVar addresses these issues to facilitate the contextualization and interpretation of human missense variation with unparalleled flexibility and ease of accessibility for use by the broadest range of researchers. By precalculating all possible variants in the human proteome it offers near instantaneous mapping between all relevant data types. It also combines data and analyses from a plethora of resources to bring together genomic, protein sequence and function annotations as well as structural insights and predictions to better understand the likely effect of missense variation in humans. It is offered as an intuitive web server https://www.ebi.ac.uk/protvar where data can be explored and downloaded, and can be accessed programmatically via an API.
DOI
10.1093/nar/gkae413
Reactome
PUBMED_LINK
DESCRIPTION
REACTOME is an open-source, open access, manually curated and peer-reviewed pathway database. Our goal is to provide intuitive bioinformatics tools for the visualization, interpretation and analysis of pathway knowledge to support basic and clinical research, genome analysis, modeling, systems biology and education. Founded in 2003, the Reactome project is led by Lincoln Stein of OICR, Peter D’Eustachio of NYU Langone Health, Henning Hermjakob of EMBL-EBI, and Guanming Wu of OHSU.
URL
KEYWORDS
Pathway
TITLE
The Reactome Pathway Knowledgebase 2024.
Main citation
Milacic M, Beavers D, Conley P, Gong C, ...&, D'Eustachio P. (2024) The Reactome Pathway Knowledgebase 2024. Nucleic Acids Res, 52 (D1) D672-D678. doi:10.1093/nar/gkad1025. PMID 37941124
ABSTRACT
The Reactome Knowledgebase (https://reactome.org), an Elixir and GCBR core biological data resource, provides manually curated molecular details of a broad range of normal and disease-related biological processes. Processes are annotated as an ordered network of molecular transformations in a single consistent data model. Reactome thus functions both as a digital archive of manually curated human biological processes and as a tool for discovering functional relationships in data such as gene expression profiles or somatic mutation catalogs from tumor cells. Here we review progress towards annotation of the entire human proteome, targeted annotation of disease-causing genetic variants of proteins and of small-molecule drugs in a pathway context, and towards supporting explicit annotation of cell- and tissue-specific pathways. Finally, we briefly discuss issues involved in making Reactome more fully interoperable with other related resources such as the Gene Ontology and maintaining the resulting community resource network.
DOI
10.1093/nar/gkad1025
REVEL
PUBMED_LINK
FULL NAME
Rare Exome Variant Ensemble Learner
DESCRIPTION
Ensemble method integrating multiple tools to predict pathogenicity.
URL
KEYWORDS
ensemble, missense
USE
pathogenicity scoring
TITLE
REVEL: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants.
Main citation
Ioannidis NM, Rothstein JH, Pejaver V, Middha S, ...&, Sieh W. (2016) REVEL: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants. Am J Hum Genet, 99 (4) 877-885. doi:10.1016/j.ajhg.2016.08.016. PMID 27666373
ABSTRACT
The vast majority of coding variants are rare, and assessment of the contribution of rare variants to complex traits is hampered by low statistical power and limited functional data. Improved methods for predicting the pathogenicity of rare coding variants are needed to facilitate the discovery of disease variants from exome sequencing studies. We developed REVEL (rare exome variant ensemble learner), an ensemble method for predicting the pathogenicity of missense variants on the basis of individual tools: MutPred, FATHMM, VEST, PolyPhen, SIFT, PROVEAN, MutationAssessor, MutationTaster, LRT, GERP, SiPhy, phyloP, and phastCons. REVEL was trained with recently discovered pathogenic and rare neutral missense variants, excluding those previously used to train its constituent tools. When applied to two independent test sets, REVEL had the best overall performance (p < 10-12) as compared to any individual tool and seven ensemble methods: MetaSVM, MetaLR, KGGSeq, Condel, CADD, DANN, and Eigen. Importantly, REVEL also had the best performance for distinguishing pathogenic from rare neutral variants with allele frequencies <0.5%. The area under the receiver operating characteristic curve (AUC) for REVEL was 0.046-0.182 higher in an independent test set of 935 recent SwissVar disease variants and 123,935 putatively neutral exome sequencing variants and 0.027-0.143 higher in an independent test set of 1,953 pathogenic and 2,406 benign variants recently reported in ClinVar than the AUCs for other ensemble methods. We provide pre-computed REVEL scores for all possible human missense variants to facilitate the identification of pathogenic variants in the sea of rare variants discovered as sequencing studies expand in scale.
DOI
10.1016/j.ajhg.2016.08.016
ROADMAP 2010
PUBMED_LINK
TITLE
The NIH Roadmap Epigenomics Mapping Consortium.
Main citation
Bernstein BE, Stamatoyannopoulos JA, Costello JF, Ren B, ...&, Thomson JA. (2010) The NIH Roadmap Epigenomics Mapping Consortium. Nat Biotechnol, 28 (10) 1045-8. doi:10.1038/nbt1010-1045. PMID 20944595
ABSTRACT
The NIH Roadmap Epigenomics Mapping Consortium aims to produce a public resource of epigenomic maps for stem cells and primary ex vivo tissues selected to represent the normal counterparts of tissues and organ systems frequently involved in human disease.
DOI
10.1038/nbt1010-1045
ROADMAP 2015
PUBMED_LINK
URL
TITLE
Integrative analysis of 111 reference human epigenomes.
Main citation
Roadmap Epigenomics Consortium, Kundaje A, Meuleman W, Ernst J, ...&, Kellis M. (2015) Integrative analysis of 111 reference human epigenomes. Nature, 518 (7539) 317-30. doi:10.1038/nature14248. PMID 25693563
ABSTRACT
The reference human genome sequence set the stage for studies of genetic variation and its association with human disease, but epigenomic studies lack a similar reference. To address this need, the NIH Roadmap Epigenomics Consortium generated the largest collection so far of human epigenomes for primary cells and tissues. Here we describe the integrative analysis of 111 reference human epigenomes generated as part of the programme, profiled for histone modification patterns, DNA accessibility, DNA methylation and RNA expression. We establish global maps of regulatory elements, define regulatory modules of coordinated activity, and their likely activators and repressors. We show that disease- and trait-associated genetic variants are enriched in tissue-specific epigenomic marks, revealing biologically relevant cell types for diverse human traits, and providing a resource for interpreting the molecular basis of human disease. Our results demonstrate the central role of epigenomic information for understanding gene regulation, cellular differentiation and human disease.
DOI
10.1038/nature14248
SAHA
FULL NAME
The Spatial Atlas of Human Anatomy (SAHA)
Main citation
Park, J. et al. The spatial atlas of Human Anatomy (SAHA): A multimodal subcellular-resolution reference across human organs. bioRxiv 2025.06.16.658716 (2025) doi:10.1101/2025.06.16.658716.
SIFT
PUBMED_LINK
FULL NAME
Sorting Intolerant From Tolerant
DESCRIPTION
Predicts whether substitutions affect protein function.
URL
KEYWORDS
conservation, missense
USE
variant scoring
TITLE
SIFT: Predicting amino acid changes that affect protein function.
Main citation
Ng PC, Henikoff S. (2003) SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res, 31 (13) 3812-4. doi:10.1093/nar/gkg509. PMID 12824425
ABSTRACT
Single nucleotide polymorphism (SNP) studies and random mutagenesis projects identify amino acid substitutions in protein-coding regions. Each substitution has the potential to affect protein function. SIFT (Sorting Intolerant From Tolerant) is a program that predicts whether an amino acid substitution affects protein function so that users can prioritize substitutions for further study. We have shown that SIFT can distinguish between functionally neutral and deleterious amino acid changes in mutagenesis studies and on human polymorphisms. SIFT is available at http://blocks.fhcrc.org/sift/SIFT.html.
DOI
10.1093/nar/gkg509
STRING
PUBMED_LINK
DESCRIPTION
STRING is a database of known and predicted protein-protein interactions. The interactions include direct (physical) and indirect (functional) associations; they stem from computational prediction, from knowledge transfer between organisms, and from interactions aggregated from other (primary) databases.
URL
KEYWORDS
Interaction
TITLE
The STRING database in 2023: protein-protein association networks and functional enrichment analyses for any sequenced genome of interest.
Main citation
Szklarczyk D, Kirsch R, Koutrouli M, Nastou K, ...&, von Mering C. (2023) The STRING database in 2023: protein-protein association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic Acids Res, 51 (D1) D638-D646. doi:10.1093/nar/gkac1000. PMID 36370105
ABSTRACT
Much of the complexity within cells arises from functional and regulatory interactions among proteins. The core of these interactions is increasingly known, but novel interactions continue to be discovered, and the information remains scattered across different database resources, experimental modalities and levels of mechanistic detail. The STRING database (https://string-db.org/) systematically collects and integrates protein-protein interactions-both physical interactions as well as functional associations. The data originate from a number of sources: automated text mining of the scientific literature, computational interaction predictions from co-expression, conserved genomic context, databases of interaction experiments and known complexes/pathways from curated sources. All of these interactions are critically assessed, scored, and subsequently automatically transferred to less well-studied organisms using hierarchical orthology information. The data can be accessed via the website, but also programmatically and via bulk downloads. The most recent developments in STRING (version 12.0) are: (i) it is now possible to create, browse and analyze a full interaction network for any novel genome of interest, by submitting its complement of encoded proteins, (ii) the co-expression channel now uses variational auto-encoders to predict interactions, and it covers two new sources, single-cell RNA-seq and experimental proteomics data and (iii) the confidence in each experimentally derived interaction is now estimated based on the detection method used, and communicated to the user in the web-interface. Furthermore, STRING continues to enhance its facilities for functional enrichment analysis, which are now fully available also for user-submitted genomes.
DOI
10.1093/nar/gkac1000
Taiwan View
UKBB-LD
PUBMED_LINK
DESCRIPTION
Linkage disequilibrium (LD) matrices of UK Biobank participants of a British ancestry, based on imputed genotypes.
URL
TITLE
Functionally informed fine-mapping and polygenic localization of complex trait heritability.
Main citation
Weissbrod O, Hormozdiari F, Benner C, Cui R, ...&, Price AL. (2020) Functionally informed fine-mapping and polygenic localization of complex trait heritability. Nat Genet, 52 (12) 1355-1363. doi:10.1038/s41588-020-00735-5. PMID 33199916
ABSTRACT
Fine-mapping aims to identify causal variants impacting complex traits. We propose PolyFun, a computationally scalable framework to improve fine-mapping accuracy by leveraging functional annotations across the entire genome-not just genome-wide-significant loci-to specify prior probabilities for fine-mapping methods such as SuSiE or FINEMAP. In simulations, PolyFun + SuSiE and PolyFun + FINEMAP were well calibrated and identified >20% more variants with a posterior causal probability >0.95 than identified in their nonfunctionally informed counterparts. In analyses of 49 UK Biobank traits (average n = 318,000), PolyFun + SuSiE identified 3,025 fine-mapped variant-trait pairs with posterior causal probability >0.95, a >32% improvement versus SuSiE. We used posterior mean per-SNP heritabilities from PolyFun + SuSiE to perform polygenic localization, constructing minimal sets of common SNPs causally explaining 50% of common SNP heritability; these sets ranged in size from 28 (hair color) to 3,400 (height) to 2 million (number of children). In conclusion, PolyFun prioritizes variants for functional follow-up and provides insights into complex trait architectures.
DOI
10.1038/s41588-020-00735-5
MAIN ANCESTRY
EUR
UNEECON
PUBMED_LINK
DESCRIPTION
UNEECON is a statistical method for inferring deleterious mutations and constrained genes in human and potentially other species.
URL
TITLE
Unified inference of missense variant effects and gene constraints in the human genome.
Main citation
Huang YF. (2020) Unified inference of missense variant effects and gene constraints in the human genome. PLoS Genet, 16 (7) e1008922. doi:10.1371/journal.pgen.1008922. PMID 32667917
ABSTRACT
A challenge in medical genomics is to identify variants and genes associated with severe genetic disorders. Based on the premise that severe, early-onset disorders often result in a reduction of evolutionary fitness, several statistical methods have been developed to predict pathogenic variants or constrained genes based on the signatures of negative selection in human populations. However, we currently lack a statistical framework to jointly predict deleterious variants and constrained genes from both variant-level features and gene-level selective constraints. Here we present such a unified approach, UNEECON, based on deep learning and population genetics. UNEECON treats the contributions of variant-level features and gene-level constraints as a variant-level fixed effect and a gene-level random effect, respectively. The sum of the fixed and random effects is then combined with an evolutionary model to infer the strength of negative selection at both variant and gene levels. Compared with previously published methods, UNEECON shows improved performance in predicting missense variants and protein-coding genes associated with autosomal dominant disorders, and feature importance analysis suggests that both gene-level selective constraints and variant-level predictors are important for accurate variant prioritization. Furthermore, based on UNEECON, we observe a low correlation between gene-level intolerance to missense mutations and that to loss-of-function mutations, which can be partially explained by the prevalence of disordered protein regions that are highly tolerant to missense mutations. Finally, we show that genes intolerant to both missense and loss-of-function mutations play key roles in the central nervous system and the autism spectrum disorders. Overall, UNEECON is a promising framework for both variant and gene prioritization.
DOI
10.1371/journal.pgen.1008922
UniProt
PUBMED_LINK
DESCRIPTION
The Universal Protein Resource (UniProt) is a comprehensive resource for protein sequence and annotation data. The UniProt databases are the UniProt Knowledgebase (UniProtKB), the UniProt Reference Clusters (UniRef), and the UniProt Archive (UniParc). The UniProt consortium and host institutions EMBL-EBI, SIB and PIR are committed to the long-term preservation of the UniProt databases.
URL
TITLE
UniProt: the Universal Protein Knowledgebase in 2025.
Main citation
UniProt Consortium. (2025) UniProt: the Universal Protein Knowledgebase in 2025. Nucleic Acids Res, 53 (D1) D609-D617. doi:10.1093/nar/gkae1010. PMID 39552041
ABSTRACT
The aim of the UniProt Knowledgebase (UniProtKB; https://www.uniprot.org/) is to provide users with a comprehensive, high-quality and freely accessible set of protein sequences annotated with functional information. In this publication, we describe ongoing changes to our production pipeline to limit the sequences available in UniProtKB to high-quality, non-redundant reference proteomes. We continue to manually curate the scientific literature to add the latest functional data and use machine learning techniques. We also encourage community curation to ensure key publications are not missed. We provide an update on the automatic annotation methods used by UniProtKB to predict information for unreviewed entries describing unstudied proteins. Finally, updates to the UniProt website are described, including a new tab linking protein to genomic information. In recognition of its value to the scientific community, the UniProt database has been awarded Global Core Biodata Resource status.
DOI
10.1093/nar/gkae1010
VEST4
PUBMED_LINK
FULL NAME
Variant Effect Scoring Tool v4
DESCRIPTION
Machine learning pathogenicity score for SNVs.
URL
KEYWORDS
ML, SNV
USE
variant scoring
TITLE
Identifying Mendelian disease genes with the variant effect scoring tool.
Main citation
Carter H, Douville C, Stenson PD, Cooper DN, ...&, Karchin R. (2013) Identifying Mendelian disease genes with the variant effect scoring tool. BMC Genomics, 14 Suppl 3 (Suppl 3) S3. doi:10.1186/1471-2164-14-S3-S3. PMID 23819870
ABSTRACT
BACKGROUND: Whole exome sequencing studies identify hundreds to thousands of rare protein coding variants of ambiguous significance for human health. Computational tools are needed to accelerate the identification of specific variants and genes that contribute to human disease. RESULTS: We have developed the Variant Effect Scoring Tool (VEST), a supervised machine learning-based classifier, to prioritize rare missense variants with likely involvement in human disease. The VEST classifier training set comprised ~ 45,000 disease mutations from the latest Human Gene Mutation Database release and another ~45,000 high frequency (allele frequency >1%) putatively neutral missense variants from the Exome Sequencing Project. VEST outperforms some of the most popular methods for prioritizing missense variants in carefully designed holdout benchmarking experiments (VEST ROC AUC = 0.91, PolyPhen2 ROC AUC = 0.86, SIFT4.0 ROC AUC = 0.84). VEST estimates variant score p-values against a null distribution of VEST scores for neutral variants not included in the VEST training set. These p-values can be aggregated at the gene level across multiple disease exomes to rank genes for probable disease involvement. We tested the ability of an aggregate VEST gene score to identify candidate Mendelian disease genes, based on whole-exome sequencing of a small number of disease cases. We used whole-exome data for two Mendelian disorders for which the causal gene is known. Considering only genes that contained variants in all cases, the VEST gene score ranked dihydroorotate dehydrogenase (DHODH) number 2 of 2253 genes in four cases of Miller syndrome, and myosin-3 (MYH3) number 2 of 2313 genes in three cases of Freeman Sheldon syndrome. CONCLUSIONS: Our results demonstrate the potential power gain of aggregating bioinformatics variant scores into gene-level scores and the general utility of bioinformatics in assisting the search for disease genes in large-scale exome sequencing studies. VEST is available as a stand-alone software package at http://wiki.chasmsoftware.org and is hosted by the CRAVAT web server at http://www.cravat.us.
DOI
10.1186/1471-2164-14-S3-S3