Reference

Catalog entries using this tag (links open the entry card on its page):

Entries

HapMap Phase I

STAGE_PERIOD

2003–2005

DESCRIPTION

International HapMap Project first data release: ~1 million SNPs in CEU, YRI, and JPT+CHB; produced the first genome-wide LD and recombination maps and drove early GWAS SNP selection and imputation panels.

Show full descriptionShow less

URL

https://www.genome.gov/10001688/international-hapmap-project

HapMap Phase II

1000 Genomes HapMap Reference

STAGE_PERIOD

2005–2007

DESCRIPTION

Expanded SNP density (~3.1M SNPs) and haplotype structure across the same core panels; improved tagging coverage and supported finer-scale association and phasing workflows before large-scale resequencing.

Show full descriptionShow less

URL

https://www.genome.gov/10001688/international-hapmap-project

HapMap Phase III

1000 Genomes HapMap Reference

STAGE_PERIOD

2007–2009

DESCRIPTION

Extended to 11 populations and ~1.6M SNPs; broader ancestry representation and LD maps that informed the design and early phases of the 1000 Genomes Project.

Show full descriptionShow less

URL

https://www.genome.gov/10001688/international-hapmap-project

Phase 1

1000 Genomes Reference

STAGE_PERIOD

2010–2011

DESCRIPTION

Expanded low-coverage WGS (~1,092 individuals) with exome capture and dense SNP genotyping; primary SNP and indel reference for early imputation panels.

Show full descriptionShow less

URL

https://www.internationalgenome.org/

Phase 3

1000 Genomes Reference

STAGE_PERIOD

2012–2015

DESCRIPTION

~2,504 individuals across 26 populations; GRCh37/38 VCF releases became the standard allele-frequency, LD, and imputation backbone for GWAS and SV pipelines.

Show full descriptionShow less

URL

https://www.internationalgenome.org/

Pilot

1000 Genomes Reference

STAGE_PERIOD

2008–2010

DESCRIPTION

Proof-of-concept low-coverage whole-genome sequencing and SNP arrays across multiple populations; established protocols and data model for the main project.

Show full descriptionShow less

URL

https://www.internationalgenome.org/

AlphaFold

Reference

PUBMED_LINK

31942072

URL

https://alphafold.ebi.ac.uk/

TITLE

Improved protein structure prediction using potentials from deep learning.

Main citation

Senior AW, Evans R, Jumper J, Kirkpatrick J, ...&, Hassabis D. (2020) Improved protein structure prediction using potentials from deep learning. Nature, 577 (7792) 706-710. doi:10.1038/s41586-019-1923-7. PMID 31942072

ABSTRACT

Protein structure prediction can be used to determine the three-dimensional shape of a protein from its amino acid sequence1. This problem is of fundamental importance as the structure of a protein largely determines its function2; however, protein structures can be difficult to determine experimentally. Considerable progress has recently been made by leveraging genetic information. It is possible to infer which amino acid residues are in contact by analysing covariation in homologous sequences, which aids in the prediction of protein structures3. Here we show that we can train a neural network to make accurate predictions of the distances between pairs of residues, which convey more information about the structure than contact predictions. Using this information, we construct a potential of mean force4 that can accurately describe the shape of a protein. We find that the resulting potential can be optimized by a simple gradient descent algorithm to generate structures without complex sampling procedures. The resulting system, named AlphaFold, achieves high accuracy, even for sequences with fewer homologous sequences. In the recent Critical Assessment of Protein Structure Prediction5 (CASP13)-a blind assessment of the state of the field-AlphaFold created high-accuracy structures (with template modelling (TM) scores6 of 0.7 or higher) for 24 out of 43 free modelling domains, whereas the next best method, which used sampling and contact information, achieved such accuracy for only 14 out of 43 domains. AlphaFold represents a considerable advance in protein-structure prediction. We expect this increased accuracy to enable insights into the function and malfunction of proteins, especially in cases for which no structures for homologous proteins have been experimentally determined7.

Show full abstractShow less

DOI

10.1038/s41586-019-1923-7

AlphaFold 2 (AlphaFold)

Reference

PUBMED_LINK

34265844

FULL NAME

AlphaFold Protein Structure Database

DESCRIPTION

High-accuracy protein structure prediction using deep learning.

Show full descriptionShow less

URL

https://alphafold.ebi.ac.uk/

KEYWORDS

protein structure, deep learning, folding

Show full keywordsShow less

USE

structure prediction

SERVER

EMBL-EBI

TITLE

Highly accurate protein structure prediction with AlphaFold.

Main citation

Jumper J, Evans R, Pritzel A, Green T, ...&, Hassabis D. (2021) Highly accurate protein structure prediction with AlphaFold. Nature, 596 (7873) 583-589. doi:10.1038/s41586-021-03819-2. PMID 34265844

ABSTRACT

Proteins are essential to life, and understanding their structure can facilitate a mechanistic understanding of their function. Through an enormous experimental effort1-4, the structures of around 100,000 unique proteins have been determined5, but this represents a small fraction of the billions of known protein sequences6,7. Structural coverage is bottlenecked by the months to years of painstaking effort required to determine a single protein structure. Accurate computational approaches are needed to address this gap and to enable large-scale structural bioinformatics. Predicting the three-dimensional structure that a protein will adopt based solely on its amino acid sequence-the structure prediction component of the 'protein folding problem'8-has been an important open research problem for more than 50 years9. Despite recent progress10-14, existing methods fall far short of atomic accuracy, especially when no homologous structure is available. Here we provide the first computational method that can regularly predict protein structures with atomic accuracy even in cases in which no similar structure is known. We validated an entirely redesigned version of our neural network-based model, AlphaFold, in the challenging 14th Critical Assessment of protein Structure Prediction (CASP14)15, demonstrating accuracy competitive with experimental structures in a majority of cases and greatly outperforming other methods. Underpinning the latest version of AlphaFold is a novel machine learning approach that incorporates physical and biological knowledge about protein structure, leveraging multi-sequence alignments, into the design of the deep learning algorithm.

Show full abstractShow less

DOI

10.1038/s41586-021-03819-2

AlphaFold 3

Reference

PUBMED_LINK

38718835

URL

https://alphafold.ebi.ac.uk/

TITLE

Accurate structure prediction of biomolecular interactions with AlphaFold 3.

Main citation

Abramson J, Adler J, Dunger J, Evans R, ...&, Jumper JM. (2024) Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature, 630 (8016) 493-500. doi:10.1038/s41586-024-07487-w. PMID 38718835

ABSTRACT

The introduction of AlphaFold 21 has spurred a revolution in modelling the structure of proteins and their interactions, enabling a huge range of applications in protein modelling and design2-6. Here we describe our AlphaFold 3 model with a substantially updated diffusion-based architecture that is capable of predicting the joint structure of complexes including proteins, nucleic acids, small molecules, ions and modified residues. The new AlphaFold model demonstrates substantially improved accuracy over many previous specialized tools: far greater accuracy for protein-ligand interactions compared with state-of-the-art docking tools, much higher accuracy for protein-nucleic acid interactions compared with nucleic-acid-specific predictors and substantially higher antibody-antigen prediction accuracy compared with AlphaFold-Multimer v.2.37,8. Together, these results show that high-accuracy modelling across biomolecular space is possible within a single unified deep-learning framework.

Show full abstractShow less

DOI

10.1038/s41586-024-07487-w

AlphaMissense

Reference

PUBMED_LINK

37733863

DESCRIPTION

Deep learning model predicting pathogenicity of all possible missense variants in human proteins.

Show full descriptionShow less

URL

https://github.com/google-deepmind/alphamissense

KEYWORDS

missense, pathogenicity, variant effect, deep learning

Show full keywordsShow less

USE

variant effect scoring

TITLE

Accurate proteome-wide missense variant effect prediction with AlphaMissense.

Main citation

Cheng J, Novati G, Pan J, Bycroft C, ...&, Avsec Ž. (2023) Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science, 381 (6664) eadg7492. doi:10.1126/science.adg7492. PMID 37733863

ABSTRACT

The vast majority of missense variants observed in the human genome are of unknown clinical significance. We present AlphaMissense, an adaptation of AlphaFold fine-tuned on human and primate variant population frequency databases to predict missense variant pathogenicity. By combining structural context and evolutionary conservation, our model achieves state-of-the-art results across a wide range of genetic and experimental benchmarks, all without explicitly training on such data. The average pathogenicity score of genes is also predictive for their cell essentiality, capable of identifying short essential genes that existing statistical approaches are underpowered to detect. As a resource to the community, we provide a database of predictions for all possible human single amino acid substitutions and classify 89% of missense variants as either likely benign or likely pathogenic.

Show full abstractShow less

DOI

10.1126/science.adg7492

b37

Reference 1000 Genomes WGS

FULL NAME

Broad Institute Homo_sapiens_assembly19 (b37)

DESCRIPTION

GRCh37-compatible reference FASTA used across Broad Institute and 1000 Genomes workflows: chromosomes 1-22, X, Y, MT, plus GL/NC unlocalized and unplaced contigs (as in the distributed assembly19 package). Coordinate system matches the 1KG/b37 ecosystem used by many GWAS imputation and joint-calling pipelines.

Show full descriptionShow less

URL

https://data.broadinstitute.org/snowman/hg19/

KEYWORDS

GRCh37; 1000 Genomes; Broad; b37; reference FASTA

Show full keywordsShow less

Main citation

Broad Institute / 1000 Genomes Project. Homo_sapiens_assembly19.fasta (b37). https://data.broadinstitute.org/snowman/hg19/

b38

Reference WGS

FULL NAME

Broad Institute Homo_sapiens_assembly38 (b38)

DESCRIPTION

GRCh38-based reference FASTA distributed with GATK and Broad pipelines (Homo_sapiens_assembly38), including primary chromosomes and standard alternate contigs (hs38d5 decoy is distributed separately). Default reference for many germline short-variant and joint-genotyping workflows on cloud and HPC.

Show full descriptionShow less

URL

https://storage.googleapis.com/genomics-public-data/references/hg38/v0/Homo_sapiens_assembly38.fasta

KEYWORDS

GRCh38; GATK; Broad; b38; reference FASTA

Show full keywordsShow less

Main citation

Broad Institute. Homo_sapiens_assembly38.fasta (GATK GRCh38 reference bundle). https://storage.googleapis.com/genomics-public-data/references/hg38/v0/

BioGRID

Reference

PUBMED_LINK

33070389

DESCRIPTION

BioGRID is a biomedical interaction repository with data compiled through comprehensive curation efforts. Our current index is version 4.4.242 and searches 86,339 publications for 2,834,410 protein and genetic interactions, 31,144 chemical interactions and 1,128,339 post translational modifications from major model organism species. All data are freely provided via our search index and available for download in many standardized formats.

Show full descriptionShow less

URL

https://thebiogrid.org/

KEYWORDS

Interaction

Show full keywordsShow less

TITLE

The BioGRID database: A comprehensive biomedical resource of curated protein, genetic, and chemical interactions.

Main citation

Oughtred R, Rust J, Chang C, Breitkreutz BJ, ...&, Tyers M. (2021) The BioGRID database: A comprehensive biomedical resource of curated protein, genetic, and chemical interactions. Protein Sci, 30 (1) 187-200. doi:10.1002/pro.3978. PMID 33070389

ABSTRACT

The BioGRID (Biological General Repository for Interaction Datasets, thebiogrid.org) is an open-access database resource that houses manually curated protein and genetic interactions from multiple species including yeast, worm, fly, mouse, and human. The ~1.93 million curated interactions in BioGRID can be used to build complex networks to facilitate biomedical discoveries, particularly as related to human health and disease. All BioGRID content is curated from primary experimental evidence in the biomedical literature, and includes both focused low-throughput studies and large high-throughput datasets. BioGRID also captures protein post-translational modifications and protein or gene interactions with bioactive small molecules including many known drugs. A built-in network visualization tool combines all annotations and allows users to generate network graphs of protein, genetic and chemical interactions. In addition to general curation across species, BioGRID undertakes themed curation projects in specific aspects of cellular regulation, for example the ubiquitin-proteasome system, as well as specific disease areas, such as for the SARS-CoV-2 virus that causes COVID-19 severe acute respiratory syndrome. A recent extension of BioGRID, named the Open Repository of CRISPR Screens (ORCS, orcs.thebiogrid.org), captures single mutant phenotypes and genetic interactions from published high throughput genome-wide CRISPR/Cas9-based genetic screens. BioGRID-ORCS contains datasets for over 1,042 CRISPR screens carried out to date in human, mouse and fly cell lines. The biomedical research community can freely access all BioGRID data through the web interface, standardized file downloads, or via model organism databases and partner meta-databases.

Show full abstractShow less

DOI

10.1002/pro.3978

CADD

Reference

PUBMED_LINK

24487276

FULL NAME

Combined Annotation–Dependent Depletion

DESCRIPTION

Combined Annotation–Dependent Depletion; integrates multiple annotations to score variant deleteriousness.

Show full descriptionShow less

URL

https://cadd.gs.washington.edu/

KEYWORDS

genome-wide, deleteriousness, annotation

Show full keywordsShow less

USE

prioritization, filtering

TITLE

A general framework for estimating the relative pathogenicity of human genetic variants.

Main citation

Kircher M, Witten DM, Jain P, O'Roak BJ, ...&, Shendure J. (2014) A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet, 46 (3) 310-5. doi:10.1038/ng.2892. PMID 24487276

ABSTRACT

Current methods for annotating and interpreting human genetic variation tend to exploit a single information type (for example, conservation) and/or are restricted in scope (for example, to missense changes). Here we describe Combined Annotation-Dependent Depletion (CADD), a method for objectively integrating many diverse annotations into a single measure (C score) for each variant. We implement CADD as a support vector machine trained to differentiate 14.7 million high-frequency human-derived alleles from 14.7 million simulated variants. We precompute C scores for all 8.6 billion possible human single-nucleotide variants and enable scoring of short insertions-deletions. C scores correlate with allelic diversity, annotations of functionality, pathogenicity, disease severity, experimentally measured regulatory effects and complex trait associations, and they highly rank known pathogenic variants within individual genomes. The ability of CADD to prioritize functional, deleterious and pathogenic variants across many functional categories, effect sizes and genetic architectures is unmatched by any current single-annotation method.

Show full abstractShow less

DOI

10.1038/ng.2892

CADD v1.4

Reference

PUBMED_LINK

30371827

URL

https://cadd.gs.washington.edu/

TITLE

CADD: predicting the deleteriousness of variants throughout the human genome.

Main citation

Rentzsch P, Witten D, Cooper GM, Shendure J, ...&, Kircher M. (2019) CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res, 47 (D1) D886-D894. doi:10.1093/nar/gky1016. PMID 30371827

ABSTRACT

Combined Annotation-Dependent Depletion (CADD) is a widely used measure of variant deleteriousness that can effectively prioritize causal variants in genetic analyses, particularly highly penetrant contributors to severe Mendelian disorders. CADD is an integrative annotation built from more than 60 genomic features, and can score human single nucleotide variants and short insertion and deletions anywhere in the reference assembly. CADD uses a machine learning model trained on a binary distinction between simulated de novo variants and variants that have arisen and become fixed in human populations since the split between humans and chimpanzees; the former are free of selective pressure and may thus include both neutral and deleterious alleles, while the latter are overwhelmingly neutral (or, at most, weakly deleterious) by virtue of having survived millions of years of purifying selection. Here we review the latest updates to CADD, including the most recent version, 1.4, which supports the human genome build GRCh38. We also present updates to our website that include simplified variant lookup, extended documentation, an Application Program Interface and improved mechanisms for integrating CADD scores into other tools or applications. CADD scores, software and documentation are available at https://cadd.gs.washington.edu.

Show full abstractShow less

DOI

10.1093/nar/gky1016

CADD v1.6 (CADD-Splice)

Reference

PUBMED_LINK

33618777

URL

https://cadd.gs.washington.edu/

TITLE

CADD-Splice-improving genome-wide variant effect prediction using deep learning-derived splice scores.

Main citation

Rentzsch P, Schubach M, Shendure J, Kircher M. (2021) CADD-Splice-improving genome-wide variant effect prediction using deep learning-derived splice scores. Genome Med, 13 (1) 31. doi:10.1186/s13073-021-00835-9. PMID 33618777

ABSTRACT

BACKGROUND: Splicing of genomic exons into mRNAs is a critical prerequisite for the accurate synthesis of human proteins. Genetic variants impacting splicing underlie a substantial proportion of genetic disease, but are challenging to identify beyond those occurring at donor and acceptor dinucleotides. To address this, various methods aim to predict variant effects on splicing. Recently, deep neural networks (DNNs) have been shown to achieve better results in predicting splice variants than other strategies. METHODS: It has been unclear how best to integrate such process-specific scores into genome-wide variant effect predictors. Here, we use a recently published experimental data set to compare several machine learning methods that score variant effects on splicing. We integrate the best of those approaches into general variant effect prediction models and observe the effect on classification of known pathogenic variants. RESULTS: We integrate two specialized splicing scores into CADD (Combined Annotation Dependent Depletion; cadd.gs.washington.edu ), a widely used tool for genome-wide variant effect prediction that we previously developed to weight and integrate diverse collections of genomic annotations. With this new model, CADD-Splice, we show that inclusion of splicing DNN effect scores substantially improves predictions across multiple variant categories, without compromising overall performance. CONCLUSIONS: While splice effect scores show superior performance on splice variants, specialized predictors cannot compete with other variant scores in general variant interpretation, as the latter account for nonsense and missense effects that do not alter splicing. Although only shown here for splice scores, we believe that the applied approach will generalize to other specific molecular processes, providing a path for the further improvement of genome-wide variant effect prediction.

Show full abstractShow less

DOI

10.1186/s13073-021-00835-9

CADD v1.7

Reference

PUBMED_LINK

38183205

URL

https://cadd.gs.washington.edu/

TITLE

CADD v1.7: using protein language models, regulatory CNNs and other nucleotide-level scores to improve genome-wide variant predictions.

Main citation

Schubach M, Maass T, Nazaretyan L, Röner S, ...&, Kircher M. (2024) CADD v1.7: using protein language models, regulatory CNNs and other nucleotide-level scores to improve genome-wide variant predictions. Nucleic Acids Res, 52 (D1) D1143-D1154. doi:10.1093/nar/gkad989. PMID 38183205

ABSTRACT

Machine Learning-based scoring and classification of genetic variants aids the assessment of clinical findings and is employed to prioritize variants in diverse genetic studies and analyses. Combined Annotation-Dependent Depletion (CADD) is one of the first methods for the genome-wide prioritization of variants across different molecular functions and has been continuously developed and improved since its original publication. Here, we present our most recent release, CADD v1.7. We explored and integrated new annotation features, among them state-of-the-art protein language model scores (Meta ESM-1v), regulatory variant effect predictions (from sequence-based convolutional neural networks) and sequence conservation scores (Zoonomia). We evaluated the new version on data sets derived from ClinVar, ExAC/gnomAD and 1000 Genomes variants. For coding effects, we tested CADD on 31 Deep Mutational Scanning (DMS) data sets from ProteinGym and, for regulatory effect prediction, we used saturation mutagenesis reporter assay data of promoter and enhancer sequences. The inclusion of new features further improved the overall performance of CADD. As with previous releases, all data sets, genome-wide CADD v1.7 scores, scripts for on-site scoring and an easy-to-use webserver are readily provided via https://cadd.bihealth.org/ or https://cadd.gs.washington.edu/ to the community.

Show full abstractShow less

DOI

10.1093/nar/gkad989

Chinese Millionome Database (CMDB)

Reference

URL

http://cmdb.bgi.com/

CHM13

Reference Structural variants WGS

PUBMED_LINK

35357919

FULL NAME

T2T-CHM13 v1.1 complete hydatidiform mole assembly

DESCRIPTION

Telomere-to-telomere (T2T) assembly of the CHM13 hydatidiform mole cell line, providing the first gap-resolved maps of centromeres and the full Y (from a composite). Use as a complement to GRCh38 for studying repetitive and structurally variable loci; chromosome naming and coordinates differ from GRC primary assemblies; use liftover and T2T-specific tooling where appropriate.

Show full descriptionShow less

URL

https://github.com/marbl/CHM13

KEYWORDS

T2T; telomere-to-telomere; complete genome; CHM13; GRCh38 alternative

Show full keywordsShow less

TITLE

The complete sequence of a human genome.

Main citation

Nurk S, Koren S, Rhie A, Rautiainen M, ...&, Phillippy AM. (2022) The complete sequence of a human genome. Science, 376 (6588) 44-53. doi:10.1126/science.abj6987. PMID 35357919

ABSTRACT

Since its initial release in 2000, the human reference genome has covered only the euchromatic fraction of the genome, leaving important heterochromatic regions unfinished. Addressing the remaining 8% of the genome, the Telomere-to-Telomere (T2T) Consortium presents a complete 3.055 billion-base pair sequence of a human genome, T2T-CHM13, that includes gapless assemblies for all chromosomes except Y, corrects errors in the prior references, and introduces nearly 200 million base pairs of sequence containing 1956 gene predictions, 99 of which are predicted to be protein coding. The completed regions include all centromeric satellite arrays, recent segmental duplications, and the short arms of all five acrocentric chromosomes, unlocking these complex regions of the genome to variational and functional studies.

Show full abstractShow less

DOI

10.1126/science.abj6987

ClinVar

Reference

PUBMED_LINK

39578691

DESCRIPTION

Archive of clinically relevant variants with interpretations.

Show full descriptionShow less

URL

https://www.ncbi.nlm.nih.gov/clinvar/

KEYWORDS

pathogenicity, variant, clinical

Show full keywordsShow less

USE

clinical annotation

TITLE

ClinVar: updates to support classifications of both germline and somatic variants.

Main citation

Landrum MJ, Chitipiralla S, Kaur K, Brown G, ...&, Kattman BL. (2025) ClinVar: updates to support classifications of both germline and somatic variants. Nucleic Acids Res, 53 (D1) D1313-D1321. doi:10.1093/nar/gkae1090. PMID 39578691

ABSTRACT

ClinVar (www.ncbi.nlm.nih.gov/clinvar/) is a free, public database of human genetic variants and their relationships to disease, with >3 million variants submitted by >2800 organizations across the world. The database was recently updated to have three types of classifications: germline, oncogenicity and clinical impact for somatic variants. As for germline variants, classifications for somatic variants can be submitted in batches in a file submission or through the submission API; variants can also be submitted and updated one at a time in online submission forms. The ClinVar XML files were redesigned to allow multiple classification types. Both old and new formats of the XML are supported through the end of 2024. Data for somatic classifications were also added to the ClinVar VCF files and to several tab-delimited files. The ClinVar VCV pages were updated to display the three types of classifications, both as it was submitted and as it was aggregated by ClinVar. Clinical testing laboratories and others in the cancer community are invited to share their classifications of somatic variant classifications through ClinVar to provide transparency in genomic testing and improve patient care.

Show full abstractShow less

DOI

10.1093/nar/gkae1090

CPC

Pangenome Reference WGS

PUBMED_LINK

37316654

FULL NAME

Chinese Pangenome Consortium (phase I core)

DESCRIPTION

Phase I data from the Chinese Pangenome Consortium: 116 high-quality haplotype-phased de novo assemblies from 58 core samples across 36 minority Chinese ethnic groups (high-fidelity long-read coverage). Adds substantial novel sequence and variant discovery relative to GRCh38 and supports population-specific reference panels for Asian-ancestry genomics.

Show full descriptionShow less

URL

https://ngdc.cncb.ac.cn/bioproject/browse/PRJCA011422

KEYWORDS

pangenome; Chinese populations; long-read; haplotype; GRCh38

Show full keywordsShow less

TITLE

A pangenome reference of 36 Chinese populations.

Main citation

Gao Y, Yang X, Chen H, Tan X, ...&, Xu S. (2023) A pangenome reference of 36 Chinese populations. Nature, 619 (7968) 112-121. doi:10.1038/s41586-023-06173-7. PMID 37316654

ABSTRACT

Human genomics is witnessing an ongoing paradigm shift from a single reference sequence to a pangenome form, but populations of Asian ancestry are underrepresented. Here we present data from the first phase of the Chinese Pangenome Consortium, including a collection of 116 high-quality and haplotype-phased de novo assemblies based on 58 core samples representing 36 minority Chinese ethnic groups. With an average 30.65× high-fidelity long-read sequence coverage, an average contiguity N50 of more than 35.63 megabases and an average total size of 3.01 gigabases, the CPC core assemblies add 189 million base pairs of euchromatic polymorphic sequences and 1,367 protein-coding gene duplications to GRCh38. We identified 15.9 million small variants and 78,072 structural variants, of which 5.9 million small variants and 34,223 structural variants were not reported in a recently released pangenome reference1. The Chinese Pangenome Consortium data demonstrate a remarkable increase in the discovery of novel and missing sequences when individuals are included from underrepresented minority ethnic groups. The missing reference sequences were enriched with archaic-derived alleles and genes that confer essential functions related to keratinization, response to ultraviolet radiation, DNA repair, immunological responses and lifespan, implying great potential for shedding new light on human evolution and recovering missing heritability in complex disease mapping.

Show full abstractShow less

DOI

10.1038/s41586-023-06173-7

DAVID

Reference

PUBMED_LINK

17784955

FULL NAME

Database for Annotation, Visualization and Integrated Discovery

DESCRIPTION

Functional annotation and enrichment analysis platform.

Show full descriptionShow less

URL

https://david.ncifcrf.gov/

KEYWORDS

functional enrichment, GO, pathway

Show full keywordsShow less

USE

enrichment analysis

TITLE

The DAVID Gene Functional Classification Tool: a novel biological module-centric algorithm to functionally analyze large gene lists.

Main citation

Huang DW, Sherman BT, Tan Q, Collins JR, ...&, Lempicki RA. (2007) The DAVID Gene Functional Classification Tool: a novel biological module-centric algorithm to functionally analyze large gene lists. Genome Biol, 8 (9) R183. doi:10.1186/gb-2007-8-9-r183. PMID 17784955

ABSTRACT

The DAVID Gene Functional Classification Tool http://david.abcc.ncifcrf.gov uses a novel agglomeration algorithm to condense a list of genes or associated biological terms into organized classes of related genes or biology, called biological modules. This organization is accomplished by mining the complex biological co-occurrences found in multiple sources of functional annotation. It is a powerful method to group functionally related genes and terms into a manageable number of biological modules for efficient interpretation of gene lists in a network context.

Show full abstractShow less

DOI

10.1186/gb-2007-8-9-r183

dbNSFP v4 (dbNSFP)

Reference

PUBMED_LINK

33261662

FULL NAME

Database for Nonsynonymous SNPs’ Functional Predictions

DESCRIPTION

Database aggregating functional predictions and annotations for nonsynonymous variants.

Show full descriptionShow less

URL

https://sites.google.com/site/jpopgen/dbNSFP

KEYWORDS

annotation, variant, missense

Show full keywordsShow less

USE

meta-annotation

TITLE

dbNSFP v4: a comprehensive database of transcript-specific functional predictions and annotations for human nonsynonymous and splice-site SNVs.

Main citation

Liu X, Li C, Mou C, Dong Y, ...&, Tu Y. (2020) dbNSFP v4: a comprehensive database of transcript-specific functional predictions and annotations for human nonsynonymous and splice-site SNVs. Genome Med, 12 (1) 103. doi:10.1186/s13073-020-00803-9. PMID 33261662

ABSTRACT

Whole exome sequencing has been increasingly used in human disease studies. Prioritization based on appropriate functional annotations has been used as an indispensable step to select candidate variants. Here we present the latest updates to dbNSFP (version 4.1), a database designed to facilitate this step by providing deleteriousness prediction and functional annotation for all potential nonsynonymous and splice-site SNVs (a total of 84,013,093) in the human genome. The current version compiled 36 deleteriousness prediction scores, including 12 transcript-specific scores, and other variant and gene-level functional annotations. The database is available at http://database.liulab.science/dbNSFP with a downloadable version and a web-service.

Show full abstractShow less

DOI

10.1186/s13073-020-00803-9

dbSNP

Reference

URL

https://www.ncbi.nlm.nih.gov/snp/

ENCODE 2004

Reference

PUBMED_LINK

15499007

TITLE

The ENCODE (ENCyclopedia Of DNA Elements) Project.

Main citation

ENCODE Project Consortium. (2004) The ENCODE (ENCyclopedia Of DNA Elements) Project. Science, 306 (5696) 636-40. doi:10.1126/science.1105136. PMID 15499007

ABSTRACT

The ENCyclopedia Of DNA Elements (ENCODE) Project aims to identify all functional elements in the human genome sequence. The pilot phase of the Project is focused on a specified 30 megabases (approximately 1%) of the human genome sequence and is organized as an international consortium of computational and laboratory-based scientists working to develop and apply high-throughput approaches for detecting all sequence elements that confer biological function. The results of this pilot phase will guide future efforts to analyze the entire human genome.

Show full abstractShow less

DOI

10.1126/science.1105136

ENCODE 2007

Reference

PUBMED_LINK

17571346

TITLE

Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project.

Main citation

ENCODE Project Consortium, Birney E, Stamatoyannopoulos JA, Dutta A, ...&, de Jong PJ. (2007) Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature, 447 (7146) 799-816. doi:10.1038/nature05874. PMID 17571346

ABSTRACT

We report the generation and analysis of functional data from multiple, diverse experiments performed on a targeted 1% of the human genome as part of the pilot phase of the ENCODE Project. These data have been further integrated and augmented by a number of evolutionary and computational analyses. Together, our results advance the collective knowledge about human genome function in several major areas. First, our studies provide convincing evidence that the genome is pervasively transcribed, such that the majority of its bases can be found in primary transcripts, including non-protein-coding transcripts, and those that extensively overlap one another. Second, systematic examination of transcriptional regulation has yielded new understanding about transcription start sites, including their relationship to specific regulatory sequences and features of chromatin accessibility and histone modification. Third, a more sophisticated view of chromatin structure has emerged, including its inter-relationship with DNA replication and transcriptional regulation. Finally, integration of these new sources of information, in particular with respect to mammalian evolution based on inter- and intra-species sequence comparisons, has yielded new mechanistic and evolutionary insights concerning the functional landscape of the human genome. Together, these studies are defining a path for pursuit of a more comprehensive characterization of human genome function.

Show full abstractShow less

DOI

10.1038/nature05874

ENCODE 2012

Reference

PUBMED_LINK

22955616

DESCRIPTION

ENCODE Phase II

Show full descriptionShow less

TITLE

An integrated encyclopedia of DNA elements in the human genome.

Main citation

ENCODE Project Consortium. (2012) An integrated encyclopedia of DNA elements in the human genome. Nature, 489 (7414) 57-74. doi:10.1038/nature11247. PMID 22955616

ABSTRACT

The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure and histone modification. These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions. Many discovered candidate regulatory elements are physically associated with one another and with expressed genes, providing new insights into the mechanisms of gene regulation. The newly identified elements also show a statistical correspondence to sequence variants linked to human disease, and can thereby guide interpretation of this variation. Overall, the project provides new insights into the organization and regulation of our genes and genome, and is an expansive resource of functional annotations for biomedical research.

Show full abstractShow less

DOI

10.1038/nature11247

ENCODE 2020

Reference

PUBMED_LINK

32728249

DESCRIPTION

ENCODE Phase III

Show full descriptionShow less

TITLE

Expanded encyclopaedias of DNA elements in the human and mouse genomes.

Main citation

ENCODE Project Consortium, Moore JE, Purcaro MJ, Pratt HE, ...&, Weng Z. (2020) Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature, 583 (7818) 699-710. doi:10.1038/s41586-020-2493-4. PMID 32728249

ABSTRACT

The human and mouse genomes contain instructions that specify RNAs and proteins and govern the timing, magnitude, and cellular context of their production. To better delineate these elements, phase III of the Encyclopedia of DNA Elements (ENCODE) Project has expanded analysis of the cell and tissue repertoires of RNA transcription, chromatin structure and modification, DNA methylation, chromatin looping, and occupancy by transcription factors and RNA-binding proteins. Here we summarize these efforts, which have produced 5,992 new experimental datasets, including systematic determinations across mouse fetal development. All data are available through the ENCODE data portal (https://www.encodeproject.org), including phase II ENCODE1 and Roadmap Epigenomics2 data. We have developed a registry of 926,535 human and 339,815 mouse candidate cis-regulatory elements, covering 7.9 and 3.4% of their respective genomes, by integrating selected datatypes associated with gene regulation, and constructed a web-based server (SCREEN; http://screen.encodeproject.org) to provide flexible, user-defined access to this resource. Collectively, the ENCODE data and registry provide an expansive resource for the scientific community to build a better understanding of the organization and function of the human and mouse genomes.

Show full abstractShow less

DOI

10.1038/s41586-020-2493-4

ENCODE 2026

Reference

PUBMED_LINK

41501460

DESCRIPTION

ENCODE Phase IV (ENCODE4)

Show full descriptionShow less

TITLE

An expanded registry of candidate cis-regulatory elements.

Main citation

Moore JE, Pratt HE, Fan K, Phalke N, ...&, Weng Z. (2026) An expanded registry of candidate cis-regulatory elements. Nature, () . doi:10.1038/s41586-025-09909-9. PMID 41501460

ABSTRACT

Mammalian genomes contain millions of regulatory elements that control the complex patterns of gene expression1. Previously, the ENCODE consortium mapped biochemical signals across hundreds of cell types and tissues and integrated these data to develop a registry containing 0.9 million human and 300,000 mouse candidate cis-regulatory elements (cCREs) annotated with potential functions2. Here we have expanded the registry to include 2.37 million human and 967,000 mouse cCREs, leveraging new ENCODE datasets and enhanced computational methods. This expanded registry covers hundreds of unique cell and tissue types, providing a comprehensive understanding of gene regulation. Functional characterization data from assays such as STARR-seq3, massively parallel reporter assay4, CRISPR perturbation5,6 and transgenic mouse assays7 have profiled more than 90% of human cCREs, revealing complex regulatory functions. We identified thousands of novel silencer cCREs and demonstrated their dual enhancer and silencer roles in different cellular contexts. Integrating the registry with other ENCODE annotations facilitates genetic variation interpretation and trait-associated gene identification, exemplified by the identification of KLF1 as a novel causal gene for red blood cell traits. This expanded registry is a valuable resource for studying the regulatory genome and its impact on health and disease.

Show full abstractShow less

DOI

10.1038/s41586-025-09909-9

ENCODE portal

Reference

PUBMED_LINK

41168159

TITLE

Data navigation on the ENCODE portal.

Main citation

Kagda MS, Lam B, Litton C, Small C, ...&, Hitz BC. (2025) Data navigation on the ENCODE portal. Nat Commun, 16 (1) 9592. doi:10.1038/s41467-025-64343-9. PMID 41168159

ABSTRACT

Spanning two decades, the collaborative ENCODE project aims to identify all the functional elements within human and mouse genomes. To best serve the scientific community, the comprehensive ENCODE data including results from 23,000+ functional genomics experiments, 800+ functional elements characterization experiments and 60,000+ results from integrative computational analyses are available on an open-access data-portal ( https://www.encodeproject.org/ ). The final phase of the project includes data from several novel assays aimed at characterization and validation of genomic elements. In addition to developing and maintaining the data portal, the Data Coordination Center (DCC) implemented and utilised uniform processing pipelines to generate uniformly processed data. Here we report recent updates to the data portal including a redesigned home page, an improved search interface, new custom-designed pages highlighting biologically related datasets and an enhanced cart interface for data visualisation plus user-friendly data download options. A summary of data generated using uniform processing pipelines is also provided.

Show full abstractShow less

DOI

10.1038/s41467-025-64343-9

Ensembl

Reference

URL

https://asia.ensembl.org/index.html

ESM-2

Reference

PUBMED_LINK

36927031

FULL NAME

Evolutionary Scale Modeling v2

DESCRIPTION

Large-scale protein language model enabling structure/function prediction.

Show full descriptionShow less

URL

https://esmatlas.com/

KEYWORDS

transformer, LLM, structure, sequence

Show full keywordsShow less

USE

embeddings, structure prediction

TITLE

Evolutionary-scale prediction of atomic-level protein structure with a language model.

Main citation

Lin Z, Akin H, Rao R, Hie B, ...&, Rives A. (2023) Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379 (6637) 1123-1130. doi:10.1126/science.ade2574. PMID 36927031

ABSTRACT

Recent advances in machine learning have leveraged evolutionary information in multiple sequence alignments to predict protein structure. We demonstrate direct inference of full atomic-level protein structure from primary sequence using a large language model. As language models of protein sequences are scaled up to 15 billion parameters, an atomic-resolution picture of protein structure emerges in the learned representations. This results in an order-of-magnitude acceleration of high-resolution structure prediction, which enables large-scale structural characterization of metagenomic proteins. We apply this capability to construct the ESM Metagenomic Atlas by predicting structures for >617 million metagenomic protein sequences, including >225 million that are predicted with high confidence, which gives a view into the vast breadth and diversity of natural proteins.

Show full abstractShow less

DOI

10.1126/science.ade2574

EVE

Reference

PUBMED_LINK

34707284

FULL NAME

Evolutionary model of Variant Effect

DESCRIPTION

VAE-based unsupervised model to predict variant impact using MSAs.

Show full descriptionShow less

URL

https://evemodel.org/

KEYWORDS

evolutionary, MSA, variant effect

Show full keywordsShow less

USE

missense scoring

TITLE

Disease variant prediction with deep generative models of evolutionary data.

Main citation

Frazer J, Notin P, Dias M, Gomez A, ...&, Marks DS. (2021) Disease variant prediction with deep generative models of evolutionary data. Nature, 599 (7883) 91-95. doi:10.1038/s41586-021-04043-8. PMID 34707284

ABSTRACT

Quantifying the pathogenicity of protein variants in human disease-related genes would have a marked effect on clinical decisions, yet the overwhelming majority (over 98%) of these variants still have unknown consequences1-3. In principle, computational methods could support the large-scale interpretation of genetic variants. However, state-of-the-art methods4-10 have relied on training machine learning models on known disease labels. As these labels are sparse, biased and of variable quality, the resulting models have been considered insufficiently reliable11. Here we propose an approach that leverages deep generative models to predict variant pathogenicity without relying on labels. By modelling the distribution of sequence variation across organisms, we implicitly capture constraints on the protein sequences that maintain fitness. Our model EVE (evolutionary model of variant effect) not only outperforms computational approaches that rely on labelled data but also performs on par with, if not better than, predictions from high-throughput experiments, which are increasingly used as evidence for variant classification12-16. We predict the pathogenicity of more than 36 million variants across 3,219 disease genes and provide evidence for the classification of more than 256,000 variants of unknown significance. Our work suggests that models of evolutionary information can provide valuable independent evidence for variant interpretation that will be widely useful in research and clinical settings.

Show full abstractShow less

DOI

10.1038/s41586-021-04043-8

Gene Ontology (GO)

Reference

PUBMED_LINK

10802651

DESCRIPTION

Controlled vocabulary for gene function classification.

Show full descriptionShow less

URL

http://geneontology.org/

KEYWORDS

GO terms, pathways

Show full keywordsShow less

USE

enrichment analysis

TITLE

Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.

Main citation

Ashburner M, Ball CA, Blake JA, Botstein D, ...&, Sherlock G. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet, 25 (1) 25-9. doi:10.1038/75556. PMID 10802651

ABSTRACT

Genomic sequencing has made it clear that a large fraction of the genes specifying the core biological functions are shared by all eukaryotes. Knowledge of the biological role of such shared proteins in one organism can often be transferred to other organisms. The goal of the Gene Ontology Consortium is to produce a dynamic, controlled vocabulary that can be applied to all eukaryotes even as knowledge of gene and protein roles in cells is accumulating and changing. To this end, three independent ontologies accessible on the World-Wide Web (http://www.geneontology.org) are being constructed: biological process, molecular function and cellular component.

Show full abstractShow less

DOI

10.1038/75556

GeneCards

Reference

PUBMED_LINK

27322403

DESCRIPTION

GeneCards is a searchable, integrative database that provides comprehensive, user-friendly information on all annotated and predicted human genes. The knowledgebase automatically integrates gene-centric data from ~200 web sources, including genomic, transcriptomic, proteomic, genetic, clinical and functional information.

Show full descriptionShow less

URL

https://www.genecards.org/

TITLE

The GeneCards Suite: From Gene Data Mining to Disease Genome Sequence Analyses.

Main citation

Stelzer G, Rosen N, Plaschkes I, Zimmerman S, ...&, Lancet D. (2016) The GeneCards Suite: From Gene Data Mining to Disease Genome Sequence Analyses. Curr Protoc Bioinformatics, 54 () 1.30.1-1.30.33. doi:10.1002/cpbi.5. PMID 27322403

ABSTRACT

GeneCards, the human gene compendium, enables researchers to effectively navigate and inter-relate the wide universe of human genes, diseases, variants, proteins, cells, and biological pathways. Our recently launched Version 4 has a revamped infrastructure facilitating faster data updates, better-targeted data queries, and friendlier user experience. It also provides a stronger foundation for the GeneCards suite of companion databases and analysis tools. Improved data unification includes gene-disease links via MalaCards and merged biological pathways via PathCards, as well as drug information and proteome expression. VarElect, another suite member, is a phenotype prioritizer for next-generation sequencing, leveraging the GeneCards and MalaCards knowledgebase. It automatically infers direct and indirect scored associations between hundreds or even thousands of variant-containing genes and disease phenotype terms. VarElect's capabilities, either independently or within TGex, our comprehensive variant analysis pipeline, help prepare for the challenge of clinical projects that involve thousands of exome/genome NGS analyses. © 2016 by John Wiley & Sons, Inc.

Show full abstractShow less

DOI

10.1002/cpbi.5

gnomAD

Reference

URL

https://gnomad.broadinstitute.org/

GRCh37.p13

Reference WGS

FULL NAME

Genome Reference Consortium Human Build 37 patch release 13

DESCRIPTION

NCBI/GRC human assembly build 37, patch 13 (GCF_000001405.25): the authoritative GRCh37 patch-level reference used for stable accessioning and alignment. Distinct from UCSC hg19/Broad b37 contig naming; always verify chromosome naming and inclusion of ALT/patch contigs when mixing resources.

Show full descriptionShow less

URL

https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.25/

KEYWORDS

GRCh37; GRC; NCBI; reference assembly; patch 13

Show full keywordsShow less

Main citation

Genome Reference Consortium. Human genome assembly GRCh37.p13 (GCF_000001405.25). National Center for Biotechnology Information.

GRCh38.p14

Reference WGS

FULL NAME

Genome Reference Consortium Human Build 38 patch release 14

DESCRIPTION

NCBI/GRC human assembly build 38, patch 14 (GCF_000001405.40): current GRC primary human reference on the GRCh38 line, including cumulative sequence fixes and scaffold updates through p14. Use this accession when you need the exact GRC patch level that matches NCBI/RefSeq alignment products.

Show full descriptionShow less

URL

https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.40

KEYWORDS

GRCh38; GRC; NCBI; reference assembly; patch 14

Show full keywordsShow less

Main citation

Genome Reference Consortium. Human genome assembly GRCh38.p14 (GCF_000001405.40). National Center for Biotechnology Information. https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.40

GRCh39 (indefinitely postponed) (GRCh39)

Reference WGS

FULL NAME

Genome Reference Consortium Human Build 39 (not pursued)

DESCRIPTION

The Genome Reference Consortium announced that work toward a distinct GRCh39 assembly line was indefinitely postponed; human reference updates continue on the GRCh38 series (patches) and complementary resources such as T2T-CHM13 and pangenome references. Check the GRC human page for current guidance and patch releases.

Show full descriptionShow less

URL

https://www.ncbi.nlm.nih.gov/grc/human

KEYWORDS

GRC; GRCh39; reference assembly; postponed

Show full keywordsShow less

Main citation

Genome Reference Consortium. Human genome reference updates (GRCh39 indefinitely postponed; continued GRCh38 patches). https://www.ncbi.nlm.nih.gov/grc/human

GWAS Catalog

Reference

URL

https://www.ebi.ac.uk/gwas/

hg19

Reference WGS

FULL NAME

UCSC hg19 (GRCh37) reference bundle

DESCRIPTION

UCSC Genome Browser distribution of the GRCh37-era human reference (hg19): chromosomes chr1-22, chrX, chrY, chrM, plus unlocalized and unplaced contigs, alternate loci (e.g. chr6_apd_hap1), and related patches as packaged for the browser. Widely used in legacy pipelines and liftOver chains to/from hg38.

Show full descriptionShow less

URL

https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/

KEYWORDS

GRCh37; UCSC; reference genome; FASTA; legacy assembly

Show full keywordsShow less

Main citation

UCSC Genome Browser. Human reference assembly hg19 (GRCh37-aligned). https://hgdownload.soe.ucsc.edu/goldenPath/hg19/

hg38

Reference WGS

FULL NAME

UCSC hg38 (GRCh38) reference bundle

DESCRIPTION

UCSC Genome Browser distribution of the human reference aligned to GRCh38 (primary assembly plus standard patches and decoys as packaged in the browser bigZips downloads). Chromosome names use the chr1-chrM convention; coordinates match the corresponding GRC assembly for the same patch level when sequences are identical.

Show full descriptionShow less

URL

https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/

KEYWORDS

GRCh38; UCSC; reference genome; FASTA; primary assembly

Show full keywordsShow less

Main citation

UCSC Genome Browser. Human reference assembly hg38 (GRCh38-aligned). https://hgdownload.soe.ucsc.edu/goldenPath/hg38/

HPRC first draft pangenome (HPRC draft)

Pangenome Reference Structural variants WGS

PUBMED_LINK

37165242

FULL NAME

Human Pangenome Reference Consortium first-draft pangenome

DESCRIPTION

First-draft human pangenome from the HPRC: 47 phased diploid assemblies from diverse samples, aligned and summarized relative to GRCh38. Adds substantial euchromatic polymorphic sequence and duplicated gene content versus a single linear reference; intended for pangenome-aware alignment, variant calling, and downstream graph-based genomics (see HPRC data portal and companion software).

Show full descriptionShow less

URL

https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html

KEYWORDS

HPRC; pangenome; graph genome; haplotypes; GRCh38

Show full keywordsShow less

TITLE

A draft human pangenome reference.

Main citation

Liao WW, Asri M, Ebler J, Doerr D, ...&, Paten B. (2023) A draft human pangenome reference. Nature, 617 (7960) 312-324. doi:10.1038/s41586-023-05896-x. PMID 37165242

ABSTRACT

Here the Human Pangenome Reference Consortium presents a first draft of the human pangenome reference. The pangenome contains 47 phased, diploid assemblies from a cohort of genetically diverse individuals1. These assemblies cover more than 99% of the expected sequence in each genome and are more than 99% accurate at the structural and base pair levels. Based on alignments of the assemblies, we generate a draft pangenome that captures known variants and haplotypes and reveals new alleles at structurally complex loci. We also add 119 million base pairs of euchromatic polymorphic sequences and 1,115 gene duplications relative to the existing reference GRCh38. Roughly 90 million of the additional base pairs are derived from structural variation. Using our draft pangenome to analyse short-read data reduced small variant discovery errors by 34% and increased the number of structural variants detected per haplotype by 104% compared with GRCh38-based workflows, which enabled the typing of the vast majority of structural variant alleles per sample.

Show full abstractShow less

DOI

10.1038/s41586-023-05896-x

hs37d5

Reference 1000 Genomes WGS

FULL NAME

1000 Genomes GRCh37 + decoy (hs37d5)

DESCRIPTION

GRCh37 (b37-style) primary chromosomes and contigs plus the hs37d5 decoy sequence set (HuRef/BAC/Fosmid/NA12878-derived sequences) to reduce spurious alignments in short-read mapping. Standard reference for Phase 3-era 1000 Genomes alignment and many imputation and low-pass WGS workflows that target the 1KG coordinate system.

Show full descriptionShow less

URL

https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/

KEYWORDS

GRCh37; decoy; 1000 Genomes; alignment; hs37d5

Show full keywordsShow less

Main citation

1000 Genomes Project / Broad Institute. hs37d5 reference (GRCh37 plus decoy sequences). https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/

HUGO Gene Nomenclature Committee (HGNC)

Reference

PUBMED_LINK

32747822

FULL NAME

Human Genome Organisation Gene Nomenclature Committee

URL

https://www.genenames.org/

TITLE

Guidelines for human gene nomenclature.

Main citation

Bruford EA, Braschi B, Denny P, Jones TEM, ...&, Tweedie S. (2020) Guidelines for human gene nomenclature. Nat Genet, 52 (8) 754-758. doi:10.1038/s41588-020-0669-3. PMID 32747822

ABSTRACT

Standardized gene naming is crucial for effective communication about genes, and as genomics becomes increasingly important in healthcare, the need for a consistent language for human genes becomes ever more vital. Here we present the current HUGO Gene Nomenclature Committee (HGNC) guidelines for naming not only protein-coding but also RNA genes and pseudogenes, and outline the changes in approach and ethos that have resulted from the discoveries of the last few decades.

Show full abstractShow less

DOI

10.1038/s41588-020-0669-3

humanG1Kv37

Reference 1000 Genomes WGS

FULL NAME

1000 Genomes human_g1k_v37 reference

DESCRIPTION

GRCh37-based reference FASTA distributed by the 1000 Genomes Project (human_g1k_v37): chromosomes 1-22, X, Y, MT, plus GL unlocalized/unplaced contigs, without separate haplotype scaffolds or EBV. Commonly used as the Phase 1/III alignment reference when harmonizing with public 1KG VCFs and phase panels.

Show full descriptionShow less

URL

https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/

KEYWORDS

GRCh37; 1000 Genomes; reference FASTA; human_g1k_v37

Show full keywordsShow less

Main citation

1000 Genomes Project. human_g1k_v37 reference (GRCh37). https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/

jMorp

Reference

PUBMED_LINK

37930845

FULL NAME

Japanese Multi-Omics Reference Panel

URL

https://jmorp.megabank.tohoku.ac.jp/

TITLE

jMorp: Japanese Multi-Omics Reference Panel update report 2023.

Main citation

Tadaka S, Kawashima J, Hishinuma E, Saito S, ...&, Kinoshita K. (2024) jMorp: Japanese Multi-Omics Reference Panel update report 2023. Nucleic Acids Res, 52 (D1) D622-D632. doi:10.1093/nar/gkad978. PMID 37930845

ABSTRACT

Modern medicine is increasingly focused on personalized medicine, and multi-omics data is crucial in understanding biological phenomena and disease mechanisms. Each ethnic group has its unique genetic background with specific genomic variations influencing disease risk and drug response. Therefore, multi-omics data from specific ethnic populations are essential for the effective implementation of personalized medicine. Various prospective cohort studies, such as the UK Biobank, All of Us and Lifelines, have been conducted worldwide. The Tohoku Medical Megabank project was initiated after the Great East Japan Earthquake in 2011. It collects biological specimens and conducts genome and omics analyses to build a basis for personalized medicine. Summary statistical data from these analyses are available in the jMorp web database (https://jmorp.megabank.tohoku.ac.jp), which provides a multidimensional approach to the diversity of the Japanese population. jMorp was launched in 2015 as a public database for plasma metabolome and proteome analyses and has been continuously updated. The current update will significantly expand the scale of the data (metabolome, genome, transcriptome, and metagenome). In addition, the user interface and backend server implementations were rewritten to improve the connectivity between the items stored in jMorp. This paper provides an overview of the new version of the jMorp.

Show full abstractShow less

DOI

10.1093/nar/gkad978

M-CAP

Reference

PUBMED_LINK

27776117

FULL NAME

Mendelian Clinically Applicable Pathogenicity

DESCRIPTION

Rare missense pathogenicity classifier for clinical interpretation.

Show full descriptionShow less

URL

https://bejerano.stanford.edu/mcap/

KEYWORDS

missense, clinical

Show full keywordsShow less

USE

clinical scoring

TITLE

M-CAP eliminates a majority of variants of uncertain significance in clinical exomes at high sensitivity.

Main citation

Jagadeesh KA, Wenger AM, Berger MJ, Guturu H, ...&, Bejerano G. (2016) M-CAP eliminates a majority of variants of uncertain significance in clinical exomes at high sensitivity. Nat Genet, 48 (12) 1581-1586. doi:10.1038/ng.3703. PMID 27776117

ABSTRACT

Variant pathogenicity classifiers such as SIFT, PolyPhen-2, CADD, and MetaLR assist in interpretation of the hundreds of rare, missense variants in the typical patient genome by deprioritizing some variants as likely benign. These widely used methods misclassify 26 to 38% of known pathogenic mutations, which could lead to missed diagnoses if the classifiers are trusted as definitive in a clinical setting. We developed M-CAP, a clinical pathogenicity classifier that outperforms existing methods at all thresholds and correctly dismisses 60% of rare, missense variants of uncertain significance in a typical genome at 95% sensitivity.

Show full abstractShow less

DOI

10.1038/ng.3703

MetaLR / MetaSVM (MetaLR)

Reference

PUBMED_LINK

25552646

DESCRIPTION

Ensemble pathogenicity scores integrating multiple annotations.

Show full descriptionShow less

URL

https://www.ncbi.nlm.nih.gov/clinvar/docs/scoreinfo/

KEYWORDS

ensemble, missense

Show full keywordsShow less

USE

prioritization

TITLE

Comparison and integration of deleteriousness prediction methods for nonsynonymous SNVs in whole exome sequencing studies.

Main citation

Dong C, Wei P, Jian X, Gibbs R, ...&, Liu X. (2015) Comparison and integration of deleteriousness prediction methods for nonsynonymous SNVs in whole exome sequencing studies. Hum Mol Genet, 24 (8) 2125-37. doi:10.1093/hmg/ddu733. PMID 25552646

ABSTRACT

Accurate deleteriousness prediction for nonsynonymous variants is crucial for distinguishing pathogenic mutations from background polymorphisms in whole exome sequencing (WES) studies. Although many deleteriousness prediction methods have been developed, their prediction results are sometimes inconsistent with each other and their relative merits are still unclear in practical applications. To address these issues, we comprehensively evaluated the predictive performance of 18 current deleteriousness-scoring methods, including 11 function prediction scores (PolyPhen-2, SIFT, MutationTaster, Mutation Assessor, FATHMM, LRT, PANTHER, PhD-SNP, SNAP, SNPs&GO and MutPred), 3 conservation scores (GERP++, SiPhy and PhyloP) and 4 ensemble scores (CADD, PON-P, KGGSeq and CONDEL). We found that FATHMM and KGGSeq had the highest discriminative power among independent scores and ensemble scores, respectively. Moreover, to ensure unbiased performance evaluation of these prediction scores, we manually collected three distinct testing datasets, on which no current prediction scores were tuned. In addition, we developed two new ensemble scores that integrate nine independent scores and allele frequency. Our scores achieved the highest discriminative power compared with all the deleteriousness prediction scores tested and showed low false-positive prediction rate for benign yet rare nonsynonymous variants, which demonstrated the value of combining information from multiple orthologous approaches. Finally, to facilitate variant prioritization in WES studies, we have pre-computed our ensemble scores for 87 347 044 possible variants in the whole-exome and made them publicly available through the ANNOVAR software and the dbNSFP database.

Show full abstractShow less

DOI

10.1093/hmg/ddu733

MutationAssessor

Reference

PUBMED_LINK

40832239

DESCRIPTION

Predicts functional impact based on evolutionary conservation.

Show full descriptionShow less

URL

http://mutationassessor.org/

KEYWORDS

conservation, function

Show full keywordsShow less

USE

variant effect

TITLE

MutationAssessor in cBioPortal.

Main citation

Su Y, Li X, Reva B, Antipin Y, ...&, Sander C. (2025) MutationAssessor in cBioPortal. bioRxiv, () . doi:10.1101/2025.08.10.669566. PMID 40832239

ABSTRACT

MutationAssessor (MA) helps researchers evaluate the likely functional impact of somatic and germline mutations in cancer. It provides an evolution-based functional impact score (FIS) to classify mutations based on their likely effect on protein function. FIS scores are based on analysis of patterns of conservation in protein families (conserved residues) and subfamilies (specificity residues). In this new version (r4) we have (1) refined the combinatorial entropy analysis of conservation patterns, (2) recalculated full-length protein multiple sequence alignments covering a larger fraction of human proteins and making use of the explosive growth of protein sequence data, (3) compared predicted functional impact with the pathogenic-benign classification of sequence variants in curated knowledge bases, such as ClinVar, (4) observed the inverse relationship between predicted high functional impact and variant frequency in germline genome sequences and (5) explore the evaluation of switch-of-function mutational effects. Functional impact of ~4 million somatic amino-acid changing mutations across more than 320K human tumor samples are now available in the widely used cBioPortal for Cancer Genomics.

Show full abstractShow less

DOI

10.1101/2025.08.10.669566

MVP

Reference

PUBMED_LINK

33479230

FULL NAME

Missense Variant Pathogenicity prediction

DESCRIPTION

A new prediction method that uses deep residual network to leverage large training data sets and many correlated predictors

Show full descriptionShow less

URL

https://figshare.com/articles/dataset/Predicting_pathogenicity_of_missense_variants_by_deep_learning/13204118

KEYWORDS

deep residual network, pathogenic missense variant

Show full keywordsShow less

TITLE

MVP predicts the pathogenicity of missense variants by deep learning.

Main citation

Qi H, Zhang H, Zhao Y, Chen C, ...&, Shen Y. (2021) MVP predicts the pathogenicity of missense variants by deep learning. Nat Commun, 12 (1) 510. doi:10.1038/s41467-020-20847-0. PMID 33479230

ABSTRACT

Accurate pathogenicity prediction of missense variants is critically important in genetic studies and clinical diagnosis. Previously published prediction methods have facilitated the interpretation of missense variants but have limited performance. Here, we describe MVP (Missense Variant Pathogenicity prediction), a new prediction method that uses deep residual network to leverage large training data sets and many correlated predictors. We train the model separately in genes that are intolerant of loss of function variants and the ones that are tolerant in order to take account of potentially different genetic effect size and mode of action. We compile cancer mutation hotspots and de novo variants from developmental disorders for benchmarking. Overall, MVP achieves better performance in prioritizing pathogenic missense variants than previous methods, especially in genes tolerant of loss of function variants. Finally, using MVP, we estimate that de novo coding variants contribute to 7.8% of isolated congenital heart disease, nearly doubling previous estimates.

Show full abstractShow less

DOI

10.1038/s41467-020-20847-0

NCBI-Gene

Reference

PUBMED_LINK

33095870

DESCRIPTION

Gene integrates information from a wide range of species. A record may include nomenclature, Reference Sequences (RefSeqs), maps, pathways, variations, phenotypes, and links to genome-, phenotype-, and locus-specific resources worldwide.

Show full descriptionShow less

URL

https://www.ncbi.nlm.nih.gov/gene/

TITLE

Database resources of the National Center for Biotechnology Information.

Main citation

Sayers EW, Beck J, Bolton EE, Bourexis D, ...&, Sherry ST. (2021) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res, 49 (D1) D10-D17. doi:10.1093/nar/gkaa892. PMID 33095870

ABSTRACT

The National Center for Biotechnology Information (NCBI) provides a large suite of online resources for biological information and data, including the GenBank® nucleic acid sequence database and the PubMed® database of citations and abstracts published in life science journals. The Entrez system provides search and retrieval operations for most of these data from 34 distinct databases. The E-utilities serve as the programming interface for the Entrez system. Custom implementations of the BLAST program provide sequence-based searching of many specialized datasets. New resources released in the past year include a new PubMed interface and NCBI datasets. Additional resources that were updated in the past year include PMC, Bookshelf, Genome Data Viewer, SRA, ClinVar, dbSNP, dbVar, Pathogen Detection, BLAST, Primer-BLAST, IgBLAST, iCn3D and PubChem. All of these resources can be accessed through the NCBI home page at https://www.ncbi.nlm.nih.gov.

Show full abstractShow less

DOI

10.1093/nar/gkaa892

OneK1K

Reference

PUBMED_LINK

35389779

DESCRIPTION

The OneK1K cohort consists of single-cell RNA sequencing (scRNA-seq) data from 1.27 million peripheral blood mononuclear cells (PMBCs) collected from 982 donors. We developed a framework for the classification of individual cells, and by combining the scRNA-seq data with genotype data, we mapped the genetic effects on gene expression in each of 14 immune cell types and identified 26,597 independent cis–expression quantitative trait loci (eQTLs).

Show full descriptionShow less

URL

https://onek1k.org/

TITLE

Single-cell eQTL mapping identifies cell type-specific genetic control of autoimmune disease.

Main citation

Yazar S, Alquicira-Hernandez J, Wing K, Senabouth A, ...&, Powell JE. (2022) Single-cell eQTL mapping identifies cell type-specific genetic control of autoimmune disease. Science, 376 (6589) eabf3041. doi:10.1126/science.abf3041. PMID 35389779

ABSTRACT

The human immune system displays substantial variation between individuals, leading to differences in susceptibility to autoimmune disease. We present single-cell RNA sequencing (scRNA-seq) data from 1,267,758 peripheral blood mononuclear cells from 982 healthy human subjects. For 14 cell types, we identified 26,597 independent cis-expression quantitative trait loci (eQTLs) and 990 trans-eQTLs, with most showing cell type-specific effects on gene expression. We subsequently show how eQTLs have dynamic allelic effects in B cells that are transitioning from naïve to memory states and demonstrate how commonly segregating alleles lead to interindividual variation in immune function. Finally, using a Mendelian randomization approach, we identify the causal route by which 305 risk loci contribute to autoimmune disease at the cellular level. This work brings together genetic epidemiology with scRNA-seq to uncover drivers of interindividual variation in the immune system.

Show full abstractShow less

DOI

10.1126/science.abf3041

Open Target Genetics

Reference

URL

https://genetics.opentargets.org/

PGG.Han 2.0 (PGG.Han)

Reference

URL

https://www.biosino.org/pgghan2/index#home1

PolyPhen-2

Reference

PUBMED_LINK

23315928

FULL NAME

Polymorphism Phenotyping v2

DESCRIPTION

Predicts functional impact of amino acid substitutions.

Show full descriptionShow less

URL

http://genetics.bwh.harvard.edu/pph2/

KEYWORDS

missense, conservation

Show full keywordsShow less

USE

variant scoring

TITLE

Predicting functional effect of human missense mutations using PolyPhen-2.

Main citation

Adzhubei I, Jordan DM, Sunyaev SR. (2013) Predicting functional effect of human missense mutations using PolyPhen-2. Curr Protoc Hum Genet, Chapter 7 () Unit7.20. doi:10.1002/0471142905.hg0720s76. PMID 23315928

ABSTRACT

PolyPhen-2 (Polymorphism Phenotyping v2), available as software and via a Web server, predicts the possible impact of amino acid substitutions on the stability and function of human proteins using structural and comparative evolutionary considerations. It performs functional annotation of single-nucleotide polymorphisms (SNPs), maps coding SNPs to gene transcripts, extracts protein sequence annotations and structural attributes, and builds conservation profiles. It then estimates the probability of the missense mutation being damaging based on a combination of all these properties. PolyPhen-2 features include a high-quality multiple protein sequence alignment pipeline and a prediction method employing machine-learning classification. The software also integrates the UCSC Genome Browser's human genome annotations and MultiZ multiple alignments of vertebrate genomes with the human genome. PolyPhen-2 is capable of analyzing large volumes of data produced by next-generation sequencing projects, thanks to built-in support for high-performance computing environments like Grid Engine and Platform LSF.

Show full abstractShow less

DOI

10.1002/0471142905.hg0720s76

PrimateAI-3D

Reference

PUBMED_LINK

37262156

DESCRIPTION

DL model trained on primate variation + 3D structure.

Show full descriptionShow less

URL

https://www.broadinstitute.org

KEYWORDS

deep learning, primate, missense

Show full keywordsShow less

USE

clinical variant scoring

TITLE

The landscape of tolerated genetic variation in humans and primates.

Main citation

Gao H, Hamp T, Ede J, Schraiber JG, ...&, Farh KK. (2023) The landscape of tolerated genetic variation in humans and primates. Science, 380 (6648) eabn8153. doi:10.1126/science.abn8197. PMID 37262156

ABSTRACT

Personalized genome sequencing has revealed millions of genetic differences between individuals, but our understanding of their clinical relevance remains largely incomplete. To systematically decipher the effects of human genetic variants, we obtained whole-genome sequencing data for 809 individuals from 233 primate species and identified 4.3 million common protein-altering variants with orthologs in humans. We show that these variants can be inferred to have nondeleterious effects in humans based on their presence at high allele frequencies in other primate populations. We use this resource to classify 6% of all possible human protein-altering variants as likely benign and impute the pathogenicity of the remaining 94% of variants with deep learning, achieving state-of-the-art accuracy for diagnosing pathogenic variants in patients with genetic diseases.

Show full abstractShow less

DOI

10.1126/science.abn8197

ProGen2

Reference

PUBMED_LINK

37909046

DESCRIPTION

Generative protein design using LLMs trained on protein sequences.

Show full descriptionShow less

URL

https://github.com/salesforce/progen

KEYWORDS

protein design, LLM

Show full keywordsShow less

USE

sequence generation

TITLE

ProGen2: Exploring the boundaries of protein language models.

Main citation

Nijkamp E, Ruffolo JA, Weinstein EN, Naik N, ...&, Madani A. (2023) ProGen2: Exploring the boundaries of protein language models. Cell Syst, 14 (11) 968-978.e3. doi:10.1016/j.cels.2023.10.002. PMID 37909046

ABSTRACT

Attention-based models trained on protein sequences have demonstrated incredible success at classification and generation tasks relevant for artificial-intelligence-driven protein design. However, we lack a sufficient understanding of how very large-scale models and data play a role in effective protein model development. We introduce a suite of protein language models, named ProGen2, that are scaled up to 6.4B parameters and trained on different sequence datasets drawn from over a billion proteins from genomic, metagenomic, and immune repertoire databases. ProGen2 models show state-of-the-art performance in capturing the distribution of observed evolutionary sequences, generating novel viable sequences, and predicting protein fitness without additional fine-tuning. As large model sizes and raw numbers of protein sequences continue to become more widely accessible, our results suggest that a growing emphasis needs to be placed on the data distribution provided to a protein sequence model. Our models and code are open sourced for widespread adoption in protein engineering. A record of this paper's Transparent Peer Review process is included in the supplemental information.

Show full abstractShow less

DOI

10.1016/j.cels.2023.10.002

ProtBERT

Reference

PUBMED_LINK

34232869

DESCRIPTION

BERT-based protein language model for downstream functional tasks.

Show full descriptionShow less

URL

https://huggingface.co/Rostlab/prot_bert

KEYWORDS

protein LM, transformer, embeddings

Show full keywordsShow less

USE

feature extraction

TITLE

ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning.

Main citation

Elnaggar A, Heinzinger M, Dallago C, Rehawi G, ...&, Rost B. (2022) ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning. IEEE Trans Pattern Anal Mach Intell, 44 (10) 7112-7127. doi:10.1109/TPAMI.2021.3095381. PMID 34232869

ABSTRACT

Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models (LMs) taken from Natural Language Processing (NLP). These LMs reach for new prediction frontiers at low inference costs. Here, we trained two auto-regressive models (Transformer-XL, XLNet) and four auto-encoder models (BERT, Albert, Electra, T5) on data from UniRef and BFD containing up to 393 billion amino acids. The protein LMs (pLMs) were trained on the Summit supercomputer using 5616 GPUs and TPU Pod up-to 1024 cores. Dimensionality reduction revealed that the raw pLM-embeddings from unlabeled data captured some biophysical features of protein sequences. We validated the advantage of using the embeddings as exclusive input for several subsequent tasks: (1) a per-residue (per-token) prediction of protein secondary structure (3-state accuracy Q3=81%-87%); (2) per-protein (pooling) predictions of protein sub-cellular location (ten-state accuracy: Q10=81%) and membrane versus water-soluble (2-state accuracy Q2=91%). For secondary structure, the most informative embeddings (ProtT5) for the first time outperformed the state-of-the-art without multiple sequence alignments (MSAs) or evolutionary information thereby bypassing expensive database searches. Taken together, the results implied that pLMs learned some of the grammar of the language of life. All our models are available through https://github.com/agemagician/ProtTrans.

Show full abstractShow less

DOI

10.1109/TPAMI.2021.3095381

ProteinBERT

Reference

PUBMED_LINK

35020807

TITLE

ProteinBERT: a universal deep-learning model of protein sequence and function.

Main citation

Brandes N, Ofer D, Peleg Y, Rappoport N, ...&, Linial M. (2022) ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics, 38 (8) 2102-2110. doi:10.1093/bioinformatics/btac020. PMID 35020807

ABSTRACT

SUMMARY: Self-supervised deep language modeling has shown unprecedented success across natural language tasks, and has recently been repurposed to biological sequences. However, existing models and pretraining methods are designed and optimized for text analysis. We introduce ProteinBERT, a deep language model specifically designed for proteins. Our pretraining scheme combines language modeling with a novel task of Gene Ontology (GO) annotation prediction. We introduce novel architectural elements that make the model highly efficient and flexible to long sequences. The architecture of ProteinBERT consists of both local and global representations, allowing end-to-end processing of these types of inputs and outputs. ProteinBERT obtains near state-of-the-art performance, and sometimes exceeds it, on multiple benchmarks covering diverse protein properties (including protein structure, post-translational modifications and biophysical attributes), despite using a far smaller and faster model than competing deep-learning methods. Overall, ProteinBERT provides an efficient framework for rapidly training protein predictors, even with limited labeled data. AVAILABILITY AND IMPLEMENTATION: Code and pretrained model weights are available at https://github.com/nadavbra/protein_bert. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Show full abstractShow less

DOI

10.1093/bioinformatics/btac020

ProtVar

Reference

PUBMED_LINK

38769064

URL

https://www.ebi.ac.uk/ProtVar/

TITLE

ProtVar: mapping and contextualizing human missense variation.

Main citation

Stephenson JD, Totoo P, Burke DF, Jänes J, ...&, Martin MJ. (2024) ProtVar: mapping and contextualizing human missense variation. Nucleic Acids Res, 52 (W1) W140-W147. doi:10.1093/nar/gkae413. PMID 38769064

ABSTRACT

Genomic variation can impact normal biological function in complex ways and so understanding variant effects requires a broad range of data to be coherently assimilated. Whilst the volume of human variant data and relevant annotations has increased, the corresponding increase in the breadth of participating fields, standards and versioning mean that moving between genomic, coding, protein and structure positions is increasingly complex. In turn this makes investigating variants in diverse formats and assimilating annotations from different resources challenging. ProtVar addresses these issues to facilitate the contextualization and interpretation of human missense variation with unparalleled flexibility and ease of accessibility for use by the broadest range of researchers. By precalculating all possible variants in the human proteome it offers near instantaneous mapping between all relevant data types. It also combines data and analyses from a plethora of resources to bring together genomic, protein sequence and function annotations as well as structural insights and predictions to better understand the likely effect of missense variation in humans. It is offered as an intuitive web server https://www.ebi.ac.uk/protvar where data can be explored and downloaded, and can be accessed programmatically via an API.

Show full abstractShow less

DOI

10.1093/nar/gkae413

Reactome

Reference

PUBMED_LINK

37941124

DESCRIPTION

REACTOME is an open-source, open access, manually curated and peer-reviewed pathway database. Our goal is to provide intuitive bioinformatics tools for the visualization, interpretation and analysis of pathway knowledge to support basic and clinical research, genome analysis, modeling, systems biology and education. Founded in 2003, the Reactome project is led by Lincoln Stein of OICR, Peter D’Eustachio of NYU Langone Health, Henning Hermjakob of EMBL-EBI, and Guanming Wu of OHSU.

Show full descriptionShow less

URL

https://reactome.org/

KEYWORDS

Pathway

Show full keywordsShow less

TITLE

The Reactome Pathway Knowledgebase 2024.

Main citation

Milacic M, Beavers D, Conley P, Gong C, ...&, D'Eustachio P. (2024) The Reactome Pathway Knowledgebase 2024. Nucleic Acids Res, 52 (D1) D672-D678. doi:10.1093/nar/gkad1025. PMID 37941124

ABSTRACT

The Reactome Knowledgebase (https://reactome.org), an Elixir and GCBR core biological data resource, provides manually curated molecular details of a broad range of normal and disease-related biological processes. Processes are annotated as an ordered network of molecular transformations in a single consistent data model. Reactome thus functions both as a digital archive of manually curated human biological processes and as a tool for discovering functional relationships in data such as gene expression profiles or somatic mutation catalogs from tumor cells. Here we review progress towards annotation of the entire human proteome, targeted annotation of disease-causing genetic variants of proteins and of small-molecule drugs in a pathway context, and towards supporting explicit annotation of cell- and tissue-specific pathways. Finally, we briefly discuss issues involved in making Reactome more fully interoperable with other related resources such as the Gene Ontology and maintaining the resulting community resource network.

Show full abstractShow less

DOI

10.1093/nar/gkad1025

REVEL

Reference

PUBMED_LINK

27666373

FULL NAME

Rare Exome Variant Ensemble Learner

DESCRIPTION

Ensemble method integrating multiple tools to predict pathogenicity.

Show full descriptionShow less

URL

https://sites.google.com/site/revelgenomics/

KEYWORDS

ensemble, missense

Show full keywordsShow less

USE

pathogenicity scoring

TITLE

REVEL: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants.

Main citation

Ioannidis NM, Rothstein JH, Pejaver V, Middha S, ...&, Sieh W. (2016) REVEL: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants. Am J Hum Genet, 99 (4) 877-885. doi:10.1016/j.ajhg.2016.08.016. PMID 27666373

ABSTRACT

The vast majority of coding variants are rare, and assessment of the contribution of rare variants to complex traits is hampered by low statistical power and limited functional data. Improved methods for predicting the pathogenicity of rare coding variants are needed to facilitate the discovery of disease variants from exome sequencing studies. We developed REVEL (rare exome variant ensemble learner), an ensemble method for predicting the pathogenicity of missense variants on the basis of individual tools: MutPred, FATHMM, VEST, PolyPhen, SIFT, PROVEAN, MutationAssessor, MutationTaster, LRT, GERP, SiPhy, phyloP, and phastCons. REVEL was trained with recently discovered pathogenic and rare neutral missense variants, excluding those previously used to train its constituent tools. When applied to two independent test sets, REVEL had the best overall performance (p < 10-12) as compared to any individual tool and seven ensemble methods: MetaSVM, MetaLR, KGGSeq, Condel, CADD, DANN, and Eigen. Importantly, REVEL also had the best performance for distinguishing pathogenic from rare neutral variants with allele frequencies <0.5%. The area under the receiver operating characteristic curve (AUC) for REVEL was 0.046-0.182 higher in an independent test set of 935 recent SwissVar disease variants and 123,935 putatively neutral exome sequencing variants and 0.027-0.143 higher in an independent test set of 1,953 pathogenic and 2,406 benign variants recently reported in ClinVar than the AUCs for other ensemble methods. We provide pre-computed REVEL scores for all possible human missense variants to facilitate the identification of pathogenic variants in the sea of rare variants discovered as sequencing studies expand in scale.

Show full abstractShow less

DOI

10.1016/j.ajhg.2016.08.016

ROADMAP 2010

Reference

PUBMED_LINK

20944595

TITLE

The NIH Roadmap Epigenomics Mapping Consortium.

Main citation

Bernstein BE, Stamatoyannopoulos JA, Costello JF, Ren B, ...&, Thomson JA. (2010) The NIH Roadmap Epigenomics Mapping Consortium. Nat Biotechnol, 28 (10) 1045-8. doi:10.1038/nbt1010-1045. PMID 20944595

ABSTRACT

The NIH Roadmap Epigenomics Mapping Consortium aims to produce a public resource of epigenomic maps for stem cells and primary ex vivo tissues selected to represent the normal counterparts of tissues and organ systems frequently involved in human disease.

Show full abstractShow less

DOI

10.1038/nbt1010-1045

ROADMAP 2015

Reference

PUBMED_LINK

25693563

URL

https://maayanlab.cloud/Harmonizome/resource/Roadmap+Epigenomics ,https://egg2.wustl.edu/roadmap/web_portal/chr_state_learning.html#core_15state

TITLE

Integrative analysis of 111 reference human epigenomes.

Main citation

Roadmap Epigenomics Consortium, Kundaje A, Meuleman W, Ernst J, ...&, Kellis M. (2015) Integrative analysis of 111 reference human epigenomes. Nature, 518 (7539) 317-30. doi:10.1038/nature14248. PMID 25693563

ABSTRACT

The reference human genome sequence set the stage for studies of genetic variation and its association with human disease, but epigenomic studies lack a similar reference. To address this need, the NIH Roadmap Epigenomics Consortium generated the largest collection so far of human epigenomes for primary cells and tissues. Here we describe the integrative analysis of 111 reference human epigenomes generated as part of the programme, profiled for histone modification patterns, DNA accessibility, DNA methylation and RNA expression. We establish global maps of regulatory elements, define regulatory modules of coordinated activity, and their likely activators and repressors. We show that disease- and trait-associated genetic variants are enriched in tissue-specific epigenomic marks, revealing biologically relevant cell types for diverse human traits, and providing a resource for interpreting the molecular basis of human disease. Our results demonstrate the central role of epigenomic information for understanding gene regulation, cellular differentiation and human disease.

Show full abstractShow less

DOI

10.1038/nature14248

SAHA

Reference

FULL NAME

The Spatial Atlas of Human Anatomy (SAHA)

Main citation

Park, J. et al. The spatial atlas of Human Anatomy (SAHA): A multimodal subcellular-resolution reference across human organs. bioRxiv 2025.06.16.658716 (2025) doi:10.1101/2025.06.16.658716.

SIFT

Reference

PUBMED_LINK

12824425

FULL NAME

Sorting Intolerant From Tolerant

DESCRIPTION

Predicts whether substitutions affect protein function.

Show full descriptionShow less

URL

https://sift.bii.a-star.edu.sg/

KEYWORDS

conservation, missense

Show full keywordsShow less

USE

variant scoring

TITLE

SIFT: Predicting amino acid changes that affect protein function.

Main citation

Ng PC, Henikoff S. (2003) SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res, 31 (13) 3812-4. doi:10.1093/nar/gkg509. PMID 12824425

ABSTRACT

Single nucleotide polymorphism (SNP) studies and random mutagenesis projects identify amino acid substitutions in protein-coding regions. Each substitution has the potential to affect protein function. SIFT (Sorting Intolerant From Tolerant) is a program that predicts whether an amino acid substitution affects protein function so that users can prioritize substitutions for further study. We have shown that SIFT can distinguish between functionally neutral and deleterious amino acid changes in mutagenesis studies and on human polymorphisms. SIFT is available at http://blocks.fhcrc.org/sift/SIFT.html.

Show full abstractShow less

DOI

10.1093/nar/gkg509

STRING

Reference

PUBMED_LINK

36370105

DESCRIPTION

STRING is a database of known and predicted protein-protein interactions. The interactions include direct (physical) and indirect (functional) associations; they stem from computational prediction, from knowledge transfer between organisms, and from interactions aggregated from other (primary) databases.

Show full descriptionShow less

URL

https://string-db.org/

KEYWORDS

Interaction

Show full keywordsShow less

TITLE

The STRING database in 2023: protein-protein association networks and functional enrichment analyses for any sequenced genome of interest.

Main citation

Szklarczyk D, Kirsch R, Koutrouli M, Nastou K, ...&, von Mering C. (2023) The STRING database in 2023: protein-protein association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic Acids Res, 51 (D1) D638-D646. doi:10.1093/nar/gkac1000. PMID 36370105

ABSTRACT

Much of the complexity within cells arises from functional and regulatory interactions among proteins. The core of these interactions is increasingly known, but novel interactions continue to be discovered, and the information remains scattered across different database resources, experimental modalities and levels of mechanistic detail. The STRING database (https://string-db.org/) systematically collects and integrates protein-protein interactions-both physical interactions as well as functional associations. The data originate from a number of sources: automated text mining of the scientific literature, computational interaction predictions from co-expression, conserved genomic context, databases of interaction experiments and known complexes/pathways from curated sources. All of these interactions are critically assessed, scored, and subsequently automatically transferred to less well-studied organisms using hierarchical orthology information. The data can be accessed via the website, but also programmatically and via bulk downloads. The most recent developments in STRING (version 12.0) are: (i) it is now possible to create, browse and analyze a full interaction network for any novel genome of interest, by submitting its complement of encoded proteins, (ii) the co-expression channel now uses variational auto-encoders to predict interactions, and it covers two new sources, single-cell RNA-seq and experimental proteomics data and (iii) the confidence in each experimentally derived interaction is now estimated based on the detection method used, and communicated to the user in the web-interface. Furthermore, STRING continues to enhance its facilities for functional enrichment analysis, which are now fully available also for user-submitted genomes.

Show full abstractShow less

DOI

10.1093/nar/gkac1000

Taiwan View

Reference

URL

https://taiwanview.twbiobank.org.tw/variant.php

UKBB-LD

Reference

PUBMED_LINK

33199916

DESCRIPTION

Linkage disequilibrium (LD) matrices of UK Biobank participants of a British ancestry, based on imputed genotypes.

Show full descriptionShow less

URL

https://registry.opendata.aws/ukbb-ld/

TITLE

Functionally informed fine-mapping and polygenic localization of complex trait heritability.

Main citation

Weissbrod O, Hormozdiari F, Benner C, Cui R, ...&, Price AL. (2020) Functionally informed fine-mapping and polygenic localization of complex trait heritability. Nat Genet, 52 (12) 1355-1363. doi:10.1038/s41588-020-00735-5. PMID 33199916

ABSTRACT

Fine-mapping aims to identify causal variants impacting complex traits. We propose PolyFun, a computationally scalable framework to improve fine-mapping accuracy by leveraging functional annotations across the entire genome-not just genome-wide-significant loci-to specify prior probabilities for fine-mapping methods such as SuSiE or FINEMAP. In simulations, PolyFun + SuSiE and PolyFun + FINEMAP were well calibrated and identified >20% more variants with a posterior causal probability >0.95 than identified in their nonfunctionally informed counterparts. In analyses of 49 UK Biobank traits (average n = 318,000), PolyFun + SuSiE identified 3,025 fine-mapped variant-trait pairs with posterior causal probability >0.95, a >32% improvement versus SuSiE. We used posterior mean per-SNP heritabilities from PolyFun + SuSiE to perform polygenic localization, constructing minimal sets of common SNPs causally explaining 50% of common SNP heritability; these sets ranged in size from 28 (hair color) to 3,400 (height) to 2 million (number of children). In conclusion, PolyFun prioritizes variants for functional follow-up and provides insights into complex trait architectures.

Show full abstractShow less

DOI

10.1038/s41588-020-00735-5

MAIN ANCESTRY

EUR

UNEECON

Reference

PUBMED_LINK

32667917

DESCRIPTION

UNEECON is a statistical method for inferring deleterious mutations and constrained genes in human and potentially other species.

Show full descriptionShow less

URL

https://github.com/yifei-lab/UNEECON

TITLE

Unified inference of missense variant effects and gene constraints in the human genome.

Main citation

Huang YF. (2020) Unified inference of missense variant effects and gene constraints in the human genome. PLoS Genet, 16 (7) e1008922. doi:10.1371/journal.pgen.1008922. PMID 32667917

ABSTRACT

A challenge in medical genomics is to identify variants and genes associated with severe genetic disorders. Based on the premise that severe, early-onset disorders often result in a reduction of evolutionary fitness, several statistical methods have been developed to predict pathogenic variants or constrained genes based on the signatures of negative selection in human populations. However, we currently lack a statistical framework to jointly predict deleterious variants and constrained genes from both variant-level features and gene-level selective constraints. Here we present such a unified approach, UNEECON, based on deep learning and population genetics. UNEECON treats the contributions of variant-level features and gene-level constraints as a variant-level fixed effect and a gene-level random effect, respectively. The sum of the fixed and random effects is then combined with an evolutionary model to infer the strength of negative selection at both variant and gene levels. Compared with previously published methods, UNEECON shows improved performance in predicting missense variants and protein-coding genes associated with autosomal dominant disorders, and feature importance analysis suggests that both gene-level selective constraints and variant-level predictors are important for accurate variant prioritization. Furthermore, based on UNEECON, we observe a low correlation between gene-level intolerance to missense mutations and that to loss-of-function mutations, which can be partially explained by the prevalence of disordered protein regions that are highly tolerant to missense mutations. Finally, we show that genes intolerant to both missense and loss-of-function mutations play key roles in the central nervous system and the autism spectrum disorders. Overall, UNEECON is a promising framework for both variant and gene prioritization.

Show full abstractShow less

DOI

10.1371/journal.pgen.1008922

UniProt

Reference

PUBMED_LINK

39552041

DESCRIPTION

The Universal Protein Resource (UniProt) is a comprehensive resource for protein sequence and annotation data. The UniProt databases are the UniProt Knowledgebase (UniProtKB), the UniProt Reference Clusters (UniRef), and the UniProt Archive (UniParc). The UniProt consortium and host institutions EMBL-EBI, SIB and PIR are committed to the long-term preservation of the UniProt databases.

Show full descriptionShow less

URL

https://www.uniprot.org/

TITLE

UniProt: the Universal Protein Knowledgebase in 2025.

Main citation

UniProt Consortium. (2025) UniProt: the Universal Protein Knowledgebase in 2025. Nucleic Acids Res, 53 (D1) D609-D617. doi:10.1093/nar/gkae1010. PMID 39552041

ABSTRACT

The aim of the UniProt Knowledgebase (UniProtKB; https://www.uniprot.org/) is to provide users with a comprehensive, high-quality and freely accessible set of protein sequences annotated with functional information. In this publication, we describe ongoing changes to our production pipeline to limit the sequences available in UniProtKB to high-quality, non-redundant reference proteomes. We continue to manually curate the scientific literature to add the latest functional data and use machine learning techniques. We also encourage community curation to ensure key publications are not missed. We provide an update on the automatic annotation methods used by UniProtKB to predict information for unreviewed entries describing unstudied proteins. Finally, updates to the UniProt website are described, including a new tab linking protein to genomic information. In recognition of its value to the scientific community, the UniProt database has been awarded Global Core Biodata Resource status.

Show full abstractShow less

DOI

10.1093/nar/gkae1010

VEST4

Reference

PUBMED_LINK

23819870

FULL NAME

Variant Effect Scoring Tool v4

DESCRIPTION

Machine learning pathogenicity score for SNVs.

Show full descriptionShow less

URL

https://www.cravat.us/

KEYWORDS

ML, SNV

Show full keywordsShow less

USE

variant scoring

TITLE

Identifying Mendelian disease genes with the variant effect scoring tool.

Main citation

Carter H, Douville C, Stenson PD, Cooper DN, ...&, Karchin R. (2013) Identifying Mendelian disease genes with the variant effect scoring tool. BMC Genomics, 14 Suppl 3 (Suppl 3) S3. doi:10.1186/1471-2164-14-S3-S3. PMID 23819870

ABSTRACT

BACKGROUND: Whole exome sequencing studies identify hundreds to thousands of rare protein coding variants of ambiguous significance for human health. Computational tools are needed to accelerate the identification of specific variants and genes that contribute to human disease. RESULTS: We have developed the Variant Effect Scoring Tool (VEST), a supervised machine learning-based classifier, to prioritize rare missense variants with likely involvement in human disease. The VEST classifier training set comprised ~ 45,000 disease mutations from the latest Human Gene Mutation Database release and another ~45,000 high frequency (allele frequency >1%) putatively neutral missense variants from the Exome Sequencing Project. VEST outperforms some of the most popular methods for prioritizing missense variants in carefully designed holdout benchmarking experiments (VEST ROC AUC = 0.91, PolyPhen2 ROC AUC = 0.86, SIFT4.0 ROC AUC = 0.84). VEST estimates variant score p-values against a null distribution of VEST scores for neutral variants not included in the VEST training set. These p-values can be aggregated at the gene level across multiple disease exomes to rank genes for probable disease involvement. We tested the ability of an aggregate VEST gene score to identify candidate Mendelian disease genes, based on whole-exome sequencing of a small number of disease cases. We used whole-exome data for two Mendelian disorders for which the causal gene is known. Considering only genes that contained variants in all cases, the VEST gene score ranked dihydroorotate dehydrogenase (DHODH) number 2 of 2253 genes in four cases of Miller syndrome, and myosin-3 (MYH3) number 2 of 2313 genes in three cases of Freeman Sheldon syndrome. CONCLUSIONS: Our results demonstrate the potential power gain of aggregating bioinformatics variant scores into gene-level scores and the general utility of bioinformatics in assisting the search for disease genes in large-scale exome sequencing studies. VEST is available as a stand-alone software package at http://wiki.chasmsoftware.org and is hosted by the CRAVAT web server at http://www.cravat.us.

Show full abstractShow less

DOI

10.1186/1471-2164-14-S3-S3

Westlake BioBank for Chinese (WBBC) (WBBC)

Reference

URL

https://wbbc.westlake.edu.cn/genotype.html