Skip to content

Tools Imputation

Curation of Imputation — listings under the GWAS Tools tab.

Summary Table

Click a column header to sort the table.

NAME CATEGORY Main citation YEAR
1000 Genomes Phase 3 Version 5 (1KGp3v5) Imputation panel
1000 Genomes Project Consortium et al., Nature, 2015
2015
1KG+7K Imputation panel
NA
NA
CKB reference panel Imputation panel
Yu C et al., Nucleic Acids Res, 2023
2023
ChinaMAP panel Imputation panel
Li L et al., Cell Res, 2021
2021
GenomeAsia 100K Imputation panel
GenomeAsia100K Consortium, Nature, 2019
2019
HGDP+1kGP Imputation panel
Koenig Z et al., Genome Res, 2024
2024
HRC Imputation panel
McCarthy S et al., Nat Genet, 2016
2016
NARD2 Imputation panel
Choi J et al., Sci Adv, 2023
2023
NARD Imputation panel
Yoo SK et al., Genome Med, 2019
2019
Nyuwa Genome Resource Phase 1 Imputation panel
Zhang P et al., Cell Rep, 2021
2021
PGG.Han panel Imputation panel
Gao Y et al., Nucleic Acids Res, 2020
2020
South and East Asian Reference Database (SEAD) Imputation panel
Yang, M. Y., Zhong, J. D., Li, X., Bai, W. Y., Yuan, C. D., Qiu, M. C., ... & Zheng, H. F. (2023). SEAD: an…
NA
TOPMED Imputation panel
Taliun D et al., Nature, 2021
2021
WBBC panel Imputation panel
Cong PK et al., Nat Commun, 2022
2022
CNGB Imputation Service Imputation server
Yu C et al., Nucleic Acids Res, 2023
2023
ChinaMAP Imputation server
Li L et al., Cell Res, 2021
2021
Michigan Imputation Server Imputation server
Das S et al., Nat Genet, 2016
2016
NyuWa Imputation Server Imputation server
Zhang P et al., Cell Rep, 2021
2021
PGG.Han Imputation server
Gao Y et al., Nucleic Acids Res, 2020
2020
Sanger Imputation server
McCarthy S et al., Nat Genet, 2016
2016
TOPMED Imputation server
Taliun D et al., Nature, 2021
2021
Westlake Imputation Server Imputation server
Cong PK et al., Nat Commun, 2022
2022
RESHAPE Other tools
Cavinato T et al., Nat Comput Sci, 2024
2024
BEAGLE4 Phasing & Imputation tool
Browning BL et al., Am J Hum Genet, 2016
2016
BEAGLE5.4 (Imputation) Phasing & Imputation tool
Browning BL et al., Am J Hum Genet, 2018
2018
BEAGLE5.4 (Phasing) Phasing & Imputation tool
Browning BL et al., Am J Hum Genet, 2021
2021
BEAGLE Phasing & Imputation tool
Browning SR et al., Am J Hum Genet, 2007
2007
EAGLE2 Phasing & Imputation tool
Loh PR et al., Nat Genet, 2016
2016
EAGLE Phasing & Imputation tool
Loh PR et al., Nat Genet, 2016
2016
GLIMPSE Phasing & Imputation tool
Rubinacci S et al., Nat Genet, 2021
2021
IMPUTE2 Phasing & Imputation tool
Howie BN et al., PLoS Genet, 2009
2009
IMPUTE4 Phasing & Imputation tool
Bycroft C et al., Nature, 2018
2018
IMPUTE5 Phasing & Imputation tool
Rubinacci S et al., PLoS Genet, 2020
2020
IMPUTE Phasing & Imputation tool
Marchini J et al., Nat Genet, 2007
2007
MACH / minimach pre-phasing Phasing & Imputation tool
Howie B et al., Nat Genet, 2012
2012
MACH / minimach2 Phasing & Imputation tool
Fuchsberger C et al., Bioinformatics, 2015
2015
MACH / minimach3 Phasing & Imputation tool
Das S et al., Nat Genet, 2016
2016
MACH / minimach4 Phasing & Imputation tool
NA
NA
MACH / minimach Phasing & Imputation tool
Li Y et al., Genet Epidemiol, 2010
2010
QUILT1 Phasing & Imputation tool
Davies RW et al., Nat Genet, 2021
2021
QUILT2 Phasing & Imputation tool
Li, Z., Albrechtsen, A. & Davies, R. W. Rapid and accurate genotype imputation from low coverage short read, long…
NA
SHAPEIT1 Phasing & Imputation tool
Delaneau O et al., Nat Methods, 2011
2011
SHAPEIT2 Phasing & Imputation tool
Delaneau O et al., Nat Methods, 2013
2013
SHAPEIT3 Phasing & Imputation tool
O'Connell J et al., Nat Genet, 2016
2016
SHAPEIT4 Phasing & Imputation tool
Delaneau O et al., Nat Commun, 2019
2019
SHAPEIT5 Phasing & Imputation tool
Hofmeister RJ et al., Nat Genet, 2023
2023
fastPHASE Phasing & Imputation tool
Scheet P et al., Am J Hum Genet, 2006
2006
Review-Das Review
Das S et al., Annu Rev Genomics Hum Genet, 2018
2018
Review-Li Review
Li Y et al., Annu Rev Genomics Hum Genet, 2009
2009
Review-Marchini Review
Marchini J et al., Nat Rev Genet, 2010
2010
1KG SV imputation panel Structural variants imputation panel
Noyvert, B., Erzurumluoglu, A. M., Drichel, D., Omland, S., Andlauer, T. F., Mueller, S., ... & Ding, Z. (2023).…
NA

Imputation panel

1000 Genomes Phase 3 Version 5 (1KGp3v5) (1KG)

Tool
PUBMED_LINK
26432245
URL
https://www.internationalgenome.org/
TITLE
A global reference for human genetic variation.
Main citation
1000 Genomes Project Consortium, Auton A, Brooks LD, Durbin RM, ...&, Abecasis GR. (2015) A global reference for human genetic variation. Nature, 526 (7571) 68-74. doi:10.1038/nature15393. PMID 26432245
ABSTRACT
The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes >99% of SNP variants with a frequency of >1% for a variety of ancestries. We describe the distribution of genetic variation across the global sample, and discuss the implications for common disease studies.
DOI
10.1038/nature15393

1KG+7K

Tool
KEYWORDS
Japanese population-specific reference panel
PREPRINT_DOI
10.21203/rs.3.rs-3194976/v1

CKB reference panel (CKB)

Tool
PUBMED_LINK
37870428
FULL NAME
China Kadoorie Biobank
URL
https://db.cngb.org/imputation/
TITLE
A high-resolution haplotype-resolved Reference panel constructed from the China Kadoorie Biobank Study.
Main citation
Yu C, Lan X, Tao Y, Guo Y, ...&, Li L. (2023) A high-resolution haplotype-resolved Reference panel constructed from the China Kadoorie Biobank Study. Nucleic Acids Res, 51 (21) 11770-11782. doi:10.1093/nar/gkad779. PMID 37870428
ABSTRACT
Precision medicine depends on high-accuracy individual-level genotype data. However, the whole-genome sequencing (WGS) is still not suitable for gigantic studies due to budget constraints. It is particularly important to construct highly accurate haplotype reference panel for genotype imputation. In this study, we used 10 000 samples with medium-depth WGS to construct a reference panel that we named the CKB reference panel. By imputing microarray datasets, it showed that the CKB panel outperformed compared panels in terms of both the number of well-imputed variants and imputation accuracy. In addition, we have completed the imputation of 100 706 microarrays with the CKB panel, and the after-imputed data is the hitherto largest whole genome data of the Chinese population. Furthermore, in the GWAS analysis of real phenotype height, the number of tested SNPs tripled and the number of significant SNPs doubled after imputation. Finally, we developed an online server for offering free genotype imputation service based on the CKB reference panel (https://db.cngb.org/imputation/). We believe that the CKB panel is of great value for imputing microarray or low-coverage genotype data of Chinese population, and potentially mixed populations. The imputation-completed 100 706 microarray data are enormous and precious resources of population genetic studies for complex traits and diseases.
DOI
10.1093/nar/gkad779

ChinaMAP panel (ChinaMAP)

Tool
PUBMED_LINK
34489580
FULL NAME
China Metabolic Analytics Project
URL
http://www.mbiobank.com/
TITLE
The ChinaMAP reference panel for the accurate genotype imputation in Chinese populations.
Main citation
Li L, Huang P, Sun X, Wang S, ...&, Wang W. (2021) The ChinaMAP reference panel for the accurate genotype imputation in Chinese populations. Cell Res, 31 (12) 1308-1310. doi:10.1038/s41422-021-00564-z. PMID 34489580
DOI
10.1038/s41422-021-00564-z

GenomeAsia 100K

Tool
PUBMED_LINK
31802016
URL
https://www.genomeasia100k.org/
TITLE
The GenomeAsia 100K Project enables genetic discoveries across Asia.
Main citation
GenomeAsia100K Consortium. (2019) The GenomeAsia 100K Project enables genetic discoveries across Asia. Nature, 576 (7785) 106-111. doi:10.1038/s41586-019-1793-z. PMID 31802016
ABSTRACT
The underrepresentation of non-Europeans in human genetic studies so far has limited the diversity of individuals in genomic datasets and led to reduced medical relevance for a large proportion of the world's population. Population-specific reference genome datasets as well as genome-wide association studies in diverse populations are needed to address this issue. Here we describe the pilot phase of the GenomeAsia 100K Project. This includes a whole-genome sequencing reference dataset from 1,739 individuals of 219 population groups and 64 countries across Asia. We catalogue genetic variation, population structure, disease associations and founder effects. We also explore the use of this dataset in imputation, to facilitate genetic studies in populations across Asia and worldwide.
DOI
10.1038/s41586-019-1793-z

HGDP+1kGP

Tool
PUBMED_LINK
38749656
FULL NAME
Human Genome Diversity Project + 1000 Genomes project
URL
https://gnomad.broadinstitute.org/news/2020-10-gnomad-v3-1-new-content-methods-annotations-and-data-availability/#the-gnomad-hgdp-and-1000-genomes-callset
TITLE
A harmonized public resource of deeply sequenced diverse human genomes.
Main citation
Koenig Z, Yohannes MT, Nkambule LL, Zhao X, ...&, Martin AR. (2024) A harmonized public resource of deeply sequenced diverse human genomes. Genome Res, 34 (5) 796-809. doi:10.1101/gr.278378.123. PMID 38749656
ABSTRACT
Underrepresented populations are often excluded from genomic studies owing in part to a lack of resources supporting their analyses. The 1000 Genomes Project (1kGP) and Human Genome Diversity Project (HGDP), which have recently been sequenced to high coverage, are valuable genomic resources because of the global diversity they capture and their open data sharing policies. Here, we harmonized a high-quality set of 4094 whole genomes from 80 populations in the HGDP and 1kGP with data from the Genome Aggregation Database (gnomAD) and identified over 153 million high-quality SNVs, indels, and SVs. We performed a detailed ancestry analysis of this cohort, characterizing population structure and patterns of admixture across populations, analyzing site frequency spectra, and measuring variant counts at global and subcontinental levels. We also show substantial added value from this data set compared with the prior versions of the component resources, typically combined via liftOver and variant intersection; for example, we catalog millions of new genetic variants, mostly rare, compared with previous releases. In addition to unrestricted individual-level public release, we provide detailed tutorials for conducting many of the most common quality-control steps and analyses with these data in a scalable cloud-computing environment and publicly release this new phased joint callset for use as a haplotype resource in phasing and imputation pipelines. This jointly called reference panel will serve as a key resource to support research of diverse ancestry populations.
DOI
10.1101/gr.278378.123

HRC

Tool
PUBMED_LINK
27548312
URL
http://www.haplotype-reference-consortium.org/
TITLE
A reference panel of 64,976 haplotypes for genotype imputation.
Main citation
McCarthy S, Das S, Kretzschmar W, Delaneau O, ...&, Haplotype Reference Consortium. (2016) A reference panel of 64,976 haplotypes for genotype imputation. Nat Genet, 48 (10) 1279-83. doi:10.1038/ng.3643. PMID 27548312
ABSTRACT
We describe a reference panel of 64,976 human haplotypes at 39,235,157 SNPs constructed using whole-genome sequence data from 20 studies of predominantly European ancestry. Using this resource leads to accurate genotype imputation at minor allele frequencies as low as 0.1% and a large increase in the number of SNPs tested in association studies, and it can help to discover and refine causal loci. We describe remote server resources that allow researchers to carry out imputation and phasing consistently and efficiently.
DOI
10.1038/ng.3643

NARD

Tool
PUBMED_LINK
31640730
FULL NAME
Northeast Asian Reference Database
URL
https://nard.macrogen.com/
TITLE
NARD: whole-genome reference panel of 1779 Northeast Asians improves imputation accuracy of rare and low-frequency variants.
Main citation
Yoo SK, Kim CU, Kim HL, Kim S, ...&, Seo JS. (2019) NARD: whole-genome reference panel of 1779 Northeast Asians improves imputation accuracy of rare and low-frequency variants. Genome Med, 11 (1) 64. doi:10.1186/s13073-019-0677-z. PMID 31640730
ABSTRACT
Here, we present the Northeast Asian Reference Database (NARD), including whole-genome sequencing data of 1779 individuals from Korea, Mongolia, Japan, China, and Hong Kong. NARD provides the genetic diversity of Korean (n = 850) and Mongolian (n = 384) ancestries that were not present in the 1000 Genomes Project Phase 3 (1KGP3). We combined and re-phased the genotypes from NARD and 1KGP3 to construct a union set of haplotypes. This approach established a robust imputation reference panel for Northeast Asians, which yields the greatest imputation accuracy of rare and low-frequency variants compared with the existing panels. NARD imputation panel is available at https://nard.macrogen.com/ .
DOI
10.1186/s13073-019-0677-z

NARD2

Tool
PUBMED_LINK
37556544
FULL NAME
Northeast Asian Reference Database 2
URL
https://nard.macrogen.com/
TITLE
A whole-genome reference panel of 14,393 individuals for East Asian populations accelerates discovery of rare functional variants.
Main citation
Choi J, Kim S, Kim J, Son HY, ...&, Im SW. (2023) A whole-genome reference panel of 14,393 individuals for East Asian populations accelerates discovery of rare functional variants. Sci Adv, 9 (32) eadg6319. doi:10.1126/sciadv.adg6319. PMID 37556544
ABSTRACT
Underrepresentation of non-European (EUR) populations hinders growth of global precision medicine. Resources such as imputation reference panels that match the study population are necessary to find low-frequency variants with substantial effects. We created a reference panel consisting of 14,393 whole-genome sequences including more than 11,000 Asian individuals. Genome-wide association studies were conducted using the reference panel and a population-specific genotype array of 72,298 subjects for eight phenotypes. This panel yields improved imputation accuracy of rare and low-frequency variants within East Asian populations compared with the largest reference panel. Thirty-nine previously unidentified associations were found, and more than half of the variants were East Asian specific. We discovered genes with rare protein-altering variants, including LTBP1 for height and GPR75 for body mass index, as well as putative regulatory mechanisms for rare noncoding variants with cell type-specific effects. We suggest that this dataset will add to the potential value of Asian precision medicine.
DOI
10.1126/sciadv.adg6319

Nyuwa Genome Resource Phase 1

Tool
PUBMED_LINK
34788621
URL
http://bigdata.ibp.ac.cn/refpanel/getstarted.php
TITLE
NyuWa Genome resource: A deep whole-genome sequencing-based variation profile and reference panel for the Chinese population.
Main citation
Zhang P, Luo H, Li Y, Wang Y, ...&, He S. (2021) NyuWa Genome resource: A deep whole-genome sequencing-based variation profile and reference panel for the Chinese population. Cell Rep, 37 (7) 110017. doi:10.1016/j.celrep.2021.110017. PMID 34788621
ABSTRACT
The lack of haplotype reference panels and whole-genome sequencing resources specific to the Chinese population has greatly hindered genetic studies in the world's largest population. Here, we present the NyuWa genome resource, based on deep (26.2×) sequencing of 2,999 Chinese individuals, and construct a NyuWa reference panel of 5,804 haplotypes and 19.3 million variants, which is a high-quality publicly available Chinese population-specific reference panel with thousands of samples. Compared with other panels, the NyuWa reference panel reduces the Han Chinese imputation error rate by a margin ranging from 30% to 51%. Population structure and imputation simulation tests support the applicability of one integrated reference panel for northern and southern Chinese. In addition, a total of 22,504 loss-of-function variants in coding and noncoding genes are identified, including 11,493 novel variants. These results highlight the value of the NyuWa genome resource in facilitating genetic research in Chinese and Asian populations.
DOI
10.1016/j.celrep.2021.110017

PGG.Han panel (PGG.Han)

Tool
PUBMED_LINK
31584086
URL
https://www.biosino.org/pgghan2/index#home1
TITLE
PGG.Han: the Han Chinese genome database and analysis platform.
Main citation
Gao Y, Zhang C, Yuan L, Ling Y, ...&, Xu S. (2020) PGG.Han: the Han Chinese genome database and analysis platform. Nucleic Acids Res, 48 (D1) D971-D976. doi:10.1093/nar/gkz829. PMID 31584086
ABSTRACT
As the largest ethnic group in the world, the Han Chinese population is nonetheless underrepresented in global efforts to catalogue the genomic variability of natural populations. Here, we developed the PGG.Han, a population genome database to serve as the central repository for the genomic data of the Han Chinese Genome Initiative (Phase I). In its current version, the PGG.Han archives whole-genome sequences or high-density genome-wide single-nucleotide variants (SNVs) of 114 783 Han Chinese individuals (a.k.a. the Han100K), representing geographical sub-populations covering 33 of the 34 administrative divisions of China, as well as Singapore. The PGG.Han provides: (i) an interactive interface for visualization of the fine-scale genetic structure of the Han Chinese population; (ii) genome-wide allele frequencies of hierarchical sub-populations; (iii) ancestry inference for individual samples and controlling population stratification based on nested ancestry informative markers (AIMs) panels; (iv) population-structure-aware shared control data for genotype-phenotype association studies (e.g. GWASs) and (v) a Han-Chinese-specific reference panel for genotype imputation. Computational tools are implemented into the PGG.Han, and an online user-friendly interface is provided for data analysis and results visualization. The PGG.Han database is freely accessible via http://www.pgghan.org or https://www.hanchinesegenomes.org.
DOI
10.1093/nar/gkz829

South and East Asian Reference Database (SEAD) (SEAD)

Tool
FULL NAME
South and East Asian Reference Database
URL
https://imputationserver.westlake.edu.cn/
PREPRINT_DOI
10.1101/2023.12.23.23300480
Main citation
Yang, M. Y., Zhong, J. D., Li, X., Bai, W. Y., Yuan, C. D., Qiu, M. C., ... & Zheng, H. F. (2023). SEAD: an augmented reference panel with 22,134 haplotypes boosts the rare variants imputation and GWAS analysis in Asian population. medRxiv, 2023-12.

TOPMED

Tool
PUBMED_LINK
33568819
FULL NAME
Trans-Omics for Precision Medicine
URL
https://imputation.biodatacatalyst.nhlbi.nih.gov/#!
TITLE
Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program.
Main citation
Taliun D, Harris DN, Kessler MD, Carlson J, ...&, Abecasis GR. (2021) Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature, 590 (7845) 290-299. doi:10.1038/s41586-021-03205-y. PMID 33568819
ABSTRACT
The Trans-Omics for Precision Medicine (TOPMed) programme seeks to elucidate the genetic architecture and biology of heart, lung, blood and sleep disorders, with the ultimate goal of improving diagnosis, treatment and prevention of these diseases. The initial phases of the programme focused on whole-genome sequencing of individuals with rich phenotypic data and diverse backgrounds. Here we describe the TOPMed goals and design as well as the available resources and early insights obtained from the sequence data. The resources include a variant browser, a genotype imputation server, and genomic and phenotypic data that are available through dbGaP (Database of Genotypes and Phenotypes)1. In the first 53,831 TOPMed samples, we detected more than 400 million single-nucleotide and insertion or deletion variants after alignment with the reference genome. Additional previously undescribed variants were detected through assembly of unmapped reads and customized analysis in highly variable loci. Among the more than 400 million detected variants, 97% have frequencies of less than 1% and 46% are singletons that are present in only one individual (53% among unrelated individuals). These rare variants provide insights into mutational processes and recent human evolutionary history. The extensive catalogue of genetic variation in TOPMed studies provides unique opportunities for exploring the contributions of rare and noncoding sequence variants to phenotypic variation. Furthermore, combining TOPMed haplotypes with modern imputation methods improves the power and reach of genome-wide association studies to include variants down to a frequency of approximately 0.01%.
DOI
10.1038/s41586-021-03205-y

WBBC panel (WBBC)

Tool
PUBMED_LINK
35618720
URL
https://imputationserver.westlake.edu.cn/
TITLE
Genomic analyses of 10,376 individuals in the Westlake BioBank for Chinese (WBBC) pilot project.
Main citation
Cong PK, Bai WY, Li JC, Yang MY, ...&, Zheng HF. (2022) Genomic analyses of 10,376 individuals in the Westlake BioBank for Chinese (WBBC) pilot project. Nat Commun, 13 (1) 2939. doi:10.1038/s41467-022-30526-x. PMID 35618720
ABSTRACT
We initiate the Westlake BioBank for Chinese (WBBC) pilot project with 4,535 whole-genome sequencing (WGS) individuals and 5,841 high-density genotyping individuals, and identify 81.5 million SNPs and INDELs, of which 38.5% are absent in dbSNP Build 151. We provide a population-specific reference panel and an online imputation server ( https://wbbc.westlake.edu.cn/ ) which could yield substantial improvement of imputation performance in Chinese population, especially for low-frequency and rare variants. By analyzing the singleton density of the WGS data, we find selection signatures in SNX29, DNAH1 and WDR1 genes, and the derived alleles of the alcohol metabolism genes (ADH1A and ADH1B) emerge around 7,000 years ago and tend to be more common from 4,000 years ago in East Asia. Genetic evidence supports the corresponding geographical boundaries of the Qinling-Huaihe Line and Nanling Mountains, which separate the Han Chinese into subgroups, and we reveal that North Han was more homogeneous than South Han.
DOI
10.1038/s41467-022-30526-x

Imputation server

CNGB Imputation Service (CNGB)

Tool
PUBMED_LINK
37870428
FULL NAME
China National GeneBank
URL
https://db.cngb.org/imputation/
TITLE
A high-resolution haplotype-resolved Reference panel constructed from the China Kadoorie Biobank Study.
Main citation
Yu C, Lan X, Tao Y, Guo Y, ...&, Li L. (2023) A high-resolution haplotype-resolved Reference panel constructed from the China Kadoorie Biobank Study. Nucleic Acids Res, 51 (21) 11770-11782. doi:10.1093/nar/gkad779. PMID 37870428
ABSTRACT
Precision medicine depends on high-accuracy individual-level genotype data. However, the whole-genome sequencing (WGS) is still not suitable for gigantic studies due to budget constraints. It is particularly important to construct highly accurate haplotype reference panel for genotype imputation. In this study, we used 10 000 samples with medium-depth WGS to construct a reference panel that we named the CKB reference panel. By imputing microarray datasets, it showed that the CKB panel outperformed compared panels in terms of both the number of well-imputed variants and imputation accuracy. In addition, we have completed the imputation of 100 706 microarrays with the CKB panel, and the after-imputed data is the hitherto largest whole genome data of the Chinese population. Furthermore, in the GWAS analysis of real phenotype height, the number of tested SNPs tripled and the number of significant SNPs doubled after imputation. Finally, we developed an online server for offering free genotype imputation service based on the CKB reference panel (https://db.cngb.org/imputation/). We believe that the CKB panel is of great value for imputing microarray or low-coverage genotype data of Chinese population, and potentially mixed populations. The imputation-completed 100 706 microarray data are enormous and precious resources of population genetic studies for complex traits and diseases.
DOI
10.1093/nar/gkad779

ChinaMAP

Tool
PUBMED_LINK
34489580
FULL NAME
China Metabolic Analytics Project
URL
http://www.mbiobank.com/
TITLE
The ChinaMAP reference panel for the accurate genotype imputation in Chinese populations.
Main citation
Li L, Huang P, Sun X, Wang S, ...&, Wang W. (2021) The ChinaMAP reference panel for the accurate genotype imputation in Chinese populations. Cell Res, 31 (12) 1308-1310. doi:10.1038/s41422-021-00564-z. PMID 34489580
DOI
10.1038/s41422-021-00564-z

Michigan Imputation Server (Michigan)

Tool
PUBMED_LINK
27571263
URL
https://imputationserver.sph.umich.edu/index.html#!
TITLE
Next-generation genotype imputation service and methods.
Main citation
Das S, Forer L, Schönherr S, Sidore C, ...&, Fuchsberger C. (2016) Next-generation genotype imputation service and methods. Nat Genet, 48 (10) 1284-1287. doi:10.1038/ng.3656. PMID 27571263
ABSTRACT
Genotype imputation is a key component of genetic association studies, where it increases power, facilitates meta-analysis, and aids interpretation of signals. Genotype imputation is computationally demanding and, with current tools, typically requires access to a high-performance computing cluster and to a reference panel of sequenced genomes. Here we describe improvements to imputation machinery that reduce computational requirements by more than an order of magnitude with no loss of accuracy in comparison to standard imputation tools. We also describe a new web-based service for imputation that facilitates access to new reference panels and greatly improves user experience and productivity.
DOI
10.1038/ng.3656

NyuWa Imputation Server (NyuWa)

Tool
PUBMED_LINK
34788621
URL
http://bigdata.ibp.ac.cn/refpanel/getstarted.php
TITLE
NyuWa Genome resource: A deep whole-genome sequencing-based variation profile and reference panel for the Chinese population.
Main citation
Zhang P, Luo H, Li Y, Wang Y, ...&, He S. (2021) NyuWa Genome resource: A deep whole-genome sequencing-based variation profile and reference panel for the Chinese population. Cell Rep, 37 (7) 110017. doi:10.1016/j.celrep.2021.110017. PMID 34788621
ABSTRACT
The lack of haplotype reference panels and whole-genome sequencing resources specific to the Chinese population has greatly hindered genetic studies in the world's largest population. Here, we present the NyuWa genome resource, based on deep (26.2×) sequencing of 2,999 Chinese individuals, and construct a NyuWa reference panel of 5,804 haplotypes and 19.3 million variants, which is a high-quality publicly available Chinese population-specific reference panel with thousands of samples. Compared with other panels, the NyuWa reference panel reduces the Han Chinese imputation error rate by a margin ranging from 30% to 51%. Population structure and imputation simulation tests support the applicability of one integrated reference panel for northern and southern Chinese. In addition, a total of 22,504 loss-of-function variants in coding and noncoding genes are identified, including 11,493 novel variants. These results highlight the value of the NyuWa genome resource in facilitating genetic research in Chinese and Asian populations.
DOI
10.1016/j.celrep.2021.110017

PGG.Han

Tool
PUBMED_LINK
31584086
URL
https://www.biosino.org/pgghan2/login
TITLE
PGG.Han: the Han Chinese genome database and analysis platform.
Main citation
Gao Y, Zhang C, Yuan L, Ling Y, ...&, Xu S. (2020) PGG.Han: the Han Chinese genome database and analysis platform. Nucleic Acids Res, 48 (D1) D971-D976. doi:10.1093/nar/gkz829. PMID 31584086
ABSTRACT
As the largest ethnic group in the world, the Han Chinese population is nonetheless underrepresented in global efforts to catalogue the genomic variability of natural populations. Here, we developed the PGG.Han, a population genome database to serve as the central repository for the genomic data of the Han Chinese Genome Initiative (Phase I). In its current version, the PGG.Han archives whole-genome sequences or high-density genome-wide single-nucleotide variants (SNVs) of 114 783 Han Chinese individuals (a.k.a. the Han100K), representing geographical sub-populations covering 33 of the 34 administrative divisions of China, as well as Singapore. The PGG.Han provides: (i) an interactive interface for visualization of the fine-scale genetic structure of the Han Chinese population; (ii) genome-wide allele frequencies of hierarchical sub-populations; (iii) ancestry inference for individual samples and controlling population stratification based on nested ancestry informative markers (AIMs) panels; (iv) population-structure-aware shared control data for genotype-phenotype association studies (e.g. GWASs) and (v) a Han-Chinese-specific reference panel for genotype imputation. Computational tools are implemented into the PGG.Han, and an online user-friendly interface is provided for data analysis and results visualization. The PGG.Han database is freely accessible via http://www.pgghan.org or https://www.hanchinesegenomes.org.
DOI
10.1093/nar/gkz829

Sanger

Tool
PUBMED_LINK
27548312
URL
https://imputation.sanger.ac.uk/
TITLE
A reference panel of 64,976 haplotypes for genotype imputation.
Main citation
McCarthy S, Das S, Kretzschmar W, Delaneau O, ...&, Haplotype Reference Consortium. (2016) A reference panel of 64,976 haplotypes for genotype imputation. Nat Genet, 48 (10) 1279-83. doi:10.1038/ng.3643. PMID 27548312
ABSTRACT
We describe a reference panel of 64,976 human haplotypes at 39,235,157 SNPs constructed using whole-genome sequence data from 20 studies of predominantly European ancestry. Using this resource leads to accurate genotype imputation at minor allele frequencies as low as 0.1% and a large increase in the number of SNPs tested in association studies, and it can help to discover and refine causal loci. We describe remote server resources that allow researchers to carry out imputation and phasing consistently and efficiently.
DOI
10.1038/ng.3643

TOPMED

Tool
PUBMED_LINK
33568819
FULL NAME
Trans-Omics for Precision Medicine
URL
https://imputation.biodatacatalyst.nhlbi.nih.gov/#!
TITLE
Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program.
Main citation
Taliun D, Harris DN, Kessler MD, Carlson J, ...&, Abecasis GR. (2021) Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature, 590 (7845) 290-299. doi:10.1038/s41586-021-03205-y. PMID 33568819
ABSTRACT
The Trans-Omics for Precision Medicine (TOPMed) programme seeks to elucidate the genetic architecture and biology of heart, lung, blood and sleep disorders, with the ultimate goal of improving diagnosis, treatment and prevention of these diseases. The initial phases of the programme focused on whole-genome sequencing of individuals with rich phenotypic data and diverse backgrounds. Here we describe the TOPMed goals and design as well as the available resources and early insights obtained from the sequence data. The resources include a variant browser, a genotype imputation server, and genomic and phenotypic data that are available through dbGaP (Database of Genotypes and Phenotypes)1. In the first 53,831 TOPMed samples, we detected more than 400 million single-nucleotide and insertion or deletion variants after alignment with the reference genome. Additional previously undescribed variants were detected through assembly of unmapped reads and customized analysis in highly variable loci. Among the more than 400 million detected variants, 97% have frequencies of less than 1% and 46% are singletons that are present in only one individual (53% among unrelated individuals). These rare variants provide insights into mutational processes and recent human evolutionary history. The extensive catalogue of genetic variation in TOPMed studies provides unique opportunities for exploring the contributions of rare and noncoding sequence variants to phenotypic variation. Furthermore, combining TOPMed haplotypes with modern imputation methods improves the power and reach of genome-wide association studies to include variants down to a frequency of approximately 0.01%.
DOI
10.1038/s41586-021-03205-y

Westlake Imputation Server

Tool
PUBMED_LINK
35618720
URL
https://imputationserver.westlake.edu.cn/
TITLE
Genomic analyses of 10,376 individuals in the Westlake BioBank for Chinese (WBBC) pilot project.
Main citation
Cong PK, Bai WY, Li JC, Yang MY, ...&, Zheng HF. (2022) Genomic analyses of 10,376 individuals in the Westlake BioBank for Chinese (WBBC) pilot project. Nat Commun, 13 (1) 2939. doi:10.1038/s41467-022-30526-x. PMID 35618720
ABSTRACT
We initiate the Westlake BioBank for Chinese (WBBC) pilot project with 4,535 whole-genome sequencing (WGS) individuals and 5,841 high-density genotyping individuals, and identify 81.5 million SNPs and INDELs, of which 38.5% are absent in dbSNP Build 151. We provide a population-specific reference panel and an online imputation server ( https://wbbc.westlake.edu.cn/ ) which could yield substantial improvement of imputation performance in Chinese population, especially for low-frequency and rare variants. By analyzing the singleton density of the WGS data, we find selection signatures in SNX29, DNAH1 and WDR1 genes, and the derived alleles of the alcohol metabolism genes (ADH1A and ADH1B) emerge around 7,000 years ago and tend to be more common from 4,000 years ago in East Asia. Genetic evidence supports the corresponding geographical boundaries of the Qinling-Huaihe Line and Nanling Mountains, which separate the Han Chinese into subgroups, and we reveal that North Han was more homogeneous than South Han.
DOI
10.1038/s41467-022-30526-x

Other tools

RESHAPE

Tool
PUBMED_LINK
38745108
FULL NAME
REcombine and Share HAPlotypEs
DESCRIPTION
RESHAPE removes sample-level genetic information from a reference panel to create a synthetic reference panel. By providing it with a genetic map and the VCF/BCF of a reference panel, RESHAPE outputs a VCF/BCF of the same size where each haplotypes corresponds to a mosaic of the original haplotypes of the reference panel.
URL
https://github.com/TheoCavinato/RESHAPE
TITLE
A resampling-based approach to share reference panels.
Main citation
Cavinato T, Rubinacci S, Malaspinas AS, Delaneau O. (2024) A resampling-based approach to share reference panels. Nat Comput Sci, 4 (5) 360-366. doi:10.1038/s43588-024-00630-7. PMID 38745108
ABSTRACT
For many genome-wide association studies, imputing genotypes from a haplotype reference panel is a necessary step. Over the past 15 years, reference panels have become larger and more diverse, leading to improvements in imputation accuracy. However, the latest generation of reference panels is subject to restrictions on data sharing due to concerns about privacy, limiting their usefulness for genotype imputation. In this context, here we propose RESHAPE, a method that employs a recombination Poisson process on a reference panel to simulate the genomes of hypothetical descendants after multiple generations. This data transformation helps to protect against re-identification threats and preserves data attributes, such as linkage disequilibrium patterns and, to some degree, identity-by-descent sharing, allowing for genotype imputation. Our experiments on gold-standard datasets show that simulated descendants up to eight generations can serve as reference panels without substantially reducing genotype imputation accuracy.
DOI
10.1038/s43588-024-00630-7

Phasing & Imputation tool

BEAGLE

Tool
PUBMED_LINK
17924348
URL
https://faculty.washington.edu/browning/beagle/beagle.html
TITLE
Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering.
Main citation
Browning SR, Browning BL. (2007) Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am J Hum Genet, 81 (5) 1084-97. doi:10.1086/521987. PMID 17924348
ABSTRACT
Whole-genome association studies present many new statistical and computational challenges due to the large quantity of data obtained. One of these challenges is haplotype inference; methods for haplotype inference designed for small data sets from candidate-gene studies do not scale well to the large number of individuals genotyped in whole-genome association studies. We present a new method and software for inference of haplotype phase and missing data that can accurately phase data from whole-genome association studies, and we present the first comparison of haplotype-inference methods for real and simulated data sets with thousands of genotyped individuals. We find that our method outperforms existing methods in terms of both speed and accuracy for large data sets with thousands of individuals and densely spaced genetic markers, and we use our method to phase a real data set of 3,002 individuals genotyped for 490,032 markers in 3.1 days of computing time, with 99% of masked alleles imputed correctly. Our method is implemented in the Beagle software package, which is freely available.
DOI
10.1086/521987

BEAGLE4

Tool
PUBMED_LINK
26748515
DESCRIPTION
(beagle 4.1)
URL
https://faculty.washington.edu/browning/beagle/beagle.html
TITLE
Genotype Imputation with Millions of Reference Samples.
Main citation
Browning BL, Browning SR. (2016) Genotype Imputation with Millions of Reference Samples. Am J Hum Genet, 98 (1) 116-26. doi:10.1016/j.ajhg.2015.11.020. PMID 26748515
ABSTRACT
We present a genotype imputation method that scales to millions of reference samples. The imputation method, based on the Li and Stephens model and implemented in Beagle v.4.1, is parallelized and memory efficient, making it well suited to multi-core computer processors. It achieves fast, accurate, and memory-efficient genotype imputation by restricting the probability model to markers that are genotyped in the target samples and by performing linear interpolation to impute ungenotyped variants. We compare Beagle v.4.1 with Impute2 and Minimac3 by using 1000 Genomes Project data, UK10K Project data, and simulated data. All three methods have similar accuracy but different memory requirements and different computation times. When imputing 10 Mb of sequence data from 50,000 reference samples, Beagle's throughput was more than 100× greater than Impute2's throughput on our computer servers. When imputing 10 Mb of sequence data from 200,000 reference samples in VCF format, Minimac3 consumed 26× more memory per computational thread and 15× more CPU time than Beagle. We demonstrate that Beagle v.4.1 scales to much larger reference panels by performing imputation from a simulated reference panel having 5 million samples and a mean marker density of one marker per four base pairs.
DOI
10.1016/j.ajhg.2015.11.020

BEAGLE5.4 (Imputation)

Tool
PUBMED_LINK
30100085
DESCRIPTION
(beagle 5.4 imputation)
URL
https://faculty.washington.edu/browning/beagle/beagle.html
TITLE
A One-Penny Imputed Genome from Next-Generation Reference Panels.
Main citation
Browning BL, Zhou Y, Browning SR. (2018) A One-Penny Imputed Genome from Next-Generation Reference Panels. Am J Hum Genet, 103 (3) 338-348. doi:10.1016/j.ajhg.2018.07.015. PMID 30100085
ABSTRACT
Genotype imputation is commonly performed in genome-wide association studies because it greatly increases the number of markers that can be tested for association with a trait. In general, one should perform genotype imputation using the largest reference panel that is available because the number of accurately imputed variants increases with reference panel size. However, one impediment to using larger reference panels is the increased computational cost of imputation. We present a new genotype imputation method, Beagle 5.0, which greatly reduces the computational cost of imputation from large reference panels. We compare Beagle 5.0 with Beagle 4.1, Impute4, Minimac3, and Minimac4 using 1000 Genomes Project data, Haplotype Reference Consortium data, and simulated data for 10k, 100k, 1M, and 10M reference samples. All methods produce nearly identical accuracy, but Beagle 5.0 has the lowest computation time and the best scaling of computation time with increasing reference panel size. For 10k, 100k, 1M, and 10M reference samples and 1,000 phased target samples, Beagle 5.0's computation time is 3× (10k), 12× (100k), 43× (1M), and 533× (10M) faster than the fastest alternative method. Cost data from the Amazon Elastic Compute Cloud show that Beagle 5.0 can perform genome-wide imputation from 10M reference samples into 1,000 phased target samples at a cost of less than one US cent per sample.
DOI
10.1016/j.ajhg.2018.07.015

BEAGLE5.4 (Phasing)

Tool
PUBMED_LINK
34478634
DESCRIPTION
(beagle 5.4 phasing)
URL
https://faculty.washington.edu/browning/beagle/beagle.html
TITLE
Fast two-stage phasing of large-scale sequence data.
Main citation
Browning BL, Tian X, Zhou Y, Browning SR. (2021) Fast two-stage phasing of large-scale sequence data. Am J Hum Genet, 108 (10) 1880-1890. doi:10.1016/j.ajhg.2021.08.005. PMID 34478634
ABSTRACT
Haplotype phasing is the estimation of haplotypes from genotype data. We present a fast, accurate, and memory-efficient haplotype phasing method that scales to large-scale SNP array and sequence data. The method uses marker windowing and composite reference haplotypes to reduce memory usage and computation time. It incorporates a progressive phasing algorithm that identifies confidently phased heterozygotes in each iteration and fixes the phase of these heterozygotes in subsequent iterations. For data with many low-frequency variants, such as whole-genome sequence data, the method employs a two-stage phasing algorithm that phases high-frequency markers via progressive phasing in the first stage and phases low-frequency markers via genotype imputation in the second stage. This haplotype phasing method is implemented in the open-source Beagle 5.2 software package. We compare Beagle 5.2 and SHAPEIT 4.2.1 by using expanding subsets of 485,301 UK Biobank samples and 38,387 TOPMed samples. Both methods have very similar accuracy and computation time for UK Biobank SNP array data. However, for TOPMed sequence data, Beagle is more than 20 times faster than SHAPEIT, achieves similar accuracy, and scales to larger sample sizes.
DOI
10.1016/j.ajhg.2021.08.005

EAGLE

Tool
PUBMED_LINK
27270109
DESCRIPTION
(EAGLE1)
URL
https://alkesgroup.broadinstitute.org/Eagle/
TITLE
Fast and accurate long-range phasing in a UK Biobank cohort.
Main citation
Loh PR, Palamara PF, Price AL. (2016) Fast and accurate long-range phasing in a UK Biobank cohort. Nat Genet, 48 (7) 811-6. doi:10.1038/ng.3571. PMID 27270109
ABSTRACT
Recent work has leveraged the extensive genotyping of the Icelandic population to perform long-range phasing (LRP), enabling accurate imputation and association analysis of rare variants in target samples typed on genotyping arrays. Here we develop a fast and accurate LRP method, Eagle, that extends this paradigm to populations with much smaller proportions of genotyped samples by harnessing long (>4-cM) identical-by-descent (IBD) tracts shared among distantly related individuals. We applied Eagle to N ≈ 150,000 samples (0.2% of the British population) from the UK Biobank, and we determined that it is 1-2 orders of magnitude faster than existing methods while achieving similar or better phasing accuracy (switch error rate ≈ 0.3%, corresponding to perfect phase in a majority of 10-Mb segments). We also observed that, when used within an imputation pipeline, Eagle prephasing improved downstream imputation accuracy in comparison to prephasing in batches using existing methods, as necessary to achieve comparable computational cost.
DOI
10.1038/ng.3571

EAGLE2

Tool
PUBMED_LINK
27694958
DESCRIPTION
(EAGLE2)
URL
https://alkesgroup.broadinstitute.org/Eagle/
TITLE
Reference-based phasing using the Haplotype Reference Consortium panel.
Main citation
Loh PR, Danecek P, Palamara PF, Fuchsberger C, ...&, L Price A. (2016) Reference-based phasing using the Haplotype Reference Consortium panel. Nat Genet, 48 (11) 1443-1448. doi:10.1038/ng.3679. PMID 27694958
ABSTRACT
Haplotype phasing is a fundamental problem in medical and population genetics. Phasing is generally performed via statistical phasing in a genotyped cohort, an approach that can yield high accuracy in very large cohorts but attains lower accuracy in smaller cohorts. Here we instead explore the paradigm of reference-based phasing. We introduce a new phasing algorithm, Eagle2, that attains high accuracy across a broad range of cohort sizes by efficiently leveraging information from large external reference panels (such as the Haplotype Reference Consortium; HRC) using a new data structure based on the positional Burrows-Wheeler transform. We demonstrate that Eagle2 attains a ∼20× speedup and ∼10% increase in accuracy compared to reference-based phasing using SHAPEIT2. On European-ancestry samples, Eagle2 with the HRC panel achieves >2× the accuracy of 1000 Genomes-based phasing. Eagle2 is open source and freely available for HRC-based phasing via the Sanger Imputation Service and the Michigan Imputation Server.
DOI
10.1038/ng.3679

GLIMPSE

Tool
PUBMED_LINK
33414550
FULL NAME
Genotype Likelihoods IMputation and PhaSing mEthod
DESCRIPTION
GLIMPSE is a phasing and imputation method for large-scale low-coverage sequencing studies.
URL
https://odelaneau.github.io/GLIMPSE/
TITLE
Efficient phasing and imputation of low-coverage sequencing data using large reference panels.
Main citation
Rubinacci S, Ribeiro DM, Hofmeister RJ, Delaneau O. (2021) Efficient phasing and imputation of low-coverage sequencing data using large reference panels. Nat Genet, 53 (1) 120-126. doi:10.1038/s41588-020-00756-0. PMID 33414550
ABSTRACT
Low-coverage whole-genome sequencing followed by imputation has been proposed as a cost-effective genotyping approach for disease and population genetics studies. However, its competitiveness against SNP arrays is undermined because current imputation methods are computationally expensive and unable to leverage large reference panels. Here, we describe a method, GLIMPSE, for phasing and imputation of low-coverage sequencing datasets from modern reference panels. We demonstrate its remarkable performance across different coverages and human populations. GLIMPSE achieves imputation of a genome for less than US$1 in computational cost, considerably outperforming other methods and improving imputation accuracy over the full allele frequency range. As a proof of concept, we show that 1× coverage enables effective gene expression association studies and outperforms dense SNP arrays in rare variant burden tests. Overall, this study illustrates the promising potential of low-coverage imputation and suggests a paradigm shift in the design of future genomic studies.
DOI
10.1038/s41588-020-00756-0

IMPUTE

Tool
PUBMED_LINK
17572673
URL
https://jmarchini.org/software/
TITLE
A new multipoint method for genome-wide association studies by imputation of genotypes.
Main citation
Marchini J, Howie B, Myers S, McVean G, ...&, Donnelly P. (2007) A new multipoint method for genome-wide association studies by imputation of genotypes. Nat Genet, 39 (7) 906-13. doi:10.1038/ng2088. PMID 17572673
ABSTRACT
Genome-wide association studies are set to become the method of choice for uncovering the genetic basis of human diseases. A central challenge in this area is the development of powerful multipoint methods that can detect causal variants that have not been directly genotyped. We propose a coherent analysis framework that treats the problem as one involving missing or uncertain genotypes. Central to our approach is a model-based imputation method for inferring genotypes at observed or unobserved SNPs, leading to improved power over existing methods for multipoint association mapping. Using real genome-wide association study data, we show that our approach (i) is accurate and well calibrated, (ii) provides detailed views of associated regions that facilitate follow-up studies and (iii) can be used to validate and correct data at genotyped markers. A notable future use of our method will be to boost power by combining data from genome-wide scans that use different SNP sets.
DOI
10.1038/ng2088

IMPUTE2

Tool
PUBMED_LINK
19543373
TITLE
A flexible and accurate genotype imputation method for the next generation of genome-wide association studies.
Main citation
Howie BN, Donnelly P, Marchini J. (2009) A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet, 5 (6) e1000529. doi:10.1371/journal.pgen.1000529. PMID 19543373
ABSTRACT
Genotype imputation methods are now being widely used in the analysis of genome-wide association studies. Most imputation analyses to date have used the HapMap as a reference dataset, but new reference panels (such as controls genotyped on multiple SNP chips and densely typed samples from the 1,000 Genomes Project) will soon allow a broader range of SNPs to be imputed with higher accuracy, thereby increasing power. We describe a genotype imputation method (IMPUTE version 2) that is designed to address the challenges presented by these new datasets. The main innovation of our approach is a flexible modelling framework that increases accuracy and combines information across multiple reference panels while remaining computationally feasible. We find that IMPUTE v2 attains higher accuracy than other methods when the HapMap provides the sole reference panel, but that the size of the panel constrains the improvements that can be made. We also find that imputation accuracy can be greatly enhanced by expanding the reference panel to contain thousands of chromosomes and that IMPUTE v2 outperforms other methods in this setting at both rare and common SNPs, with overall error rates that are 15%-20% lower than those of the closest competing method. One particularly challenging aspect of next-generation association studies is to integrate information across multiple reference panels genotyped on different sets of SNPs; we show that our approach to this problem has practical advantages over other suggested solutions.
DOI
10.1371/journal.pgen.1000529

IMPUTE4

Tool
PUBMED_LINK
30305743
TITLE
The UK Biobank resource with deep phenotyping and genomic data.
Main citation
Bycroft C, Freeman C, Petkova D, Band G, ...&, Marchini J. (2018) The UK Biobank resource with deep phenotyping and genomic data. Nature, 562 (7726) 203-209. doi:10.1038/s41586-018-0579-z. PMID 30305743
ABSTRACT
The UK Biobank project is a prospective cohort study with deep genetic and phenotypic data collected on approximately 500,000 individuals from across the United Kingdom, aged between 40 and 69 at recruitment. The open resource is unique in its size and scope. A rich variety of phenotypic and health-related information is available on each participant, including biological measurements, lifestyle indicators, biomarkers in blood and urine, and imaging of the body and brain. Follow-up information is provided by linking health and medical records. Genome-wide genotype data have been collected on all participants, providing many opportunities for the discovery of new genetic associations and the genetic bases of complex traits. Here we describe the centralized analysis of the genetic data, including genotype quality, properties of population structure and relatedness of the genetic data, and efficient phasing and genotype imputation that increases the number of testable variants to around 96 million. Classical allelic variation at 11 human leukocyte antigen genes was imputed, resulting in the recovery of signals with known associations between human leukocyte antigen alleles and many diseases.
DOI
10.1038/s41586-018-0579-z

IMPUTE5

Tool
PUBMED_LINK
33196638
TITLE
Genotype imputation using the Positional Burrows Wheeler Transform.
Main citation
Rubinacci S, Delaneau O, Marchini J. (2020) Genotype imputation using the Positional Burrows Wheeler Transform. PLoS Genet, 16 (11) e1009049. doi:10.1371/journal.pgen.1009049. PMID 33196638
ABSTRACT
Genotype imputation is the process of predicting unobserved genotypes in a sample of individuals using a reference panel of haplotypes. In the last 10 years reference panels have increased in size by more than 100 fold. Increasing reference panel size improves accuracy of markers with low minor allele frequencies but poses ever increasing computational challenges for imputation methods. Here we present IMPUTE5, a genotype imputation method that can scale to reference panels with millions of samples. This method continues to refine the observation made in the IMPUTE2 method, that accuracy is optimized via use of a custom subset of haplotypes when imputing each individual. It achieves fast, accurate, and memory-efficient imputation by selecting haplotypes using the Positional Burrows Wheeler Transform (PBWT). By using the PBWT data structure at genotyped markers, IMPUTE5 identifies locally best matching haplotypes and long identical by state segments. The method then uses the selected haplotypes as conditioning states within the IMPUTE model. Using the HRC reference panel, which has ∼65,000 haplotypes, we show that IMPUTE5 is up to 30x faster than MINIMAC4 and up to 3x faster than BEAGLE5.1, and uses less memory than both these methods. Using simulated reference panels we show that IMPUTE5 scales sub-linearly with reference panel size. For example, keeping the number of imputed markers constant, increasing the reference panel size from 10,000 to 1 million haplotypes requires less than twice the computation time. As the reference panel increases in size IMPUTE5 is able to utilize a smaller number of reference haplotypes, thus reducing computational cost.
DOI
10.1371/journal.pgen.1009049

MACH / minimach

Tool
PUBMED_LINK
21058334
DESCRIPTION
(MACH)
URL
http://csg.sph.umich.edu/abecasis/MaCH/index.html
TITLE
MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes.
Main citation
Li Y, Willer CJ, Ding J, Scheet P, ...&, Abecasis GR. (2010) MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet Epidemiol, 34 (8) 816-34. doi:10.1002/gepi.20533. PMID 21058334
ABSTRACT
Genome-wide association studies (GWAS) can identify common alleles that contribute to complex disease susceptibility. Despite the large number of SNPs assessed in each study, the effects of most common SNPs must be evaluated indirectly using either genotyped markers or haplotypes thereof as proxies. We have previously implemented a computationally efficient Markov Chain framework for genotype imputation and haplotyping in the freely available MaCH software package. The approach describes sampled chromosomes as mosaics of each other and uses available genotype and shotgun sequence data to estimate unobserved genotypes and haplotypes, together with useful measures of the quality of these estimates. Our approach is already widely used to facilitate comparison of results across studies as well as meta-analyses of GWAS. Here, we use simulations and experimental genotypes to evaluate its accuracy and utility, considering choices of genotyping panels, reference panel configurations, and designs where genotyping is replaced with shotgun sequencing. Importantly, we show that genotype imputation not only facilitates cross study analyses but also increases power of genetic association studies. We show that genotype imputation of common variants using HapMap haplotypes as a reference is very accurate using either genome-wide SNP data or smaller amounts of data typical in fine-mapping studies. Furthermore, we show the approach is applicable in a variety of populations. Finally, we illustrate how association analyses of unobserved variants will benefit from ongoing advances such as larger HapMap reference panels and whole genome shotgun sequencing technologies.
DOI
10.1002/gepi.20533

MACH / minimach pre-phasing

Tool
PUBMED_LINK
22820512
DESCRIPTION
(pre-phasing, minimac)
URL
https://genome.sph.umich.edu/wiki/Minimac
TITLE
Fast and accurate genotype imputation in genome-wide association studies through pre-phasing.
Main citation
Howie B, Fuchsberger C, Stephens M, Marchini J, ...&, Abecasis GR. (2012) Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat Genet, 44 (8) 955-9. doi:10.1038/ng.2354. PMID 22820512
ABSTRACT
The 1000 Genomes Project and disease-specific sequencing efforts are producing large collections of haplotypes that can be used as reference panels for genotype imputation in genome-wide association studies (GWAS). However, imputing from large reference panels with existing methods imposes a high computational burden. We introduce a strategy called 'pre-phasing' that maintains the accuracy of leading methods while reducing computational costs. We first statistically estimate the haplotypes for each individual within the GWAS sample (pre-phasing) and then impute missing genotypes into these estimated haplotypes. This reduces the computational cost because (i) the GWAS samples must be phased only once, whereas standard methods would implicitly repeat phasing with each reference panel update, and (ii) it is much faster to match a phased GWAS haplotype to one reference haplotype than to match two unphased GWAS genotypes to a pair of reference haplotypes. We implemented our approach in the MaCH and IMPUTE2 frameworks, and we tested it on data sets from the Wellcome Trust Case Control Consortium 2 (WTCCC2), the Genetic Association Information Network (GAIN), the Women's Health Initiative (WHI) and the 1000 Genomes Project. This strategy will be particularly valuable for repeated imputation as reference panels evolve.
DOI
10.1038/ng.2354

MACH / minimach2

Tool
PUBMED_LINK
25338720
DESCRIPTION
(minimac2)
URL
https://genome.sph.umich.edu/wiki/Minimac2
TITLE
minimac2: faster genotype imputation.
Main citation
Fuchsberger C, Abecasis GR, Hinds DA. (2015) minimac2: faster genotype imputation. Bioinformatics, 31 (5) 782-4. doi:10.1093/bioinformatics/btu704. PMID 25338720
ABSTRACT
UNLABELLED: Genotype imputation is a key step in the analysis of genome-wide association studies. Upcoming very large reference panels, such as those from The 1000 Genomes Project and the Haplotype Consortium, will improve imputation quality of rare and less common variants, but will also increase the computational burden. Here, we demonstrate how the application of software engineering techniques can help to keep imputation broadly accessible. Overall, these improvements speed up imputation by an order of magnitude compared with our previous implementation. AVAILABILITY AND IMPLEMENTATION: minimac2, including source code, documentation, and examples is available at http://genome.sph.umich.edu/wiki/Minimac2
DOI
10.1093/bioinformatics/btu704

MACH / minimach3

Tool
PUBMED_LINK
27571263
DESCRIPTION
(minimac3)
URL
https://genome.sph.umich.edu/wiki/Minimac3
TITLE
Next-generation genotype imputation service and methods.
Main citation
Das S, Forer L, Schönherr S, Sidore C, ...&, Fuchsberger C. (2016) Next-generation genotype imputation service and methods. Nat Genet, 48 (10) 1284-1287. doi:10.1038/ng.3656. PMID 27571263
ABSTRACT
Genotype imputation is a key component of genetic association studies, where it increases power, facilitates meta-analysis, and aids interpretation of signals. Genotype imputation is computationally demanding and, with current tools, typically requires access to a high-performance computing cluster and to a reference panel of sequenced genomes. Here we describe improvements to imputation machinery that reduce computational requirements by more than an order of magnitude with no loss of accuracy in comparison to standard imputation tools. We also describe a new web-based service for imputation that facilitates access to new reference panels and greatly improves user experience and productivity.
DOI
10.1038/ng.3656

QUILT1

Tool
PUBMED_LINK
34083788
URL
https://github.com/rwdavies/QUILT
TITLE
Rapid genotype imputation from sequence with reference panels.
Main citation
Davies RW, Kucka M, Su D, Shi S, ...&, Myers S. (2021) Rapid genotype imputation from sequence with reference panels. Nat Genet, 53 (7) 1104-1111. doi:10.1038/s41588-021-00877-0. PMID 34083788
ABSTRACT
Inexpensive genotyping methods are essential to modern genomics. Here we present QUILT, which performs diploid genotype imputation using low-coverage whole-genome sequence data. QUILT employs Gibbs sampling to partition reads into maternal and paternal sets, facilitating rapid haploid imputation using large reference panels. We show this partitioning to be accurate over many megabases, enabling highly accurate imputation close to theoretical limits and outperforming existing methods. Moreover, QUILT can impute accurately using diverse technologies, including long reads from Oxford Nanopore Technologies, and a new form of low-cost barcoded Illumina sequencing called haplotagging, with the latter showing improved accuracy at low coverages. Relative to DNA genotyping microarrays, QUILT offers improved accuracy at reduced cost, particularly for diverse populations that are traditionally underserved in modern genomic analyses, with accuracy nearly doubling at rare SNPs. Finally, QUILT can accurately impute (four-digit) human leukocyte antigen types, the first such method from low-coverage sequence data.
DOI
10.1038/s41588-021-00877-0

QUILT2

Tool
DESCRIPTION
QUILT2 is a fast and memory-efficient method for imputation from low coverage sequence. Statistically, QUILT2 operates on a per-read basis, and is base quality aware, meaning it can accurately impute from diverse inputs, including short read (e.g. Illumina), long read sequencing (that might be noisy) (e.g. Oxford Nanopore Technologies), barcoded Illumina sequencing (e.g. Haplotagging) and ancient DNA. In addition, QUILT2 can impute both the mother and fetal genome using cfDNA NIPT data.
URL
https://github.com/rwdavies/QUILT
PREPRINT_DOI
10.1101/2024.07.18.604149
Main citation
Li, Z., Albrechtsen, A. & Davies, R. W. Rapid and accurate genotype imputation from low coverage short read, long read, and cell free DNA sequence. bioRxiv 2024.07.18.604149 (2024) doi:10.1101/2024.07.18.604149.

SHAPEIT1

Tool
PUBMED_LINK
22138821
DESCRIPTION
(SHAPEIT1)
URL
https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html
TITLE
A linear complexity phasing method for thousands of genomes.
Main citation
Delaneau O, Marchini J, Zagury JF. (2011) A linear complexity phasing method for thousands of genomes. Nat Methods, 9 (2) 179-81. doi:10.1038/nmeth.1785. PMID 22138821
ABSTRACT
Human-disease etiology can be better understood with phase information about diploid sequences. We present a method for estimating haplotypes, using genotype data from unrelated samples or small nuclear families, that leads to improved accuracy and speed compared to several widely used methods. The method, segmented haplotype estimation and imputation tool (SHAPEIT), scales linearly with the number of haplotypes used in each iteration and can be run efficiently on whole chromosomes.
DOI
10.1038/nmeth.1785

SHAPEIT2

Tool
PUBMED_LINK
23269371
DESCRIPTION
(SHAPEIT2)
TITLE
Improved whole-chromosome phasing for disease and population genetic studies.
Main citation
Delaneau O, Zagury JF, Marchini J. (2013) Improved whole-chromosome phasing for disease and population genetic studies. Nat Methods, 10 (1) 5-6. doi:10.1038/nmeth.2307. PMID 23269371
DOI
10.1038/nmeth.2307

SHAPEIT3

Tool
PUBMED_LINK
27270105
DESCRIPTION
(SHAPEIT3)
URL
https://jmarchini.org/shapeit3/
TITLE
Haplotype estimation for biobank-scale data sets.
Main citation
O'Connell J, Sharp K, Shrine N, Wain L, ...&, Marchini J. (2016) Haplotype estimation for biobank-scale data sets. Nat Genet, 48 (7) 817-20. doi:10.1038/ng.3583. PMID 27270105
ABSTRACT
The UK Biobank (UKB) has recently released genotypes on 152,328 individuals together with extensive phenotypic and lifestyle information. We present a new phasing method, SHAPEIT3, that can handle such biobank-scale data sets and results in switch error rates as low as ∼0.3%. The method exhibits O(NlogN) scaling with sample size N, enabling fast and accurate phasing of even larger cohorts.
DOI
10.1038/ng.3583

SHAPEIT4

Tool
PUBMED_LINK
31780650
DESCRIPTION
(SHAPEIT4)
URL
https://odelaneau.github.io/shapeit4/
TITLE
Accurate, scalable and integrative haplotype estimation.
Main citation
Delaneau O, Zagury JF, Robinson MR, Marchini JL, ...&, Dermitzakis ET. (2019) Accurate, scalable and integrative haplotype estimation. Nat Commun, 10 (1) 5436. doi:10.1038/s41467-019-13225-y. PMID 31780650
ABSTRACT
The number of human genomes being genotyped or sequenced increases exponentially and efficient haplotype estimation methods able to handle this amount of data are now required. Here we present a method, SHAPEIT4, which substantially improves upon other methods to process large genotype and high coverage sequencing datasets. It notably exhibits sub-linear running times with sample size, provides highly accurate haplotypes and allows integrating external phasing information such as large reference panels of haplotypes, collections of pre-phased variants and long sequencing reads. We provide SHAPEIT4 in an open source format and demonstrate its performance in terms of accuracy and running times on two gold standard datasets: the UK Biobank data and the Genome In A Bottle.
DOI
10.1038/s41467-019-13225-y

SHAPEIT5

Tool
PUBMED_LINK
37386248
DESCRIPTION
(SHAPEIT5)
TITLE
Accurate rare variant phasing of whole-genome and whole-exome sequencing data in the UK Biobank.
Main citation
Hofmeister RJ, Ribeiro DM, Rubinacci S, Delaneau O. (2023) Accurate rare variant phasing of whole-genome and whole-exome sequencing data in the UK Biobank. Nat Genet, 55 (7) 1243-1249. doi:10.1038/s41588-023-01415-w. PMID 37386248
ABSTRACT
Phasing involves distinguishing the two parentally inherited copies of each chromosome into haplotypes. Here, we introduce SHAPEIT5, a new phasing method that quickly and accurately processes large sequencing datasets and applied it to UK Biobank (UKB) whole-genome and whole-exome sequencing data. We demonstrate that SHAPEIT5 phases rare variants with low switch error rates of below 5% for variants present in just 1 sample out of 100,000. Furthermore, we outline a method for phasing singletons, which, although less precise, constitutes an important step towards future developments. We then demonstrate that the use of UKB as a reference panel improves the accuracy of genotype imputation, which is even more pronounced when phased with SHAPEIT5 compared with other methods. Finally, we screen the UKB data for loss-of-function compound heterozygous events and identify 549 genes where both gene copies are knocked out. These genes complement current knowledge of gene essentiality in the human genome.
DOI
10.1038/s41588-023-01415-w

fastPHASE

Tool
PUBMED_LINK
16532393
URL
http://scheet.org/software.html
TITLE
A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase.
Main citation
Scheet P, Stephens M. (2006) A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am J Hum Genet, 78 (4) 629-44. doi:10.1086/502802. PMID 16532393
ABSTRACT
We present a statistical model for patterns of genetic variation in samples of unrelated individuals from natural populations. This model is based on the idea that, over short regions, haplotypes in a population tend to cluster into groups of similar haplotypes. To capture the fact that, because of recombination, this clustering tends to be local in nature, our model allows cluster memberships to change continuously along the chromosome according to a hidden Markov model. This approach is flexible, allowing for both "block-like" patterns of linkage disequilibrium (LD) and gradual decline in LD with distance. The resulting model is also fast and, as a result, is practicable for large data sets (e.g., thousands of individuals typed at hundreds of thousands of markers). We illustrate the utility of the model by applying it to dense single-nucleotide-polymorphism genotype data for the tasks of imputing missing genotypes and estimating haplotypic phase. For imputing missing genotypes, methods based on this model are as accurate or more accurate than existing methods. For haplotype estimation, the point estimates are slightly less accurate than those from the best existing methods (e.g., for unrelated Centre d'Etude du Polymorphisme Humain individuals from the HapMap project, switch error was 0.055 for our method vs. 0.051 for PHASE) but require a small fraction of the computational cost. In addition, we demonstrate that the model accurately reflects uncertainty in its estimates, in that probabilities computed using the model are approximately well calibrated. The methods described in this article are implemented in a software package, fastPHASE, which is available from the Stephens Lab Web site.
DOI
10.1086/502802

Review

Review-Das

Tool
PUBMED_LINK
29799802
TITLE
Genotype Imputation from Large Reference Panels.
Main citation
Das S, Abecasis GR, Browning BL. (2018) Genotype Imputation from Large Reference Panels. Annu Rev Genomics Hum Genet, 19 () 73-96. doi:10.1146/annurev-genom-083117-021602. PMID 29799802
ABSTRACT
Genotype imputation has become a standard tool in genome-wide association studies because it enables researchers to inexpensively approximate whole-genome sequence data from genome-wide single-nucleotide polymorphism array data. Genotype imputation increases statistical power, facilitates fine mapping of causal variants, and plays a key role in meta-analyses of genome-wide association studies. Only variants that were previously observed in a reference panel of sequenced individuals can be imputed. However, the rapid increase in the number of deeply sequenced individuals will soon make it possible to assemble enormous reference panels that greatly increase the number of imputable variants. In this review, we present an overview of genotype imputation and describe the computational techniques that make it possible to impute genotypes from reference panels with millions of individuals.
DOI
10.1146/annurev-genom-083117-021602

Review-Li

Tool
PUBMED_LINK
19715440
TITLE
Genotype imputation.
Main citation
Li Y, Willer C, Sanna S, Abecasis G. (2009) Genotype imputation. Annu Rev Genomics Hum Genet, 10 () 387-406. doi:10.1146/annurev.genom.9.081307.164242. PMID 19715440
ABSTRACT
Genotype imputation is now an essential tool in the analysis of genome-wide association scans. This technique allows geneticists to accurately evaluate the evidence for association at genetic markers that are not directly genotyped. Genotype imputation is particularly useful for combining results across studies that rely on different genotyping platforms but also increases the power of individual scans. Here, we review the history and theoretical underpinnings of the technique. To illustrate performance of the approach, we summarize results from several gene mapping studies. Finally, we preview the role of genotype imputation in an era when whole genome resequencing is becoming increasingly common.
DOI
10.1146/annurev.genom.9.081307.164242

Review-Marchini

Tool
PUBMED_LINK
20517342
TITLE
Genotype imputation for genome-wide association studies.
Main citation
Marchini J, Howie B. (2010) Genotype imputation for genome-wide association studies. Nat Rev Genet, 11 (7) 499-511. doi:10.1038/nrg2796. PMID 20517342
ABSTRACT
In the past few years genome-wide association (GWA) studies have uncovered a large number of convincingly replicated associations for many complex human diseases. Genotype imputation has been used widely in the analysis of GWA studies to boost power, fine-map associations and facilitate the combination of results across studies using meta-analysis. This Review describes the details of several different statistical methods for imputing genotypes, illustrates and discusses the factors that influence imputation performance, and reviews methods that can be used to assess imputation performance and test association at imputed SNPs.
DOI
10.1038/nrg2796

Structural variants imputation panel

1KG SV imputation panel (1KG SV)

Tool
KEYWORDS
structural variants, long-read
PREPRINT_DOI
10.1101/2023.12.20.23300308
Main citation
Noyvert, B., Erzurumluoglu, A. M., Drichel, D., Omland, S., Andlauer, T. F., Mueller, S., ... & Ding, Z. (2023). Imputation of structural variants using a multi-ancestry long-read sequencing panel enables identification of disease associations. medRxiv, 2023-12.