Tools Imputation

Curation of Imputation — listings under the GWAS Tools tab.

Summary Table

Click a column header to sort the table.

NAME	CATEGORY	Main citation	YEAR
1000 Genomes Phase 3 Version 5 (1KGp3v5)	Imputation panel	1000 Genomes Project Consortium et al., Nature, 2015	2015
1KG+7K	Imputation panel	NA	NA
CKB reference panel	Imputation panel	Yu C et al., Nucleic Acids Res, 2023	2023
ChinaMAP panel	Imputation panel	Li L et al., Cell Res, 2021	2021
GenomeAsia 100K	Imputation panel	GenomeAsia100K Consortium, Nature, 2019	2019
HGDP+1kGP	Imputation panel	Koenig Z et al., Genome Res, 2024	2024
HRC	Imputation panel	McCarthy S et al., Nat Genet, 2016	2016
NARD2	Imputation panel	Choi J et al., Sci Adv, 2023	2023
NARD	Imputation panel	Yoo SK et al., Genome Med, 2019	2019
Nyuwa Genome Resource Phase 1	Imputation panel	Zhang P et al., Cell Rep, 2021	2021
PGG.Han panel	Imputation panel	Gao Y et al., Nucleic Acids Res, 2020	2020
South and East Asian Reference Database (SEAD)	Imputation panel	Yang, M. Y., Zhong, J. D., Li, X., Bai, W. Y., Yuan, C. D., Qiu, M. C., ... & Zheng, H. F. (2023). SEAD: an…	NA
TOPMED	Imputation panel	Taliun D et al., Nature, 2021	2021
WBBC panel	Imputation panel	Cong PK et al., Nat Commun, 2022	2022
CNGB Imputation Service	Imputation server	Yu C et al., Nucleic Acids Res, 2023	2023
ChinaMAP	Imputation server	Li L et al., Cell Res, 2021	2021
Michigan Imputation Server	Imputation server	Das S et al., Nat Genet, 2016	2016
NyuWa Imputation Server	Imputation server	Zhang P et al., Cell Rep, 2021	2021
PGG.Han	Imputation server	Gao Y et al., Nucleic Acids Res, 2020	2020
Sanger	Imputation server	McCarthy S et al., Nat Genet, 2016	2016
TOPMED	Imputation server	Taliun D et al., Nature, 2021	2021
Westlake Imputation Server	Imputation server	Cong PK et al., Nat Commun, 2022	2022
RESHAPE	Other tools	Cavinato T et al., Nat Comput Sci, 2024	2024
BEAGLE4	Phasing & Imputation tool	Browning BL et al., Am J Hum Genet, 2016	2016
BEAGLE5.4 (Imputation)	Phasing & Imputation tool	Browning BL et al., Am J Hum Genet, 2018	2018
BEAGLE5.4 (Phasing)	Phasing & Imputation tool	Browning BL et al., Am J Hum Genet, 2021	2021
BEAGLE	Phasing & Imputation tool	Browning SR et al., Am J Hum Genet, 2007	2007
EAGLE2	Phasing & Imputation tool	Loh PR et al., Nat Genet, 2016	2016
EAGLE	Phasing & Imputation tool	Loh PR et al., Nat Genet, 2016	2016
GLIMPSE	Phasing & Imputation tool	Rubinacci S et al., Nat Genet, 2021	2021
IMPUTE2	Phasing & Imputation tool	Howie BN et al., PLoS Genet, 2009	2009
IMPUTE4	Phasing & Imputation tool	Bycroft C et al., Nature, 2018	2018
IMPUTE5	Phasing & Imputation tool	Rubinacci S et al., PLoS Genet, 2020	2020
IMPUTE	Phasing & Imputation tool	Marchini J et al., Nat Genet, 2007	2007
MACH / minimach pre-phasing	Phasing & Imputation tool	Howie B et al., Nat Genet, 2012	2012
MACH / minimach2	Phasing & Imputation tool	Fuchsberger C et al., Bioinformatics, 2015	2015
MACH / minimach3	Phasing & Imputation tool	Das S et al., Nat Genet, 2016	2016
MACH / minimach4	Phasing & Imputation tool	NA	NA
MACH / minimach	Phasing & Imputation tool	Li Y et al., Genet Epidemiol, 2010	2010
QUILT1	Phasing & Imputation tool	Davies RW et al., Nat Genet, 2021	2021
QUILT2	Phasing & Imputation tool	Li, Z., Albrechtsen, A. & Davies, R. W. Rapid and accurate genotype imputation from low coverage short read, long…	NA
SHAPEIT1	Phasing & Imputation tool	Delaneau O et al., Nat Methods, 2011	2011
SHAPEIT2	Phasing & Imputation tool	Delaneau O et al., Nat Methods, 2013	2013
SHAPEIT3	Phasing & Imputation tool	O'Connell J et al., Nat Genet, 2016	2016
SHAPEIT4	Phasing & Imputation tool	Delaneau O et al., Nat Commun, 2019	2019
SHAPEIT5	Phasing & Imputation tool	Hofmeister RJ et al., Nat Genet, 2023	2023
fastPHASE	Phasing & Imputation tool	Scheet P et al., Am J Hum Genet, 2006	2006
Review-Das	Review	Das S et al., Annu Rev Genomics Hum Genet, 2018	2018
Review-Li	Review	Li Y et al., Annu Rev Genomics Hum Genet, 2009	2009
Review-Marchini	Review	Marchini J et al., Nat Rev Genet, 2010	2010
1KG SV imputation panel	Structural variants imputation panel	Noyvert, B., Erzurumluoglu, A. M., Drichel, D., Omland, S., Andlauer, T. F., Mueller, S., ... & Ding, Z. (2023).…	NA
ImputeSV	Structural variants imputation tool & panel	Bai WY et al., Nat Genet, 2026	2026

Imputation panel

1000 Genomes Phase 3 Version 5 (1KGp3v5) (1KG)

Tool

PUBMED_LINK

26432245

URL

https://www.internationalgenome.org/

TITLE

A global reference for human genetic variation.

Main citation

1000 Genomes Project Consortium, Auton A, Brooks LD, Durbin RM, ...&, Abecasis GR. (2015) A global reference for human genetic variation. Nature, 526 (7571) 68-74. doi:10.1038/nature15393. PMID 26432245

ABSTRACT

The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes >99% of SNP variants with a frequency of >1% for a variety of ancestries. We describe the distribution of genetic variation across the global sample, and discuss the implications for common disease studies.

Show full abstractShow less

DOI

10.1038/nature15393

1KG+7K

Tool

KEYWORDS

Japanese population-specific reference panel

Show full keywordsShow less

PREPRINT_DOI

10.21203/rs.3.rs-3194976/v1

CKB reference panel (CKB)

Tool

PUBMED_LINK

37870428

FULL NAME

China Kadoorie Biobank

URL

https://db.cngb.org/imputation/

TITLE

A high-resolution haplotype-resolved Reference panel constructed from the China Kadoorie Biobank Study.

Main citation

Yu C, Lan X, Tao Y, Guo Y, ...&, Li L. (2023) A high-resolution haplotype-resolved Reference panel constructed from the China Kadoorie Biobank Study. Nucleic Acids Res, 51 (21) 11770-11782. doi:10.1093/nar/gkad779. PMID 37870428

ABSTRACT

Precision medicine depends on high-accuracy individual-level genotype data. However, the whole-genome sequencing (WGS) is still not suitable for gigantic studies due to budget constraints. It is particularly important to construct highly accurate haplotype reference panel for genotype imputation. In this study, we used 10 000 samples with medium-depth WGS to construct a reference panel that we named the CKB reference panel. By imputing microarray datasets, it showed that the CKB panel outperformed compared panels in terms of both the number of well-imputed variants and imputation accuracy. In addition, we have completed the imputation of 100 706 microarrays with the CKB panel, and the after-imputed data is the hitherto largest whole genome data of the Chinese population. Furthermore, in the GWAS analysis of real phenotype height, the number of tested SNPs tripled and the number of significant SNPs doubled after imputation. Finally, we developed an online server for offering free genotype imputation service based on the CKB reference panel (https://db.cngb.org/imputation/). We believe that the CKB panel is of great value for imputing microarray or low-coverage genotype data of Chinese population, and potentially mixed populations. The imputation-completed 100 706 microarray data are enormous and precious resources of population genetic studies for complex traits and diseases.

Show full abstractShow less

DOI

10.1093/nar/gkad779

ChinaMAP panel (ChinaMAP)

Tool

PUBMED_LINK

34489580

FULL NAME

China Metabolic Analytics Project

URL

http://www.mbiobank.com/

TITLE

The ChinaMAP reference panel for the accurate genotype imputation in Chinese populations.

Main citation

Li L, Huang P, Sun X, Wang S, ...&, Wang W. (2021) The ChinaMAP reference panel for the accurate genotype imputation in Chinese populations. Cell Res, 31 (12) 1308-1310. doi:10.1038/s41422-021-00564-z. PMID 34489580

DOI

10.1038/s41422-021-00564-z

GenomeAsia 100K

Tool

PUBMED_LINK

31802016

URL

https://www.genomeasia100k.org/

TITLE

The GenomeAsia 100K Project enables genetic discoveries across Asia.

Main citation

GenomeAsia100K Consortium. (2019) The GenomeAsia 100K Project enables genetic discoveries across Asia. Nature, 576 (7785) 106-111. doi:10.1038/s41586-019-1793-z. PMID 31802016

ABSTRACT

The underrepresentation of non-Europeans in human genetic studies so far has limited the diversity of individuals in genomic datasets and led to reduced medical relevance for a large proportion of the world's population. Population-specific reference genome datasets as well as genome-wide association studies in diverse populations are needed to address this issue. Here we describe the pilot phase of the GenomeAsia 100K Project. This includes a whole-genome sequencing reference dataset from 1,739 individuals of 219 population groups and 64 countries across Asia. We catalogue genetic variation, population structure, disease associations and founder effects. We also explore the use of this dataset in imputation, to facilitate genetic studies in populations across Asia and worldwide.

Show full abstractShow less

DOI

10.1038/s41586-019-1793-z

HGDP+1kGP

Tool

PUBMED_LINK

38749656

FULL NAME

Human Genome Diversity Project + 1000 Genomes project

URL

https://gnomad.broadinstitute.org/news/2020-10-gnomad-v3-1-new-content-methods-annotations-and-data-availability/#the-gnomad-hgdp-and-1000-genomes-callset

TITLE

A harmonized public resource of deeply sequenced diverse human genomes.

Main citation

Koenig Z, Yohannes MT, Nkambule LL, Zhao X, ...&, Martin AR. (2024) A harmonized public resource of deeply sequenced diverse human genomes. Genome Res, 34 (5) 796-809. doi:10.1101/gr.278378.123. PMID 38749656

ABSTRACT

Underrepresented populations are often excluded from genomic studies owing in part to a lack of resources supporting their analyses. The 1000 Genomes Project (1kGP) and Human Genome Diversity Project (HGDP), which have recently been sequenced to high coverage, are valuable genomic resources because of the global diversity they capture and their open data sharing policies. Here, we harmonized a high-quality set of 4094 whole genomes from 80 populations in the HGDP and 1kGP with data from the Genome Aggregation Database (gnomAD) and identified over 153 million high-quality SNVs, indels, and SVs. We performed a detailed ancestry analysis of this cohort, characterizing population structure and patterns of admixture across populations, analyzing site frequency spectra, and measuring variant counts at global and subcontinental levels. We also show substantial added value from this data set compared with the prior versions of the component resources, typically combined via liftOver and variant intersection; for example, we catalog millions of new genetic variants, mostly rare, compared with previous releases. In addition to unrestricted individual-level public release, we provide detailed tutorials for conducting many of the most common quality-control steps and analyses with these data in a scalable cloud-computing environment and publicly release this new phased joint callset for use as a haplotype resource in phasing and imputation pipelines. This jointly called reference panel will serve as a key resource to support research of diverse ancestry populations.

Show full abstractShow less

DOI

10.1101/gr.278378.123

HRC

Tool

PUBMED_LINK

27548312

URL

http://www.haplotype-reference-consortium.org/

TITLE

A reference panel of 64,976 haplotypes for genotype imputation.

Main citation

McCarthy S, Das S, Kretzschmar W, Delaneau O, ...&, Haplotype Reference Consortium. (2016) A reference panel of 64,976 haplotypes for genotype imputation. Nat Genet, 48 (10) 1279-83. doi:10.1038/ng.3643. PMID 27548312

ABSTRACT

We describe a reference panel of 64,976 human haplotypes at 39,235,157 SNPs constructed using whole-genome sequence data from 20 studies of predominantly European ancestry. Using this resource leads to accurate genotype imputation at minor allele frequencies as low as 0.1% and a large increase in the number of SNPs tested in association studies, and it can help to discover and refine causal loci. We describe remote server resources that allow researchers to carry out imputation and phasing consistently and efficiently.

Show full abstractShow less

DOI

10.1038/ng.3643

NARD

Tool

PUBMED_LINK

31640730

FULL NAME

Northeast Asian Reference Database

URL

https://nard.macrogen.com/

TITLE

NARD: whole-genome reference panel of 1779 Northeast Asians improves imputation accuracy of rare and low-frequency variants.

Main citation

Yoo SK, Kim CU, Kim HL, Kim S, ...&, Seo JS. (2019) NARD: whole-genome reference panel of 1779 Northeast Asians improves imputation accuracy of rare and low-frequency variants. Genome Med, 11 (1) 64. doi:10.1186/s13073-019-0677-z. PMID 31640730

ABSTRACT

Here, we present the Northeast Asian Reference Database (NARD), including whole-genome sequencing data of 1779 individuals from Korea, Mongolia, Japan, China, and Hong Kong. NARD provides the genetic diversity of Korean (n = 850) and Mongolian (n = 384) ancestries that were not present in the 1000 Genomes Project Phase 3 (1KGP3). We combined and re-phased the genotypes from NARD and 1KGP3 to construct a union set of haplotypes. This approach established a robust imputation reference panel for Northeast Asians, which yields the greatest imputation accuracy of rare and low-frequency variants compared with the existing panels. NARD imputation panel is available at https://nard.macrogen.com/ .

Show full abstractShow less

DOI

10.1186/s13073-019-0677-z

NARD2

Tool

PUBMED_LINK

37556544

FULL NAME

Northeast Asian Reference Database 2

URL

https://nard.macrogen.com/

TITLE

A whole-genome reference panel of 14,393 individuals for East Asian populations accelerates discovery of rare functional variants.

Main citation

Choi J, Kim S, Kim J, Son HY, ...&, Im SW. (2023) A whole-genome reference panel of 14,393 individuals for East Asian populations accelerates discovery of rare functional variants. Sci Adv, 9 (32) eadg6319. doi:10.1126/sciadv.adg6319. PMID 37556544

ABSTRACT

Underrepresentation of non-European (EUR) populations hinders growth of global precision medicine. Resources such as imputation reference panels that match the study population are necessary to find low-frequency variants with substantial effects. We created a reference panel consisting of 14,393 whole-genome sequences including more than 11,000 Asian individuals. Genome-wide association studies were conducted using the reference panel and a population-specific genotype array of 72,298 subjects for eight phenotypes. This panel yields improved imputation accuracy of rare and low-frequency variants within East Asian populations compared with the largest reference panel. Thirty-nine previously unidentified associations were found, and more than half of the variants were East Asian specific. We discovered genes with rare protein-altering variants, including LTBP1 for height and GPR75 for body mass index, as well as putative regulatory mechanisms for rare noncoding variants with cell type-specific effects. We suggest that this dataset will add to the potential value of Asian precision medicine.

Show full abstractShow less

DOI

10.1126/sciadv.adg6319

Nyuwa Genome Resource Phase 1

Tool

PUBMED_LINK

34788621

URL

http://bigdata.ibp.ac.cn/refpanel/getstarted.php

TITLE

NyuWa Genome resource: A deep whole-genome sequencing-based variation profile and reference panel for the Chinese population.

Main citation

Zhang P, Luo H, Li Y, Wang Y, ...&, He S. (2021) NyuWa Genome resource: A deep whole-genome sequencing-based variation profile and reference panel for the Chinese population. Cell Rep, 37 (7) 110017. doi:10.1016/j.celrep.2021.110017. PMID 34788621

ABSTRACT

The lack of haplotype reference panels and whole-genome sequencing resources specific to the Chinese population has greatly hindered genetic studies in the world's largest population. Here, we present the NyuWa genome resource, based on deep (26.2×) sequencing of 2,999 Chinese individuals, and construct a NyuWa reference panel of 5,804 haplotypes and 19.3 million variants, which is a high-quality publicly available Chinese population-specific reference panel with thousands of samples. Compared with other panels, the NyuWa reference panel reduces the Han Chinese imputation error rate by a margin ranging from 30% to 51%. Population structure and imputation simulation tests support the applicability of one integrated reference panel for northern and southern Chinese. In addition, a total of 22,504 loss-of-function variants in coding and noncoding genes are identified, including 11,493 novel variants. These results highlight the value of the NyuWa genome resource in facilitating genetic research in Chinese and Asian populations.

Show full abstractShow less

DOI

10.1016/j.celrep.2021.110017

PGG.Han panel (PGG.Han)

Tool

PUBMED_LINK

31584086

URL

https://www.biosino.org/pgghan2/index#home1

TITLE

PGG.Han: the Han Chinese genome database and analysis platform.

Main citation

Gao Y, Zhang C, Yuan L, Ling Y, ...&, Xu S. (2020) PGG.Han: the Han Chinese genome database and analysis platform. Nucleic Acids Res, 48 (D1) D971-D976. doi:10.1093/nar/gkz829. PMID 31584086

ABSTRACT

As the largest ethnic group in the world, the Han Chinese population is nonetheless underrepresented in global efforts to catalogue the genomic variability of natural populations. Here, we developed the PGG.Han, a population genome database to serve as the central repository for the genomic data of the Han Chinese Genome Initiative (Phase I). In its current version, the PGG.Han archives whole-genome sequences or high-density genome-wide single-nucleotide variants (SNVs) of 114 783 Han Chinese individuals (a.k.a. the Han100K), representing geographical sub-populations covering 33 of the 34 administrative divisions of China, as well as Singapore. The PGG.Han provides: (i) an interactive interface for visualization of the fine-scale genetic structure of the Han Chinese population; (ii) genome-wide allele frequencies of hierarchical sub-populations; (iii) ancestry inference for individual samples and controlling population stratification based on nested ancestry informative markers (AIMs) panels; (iv) population-structure-aware shared control data for genotype-phenotype association studies (e.g. GWASs) and (v) a Han-Chinese-specific reference panel for genotype imputation. Computational tools are implemented into the PGG.Han, and an online user-friendly interface is provided for data analysis and results visualization. The PGG.Han database is freely accessible via http://www.pgghan.org or https://www.hanchinesegenomes.org.

Show full abstractShow less

DOI

10.1093/nar/gkz829

South and East Asian Reference Database (SEAD) (SEAD)

Tool

FULL NAME

South and East Asian Reference Database

URL

https://imputationserver.westlake.edu.cn/

PREPRINT_DOI

10.1101/2023.12.23.23300480

Main citation

Yang, M. Y., Zhong, J. D., Li, X., Bai, W. Y., Yuan, C. D., Qiu, M. C., ... & Zheng, H. F. (2023). SEAD: an augmented reference panel with 22,134 haplotypes boosts the rare variants imputation and GWAS analysis in Asian population. medRxiv, 2023-12.

TOPMED

Tool

PUBMED_LINK

33568819

FULL NAME

Trans-Omics for Precision Medicine

URL

https://imputation.biodatacatalyst.nhlbi.nih.gov/#!

TITLE

Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program.

Main citation

Taliun D, Harris DN, Kessler MD, Carlson J, ...&, Abecasis GR. (2021) Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature, 590 (7845) 290-299. doi:10.1038/s41586-021-03205-y. PMID 33568819

ABSTRACT

The Trans-Omics for Precision Medicine (TOPMed) programme seeks to elucidate the genetic architecture and biology of heart, lung, blood and sleep disorders, with the ultimate goal of improving diagnosis, treatment and prevention of these diseases. The initial phases of the programme focused on whole-genome sequencing of individuals with rich phenotypic data and diverse backgrounds. Here we describe the TOPMed goals and design as well as the available resources and early insights obtained from the sequence data. The resources include a variant browser, a genotype imputation server, and genomic and phenotypic data that are available through dbGaP (Database of Genotypes and Phenotypes)1. In the first 53,831 TOPMed samples, we detected more than 400 million single-nucleotide and insertion or deletion variants after alignment with the reference genome. Additional previously undescribed variants were detected through assembly of unmapped reads and customized analysis in highly variable loci. Among the more than 400 million detected variants, 97% have frequencies of less than 1% and 46% are singletons that are present in only one individual (53% among unrelated individuals). These rare variants provide insights into mutational processes and recent human evolutionary history. The extensive catalogue of genetic variation in TOPMed studies provides unique opportunities for exploring the contributions of rare and noncoding sequence variants to phenotypic variation. Furthermore, combining TOPMed haplotypes with modern imputation methods improves the power and reach of genome-wide association studies to include variants down to a frequency of approximately 0.01%.

Show full abstractShow less

DOI

10.1038/s41586-021-03205-y

WBBC panel (WBBC)

Tool

PUBMED_LINK

35618720

URL

https://imputationserver.westlake.edu.cn/

TITLE

Genomic analyses of 10,376 individuals in the Westlake BioBank for Chinese (WBBC) pilot project.

Main citation

Cong PK, Bai WY, Li JC, Yang MY, ...&, Zheng HF. (2022) Genomic analyses of 10,376 individuals in the Westlake BioBank for Chinese (WBBC) pilot project. Nat Commun, 13 (1) 2939. doi:10.1038/s41467-022-30526-x. PMID 35618720

ABSTRACT

We initiate the Westlake BioBank for Chinese (WBBC) pilot project with 4,535 whole-genome sequencing (WGS) individuals and 5,841 high-density genotyping individuals, and identify 81.5 million SNPs and INDELs, of which 38.5% are absent in dbSNP Build 151. We provide a population-specific reference panel and an online imputation server ( https://wbbc.westlake.edu.cn/ ) which could yield substantial improvement of imputation performance in Chinese population, especially for low-frequency and rare variants. By analyzing the singleton density of the WGS data, we find selection signatures in SNX29, DNAH1 and WDR1 genes, and the derived alleles of the alcohol metabolism genes (ADH1A and ADH1B) emerge around 7,000 years ago and tend to be more common from 4,000 years ago in East Asia. Genetic evidence supports the corresponding geographical boundaries of the Qinling-Huaihe Line and Nanling Mountains, which separate the Han Chinese into subgroups, and we reveal that North Han was more homogeneous than South Han.

Show full abstractShow less

DOI

10.1038/s41467-022-30526-x

Imputation server

CNGB Imputation Service (CNGB)

Tool

PUBMED_LINK

37870428

FULL NAME

China National GeneBank

URL

https://db.cngb.org/imputation/

TITLE

A high-resolution haplotype-resolved Reference panel constructed from the China Kadoorie Biobank Study.

Main citation

Yu C, Lan X, Tao Y, Guo Y, ...&, Li L. (2023) A high-resolution haplotype-resolved Reference panel constructed from the China Kadoorie Biobank Study. Nucleic Acids Res, 51 (21) 11770-11782. doi:10.1093/nar/gkad779. PMID 37870428

ABSTRACT

Precision medicine depends on high-accuracy individual-level genotype data. However, the whole-genome sequencing (WGS) is still not suitable for gigantic studies due to budget constraints. It is particularly important to construct highly accurate haplotype reference panel for genotype imputation. In this study, we used 10 000 samples with medium-depth WGS to construct a reference panel that we named the CKB reference panel. By imputing microarray datasets, it showed that the CKB panel outperformed compared panels in terms of both the number of well-imputed variants and imputation accuracy. In addition, we have completed the imputation of 100 706 microarrays with the CKB panel, and the after-imputed data is the hitherto largest whole genome data of the Chinese population. Furthermore, in the GWAS analysis of real phenotype height, the number of tested SNPs tripled and the number of significant SNPs doubled after imputation. Finally, we developed an online server for offering free genotype imputation service based on the CKB reference panel (https://db.cngb.org/imputation/). We believe that the CKB panel is of great value for imputing microarray or low-coverage genotype data of Chinese population, and potentially mixed populations. The imputation-completed 100 706 microarray data are enormous and precious resources of population genetic studies for complex traits and diseases.

Show full abstractShow less

DOI

10.1093/nar/gkad779

ChinaMAP

Tool

PUBMED_LINK

34489580

FULL NAME

China Metabolic Analytics Project

URL

http://www.mbiobank.com/

TITLE

The ChinaMAP reference panel for the accurate genotype imputation in Chinese populations.

Main citation

Li L, Huang P, Sun X, Wang S, ...&, Wang W. (2021) The ChinaMAP reference panel for the accurate genotype imputation in Chinese populations. Cell Res, 31 (12) 1308-1310. doi:10.1038/s41422-021-00564-z. PMID 34489580

DOI

10.1038/s41422-021-00564-z

Michigan Imputation Server (Michigan)

Tool

PUBMED_LINK

27571263

URL

https://imputationserver.sph.umich.edu/index.html#!

TITLE

Next-generation genotype imputation service and methods.

Main citation

Das S, Forer L, Schönherr S, Sidore C, ...&, Fuchsberger C. (2016) Next-generation genotype imputation service and methods. Nat Genet, 48 (10) 1284-1287. doi:10.1038/ng.3656. PMID 27571263

ABSTRACT

Genotype imputation is a key component of genetic association studies, where it increases power, facilitates meta-analysis, and aids interpretation of signals. Genotype imputation is computationally demanding and, with current tools, typically requires access to a high-performance computing cluster and to a reference panel of sequenced genomes. Here we describe improvements to imputation machinery that reduce computational requirements by more than an order of magnitude with no loss of accuracy in comparison to standard imputation tools. We also describe a new web-based service for imputation that facilitates access to new reference panels and greatly improves user experience and productivity.

Show full abstractShow less

DOI

10.1038/ng.3656

NyuWa Imputation Server (NyuWa)

Tool

PUBMED_LINK

34788621

URL

http://bigdata.ibp.ac.cn/refpanel/getstarted.php

TITLE

NyuWa Genome resource: A deep whole-genome sequencing-based variation profile and reference panel for the Chinese population.

Main citation

Zhang P, Luo H, Li Y, Wang Y, ...&, He S. (2021) NyuWa Genome resource: A deep whole-genome sequencing-based variation profile and reference panel for the Chinese population. Cell Rep, 37 (7) 110017. doi:10.1016/j.celrep.2021.110017. PMID 34788621

ABSTRACT

The lack of haplotype reference panels and whole-genome sequencing resources specific to the Chinese population has greatly hindered genetic studies in the world's largest population. Here, we present the NyuWa genome resource, based on deep (26.2×) sequencing of 2,999 Chinese individuals, and construct a NyuWa reference panel of 5,804 haplotypes and 19.3 million variants, which is a high-quality publicly available Chinese population-specific reference panel with thousands of samples. Compared with other panels, the NyuWa reference panel reduces the Han Chinese imputation error rate by a margin ranging from 30% to 51%. Population structure and imputation simulation tests support the applicability of one integrated reference panel for northern and southern Chinese. In addition, a total of 22,504 loss-of-function variants in coding and noncoding genes are identified, including 11,493 novel variants. These results highlight the value of the NyuWa genome resource in facilitating genetic research in Chinese and Asian populations.

Show full abstractShow less

DOI

10.1016/j.celrep.2021.110017

PGG.Han

Tool

PUBMED_LINK

31584086

URL

https://www.biosino.org/pgghan2/login

TITLE

PGG.Han: the Han Chinese genome database and analysis platform.

Main citation

Gao Y, Zhang C, Yuan L, Ling Y, ...&, Xu S. (2020) PGG.Han: the Han Chinese genome database and analysis platform. Nucleic Acids Res, 48 (D1) D971-D976. doi:10.1093/nar/gkz829. PMID 31584086

ABSTRACT

As the largest ethnic group in the world, the Han Chinese population is nonetheless underrepresented in global efforts to catalogue the genomic variability of natural populations. Here, we developed the PGG.Han, a population genome database to serve as the central repository for the genomic data of the Han Chinese Genome Initiative (Phase I). In its current version, the PGG.Han archives whole-genome sequences or high-density genome-wide single-nucleotide variants (SNVs) of 114 783 Han Chinese individuals (a.k.a. the Han100K), representing geographical sub-populations covering 33 of the 34 administrative divisions of China, as well as Singapore. The PGG.Han provides: (i) an interactive interface for visualization of the fine-scale genetic structure of the Han Chinese population; (ii) genome-wide allele frequencies of hierarchical sub-populations; (iii) ancestry inference for individual samples and controlling population stratification based on nested ancestry informative markers (AIMs) panels; (iv) population-structure-aware shared control data for genotype-phenotype association studies (e.g. GWASs) and (v) a Han-Chinese-specific reference panel for genotype imputation. Computational tools are implemented into the PGG.Han, and an online user-friendly interface is provided for data analysis and results visualization. The PGG.Han database is freely accessible via http://www.pgghan.org or https://www.hanchinesegenomes.org.

Show full abstractShow less

DOI

10.1093/nar/gkz829

Sanger

Tool

PUBMED_LINK

27548312

URL

https://imputation.sanger.ac.uk/

TITLE

A reference panel of 64,976 haplotypes for genotype imputation.

Main citation

McCarthy S, Das S, Kretzschmar W, Delaneau O, ...&, Haplotype Reference Consortium. (2016) A reference panel of 64,976 haplotypes for genotype imputation. Nat Genet, 48 (10) 1279-83. doi:10.1038/ng.3643. PMID 27548312

ABSTRACT

We describe a reference panel of 64,976 human haplotypes at 39,235,157 SNPs constructed using whole-genome sequence data from 20 studies of predominantly European ancestry. Using this resource leads to accurate genotype imputation at minor allele frequencies as low as 0.1% and a large increase in the number of SNPs tested in association studies, and it can help to discover and refine causal loci. We describe remote server resources that allow researchers to carry out imputation and phasing consistently and efficiently.

Show full abstractShow less

DOI

10.1038/ng.3643

TOPMED

Tool

PUBMED_LINK

33568819

FULL NAME

Trans-Omics for Precision Medicine

URL

https://imputation.biodatacatalyst.nhlbi.nih.gov/#!

TITLE

Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program.

Main citation

Taliun D, Harris DN, Kessler MD, Carlson J, ...&, Abecasis GR. (2021) Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature, 590 (7845) 290-299. doi:10.1038/s41586-021-03205-y. PMID 33568819

ABSTRACT

The Trans-Omics for Precision Medicine (TOPMed) programme seeks to elucidate the genetic architecture and biology of heart, lung, blood and sleep disorders, with the ultimate goal of improving diagnosis, treatment and prevention of these diseases. The initial phases of the programme focused on whole-genome sequencing of individuals with rich phenotypic data and diverse backgrounds. Here we describe the TOPMed goals and design as well as the available resources and early insights obtained from the sequence data. The resources include a variant browser, a genotype imputation server, and genomic and phenotypic data that are available through dbGaP (Database of Genotypes and Phenotypes)1. In the first 53,831 TOPMed samples, we detected more than 400 million single-nucleotide and insertion or deletion variants after alignment with the reference genome. Additional previously undescribed variants were detected through assembly of unmapped reads and customized analysis in highly variable loci. Among the more than 400 million detected variants, 97% have frequencies of less than 1% and 46% are singletons that are present in only one individual (53% among unrelated individuals). These rare variants provide insights into mutational processes and recent human evolutionary history. The extensive catalogue of genetic variation in TOPMed studies provides unique opportunities for exploring the contributions of rare and noncoding sequence variants to phenotypic variation. Furthermore, combining TOPMed haplotypes with modern imputation methods improves the power and reach of genome-wide association studies to include variants down to a frequency of approximately 0.01%.

Show full abstractShow less

DOI

10.1038/s41586-021-03205-y

Westlake Imputation Server

Tool

PUBMED_LINK

35618720

URL

https://imputationserver.westlake.edu.cn/

TITLE

Genomic analyses of 10,376 individuals in the Westlake BioBank for Chinese (WBBC) pilot project.

Main citation

Cong PK, Bai WY, Li JC, Yang MY, ...&, Zheng HF. (2022) Genomic analyses of 10,376 individuals in the Westlake BioBank for Chinese (WBBC) pilot project. Nat Commun, 13 (1) 2939. doi:10.1038/s41467-022-30526-x. PMID 35618720

ABSTRACT

We initiate the Westlake BioBank for Chinese (WBBC) pilot project with 4,535 whole-genome sequencing (WGS) individuals and 5,841 high-density genotyping individuals, and identify 81.5 million SNPs and INDELs, of which 38.5% are absent in dbSNP Build 151. We provide a population-specific reference panel and an online imputation server ( https://wbbc.westlake.edu.cn/ ) which could yield substantial improvement of imputation performance in Chinese population, especially for low-frequency and rare variants. By analyzing the singleton density of the WGS data, we find selection signatures in SNX29, DNAH1 and WDR1 genes, and the derived alleles of the alcohol metabolism genes (ADH1A and ADH1B) emerge around 7,000 years ago and tend to be more common from 4,000 years ago in East Asia. Genetic evidence supports the corresponding geographical boundaries of the Qinling-Huaihe Line and Nanling Mountains, which separate the Han Chinese into subgroups, and we reveal that North Han was more homogeneous than South Han.

Show full abstractShow less

DOI

10.1038/s41467-022-30526-x

Other tools

RESHAPE

Tool

PUBMED_LINK

38745108

FULL NAME

REcombine and Share HAPlotypEs

DESCRIPTION

RESHAPE removes sample-level genetic information from a reference panel to create a synthetic reference panel. By providing it with a genetic map and the VCF/BCF of a reference panel, RESHAPE outputs a VCF/BCF of the same size where each haplotypes corresponds to a mosaic of the original haplotypes of the reference panel.

Show full descriptionShow less

URL

https://github.com/TheoCavinato/RESHAPE

TITLE

A resampling-based approach to share reference panels.

Main citation

Cavinato T, Rubinacci S, Malaspinas AS, Delaneau O. (2024) A resampling-based approach to share reference panels. Nat Comput Sci, 4 (5) 360-366. doi:10.1038/s43588-024-00630-7. PMID 38745108

ABSTRACT

For many genome-wide association studies, imputing genotypes from a haplotype reference panel is a necessary step. Over the past 15 years, reference panels have become larger and more diverse, leading to improvements in imputation accuracy. However, the latest generation of reference panels is subject to restrictions on data sharing due to concerns about privacy, limiting their usefulness for genotype imputation. In this context, here we propose RESHAPE, a method that employs a recombination Poisson process on a reference panel to simulate the genomes of hypothetical descendants after multiple generations. This data transformation helps to protect against re-identification threats and preserves data attributes, such as linkage disequilibrium patterns and, to some degree, identity-by-descent sharing, allowing for genotype imputation. Our experiments on gold-standard datasets show that simulated descendants up to eight generations can serve as reference panels without substantially reducing genotype imputation accuracy.

Show full abstractShow less

DOI

10.1038/s43588-024-00630-7

Phasing & Imputation tool

BEAGLE

Tool

PUBMED_LINK

17924348

URL

https://faculty.washington.edu/browning/beagle/beagle.html

TITLE

Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering.

Main citation

Browning SR, Browning BL. (2007) Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am J Hum Genet, 81 (5) 1084-97. doi:10.1086/521987. PMID 17924348

ABSTRACT

Whole-genome association studies present many new statistical and computational challenges due to the large quantity of data obtained. One of these challenges is haplotype inference; methods for haplotype inference designed for small data sets from candidate-gene studies do not scale well to the large number of individuals genotyped in whole-genome association studies. We present a new method and software for inference of haplotype phase and missing data that can accurately phase data from whole-genome association studies, and we present the first comparison of haplotype-inference methods for real and simulated data sets with thousands of genotyped individuals. We find that our method outperforms existing methods in terms of both speed and accuracy for large data sets with thousands of individuals and densely spaced genetic markers, and we use our method to phase a real data set of 3,002 individuals genotyped for 490,032 markers in 3.1 days of computing time, with 99% of masked alleles imputed correctly. Our method is implemented in the Beagle software package, which is freely available.

Show full abstractShow less

DOI

10.1086/521987

BEAGLE4

Tool

PUBMED_LINK

26748515

DESCRIPTION

(beagle 4.1)

Show full descriptionShow less

URL

https://faculty.washington.edu/browning/beagle/beagle.html

TITLE

Genotype Imputation with Millions of Reference Samples.

Main citation

Browning BL, Browning SR. (2016) Genotype Imputation with Millions of Reference Samples. Am J Hum Genet, 98 (1) 116-26. doi:10.1016/j.ajhg.2015.11.020. PMID 26748515

ABSTRACT

We present a genotype imputation method that scales to millions of reference samples. The imputation method, based on the Li and Stephens model and implemented in Beagle v.4.1, is parallelized and memory efficient, making it well suited to multi-core computer processors. It achieves fast, accurate, and memory-efficient genotype imputation by restricting the probability model to markers that are genotyped in the target samples and by performing linear interpolation to impute ungenotyped variants. We compare Beagle v.4.1 with Impute2 and Minimac3 by using 1000 Genomes Project data, UK10K Project data, and simulated data. All three methods have similar accuracy but different memory requirements and different computation times. When imputing 10 Mb of sequence data from 50,000 reference samples, Beagle's throughput was more than 100× greater than Impute2's throughput on our computer servers. When imputing 10 Mb of sequence data from 200,000 reference samples in VCF format, Minimac3 consumed 26× more memory per computational thread and 15× more CPU time than Beagle. We demonstrate that Beagle v.4.1 scales to much larger reference panels by performing imputation from a simulated reference panel having 5 million samples and a mean marker density of one marker per four base pairs.

Show full abstractShow less

DOI

10.1016/j.ajhg.2015.11.020

BEAGLE5.4 (Imputation)

Tool

PUBMED_LINK

30100085

DESCRIPTION

(beagle 5.4 imputation)

Show full descriptionShow less

URL

https://faculty.washington.edu/browning/beagle/beagle.html

TITLE

A One-Penny Imputed Genome from Next-Generation Reference Panels.

Main citation

Browning BL, Zhou Y, Browning SR. (2018) A One-Penny Imputed Genome from Next-Generation Reference Panels. Am J Hum Genet, 103 (3) 338-348. doi:10.1016/j.ajhg.2018.07.015. PMID 30100085

ABSTRACT

Genotype imputation is commonly performed in genome-wide association studies because it greatly increases the number of markers that can be tested for association with a trait. In general, one should perform genotype imputation using the largest reference panel that is available because the number of accurately imputed variants increases with reference panel size. However, one impediment to using larger reference panels is the increased computational cost of imputation. We present a new genotype imputation method, Beagle 5.0, which greatly reduces the computational cost of imputation from large reference panels. We compare Beagle 5.0 with Beagle 4.1, Impute4, Minimac3, and Minimac4 using 1000 Genomes Project data, Haplotype Reference Consortium data, and simulated data for 10k, 100k, 1M, and 10M reference samples. All methods produce nearly identical accuracy, but Beagle 5.0 has the lowest computation time and the best scaling of computation time with increasing reference panel size. For 10k, 100k, 1M, and 10M reference samples and 1,000 phased target samples, Beagle 5.0's computation time is 3× (10k), 12× (100k), 43× (1M), and 533× (10M) faster than the fastest alternative method. Cost data from the Amazon Elastic Compute Cloud show that Beagle 5.0 can perform genome-wide imputation from 10M reference samples into 1,000 phased target samples at a cost of less than one US cent per sample.

Show full abstractShow less

DOI

10.1016/j.ajhg.2018.07.015

BEAGLE5.4 (Phasing)

Tool

PUBMED_LINK

34478634

DESCRIPTION

(beagle 5.4 phasing)

Show full descriptionShow less

URL

https://faculty.washington.edu/browning/beagle/beagle.html

TITLE

Fast two-stage phasing of large-scale sequence data.

Main citation

Browning BL, Tian X, Zhou Y, Browning SR. (2021) Fast two-stage phasing of large-scale sequence data. Am J Hum Genet, 108 (10) 1880-1890. doi:10.1016/j.ajhg.2021.08.005. PMID 34478634

ABSTRACT

Haplotype phasing is the estimation of haplotypes from genotype data. We present a fast, accurate, and memory-efficient haplotype phasing method that scales to large-scale SNP array and sequence data. The method uses marker windowing and composite reference haplotypes to reduce memory usage and computation time. It incorporates a progressive phasing algorithm that identifies confidently phased heterozygotes in each iteration and fixes the phase of these heterozygotes in subsequent iterations. For data with many low-frequency variants, such as whole-genome sequence data, the method employs a two-stage phasing algorithm that phases high-frequency markers via progressive phasing in the first stage and phases low-frequency markers via genotype imputation in the second stage. This haplotype phasing method is implemented in the open-source Beagle 5.2 software package. We compare Beagle 5.2 and SHAPEIT 4.2.1 by using expanding subsets of 485,301 UK Biobank samples and 38,387 TOPMed samples. Both methods have very similar accuracy and computation time for UK Biobank SNP array data. However, for TOPMed sequence data, Beagle is more than 20 times faster than SHAPEIT, achieves similar accuracy, and scales to larger sample sizes.

Show full abstractShow less

DOI

10.1016/j.ajhg.2021.08.005

EAGLE

Tool

PUBMED_LINK

27270109

DESCRIPTION

(EAGLE1)

Show full descriptionShow less

URL

https://alkesgroup.broadinstitute.org/Eagle/

TITLE

Fast and accurate long-range phasing in a UK Biobank cohort.

Main citation

Loh PR, Palamara PF, Price AL. (2016) Fast and accurate long-range phasing in a UK Biobank cohort. Nat Genet, 48 (7) 811-6. doi:10.1038/ng.3571. PMID 27270109

ABSTRACT

Recent work has leveraged the extensive genotyping of the Icelandic population to perform long-range phasing (LRP), enabling accurate imputation and association analysis of rare variants in target samples typed on genotyping arrays. Here we develop a fast and accurate LRP method, Eagle, that extends this paradigm to populations with much smaller proportions of genotyped samples by harnessing long (>4-cM) identical-by-descent (IBD) tracts shared among distantly related individuals. We applied Eagle to N ≈ 150,000 samples (0.2% of the British population) from the UK Biobank, and we determined that it is 1-2 orders of magnitude faster than existing methods while achieving similar or better phasing accuracy (switch error rate ≈ 0.3%, corresponding to perfect phase in a majority of 10-Mb segments). We also observed that, when used within an imputation pipeline, Eagle prephasing improved downstream imputation accuracy in comparison to prephasing in batches using existing methods, as necessary to achieve comparable computational cost.

Show full abstractShow less

DOI

10.1038/ng.3571

EAGLE2

Tool

PUBMED_LINK

27694958

DESCRIPTION

(EAGLE2)

Show full descriptionShow less

URL

https://alkesgroup.broadinstitute.org/Eagle/

TITLE

Reference-based phasing using the Haplotype Reference Consortium panel.

Main citation

Loh PR, Danecek P, Palamara PF, Fuchsberger C, ...&, L Price A. (2016) Reference-based phasing using the Haplotype Reference Consortium panel. Nat Genet, 48 (11) 1443-1448. doi:10.1038/ng.3679. PMID 27694958

ABSTRACT

Haplotype phasing is a fundamental problem in medical and population genetics. Phasing is generally performed via statistical phasing in a genotyped cohort, an approach that can yield high accuracy in very large cohorts but attains lower accuracy in smaller cohorts. Here we instead explore the paradigm of reference-based phasing. We introduce a new phasing algorithm, Eagle2, that attains high accuracy across a broad range of cohort sizes by efficiently leveraging information from large external reference panels (such as the Haplotype Reference Consortium; HRC) using a new data structure based on the positional Burrows-Wheeler transform. We demonstrate that Eagle2 attains a ∼20× speedup and ∼10% increase in accuracy compared to reference-based phasing using SHAPEIT2. On European-ancestry samples, Eagle2 with the HRC panel achieves >2× the accuracy of 1000 Genomes-based phasing. Eagle2 is open source and freely available for HRC-based phasing via the Sanger Imputation Service and the Michigan Imputation Server.

Show full abstractShow less

DOI

10.1038/ng.3679

GLIMPSE

Tool

PUBMED_LINK

33414550

FULL NAME

Genotype Likelihoods IMputation and PhaSing mEthod

DESCRIPTION

GLIMPSE is a phasing and imputation method for large-scale low-coverage sequencing studies.

Show full descriptionShow less

URL

https://odelaneau.github.io/GLIMPSE/

TITLE

Efficient phasing and imputation of low-coverage sequencing data using large reference panels.

Main citation

Rubinacci S, Ribeiro DM, Hofmeister RJ, Delaneau O. (2021) Efficient phasing and imputation of low-coverage sequencing data using large reference panels. Nat Genet, 53 (1) 120-126. doi:10.1038/s41588-020-00756-0. PMID 33414550

ABSTRACT

Low-coverage whole-genome sequencing followed by imputation has been proposed as a cost-effective genotyping approach for disease and population genetics studies. However, its competitiveness against SNP arrays is undermined because current imputation methods are computationally expensive and unable to leverage large reference panels. Here, we describe a method, GLIMPSE, for phasing and imputation of low-coverage sequencing datasets from modern reference panels. We demonstrate its remarkable performance across different coverages and human populations. GLIMPSE achieves imputation of a genome for less than US$1 in computational cost, considerably outperforming other methods and improving imputation accuracy over the full allele frequency range. As a proof of concept, we show that 1× coverage enables effective gene expression association studies and outperforms dense SNP arrays in rare variant burden tests. Overall, this study illustrates the promising potential of low-coverage imputation and suggests a paradigm shift in the design of future genomic studies.

Show full abstractShow less

DOI

10.1038/s41588-020-00756-0

IMPUTE

Tool

PUBMED_LINK

17572673

URL

https://jmarchini.org/software/

TITLE

A new multipoint method for genome-wide association studies by imputation of genotypes.

Main citation

Marchini J, Howie B, Myers S, McVean G, ...&, Donnelly P. (2007) A new multipoint method for genome-wide association studies by imputation of genotypes. Nat Genet, 39 (7) 906-13. doi:10.1038/ng2088. PMID 17572673

ABSTRACT

Genome-wide association studies are set to become the method of choice for uncovering the genetic basis of human diseases. A central challenge in this area is the development of powerful multipoint methods that can detect causal variants that have not been directly genotyped. We propose a coherent analysis framework that treats the problem as one involving missing or uncertain genotypes. Central to our approach is a model-based imputation method for inferring genotypes at observed or unobserved SNPs, leading to improved power over existing methods for multipoint association mapping. Using real genome-wide association study data, we show that our approach (i) is accurate and well calibrated, (ii) provides detailed views of associated regions that facilitate follow-up studies and (iii) can be used to validate and correct data at genotyped markers. A notable future use of our method will be to boost power by combining data from genome-wide scans that use different SNP sets.

Show full abstractShow less

DOI

10.1038/ng2088

IMPUTE2

Tool

PUBMED_LINK

19543373

TITLE

A flexible and accurate genotype imputation method for the next generation of genome-wide association studies.

Main citation

Howie BN, Donnelly P, Marchini J. (2009) A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet, 5 (6) e1000529. doi:10.1371/journal.pgen.1000529. PMID 19543373

ABSTRACT

Genotype imputation methods are now being widely used in the analysis of genome-wide association studies. Most imputation analyses to date have used the HapMap as a reference dataset, but new reference panels (such as controls genotyped on multiple SNP chips and densely typed samples from the 1,000 Genomes Project) will soon allow a broader range of SNPs to be imputed with higher accuracy, thereby increasing power. We describe a genotype imputation method (IMPUTE version 2) that is designed to address the challenges presented by these new datasets. The main innovation of our approach is a flexible modelling framework that increases accuracy and combines information across multiple reference panels while remaining computationally feasible. We find that IMPUTE v2 attains higher accuracy than other methods when the HapMap provides the sole reference panel, but that the size of the panel constrains the improvements that can be made. We also find that imputation accuracy can be greatly enhanced by expanding the reference panel to contain thousands of chromosomes and that IMPUTE v2 outperforms other methods in this setting at both rare and common SNPs, with overall error rates that are 15%-20% lower than those of the closest competing method. One particularly challenging aspect of next-generation association studies is to integrate information across multiple reference panels genotyped on different sets of SNPs; we show that our approach to this problem has practical advantages over other suggested solutions.

Show full abstractShow less

DOI

10.1371/journal.pgen.1000529

IMPUTE4

Tool

PUBMED_LINK

30305743

TITLE

The UK Biobank resource with deep phenotyping and genomic data.

Main citation

Bycroft C, Freeman C, Petkova D, Band G, ...&, Marchini J. (2018) The UK Biobank resource with deep phenotyping and genomic data. Nature, 562 (7726) 203-209. doi:10.1038/s41586-018-0579-z. PMID 30305743

ABSTRACT

The UK Biobank project is a prospective cohort study with deep genetic and phenotypic data collected on approximately 500,000 individuals from across the United Kingdom, aged between 40 and 69 at recruitment. The open resource is unique in its size and scope. A rich variety of phenotypic and health-related information is available on each participant, including biological measurements, lifestyle indicators, biomarkers in blood and urine, and imaging of the body and brain. Follow-up information is provided by linking health and medical records. Genome-wide genotype data have been collected on all participants, providing many opportunities for the discovery of new genetic associations and the genetic bases of complex traits. Here we describe the centralized analysis of the genetic data, including genotype quality, properties of population structure and relatedness of the genetic data, and efficient phasing and genotype imputation that increases the number of testable variants to around 96 million. Classical allelic variation at 11 human leukocyte antigen genes was imputed, resulting in the recovery of signals with known associations between human leukocyte antigen alleles and many diseases.

Show full abstractShow less

DOI

10.1038/s41586-018-0579-z

IMPUTE5

Tool

PUBMED_LINK

33196638

TITLE

Genotype imputation using the Positional Burrows Wheeler Transform.

Main citation

Rubinacci S, Delaneau O, Marchini J. (2020) Genotype imputation using the Positional Burrows Wheeler Transform. PLoS Genet, 16 (11) e1009049. doi:10.1371/journal.pgen.1009049. PMID 33196638

ABSTRACT

Genotype imputation is the process of predicting unobserved genotypes in a sample of individuals using a reference panel of haplotypes. In the last 10 years reference panels have increased in size by more than 100 fold. Increasing reference panel size improves accuracy of markers with low minor allele frequencies but poses ever increasing computational challenges for imputation methods. Here we present IMPUTE5, a genotype imputation method that can scale to reference panels with millions of samples. This method continues to refine the observation made in the IMPUTE2 method, that accuracy is optimized via use of a custom subset of haplotypes when imputing each individual. It achieves fast, accurate, and memory-efficient imputation by selecting haplotypes using the Positional Burrows Wheeler Transform (PBWT). By using the PBWT data structure at genotyped markers, IMPUTE5 identifies locally best matching haplotypes and long identical by state segments. The method then uses the selected haplotypes as conditioning states within the IMPUTE model. Using the HRC reference panel, which has ∼65,000 haplotypes, we show that IMPUTE5 is up to 30x faster than MINIMAC4 and up to 3x faster than BEAGLE5.1, and uses less memory than both these methods. Using simulated reference panels we show that IMPUTE5 scales sub-linearly with reference panel size. For example, keeping the number of imputed markers constant, increasing the reference panel size from 10,000 to 1 million haplotypes requires less than twice the computation time. As the reference panel increases in size IMPUTE5 is able to utilize a smaller number of reference haplotypes, thus reducing computational cost.

Show full abstractShow less

DOI

10.1371/journal.pgen.1009049

MACH / minimach

Tool

PUBMED_LINK

21058334

DESCRIPTION

(MACH)

Show full descriptionShow less

URL

http://csg.sph.umich.edu/abecasis/MaCH/index.html

TITLE

MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes.

Main citation

Li Y, Willer CJ, Ding J, Scheet P, ...&, Abecasis GR. (2010) MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet Epidemiol, 34 (8) 816-34. doi:10.1002/gepi.20533. PMID 21058334

ABSTRACT

Genome-wide association studies (GWAS) can identify common alleles that contribute to complex disease susceptibility. Despite the large number of SNPs assessed in each study, the effects of most common SNPs must be evaluated indirectly using either genotyped markers or haplotypes thereof as proxies. We have previously implemented a computationally efficient Markov Chain framework for genotype imputation and haplotyping in the freely available MaCH software package. The approach describes sampled chromosomes as mosaics of each other and uses available genotype and shotgun sequence data to estimate unobserved genotypes and haplotypes, together with useful measures of the quality of these estimates. Our approach is already widely used to facilitate comparison of results across studies as well as meta-analyses of GWAS. Here, we use simulations and experimental genotypes to evaluate its accuracy and utility, considering choices of genotyping panels, reference panel configurations, and designs where genotyping is replaced with shotgun sequencing. Importantly, we show that genotype imputation not only facilitates cross study analyses but also increases power of genetic association studies. We show that genotype imputation of common variants using HapMap haplotypes as a reference is very accurate using either genome-wide SNP data or smaller amounts of data typical in fine-mapping studies. Furthermore, we show the approach is applicable in a variety of populations. Finally, we illustrate how association analyses of unobserved variants will benefit from ongoing advances such as larger HapMap reference panels and whole genome shotgun sequencing technologies.

Show full abstractShow less

DOI

10.1002/gepi.20533

MACH / minimach pre-phasing

Tool

PUBMED_LINK

22820512

DESCRIPTION

(pre-phasing, minimac)

Show full descriptionShow less

URL

https://genome.sph.umich.edu/wiki/Minimac

TITLE

Fast and accurate genotype imputation in genome-wide association studies through pre-phasing.

Main citation

Howie B, Fuchsberger C, Stephens M, Marchini J, ...&, Abecasis GR. (2012) Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat Genet, 44 (8) 955-9. doi:10.1038/ng.2354. PMID 22820512

ABSTRACT

The 1000 Genomes Project and disease-specific sequencing efforts are producing large collections of haplotypes that can be used as reference panels for genotype imputation in genome-wide association studies (GWAS). However, imputing from large reference panels with existing methods imposes a high computational burden. We introduce a strategy called 'pre-phasing' that maintains the accuracy of leading methods while reducing computational costs. We first statistically estimate the haplotypes for each individual within the GWAS sample (pre-phasing) and then impute missing genotypes into these estimated haplotypes. This reduces the computational cost because (i) the GWAS samples must be phased only once, whereas standard methods would implicitly repeat phasing with each reference panel update, and (ii) it is much faster to match a phased GWAS haplotype to one reference haplotype than to match two unphased GWAS genotypes to a pair of reference haplotypes. We implemented our approach in the MaCH and IMPUTE2 frameworks, and we tested it on data sets from the Wellcome Trust Case Control Consortium 2 (WTCCC2), the Genetic Association Information Network (GAIN), the Women's Health Initiative (WHI) and the 1000 Genomes Project. This strategy will be particularly valuable for repeated imputation as reference panels evolve.

Show full abstractShow less

DOI

10.1038/ng.2354

MACH / minimach2

Tool

PUBMED_LINK

25338720

DESCRIPTION

(minimac2)

Show full descriptionShow less

URL

https://genome.sph.umich.edu/wiki/Minimac2

TITLE

minimac2: faster genotype imputation.

Main citation

Fuchsberger C, Abecasis GR, Hinds DA. (2015) minimac2: faster genotype imputation. Bioinformatics, 31 (5) 782-4. doi:10.1093/bioinformatics/btu704. PMID 25338720

ABSTRACT

UNLABELLED: Genotype imputation is a key step in the analysis of genome-wide association studies. Upcoming very large reference panels, such as those from The 1000 Genomes Project and the Haplotype Consortium, will improve imputation quality of rare and less common variants, but will also increase the computational burden. Here, we demonstrate how the application of software engineering techniques can help to keep imputation broadly accessible. Overall, these improvements speed up imputation by an order of magnitude compared with our previous implementation. AVAILABILITY AND IMPLEMENTATION: minimac2, including source code, documentation, and examples is available at http://genome.sph.umich.edu/wiki/Minimac2

Show full abstractShow less

DOI

10.1093/bioinformatics/btu704

MACH / minimach3

Tool

PUBMED_LINK

27571263

DESCRIPTION

(minimac3)

Show full descriptionShow less

URL

https://genome.sph.umich.edu/wiki/Minimac3

TITLE

Next-generation genotype imputation service and methods.

Main citation

Das S, Forer L, Schönherr S, Sidore C, ...&, Fuchsberger C. (2016) Next-generation genotype imputation service and methods. Nat Genet, 48 (10) 1284-1287. doi:10.1038/ng.3656. PMID 27571263

ABSTRACT

Genotype imputation is a key component of genetic association studies, where it increases power, facilitates meta-analysis, and aids interpretation of signals. Genotype imputation is computationally demanding and, with current tools, typically requires access to a high-performance computing cluster and to a reference panel of sequenced genomes. Here we describe improvements to imputation machinery that reduce computational requirements by more than an order of magnitude with no loss of accuracy in comparison to standard imputation tools. We also describe a new web-based service for imputation that facilitates access to new reference panels and greatly improves user experience and productivity.

Show full abstractShow less

DOI

10.1038/ng.3656

MACH / minimach4

Tool

DESCRIPTION

(minimac4)

Show full descriptionShow less

URL

https://genome.sph.umich.edu/wiki/Minimac4

QUILT1

Tool

PUBMED_LINK

34083788

URL

https://github.com/rwdavies/QUILT

TITLE

Rapid genotype imputation from sequence with reference panels.

Main citation

Davies RW, Kucka M, Su D, Shi S, ...&, Myers S. (2021) Rapid genotype imputation from sequence with reference panels. Nat Genet, 53 (7) 1104-1111. doi:10.1038/s41588-021-00877-0. PMID 34083788

ABSTRACT

Inexpensive genotyping methods are essential to modern genomics. Here we present QUILT, which performs diploid genotype imputation using low-coverage whole-genome sequence data. QUILT employs Gibbs sampling to partition reads into maternal and paternal sets, facilitating rapid haploid imputation using large reference panels. We show this partitioning to be accurate over many megabases, enabling highly accurate imputation close to theoretical limits and outperforming existing methods. Moreover, QUILT can impute accurately using diverse technologies, including long reads from Oxford Nanopore Technologies, and a new form of low-cost barcoded Illumina sequencing called haplotagging, with the latter showing improved accuracy at low coverages. Relative to DNA genotyping microarrays, QUILT offers improved accuracy at reduced cost, particularly for diverse populations that are traditionally underserved in modern genomic analyses, with accuracy nearly doubling at rare SNPs. Finally, QUILT can accurately impute (four-digit) human leukocyte antigen types, the first such method from low-coverage sequence data.

Show full abstractShow less

DOI

10.1038/s41588-021-00877-0

QUILT2

Tool

DESCRIPTION

QUILT2 is a fast and memory-efficient method for imputation from low coverage sequence. Statistically, QUILT2 operates on a per-read basis, and is base quality aware, meaning it can accurately impute from diverse inputs, including short read (e.g. Illumina), long read sequencing (that might be noisy) (e.g. Oxford Nanopore Technologies), barcoded Illumina sequencing (e.g. Haplotagging) and ancient DNA. In addition, QUILT2 can impute both the mother and fetal genome using cfDNA NIPT data.

Show full descriptionShow less

URL

https://github.com/rwdavies/QUILT

PREPRINT_DOI

10.1101/2024.07.18.604149

Main citation

Li, Z., Albrechtsen, A. & Davies, R. W. Rapid and accurate genotype imputation from low coverage short read, long read, and cell free DNA sequence. bioRxiv 2024.07.18.604149 (2024) doi:10.1101/2024.07.18.604149.

SHAPEIT1

Tool

PUBMED_LINK

22138821

DESCRIPTION

(SHAPEIT1)

Show full descriptionShow less

URL

https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html

TITLE

A linear complexity phasing method for thousands of genomes.

Main citation

Delaneau O, Marchini J, Zagury JF. (2011) A linear complexity phasing method for thousands of genomes. Nat Methods, 9 (2) 179-81. doi:10.1038/nmeth.1785. PMID 22138821

ABSTRACT

Human-disease etiology can be better understood with phase information about diploid sequences. We present a method for estimating haplotypes, using genotype data from unrelated samples or small nuclear families, that leads to improved accuracy and speed compared to several widely used methods. The method, segmented haplotype estimation and imputation tool (SHAPEIT), scales linearly with the number of haplotypes used in each iteration and can be run efficiently on whole chromosomes.

Show full abstractShow less

DOI

10.1038/nmeth.1785

SHAPEIT2

Tool

PUBMED_LINK

23269371

DESCRIPTION

(SHAPEIT2)

Show full descriptionShow less

TITLE

Improved whole-chromosome phasing for disease and population genetic studies.

Main citation

Delaneau O, Zagury JF, Marchini J. (2013) Improved whole-chromosome phasing for disease and population genetic studies. Nat Methods, 10 (1) 5-6. doi:10.1038/nmeth.2307. PMID 23269371

DOI

10.1038/nmeth.2307

SHAPEIT3

Tool

PUBMED_LINK

27270105

DESCRIPTION

(SHAPEIT3)

Show full descriptionShow less

URL

https://jmarchini.org/shapeit3/

TITLE

Haplotype estimation for biobank-scale data sets.

Main citation

O'Connell J, Sharp K, Shrine N, Wain L, ...&, Marchini J. (2016) Haplotype estimation for biobank-scale data sets. Nat Genet, 48 (7) 817-20. doi:10.1038/ng.3583. PMID 27270105

ABSTRACT

The UK Biobank (UKB) has recently released genotypes on 152,328 individuals together with extensive phenotypic and lifestyle information. We present a new phasing method, SHAPEIT3, that can handle such biobank-scale data sets and results in switch error rates as low as ∼0.3%. The method exhibits O(NlogN) scaling with sample size N, enabling fast and accurate phasing of even larger cohorts.

Show full abstractShow less

DOI

10.1038/ng.3583

SHAPEIT4

Tool

PUBMED_LINK

31780650

DESCRIPTION

(SHAPEIT4)

Show full descriptionShow less

URL

https://odelaneau.github.io/shapeit4/

TITLE

Accurate, scalable and integrative haplotype estimation.

Main citation

Delaneau O, Zagury JF, Robinson MR, Marchini JL, ...&, Dermitzakis ET. (2019) Accurate, scalable and integrative haplotype estimation. Nat Commun, 10 (1) 5436. doi:10.1038/s41467-019-13225-y. PMID 31780650

ABSTRACT

The number of human genomes being genotyped or sequenced increases exponentially and efficient haplotype estimation methods able to handle this amount of data are now required. Here we present a method, SHAPEIT4, which substantially improves upon other methods to process large genotype and high coverage sequencing datasets. It notably exhibits sub-linear running times with sample size, provides highly accurate haplotypes and allows integrating external phasing information such as large reference panels of haplotypes, collections of pre-phased variants and long sequencing reads. We provide SHAPEIT4 in an open source format and demonstrate its performance in terms of accuracy and running times on two gold standard datasets: the UK Biobank data and the Genome In A Bottle.

Show full abstractShow less

DOI

10.1038/s41467-019-13225-y

SHAPEIT5

Tool

PUBMED_LINK

37386248

DESCRIPTION

(SHAPEIT5)

Show full descriptionShow less

TITLE

Accurate rare variant phasing of whole-genome and whole-exome sequencing data in the UK Biobank.

Main citation

Hofmeister RJ, Ribeiro DM, Rubinacci S, Delaneau O. (2023) Accurate rare variant phasing of whole-genome and whole-exome sequencing data in the UK Biobank. Nat Genet, 55 (7) 1243-1249. doi:10.1038/s41588-023-01415-w. PMID 37386248

ABSTRACT

Phasing involves distinguishing the two parentally inherited copies of each chromosome into haplotypes. Here, we introduce SHAPEIT5, a new phasing method that quickly and accurately processes large sequencing datasets and applied it to UK Biobank (UKB) whole-genome and whole-exome sequencing data. We demonstrate that SHAPEIT5 phases rare variants with low switch error rates of below 5% for variants present in just 1 sample out of 100,000. Furthermore, we outline a method for phasing singletons, which, although less precise, constitutes an important step towards future developments. We then demonstrate that the use of UKB as a reference panel improves the accuracy of genotype imputation, which is even more pronounced when phased with SHAPEIT5 compared with other methods. Finally, we screen the UKB data for loss-of-function compound heterozygous events and identify 549 genes where both gene copies are knocked out. These genes complement current knowledge of gene essentiality in the human genome.

Show full abstractShow less

DOI

10.1038/s41588-023-01415-w

fastPHASE

Tool

PUBMED_LINK

16532393

URL

http://scheet.org/software.html

TITLE

A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase.

Main citation

Scheet P, Stephens M. (2006) A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am J Hum Genet, 78 (4) 629-44. doi:10.1086/502802. PMID 16532393

ABSTRACT

We present a statistical model for patterns of genetic variation in samples of unrelated individuals from natural populations. This model is based on the idea that, over short regions, haplotypes in a population tend to cluster into groups of similar haplotypes. To capture the fact that, because of recombination, this clustering tends to be local in nature, our model allows cluster memberships to change continuously along the chromosome according to a hidden Markov model. This approach is flexible, allowing for both "block-like" patterns of linkage disequilibrium (LD) and gradual decline in LD with distance. The resulting model is also fast and, as a result, is practicable for large data sets (e.g., thousands of individuals typed at hundreds of thousands of markers). We illustrate the utility of the model by applying it to dense single-nucleotide-polymorphism genotype data for the tasks of imputing missing genotypes and estimating haplotypic phase. For imputing missing genotypes, methods based on this model are as accurate or more accurate than existing methods. For haplotype estimation, the point estimates are slightly less accurate than those from the best existing methods (e.g., for unrelated Centre d'Etude du Polymorphisme Humain individuals from the HapMap project, switch error was 0.055 for our method vs. 0.051 for PHASE) but require a small fraction of the computational cost. In addition, we demonstrate that the model accurately reflects uncertainty in its estimates, in that probabilities computed using the model are approximately well calibrated. The methods described in this article are implemented in a software package, fastPHASE, which is available from the Stephens Lab Web site.

Show full abstractShow less

DOI

10.1086/502802

Review

Review-Das

Tool

PUBMED_LINK

29799802

TITLE

Genotype Imputation from Large Reference Panels.

Main citation

Das S, Abecasis GR, Browning BL. (2018) Genotype Imputation from Large Reference Panels. Annu Rev Genomics Hum Genet, 19 () 73-96. doi:10.1146/annurev-genom-083117-021602. PMID 29799802

ABSTRACT

Genotype imputation has become a standard tool in genome-wide association studies because it enables researchers to inexpensively approximate whole-genome sequence data from genome-wide single-nucleotide polymorphism array data. Genotype imputation increases statistical power, facilitates fine mapping of causal variants, and plays a key role in meta-analyses of genome-wide association studies. Only variants that were previously observed in a reference panel of sequenced individuals can be imputed. However, the rapid increase in the number of deeply sequenced individuals will soon make it possible to assemble enormous reference panels that greatly increase the number of imputable variants. In this review, we present an overview of genotype imputation and describe the computational techniques that make it possible to impute genotypes from reference panels with millions of individuals.

Show full abstractShow less

DOI

10.1146/annurev-genom-083117-021602

Review-Li

Tool

PUBMED_LINK

19715440

TITLE

Genotype imputation.

Main citation

Li Y, Willer C, Sanna S, Abecasis G. (2009) Genotype imputation. Annu Rev Genomics Hum Genet, 10 () 387-406. doi:10.1146/annurev.genom.9.081307.164242. PMID 19715440

ABSTRACT

Genotype imputation is now an essential tool in the analysis of genome-wide association scans. This technique allows geneticists to accurately evaluate the evidence for association at genetic markers that are not directly genotyped. Genotype imputation is particularly useful for combining results across studies that rely on different genotyping platforms but also increases the power of individual scans. Here, we review the history and theoretical underpinnings of the technique. To illustrate performance of the approach, we summarize results from several gene mapping studies. Finally, we preview the role of genotype imputation in an era when whole genome resequencing is becoming increasingly common.

Show full abstractShow less

DOI

10.1146/annurev.genom.9.081307.164242

Review-Marchini

Tool

PUBMED_LINK

20517342

TITLE

Genotype imputation for genome-wide association studies.

Main citation

Marchini J, Howie B. (2010) Genotype imputation for genome-wide association studies. Nat Rev Genet, 11 (7) 499-511. doi:10.1038/nrg2796. PMID 20517342

ABSTRACT

In the past few years genome-wide association (GWA) studies have uncovered a large number of convincingly replicated associations for many complex human diseases. Genotype imputation has been used widely in the analysis of GWA studies to boost power, fine-map associations and facilitate the combination of results across studies using meta-analysis. This Review describes the details of several different statistical methods for imputing genotypes, illustrates and discusses the factors that influence imputation performance, and reviews methods that can be used to assess imputation performance and test association at imputed SNPs.

Show full abstractShow less

DOI

10.1038/nrg2796

Structural variants imputation panel

1KG SV imputation panel (1KG SV)

Tool

KEYWORDS

structural variants, long-read

Show full keywordsShow less

PREPRINT_DOI

10.1101/2023.12.20.23300308

Main citation

Noyvert, B., Erzurumluoglu, A. M., Drichel, D., Omland, S., Andlauer, T. F., Mueller, S., ... & Ding, Z. (2023). Imputation of structural variants using a multi-ancestry long-read sequencing panel enables identification of disease associations. medRxiv, 2023-12.

Structural variants imputation tool & panel

ImputeSV

Tool

PUBMED_LINK

42156564

FULL NAME

Genome-wide imputation of structural variants from long-read assemblies

DESCRIPTION

A reference panel and web application to impute structural variants from SNP data, built from 482 haplotype-resolved long-read genome assemblies (PacBio HiFi) of 241 individuals.

Show full descriptionShow less

URL

https://www.ImputeSV.com

KEYWORDS

structural variants, long-read, imputation, SNP

Show full keywordsShow less

TITLE

Genome-wide associations of structural variants with human traits through imputation from long-read assemblies.

Main citation

Bai WY, Liu S, Duan Z, ...&, Yang J. (2026) Genome-wide associations of structural variants with human traits through imputation from long-read assemblies. Nat Genet. doi:10.1038/s41588-026-02612-z. PMID 42156564

ABSTRACT

Structural variants (SVs) are a major type of genetic variation, yet their role in human traits remains largely uncharacterized, primarily due to challenges in genotyping them on a genome-wide scale in large cohorts. Here we identified 171,233 high-quality, genome-wide SVs from 482 haplotype-resolved genome assemblies derived from PacBio HiFi long-read sequencing of 241 individuals. We developed a reference panel and a web application (ImputeSV) to impute these SVs from single-nucleotide polymorphism (SNP) data and demonstrated high imputation accuracy at both the individual and cohort levels. Using this tool, we imputed 54,578 common SVs (minor allele frequencies (MAFs) ≥1%) in 456,643 UK Biobank (UKB) participants of European ancestry. Through analysis of UKB data and simulations, we estimated that SVs contributed to at least 4.7% of the common genetic variation for complex traits. Genome-wide association analyses of SVs for 2,624 UKB traits identified 17,335 SV-trait associations, including 958 unlikely to be driven by small genetic variants. Our study demonstrates the power of using long-read assemblies for imputing SVs from SNPs, unveils the role of SVs in complex trait variation and provides a catalog of SV associations in the UKB.

Show full abstractShow less

DOI

10.1038/s41588-026-02612-z

ARROW_SUMMARY

482 haplotype-resolved assemblies (PacBio HiFi, 241 individuals) → 171,233 high-quality SVs → ImputeSV reference panel → Impute SVs from SNP data → 54,578 common SVs imputed in 456,643 UKB participants → GWAS for 2,624 traits → 17,335 SV-trait associations