Imputation
Summary Table
NAME | CATEGORY | CITATION | YEAR |
---|---|---|---|
1000 Genomes Phase 3 Version 5 (1KGp3v5) | Imputation panel | 1000 Genomes Project Consortium, Auton A, Brooks LD, Durbin RM, ...&, Abecasis GR. (2015) A global reference for human genetic variation Nature, 526 (7571) 68-74. doi:10.1038/nature15393. PMID 26432245 | 2015 |
1KG+7K | Imputation panel | Terao, C., Flanagan, J., Tomizuka, K., Liu, X., Ortega-Reyes, D., Matoba, N., ... & Horikoshi, M. (2023). Population-specific reference panel improves imputation quality and enhances locus discovery and fine-mapping. | NA |
CKB reference panel | Imputation panel | Yu C, Lan X, Tao Y, Guo Y, ...&, Li L. (2023) A high-resolution haplotype-resolved Reference panel constructed from the China Kadoorie Biobank Study Nucleic Acids Res., 51 (21) 11770-11782. doi:10.1093/nar/gkad779. PMID 37870428 | 2023 |
ChinaMAP panel | Imputation panel | Li L, Huang P, Sun X, Wang S, ...&, Wang W. (2021) The ChinaMAP reference panel for the accurate genotype imputation in Chinese populations Cell Res., 31 (12) 1308-1310. doi:10.1038/s41422-021-00564-z. PMID 34489580 | 2021 |
GenomeAsia 100K | Imputation panel | GenomeAsia100K Consortium. (2019) The GenomeAsia 100K Project enables genetic discoveries across Asia Nature, 576 (7785) 106-111. doi:10.1038/s41586-019-1793-z. PMID 31802016 | 2019 |
HGDP+1kGP | Imputation panel | Koenig Z, Yohannes MT, Nkambule LL, Zhao X, ...&, Martin AR. (2024) A harmonized public resource of deeply sequenced diverse human genomes Genome Res., 34 (5) 796-809. doi:10.1101/gr.278378.123. PMID 38749656 | 2024 |
HRC | Imputation panel | McCarthy S, Das S, Kretzschmar W, Delaneau O, ...&, Marchini J. (2016) A reference panel of 64,976 haplotypes for genotype imputation Nat. Genet., 48 (10) 1279-1283. doi:10.1038/ng.3643. PMID 27548312 | 2016 |
NARD2 | Imputation panel | Choi J, Kim S, Kim J, Son HY, ...&, Im SW. (2023) A whole-genome reference panel of 14,393 individuals for East Asian populations accelerates discovery of rare functional variants Sci Adv, 9 (32) eadg6319. doi:10.1126/sciadv.adg6319. PMID 37556544 | 2023 |
NARD | Imputation panel | Yoo SK, Kim CU, Kim HL, Kim S, ...&, Seo JS. (2019) NARD: whole-genome reference panel of 1779 Northeast Asians improves imputation accuracy of rare and low-frequency variants Genome Med., 11 (1) 64. doi:10.1186/s13073-019-0677-z. PMID 31640730 | 2019 |
Nyuwa Genome Resource Phase 1 | Imputation panel | Zhang P, Luo H, Li Y, Wang Y, ...&, He S. (2021) NyuWa Genome resource: A deep whole-genome sequencing-based variation profile and reference panel for the Chinese population Cell Rep., 37 (7) 110017. doi:10.1016/j.celrep.2021.110017. PMID 34788621 | 2021 |
PGG.Han panel | Imputation panel | Gao Y, Zhang C, Yuan L, Ling Y, ...&, Xu S. (2020) PGG.Han: the Han Chinese genome database and analysis platform Nucleic Acids Res., 48 (D1) D971-D976. doi:10.1093/nar/gkz829. PMID 31584086 | 2020 |
South and East Asian Reference Database (SEAD) | Imputation panel | Yang, M. Y., Zhong, J. D., Li, X., Bai, W. Y., Yuan, C. D., Qiu, M. C., ... & Zheng, H. F. (2023). SEAD: an augmented reference panel with 22,134 haplotypes boosts the rare variants imputation and GWAS analysis in Asian population. medRxiv, 2023-12. | NA |
TOPMED | Imputation panel | Taliun D, Harris DN, Kessler MD, Carlson J, ...&, Abecasis GR. (2021) Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program Nature, 590 (7845) 290-299. doi:10.1038/s41586-021-03205-y. PMID 33568819 | 2021 |
WBBC panel | Imputation panel | Cong PK, Bai WY, Li JC, Yang MY, ...&, Zheng HF. (2022) Genomic analyses of 10,376 individuals in the Westlake BioBank for Chinese (WBBC) pilot project Nat. Commun., 13 (1) 2939. doi:10.1038/s41467-022-30526-x. PMID 35618720 | 2022 |
CNGB Imputation Service | Imputation server | Yu C, Lan X, Tao Y, Guo Y, ...&, Li L. (2023) A high-resolution haplotype-resolved Reference panel constructed from the China Kadoorie Biobank Study Nucleic Acids Res., 51 (21) 11770-11782. doi:10.1093/nar/gkad779. PMID 37870428 | 2023 |
ChinaMAP | Imputation server | Li L, Huang P, Sun X, Wang S, ...&, Wang W. (2021) The ChinaMAP reference panel for the accurate genotype imputation in Chinese populations Cell Res., 31 (12) 1308-1310. doi:10.1038/s41422-021-00564-z. PMID 34489580 | 2021 |
Michigan Imputation Server | Imputation server | Das S, Forer L, Schönherr S, Sidore C, ...&, Fuchsberger C. (2016) Next-generation genotype imputation service and methods Nat. Genet., 48 (10) 1284-1287. doi:10.1038/ng.3656. PMID 27571263 | 2016 |
NyuWa Imputation Server | Imputation server | Zhang P, Luo H, Li Y, Wang Y, ...&, He S. (2021) NyuWa Genome resource: A deep whole-genome sequencing-based variation profile and reference panel for the Chinese population Cell Rep., 37 (7) 110017. doi:10.1016/j.celrep.2021.110017. PMID 34788621 | 2021 |
PGG.Han | Imputation server | Gao Y, Zhang C, Yuan L, Ling Y, ...&, Xu S. (2020) PGG.Han: the Han Chinese genome database and analysis platform Nucleic Acids Res., 48 (D1) D971-D976. doi:10.1093/nar/gkz829. PMID 31584086 | 2020 |
Sanger | Imputation server | McCarthy S, Das S, Kretzschmar W, Delaneau O, ...&, Marchini J. (2016) A reference panel of 64,976 haplotypes for genotype imputation Nat. Genet., 48 (10) 1279-1283. doi:10.1038/ng.3643. PMID 27548312 | 2016 |
TOPMED | Imputation server | Taliun D, Harris DN, Kessler MD, Carlson J, ...&, Abecasis GR. (2021) Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program Nature, 590 (7845) 290-299. doi:10.1038/s41586-021-03205-y. PMID 33568819 | 2021 |
Westlake Imputation Server | Imputation server | Cong PK, Bai WY, Li JC, Yang MY, ...&, Zheng HF. (2022) Genomic analyses of 10,376 individuals in the Westlake BioBank for Chinese (WBBC) pilot project Nat. Commun., 13 (1) 2939. doi:10.1038/s41467-022-30526-x. PMID 35618720 | 2022 |
RESHAPE | Other tools | Cavinato T, Rubinacci S, Malaspinas AS, Delaneau O. (2024) A resampling-based approach to share reference panels Nat. Comput. Sci., 4 (5) 360-366. doi:10.1038/s43588-024-00630-7. PMID 38745108 | 2024 |
BEAGLE4 | Phasing & Imputation tool | Browning BL, Browning SR. (2016) Genotype imputation with millions of reference samples Am. J. Hum. Genet., 98 (1) 116-126. doi:10.1016/j.ajhg.2015.11.020. PMID 26748515 | 2016 |
BEAGLE5.4 (Imputation) | Phasing & Imputation tool | Browning BL, Zhou Y, Browning SR. (2018) A one-penny imputed genome from next-generation reference panels Am. J. Hum. Genet., 103 (3) 338-348. doi:10.1016/j.ajhg.2018.07.015. PMID 30100085 | 2018 |
BEAGLE5.4 (Phasing) | Phasing & Imputation tool | Browning BL, Tian X, Zhou Y, Browning SR. (2021) Fast two-stage phasing of large-scale sequence data Am. J. Hum. Genet., 108 (10) 1880-1890. doi:10.1016/j.ajhg.2021.08.005. PMID 34478634 | 2021 |
BEAGLE | Phasing & Imputation tool | Browning SR, Browning BL. (2007) Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering Am. J. Hum. Genet., 81 (5) 1084-1097. doi:10.1086/521987. PMID 17924348 | 2007 |
EAGLE2 | Phasing & Imputation tool | Loh PR, Danecek P, Palamara PF, Fuchsberger C, ...&, L Price A. (2016) Reference-based phasing using the Haplotype Reference Consortium panel Nat. Genet., 48 (11) 1443-1448. doi:10.1038/ng.3679. PMID 27694958 | 2016 |
EAGLE | Phasing & Imputation tool | Loh PR, Palamara PF, Price AL. (2016) Fast and accurate long-range phasing in a UK Biobank cohort Nat. Genet., 48 (7) 811-816. doi:10.1038/ng.3571. PMID 27270109 | 2016 |
GLIMPSE | Phasing & Imputation tool | Rubinacci S, Ribeiro DM, Hofmeister RJ, Delaneau O. (2021) Efficient phasing and imputation of low-coverage sequencing data using large reference panels Nat. Genet., 53 (1) 120-126. doi:10.1038/s41588-020-00756-0. PMID 33414550 | 2021 |
IMPUTE2 | Phasing & Imputation tool | Howie BN, Donnelly P, Marchini J. (2009) A flexible and accurate genotype imputation method for the next generation of genome-wide association studies PLoS Genet., 5 (6) e1000529. doi:10.1371/journal.pgen.1000529. PMID 19543373 | 2009 |
IMPUTE4 | Phasing & Imputation tool | Bycroft C, Freeman C, Petkova D, Band G, ...&, Marchini J. (2018) The UK Biobank resource with deep phenotyping and genomic data Nature, 562 (7726) 203-209. doi:10.1038/s41586-018-0579-z. PMID 30305743 | 2018 |
IMPUTE5 | Phasing & Imputation tool | Rubinacci S, Delaneau O, Marchini J. (2020) Genotype imputation using the Positional Burrows Wheeler Transform PLoS Genet., 16 (11) e1009049. doi:10.1371/journal.pgen.1009049. PMID 33196638 | 2020 |
IMPUTE | Phasing & Imputation tool | Marchini J, Howie B, Myers S, McVean G, ...&, Donnelly P. (2007) A new multipoint method for genome-wide association studies by imputation of genotypes Nat. Genet., 39 (7) 906-913. doi:10.1038/ng2088. PMID 17572673 | 2007 |
MACH / minimach pre-phasing | Phasing & Imputation tool | Howie B, Fuchsberger C, Stephens M, Marchini J, ...&, Abecasis GR. (2012) Fast and accurate genotype imputation in genome-wide association studies through pre-phasing Nat. Genet., 44 (8) 955-959. doi:10.1038/ng.2354. PMID 22820512 | 2012 |
MACH / minimach2 | Phasing & Imputation tool | Fuchsberger C, Abecasis GR, Hinds DA. (2015) Minimac2: Faster genotype imputation Bioinformatics, 31 (5) 782-784. doi:10.1093/bioinformatics/btu704. PMID 25338720 | 2015 |
MACH / minimach3 | Phasing & Imputation tool | Das S, Forer L, Schönherr S, Sidore C, ...&, Fuchsberger C. (2016) Next-generation genotype imputation service and methods Nat. Genet., 48 (10) 1284-1287. doi:10.1038/ng.3656. PMID 27571263 | 2016 |
MACH / minimach4 | Phasing & Imputation tool | NA | NA |
MACH / minimach | Phasing & Imputation tool | Li Y, Willer CJ, Ding J, Scheet P, ...&, Abecasis GR. (2010) MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes Genet. Epidemiol., 34 (8) 816-834. doi:10.1002/gepi.20533. PMID 21058334 | 2010 |
SHAPEIT1 | Phasing & Imputation tool | Delaneau O, Marchini J, Zagury JF. (2011) A linear complexity phasing method for thousands of genomes Nat. Methods, 9 (2) 179-181. doi:10.1038/nmeth.1785. PMID 22138821 | 2011 |
SHAPEIT2 | Phasing & Imputation tool | Delaneau O, Zagury JF, Marchini J. (2013) Improved whole-chromosome phasing for disease and population genetic studies Nat. Methods, 10 (1) 5-6. doi:10.1038/nmeth.2307. PMID 23269371 | 2013 |
SHAPEIT3 | Phasing & Imputation tool | O'Connell J, Sharp K, Shrine N, Wain L, ...&, Marchini J. (2016) Haplotype estimation for biobank-scale data sets Nat. Genet., 48 (7) 817-820. doi:10.1038/ng.3583. PMID 27270105 | 2016 |
SHAPEIT4 | Phasing & Imputation tool | Delaneau O, Zagury JF, Robinson MR, Marchini JL, ...&, Dermitzakis ET. (2019) Accurate, scalable and integrative haplotype estimation Nat. Commun., 10 (1) 5436. doi:10.1038/s41467-019-13225-y. PMID 31780650 | 2019 |
SHAPEIT5 | Phasing & Imputation tool | Hofmeister RJ, Ribeiro DM, Rubinacci S, Delaneau O. (2022) Accurate rare variant phasing of whole-genome and whole-exome sequencing data in the UK Biobank Nat. Genet., 55 (7) 1243-1249. doi:10.1038/s41588-023-01415-w. PMID 37386248 | 2022 |
fastPHASE | Phasing & Imputation tool | Scheet P, Stephens M. (2006) A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase Am. J. Hum. Genet., 78 (4) 629-644. doi:10.1086/502802. PMID 16532393 | 2006 |
Review-Das | Review | Das S, Abecasis GR, Browning BL. (2018) Genotype imputation from large reference panels Annu. Rev. Genomics Hum. Genet., 19 (1) 73-96. doi:10.1146/annurev-genom-083117-021602. PMID 29799802 | 2018 |
Review-Li | Review | Li Y, Willer C, Sanna S, Abecasis G. (2009) Genotype imputation Annu. Rev. Genomics Hum. Genet., 10 (1) 387-406. doi:10.1146/annurev.genom.9.081307.164242. PMID 19715440 | 2009 |
Review-Marchini | Review | Marchini J, Howie B. (2010) Genotype imputation for genome-wide association studies Nat. Rev. Genet., 11 (7) 499-511. doi:10.1038/nrg2796. PMID 20517342 | 2010 |
1KG SV imputation panel | Structural variants imputation panel | Noyvert, B., Erzurumluoglu, A. M., Drichel, D., Omland, S., Andlauer, T. F., Mueller, S., ... & Ding, Z. (2023). Imputation of structural variants using a multi-ancestry long-read sequencing panel enables identification of disease associations. medRxiv, 2023-12. | NA |
Imputation panel
1000 Genomes Phase 3 Version 5 (1KGp3v5)
- NAME : 1000 Genomes Phase 3 Version 5 (1KGp3v5)
- SHORT NAME : 1KG
- URL : https://www.internationalgenome.org/
- TITLE : A global reference for human genetic variation
- DOI : 10.1038/nature15393
- ABSTRACT : The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes >99% of SNP variants with a frequency of >1% for a variety of ancestries. We describe the distribution of genetic variation across the global sample, and discuss the implications for common disease studies.
- CITATION : 1000 Genomes Project Consortium, Auton A, Brooks LD, Durbin RM, ...&, Abecasis GR. (2015) A global reference for human genetic variation Nature, 526 (7571) 68-74. doi:10.1038/nature15393. PMID 26432245
- JOURNAL_INFO : Nature ; Nature ; 2015 ; 526 ; 7571 ; 68-74
- PUBMED_LINK : 26432245
1KG+7K
- NAME : 1KG+7K
- SHORT NAME : 1KG+7K
- FULL NAME : 1KG+7K
- KEYWORDS : Japanese population-specific reference panel
- PREPRINT_DOI : 10.21203/rs.3.rs-3194976/v1
- SERVER : researchsquare
- CITATION : Terao, C., Flanagan, J., Tomizuka, K., Liu, X., Ortega-Reyes, D., Matoba, N., ... & Horikoshi, M. (2023). Population-specific reference panel improves imputation quality and enhances locus discovery and fine-mapping.
CKB reference panel
- NAME : CKB reference panel
- SHORT NAME : CKB
- FULL NAME : China Kadoorie Biobank
- URL : https://db.cngb.org/imputation/
- TITLE : A high-resolution haplotype-resolved Reference panel constructed from the China Kadoorie Biobank Study
- DOI : 10.1093/nar/gkad779
- ABSTRACT : Precision medicine depends on high-accuracy individual-level genotype data. However, the whole-genome sequencing (WGS) is still not suitable for gigantic studies due to budget constraints. It is particularly important to construct highly accurate haplotype reference panel for genotype imputation. In this study, we used 10 000 samples with medium-depth WGS to construct a reference panel that we named the CKB reference panel. By imputing microarray datasets, it showed that the CKB panel outperformed compared panels in terms of both the number of well-imputed variants and imputation accuracy. In addition, we have completed the imputation of 100 706 microarrays with the CKB panel, and the after-imputed data is the hitherto largest whole genome data of the Chinese population. Furthermore, in the GWAS analysis of real phenotype height, the number of tested SNPs tripled and the number of significant SNPs doubled after imputation. Finally, we developed an online server for offering free genotype imputation service based on the CKB reference panel (https://db.cngb.org/imputation/). We believe that the CKB panel is of great value for imputing microarray or low-coverage genotype data of Chinese population, and potentially mixed populations. The imputation-completed 100 706 microarray data are enormous and precious resources of population genetic studies for complex traits and diseases.
- COPYRIGHT : https://creativecommons.org/licenses/by/4.0/
- CITATION : Yu C, Lan X, Tao Y, Guo Y, ...&, Li L. (2023) A high-resolution haplotype-resolved Reference panel constructed from the China Kadoorie Biobank Study Nucleic Acids Res., 51 (21) 11770-11782. doi:10.1093/nar/gkad779. PMID 37870428
- JOURNAL_INFO : Nucleic acids research ; Nucleic Acids Res. ; 2023 ; 51 ; 21 ; 11770-11782
- PUBMED_LINK : 37870428
ChinaMAP panel
- NAME : ChinaMAP panel
- SHORT NAME : ChinaMAP
- FULL NAME : China Metabolic Analytics Project
- URL : http://www.mbiobank.com/
- TITLE : The ChinaMAP reference panel for the accurate genotype imputation in Chinese populations
- DOI : 10.1038/s41422-021-00564-z
- ABSTRACT : The ChinaMAP reference panel for the accurate genotype imputation in Chinese populations
- COPYRIGHT : https://creativecommons.org/licenses/by/4.0
- CITATION : Li L, Huang P, Sun X, Wang S, ...&, Wang W. (2021) The ChinaMAP reference panel for the accurate genotype imputation in Chinese populations Cell Res., 31 (12) 1308-1310. doi:10.1038/s41422-021-00564-z. PMID 34489580
- JOURNAL_INFO : Cell research ; Cell Res. ; 2021 ; 31 ; 12 ; 1308-1310
- PUBMED_LINK : 34489580
GenomeAsia 100K
- NAME : GenomeAsia 100K
- SHORT NAME : GenomeAsia 100K
- URL : https://www.genomeasia100k.org/
- TITLE : The GenomeAsia 100K Project enables genetic discoveries across Asia
- DOI : 10.1038/s41586-019-1793-z
- ABSTRACT : The underrepresentation of non-Europeans in human genetic studies so far has limited the diversity of individuals in genomic datasets and led to reduced medical relevance for a large proportion of the world's population. Population-specific reference genome datasets as well as genome-wide association studies in diverse populations are needed to address this issue. Here we describe the pilot phase of the GenomeAsia 100K Project. This includes a whole-genome sequencing reference dataset from 1,739 individuals of 219 population groups and 64 countries across Asia. We catalogue genetic variation, population structure, disease associations and founder effects. We also explore the use of this dataset in imputation, to facilitate genetic studies in populations across Asia and worldwide.
- COPYRIGHT : https://creativecommons.org/licenses/by/4.0
- CITATION : GenomeAsia100K Consortium. (2019) The GenomeAsia 100K Project enables genetic discoveries across Asia Nature, 576 (7785) 106-111. doi:10.1038/s41586-019-1793-z. PMID 31802016
- JOURNAL_INFO : Nature ; Nature ; 2019 ; 576 ; 7785 ; 106-111
- PUBMED_LINK : 31802016
HGDP+1kGP
- NAME : HGDP+1kGP
- SHORT NAME : HGDP+1kGP
- FULL NAME : Human Genome Diversity Project + 1000 Genomes project
- URL : https://gnomad.broadinstitute.org/news/2020-10-gnomad-v3-1-new-content-methods-annotations-and-data-availability/#the-gnomad-hgdp-and-1000-genomes-callset
- TITLE : A harmonized public resource of deeply sequenced diverse human genomes
- DOI : 10.1101/gr.278378.123
- ABSTRACT : Underrepresented populations are often excluded from genomic studies owing in part to a lack of resources supporting their analyses. The 1000 Genomes Project (1kGP) and Human Genome Diversity Project (HGDP), which have recently been sequenced to high coverage, are valuable genomic resources because of the global diversity they capture and their open data sharing policies. Here, we harmonized a high-quality set of 4094 whole genomes from 80 populations in the HGDP and 1kGP with data from the Genome Aggregation Database (gnomAD) and identified over 153 million high-quality SNVs, indels, and SVs. We performed a detailed ancestry analysis of this cohort, characterizing population structure and patterns of admixture across populations, analyzing site frequency spectra, and measuring variant counts at global and subcontinental levels. We also show substantial added value from this data set compared with the prior versions of the component resources, typically combined via liftOver and variant intersection; for example, we catalog millions of new genetic variants, mostly rare, compared with previous releases. In addition to unrestricted individual-level public release, we provide detailed tutorials for conducting many of the most common quality-control steps and analyses with these data in a scalable cloud-computing environment and publicly release this new phased joint callset for use as a haplotype resource in phasing and imputation pipelines. This jointly called reference panel will serve as a key resource to support research of diverse ancestry populations.
- CITATION : Koenig Z, Yohannes MT, Nkambule LL, Zhao X, ...&, Martin AR. (2024) A harmonized public resource of deeply sequenced diverse human genomes Genome Res., 34 (5) 796-809. doi:10.1101/gr.278378.123. PMID 38749656
- JOURNAL_INFO : Genome research ; Genome Res. ; 2024 ; 34 ; 5 ; 796-809
- PUBMED_LINK : 38749656
HRC
- NAME : HRC
- SHORT NAME : HRC
- URL : http://www.haplotype-reference-consortium.org/
- TITLE : A reference panel of 64,976 haplotypes for genotype imputation
- DOI : 10.1038/ng.3643
- ABSTRACT : We describe a reference panel of 64,976 human haplotypes at 39,235,157 SNPs constructed using whole-genome sequence data from 20 studies of predominantly European ancestry. Using this resource leads to accurate genotype imputation at minor allele frequencies as low as 0.1% and a large increase in the number of SNPs tested in association studies, and it can help to discover and refine causal loci. We describe remote server resources that allow researchers to carry out imputation and phasing consistently and efficiently.
- CITATION : McCarthy S, Das S, Kretzschmar W, Delaneau O, ...&, Marchini J. (2016) A reference panel of 64,976 haplotypes for genotype imputation Nat. Genet., 48 (10) 1279-1283. doi:10.1038/ng.3643. PMID 27548312
- JOURNAL_INFO : Nature genetics ; Nat. Genet. ; 2016 ; 48 ; 10 ; 1279-1283
- PUBMED_LINK : 27548312
NARD
- NAME : NARD
- SHORT NAME : NARD
- FULL NAME : Northeast Asian Reference Database
- URL : https://nard.macrogen.com/
- TITLE : NARD: whole-genome reference panel of 1779 Northeast Asians improves imputation accuracy of rare and low-frequency variants
- DOI : 10.1186/s13073-019-0677-z
- ABSTRACT : Here, we present the Northeast Asian Reference Database (NARD), including whole-genome sequencing data of 1779 individuals from Korea, Mongolia, Japan, China, and Hong Kong. NARD provides the genetic diversity of Korean (n = 850) and Mongolian (n = 384) ancestries that were not present in the 1000 Genomes Project Phase 3 (1KGP3). We combined and re-phased the genotypes from NARD and 1KGP3 to construct a union set of haplotypes. This approach established a robust imputation reference panel for Northeast Asians, which yields the greatest imputation accuracy of rare and low-frequency variants compared with the existing panels. NARD imputation panel is available at https://nard.macrogen.com/ .
- CITATION : Yoo SK, Kim CU, Kim HL, Kim S, ...&, Seo JS. (2019) NARD: whole-genome reference panel of 1779 Northeast Asians improves imputation accuracy of rare and low-frequency variants Genome Med., 11 (1) 64. doi:10.1186/s13073-019-0677-z. PMID 31640730
- JOURNAL_INFO : Genome medicine ; Genome Med. ; 2019 ; 11 ; 1 ; 64
- PUBMED_LINK : 31640730
NARD2
- NAME : NARD2
- SHORT NAME : NARD2
- FULL NAME : Northeast Asian Reference Database 2
- URL : https://nard.macrogen.com/
- TITLE : A whole-genome reference panel of 14,393 individuals for East Asian populations accelerates discovery of rare functional variants
- DOI : 10.1126/sciadv.adg6319
- ABSTRACT : Underrepresentation of non-European (EUR) populations hinders growth of global precision medicine. Resources such as imputation reference panels that match the study population are necessary to find low-frequency variants with substantial effects. We created a reference panel consisting of 14,393 whole-genome sequences including more than 11,000 Asian individuals. Genome-wide association studies were conducted using the reference panel and a population-specific genotype array of 72,298 subjects for eight phenotypes. This panel yields improved imputation accuracy of rare and low-frequency variants within East Asian populations compared with the largest reference panel. Thirty-nine previously unidentified associations were found, and more than half of the variants were East Asian specific. We discovered genes with rare protein-altering variants, including LTBP1 for height and GPR75 for body mass index, as well as putative regulatory mechanisms for rare noncoding variants with cell type-specific effects. We suggest that this dataset will add to the potential value of Asian precision medicine.
- CITATION : Choi J, Kim S, Kim J, Son HY, ...&, Im SW. (2023) A whole-genome reference panel of 14,393 individuals for East Asian populations accelerates discovery of rare functional variants Sci Adv, 9 (32) eadg6319. doi:10.1126/sciadv.adg6319. PMID 37556544
- JOURNAL_INFO : Science advances ; Sci Adv ; 2023 ; 9 ; 32 ; eadg6319
- PUBMED_LINK : 37556544
Nyuwa Genome Resource Phase 1
- NAME : Nyuwa Genome Resource Phase 1
- SHORT NAME : Nyuwa Genome Resource Phase 1
- FULL NAME : Nyuwa Genome Resource Phase 1
- URL : http://bigdata.ibp.ac.cn/refpanel/getstarted.php
- TITLE : NyuWa Genome resource: A deep whole-genome sequencing-based variation profile and reference panel for the Chinese population
- DOI : 10.1016/j.celrep.2021.110017
- ABSTRACT : The lack of haplotype reference panels and whole-genome sequencing resources specific to the Chinese population has greatly hindered genetic studies in the world's largest population. Here, we present the NyuWa genome resource, based on deep (26.2×) sequencing of 2,999 Chinese individuals, and construct a NyuWa reference panel of 5,804 haplotypes and 19.3 million variants, which is a high-quality publicly available Chinese population-specific reference panel with thousands of samples. Compared with other panels, the NyuWa reference panel reduces the Han Chinese imputation error rate by a margin ranging from 30% to 51%. Population structure and imputation simulation tests support the applicability of one integrated reference panel for northern and southern Chinese. In addition, a total of 22,504 loss-of-function variants in coding and noncoding genes are identified, including 11,493 novel variants. These results highlight the value of the NyuWa genome resource in facilitating genetic research in Chinese and Asian populations.
- COPYRIGHT : http://creativecommons.org/licenses/by-nc-nd/4.0/
- CITATION : Zhang P, Luo H, Li Y, Wang Y, ...&, He S. (2021) NyuWa Genome resource: A deep whole-genome sequencing-based variation profile and reference panel for the Chinese population Cell Rep., 37 (7) 110017. doi:10.1016/j.celrep.2021.110017. PMID 34788621
- JOURNAL_INFO : Cell reports ; Cell Rep. ; 2021 ; 37 ; 7 ; 110017
- PUBMED_LINK : 34788621
PGG.Han panel
- NAME : PGG.Han panel
- SHORT NAME : PGG.Han
- URL : https://www.biosino.org/pgghan2/index#home1
- TITLE : PGG.Han: the Han Chinese genome database and analysis platform
- DOI : 10.1093/nar/gkz829
- ABSTRACT : As the largest ethnic group in the world, the Han Chinese population is nonetheless underrepresented in global efforts to catalogue the genomic variability of natural populations. Here, we developed the PGG.Han, a population genome database to serve as the central repository for the genomic data of the Han Chinese Genome Initiative (Phase I). In its current version, the PGG.Han archives whole-genome sequences or high-density genome-wide single-nucleotide variants (SNVs) of 114 783 Han Chinese individuals (a.k.a. the Han100K), representing geographical sub-populations covering 33 of the 34 administrative divisions of China, as well as Singapore. The PGG.Han provides: (i) an interactive interface for visualization of the fine-scale genetic structure of the Han Chinese population; (ii) genome-wide allele frequencies of hierarchical sub-populations; (iii) ancestry inference for individual samples and controlling population stratification based on nested ancestry informative markers (AIMs) panels; (iv) population-structure-aware shared control data for genotype-phenotype association studies (e.g. GWASs) and (v) a Han-Chinese-specific reference panel for genotype imputation. Computational tools are implemented into the PGG.Han, and an online user-friendly interface is provided for data analysis and results visualization. The PGG.Han database is freely accessible via http://www.pgghan.org or https://www.hanchinesegenomes.org.
- COPYRIGHT : http://creativecommons.org/licenses/by-nc/4.0/
- CITATION : Gao Y, Zhang C, Yuan L, Ling Y, ...&, Xu S. (2020) PGG.Han: the Han Chinese genome database and analysis platform Nucleic Acids Res., 48 (D1) D971-D976. doi:10.1093/nar/gkz829. PMID 31584086
- JOURNAL_INFO : Nucleic acids research ; Nucleic Acids Res. ; 2020 ; 48 ; D1 ; D971-D976
- PUBMED_LINK : 31584086
South and East Asian Reference Database (SEAD)
- NAME : South and East Asian Reference Database (SEAD)
- SHORT NAME : SEAD
- FULL NAME : South and East Asian Reference Database
- URL : https://imputationserver.westlake.edu.cn/
- PREPRINT_DOI : 10.1101/2023.12.23.23300480
- SERVER : medrxiv
- CITATION : Yang, M. Y., Zhong, J. D., Li, X., Bai, W. Y., Yuan, C. D., Qiu, M. C., ... & Zheng, H. F. (2023). SEAD: an augmented reference panel with 22,134 haplotypes boosts the rare variants imputation and GWAS analysis in Asian population. medRxiv, 2023-12.
TOPMED
- NAME : TOPMED
- SHORT NAME : TOPMED
- FULL NAME : Trans-Omics for Precision Medicine
- URL : https://imputation.biodatacatalyst.nhlbi.nih.gov/#!
- TITLE : Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program
- DOI : 10.1038/s41586-021-03205-y
- ABSTRACT : The Trans-Omics for Precision Medicine (TOPMed) programme seeks to elucidate the genetic architecture and biology of heart, lung, blood and sleep disorders, with the ultimate goal of improving diagnosis, treatment and prevention of these diseases. The initial phases of the programme focused on whole-genome sequencing of individuals with rich phenotypic data and diverse backgrounds. Here we describe the TOPMed goals and design as well as the available resources and early insights obtained from the sequence data. The resources include a variant browser, a genotype imputation server, and genomic and phenotypic data that are available through dbGaP (Database of Genotypes and Phenotypes)1. In the first 53,831 TOPMed samples, we detected more than 400 million single-nucleotide and insertion or deletion variants after alignment with the reference genome. Additional previously undescribed variants were detected through assembly of unmapped reads and customized analysis in highly variable loci. Among the more than 400 million detected variants, 97% have frequencies of less than 1% and 46% are singletons that are present in only one individual (53% among unrelated individuals). These rare variants provide insights into mutational processes and recent human evolutionary history. The extensive catalogue of genetic variation in TOPMed studies provides unique opportunities for exploring the contributions of rare and noncoding sequence variants to phenotypic variation. Furthermore, combining TOPMed haplotypes with modern imputation methods improves the power and reach of genome-wide association studies to include variants down to a frequency of approximately 0.01%.
- CITATION : Taliun D, Harris DN, Kessler MD, Carlson J, ...&, Abecasis GR. (2021) Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program Nature, 590 (7845) 290-299. doi:10.1038/s41586-021-03205-y. PMID 33568819
- JOURNAL_INFO : Nature ; Nature ; 2021 ; 590 ; 7845 ; 290-299
- PUBMED_LINK : 33568819
WBBC panel
- NAME : WBBC panel
- SHORT NAME : WBBC
- URL : https://imputationserver.westlake.edu.cn/
- TITLE : Genomic analyses of 10,376 individuals in the Westlake BioBank for Chinese (WBBC) pilot project
- DOI : 10.1038/s41467-022-30526-x
- ABSTRACT : We initiate the Westlake BioBank for Chinese (WBBC) pilot project with 4,535 whole-genome sequencing (WGS) individuals and 5,841 high-density genotyping individuals, and identify 81.5 million SNPs and INDELs, of which 38.5% are absent in dbSNP Build 151. We provide a population-specific reference panel and an online imputation server ( https://wbbc.westlake.edu.cn/ ) which could yield substantial improvement of imputation performance in Chinese population, especially for low-frequency and rare variants. By analyzing the singleton density of the WGS data, we find selection signatures in SNX29, DNAH1 and WDR1 genes, and the derived alleles of the alcohol metabolism genes (ADH1A and ADH1B) emerge around 7,000 years ago and tend to be more common from 4,000 years ago in East Asia. Genetic evidence supports the corresponding geographical boundaries of the Qinling-Huaihe Line and Nanling Mountains, which separate the Han Chinese into subgroups, and we reveal that North Han was more homogeneous than South Han.
- CITATION : Cong PK, Bai WY, Li JC, Yang MY, ...&, Zheng HF. (2022) Genomic analyses of 10,376 individuals in the Westlake BioBank for Chinese (WBBC) pilot project Nat. Commun., 13 (1) 2939. doi:10.1038/s41467-022-30526-x. PMID 35618720
- JOURNAL_INFO : Nature communications ; Nat. Commun. ; 2022 ; 13 ; 1 ; 2939
- PUBMED_LINK : 35618720
Imputation server
CNGB Imputation Service
- NAME : CNGB Imputation Service
- SHORT NAME : CNGB
- FULL NAME : China National GeneBank
- URL : https://db.cngb.org/imputation/
- TITLE : A high-resolution haplotype-resolved Reference panel constructed from the China Kadoorie Biobank Study
- DOI : 10.1093/nar/gkad779
- ABSTRACT : Precision medicine depends on high-accuracy individual-level genotype data. However, the whole-genome sequencing (WGS) is still not suitable for gigantic studies due to budget constraints. It is particularly important to construct highly accurate haplotype reference panel for genotype imputation. In this study, we used 10 000 samples with medium-depth WGS to construct a reference panel that we named the CKB reference panel. By imputing microarray datasets, it showed that the CKB panel outperformed compared panels in terms of both the number of well-imputed variants and imputation accuracy. In addition, we have completed the imputation of 100 706 microarrays with the CKB panel, and the after-imputed data is the hitherto largest whole genome data of the Chinese population. Furthermore, in the GWAS analysis of real phenotype height, the number of tested SNPs tripled and the number of significant SNPs doubled after imputation. Finally, we developed an online server for offering free genotype imputation service based on the CKB reference panel (https://db.cngb.org/imputation/). We believe that the CKB panel is of great value for imputing microarray or low-coverage genotype data of Chinese population, and potentially mixed populations. The imputation-completed 100 706 microarray data are enormous and precious resources of population genetic studies for complex traits and diseases.
- COPYRIGHT : https://creativecommons.org/licenses/by/4.0/
- CITATION : Yu C, Lan X, Tao Y, Guo Y, ...&, Li L. (2023) A high-resolution haplotype-resolved Reference panel constructed from the China Kadoorie Biobank Study Nucleic Acids Res., 51 (21) 11770-11782. doi:10.1093/nar/gkad779. PMID 37870428
- JOURNAL_INFO : Nucleic acids research ; Nucleic Acids Res. ; 2023 ; 51 ; 21 ; 11770-11782
- PUBMED_LINK : 37870428
ChinaMAP
- NAME : ChinaMAP
- SHORT NAME : ChinaMAP
- FULL NAME : China Metabolic Analytics Project
- URL : http://www.mbiobank.com/
- TITLE : The ChinaMAP reference panel for the accurate genotype imputation in Chinese populations
- DOI : 10.1038/s41422-021-00564-z
- ABSTRACT : The ChinaMAP reference panel for the accurate genotype imputation in Chinese populations
- COPYRIGHT : https://creativecommons.org/licenses/by/4.0
- CITATION : Li L, Huang P, Sun X, Wang S, ...&, Wang W. (2021) The ChinaMAP reference panel for the accurate genotype imputation in Chinese populations Cell Res., 31 (12) 1308-1310. doi:10.1038/s41422-021-00564-z. PMID 34489580
- JOURNAL_INFO : Cell research ; Cell Res. ; 2021 ; 31 ; 12 ; 1308-1310
- PUBMED_LINK : 34489580
Michigan Imputation Server
- NAME : Michigan Imputation Server
- SHORT NAME : Michigan
- URL : https://imputationserver.sph.umich.edu/index.html#!
- TITLE : Next-generation genotype imputation service and methods
- DOI : 10.1038/ng.3656
- ABSTRACT : Genotype imputation is a key component of genetic association studies, where it increases power, facilitates meta-analysis, and aids interpretation of signals. Genotype imputation is computationally demanding and, with current tools, typically requires access to a high-performance computing cluster and to a reference panel of sequenced genomes. Here we describe improvements to imputation machinery that reduce computational requirements by more than an order of magnitude with no loss of accuracy in comparison to standard imputation tools. We also describe a new web-based service for imputation that facilitates access to new reference panels and greatly improves user experience and productivity.
- CITATION : Das S, Forer L, Schönherr S, Sidore C, ...&, Fuchsberger C. (2016) Next-generation genotype imputation service and methods Nat. Genet., 48 (10) 1284-1287. doi:10.1038/ng.3656. PMID 27571263
- JOURNAL_INFO : Nature genetics ; Nat. Genet. ; 2016 ; 48 ; 10 ; 1284-1287
- PUBMED_LINK : 27571263
NyuWa Imputation Server
- NAME : NyuWa Imputation Server
- SHORT NAME : NyuWa
- URL : http://bigdata.ibp.ac.cn/refpanel/getstarted.php
- TITLE : NyuWa Genome resource: A deep whole-genome sequencing-based variation profile and reference panel for the Chinese population
- DOI : 10.1016/j.celrep.2021.110017
- ABSTRACT : The lack of haplotype reference panels and whole-genome sequencing resources specific to the Chinese population has greatly hindered genetic studies in the world's largest population. Here, we present the NyuWa genome resource, based on deep (26.2×) sequencing of 2,999 Chinese individuals, and construct a NyuWa reference panel of 5,804 haplotypes and 19.3 million variants, which is a high-quality publicly available Chinese population-specific reference panel with thousands of samples. Compared with other panels, the NyuWa reference panel reduces the Han Chinese imputation error rate by a margin ranging from 30% to 51%. Population structure and imputation simulation tests support the applicability of one integrated reference panel for northern and southern Chinese. In addition, a total of 22,504 loss-of-function variants in coding and noncoding genes are identified, including 11,493 novel variants. These results highlight the value of the NyuWa genome resource in facilitating genetic research in Chinese and Asian populations.
- COPYRIGHT : http://creativecommons.org/licenses/by-nc-nd/4.0/
- CITATION : Zhang P, Luo H, Li Y, Wang Y, ...&, He S. (2021) NyuWa Genome resource: A deep whole-genome sequencing-based variation profile and reference panel for the Chinese population Cell Rep., 37 (7) 110017. doi:10.1016/j.celrep.2021.110017. PMID 34788621
- JOURNAL_INFO : Cell reports ; Cell Rep. ; 2021 ; 37 ; 7 ; 110017
- PUBMED_LINK : 34788621
PGG.Han
- NAME : PGG.Han
- SHORT NAME : PGG.Han
- URL : https://www.biosino.org/pgghan2/login
- TITLE : PGG.Han: the Han Chinese genome database and analysis platform
- DOI : 10.1093/nar/gkz829
- ABSTRACT : As the largest ethnic group in the world, the Han Chinese population is nonetheless underrepresented in global efforts to catalogue the genomic variability of natural populations. Here, we developed the PGG.Han, a population genome database to serve as the central repository for the genomic data of the Han Chinese Genome Initiative (Phase I). In its current version, the PGG.Han archives whole-genome sequences or high-density genome-wide single-nucleotide variants (SNVs) of 114 783 Han Chinese individuals (a.k.a. the Han100K), representing geographical sub-populations covering 33 of the 34 administrative divisions of China, as well as Singapore. The PGG.Han provides: (i) an interactive interface for visualization of the fine-scale genetic structure of the Han Chinese population; (ii) genome-wide allele frequencies of hierarchical sub-populations; (iii) ancestry inference for individual samples and controlling population stratification based on nested ancestry informative markers (AIMs) panels; (iv) population-structure-aware shared control data for genotype-phenotype association studies (e.g. GWASs) and (v) a Han-Chinese-specific reference panel for genotype imputation. Computational tools are implemented into the PGG.Han, and an online user-friendly interface is provided for data analysis and results visualization. The PGG.Han database is freely accessible via http://www.pgghan.org or https://www.hanchinesegenomes.org.
- COPYRIGHT : http://creativecommons.org/licenses/by-nc/4.0/
- CITATION : Gao Y, Zhang C, Yuan L, Ling Y, ...&, Xu S. (2020) PGG.Han: the Han Chinese genome database and analysis platform Nucleic Acids Res., 48 (D1) D971-D976. doi:10.1093/nar/gkz829. PMID 31584086
- JOURNAL_INFO : Nucleic acids research ; Nucleic Acids Res. ; 2020 ; 48 ; D1 ; D971-D976
- PUBMED_LINK : 31584086
Sanger
- NAME : Sanger
- URL : https://imputation.sanger.ac.uk/
- TITLE : A reference panel of 64,976 haplotypes for genotype imputation
- DOI : 10.1038/ng.3643
- ABSTRACT : We describe a reference panel of 64,976 human haplotypes at 39,235,157 SNPs constructed using whole-genome sequence data from 20 studies of predominantly European ancestry. Using this resource leads to accurate genotype imputation at minor allele frequencies as low as 0.1% and a large increase in the number of SNPs tested in association studies, and it can help to discover and refine causal loci. We describe remote server resources that allow researchers to carry out imputation and phasing consistently and efficiently.
- CITATION : McCarthy S, Das S, Kretzschmar W, Delaneau O, ...&, Marchini J. (2016) A reference panel of 64,976 haplotypes for genotype imputation Nat. Genet., 48 (10) 1279-1283. doi:10.1038/ng.3643. PMID 27548312
- JOURNAL_INFO : Nature genetics ; Nat. Genet. ; 2016 ; 48 ; 10 ; 1279-1283
- PUBMED_LINK : 27548312
TOPMED
- NAME : TOPMED
- SHORT NAME : TOPMED
- FULL NAME : Trans-Omics for Precision Medicine
- URL : https://imputation.biodatacatalyst.nhlbi.nih.gov/#!
- TITLE : Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program
- DOI : 10.1038/s41586-021-03205-y
- ABSTRACT : The Trans-Omics for Precision Medicine (TOPMed) programme seeks to elucidate the genetic architecture and biology of heart, lung, blood and sleep disorders, with the ultimate goal of improving diagnosis, treatment and prevention of these diseases. The initial phases of the programme focused on whole-genome sequencing of individuals with rich phenotypic data and diverse backgrounds. Here we describe the TOPMed goals and design as well as the available resources and early insights obtained from the sequence data. The resources include a variant browser, a genotype imputation server, and genomic and phenotypic data that are available through dbGaP (Database of Genotypes and Phenotypes)1. In the first 53,831 TOPMed samples, we detected more than 400 million single-nucleotide and insertion or deletion variants after alignment with the reference genome. Additional previously undescribed variants were detected through assembly of unmapped reads and customized analysis in highly variable loci. Among the more than 400 million detected variants, 97% have frequencies of less than 1% and 46% are singletons that are present in only one individual (53% among unrelated individuals). These rare variants provide insights into mutational processes and recent human evolutionary history. The extensive catalogue of genetic variation in TOPMed studies provides unique opportunities for exploring the contributions of rare and noncoding sequence variants to phenotypic variation. Furthermore, combining TOPMed haplotypes with modern imputation methods improves the power and reach of genome-wide association studies to include variants down to a frequency of approximately 0.01%.
- CITATION : Taliun D, Harris DN, Kessler MD, Carlson J, ...&, Abecasis GR. (2021) Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program Nature, 590 (7845) 290-299. doi:10.1038/s41586-021-03205-y. PMID 33568819
- JOURNAL_INFO : Nature ; Nature ; 2021 ; 590 ; 7845 ; 290-299
- PUBMED_LINK : 33568819
Westlake Imputation Server
- NAME : Westlake Imputation Server
- SHORT NAME : Westlake Imputation Server
- URL : https://imputationserver.westlake.edu.cn/
- TITLE : Genomic analyses of 10,376 individuals in the Westlake BioBank for Chinese (WBBC) pilot project
- DOI : 10.1038/s41467-022-30526-x
- ABSTRACT : We initiate the Westlake BioBank for Chinese (WBBC) pilot project with 4,535 whole-genome sequencing (WGS) individuals and 5,841 high-density genotyping individuals, and identify 81.5 million SNPs and INDELs, of which 38.5% are absent in dbSNP Build 151. We provide a population-specific reference panel and an online imputation server ( https://wbbc.westlake.edu.cn/ ) which could yield substantial improvement of imputation performance in Chinese population, especially for low-frequency and rare variants. By analyzing the singleton density of the WGS data, we find selection signatures in SNX29, DNAH1 and WDR1 genes, and the derived alleles of the alcohol metabolism genes (ADH1A and ADH1B) emerge around 7,000 years ago and tend to be more common from 4,000 years ago in East Asia. Genetic evidence supports the corresponding geographical boundaries of the Qinling-Huaihe Line and Nanling Mountains, which separate the Han Chinese into subgroups, and we reveal that North Han was more homogeneous than South Han.
- CITATION : Cong PK, Bai WY, Li JC, Yang MY, ...&, Zheng HF. (2022) Genomic analyses of 10,376 individuals in the Westlake BioBank for Chinese (WBBC) pilot project Nat. Commun., 13 (1) 2939. doi:10.1038/s41467-022-30526-x. PMID 35618720
- JOURNAL_INFO : Nature communications ; Nat. Commun. ; 2022 ; 13 ; 1 ; 2939
- PUBMED_LINK : 35618720
Other tools
RESHAPE
- NAME : RESHAPE
- SHORT NAME : RESHAPE
- FULL NAME : REcombine and Share HAPlotypEs
- DESCRIPTION : RESHAPE removes sample-level genetic information from a reference panel to create a synthetic reference panel. By providing it with a genetic map and the VCF/BCF of a reference panel, RESHAPE outputs a VCF/BCF of the same size where each haplotypes corresponds to a mosaic of the original haplotypes of the reference panel.
- URL : https://github.com/TheoCavinato/RESHAPE
- TITLE : A resampling-based approach to share reference panels
- DOI : 10.1038/s43588-024-00630-7
- ABSTRACT : For many genome-wide association studies, imputing genotypes from a haplotype reference panel is a necessary step. Over the past 15 years, reference panels have become larger and more diverse, leading to improvements in imputation accuracy. However, the latest generation of reference panels is subject to restrictions on data sharing due to concerns about privacy, limiting their usefulness for genotype imputation. In this context, here we propose RESHAPE, a method that employs a recombination Poisson process on a reference panel to simulate the genomes of hypothetical descendants after multiple generations. This data transformation helps to protect against re-identification threats and preserves data attributes, such as linkage disequilibrium patterns and, to some degree, identity-by-descent sharing, allowing for genotype imputation. Our experiments on gold-standard datasets show that simulated descendants up to eight generations can serve as reference panels without substantially reducing genotype imputation accuracy.
- COPYRIGHT : https://creativecommons.org/licenses/by/4.0
- CITATION : Cavinato T, Rubinacci S, Malaspinas AS, Delaneau O. (2024) A resampling-based approach to share reference panels Nat. Comput. Sci., 4 (5) 360-366. doi:10.1038/s43588-024-00630-7. PMID 38745108
- JOURNAL_INFO : Nature computational science ; Nat. Comput. Sci. ; 2024 ; 4 ; 5 ; 360-366
- PUBMED_LINK : 38745108
Phasing & Imputation tool
BEAGLE
- NAME : BEAGLE
- SHORT NAME : BEAGLE
- URL : https://faculty.washington.edu/browning/beagle/beagle.html
- TITLE : Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering
- DOI : 10.1086/521987
- ABSTRACT : Whole-genome association studies present many new statistical and computational challenges due to the large quantity of data obtained. One of these challenges is haplotype inference; methods for haplotype inference designed for small data sets from candidate-gene studies do not scale well to the large number of individuals genotyped in whole-genome association studies. We present a new method and software for inference of haplotype phase and missing data that can accurately phase data from whole-genome association studies, and we present the first comparison of haplotype-inference methods for real and simulated data sets with thousands of genotyped individuals. We find that our method outperforms existing methods in terms of both speed and accuracy for large data sets with thousands of individuals and densely spaced genetic markers, and we use our method to phase a real data set of 3,002 individuals genotyped for 490,032 markers in 3.1 days of computing time, with 99% of masked alleles imputed correctly. Our method is implemented in the Beagle software package, which is freely available.
- COPYRIGHT : https://www.elsevier.com/open-access/userlicense/1.0/
- CITATION : Browning SR, Browning BL. (2007) Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering Am. J. Hum. Genet., 81 (5) 1084-1097. doi:10.1086/521987. PMID 17924348
- JOURNAL_INFO : The American Journal of Human Genetics ; Am. J. Hum. Genet. ; 2007 ; 81 ; 5 ; 1084-1097
- PUBMED_LINK : 17924348
BEAGLE4
- NAME : BEAGLE4
- SHORT NAME : BEAGLE4
- DESCRIPTION : (beagle 4.1)
- URL : https://faculty.washington.edu/browning/beagle/beagle.html
- TITLE : Genotype imputation with millions of reference samples
- DOI : 10.1016/j.ajhg.2015.11.020
- ABSTRACT : We present a genotype imputation method that scales to millions of reference samples. The imputation method, based on the Li and Stephens model and implemented in Beagle v.4.1, is parallelized and memory efficient, making it well suited to multi-core computer processors. It achieves fast, accurate, and memory-efficient genotype imputation by restricting the probability model to markers that are genotyped in the target samples and by performing linear interpolation to impute ungenotyped variants. We compare Beagle v.4.1 with Impute2 and Minimac3 by using 1000 Genomes Project data, UK10K Project data, and simulated data. All three methods have similar accuracy but different memory requirements and different computation times. When imputing 10 Mb of sequence data from 50,000 reference samples, Beagle's throughput was more than 100× greater than Impute2's throughput on our computer servers. When imputing 10 Mb of sequence data from 200,000 reference samples in VCF format, Minimac3 consumed 26× more memory per computational thread and 15× more CPU time than Beagle. We demonstrate that Beagle v.4.1 scales to much larger reference panels by performing imputation from a simulated reference panel having 5 million samples and a mean marker density of one marker per four base pairs.
- COPYRIGHT : https://www.elsevier.com/open-access/userlicense/1.0/
- CITATION : Browning BL, Browning SR. (2016) Genotype imputation with millions of reference samples Am. J. Hum. Genet., 98 (1) 116-126. doi:10.1016/j.ajhg.2015.11.020. PMID 26748515
- JOURNAL_INFO : The American Journal of Human Genetics ; Am. J. Hum. Genet. ; 2016 ; 98 ; 1 ; 116-126
- PUBMED_LINK : 26748515
BEAGLE5.4 (Imputation)
- NAME : BEAGLE5.4 (Imputation)
- SHORT NAME : BEAGLE5.4 (Imputation)
- DESCRIPTION : (beagle 5.4 imputation)
- URL : https://faculty.washington.edu/browning/beagle/beagle.html
- TITLE : A one-penny imputed genome from next-generation reference panels
- DOI : 10.1016/j.ajhg.2018.07.015
- ABSTRACT : Genotype imputation is commonly performed in genome-wide association studies because it greatly increases the number of markers that can be tested for association with a trait. In general, one should perform genotype imputation using the largest reference panel that is available because the number of accurately imputed variants increases with reference panel size. However, one impediment to using larger reference panels is the increased computational cost of imputation. We present a new genotype imputation method, Beagle 5.0, which greatly reduces the computational cost of imputation from large reference panels. We compare Beagle 5.0 with Beagle 4.1, Impute4, Minimac3, and Minimac4 using 1000 Genomes Project data, Haplotype Reference Consortium data, and simulated data for 10k, 100k, 1M, and 10M reference samples. All methods produce nearly identical accuracy, but Beagle 5.0 has the lowest computation time and the best scaling of computation time with increasing reference panel size. For 10k, 100k, 1M, and 10M reference samples and 1,000 phased target samples, Beagle 5.0's computation time is 3× (10k), 12× (100k), 43× (1M), and 533× (10M) faster than the fastest alternative method. Cost data from the Amazon Elastic Compute Cloud show that Beagle 5.0 can perform genome-wide imputation from 10M reference samples into 1,000 phased target samples at a cost of less than one US cent per sample.
- COPYRIGHT : http://www.elsevier.com/open-access/userlicense/1.0/
- CITATION : Browning BL, Zhou Y, Browning SR. (2018) A one-penny imputed genome from next-generation reference panels Am. J. Hum. Genet., 103 (3) 338-348. doi:10.1016/j.ajhg.2018.07.015. PMID 30100085
- JOURNAL_INFO : The American Journal of Human Genetics ; Am. J. Hum. Genet. ; 2018 ; 103 ; 3 ; 338-348
- PUBMED_LINK : 30100085
BEAGLE5.4 (Phasing)
- NAME : BEAGLE5.4 (Phasing)
- SHORT NAME : BEAGLE5.4 (Phasing)
- DESCRIPTION : (beagle 5.4 phasing)
- URL : https://faculty.washington.edu/browning/beagle/beagle.html
- TITLE : Fast two-stage phasing of large-scale sequence data
- DOI : 10.1016/j.ajhg.2021.08.005
- ABSTRACT : Haplotype phasing is the estimation of haplotypes from genotype data. We present a fast, accurate, and memory-efficient haplotype phasing method that scales to large-scale SNP array and sequence data. The method uses marker windowing and composite reference haplotypes to reduce memory usage and computation time. It incorporates a progressive phasing algorithm that identifies confidently phased heterozygotes in each iteration and fixes the phase of these heterozygotes in subsequent iterations. For data with many low-frequency variants, such as whole-genome sequence data, the method employs a two-stage phasing algorithm that phases high-frequency markers via progressive phasing in the first stage and phases low-frequency markers via genotype imputation in the second stage. This haplotype phasing method is implemented in the open-source Beagle 5.2 software package. We compare Beagle 5.2 and SHAPEIT 4.2.1 by using expanding subsets of 485,301 UK Biobank samples and 38,387 TOPMed samples. Both methods have very similar accuracy and computation time for UK Biobank SNP array data. However, for TOPMed sequence data, Beagle is more than 20 times faster than SHAPEIT, achieves similar accuracy, and scales to larger sample sizes.
- COPYRIGHT : http://www.elsevier.com/open-access/userlicense/1.0/
- CITATION : Browning BL, Tian X, Zhou Y, Browning SR. (2021) Fast two-stage phasing of large-scale sequence data Am. J. Hum. Genet., 108 (10) 1880-1890. doi:10.1016/j.ajhg.2021.08.005. PMID 34478634
- JOURNAL_INFO : The American Journal of Human Genetics ; Am. J. Hum. Genet. ; 2021 ; 108 ; 10 ; 1880-1890
- PUBMED_LINK : 34478634
EAGLE
- NAME : EAGLE
- SHORT NAME : EAGLE
- DESCRIPTION : (EAGLE1)
- URL : https://alkesgroup.broadinstitute.org/Eagle/
- TITLE : Fast and accurate long-range phasing in a UK Biobank cohort
- DOI : 10.1038/ng.3571
- ABSTRACT : Recent work has leveraged the extensive genotyping of the Icelandic population to perform long-range phasing (LRP), enabling accurate imputation and association analysis of rare variants in target samples typed on genotyping arrays. Here we develop a fast and accurate LRP method, Eagle, that extends this paradigm to populations with much smaller proportions of genotyped samples by harnessing long (>4-cM) identical-by-descent (IBD) tracts shared among distantly related individuals. We applied Eagle to N ≈ 150,000 samples (0.2% of the British population) from the UK Biobank, and we determined that it is 1-2 orders of magnitude faster than existing methods while achieving similar or better phasing accuracy (switch error rate ≈ 0.3%, corresponding to perfect phase in a majority of 10-Mb segments). We also observed that, when used within an imputation pipeline, Eagle prephasing improved downstream imputation accuracy in comparison to prephasing in batches using existing methods, as necessary to achieve comparable computational cost.
- CITATION : Loh PR, Palamara PF, Price AL. (2016) Fast and accurate long-range phasing in a UK Biobank cohort Nat. Genet., 48 (7) 811-816. doi:10.1038/ng.3571. PMID 27270109
- JOURNAL_INFO : Nature genetics ; Nat. Genet. ; 2016 ; 48 ; 7 ; 811-816
- PUBMED_LINK : 27270109
EAGLE2
- NAME : EAGLE2
- SHORT NAME : EAGLE2
- DESCRIPTION : (EAGLE2)
- URL : https://alkesgroup.broadinstitute.org/Eagle/
- TITLE : Reference-based phasing using the Haplotype Reference Consortium panel
- DOI : 10.1038/ng.3679
- ABSTRACT : Haplotype phasing is a fundamental problem in medical and population genetics. Phasing is generally performed via statistical phasing in a genotyped cohort, an approach that can yield high accuracy in very large cohorts but attains lower accuracy in smaller cohorts. Here we instead explore the paradigm of reference-based phasing. We introduce a new phasing algorithm, Eagle2, that attains high accuracy across a broad range of cohort sizes by efficiently leveraging information from large external reference panels (such as the Haplotype Reference Consortium; HRC) using a new data structure based on the positional Burrows-Wheeler transform. We demonstrate that Eagle2 attains a ∼20× speedup and ∼10% increase in accuracy compared to reference-based phasing using SHAPEIT2. On European-ancestry samples, Eagle2 with the HRC panel achieves >2× the accuracy of 1000 Genomes-based phasing. Eagle2 is open source and freely available for HRC-based phasing via the Sanger Imputation Service and the Michigan Imputation Server.
- CITATION : Loh PR, Danecek P, Palamara PF, Fuchsberger C, ...&, L Price A. (2016) Reference-based phasing using the Haplotype Reference Consortium panel Nat. Genet., 48 (11) 1443-1448. doi:10.1038/ng.3679. PMID 27694958
- JOURNAL_INFO : Nature genetics ; Nat. Genet. ; 2016 ; 48 ; 11 ; 1443-1448
- PUBMED_LINK : 27694958
GLIMPSE
- NAME : GLIMPSE
- SHORT NAME : GLIMPSE
- FULL NAME : Genotype Likelihoods IMputation and PhaSing mEthod
- DESCRIPTION : GLIMPSE is a phasing and imputation method for large-scale low-coverage sequencing studies.
- URL : https://odelaneau.github.io/GLIMPSE/
- TITLE : Efficient phasing and imputation of low-coverage sequencing data using large reference panels
- DOI : 10.1038/s41588-020-00756-0
- ABSTRACT : Low-coverage whole-genome sequencing followed by imputation has been proposed as a cost-effective genotyping approach for disease and population genetics studies. However, its competitiveness against SNP arrays is undermined because current imputation methods are computationally expensive and unable to leverage large reference panels. Here, we describe a method, GLIMPSE, for phasing and imputation of low-coverage sequencing datasets from modern reference panels. We demonstrate its remarkable performance across different coverages and human populations. GLIMPSE achieves imputation of a genome for less than US$1 in computational cost, considerably outperforming other methods and improving imputation accuracy over the full allele frequency range. As a proof of concept, we show that 1× coverage enables effective gene expression association studies and outperforms dense SNP arrays in rare variant burden tests. Overall, this study illustrates the promising potential of low-coverage imputation and suggests a paradigm shift in the design of future genomic studies.
- CITATION : Rubinacci S, Ribeiro DM, Hofmeister RJ, Delaneau O. (2021) Efficient phasing and imputation of low-coverage sequencing data using large reference panels Nat. Genet., 53 (1) 120-126. doi:10.1038/s41588-020-00756-0. PMID 33414550
- JOURNAL_INFO : Nature genetics ; Nat. Genet. ; 2021 ; 53 ; 1 ; 120-126
- PUBMED_LINK : 33414550
IMPUTE
- NAME : IMPUTE
- SHORT NAME : IMPUTE
- URL : https://jmarchini.org/software/
- TITLE : A new multipoint method for genome-wide association studies by imputation of genotypes
- DOI : 10.1038/ng2088
- ABSTRACT : Genome-wide association studies are set to become the method of choice for uncovering the genetic basis of human diseases. A central challenge in this area is the development of powerful multipoint methods that can detect causal variants that have not been directly genotyped. We propose a coherent analysis framework that treats the problem as one involving missing or uncertain genotypes. Central to our approach is a model-based imputation method for inferring genotypes at observed or unobserved SNPs, leading to improved power over existing methods for multipoint association mapping. Using real genome-wide association study data, we show that our approach (i) is accurate and well calibrated, (ii) provides detailed views of associated regions that facilitate follow-up studies and (iii) can be used to validate and correct data at genotyped markers. A notable future use of our method will be to boost power by combining data from genome-wide scans that use different SNP sets.
- CITATION : Marchini J, Howie B, Myers S, McVean G, ...&, Donnelly P. (2007) A new multipoint method for genome-wide association studies by imputation of genotypes Nat. Genet., 39 (7) 906-913. doi:10.1038/ng2088. PMID 17572673
- JOURNAL_INFO : Nature genetics ; Nat. Genet. ; 2007 ; 39 ; 7 ; 906-913
- PUBMED_LINK : 17572673
IMPUTE2
- NAME : IMPUTE2
- SHORT NAME : IMPUTE2
- TITLE : A flexible and accurate genotype imputation method for the next generation of genome-wide association studies
- DOI : 10.1371/journal.pgen.1000529
- ABSTRACT : Genotype imputation methods are now being widely used in the analysis of genome-wide association studies. Most imputation analyses to date have used the HapMap as a reference dataset, but new reference panels (such as controls genotyped on multiple SNP chips and densely typed samples from the 1,000 Genomes Project) will soon allow a broader range of SNPs to be imputed with higher accuracy, thereby increasing power. We describe a genotype imputation method (IMPUTE version 2) that is designed to address the challenges presented by these new datasets. The main innovation of our approach is a flexible modelling framework that increases accuracy and combines information across multiple reference panels while remaining computationally feasible. We find that IMPUTE v2 attains higher accuracy than other methods when the HapMap provides the sole reference panel, but that the size of the panel constrains the improvements that can be made. We also find that imputation accuracy can be greatly enhanced by expanding the reference panel to contain thousands of chromosomes and that IMPUTE v2 outperforms other methods in this setting at both rare and common SNPs, with overall error rates that are 15%-20% lower than those of the closest competing method. One particularly challenging aspect of next-generation association studies is to integrate information across multiple reference panels genotyped on different sets of SNPs; we show that our approach to this problem has practical advantages over other suggested solutions.
- CITATION : Howie BN, Donnelly P, Marchini J. (2009) A flexible and accurate genotype imputation method for the next generation of genome-wide association studies PLoS Genet., 5 (6) e1000529. doi:10.1371/journal.pgen.1000529. PMID 19543373
- JOURNAL_INFO : PLoS genetics ; PLoS Genet. ; 2009 ; 5 ; 6 ; e1000529
- PUBMED_LINK : 19543373
IMPUTE4
- NAME : IMPUTE4
- SHORT NAME : IMPUTE4
- TITLE : The UK Biobank resource with deep phenotyping and genomic data
- DOI : 10.1038/s41586-018-0579-z
- ABSTRACT : The UK Biobank project is a prospective cohort study with deep genetic and phenotypic data collected on approximately 500,000 individuals from across the United Kingdom, aged between 40 and 69 at recruitment. The open resource is unique in its size and scope. A rich variety of phenotypic and health-related information is available on each participant, including biological measurements, lifestyle indicators, biomarkers in blood and urine, and imaging of the body and brain. Follow-up information is provided by linking health and medical records. Genome-wide genotype data have been collected on all participants, providing many opportunities for the discovery of new genetic associations and the genetic bases of complex traits. Here we describe the centralized analysis of the genetic data, including genotype quality, properties of population structure and relatedness of the genetic data, and efficient phasing and genotype imputation that increases the number of testable variants to around 96 million. Classical allelic variation at 11 human leukocyte antigen genes was imputed, resulting in the recovery of signals with known associations between human leukocyte antigen alleles and many diseases.
- CITATION : Bycroft C, Freeman C, Petkova D, Band G, ...&, Marchini J. (2018) The UK Biobank resource with deep phenotyping and genomic data Nature, 562 (7726) 203-209. doi:10.1038/s41586-018-0579-z. PMID 30305743
- JOURNAL_INFO : Nature ; Nature ; 2018 ; 562 ; 7726 ; 203-209
- PUBMED_LINK : 30305743
IMPUTE5
- NAME : IMPUTE5
- SHORT NAME : IMPUTE5
- TITLE : Genotype imputation using the Positional Burrows Wheeler Transform
- DOI : 10.1371/journal.pgen.1009049
- ABSTRACT : Genotype imputation is the process of predicting unobserved genotypes in a sample of individuals using a reference panel of haplotypes. In the last 10 years reference panels have increased in size by more than 100 fold. Increasing reference panel size improves accuracy of markers with low minor allele frequencies but poses ever increasing computational challenges for imputation methods. Here we present IMPUTE5, a genotype imputation method that can scale to reference panels with millions of samples. This method continues to refine the observation made in the IMPUTE2 method, that accuracy is optimized via use of a custom subset of haplotypes when imputing each individual. It achieves fast, accurate, and memory-efficient imputation by selecting haplotypes using the Positional Burrows Wheeler Transform (PBWT). By using the PBWT data structure at genotyped markers, IMPUTE5 identifies locally best matching haplotypes and long identical by state segments. The method then uses the selected haplotypes as conditioning states within the IMPUTE model. Using the HRC reference panel, which has ∼65,000 haplotypes, we show that IMPUTE5 is up to 30x faster than MINIMAC4 and up to 3x faster than BEAGLE5.1, and uses less memory than both these methods. Using simulated reference panels we show that IMPUTE5 scales sub-linearly with reference panel size. For example, keeping the number of imputed markers constant, increasing the reference panel size from 10,000 to 1 million haplotypes requires less than twice the computation time. As the reference panel increases in size IMPUTE5 is able to utilize a smaller number of reference haplotypes, thus reducing computational cost.
- COPYRIGHT : http://creativecommons.org/licenses/by/4.0/
- CITATION : Rubinacci S, Delaneau O, Marchini J. (2020) Genotype imputation using the Positional Burrows Wheeler Transform PLoS Genet., 16 (11) e1009049. doi:10.1371/journal.pgen.1009049. PMID 33196638
- JOURNAL_INFO : PLoS genetics ; PLoS Genet. ; 2020 ; 16 ; 11 ; e1009049
- PUBMED_LINK : 33196638
MACH / minimach
- NAME : MACH / minimach
- SHORT NAME : MACH / minimach
- DESCRIPTION : (MACH)
- URL : http://csg.sph.umich.edu/abecasis/MaCH/index.html
- TITLE : MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes
- DOI : 10.1002/gepi.20533
- ABSTRACT : Genome-wide association studies (GWAS) can identify common alleles that contribute to complex disease susceptibility. Despite the large number of SNPs assessed in each study, the effects of most common SNPs must be evaluated indirectly using either genotyped markers or haplotypes thereof as proxies. We have previously implemented a computationally efficient Markov Chain framework for genotype imputation and haplotyping in the freely available MaCH software package. The approach describes sampled chromosomes as mosaics of each other and uses available genotype and shotgun sequence data to estimate unobserved genotypes and haplotypes, together with useful measures of the quality of these estimates. Our approach is already widely used to facilitate comparison of results across studies as well as meta-analyses of GWAS. Here, we use simulations and experimental genotypes to evaluate its accuracy and utility, considering choices of genotyping panels, reference panel configurations, and designs where genotyping is replaced with shotgun sequencing. Importantly, we show that genotype imputation not only facilitates cross study analyses but also increases power of genetic association studies. We show that genotype imputation of common variants using HapMap haplotypes as a reference is very accurate using either genome-wide SNP data or smaller amounts of data typical in fine-mapping studies. Furthermore, we show the approach is applicable in a variety of populations. Finally, we illustrate how association analyses of unobserved variants will benefit from ongoing advances such as larger HapMap reference panels and whole genome shotgun sequencing technologies.
- COPYRIGHT : http://onlinelibrary.wiley.com/termsAndConditions#vor
- CITATION : Li Y, Willer CJ, Ding J, Scheet P, ...&, Abecasis GR. (2010) MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes Genet. Epidemiol., 34 (8) 816-834. doi:10.1002/gepi.20533. PMID 21058334
- JOURNAL_INFO : Genetic epidemiology ; Genet. Epidemiol. ; 2010 ; 34 ; 8 ; 816-834
- PUBMED_LINK : 21058334
MACH / minimach pre-phasing
- NAME : MACH / minimach pre-phasing
- SHORT NAME : MACH / minimach pre-phasing
- DESCRIPTION : (pre-phasing, minimac)
- URL : https://genome.sph.umich.edu/wiki/Minimac
- TITLE : Fast and accurate genotype imputation in genome-wide association studies through pre-phasing
- DOI : 10.1038/ng.2354
- ABSTRACT : The 1000 Genomes Project and disease-specific sequencing efforts are producing large collections of haplotypes that can be used as reference panels for genotype imputation in genome-wide association studies (GWAS). However, imputing from large reference panels with existing methods imposes a high computational burden. We introduce a strategy called 'pre-phasing' that maintains the accuracy of leading methods while reducing computational costs. We first statistically estimate the haplotypes for each individual within the GWAS sample (pre-phasing) and then impute missing genotypes into these estimated haplotypes. This reduces the computational cost because (i) the GWAS samples must be phased only once, whereas standard methods would implicitly repeat phasing with each reference panel update, and (ii) it is much faster to match a phased GWAS haplotype to one reference haplotype than to match two unphased GWAS genotypes to a pair of reference haplotypes. We implemented our approach in the MaCH and IMPUTE2 frameworks, and we tested it on data sets from the Wellcome Trust Case Control Consortium 2 (WTCCC2), the Genetic Association Information Network (GAIN), the Women's Health Initiative (WHI) and the 1000 Genomes Project. This strategy will be particularly valuable for repeated imputation as reference panels evolve.
- CITATION : Howie B, Fuchsberger C, Stephens M, Marchini J, ...&, Abecasis GR. (2012) Fast and accurate genotype imputation in genome-wide association studies through pre-phasing Nat. Genet., 44 (8) 955-959. doi:10.1038/ng.2354. PMID 22820512
- JOURNAL_INFO : Nature genetics ; Nat. Genet. ; 2012 ; 44 ; 8 ; 955-959
- PUBMED_LINK : 22820512
MACH / minimach2
- NAME : MACH / minimach2
- SHORT NAME : MACH / minimach2
- DESCRIPTION : (minimac2)
- URL : https://genome.sph.umich.edu/wiki/Minimac2
- TITLE : Minimac2: Faster genotype imputation
- DOI : 10.1093/bioinformatics/btu704
- ABSTRACT : UNLABELLED: Genotype imputation is a key step in the analysis of genome-wide association studies. Upcoming very large reference panels, such as those from The 1000 Genomes Project and the Haplotype Consortium, will improve imputation quality of rare and less common variants, but will also increase the computational burden. Here, we demonstrate how the application of software engineering techniques can help to keep imputation broadly accessible. Overall, these improvements speed up imputation by an order of magnitude compared with our previous implementation. AVAILABILITY AND IMPLEMENTATION: minimac2, including source code, documentation, and examples is available at http://genome.sph.umich.edu/wiki/Minimac2
- CITATION : Fuchsberger C, Abecasis GR, Hinds DA. (2015) Minimac2: Faster genotype imputation Bioinformatics, 31 (5) 782-784. doi:10.1093/bioinformatics/btu704. PMID 25338720
- JOURNAL_INFO : Bioinformatics (Oxford, England) ; Bioinformatics ; 2015 ; 31 ; 5 ; 782-784
- PUBMED_LINK : 25338720
MACH / minimach3
- NAME : MACH / minimach3
- SHORT NAME : MACH / minimach3
- DESCRIPTION : (minimac3)
- URL : https://genome.sph.umich.edu/wiki/Minimac3
- TITLE : Next-generation genotype imputation service and methods
- DOI : 10.1038/ng.3656
- ABSTRACT : Genotype imputation is a key component of genetic association studies, where it increases power, facilitates meta-analysis, and aids interpretation of signals. Genotype imputation is computationally demanding and, with current tools, typically requires access to a high-performance computing cluster and to a reference panel of sequenced genomes. Here we describe improvements to imputation machinery that reduce computational requirements by more than an order of magnitude with no loss of accuracy in comparison to standard imputation tools. We also describe a new web-based service for imputation that facilitates access to new reference panels and greatly improves user experience and productivity.
- CITATION : Das S, Forer L, Schönherr S, Sidore C, ...&, Fuchsberger C. (2016) Next-generation genotype imputation service and methods Nat. Genet., 48 (10) 1284-1287. doi:10.1038/ng.3656. PMID 27571263
- JOURNAL_INFO : Nature genetics ; Nat. Genet. ; 2016 ; 48 ; 10 ; 1284-1287
- PUBMED_LINK : 27571263
MACH / minimach4
- NAME : MACH / minimach4
- SHORT NAME : MACH / minimach4
- DESCRIPTION : (minimac4)
- URL : https://genome.sph.umich.edu/wiki/Minimac4
SHAPEIT1
- NAME : SHAPEIT1
- SHORT NAME : SHAPEIT1
- DESCRIPTION : (SHAPEIT1)
- URL : https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html
- TITLE : A linear complexity phasing method for thousands of genomes
- DOI : 10.1038/nmeth.1785
- ABSTRACT : Human-disease etiology can be better understood with phase information about diploid sequences. We present a method for estimating haplotypes, using genotype data from unrelated samples or small nuclear families, that leads to improved accuracy and speed compared to several widely used methods. The method, segmented haplotype estimation and imputation tool (SHAPEIT), scales linearly with the number of haplotypes used in each iteration and can be run efficiently on whole chromosomes.
- CITATION : Delaneau O, Marchini J, Zagury JF. (2011) A linear complexity phasing method for thousands of genomes Nat. Methods, 9 (2) 179-181. doi:10.1038/nmeth.1785. PMID 22138821
- JOURNAL_INFO : Nature methods ; Nat. Methods ; 2011 ; 9 ; 2 ; 179-181
- PUBMED_LINK : 22138821
SHAPEIT2
- NAME : SHAPEIT2
- SHORT NAME : SHAPEIT2
- DESCRIPTION : (SHAPEIT2)
- TITLE : Improved whole-chromosome phasing for disease and population genetic studies
- DOI : 10.1038/nmeth.2307
- CITATION : Delaneau O, Zagury JF, Marchini J. (2013) Improved whole-chromosome phasing for disease and population genetic studies Nat. Methods, 10 (1) 5-6. doi:10.1038/nmeth.2307. PMID 23269371
- JOURNAL_INFO : Nature methods ; Nat. Methods ; 2013 ; 10 ; 1 ; 5-6
- PUBMED_LINK : 23269371
SHAPEIT3
- NAME : SHAPEIT3
- SHORT NAME : SHAPEIT3
- DESCRIPTION : (SHAPEIT3)
- URL : https://jmarchini.org/shapeit3/
- TITLE : Haplotype estimation for biobank-scale data sets
- DOI : 10.1038/ng.3583
- ABSTRACT : The UK Biobank (UKB) has recently released genotypes on 152,328 individuals together with extensive phenotypic and lifestyle information. We present a new phasing method, SHAPEIT3, that can handle such biobank-scale data sets and results in switch error rates as low as ∼0.3%. The method exhibits O(NlogN) scaling with sample size N, enabling fast and accurate phasing of even larger cohorts.
- CITATION : O'Connell J, Sharp K, Shrine N, Wain L, ...&, Marchini J. (2016) Haplotype estimation for biobank-scale data sets Nat. Genet., 48 (7) 817-820. doi:10.1038/ng.3583. PMID 27270105
- JOURNAL_INFO : Nature genetics ; Nat. Genet. ; 2016 ; 48 ; 7 ; 817-820
- PUBMED_LINK : 27270105
SHAPEIT4
- NAME : SHAPEIT4
- SHORT NAME : SHAPEIT4
- DESCRIPTION : (SHAPEIT4)
- URL : https://odelaneau.github.io/shapeit4/
- TITLE : Accurate, scalable and integrative haplotype estimation
- DOI : 10.1038/s41467-019-13225-y
- ABSTRACT : The number of human genomes being genotyped or sequenced increases exponentially and efficient haplotype estimation methods able to handle this amount of data are now required. Here we present a method, SHAPEIT4, which substantially improves upon other methods to process large genotype and high coverage sequencing datasets. It notably exhibits sub-linear running times with sample size, provides highly accurate haplotypes and allows integrating external phasing information such as large reference panels of haplotypes, collections of pre-phased variants and long sequencing reads. We provide SHAPEIT4 in an open source format and demonstrate its performance in terms of accuracy and running times on two gold standard datasets: the UK Biobank data and the Genome In A Bottle.
- COPYRIGHT : https://creativecommons.org/licenses/by/4.0
- CITATION : Delaneau O, Zagury JF, Robinson MR, Marchini JL, ...&, Dermitzakis ET. (2019) Accurate, scalable and integrative haplotype estimation Nat. Commun., 10 (1) 5436. doi:10.1038/s41467-019-13225-y. PMID 31780650
- JOURNAL_INFO : Nature communications ; Nat. Commun. ; 2019 ; 10 ; 1 ; 5436
- PUBMED_LINK : 31780650
SHAPEIT5
- NAME : SHAPEIT5
- SHORT NAME : SHAPEIT5
- DESCRIPTION : (SHAPEIT5)
- TITLE : Accurate rare variant phasing of whole-genome and whole-exome sequencing data in the UK Biobank
- DOI : 10.1038/s41588-023-01415-w
- ABSTRACT : Phasing involves distinguishing the two parentally inherited copies of each chromosome into haplotypes. Here, we introduce SHAPEIT5, a new phasing method that quickly and accurately processes large sequencing datasets and applied it to UK Biobank (UKB) whole-genome and whole-exome sequencing data. We demonstrate that SHAPEIT5 phases rare variants with low switch error rates of below 5% for variants present in just 1 sample out of 100,000. Furthermore, we outline a method for phasing singletons, which, although less precise, constitutes an important step towards future developments. We then demonstrate that the use of UKB as a reference panel improves the accuracy of genotype imputation, which is even more pronounced when phased with SHAPEIT5 compared with other methods. Finally, we screen the UKB data for loss-of-function compound heterozygous events and identify 549 genes where both gene copies are knocked out. These genes complement current knowledge of gene essentiality in the human genome.
- CITATION : Hofmeister RJ, Ribeiro DM, Rubinacci S, Delaneau O. (2022) Accurate rare variant phasing of whole-genome and whole-exome sequencing data in the UK Biobank Nat. Genet., 55 (7) 1243-1249. doi:10.1038/s41588-023-01415-w. PMID 37386248
- JOURNAL_INFO : Nature genetics ; Nat. Genet. ; 2022 ; 55 ; 7 ; 1243-1249
- PUBMED_LINK : 37386248
fastPHASE
- NAME : fastPHASE
- SHORT NAME : fastPHASE
- URL : http://scheet.org/software.html
- TITLE : A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase
- DOI : 10.1086/502802
- ABSTRACT : We present a statistical model for patterns of genetic variation in samples of unrelated individuals from natural populations. This model is based on the idea that, over short regions, haplotypes in a population tend to cluster into groups of similar haplotypes. To capture the fact that, because of recombination, this clustering tends to be local in nature, our model allows cluster memberships to change continuously along the chromosome according to a hidden Markov model. This approach is flexible, allowing for both "block-like" patterns of linkage disequilibrium (LD) and gradual decline in LD with distance. The resulting model is also fast and, as a result, is practicable for large data sets (e.g., thousands of individuals typed at hundreds of thousands of markers). We illustrate the utility of the model by applying it to dense single-nucleotide-polymorphism genotype data for the tasks of imputing missing genotypes and estimating haplotypic phase. For imputing missing genotypes, methods based on this model are as accurate or more accurate than existing methods. For haplotype estimation, the point estimates are slightly less accurate than those from the best existing methods (e.g., for unrelated Centre d'Etude du Polymorphisme Humain individuals from the HapMap project, switch error was 0.055 for our method vs. 0.051 for PHASE) but require a small fraction of the computational cost. In addition, we demonstrate that the model accurately reflects uncertainty in its estimates, in that probabilities computed using the model are approximately well calibrated. The methods described in this article are implemented in a software package, fastPHASE, which is available from the Stephens Lab Web site.
- COPYRIGHT : https://www.elsevier.com/open-access/userlicense/1.0/
- CITATION : Scheet P, Stephens M. (2006) A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase Am. J. Hum. Genet., 78 (4) 629-644. doi:10.1086/502802. PMID 16532393
- JOURNAL_INFO : The American Journal of Human Genetics ; Am. J. Hum. Genet. ; 2006 ; 78 ; 4 ; 629-644
- PUBMED_LINK : 16532393
Review
Review-Das
- NAME : Review-Das
- TITLE : Genotype imputation from large reference panels
- DOI : 10.1146/annurev-genom-083117-021602
- ABSTRACT : Genotype imputation has become a standard tool in genome-wide association studies because it enables researchers to inexpensively approximate whole-genome sequence data from genome-wide single-nucleotide polymorphism array data. Genotype imputation increases statistical power, facilitates fine mapping of causal variants, and plays a key role in meta-analyses of genome-wide association studies. Only variants that were previously observed in a reference panel of sequenced individuals can be imputed. However, the rapid increase in the number of deeply sequenced individuals will soon make it possible to assemble enormous reference panels that greatly increase the number of imputable variants. In this review, we present an overview of genotype imputation and describe the computational techniques that make it possible to impute genotypes from reference panels with millions of individuals.
- CITATION : Das S, Abecasis GR, Browning BL. (2018) Genotype imputation from large reference panels Annu. Rev. Genomics Hum. Genet., 19 (1) 73-96. doi:10.1146/annurev-genom-083117-021602. PMID 29799802
- JOURNAL_INFO : Annual review of genomics and human genetics ; Annu. Rev. Genomics Hum. Genet. ; 2018 ; 19 ; 1 ; 73-96
- PUBMED_LINK : 29799802
Review-Li
- NAME : Review-Li
- TITLE : Genotype imputation
- DOI : 10.1146/annurev.genom.9.081307.164242
- ABSTRACT : Genotype imputation is now an essential tool in the analysis of genome-wide association scans. This technique allows geneticists to accurately evaluate the evidence for association at genetic markers that are not directly genotyped. Genotype imputation is particularly useful for combining results across studies that rely on different genotyping platforms but also increases the power of individual scans. Here, we review the history and theoretical underpinnings of the technique. To illustrate performance of the approach, we summarize results from several gene mapping studies. Finally, we preview the role of genotype imputation in an era when whole genome resequencing is becoming increasingly common.
- CITATION : Li Y, Willer C, Sanna S, Abecasis G. (2009) Genotype imputation Annu. Rev. Genomics Hum. Genet., 10 (1) 387-406. doi:10.1146/annurev.genom.9.081307.164242. PMID 19715440
- JOURNAL_INFO : Annual review of genomics and human genetics ; Annu. Rev. Genomics Hum. Genet. ; 2009 ; 10 ; 1 ; 387-406
- PUBMED_LINK : 19715440
Review-Marchini
- NAME : Review-Marchini
- TITLE : Genotype imputation for genome-wide association studies
- DOI : 10.1038/nrg2796
- ABSTRACT : In the past few years genome-wide association (GWA) studies have uncovered a large number of convincingly replicated associations for many complex human diseases. Genotype imputation has been used widely in the analysis of GWA studies to boost power, fine-map associations and facilitate the combination of results across studies using meta-analysis. This Review describes the details of several different statistical methods for imputing genotypes, illustrates and discusses the factors that influence imputation performance, and reviews methods that can be used to assess imputation performance and test association at imputed SNPs.
- CITATION : Marchini J, Howie B. (2010) Genotype imputation for genome-wide association studies Nat. Rev. Genet., 11 (7) 499-511. doi:10.1038/nrg2796. PMID 20517342
- JOURNAL_INFO : Nature reviews. Genetics ; Nat. Rev. Genet. ; 2010 ; 11 ; 7 ; 499-511
- PUBMED_LINK : 20517342
Structural variants imputation panel
1KG SV imputation panel
- NAME : 1KG SV imputation panel
- SHORT NAME : 1KG SV
- KEYWORDS : structural variants, long-read
- PREPRINT_DOI : 10.1101/2023.12.20.23300308
- SERVER : medrxiv
- CITATION : Noyvert, B., Erzurumluoglu, A. M., Drichel, D., Omland, S., Andlauer, T. F., Mueller, S., ... & Ding, Z. (2023). Imputation of structural variants using a multi-ancestry long-read sequencing panel enables identification of disease associations. medRxiv, 2023-12.