Skip to content

Association_tests

Summary Table

NAME CATEGORY CITATION YEAR
CC-GWAS Case-case GWAS Peyrot WJ, Price AL. (2021) Identifying loci with different allele frequencies among cases of eight psychiatric disorders using CC-GWAS Nat. Genet., 53 (4) 445-454. doi:10.1038/s41588-021-00787-1. PMID 33686288 2021
TrajGWAS GWAS of longitudinal trajectories Ko S, German CA, Jensen A, Shen J, ...&, Zhou JJ. (2022) GWAS of longitudinal trajectories at biobank scale Am. J. Hum. Genet., 109 (3) 433-445. doi:10.1016/j.ajhg.2022.01.018. PMID 35196515 2022
GWAX GWAS using family history Liu JZ, Erlich Y, Pickrell JK. (2017) Case-control association mapping by proxy using family history of disease Nat. Genet., 49 (3) 325-331. doi:10.1038/ng.3766. PMID 28092683 2017
LT-FH GWAS using family history Hujoel MLA, Gazal S, Loh PR, Patterson N, ...&, Price AL. (2020) Liability threshold modeling of case-control status and family history of disease increases association power Nat. Genet., 52 (5) 541-547. doi:10.1038/s41588-020-0613-6. PMID 32313248 2020
SiblingGWAS GWAS using family history Howe LJ, Nivard MG, Morris TT, Hansen AF, ...&, Davies NM. (2022) Within-sibship genome-wide association analyses decrease bias in estimates of direct genetic effects Nat. Genet., 54 (5) 581-592. doi:10.1038/s41588-022-01062-7. PMID 35534559 2022
snipar GWAS using family history Young AI, Nehzati SM, Benonisdottir S, Okbay A, ...&, Kong A. (2022) Mendelian imputation of parental genotypes improves estimates of direct genetic effects Nat. Genet., 54 (6) 897-905. doi:10.1038/s41588-022-01085-0. PMID 35681053 2022
REGENIE Gene-based analysis (rare variant) Mbatchou J, Barnard L, Backman J, Marcketta A, ...&, Marchini J. (2021) Computationally efficient whole-genome regression for quantitative and binary traits Nat. Genet., 53 (7) 1097-1103. doi:10.1038/s41588-021-00870-7. PMID 34017140 2021
SAIGE-GENE+ Gene-based analysis (rare variant) Zhou W, Bi W, Zhao Z, Dey KK, ...&, Lee S. (2022) SAIGE-GENE+ improves the efficiency and accuracy of set-based rare variant association tests Nat. Genet., 54 (10) 1466-1469. doi:10.1038/s41588-022-01178-w. PMID 36138231 2022
SAIGE-GENE Gene-based analysis (rare variant) Zhou W, Zhao Z, Nielsen JB, Fritsche LG, ...&, Lee S. (2020) Scalable generalized linear mixed model for region-based association tests in large biobanks and cohorts Nat. Genet., 52 (6) 634-639. doi:10.1038/s41588-020-0621-6. PMID 32424355 2020
SKAT-O Gene-based analysis (rare variant) Lee S, Wu MC, Lin X. (2012) Optimal tests for rare variant effects in sequencing association studies Biostatistics, 13 (4) 762-775. doi:10.1093/biostatistics/kxs014. PMID 22699862 2012
SKAT Gene-based analysis (rare variant) Wu MC, Lee S, Cai T, Li Y, ...&, Lin X. (2011) Rare-variant association testing for sequencing data with the sequence kernel association test Am. J. Hum. Genet., 89 (1) 82-93. doi:10.1016/j.ajhg.2011.05.029. PMID 21737059 2011
STAAR Gene-based analysis (rare variant) Li X, Li Z, Zhou H, Gaynor SM, ...&, Lin X. (2020) Dynamic incorporation of multiple in silico functional annotations empowers rare variant association analysis of large whole-genome sequencing studies at scale Nat. Genet., 52 (9) 969-983. doi:10.1038/s41588-020-0676-4. PMID 32839606 2020
LDAK-GBAT Gene-based analysis (sumstats) Berrandou TE, Balding D, Speed D. (2023) LDAK-GBAT: Fast and powerful gene-based association testing using summary statistics Am. J. Hum. Genet., 110 (1) 23-29. doi:10.1016/j.ajhg.2022.11.010. PMID 36480927 2023
PGS-adjusted GWAS PGS-adjusted GWAS Campos AI, Namba S, Lin SC, Nam K, ...&, Yengo L. (2023) Boosting the power of genome-wide association studies within and across ancestries by using polygenic scores Nat. Genet., 55 (10) 1769-1776. doi:10.1038/s41588-023-01500-0. PMID 37723263 2023
PGS-adjusted RVATs PGS-adjusted GWAS Jurgens SJ, Pirruccello JP, Choi SH, Morrill VN, ...&, Ellinor PT. (2023) Adjusting for common variant polygenic scores improves yield in rare variant association analyses Nat. Genet., () . doi:10.1038/s41588-023-01342-w. PMID 36959364 2023
Review-Povysil Review Povysil G, Petrovski S, Hostyk J, Aggarwal V, ...&, Goldstein DB. (2019) Rare-variant collapsing analyses for complex traits: guidelines and applications Nat. Rev. Genet., 20 (12) 747-759. doi:10.1038/s41576-019-0177-4. PMID 31605095 2019
BOLT-lMM Single variant association tests Loh PR, Tucker G, Bulik-Sullivan BK, Vilhjálmsson BJ, ...&, Price AL. (2015) Efficient Bayesian mixed-model analysis increases association power in large cohorts Nat. Genet., 47 (3) 284-290. doi:10.1038/ng.3190. PMID 25642633 2015
EMMAX Single variant association tests Kang HM, Sul JH, Service, Susan K., ...&, Eskin E. (2010) Variance component model to account for sample structure in genome-wide association studies Nat. Genet., 42 (4) 348-354. doi:10.1038/ng.548. PMID 20208533 2010
GEMMA Single variant association tests Zhou X, Stephens M. (2012) Genome-wide efficient mixed-model analysis for association studies Nat. Genet., 44 (7) 821-824. doi:10.1038/ng.2310. PMID 22706312 2012
LDAK-KVIK Single variant association tests Hof, J. P. & Speed, D. LDAK-KVIK performs fast and powerful mixed-model association analysis of quantitative and binary phenotypes. bioRxiv 2024.07.25.24311005 (2024) doi:10.1101/2024.07.25.24311005. NA
PLINK2 Single variant association tests Chang CC, Chow CC, Tellier LC, Vattikuti S, ...&, Lee JJ. (2015) Second-generation PLINK: rising to the challenge of larger and richer datasets Gigascience, 4 (1) 7. doi:10.1186/s13742-015-0047-8. PMID 25722852 2015
PLINK Single variant association tests Purcell S, Neale B, Todd-Brown K, Thomas L, ...&, Sham PC. (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses Am. J. Hum. Genet., 81 (3) 559-575. doi:10.1086/519795. PMID 17701901 2007
POLMM Single variant association tests Bi W, Zhou W, Dey R, Mukherjee B, ...&, Lee S. (2021) Efficient mixed model approach for large-scale genome-wide association studies of ordinal categorical phenotypes Am. J. Hum. Genet., 108 (5) 825-839. doi:10.1016/j.ajhg.2021.03.019. PMID 33836139 2021
QRGWAS Single variant association tests Wang C, Wang T, Kiryluk K, Wei Y, ...&, Ionita-Laza I. (2024) Genome-wide discovery for biomarkers using quantile regression at biobank scale Nat. Commun., 15 (1) 6460. doi:10.1038/s41467-024-50726-x. PMID 39085219 2024
REGENIE Single variant association tests Mbatchou J, Barnard L, Backman J, Marcketta A, ...&, Marchini J. (2021) Computationally efficient whole-genome regression for quantitative and binary traits Nat. Genet., 53 (7) 1097-1103. doi:10.1038/s41588-021-00870-7. PMID 34017140 2021
SAIGE Single variant association tests Zhou W, Nielsen JB, Fritsche LG, Dey R, ...&, Lee S. (2018) Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies Nat. Genet., 50 (9) 1335-1341. doi:10.1038/s41588-018-0184-y. PMID 30104761 2018
fastGWA-GLMM Single variant association tests Jiang L, Zheng Z, Fang H, Yang J. (2021) A generalized linear mixed model association tool for biobank-scale data Nat. Genet., 53 (11) 1616-1621. doi:10.1038/s41588-021-00954-4. PMID 34737426 2021
fastGWA Single variant association tests Jiang L, Zheng Z, Qi T, Kemper KE, ...&, Yang J. (2019) A resource-efficient tool for mixed model association analysis of large-scale data Nat. Genet., 51 (12) 1749-1755. doi:10.1038/s41588-019-0530-8. PMID 31768069 2019

Case-case GWAS

CC-GWAS

  • NAME : CC-GWAS
  • SHORT NAME : CC-GWAS
  • FULL NAME : case–case genome-wide association study
  • DESCRIPTION : The CCGWAS R package provides a tool for case-case association testing of two different disorders based on their respective case-control GWAS results
  • URL : https://github.com/wouterpeyrot/CCGWAS
  • TITLE : Identifying loci with different allele frequencies among cases of eight psychiatric disorders using CC-GWAS
  • DOI : 10.1038/s41588-021-00787-1
  • ABSTRACT : Psychiatric disorders are highly genetically correlated, but little research has been conducted on the genetic differences between disorders. We developed a new method (case-case genome-wide association study; CC-GWAS) to test for differences in allele frequency between cases of two disorders using summary statistics from the respective case-control GWAS, transcending current methods that require individual-level data. Simulations and analytical computations confirm that CC-GWAS is well powered with effective control of type I error. We applied CC-GWAS to publicly available summary statistics for schizophrenia, bipolar disorder, major depressive disorder and five other psychiatric disorders. CC-GWAS identified 196 independent case-case loci, including 72 CC-GWAS-specific loci that were not significant at the genome-wide level in the input case-control summary statistics; two of the CC-GWAS-specific loci implicate the genes KLF6 and KLF16 (from the Krüppel-like family of transcription factors), which have been linked to neurite outgrowth and axon regeneration. CC-GWAS loci replicated convincingly in applications to datasets with independent replication data.
  • COPYRIGHT : https://www.springernature.com/gp/researchers/text-and-data-mining
  • CITATION : Peyrot WJ, Price AL. (2021) Identifying loci with different allele frequencies among cases of eight psychiatric disorders using CC-GWAS Nat. Genet., 53 (4) 445-454. doi:10.1038/s41588-021-00787-1. PMID 33686288
  • JOURNAL_INFO : Nature genetics ; Nat. Genet. ; 2021 ; 53 ; 4 ; 445-454
  • PUBMED_LINK : 33686288

GWAS of longitudinal trajectories

TrajGWAS

  • NAME : TrajGWAS
  • SHORT NAME : TrajGWAS
  • FULL NAME : GWAS of longitudinal trajectories
  • DESCRIPTION : TrajGWAS.jl is a Julia package for performing genome-wide association studies (GWAS) for continuous longitudinal phenotypes using a modified linear mixed effects model. It builds upon the within-subject variance estimation by robust regression (WiSER) method and can be used to identify variants associated with changes in the mean and within-subject variability of the longitduinal trait.
  • URL : https://github.com/OpenMendel/TrajGWAS.jl
  • KEYWORDS : biomarker trajectories, mean, within-subject (WS) variability, linear mixed effect model, within-subject variance estimation by robust regression (WiSER) method
  • TITLE : GWAS of longitudinal trajectories at biobank scale
  • DOI : 10.1016/j.ajhg.2022.01.018
  • ABSTRACT : Biobanks linked to massive, longitudinal electronic health record (EHR) data make numerous new genetic research questions feasible. One among these is the study of biomarker trajectories. For example, high blood pressure measurements over visits strongly predict stroke onset, and consistently high fasting glucose and Hb1Ac levels define diabetes. Recent research reveals that not only the mean level of biomarker trajectories but also their fluctuations, or within-subject (WS) variability, are risk factors for many diseases. Glycemic variation, for instance, is recently considered an important clinical metric in diabetes management. It is crucial to identify the genetic factors that shift the mean or alter the WS variability of a biomarker trajectory. Compared to traditional cross-sectional studies, trajectory analysis utilizes more data points and captures a complete picture of the impact of time-varying factors, including medication history and lifestyle. Currently, there are no efficient tools for genome-wide association studies (GWASs) of biomarker trajectories at the biobank scale, even for just mean effects. We propose TrajGWAS, a linear mixed effect model-based method for testing genetic effects that shift the mean or alter the WS variability of a biomarker trajectory. It is scalable to biobank data with 100,000 to 1,000,000 individuals and many longitudinal measurements and robust to distributional assumptions. Simulation studies corroborate that TrajGWAS controls the type I error rate and is powerful. Analysis of eleven biomarkers measured longitudinally and extracted from UK Biobank primary care data for more than 150,000 participants with 1,800,000 observations reveals loci that significantly alter the mean or WS variability.
  • COPYRIGHT : http://www.elsevier.com/open-access/userlicense/1.0/
  • CITATION : Ko S, German CA, Jensen A, Shen J, ...&, Zhou JJ. (2022) GWAS of longitudinal trajectories at biobank scale Am. J. Hum. Genet., 109 (3) 433-445. doi:10.1016/j.ajhg.2022.01.018. PMID 35196515
  • JOURNAL_INFO : The American Journal of Human Genetics ; Am. J. Hum. Genet. ; 2022 ; 109 ; 3 ; 433-445
  • PUBMED_LINK : 35196515

GWAS using family history

GWAX

  • NAME : GWAX
  • SHORT NAME : GWAX
  • FULL NAME : genome-wide association by proxy
  • DESCRIPTION : In randomly ascertained cohorts, replacing cases with their first-degree relatives enables studies of diseases that are absent (or nearly absent) in the cohort.
  • TITLE : Case-control association mapping by proxy using family history of disease
  • DOI : 10.1038/ng.3766
  • ABSTRACT : Collecting cases for case-control genetic association studies can be time-consuming and expensive. In some situations (such as studies of late-onset or rapidly lethal diseases), it may be more practical to identify family members of cases. In randomly ascertained cohorts, replacing cases with their first-degree relatives enables studies of diseases that are absent (or nearly absent) in the cohort. We refer to this approach as genome-wide association study by proxy (GWAX) and apply it to 12 common diseases in 116,196 individuals from the UK Biobank. Meta-analysis with published genome-wide association study summary statistics replicated established risk loci and yielded four newly associated loci for Alzheimer's disease, eight for coronary artery disease and five for type 2 diabetes. In addition to informing disease biology, our results demonstrate the utility of association mapping without directly observing cases. We anticipate that GWAX will prove useful in future genetic studies of complex traits in large population cohorts.
  • CITATION : Liu JZ, Erlich Y, Pickrell JK. (2017) Case-control association mapping by proxy using family history of disease Nat. Genet., 49 (3) 325-331. doi:10.1038/ng.3766. PMID 28092683
  • JOURNAL_INFO : Nature genetics ; Nat. Genet. ; 2017 ; 49 ; 3 ; 325-331
  • PUBMED_LINK : 28092683

LT-FH

  • NAME : LT-FH
  • SHORT NAME : LT-FH
  • FULL NAME : liability threshold model, conditional on case–control status and family history
  • DESCRIPTION : an association method based on posterior mean genetic liabilities under a liability threshold model, conditional on case-control status and family history (LT-FH)
  • URL : https://alkesgroup.broadinstitute.org/UKBB/LTFH/
  • TITLE : Liability threshold modeling of case-control status and family history of disease increases association power
  • DOI : 10.1038/s41588-020-0613-6
  • ABSTRACT : Family history of disease can provide valuable information in case-control association studies, but it is currently unclear how to best combine case-control status and family history of disease. We developed an association method based on posterior mean genetic liabilities under a liability threshold model, conditional on case-control status and family history (LT-FH). Analyzing 12 diseases from the UK Biobank (average N = 350,000) we compared LT-FH to genome-wide association without using family history (GWAS) and a previous proxy-based method incorporating family history (GWAX). LT-FH was 63% (standard error (s.e.) 6%) more powerful than GWAS and 36% (s.e. 4%) more powerful than the trait-specific maximum of GWAS and GWAX, based on the number of independent genome-wide-significant loci across all diseases (for example, 690 loci for LT-FH versus 423 for GWAS); relative improvements were similar when applying BOLT-LMM to GWAS, GWAX and LT-FH phenotypes. Thus, LT-FH greatly increases association power when family history of disease is available.
  • CITATION : Hujoel MLA, Gazal S, Loh PR, Patterson N, ...&, Price AL. (2020) Liability threshold modeling of case-control status and family history of disease increases association power Nat. Genet., 52 (5) 541-547. doi:10.1038/s41588-020-0613-6. PMID 32313248
  • JOURNAL_INFO : Nature genetics ; Nat. Genet. ; 2020 ; 52 ; 5 ; 541-547
  • PUBMED_LINK : 32313248

SiblingGWAS

  • NAME : SiblingGWAS
  • SHORT NAME : SiblingGWAS
  • FULL NAME : Within-sibship genome-wide association analyses
  • DESCRIPTION : Scripts for running GWAS using siblings to estimate Within-Family (WF) and Between-Family (BF) effects of genetic variants on continuous traits. Allows the inclusion of more than two siblings from one family.
  • URL : https://github.com/LaurenceHowe/SiblingGWAS
  • TITLE : Within-sibship genome-wide association analyses decrease bias in estimates of direct genetic effects
  • DOI : 10.1038/s41588-022-01062-7
  • ABSTRACT : Estimates from genome-wide association studies (GWAS) of unrelated individuals capture effects of inherited variation (direct effects), demography (population stratification, assortative mating) and relatives (indirect genetic effects). Family-based GWAS designs can control for demographic and indirect genetic effects, but large-scale family datasets have been lacking. We combined data from 178,086 siblings from 19 cohorts to generate population (between-family) and within-sibship (within-family) GWAS estimates for 25 phenotypes. Within-sibship GWAS estimates were smaller than population estimates for height, educational attainment, age at first birth, number of children, cognitive ability, depressive symptoms and smoking. Some differences were observed in downstream SNP heritability, genetic correlations and Mendelian randomization analyses. For example, the within-sibship genetic correlation between educational attainment and body mass index attenuated towards zero. In contrast, analyses of most molecular phenotypes (for example, low-density lipoprotein-cholesterol) were generally consistent. We also found within-sibship evidence of polygenic adaptation on taller height. Here, we illustrate the importance of family-based GWAS data for phenotypes influenced by demographic and indirect genetic effects.
  • COPYRIGHT : https://creativecommons.org/licenses/by/4.0
  • CITATION : Howe LJ, Nivard MG, Morris TT, Hansen AF, ...&, Davies NM. (2022) Within-sibship genome-wide association analyses decrease bias in estimates of direct genetic effects Nat. Genet., 54 (5) 581-592. doi:10.1038/s41588-022-01062-7. PMID 35534559
  • JOURNAL_INFO : Nature genetics ; Nat. Genet. ; 2022 ; 54 ; 5 ; 581-592
  • PUBMED_LINK : 35534559

snipar

  • NAME : snipar
  • SHORT NAME : snipar
  • FULL NAME : single nucleotide imputation of parents
  • DESCRIPTION : snipar (single nucleotide imputation of parents) is a Python package for inferring identity-by-descent (IBD) segments shared between siblings, imputing missing parental genotypes, and for performing family based genome-wide association and polygenic score analyses using observed and/or imputed parental genotypes.
  • URL : https://github.com/AlexTISYoung/snipar
  • TITLE : Mendelian imputation of parental genotypes improves estimates of direct genetic effects
  • DOI : 10.1038/s41588-022-01085-0
  • ABSTRACT : Effects estimated by genome-wide association studies (GWASs) include effects of alleles in an individual on that individual (direct genetic effects), indirect genetic effects (for example, effects of alleles in parents on offspring through the environment) and bias from confounding. Within-family genetic variation is random, enabling unbiased estimation of direct genetic effects when parents are genotyped. However, parental genotypes are often missing. We introduce a method that imputes missing parental genotypes and estimates direct genetic effects. Our method, implemented in the software package snipar (single-nucleotide imputation of parents), gives more precise estimates of direct genetic effects than existing approaches. Using 39,614 individuals from the UK Biobank with at least one genotyped sibling/parent, we estimate the correlation between direct genetic effects and effects from standard GWASs for nine phenotypes, including educational attainment (r = 0.739, standard error (s.e.) = 0.086) and cognitive ability (r = 0.490, s.e. = 0.086). Our results demonstrate substantial confounding bias in standard GWASs for some phenotypes.
  • CITATION : Young AI, Nehzati SM, Benonisdottir S, Okbay A, ...&, Kong A. (2022) Mendelian imputation of parental genotypes improves estimates of direct genetic effects Nat. Genet., 54 (6) 897-905. doi:10.1038/s41588-022-01085-0. PMID 35681053
  • JOURNAL_INFO : Nature genetics ; Nat. Genet. ; 2022 ; 54 ; 6 ; 897-905
  • PUBMED_LINK : 35681053

Gene-based analysis (rare variant)

REGENIE

  • NAME : REGENIE
  • SHORT NAME : REGENIE
  • FULL NAME : REGENIE
  • DESCRIPTION : regenie is a C++ program for whole genome regression modelling of large genome-wide association studies. It is developed and supported by a team of scientists at the Regeneron Genetics Center.
  • URL : https://github.com/rgcgithub/regenie
  • KEYWORDS : whole genome regression
  • TITLE : Computationally efficient whole-genome regression for quantitative and binary traits
  • DOI : 10.1038/s41588-021-00870-7
  • ABSTRACT : Genome-wide association analysis of cohorts with thousands of phenotypes is computationally expensive, particularly when accounting for sample relatedness or population structure. Here we present a novel machine-learning method called REGENIE for fitting a whole-genome regression model for quantitative and binary phenotypes that is substantially faster than alternatives in multi-trait analyses while maintaining statistical efficiency. The method naturally accommodates parallel analysis of multiple phenotypes and requires only local segments of the genotype matrix to be loaded in memory, in contrast to existing alternatives, which must load genome-wide matrices into memory. This results in substantial savings in compute time and memory usage. We introduce a fast, approximate Firth logistic regression test for unbalanced case-control phenotypes. The method is ideally suited to take advantage of distributed computing frameworks. We demonstrate the accuracy and computational benefits of this approach using the UK Biobank dataset with up to 407,746 individuals.
  • CITATION : Mbatchou J, Barnard L, Backman J, Marcketta A, ...&, Marchini J. (2021) Computationally efficient whole-genome regression for quantitative and binary traits Nat. Genet., 53 (7) 1097-1103. doi:10.1038/s41588-021-00870-7. PMID 34017140
  • JOURNAL_INFO : Nature genetics ; Nat. Genet. ; 2021 ; 53 ; 7 ; 1097-1103
  • PUBMED_LINK : 34017140

SAIGE-GENE

  • NAME : SAIGE-GENE
  • SHORT NAME : SAIGE-GENE
  • FULL NAME : SAIGE-GENE
  • URL : https://github.com/weizhouUMICH/SAIGE
  • TITLE : Scalable generalized linear mixed model for region-based association tests in large biobanks and cohorts
  • DOI : 10.1038/s41588-020-0621-6
  • ABSTRACT : With very large sample sizes, biobanks provide an exciting opportunity to identify genetic components of complex traits. To analyze rare variants, region-based multiple-variant aggregate tests are commonly used to increase power for association tests. However, because of the substantial computational cost, existing region-based tests cannot analyze hundreds of thousands of samples while accounting for confounders such as population stratification and sample relatedness. Here we propose a scalable generalized mixed-model region-based association test, SAIGE-GENE, that is applicable to exome-wide and genome-wide region-based analysis for hundreds of thousands of samples and can account for unbalanced case-control ratios for binary traits. Through extensive simulation studies and analysis of the HUNT study with 69,716 Norwegian samples and the UK Biobank data with 408,910 White British samples, we show that SAIGE-GENE can efficiently analyze large-sample data (N > 400,000) with type I error rates well controlled.
  • CITATION : Zhou W, Zhao Z, Nielsen JB, Fritsche LG, ...&, Lee S. (2020) Scalable generalized linear mixed model for region-based association tests in large biobanks and cohorts Nat. Genet., 52 (6) 634-639. doi:10.1038/s41588-020-0621-6. PMID 32424355
  • JOURNAL_INFO : Nature genetics ; Nat. Genet. ; 2020 ; 52 ; 6 ; 634-639
  • PUBMED_LINK : 32424355

SAIGE-GENE+

  • NAME : SAIGE-GENE+
  • SHORT NAME : SAIGE-GENE+
  • FULL NAME : SAIGE-GENE+
  • URL : https://github.com/weizhouUMICH/SAIGE
  • TITLE : SAIGE-GENE+ improves the efficiency and accuracy of set-based rare variant association tests
  • DOI : 10.1038/s41588-022-01178-w
  • ABSTRACT : Several biobanks, including UK Biobank (UKBB), are generating large-scale sequencing data. An existing method, SAIGE-GENE, performs well when testing variants with minor allele frequency (MAF) ≤ 1%, but inflation is observed in variance component set-based tests when restricting to variants with MAF ≤ 0.1% or 0.01%. Here, we propose SAIGE-GENE+ with greatly improved type I error control and computational efficiency to facilitate rare variant tests in large-scale data. We further show that incorporating multiple MAF cutoffs and functional annotations can improve power and thus uncover new gene-phenotype associations. In the analysis of UKBB whole exome sequencing data for 30 quantitative and 141 binary traits, SAIGE-GENE+ identified 551 gene-phenotype associations.
  • COPYRIGHT : https://creativecommons.org/licenses/by/4.0
  • CITATION : Zhou W, Bi W, Zhao Z, Dey KK, ...&, Lee S. (2022) SAIGE-GENE+ improves the efficiency and accuracy of set-based rare variant association tests Nat. Genet., 54 (10) 1466-1469. doi:10.1038/s41588-022-01178-w. PMID 36138231
  • JOURNAL_INFO : Nature genetics ; Nat. Genet. ; 2022 ; 54 ; 10 ; 1466-1469
  • PUBMED_LINK : 36138231

SKAT

  • NAME : SKAT
  • SHORT NAME : SKAT
  • FULL NAME : sequence kernel association test
  • DESCRIPTION : SKAT is a SNP-set (e.g., a gene or a region) level test for association between a set of rare (or common) variants and dichotomous or quantitative phenotypes, SKAT aggregates individual score test statistics of SNPs in a SNP set and efficiently computes SNP-set level p-values, e.g. a gene or a region level p-value, while adjusting for covariates, such as principal components to account for population stratification. SKAT also allows for power/sample size calculations for designing for sequence association studies.
  • URL : https://www.hsph.harvard.edu/skat/
  • TITLE : Rare-variant association testing for sequencing data with the sequence kernel association test
  • DOI : 10.1016/j.ajhg.2011.05.029
  • ABSTRACT : Sequencing studies are increasingly being conducted to identify rare variants associated with complex traits. The limited power of classical single-marker association analysis for rare variants poses a central challenge in such studies. We propose the sequence kernel association test (SKAT), a supervised, flexible, computationally efficient regression method to test for association between genetic variants (common and rare) in a region and a continuous or dichotomous trait while easily adjusting for covariates. As a score-based variance-component test, SKAT can quickly calculate p values analytically by fitting the null model containing only the covariates, and so can easily be applied to genome-wide data. Using SKAT to analyze a genome-wide sequencing study of 1000 individuals, by segmenting the whole genome into 30 kb regions, requires only 7 hr on a laptop. Through analysis of simulated data across a wide range of practical scenarios and triglyceride data from the Dallas Heart Study, we show that SKAT can substantially outperform several alternative rare-variant association tests. We also provide analytic power and sample-size calculations to help design candidate-gene, whole-exome, and whole-genome sequence association studies.
  • COPYRIGHT : https://www.elsevier.com/open-access/userlicense/1.0/
  • CITATION : Wu MC, Lee S, Cai T, Li Y, ...&, Lin X. (2011) Rare-variant association testing for sequencing data with the sequence kernel association test Am. J. Hum. Genet., 89 (1) 82-93. doi:10.1016/j.ajhg.2011.05.029. PMID 21737059
  • JOURNAL_INFO : The American Journal of Human Genetics ; Am. J. Hum. Genet. ; 2011 ; 89 ; 1 ; 82-93
  • PUBMED_LINK : 21737059

SKAT-O

  • NAME : SKAT-O
  • SHORT NAME : SKAT-O
  • FULL NAME : sequence kernel association test - optimal test
  • DESCRIPTION : estimating the correlation parameter in the kernel matrix to maximize the power, which corresponds to the estimated weight in the linear combination of the burden test and SKAT test statistics that maximizes power.
  • URL : https://www.hsph.harvard.edu/skat/
  • TITLE : Optimal tests for rare variant effects in sequencing association studies
  • DOI : 10.1093/biostatistics/kxs014
  • ABSTRACT : With development of massively parallel sequencing technologies, there is a substantial need for developing powerful rare variant association tests. Common approaches include burden and non-burden tests. Burden tests assume all rare variants in the target region have effects on the phenotype in the same direction and of similar magnitude. The recently proposed sequence kernel association test (SKAT) (Wu, M. C., and others, 2011. Rare-variant association testing for sequencing data with the SKAT. The American Journal of Human Genetics 89, 82-93], an extension of the C-alpha test (Neale, B. M., and others, 2011. Testing for an unusual distribution of rare variants. PLoS Genetics 7, 161-165], provides a robust test that is particularly powerful in the presence of protective and deleterious variants and null variants, but is less powerful than burden tests when a large number of variants in a region are causal and in the same direction. As the underlying biological mechanisms are unknown in practice and vary from one gene to another across the genome, it is of substantial practical interest to develop a test that is optimal for both scenarios. In this paper, we propose a class of tests that include burden tests and SKAT as special cases, and derive an optimal test within this class that maximizes power. We show that this optimal test outperforms burden tests and SKAT in a wide range of scenarios. The results are illustrated using simulation studies and triglyceride data from the Dallas Heart Study. In addition, we have derived sample size/power calculation formula for SKAT with a new family of kernels to facilitate designing new sequence association studies.
  • CITATION : Lee S, Wu MC, Lin X. (2012) Optimal tests for rare variant effects in sequencing association studies Biostatistics, 13 (4) 762-775. doi:10.1093/biostatistics/kxs014. PMID 22699862
  • JOURNAL_INFO : Biostatistics (Oxford, England) ; Biostatistics ; 2012 ; 13 ; 4 ; 762-775
  • PUBMED_LINK : 22699862

STAAR

  • NAME : STAAR
  • SHORT NAME : STAAR
  • FULL NAME : variant-set test for association using annotation information
  • DESCRIPTION : STAAR is an R package for performing variant-Set Test for Association using Annotation infoRmation (STAAR) procedure in whole-genome sequencing (WGS) studies. STAAR is a general framework that incorporates both qualitative functional categories and quantitative complementary functional annotations using an omnibus multi-dimensional weighting scheme. STAAR accounts for population structure and relatedness, and is scalable for analyzing large WGS studies of continuous and dichotomous traits.
  • URL : https://github.com/xihaoli/STAAR
  • KEYWORDS : functional annotations
  • TITLE : Dynamic incorporation of multiple in silico functional annotations empowers rare variant association analysis of large whole-genome sequencing studies at scale
  • DOI : 10.1038/s41588-020-0676-4
  • ABSTRACT : Large-scale whole-genome sequencing studies have enabled the analysis of rare variants (RVs) associated with complex phenotypes. Commonly used RV association tests have limited scope to leverage variant functions. We propose STAAR (variant-set test for association using annotation information), a scalable and powerful RV association test method that effectively incorporates both variant categories and multiple complementary annotations using a dynamic weighting scheme. For the latter, we introduce 'annotation principal components', multidimensional summaries of in silico variant annotations. STAAR accounts for population structure and relatedness and is scalable for analyzing very large cohort and biobank whole-genome sequencing studies of continuous and dichotomous traits. We applied STAAR to identify RVs associated with four lipid traits in 12,316 discovery and 17,822 replication samples from the Trans-Omics for Precision Medicine Program. We discovered and replicated new RV associations, including disruptive missense RVs of NPC1L1 and an intergenic region near APOC1P1 associated with low-density lipoprotein cholesterol.
  • CITATION : Li X, Li Z, Zhou H, Gaynor SM, ...&, Lin X. (2020) Dynamic incorporation of multiple in silico functional annotations empowers rare variant association analysis of large whole-genome sequencing studies at scale Nat. Genet., 52 (9) 969-983. doi:10.1038/s41588-020-0676-4. PMID 32839606
  • JOURNAL_INFO : Nature genetics ; Nat. Genet. ; 2020 ; 52 ; 9 ; 969-983
  • PUBMED_LINK : 32839606

Gene-based analysis (sumstats)

LDAK-GBAT

  • NAME : LDAK-GBAT
  • SHORT NAME : LDAK-GBAT
  • FULL NAME : LDAK gene-based association testing
  • URL : http://www.ldak.org/
  • TITLE : LDAK-GBAT: Fast and powerful gene-based association testing using summary statistics
  • DOI : 10.1016/j.ajhg.2022.11.010
  • ABSTRACT : We present LDAK-GBAT, a tool for gene-based association testing using summary statistics from genome-wide association studies that is computationally efficient, produces well-calibrated p values, and is significantly more powerful than existing tools. LDAK-GBAT takes approximately 30 min to analyze imputed data (2.9M common, genic SNPs), requiring less than 10 Gb memory. It shows good control of type 1 error given an appropriate reference panel. Across 109 phenotypes (82 from the UK Biobank, 18 from the Million Veteran Program, and nine from the Psychiatric Genetics Consortium), LDAK-GBAT finds on average 19% (SE: 1%) more significant genes than the existing tool sumFREGAT-ACAT, with even greater gains in comparison with MAGMA, GCTA-fastBAT, sumFREGAT-SKAT-O, and sumFREGAT-PCA.
  • COPYRIGHT : http://www.elsevier.com/open-access/userlicense/1.0/
  • CITATION : Berrandou TE, Balding D, Speed D. (2023) LDAK-GBAT: Fast and powerful gene-based association testing using summary statistics Am. J. Hum. Genet., 110 (1) 23-29. doi:10.1016/j.ajhg.2022.11.010. PMID 36480927
  • JOURNAL_INFO : The American Journal of Human Genetics ; Am. J. Hum. Genet. ; 2023 ; 110 ; 1 ; 23-29
  • PUBMED_LINK : 36480927

PGS-adjusted GWAS

PGS-adjusted GWAS

  • NAME : PGS-adjusted GWAS
  • SHORT NAME : PGS-adjusted GWAS
  • FULL NAME : PGS-adjusted GWAS
  • DESCRIPTION : adjustment of GWAS analyses for polygenic scores (PGSs) increases the statistical power for discovery across all ancestries
  • KEYWORDS : LOCO-PGSs, two-stage meta-analysis strategy
  • TITLE : Boosting the power of genome-wide association studies within and across ancestries by using polygenic scores
  • DOI : 10.1038/s41588-023-01500-0
  • ABSTRACT : Genome-wide association studies (GWASs) have been mostly conducted in populations of European ancestry, which currently limits the transferability of their findings to other populations. Here, we show, through theory, simulations and applications to real data, that adjustment of GWAS analyses for polygenic scores (PGSs) increases the statistical power for discovery across all ancestries. We applied this method to analyze seven traits available in three large biobanks with participants of East Asian ancestry (n = 340,000 in total) and report 139 additional associations across traits. We also present a two-stage meta-analysis strategy whereby, in contributing cohorts, a PGS-adjusted GWAS is rerun using PGSs derived from a first round of a standard meta-analysis. On average, across traits, this approach yields a 1.26-fold increase in the number of detected associations (range 1.07- to 1.76-fold increase). Altogether, our study demonstrates the value of using PGSs to increase the power of GWASs in underrepresented populations and promotes such an analytical strategy for future GWAS meta-analyses.
  • CITATION : Campos AI, Namba S, Lin SC, Nam K, ...&, Yengo L. (2023) Boosting the power of genome-wide association studies within and across ancestries by using polygenic scores Nat. Genet., 55 (10) 1769-1776. doi:10.1038/s41588-023-01500-0. PMID 37723263
  • JOURNAL_INFO : Nature genetics ; Nat. Genet. ; 2023 ; 55 ; 10 ; 1769-1776
  • PUBMED_LINK : 37723263

PGS-adjusted RVATs

  • NAME : PGS-adjusted RVATs
  • SHORT NAME : PGS-adjusted RVATs
  • FULL NAME : PGS-adjusted rare variant association tests
  • DESCRIPTION : adjusting for common variant polygenic scores improves yield in gene-based rare variant association tests
  • KEYWORDS : PGS, Rare variants
  • TITLE : Adjusting for common variant polygenic scores improves yield in rare variant association analyses
  • DOI : 10.1038/s41588-023-01342-w
  • ABSTRACT : With the emergence of large-scale sequencing data, methods for improving power in rare variant association tests are needed. Here we show that adjusting for common variant polygenic scores improves yield in gene-based rare variant association tests across 65 quantitative traits in the UK Biobank (up to 20% increase at α = 2.6 × 10-6), without marked increases in false-positive rates or genomic inflation. Benefits were seen for various models, with the largest improvements seen for efficient sparse mixed-effects models. Our results illustrate how polygenic score adjustment can efficiently improve power in rare variant association discovery.
  • CITATION : Jurgens SJ, Pirruccello JP, Choi SH, Morrill VN, ...&, Ellinor PT. (2023) Adjusting for common variant polygenic scores improves yield in rare variant association analyses Nat. Genet., () . doi:10.1038/s41588-023-01342-w. PMID 36959364
  • JOURNAL_INFO : Nature genetics ; Nat. Genet. ; 2023 ; ; ;
  • PUBMED_LINK : 36959364

Review

Review-Povysil

  • NAME : Review-Povysil
  • TITLE : Rare-variant collapsing analyses for complex traits: guidelines and applications
  • DOI : 10.1038/s41576-019-0177-4
  • ABSTRACT : The first phase of genome-wide association studies (GWAS) assessed the role of common variation in human disease. Advances optimizing and economizing high-throughput sequencing have enabled a second phase of association studies that assess the contribution of rare variation to complex disease in all protein-coding genes. Unlike the early microarray-based studies, sequencing-based studies catalogue the full range of genetic variation, including the evolutionarily youngest forms. Although the experience with common variants helped establish relevant standards for genome-wide studies, the analysis of rare variation introduces several challenges that require novel analysis approaches.
  • CITATION : Povysil G, Petrovski S, Hostyk J, Aggarwal V, ...&, Goldstein DB. (2019) Rare-variant collapsing analyses for complex traits: guidelines and applications Nat. Rev. Genet., 20 (12) 747-759. doi:10.1038/s41576-019-0177-4. PMID 31605095
  • JOURNAL_INFO : Nature reviews. Genetics ; Nat. Rev. Genet. ; 2019 ; 20 ; 12 ; 747-759
  • PUBMED_LINK : 31605095

Single variant association tests

BOLT-lMM

  • NAME : BOLT-lMM
  • SHORT NAME : BOLT-lMM
  • FULL NAME : BOLT-lMM
  • DESCRIPTION : The BOLT-LMM software package currently consists of two main algorithms, the BOLT-LMM algorithm for mixed model association testing, and the BOLT-REML algorithm for variance components analysis (i.e., partitioning of SNP-heritability and estimation of genetic correlations).
  • URL : https://alkesgroup.broadinstitute.org/BOLT-LMM/BOLT-LMM_manual.html
  • KEYWORDS : non-infinitesimal model, mixture of two Gaussian distributions
  • TITLE : Efficient Bayesian mixed-model analysis increases association power in large cohorts
  • DOI : 10.1038/ng.3190
  • ABSTRACT : Linear mixed models are a powerful statistical tool for identifying genetic associations and avoiding confounding. However, existing methods are computationally intractable in large cohorts and may not optimize power. All existing methods require time cost O(MN(2)) (where N is the number of samples and M is the number of SNPs) and implicitly assume an infinitesimal genetic architecture in which effect sizes are normally distributed, which can limit power. Here we present a far more efficient mixed-model association method, BOLT-LMM, which requires only a small number of O(MN) time iterations and increases power by modeling more realistic, non-infinitesimal genetic architectures via a Bayesian mixture prior on marker effect sizes. We applied BOLT-LMM to 9 quantitative traits in 23,294 samples from the Women's Genome Health Study (WGHS) and observed significant increases in power, consistent with simulations. Theory and simulations show that the boost in power increases with cohort size, making BOLT-LMM appealing for genome-wide association studies in large cohorts.
  • CITATION : Loh PR, Tucker G, Bulik-Sullivan BK, Vilhjálmsson BJ, ...&, Price AL. (2015) Efficient Bayesian mixed-model analysis increases association power in large cohorts Nat. Genet., 47 (3) 284-290. doi:10.1038/ng.3190. PMID 25642633
  • JOURNAL_INFO : Nature genetics ; Nat. Genet. ; 2015 ; 47 ; 3 ; 284-290
  • PUBMED_LINK : 25642633

EMMAX

  • NAME : EMMAX
  • SHORT NAME : EMMAX
  • FULL NAME : efficient mixed-model association eXpedited
  • DESCRIPTION : EMMAX is a statistical test for large scale human or model organism association mapping accounting for the sample structure. In addition to the computational efficiency obtained by EMMA algorithm, EMMAX takes advantage of the fact that each loci explains only a small fraction of complex traits, which allows us to avoid repetitive variance component estimation procedure, resulting in a significant amount of increase in computational time of association mapping using mixed model.
  • URL : https://genome.sph.umich.edu/wiki/EMMAX
  • TITLE : Variance component model to account for sample structure in genome-wide association studies
  • DOI : 10.1038/ng.548
  • ABSTRACT : Although genome-wide association studies (GWASs) have identified numerous loci associated with complex traits, imprecise modeling of the genetic relatedness within study samples may cause substantial inflation of test statistics and possibly spurious associations. Variance component approaches, such as efficient mixed-model association (EMMA), can correct for a wide range of sample structures by explicitly accounting for pairwise relatedness between individuals, using high-density markers to model the phenotype distribution; but such approaches are computationally impractical. We report here a variance component approach implemented in publicly available software, EMMA eXpedited (EMMAX), that reduces the computational time for analyzing large GWAS data sets from years to hours. We apply this method to two human GWAS data sets, performing association analysis for ten quantitative traits from the Northern Finland Birth Cohort and seven common diseases from the Wellcome Trust Case Control Consortium. We find that EMMAX outperforms both principal component analysis and genomic control in correcting for sample structure.
  • CITATION : Kang HM, Sul JH, Service, Susan K., ...&, Eskin E. (2010) Variance component model to account for sample structure in genome-wide association studies Nat. Genet., 42 (4) 348-354. doi:10.1038/ng.548. PMID 20208533
  • JOURNAL_INFO : Nature genetics ; Nat. Genet. ; 2010 ; 42 ; 4 ; 348-354
  • PUBMED_LINK : 20208533

GEMMA

  • NAME : GEMMA
  • SHORT NAME : GEMMA
  • FULL NAME : genome-wide efficient mixed-model association
  • DESCRIPTION : GEMMA is the software implementing the Genome-wide Efficient Mixed Model Association algorithm for a standard linear mixed model and some of its close relatives for genome-wide association studies (GWAS). It fits a standard linear mixed model (LMM) to account for population stratification and sample structure for single marker association tests. It fits a Bayesian sparse linear mixed model (BSLMM) using Markov chain Monte Carlo (MCMC) for estimating the proportion of variance in phenotypes explained (PVE) by typed genotypes (i.e. chip heritability), predicting phenotypes, and identifying associated markers by jointly modeling all markers while controlling for population structure. It is computationally efficient for large scale GWAS and uses freely available open-source numerical libraries.
  • URL : http://stephenslab.uchicago.edu/software.html#gemma
  • TITLE : Genome-wide efficient mixed-model analysis for association studies
  • DOI : 10.1038/ng.2310
  • ABSTRACT : Linear mixed models have attracted considerable attention recently as a powerful and effective tool for accounting for population stratification and relatedness in genetic association tests. However, existing methods for exact computation of standard test statistics are computationally impractical for even moderate-sized genome-wide association studies. To address this issue, several approximate methods have been proposed. Here, we present an efficient exact method, which we refer to as genome-wide efficient mixed-model association (GEMMA), that makes approximations unnecessary in many contexts. This method is approximately n times faster than the widely used exact method known as efficient mixed-model association (EMMA), where n is the sample size, making exact genome-wide association analysis computationally practical for large numbers of individuals.
  • CITATION : Zhou X, Stephens M. (2012) Genome-wide efficient mixed-model analysis for association studies Nat. Genet., 44 (7) 821-824. doi:10.1038/ng.2310. PMID 22706312
  • JOURNAL_INFO : Nature genetics ; Nat. Genet. ; 2012 ; 44 ; 7 ; 821-824
  • PUBMED_LINK : 22706312

LDAK-KVIK

  • NAME : LDAK-KVIK
  • SHORT NAME : LDAK-KVIK
  • URL : http://www.ldak.org/
  • CITATION : Hof, J. P. & Speed, D. LDAK-KVIK performs fast and powerful mixed-model association analysis of quantitative and binary phenotypes. bioRxiv 2024.07.25.24311005 (2024) doi:10.1101/2024.07.25.24311005.
  • NAME : PLINK
  • SHORT NAME : PLINK
  • FULL NAME : PLINK
  • DESCRIPTION : A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses. PLINK is a free, open-source whole genome association analysis toolset, designed to perform a range of basic, large-scale analyses in a computationally efficient manner. The focus of PLINK is purely on analysis of genotype/phenotype data, so there is no support for steps prior to this (e.g. study design and planning, generating genotype or CNV calls from raw data). Through integration with gPLINK and Haploview, there is some support for the subsequent visualization, annotation and storage of results.
  • URL : https://www.cog-genomics.org/plink/
  • TITLE : PLINK: a tool set for whole-genome association and population-based linkage analyses
  • DOI : 10.1086/519795
  • ABSTRACT : Whole-genome association studies (WGAS) bring new computational, as well as analytic, challenges to researchers. Many existing genetic-analysis tools are not designed to handle such large data sets in a convenient manner and do not necessarily exploit the new opportunities that whole-genome data bring. To address these issues, we developed PLINK, an open-source C/C++ WGAS tool set. With PLINK, large data sets comprising hundreds of thousands of markers genotyped for thousands of individuals can be rapidly manipulated and analyzed in their entirety. As well as providing tools to make the basic analytic steps computationally efficient, PLINK also supports some novel approaches to whole-genome data that take advantage of whole-genome coverage. We introduce PLINK and describe the five main domains of function: data management, summary statistics, population stratification, association analysis, and identity-by-descent estimation. In particular, we focus on the estimation and use of identity-by-state and identity-by-descent information in the context of population-based whole-genome studies. This information can be used to detect and correct for population stratification and to identify extended chromosomal segments that are shared identical by descent between very distantly related individuals. Analysis of the patterns of segmental sharing has the potential to map disease loci that contain multiple rare variants in a population-based linkage analysis.
  • COPYRIGHT : https://www.elsevier.com/open-access/userlicense/1.0/
  • CITATION : Purcell S, Neale B, Todd-Brown K, Thomas L, ...&, Sham PC. (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses Am. J. Hum. Genet., 81 (3) 559-575. doi:10.1086/519795. PMID 17701901
  • JOURNAL_INFO : The American Journal of Human Genetics ; Am. J. Hum. Genet. ; 2007 ; 81 ; 3 ; 559-575
  • PUBMED_LINK : 17701901

PLINK2

  • NAME : PLINK2
  • SHORT NAME : PLINK2
  • FULL NAME : PLINK2
  • URL : https://www.cog-genomics.org/plink/2.0/
  • TITLE : Second-generation PLINK: rising to the challenge of larger and richer datasets
  • DOI : 10.1186/s13742-015-0047-8
  • ABSTRACT : BACKGROUND: PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for faster and scalable implementations of key functions, such as logistic regression, linkage disequilibrium estimation, and genomic distance evaluation. In addition, GWAS and population-genetic data now frequently contain genotype likelihoods, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1's primary data format. FINDINGS: To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, [Formula: see text]-time/constant-space Hardy-Weinberg equilibrium and Fisher's exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. We have also developed an extension to the data format which adds low-overhead support for genotype likelihoods, phase, multiallelic variants, and reference vs. alternate alleles, which is the basis of our planned second release (PLINK 2.0). CONCLUSIONS: The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.
  • CITATION : Chang CC, Chow CC, Tellier LC, Vattikuti S, ...&, Lee JJ. (2015) Second-generation PLINK: rising to the challenge of larger and richer datasets Gigascience, 4 (1) 7. doi:10.1186/s13742-015-0047-8. PMID 25722852
  • JOURNAL_INFO : GigaScience ; Gigascience ; 2015 ; 4 ; 1 ; 7
  • PUBMED_LINK : 25722852

POLMM

  • NAME : POLMM
  • SHORT NAME : POLMM
  • FULL NAME : proportional odds logistic mixed model (POLMM)
  • DESCRIPTION : Proportional Odds Logistic Mixed Model (POLMM) for ordinal categorical data analysis
  • URL : https://github.com/WenjianBI/POLMM
  • KEYWORDS : ordinal categorical phenotypes
  • TITLE : Efficient mixed model approach for large-scale genome-wide association studies of ordinal categorical phenotypes
  • DOI : 10.1016/j.ajhg.2021.03.019
  • ABSTRACT : In genome-wide association studies, ordinal categorical phenotypes are widely used to measure human behaviors, satisfaction, and preferences. However, because of the lack of analysis tools, methods designed for binary or quantitative traits are commonly used inappropriately to analyze categorical phenotypes. To accurately model the dependence of an ordinal categorical phenotype on covariates, we propose an efficient mixed model association test, proportional odds logistic mixed model (POLMM). POLMM is computationally efficient to analyze large datasets with hundreds of thousands of samples, can control type I error rates at a stringent significance level regardless of the phenotypic distribution, and is more powerful than alternative methods. In contrast, the standard linear mixed model approaches cannot control type I error rates for rare variants when the phenotypic distribution is unbalanced, although they performed well when testing common variants. We applied POLMM to 258 ordinal categorical phenotypes on array genotypes and imputed samples from 408,961 individuals in UK Biobank. In total, we identified 5,885 genome-wide significant variants, of which, 424 variants (7.2%) are rare variants with MAF < 0.01.
  • COPYRIGHT : http://www.elsevier.com/open-access/userlicense/1.0/
  • CITATION : Bi W, Zhou W, Dey R, Mukherjee B, ...&, Lee S. (2021) Efficient mixed model approach for large-scale genome-wide association studies of ordinal categorical phenotypes Am. J. Hum. Genet., 108 (5) 825-839. doi:10.1016/j.ajhg.2021.03.019. PMID 33836139
  • JOURNAL_INFO : The American Journal of Human Genetics ; Am. J. Hum. Genet. ; 2021 ; 108 ; 5 ; 825-839
  • PUBMED_LINK : 33836139

QRGWAS

  • NAME : QRGWAS
  • SHORT NAME : QRGWAS
  • FULL NAME : Quantile regression GWAS
  • URL : https://github.com/Iuliana-Ionita-Laza/QRGWAS
  • TITLE : Genome-wide discovery for biomarkers using quantile regression at biobank scale
  • DOI : 10.1038/s41467-024-50726-x
  • ABSTRACT : Genome-wide association studies (GWAS) for biomarkers important for clinical phenotypes can lead to clinically relevant discoveries. Conventional GWAS for quantitative traits are based on simplified regression models modeling the conditional mean of a phenotype as a linear function of genotype. We draw attention here to an alternative, lesser known approach, namely quantile regression that naturally extends linear regression to the analysis of the entire conditional distribution of a phenotype of interest. Quantile regression can be applied efficiently at biobank scale, while having some unique advantages such as (1) identifying variants with heterogeneous effects across quantiles of the phenotype distribution; (2) accommodating a wide range of phenotype distributions including non-normal distributions, with invariance of results to trait transformations; and (3) providing more detailed information about genotype-phenotype associations even for those associations identified by conventional GWAS. We show in simulations that quantile regression is powerful across both homogeneous and various heterogeneous models. Applications to 39 quantitative traits in the UK Biobank demonstrate that quantile regression can be a helpful complement to linear regression in GWAS and can identify variants with larger effects on high-risk subgroups of individuals but with lower or no contribution overall.
  • COPYRIGHT : https://creativecommons.org/licenses/by-nc-nd/4.0
  • CITATION : Wang C, Wang T, Kiryluk K, Wei Y, ...&, Ionita-Laza I. (2024) Genome-wide discovery for biomarkers using quantile regression at biobank scale Nat. Commun., 15 (1) 6460. doi:10.1038/s41467-024-50726-x. PMID 39085219
  • JOURNAL_INFO : Nature communications ; Nat. Commun. ; 2024 ; 15 ; 1 ; 6460
  • PUBMED_LINK : 39085219

REGENIE

  • NAME : REGENIE
  • SHORT NAME : REGENIE
  • FULL NAME : REGENIE
  • DESCRIPTION : regenie is a C++ program for whole genome regression modelling of large genome-wide association studies. It is developed and supported by a team of scientists at the Regeneron Genetics Center.
  • URL : https://github.com/rgcgithub/regenie
  • KEYWORDS : whole genome regression
  • TITLE : Computationally efficient whole-genome regression for quantitative and binary traits
  • DOI : 10.1038/s41588-021-00870-7
  • ABSTRACT : Genome-wide association analysis of cohorts with thousands of phenotypes is computationally expensive, particularly when accounting for sample relatedness or population structure. Here we present a novel machine-learning method called REGENIE for fitting a whole-genome regression model for quantitative and binary phenotypes that is substantially faster than alternatives in multi-trait analyses while maintaining statistical efficiency. The method naturally accommodates parallel analysis of multiple phenotypes and requires only local segments of the genotype matrix to be loaded in memory, in contrast to existing alternatives, which must load genome-wide matrices into memory. This results in substantial savings in compute time and memory usage. We introduce a fast, approximate Firth logistic regression test for unbalanced case-control phenotypes. The method is ideally suited to take advantage of distributed computing frameworks. We demonstrate the accuracy and computational benefits of this approach using the UK Biobank dataset with up to 407,746 individuals.
  • CITATION : Mbatchou J, Barnard L, Backman J, Marcketta A, ...&, Marchini J. (2021) Computationally efficient whole-genome regression for quantitative and binary traits Nat. Genet., 53 (7) 1097-1103. doi:10.1038/s41588-021-00870-7. PMID 34017140
  • JOURNAL_INFO : Nature genetics ; Nat. Genet. ; 2021 ; 53 ; 7 ; 1097-1103
  • PUBMED_LINK : 34017140

SAIGE

  • NAME : SAIGE
  • SHORT NAME : SAIGE
  • FULL NAME : Scalable and Accurate Implementation of GEneralized mixed model
  • DESCRIPTION : SAIGE is an R package with Scalable and Accurate Implementation of Generalized mixed model (Chen, H. et al. 2016). It accounts for sample relatedness and is feasible for genetic association tests in large cohorts and biobanks (N > 400,000). SAIGE performs single-variant association tests for binary traits and quantitative taits. For binary traits, SAIGE uses the saddlepoint approximation (SPA)(mhof, J. P. , 1961; Kuonen, D. 1999; Dey, R. et.al 2017) to account for case-control imbalance.
  • URL : https://github.com/weizhouUMICH/SAIGE
  • KEYWORDS : case-control imbalance, saddlepoint approximation (SPA)
  • TITLE : Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies
  • DOI : 10.1038/s41588-018-0184-y
  • ABSTRACT : In genome-wide association studies (GWAS) for thousands of phenotypes in large biobanks, most binary traits have substantially fewer cases than controls. Both of the widely used approaches, the linear mixed model and the recently proposed logistic mixed model, perform poorly; they produce large type I error rates when used to analyze unbalanced case-control phenotypes. Here we propose a scalable and accurate generalized mixed model association test that uses the saddlepoint approximation to calibrate the distribution of score test statistics. This method, SAIGE (Scalable and Accurate Implementation of GEneralized mixed model), provides accurate P values even when case-control ratios are extremely unbalanced. SAIGE uses state-of-art optimization strategies to reduce computational costs; hence, it is applicable to GWAS for thousands of phenotypes by large biobanks. Through the analysis of UK Biobank data of 408,961 samples from white British participants with European ancestry for > 1,400 binary phenotypes, we show that SAIGE can efficiently analyze large sample data, controlling for unbalanced case-control ratios and sample relatedness.
  • CITATION : Zhou W, Nielsen JB, Fritsche LG, Dey R, ...&, Lee S. (2018) Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies Nat. Genet., 50 (9) 1335-1341. doi:10.1038/s41588-018-0184-y. PMID 30104761
  • JOURNAL_INFO : Nature genetics ; Nat. Genet. ; 2018 ; 50 ; 9 ; 1335-1341
  • PUBMED_LINK : 30104761

fastGWA

  • NAME : fastGWA
  • SHORT NAME : fastGWA
  • FULL NAME : fastGWA
  • URL : https://yanglab.westlake.edu.cn/software/gcta/#fastGWA
  • KEYWORDS : grid-search-based REML algorithm
  • TITLE : A resource-efficient tool for mixed model association analysis of large-scale data
  • DOI : 10.1038/s41588-019-0530-8
  • ABSTRACT : The genome-wide association study (GWAS) has been widely used as an experimental design to detect associations between genetic variants and a phenotype. Two major confounding factors, population stratification and relatedness, could potentially lead to inflated GWAS test statistics and hence to spurious associations. Mixed linear model (MLM)-based approaches can be used to account for sample structure. However, genome-wide association (GWA) analyses in biobank samples such as the UK Biobank (UKB) often exceed the capability of most existing MLM-based tools especially if the number of traits is large. Here, we develop an MLM-based tool (fastGWA) that controls for population stratification by principal components and for relatedness by a sparse genetic relationship matrix for GWA analyses of biobank-scale data. We demonstrate by extensive simulations that fastGWA is reliable, robust and highly resource-efficient. We then apply fastGWA to 2,173 traits on array-genotyped and imputed samples from 456,422 individuals and to 2,048 traits on whole-exome-sequenced samples from 46,191 individuals in the UKB.
  • CITATION : Jiang L, Zheng Z, Qi T, Kemper KE, ...&, Yang J. (2019) A resource-efficient tool for mixed model association analysis of large-scale data Nat. Genet., 51 (12) 1749-1755. doi:10.1038/s41588-019-0530-8. PMID 31768069
  • JOURNAL_INFO : Nature genetics ; Nat. Genet. ; 2019 ; 51 ; 12 ; 1749-1755
  • PUBMED_LINK : 31768069

fastGWA-GLMM

  • NAME : fastGWA-GLMM
  • SHORT NAME : fastGWA-GLMM
  • FULL NAME : fastGWA-GLMM
  • URL : https://yanglab.westlake.edu.cn/software/gcta/#fastGWA
  • TITLE : A generalized linear mixed model association tool for biobank-scale data
  • DOI : 10.1038/s41588-021-00954-4
  • ABSTRACT : Compared with linear mixed model-based genome-wide association (GWA) methods, generalized linear mixed model (GLMM)-based methods have better statistical properties when applied to binary traits but are computationally much slower. In the present study, leveraging efficient sparse matrix-based algorithms, we developed a GLMM-based GWA tool, fastGWA-GLMM, that is severalfold to orders of magnitude faster than the state-of-the-art tools when applied to the UK Biobank (UKB) data and scalable to cohorts with millions of individuals. We show by simulation that the fastGWA-GLMM test statistics of both common and rare variants are well calibrated under the null, even for traits with extreme case-control ratios. We applied fastGWA-GLMM to the UKB data of 456,348 individuals, 11,842,647 variants and 2,989 binary traits (full summary statistics available at http://fastgwa.info/ukbimpbin ), and identified 259 rare variants associated with 75 traits, demonstrating the use of imputed genotype data in a large cohort to discover rare variants for binary complex traits.
  • COPYRIGHT : https://www.springernature.com/gp/researchers/text-and-data-mining
  • CITATION : Jiang L, Zheng Z, Fang H, Yang J. (2021) A generalized linear mixed model association tool for biobank-scale data Nat. Genet., 53 (11) 1616-1621. doi:10.1038/s41588-021-00954-4. PMID 34737426
  • JOURNAL_INFO : Nature genetics ; Nat. Genet. ; 2021 ; 53 ; 11 ; 1616-1621
  • PUBMED_LINK : 34737426