Skip to content

Tools Association tests

Curation of Association tests — listings under the GWAS Tools tab.

Summary Table

Click a column header to sort the table.

NAME CATEGORY Main citation YEAR
CC-GWAS Case-case GWAS
Peyrot WJ et al., Nat Genet, 2021
2021
SPAGRM GWAS of longitudinal traits
Xu H et al., Nat Commun, 2025
2025
TrajGWAS GWAS of longitudinal trajectories
Ko S et al., Am J Hum Genet, 2022
2022
GWAX GWAS using family history
Liu JZ et al., Nat Genet, 2017
2017
LT-FH GWAS using family history
Hujoel MLA et al., Nat Genet, 2020
2020
SiblingGWAS GWAS using family history
Howe LJ et al., Nat Genet, 2022
2022
snipar-unified estimator GWAS using family history
Guan J et al., Nat Genet, 2025
2025
snipar GWAS using family history
Young AI et al., Nat Genet, 2022
2022
MultiSTAAR Gene-based analysis (rare variant)
Li X et al., Nat Comput Sci, 2025
2025
REGENIE Gene-based analysis (rare variant)
Mbatchou J et al., Nat Genet, 2021
2021
SAIGE-GENE+ Gene-based analysis (rare variant)
Zhou W et al., Nat Genet, 2022
2022
SAIGE-GENE Gene-based analysis (rare variant)
Zhou W et al., Nat Genet, 2020
2020
SKAT-O Gene-based analysis (rare variant)
Lee S et al., Biostatistics, 2012
2012
SKAT Gene-based analysis (rare variant)
Wu MC et al., Am J Hum Genet, 2011
2011
STAAR Gene-based analysis (rare variant)
Li X et al., Nat Genet, 2020
2020
STAARpipeline Gene-based analysis (rare variant)
Li Z et al., Nat Methods, 2022
2022
LDAK-GBAT Gene-based analysis (sumstats)
Berrandou TE et al., Am J Hum Genet, 2023
2023
GATE Genome-wide survival association analysis
Dey R et al., Nat Commun, 2022
2022
COWAS MISC
Malakhov MM et al., Nat Commun, 2025
2025
GWAS-by-Subtraction Other
Demange PA et al., Nat Genet, 2021
2021
PGS-adjusted GWAS PGS-adjusted GWAS
Campos AI et al., Nat Genet, 2023
2023
PGS-adjusted RVATs PGS-adjusted GWAS
Jurgens SJ et al., Nat Genet, 2023
2023
POP-GWAS Phenotype imputation
Miao J et al., Nat Genet, 2024
2024
Review-Povysil Review
Povysil G et al., Nat Rev Genet, 2019
2019
BOLT-lMM Single variant association tests
Loh PR et al., Nat Genet, 2015
2015
EMMAX Single variant association tests
Kang HM et al., Nat Genet, 2010
2010
GEMMA Single variant association tests
Zhou X et al., Nat Genet, 2012
2012
LDAK-KVIK Single variant association tests
Hof, J. P. & Speed, D. LDAK-KVIK performs fast and powerful mixed-model association analysis of quantitative and…
NA
PLINK2 Single variant association tests
Chang CC et al., Gigascience, 2015
2015
PLINK Single variant association tests
Purcell S et al., Am J Hum Genet, 2007
2007
POLMM Single variant association tests
Bi W et al., Am J Hum Genet, 2021
2021
QRGWAS Single variant association tests
Wang C et al., Nat Commun, 2024
2024
Quickdraws Single variant association tests
Loya H et al., Nat Genet, 2025
2025
REGENIE Single variant association tests
Mbatchou J et al., Nat Genet, 2021
2021
SAIGE Single variant association tests
Zhou W et al., Nat Genet, 2018
2018
fastGWA-GLMM Single variant association tests
Jiang L et al., Nat Genet, 2021
2021
fastGWA Single variant association tests
Jiang L et al., Nat Genet, 2019
2019

Case-case GWAS

CC-GWAS

Tool
PUBMED_LINK
33686288
FULL NAME
case–case genome-wide association study
DESCRIPTION
The CCGWAS R package provides a tool for case-case association testing of two different disorders based on their respective case-control GWAS results
URL
https://github.com/wouterpeyrot/CCGWAS
TITLE
Identifying loci with different allele frequencies among cases of eight psychiatric disorders using CC-GWAS.
Main citation
Peyrot WJ, Price AL. (2021) Identifying loci with different allele frequencies among cases of eight psychiatric disorders using CC-GWAS. Nat Genet, 53 (4) 445-454. doi:10.1038/s41588-021-00787-1. PMID 33686288
ABSTRACT
Psychiatric disorders are highly genetically correlated, but little research has been conducted on the genetic differences between disorders. We developed a new method (case-case genome-wide association study; CC-GWAS) to test for differences in allele frequency between cases of two disorders using summary statistics from the respective case-control GWAS, transcending current methods that require individual-level data. Simulations and analytical computations confirm that CC-GWAS is well powered with effective control of type I error. We applied CC-GWAS to publicly available summary statistics for schizophrenia, bipolar disorder, major depressive disorder and five other psychiatric disorders. CC-GWAS identified 196 independent case-case loci, including 72 CC-GWAS-specific loci that were not significant at the genome-wide level in the input case-control summary statistics; two of the CC-GWAS-specific loci implicate the genes KLF6 and KLF16 (from the Krüppel-like family of transcription factors), which have been linked to neurite outgrowth and axon regeneration. CC-GWAS loci replicated convincingly in applications to datasets with independent replication data.
DOI
10.1038/s41588-021-00787-1

GWAS of longitudinal traits

SPAGRM

Tool
PUBMED_LINK
39915470
DESCRIPTION
PAGRM is a scalable and accurate analysis framework to control for sample relatedness in large-scale genome-wide association studies (GWAS).
URL
https://github.com/HeXuPKU/SPAGRM
KEYWORDS
SPA, longitudinal traits
TITLE
SPA
Main citation
Xu H, Ma Y, Xu LL, Li Y, ...&, Bi W. (2025) SPA Nat Commun, 16 (1) 1413. doi:10.1038/s41467-025-56669-1. PMID 39915470
ABSTRACT
Sample relatedness is a major confounder in genome-wide association studies (GWAS), potentially leading to inflated type I error rates if not appropriately controlled. A common strategy is to incorporate a random effect related to genetic relatedness matrix (GRM) into regression models. However, this approach is challenging for large-scale GWAS of complex traits, such as longitudinal traits. Here we propose a scalable and accurate analysis framework, SPAGRM, which controls for sample relatedness via a precise approximation of the joint distribution of genotypes. SPAGRM can utilize GRM-free models and thus is applicable to various trait types and statistical methods, including linear mixed models and generalized estimation equations for longitudinal traits. A hybrid strategy incorporating saddlepoint approximation greatly increases the accuracy to analyze low-frequency and rare genetic variants, especially in unbalanced phenotypic distributions. We also introduce SPAGRM(CCT) to aggregate the results following different models via Cauchy combination test. Extensive simulations and real data analyses demonstrated that SPAGRM maintains well-controlled type I error rates and SPAGRM(CCT) can serve as a broadly effective method. Applying SPAGRM to 79 longitudinal traits extracted from UK Biobank primary care data, we identified 7,463 genetic loci, making a pioneering attempt to conduct GWAS for these traits as longitudinal traits.
DOI
10.1038/s41467-025-56669-1

GWAS of longitudinal trajectories

TrajGWAS

Tool
PUBMED_LINK
35196515
FULL NAME
GWAS of longitudinal trajectories
DESCRIPTION
TrajGWAS.jl is a Julia package for performing genome-wide association studies (GWAS) for continuous longitudinal phenotypes using a modified linear mixed effects model. It builds upon the within-subject variance estimation by robust regression (WiSER) method and can be used to identify variants associated with changes in the mean and within-subject variability of the longitduinal trait.
URL
https://github.com/OpenMendel/TrajGWAS.jl
KEYWORDS
biomarker trajectories, mean, within-subject (WS) variability, linear mixed effect model, within-subject variance estimation by robust regression (WiSER) method
TITLE
GWAS of longitudinal trajectories at biobank scale.
Main citation
Ko S, German CA, Jensen A, Shen J, ...&, Zhou JJ. (2022) GWAS of longitudinal trajectories at biobank scale. Am J Hum Genet, 109 (3) 433-445. doi:10.1016/j.ajhg.2022.01.018. PMID 35196515
ABSTRACT
Biobanks linked to massive, longitudinal electronic health record (EHR) data make numerous new genetic research questions feasible. One among these is the study of biomarker trajectories. For example, high blood pressure measurements over visits strongly predict stroke onset, and consistently high fasting glucose and Hb1Ac levels define diabetes. Recent research reveals that not only the mean level of biomarker trajectories but also their fluctuations, or within-subject (WS) variability, are risk factors for many diseases. Glycemic variation, for instance, is recently considered an important clinical metric in diabetes management. It is crucial to identify the genetic factors that shift the mean or alter the WS variability of a biomarker trajectory. Compared to traditional cross-sectional studies, trajectory analysis utilizes more data points and captures a complete picture of the impact of time-varying factors, including medication history and lifestyle. Currently, there are no efficient tools for genome-wide association studies (GWASs) of biomarker trajectories at the biobank scale, even for just mean effects. We propose TrajGWAS, a linear mixed effect model-based method for testing genetic effects that shift the mean or alter the WS variability of a biomarker trajectory. It is scalable to biobank data with 100,000 to 1,000,000 individuals and many longitudinal measurements and robust to distributional assumptions. Simulation studies corroborate that TrajGWAS controls the type I error rate and is powerful. Analysis of eleven biomarkers measured longitudinally and extracted from UK Biobank primary care data for more than 150,000 participants with 1,800,000 observations reveals loci that significantly alter the mean or WS variability.
DOI
10.1016/j.ajhg.2022.01.018

GWAS using family history

GWAX

Tool
PUBMED_LINK
28092683
FULL NAME
genome-wide association by proxy
DESCRIPTION
In randomly ascertained cohorts, replacing cases with their first-degree relatives enables studies of diseases that are absent (or nearly absent) in the cohort.
TITLE
Case-control association mapping by proxy using family history of disease.
Main citation
Liu JZ, Erlich Y, Pickrell JK. (2017) Case-control association mapping by proxy using family history of disease. Nat Genet, 49 (3) 325-331. doi:10.1038/ng.3766. PMID 28092683
ABSTRACT
Collecting cases for case-control genetic association studies can be time-consuming and expensive. In some situations (such as studies of late-onset or rapidly lethal diseases), it may be more practical to identify family members of cases. In randomly ascertained cohorts, replacing cases with their first-degree relatives enables studies of diseases that are absent (or nearly absent) in the cohort. We refer to this approach as genome-wide association study by proxy (GWAX) and apply it to 12 common diseases in 116,196 individuals from the UK Biobank. Meta-analysis with published genome-wide association study summary statistics replicated established risk loci and yielded four newly associated loci for Alzheimer's disease, eight for coronary artery disease and five for type 2 diabetes. In addition to informing disease biology, our results demonstrate the utility of association mapping without directly observing cases. We anticipate that GWAX will prove useful in future genetic studies of complex traits in large population cohorts.
DOI
10.1038/ng.3766

LT-FH

Tool
PUBMED_LINK
32313248
FULL NAME
liability threshold model, conditional on case–control status and family history
DESCRIPTION
an association method based on posterior mean genetic liabilities under a liability threshold model, conditional on case-control status and family history (LT-FH)
URL
https://alkesgroup.broadinstitute.org/UKBB/LTFH/
TITLE
Liability threshold modeling of case-control status and family history of disease increases association power.
Main citation
Hujoel MLA, Gazal S, Loh PR, Patterson N, ...&, Price AL. (2020) Liability threshold modeling of case-control status and family history of disease increases association power. Nat Genet, 52 (5) 541-547. doi:10.1038/s41588-020-0613-6. PMID 32313248
ABSTRACT
Family history of disease can provide valuable information in case-control association studies, but it is currently unclear how to best combine case-control status and family history of disease. We developed an association method based on posterior mean genetic liabilities under a liability threshold model, conditional on case-control status and family history (LT-FH). Analyzing 12 diseases from the UK Biobank (average N = 350,000) we compared LT-FH to genome-wide association without using family history (GWAS) and a previous proxy-based method incorporating family history (GWAX). LT-FH was 63% (standard error (s.e.) 6%) more powerful than GWAS and 36% (s.e. 4%) more powerful than the trait-specific maximum of GWAS and GWAX, based on the number of independent genome-wide-significant loci across all diseases (for example, 690 loci for LT-FH versus 423 for GWAS); relative improvements were similar when applying BOLT-LMM to GWAS, GWAX and LT-FH phenotypes. Thus, LT-FH greatly increases association power when family history of disease is available.
DOI
10.1038/s41588-020-0613-6

SiblingGWAS

Tool
PUBMED_LINK
35534559
FULL NAME
Within-sibship genome-wide association analyses
DESCRIPTION
Scripts for running GWAS using siblings to estimate Within-Family (WF) and Between-Family (BF) effects of genetic variants on continuous traits. Allows the inclusion of more than two siblings from one family.
URL
https://github.com/LaurenceHowe/SiblingGWAS
TITLE
Within-sibship genome-wide association analyses decrease bias in estimates of direct genetic effects.
Main citation
Howe LJ, Nivard MG, Morris TT, Hansen AF, ...&, Davies NM. (2022) Within-sibship genome-wide association analyses decrease bias in estimates of direct genetic effects. Nat Genet, 54 (5) 581-592. doi:10.1038/s41588-022-01062-7. PMID 35534559
ABSTRACT
Estimates from genome-wide association studies (GWAS) of unrelated individuals capture effects of inherited variation (direct effects), demography (population stratification, assortative mating) and relatives (indirect genetic effects). Family-based GWAS designs can control for demographic and indirect genetic effects, but large-scale family datasets have been lacking. We combined data from 178,086 siblings from 19 cohorts to generate population (between-family) and within-sibship (within-family) GWAS estimates for 25 phenotypes. Within-sibship GWAS estimates were smaller than population estimates for height, educational attainment, age at first birth, number of children, cognitive ability, depressive symptoms and smoking. Some differences were observed in downstream SNP heritability, genetic correlations and Mendelian randomization analyses. For example, the within-sibship genetic correlation between educational attainment and body mass index attenuated towards zero. In contrast, analyses of most molecular phenotypes (for example, low-density lipoprotein-cholesterol) were generally consistent. We also found within-sibship evidence of polygenic adaptation on taller height. Here, we illustrate the importance of family-based GWAS data for phenotypes influenced by demographic and indirect genetic effects.
DOI
10.1038/s41588-022-01062-7

snipar

Tool
PUBMED_LINK
35681053
FULL NAME
single nucleotide imputation of parents
DESCRIPTION
snipar (single nucleotide imputation of parents) is a Python package for inferring identity-by-descent (IBD) segments shared between siblings, imputing missing parental genotypes, and for performing family based genome-wide association and polygenic score analyses using observed and/or imputed parental genotypes.
URL
https://github.com/AlexTISYoung/snipar
TITLE
Mendelian imputation of parental genotypes improves estimates of direct genetic effects.
Main citation
Young AI, Nehzati SM, Benonisdottir S, Okbay A, ...&, Kong A. (2022) Mendelian imputation of parental genotypes improves estimates of direct genetic effects. Nat Genet, 54 (6) 897-905. doi:10.1038/s41588-022-01085-0. PMID 35681053
ABSTRACT
Effects estimated by genome-wide association studies (GWASs) include effects of alleles in an individual on that individual (direct genetic effects), indirect genetic effects (for example, effects of alleles in parents on offspring through the environment) and bias from confounding. Within-family genetic variation is random, enabling unbiased estimation of direct genetic effects when parents are genotyped. However, parental genotypes are often missing. We introduce a method that imputes missing parental genotypes and estimates direct genetic effects. Our method, implemented in the software package snipar (single-nucleotide imputation of parents), gives more precise estimates of direct genetic effects than existing approaches. Using 39,614 individuals from the UK Biobank with at least one genotyped sibling/parent, we estimate the correlation between direct genetic effects and effects from standard GWASs for nine phenotypes, including educational attainment (r = 0.739, standard error (s.e.) = 0.086) and cognitive ability (r = 0.490, s.e. = 0.086). Our results demonstrate substantial confounding bias in standard GWASs for some phenotypes.
DOI
10.1038/s41588-022-01085-0

snipar-unified estimator (snipar)

Tool
PUBMED_LINK
40065166
FULL NAME
single nucleotide imputation of parents
URL
https://github.com/AlexTISYoung/snipar
TITLE
Family-based genome-wide association study designs for increased power and robustness.
Main citation
Guan J, Tan T, Nehzati SM, Bennett M, ...&, Young AS. (2025) Family-based genome-wide association study designs for increased power and robustness. Nat Genet, 57 (4) 1044-1052. doi:10.1038/s41588-025-02118-0. PMID 40065166
ABSTRACT
Family-based genome-wide association studies (FGWASs) use random, within-family genetic variation to remove confounding from estimates of direct genetic effects (DGEs). Here we introduce a 'unified estimator' that includes individuals without genotyped relatives, unifying standard and FGWAS while increasing power for DGE estimation. We also introduce a 'robust estimator' that is not biased in structured and/or admixed populations. In an analysis of 19 phenotypes in the UK Biobank, the unified estimator in the White British subsample and the robust estimator (applied without ancestry restrictions) increased the effective sample size for DGEs by 46.9% to 106.5% and 10.3% to 21.0%, respectively, compared to using genetic differences between siblings. Polygenic predictors derived from the unified estimator demonstrated superior out-of-sample prediction ability compared to other family-based methods. We implemented the methods in the software package snipar in an efficient linear mixed model that accounts for sample relatedness and sibling shared environment.
DOI
10.1038/s41588-025-02118-0

Gene-based analysis (rare variant)

MultiSTAAR

Tool
PUBMED_LINK
39920506
FULL NAME
Multi-trait variant-Set Test for Association using Annotation infoRmation
DESCRIPTION
MultiSTAAR is an R package for performing Multi-trait variant-Set Test for Association using Annotation infoRmation (MultiSTAAR) procedure in whole-genome sequencing (WGS) studies. MultiSTAAR is a general framework that (1) leverages the correlation structure between multiple phenotypes to improve power of multi-trait analysis over single-trait analysis, and (2) incorporates both qualitative functional categories and quantitative complementary functional annotations using an omnibus multi-dimensional weighting scheme. MultiSTAAR accounts for population structure and relatedness, and is scalable for jointly analyzing large WGS studies of multiple correlated traits.
URL
https://github.com/xihaoli/MultiSTAAR
TITLE
A statistical framework for multi-trait rare variant analysis in large-scale whole-genome sequencing studies.
Main citation
Li X, Chen H, Selvaraj MS, Van Buren E, ...&, Lin X. (2025) A statistical framework for multi-trait rare variant analysis in large-scale whole-genome sequencing studies. Nat Comput Sci, 5 (2) 125-143. doi:10.1038/s43588-024-00764-8. PMID 39920506
ABSTRACT
Large-scale whole-genome sequencing (WGS) studies have improved our understanding of the contributions of coding and noncoding rare variants to complex human traits. Leveraging association effect sizes across multiple traits in WGS rare variant association analysis can improve statistical power over single-trait analysis, and also detect pleiotropic genes and regions. Existing multi-trait methods have limited ability to perform rare variant analysis of large-scale WGS data. We propose MultiSTAAR, a statistical framework and computationally scalable analytical pipeline for functionally informed multi-trait rare variant analysis in large-scale WGS studies. MultiSTAAR accounts for relatedness, population structure and correlation among phenotypes by jointly analyzing multiple traits, and further empowers rare variant association analysis by incorporating multiple functional annotations. We applied MultiSTAAR to jointly analyze three lipid traits in 61,838 multi-ethnic samples from the Trans-Omics for Precision Medicine (TOPMed) Program. We discovered and replicated new associations with lipid traits missed by single-trait analysis.
DOI
10.1038/s43588-024-00764-8

REGENIE

GWAS
PUBMED_LINK
34017140
DESCRIPTION
regenie is a C++ program for whole genome regression modelling of large genome-wide association studies. It is developed and supported by a team of scientists at the Regeneron Genetics Center.
URL
https://github.com/rgcgithub/regenie
KEYWORDS
whole genome regression
TITLE
Computationally efficient whole-genome regression for quantitative and binary traits.
Main citation
Mbatchou J, Barnard L, Backman J, Marcketta A, ...&, Marchini J. (2021) Computationally efficient whole-genome regression for quantitative and binary traits. Nat Genet, 53 (7) 1097-1103. doi:10.1038/s41588-021-00870-7. PMID 34017140
ABSTRACT
Genome-wide association analysis of cohorts with thousands of phenotypes is computationally expensive, particularly when accounting for sample relatedness or population structure. Here we present a novel machine-learning method called REGENIE for fitting a whole-genome regression model for quantitative and binary phenotypes that is substantially faster than alternatives in multi-trait analyses while maintaining statistical efficiency. The method naturally accommodates parallel analysis of multiple phenotypes and requires only local segments of the genotype matrix to be loaded in memory, in contrast to existing alternatives, which must load genome-wide matrices into memory. This results in substantial savings in compute time and memory usage. We introduce a fast, approximate Firth logistic regression test for unbalanced case-control phenotypes. The method is ideally suited to take advantage of distributed computing frameworks. We demonstrate the accuracy and computational benefits of this approach using the UK Biobank dataset with up to 407,746 individuals.
DOI
10.1038/s41588-021-00870-7

SAIGE-GENE

Tool
PUBMED_LINK
32424355
URL
https://github.com/weizhouUMICH/SAIGE
TITLE
Scalable generalized linear mixed model for region-based association tests in large biobanks and cohorts.
Main citation
Zhou W, Zhao Z, Nielsen JB, Fritsche LG, ...&, Lee S. (2020) Scalable generalized linear mixed model for region-based association tests in large biobanks and cohorts. Nat Genet, 52 (6) 634-639. doi:10.1038/s41588-020-0621-6. PMID 32424355
ABSTRACT
With very large sample sizes, biobanks provide an exciting opportunity to identify genetic components of complex traits. To analyze rare variants, region-based multiple-variant aggregate tests are commonly used to increase power for association tests. However, because of the substantial computational cost, existing region-based tests cannot analyze hundreds of thousands of samples while accounting for confounders such as population stratification and sample relatedness. Here we propose a scalable generalized mixed-model region-based association test, SAIGE-GENE, that is applicable to exome-wide and genome-wide region-based analysis for hundreds of thousands of samples and can account for unbalanced case-control ratios for binary traits. Through extensive simulation studies and analysis of the HUNT study with 69,716 Norwegian samples and the UK Biobank data with 408,910 White British samples, we show that SAIGE-GENE can efficiently analyze large-sample data (N > 400,000) with type I error rates well controlled.
DOI
10.1038/s41588-020-0621-6

SAIGE-GENE+

Tool
PUBMED_LINK
36138231
URL
https://github.com/weizhouUMICH/SAIGE
TITLE
SAIGE-GENE+ improves the efficiency and accuracy of set-based rare variant association tests.
Main citation
Zhou W, Bi W, Zhao Z, Dey KK, ...&, Lee S. (2022) SAIGE-GENE+ improves the efficiency and accuracy of set-based rare variant association tests. Nat Genet, 54 (10) 1466-1469. doi:10.1038/s41588-022-01178-w. PMID 36138231
ABSTRACT
Several biobanks, including UK Biobank (UKBB), are generating large-scale sequencing data. An existing method, SAIGE-GENE, performs well when testing variants with minor allele frequency (MAF) ≤ 1%, but inflation is observed in variance component set-based tests when restricting to variants with MAF ≤ 0.1% or 0.01%. Here, we propose SAIGE-GENE+ with greatly improved type I error control and computational efficiency to facilitate rare variant tests in large-scale data. We further show that incorporating multiple MAF cutoffs and functional annotations can improve power and thus uncover new gene-phenotype associations. In the analysis of UKBB whole exome sequencing data for 30 quantitative and 141 binary traits, SAIGE-GENE+ identified 551 gene-phenotype associations.
DOI
10.1038/s41588-022-01178-w

SKAT

Tool
PUBMED_LINK
21737059
FULL NAME
sequence kernel association test
DESCRIPTION
SKAT is a SNP-set (e.g., a gene or a region) level test for association between a set of rare (or common) variants and dichotomous or quantitative phenotypes, SKAT aggregates individual score test statistics of SNPs in a SNP set and efficiently computes SNP-set level p-values, e.g. a gene or a region level p-value, while adjusting for covariates, such as principal components to account for population stratification. SKAT also allows for power/sample size calculations for designing for sequence association studies.
URL
https://www.hsph.harvard.edu/skat/
TITLE
Rare-variant association testing for sequencing data with the sequence kernel association test.
Main citation
Wu MC, Lee S, Cai T, Li Y, ...&, Lin X. (2011) Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet, 89 (1) 82-93. doi:10.1016/j.ajhg.2011.05.029. PMID 21737059
ABSTRACT
Sequencing studies are increasingly being conducted to identify rare variants associated with complex traits. The limited power of classical single-marker association analysis for rare variants poses a central challenge in such studies. We propose the sequence kernel association test (SKAT), a supervised, flexible, computationally efficient regression method to test for association between genetic variants (common and rare) in a region and a continuous or dichotomous trait while easily adjusting for covariates. As a score-based variance-component test, SKAT can quickly calculate p values analytically by fitting the null model containing only the covariates, and so can easily be applied to genome-wide data. Using SKAT to analyze a genome-wide sequencing study of 1000 individuals, by segmenting the whole genome into 30 kb regions, requires only 7 hr on a laptop. Through analysis of simulated data across a wide range of practical scenarios and triglyceride data from the Dallas Heart Study, we show that SKAT can substantially outperform several alternative rare-variant association tests. We also provide analytic power and sample-size calculations to help design candidate-gene, whole-exome, and whole-genome sequence association studies.
DOI
10.1016/j.ajhg.2011.05.029

SKAT-O

Tool
PUBMED_LINK
22699862
FULL NAME
sequence kernel association test - optimal test
DESCRIPTION
estimating the correlation parameter in the kernel matrix to maximize the power, which corresponds to the estimated weight in the linear combination of the burden test and SKAT test statistics that maximizes power.
URL
https://www.hsph.harvard.edu/skat/
TITLE
Optimal tests for rare variant effects in sequencing association studies.
Main citation
Lee S, Wu MC, Lin X. (2012) Optimal tests for rare variant effects in sequencing association studies. Biostatistics, 13 (4) 762-75. doi:10.1093/biostatistics/kxs014. PMID 22699862
ABSTRACT
With development of massively parallel sequencing technologies, there is a substantial need for developing powerful rare variant association tests. Common approaches include burden and non-burden tests. Burden tests assume all rare variants in the target region have effects on the phenotype in the same direction and of similar magnitude. The recently proposed sequence kernel association test (SKAT) (Wu, M. C., and others, 2011. Rare-variant association testing for sequencing data with the SKAT. The American Journal of Human Genetics 89, 82-93], an extension of the C-alpha test (Neale, B. M., and others, 2011. Testing for an unusual distribution of rare variants. PLoS Genetics 7, 161-165], provides a robust test that is particularly powerful in the presence of protective and deleterious variants and null variants, but is less powerful than burden tests when a large number of variants in a region are causal and in the same direction. As the underlying biological mechanisms are unknown in practice and vary from one gene to another across the genome, it is of substantial practical interest to develop a test that is optimal for both scenarios. In this paper, we propose a class of tests that include burden tests and SKAT as special cases, and derive an optimal test within this class that maximizes power. We show that this optimal test outperforms burden tests and SKAT in a wide range of scenarios. The results are illustrated using simulation studies and triglyceride data from the Dallas Heart Study. In addition, we have derived sample size/power calculation formula for SKAT with a new family of kernels to facilitate designing new sequence association studies.
DOI
10.1093/biostatistics/kxs014

STAAR

Tool
PUBMED_LINK
32839606
FULL NAME
variant-set test for association using annotation information
DESCRIPTION
STAAR is an R package for performing variant-Set Test for Association using Annotation infoRmation (STAAR) procedure in whole-genome sequencing (WGS) studies. STAAR is a general framework that incorporates both qualitative functional categories and quantitative complementary functional annotations using an omnibus multi-dimensional weighting scheme. STAAR accounts for population structure and relatedness, and is scalable for analyzing large WGS studies of continuous and dichotomous traits.
URL
https://github.com/xihaoli/STAAR
KEYWORDS
functional annotations
TITLE
Dynamic incorporation of multiple in silico functional annotations empowers rare variant association analysis of large whole-genome sequencing studies at scale.
Main citation
Li X, Li Z, Zhou H, Gaynor SM, ...&, Lin X. (2020) Dynamic incorporation of multiple in silico functional annotations empowers rare variant association analysis of large whole-genome sequencing studies at scale. Nat Genet, 52 (9) 969-983. doi:10.1038/s41588-020-0676-4. PMID 32839606
ABSTRACT
Large-scale whole-genome sequencing studies have enabled the analysis of rare variants (RVs) associated with complex phenotypes. Commonly used RV association tests have limited scope to leverage variant functions. We propose STAAR (variant-set test for association using annotation information), a scalable and powerful RV association test method that effectively incorporates both variant categories and multiple complementary annotations using a dynamic weighting scheme. For the latter, we introduce 'annotation principal components', multidimensional summaries of in silico variant annotations. STAAR accounts for population structure and relatedness and is scalable for analyzing very large cohort and biobank whole-genome sequencing studies of continuous and dichotomous traits. We applied STAAR to identify RVs associated with four lipid traits in 12,316 discovery and 17,822 replication samples from the Trans-Omics for Precision Medicine Program. We discovered and replicated new RV associations, including disruptive missense RVs of NPC1L1 and an intergenic region near APOC1P1 associated with low-density lipoprotein cholesterol.
DOI
10.1038/s41588-020-0676-4

STAARpipeline

Tool
PUBMED_LINK
36303018
FULL NAME
variant-set test for association using annotation information
DESCRIPTION
STAARpipeline is an R package for phenotype-genotype association analyses of biobank-scale WGS/WES data, including single variant analysis and variant set analysis.
URL
https://github.com/xihaoli/STAARpipeline/
TITLE
A framework for detecting noncoding rare-variant associations of large-scale whole-genome sequencing studies.
Main citation
Li Z, Li X, Zhou H, Gaynor SM, ...&, Lin X. (2022) A framework for detecting noncoding rare-variant associations of large-scale whole-genome sequencing studies. Nat Methods, 19 (12) 1599-1611. doi:10.1038/s41592-022-01640-x. PMID 36303018
ABSTRACT
Large-scale whole-genome sequencing studies have enabled analysis of noncoding rare-variant (RV) associations with complex human diseases and traits. Variant-set analysis is a powerful approach to study RV association. However, existing methods have limited ability in analyzing the noncoding genome. We propose a computationally efficient and robust noncoding RV association detection framework, STAARpipeline, to automatically annotate a whole-genome sequencing study and perform flexible noncoding RV association analysis, including gene-centric analysis and fixed window-based and dynamic window-based non-gene-centric analysis by incorporating variant functional annotations. In gene-centric analysis, STAARpipeline uses STAAR to group noncoding variants based on functional categories of genes and incorporate multiple functional annotations. In non-gene-centric analysis, STAARpipeline uses SCANG-STAAR to incorporate dynamic window sizes and multiple functional annotations. We apply STAARpipeline to identify noncoding RV sets associated with four lipid traits in 21,015 discovery samples from the Trans-Omics for Precision Medicine (TOPMed) program and replicate several of them in an additional 9,123 TOPMed samples. We also analyze five non-lipid TOPMed traits.
DOI
10.1038/s41592-022-01640-x

Gene-based analysis (sumstats)

LDAK-GBAT

Tool
PUBMED_LINK
36480927
FULL NAME
LDAK gene-based association testing
URL
http://www.ldak.org/
TITLE
LDAK-GBAT: Fast and powerful gene-based association testing using summary statistics.
Main citation
Berrandou TE, Balding D, Speed D. (2023) LDAK-GBAT: Fast and powerful gene-based association testing using summary statistics. Am J Hum Genet, 110 (1) 23-29. doi:10.1016/j.ajhg.2022.11.010. PMID 36480927
ABSTRACT
We present LDAK-GBAT, a tool for gene-based association testing using summary statistics from genome-wide association studies that is computationally efficient, produces well-calibrated p values, and is significantly more powerful than existing tools. LDAK-GBAT takes approximately 30 min to analyze imputed data (2.9M common, genic SNPs), requiring less than 10 Gb memory. It shows good control of type 1 error given an appropriate reference panel. Across 109 phenotypes (82 from the UK Biobank, 18 from the Million Veteran Program, and nine from the Psychiatric Genetics Consortium), LDAK-GBAT finds on average 19% (SE: 1%) more significant genes than the existing tool sumFREGAT-ACAT, with even greater gains in comparison with MAGMA, GCTA-fastBAT, sumFREGAT-SKAT-O, and sumFREGAT-PCA.
DOI
10.1016/j.ajhg.2022.11.010

Genome-wide survival association analysis

GATE

Tool
PUBMED_LINK
36114182
FULL NAME
Genetic Analysis of Time-to-Event phenotypes
DESCRIPTION
GATE (Genetic Analysis of Time-to-Event phenotypes) is an R package with Scalable and accurate genome-wide association analysis of censored survival data in large scale biobanks using frailty models.

GATE performs single-variant association tests for time-to-event endpoints. GATE uses uses the saddlepoint approximation (SPA)(mhof, J. P. , 1961; Kuonen, D. 1999; Dey, R. et.al 2017) to account for heavy censoring rates.
URL
https://github.com/weizhou0/GATE
KEYWORDS
censored time-to-event (TTE) phenotypes
TITLE
Efficient and accurate frailty model approach for genome-wide survival association analysis in large-scale biobanks.
Main citation
Dey R, Zhou W, Kiiskinen T, Havulinna A, ...&, Lin X. (2022) Efficient and accurate frailty model approach for genome-wide survival association analysis in large-scale biobanks. Nat Commun, 13 (1) 5437. doi:10.1038/s41467-022-32885-x. PMID 36114182
ABSTRACT
With decades of electronic health records linked to genetic data, large biobanks provide unprecedented opportunities for systematically understanding the genetics of the natural history of complex diseases. Genome-wide survival association analysis can identify genetic variants associated with ages of onset, disease progression and lifespan. We propose an efficient and accurate frailty model approach for genome-wide survival association analysis of censored time-to-event (TTE) phenotypes by accounting for both population structure and relatedness. Our method utilizes state-of-the-art optimization strategies to reduce the computational cost. The saddlepoint approximation is used to allow for analysis of heavily censored phenotypes (>90%) and low frequency variants (down to minor allele count 20). We demonstrate the performance of our method through extensive simulation studies and analysis of five TTE phenotypes, including lifespan, with heavy censoring rates (90.9% to 99.8%) on ~400,000 UK Biobank participants with white British ancestry and ~180,000 individuals in FinnGen. We further analyzed 871 TTE phenotypes in the UK Biobank and presented the genome-wide scale phenome-wide association results with the PheWeb browser.
DOI
10.1038/s41467-022-32885-x

MISC

COWAS

TWAS Functional genomics Gene prioritization Tool Summary statistics
PUBMED_LINK
41381446
FULL NAME
Co-expression-wide association study
DESCRIPTION
Co-expression-wide association study (COWAS) extends TWAS/PWAS by testing pairs of genes or proteins whose genetically regulated co-expression or interaction is associated with a trait; includes implemented R software and trained imputation weights for summary-statistic follow-up.
URL
https://github.com/mykmal/cowas ,https://doi.org/10.1038/s41467-025-66039-6
KEYWORDS
TWAS, PWAS, co-expression, gene-gene interaction, GWAS summary statistics
TITLE
Co-expression-wide association studies link genetically regulated interactions with complex traits.
Main citation
Malakhov MM, Pan W. (2025) Co-expression-wide association studies link genetically regulated interactions with complex traits. Nat Commun, 16 (1) 11061. doi:10.1038/s41467-025-66039-6. PMID 41381446
ABSTRACT
Transcriptome- and proteome-wide association studies (TWAS/PWAS) have proven successful in prioritizing genes and proteins whose genetically regulated expression modulates disease risk, but they ignore potential co-expression and interaction effects. To address this limitation, we introduce the co-expression-wide association study (COWAS) method, which can identify pairs of genes or proteins whose genetically regulated co-expression is associated with complex traits. COWAS first trains models to predict expression and co-expression from genetic variation, and then tests for association between imputed co-expression and the trait of interest while also accounting for direct effects from each exposure. We applied our method to plasma proteomic concentrations from the UK Biobank, identifying dozens of interacting protein pairs associated with cholesterol levels, Alzheimer's disease, and Parkinson's disease. Notably, our results demonstrate that co-expression between proteins may affect complex traits even if neither protein is detected to influence the trait when considered on its own. We also show how COWAS can help to disentangle direct and interaction effects, providing a richer picture of the molecular networks that mediate genetic effects on disease outcomes.
DOI
10.1038/s41467-025-66039-6

Other

GWAS-by-Subtraction

Tool
PUBMED_LINK
33414549
URL
https://github.com/GenomicSEM/GenomicSEM
TITLE
Investigating the genetic architecture of noncognitive skills using GWAS-by-subtraction.
Main citation
Demange PA, Malanchini M, Mallard TT, Biroli P, ...&, Nivard MG. (2021) Investigating the genetic architecture of noncognitive skills using GWAS-by-subtraction. Nat Genet, 53 (1) 35-44. doi:10.1038/s41588-020-00754-2. PMID 33414549
ABSTRACT
Little is known about the genetic architecture of traits affecting educational attainment other than cognitive ability. We used genomic structural equation modeling and prior genome-wide association studies (GWASs) of educational attainment (n = 1,131,881) and cognitive test performance (n = 257,841) to estimate SNP associations with educational attainment variation that is independent of cognitive ability. We identified 157 genome-wide-significant loci and a polygenic architecture accounting for 57% of genetic variance in educational attainment. Noncognitive genetics were enriched in the same brain tissues and cell types as cognitive performance, but showed different associations with gray-matter brain volumes. Noncognitive genetics were further distinguished by associations with personality traits, less risky behavior and increased risk for certain psychiatric disorders. For socioeconomic success and longevity, noncognitive and cognitive-performance genetics demonstrated associations of similar magnitude. By conducting a GWAS of a phenotype that was not directly measured, we offer a view of genetic architecture of noncognitive skills influencing educational success.
DOI
10.1038/s41588-020-00754-2

PGS-adjusted GWAS

PGS-adjusted GWAS

Tool
PUBMED_LINK
37723263
DESCRIPTION
adjustment of GWAS analyses for polygenic scores (PGSs) increases the statistical power for discovery across all ancestries
KEYWORDS
LOCO-PGSs, two-stage meta-analysis strategy
TITLE
Boosting the power of genome-wide association studies within and across ancestries by using polygenic scores.
Main citation
Campos AI, Namba S, Lin SC, Nam K, ...&, Yengo L. (2023) Boosting the power of genome-wide association studies within and across ancestries by using polygenic scores. Nat Genet, 55 (10) 1769-1776. doi:10.1038/s41588-023-01500-0. PMID 37723263
ABSTRACT
Genome-wide association studies (GWASs) have been mostly conducted in populations of European ancestry, which currently limits the transferability of their findings to other populations. Here, we show, through theory, simulations and applications to real data, that adjustment of GWAS analyses for polygenic scores (PGSs) increases the statistical power for discovery across all ancestries. We applied this method to analyze seven traits available in three large biobanks with participants of East Asian ancestry (n = 340,000 in total) and report 139 additional associations across traits. We also present a two-stage meta-analysis strategy whereby, in contributing cohorts, a PGS-adjusted GWAS is rerun using PGSs derived from a first round of a standard meta-analysis. On average, across traits, this approach yields a 1.26-fold increase in the number of detected associations (range 1.07- to 1.76-fold increase). Altogether, our study demonstrates the value of using PGSs to increase the power of GWASs in underrepresented populations and promotes such an analytical strategy for future GWAS meta-analyses.
DOI
10.1038/s41588-023-01500-0

PGS-adjusted RVATs

Tool
PUBMED_LINK
36959364
FULL NAME
PGS-adjusted rare variant association tests
DESCRIPTION
adjusting for common variant polygenic scores improves yield in gene-based rare variant association tests
KEYWORDS
PGS, Rare variants
TITLE
Adjusting for common variant polygenic scores improves yield in rare variant association analyses.
Main citation
Jurgens SJ, Pirruccello JP, Choi SH, Morrill VN, ...&, Ellinor PT. (2023) Adjusting for common variant polygenic scores improves yield in rare variant association analyses. Nat Genet, 55 (4) 544-548. doi:10.1038/s41588-023-01342-w. PMID 36959364
ABSTRACT
With the emergence of large-scale sequencing data, methods for improving power in rare variant association tests are needed. Here we show that adjusting for common variant polygenic scores improves yield in gene-based rare variant association tests across 65 quantitative traits in the UK Biobank (up to 20% increase at α = 2.6 × 10-6), without marked increases in false-positive rates or genomic inflation. Benefits were seen for various models, with the largest improvements seen for efficient sparse mixed-effects models. Our results illustrate how polygenic score adjustment can efficiently improve power in rare variant association discovery.
DOI
10.1038/s41588-023-01342-w

Phenotype imputation

POP-GWAS

Tool
PUBMED_LINK
39349818
FULL NAME
Post-Prediction GWAS
DESCRIPTION
POP-TOOLS (POst-Prediction TOOLS) is a Python3-based command line toolkit for conducting valid and powerful machine learning (ML)-assisted genetic association studies.
URL
https://github.com/qlu-lab/POP-TOOLS
KEYWORDS
imputed phenotypes, 3 GWASs
TITLE
Valid inference for machine learning-assisted genome-wide association studies.
Main citation
Miao J, Wu Y, Sun Z, Miao X, ...&, Lu Q. (2024) Valid inference for machine learning-assisted genome-wide association studies. Nat Genet, 56 (11) 2361-2369. doi:10.1038/s41588-024-01934-0. PMID 39349818
ABSTRACT
Machine learning (ML) has become increasingly popular in almost all scientific disciplines, including human genetics. Owing to challenges related to sample collection and precise phenotyping, ML-assisted genome-wide association study (GWAS), which uses sophisticated ML techniques to impute phenotypes and then performs GWAS on the imputed outcomes, have become increasingly common in complex trait genetics research. However, the validity of ML-assisted GWAS associations has not been carefully evaluated. Here, we report pervasive risks for false-positive associations in ML-assisted GWAS and introduce Post-Prediction GWAS (POP-GWAS), a statistical framework that redesigns GWAS on ML-imputed outcomes. POP-GWAS ensures valid and powerful statistical inference irrespective of imputation quality and choice of algorithm, requiring only GWAS summary statistics as input. We employed POP-GWAS to perform a GWAS of bone mineral density derived from dual-energy X-ray absorptiometry imaging at 14 skeletal sites, identifying 89 new loci and revealing skeletal site-specific genetic architecture. Our framework offers a robust analytic solution for future ML-assisted GWAS.
DOI
10.1038/s41588-024-01934-0

Review

Review-Povysil

Tool
PUBMED_LINK
31605095
TITLE
Rare-variant collapsing analyses for complex traits: guidelines and applications.
Main citation
Povysil G, Petrovski S, Hostyk J, Aggarwal V, ...&, Goldstein DB. (2019) Rare-variant collapsing analyses for complex traits: guidelines and applications. Nat Rev Genet, 20 (12) 747-759. doi:10.1038/s41576-019-0177-4. PMID 31605095
ABSTRACT
The first phase of genome-wide association studies (GWAS) assessed the role of common variation in human disease. Advances optimizing and economizing high-throughput sequencing have enabled a second phase of association studies that assess the contribution of rare variation to complex disease in all protein-coding genes. Unlike the early microarray-based studies, sequencing-based studies catalogue the full range of genetic variation, including the evolutionarily youngest forms. Although the experience with common variants helped establish relevant standards for genome-wide studies, the analysis of rare variation introduces several challenges that require novel analysis approaches.
DOI
10.1038/s41576-019-0177-4

Single variant association tests

BOLT-lMM

Tool
PUBMED_LINK
25642633
DESCRIPTION
The BOLT-LMM software package currently consists of two main algorithms, the BOLT-LMM algorithm for mixed model association testing, and the BOLT-REML algorithm for variance components analysis (i.e., partitioning of SNP-heritability and estimation of genetic correlations).
URL
https://alkesgroup.broadinstitute.org/BOLT-LMM/BOLT-LMM_manual.html
KEYWORDS
non-infinitesimal model, mixture of two Gaussian distributions
TITLE
Efficient Bayesian mixed-model analysis increases association power in large cohorts.
Main citation
Loh PR, Tucker G, Bulik-Sullivan BK, Vilhjálmsson BJ, ...&, Price AL. (2015) Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nat Genet, 47 (3) 284-90. doi:10.1038/ng.3190. PMID 25642633
ABSTRACT
Linear mixed models are a powerful statistical tool for identifying genetic associations and avoiding confounding. However, existing methods are computationally intractable in large cohorts and may not optimize power. All existing methods require time cost O(MN(2)) (where N is the number of samples and M is the number of SNPs) and implicitly assume an infinitesimal genetic architecture in which effect sizes are normally distributed, which can limit power. Here we present a far more efficient mixed-model association method, BOLT-LMM, which requires only a small number of O(MN) time iterations and increases power by modeling more realistic, non-infinitesimal genetic architectures via a Bayesian mixture prior on marker effect sizes. We applied BOLT-LMM to 9 quantitative traits in 23,294 samples from the Women's Genome Health Study (WGHS) and observed significant increases in power, consistent with simulations. Theory and simulations show that the boost in power increases with cohort size, making BOLT-LMM appealing for genome-wide association studies in large cohorts.
DOI
10.1038/ng.3190

EMMAX

Tool
PUBMED_LINK
20208533
FULL NAME
efficient mixed-model association eXpedited
DESCRIPTION
EMMAX is a statistical test for large scale human or model organism association mapping accounting for the sample structure. In addition to the computational efficiency obtained by EMMA algorithm, EMMAX takes advantage of the fact that each loci explains only a small fraction of complex traits, which allows us to avoid repetitive variance component estimation procedure, resulting in a significant amount of increase in computational time of association mapping using mixed model.
URL
https://genome.sph.umich.edu/wiki/EMMAX
TITLE
Variance component model to account for sample structure in genome-wide association studies.
Main citation
Kang HM, Sul JH, Service SK, Zaitlen NA, ...&, Eskin E. (2010) Variance component model to account for sample structure in genome-wide association studies. Nat Genet, 42 (4) 348-54. doi:10.1038/ng.548. PMID 20208533
ABSTRACT
Although genome-wide association studies (GWASs) have identified numerous loci associated with complex traits, imprecise modeling of the genetic relatedness within study samples may cause substantial inflation of test statistics and possibly spurious associations. Variance component approaches, such as efficient mixed-model association (EMMA), can correct for a wide range of sample structures by explicitly accounting for pairwise relatedness between individuals, using high-density markers to model the phenotype distribution; but such approaches are computationally impractical. We report here a variance component approach implemented in publicly available software, EMMA eXpedited (EMMAX), that reduces the computational time for analyzing large GWAS data sets from years to hours. We apply this method to two human GWAS data sets, performing association analysis for ten quantitative traits from the Northern Finland Birth Cohort and seven common diseases from the Wellcome Trust Case Control Consortium. We find that EMMAX outperforms both principal component analysis and genomic control in correcting for sample structure.
DOI
10.1038/ng.548

GEMMA

Tool
PUBMED_LINK
22706312
FULL NAME
genome-wide efficient mixed-model association
DESCRIPTION
GEMMA is the software implementing the Genome-wide Efficient Mixed Model Association algorithm for a standard linear mixed model and some of its close relatives for genome-wide association studies (GWAS). It fits a standard linear mixed model (LMM) to account for population stratification and sample structure for single marker association tests. It fits a Bayesian sparse linear mixed model (BSLMM) using Markov chain Monte Carlo (MCMC) for estimating the proportion of variance in phenotypes explained (PVE) by typed genotypes (i.e. chip heritability), predicting phenotypes, and identifying associated markers by jointly modeling all markers while controlling for population structure. It is computationally efficient for large scale GWAS and uses freely available open-source numerical libraries.
URL
http://stephenslab.uchicago.edu/software.html#gemma
TITLE
Genome-wide efficient mixed-model analysis for association studies.
Main citation
Zhou X, Stephens M. (2012) Genome-wide efficient mixed-model analysis for association studies. Nat Genet, 44 (7) 821-4. doi:10.1038/ng.2310. PMID 22706312
ABSTRACT
Linear mixed models have attracted considerable attention recently as a powerful and effective tool for accounting for population stratification and relatedness in genetic association tests. However, existing methods for exact computation of standard test statistics are computationally impractical for even moderate-sized genome-wide association studies. To address this issue, several approximate methods have been proposed. Here, we present an efficient exact method, which we refer to as genome-wide efficient mixed-model association (GEMMA), that makes approximations unnecessary in many contexts. This method is approximately n times faster than the widely used exact method known as efficient mixed-model association (EMMA), where n is the sample size, making exact genome-wide association analysis computationally practical for large numbers of individuals.
DOI
10.1038/ng.2310

LDAK-KVIK

Tool
URL
http://www.ldak.org/
Main citation
Hof, J. P. & Speed, D. LDAK-KVIK performs fast and powerful mixed-model association analysis of quantitative and binary phenotypes. bioRxiv 2024.07.25.24311005 (2024) doi:10.1101/2024.07.25.24311005.

PLINK

Tool
PUBMED_LINK
17701901
DESCRIPTION
A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses. PLINK is a free, open-source whole genome association analysis toolset, designed to perform a range of basic, large-scale analyses in a computationally efficient manner. The focus of PLINK is purely on analysis of genotype/phenotype data, so there is no support for steps prior to this (e.g. study design and planning, generating genotype or CNV calls from raw data). Through integration with gPLINK and Haploview, there is some support for the subsequent visualization, annotation and storage of results.
URL
https://www.cog-genomics.org/plink/
TITLE
PLINK: a tool set for whole-genome association and population-based linkage analyses.
Main citation
Purcell S, Neale B, Todd-Brown K, Thomas L, ...&, Sham PC. (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet, 81 (3) 559-75. doi:10.1086/519795. PMID 17701901
ABSTRACT
Whole-genome association studies (WGAS) bring new computational, as well as analytic, challenges to researchers. Many existing genetic-analysis tools are not designed to handle such large data sets in a convenient manner and do not necessarily exploit the new opportunities that whole-genome data bring. To address these issues, we developed PLINK, an open-source C/C++ WGAS tool set. With PLINK, large data sets comprising hundreds of thousands of markers genotyped for thousands of individuals can be rapidly manipulated and analyzed in their entirety. As well as providing tools to make the basic analytic steps computationally efficient, PLINK also supports some novel approaches to whole-genome data that take advantage of whole-genome coverage. We introduce PLINK and describe the five main domains of function: data management, summary statistics, population stratification, association analysis, and identity-by-descent estimation. In particular, we focus on the estimation and use of identity-by-state and identity-by-descent information in the context of population-based whole-genome studies. This information can be used to detect and correct for population stratification and to identify extended chromosomal segments that are shared identical by descent between very distantly related individuals. Analysis of the patterns of segmental sharing has the potential to map disease loci that contain multiple rare variants in a population-based linkage analysis.
DOI
10.1086/519795

PLINK2

Tool
PUBMED_LINK
25722852
URL
https://www.cog-genomics.org/plink/2.0/
TITLE
Second-generation PLINK: rising to the challenge of larger and richer datasets.
Main citation
Chang CC, Chow CC, Tellier LC, Vattikuti S, ...&, Lee JJ. (2015) Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience, 4 () 7. doi:10.1186/s13742-015-0047-8. PMID 25722852
ABSTRACT
BACKGROUND: PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for faster and scalable implementations of key functions, such as logistic regression, linkage disequilibrium estimation, and genomic distance evaluation. In addition, GWAS and population-genetic data now frequently contain genotype likelihoods, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1's primary data format. FINDINGS: To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, [Formula: see text]-time/constant-space Hardy-Weinberg equilibrium and Fisher's exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. We have also developed an extension to the data format which adds low-overhead support for genotype likelihoods, phase, multiallelic variants, and reference vs. alternate alleles, which is the basis of our planned second release (PLINK 2.0). CONCLUSIONS: The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.
DOI
10.1186/s13742-015-0047-8

POLMM

Tool
PUBMED_LINK
33836139
FULL NAME
proportional odds logistic mixed model (POLMM)
DESCRIPTION
Proportional Odds Logistic Mixed Model (POLMM) for ordinal categorical data analysis
URL
https://github.com/WenjianBI/POLMM
KEYWORDS
ordinal categorical phenotypes
TITLE
Efficient mixed model approach for large-scale genome-wide association studies of ordinal categorical phenotypes.
Main citation
Bi W, Zhou W, Dey R, Mukherjee B, ...&, Lee S. (2021) Efficient mixed model approach for large-scale genome-wide association studies of ordinal categorical phenotypes. Am J Hum Genet, 108 (5) 825-839. doi:10.1016/j.ajhg.2021.03.019. PMID 33836139
ABSTRACT
In genome-wide association studies, ordinal categorical phenotypes are widely used to measure human behaviors, satisfaction, and preferences. However, because of the lack of analysis tools, methods designed for binary or quantitative traits are commonly used inappropriately to analyze categorical phenotypes. To accurately model the dependence of an ordinal categorical phenotype on covariates, we propose an efficient mixed model association test, proportional odds logistic mixed model (POLMM). POLMM is computationally efficient to analyze large datasets with hundreds of thousands of samples, can control type I error rates at a stringent significance level regardless of the phenotypic distribution, and is more powerful than alternative methods. In contrast, the standard linear mixed model approaches cannot control type I error rates for rare variants when the phenotypic distribution is unbalanced, although they performed well when testing common variants. We applied POLMM to 258 ordinal categorical phenotypes on array genotypes and imputed samples from 408,961 individuals in UK Biobank. In total, we identified 5,885 genome-wide significant variants, of which, 424 variants (7.2%) are rare variants with MAF < 0.01.
DOI
10.1016/j.ajhg.2021.03.019

QRGWAS

Tool
PUBMED_LINK
39085219
FULL NAME
Quantile regression GWAS
URL
https://github.com/Iuliana-Ionita-Laza/QRGWAS
TITLE
Genome-wide discovery for biomarkers using quantile regression at biobank scale.
Main citation
Wang C, Wang T, Kiryluk K, Wei Y, ...&, Ionita-Laza I. (2024) Genome-wide discovery for biomarkers using quantile regression at biobank scale. Nat Commun, 15 (1) 6460. doi:10.1038/s41467-024-50726-x. PMID 39085219
ABSTRACT
Genome-wide association studies (GWAS) for biomarkers important for clinical phenotypes can lead to clinically relevant discoveries. Conventional GWAS for quantitative traits are based on simplified regression models modeling the conditional mean of a phenotype as a linear function of genotype. We draw attention here to an alternative, lesser known approach, namely quantile regression that naturally extends linear regression to the analysis of the entire conditional distribution of a phenotype of interest. Quantile regression can be applied efficiently at biobank scale, while having some unique advantages such as (1) identifying variants with heterogeneous effects across quantiles of the phenotype distribution; (2) accommodating a wide range of phenotype distributions including non-normal distributions, with invariance of results to trait transformations; and (3) providing more detailed information about genotype-phenotype associations even for those associations identified by conventional GWAS. We show in simulations that quantile regression is powerful across both homogeneous and various heterogeneous models. Applications to 39 quantitative traits in the UK Biobank demonstrate that quantile regression can be a helpful complement to linear regression in GWAS and can identify variants with larger effects on high-risk subgroups of individuals but with lower or no contribution overall.
DOI
10.1038/s41467-024-50726-x

Quickdraws

Tool
PUBMED_LINK
39789286
DESCRIPTION
Quickdraws is a scalable method to perform genome-wide association studies (GWAS) for quantitative and binary traits. To run GWAS using Quickdraws, you will need three main input files: bed (and bgen) files with model-building and testing genetic variants, phenotype files, and covariate files. For certain analyses, you may also need a list of model SNPs and a file describing close genetic relatives
URL
https://palamaralab.github.io/software/quickdraws/manual/
TITLE
A scalable variational inference approach for increased mixed-model association power.
Main citation
Loya H, Kalantzis G, Cooper F, Palamara PF. (2025) A scalable variational inference approach for increased mixed-model association power. Nat Genet, 57 (2) 461-468. doi:10.1038/s41588-024-02044-7. PMID 39789286
ABSTRACT
The rapid growth of modern biobanks is creating new opportunities for large-scale genome-wide association studies (GWASs) and the analysis of complex traits. However, performing GWASs on millions of samples often leads to trade-offs between computational efficiency and statistical power, reducing the benefits of large-scale data collection efforts. We developed Quickdraws, a method that increases association power in quantitative and binary traits without sacrificing computational efficiency, leveraging a spike-and-slab prior on variant effects, stochastic variational inference and graphics processing unit acceleration. We applied Quickdraws to 79 quantitative and 50 binary traits in 405,088 UK Biobank samples, identifying 4.97% and 3.25% more associations than REGENIE and 22.71% and 7.07% more than FastGWA. Quickdraws had costs comparable to REGENIE, FastGWA and SAIGE on the UK Biobank Research Analysis Platform service, while being substantially faster than BOLT-LMM. These results highlight the promise of leveraging machine learning techniques for scalable GWASs without sacrificing power or robustness.
DOI
10.1038/s41588-024-02044-7

REGENIE

GWAS
PUBMED_LINK
34017140
DESCRIPTION
regenie is a C++ program for whole genome regression modelling of large genome-wide association studies. It is developed and supported by a team of scientists at the Regeneron Genetics Center.
URL
https://github.com/rgcgithub/regenie
KEYWORDS
whole genome regression
TITLE
Computationally efficient whole-genome regression for quantitative and binary traits.
Main citation
Mbatchou J, Barnard L, Backman J, Marcketta A, ...&, Marchini J. (2021) Computationally efficient whole-genome regression for quantitative and binary traits. Nat Genet, 53 (7) 1097-1103. doi:10.1038/s41588-021-00870-7. PMID 34017140
ABSTRACT
Genome-wide association analysis of cohorts with thousands of phenotypes is computationally expensive, particularly when accounting for sample relatedness or population structure. Here we present a novel machine-learning method called REGENIE for fitting a whole-genome regression model for quantitative and binary phenotypes that is substantially faster than alternatives in multi-trait analyses while maintaining statistical efficiency. The method naturally accommodates parallel analysis of multiple phenotypes and requires only local segments of the genotype matrix to be loaded in memory, in contrast to existing alternatives, which must load genome-wide matrices into memory. This results in substantial savings in compute time and memory usage. We introduce a fast, approximate Firth logistic regression test for unbalanced case-control phenotypes. The method is ideally suited to take advantage of distributed computing frameworks. We demonstrate the accuracy and computational benefits of this approach using the UK Biobank dataset with up to 407,746 individuals.
DOI
10.1038/s41588-021-00870-7

SAIGE

Tool
PUBMED_LINK
30104761
FULL NAME
Scalable and Accurate Implementation of GEneralized mixed model
DESCRIPTION
SAIGE is an R package with Scalable and Accurate Implementation of Generalized mixed model (Chen, H. et al. 2016). It accounts for sample relatedness and is feasible for genetic association tests in large cohorts and biobanks (N > 400,000). SAIGE performs single-variant association tests for binary traits and quantitative taits. For binary traits, SAIGE uses the saddlepoint approximation (SPA)(mhof, J. P. , 1961; Kuonen, D. 1999; Dey, R. et.al 2017) to account for case-control imbalance.
URL
https://github.com/weizhouUMICH/SAIGE
KEYWORDS
case-control imbalance, saddlepoint approximation (SPA)
TITLE
Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies.
Main citation
Zhou W, Nielsen JB, Fritsche LG, Dey R, ...&, Lee S. (2018) Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. Nat Genet, 50 (9) 1335-1341. doi:10.1038/s41588-018-0184-y. PMID 30104761
ABSTRACT
In genome-wide association studies (GWAS) for thousands of phenotypes in large biobanks, most binary traits have substantially fewer cases than controls. Both of the widely used approaches, the linear mixed model and the recently proposed logistic mixed model, perform poorly; they produce large type I error rates when used to analyze unbalanced case-control phenotypes. Here we propose a scalable and accurate generalized mixed model association test that uses the saddlepoint approximation to calibrate the distribution of score test statistics. This method, SAIGE (Scalable and Accurate Implementation of GEneralized mixed model), provides accurate P values even when case-control ratios are extremely unbalanced. SAIGE uses state-of-art optimization strategies to reduce computational costs; hence, it is applicable to GWAS for thousands of phenotypes by large biobanks. Through the analysis of UK Biobank data of 408,961 samples from white British participants with European ancestry for > 1,400 binary phenotypes, we show that SAIGE can efficiently analyze large sample data, controlling for unbalanced case-control ratios and sample relatedness.
DOI
10.1038/s41588-018-0184-y

fastGWA

Tool
PUBMED_LINK
31768069
URL
https://yanglab.westlake.edu.cn/software/gcta/#fastGWA
KEYWORDS
grid-search-based REML algorithm
TITLE
A resource-efficient tool for mixed model association analysis of large-scale data.
Main citation
Jiang L, Zheng Z, Qi T, Kemper KE, ...&, Yang J. (2019) A resource-efficient tool for mixed model association analysis of large-scale data. Nat Genet, 51 (12) 1749-1755. doi:10.1038/s41588-019-0530-8. PMID 31768069
ABSTRACT
The genome-wide association study (GWAS) has been widely used as an experimental design to detect associations between genetic variants and a phenotype. Two major confounding factors, population stratification and relatedness, could potentially lead to inflated GWAS test statistics and hence to spurious associations. Mixed linear model (MLM)-based approaches can be used to account for sample structure. However, genome-wide association (GWA) analyses in biobank samples such as the UK Biobank (UKB) often exceed the capability of most existing MLM-based tools especially if the number of traits is large. Here, we develop an MLM-based tool (fastGWA) that controls for population stratification by principal components and for relatedness by a sparse genetic relationship matrix for GWA analyses of biobank-scale data. We demonstrate by extensive simulations that fastGWA is reliable, robust and highly resource-efficient. We then apply fastGWA to 2,173 traits on array-genotyped and imputed samples from 456,422 individuals and to 2,048 traits on whole-exome-sequenced samples from 46,191 individuals in the UKB.
DOI
10.1038/s41588-019-0530-8

fastGWA-GLMM

Tool
PUBMED_LINK
34737426
URL
https://yanglab.westlake.edu.cn/software/gcta/#fastGWA
TITLE
A generalized linear mixed model association tool for biobank-scale data.
Main citation
Jiang L, Zheng Z, Fang H, Yang J. (2021) A generalized linear mixed model association tool for biobank-scale data. Nat Genet, 53 (11) 1616-1621. doi:10.1038/s41588-021-00954-4. PMID 34737426
ABSTRACT
Compared with linear mixed model-based genome-wide association (GWA) methods, generalized linear mixed model (GLMM)-based methods have better statistical properties when applied to binary traits but are computationally much slower. In the present study, leveraging efficient sparse matrix-based algorithms, we developed a GLMM-based GWA tool, fastGWA-GLMM, that is severalfold to orders of magnitude faster than the state-of-the-art tools when applied to the UK Biobank (UKB) data and scalable to cohorts with millions of individuals. We show by simulation that the fastGWA-GLMM test statistics of both common and rare variants are well calibrated under the null, even for traits with extreme case-control ratios. We applied fastGWA-GLMM to the UKB data of 456,348 individuals, 11,842,647 variants and 2,989 binary traits (full summary statistics available at http://fastgwa.info/ukbimpbin ), and identified 259 rare variants associated with 75 traits, demonstrating the use of imputed genotype data in a large cohort to discover rare variants for binary complex traits.
DOI
10.1038/s41588-021-00954-4