Tools Association tests

Curation of Association tests — listings under the GWAS Tools tab.

Summary Table

Click a column header to sort the table.

NAME	CATEGORY	Main citation	YEAR
CC-GWAS	Case-case GWAS	Peyrot WJ et al., Nat Genet, 2021	2021
SPAGRM	GWAS of longitudinal traits	Xu H et al., Nat Commun, 2025	2025
TrajGWAS	GWAS of longitudinal trajectories	Ko S et al., Am J Hum Genet, 2022	2022
GWAX	GWAS using family history	Liu JZ et al., Nat Genet, 2017	2017
LT-FH	GWAS using family history	Hujoel MLA et al., Nat Genet, 2020	2020
SiblingGWAS	GWAS using family history	Howe LJ et al., Nat Genet, 2022	2022
snipar-unified estimator	GWAS using family history	Guan J et al., Nat Genet, 2025	2025
snipar	GWAS using family history	Young AI et al., Nat Genet, 2022	2022
MultiSTAAR	Gene-based analysis (rare variant)	Li X et al., Nat Comput Sci, 2025	2025
REGENIE	Gene-based analysis (rare variant)	Mbatchou J et al., Nat Genet, 2021	2021
SAIGE-GENE+	Gene-based analysis (rare variant)	Zhou W et al., Nat Genet, 2022	2022
SAIGE-GENE	Gene-based analysis (rare variant)	Zhou W et al., Nat Genet, 2020	2020
SKAT-O	Gene-based analysis (rare variant)	Lee S et al., Biostatistics, 2012	2012
SKAT	Gene-based analysis (rare variant)	Wu MC et al., Am J Hum Genet, 2011	2011
STAAR	Gene-based analysis (rare variant)	Li X et al., Nat Genet, 2020	2020
STAARpipeline	Gene-based analysis (rare variant)	Li Z et al., Nat Methods, 2022	2022
LDAK-GBAT	Gene-based analysis (sumstats)	Berrandou TE et al., Am J Hum Genet, 2023	2023
GATE	Genome-wide survival association analysis	Dey R et al., Nat Commun, 2022	2022
COWAS	MISC	Malakhov MM et al., Nat Commun, 2025	2025
GWAS-by-Subtraction	Other	Demange PA et al., Nat Genet, 2021	2021
PGS-adjusted GWAS	PGS-adjusted GWAS	Campos AI et al., Nat Genet, 2023	2023
PGS-adjusted RVATs	PGS-adjusted GWAS	Jurgens SJ et al., Nat Genet, 2023	2023
POP-GWAS	Phenotype imputation	Miao J et al., Nat Genet, 2024	2024
Review-Povysil	Review	Povysil G et al., Nat Rev Genet, 2019	2019
BOLT-lMM	Single variant association tests	Loh PR et al., Nat Genet, 2015	2015
EMMAX	Single variant association tests	Kang HM et al., Nat Genet, 2010	2010
GEMMA	Single variant association tests	Zhou X et al., Nat Genet, 2012	2012
LDAK-KVIK	Single variant association tests	Hof, J. P. & Speed, D. LDAK-KVIK performs fast and powerful mixed-model association analysis of quantitative and…	NA
PLINK2	Single variant association tests	Chang CC et al., Gigascience, 2015	2015
PLINK	Single variant association tests	Purcell S et al., Am J Hum Genet, 2007	2007
POLMM	Single variant association tests	Bi W et al., Am J Hum Genet, 2021	2021
QRGWAS	Single variant association tests	Wang C et al., Nat Commun, 2024	2024
Quickdraws	Single variant association tests	Loya H et al., Nat Genet, 2025	2025
REGENIE	Single variant association tests	Mbatchou J et al., Nat Genet, 2021	2021
SAIGE	Single variant association tests	Zhou W et al., Nat Genet, 2018	2018
fastGWA-GLMM	Single variant association tests	Jiang L et al., Nat Genet, 2021	2021
fastGWA	Single variant association tests	Jiang L et al., Nat Genet, 2019	2019

Case-case GWAS

CC-GWAS

Tool

PUBMED_LINK

33686288

FULL NAME

case–case genome-wide association study

DESCRIPTION

The CCGWAS R package provides a tool for case-case association testing of two different disorders based on their respective case-control GWAS results

Show full descriptionShow less

URL

https://github.com/wouterpeyrot/CCGWAS

TITLE

Identifying loci with different allele frequencies among cases of eight psychiatric disorders using CC-GWAS.

Main citation

Peyrot WJ, Price AL. (2021) Identifying loci with different allele frequencies among cases of eight psychiatric disorders using CC-GWAS. Nat Genet, 53 (4) 445-454. doi:10.1038/s41588-021-00787-1. PMID 33686288

ABSTRACT

Psychiatric disorders are highly genetically correlated, but little research has been conducted on the genetic differences between disorders. We developed a new method (case-case genome-wide association study; CC-GWAS) to test for differences in allele frequency between cases of two disorders using summary statistics from the respective case-control GWAS, transcending current methods that require individual-level data. Simulations and analytical computations confirm that CC-GWAS is well powered with effective control of type I error. We applied CC-GWAS to publicly available summary statistics for schizophrenia, bipolar disorder, major depressive disorder and five other psychiatric disorders. CC-GWAS identified 196 independent case-case loci, including 72 CC-GWAS-specific loci that were not significant at the genome-wide level in the input case-control summary statistics; two of the CC-GWAS-specific loci implicate the genes KLF6 and KLF16 (from the Krüppel-like family of transcription factors), which have been linked to neurite outgrowth and axon regeneration. CC-GWAS loci replicated convincingly in applications to datasets with independent replication data.

Show full abstractShow less

DOI

10.1038/s41588-021-00787-1

GWAS of longitudinal traits

SPAGRM

Tool

PUBMED_LINK

39915470

DESCRIPTION

PAGRM is a scalable and accurate analysis framework to control for sample relatedness in large-scale genome-wide association studies (GWAS).

Show full descriptionShow less

URL

https://github.com/HeXuPKU/SPAGRM

KEYWORDS

SPA, longitudinal traits

Show full keywordsShow less

TITLE

SPA

Main citation

Xu H, Ma Y, Xu LL, Li Y, ...&, Bi W. (2025) SPA Nat Commun, 16 (1) 1413. doi:10.1038/s41467-025-56669-1. PMID 39915470

ABSTRACT

Sample relatedness is a major confounder in genome-wide association studies (GWAS), potentially leading to inflated type I error rates if not appropriately controlled. A common strategy is to incorporate a random effect related to genetic relatedness matrix (GRM) into regression models. However, this approach is challenging for large-scale GWAS of complex traits, such as longitudinal traits. Here we propose a scalable and accurate analysis framework, SPAGRM, which controls for sample relatedness via a precise approximation of the joint distribution of genotypes. SPAGRM can utilize GRM-free models and thus is applicable to various trait types and statistical methods, including linear mixed models and generalized estimation equations for longitudinal traits. A hybrid strategy incorporating saddlepoint approximation greatly increases the accuracy to analyze low-frequency and rare genetic variants, especially in unbalanced phenotypic distributions. We also introduce SPAGRM(CCT) to aggregate the results following different models via Cauchy combination test. Extensive simulations and real data analyses demonstrated that SPAGRM maintains well-controlled type I error rates and SPAGRM(CCT) can serve as a broadly effective method. Applying SPAGRM to 79 longitudinal traits extracted from UK Biobank primary care data, we identified 7,463 genetic loci, making a pioneering attempt to conduct GWAS for these traits as longitudinal traits.

Show full abstractShow less

DOI

10.1038/s41467-025-56669-1

GWAS of longitudinal trajectories

TrajGWAS

Tool

PUBMED_LINK

35196515

FULL NAME

GWAS of longitudinal trajectories

DESCRIPTION

TrajGWAS.jl is a Julia package for performing genome-wide association studies (GWAS) for continuous longitudinal phenotypes using a modified linear mixed effects model. It builds upon the within-subject variance estimation by robust regression (WiSER) method and can be used to identify variants associated with changes in the mean and within-subject variability of the longitduinal trait.

Show full descriptionShow less

URL

https://github.com/OpenMendel/TrajGWAS.jl

KEYWORDS

biomarker trajectories, mean, within-subject (WS) variability, linear mixed effect model, within-subject variance estimation by robust regression (WiSER) method

Show full keywordsShow less

TITLE

GWAS of longitudinal trajectories at biobank scale.

Main citation

Ko S, German CA, Jensen A, Shen J, ...&, Zhou JJ. (2022) GWAS of longitudinal trajectories at biobank scale. Am J Hum Genet, 109 (3) 433-445. doi:10.1016/j.ajhg.2022.01.018. PMID 35196515

ABSTRACT

Biobanks linked to massive, longitudinal electronic health record (EHR) data make numerous new genetic research questions feasible. One among these is the study of biomarker trajectories. For example, high blood pressure measurements over visits strongly predict stroke onset, and consistently high fasting glucose and Hb1Ac levels define diabetes. Recent research reveals that not only the mean level of biomarker trajectories but also their fluctuations, or within-subject (WS) variability, are risk factors for many diseases. Glycemic variation, for instance, is recently considered an important clinical metric in diabetes management. It is crucial to identify the genetic factors that shift the mean or alter the WS variability of a biomarker trajectory. Compared to traditional cross-sectional studies, trajectory analysis utilizes more data points and captures a complete picture of the impact of time-varying factors, including medication history and lifestyle. Currently, there are no efficient tools for genome-wide association studies (GWASs) of biomarker trajectories at the biobank scale, even for just mean effects. We propose TrajGWAS, a linear mixed effect model-based method for testing genetic effects that shift the mean or alter the WS variability of a biomarker trajectory. It is scalable to biobank data with 100,000 to 1,000,000 individuals and many longitudinal measurements and robust to distributional assumptions. Simulation studies corroborate that TrajGWAS controls the type I error rate and is powerful. Analysis of eleven biomarkers measured longitudinally and extracted from UK Biobank primary care data for more than 150,000 participants with 1,800,000 observations reveals loci that significantly alter the mean or WS variability.

Show full abstractShow less

DOI

10.1016/j.ajhg.2022.01.018

GWAS using family history

GWAX

Tool

PUBMED_LINK

28092683

FULL NAME

genome-wide association by proxy

DESCRIPTION

In randomly ascertained cohorts, replacing cases with their first-degree relatives enables studies of diseases that are absent (or nearly absent) in the cohort.

Show full descriptionShow less

TITLE

Case-control association mapping by proxy using family history of disease.

Main citation

Liu JZ, Erlich Y, Pickrell JK. (2017) Case-control association mapping by proxy using family history of disease. Nat Genet, 49 (3) 325-331. doi:10.1038/ng.3766. PMID 28092683

ABSTRACT

Collecting cases for case-control genetic association studies can be time-consuming and expensive. In some situations (such as studies of late-onset or rapidly lethal diseases), it may be more practical to identify family members of cases. In randomly ascertained cohorts, replacing cases with their first-degree relatives enables studies of diseases that are absent (or nearly absent) in the cohort. We refer to this approach as genome-wide association study by proxy (GWAX) and apply it to 12 common diseases in 116,196 individuals from the UK Biobank. Meta-analysis with published genome-wide association study summary statistics replicated established risk loci and yielded four newly associated loci for Alzheimer's disease, eight for coronary artery disease and five for type 2 diabetes. In addition to informing disease biology, our results demonstrate the utility of association mapping without directly observing cases. We anticipate that GWAX will prove useful in future genetic studies of complex traits in large population cohorts.

Show full abstractShow less

DOI

10.1038/ng.3766

LT-FH

Tool

PUBMED_LINK

32313248

FULL NAME

liability threshold model, conditional on case–control status and family history

DESCRIPTION

an association method based on posterior mean genetic liabilities under a liability threshold model, conditional on case-control status and family history (LT-FH)

Show full descriptionShow less

URL

https://alkesgroup.broadinstitute.org/UKBB/LTFH/

TITLE

Liability threshold modeling of case-control status and family history of disease increases association power.

Main citation

Hujoel MLA, Gazal S, Loh PR, Patterson N, ...&, Price AL. (2020) Liability threshold modeling of case-control status and family history of disease increases association power. Nat Genet, 52 (5) 541-547. doi:10.1038/s41588-020-0613-6. PMID 32313248

ABSTRACT

Family history of disease can provide valuable information in case-control association studies, but it is currently unclear how to best combine case-control status and family history of disease. We developed an association method based on posterior mean genetic liabilities under a liability threshold model, conditional on case-control status and family history (LT-FH). Analyzing 12 diseases from the UK Biobank (average N = 350,000) we compared LT-FH to genome-wide association without using family history (GWAS) and a previous proxy-based method incorporating family history (GWAX). LT-FH was 63% (standard error (s.e.) 6%) more powerful than GWAS and 36% (s.e. 4%) more powerful than the trait-specific maximum of GWAS and GWAX, based on the number of independent genome-wide-significant loci across all diseases (for example, 690 loci for LT-FH versus 423 for GWAS); relative improvements were similar when applying BOLT-LMM to GWAS, GWAX and LT-FH phenotypes. Thus, LT-FH greatly increases association power when family history of disease is available.

Show full abstractShow less

DOI

10.1038/s41588-020-0613-6

SiblingGWAS

Tool

PUBMED_LINK

35534559

FULL NAME

Within-sibship genome-wide association analyses

DESCRIPTION

Scripts for running GWAS using siblings to estimate Within-Family (WF) and Between-Family (BF) effects of genetic variants on continuous traits. Allows the inclusion of more than two siblings from one family.

Show full descriptionShow less

URL

https://github.com/LaurenceHowe/SiblingGWAS

TITLE

Within-sibship genome-wide association analyses decrease bias in estimates of direct genetic effects.

Main citation

Howe LJ, Nivard MG, Morris TT, Hansen AF, ...&, Davies NM. (2022) Within-sibship genome-wide association analyses decrease bias in estimates of direct genetic effects. Nat Genet, 54 (5) 581-592. doi:10.1038/s41588-022-01062-7. PMID 35534559

ABSTRACT

Estimates from genome-wide association studies (GWAS) of unrelated individuals capture effects of inherited variation (direct effects), demography (population stratification, assortative mating) and relatives (indirect genetic effects). Family-based GWAS designs can control for demographic and indirect genetic effects, but large-scale family datasets have been lacking. We combined data from 178,086 siblings from 19 cohorts to generate population (between-family) and within-sibship (within-family) GWAS estimates for 25 phenotypes. Within-sibship GWAS estimates were smaller than population estimates for height, educational attainment, age at first birth, number of children, cognitive ability, depressive symptoms and smoking. Some differences were observed in downstream SNP heritability, genetic correlations and Mendelian randomization analyses. For example, the within-sibship genetic correlation between educational attainment and body mass index attenuated towards zero. In contrast, analyses of most molecular phenotypes (for example, low-density lipoprotein-cholesterol) were generally consistent. We also found within-sibship evidence of polygenic adaptation on taller height. Here, we illustrate the importance of family-based GWAS data for phenotypes influenced by demographic and indirect genetic effects.

Show full abstractShow less

DOI

10.1038/s41588-022-01062-7

snipar

Tool

PUBMED_LINK

35681053

FULL NAME

single nucleotide imputation of parents

DESCRIPTION

snipar (single nucleotide imputation of parents) is a Python package for inferring identity-by-descent (IBD) segments shared between siblings, imputing missing parental genotypes, and for performing family based genome-wide association and polygenic score analyses using observed and/or imputed parental genotypes.

Show full descriptionShow less

URL

https://github.com/AlexTISYoung/snipar

TITLE

Mendelian imputation of parental genotypes improves estimates of direct genetic effects.

Main citation

Young AI, Nehzati SM, Benonisdottir S, Okbay A, ...&, Kong A. (2022) Mendelian imputation of parental genotypes improves estimates of direct genetic effects. Nat Genet, 54 (6) 897-905. doi:10.1038/s41588-022-01085-0. PMID 35681053

ABSTRACT

Effects estimated by genome-wide association studies (GWASs) include effects of alleles in an individual on that individual (direct genetic effects), indirect genetic effects (for example, effects of alleles in parents on offspring through the environment) and bias from confounding. Within-family genetic variation is random, enabling unbiased estimation of direct genetic effects when parents are genotyped. However, parental genotypes are often missing. We introduce a method that imputes missing parental genotypes and estimates direct genetic effects. Our method, implemented in the software package snipar (single-nucleotide imputation of parents), gives more precise estimates of direct genetic effects than existing approaches. Using 39,614 individuals from the UK Biobank with at least one genotyped sibling/parent, we estimate the correlation between direct genetic effects and effects from standard GWASs for nine phenotypes, including educational attainment (r = 0.739, standard error (s.e.) = 0.086) and cognitive ability (r = 0.490, s.e. = 0.086). Our results demonstrate substantial confounding bias in standard GWASs for some phenotypes.

Show full abstractShow less

DOI

10.1038/s41588-022-01085-0

snipar-unified estimator (snipar)

Tool

PUBMED_LINK

40065166

FULL NAME

single nucleotide imputation of parents

URL

https://github.com/AlexTISYoung/snipar

TITLE

Family-based genome-wide association study designs for increased power and robustness.

Main citation

Guan J, Tan T, Nehzati SM, Bennett M, ...&, Young AS. (2025) Family-based genome-wide association study designs for increased power and robustness. Nat Genet, 57 (4) 1044-1052. doi:10.1038/s41588-025-02118-0. PMID 40065166

ABSTRACT

Family-based genome-wide association studies (FGWASs) use random, within-family genetic variation to remove confounding from estimates of direct genetic effects (DGEs). Here we introduce a 'unified estimator' that includes individuals without genotyped relatives, unifying standard and FGWAS while increasing power for DGE estimation. We also introduce a 'robust estimator' that is not biased in structured and/or admixed populations. In an analysis of 19 phenotypes in the UK Biobank, the unified estimator in the White British subsample and the robust estimator (applied without ancestry restrictions) increased the effective sample size for DGEs by 46.9% to 106.5% and 10.3% to 21.0%, respectively, compared to using genetic differences between siblings. Polygenic predictors derived from the unified estimator demonstrated superior out-of-sample prediction ability compared to other family-based methods. We implemented the methods in the software package snipar in an efficient linear mixed model that accounts for sample relatedness and sibling shared environment.

Show full abstractShow less

DOI

10.1038/s41588-025-02118-0

Gene-based analysis (rare variant)

MultiSTAAR

Tool

PUBMED_LINK

39920506

FULL NAME

Multi-trait variant-Set Test for Association using Annotation infoRmation

DESCRIPTION

MultiSTAAR is an R package for performing Multi-trait variant-Set Test for Association using Annotation infoRmation (MultiSTAAR) procedure in whole-genome sequencing (WGS) studies. MultiSTAAR is a general framework that (1) leverages the correlation structure between multiple phenotypes to improve power of multi-trait analysis over single-trait analysis, and (2) incorporates both qualitative functional categories and quantitative complementary functional annotations using an omnibus multi-dimensional weighting scheme. MultiSTAAR accounts for population structure and relatedness, and is scalable for jointly analyzing large WGS studies of multiple correlated traits.

Show full descriptionShow less

URL

https://github.com/xihaoli/MultiSTAAR

TITLE

A statistical framework for multi-trait rare variant analysis in large-scale whole-genome sequencing studies.

Main citation

Li X, Chen H, Selvaraj MS, Van Buren E, ...&, Lin X. (2025) A statistical framework for multi-trait rare variant analysis in large-scale whole-genome sequencing studies. Nat Comput Sci, 5 (2) 125-143. doi:10.1038/s43588-024-00764-8. PMID 39920506

ABSTRACT

Large-scale whole-genome sequencing (WGS) studies have improved our understanding of the contributions of coding and noncoding rare variants to complex human traits. Leveraging association effect sizes across multiple traits in WGS rare variant association analysis can improve statistical power over single-trait analysis, and also detect pleiotropic genes and regions. Existing multi-trait methods have limited ability to perform rare variant analysis of large-scale WGS data. We propose MultiSTAAR, a statistical framework and computationally scalable analytical pipeline for functionally informed multi-trait rare variant analysis in large-scale WGS studies. MultiSTAAR accounts for relatedness, population structure and correlation among phenotypes by jointly analyzing multiple traits, and further empowers rare variant association analysis by incorporating multiple functional annotations. We applied MultiSTAAR to jointly analyze three lipid traits in 61,838 multi-ethnic samples from the Trans-Omics for Precision Medicine (TOPMed) Program. We discovered and replicated new associations with lipid traits missed by single-trait analysis.

Show full abstractShow less

DOI

10.1038/s43588-024-00764-8

REGENIE

GWAS

PUBMED_LINK

34017140

DESCRIPTION

regenie is a C++ program for whole genome regression modelling of large genome-wide association studies. It is developed and supported by a team of scientists at the Regeneron Genetics Center.

Show full descriptionShow less

URL

https://github.com/rgcgithub/regenie

KEYWORDS

whole genome regression

Show full keywordsShow less

TITLE

Computationally efficient whole-genome regression for quantitative and binary traits.

Main citation

Mbatchou J, Barnard L, Backman J, Marcketta A, ...&, Marchini J. (2021) Computationally efficient whole-genome regression for quantitative and binary traits. Nat Genet, 53 (7) 1097-1103. doi:10.1038/s41588-021-00870-7. PMID 34017140

ABSTRACT

Genome-wide association analysis of cohorts with thousands of phenotypes is computationally expensive, particularly when accounting for sample relatedness or population structure. Here we present a novel machine-learning method called REGENIE for fitting a whole-genome regression model for quantitative and binary phenotypes that is substantially faster than alternatives in multi-trait analyses while maintaining statistical efficiency. The method naturally accommodates parallel analysis of multiple phenotypes and requires only local segments of the genotype matrix to be loaded in memory, in contrast to existing alternatives, which must load genome-wide matrices into memory. This results in substantial savings in compute time and memory usage. We introduce a fast, approximate Firth logistic regression test for unbalanced case-control phenotypes. The method is ideally suited to take advantage of distributed computing frameworks. We demonstrate the accuracy and computational benefits of this approach using the UK Biobank dataset with up to 407,746 individuals.

Show full abstractShow less

DOI

10.1038/s41588-021-00870-7

SAIGE-GENE

Tool

PUBMED_LINK

32424355

URL

https://github.com/weizhouUMICH/SAIGE

TITLE

Scalable generalized linear mixed model for region-based association tests in large biobanks and cohorts.

Main citation

Zhou W, Zhao Z, Nielsen JB, Fritsche LG, ...&, Lee S. (2020) Scalable generalized linear mixed model for region-based association tests in large biobanks and cohorts. Nat Genet, 52 (6) 634-639. doi:10.1038/s41588-020-0621-6. PMID 32424355

ABSTRACT

With very large sample sizes, biobanks provide an exciting opportunity to identify genetic components of complex traits. To analyze rare variants, region-based multiple-variant aggregate tests are commonly used to increase power for association tests. However, because of the substantial computational cost, existing region-based tests cannot analyze hundreds of thousands of samples while accounting for confounders such as population stratification and sample relatedness. Here we propose a scalable generalized mixed-model region-based association test, SAIGE-GENE, that is applicable to exome-wide and genome-wide region-based analysis for hundreds of thousands of samples and can account for unbalanced case-control ratios for binary traits. Through extensive simulation studies and analysis of the HUNT study with 69,716 Norwegian samples and the UK Biobank data with 408,910 White British samples, we show that SAIGE-GENE can efficiently analyze large-sample data (N > 400,000) with type I error rates well controlled.

Show full abstractShow less

DOI

10.1038/s41588-020-0621-6

SAIGE-GENE+

Tool

PUBMED_LINK

36138231

URL

https://github.com/weizhouUMICH/SAIGE

TITLE

SAIGE-GENE+ improves the efficiency and accuracy of set-based rare variant association tests.

Main citation

Zhou W, Bi W, Zhao Z, Dey KK, ...&, Lee S. (2022) SAIGE-GENE+ improves the efficiency and accuracy of set-based rare variant association tests. Nat Genet, 54 (10) 1466-1469. doi:10.1038/s41588-022-01178-w. PMID 36138231

ABSTRACT

Several biobanks, including UK Biobank (UKBB), are generating large-scale sequencing data. An existing method, SAIGE-GENE, performs well when testing variants with minor allele frequency (MAF) ≤ 1%, but inflation is observed in variance component set-based tests when restricting to variants with MAF ≤ 0.1% or 0.01%. Here, we propose SAIGE-GENE+ with greatly improved type I error control and computational efficiency to facilitate rare variant tests in large-scale data. We further show that incorporating multiple MAF cutoffs and functional annotations can improve power and thus uncover new gene-phenotype associations. In the analysis of UKBB whole exome sequencing data for 30 quantitative and 141 binary traits, SAIGE-GENE+ identified 551 gene-phenotype associations.

Show full abstractShow less

DOI

10.1038/s41588-022-01178-w

SKAT

Tool

PUBMED_LINK

21737059

FULL NAME

sequence kernel association test

DESCRIPTION

SKAT is a SNP-set (e.g., a gene or a region) level test for association between a set of rare (or common) variants and dichotomous or quantitative phenotypes, SKAT aggregates individual score test statistics of SNPs in a SNP set and efficiently computes SNP-set level p-values, e.g. a gene or a region level p-value, while adjusting for covariates, such as principal components to account for population stratification. SKAT also allows for power/sample size calculations for designing for sequence association studies.

Show full descriptionShow less

URL

https://www.hsph.harvard.edu/skat/

TITLE

Rare-variant association testing for sequencing data with the sequence kernel association test.

Main citation

Wu MC, Lee S, Cai T, Li Y, ...&, Lin X. (2011) Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet, 89 (1) 82-93. doi:10.1016/j.ajhg.2011.05.029. PMID 21737059

ABSTRACT

Sequencing studies are increasingly being conducted to identify rare variants associated with complex traits. The limited power of classical single-marker association analysis for rare variants poses a central challenge in such studies. We propose the sequence kernel association test (SKAT), a supervised, flexible, computationally efficient regression method to test for association between genetic variants (common and rare) in a region and a continuous or dichotomous trait while easily adjusting for covariates. As a score-based variance-component test, SKAT can quickly calculate p values analytically by fitting the null model containing only the covariates, and so can easily be applied to genome-wide data. Using SKAT to analyze a genome-wide sequencing study of 1000 individuals, by segmenting the whole genome into 30 kb regions, requires only 7 hr on a laptop. Through analysis of simulated data across a wide range of practical scenarios and triglyceride data from the Dallas Heart Study, we show that SKAT can substantially outperform several alternative rare-variant association tests. We also provide analytic power and sample-size calculations to help design candidate-gene, whole-exome, and whole-genome sequence association studies.

Show full abstractShow less

DOI

10.1016/j.ajhg.2011.05.029

SKAT-O

Tool

PUBMED_LINK

22699862

FULL NAME

sequence kernel association test - optimal test

DESCRIPTION

estimating the correlation parameter in the kernel matrix to maximize the power, which corresponds to the estimated weight in the linear combination of the burden test and SKAT test statistics that maximizes power.

Show full descriptionShow less

URL

https://www.hsph.harvard.edu/skat/

TITLE

Optimal tests for rare variant effects in sequencing association studies.

Main citation

Lee S, Wu MC, Lin X. (2012) Optimal tests for rare variant effects in sequencing association studies. Biostatistics, 13 (4) 762-75. doi:10.1093/biostatistics/kxs014. PMID 22699862

ABSTRACT

With development of massively parallel sequencing technologies, there is a substantial need for developing powerful rare variant association tests. Common approaches include burden and non-burden tests. Burden tests assume all rare variants in the target region have effects on the phenotype in the same direction and of similar magnitude. The recently proposed sequence kernel association test (SKAT) (Wu, M. C., and others, 2011. Rare-variant association testing for sequencing data with the SKAT. The American Journal of Human Genetics 89, 82-93], an extension of the C-alpha test (Neale, B. M., and others, 2011. Testing for an unusual distribution of rare variants. PLoS Genetics 7, 161-165], provides a robust test that is particularly powerful in the presence of protective and deleterious variants and null variants, but is less powerful than burden tests when a large number of variants in a region are causal and in the same direction. As the underlying biological mechanisms are unknown in practice and vary from one gene to another across the genome, it is of substantial practical interest to develop a test that is optimal for both scenarios. In this paper, we propose a class of tests that include burden tests and SKAT as special cases, and derive an optimal test within this class that maximizes power. We show that this optimal test outperforms burden tests and SKAT in a wide range of scenarios. The results are illustrated using simulation studies and triglyceride data from the Dallas Heart Study. In addition, we have derived sample size/power calculation formula for SKAT with a new family of kernels to facilitate designing new sequence association studies.

Show full abstractShow less

DOI

10.1093/biostatistics/kxs014

STAAR

Tool

PUBMED_LINK

32839606

FULL NAME

variant-set test for association using annotation information

DESCRIPTION

STAAR is an R package for performing variant-Set Test for Association using Annotation infoRmation (STAAR) procedure in whole-genome sequencing (WGS) studies. STAAR is a general framework that incorporates both qualitative functional categories and quantitative complementary functional annotations using an omnibus multi-dimensional weighting scheme. STAAR accounts for population structure and relatedness, and is scalable for analyzing large WGS studies of continuous and dichotomous traits.

Show full descriptionShow less

URL

https://github.com/xihaoli/STAAR

KEYWORDS

functional annotations

Show full keywordsShow less

TITLE

Dynamic incorporation of multiple in silico functional annotations empowers rare variant association analysis of large whole-genome sequencing studies at scale.

Main citation

Li X, Li Z, Zhou H, Gaynor SM, ...&, Lin X. (2020) Dynamic incorporation of multiple in silico functional annotations empowers rare variant association analysis of large whole-genome sequencing studies at scale. Nat Genet, 52 (9) 969-983. doi:10.1038/s41588-020-0676-4. PMID 32839606

ABSTRACT

Large-scale whole-genome sequencing studies have enabled the analysis of rare variants (RVs) associated with complex phenotypes. Commonly used RV association tests have limited scope to leverage variant functions. We propose STAAR (variant-set test for association using annotation information), a scalable and powerful RV association test method that effectively incorporates both variant categories and multiple complementary annotations using a dynamic weighting scheme. For the latter, we introduce 'annotation principal components', multidimensional summaries of in silico variant annotations. STAAR accounts for population structure and relatedness and is scalable for analyzing very large cohort and biobank whole-genome sequencing studies of continuous and dichotomous traits. We applied STAAR to identify RVs associated with four lipid traits in 12,316 discovery and 17,822 replication samples from the Trans-Omics for Precision Medicine Program. We discovered and replicated new RV associations, including disruptive missense RVs of NPC1L1 and an intergenic region near APOC1P1 associated with low-density lipoprotein cholesterol.

Show full abstractShow less

DOI

10.1038/s41588-020-0676-4

STAARpipeline

Tool

PUBMED_LINK

36303018

FULL NAME

variant-set test for association using annotation information

DESCRIPTION

STAARpipeline is an R package for phenotype-genotype association analyses of biobank-scale WGS/WES data, including single variant analysis and variant set analysis.

Show full descriptionShow less

URL

https://github.com/xihaoli/STAARpipeline/

TITLE

A framework for detecting noncoding rare-variant associations of large-scale whole-genome sequencing studies.

Main citation

Li Z, Li X, Zhou H, Gaynor SM, ...&, Lin X. (2022) A framework for detecting noncoding rare-variant associations of large-scale whole-genome sequencing studies. Nat Methods, 19 (12) 1599-1611. doi:10.1038/s41592-022-01640-x. PMID 36303018

ABSTRACT

Large-scale whole-genome sequencing studies have enabled analysis of noncoding rare-variant (RV) associations with complex human diseases and traits. Variant-set analysis is a powerful approach to study RV association. However, existing methods have limited ability in analyzing the noncoding genome. We propose a computationally efficient and robust noncoding RV association detection framework, STAARpipeline, to automatically annotate a whole-genome sequencing study and perform flexible noncoding RV association analysis, including gene-centric analysis and fixed window-based and dynamic window-based non-gene-centric analysis by incorporating variant functional annotations. In gene-centric analysis, STAARpipeline uses STAAR to group noncoding variants based on functional categories of genes and incorporate multiple functional annotations. In non-gene-centric analysis, STAARpipeline uses SCANG-STAAR to incorporate dynamic window sizes and multiple functional annotations. We apply STAARpipeline to identify noncoding RV sets associated with four lipid traits in 21,015 discovery samples from the Trans-Omics for Precision Medicine (TOPMed) program and replicate several of them in an additional 9,123 TOPMed samples. We also analyze five non-lipid TOPMed traits.

Show full abstractShow less

DOI

10.1038/s41592-022-01640-x

Gene-based analysis (sumstats)

LDAK-GBAT

Tool

PUBMED_LINK

36480927

FULL NAME

LDAK gene-based association testing

URL

http://www.ldak.org/

TITLE

LDAK-GBAT: Fast and powerful gene-based association testing using summary statistics.

Main citation

Berrandou TE, Balding D, Speed D. (2023) LDAK-GBAT: Fast and powerful gene-based association testing using summary statistics. Am J Hum Genet, 110 (1) 23-29. doi:10.1016/j.ajhg.2022.11.010. PMID 36480927

ABSTRACT

We present LDAK-GBAT, a tool for gene-based association testing using summary statistics from genome-wide association studies that is computationally efficient, produces well-calibrated p values, and is significantly more powerful than existing tools. LDAK-GBAT takes approximately 30 min to analyze imputed data (2.9M common, genic SNPs), requiring less than 10 Gb memory. It shows good control of type 1 error given an appropriate reference panel. Across 109 phenotypes (82 from the UK Biobank, 18 from the Million Veteran Program, and nine from the Psychiatric Genetics Consortium), LDAK-GBAT finds on average 19% (SE: 1%) more significant genes than the existing tool sumFREGAT-ACAT, with even greater gains in comparison with MAGMA, GCTA-fastBAT, sumFREGAT-SKAT-O, and sumFREGAT-PCA.

Show full abstractShow less

DOI

10.1016/j.ajhg.2022.11.010

Genome-wide survival association analysis

GATE

Tool

PUBMED_LINK

36114182

FULL NAME

Genetic Analysis of Time-to-Event phenotypes

DESCRIPTION

GATE (Genetic Analysis of Time-to-Event phenotypes) is an R package with Scalable and accurate genome-wide association analysis of censored survival data in large scale biobanks using frailty models.

GATE performs single-variant association tests for time-to-event endpoints. GATE uses uses the saddlepoint approximation (SPA)(mhof, J. P. , 1961; Kuonen, D. 1999; Dey, R. et.al 2017) to account for heavy censoring rates.

Show full descriptionShow less

URL

https://github.com/weizhou0/GATE

KEYWORDS

censored time-to-event (TTE) phenotypes

Show full keywordsShow less

TITLE

Efficient and accurate frailty model approach for genome-wide survival association analysis in large-scale biobanks.

Main citation

Dey R, Zhou W, Kiiskinen T, Havulinna A, ...&, Lin X. (2022) Efficient and accurate frailty model approach for genome-wide survival association analysis in large-scale biobanks. Nat Commun, 13 (1) 5437. doi:10.1038/s41467-022-32885-x. PMID 36114182

ABSTRACT

With decades of electronic health records linked to genetic data, large biobanks provide unprecedented opportunities for systematically understanding the genetics of the natural history of complex diseases. Genome-wide survival association analysis can identify genetic variants associated with ages of onset, disease progression and lifespan. We propose an efficient and accurate frailty model approach for genome-wide survival association analysis of censored time-to-event (TTE) phenotypes by accounting for both population structure and relatedness. Our method utilizes state-of-the-art optimization strategies to reduce the computational cost. The saddlepoint approximation is used to allow for analysis of heavily censored phenotypes (>90%) and low frequency variants (down to minor allele count 20). We demonstrate the performance of our method through extensive simulation studies and analysis of five TTE phenotypes, including lifespan, with heavy censoring rates (90.9% to 99.8%) on ~400,000 UK Biobank participants with white British ancestry and ~180,000 individuals in FinnGen. We further analyzed 871 TTE phenotypes in the UK Biobank and presented the genome-wide scale phenome-wide association results with the PheWeb browser.

Show full abstractShow less

DOI

10.1038/s41467-022-32885-x

MISC

COWAS

TWAS Functional genomics Gene prioritization Tool Summary statistics

PUBMED_LINK

41381446

FULL NAME

Co-expression-wide association study

DESCRIPTION

Co-expression-wide association study (COWAS) extends TWAS/PWAS by testing pairs of genes or proteins whose genetically regulated co-expression or interaction is associated with a trait; includes implemented R software and trained imputation weights for summary-statistic follow-up.

Show full descriptionShow less

URL

https://github.com/mykmal/cowas ,https://doi.org/10.1038/s41467-025-66039-6

KEYWORDS

TWAS, PWAS, co-expression, gene-gene interaction, GWAS summary statistics

Show full keywordsShow less

TITLE

Co-expression-wide association studies link genetically regulated interactions with complex traits.

Main citation

Malakhov MM, Pan W. (2025) Co-expression-wide association studies link genetically regulated interactions with complex traits. Nat Commun, 16 (1) 11061. doi:10.1038/s41467-025-66039-6. PMID 41381446

ABSTRACT

Transcriptome- and proteome-wide association studies (TWAS/PWAS) have proven successful in prioritizing genes and proteins whose genetically regulated expression modulates disease risk, but they ignore potential co-expression and interaction effects. To address this limitation, we introduce the co-expression-wide association study (COWAS) method, which can identify pairs of genes or proteins whose genetically regulated co-expression is associated with complex traits. COWAS first trains models to predict expression and co-expression from genetic variation, and then tests for association between imputed co-expression and the trait of interest while also accounting for direct effects from each exposure. We applied our method to plasma proteomic concentrations from the UK Biobank, identifying dozens of interacting protein pairs associated with cholesterol levels, Alzheimer's disease, and Parkinson's disease. Notably, our results demonstrate that co-expression between proteins may affect complex traits even if neither protein is detected to influence the trait when considered on its own. We also show how COWAS can help to disentangle direct and interaction effects, providing a richer picture of the molecular networks that mediate genetic effects on disease outcomes.

Show full abstractShow less

DOI

10.1038/s41467-025-66039-6

Other

GWAS-by-Subtraction

Tool

PUBMED_LINK

33414549

URL

https://github.com/GenomicSEM/GenomicSEM

TITLE

Investigating the genetic architecture of noncognitive skills using GWAS-by-subtraction.

Main citation

Demange PA, Malanchini M, Mallard TT, Biroli P, ...&, Nivard MG. (2021) Investigating the genetic architecture of noncognitive skills using GWAS-by-subtraction. Nat Genet, 53 (1) 35-44. doi:10.1038/s41588-020-00754-2. PMID 33414549

ABSTRACT

Little is known about the genetic architecture of traits affecting educational attainment other than cognitive ability. We used genomic structural equation modeling and prior genome-wide association studies (GWASs) of educational attainment (n = 1,131,881) and cognitive test performance (n = 257,841) to estimate SNP associations with educational attainment variation that is independent of cognitive ability. We identified 157 genome-wide-significant loci and a polygenic architecture accounting for 57% of genetic variance in educational attainment. Noncognitive genetics were enriched in the same brain tissues and cell types as cognitive performance, but showed different associations with gray-matter brain volumes. Noncognitive genetics were further distinguished by associations with personality traits, less risky behavior and increased risk for certain psychiatric disorders. For socioeconomic success and longevity, noncognitive and cognitive-performance genetics demonstrated associations of similar magnitude. By conducting a GWAS of a phenotype that was not directly measured, we offer a view of genetic architecture of noncognitive skills influencing educational success.

Show full abstractShow less

DOI

10.1038/s41588-020-00754-2

PGS-adjusted GWAS

Tool

PUBMED_LINK

37723263

DESCRIPTION

adjustment of GWAS analyses for polygenic scores (PGSs) increases the statistical power for discovery across all ancestries

Show full descriptionShow less

KEYWORDS

LOCO-PGSs, two-stage meta-analysis strategy

Show full keywordsShow less

TITLE

Boosting the power of genome-wide association studies within and across ancestries by using polygenic scores.

Main citation

Campos AI, Namba S, Lin SC, Nam K, ...&, Yengo L. (2023) Boosting the power of genome-wide association studies within and across ancestries by using polygenic scores. Nat Genet, 55 (10) 1769-1776. doi:10.1038/s41588-023-01500-0. PMID 37723263

ABSTRACT

Genome-wide association studies (GWASs) have been mostly conducted in populations of European ancestry, which currently limits the transferability of their findings to other populations. Here, we show, through theory, simulations and applications to real data, that adjustment of GWAS analyses for polygenic scores (PGSs) increases the statistical power for discovery across all ancestries. We applied this method to analyze seven traits available in three large biobanks with participants of East Asian ancestry (n = 340,000 in total) and report 139 additional associations across traits. We also present a two-stage meta-analysis strategy whereby, in contributing cohorts, a PGS-adjusted GWAS is rerun using PGSs derived from a first round of a standard meta-analysis. On average, across traits, this approach yields a 1.26-fold increase in the number of detected associations (range 1.07- to 1.76-fold increase). Altogether, our study demonstrates the value of using PGSs to increase the power of GWASs in underrepresented populations and promotes such an analytical strategy for future GWAS meta-analyses.

Show full abstractShow less

DOI

10.1038/s41588-023-01500-0

PGS-adjusted RVATs

Tool

PUBMED_LINK

36959364

FULL NAME

PGS-adjusted rare variant association tests

DESCRIPTION

adjusting for common variant polygenic scores improves yield in gene-based rare variant association tests

Show full descriptionShow less

KEYWORDS

PGS, Rare variants

Show full keywordsShow less

TITLE

Adjusting for common variant polygenic scores improves yield in rare variant association analyses.

Main citation

Jurgens SJ, Pirruccello JP, Choi SH, Morrill VN, ...&, Ellinor PT. (2023) Adjusting for common variant polygenic scores improves yield in rare variant association analyses. Nat Genet, 55 (4) 544-548. doi:10.1038/s41588-023-01342-w. PMID 36959364

ABSTRACT

With the emergence of large-scale sequencing data, methods for improving power in rare variant association tests are needed. Here we show that adjusting for common variant polygenic scores improves yield in gene-based rare variant association tests across 65 quantitative traits in the UK Biobank (up to 20% increase at α = 2.6 × 10-6), without marked increases in false-positive rates or genomic inflation. Benefits were seen for various models, with the largest improvements seen for efficient sparse mixed-effects models. Our results illustrate how polygenic score adjustment can efficiently improve power in rare variant association discovery.

Show full abstractShow less

DOI

10.1038/s41588-023-01342-w

Phenotype imputation

POP-GWAS

Tool

PUBMED_LINK

39349818

FULL NAME

Post-Prediction GWAS

DESCRIPTION

POP-TOOLS (POst-Prediction TOOLS) is a Python3-based command line toolkit for conducting valid and powerful machine learning (ML)-assisted genetic association studies.

Show full descriptionShow less

URL

https://github.com/qlu-lab/POP-TOOLS

KEYWORDS

imputed phenotypes, 3 GWASs

Show full keywordsShow less

TITLE

Valid inference for machine learning-assisted genome-wide association studies.

Main citation

Miao J, Wu Y, Sun Z, Miao X, ...&, Lu Q. (2024) Valid inference for machine learning-assisted genome-wide association studies. Nat Genet, 56 (11) 2361-2369. doi:10.1038/s41588-024-01934-0. PMID 39349818

ABSTRACT

Machine learning (ML) has become increasingly popular in almost all scientific disciplines, including human genetics. Owing to challenges related to sample collection and precise phenotyping, ML-assisted genome-wide association study (GWAS), which uses sophisticated ML techniques to impute phenotypes and then performs GWAS on the imputed outcomes, have become increasingly common in complex trait genetics research. However, the validity of ML-assisted GWAS associations has not been carefully evaluated. Here, we report pervasive risks for false-positive associations in ML-assisted GWAS and introduce Post-Prediction GWAS (POP-GWAS), a statistical framework that redesigns GWAS on ML-imputed outcomes. POP-GWAS ensures valid and powerful statistical inference irrespective of imputation quality and choice of algorithm, requiring only GWAS summary statistics as input. We employed POP-GWAS to perform a GWAS of bone mineral density derived from dual-energy X-ray absorptiometry imaging at 14 skeletal sites, identifying 89 new loci and revealing skeletal site-specific genetic architecture. Our framework offers a robust analytic solution for future ML-assisted GWAS.

Show full abstractShow less

DOI

10.1038/s41588-024-01934-0

Review

Review-Povysil

Tool

PUBMED_LINK

31605095

TITLE

Rare-variant collapsing analyses for complex traits: guidelines and applications.

Main citation

Povysil G, Petrovski S, Hostyk J, Aggarwal V, ...&, Goldstein DB. (2019) Rare-variant collapsing analyses for complex traits: guidelines and applications. Nat Rev Genet, 20 (12) 747-759. doi:10.1038/s41576-019-0177-4. PMID 31605095

ABSTRACT

The first phase of genome-wide association studies (GWAS) assessed the role of common variation in human disease. Advances optimizing and economizing high-throughput sequencing have enabled a second phase of association studies that assess the contribution of rare variation to complex disease in all protein-coding genes. Unlike the early microarray-based studies, sequencing-based studies catalogue the full range of genetic variation, including the evolutionarily youngest forms. Although the experience with common variants helped establish relevant standards for genome-wide studies, the analysis of rare variation introduces several challenges that require novel analysis approaches.

Show full abstractShow less

DOI

10.1038/s41576-019-0177-4

Single variant association tests

BOLT-lMM

Tool

PUBMED_LINK

25642633

DESCRIPTION

The BOLT-LMM software package currently consists of two main algorithms, the BOLT-LMM algorithm for mixed model association testing, and the BOLT-REML algorithm for variance components analysis (i.e., partitioning of SNP-heritability and estimation of genetic correlations).

Show full descriptionShow less

URL

https://alkesgroup.broadinstitute.org/BOLT-LMM/BOLT-LMM_manual.html

KEYWORDS

non-infinitesimal model, mixture of two Gaussian distributions

Show full keywordsShow less

TITLE

Efficient Bayesian mixed-model analysis increases association power in large cohorts.

Main citation

Loh PR, Tucker G, Bulik-Sullivan BK, Vilhjálmsson BJ, ...&, Price AL. (2015) Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nat Genet, 47 (3) 284-90. doi:10.1038/ng.3190. PMID 25642633

ABSTRACT

Linear mixed models are a powerful statistical tool for identifying genetic associations and avoiding confounding. However, existing methods are computationally intractable in large cohorts and may not optimize power. All existing methods require time cost O(MN(2)) (where N is the number of samples and M is the number of SNPs) and implicitly assume an infinitesimal genetic architecture in which effect sizes are normally distributed, which can limit power. Here we present a far more efficient mixed-model association method, BOLT-LMM, which requires only a small number of O(MN) time iterations and increases power by modeling more realistic, non-infinitesimal genetic architectures via a Bayesian mixture prior on marker effect sizes. We applied BOLT-LMM to 9 quantitative traits in 23,294 samples from the Women's Genome Health Study (WGHS) and observed significant increases in power, consistent with simulations. Theory and simulations show that the boost in power increases with cohort size, making BOLT-LMM appealing for genome-wide association studies in large cohorts.

Show full abstractShow less

DOI

10.1038/ng.3190

EMMAX

Tool

PUBMED_LINK

20208533

FULL NAME

efficient mixed-model association eXpedited

DESCRIPTION

EMMAX is a statistical test for large scale human or model organism association mapping accounting for the sample structure. In addition to the computational efficiency obtained by EMMA algorithm, EMMAX takes advantage of the fact that each loci explains only a small fraction of complex traits, which allows us to avoid repetitive variance component estimation procedure, resulting in a significant amount of increase in computational time of association mapping using mixed model.

Show full descriptionShow less

URL

https://genome.sph.umich.edu/wiki/EMMAX

TITLE

Variance component model to account for sample structure in genome-wide association studies.

Main citation

Kang HM, Sul JH, Service SK, Zaitlen NA, ...&, Eskin E. (2010) Variance component model to account for sample structure in genome-wide association studies. Nat Genet, 42 (4) 348-54. doi:10.1038/ng.548. PMID 20208533

ABSTRACT

Although genome-wide association studies (GWASs) have identified numerous loci associated with complex traits, imprecise modeling of the genetic relatedness within study samples may cause substantial inflation of test statistics and possibly spurious associations. Variance component approaches, such as efficient mixed-model association (EMMA), can correct for a wide range of sample structures by explicitly accounting for pairwise relatedness between individuals, using high-density markers to model the phenotype distribution; but such approaches are computationally impractical. We report here a variance component approach implemented in publicly available software, EMMA eXpedited (EMMAX), that reduces the computational time for analyzing large GWAS data sets from years to hours. We apply this method to two human GWAS data sets, performing association analysis for ten quantitative traits from the Northern Finland Birth Cohort and seven common diseases from the Wellcome Trust Case Control Consortium. We find that EMMAX outperforms both principal component analysis and genomic control in correcting for sample structure.

Show full abstractShow less

DOI

10.1038/ng.548

GEMMA

Tool

PUBMED_LINK

22706312

FULL NAME

genome-wide efficient mixed-model association

DESCRIPTION

GEMMA is the software implementing the Genome-wide Efficient Mixed Model Association algorithm for a standard linear mixed model and some of its close relatives for genome-wide association studies (GWAS). It fits a standard linear mixed model (LMM) to account for population stratification and sample structure for single marker association tests. It fits a Bayesian sparse linear mixed model (BSLMM) using Markov chain Monte Carlo (MCMC) for estimating the proportion of variance in phenotypes explained (PVE) by typed genotypes (i.e. chip heritability), predicting phenotypes, and identifying associated markers by jointly modeling all markers while controlling for population structure. It is computationally efficient for large scale GWAS and uses freely available open-source numerical libraries.

Show full descriptionShow less

URL

http://stephenslab.uchicago.edu/software.html#gemma

TITLE

Genome-wide efficient mixed-model analysis for association studies.

Main citation

Zhou X, Stephens M. (2012) Genome-wide efficient mixed-model analysis for association studies. Nat Genet, 44 (7) 821-4. doi:10.1038/ng.2310. PMID 22706312

ABSTRACT

Linear mixed models have attracted considerable attention recently as a powerful and effective tool for accounting for population stratification and relatedness in genetic association tests. However, existing methods for exact computation of standard test statistics are computationally impractical for even moderate-sized genome-wide association studies. To address this issue, several approximate methods have been proposed. Here, we present an efficient exact method, which we refer to as genome-wide efficient mixed-model association (GEMMA), that makes approximations unnecessary in many contexts. This method is approximately n times faster than the widely used exact method known as efficient mixed-model association (EMMA), where n is the sample size, making exact genome-wide association analysis computationally practical for large numbers of individuals.

Show full abstractShow less

DOI

10.1038/ng.2310

LDAK-KVIK

Tool

URL

http://www.ldak.org/

Main citation

Hof, J. P. & Speed, D. LDAK-KVIK performs fast and powerful mixed-model association analysis of quantitative and binary phenotypes. bioRxiv 2024.07.25.24311005 (2024) doi:10.1101/2024.07.25.24311005.

PLINK

Tool

PUBMED_LINK

17701901

DESCRIPTION

A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses. PLINK is a free, open-source whole genome association analysis toolset, designed to perform a range of basic, large-scale analyses in a computationally efficient manner. The focus of PLINK is purely on analysis of genotype/phenotype data, so there is no support for steps prior to this (e.g. study design and planning, generating genotype or CNV calls from raw data). Through integration with gPLINK and Haploview, there is some support for the subsequent visualization, annotation and storage of results.

Show full descriptionShow less

URL

https://www.cog-genomics.org/plink/

TITLE

PLINK: a tool set for whole-genome association and population-based linkage analyses.

Main citation

Purcell S, Neale B, Todd-Brown K, Thomas L, ...&, Sham PC. (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet, 81 (3) 559-75. doi:10.1086/519795. PMID 17701901

ABSTRACT

Whole-genome association studies (WGAS) bring new computational, as well as analytic, challenges to researchers. Many existing genetic-analysis tools are not designed to handle such large data sets in a convenient manner and do not necessarily exploit the new opportunities that whole-genome data bring. To address these issues, we developed PLINK, an open-source C/C++ WGAS tool set. With PLINK, large data sets comprising hundreds of thousands of markers genotyped for thousands of individuals can be rapidly manipulated and analyzed in their entirety. As well as providing tools to make the basic analytic steps computationally efficient, PLINK also supports some novel approaches to whole-genome data that take advantage of whole-genome coverage. We introduce PLINK and describe the five main domains of function: data management, summary statistics, population stratification, association analysis, and identity-by-descent estimation. In particular, we focus on the estimation and use of identity-by-state and identity-by-descent information in the context of population-based whole-genome studies. This information can be used to detect and correct for population stratification and to identify extended chromosomal segments that are shared identical by descent between very distantly related individuals. Analysis of the patterns of segmental sharing has the potential to map disease loci that contain multiple rare variants in a population-based linkage analysis.

Show full abstractShow less

DOI

10.1086/519795

PLINK2

Tool

PUBMED_LINK

25722852

URL

https://www.cog-genomics.org/plink/2.0/

TITLE

Second-generation PLINK: rising to the challenge of larger and richer datasets.

Main citation

Chang CC, Chow CC, Tellier LC, Vattikuti S, ...&, Lee JJ. (2015) Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience, 4 () 7. doi:10.1186/s13742-015-0047-8. PMID 25722852

ABSTRACT

BACKGROUND: PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for faster and scalable implementations of key functions, such as logistic regression, linkage disequilibrium estimation, and genomic distance evaluation. In addition, GWAS and population-genetic data now frequently contain genotype likelihoods, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1's primary data format. FINDINGS: To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, [Formula: see text]-time/constant-space Hardy-Weinberg equilibrium and Fisher's exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. We have also developed an extension to the data format which adds low-overhead support for genotype likelihoods, phase, multiallelic variants, and reference vs. alternate alleles, which is the basis of our planned second release (PLINK 2.0). CONCLUSIONS: The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.

Show full abstractShow less

DOI

10.1186/s13742-015-0047-8

POLMM

Tool

PUBMED_LINK

33836139

FULL NAME

proportional odds logistic mixed model (POLMM)

DESCRIPTION

Proportional Odds Logistic Mixed Model (POLMM) for ordinal categorical data analysis

Show full descriptionShow less

URL

https://github.com/WenjianBI/POLMM

KEYWORDS

ordinal categorical phenotypes

Show full keywordsShow less

TITLE

Efficient mixed model approach for large-scale genome-wide association studies of ordinal categorical phenotypes.

Main citation

Bi W, Zhou W, Dey R, Mukherjee B, ...&, Lee S. (2021) Efficient mixed model approach for large-scale genome-wide association studies of ordinal categorical phenotypes. Am J Hum Genet, 108 (5) 825-839. doi:10.1016/j.ajhg.2021.03.019. PMID 33836139

ABSTRACT

In genome-wide association studies, ordinal categorical phenotypes are widely used to measure human behaviors, satisfaction, and preferences. However, because of the lack of analysis tools, methods designed for binary or quantitative traits are commonly used inappropriately to analyze categorical phenotypes. To accurately model the dependence of an ordinal categorical phenotype on covariates, we propose an efficient mixed model association test, proportional odds logistic mixed model (POLMM). POLMM is computationally efficient to analyze large datasets with hundreds of thousands of samples, can control type I error rates at a stringent significance level regardless of the phenotypic distribution, and is more powerful than alternative methods. In contrast, the standard linear mixed model approaches cannot control type I error rates for rare variants when the phenotypic distribution is unbalanced, although they performed well when testing common variants. We applied POLMM to 258 ordinal categorical phenotypes on array genotypes and imputed samples from 408,961 individuals in UK Biobank. In total, we identified 5,885 genome-wide significant variants, of which, 424 variants (7.2%) are rare variants with MAF < 0.01.

Show full abstractShow less

DOI

10.1016/j.ajhg.2021.03.019

QRGWAS

Tool

PUBMED_LINK

39085219

FULL NAME

Quantile regression GWAS

URL

https://github.com/Iuliana-Ionita-Laza/QRGWAS

TITLE

Genome-wide discovery for biomarkers using quantile regression at biobank scale.

Main citation

Wang C, Wang T, Kiryluk K, Wei Y, ...&, Ionita-Laza I. (2024) Genome-wide discovery for biomarkers using quantile regression at biobank scale. Nat Commun, 15 (1) 6460. doi:10.1038/s41467-024-50726-x. PMID 39085219

ABSTRACT

Genome-wide association studies (GWAS) for biomarkers important for clinical phenotypes can lead to clinically relevant discoveries. Conventional GWAS for quantitative traits are based on simplified regression models modeling the conditional mean of a phenotype as a linear function of genotype. We draw attention here to an alternative, lesser known approach, namely quantile regression that naturally extends linear regression to the analysis of the entire conditional distribution of a phenotype of interest. Quantile regression can be applied efficiently at biobank scale, while having some unique advantages such as (1) identifying variants with heterogeneous effects across quantiles of the phenotype distribution; (2) accommodating a wide range of phenotype distributions including non-normal distributions, with invariance of results to trait transformations; and (3) providing more detailed information about genotype-phenotype associations even for those associations identified by conventional GWAS. We show in simulations that quantile regression is powerful across both homogeneous and various heterogeneous models. Applications to 39 quantitative traits in the UK Biobank demonstrate that quantile regression can be a helpful complement to linear regression in GWAS and can identify variants with larger effects on high-risk subgroups of individuals but with lower or no contribution overall.

Show full abstractShow less

DOI

10.1038/s41467-024-50726-x

Quickdraws

Tool

PUBMED_LINK

39789286

DESCRIPTION

Quickdraws is a scalable method to perform genome-wide association studies (GWAS) for quantitative and binary traits. To run GWAS using Quickdraws, you will need three main input files: bed (and bgen) files with model-building and testing genetic variants, phenotype files, and covariate files. For certain analyses, you may also need a list of model SNPs and a file describing close genetic relatives

Show full descriptionShow less

URL

https://palamaralab.github.io/software/quickdraws/manual/

TITLE

A scalable variational inference approach for increased mixed-model association power.

Main citation

Loya H, Kalantzis G, Cooper F, Palamara PF. (2025) A scalable variational inference approach for increased mixed-model association power. Nat Genet, 57 (2) 461-468. doi:10.1038/s41588-024-02044-7. PMID 39789286

ABSTRACT

The rapid growth of modern biobanks is creating new opportunities for large-scale genome-wide association studies (GWASs) and the analysis of complex traits. However, performing GWASs on millions of samples often leads to trade-offs between computational efficiency and statistical power, reducing the benefits of large-scale data collection efforts. We developed Quickdraws, a method that increases association power in quantitative and binary traits without sacrificing computational efficiency, leveraging a spike-and-slab prior on variant effects, stochastic variational inference and graphics processing unit acceleration. We applied Quickdraws to 79 quantitative and 50 binary traits in 405,088 UK Biobank samples, identifying 4.97% and 3.25% more associations than REGENIE and 22.71% and 7.07% more than FastGWA. Quickdraws had costs comparable to REGENIE, FastGWA and SAIGE on the UK Biobank Research Analysis Platform service, while being substantially faster than BOLT-LMM. These results highlight the promise of leveraging machine learning techniques for scalable GWASs without sacrificing power or robustness.

Show full abstractShow less

DOI

10.1038/s41588-024-02044-7

REGENIE

GWAS

PUBMED_LINK

34017140

DESCRIPTION

regenie is a C++ program for whole genome regression modelling of large genome-wide association studies. It is developed and supported by a team of scientists at the Regeneron Genetics Center.

Show full descriptionShow less

URL

https://github.com/rgcgithub/regenie

KEYWORDS

whole genome regression

Show full keywordsShow less

TITLE

Computationally efficient whole-genome regression for quantitative and binary traits.

Main citation

Mbatchou J, Barnard L, Backman J, Marcketta A, ...&, Marchini J. (2021) Computationally efficient whole-genome regression for quantitative and binary traits. Nat Genet, 53 (7) 1097-1103. doi:10.1038/s41588-021-00870-7. PMID 34017140

ABSTRACT

Genome-wide association analysis of cohorts with thousands of phenotypes is computationally expensive, particularly when accounting for sample relatedness or population structure. Here we present a novel machine-learning method called REGENIE for fitting a whole-genome regression model for quantitative and binary phenotypes that is substantially faster than alternatives in multi-trait analyses while maintaining statistical efficiency. The method naturally accommodates parallel analysis of multiple phenotypes and requires only local segments of the genotype matrix to be loaded in memory, in contrast to existing alternatives, which must load genome-wide matrices into memory. This results in substantial savings in compute time and memory usage. We introduce a fast, approximate Firth logistic regression test for unbalanced case-control phenotypes. The method is ideally suited to take advantage of distributed computing frameworks. We demonstrate the accuracy and computational benefits of this approach using the UK Biobank dataset with up to 407,746 individuals.

Show full abstractShow less

DOI

10.1038/s41588-021-00870-7

SAIGE

Tool

PUBMED_LINK

30104761

FULL NAME

Scalable and Accurate Implementation of GEneralized mixed model

DESCRIPTION

SAIGE is an R package with Scalable and Accurate Implementation of Generalized mixed model (Chen, H. et al. 2016). It accounts for sample relatedness and is feasible for genetic association tests in large cohorts and biobanks (N > 400,000). SAIGE performs single-variant association tests for binary traits and quantitative taits. For binary traits, SAIGE uses the saddlepoint approximation (SPA)(mhof, J. P. , 1961; Kuonen, D. 1999; Dey, R. et.al 2017) to account for case-control imbalance.

Show full descriptionShow less

URL

https://github.com/weizhouUMICH/SAIGE

KEYWORDS

case-control imbalance, saddlepoint approximation (SPA)

Show full keywordsShow less

TITLE

Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies.

Main citation

Zhou W, Nielsen JB, Fritsche LG, Dey R, ...&, Lee S. (2018) Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. Nat Genet, 50 (9) 1335-1341. doi:10.1038/s41588-018-0184-y. PMID 30104761

ABSTRACT

In genome-wide association studies (GWAS) for thousands of phenotypes in large biobanks, most binary traits have substantially fewer cases than controls. Both of the widely used approaches, the linear mixed model and the recently proposed logistic mixed model, perform poorly; they produce large type I error rates when used to analyze unbalanced case-control phenotypes. Here we propose a scalable and accurate generalized mixed model association test that uses the saddlepoint approximation to calibrate the distribution of score test statistics. This method, SAIGE (Scalable and Accurate Implementation of GEneralized mixed model), provides accurate P values even when case-control ratios are extremely unbalanced. SAIGE uses state-of-art optimization strategies to reduce computational costs; hence, it is applicable to GWAS for thousands of phenotypes by large biobanks. Through the analysis of UK Biobank data of 408,961 samples from white British participants with European ancestry for > 1,400 binary phenotypes, we show that SAIGE can efficiently analyze large sample data, controlling for unbalanced case-control ratios and sample relatedness.

Show full abstractShow less

DOI

10.1038/s41588-018-0184-y

fastGWA

Tool

PUBMED_LINK

31768069

URL

https://yanglab.westlake.edu.cn/software/gcta/#fastGWA

KEYWORDS

grid-search-based REML algorithm

Show full keywordsShow less

TITLE

A resource-efficient tool for mixed model association analysis of large-scale data.

Main citation

Jiang L, Zheng Z, Qi T, Kemper KE, ...&, Yang J. (2019) A resource-efficient tool for mixed model association analysis of large-scale data. Nat Genet, 51 (12) 1749-1755. doi:10.1038/s41588-019-0530-8. PMID 31768069

ABSTRACT

The genome-wide association study (GWAS) has been widely used as an experimental design to detect associations between genetic variants and a phenotype. Two major confounding factors, population stratification and relatedness, could potentially lead to inflated GWAS test statistics and hence to spurious associations. Mixed linear model (MLM)-based approaches can be used to account for sample structure. However, genome-wide association (GWA) analyses in biobank samples such as the UK Biobank (UKB) often exceed the capability of most existing MLM-based tools especially if the number of traits is large. Here, we develop an MLM-based tool (fastGWA) that controls for population stratification by principal components and for relatedness by a sparse genetic relationship matrix for GWA analyses of biobank-scale data. We demonstrate by extensive simulations that fastGWA is reliable, robust and highly resource-efficient. We then apply fastGWA to 2,173 traits on array-genotyped and imputed samples from 456,422 individuals and to 2,048 traits on whole-exome-sequenced samples from 46,191 individuals in the UKB.

Show full abstractShow less

DOI

10.1038/s41588-019-0530-8

fastGWA-GLMM

Tool

PUBMED_LINK

34737426

URL

https://yanglab.westlake.edu.cn/software/gcta/#fastGWA

TITLE

A generalized linear mixed model association tool for biobank-scale data.

Main citation

Jiang L, Zheng Z, Fang H, Yang J. (2021) A generalized linear mixed model association tool for biobank-scale data. Nat Genet, 53 (11) 1616-1621. doi:10.1038/s41588-021-00954-4. PMID 34737426

ABSTRACT

Compared with linear mixed model-based genome-wide association (GWA) methods, generalized linear mixed model (GLMM)-based methods have better statistical properties when applied to binary traits but are computationally much slower. In the present study, leveraging efficient sparse matrix-based algorithms, we developed a GLMM-based GWA tool, fastGWA-GLMM, that is severalfold to orders of magnitude faster than the state-of-the-art tools when applied to the UK Biobank (UKB) data and scalable to cohorts with millions of individuals. We show by simulation that the fastGWA-GLMM test statistics of both common and rare variants are well calibrated under the null, even for traits with extreme case-control ratios. We applied fastGWA-GLMM to the UKB data of 456,348 individuals, 11,842,647 variants and 2,989 binary traits (full summary statistics available at http://fastgwa.info/ukbimpbin ), and identified 259 rare variants associated with 75 traits, demonstrating the use of imputed genotype data in a large cohort to discover rare variants for binary complex traits.

Show full abstractShow less

DOI

10.1038/s41588-021-00954-4