Tool

https://github.com/medical-genomics-group/gmrm

TITLE

Simultaneous discovery, estimation and prediction analysis of complex traits using a bayesian mixture model.

Main citation

Moser G, Lee SH, Hayes BJ, Goddard ME, ...&, Visscher PM. (2015) Simultaneous discovery, estimation and prediction analysis of complex traits using a bayesian mixture model. PLoS Genet, 11 (4) e1004969. doi:10.1371/journal.pgen.1004969. PMID 25849665

ABSTRACT

Gene discovery, estimation of heritability captured by SNP arrays, inference on genetic architecture and prediction analyses of complex traits are usually performed using different statistical models and methods, leading to inefficiency and loss of power. Here we use a Bayesian mixture model that simultaneously allows variant discovery, estimation of genetic variance explained by all variants and prediction of unobserved phenotypes in new samples. We apply the method to simulated data of quantitative traits and Welcome Trust Case Control Consortium (WTCCC) data on disease and show that it provides accurate estimates of SNP-based heritability, produces unbiased estimators of risk in new samples, and that it can estimate genetic architecture by partitioning variation across hundreds to thousands of SNPs. We estimated that, depending on the trait, 2,633 to 9,411 SNPs explain all of the SNP-based heritability in the WTCCC diseases. The majority of those SNPs (>96%) had small effects, confirming a substantial polygenic component to common diseases. The proportion of the SNP-based variance explained by large effects (each SNP explaining 1% of the variance) varied markedly between diseases, ranging from almost zero for bipolar disorder to 72% for type 1 diabetes. Prediction analyses demonstrate that for diseases with major loci, such as type 1 diabetes and rheumatoid arthritis, Bayesian methods outperform profile scoring or mixed model approaches.

Show full abstractShow less

DOI

10.1371/journal.pgen.1004969

BayesRR-RC

Tool

PUBMED_LINK

34848700

DESCRIPTION

gmrm is hybrid-parallel software for a Bayesian grouped mixture of regressions model for genome-wide association studies (GWAS). It is written in C++ using extensive optimisations and code vectorisation. It relies on plink's .bed format. It can handle multiple traits simultaneously.

Show full descriptionShow less

URL

TITLE

Probabilistic inference of the genetic architecture underlying functional enrichment of complex traits.

Main citation

Patxot M, Banos DT, Kousathanas A, Orliac EJ, ...&, Robinson MR. (2021) Probabilistic inference of the genetic architecture underlying functional enrichment of complex traits. Nat Commun, 12 (1) 6972. doi:10.1038/s41467-021-27258-9. PMID 34848700

ABSTRACT

We develop a Bayesian model (BayesRR-RC) that provides robust SNP-heritability estimation, an alternative to marker discovery, and accurate genomic prediction, taking 22 seconds per iteration to estimate 8.4 million SNP-effects and 78 SNP-heritability parameters in the UK Biobank. We find that only ≤10% of the genetic variation captured for height, body mass index, cardiovascular disease, and type 2 diabetes is attributable to proximal regulatory regions within 10kb upstream of genes, while 12-25% is attributed to coding regions, 32-44% to introns, and 22-28% to distal 10-500kb upstream regions. Up to 24% of all cis and coding regions of each chromosome are associated with each trait, with over 3,100 independent exonic and intronic regions and over 5,400 independent regulatory regions having ≥95% probability of contributing ≥0.001% to the genetic variance of these four traits. Our open-source software (GMRM) provides a scalable alternative to current approaches for biobank data.

Show full abstractShow less

DOI

10.1038/s41467-021-27258-9

BayesS

Tool

PUBMED_LINK

29662166

URL

https://faculty.washington.edu/browning/beagle/beagle.html

TITLE

Signatures of negative selection in the genetic architecture of human complex traits.

Main citation

Zeng J, de Vlaming R, Wu Y, Robinson MR, ...&, Yang J. (2018) Signatures of negative selection in the genetic architecture of human complex traits. Nat Genet, 50 (5) 746-753. doi:10.1038/s41588-018-0101-4. PMID 29662166

ABSTRACT

We develop a Bayesian mixed linear model that simultaneously estimates single-nucleotide polymorphism (SNP)-based heritability, polygenicity (proportion of SNPs with nonzero effects), and the relationship between SNP effect size and minor allele frequency for complex traits in conventionally unrelated individuals using genome-wide SNP data. We apply the method to 28 complex traits in the UK Biobank data (N = 126,752) and show that on average, 6% of SNPs have nonzero effects, which in total explain 22% of phenotypic variance. We detect significant (P < 0.05/28) signatures of natural selection in the genetic architecture of 23 traits, including reproductive, cardiovascular, and anthropometric traits, as well as educational attainment. The significant estimates of the relationship between effect size and minor allele frequency in complex traits are consistent with a model of negative (or purifying) selection, as confirmed by forward simulation. We conclude that negative selection acts pervasively on the genetic variants associated with human complex traits.

Show full abstractShow less

DOI

10.1038/s41588-018-0101-4

BEAGLE

Tool

PUBMED_LINK

17924348

URL

TITLE

Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering.

Main citation

Browning SR, Browning BL. (2007) Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am J Hum Genet, 81 (5) 1084-97. doi:10.1086/521987. PMID 17924348

ABSTRACT

Whole-genome association studies present many new statistical and computational challenges due to the large quantity of data obtained. One of these challenges is haplotype inference; methods for haplotype inference designed for small data sets from candidate-gene studies do not scale well to the large number of individuals genotyped in whole-genome association studies. We present a new method and software for inference of haplotype phase and missing data that can accurately phase data from whole-genome association studies, and we present the first comparison of haplotype-inference methods for real and simulated data sets with thousands of genotyped individuals. We find that our method outperforms existing methods in terms of both speed and accuracy for large data sets with thousands of individuals and densely spaced genetic markers, and we use our method to phase a real data set of 3,002 individuals genotyped for 490,032 markers in 3.1 days of computing time, with 99% of masked alleles imputed correctly. Our method is implemented in the Beagle software package, which is freely available.

Show full abstractShow less

DOI

10.1086/521987

BEAGLE4

Tool

PUBMED_LINK

26748515

DESCRIPTION

(beagle 4.1)

Show full descriptionShow less

URL

https://faculty.washington.edu/browning/beagle/beagle.html

TITLE

Genotype Imputation with Millions of Reference Samples.

Main citation

Browning BL, Browning SR. (2016) Genotype Imputation with Millions of Reference Samples. Am J Hum Genet, 98 (1) 116-26. doi:10.1016/j.ajhg.2015.11.020. PMID 26748515

ABSTRACT

We present a genotype imputation method that scales to millions of reference samples. The imputation method, based on the Li and Stephens model and implemented in Beagle v.4.1, is parallelized and memory efficient, making it well suited to multi-core computer processors. It achieves fast, accurate, and memory-efficient genotype imputation by restricting the probability model to markers that are genotyped in the target samples and by performing linear interpolation to impute ungenotyped variants. We compare Beagle v.4.1 with Impute2 and Minimac3 by using 1000 Genomes Project data, UK10K Project data, and simulated data. All three methods have similar accuracy but different memory requirements and different computation times. When imputing 10 Mb of sequence data from 50,000 reference samples, Beagle's throughput was more than 100× greater than Impute2's throughput on our computer servers. When imputing 10 Mb of sequence data from 200,000 reference samples in VCF format, Minimac3 consumed 26× more memory per computational thread and 15× more CPU time than Beagle. We demonstrate that Beagle v.4.1 scales to much larger reference panels by performing imputation from a simulated reference panel having 5 million samples and a mean marker density of one marker per four base pairs.

Show full abstractShow less

DOI

10.1016/j.ajhg.2015.11.020

BEAGLE5.4 (Imputation)

Tool

PUBMED_LINK

30100085

DESCRIPTION

(beagle 5.4 imputation)

Show full descriptionShow less

URL

https://faculty.washington.edu/browning/beagle/beagle.html

TITLE

A One-Penny Imputed Genome from Next-Generation Reference Panels.

Main citation

Browning BL, Zhou Y, Browning SR. (2018) A One-Penny Imputed Genome from Next-Generation Reference Panels. Am J Hum Genet, 103 (3) 338-348. doi:10.1016/j.ajhg.2018.07.015. PMID 30100085

ABSTRACT

Genotype imputation is commonly performed in genome-wide association studies because it greatly increases the number of markers that can be tested for association with a trait. In general, one should perform genotype imputation using the largest reference panel that is available because the number of accurately imputed variants increases with reference panel size. However, one impediment to using larger reference panels is the increased computational cost of imputation. We present a new genotype imputation method, Beagle 5.0, which greatly reduces the computational cost of imputation from large reference panels. We compare Beagle 5.0 with Beagle 4.1, Impute4, Minimac3, and Minimac4 using 1000 Genomes Project data, Haplotype Reference Consortium data, and simulated data for 10k, 100k, 1M, and 10M reference samples. All methods produce nearly identical accuracy, but Beagle 5.0 has the lowest computation time and the best scaling of computation time with increasing reference panel size. For 10k, 100k, 1M, and 10M reference samples and 1,000 phased target samples, Beagle 5.0's computation time is 3× (10k), 12× (100k), 43× (1M), and 533× (10M) faster than the fastest alternative method. Cost data from the Amazon Elastic Compute Cloud show that Beagle 5.0 can perform genome-wide imputation from 10M reference samples into 1,000 phased target samples at a cost of less than one US cent per sample.

Show full abstractShow less

DOI

10.1016/j.ajhg.2018.07.015

BEAGLE5.4 (Phasing)

Tool

PUBMED_LINK

34478634

DESCRIPTION

(beagle 5.4 phasing)

Show full descriptionShow less

URL

https://faculty.washington.edu/browning/beagle/beagle.html

TITLE

Fast two-stage phasing of large-scale sequence data.

Main citation

Browning BL, Tian X, Zhou Y, Browning SR. (2021) Fast two-stage phasing of large-scale sequence data. Am J Hum Genet, 108 (10) 1880-1890. doi:10.1016/j.ajhg.2021.08.005. PMID 34478634

ABSTRACT

Haplotype phasing is the estimation of haplotypes from genotype data. We present a fast, accurate, and memory-efficient haplotype phasing method that scales to large-scale SNP array and sequence data. The method uses marker windowing and composite reference haplotypes to reduce memory usage and computation time. It incorporates a progressive phasing algorithm that identifies confidently phased heterozygotes in each iteration and fixes the phase of these heterozygotes in subsequent iterations. For data with many low-frequency variants, such as whole-genome sequence data, the method employs a two-stage phasing algorithm that phases high-frequency markers via progressive phasing in the first stage and phases low-frequency markers via genotype imputation in the second stage. This haplotype phasing method is implemented in the open-source Beagle 5.2 software package. We compare Beagle 5.2 and SHAPEIT 4.2.1 by using expanding subsets of 485,301 UK Biobank samples and 38,387 TOPMed samples. Both methods have very similar accuracy and computation time for UK Biobank SNP array data. However, for TOPMed sequence data, Beagle is more than 20 times faster than SHAPEIT, achieves similar accuracy, and scales to larger sample sizes.

Show full abstractShow less

DOI

10.1016/j.ajhg.2021.08.005

BEATRICE

Tool

PUBMED_LINK

39360993

FULL NAME

Bayesian finE-mapping from summAry daTa using deep vaRiational InferenCE

DESCRIPTION

In this repository, we introduce BEATRICE, a finemapping tool to identify putative causal variants from GWAS summary data. BEATRICE combines a hierarchical Bayesian model with a deep learning-based inference procedure. This combination provides greater inferential power to handle noise and spurious interactions due to polygenicity of the trait, trans-interactions of variants, or varying correlation structure of the genomic region.

Show full descriptionShow less

URL

https://github.com/sayangsep/Beatrice-Finemapping

TITLE

BEATRICE: Bayesian fine-mapping from summary data using deep variational inference.

Main citation

Ghosal S, Schatz MC, Venkataraman A. (2024) BEATRICE: Bayesian fine-mapping from summary data using deep variational inference. Bioinformatics, 40 (10) . doi:10.1093/bioinformatics/btae590. PMID 39360993

ABSTRACT

MOTIVATION: We introduce a novel framework BEATRICE to identify putative causal variants from GWAS statistics. Identifying causal variants is challenging due to their sparsity and high correlation in the nearby regions. To account for these challenges, we rely on a hierarchical Bayesian model that imposes a binary concrete prior on the set of causal variants. We derive a variational algorithm for this fine-mapping problem by minimizing the KL divergence between an approximate density and the posterior probability distribution of the causal configurations. Correspondingly, we use a deep neural network as an inference machine to estimate the parameters of our proposal distribution. Our stochastic optimization procedure allows us to sample from the space of causal configurations, which we use to compute the posterior inclusion probabilities and determine credible sets for each causal variant. We conduct a detailed simulation study to quantify the performance of our framework against two state-of-the-art baseline methods across different numbers of causal variants and noise paradigms, as defined by the relative genetic contributions of causal and noncausal variants. RESULTS: We demonstrate that BEATRICE achieves uniformly better coverage with comparable power and set sizes, and that the performance gain increases with the number of causal variants. We also show the efficacy BEATRICE in finding causal variants from the GWAS study of Alzheimer's disease. In comparison to the baselines, only BEATRICE can successfully find the APOE ϵ2 allele, a commonly associated variant of Alzheimer's. AVAILABILITY AND IMPLEMENTATION: BEATRICE is available for download at https://github.com/sayangsep/Beatrice-Finemapping.

Show full abstractShow less

DOI

10.1093/bioinformatics/btae590

Benchmark-Wang

Tool

PUBMED_LINK

36585786

TITLE

A comprehensive investigation of statistical and machine learning approaches for predicting complex human diseases on genomic variants.

Main citation

Wang C, Zhang J, Veldsman WP, Zhou X, ...&, Zhang L. (2023) A comprehensive investigation of statistical and machine learning approaches for predicting complex human diseases on genomic variants. Brief Bioinform, 24 (1) . doi:10.1093/bib/bbac552. PMID 36585786

ABSTRACT

Quantifying an individual's risk for common diseases is an important goal of precision health. The polygenic risk score (PRS), which aggregates multiple risk alleles of candidate diseases, has emerged as a standard approach for identifying high-risk individuals. Although several studies have been performed to benchmark the PRS calculation tools and assess their potential to guide future clinical applications, some issues remain to be further investigated, such as lacking (i) various simulated data with different genetic effects; (ii) evaluation of machine learning models and (iii) evaluation on multiple ancestries studies. In this study, we systematically validated and compared 13 statistical methods, 5 machine learning models and 2 ensemble models using simulated data with additive and genetic interaction models, 22 common diseases with internal training sets, 4 common diseases with external summary statistics and 3 common diseases for trans-ancestry studies in UK Biobank. The statistical methods were better in simulated data from additive models and machine learning models have edges for data that include genetic interactions. Ensemble models are generally the best choice by integrating various statistical methods. LDpred2 outperformed the other standalone tools, whereas PRS-CS, lassosum and DBSLMM showed comparable performance. We also identified that disease heritability strongly affected the predictive performance of all methods. Both the number and effect sizes of risk SNPs are important; and sample size strongly influences the performance of all methods. For the trans-ancestry studies, we found that the performance of most methods became worse when training and testing sets were from different populations.

Show full abstractShow less

DOI

10.1093/bib/bbac552

BOLT-lMM

Tool

PUBMED_LINK

25642633

DESCRIPTION

The BOLT-LMM software package currently consists of two main algorithms, the BOLT-LMM algorithm for mixed model association testing, and the BOLT-REML algorithm for variance components analysis (i.e., partitioning of SNP-heritability and estimation of genetic correlations).

Show full descriptionShow less

URL

https://alkesgroup.broadinstitute.org/BOLT-LMM/BOLT-LMM_manual.html

KEYWORDS

non-infinitesimal model, mixture of two Gaussian distributions

Show full keywordsShow less

TITLE

Efficient Bayesian mixed-model analysis increases association power in large cohorts.

Main citation

Loh PR, Tucker G, Bulik-Sullivan BK, Vilhjálmsson BJ, ...&, Price AL. (2015) Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nat Genet, 47 (3) 284-90. doi:10.1038/ng.3190. PMID 25642633

ABSTRACT

Linear mixed models are a powerful statistical tool for identifying genetic associations and avoiding confounding. However, existing methods are computationally intractable in large cohorts and may not optimize power. All existing methods require time cost O(MN(2)) (where N is the number of samples and M is the number of SNPs) and implicitly assume an infinitesimal genetic architecture in which effect sizes are normally distributed, which can limit power. Here we present a far more efficient mixed-model association method, BOLT-LMM, which requires only a small number of O(MN) time iterations and increases power by modeling more realistic, non-infinitesimal genetic architectures via a Bayesian mixture prior on marker effect sizes. We applied BOLT-LMM to 9 quantitative traits in 23,294 samples from the Women's Genome Health Study (WGHS) and observed significant increases in power, consistent with simulations. Theory and simulations show that the boost in power increases with cohort size, making BOLT-LMM appealing for genome-wide association studies in large cohorts.

Show full abstractShow less

DOI

10.1038/ng.3190

BridgePRS

Tool

PUBMED_LINK

38123642

DESCRIPTION

BridgePRS is a Bayesian-ridge (Bridge) approach, which "bridges" the PRS between two populations of different ancestry, developed to tackle the "PRS Portability Problem". The PRS Portability Problem causes lower accuracy PRS in underrepresented populations due to the biased sampling in GWAS data collection.

Show full descriptionShow less

URL

https://www.bridgeprs.net/

TITLE

BridgePRS leverages shared genetic effects across ancestries to increase polygenic risk score portability.

Main citation

Hoggart CJ, Choi SW, García-González J, Souaiaia T, ...&, O'Reilly PF. (2024) BridgePRS leverages shared genetic effects across ancestries to increase polygenic risk score portability. Nat Genet, 56 (1) 180-186. doi:10.1038/s41588-023-01583-9. PMID 38123642

ABSTRACT

Here we present BridgePRS, a novel Bayesian polygenic risk score (PRS) method that leverages shared genetic effects across ancestries to increase PRS portability. We evaluate BridgePRS via simulations and real UK Biobank data across 19 traits in individuals of African, South Asian and East Asian ancestry, using both UK Biobank and Biobank Japan genome-wide association study summary statistics; out-of-cohort validation is performed in the Mount Sinai (New York) BioMe biobank. BridgePRS is compared with the leading alternative, PRS-CSx, and two other PRS methods. Simulations suggest that the performance of BridgePRS relative to PRS-CSx increases as uncertainty increases: with lower trait heritability, higher polygenicity and greater between-population genetic diversity; and when causal variants are not present in the data. In real data, BridgePRS has a 61% larger average R2 than PRS-CSx in out-of-cohort prediction of African ancestry samples in BioMe (P = 6 × 10-5). BridgePRS is a computationally efficient, user-friendly and powerful approach for PRS analyses in non-European ancestries.

Show full abstractShow less

DOI

10.1038/s41588-023-01583-9

CAFEH

Tool

PUBMED_LINK

35085493

FULL NAME

colocalization and fine-mapping in the presence of allelic heterogeneity

DESCRIPTION

CAFEH is a method that performs finemapping and colocalization jointly over multiple phenotypes. CAFEH can be run with 10s of phenotypes and 1000s of variants in a few minutes.

Show full descriptionShow less

URL

https://github.com/karltayeb/cafeh

KEYWORDS

multi-trait, finemapping, colocalization

Show full keywordsShow less

TITLE

Redefining tissue specificity of genetic regulation of gene expression in the presence of allelic heterogeneity.

Main citation

Arvanitis M, Tayeb K, Strober BJ, Battle A. (2022) Redefining tissue specificity of genetic regulation of gene expression in the presence of allelic heterogeneity. Am J Hum Genet, 109 (2) 223-239. doi:10.1016/j.ajhg.2022.01.002. PMID 35085493

ABSTRACT

Uncovering the functional impact of genetic variation on gene expression is important in understanding tissue biology and the pathogenesis of complex traits. Despite large efforts to map expression quantitative trait loci (eQTLs) across many human tissues, our ability to translate those findings to understanding human disease has been incomplete, and the majority of disease loci are not explained by association with expression of a target gene. Cell-type specificity and the presence of multiple independent causal variants for many eQTLs are potential confounders contributing to the apparent discrepancy with disease loci. In this study, we investigate the tissue specificity of genetic effects on gene expression and the overlap with disease loci while considering the presence of multiple causal variants within and across tissues. We find evidence of pervasive tissue specificity of eQTLs, often masked by linkage disequilibrium that misleads traditional meta-analytic approaches. We propose CAFEH (colocalization and fine-mapping in the presence of allelic heterogeneity), a Bayesian method that integrates genetic association data across multiple traits, incorporating linkage disequilibrium to identify causal variants. CAFEH outperforms previous approaches in colocalization and fine-mapping. Using CAFEH, we show that genes with highly tissue-specific genetic effects are under greater selection, enriched in differentiation and developmental processes, and more likely to be involved in human disease. Last, we demonstrate that CAFEH can efficiently leverage the widespread allelic heterogeneity in genetic regulation of gene expression to prioritize the target tissue in genome-wide association complex trait loci, thereby improving our ability to interpret complex trait genetics.

Show full abstractShow less

DOI

10.1016/j.ajhg.2022.01.002

CalPred

Tool

PUBMED_LINK

38886587

FULL NAME

Calibrated prediction intervals

DESCRIPTION

a statistical framework that jointly models the effects of all contexts on PGS accuracy with parameters learned in a calibration dataset

Show full descriptionShow less

URL

https://github.com/KangchengHou/calpred

KEYWORDS

trait prediction intervals

Show full keywordsShow less

TITLE

Calibrated prediction intervals for polygenic scores across diverse contexts.

Main citation

Hou K, Xu Z, Ding Y, Mandla R, ...&, Pasaniuc B. (2024) Calibrated prediction intervals for polygenic scores across diverse contexts. Nat Genet, 56 (7) 1386-1396. doi:10.1038/s41588-024-01792-w. PMID 38886587

ABSTRACT

Polygenic scores (PGS) have emerged as the tool of choice for genomic prediction in a wide range of fields. We show that PGS performance varies broadly across contexts and biobanks. Contexts such as age, sex and income can impact PGS accuracy with similar magnitudes as genetic ancestry. Here we introduce an approach (CalPred) that models all contexts jointly to produce prediction intervals that vary across contexts to achieve calibration (include the trait with 90% probability), whereas existing methods are miscalibrated. In analyses of 72 traits across large and diverse biobanks (All of Us and UK Biobank), we find that prediction intervals required adjustment by up to 80% for quantitative traits. For disease traits, PGS-based predictions were miscalibrated across socioeconomic contexts such as annual household income levels, further highlighting the need of accounting for context information in PGS-based prediction across diverse populations.

Show full abstractShow less

DOI

10.1038/s41588-024-01792-w

Cancer PRSweb

Tool

PUBMED_LINK

32991828

DESCRIPTION

Our framework condenses these summary statistics into PRS using linkage disequilibrium pruning and p-value thresholding (fixed or data-adaptively optimized thresholds) or penalized, genome-wide effect size weighting. We evaluate them in the cancer-enriched cohort of the Michigan Genomics Initiative (MGI), a longitudinal biorepository effort at Michigan Medicine, and in the population-based UK Biobank Study (UKB). For each PRS construct, measures on performance, calibration, and discrimination are provided. Beyond the cancer PRS evaluation in MGI and UKB, the PRSweb platform features construct downloads, risk evaluation in the top percentiles, and phenome-wide PRS association studies (PRS-PheWAS) for a subset of PRS that are predictive for the primary cancer.

Show full descriptionShow less

URL

https://prsweb.sph.umich.edu:8443/

KEYWORDS

Cancer PRS

Show full keywordsShow less

TITLE

Cancer PRSweb: An Online Repository with Polygenic Risk Scores for Major Cancer Traits and Their Evaluation in Two Independent Biobanks.

Main citation

Fritsche LG, Patil S, Beesley LJ, VandeHaar P, ...&, Mukherjee B. (2020) Cancer PRSweb: An Online Repository with Polygenic Risk Scores for Major Cancer Traits and Their Evaluation in Two Independent Biobanks. Am J Hum Genet, 107 (5) 815-836. doi:10.1016/j.ajhg.2020.08.025. PMID 32991828

ABSTRACT

To facilitate scientific collaboration on polygenic risk scores (PRSs) research, we created an extensive PRS online repository for 35 common cancer traits integrating freely available genome-wide association studies (GWASs) summary statistics from three sources: published GWASs, the NHGRI-EBI GWAS Catalog, and UK Biobank-based GWASs. Our framework condenses these summary statistics into PRSs using various approaches such as linkage disequilibrium pruning/p value thresholding (fixed or data-adaptively optimized thresholds) and penalized, genome-wide effect size weighting. We evaluated the PRSs in two biobanks: the Michigan Genomics Initiative (MGI), a longitudinal biorepository effort at Michigan Medicine, and the population-based UK Biobank (UKB). For each PRS construct, we provide measures on predictive performance and discrimination. Besides PRS evaluation, the Cancer-PRSweb platform features construct downloads and phenome-wide PRS association study results (PRS-PheWAS) for predictive PRSs. We expect this integrated platform to accelerate PRS-related cancer research.

Show full abstractShow less

DOI

10.1016/j.ajhg.2020.08.025

CaTS power calculator

Tool

PUBMED_LINK

16415888

DESCRIPTION

CaTS is a simple, multi-platform interface for carrying out power calculations for large genetic association studies, including two stage genome wide association studies.

Show full descriptionShow less

URL

https://csg.sph.umich.edu/abecasis/cats/

TITLE

Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies.

Main citation

Skol AD, Scott LJ, Abecasis GR, Boehnke M. (2006) Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nat Genet, 38 (2) 209-13. doi:10.1038/ng1706. PMID 16415888

ABSTRACT

Genome-wide association is a promising approach to identify common genetic variants that predispose to human disease. Because of the high cost of genotyping hundreds of thousands of markers on thousands of subjects, genome-wide association studies often follow a staged design in which a proportion (pi(samples)) of the available samples are genotyped on a large number of markers in stage 1, and a proportion (pi(samples)) of these markers are later followed up by genotyping them on the remaining samples in stage 2. The standard strategy for analyzing such two-stage data is to view stage 2 as a replication study and focus on findings that reach statistical significance when stage 2 data are considered alone. We demonstrate that the alternative strategy of jointly analyzing the data from both stages almost always results in increased power to detect genetic association, despite the need to use more stringent significance levels, even when effect sizes differ between the two stages. We recommend joint analysis for all two-stage genome-wide association studies, especially when a relatively large proportion of the samples are genotyped in stage 1 (pi(samples) >or= 0.30), and a relatively large proportion of markers are selected for follow-up in stage 2 (pi(markers) >or= 0.01).

Show full abstractShow less

DOI

10.1038/ng1706

CAVIAR

Tool

PUBMED_LINK

25104515

FULL NAME

causal variants identification in associated regions

DESCRIPTION

a statistical framework that quantifies the probability of each variant to be causal while allowing an arbitrary number of causal variants.

Show full descriptionShow less

URL

http://genetics.cs.ucla.edu/caviar/

TITLE

Identifying causal variants at loci with multiple signals of association.

Main citation

Hormozdiari F, Kostem E, Kang EY, Pasaniuc B, ...&, Eskin E. (2014) Identifying causal variants at loci with multiple signals of association. Genetics, 198 (2) 497-508. doi:10.1534/genetics.114.167908. PMID 25104515

ABSTRACT

Although genome-wide association studies have successfully identified thousands of risk loci for complex traits, only a handful of the biologically causal variants, responsible for association at these loci, have been successfully identified. Current statistical methods for identifying causal variants at risk loci either use the strength of the association signal in an iterative conditioning framework or estimate probabilities for variants to be causal. A main drawback of existing methods is that they rely on the simplifying assumption of a single causal variant at each risk locus, which is typically invalid at many risk loci. In this work, we propose a new statistical framework that allows for the possibility of an arbitrary number of causal variants when estimating the posterior probability of a variant being causal. A direct benefit of our approach is that we predict a set of variants for each locus that under reasonable assumptions will contain all of the true causal variants with a high confidence level (e.g., 95%) even when the locus contains multiple causal variants. We use simulations to show that our approach provides 20-50% improvement in our ability to identify the causal variants compared to the existing methods at loci harboring multiple causal variants. We validate our approach using empirical data from an expression QTL study of CHI3L2 to identify new causal variants that affect gene expression at this locus. CAVIAR is publicly available online at http://genetics.cs.ucla.edu/caviar/.

Show full abstractShow less

DOI

10.1534/genetics.114.167908

CAVIARBF

Tool

PUBMED_LINK

25948564

FULL NAME

CAVIAR Bayes factor

DESCRIPTION

a fine-mapping method using marginal test statistics in the Bayesian framework

Show full descriptionShow less

URL

https://bitbucket.org/Wenan/caviarbf/src/master/

KEYWORDS

Bayes factor

Show full keywordsShow less

TITLE

Fine Mapping Causal Variants with an Approximate Bayesian Method Using Marginal Test Statistics.

Main citation

Chen W, Larrabee BR, Ovsyannikova IG, Kennedy RB, ...&, Schaid DJ. (2015) Fine Mapping Causal Variants with an Approximate Bayesian Method Using Marginal Test Statistics. Genetics, 200 (3) 719-36. doi:10.1534/genetics.115.176107. PMID 25948564

ABSTRACT

Two recently developed fine-mapping methods, CAVIAR and PAINTOR, demonstrate better performance over other fine-mapping methods. They also have the advantage of using only the marginal test statistics and the correlation among SNPs. Both methods leverage the fact that the marginal test statistics asymptotically follow a multivariate normal distribution and are likelihood based. However, their relationship with Bayesian fine mapping, such as BIMBAM, is not clear. In this study, we first show that CAVIAR and BIMBAM are actually approximately equivalent to each other. This leads to a fine-mapping method using marginal test statistics in the Bayesian framework, which we call CAVIAR Bayes factor (CAVIARBF). Another advantage of the Bayesian framework is that it can answer both association and fine-mapping questions. We also used simulations to compare CAVIARBF with other methods under different numbers of causal variants. The results showed that both CAVIARBF and BIMBAM have better performance than PAINTOR and other methods. Compared to BIMBAM, CAVIARBF has the advantage of using only marginal test statistics and takes about one-quarter to one-fifth of the running time. We applied different methods on two independent cohorts of the same phenotype. Results showed that CAVIARBF, BIMBAM, and PAINTOR selected the same top 3 SNPs; however, CAVIARBF and BIMBAM had better consistency in selecting the top 10 ranked SNPs between the two cohorts. Software is available at https://bitbucket.org/Wenan/caviarbf.

Show full abstractShow less

DOI

10.1534/genetics.115.176107

CC-GWAS

Tool

PUBMED_LINK

33686288

FULL NAME

case–case genome-wide association study

DESCRIPTION

The CCGWAS R package provides a tool for case-case association testing of two different disorders based on their respective case-control GWAS results

Show full descriptionShow less

URL

https://github.com/wouterpeyrot/CCGWAS

TITLE

Identifying loci with different allele frequencies among cases of eight psychiatric disorders using CC-GWAS.

Main citation

Peyrot WJ, Price AL. (2021) Identifying loci with different allele frequencies among cases of eight psychiatric disorders using CC-GWAS. Nat Genet, 53 (4) 445-454. doi:10.1038/s41588-021-00787-1. PMID 33686288

ABSTRACT

Psychiatric disorders are highly genetically correlated, but little research has been conducted on the genetic differences between disorders. We developed a new method (case-case genome-wide association study; CC-GWAS) to test for differences in allele frequency between cases of two disorders using summary statistics from the respective case-control GWAS, transcending current methods that require individual-level data. Simulations and analytical computations confirm that CC-GWAS is well powered with effective control of type I error. We applied CC-GWAS to publicly available summary statistics for schizophrenia, bipolar disorder, major depressive disorder and five other psychiatric disorders. CC-GWAS identified 196 independent case-case loci, including 72 CC-GWAS-specific loci that were not significant at the genome-wide level in the input case-control summary statistics; two of the CC-GWAS-specific loci implicate the genes KLF6 and KLF16 (from the Krüppel-like family of transcription factors), which have been linked to neurite outgrowth and axon regeneration. CC-GWAS loci replicated convincingly in applications to datasets with independent replication data.

Show full abstractShow less

DOI

10.1038/s41588-021-00787-1

cellAdmix

Tool

PUBMED_LINK

41559218

DESCRIPTION

cellAdmix detects and corrects segmentation errors in imaging-based spatial transcriptomics by factorizing local molecular neighborhoods—analogous to doublet removal in scRNA-seq—to reassign transcripts that spill across cell boundaries.

Show full descriptionShow less

URL

https://github.com/kharchenkolab/cellAdmix ,http://pklab.org/peterk/cellAdmix/

KEYWORDS

spatial transcriptomics, segmentation, matrix factorization, imaging-based ST

Show full keywordsShow less

TITLE

Impact and correction of segmentation errors in spatial transcriptomics.

Main citation

Mitchel J, Gao T, Petukhov V, Cole E, ...&, Kharchenko PV. (2026) Impact and correction of segmentation errors in spatial transcriptomics. Nat Genet, 58 (2) 434-444. doi:10.1038/s41588-025-02497-4. PMID 41559218

ABSTRACT

Spatial transcriptomics aims to elucidate how cells coordinate within tissues by connecting cellular states to their native microenvironments. Imaging-based assays are especially promising, capturing molecular and cellular features at subcellular resolution in three dimensions. Interpretation of such data, however, hinges on accurate cell segmentation. Assigning individual molecules to the correct cells remains challenging. Here we re-analyze data from multiple tissues and platforms to find that segmentation errors currently confound most downstream analysis of cellular state, including differential expression, neighbor influence and ligand-receptor interactions. The extent to which misassigned molecules impact the results can be striking, frequently dominating the results. Thus, we show that matrix factorization of local molecular neighborhoods can effectively identify and isolate such molecular admixtures, thereby reducing their impact on downstream analyses, in a manner analogous to doublet filtering in single-cell RNA sequencing. As the applications of spatial transcriptomics assays become more widespread, accounting for segmentation errors will be important for resolving molecular mechanisms of tissue biology.

Show full abstractShow less

DOI

10.1038/s41588-025-02497-4

ChinaMAP

Tool

PUBMED_LINK

34489580

FULL NAME

China Metabolic Analytics Project

URL

http://www.mbiobank.com/

TITLE

The ChinaMAP reference panel for the accurate genotype imputation in Chinese populations.

Main citation

Li L, Huang P, Sun X, Wang S, ...&, Wang W. (2021) The ChinaMAP reference panel for the accurate genotype imputation in Chinese populations. Cell Res, 31 (12) 1308-1310. doi:10.1038/s41422-021-00564-z. PMID 34489580

DOI

10.1038/s41422-021-00564-z

ChinaMAP panel (ChinaMAP)

Tool

PUBMED_LINK

34489580

FULL NAME

China Metabolic Analytics Project

URL

http://www.mbiobank.com/

TITLE

The ChinaMAP reference panel for the accurate genotype imputation in Chinese populations.

Main citation

Li L, Huang P, Sun X, Wang S, ...&, Wang W. (2021) The ChinaMAP reference panel for the accurate genotype imputation in Chinese populations. Cell Res, 31 (12) 1308-1310. doi:10.1038/s41422-021-00564-z. PMID 34489580

DOI

10.1038/s41422-021-00564-z

ChromoMap

Tool

PUBMED_LINK

35016614

DESCRIPTION

an R package for interactive visualization of multi-omics data and annotation of chromosomes

Show full descriptionShow less

URL

https://lakshay-anand.github.io/chromoMap/index.html

TITLE

ChromoMap: an R package for interactive visualization of multi-omics data and annotation of chromosomes.

Main citation

Anand L, Rodriguez Lopez CM. (2022) ChromoMap: an R package for interactive visualization of multi-omics data and annotation of chromosomes. BMC Bioinformatics, 23 (1) 33. doi:10.1186/s12859-021-04556-z. PMID 35016614

ABSTRACT

BACKGROUND: The recent advancements in high-throughput sequencing have resulted in the availability of annotated genomes, as well as of multi-omics data for many living organisms. This has increased the need for graphic tools that allow the concurrent visualization of genomes and feature-associated multi-omics data on single publication-ready plots. RESULTS: We present chromoMap, an R package, developed for the construction of interactive visualizations of chromosomes/chromosomal regions, mapping of any chromosomal feature with known coordinates (i.e., protein coding genes, transposable elements, non-coding RNAs, microsatellites, etc.), and chromosomal regional characteristics (i.e. genomic feature density, gene expression, DNA methylation, chromatin modifications, etc.) of organisms with a genome assembly. ChromoMap can also integrate multi-omics data (genomics, transcriptomics and epigenomics) in relation to their occurrence across chromosomes. ChromoMap takes tab-delimited files (BED like) or alternatively R objects to specify the genomic co-ordinates of the chromosomes and elements to annotate. Rendered chromosomes are composed of continuous windows of a given range, which, on hover, display detailed information about the elements annotated within that range. By adjusting parameters of a single function, users can generate a variety of plots that can either be saved as static image or as HTML documents. CONCLUSIONS: ChromoMap's flexibility allows for concurrent visualization of genomic data in each strand of a given chromosome, or of more than one homologous chromosome; allowing the comparison of multi-omic data between genotypes (e.g. species, varieties, etc.) or between homologous chromosomes of phased diploid/polyploid genomes. chromoMap is an extensive tool that can be potentially used in various bioinformatics analysis pipelines for genomic visualization of multi-omics data.

Show full abstractShow less

DOI

10.1186/s12859-021-04556-z

CKB reference panel (CKB)

Tool

PUBMED_LINK

37870428

FULL NAME

China Kadoorie Biobank

URL

https://db.cngb.org/imputation/

TITLE

A high-resolution haplotype-resolved Reference panel constructed from the China Kadoorie Biobank Study.

Main citation

Yu C, Lan X, Tao Y, Guo Y, ...&, Li L. (2023) A high-resolution haplotype-resolved Reference panel constructed from the China Kadoorie Biobank Study. Nucleic Acids Res, 51 (21) 11770-11782. doi:10.1093/nar/gkad779. PMID 37870428

ABSTRACT

Precision medicine depends on high-accuracy individual-level genotype data. However, the whole-genome sequencing (WGS) is still not suitable for gigantic studies due to budget constraints. It is particularly important to construct highly accurate haplotype reference panel for genotype imputation. In this study, we used 10 000 samples with medium-depth WGS to construct a reference panel that we named the CKB reference panel. By imputing microarray datasets, it showed that the CKB panel outperformed compared panels in terms of both the number of well-imputed variants and imputation accuracy. In addition, we have completed the imputation of 100 706 microarrays with the CKB panel, and the after-imputed data is the hitherto largest whole genome data of the Chinese population. Furthermore, in the GWAS analysis of real phenotype height, the number of tested SNPs tripled and the number of significant SNPs doubled after imputation. Finally, we developed an online server for offering free genotype imputation service based on the CKB reference panel (https://db.cngb.org/imputation/). We believe that the CKB panel is of great value for imputing microarray or low-coverage genotype data of Chinese population, and potentially mixed populations. The imputation-completed 100 706 microarray data are enormous and precious resources of population genetic studies for complex traits and diseases.

Show full abstractShow less

DOI

10.1093/nar/gkad779

Cmplot

Tool

PUBMED_LINK

33662620

DESCRIPTION

an easy-to-use open-source web-based tool for visualizing, navigating and sharing GWAS and PheWAS results

Show full descriptionShow less

URL

https://github.com/YinLiLin/Cmplot

TITLE

rMVP: A Memory-efficient, Visualization-enhanced, and Parallel-accelerated Tool for Genome-wide Association Study.

Main citation

Yin L, Zhang H, Tang Z, Xu J, ...&, Liu X. (2021) rMVP: A Memory-efficient, Visualization-enhanced, and Parallel-accelerated Tool for Genome-wide Association Study. Genomics Proteomics Bioinformatics, 19 (4) 619-628. doi:10.1016/j.gpb.2020.10.007. PMID 33662620

ABSTRACT

Along with the development of high-throughput sequencing technologies, both sample size and SNP number are increasing rapidly in genome-wide association studies (GWAS), and the associated computation is more challenging than ever. Here, we present a memory-efficient, visualization-enhanced, and parallel-accelerated R package called "rMVP" to address the need for improved GWAS computation. rMVP can 1) effectively process large GWAS data, 2) rapidly evaluate population structure, 3) efficiently estimate variance components by Efficient Mixed-Model Association eXpedited (EMMAX), Factored Spectrally Transformed Linear Mixed Models (FaST-LMM), and Haseman-Elston (HE) regression algorithms, 4) implement parallel-accelerated association tests of markers using general linear model (GLM), mixed linear model (MLM), and fixed and random model circulating probability unification (FarmCPU) methods, 5) compute fast with a globally efficient design in the GWAS processes, and 6) generate various visualizations of GWAS-related information. Accelerated by block matrix multiplication strategy and multiple threads, the association test methods embedded in rMVP are significantly faster than PLINK, GEMMA, and FarmCPU_pkg. rMVP is freely available at https://github.com/xiaolei-lab/rMVP.

Show full abstractShow less

DOI

10.1016/j.gpb.2020.10.007

CMS

Tool

PUBMED_LINK

20056855

FULL NAME

Composite of multiple signals

DESCRIPTION

Grossman, S. R., Shylakhter, I., Karlsson, E. K., Byrne, E. H., Morales, S., Frieden, G., ... & Sabeti, P. C. (2010). A composite of multiple signals distinguishes causal variants in regions of positive selection. Science, 327(5967), 883-886.

Show full descriptionShow less

TITLE

A composite of multiple signals distinguishes causal variants in regions of positive selection.

Main citation

Grossman SR, Shlyakhter I, Karlsson EK, Byrne EH, ...&, Sabeti PC. (2010) A composite of multiple signals distinguishes causal variants in regions of positive selection. Science, 327 (5967) 883-6. doi:10.1126/science.1183863. PMID 20056855

ABSTRACT

The human genome contains hundreds of regions whose patterns of genetic variation indicate recent positive natural selection, yet for most the underlying gene and the advantageous mutation remain unknown. We developed a method, composite of multiple signals (CMS), that combines tests for multiple signals of selection and increases resolution by up to 100-fold. By applying CMS to candidate regions from the International Haplotype Map, we localized population-specific selective signals to 55 kilobases (median), identifying known and novel causal variants. CMS can not just identify individual loci but implicates precise variants selected by evolution.

Show full abstractShow less

DOI

10.1126/science.1183863

CNGB Imputation Service (CNGB)

Tool

PUBMED_LINK

37870428

FULL NAME

China National GeneBank

URL

https://db.cngb.org/imputation/

TITLE

A high-resolution haplotype-resolved Reference panel constructed from the China Kadoorie Biobank Study.

Main citation

Yu C, Lan X, Tao Y, Guo Y, ...&, Li L. (2023) A high-resolution haplotype-resolved Reference panel constructed from the China Kadoorie Biobank Study. Nucleic Acids Res, 51 (21) 11770-11782. doi:10.1093/nar/gkad779. PMID 37870428

ABSTRACT

Precision medicine depends on high-accuracy individual-level genotype data. However, the whole-genome sequencing (WGS) is still not suitable for gigantic studies due to budget constraints. It is particularly important to construct highly accurate haplotype reference panel for genotype imputation. In this study, we used 10 000 samples with medium-depth WGS to construct a reference panel that we named the CKB reference panel. By imputing microarray datasets, it showed that the CKB panel outperformed compared panels in terms of both the number of well-imputed variants and imputation accuracy. In addition, we have completed the imputation of 100 706 microarrays with the CKB panel, and the after-imputed data is the hitherto largest whole genome data of the Chinese population. Furthermore, in the GWAS analysis of real phenotype height, the number of tested SNPs tripled and the number of significant SNPs doubled after imputation. Finally, we developed an online server for offering free genotype imputation service based on the CKB reference panel (https://db.cngb.org/imputation/). We believe that the CKB panel is of great value for imputing microarray or low-coverage genotype data of Chinese population, and potentially mixed populations. The imputation-completed 100 706 microarray data are enormous and precious resources of population genetic studies for complex traits and diseases.

Show full abstractShow less

DOI

10.1093/nar/gkad779

CoCoNet

Tool

PUBMED_LINK

32310941

DESCRIPTION

CoCoNet is a composite likelihood-based covariance regression network model for identifying trait-relevant tissues or cell types.

Show full descriptionShow less

URL

https://xiangzhou.github.io/software/

KEYWORDS

composite likelihood-based inference algorithm

Show full keywordsShow less

TITLE

Leveraging gene co-expression patterns to infer trait-relevant tissues in genome-wide association studies.

Main citation

Shang L, Smith JA, Zhou X. (2020) Leveraging gene co-expression patterns to infer trait-relevant tissues in genome-wide association studies. PLoS Genet, 16 (4) e1008734. doi:10.1371/journal.pgen.1008734. PMID 32310941

ABSTRACT

Genome-wide association studies (GWASs) have identified many SNPs associated with various common diseases. Understanding the biological functions of these identified SNP associations requires identifying disease/trait relevant tissues or cell types. Here, we develop a network method, CoCoNet, to facilitate the identification of trait-relevant tissues or cell types. Different from existing approaches, CoCoNet incorporates tissue-specific gene co-expression networks constructed from either bulk or single cell RNA sequencing (RNAseq) studies with GWAS data for trait-tissue inference. In particular, CoCoNet relies on a covariance regression network model to express gene-level effect measurements for the given GWAS trait as a function of the tissue-specific co-expression adjacency matrix. With a composite likelihood-based inference algorithm, CoCoNet is scalable to tens of thousands of genes. We validate the performance of CoCoNet through extensive simulations. We apply CoCoNet for an in-depth analysis of four neurological disorders and four autoimmune diseases, where we integrate the corresponding GWASs with bulk RNAseq data from 38 tissues and single cell RNAseq data from 10 cell types. In the real data applications, we show how CoCoNet can help identify specific glial cell types relevant for neurological disorders and identify disease-targeted colon tissues as relevant for autoimmune diseases.

Show full abstractShow less

DOI

10.1371/journal.pgen.1008734

Coloc

Tool

PUBMED_LINK

24830394

URL

https://chr1swallace.github.io/coloc/

KEYWORDS

Approximate Bayes Factor (ABF)

Show full keywordsShow less

TITLE

Bayesian test for colocalisation between pairs of genetic association studies using summary statistics.

Main citation

Giambartolomei C, Vukcevic D, Schadt EE, Franke L, ...&, Plagnol V. (2014) Bayesian test for colocalisation between pairs of genetic association studies using summary statistics. PLoS Genet, 10 (5) e1004383. doi:10.1371/journal.pgen.1004383. PMID 24830394

ABSTRACT

Genetic association studies, in particular the genome-wide association study (GWAS) design, have provided a wealth of novel insights into the aetiology of a wide range of human diseases and traits, in particular cardiovascular diseases and lipid biomarkers. The next challenge consists of understanding the molecular basis of these associations. The integration of multiple association datasets, including gene expression datasets, can contribute to this goal. We have developed a novel statistical methodology to assess whether two association signals are consistent with a shared causal variant. An application is the integration of disease scans with expression quantitative trait locus (eQTL) studies, but any pair of GWAS datasets can be integrated in this framework. We demonstrate the value of the approach by re-analysing a gene expression dataset in 966 liver samples with a published meta-analysis of lipid traits including >100,000 individuals of European ancestry. Combining all lipid biomarkers, our re-analysis supported 26 out of 38 reported colocalisation results with eQTLs and identified 14 new colocalisation results, hence highlighting the value of a formal statistical test. In three cases of reported eQTL-lipid pairs (SYPL2, IFT172, TBKBP1) for which our analysis suggests that the eQTL pattern is not consistent with the lipid association, we identify alternative colocalisation results with SORT1, GCKR, and KPNB1, indicating that these genes are more likely to be causal in these genomic intervals. A key feature of the method is the ability to derive the output statistics from single SNP summary statistics, hence making it possible to perform systematic meta-analysis type comparisons across multiple GWAS datasets (implemented online at http://coloc.cs.ucl.ac.uk/coloc/). Our methodology provides information about candidate causal genes in associated intervals and has direct implications for the understanding of complex diseases as well as the design of drugs to target disease pathways.

Show full abstractShow less

DOI

10.1371/journal.pgen.1004383

Coloc-susie

Tool

PUBMED_LINK

34587156

URL

https://chr1swallace.github.io/coloc/articles/a06_SuSiE.html

KEYWORDS

Approximate Bayes Factor (ABF), Sum of Single Effects (SuSiE)

Show full keywordsShow less

TITLE

A more accurate method for colocalisation analysis allowing for multiple causal variants.

Main citation

Wallace C. (2021) A more accurate method for colocalisation analysis allowing for multiple causal variants. PLoS Genet, 17 (9) e1009440. doi:10.1371/journal.pgen.1009440. PMID 34587156

ABSTRACT

In genome-wide association studies (GWAS) it is now common to search for, and find, multiple causal variants located in close proximity. It has also become standard to ask whether different traits share the same causal variants, but one of the popular methods to answer this question, coloc, makes the simplifying assumption that only a single causal variant exists for any given trait in any genomic region. Here, we examine the potential of the recently proposed Sum of Single Effects (SuSiE) regression framework, which can be used for fine-mapping genetic signals, for use with coloc. SuSiE is a novel approach that allows evidence for association at multiple causal variants to be evaluated simultaneously, whilst separating the statistical support for each variant conditional on the causal signal being considered. We show this results in more accurate coloc inference than other proposals to adapt coloc for multiple causal variants based on conditioning. We therefore recommend that coloc be used in combination with SuSiE to optimise accuracy of colocalisation analyses when multiple causal variants exist.

Show full abstractShow less

DOI

10.1371/journal.pgen.1009440

Comparison

Tool

PUBMED_LINK

34819519

TITLE

Synergistic insights into human health from aptamer- and antibody-based proteomic profiling.

Main citation

Pietzner M, Wheeler E, Carrasco-Zanini J, Kerrison ND, ...&, Langenberg C. (2021) Synergistic insights into human health from aptamer- and antibody-based proteomic profiling. Nat Commun, 12 (1) 6822. doi:10.1038/s41467-021-27164-0. PMID 34819519

ABSTRACT

Affinity-based proteomics has enabled scalable quantification of thousands of protein targets in blood enhancing biomarker discovery, understanding of disease mechanisms, and genetic evaluation of drug targets in humans through protein quantitative trait loci (pQTLs). Here, we integrate two partly complementary techniques-the aptamer-based SomaScan® v4 assay and the antibody-based Olink assays-to systematically assess phenotypic consequences of hundreds of pQTLs discovered for 871 protein targets across both platforms. We create a genetically anchored cross-platform proteome-phenome network comprising 547 protein-phenotype connections, 36.3% of which were only seen with one of the two platforms suggesting that both techniques capture distinct aspects of protein biology. We further highlight discordance of genetically predicted effect directions between assays, such as for PILRA and Alzheimer's disease. Our results showcase the synergistic nature of these technologies to better understand and identify disease mechanisms and provide a benchmark for future cross-platform discoveries.

Show full abstractShow less

DOI

10.1038/s41467-021-27164-0

Concepts&Principals

Tool

PUBMED_LINK

34570226

TITLE

Interpreting Mendelian-randomization estimates of the effects of categorical exposures such as disease status and educational attainment.

Main citation

Howe LJ, Tudball M, Davey Smith G, Davies NM. (2022) Interpreting Mendelian-randomization estimates of the effects of categorical exposures such as disease status and educational attainment. Int J Epidemiol, 51 (3) 948-957. doi:10.1093/ije/dyab208. PMID 34570226

ABSTRACT

BACKGROUND: Mendelian randomization has been previously used to estimate the effects of binary and ordinal categorical exposures-e.g. Type 2 diabetes or educational attainment defined by qualification-on outcomes. Binary and categorical phenotypes can be modelled in terms of liability-an underlying latent continuous variable with liability thresholds separating individuals into categories. Genetic variants influence an individual's categorical exposure via their effects on liability, thus Mendelian-randomization analyses with categorical exposures will capture effects of liability that act independently of exposure category. METHODS AND RESULTS: We discuss how groups in which the categorical exposure is invariant can be used to detect liability effects acting independently of exposure category. For example, associations between an adult educational-attainment polygenic score (PGS) and body mass index measured before the minimum school leaving age (e.g. age 10 years), cannot indicate the effects of years in full-time education on this outcome. Using UK Biobank data, we show that a higher educational-attainment PGS is strongly associated with lower smoking initiation and higher odds of glasses use at age 15 years. These associations were replicated in sibling models. An orthogonal approach using the raising of the school leaving age (ROSLA) policy change found that individuals who chose to remain in education to age 16 years before the reform likely had higher liability to educational attainment than those who were compelled to remain in education to age 16 years after the reform, and had higher income, lower pack-years of smoking, higher odds of glasses use and lower deprivation in adulthood. These results suggest that liability to educational attainment is associated with health and social outcomes independently of years in full-time education. CONCLUSIONS: Mendelian-randomization studies with non-continuous exposures should be interpreted in terms of liability, which may affect the outcome via changes in exposure category and/or independently.

Show full abstractShow less

DOI

10.1093/ije/dyab208

CookHLA

Tool

PUBMED_LINK

33627654

URL

https://github.com/WansonChoi/CookHLA

TITLE

Accurate imputation of human leukocyte antigens with CookHLA.

Main citation

Cook S, Choi W, Lim H, Luo Y, ...&, Han B. (2021) Accurate imputation of human leukocyte antigens with CookHLA. Nat Commun, 12 (1) 1264. doi:10.1038/s41467-021-21541-5. PMID 33627654

ABSTRACT

The recent development of imputation methods enabled the prediction of human leukocyte antigen (HLA) alleles from intergenic SNP data, allowing studies to fine-map HLA for immune phenotypes. Here we report an accurate HLA imputation method, CookHLA, which has superior imputation accuracy compared to previous methods. CookHLA differs from other approaches in that it locally embeds prediction markers into highly polymorphic exons to account for exonic variability, and in that it adaptively learns the genetic map within MHC from the data to facilitate imputation. Our benchmarking with real datasets shows that our method achieves high imputation accuracy in a wide range of scenarios, including situations where the reference panel is small or ethnically unmatched.

Show full abstractShow less

DOI

10.1038/s41467-021-21541-5

CoPheScan

Tool

PUBMED_LINK

38997278

FULL NAME

Coloc adapted Phenome-wide Scan

URL

https://github.com/ichcha-m/cophescan

TITLE

CoPheScan: phenome-wide association studies accounting for linkage disequilibrium.

Main citation

Manipur I, Reales G, Sul JH, Shin MK, ...&, Wallace C. (2024) CoPheScan: phenome-wide association studies accounting for linkage disequilibrium. Nat Commun, 15 (1) 5862. doi:10.1038/s41467-024-49990-8. PMID 38997278

ABSTRACT

Phenome-wide association studies (PheWAS) facilitate the discovery of associations between a single genetic variant with multiple phenotypes. For variants which impact a specific protein, this can help identify additional therapeutic indications or on-target side effects of intervening on that protein. However, PheWAS is restricted by an inability to distinguish confounding due to linkage disequilibrium (LD) from true pleiotropy. Here we describe CoPheScan (Coloc adapted Phenome-wide Scan), a Bayesian approach that enables an intuitive and systematic exploration of causal associations while simultaneously addressing LD confounding. We demonstrate its performance through simulation, showing considerably better control of false positive rates than a conventional approach not accounting for LD. We used CoPheScan to perform PheWAS of protein-truncating variants and fine-mapped variants from disease and pQTL studies, in 2275 disease phenotypes from the UK Biobank. Our results identify the complexity of known pleiotropic genes such as APOE, and suggest a new causal role for TGM3 in skin cancer.

Show full abstractShow less

DOI

10.1038/s41467-024-49990-8

corrplot

Tool

DESCRIPTION

R package corrplot provides a visual exploratory tool on correlation matrix that supports automatic variable reordering to help detect hidden patterns among variables.

Show full descriptionShow less

URL

https://github.com/taiyun/corrplot

COWAS

TWAS Functional genomics Gene prioritization Tool Summary statistics

PUBMED_LINK

41381446

FULL NAME

Co-expression-wide association study

DESCRIPTION

Co-expression-wide association study (COWAS) extends TWAS/PWAS by testing pairs of genes or proteins whose genetically regulated co-expression or interaction is associated with a trait; includes implemented R software and trained imputation weights for summary-statistic follow-up.

Show full descriptionShow less

URL

https://github.com/mykmal/cowas ,https://doi.org/10.1038/s41467-025-66039-6

KEYWORDS

TWAS, PWAS, co-expression, gene-gene interaction, GWAS summary statistics

Show full keywordsShow less

TITLE

Co-expression-wide association studies link genetically regulated interactions with complex traits.

Main citation

Malakhov MM, Pan W. (2025) Co-expression-wide association studies link genetically regulated interactions with complex traits. Nat Commun, 16 (1) 11061. doi:10.1038/s41467-025-66039-6. PMID 41381446

ABSTRACT

Transcriptome- and proteome-wide association studies (TWAS/PWAS) have proven successful in prioritizing genes and proteins whose genetically regulated expression modulates disease risk, but they ignore potential co-expression and interaction effects. To address this limitation, we introduce the co-expression-wide association study (COWAS) method, which can identify pairs of genes or proteins whose genetically regulated co-expression is associated with complex traits. COWAS first trains models to predict expression and co-expression from genetic variation, and then tests for association between imputed co-expression and the trait of interest while also accounting for direct effects from each exposure. We applied our method to plasma proteomic concentrations from the UK Biobank, identifying dozens of interacting protein pairs associated with cholesterol levels, Alzheimer's disease, and Parkinson's disease. Notably, our results demonstrate that co-expression between proteins may affect complex traits even if neither protein is detected to influence the trait when considered on its own. We also show how COWAS can help to disentangle direct and interaction effects, providing a richer picture of the molecular networks that mediate genetic effects on disease outcomes.

Show full abstractShow less

DOI

10.1038/s41467-025-66039-6

cross-trait LDSC

Tool

PUBMED_LINK

26414676

FULL NAME

cross-trait LD Score Regression

DESCRIPTION

ldsc is a command line tool for estimating heritability and genetic correlation from GWAS summary statistics. ldsc also computes LD Scores.

Show full descriptionShow less

URL

https://alkesgroup.broadinstitute.org/cS2G/code

KEYWORDS

cross-trait, LD score regression

Show full keywordsShow less

TITLE

An atlas of genetic correlations across human diseases and traits.

Main citation

Bulik-Sullivan B, Finucane HK, Anttila V, Gusev A, ...&, Neale BM. (2015) An atlas of genetic correlations across human diseases and traits. Nat Genet, 47 (11) 1236-41. doi:10.1038/ng.3406. PMID 26414676

ABSTRACT

Identifying genetic correlations between complex traits and diseases can provide useful etiological insights and help prioritize likely causal relationships. The major challenges preventing estimation of genetic correlation from genome-wide association study (GWAS) data with current methods are the lack of availability of individual-level genotype data and widespread sample overlap among meta-analyses. We circumvent these difficulties by introducing a technique-cross-trait LD Score regression-for estimating genetic correlation that requires only GWAS summary statistics and is not biased by sample overlap. We use this method to estimate 276 genetic correlations among 24 traits. The results include genetic correlations between anorexia nervosa and schizophrenia, anorexia and obesity, and educational attainment and several diseases. These results highlight the power of genome-wide analyses, as there currently are no significantly associated SNPs for anorexia nervosa and only three for educational attainment.

Show full abstractShow less

DOI

10.1038/ng.3406

cS2G

Tool

PUBMED_LINK

35668300

FULL NAME

optimal combined S2G strategy

DESCRIPTION

heritability-based framework for evaluating and combining different S2G strategies to optimize their informativeness for common disease risk

Show full descriptionShow less

URL

TITLE

Combining SNP-to-gene linking strategies to identify disease genes and assess disease omnigenicity.

Main citation

Gazal S, Weissbrod O, Hormozdiari F, Dey KK, ...&, Price AL. (2022) Combining SNP-to-gene linking strategies to identify disease genes and assess disease omnigenicity. Nat Genet, 54 (6) 827-836. doi:10.1038/s41588-022-01087-y. PMID 35668300

ABSTRACT

Disease-associated single-nucleotide polymorphisms (SNPs) generally do not implicate target genes, as most disease SNPs are regulatory. Many SNP-to-gene (S2G) linking strategies have been developed to link regulatory SNPs to the genes that they regulate in cis. Here, we developed a heritability-based framework for evaluating and combining different S2G strategies to optimize their informativeness for common disease risk. Our optimal combined S2G strategy (cS2G) included seven constituent S2G strategies and achieved a precision of 0.75 and a recall of 0.33, more than doubling the recall of any individual strategy. We applied cS2G to fine-mapping results for 49 UK Biobank diseases/traits to predict 5,095 causal SNP-gene-disease triplets (with S2G-derived functional interpretation) with high confidence. We further applied cS2G to provide an empirical assessment of disease omnigenicity; we determined that the top 1% of genes explained roughly half of the SNP heritability linked to all genes and that gene-level architectures vary with variant allele frequency.

Show full abstractShow less

DOI

10.1038/s41588-022-01087-y

CT-SLEB

Tool

PUBMED_LINK

37749244

DESCRIPTION

CT-SLEB is a method designed to generate multi-ancestry PRSs that incorporate existing large GWAS from EUR populations and smaller GWAS from non-EUR populations. The method has three key steps: 1. Clumping and Thresholding for selecting SNPs to be included in a PRS for the target population; 2. Empirical-Bayes method for estimating the coefficients of the SNPs; 3. Super-learning model to combine a series of PRSs generated under different SNP selection thresholds.

Show full descriptionShow less

URL

https://github.com/andrewhaoyu/CTSLEB

TITLE

A new method for multiancestry polygenic prediction improves performance across diverse populations.

Main citation

Zhang H, Zhan J, Jin J, Zhang J, ...&, Chatterjee N. (2023) A new method for multiancestry polygenic prediction improves performance across diverse populations. Nat Genet, 55 (10) 1757-1768. doi:10.1038/s41588-023-01501-z. PMID 37749244

ABSTRACT

Polygenic risk scores (PRSs) increasingly predict complex traits; however, suboptimal performance in non-European populations raise concerns about clinical applications and health inequities. We developed CT-SLEB, a powerful and scalable method to calculate PRSs, using ancestry-specific genome-wide association study summary statistics from multiancestry training samples, integrating clumping and thresholding, empirical Bayes and superlearning. We evaluated CT-SLEB and nine alternative methods with large-scale simulated genome-wide association studies (~19 million common variants) and datasets from 23andMe, Inc., the Global Lipids Genetics Consortium, All of Us and UK Biobank, involving 5.1 million individuals of diverse ancestry, with 1.18 million individuals from four non-European populations across 13 complex traits. Results demonstrated that CT-SLEB significantly improves PRS performance in non-European populations compared with simple alternatives, with comparable or superior performance to a recent, computationally intensive method. Moreover, our simulation studies offered insights into sample size requirements and SNP density effects on multiancestry risk prediction.

Show full abstractShow less

DOI

10.1038/s41588-023-01501-z

cTWAS

Tool

PUBMED_LINK

38279041

FULL NAME

causal-TWAS

DESCRIPTION

Expression Quantitative Trait Loci (eQTLs) have often been used to nominate candidate genes from Genome-wide association studies (GWAS). However, commonly used methods are susceptible to false positives largely due to Linkage Disequilibrium of eQTLs with causal variants acting on the phenotype directly. Our method, causal-TWAS (cTWAS), addressed this challenge by borrowing ideas from statistical fine-mapping. It is a generalization of Transcriptome-wide association studies (TWAS), but when analyzing any gene, it adjusts for other nearby genes and all nearby genetic variants.

Show full descriptionShow less

URL

https://xinhe-lab.github.io/ctwas/

KEYWORDS

TWAS, fine-mapping

Show full keywordsShow less

TITLE

Adjusting for genetic confounders in transcriptome-wide association studies improves discovery of risk genes of complex traits.

Main citation

Zhao S, Crouse W, Qian S, Luo K, ...&, He X. (2024) Adjusting for genetic confounders in transcriptome-wide association studies improves discovery of risk genes of complex traits. Nat Genet, 56 (2) 336-347. doi:10.1038/s41588-023-01648-9. PMID 38279041

ABSTRACT

Many methods have been developed to leverage expression quantitative trait loci (eQTL) data to nominate candidate genes from genome-wide association studies. These methods, including colocalization, transcriptome-wide association studies (TWAS) and Mendelian randomization-based methods; however, all suffer from a key problem-when assessing the role of a gene in a trait using its eQTLs, nearby variants and genetic components of other genes' expression may be correlated with these eQTLs and have direct effects on the trait, acting as potential confounders. Our extensive simulations showed that existing methods fail to account for these 'genetic confounders', resulting in severe inflation of false positives. Our new method, causal-TWAS (cTWAS), borrows ideas from statistical fine-mapping and allows us to adjust all genetic confounders. cTWAS showed calibrated false discovery rates in simulations, and its application on several common traits discovered new candidate genes. In conclusion, cTWAS provides a robust statistical framework for gene discovery.

Show full abstractShow less

DOI

10.1038/s41588-023-01648-9

Ctyper

Tool

PUBMED_LINK

41107550

DESCRIPTION

Ctyper genotypes sequence-resolved copy-number variation and other complex polymorphic genes using a pangenome reference matrix, enabling allele- and copy-aware calls at scale for biobank-style cohorts.

Show full descriptionShow less

URL

https://github.com/ChaissonLab/Ctyper ,https://www.nature.com/articles/s41588-025-02346-4

KEYWORDS

CNV, copy number, pangenome, sequence-resolved, biobank scale

Show full keywordsShow less

TITLE

Genotyping sequence-resolved copy number variation using pangenomes reveals paralog-specific global diversity and expression divergence of duplicated genes.

Main citation

Ma W, Chaisson MJP. (2025) Genotyping sequence-resolved copy number variation using pangenomes reveals paralog-specific global diversity and expression divergence of duplicated genes. Nat Genet, 57 (11) 2909-2919. doi:10.1038/s41588-025-02346-4. PMID 41107550

ABSTRACT

Copy number variable (CNV) genes are important in evolution and disease, yet their sequence variation remains a blind spot in large-scale studies. We present ctyper, a method that leverages pangenomes to produce allele-specific copy numbers with locally phased variants from next-generation sequencing samples. Benchmarking on 3,351 CNV genes and 212 challenging medically relevant (CMR) genes, ctyper captures 96.5% of phased variants with ≥99.1% correctness of copy number in CNV genes and 94.8% of phased variants in CMR genes. Ctyper takes 1.5 h to genotype a genome on one CPU. The ctyper genotypes give a 4.81-fold improvement in predictions of gene expression compared to known expression quantitative trait locus (eQTL) variants. Allele-specific expression quantified divergent expression in 7.94% of paralogs and tissue-specific biases in 4.68%. We found reduced expression of SMN2 due to SMN1 conversion, potentially affecting spinal muscular atrophy, and increased expression of translocated duplications of AMY2B. Overall, ctyper enables biobank-scale genotyping of CNV and CMR genes.

Show full abstractShow less

DOI

10.1038/s41588-025-02346-4

DBSLMM

Tool

PUBMED_LINK

32330416

FULL NAME

Deterministic Bayesian Sparse Linear Mixed Model

DESCRIPTION

There are two versions of DBSLMM: the tuning version and the deterministic version. The tuning version examines three different heritability choices and requires a validation data to tune the heritability hyper-parameter. The deterministic version uses one heritability estimate and directly fit the model in the training data without a separate validation data. Both versions requires a reference data to compute the SNP correlation matrix. In our experience, the tuning version may work more accurately than the deterministic version.

Show full descriptionShow less

URL

https://github.com/biostat0903/DBSLMM

TITLE

Accurate and Scalable Construction of Polygenic Scores in Large Biobank Data Sets.

Main citation

Yang S, Zhou X. (2020) Accurate and Scalable Construction of Polygenic Scores in Large Biobank Data Sets. Am J Hum Genet, 106 (5) 679-693. doi:10.1016/j.ajhg.2020.03.013. PMID 32330416

ABSTRACT

Accurate construction of polygenic scores (PGS) can enable early diagnosis of diseases and facilitate the development of personalized medicine. Accurate PGS construction requires prediction models that are both adaptive to different genetic architectures and scalable to biobank scale datasets with millions of individuals and tens of millions of genetic variants. Here, we develop such a method called Deterministic Bayesian Sparse Linear Mixed Model (DBSLMM). DBSLMM relies on a flexible modeling assumption on the effect size distribution to achieve robust and accurate prediction performance across a range of genetic architectures. DBSLMM also relies on a simple deterministic search algorithm to yield an approximate analytic estimation solution using summary statistics only. The deterministic search algorithm, when paired with further algebraic innovations, results in substantial computational savings. With simulations, we show that DBSLMM achieves scalable and accurate prediction performance across a range of realistic genetic architectures. We then apply DBSLMM to analyze 25 traits in UK Biobank. For these traits, compared to existing approaches, DBSLMM achieves an average of 2.03%-101.09% accuracy gain in internal cross-validations. In external validations on two separate datasets, including one from BioBank Japan, DBSLMM achieves an average of 14.74%-522.74% accuracy gain. In these real data applications, DBSLMM is 1.03-28.11 times faster and uses only 7.4%-24.8% of physical memory as compared to other multiple regression-based PGS methods. Overall, DBSLMM represents an accurate and scalable method for constructing PGS in biobank scale datasets.

Show full abstractShow less

DOI

10.1016/j.ajhg.2020.03.013

DDx-PRS

Tool

FULL NAME

Differential Diagnosis-Polygenic Risk Score

DESCRIPTION

The DDxPRS R function provides a tool for distuingishing different disorders based on polygenic prediction.

Show full descriptionShow less

URL

https://github.com/wouterpeyrot/DDxPRS

Main citation

Peyrot, W. J., Panagiotaropoulou, G., Olde Loohuis, L. M., Adams, M., Awasthi, S., Ge, T., ... & Price, A. L. (2024). Distinguishing different psychiatric disorders using DDx-PRS. medRxiv, 2024-02.

DEEP*HLA

Tool

PUBMED_LINK

33712626

URL

https://github.com/tatsuhikonaito/DEEP-HLA

TITLE

A deep learning method for HLA imputation and trans-ethnic MHC fine-mapping of type 1 diabetes.

Main citation

Naito T, Suzuki K, Hirata J, Kamatani Y, ...&, Okada Y. (2021) A deep learning method for HLA imputation and trans-ethnic MHC fine-mapping of type 1 diabetes. Nat Commun, 12 (1) 1639. doi:10.1038/s41467-021-21975-x. PMID 33712626

ABSTRACT

Conventional human leukocyte antigen (HLA) imputation methods drop their performance for infrequent alleles, which is one of the factors that reduce the reliability of trans-ethnic major histocompatibility complex (MHC) fine-mapping due to inter-ethnic heterogeneity in allele frequency spectra. We develop DEEP*HLA, a deep learning method for imputing HLA genotypes. Through validation using the Japanese and European HLA reference panels (n = 1,118 and 5,122), DEEP*HLA achieves the highest accuracies with significant superiority for low-frequency and rare alleles. DEEP*HLA is less dependent on distance-dependent linkage disequilibrium decay of the target alleles and might capture the complicated region-wide information. We apply DEEP*HLA to type 1 diabetes GWAS data from BioBank Japan (n = 62,387) and UK Biobank (n = 354,459), and successfully disentangle independently associated class I and II HLA variants with shared risk among diverse populations (the top signal at amino acid position 71 of HLA-DRβ1; P = 7.5 × 10-120). Our study illustrates the value of deep learning in genotype imputation and trans-ethnic MHC fine-mapping.

Show full abstractShow less

DOI

10.1038/s41467-021-21975-x

DEPICT

Tool

PUBMED_LINK

25597830

FULL NAME

Data-driven Expression Prioritized Integration for Complex Traits

DESCRIPTION

an integrative tool that employs predicted gene functions to systematically prioritize the most likely causal genes at associated loci, highlight enriched pathways and identify tissues/cell types where genes from associated loci are highly expressed. DEPICT is not limited to genes with established functions and prioritizes relevant gene sets for many phenotypes.

Show full descriptionShow less

URL

https://github.com/perslab/depict

KEYWORDS

co-regulation of gene expression

Show full keywordsShow less

TITLE

Biological interpretation of genome-wide association studies using predicted gene functions.

Main citation

Pers TH, Karjalainen JM, Chan Y, Westra HJ, ...&, Franke L. (2015) Biological interpretation of genome-wide association studies using predicted gene functions. Nat Commun, 6 () 5890. doi:10.1038/ncomms6890. PMID 25597830

ABSTRACT

The main challenge for gaining biological insights from genetic associations is identifying which genes and pathways explain the associations. Here we present DEPICT, an integrative tool that employs predicted gene functions to systematically prioritize the most likely causal genes at associated loci, highlight enriched pathways and identify tissues/cell types where genes from associated loci are highly expressed. DEPICT is not limited to genes with established functions and prioritizes relevant gene sets for many phenotypes.

Show full abstractShow less

DOI

10.1038/ncomms6890

DOG

Tool

PUBMED_LINK

19153597

FULL NAME

Domain Graph

DESCRIPTION

a novel software of DOG for experimentalists, to prepare publication-quality figures of protein domain structures

Show full descriptionShow less

URL

https://dog.biocuckoo.org/

TITLE

DOG 1.0: illustrator of protein domain structures.

Main citation

Ren J, Wen L, Gao X, Jin C, ...&, Yao X. (2009) DOG 1.0: illustrator of protein domain structures. Cell Res, 19 (2) 271-3. doi:10.1038/cr.2009.6. PMID 19153597

DOI

10.1038/cr.2009.6

DRUG TARGETOR

Tool

PUBMED_LINK

30517594

DESCRIPTION

This website harnesses results from genome-wide association studies (GWAS), and drug bioactivity data, to prioritize drugs and targets for a given phenotype. Drug Targetor networks are constructed using genetically scored drugs and genes, connected by the type of drug-target or drug-gene interaction

Show full descriptionShow less

URL

https://drugtargetor.com/index_v1.21.html

TITLE

Drug Targetor: a web interface to investigate the human druggome for over 500 phenotypes.

Main citation

Gaspar HA, Hübel C, Breen G. (2019) Drug Targetor: a web interface to investigate the human druggome for over 500 phenotypes. Bioinformatics, 35 (14) 2515-2517. doi:10.1093/bioinformatics/bty982. PMID 30517594

ABSTRACT

SUMMARY: Results from hundreds of genome-wide association studies (GWAS) are now freely available and offer a catalogue of the association between phenotypes across medicine with variants in the genome. With the aim of using this data to better understand therapeutic mechanisms, we have developed Drug Targetor, a web interface that allows the generation and exploration of drug-target networks of hundreds of phenotypes using GWAS data. Drug Targetor networks consist of drug and target nodes ordered by genetic association and connected by drug-target or drug-gene relationship. We show that Drug Targetor can help prioritize drugs, targets and drug-target interactions for a specific phenotype based on genetic evidence. AVAILABILITY AND IMPLEMENTATION: Drug Targetor v1.21 is a web application freely available online at drugtargetor.com and under MIT licence. The source code can be found at https://github.com/hagax8/drugtargetor. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Show full abstractShow less

DOI

10.1093/bioinformatics/bty982

EAGLE

Tool

PUBMED_LINK

27270109

DESCRIPTION

(EAGLE1)

Show full descriptionShow less

URL

https://alkesgroup.broadinstitute.org/Eagle/

TITLE

Fast and accurate long-range phasing in a UK Biobank cohort.

Main citation

Loh PR, Palamara PF, Price AL. (2016) Fast and accurate long-range phasing in a UK Biobank cohort. Nat Genet, 48 (7) 811-6. doi:10.1038/ng.3571. PMID 27270109

ABSTRACT

Recent work has leveraged the extensive genotyping of the Icelandic population to perform long-range phasing (LRP), enabling accurate imputation and association analysis of rare variants in target samples typed on genotyping arrays. Here we develop a fast and accurate LRP method, Eagle, that extends this paradigm to populations with much smaller proportions of genotyped samples by harnessing long (>4-cM) identical-by-descent (IBD) tracts shared among distantly related individuals. We applied Eagle to N ≈ 150,000 samples (0.2% of the British population) from the UK Biobank, and we determined that it is 1-2 orders of magnitude faster than existing methods while achieving similar or better phasing accuracy (switch error rate ≈ 0.3%, corresponding to perfect phase in a majority of 10-Mb segments). We also observed that, when used within an imputation pipeline, Eagle prephasing improved downstream imputation accuracy in comparison to prephasing in batches using existing methods, as necessary to achieve comparable computational cost.

Show full abstractShow less

DOI

10.1038/ng.3571

EAGLE2

Tool

PUBMED_LINK

27694958

DESCRIPTION

(EAGLE2)

Show full descriptionShow less

URL

https://alkesgroup.broadinstitute.org/Eagle/

TITLE

Reference-based phasing using the Haplotype Reference Consortium panel.

Main citation

Loh PR, Danecek P, Palamara PF, Fuchsberger C, ...&, L Price A. (2016) Reference-based phasing using the Haplotype Reference Consortium panel. Nat Genet, 48 (11) 1443-1448. doi:10.1038/ng.3679. PMID 27694958

ABSTRACT

Haplotype phasing is a fundamental problem in medical and population genetics. Phasing is generally performed via statistical phasing in a genotyped cohort, an approach that can yield high accuracy in very large cohorts but attains lower accuracy in smaller cohorts. Here we instead explore the paradigm of reference-based phasing. We introduce a new phasing algorithm, Eagle2, that attains high accuracy across a broad range of cohort sizes by efficiently leveraging information from large external reference panels (such as the Haplotype Reference Consortium; HRC) using a new data structure based on the positional Burrows-Wheeler transform. We demonstrate that Eagle2 attains a ∼20× speedup and ∼10% increase in accuracy compared to reference-based phasing using SHAPEIT2. On European-ancestry samples, Eagle2 with the HRC panel achieves >2× the accuracy of 1000 Genomes-based phasing. Eagle2 is open source and freely available for HRC-based phasing via the Sanger Imputation Service and the Michigan Imputation Server.

Show full abstractShow less

DOI

10.1038/ng.3679

eCAVIAR

Tool

PUBMED_LINK

27866706

FULL NAME

eQTL and GWAS Causal Variant Identification in Associated Regions

URL

https://github.com/fhormoz/caviar

TITLE

Colocalization of GWAS and eQTL Signals Detects Target Genes.

Main citation

Hormozdiari F, van de Bunt M, Segrè AV, Li X, ...&, Eskin E. (2016) Colocalization of GWAS and eQTL Signals Detects Target Genes. Am J Hum Genet, 99 (6) 1245-1260. doi:10.1016/j.ajhg.2016.10.003. PMID 27866706

ABSTRACT

The vast majority of genome-wide association study (GWAS) risk loci fall in non-coding regions of the genome. One possible hypothesis is that these GWAS risk loci alter the individual's disease risk through their effect on gene expression in different tissues. In order to understand the mechanisms driving a GWAS risk locus, it is helpful to determine which gene is affected in specific tissue types. For example, the relevant gene and tissue could play a role in the disease mechanism if the same variant responsible for a GWAS locus also affects gene expression. Identifying whether or not the same variant is causal in both GWASs and expression quantitative trail locus (eQTL) studies is challenging because of the uncertainty induced by linkage disequilibrium and the fact that some loci harbor multiple causal variants. However, current methods that address this problem assume that each locus contains a single causal variant. In this paper, we present eCAVIAR, a probabilistic method that has several key advantages over existing methods. First, our method can account for more than one causal variant in any given locus. Second, it can leverage summary statistics without accessing the individual genotype data. We use both simulated and real datasets to demonstrate the utility of our method. Using publicly available eQTL data on 45 different tissues, we demonstrate that eCAVIAR can prioritize likely relevant tissues and target genes for a set of glucose- and insulin-related trait loci.

Show full abstractShow less

DOI

10.1016/j.ajhg.2016.10.003

EHH

Tool

PUBMED_LINK

35041674

FULL NAME

Extended haplotype homozygosity

DESCRIPTION

Sabeti, P. C., Reich, D. E., Higgins, J. M., Levine, H. Z., Richter, D. J., Schaffner, S. F., ... & Lander, E. S. (2002). Detecting recent positive selection in the human genome from haplotype structure. Nature, 419(6909), 832-837.

Show full descriptionShow less

TITLE

Detecting selection using extended haplotype homozygosity (EHH)-based statistics in unphased or unpolarized data.

Main citation

Klassmann A, Gautier M. (2022) Detecting selection using extended haplotype homozygosity (EHH)-based statistics in unphased or unpolarized data. PLoS One, 17 (1) e0262024. doi:10.1371/journal.pone.0262024. PMID 35041674

ABSTRACT

Analysis of population genetic data often includes a search for genomic regions with signs of recent positive selection. One of such approaches involves the concept of extended haplotype homozygosity (EHH) and its associated statistics. These statistics typically require phased haplotypes, and some of them necessitate polarized variants. Here, we unify and extend previously proposed modifications to loosen these requirements. We compare the modified versions with the original ones by measuring the false discovery rate in simulated whole-genome scans and by quantifying the overlap of inferred candidate regions in empirical data. We find that phasing information is indispensable for accurate estimation of within-population statistics (for all but very large samples) and of cross-population statistics for small samples. Ancestry information, in contrast, is of lesser importance for both types of statistic. Our publicly available R package rehh incorporates the modified statistics presented here.

Show full abstractShow less

DOI

10.1371/journal.pone.0262024

EIGENSTRAT

Tool

PUBMED_LINK

16862161

URL

https://github.com/DReichLab/EIG

KEYWORDS

PCA, Linear

Show full keywordsShow less

TITLE

Principal components analysis corrects for stratification in genome-wide association studies.

Main citation

Price AL, Patterson NJ, Plenge RM, Weinblatt ME, ...&, Reich D. (2006) Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet, 38 (8) 904-9. doi:10.1038/ng1847. PMID 16862161

ABSTRACT

Population stratification--allele frequency differences between cases and controls due to systematic ancestry differences-can cause spurious associations in disease studies. We describe a method that enables explicit detection and correction of population stratification on a genome-wide scale. Our method uses principal components analysis to explicitly model ancestry differences between cases and controls. The resulting correction is specific to a candidate marker's variation in frequency across ancestral populations, minimizing spurious associations while maximizing power to detect true associations. Our simple, efficient approach can easily be applied to disease studies with hundreds of thousands of markers.

Show full abstractShow less

DOI

10.1038/ng1847

Ellis CA

Tool

PUBMED_LINK

39168121

TITLE

Inflation of polygenic risk scores caused by sample overlap and relatedness: Examples of a major risk of bias.

Main citation

Ellis CA, Oliver KL, Harris RV, Ottman R, ...&, Bahlo M. (2024) Inflation of polygenic risk scores caused by sample overlap and relatedness: Examples of a major risk of bias. Am J Hum Genet, 111 (9) 1805-1809. doi:10.1016/j.ajhg.2024.07.014. PMID 39168121

ABSTRACT

Polygenic risk scores (PRSs) are an important tool for understanding the role of common genetic variants in human disease. Standard best practices recommend that PRSs be analyzed in cohorts that are independent of the genome-wide association study (GWAS) used to derive the scores without sample overlap or relatedness between the two cohorts. However, identifying sample overlap and relatedness can be challenging in an era of GWASs performed by large biobanks and international research consortia. Although most genomics researchers are aware of best practices and theoretical concerns about sample overlap and relatedness between GWAS and PRS cohorts, the prevailing assumption is that the risk of bias is small for very large GWASs. Here, we present two real-world examples demonstrating that sample overlap and relatedness is not a minor or theoretical concern but an important potential source of bias in PRS studies. Using a recently developed statistical adjustment tool, we found that excluding overlapping and related samples was equal to or more powerful than adjusting for overlap bias. Our goal is to make genomics researchers aware of the magnitude of risk of bias from sample overlap and relatedness and to highlight the need for mitigation tools, including independent validation cohorts in PRS studies, continued development of statistical adjustment methods, and tools for researchers to test their cohorts for overlap and relatedness with GWAS cohorts without sharing individual-level data.

Show full abstractShow less

DOI

10.1016/j.ajhg.2024.07.014

EMMAX

Tool

PUBMED_LINK

20208533

FULL NAME

efficient mixed-model association eXpedited

DESCRIPTION

EMMAX is a statistical test for large scale human or model organism association mapping accounting for the sample structure. In addition to the computational efficiency obtained by EMMA algorithm, EMMAX takes advantage of the fact that each loci explains only a small fraction of complex traits, which allows us to avoid repetitive variance component estimation procedure, resulting in a significant amount of increase in computational time of association mapping using mixed model.

Show full descriptionShow less

URL

https://genome.sph.umich.edu/wiki/EMMAX

TITLE

Variance component model to account for sample structure in genome-wide association studies.

Main citation

Kang HM, Sul JH, Service SK, Zaitlen NA, ...&, Eskin E. (2010) Variance component model to account for sample structure in genome-wide association studies. Nat Genet, 42 (4) 348-54. doi:10.1038/ng.548. PMID 20208533

ABSTRACT

Although genome-wide association studies (GWASs) have identified numerous loci associated with complex traits, imprecise modeling of the genetic relatedness within study samples may cause substantial inflation of test statistics and possibly spurious associations. Variance component approaches, such as efficient mixed-model association (EMMA), can correct for a wide range of sample structures by explicitly accounting for pairwise relatedness between individuals, using high-density markers to model the phenotype distribution; but such approaches are computationally impractical. We report here a variance component approach implemented in publicly available software, EMMA eXpedited (EMMAX), that reduces the computational time for analyzing large GWAS data sets from years to hours. We apply this method to two human GWAS data sets, performing association analysis for ten quantitative traits from the Northern Finland Birth Cohort and seven common diseases from the Wellcome Trust Case Control Consortium. We find that EMMAX outperforms both principal component analysis and genomic control in correcting for sample structure.

Show full abstractShow less

DOI

10.1038/ng.548

EPIC

Tool

PUBMED_LINK

35709291

FULL NAME

cEll tyPe enrIChment

DESCRIPTION

Inferring relevant tissues and cell types for complex traits in genome-wide association studies

Show full descriptionShow less

URL

https://github.com/rujinwang/EPIC

KEYWORDS

GWAS, scRNA-seq

Show full keywordsShow less

TITLE

EPIC: Inferring relevant cell types for complex traits by integrating genome-wide association studies and single-cell RNA sequencing.

Main citation

Wang R, Lin DY, Jiang Y. (2022) EPIC: Inferring relevant cell types for complex traits by integrating genome-wide association studies and single-cell RNA sequencing. PLoS Genet, 18 (6) e1010251. doi:10.1371/journal.pgen.1010251. PMID 35709291

ABSTRACT

More than a decade of genome-wide association studies (GWASs) have identified genetic risk variants that are significantly associated with complex traits. Emerging evidence suggests that the function of trait-associated variants likely acts in a tissue- or cell-type-specific fashion. Yet, it remains challenging to prioritize trait-relevant tissues or cell types to elucidate disease etiology. Here, we present EPIC (cEll tyPe enrIChment), a statistical framework that relates large-scale GWAS summary statistics to cell-type-specific gene expression measurements from single-cell RNA sequencing (scRNA-seq). We derive powerful gene-level test statistics for common and rare variants, separately and jointly, and adopt generalized least squares to prioritize trait-relevant cell types while accounting for the correlation structures both within and between genes. Using enrichment of loci associated with four lipid traits in the liver and enrichment of loci associated with three neurological disorders in the brain as ground truths, we show that EPIC outperforms existing methods. We apply our framework to multiple scRNA-seq datasets from different platforms and identify cell types underlying type 2 diabetes and schizophrenia. The enrichment is replicated using independent GWAS and scRNA-seq datasets and further validated using PubMed search and existing bulk case-control testing results.

Show full abstractShow less

DOI

10.1371/journal.pgen.1010251

ExPRSweb

Tool

PUBMED_LINK

36152628

FULL NAME

exposure polygenic risk scores (ExPRSs)

DESCRIPTION

Integrating published and freely available genome-wide association studies (GWAS) summary statistics from multiple sources (published GWAS, the NHGRI-EBI GWAS Catalog, FinnGen- or UKB-based GWAS), we created an online repository for exposure polygenic risk scores (ExPRS) for health-related exposure traits. Our framework condenses these summary statistics into ExPRS using linkage disequilibrium pruning and p-value thresholding (P&T) or penalized, genome-wide effect size weighting. We evaluate them in the cohort of the Michigan Genomics Initiative (MGI), a longitudinal biorepository effort at Michigan Medicine, and in the population-based UK Biobank Study (UKB). For each ExPRS construct, measures on performance, accuracy, and discrimination are provided. Beyond the ExPRS evaluation in MGI and UKB, the ExPRSweb platform features construct downloads, evaluation in the top percentiles, and phenome-wide ExPRS association studies (ExPRS-PheWAS) for a subset of ExPRS that are predictive for the corresponding exposure.

Show full descriptionShow less

URL

https://exprsweb.sph.umich.edu:8443/

KEYWORDS

exposure PRS

Show full keywordsShow less

TITLE

ExPRSweb: An online repository with polygenic risk scores for common health-related exposures.

Main citation

Ma Y, Patil S, Zhou X, Mukherjee B, ...&, Fritsche LG. (2022) ExPRSweb: An online repository with polygenic risk scores for common health-related exposures. Am J Hum Genet, 109 (10) 1742-1760. doi:10.1016/j.ajhg.2022.09.001. PMID 36152628

ABSTRACT

Complex traits are influenced by genetic risk factors, lifestyle, and environmental variables, so-called exposures. Some exposures, e.g., smoking or lipid levels, have common genetic modifiers identified in genome-wide association studies. Because measurements are often unfeasible, exposure polygenic risk scores (ExPRSs) offer an alternative to study the influence of exposures on various phenotypes. Here, we collected publicly available summary statistics for 28 exposures and applied four common PRS methods to generate ExPRSs in two large biobanks: the Michigan Genomics Initiative and the UK Biobank. We established ExPRSs for 27 exposures and demonstrated their applicability in phenome-wide association studies and as predictors for common chronic conditions. Especially the addition of multiple ExPRSs showed, for several chronic conditions, an improvement compared to prediction models that only included traditional, disease-focused PRSs. To facilitate follow-up studies, we share all ExPRS constructs and generated results via an online repository called ExPRSweb.

Show full abstractShow less

DOI

10.1016/j.ajhg.2022.09.001

f

Tool

PUBMED_LINK

27197222

FULL NAME

fraction of sites under selection

DESCRIPTION

Moon, S., & Akey, J. M. (2016). A flexible method for estimating the fraction of fitness influencing mutations from large sequencing data sets. Genome Research, 26(6), 834-843.

Show full descriptionShow less

TITLE

A flexible method for estimating the fraction of fitness influencing mutations from large sequencing data sets.

Main citation

Moon S, Akey JM. (2016) A flexible method for estimating the fraction of fitness influencing mutations from large sequencing data sets. Genome Res, 26 (6) 834-43. doi:10.1101/gr.203059.115. PMID 27197222

ABSTRACT

A continuing challenge in the analysis of massively large sequencing data sets is quantifying and interpreting non-neutrally evolving mutations. Here, we describe a flexible and robust approach based on the site frequency spectrum to estimate the fraction of deleterious and adaptive variants from large-scale sequencing data sets. We applied our method to approximately 1 million single nucleotide variants (SNVs) identified in high-coverage exome sequences of 6515 individuals. We estimate that the fraction of deleterious nonsynonymous SNVs is higher than previously reported; quantify the effects of genomic context, codon bias, chromatin accessibility, and number of protein-protein interactions on deleterious protein-coding SNVs; and identify pathways and networks that have likely been influenced by positive selection. Furthermore, we show that the fraction of deleterious nonsynonymous SNVs is significantly higher for Mendelian versus complex disease loci and in exons harboring dominant versus recessive Mendelian mutations. In summary, as genome-scale sequencing data accumulate in progressively larger sample sizes, our method will enable increasingly high-resolution inferences into the characteristics and determinants of non-neutral variation.

Show full abstractShow less

DOI

10.1101/gr.203059.115

FactorGO

Tool

PUBMED_LINK

37879338

FULL NAME

Factor analysis model in Genetic assOciation

DESCRIPTION

FactorGo is a scalable variational factor analysis model that learns pleiotropic factors using GWAS summary statistics.

Show full descriptionShow less

URL

https://github.com/mancusolab/FactorGo

KEYWORDS

pleiotropy, factor analysis

Show full keywordsShow less

TITLE

A scalable approach to characterize pleiotropy across thousands of human diseases and complex traits using GWAS summary statistics.

Main citation

Zhang Z, Jung J, Kim A, Suboc N, ...&, Mancuso N. (2023) A scalable approach to characterize pleiotropy across thousands of human diseases and complex traits using GWAS summary statistics. Am J Hum Genet, 110 (11) 1863-1874. doi:10.1016/j.ajhg.2023.09.015. PMID 37879338

ABSTRACT

Genome-wide association studies (GWASs) across thousands of traits have revealed the pervasive pleiotropy of trait-associated genetic variants. While methods have been proposed to characterize pleiotropic components across groups of phenotypes, scaling these approaches to ultra-large-scale biobanks has been challenging. Here, we propose FactorGo, a scalable variational factor analysis model to identify and characterize pleiotropic components using biobank GWAS summary data. In extensive simulations, we observe that FactorGo outperforms the state-of-the-art (model-free) approach tSVD in capturing latent pleiotropic factors across phenotypes while maintaining a similar computational cost. We apply FactorGo to estimate 100 latent pleiotropic factors from GWAS summary data of 2,483 phenotypes measured in European-ancestry Pan-UK BioBank individuals (N = 420,531). Next, we find that factors from FactorGo are more enriched with relevant tissue-specific annotations than those identified by tSVD (p = 2.58E-10) and validate our approach by recapitulating brain-specific enrichment for BMI and the height-related connection between reproductive system and muscular-skeletal growth. Finally, our analyses suggest shared etiologies between rheumatoid arthritis and periodontal condition in addition to alkaline phosphatase as a candidate prognostic biomarker for prostate cancer. Overall, FactorGo improves our biological understanding of shared etiologies across thousands of GWASs.

Show full abstractShow less

DOI

10.1016/j.ajhg.2023.09.015

fastASSET

Tool

PUBMED_LINK

39143063

URL

https://github.com/gqi/fastASSET

TITLE

Genome-wide large-scale multi-trait analysis characterizes global patterns of pleiotropy and unique trait-specific variants.

Main citation

Qi G, Chhetri SB, Ray D, Dutta D, ...&, Chatterjee N. (2024) Genome-wide large-scale multi-trait analysis characterizes global patterns of pleiotropy and unique trait-specific variants. Nat Commun, 15 (1) 6985. doi:10.1038/s41467-024-51075-5. PMID 39143063

ABSTRACT

Genome-wide association studies (GWAS) have found widespread evidence of pleiotropy, but characterization of global patterns of pleiotropy remain highly incomplete due to insufficient power of current approaches. We develop fastASSET, a method that allows efficient detection of variant-level pleiotropic association across many traits. We analyze GWAS summary statistics of 116 complex traits of diverse types collected from the GRASP repository and large GWAS Consortia. We identify 2293 independent loci and find that the lead variants in nearly all these loci (~99%) to be associated with ≥ 2 traits (median = 6). We observe that degree of pleiotropy estimated from our study predicts that observed in the UK Biobank for a much larger number of traits (K = 4114) (correlation = 0.43, p-value < 2.2 × 10 - 16 ). Follow-up analyzes of 21 trait-specific variants indicate their link to the expression in trait-related tissues for a small number of genes involved in relevant biological processes. Our findings provide deeper insight into the nature of pleiotropy and leads to identification of highly trait-specific susceptibility variants.

Show full abstractShow less

DOI

10.1038/s41467-024-51075-5

fastGWA

Tool

PUBMED_LINK

31768069

URL

https://yanglab.westlake.edu.cn/software/gcta/#fastGWA

KEYWORDS

grid-search-based REML algorithm

Show full keywordsShow less

TITLE

A resource-efficient tool for mixed model association analysis of large-scale data.

Main citation

Jiang L, Zheng Z, Qi T, Kemper KE, ...&, Yang J. (2019) A resource-efficient tool for mixed model association analysis of large-scale data. Nat Genet, 51 (12) 1749-1755. doi:10.1038/s41588-019-0530-8. PMID 31768069

ABSTRACT

The genome-wide association study (GWAS) has been widely used as an experimental design to detect associations between genetic variants and a phenotype. Two major confounding factors, population stratification and relatedness, could potentially lead to inflated GWAS test statistics and hence to spurious associations. Mixed linear model (MLM)-based approaches can be used to account for sample structure. However, genome-wide association (GWA) analyses in biobank samples such as the UK Biobank (UKB) often exceed the capability of most existing MLM-based tools especially if the number of traits is large. Here, we develop an MLM-based tool (fastGWA) that controls for population stratification by principal components and for relatedness by a sparse genetic relationship matrix for GWA analyses of biobank-scale data. We demonstrate by extensive simulations that fastGWA is reliable, robust and highly resource-efficient. We then apply fastGWA to 2,173 traits on array-genotyped and imputed samples from 456,422 individuals and to 2,048 traits on whole-exome-sequenced samples from 46,191 individuals in the UKB.

Show full abstractShow less

DOI

10.1038/s41588-019-0530-8

fastGWA-GLMM

Tool

PUBMED_LINK

34737426

URL

https://yanglab.westlake.edu.cn/software/gcta/#fastGWA

TITLE

A generalized linear mixed model association tool for biobank-scale data.

Main citation

Jiang L, Zheng Z, Fang H, Yang J. (2021) A generalized linear mixed model association tool for biobank-scale data. Nat Genet, 53 (11) 1616-1621. doi:10.1038/s41588-021-00954-4. PMID 34737426

ABSTRACT

Compared with linear mixed model-based genome-wide association (GWA) methods, generalized linear mixed model (GLMM)-based methods have better statistical properties when applied to binary traits but are computationally much slower. In the present study, leveraging efficient sparse matrix-based algorithms, we developed a GLMM-based GWA tool, fastGWA-GLMM, that is severalfold to orders of magnitude faster than the state-of-the-art tools when applied to the UK Biobank (UKB) data and scalable to cohorts with millions of individuals. We show by simulation that the fastGWA-GLMM test statistics of both common and rare variants are well calibrated under the null, even for traits with extreme case-control ratios. We applied fastGWA-GLMM to the UKB data of 456,348 individuals, 11,842,647 variants and 2,989 binary traits (full summary statistics available at http://fastgwa.info/ukbimpbin ), and identified 259 rare variants associated with 75 traits, demonstrating the use of imputed genotype data in a large cohort to discover rare variants for binary complex traits.

Show full abstractShow less

DOI

10.1038/s41588-021-00954-4

fastPHASE

Tool

PUBMED_LINK

16532393

URL

http://scheet.org/software.html

TITLE

A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase.

Main citation

Scheet P, Stephens M. (2006) A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am J Hum Genet, 78 (4) 629-44. doi:10.1086/502802. PMID 16532393

ABSTRACT

We present a statistical model for patterns of genetic variation in samples of unrelated individuals from natural populations. This model is based on the idea that, over short regions, haplotypes in a population tend to cluster into groups of similar haplotypes. To capture the fact that, because of recombination, this clustering tends to be local in nature, our model allows cluster memberships to change continuously along the chromosome according to a hidden Markov model. This approach is flexible, allowing for both "block-like" patterns of linkage disequilibrium (LD) and gradual decline in LD with distance. The resulting model is also fast and, as a result, is practicable for large data sets (e.g., thousands of individuals typed at hundreds of thousands of markers). We illustrate the utility of the model by applying it to dense single-nucleotide-polymorphism genotype data for the tasks of imputing missing genotypes and estimating haplotypic phase. For imputing missing genotypes, methods based on this model are as accurate or more accurate than existing methods. For haplotype estimation, the point estimates are slightly less accurate than those from the best existing methods (e.g., for unrelated Centre d'Etude du Polymorphisme Humain individuals from the HapMap project, switch error was 0.055 for our method vs. 0.051 for PHASE) but require a small fraction of the computational cost. In addition, we demonstrate that the model accurately reflects uncertainty in its estimates, in that probabilities computed using the model are approximately well calibrated. The methods described in this article are implemented in a software package, fastPHASE, which is available from the Stephens Lab Web site.

Show full abstractShow less

DOI

10.1086/502802

FastQTL

Tool

PUBMED_LINK

26708335

DESCRIPTION

In order to discover quantitative trait loci (QTLs), multi-dimensional genomic datasets combining DNA-seq and ChiP-/RNA-seq require methods that rapidly correlate tens of thousands of molecular phenotypes with millions of genetic variants while appropriately controlling for multiple testing. FastQTL implements a popular cis-QTL mapping strategy in a user- and cluster-friendly tool. FastQTL also proposes an efficient permutation procedure to control for multiple testing.

Show full descriptionShow less

URL

https://github.com/francois-a/fastqtl

TITLE

Fast and efficient QTL mapper for thousands of molecular phenotypes.

Main citation

Ongen H, Buil A, Brown AA, Dermitzakis ET, ...&, Delaneau O. (2016) Fast and efficient QTL mapper for thousands of molecular phenotypes. Bioinformatics, 32 (10) 1479-85. doi:10.1093/bioinformatics/btv722. PMID 26708335

ABSTRACT

MOTIVATION: In order to discover quantitative trait loci, multi-dimensional genomic datasets combining DNA-seq and ChiP-/RNA-seq require methods that rapidly correlate tens of thousands of molecular phenotypes with millions of genetic variants while appropriately controlling for multiple testing. RESULTS: We have developed FastQTL, a method that implements a popular cis-QTL mapping strategy in a user- and cluster-friendly tool. FastQTL also proposes an efficient permutation procedure to control for multiple testing. The outcome of permutations is modeled using beta distributions trained from a few permutations and from which adjusted P-values can be estimated at any level of significance with little computational cost. The Geuvadis & GTEx pilot datasets can be now easily analyzed an order of magnitude faster than previous approaches. AVAILABILITY AND IMPLEMENTATION: Source code, binaries and comprehensive documentation of FastQTL are freely available to download at http://fastqtl.sourceforge.net/ CONTACT: emmanouil.dermitzakis@unige.ch or olivier.delaneau@unige.ch SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Show full abstractShow less

DOI

10.1093/bioinformatics/btv722

FINEMAP

Tool

PUBMED_LINK

26773131

DESCRIPTION

FINEMAP is a program for 1.identifying causal SNPs, 2. estimating effect sizes of causal SNPs, 3 estimating the heritability contribution of causal SNPs

Show full descriptionShow less

URL

http://www.christianbenner.com/

KEYWORDS

Shotgun Stochastic Search (SSS)

Show full keywordsShow less

TITLE

FINEMAP: efficient variable selection using summary data from genome-wide association studies.

Main citation

Benner C, Spencer CC, Havulinna AS, Salomaa V, ...&, Pirinen M. (2016) FINEMAP: efficient variable selection using summary data from genome-wide association studies. Bioinformatics, 32 (10) 1493-501. doi:10.1093/bioinformatics/btw018. PMID 26773131

ABSTRACT

MOTIVATION: The goal of fine-mapping in genomic regions associated with complex diseases and traits is to identify causal variants that point to molecular mechanisms behind the associations. Recent fine-mapping methods using summary data from genome-wide association studies rely on exhaustive search through all possible causal configurations, which is computationally expensive. RESULTS: We introduce FINEMAP, a software package to efficiently explore a set of the most important causal configurations of the region via a shotgun stochastic search algorithm. We show that FINEMAP produces accurate results in a fraction of processing time of existing approaches and is therefore a promising tool for analyzing growing amounts of data produced in genome-wide association studies and emerging sequencing projects. AVAILABILITY AND IMPLEMENTATION: FINEMAP v1.0 is freely available for Mac OS X and Linux at http://www.christianbenner.com CONTACT: : christian.benner@helsinki.fi or matti.pirinen@helsinki.fi.

Show full abstractShow less

DOI

10.1093/bioinformatics/btw018

flashfmZero

Tool

PUBMED_LINK

40220762

DESCRIPTION

flashfmZero performs zero-correlation latent-factor-based multi-trait fine-mapping from GWAS summary statistics for high-dimensional trait panels (e.g., blood cell counts). Latent-factor GWAS can surface signals below univariate thresholds; in INTERVAL blood-cell analyses, 99% credible sets were at least as small as univariate fine-mapping in most comparisons and were nested within univariate latent-factor credible sets.

Show full descriptionShow less

URL

https://github.com/jennasimit/flashfmZero

KEYWORDS

latent factor, multi-trait, fine-mapping, GWAS summary statistics, high-dimensional traits

Show full keywordsShow less

TITLE

Improved genetic discovery and fine-mapping resolution through multivariate latent factor analysis of high-dimensional traits.

Main citation

Zhou F, Astle WJ, Butterworth AS, Asimit JL. (2025) Improved genetic discovery and fine-mapping resolution through multivariate latent factor analysis of high-dimensional traits. Cell Genom, 5 (5) 100847. doi:10.1016/j.xgen.2025.100847. PMID 40220762

ABSTRACT

Genome-wide association studies (GWASs) of high-dimensional traits, such as blood cell or metabolic traits, often use univariate approaches, ignoring trait relationships. Biological mechanisms generating variation in high-dimensional traits can be captured parsimoniously through a GWAS of latent factors. Here, we introduce flashfmZero, a zero-correlation latent-factor-based multi-trait fine-mapping approach. In an application to 25 latent factors derived from 99 blood cell traits in the INTERVAL cohort, we show that latent factor GWASs enable the detection of signals generating sub-threshold associations with several blood cell traits. The 99% credible sets (CS99) from flashfmZero were equal to or smaller in size than those from univariate fine-mapping of blood cell traits in 87% of our comparisons. In all cases univariate latent factor CS99 contained those from flashfmZero. Our latent factor approaches can be applied to GWAS summary statistics and will enhance power for the discovery and fine-mapping of associations for many traits.

Show full abstractShow less

DOI

10.1016/j.xgen.2025.100847

Four-digit Multi-ethnic HLA v1 (2021)

Tool

PUBMED_LINK

34611364

DESCRIPTION

Available on Michigan imputation server

Show full descriptionShow less

URL

https://github.com/immunogenomics/HLA-TAPAS/

TITLE

A high-resolution HLA reference panel capturing global population diversity enables multi-ancestry fine-mapping in HIV host response.

Main citation

Luo Y, Kanai M, Choi W, Li X, ...&, Raychaudhuri S. (2021) A high-resolution HLA reference panel capturing global population diversity enables multi-ancestry fine-mapping in HIV host response. Nat Genet, 53 (10) 1504-1516. doi:10.1038/s41588-021-00935-7. PMID 34611364

ABSTRACT

Fine-mapping to plausible causal variation may be more effective in multi-ancestry cohorts, particularly in the MHC, which has population-specific structure. To enable such studies, we constructed a large (n = 21,546) HLA reference panel spanning five global populations based on whole-genome sequences. Despite population-specific long-range haplotypes, we demonstrated accurate imputation at G-group resolution (94.2%, 93.7%, 97.8% and 93.7% in admixed African (AA), East Asian (EAS), European (EUR) and Latino (LAT) populations). Applying HLA imputation to genome-wide association study data for HIV-1 viral load in three populations (EUR, AA and LAT), we obviated effects of previously reported associations from population-specific HIV studies and discovered a novel association at position 156 in HLA-B. We pinpointed the MHC association to three amino acid positions (97, 67 and 156) marking three consecutive pockets (C, B and D) within the HLA-B peptide-binding groove, explaining 12.9% of trait variance.

Show full abstractShow less

DOI

10.1038/s41588-021-00935-7

Four-digit Multi-ethnic HLA v2 (2022)

Tool

PUBMED_LINK

34611364

DESCRIPTION

Available on Michigan imputation server

Show full descriptionShow less

URL

https://github.com/immunogenomics/HLA-TAPAS/

TITLE

A high-resolution HLA reference panel capturing global population diversity enables multi-ancestry fine-mapping in HIV host response.

Main citation

Luo Y, Kanai M, Choi W, Li X, ...&, Raychaudhuri S. (2021) A high-resolution HLA reference panel capturing global population diversity enables multi-ancestry fine-mapping in HIV host response. Nat Genet, 53 (10) 1504-1516. doi:10.1038/s41588-021-00935-7. PMID 34611364

ABSTRACT

Fine-mapping to plausible causal variation may be more effective in multi-ancestry cohorts, particularly in the MHC, which has population-specific structure. To enable such studies, we constructed a large (n = 21,546) HLA reference panel spanning five global populations based on whole-genome sequences. Despite population-specific long-range haplotypes, we demonstrated accurate imputation at G-group resolution (94.2%, 93.7%, 97.8% and 93.7% in admixed African (AA), East Asian (EAS), European (EUR) and Latino (LAT) populations). Applying HLA imputation to genome-wide association study data for HIV-1 viral load in three populations (EUR, AA and LAT), we obviated effects of previously reported associations from population-specific HIV studies and discovered a novel association at position 156 in HLA-B. We pinpointed the MHC association to three amino acid positions (97, 67 and 156) marking three consecutive pockets (C, B and D) within the HLA-B peptide-binding groove, explaining 12.9% of trait variance.

Show full abstractShow less

DOI

10.1038/s41588-021-00935-7

FUMA

Tool

PUBMED_LINK

29184056

DESCRIPTION

FUMA is a platform that can be used to annotate, prioritize, visualize and interpret GWAS results.

Show full descriptionShow less

URL

https://fuma.ctglab.nl/

TITLE

Functional mapping and annotation of genetic associations with FUMA.

Main citation

Watanabe K, Taskesen E, van Bochoven A, Posthuma D. (2017) Functional mapping and annotation of genetic associations with FUMA. Nat Commun, 8 (1) 1826. doi:10.1038/s41467-017-01261-5. PMID 29184056

ABSTRACT

A main challenge in genome-wide association studies (GWAS) is to pinpoint possible causal variants. Results from GWAS typically do not directly translate into causal variants because the majority of hits are in non-coding or intergenic regions, and the presence of linkage disequilibrium leads to effects being statistically spread out across multiple variants. Post-GWAS annotation facilitates the selection of most likely causal variant(s). Multiple resources are available for post-GWAS annotation, yet these can be time consuming and do not provide integrated visual aids for data interpretation. We, therefore, develop FUMA: an integrative web-based platform using information from multiple biological resources to facilitate functional annotation of GWAS results, gene prioritization and interactive visualization. FUMA accommodates positional, expression quantitative trait loci (eQTL) and chromatin interaction mappings, and provides gene-based, pathway and tissue enrichment results. FUMA results directly aid in generating hypotheses that are testable in functional experiments aimed at proving causal relations.

Show full abstractShow less

DOI

10.1038/s41467-017-01261-5

FUSION

Tool

PUBMED_LINK

26854917

FULL NAME

Functional Summary-based Imputation

DESCRIPTION

FUSION is a suite of tools for performing transcriptome-wide and regulome-wide association studies (TWAS and RWAS). FUSION builds predictive models of the genetic component of a functional/molecular phenotype and predicts and tests that component for association with disease using GWAS summary statistics. The goal is to identify associations between a GWAS phenotype and a functional phenotype that was only measured in reference data. We provide precomputed predictive models from multiple studies to facilitate this analysis.

Show full descriptionShow less

URL

http://gusevlab.org/projects/fusion/

TITLE

Integrative approaches for large-scale transcriptome-wide association studies.

Main citation

Gusev A, Ko A, Shi H, Bhatia G, ...&, Pasaniuc B. (2016) Integrative approaches for large-scale transcriptome-wide association studies. Nat Genet, 48 (3) 245-52. doi:10.1038/ng.3506. PMID 26854917

ABSTRACT

Many genetic variants influence complex traits by modulating gene expression, thus altering the abundance of one or multiple proteins. Here we introduce a powerful strategy that integrates gene expression measurements with summary association statistics from large-scale genome-wide association studies (GWAS) to identify genes whose cis-regulated expression is associated with complex traits. We leverage expression imputation from genetic data to perform a transcriptome-wide association study (TWAS) to identify significant expression-trait associations. We applied our approaches to expression data from blood and adipose tissue measured in ∼ 3,000 individuals overall. We imputed gene expression into GWAS data from over 900,000 phenotype measurements to identify 69 new genes significantly associated with obesity-related traits (BMI, lipids and height). Many of these genes are associated with relevant phenotypes in the Hybrid Mouse Diversity Panel. Our results showcase the power of integrating genotype, gene expression and phenotype to gain insights into the genetic basis of complex traits.

Show full abstractShow less

DOI

10.1038/ng.3506

G2P

Tool

PUBMED_LINK

30848784

FULL NAME

A Genome-Wide-Association-Study Simulation Tool for Genotype Simulation, Phenotype Simulation, and Power Evaluation

DESCRIPTION

a Genome-Wide-Association-Study simulation tool for genotype simulation, phenotype simulation and power evaluation

Show full descriptionShow less

URL

https://github.com/XiaoleiLiuBio/G2P

TITLE

G2P: a Genome-Wide-Association-Study simulation tool for genotype simulation, phenotype simulation and power evaluation.

Main citation

Tang Y, Liu X. (2019) G2P: a Genome-Wide-Association-Study simulation tool for genotype simulation, phenotype simulation and power evaluation. Bioinformatics, 35 (19) 3852-3854. doi:10.1093/bioinformatics/btz126. PMID 30848784

ABSTRACT

MOTIVATION: Plenty of Genome-Wide-Association-Study (GWAS) methods have been developed for mapping genetic markers that associated with human diseases and agricultural economic traits. Computer simulation is a nice tool to test the performances of various GWAS methods under certain scenarios. Existing tools are either inefficient in terms of computation and memory efficiency or inconvenient to use to simulate big, realistic genotype data and phenotype data to evaluate available GWAS methods. RESULTS: Here, we present a GWAS simulation tool named G2P that can be used to simulate genotype data, phenotype data and perform power evaluation of GWAS methods. G2P is a user-friendly tool with all functions is provided in both graphical user interface and pipeline manners and it is available for Windows, Mac and Linux environments. Furthermore, G2P achieves maximum efficiency in terms of both memory usage and simulation speed; with G2P, the simulation of genotype data that includes 1 000 000 samples and 2 000 000 markers can be accomplished in 5 h. AVAILABILITY AND IMPLEMENTATION: The G2P software, user manual, and example datasets are freely available at GitHub: https://github.com/XiaoleiLiuBio/G2P. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Show full abstractShow less

DOI

10.1093/bioinformatics/btz126

Galesloot

Tool

PUBMED_LINK

24763738

TITLE

A comparison of multivariate genome-wide association methods.

Main citation

Galesloot TE, van Steen K, Kiemeney LA, Janss LL, ...&, Vermeulen SH. (2014) A comparison of multivariate genome-wide association methods. PLoS One, 9 (4) e95923. doi:10.1371/journal.pone.0095923. PMID 24763738

ABSTRACT

Joint association analysis of multiple traits in a genome-wide association study (GWAS), i.e. a multivariate GWAS, offers several advantages over analyzing each trait in a separate GWAS. In this study we directly compared a number of multivariate GWAS methods using simulated data. We focused on six methods that are implemented in the software packages PLINK, SNPTEST, MultiPhen, BIMBAM, PCHAT and TATES, and also compared them to standard univariate GWAS, analysis of the first principal component of the traits, and meta-analysis of univariate results. We simulated data (N = 1000) for three quantitative traits and one bi-allelic quantitative trait locus (QTL), and varied the number of traits associated with the QTL (explained variance 0.1%), minor allele frequency of the QTL, residual correlation between the traits, and the sign of the correlation induced by the QTL relative to the residual correlation. We compared the power of the methods using empirically fixed significance thresholds (α = 0.05). Our results showed that the multivariate methods implemented in PLINK, SNPTEST, MultiPhen and BIMBAM performed best for the majority of the tested scenarios, with a notable increase in power for scenarios with an opposite sign of genetic and residual correlation. All multivariate analyses resulted in a higher power than univariate analyses, even when only one of the traits was associated with the QTL. Hence, use of multivariate GWAS methods can be recommended, even when genetic correlations between traits are weak.

Show full abstractShow less

DOI

10.1371/journal.pone.0095923

GAS Power Calculator

Tool

FULL NAME

Genetic Association Study Power Calculator

DESCRIPTION

This Genetic Association Study (GAS) Power Calculator is a simple interface that can be used to compute statistical power for large one-stage genetic association studies. The underlying method is derived from the CaTS power calculator for two-stage association studies (2006).

Show full descriptionShow less

URL

https://csg.sph.umich.edu/abecasis/gas_power_calculator/

PREPRINT_DOI

10.1101/164343

Main citation

Johnson, J. L., & Abecasis, G. R. (2017). GAS Power Calculator: web-based power calculator for genetic association studies. BioRxiv, 164343.

GATE

Tool

PUBMED_LINK

36114182

FULL NAME

Genetic Analysis of Time-to-Event phenotypes

DESCRIPTION

GATE (Genetic Analysis of Time-to-Event phenotypes) is an R package with Scalable and accurate genome-wide association analysis of censored survival data in large scale biobanks using frailty models.

GATE performs single-variant association tests for time-to-event endpoints. GATE uses uses the saddlepoint approximation (SPA)(mhof, J. P. , 1961; Kuonen, D. 1999; Dey, R. et.al 2017) to account for heavy censoring rates.

Show full descriptionShow less

URL

https://github.com/weizhou0/GATE

KEYWORDS

censored time-to-event (TTE) phenotypes

Show full keywordsShow less

TITLE

Efficient and accurate frailty model approach for genome-wide survival association analysis in large-scale biobanks.

Main citation

Dey R, Zhou W, Kiiskinen T, Havulinna A, ...&, Lin X. (2022) Efficient and accurate frailty model approach for genome-wide survival association analysis in large-scale biobanks. Nat Commun, 13 (1) 5437. doi:10.1038/s41467-022-32885-x. PMID 36114182

ABSTRACT

With decades of electronic health records linked to genetic data, large biobanks provide unprecedented opportunities for systematically understanding the genetics of the natural history of complex diseases. Genome-wide survival association analysis can identify genetic variants associated with ages of onset, disease progression and lifespan. We propose an efficient and accurate frailty model approach for genome-wide survival association analysis of censored time-to-event (TTE) phenotypes by accounting for both population structure and relatedness. Our method utilizes state-of-the-art optimization strategies to reduce the computational cost. The saddlepoint approximation is used to allow for analysis of heavily censored phenotypes (>90%) and low frequency variants (down to minor allele count 20). We demonstrate the performance of our method through extensive simulation studies and analysis of five TTE phenotypes, including lifespan, with heavy censoring rates (90.9% to 99.8%) on ~400,000 UK Biobank participants with white British ancestry and ~180,000 individuals in FinnGen. We further analyzed 871 TTE phenotypes in the UK Biobank and presented the genome-wide scale phenome-wide association results with the PheWeb browser.

Show full abstractShow less

DOI

10.1038/s41467-022-32885-x

GCTA

Tool

PUBMED_LINK

21167468

FULL NAME

Genome-wide complex trait analysis (GCTA)

DESCRIPTION

GCTA-GREML analysis:GCTA can simulate a GWAS based on real genotype data.

Show full descriptionShow less

URL

https://yanglab.westlake.edu.cn/software/gcta/

TITLE

GCTA: a tool for genome-wide complex trait analysis.

Main citation

Yang J, Lee SH, Goddard ME, Visscher PM. (2011) GCTA: a tool for genome-wide complex trait analysis. Am J Hum Genet, 88 (1) 76-82. doi:10.1016/j.ajhg.2010.11.011. PMID 21167468

ABSTRACT

For most human complex diseases and traits, SNPs identified by genome-wide association studies (GWAS) explain only a small fraction of the heritability. Here we report a user-friendly software tool called genome-wide complex trait analysis (GCTA), which was developed based on a method we recently developed to address the "missing heritability" problem. GCTA estimates the variance explained by all the SNPs on a chromosome or on the whole genome for a complex trait rather than testing the association of any particular SNP to the trait. We introduce GCTA's five main functions: data management, estimation of the genetic relationships from SNPs, mixed linear model analysis of variance explained by the SNPs, estimation of the linkage disequilibrium structure, and GWAS simulation. We focus on the function of estimating the variance explained by all the SNPs on the X chromosome and testing the hypotheses of dosage compensation. The GCTA software is a versatile tool to estimate and partition complex trait variation with large GWAS data sets.

Show full abstractShow less

DOI

10.1016/j.ajhg.2010.11.011

GCTA

Tool

PUBMED_LINK

21167468

FULL NAME

Genome-wide complex trait analysis (GCTA)

DESCRIPTION

GCTA-GREML analysis:GCTA can simulate a GWAS based on real genotype data.

Show full descriptionShow less

URL

https://yanglab.westlake.edu.cn/software/gcta/#GWASSimulation

TITLE

GCTA: a tool for genome-wide complex trait analysis.

Main citation

Yang J, Lee SH, Goddard ME, Visscher PM. (2011) GCTA: a tool for genome-wide complex trait analysis. Am J Hum Genet, 88 (1) 76-82. doi:10.1016/j.ajhg.2010.11.011. PMID 21167468

ABSTRACT

For most human complex diseases and traits, SNPs identified by genome-wide association studies (GWAS) explain only a small fraction of the heritability. Here we report a user-friendly software tool called genome-wide complex trait analysis (GCTA), which was developed based on a method we recently developed to address the "missing heritability" problem. GCTA estimates the variance explained by all the SNPs on a chromosome or on the whole genome for a complex trait rather than testing the association of any particular SNP to the trait. We introduce GCTA's five main functions: data management, estimation of the genetic relationships from SNPs, mixed linear model analysis of variance explained by the SNPs, estimation of the linkage disequilibrium structure, and GWAS simulation. We focus on the function of estimating the variance explained by all the SNPs on the X chromosome and testing the hypotheses of dosage compensation. The GCTA software is a versatile tool to estimate and partition complex trait variation with large GWAS data sets.

Show full abstractShow less

DOI

10.1016/j.ajhg.2010.11.011

GCTA-GREML-Binary (GREML)

Tool

PUBMED_LINK

21376301

FULL NAME

Genome-wide complex trait analysis (GCTA) Genome-based restricted maximum likelihood (GREML)

DESCRIPTION

(case-control)

Show full descriptionShow less

URL

https://yanglab.westlake.edu.cn/software/gcta/#GREML

TITLE

Estimating missing heritability for disease from genome-wide association studies.

Main citation

Lee SH, Wray NR, Goddard ME, Visscher PM. (2011) Estimating missing heritability for disease from genome-wide association studies. Am J Hum Genet, 88 (3) 294-305. doi:10.1016/j.ajhg.2011.02.002. PMID 21376301

ABSTRACT

Genome-wide association studies are designed to discover SNPs that are associated with a complex trait. Employing strict significance thresholds when testing individual SNPs avoids false positives at the expense of increasing false negatives. Recently, we developed a method for quantitative traits that estimates the variation accounted for when fitting all SNPs simultaneously. Here we develop this method further for case-control studies. We use a linear mixed model for analysis of binary traits and transform the estimates to a liability scale by adjusting both for scale and for ascertainment of the case samples. We show by theory and simulation that the method is unbiased. We apply the method to data from the Wellcome Trust Case Control Consortium and show that a substantial proportion of variation in liability for Crohn disease, bipolar disorder, and type I diabetes is tagged by common SNPs.

Show full abstractShow less

DOI

10.1016/j.ajhg.2011.02.002

GCTA-GREML-Bivariate (GREML)

Tool

PUBMED_LINK

22843982

FULL NAME

Genome-wide complex trait analysis (GCTA) Genome-based restricted maximum likelihood (GREML)

DESCRIPTION

(Bivariate GREML)

Show full descriptionShow less

URL

https://yanglab.westlake.edu.cn/software/gcta/#GREML

KEYWORDS

bivariate

Show full keywordsShow less

TITLE

Estimation of pleiotropy between complex diseases using single-nucleotide polymorphism-derived genomic relationships and restricted maximum likelihood.

Main citation

Lee SH, Yang J, Goddard ME, Visscher PM, ...&, Wray NR. (2012) Estimation of pleiotropy between complex diseases using single-nucleotide polymorphism-derived genomic relationships and restricted maximum likelihood. Bioinformatics, 28 (19) 2540-2. doi:10.1093/bioinformatics/bts474. PMID 22843982

ABSTRACT

SUMMARY: Genetic correlations are the genome-wide aggregate effects of causal variants affecting multiple traits. Traditionally, genetic correlations between complex traits are estimated from pedigree studies, but such estimates can be confounded by shared environmental factors. Moreover, for diseases, low prevalence rates imply that even if the true genetic correlation between disorders was high, co-aggregation of disorders in families might not occur or could not be distinguished from chance. We have developed and implemented statistical methods based on linear mixed models to obtain unbiased estimates of the genetic correlation between pairs of quantitative traits or pairs of binary traits of complex diseases using population-based case-control studies with genome-wide single-nucleotide polymorphism data. The method is validated in a simulation study and applied to estimate genetic correlation between various diseases from Wellcome Trust Case Control Consortium data in a series of bivariate analyses. We estimate a significant positive genetic correlation between risk of Type 2 diabetes and hypertension of ~0.31 (SE 0.14, P = 0.024). AVAILABILITY: Our methods, appropriate for both quantitative and binary traits, are implemented in the freely available software GCTA (http://www.complextraitgenomics.com/software/gcta/reml_bivar.html). CONTACT: hong.lee@uq.edu.au SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Show full abstractShow less

DOI

10.1093/bioinformatics/bts474

GCTA-GREML-LDMS

Tool

PUBMED_LINK

26323059

DESCRIPTION

(GREML-LDMS)

Show full descriptionShow less

URL

https://yanglab.westlake.edu.cn/software/gcta/#GREML

TITLE

Genetic variance estimation with imputed variants finds negligible missing heritability for human height and body mass index.

Main citation

Yang J, Bakshi A, Zhu Z, Hemani G, ...&, Visscher PM. (2015) Genetic variance estimation with imputed variants finds negligible missing heritability for human height and body mass index. Nat Genet, 47 (10) 1114-20. doi:10.1038/ng.3390. PMID 26323059

ABSTRACT

We propose a method (GREML-LDMS) to estimate heritability for human complex traits in unrelated individuals using whole-genome sequencing data. We demonstrate using simulations based on whole-genome sequencing data that ∼97% and ∼68% of variation at common and rare variants, respectively, can be captured by imputation. Using the GREML-LDMS method, we estimate from 44,126 unrelated individuals that all ∼17 million imputed variants explain 56% (standard error (s.e.) = 2.3%) of variance for height and 27% (s.e. = 2.5%) of variance for body mass index (BMI), and we find evidence that height- and BMI-associated variants have been under natural selection. Considering the imperfect tagging of imputation and potential overestimation of heritability from previous family-based studies, heritability is likely to be 60-70% for height and 30-40% for BMI. Therefore, the missing heritability is small for both traits. For further discovery of genes associated with complex traits, a study design with SNP arrays followed by imputation is more cost-effective than whole-genome sequencing at current prices.

Show full abstractShow less

DOI

10.1038/ng.3390

GCTA-GREML-Partition (GREML)

Tool

PUBMED_LINK

21552263

FULL NAME

Genome-wide complex trait analysis (GCTA) Genome-based restricted maximum likelihood (GREML)

DESCRIPTION

(partition the genetic variance into individual chromosomes and genomic segments)

Show full descriptionShow less

URL

https://yanglab.westlake.edu.cn/software/gcta/#GREML

TITLE

Genome partitioning of genetic variation for complex traits using common SNPs.

Main citation

Yang J, Manolio TA, Pasquale LR, Boerwinkle E, ...&, Visscher PM. (2011) Genome partitioning of genetic variation for complex traits using common SNPs. Nat Genet, 43 (6) 519-25. doi:10.1038/ng.823. PMID 21552263

ABSTRACT

We estimate and partition genetic variation for height, body mass index (BMI), von Willebrand factor and QT interval (QTi) using 586,898 SNPs genotyped on 11,586 unrelated individuals. We estimate that ∼45%, ∼17%, ∼25% and ∼21% of the variance in height, BMI, von Willebrand factor and QTi, respectively, can be explained by all autosomal SNPs and a further ∼0.5-1% can be explained by X chromosome SNPs. We show that the variance explained by each chromosome is proportional to its length, and that SNPs in or near genes explain more variation than SNPs between genes. We propose a new approach to estimate variation due to cryptic relatedness and population stratification. Our results provide further evidence that a substantial proportion of heritability is captured by common SNPs, that height, BMI and QTi are highly polygenic traits, and that the additive variation explained by a part of the genome is approximately proportional to the total length of DNA contained within genes therein.

Show full abstractShow less

DOI

10.1038/ng.823

GCTA-GREML-Quantitative (GREML)

Tool

PUBMED_LINK

20562875

FULL NAME

Genome-wide complex trait analysis (GCTA) Genome-based restricted maximum likelihood (GREML)

DESCRIPTION

GCTA-GREML analysis: estimating the variance explained by the SNPs / GCTA-GREML analysis for a case-control study

Show full descriptionShow less

URL

https://yanglab.westlake.edu.cn/software/gcta/#GREML

TITLE

Common SNPs explain a large proportion of the heritability for human height.

Main citation

Yang J, Benyamin B, McEvoy BP, Gordon S, ...&, Visscher PM. (2010) Common SNPs explain a large proportion of the heritability for human height. Nat Genet, 42 (7) 565-9. doi:10.1038/ng.608. PMID 20562875

ABSTRACT

SNPs discovered by genome-wide association studies (GWASs) account for only a small fraction of the genetic variation of complex traits in human populations. Where is the remaining heritability? We estimated the proportion of variance for human height explained by 294,831 SNPs genotyped on 3,925 unrelated individuals using a linear model analysis, and validated the estimation method with simulations based on the observed genotype data. We show that 45% of variance can be explained by considering all SNPs simultaneously. Thus, most of the heritability is not missing but has not previously been detected because the individual effects are too small to pass stringent significance tests. We provide evidence that the remaining heritability is due to incomplete linkage disequilibrium between causal variants and genotyped SNPs, exacerbated by causal variants having lower minor allele frequency than the SNPs explored to date.

Show full abstractShow less

DOI

10.1038/ng.608

GEMMA

Tool

PUBMED_LINK

22706312

FULL NAME

genome-wide efficient mixed-model association

DESCRIPTION

GEMMA is the software implementing the Genome-wide Efficient Mixed Model Association algorithm for a standard linear mixed model and some of its close relatives for genome-wide association studies (GWAS). It fits a standard linear mixed model (LMM) to account for population stratification and sample structure for single marker association tests. It fits a Bayesian sparse linear mixed model (BSLMM) using Markov chain Monte Carlo (MCMC) for estimating the proportion of variance in phenotypes explained (PVE) by typed genotypes (i.e. chip heritability), predicting phenotypes, and identifying associated markers by jointly modeling all markers while controlling for population structure. It is computationally efficient for large scale GWAS and uses freely available open-source numerical libraries.

Show full descriptionShow less

URL

http://stephenslab.uchicago.edu/software.html#gemma

TITLE

Genome-wide efficient mixed-model analysis for association studies.

Main citation

Zhou X, Stephens M. (2012) Genome-wide efficient mixed-model analysis for association studies. Nat Genet, 44 (7) 821-4. doi:10.1038/ng.2310. PMID 22706312

ABSTRACT

Linear mixed models have attracted considerable attention recently as a powerful and effective tool for accounting for population stratification and relatedness in genetic association tests. However, existing methods for exact computation of standard test statistics are computationally impractical for even moderate-sized genome-wide association studies. To address this issue, several approximate methods have been proposed. Here, we present an efficient exact method, which we refer to as genome-wide efficient mixed-model association (GEMMA), that makes approximations unnecessary in many contexts. This method is approximately n times faster than the widely used exact method known as efficient mixed-model association (EMMA), where n is the sample size, making exact genome-wide association analysis computationally practical for large numbers of individuals.

Show full abstractShow less

DOI

10.1038/ng.2310

GenoBoost

Tool

PUBMED_LINK

38811555

DESCRIPTION

GenoBoost is a polygenic score method to capture additive and non-additive genetic inheritance effects.

Show full descriptionShow less

URL

https://github.com/rickyota/genoboost

KEYWORDS

additive effects, non-additive effects, statistical boosting

Show full keywordsShow less

TITLE

A polygenic score method boosted by non-additive models.

Main citation

Ohta R, Tanigawa Y, Suzuki Y, Kellis M, ...&, Morishita S. (2024) A polygenic score method boosted by non-additive models. Nat Commun, 15 (1) 4433. doi:10.1038/s41467-024-48654-x. PMID 38811555

ABSTRACT

Dominance heritability in complex traits has received increasing recognition. However, most polygenic score (PGS) approaches do not incorporate non-additive effects. Here, we present GenoBoost, a flexible PGS modeling framework capable of considering both additive and non-additive effects, specifically focusing on genetic dominance. Building on statistical boosting theory, we derive provably optimal GenoBoost scores and provide its efficient implementation for analyzing large-scale cohorts. We benchmark it against seven commonly used PGS methods and demonstrate its competitive predictive performance. GenoBoost is ranked the best for four traits and second-best for three traits among twelve tested disease outcomes in UK Biobank. We reveal that GenoBoost improves prediction for autoimmune diseases by incorporating non-additive effects localized in the MHC locus and, more broadly, works best in less polygenic traits. We further demonstrate that GenoBoost can infer the mode of genetic inheritance without requiring prior knowledge. For example, GenoBoost finds non-zero genetic dominance effects for 602 of 900 selected genetic variants, resulting in 2.5% improvements in predicting psoriasis cases. Lastly, we show that GenoBoost can prioritize genetic loci with genetic dominance not previously reported in the GWAS catalog. Our results highlight the increased accuracy and biological insights from incorporating non-additive effects in PGS models.

Show full abstractShow less

DOI

10.1038/s41467-024-48654-x

GenomeAsia 100K

Tool

PUBMED_LINK

31802016

URL

https://www.genomeasia100k.org/

TITLE

The GenomeAsia 100K Project enables genetic discoveries across Asia.

Main citation

GenomeAsia100K Consortium. (2019) The GenomeAsia 100K Project enables genetic discoveries across Asia. Nature, 576 (7785) 106-111. doi:10.1038/s41586-019-1793-z. PMID 31802016

ABSTRACT

The underrepresentation of non-Europeans in human genetic studies so far has limited the diversity of individuals in genomic datasets and led to reduced medical relevance for a large proportion of the world's population. Population-specific reference genome datasets as well as genome-wide association studies in diverse populations are needed to address this issue. Here we describe the pilot phase of the GenomeAsia 100K Project. This includes a whole-genome sequencing reference dataset from 1,739 individuals of 219 population groups and 64 countries across Asia. We catalogue genetic variation, population structure, disease associations and founder effects. We also explore the use of this dataset in imputation, to facilitate genetic studies in populations across Asia and worldwide.

Show full abstractShow less

DOI

10.1038/s41586-019-1793-z

Genomic-SEM

Tool

PUBMED_LINK

30962613

FULL NAME

genomic structural equation modelling

DESCRIPTION

R-package which allows the user to fit structural equation models based on the summary statistics obtained from genome wide association studies (GWAS).

Show full descriptionShow less

URL

https://github.com/GenomicSEM/GenomicSEM

KEYWORDS

SEM

Show full keywordsShow less

TITLE

Genomic structural equation modelling provides insights into the multivariate genetic architecture of complex traits.

Main citation

Grotzinger AD, Rhemtulla M, de Vlaming R, Ritchie SJ, ...&, Tucker-Drob EM. (2019) Genomic structural equation modelling provides insights into the multivariate genetic architecture of complex traits. Nat Hum Behav, 3 (5) 513-525. doi:10.1038/s41562-019-0566-x. PMID 30962613

ABSTRACT

Genetic correlations estimated from genome-wide association studies (GWASs) reveal pervasive pleiotropy across a wide variety of phenotypes. We introduce genomic structural equation modelling (genomic SEM): a multivariate method for analysing the joint genetic architecture of complex traits. Genomic SEM synthesizes genetic correlations and single-nucleotide polymorphism heritabilities inferred from GWAS summary statistics of individual traits from samples with varying and unknown degrees of overlap. Genomic SEM can be used to model multivariate genetic associations among phenotypes, identify variants with effects on general dimensions of cross-trait liability, calculate more predictive polygenic scores and identify loci that cause divergence between traits. We demonstrate several applications of genomic SEM, including a joint analysis of summary statistics from five psychiatric traits. We identify 27 independent single-nucleotide polymorphisms not previously identified in the contributing univariate GWASs. Polygenic scores from genomic SEM consistently outperform those from univariate GWASs. Genomic SEM is flexible and open ended, and allows for continuous innovation in multivariate genetic analysis.

Show full abstractShow less

DOI

10.1038/s41562-019-0566-x

GLEANR

Tool

PUBMED_LINK

40730164

FULL NAME

GWAS latent embeddings accounting for noise and regularization

DESCRIPTION

GLEANER is a GWAS matrix factorization tool to estimate sparse latent pleiotropic genetic factors. Factors map traits to a distribution of SNP effects that may capture biological pathways or mechanisms shared by these traits.

Show full descriptionShow less

URL

https://github.com/aomdahl/gleanr

TITLE

Sparse matrix factorization robust to sample sharing across GWASs reveals interpretable genetic components.

Main citation

Omdahl AR, Weinstock JS, Keener R, Chhetri SB, ...&, Battle A. (2025) Sparse matrix factorization robust to sample sharing across GWASs reveals interpretable genetic components. Am J Hum Genet, 112 (9) 2178-2197. doi:10.1016/j.ajhg.2025.07.003. PMID 40730164

ABSTRACT

Complex trait-associated genetic variation is highly pleiotropic. This extensive pleiotropy implies that multi-phenotype analyses are informative for characterizing genetic associations, as they facilitate the discovery of trait-shared and trait-specific variants and pathways ("genetic factors"). Previous efforts have estimated genetic factors using matrix factorization (MF) applied to numerous genome-wide association studies (GWASs). However, existing methods are susceptible to spurious factors arising from residual confounding due to sample sharing in biobank GWASs. Furthermore, MF approaches have historically estimated dense factors, loaded on most traits and variants, that are challenging to map onto interpretable biological pathways. To address these shortcomings, we introduce "GWAS latent embeddings accounting for noise and regularization" (GLEANR), an MF method for detection of sparse genetic factors from summary statistics. GLEANR accounts for sample sharing between studies and uses regularization to estimate a data-driven number of interpretable factors. GLEANR is robust to confounding induced by shared samples and improves the replication of genetic factors derived from distinct biobanks. We used GLEANR to evaluate 137 diverse GWASs from the UK Biobank, identifying 58 factors that decompose the genetic architecture of input traits and have distinct signatures of negative selection and degrees of polygenicity. These sparse factors can be interpreted with respect to disease, cell type, and pathway enrichment. We highlight three such factors that captured platelet-measure phenotypes and were enriched for disease-relevant markers corresponding to distinct stages of platelet differentiation. Overall, GLEANR is a powerful tool for discovering both trait-specific and trait-shared pathways underlying complex traits from GWAS summary statistics.

Show full abstractShow less

DOI

10.1016/j.ajhg.2025.07.003

GLIMPSE

Tool

PUBMED_LINK

33414550

FULL NAME

Genotype Likelihoods IMputation and PhaSing mEthod

DESCRIPTION

GLIMPSE is a phasing and imputation method for large-scale low-coverage sequencing studies.

Show full descriptionShow less

URL

https://odelaneau.github.io/GLIMPSE/

TITLE

Efficient phasing and imputation of low-coverage sequencing data using large reference panels.

Main citation

Rubinacci S, Ribeiro DM, Hofmeister RJ, Delaneau O. (2021) Efficient phasing and imputation of low-coverage sequencing data using large reference panels. Nat Genet, 53 (1) 120-126. doi:10.1038/s41588-020-00756-0. PMID 33414550

ABSTRACT

Low-coverage whole-genome sequencing followed by imputation has been proposed as a cost-effective genotyping approach for disease and population genetics studies. However, its competitiveness against SNP arrays is undermined because current imputation methods are computationally expensive and unable to leverage large reference panels. Here, we describe a method, GLIMPSE, for phasing and imputation of low-coverage sequencing datasets from modern reference panels. We demonstrate its remarkable performance across different coverages and human populations. GLIMPSE achieves imputation of a genome for less than US$1 in computational cost, considerably outperforming other methods and improving imputation accuracy over the full allele frequency range. As a proof of concept, we show that 1× coverage enables effective gene expression association studies and outperforms dense SNP arrays in rare variant burden tests. Overall, this study illustrates the promising potential of low-coverage imputation and suggests a paradigm shift in the design of future genomic studies.

Show full abstractShow less

DOI

10.1038/s41588-020-00756-0

GMRM

Tool

PUBMED_LINK

35905320

FULL NAME

Bayesian grouped mixture of regressions model

DESCRIPTION

gmrm is hybrid-parallel software for a Bayesian grouped mixture of regressions model for genome-wide association studies (GWAS). It is written in C++ using extensive optimisations and code vectorisation. It relies on plink's .bed format. It can handle multiple traits simultaneously.

Show full descriptionShow less

URL

https://github.com/medical-genomics-group/gmrm

TITLE

Improving GWAS discovery and genomic prediction accuracy in biobank data.

Main citation

Orliac EJ, Trejo Banos D, Ojavee SE, Läll K, ...&, Robinson MR. (2022) Improving GWAS discovery and genomic prediction accuracy in biobank data. Proc Natl Acad Sci U S A, 119 (31) e2121279119. doi:10.1073/pnas.2121279119. PMID 35905320

ABSTRACT

Genetically informed, deep-phenotyped biobanks are an important research resource and it is imperative that the most powerful, versatile, and efficient analysis approaches are used. Here, we apply our recently developed Bayesian grouped mixture of regressions model (GMRM) in the UK and Estonian Biobanks and obtain the highest genomic prediction accuracy reported to date across 21 heritable traits. When compared to other approaches, GMRM accuracy was greater than annotation prediction models run in the LDAK or LDPred-funct software by 15% (SE 7%) and 14% (SE 2%), respectively, and was 18% (SE 3%) greater than a baseline BayesR model without single-nucleotide polymorphism (SNP) markers grouped into minor allele frequency-linkage disequilibrium (MAF-LD) annotation categories. For height, the prediction accuracy R2 was 47% in a UK Biobank holdout sample, which was 76% of the estimated [Formula: see text]. We then extend our GMRM prediction model to provide mixed-linear model association (MLMA) SNP marker estimates for genome-wide association (GWAS) discovery, which increased the independent loci detected to 16,162 in unrelated UK Biobank individuals, compared to 10,550 from BoltLMM and 10,095 from Regenie, a 62 and 65% increase, respectively. The average [Formula: see text] value of the leading markers increased by 15.24 (SE 0.41) for every 1% increase in prediction accuracy gained over a baseline BayesR model across the traits. Thus, we show that modeling genetic associations accounting for MAF and LD differences among SNP markers, and incorporating prior knowledge of genomic function, is important for both genomic prediction and discovery in large-scale individual-level studies.

Show full abstractShow less

DOI

10.1073/pnas.2121279119

GNOVA

Tool

PUBMED_LINK

29220677

FULL NAME

GeNetic cOVariance Analyzer

DESCRIPTION

A principled framework to estimate annotation-stratified genetic covariance using GWAS summary statistics.

Show full descriptionShow less

URL

https://github.com/xtonyjiang/GNOVA

TITLE

A Powerful Approach to Estimating Annotation-Stratified Genetic Covariance via GWAS Summary Statistics.

Main citation

Lu Q, Li B, Ou D, Erlendsdottir M, ...&, Zhao H. (2017) A Powerful Approach to Estimating Annotation-Stratified Genetic Covariance via GWAS Summary Statistics. Am J Hum Genet, 101 (6) 939-964. doi:10.1016/j.ajhg.2017.11.001. PMID 29220677

ABSTRACT

Despite the success of large-scale genome-wide association studies (GWASs) on complex traits, our understanding of their genetic architecture is far from complete. Jointly modeling multiple traits' genetic profiles has provided insights into the shared genetic basis of many complex traits. However, large-scale inference sets a high bar for both statistical power and biological interpretability. Here we introduce a principled framework to estimate annotation-stratified genetic covariance between traits using GWAS summary statistics. Through theoretical and numerical analyses, we demonstrate that our method provides accurate covariance estimates, thereby enabling researchers to dissect both the shared and distinct genetic architecture across traits to better understand their etiologies. Among 50 complex traits with publicly accessible GWAS summary statistics (Ntotal≈ 4.5 million), we identified more than 170 pairs with statistically significant genetic covariance. In particular, we found strong genetic covariance between late-onset Alzheimer disease (LOAD) and amyotrophic lateral sclerosis (ALS), two major neurodegenerative diseases, in single-nucleotide polymorphisms (SNPs) with high minor allele frequencies and in SNPs located in the predicted functional genome. Joint analysis of LOAD, ALS, and other traits highlights LOAD's correlation with cognitive traits and hints at an autoimmune component for ALS.

Show full abstractShow less

DOI

10.1016/j.ajhg.2017.11.001

GPLEMMA

Tool

PUBMED_LINK

33367483

FULL NAME

Gaussian Prior Linear Environment Mixed Model Analysis

DESCRIPTION

GPLEMMA (Gaussian Prior Linear Environment Mixed Model Analysis) non-linear randomized Haseman-Elston regression method for flexible modeling of gene-environment interactions in large datasets such as the UK Biobank.

Show full descriptionShow less

URL

https://github.com/mkerin/LEMMA

TITLE

A non-linear regression method for estimation of gene-environment heritability.

Main citation

Kerin M, Marchini J. (2021) A non-linear regression method for estimation of gene-environment heritability. Bioinformatics, 36 (24) 5632-5639. doi:10.1093/bioinformatics/btaa1079. PMID 33367483

ABSTRACT

MOTIVATION: Gene-environment (GxE) interactions are one of the least studied aspects of the genetic architecture of human traits and diseases. The environment of an individual is inherently high dimensional, evolves through time and can be expensive and time consuming to measure. The UK Biobank study, with all 500 000 participants having undergone an extensive baseline questionnaire, represents a unique opportunity to assess GxE heritability for many traits and diseases in a well powered setting. RESULTS: We have developed a randomized Haseman-Elston non-linear regression method applicable when many environmental variables have been measured on each individual. The method (GPLEMMA) simultaneously estimates a linear environmental score (ES) and its GxE heritability. We compare the method via simulation to a whole-genome regression approach (LEMMA) for estimating GxE heritability. We show that GPLEMMA is more computationally efficient than LEMMA on large datasets, and produces results highly correlated with those from LEMMA when applied to simulated data and real data from the UK Biobank. AVAILABILITY AND IMPLEMENTATION: Software implementing the GPLEMMA method is available from https://jmarchini.org/gplemma/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Show full abstractShow less

DOI

10.1093/bioinformatics/btaa1079

GREP

Tool

PUBMED_LINK

30859178

FULL NAME

Genome for REPositioning drugs

DESCRIPTION

GREP can quantify an enrichment of the user-defined set of genes in the target of clinical indication categories and capture potentially repositionable drugs targeting the gene set. Both can be run in a few seconds!

Show full descriptionShow less

URL

https://github.com/saorisakaue/GREP

TITLE

GREP: genome for REPositioning drugs.

Main citation

Sakaue S, Okada Y. (2019) GREP: genome for REPositioning drugs. Bioinformatics, 35 (19) 3821-3823. doi:10.1093/bioinformatics/btz166. PMID 30859178

ABSTRACT

SUMMARY: Making use of accumulated genetic knowledge for clinical practice is our next goal in human genetics. Here we introduce GREP (Genome for REPositioning drugs), a standalone python software to quantify an enrichment of the user-defined set of genes in the target of clinical indication categories and to capture potentially repositionable drugs targeting the gene set. We show that genes identified by the large-scale genome-wide association studies were robustly enriched in the approved drugs to treat the trait of interest. This enrichment analysis was also highly applicable to other sets of biological genes such as those identified by gene expression studies and genes somatically mutated in cancers. This software should accelerate investigators to reposition drugs to other indications with the guidance of known genomics. AVAILABILITY AND IMPLEMENTATION: GREP is available at https://github.com/saorisakaue/GREP as a python source code. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Show full abstractShow less

DOI

10.1093/bioinformatics/btz166

GRPa-PRS

Tool

PUBMED_LINK

37425929

FULL NAME

genetically-regulated pathways

URL

https://github.com/davidroad/GRPa-PRS

TITLE

GRPa-PRS: A risk stratification method to identify genetically-regulated pathways in polygenic diseases.

Main citation

Li X, Fernandes BS, Liu A, Chen J, ...&, Dai Y. (2024) GRPa-PRS: A risk stratification method to identify genetically-regulated pathways in polygenic diseases. medRxiv, () . doi:10.1101/2023.06.19.23291621. PMID 37425929

ABSTRACT

BACKGROUND: Polygenic risk scores (PRS) are tools used to evaluate an individual's susceptibility to polygenic diseases based on their genetic profile. A considerable proportion of people carry a high genetic risk but evade the disease. On the other hand, some individuals with a low risk of eventually developing the disease. We hypothesized that unknown counterfactors might be involved in reversing the PRS prediction, which might provide new insights into the pathogenesis, prevention, and early intervention of diseases. METHODS: We built a novel computational framework to identify genetically-regulated pathways (GRPas) using PRS-based stratification for each cohort. We curated two AD cohorts with genotyping data; the discovery (disc) and the replication (rep) datasets include 2722 and 2854 individuals, respectively. First, we calculated the optimized PRS model based on the three recent AD GWAS summary statistics for each cohort. Then, we stratified the individuals by their PRS and clinical diagnosis into six biologically meaningful PRS strata, such as AD cases with low/high risk and cognitively normal (CN) with low/high risk. Lastly, we imputed individual genetically-regulated expression (GReX) and identified differential GReX and GRPas between risk strata using gene-set enrichment and variational analyses in two models, with and without APOE effects. An orthogonality test was further conducted to verify those GRPas are independent of PRS risk. To verify the generalizability of other polygenic diseases, we further applied a default model of GRPa-PRS for schizophrenia (SCZ). RESULTS: For each stratum, we conducted the same procedures in both the disc and rep datasets for comparison. In AD, we identified several well-known AD-related pathways, including amyloid-beta clearance, tau protein binding, and astrocyte response to oxidative stress. Additionally, we discovered resilience-related GRPs that are orthogonal to AD PRS, such as the calcium signaling pathway and divalent inorganic cation homeostasis. In SCZ, pathways related to mitochondrial function and muscle development were highlighted. Finally, our GRPa-PRS method identified more consistent differential pathways compared to another variant-based pathway PRS method. CONCLUSIONS: We developed a framework, GRPa-PRS, to systematically explore the differential GReX and GRPas among individuals stratified by their estimated PRS. The GReX-level comparison among those strata unveiled new insights into the pathways associated with disease risk and resilience. Our framework is extendable to other polygenic complex diseases.

Show full abstractShow less

DOI

10.1101/2023.06.19.23291621

gsMap

Tool

PUBMED_LINK

40108460

FULL NAME

genetically informed spatial mapping of cells for complex traits

DESCRIPTION

gsMap (genetically informed spatial mapping of cells for complex traits) integrates spatial transcriptomics (ST) data with genome-wide association study (GWAS) summary statistics to map cells to human complex traits, including diseases, in a spatially resolved manner.

Show full descriptionShow less

URL

https://github.com/JianYang-Lab/gsMap

KEYWORDS

spatial transciptomics

Show full keywordsShow less

TITLE

Spatially resolved mapping of cells associated with human complex traits.

Main citation

Song L, Chen W, Hou J, Guo M, ...&, Yang J. (2025) Spatially resolved mapping of cells associated with human complex traits. Nature, 641 (8064) 932-941. doi:10.1038/s41586-025-08757-x. PMID 40108460

ABSTRACT

Depicting spatial distributions of disease-relevant cells is crucial for understanding disease pathology1,2. Here we present genetically informed spatial mapping of cells for complex traits (gsMap), a method that integrates spatial transcriptomics data with summary statistics from genome-wide association studies to map cells to human complex traits, including diseases, in a spatially resolved manner. Using embryonic spatial transcriptomics datasets covering 25 organs, we benchmarked gsMap through simulation and by corroborating known trait-associated cells or regions in various organs. Applying gsMap to brain spatial transcriptomics data, we reveal that the spatial distribution of glutamatergic neurons associated with schizophrenia more closely resembles that for cognitive traits than that for mood traits such as depression. The schizophrenia-associated glutamatergic neurons were distributed near the dorsal hippocampus, with upregulated expression of calcium signalling and regulation genes, whereas depression-associated glutamatergic neurons were distributed near the deep medial prefrontal cortex, with upregulated expression of neuroplasticity and psychiatric drug target genes. Our study provides a method for spatially resolved mapping of trait-associated cells and demonstrates the gain of biological insights (such as the spatial distribution of trait-relevant cells and related signature genes) through these maps.

Show full abstractShow less

DOI

10.1038/s41586-025-08757-x

ARROW_SUMMARY

Spatial transcriptomics data + GWAS summary statistics → Graph Neural Network identifies homogeneous spatial domains → Compute Gene Specificity Scores (GSS) for each spot → Map GSS to nearby SNPs → Perform Stratified LD Score Regression (S-LDSC) to assess trait heritability enrichment → Aggregate spot-level p-values using the Cauchy Combination Test to identify trait-associated spatial regions

Guideline-Namba

Tool

PUBMED_LINK

36778001

DESCRIPTION

a practical guideline for genomics-driven drug discovery for cross-population meta-analysis, as lessons from the Global Biobank Meta-analysis Initiative (GBMI)

Show full descriptionShow less

TITLE

A practical guideline of genomics-driven drug discovery in the era of global biobank meta-analysis.

Main citation

Namba S, Konuma T, Wu KH, Zhou W, ...&, Okada Y. (2022) A practical guideline of genomics-driven drug discovery in the era of global biobank meta-analysis. Cell Genom, 2 (10) 100190. doi:10.1016/j.xgen.2022.100190. PMID 36778001

ABSTRACT

Genomics-driven drug discovery is indispensable for accelerating the development of novel therapeutic targets. However, the drug discovery framework based on evidence from genome-wide association studies (GWASs) has not been established, especially for cross-population GWAS meta-analysis. Here, we introduce a practical guideline for genomics-driven drug discovery for cross-population meta-analysis, as lessons from the Global Biobank Meta-analysis Initiative (GBMI). Our drug discovery framework encompassed three methodologies and was applied to the 13 common diseases targeted by GBMI (N mean = 1,329,242). Individual methodologies complementarily prioritized drugs and drug targets, which were systematically validated by referring previously known drug-disease relationships. Integration of the three methodologies provided a comprehensive catalog of candidate drugs for repositioning, nominating promising drug candidates targeting the genes involved in the coagulation process for venous thromboembolism and the interleukin-4 and interleukin-13 signaling pathway for gout. Our study highlighted key factors for successful genomics-driven drug discovery using cross-population meta-analyses.

Show full abstractShow less

DOI

10.1016/j.xgen.2022.100190

GWAMA

Tool

PUBMED_LINK

20509871

FULL NAME

Genome-Wide Association Meta-Analysis

DESCRIPTION

Software tool for meta analysis of whole genome association data

Show full descriptionShow less

URL

https://genomics.ut.ee/en/tools

TITLE

GWAMA: software for genome-wide association meta-analysis.

Main citation

Mägi R, Morris AP. (2010) GWAMA: software for genome-wide association meta-analysis. BMC Bioinformatics, 11 () 288. doi:10.1186/1471-2105-11-288. PMID 20509871

ABSTRACT

BACKGROUND: Despite the recent success of genome-wide association studies in identifying novel loci contributing effects to complex human traits, such as type 2 diabetes and obesity, much of the genetic component of variation in these phenotypes remains unexplained. One way to improving power to detect further novel loci is through meta-analysis of studies from the same population, increasing the sample size over any individual study. Although statistical software analysis packages incorporate routines for meta-analysis, they are ill equipped to meet the challenges of the scale and complexity of data generated in genome-wide association studies. RESULTS: We have developed flexible, open-source software for the meta-analysis of genome-wide association studies. The software incorporates a variety of error trapping facilities, and provides a range of meta-analysis summary statistics. The software is distributed with scripts that allow simple formatting of files containing the results of each association study and generate graphical summaries of genome-wide meta-analysis results. CONCLUSIONS: The GWAMA (Genome-Wide Association Meta-Analysis) software has been developed to perform meta-analysis of summary statistics generated from genome-wide association studies of dichotomous phenotypes or quantitative traits. Software with source files, documentation and example data files are freely available online at http://www.well.ox.ac.uk/GWAMA.

Show full abstractShow less

DOI

10.1186/1471-2105-11-288

gwas diversity monitor

Tool

PUBMED_LINK

32139905

URL

http://www.gwasdiversitymonitor.com/

TITLE

The GWAS Diversity Monitor tracks diversity by disease in real time.

Main citation

Mills MC, Rahal C. (2020) The GWAS Diversity Monitor tracks diversity by disease in real time. Nat Genet, 52 (3) 242-243. doi:10.1038/s41588-020-0580-y. PMID 32139905

DOI

10.1038/s41588-020-0580-y

GWAS SVatalog

Tool

FULL NAME

GWAS SVatalog: a visualization tool to aid fine-mapping of GWAS loci with structural variations

DESCRIPTION

Novel open-source web tool combining GWAS Catalog's SNP-trait associations with LD statistics to identify SVs explaining GWAS loci [1]

Show full descriptionShow less

URL

https://svatalog.research.sickkids.ca/

KEYWORDS

GWAS, structural variations, visualization, fine-mapping

Show full keywordsShow less

USE

Computes and visualizes linkage disequilibrium between structural variations and GWAS-associated SNPs [1]

PREPRINT_DOI

10.1101/2025.09.03.674075

Main citation

Chirmade S, Wang Z, et al. (2025). GWAS SVatalog: a visualization tool to aid fine-mapping of GWAS loci with structural variations. bioRxiv

GWAS-by-Subtraction

Tool

PUBMED_LINK

33414549

URL

https://github.com/GenomicSEM/GenomicSEM

TITLE

Investigating the genetic architecture of noncognitive skills using GWAS-by-subtraction.

Main citation

Demange PA, Malanchini M, Mallard TT, Biroli P, ...&, Nivard MG. (2021) Investigating the genetic architecture of noncognitive skills using GWAS-by-subtraction. Nat Genet, 53 (1) 35-44. doi:10.1038/s41588-020-00754-2. PMID 33414549

ABSTRACT

Little is known about the genetic architecture of traits affecting educational attainment other than cognitive ability. We used genomic structural equation modeling and prior genome-wide association studies (GWASs) of educational attainment (n = 1,131,881) and cognitive test performance (n = 257,841) to estimate SNP associations with educational attainment variation that is independent of cognitive ability. We identified 157 genome-wide-significant loci and a polygenic architecture accounting for 57% of genetic variance in educational attainment. Noncognitive genetics were enriched in the same brain tissues and cell types as cognitive performance, but showed different associations with gray-matter brain volumes. Noncognitive genetics were further distinguished by associations with personality traits, less risky behavior and increased risk for certain psychiatric disorders. For socioeconomic success and longevity, noncognitive and cognitive-performance genetics demonstrated associations of similar magnitude. By conducting a GWAS of a phenotype that was not directly measured, we offer a view of genetic architecture of noncognitive skills influencing educational success.

Show full abstractShow less

DOI

10.1038/s41588-020-00754-2

GWASLab

Tool

DESCRIPTION

a python package for handling GWAS sumstats.

Show full descriptionShow less

URL

https://github.com/Cloufield/gwaslab

PREPRINT_DOI

10.51094/jxiv.370

Main citation

GWASLab preprint: He, Y., Koido, M., Shimmori, Y., Kamatani, Y. (2023). GWASLab: a Python package for processing and visualizing GWAS summary statistics. Preprint at Jxiv, 2023-5. https://doi.org/10.51094/jxiv.370

gwaslab

Tool

URL

https://github.com/Cloufield/gwaslab

GWAX

Tool

PUBMED_LINK

28092683

FULL NAME

genome-wide association by proxy

DESCRIPTION

In randomly ascertained cohorts, replacing cases with their first-degree relatives enables studies of diseases that are absent (or nearly absent) in the cohort.

Show full descriptionShow less

TITLE

Case-control association mapping by proxy using family history of disease.

Main citation

Liu JZ, Erlich Y, Pickrell JK. (2017) Case-control association mapping by proxy using family history of disease. Nat Genet, 49 (3) 325-331. doi:10.1038/ng.3766. PMID 28092683

ABSTRACT

Collecting cases for case-control genetic association studies can be time-consuming and expensive. In some situations (such as studies of late-onset or rapidly lethal diseases), it may be more practical to identify family members of cases. In randomly ascertained cohorts, replacing cases with their first-degree relatives enables studies of diseases that are absent (or nearly absent) in the cohort. We refer to this approach as genome-wide association study by proxy (GWAX) and apply it to 12 common diseases in 116,196 individuals from the UK Biobank. Meta-analysis with published genome-wide association study summary statistics replicated established risk loci and yielded four newly associated loci for Alzheimer's disease, eight for coronary artery disease and five for type 2 diabetes. In addition to informing disease biology, our results demonstrate the utility of association mapping without directly observing cases. We anticipate that GWAX will prove useful in future genetic studies of complex traits in large population cohorts.

Show full abstractShow less

DOI

10.1038/ng.3766

GWFM

Tool

PUBMED_LINK

41912930

FULL NAME

Genome-wide fine-mapping with functional annotations

DESCRIPTION

Genome-wide fine-mapping (GWFM) with functional annotations models the global genetic architecture rather than isolated loci; compared with region-specific approaches it improves error control, power, resolution, precision, replication, and cross-ancestry phenotype prediction. Distributed as part of the GCTB software suite.

Show full descriptionShow less

URL

https://github.com/jianzeng/GCTB ,https://www.nature.com/articles/s41588-026-02549-3

KEYWORDS

fine-mapping, functional annotation, credible sets, trans-ancestry

Show full keywordsShow less

TITLE

Genome-wide fine-mapping improves identification of causal variants.

Main citation

Wu Y, Zheng Z, Thibaut L, Lin T, ...&, Zeng J. (2026) Genome-wide fine-mapping improves identification of causal variants. Nat Genet, () . doi:10.1038/s41588-026-02549-3. PMID 41912930

ABSTRACT

Fine-mapping refines genotype-phenotype association signals to identify causal variants underlying complex traits. However, current methods typically focus on individual genomic loci and do not account for the global genetic architecture. Here we demonstrate the advantages of performing genome-wide fine-mapping (GWFM) with functional annotations and develop methods to facilitate GWFM. In simulations and real data analyses, GWFM outperforms current methods across several metrics, including error control, mapping power, resolution, precision, replication rate and trans-ancestry phenotype prediction. Across 48 complex traits, we identify credible sets that collectively explain 18% of the SNP-based heritability ( h SNP 2 ) on average, with 30% credible sets located outside genome-wide significant loci. Leveraging the genetic architecture estimated from GWFM, we predict that fine-mapping over 50% of h SNP 2 would require an average of 2 million samples. Finally, as proof-of-principle, we highlight a known causal variant at FTO influencing body mass index and identify new missense causal variants influencing schizophrenia and Crohn's disease risk.

Show full abstractShow less

DOI

10.1038/s41588-026-02549-3

Hail

Tool

DESCRIPTION

Hail is an open-source, general-purpose, Python-based data analysis tool with additional data types and methods for working with genomic data.

Show full descriptionShow less

URL

https://hail.is/

Han-MHC

Tool

PUBMED_LINK

27213287

URL

http://gigadb.org/dataset/100156

TITLE

Deep sequencing of the MHC region in the Chinese population contributes to studies of complex disease.

Main citation

Zhou F, Cao H, Zuo X, Zhang T, ...&, Zhang X. (2016) Deep sequencing of the MHC region in the Chinese population contributes to studies of complex disease. Nat Genet, 48 (7) 740-6. doi:10.1038/ng.3576. PMID 27213287

ABSTRACT

The human major histocompatibility complex (MHC) region has been shown to be associated with numerous diseases. However, it remains a challenge to pinpoint the causal variants for these associations because of the extreme complexity of the region. We thus sequenced the entire 5-Mb MHC region in 20,635 individuals of Han Chinese ancestry (10,689 controls and 9,946 patients with psoriasis) and constructed a Han-MHC database that includes both variants and HLA gene typing results of high accuracy. We further identified multiple independent new susceptibility loci in HLA-C, HLA-B, HLA-DPB1 and BTNL2 and an intergenic variant, rs118179173, associated with psoriasis and confirmed the well-established risk allele HLA-C*06:02. We anticipate that our Han-MHC reference panel built by deep sequencing of a large number of samples will serve as a useful tool for investigating the role of the MHC region in a variety of diseases and thus advance understanding of the pathogenesis of these disorders.

Show full abstractShow less

DOI

10.1038/ng.3576

HAPGEN2

Tool

PUBMED_LINK

21653516

DESCRIPTION

HAPGEN2 is a an updated version of the program HAPGEN, which simulates case control datasets at SNP markers. The new version can now simulate multiple disease SNPs on a single chromosome, on the assumption that each disease SNP acts independently and are in Hardy-Weinberg equilibrium. We also supply a R package that can simulate interaction between the disease SNPs.

Show full descriptionShow less

URL

https://mathgen.stats.ox.ac.uk/genetics_software/hapgen/hapgen2.html

TITLE

HAPGEN2: simulation of multiple disease SNPs.

Main citation

Su Z, Marchini J, Donnelly P. (2011) HAPGEN2: simulation of multiple disease SNPs. Bioinformatics, 27 (16) 2304-5. doi:10.1093/bioinformatics/btr341. PMID 21653516

ABSTRACT

MOTIVATION: Performing experiments with simulated data is an inexpensive approach to evaluating competing experimental designs and analysis methods in genome-wide association studies. Simulation based on resampling known haplotypes is fast and efficient and can produce samples with patterns of linkage disequilibrium (LD), which mimic those in real data. However, the inability of current methods to simulate multiple nearby disease SNPs on the same chromosome can limit their application. RESULTS: We introduce a new simulation algorithm based on a successful resampling method, HAPGEN, that can simulate multiple nearby disease SNPs on the same chromosome. The new method, HAPGEN2, retains many advantages of resampling methods and expands the range of disease models that current simulators offer. AVAILABILITY: HAPGEN2 is freely available from http://www.stats.ox.ac.uk/~marchini/software/gwas/gwas.html. CONTACT: zhan@well.ox.ac.uk SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Show full abstractShow less

DOI

10.1093/bioinformatics/btr341

haploview

Tool

PUBMED_LINK

15297300

DESCRIPTION

Haploview is designed to simplify and expedite the process of haplotype analysis by providing a common interface to several tasks relating to such analyses.

Show full descriptionShow less

URL

https://www.broadinstitute.org/haploview/haploview

TITLE

Haploview: analysis and visualization of LD and haplotype maps.

Main citation

Barrett JC, Fry B, Maller J, Daly MJ. (2005) Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics, 21 (2) 263-5. doi:10.1093/bioinformatics/bth457. PMID 15297300

ABSTRACT

UNLABELLED: Research over the last few years has revealed significant haplotype structure in the human genome. The characterization of these patterns, particularly in the context of medical genetic association studies, is becoming a routine research activity. Haploview is a software package that provides computation of linkage disequilibrium statistics and population haplotype patterns from primary genotype data in a visually appealing and interactive interface. AVAILABILITY: http://www.broad.mit.edu/mpg/haploview/ CONTACT: jcbarret@broad.mit.edu

Show full abstractShow less

DOI

10.1093/bioinformatics/bth457

HDL

Tool

PUBMED_LINK

32601477

FULL NAME

High-Definition Likelihood

DESCRIPTION

High-Definition Likelihood (HDL) is a likelihood-based method for estimating genetic correlation using GWAS summary statistics. Compared to LD Score regression (LDSC), It reduces the variance of a genetic correlation estimate by about 60%.

Show full descriptionShow less

URL

https://github.com/zhenin/HDL/

TITLE

High-definition likelihood inference of genetic correlations across human complex traits.

Main citation

Ning Z, Pawitan Y, Shen X. (2020) High-definition likelihood inference of genetic correlations across human complex traits. Nat Genet, 52 (8) 859-864. doi:10.1038/s41588-020-0653-y. PMID 32601477

ABSTRACT

Genetic correlation is a central parameter for understanding shared genetic architecture between complex traits. By using summary statistics from genome-wide association studies (GWAS), linkage disequilibrium score regression (LDSC) was developed for unbiased estimation of genetic correlations. Although easy to use, LDSC only partially utilizes LD information. By fully accounting for LD across the genome, we develop a high-definition likelihood (HDL) method to improve precision in genetic correlation estimation. Compared to LDSC, HDL reduces the variance of genetic correlation estimates by about 60%, equivalent to a 2.5-fold increase in sample size. We apply HDL and LDSC to estimate 435 genetic correlations among 30 behavioral and disease-related phenotypes measured in the UK Biobank (UKBB). In addition to 154 significant genetic correlations observed for both methods, HDL identified another 57 significant genetic correlations, compared to only another 2 significant genetic correlations identified by LDSC. HDL brings more power to genomic analyses and better reveals the underlying connections across human complex traits.

Show full abstractShow less

DOI

10.1038/s41588-020-0653-y

HDL-L

Tool

PUBMED_LINK

40065165

FULL NAME

high-definition likelihood (local)

DESCRIPTION

High-Definition Likelihood (HDL) is a likelihood-based method for estimating genetic correlation using GWAS summary statistics. Compared to LD Score regression (LDSC), It reduces the variance of a genetic correlation estimate by about 60%. Here, we provide an R-based computational tool HDL to implement our method.

Show full descriptionShow less

URL

https://github.com/zhenin/HDL/

KEYWORDS

likelihood-based inference

Show full keywordsShow less

TITLE

An enhanced framework for local genetic correlation analysis.

Main citation

Li Y, Pawitan Y, Shen X. (2025) An enhanced framework for local genetic correlation analysis. Nat Genet, 57 (4) 1053-1058. doi:10.1038/s41588-025-02123-3. PMID 40065165

ABSTRACT

Genetic correlation is a key parameter in the joint genetic model of complex traits, but it is usually estimated on a global genomic scale. Understanding local genetic correlations provides more detailed insight into the shared genetic architecture of complex traits. However, a state-of-the-art tool for local genetic correlation analysis, LAVA, is prone to false inference. Here we extend the high-definition likelihood (HDL) method to a local version, HDL-L, which performs genetic correlation analysis in small, approximately independent linkage disequilibrium blocks. HDL-L allows a more granular estimation of genetic variances and covariances. Simulations show that HDL-L offers more consistent heritability estimates and more efficient genetic correlation estimates compared with LAVA. HDL-L demonstrated robust performance across a wide range of simulations conducted under varying parameter settings. In the analysis of 30 phenotypes from the UK Biobank, HDL-L identified 109 significant local genetic correlations and showed a notable computational advantage. HDL-L proves to be a powerful tool for uncovering the detailed genetic landscape that underlies complex human traits, offering both accuracy and computational efficiency.

Show full abstractShow less

DOI

10.1038/s41588-025-02123-3

HEELS

Tool

PUBMED_LINK

38040712

FULL NAME

Heritability Estimation with high Efficiency using LD and association Summary Statistics

DESCRIPTION

HEELS is a Python-based command line tool that produce accurate and precise local heritability estimates using summary-level statistics (marginal association test statistics along with the empirical (in-sample) LD statistics).

Show full descriptionShow less

URL

https://github.com/huilisabrina/HEELS

TITLE

Accurate and efficient estimation of local heritability using summary statistics and the linkage disequilibrium matrix.

Main citation

Li H, Mazumder R, Lin X. (2023) Accurate and efficient estimation of local heritability using summary statistics and the linkage disequilibrium matrix. Nat Commun, 14 (1) 7954. doi:10.1038/s41467-023-43565-9. PMID 38040712

ABSTRACT

Existing SNP-heritability estimators that leverage summary statistics from genome-wide association studies (GWAS) are much less efficient (i.e., have larger standard errors) than the restricted maximum likelihood (REML) estimators which require access to individual-level data. We introduce a new method for local heritability estimation-Heritability Estimation with high Efficiency using LD and association Summary Statistics (HEELS)-that significantly improves the statistical efficiency of summary-statistics-based heritability estimator and attains comparable statistical efficiency as REML (with a relative statistical efficiency >92%). Moreover, we propose representing the empirical LD matrix as the sum of a low-rank matrix and a banded matrix. We show that this way of modeling the LD can not only reduce the storage and memory cost, but also improve the computational efficiency of heritability estimation. We demonstrate the statistical efficiency of HEELS and the advantages of our proposed LD approximation strategies both in simulations and through empirical analyses of the UK Biobank data.

Show full abstractShow less

DOI

10.1038/s41467-023-43565-9

HESS

Tool

PUBMED_LINK

27346688

FULL NAME

Heritability Estimation from Summary Statistics

DESCRIPTION

HESS (Heritability Estimation from Summary Statistics) is a software package for estimating and visualizing local SNP-heritability and genetic covariance (correlation) from GWAS summary association data.

Show full descriptionShow less

URL

https://huwenboshi.github.io/hess/

TITLE

Contrasting the Genetic Architecture of 30 Complex Traits from Summary Association Data.

Main citation

Shi H, Kichaev G, Pasaniuc B. (2016) Contrasting the Genetic Architecture of 30 Complex Traits from Summary Association Data. Am J Hum Genet, 99 (1) 139-53. doi:10.1016/j.ajhg.2016.05.013. PMID 27346688

ABSTRACT

Variance-component methods that estimate the aggregate contribution of large sets of variants to the heritability of complex traits have yielded important insights into the genetic architecture of common diseases. Here, we introduce methods that estimate the total trait variance explained by the typed variants at a single locus in the genome (local SNP heritability) from genome-wide association study (GWAS) summary data while accounting for linkage disequilibrium among variants. We applied our estimator to ultra-large-scale GWAS summary data of 30 common traits and diseases to gain insights into their local genetic architecture. First, we found that common SNPs have a high contribution to the heritability of all studied traits. Second, we identified traits for which the majority of the SNP heritability can be confined to a small percentage of the genome. Third, we identified GWAS risk loci where the entire locus explains significantly more variance in the trait than the GWAS reported variants. Finally, we identified loci that explain a significant amount of heritability across multiple traits.

Show full abstractShow less

DOI

10.1016/j.ajhg.2016.05.013

HGDP+1kGP

Tool

PUBMED_LINK

38749656

FULL NAME

Human Genome Diversity Project + 1000 Genomes project

URL

https://gnomad.broadinstitute.org/news/2020-10-gnomad-v3-1-new-content-methods-annotations-and-data-availability/#the-gnomad-hgdp-and-1000-genomes-callset

TITLE

A harmonized public resource of deeply sequenced diverse human genomes.

Main citation

Koenig Z, Yohannes MT, Nkambule LL, Zhao X, ...&, Martin AR. (2024) A harmonized public resource of deeply sequenced diverse human genomes. Genome Res, 34 (5) 796-809. doi:10.1101/gr.278378.123. PMID 38749656

ABSTRACT

Underrepresented populations are often excluded from genomic studies owing in part to a lack of resources supporting their analyses. The 1000 Genomes Project (1kGP) and Human Genome Diversity Project (HGDP), which have recently been sequenced to high coverage, are valuable genomic resources because of the global diversity they capture and their open data sharing policies. Here, we harmonized a high-quality set of 4094 whole genomes from 80 populations in the HGDP and 1kGP with data from the Genome Aggregation Database (gnomAD) and identified over 153 million high-quality SNVs, indels, and SVs. We performed a detailed ancestry analysis of this cohort, characterizing population structure and patterns of admixture across populations, analyzing site frequency spectra, and measuring variant counts at global and subcontinental levels. We also show substantial added value from this data set compared with the prior versions of the component resources, typically combined via liftOver and variant intersection; for example, we catalog millions of new genetic variants, mostly rare, compared with previous releases. In addition to unrestricted individual-level public release, we provide detailed tutorials for conducting many of the most common quality-control steps and analyses with these data in a scalable cloud-computing environment and publicly release this new phased joint callset for use as a haplotype resource in phasing and imputation pipelines. This jointly called reference panel will serve as a key resource to support research of diverse ancestry populations.

Show full abstractShow less

DOI

10.1101/gr.278378.123

HIBAG

Tool

PUBMED_LINK

23712092

URL

https://github.com/zhengxwen/HIBAG

TITLE

HIBAG--HLA genotype imputation with attribute bagging.

Main citation

Zheng X, Shen J, Cox C, Wakefield JC, ...&, Weir BS. (2014) HIBAG--HLA genotype imputation with attribute bagging. Pharmacogenomics J, 14 (2) 192-200. doi:10.1038/tpj.2013.18. PMID 23712092

ABSTRACT

Genotyping of classical human leukocyte antigen (HLA) alleles is an essential tool in the analysis of diseases and adverse drug reactions with associations mapping to the major histocompatibility complex (MHC). However, deriving high-resolution HLA types subsequent to whole-genome single-nucleotide polymorphism (SNP) typing or sequencing is often cost prohibitive for large samples. An alternative approach takes advantage of the extended haplotype structure within the MHC to predict HLA alleles using dense SNP genotypes, such as those available from genome-wide SNP panels. Current methods for HLA imputation are difficult to apply or may require the user to have access to large training data sets with SNP and HLA types. We propose HIBAG, HLA Imputation using attribute BAGging, that makes predictions by averaging HLA-type posterior probabilities over an ensemble of classifiers built on bootstrap samples. We assess the performance of HIBAG using our study data (n=2668 subjects of European ancestry) as a training set and HLA data from the British 1958 birth cohort study (n≈1000 subjects) as independent validation samples. Prediction accuracies for HLA-A, B, C, DRB1 and DQB1 range from 92.2% to 98.1% using a set of SNP markers common to the Illumina 1M Duo, OmniQuad, OmniExpress, 660K and 550K platforms. HIBAG performed well compared with the other two leading methods, HLA*IMP and BEAGLE. This method is implemented in a freely available HIBAG R package that includes pre-fit classifiers for European, Asian, Hispanic and African ancestries, providing a readily available imputation approach without the need to have access to large training data sets.

Show full abstractShow less

DOI

10.1038/tpj.2013.18

HIPO

Tool

PUBMED_LINK

30289880

FULL NAME

heritability informed power optimization

DESCRIPTION

hipo is an R package that performs heritability informed power optimization (HIPO) for conducting multi-trait association analysis on summary level data.

Show full descriptionShow less

URL

https://github.com/gqi/hipo

TITLE

Heritability informed power optimization (HIPO) leads to enhanced detection of genetic associations across multiple traits.

Main citation

Qi G, Chatterjee N. (2018) Heritability informed power optimization (HIPO) leads to enhanced detection of genetic associations across multiple traits. PLoS Genet, 14 (10) e1007549. doi:10.1371/journal.pgen.1007549. PMID 30289880

ABSTRACT

Genome-wide association studies have shown that pleiotropy is a common phenomenon that can potentially be exploited for enhanced detection of susceptibility loci. We propose heritability informed power optimization (HIPO) for conducting powerful pleiotropic analysis using summary-level association statistics. We find optimal linear combinations of association coefficients across traits that are expected to maximize non-centrality parameter for the underlying test statistics, taking into account estimates of heritability, sample size variations and overlaps across the traits. Simulation studies show that the proposed method has correct type I error, robust to population stratification and leads to desired genome-wide enrichment of association signals. Application of the proposed method to publicly available data for three groups of genetically related traits, lipids (N = 188,577), psychiatric diseases (Ncase = 33,332, Ncontrol = 27,888) and social science traits (N ranging between 161,460 to 298,420 across individual traits) increased the number of genome-wide significant loci by 12%, 200% and 50%, respectively, compared to those found by analysis of individual traits. Evidence of replication is present for many of these loci in subsequent larger studies for individual traits. HIPO can potentially be extended to high-dimensional phenotypes as a way of dimension reduction to maximize power for subsequent genetic association testing.

Show full abstractShow less

DOI

10.1371/journal.pgen.1007549

HLA-TAPAS

Tool

PUBMED_LINK

34611364

FULL NAME

HLA-Typing At Protein for Association Studie

DESCRIPTION

HLA-TAPAS (HLA-Typing At Protein for Association Studies) is an HLA-focused pipeline that can handle HLA reference panel construction (MakeReference), HLA imputation (SNP2HLA), and HLA association (HLAassoc).

Show full descriptionShow less

URL

https://github.com/immunogenomics/HLA-TAPAS

KEYWORDS

HLA pipeline

Show full keywordsShow less

TITLE

A high-resolution HLA reference panel capturing global population diversity enables multi-ancestry fine-mapping in HIV host response.

Main citation

Luo Y, Kanai M, Choi W, Li X, ...&, Raychaudhuri S. (2021) A high-resolution HLA reference panel capturing global population diversity enables multi-ancestry fine-mapping in HIV host response. Nat Genet, 53 (10) 1504-1516. doi:10.1038/s41588-021-00935-7. PMID 34611364

ABSTRACT

Fine-mapping to plausible causal variation may be more effective in multi-ancestry cohorts, particularly in the MHC, which has population-specific structure. To enable such studies, we constructed a large (n = 21,546) HLA reference panel spanning five global populations based on whole-genome sequences. Despite population-specific long-range haplotypes, we demonstrated accurate imputation at G-group resolution (94.2%, 93.7%, 97.8% and 93.7% in admixed African (AA), East Asian (EAS), European (EUR) and Latino (LAT) populations). Applying HLA imputation to genome-wide association study data for HIV-1 viral load in three populations (EUR, AA and LAT), we obviated effects of previously reported associations from population-specific HIV studies and discovered a novel association at position 156 in HLA-B. We pinpointed the MHC association to three amino acid positions (97, 67 and 156) marking three consecutive pockets (C, B and D) within the HLA-B peptide-binding groove, explaining 12.9% of trait variance.

Show full abstractShow less

DOI

10.1038/s41588-021-00935-7

HLARIMNT

Tool

FULL NAME

HLA Reliable IMputatioN by Transformer

URL

https://github.com/seitalab/HLARIMNT

KEYWORDS

HLA, imputation

Show full keywordsShow less

Main citation

Tanaka, K., Kato, K., Nonaka, N., & Seita, J. (2022). Efficient HLA imputation from sequential SNPs data by Transformer. arXiv preprint arXiv:2211.06430.

HRC

Tool

PUBMED_LINK

27548312

URL

http://www.haplotype-reference-consortium.org/

TITLE

A reference panel of 64,976 haplotypes for genotype imputation.

Main citation

McCarthy S, Das S, Kretzschmar W, Delaneau O, ...&, Haplotype Reference Consortium. (2016) A reference panel of 64,976 haplotypes for genotype imputation. Nat Genet, 48 (10) 1279-83. doi:10.1038/ng.3643. PMID 27548312

ABSTRACT

We describe a reference panel of 64,976 human haplotypes at 39,235,157 SNPs constructed using whole-genome sequence data from 20 studies of predominantly European ancestry. Using this resource leads to accurate genotype imputation at minor allele frequencies as low as 0.1% and a large increase in the number of SNPs tested in association studies, and it can help to discover and refine causal loci. We describe remote server resources that allow researchers to carry out imputation and phasing consistently and efficiently.

Show full abstractShow less

DOI

10.1038/ng.3643

HWE

Tool

PUBMED_LINK

15789306

FULL NAME

Exact Tests of Hardy-Weinberg Equilibrium

DESCRIPTION

Wigginton, J. E., Cutler, D. J., & Abecasis, G. R. (2005). A note on exact tests of Hardy-Weinberg equilibrium. The American Journal of Human Genetics, 76(5), 887-893.

Show full descriptionShow less

TITLE

A note on exact tests of Hardy-Weinberg equilibrium.

Main citation

Wigginton JE, Cutler DJ, Abecasis GR. (2005) A note on exact tests of Hardy-Weinberg equilibrium. Am J Hum Genet, 76 (5) 887-93. doi:10.1086/429864. PMID 15789306

ABSTRACT

Deviations from Hardy-Weinberg equilibrium (HWE) can indicate inbreeding, population stratification, and even problems in genotyping. In samples of affected individuals, these deviations can also provide evidence for association. Tests of HWE are commonly performed using a simple chi2 goodness-of-fit test. We show that this chi2 test can have inflated type I error rates, even in relatively large samples (e.g., samples of 1,000 individuals that include approximately 100 copies of the minor allele). On the basis of previous work, we describe exact tests of HWE together with efficient computational methods for their implementation. Our methods adequately control type I error in large and small samples and are computationally efficient. They have been implemented in freely available code that will be useful for quality assessment of genotype data and for the detection of genetic association or population stratification in very large data sets.

Show full abstractShow less

DOI

10.1086/429864

HyPrColoc

Tool

PUBMED_LINK

33536417

FULL NAME

Hypothesis Prioritisation for multi-trait Colocalization

URL

https://github.com/cnfoley/hyprcoloc

KEYWORDS

multiple traits,

Show full keywordsShow less

TITLE

A fast and efficient colocalization algorithm for identifying shared genetic risk factors across multiple traits.

Main citation

Foley CN, Staley JR, Breen PG, Sun BB, ...&, Howson JMM. (2021) A fast and efficient colocalization algorithm for identifying shared genetic risk factors across multiple traits. Nat Commun, 12 (1) 764. doi:10.1038/s41467-020-20885-8. PMID 33536417

ABSTRACT

Genome-wide association studies (GWAS) have identified thousands of genomic regions affecting complex diseases. The next challenge is to elucidate the causal genes and mechanisms involved. One approach is to use statistical colocalization to assess shared genetic aetiology across multiple related traits (e.g. molecular traits, metabolic pathways and complex diseases) to identify causal pathways, prioritize causal variants and evaluate pleiotropy. We propose HyPrColoc (Hypothesis Prioritisation for multi-trait Colocalization), an efficient deterministic Bayesian algorithm using GWAS summary statistics that can detect colocalization across vast numbers of traits simultaneously (e.g. 100 traits can be jointly analysed in around 1 s). We perform a genome-wide multi-trait colocalization analysis of coronary heart disease (CHD) and fourteen related traits, identifying 43 regions in which CHD colocalized with ≥1 trait, including 5 previously unknown CHD loci. Across the 43 loci, we further integrate gene and protein expression quantitative trait loci to identify candidate causal genes.

Show full abstractShow less

DOI

10.1038/s41467-020-20885-8

IBS

Tool

PUBMED_LINK

26069263

FULL NAME

illustrator of biological sequences

DESCRIPTION

an illustrator for the presentation and visualization of biological sequences

Show full descriptionShow less

URL

http://ibs.biocuckoo.org/

TITLE

IBS: an illustrator for the presentation and visualization of biological sequences.

Main citation

Liu W, Xie Y, Ma J, Luo X, ...&, Ren J. (2015) IBS: an illustrator for the presentation and visualization of biological sequences. Bioinformatics, 31 (20) 3359-61. doi:10.1093/bioinformatics/btv362. PMID 26069263

ABSTRACT

UNLABELLED: Biological sequence diagrams are fundamental for visualizing various functional elements in protein or nucleotide sequences that enable a summarization and presentation of existing information as well as means of intuitive new discoveries. Here, we present a software package called illustrator of biological sequences (IBS) that can be used for representing the organization of either protein or nucleotide sequences in a convenient, efficient and precise manner. Multiple options are provided in IBS, and biological sequences can be manipulated, recolored or rescaled in a user-defined mode. Also, the final representational artwork can be directly exported into a publication-quality figure. AVAILABILITY AND IMPLEMENTATION: The standalone package of IBS was implemented in JAVA, while the online service was implemented in HTML5 and JavaScript. Both the standalone package and online service are freely available at http://ibs.biocuckoo.org. CONTACT: renjian.sysu@gmail.com or xueyu@hust.edu.cn SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Show full abstractShow less

DOI

10.1093/bioinformatics/btv362

iHS

Tool

PUBMED_LINK

16494531

FULL NAME

Integrated haplotype score

DESCRIPTION

Voight, B. F., Kudaravalli, S., Wen, X., & Pritchard, J. K. (2006). A map of recent positive selection in the human genome. PLoS biology, 4(3), e72.

Show full descriptionShow less

TITLE

A map of recent positive selection in the human genome.

Main citation

Voight BF, Kudaravalli S, Wen X, Pritchard JK. (2006) A map of recent positive selection in the human genome. PLoS Biol, 4 (3) e72. doi:10.1371/journal.pbio.0040072. PMID 16494531

ABSTRACT

The identification of signals of very recent positive selection provides information about the adaptation of modern humans to local conditions. We report here on a genome-wide scan for signals of very recent positive selection in favor of variants that have not yet reached fixation. We describe a new analytical method for scanning single nucleotide polymorphism (SNP) data for signals of recent selection, and apply this to data from the International HapMap Project. In all three continental groups we find widespread signals of recent positive selection. Most signals are region-specific, though a significant excess are shared across groups. Contrary to some earlier low resolution studies that suggested a paucity of recent selection in sub-Saharan Africans, we find that by some measures our strongest signals of selection are from the Yoruba population. Finally, since these signals indicate the existence of genetic variants that have substantially different fitnesses, they must indicate loci that are the source of significant phenotypic variation. Though the relevant phenotypes are generally not known, such loci should be of particular interest in mapping studies of complex traits. For this purpose we have developed a set of SNPs that can be used to tag the strongest approximately 250 signals of recent selection in each population.

Show full abstractShow less

DOI

10.1371/journal.pbio.0040072

IMPUTE

Tool

PUBMED_LINK

17572673

URL

https://jmarchini.org/software/

TITLE

A new multipoint method for genome-wide association studies by imputation of genotypes.

Main citation

Marchini J, Howie B, Myers S, McVean G, ...&, Donnelly P. (2007) A new multipoint method for genome-wide association studies by imputation of genotypes. Nat Genet, 39 (7) 906-13. doi:10.1038/ng2088. PMID 17572673

ABSTRACT

Genome-wide association studies are set to become the method of choice for uncovering the genetic basis of human diseases. A central challenge in this area is the development of powerful multipoint methods that can detect causal variants that have not been directly genotyped. We propose a coherent analysis framework that treats the problem as one involving missing or uncertain genotypes. Central to our approach is a model-based imputation method for inferring genotypes at observed or unobserved SNPs, leading to improved power over existing methods for multipoint association mapping. Using real genome-wide association study data, we show that our approach (i) is accurate and well calibrated, (ii) provides detailed views of associated regions that facilitate follow-up studies and (iii) can be used to validate and correct data at genotyped markers. A notable future use of our method will be to boost power by combining data from genome-wide scans that use different SNP sets.

Show full abstractShow less

DOI

10.1038/ng2088

IMPUTE2

Tool

PUBMED_LINK

19543373

TITLE

A flexible and accurate genotype imputation method for the next generation of genome-wide association studies.

Main citation

Howie BN, Donnelly P, Marchini J. (2009) A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet, 5 (6) e1000529. doi:10.1371/journal.pgen.1000529. PMID 19543373

ABSTRACT

Genotype imputation methods are now being widely used in the analysis of genome-wide association studies. Most imputation analyses to date have used the HapMap as a reference dataset, but new reference panels (such as controls genotyped on multiple SNP chips and densely typed samples from the 1,000 Genomes Project) will soon allow a broader range of SNPs to be imputed with higher accuracy, thereby increasing power. We describe a genotype imputation method (IMPUTE version 2) that is designed to address the challenges presented by these new datasets. The main innovation of our approach is a flexible modelling framework that increases accuracy and combines information across multiple reference panels while remaining computationally feasible. We find that IMPUTE v2 attains higher accuracy than other methods when the HapMap provides the sole reference panel, but that the size of the panel constrains the improvements that can be made. We also find that imputation accuracy can be greatly enhanced by expanding the reference panel to contain thousands of chromosomes and that IMPUTE v2 outperforms other methods in this setting at both rare and common SNPs, with overall error rates that are 15%-20% lower than those of the closest competing method. One particularly challenging aspect of next-generation association studies is to integrate information across multiple reference panels genotyped on different sets of SNPs; we show that our approach to this problem has practical advantages over other suggested solutions.

Show full abstractShow less

DOI

10.1371/journal.pgen.1000529

IMPUTE4

Tool

PUBMED_LINK

30305743

TITLE

The UK Biobank resource with deep phenotyping and genomic data.

Main citation

Bycroft C, Freeman C, Petkova D, Band G, ...&, Marchini J. (2018) The UK Biobank resource with deep phenotyping and genomic data. Nature, 562 (7726) 203-209. doi:10.1038/s41586-018-0579-z. PMID 30305743

ABSTRACT

The UK Biobank project is a prospective cohort study with deep genetic and phenotypic data collected on approximately 500,000 individuals from across the United Kingdom, aged between 40 and 69 at recruitment. The open resource is unique in its size and scope. A rich variety of phenotypic and health-related information is available on each participant, including biological measurements, lifestyle indicators, biomarkers in blood and urine, and imaging of the body and brain. Follow-up information is provided by linking health and medical records. Genome-wide genotype data have been collected on all participants, providing many opportunities for the discovery of new genetic associations and the genetic bases of complex traits. Here we describe the centralized analysis of the genetic data, including genotype quality, properties of population structure and relatedness of the genetic data, and efficient phasing and genotype imputation that increases the number of testable variants to around 96 million. Classical allelic variation at 11 human leukocyte antigen genes was imputed, resulting in the recovery of signals with known associations between human leukocyte antigen alleles and many diseases.

Show full abstractShow less

DOI

10.1038/s41586-018-0579-z

IMPUTE5

Tool

PUBMED_LINK

33196638

TITLE

Genotype imputation using the Positional Burrows Wheeler Transform.

Main citation

Rubinacci S, Delaneau O, Marchini J. (2020) Genotype imputation using the Positional Burrows Wheeler Transform. PLoS Genet, 16 (11) e1009049. doi:10.1371/journal.pgen.1009049. PMID 33196638

ABSTRACT

Genotype imputation is the process of predicting unobserved genotypes in a sample of individuals using a reference panel of haplotypes. In the last 10 years reference panels have increased in size by more than 100 fold. Increasing reference panel size improves accuracy of markers with low minor allele frequencies but poses ever increasing computational challenges for imputation methods. Here we present IMPUTE5, a genotype imputation method that can scale to reference panels with millions of samples. This method continues to refine the observation made in the IMPUTE2 method, that accuracy is optimized via use of a custom subset of haplotypes when imputing each individual. It achieves fast, accurate, and memory-efficient imputation by selecting haplotypes using the Positional Burrows Wheeler Transform (PBWT). By using the PBWT data structure at genotyped markers, IMPUTE5 identifies locally best matching haplotypes and long identical by state segments. The method then uses the selected haplotypes as conditioning states within the IMPUTE model. Using the HRC reference panel, which has ∼65,000 haplotypes, we show that IMPUTE5 is up to 30x faster than MINIMAC4 and up to 3x faster than BEAGLE5.1, and uses less memory than both these methods. Using simulated reference panels we show that IMPUTE5 scales sub-linearly with reference panel size. For example, keeping the number of imputed markers constant, increasing the reference panel size from 10,000 to 1 million haplotypes requires less than twice the computation time. As the reference panel increases in size IMPUTE5 is able to utilize a smaller number of reference haplotypes, thus reducing computational cost.

Show full abstractShow less

DOI

10.1371/journal.pgen.1009049

JAM

Tool

PUBMED_LINK

27027514

FULL NAME

joint analysis of marginal summary statistics

DESCRIPTION

Bayesian variable selection under a range of likelihoods, including linear regression for continuous outcomes, logistic regression for binary outcomes, Weibull regression for survival outcomes binary and survial outcomes, and the "JAM" model for summary genetic association data.

Show full descriptionShow less

URL

https://github.com/pjnewcombe/R2BGLiMS

TITLE

JAM: A Scalable Bayesian Framework for Joint Analysis of Marginal SNP Effects.

Main citation

Newcombe PJ, Conti DV, Richardson S. (2016) JAM: A Scalable Bayesian Framework for Joint Analysis of Marginal SNP Effects. Genet Epidemiol, 40 (3) 188-201. doi:10.1002/gepi.21953. PMID 27027514

ABSTRACT

Recently, large scale genome-wide association study (GWAS) meta-analyses have boosted the number of known signals for some traits into the tens and hundreds. Typically, however, variants are only analysed one-at-a-time. This complicates the ability of fine-mapping to identify a small set of SNPs for further functional follow-up. We describe a new and scalable algorithm, joint analysis of marginal summary statistics (JAM), for the re-analysis of published marginal summary statistics under joint multi-SNP models. The correlation is accounted for according to estimates from a reference dataset, and models and SNPs that best explain the complete joint pattern of marginal effects are highlighted via an integrated Bayesian penalized regression framework. We provide both enumerated and Reversible Jump MCMC implementations of JAM and present some comparisons of performance. In a series of realistic simulation studies, JAM demonstrated identical performance to various alternatives designed for single region settings. In multi-region settings, where the only multivariate alternative involves stepwise selection, JAM offered greater power and specificity. We also present an application to real published results from MAGIC (meta-analysis of glucose and insulin related traits consortium) - a GWAS meta-analysis of more than 15,000 people. We re-analysed several genomic regions that produced multiple significant signals with glucose levels 2 hr after oral stimulation. Through joint multivariate modelling, JAM was able to formally rule out many SNPs, and for one gene, ADCY5, suggests that an additional SNP, which transpired to be more biologically plausible, should be followed up with equal priority to the reported index.

Show full abstractShow less

DOI

10.1002/gepi.21953

JASS

Tool

PUBMED_LINK

32002517

FULL NAME

Joint Analysis of Summary Statistics

DESCRIPTION

JASS is a python package that handles the computation of the joint statistics over sets of selected GWAS results, and the interactive exploration of the results through a web interface. The generation of joint statistics over a set of selected studies, and the generation of static plots to display the results, is easily performed using the command line interface. These functionalities can also be accessed through a web application embedded in the python package, which also enables the exploration of the results through a dynamic Javascript interface. The JASS analysis module handles the data processing, going from the import of the data up to the computation of the joint statistics and the generation of the various static plots to illustrate the results. However, we also briefly describe in the next section the pre-processing of raw GWAS data which can be performed through a companion script provided on behalf of the JASS package.

Show full descriptionShow less

URL

https://gitlab.pasteur.fr/statistical-genetics/jass

TITLE

JASS: command line and web interface for the joint analysis of GWAS results.

Main citation

Julienne H, Lechat P, Guillemot V, Lasry C, ...&, Aschard H. (2020) JASS: command line and web interface for the joint analysis of GWAS results. NAR Genom Bioinform, 2 (1) lqaa003. doi:10.1093/nargab/lqaa003. PMID 32002517

ABSTRACT

Genome-wide association study (GWAS) has been the driving force for identifying association between genetic variants and human phenotypes. Thousands of GWAS summary statistics covering a broad range of human traits and diseases are now publicly available. These GWAS have proven their utility for a range of secondary analyses, including in particular the joint analysis of multiple phenotypes to identify new associated genetic variants. However, although several methods have been proposed, there are very few large-scale applications published so far because of challenges in implementing these methods on real data. Here, we present JASS (Joint Analysis of Summary Statistics), a polyvalent Python package that addresses this need. Our package incorporates recently developed joint tests such as the omnibus approach and various weighted sum of Z-score tests while solving all practical and computational barriers for large-scale multivariate analysis of GWAS summary statistics. This includes data cleaning and harmonization tools, an efficient algorithm for fast derivation of joint statistics, an optimized data management process and a web interface for exploration purposes. Both benchmark analyses and real data applications demonstrated the robustness and strong potential of JASS for the detection of new associated genetic variants. Our package is freely available at https://gitlab.pasteur.fr/statistical-genetics/jass.

Show full abstractShow less

DOI

10.1093/nargab/lqaa003

JointPRS

PRS Multi-ancestry Cross-ancestry Genetic correlation Tool Summary statistics

PUBMED_LINK

40268942

DESCRIPTION

Data-adaptive polygenic score framework that borrows strength across populations via genetic correlations using only GWAS summary statistics and LD references—supporting prediction with or without individual-level tuning data.

Show full descriptionShow less

URL

https://github.com/LeqiXu/JointPRS ,https://doi.org/10.1038/s41467-025-59243-x

KEYWORDS

PRS, multi-population, genetic correlation, summary statistics, cross-ancestry

Show full keywordsShow less

TITLE

JointPRS: A data-adaptive framework for multi-population genetic risk prediction incorporating genetic correlation.

Main citation

Xu L, Zhou G, Jiang W, Zhang H, ...&, Zhao H. (2025) JointPRS: A data-adaptive framework for multi-population genetic risk prediction incorporating genetic correlation. Nat Commun, 16 (1) 3841. doi:10.1038/s41467-025-59243-x. PMID 40268942

ABSTRACT

Genetic risk prediction for non-European populations is hindered by limited Genome-Wide Association Study (GWAS) sample sizes and small tuning datasets. We propose JointPRS, a data-adaptive framework that leverages genetic correlations across multiple populations using GWAS summary statistics. It achieves accurate predictions without individual-level tuning data and remains effective in the presence of a small tuning set thanks to its data-adaptive approach. Through extensive simulations and real data applications to 22 quantitative and four binary traits in five continental populations evaluated using the UK Biobank (UKBB) and All of Us (AoU), JointPRS consistently outperforms six state-of-the-art methods across three data scenarios: no tuning data, same-cohort tuning and testing, and cross-cohort tuning and testing. Notably, in the Admixed American population, JointPRS improves lipid trait prediction in AoU by 6.46%-172.00% compared to the other existing methods.

Show full abstractShow less

DOI

10.1038/s41467-025-59243-x

karyoploteR

Tool

PUBMED_LINK

28575171

DESCRIPTION

karyoploteR is an R package to create karyoplots, that is, representations of whole genomes with arbitrary data plotted on them. It is inspired by the R base graphics system and does not depend on other graphics packages. The aim of karyoploteR is to offer the user an easy way to plot data along the genome to get broad genome-wide view to facilitate the identification of genome wide relations and distributions.

Show full descriptionShow less

URL

https://bernatgel.github.io/karyoploter_tutorial/

TITLE

karyoploteR: an R/Bioconductor package to plot customizable genomes displaying arbitrary data.

Main citation

Gel B, Serra E. (2017) karyoploteR: an R/Bioconductor package to plot customizable genomes displaying arbitrary data. Bioinformatics, 33 (19) 3088-3090. doi:10.1093/bioinformatics/btx346. PMID 28575171

ABSTRACT

MOTIVATION: Data visualization is a crucial tool for data exploration, analysis and interpretation. For the visualization of genomic data there lacks a tool to create customizable non-circular plots of whole genomes from any species. RESULTS: We have developed karyoploteR, an R/Bioconductor package to create linear chromosomal representations of any genome with genomic annotations and experimental data plotted along them. Plot creation process is inspired in R base graphics, with a main function creating karyoplots with no data and multiple additional functions, including custom functions written by the end-user, adding data and other graphical elements. This approach allows the creation of highly customizable plots from arbitrary data with complete freedom on data positioning and representation. AVAILABILITY AND IMPLEMENTATION: karyoploteR is released under Artistic-2.0 License. Source code and documentation are freely available through Bioconductor (http://www.bioconductor.org/packages/karyoploteR) and at the examples and tutorial page at https://bernatgel.github.io/karyoploter_tutorial. CONTACT: bgel@igtp.cat.

Show full abstractShow less

DOI

10.1093/bioinformatics/btx346

KwARG

Tool

PUBMED_LINK

33970217

DESCRIPTION

Ignatieva, A., Lyngsø, R. B., Jenkins, P. A. & Hein, J. KwARG: parsimonious reconstruction of ancestral recombination graphs with recurrent mutation. Bioinformatics 37, 3277–3284 (2021).

Show full descriptionShow less

TITLE

KwARG: parsimonious reconstruction of ancestral recombination graphs with recurrent mutation.

Main citation

Ignatieva A, Lyngsø RB, Jenkins PA, Hein J. (2021) KwARG: parsimonious reconstruction of ancestral recombination graphs with recurrent mutation. Bioinformatics, 37 (19) 3277-3284. doi:10.1093/bioinformatics/btab351. PMID 33970217

ABSTRACT

MOTIVATION: The reconstruction of possible histories given a sample of genetic data in the presence of recombination and recurrent mutation is a challenging problem, but can provide key insights into the evolution of a population. We present KwARG, which implements a parsimony-based greedy heuristic algorithm for finding plausible genealogical histories (ancestral recombination graphs) that are minimal or near-minimal in the number of posited recombination and mutation events. RESULTS: Given an input dataset of aligned sequences, KwARG outputs a list of possible candidate solutions, each comprising a list of mutation and recombination events that could have generated the dataset; the relative proportion of recombinations and recurrent mutations in a solution can be controlled via specifying a set of 'cost' parameters. We demonstrate that the algorithm performs well when compared against existing methods. AVAILABILITY AND IMPLEMENTATION: The software is available at https://github.com/a-ignatieva/kwarg. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Show full abstractShow less

DOI

10.1093/bioinformatics/btab351

lassosum

Tool

PUBMED_LINK

28480976

DESCRIPTION

lassosum is a method for computing LASSO/Elastic Net estimates of a linear regression problem given summary statistics from GWAS and Genome-wide meta-analyses, accounting for Linkage Disequilibrium (LD), via a reference panel.

Show full descriptionShow less

URL

https://github.com/tshmak/lassosum

KEYWORDS

penalized regression

Show full keywordsShow less

TITLE

Polygenic scores via penalized regression on summary statistics.

Main citation

Mak TSH, Porsch RM, Choi SW, Zhou X, ...&, Sham PC. (2017) Polygenic scores via penalized regression on summary statistics. Genet Epidemiol, 41 (6) 469-480. doi:10.1002/gepi.22050. PMID 28480976

ABSTRACT

Polygenic scores (PGS) summarize the genetic contribution of a person's genotype to a disease or phenotype. They can be used to group participants into different risk categories for diseases, and are also used as covariates in epidemiological analyses. A number of possible ways of calculating PGS have been proposed, and recently there is much interest in methods that incorporate information available in published summary statistics. As there is no inherent information on linkage disequilibrium (LD) in summary statistics, a pertinent question is how we can use LD information available elsewhere to supplement such analyses. To answer this question, we propose a method for constructing PGS using summary statistics and a reference panel in a penalized regression framework, which we call lassosum. We also propose a general method for choosing the value of the tuning parameter in the absence of validation data. In our simulations, we showed that pseudovalidation often resulted in prediction accuracy that is comparable to using a dataset with validation phenotype and was clearly superior to the conservative option of setting the tuning parameter of lassosum to its lowest value. We also showed that lassosum achieved better prediction accuracy than simple clumping and P-value thresholding in almost all scenarios. It was also substantially faster and more accurate than the recently proposed LDpred.

Show full abstractShow less

DOI

10.1002/gepi.22050

lassosum2

Tool

PUBMED_LINK

36105883

DESCRIPTION

lassosum2 is a re-implementation of the lassosum model that now uses the exact same input parameters as LDpred2 (corr and df_beta). It should be fast to run. It can be run next to LDpred2 and the best model can be chosen using the validation set. Note that parameter ‘s’ from lassosum has been replaced by a new parameter ‘delta’ in lassosum2, in order to better reflect that the lassosum model also uses L2-regularization (therefore, elastic-net regularization).

Show full descriptionShow less

URL

https://privefl.github.io/bigsnpr/articles/LDpred2.html#lassosum2-grid-of-models

TITLE

Identifying and correcting for misspecifications in GWAS summary statistics and polygenic scores.

Main citation

Privé F, Arbel J, Aschard H, Vilhjálmsson BJ. (2022) Identifying and correcting for misspecifications in GWAS summary statistics and polygenic scores. HGG Adv, 3 (4) 100136. doi:10.1016/j.xhgg.2022.100136. PMID 36105883

ABSTRACT

Publicly available genome-wide association studies (GWAS) summary statistics exhibit uneven quality, which can impact the validity of follow-up analyses. First, we present an overview of possible misspecifications that come with GWAS summary statistics. Then, in both simulations and real-data analyses, we show that additional information such as imputation INFO scores, allele frequencies, and per-variant sample sizes in GWAS summary statistics can be used to detect possible issues and correct for misspecifications in the GWAS summary statistics. One important motivation for us is to improve the predictive performance of polygenic scores built from these summary statistics. Unfortunately, owing to the lack of reporting standards for GWAS summary statistics, this additional information is not systematically reported. We also show that using well-matched linkage disequilibrium (LD) references can improve model fit and translate into more accurate prediction. Finally, we discuss how to make polygenic score methods such as lassosum and LDpred2 more robust to these misspecifications to improve their predictive power.

Show full abstractShow less

DOI

10.1016/j.xhgg.2022.100136

LAVA

Tool

PUBMED_LINK

35288712

FULL NAME

Local Analysis of [co]Variant Association

DESCRIPTION

LAVA is a tool to conduct genome-wide, local genetic correlation analysis on multiple traits, using GWAS summary statistics as input.

Show full descriptionShow less

URL

https://ctg.cncr.nl/software/lava

TITLE

An integrated framework for local genetic correlation analysis.

Main citation

Werme J, van der Sluis S, Posthuma D, de Leeuw CA. (2022) An integrated framework for local genetic correlation analysis. Nat Genet, 54 (3) 274-282. doi:10.1038/s41588-022-01017-y. PMID 35288712

ABSTRACT

Genetic correlation (rg) analysis is used to identify phenotypes that may have a shared genetic basis. Traditionally, rg is studied globally, considering only the average of the shared signal across the genome, although this approach may fail when the rg is confined to particular genomic regions or in opposing directions at different loci. Current tools for local rg analysis are restricted to analysis of two phenotypes. Here we introduce LAVA, an integrated framework for local rg analysis that, in addition to testing the standard bivariate local rgs between two phenotypes, can evaluate local heritabilities and analyze conditional genetic relations between several phenotypes using partial correlation and multiple regression. Applied to 25 behavioral and health phenotypes, we show considerable heterogeneity in the bivariate local rgs across the genome, which is often masked by the global rg patterns, and demonstrate how our conditional approaches can elucidate more complex, multivariate genetic relations.

Show full abstractShow less

DOI

10.1038/s41588-022-01017-y

LCP-GWAS

Tool

PUBMED_LINK

33110245

FULL NAME

Linear Combination Phenotype GWAS

KEYWORDS

multivariate GWAS follow-up analyses

Show full keywordsShow less

TITLE

An expanded analysis framework for multivariate GWAS connects inflammatory biomarkers to functional variants and disease.

Main citation

Ruotsalainen SE, Partanen JJ, Cichonska A, Lin J, ...&, Koskela J. (2021) An expanded analysis framework for multivariate GWAS connects inflammatory biomarkers to functional variants and disease. Eur J Hum Genet, 29 (2) 309-324. doi:10.1038/s41431-020-00730-8. PMID 33110245

ABSTRACT

Multivariate methods are known to increase the statistical power to detect associations in the case of shared genetic basis between phenotypes. They have, however, lacked essential analytic tools to follow-up and understand the biology underlying these associations. We developed a novel computational workflow for multivariate GWAS follow-up analyses, including fine-mapping and identification of the subset of traits driving associations (driver traits). Many follow-up tools require univariate regression coefficients which are lacking from multivariate results. Our method overcomes this problem by using Canonical Correlation Analysis to turn each multivariate association into its optimal univariate Linear Combination Phenotype (LCP). This enables an LCP-GWAS, which in turn generates the statistics required for follow-up analyses. We implemented our method on 12 highly correlated inflammatory biomarkers in a Finnish population-based study. Altogether, we identified 11 associations, four of which (F5, ABO, C1orf140 and PDGFRB) were not detected by biomarker-specific analyses. Fine-mapping identified 19 signals within the 11 loci and driver trait analysis determined the traits contributing to the associations. A phenome-wide association study on the 19 representative variants from the signals in 176,899 individuals from the FinnGen study revealed 53 disease associations (p < 1 × 10-4). Several reported pQTLs in the 11 loci provided orthogonal evidence for the biologically relevant functions of the representative variants. Our novel multivariate analysis workflow provides a powerful addition to standard univariate GWAS analyses by enabling multivariate GWAS follow-up and thus promoting the advancement of powerful multivariate methods in genomics.

Show full abstractShow less

DOI

10.1038/s41431-020-00730-8

LDAK

Tool

PUBMED_LINK

23217325

FULL NAME

LD-adjusted kinships

DESCRIPTION

LDAK is a software package for analysing association study data.

Show full descriptionShow less

URL

TITLE

Improved heritability estimation from genome-wide SNPs.

Main citation

Speed D, Hemani G, Johnson MR, Balding DJ. (2012) Improved heritability estimation from genome-wide SNPs. Am J Hum Genet, 91 (6) 1011-21. doi:10.1016/j.ajhg.2012.10.010. PMID 23217325

ABSTRACT

Estimation of narrow-sense heritability, h(2), from genome-wide SNPs genotyped in unrelated individuals has recently attracted interest and offers several advantages over traditional pedigree-based methods. With the use of this approach, it has been estimated that over half the heritability of human height can be attributed to the ~300,000 SNPs on a genome-wide genotyping array. In comparison, only 5%-10% can be explained by SNPs reaching genome-wide significance. We investigated via simulation the validity of several key assumptions underpinning the mixed-model analysis used in SNP-based h(2) estimation. Although we found that the method is reasonably robust to violations of four key assumptions, it can be highly sensitive to uneven linkage disequilibrium (LD) between SNPs: contributions to h(2) are overestimated from causal variants in regions of high LD and are underestimated in regions of low LD. The overall direction of the bias can be up or down depending on the genetic architecture of the trait, but it can be substantial in realistic scenarios. We propose a modified kinship matrix in which SNPs are weighted according to local LD. We show that this correction greatly reduces the bias and increases the precision of h(2) estimates. We demonstrate the impact of our method on the first seven diseases studied by the Wellcome Trust Case Control Consortium. Our LD adjustment revises downward the h(2) estimate for immune-related diseases, as expected because of high LD in the major-histocompatibility region, but increases it for some nonimmune diseases. To calculate our revised kinship matrix, we developed LDAK, software for computing LD-adjusted kinships.

Show full abstractShow less

DOI

10.1016/j.ajhg.2012.10.010

LDAK-GBAT

Tool

PUBMED_LINK

36480927

FULL NAME

LDAK gene-based association testing

URL

TITLE

LDAK-GBAT: Fast and powerful gene-based association testing using summary statistics.

Main citation

Berrandou TE, Balding D, Speed D. (2023) LDAK-GBAT: Fast and powerful gene-based association testing using summary statistics. Am J Hum Genet, 110 (1) 23-29. doi:10.1016/j.ajhg.2022.11.010. PMID 36480927

ABSTRACT

We present LDAK-GBAT, a tool for gene-based association testing using summary statistics from genome-wide association studies that is computationally efficient, produces well-calibrated p values, and is significantly more powerful than existing tools. LDAK-GBAT takes approximately 30 min to analyze imputed data (2.9M common, genic SNPs), requiring less than 10 Gb memory. It shows good control of type 1 error given an appropriate reference panel. Across 109 phenotypes (82 from the UK Biobank, 18 from the Million Veteran Program, and nine from the Psychiatric Genetics Consortium), LDAK-GBAT finds on average 19% (SE: 1%) more significant genes than the existing tool sumFREGAT-ACAT, with even greater gains in comparison with MAGMA, GCTA-fastBAT, sumFREGAT-SKAT-O, and sumFREGAT-PCA.

Show full abstractShow less

DOI

10.1016/j.ajhg.2022.11.010

LDAK-KVIK

Tool

URL

https://ldlink.nci.nih.gov/?tab=home

Main citation

Hof, J. P. & Speed, D. LDAK-KVIK performs fast and powerful mixed-model association analysis of quantitative and binary phenotypes. bioRxiv 2024.07.25.24311005 (2024) doi:10.1101/2024.07.25.24311005.

Ldlink

Tool

PUBMED_LINK

26139635

DESCRIPTION

LDlink is a suite of web-based applications designed to easily and efficiently interrogate linkage disequilibrium in population groups. Each included application is specialized for querying and displaying unique aspects of linkage disequilibrium.

Show full descriptionShow less

URL

TITLE

LDlink: a web-based application for exploring population-specific haplotype structure and linking correlated alleles of possible functional variants.

Main citation

Machiela MJ, Chanock SJ. (2015) LDlink: a web-based application for exploring population-specific haplotype structure and linking correlated alleles of possible functional variants. Bioinformatics, 31 (21) 3555-7. doi:10.1093/bioinformatics/btv402. PMID 26139635

ABSTRACT

UNLABELLED: Assessing linkage disequilibrium (LD) across ancestral populations is a powerful approach for investigating population-specific genetic structure as well as functionally mapping regions of disease susceptibility. Here, we present LDlink, a web-based collection of bioinformatic modules that query single nucleotide polymorphisms (SNPs) in population groups of interest to generate haplotype tables and interactive plots. Modules are designed with an emphasis on ease of use, query flexibility, and interactive visualization of results. Phase 3 haplotype data from the 1000 Genomes Project are referenced for calculating pairwise metrics of LD, searching for proxies in high LD, and enumerating all observed haplotypes. LDlink is tailored for investigators interested in mapping common and uncommon disease susceptibility loci by focusing on output linking correlated alleles and highlighting putative functional variants. AVAILABILITY AND IMPLEMENTATION: LDlink is a free and publically available web tool which can be accessed at http://analysistools.nci.nih.gov/LDlink/. CONTACT: mitchell.machiela@nih.gov.

Show full abstractShow less

DOI

10.1093/bioinformatics/btv402

LDlinkR

Tool

PUBMED_LINK

32180801

DESCRIPTION

An R Package for Rapidly Calculating Linkage Disequilibrium Statistics in Diverse Populations

Show full descriptionShow less

URL

https://cran.r-project.org/web/packages/LDlinkR/index.html

Main citation

Myers TA, Chanock SJ, Machiela MJ. (2020) Front Genet, 11 () 157. doi:10.3389/fgene.2020.00157. PMID 32180801

ABSTRACT

Genomic research involving human genetics and evolutionary biology relies heavily on linkage disequilibrium (LD) to investigate population-specific genetic structure, functionally map regions of disease susceptibility and uncover evolutionary history. Interactive and powerful tools are needed to calculate population-specific LD estimates for integrative genomics research. LDlink is an interactive suite of web-based tools developed to query germline variants in 1000 Genomes Project population groups of interest and generate interactive tables and plots of LD estimates. As an expansion to this resource, we have developed an R package, LDlinkR, designed to rapidly calculate statistics for large lists of variants and LD attributes that eliminates the time needed to perform repetitive requests from the web-based LDlink tool. LDlinkR accelerates genomic research by providing efficient and user-friendly functions to programmatically interrogate and download pairwise LD estimates from expansive lists of genetic variants. LDlinkR is a free and publicly available R package that can be installed from the Comprehensive R Archive Network (CRAN) or downloaded from https://github.com/CBIIT/LDlinkR.

Show full abstractShow less

DOI

10.3389/fgene.2020.00157

LDpred

Tool

PUBMED_LINK

26430803

DESCRIPTION

LDpred is a Python based software package that adjusts GWAS summary statistics for the effects of linkage disequilibrium (LD).

Show full descriptionShow less

URL

https://github.com/bvilhjal/ldpred

KEYWORDS

Bayesian, Gaussian infinitesimal prior, python

Show full keywordsShow less

TITLE

Modeling Linkage Disequilibrium Increases Accuracy of Polygenic Risk Scores.

Main citation

Vilhjálmsson BJ, Yang J, Finucane HK, Gusev A, ...&, Price AL. (2015) Modeling Linkage Disequilibrium Increases Accuracy of Polygenic Risk Scores. Am J Hum Genet, 97 (4) 576-92. doi:10.1016/j.ajhg.2015.09.001. PMID 26430803

ABSTRACT

Polygenic risk scores have shown great promise in predicting complex disease risk and will become more accurate as training sample sizes increase. The standard approach for calculating risk scores involves linkage disequilibrium (LD)-based marker pruning and applying a p value threshold to association statistics, but this discards information and can reduce predictive accuracy. We introduce LDpred, a method that infers the posterior mean effect size of each marker by using a prior on effect sizes and LD information from an external reference panel. Theory and simulations show that LDpred outperforms the approach of pruning followed by thresholding, particularly at large sample sizes. Accordingly, predicted R(2) increased from 20.1% to 25.3% in a large schizophrenia dataset and from 9.8% to 12.0% in a large multiple sclerosis dataset. A similar relative improvement in accuracy was observed for three additional large disease datasets and for non-European schizophrenia samples. The advantage of LDpred over existing methods will grow as sample sizes increase.

Show full abstractShow less

DOI

10.1016/j.ajhg.2015.09.001

LDpred-funct

Tool

PUBMED_LINK

34663819

DESCRIPTION

LDpred-funct is a method for polygenic prediction that leverages trait-specific functional priors to increase prediction accuracy.

Show full descriptionShow less

URL

https://github.com/carlaml/LDpred-funct

KEYWORDS

Bayesian, functional priors

Show full keywordsShow less

TITLE

Incorporating functional priors improves polygenic prediction accuracy in UK Biobank and 23andMe data sets.

Main citation

Márquez-Luna C, Gazal S, Loh PR, Kim SS, ...&, Price AL. (2021) Incorporating functional priors improves polygenic prediction accuracy in UK Biobank and 23andMe data sets. Nat Commun, 12 (1) 6052. doi:10.1038/s41467-021-25171-9. PMID 34663819

ABSTRACT

Polygenic risk prediction is a widely investigated topic because of its promising clinical applications. Genetic variants in functional regions of the genome are enriched for complex trait heritability. Here, we introduce a method for polygenic prediction, LDpred-funct, that leverages trait-specific functional priors to increase prediction accuracy. We fit priors using the recently developed baseline-LD model, including coding, conserved, regulatory, and LD-related annotations. We analytically estimate posterior mean causal effect sizes and then use cross-validation to regularize these estimates, improving prediction accuracy for sparse architectures. We applied LDpred-funct to predict 21 highly heritable traits in the UK Biobank (avg N = 373 K as training data). LDpred-funct attained a +4.6% relative improvement in average prediction accuracy (avg prediction R2 = 0.144; highest R2 = 0.413 for height) compared to SBayesR (the best method that does not incorporate functional information). For height, meta-analyzing training data from UK Biobank and 23andMe cohorts (N = 1107 K) increased prediction R2 to 0.431. Our results show that incorporating functional priors improves polygenic prediction accuracy, consistent with the functional architecture of complex traits.

Show full abstractShow less

DOI

10.1038/s41467-021-25171-9

LDpred2

Tool

PUBMED_LINK

33326037

DESCRIPTION

LDpred-2 is one of the dedicated PRS programs which is an R package that uses a Bayesian approach to polygenic risk scoring.

Show full descriptionShow less

URL

https://privefl.github.io/bigsnpr/articles/LDpred2.html

KEYWORDS

Bayesian, R, LDpred2-grid (LDpred2), LDpred2-auto, LDpred2-sparse

Show full keywordsShow less

TITLE

LDpred2: better, faster, stronger.

Main citation

Privé F, Arbel J, Vilhjálmsson BJ. (2021) LDpred2: better, faster, stronger. Bioinformatics, 36 (22-23) 5424-5431. doi:10.1093/bioinformatics/btaa1029. PMID 33326037

ABSTRACT

MOTIVATION: Polygenic scores have become a central tool in human genetics research. LDpred is a popular method for deriving polygenic scores based on summary statistics and a matrix of correlation between genetic variants. However, LDpred has limitations that may reduce its predictive performance. RESULTS: Here, we present LDpred2, a new version of LDpred that addresses these issues. We also provide two new options in LDpred2: a 'sparse' option that can learn effects that are exactly 0, and an 'auto' option that directly learns the two LDpred parameters from data. We benchmark predictive performance of LDpred2 against the previous version on simulated and real data, demonstrating substantial improvements in robustness and predictive accuracy compared to LDpred1. We then show that LDpred2 also outperforms other polygenic score methods recently developed, with a mean AUC over the 8 real traits analyzed here of 65.1%, compared to 63.8% for lassosum, 62.9% for PRS-CS and 61.5% for SBayesR. Note that LDpred2 provides more accurate polygenic scores when run genome-wide, instead of per chromosome. AVAILABILITY AND IMPLEMENTATION: LDpred2 is implemented in R package bigsnpr. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Show full abstractShow less

DOI

10.1093/bioinformatics/btaa1029

LDpred2-auto

Tool

PUBMED_LINK

37944514

DESCRIPTION

LDpred2 is a widely used Bayesian method for building polygenic scores (PGS). LDpred2-auto can infer the two parameters from the LDpred model, h^2 and p, so that it does not require an additional validation dataset to choose best-performing parameters. Here, we present a new version of LDpred2-auto, which adds a third parameter alpha to its model for modeling negative selection. Additional changes are also made to provide better sampling of these parameters.

Show full descriptionShow less

URL

https://privefl.github.io/bigsnpr/articles/LDpred2.html

KEYWORDS

Bayesian, new LDpred2-auto, α (relationship between MAF and beta)

Show full keywordsShow less

TITLE

Inferring disease architecture and predictive ability with LDpred2-auto.

Main citation

Privé F, Albiñana C, Arbel J, Pasaniuc B, ...&, Vilhjálmsson BJ. (2023) Inferring disease architecture and predictive ability with LDpred2-auto. Am J Hum Genet, 110 (12) 2042-2055. doi:10.1016/j.ajhg.2023.10.010. PMID 37944514

ABSTRACT

LDpred2 is a widely used Bayesian method for building polygenic scores (PGSs). LDpred2-auto can infer the two parameters from the LDpred model, the SNP heritability h2 and polygenicity p, so that it does not require an additional validation dataset to choose best-performing parameters. The main aim of this paper is to properly validate the use of LDpred2-auto for inferring multiple genetic parameters. Here, we present a new version of LDpred2-auto that adds an optional third parameter α to its model, for modeling negative selection. We then validate the inference of these three parameters (or two, when using the previous model). We also show that LDpred2-auto provides per-variant probabilities of being causal that are well calibrated and can therefore be used for fine-mapping purposes. We also introduce a formula to infer the out-of-sample predictive performance r2 of the resulting PGS directly from the Gibbs sampler of LDpred2-auto. Finally, we extend the set of HapMap3 variants recommended to use with LDpred2 with 37% more variants to improve the coverage of this set, and we show that this new set of variants captures 12% more heritability and provides 6% more predictive performance, on average, in UK Biobank analyses.

Show full abstractShow less

DOI

10.1016/j.ajhg.2023.10.010

LDSC

Tool

PUBMED_LINK

25642630

FULL NAME

LD Score Regression

DESCRIPTION

ldsc is a command line tool for estimating heritability and genetic correlation from GWAS summary statistics. ldsc also computes LD Scores.

Show full descriptionShow less

URL

TITLE

LD Score regression distinguishes confounding from polygenicity in genome-wide association studies.

Main citation

Bulik-Sullivan BK, Loh PR, Finucane HK, Ripke S, ...&, Neale BM. (2015) LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat Genet, 47 (3) 291-5. doi:10.1038/ng.3211. PMID 25642630

ABSTRACT

Both polygenicity (many small genetic effects) and confounding biases, such as cryptic relatedness and population stratification, can yield an inflated distribution of test statistics in genome-wide association studies (GWAS). However, current methods cannot distinguish between inflation from a true polygenic signal and bias. We have developed an approach, LD Score regression, that quantifies the contribution of each by examining the relationship between test statistics and linkage disequilibrium (LD). The LD Score regression intercept can be used to estimate a more powerful and accurate correction factor than genomic control. We find strong evidence that polygenicity accounts for the majority of the inflation in test statistics in many GWAS of large sample size.

Show full abstractShow less

DOI

10.1038/ng.3211

LDSC

Tool

PUBMED_LINK

25642630

FULL NAME

LD Score Regression

DESCRIPTION

ldsc is a command line tool for estimating heritability and genetic correlation from GWAS summary statistics. ldsc also computes LD Scores.

Show full descriptionShow less

URL

TITLE

LD Score regression distinguishes confounding from polygenicity in genome-wide association studies.

Main citation

Bulik-Sullivan BK, Loh PR, Finucane HK, Ripke S, ...&, Neale BM. (2015) LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat Genet, 47 (3) 291-5. doi:10.1038/ng.3211. PMID 25642630

ABSTRACT

Both polygenicity (many small genetic effects) and confounding biases, such as cryptic relatedness and population stratification, can yield an inflated distribution of test statistics in genome-wide association studies (GWAS). However, current methods cannot distinguish between inflation from a true polygenic signal and bias. We have developed an approach, LD Score regression, that quantifies the contribution of each by examining the relationship between test statistics and linkage disequilibrium (LD). The LD Score regression intercept can be used to estimate a more powerful and accurate correction factor than genomic control. We find strong evidence that polygenicity accounts for the majority of the inflation in test statistics in many GWAS of large sample size.

Show full abstractShow less

DOI

10.1038/ng.3211

LDSC-SEG

Tool

PUBMED_LINK

29632380

FULL NAME

LD score regression applied to specifically expressed genes

URL

http://www.christianbenner.com/#

KEYWORDS

LDSC, tissue, cell type

Show full keywordsShow less

TITLE

Heritability enrichment of specifically expressed genes identifies disease-relevant tissues and cell types.

Main citation

Finucane HK, Reshef YA, Anttila V, Slowikowski K, ...&, Price AL. (2018) Heritability enrichment of specifically expressed genes identifies disease-relevant tissues and cell types. Nat Genet, 50 (4) 621-629. doi:10.1038/s41588-018-0081-4. PMID 29632380

ABSTRACT

We introduce an approach to identify disease-relevant tissues and cell types by analyzing gene expression data together with genome-wide association study (GWAS) summary statistics. Our approach uses stratified linkage disequilibrium (LD) score regression to test whether disease heritability is enriched in regions surrounding genes with the highest specific expression in a given tissue. We applied our approach to gene expression data from several sources together with GWAS summary statistics for 48 diseases and traits (average N = 169,331) and found significant tissue-specific enrichments (false discovery rate (FDR) < 5%) for 34 traits. In our analysis of multiple tissues, we detected a broad range of enrichments that recapitulated known biology. In our brain-specific analysis, significant enrichments included an enrichment of inhibitory over excitatory neurons for bipolar disorder, and excitatory over inhibitory neurons for schizophrenia and body mass index. Our results demonstrate that our polygenic approach is a powerful way to leverage gene expression data for interpreting GWAS signals.

Show full abstractShow less

DOI

10.1038/s41588-018-0081-4

LDSTORE2

Tool

PUBMED_LINK

28942963

DESCRIPTION

LDstore is a computationally efficient program for estimating and storing Linkage Disequilibrium (SNP correlations). It combines some of the best features from RAREMETALWORKER and PLINK by implementing parallel processing using OPENMP and storing of the SNP correlations with information about the SNPs in the same binary file for fast lookups. LDstore is therefore the ideal tool for sharing SNP correlations in large-scale meta-analyses of genome-wide association studies and for on-the-fly computing/querying within web portals.

Show full descriptionShow less

URL

TITLE

Prospects of Fine-Mapping Trait-Associated Genomic Regions by Using Summary Statistics from Genome-wide Association Studies.

Main citation

Benner C, Havulinna AS, Järvelin MR, Salomaa V, ...&, Pirinen M. (2017) Prospects of Fine-Mapping Trait-Associated Genomic Regions by Using Summary Statistics from Genome-wide Association Studies. Am J Hum Genet, 101 (4) 539-551. doi:10.1016/j.ajhg.2017.08.012. PMID 28942963

ABSTRACT

During the past few years, various novel statistical methods have been developed for fine-mapping with the use of summary statistics from genome-wide association studies (GWASs). Although these approaches require information about the linkage disequilibrium (LD) between variants, there has not been a comprehensive evaluation of how estimation of the LD structure from reference genotype panels performs in comparison with that from the original individual-level GWAS data. Using population genotype data from Finland and the UK Biobank, we show here that a reference panel of 1,000 individuals from the target population is adequate for a GWAS cohort of up to 10,000 individuals, whereas smaller panels, such as those from the 1000 Genomes Project, should be avoided. We also show, both theoretically and empirically, that the size of the reference panel needs to scale with the GWAS sample size; this has important consequences for the application of these methods in ongoing GWAS meta-analyses and large biobank studies. We conclude by providing software tools and by recommending practices for sharing LD information to more efficiently exploit summary statistics in genetics research.

Show full abstractShow less

DOI

10.1016/j.ajhg.2017.08.012

LeafCutter

Tool

PUBMED_LINK

29229983

DESCRIPTION

Leafcutter quantifies RNA splicing variation using short-read RNA-seq data. The core idea is to leverage spliced reads (reads that span an intron) to quantify (differential) intron usage across samples.

Show full descriptionShow less

URL

https://davidaknowles.github.io/leafcutter/

TITLE

Annotation-free quantification of RNA splicing using LeafCutter.

Main citation

Li YI, Knowles DA, Humphrey J, Barbeira AN, ...&, Pritchard JK. (2018) Annotation-free quantification of RNA splicing using LeafCutter. Nat Genet, 50 (1) 151-158. doi:10.1038/s41588-017-0004-9. PMID 29229983

ABSTRACT

The excision of introns from pre-mRNA is an essential step in mRNA processing. We developed LeafCutter to study sample and population variation in intron splicing. LeafCutter identifies variable splicing events from short-read RNA-seq data and finds events of high complexity. Our approach obviates the need for transcript annotations and circumvents the challenges in estimating relative isoform or exon usage in complex splicing events. LeafCutter can be used both to detect differential splicing between sample groups and to map splicing quantitative trait loci (sQTLs). Compared with contemporary methods, our approach identified 1.4-2.1 times more sQTLs, many of which helped us ascribe molecular effects to disease-associated variants. Transcriptome-wide associations between LeafCutter intron quantifications and 40 complex traits increased the number of associated disease genes at a 5% false discovery rate by an average of 2.1-fold compared with that detected through the use of gene expression levels alone. LeafCutter is fast, scalable, easy to use, and available online.

Show full abstractShow less

DOI

10.1038/s41588-017-0004-9

LEMMA

Tool

PUBMED_LINK

32888427

FULL NAME

Linear Environment Mixed Model Analysis

DESCRIPTION

LEMMA (Linear Environment Mixed Model Analysis) is a whole genome wide regression method for flexible modeling of gene-environment interactions in large datasets such as the UK Biobank.

Show full descriptionShow less

URL

https://github.com/mkerin/LEMMA

TITLE

Inferring Gene-by-Environment Interactions with a Bayesian Whole-Genome Regression Model.

Main citation

Kerin M, Marchini J. (2020) Inferring Gene-by-Environment Interactions with a Bayesian Whole-Genome Regression Model. Am J Hum Genet, 107 (4) 698-713. doi:10.1016/j.ajhg.2020.08.009. PMID 32888427

ABSTRACT

The contribution of gene-by-environment (GxE) interactions for many human traits and diseases is poorly characterized. We propose a Bayesian whole-genome regression model for joint modeling of main genetic effects and GxE interactions in large-scale datasets, such as the UK Biobank, where many environmental variables have been measured. The method is called LEMMA (Linear Environment Mixed Model Analysis) and estimates a linear combination of environmental variables, called an environmental score (ES), that interacts with genetic markers throughout the genome. The ES provides a readily interpretable way to examine the combined effect of many environmental variables. The ES can be used both to estimate the proportion of phenotypic variance attributable to GxE effects and to test for GxE effects at genetic variants across the genome. GxE effects can induce heteroskedasticity in quantitative traits, and LEMMA accounts for this by using robust standard error estimates when testing for GxE effects. When applied to body mass index, systolic blood pressure, diastolic blood pressure, and pulse pressure in the UK Biobank, we estimate that 9.3%, 3.9%, 1.6%, and 12.5%, respectively, of phenotypic variance is explained by GxE interactions and that low-frequency variants explain most of this variance. We also identify three loci that interact with the estimated environmental scores (-log10p>7.3).

Show full abstractShow less

DOI

10.1016/j.ajhg.2020.08.009

Locityper

Tool

PUBMED_LINK

41107551

DESCRIPTION

Locityper performs targeted genotyping of structurally variable and hyperpolymorphic genes—including HLA, KIR, MUC, and FCGR families—from short- or long-read whole-genome sequencing by aligning reads to locus haplotypes (often from pangenomes) and scoring depth and insert-size consistency.

Show full descriptionShow less

URL

https://github.com/tprodanov/locityper

KEYWORDS

genotyping, complex loci, HLA, short read, long read, WGS

Show full keywordsShow less

TITLE

Locityper enables targeted genotyping of complex polymorphic genes.

Main citation

Prodanov T, Plender EG, Seebohm G, Meuth SG, ...&, Marschall T. (2025) Locityper enables targeted genotyping of complex polymorphic genes. Nat Genet, 57 (11) 2901-2908. doi:10.1038/s41588-025-02362-4. PMID 41107551

ABSTRACT

The human genome contains many structurally variable polymorphic loci, including several hundred disease-associated genes, almost inaccessible for accurate variant calling. Here we present Locityper, a tool capable of genotyping such challenging genes using short-read and long-read whole-genome sequencing. For each target, Locityper recruits and aligns reads to locus haplotypes, for instance, extracted from a pangenome, and finds the likeliest haplotype pair by optimizing read alignment, insert size and read depth profiles. Across 256 challenging medically relevant loci, Locityper achieves a median quality value (QV) above 35 from both long-read and short-read data, outperforming state-of-the-art Illumina and PacBio HiFi variant calling pipelines by 10.9 and 1.7 points, respectively. Furthermore, Locityper provides access to hyperpolymorphic HLA genes and other gene families, including KIR, MUC and FCGR. With its low running time of 1 h 35 m per sample at eight threads, Locityper is scalable to biobank-sized cohorts, enabling association studies for previously intractable disease-relevant genes.

Show full abstractShow less

DOI

10.1038/s41588-025-02362-4

locuszoom

Tool

PUBMED_LINK

20634204

URL

http://locuszoom.org/

TITLE

LocusZoom: regional visualization of genome-wide association scan results.

Main citation

Pruim RJ, Welch RP, Sanna S, Teslovich TM, ...&, Willer CJ. (2010) LocusZoom: regional visualization of genome-wide association scan results. Bioinformatics, 26 (18) 2336-7. doi:10.1093/bioinformatics/btq419. PMID 20634204

ABSTRACT

UNLABELLED: Genome-wide association studies (GWAS) have revealed hundreds of loci associated with common human genetic diseases and traits. We have developed a web-based plotting tool that provides fast visual display of GWAS results in a publication-ready format. LocusZoom visually displays regional information such as the strength and extent of the association signal relative to genomic position, local linkage disequilibrium (LD) and recombination patterns and the positions of genes in the region. AVAILABILITY: LocusZoom can be accessed from a web interface at http://csg.sph.umich.edu/locuszoom. Users may generate a single plot using a web form, or many plots using batch mode. The software utilizes LD information from HapMap Phase II (CEU, YRI and JPT+CHB) or 1000 Genomes (CEU) and gene information from the UCSC browser, and will accept SNP identifiers in dbSNP or 1000 Genomes format. Single plots are generated in approximately 20 s. Source code and associated databases are available for download and local installation, and full documentation is available online.

Show full abstractShow less

DOI

10.1093/bioinformatics/btq419

loftee

Tool

PUBMED_LINK

32461654

FULL NAME

Loss-Of-Function Transcript Effect Estimator

DESCRIPTION

A VEP plugin to identify LoF (loss-of-function) variation. Currently assesses variants that are stop-gained, splice site disrupting and Frameshift variants.

Show full descriptionShow less

URL

https://github.com/konradjk/loftee

TITLE

The mutational constraint spectrum quantified from variation in 141,456 humans.

Main citation

Karczewski KJ, Francioli LC, Tiao G, Cummings BB, ...&, MacArthur DG. (2020) The mutational constraint spectrum quantified from variation in 141,456 humans. Nature, 581 (7809) 434-443. doi:10.1038/s41586-020-2308-7. PMID 32461654

ABSTRACT

Genetic variants that inactivate protein-coding genes are a powerful source of information about the phenotypic consequences of gene disruption: genes that are crucial for the function of an organism will be depleted of such variants in natural populations, whereas non-essential genes will tolerate their accumulation. However, predicted loss-of-function variants are enriched for annotation errors, and tend to be found at extremely low frequencies, so their analysis requires careful variant annotation and very large sample sizes1. Here we describe the aggregation of 125,748 exomes and 15,708 genomes from human sequencing studies into the Genome Aggregation Database (gnomAD). We identify 443,769 high-confidence predicted loss-of-function variants in this cohort after filtering for artefacts caused by sequencing and annotation errors. Using an improved model of human mutation rates, we classify human protein-coding genes along a spectrum that represents tolerance to inactivation, validate this classification using data from model organisms and engineered human cells, and show that it can be used to improve the power of gene discovery for both common and rare diseases.

Show full abstractShow less

DOI

10.1038/s41586-020-2308-7

Logica

Tool

FULL NAME

LOcal GenetIc Correlation across Ancestries

DESCRIPTION

Logica (LOcal GenetIc Correlation across Ancestries), a new method specifically designed to estimate local genetic correlations across ancestries. Logica employs a bivariate linear mixed model that explicitly accounts for diverse LD patterns across ancestries, operates on GWAS summary statistics, and utilizes a maximum likelihood framework for robust inference. Logica is implemented as an open-source R package。

Show full descriptionShow less

URL

https://github.com/borangao/Logica

LT-FH

Tool

PUBMED_LINK

32313248

FULL NAME

liability threshold model, conditional on case–control status and family history

DESCRIPTION

an association method based on posterior mean genetic liabilities under a liability threshold model, conditional on case-control status and family history (LT-FH)

Show full descriptionShow less

URL

https://alkesgroup.broadinstitute.org/UKBB/LTFH/

TITLE

Liability threshold modeling of case-control status and family history of disease increases association power.

Main citation

Hujoel MLA, Gazal S, Loh PR, Patterson N, ...&, Price AL. (2020) Liability threshold modeling of case-control status and family history of disease increases association power. Nat Genet, 52 (5) 541-547. doi:10.1038/s41588-020-0613-6. PMID 32313248

ABSTRACT

Family history of disease can provide valuable information in case-control association studies, but it is currently unclear how to best combine case-control status and family history of disease. We developed an association method based on posterior mean genetic liabilities under a liability threshold model, conditional on case-control status and family history (LT-FH). Analyzing 12 diseases from the UK Biobank (average N = 350,000) we compared LT-FH to genome-wide association without using family history (GWAS) and a previous proxy-based method incorporating family history (GWAX). LT-FH was 63% (standard error (s.e.) 6%) more powerful than GWAS and 36% (s.e. 4%) more powerful than the trait-specific maximum of GWAS and GWAX, based on the number of independent genome-wide-significant loci across all diseases (for example, 690 loci for LT-FH versus 423 for GWAS); relative improvements were similar when applying BOLT-LMM to GWAS, GWAX and LT-FH phenotypes. Thus, LT-FH greatly increases association power when family history of disease is available.

Show full abstractShow less

DOI

10.1038/s41588-020-0613-6

MACH / minimach

Tool

PUBMED_LINK

21058334

DESCRIPTION

(MACH)

Show full descriptionShow less

URL

http://csg.sph.umich.edu/abecasis/MaCH/index.html

TITLE

MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes.

Main citation

Li Y, Willer CJ, Ding J, Scheet P, ...&, Abecasis GR. (2010) MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet Epidemiol, 34 (8) 816-34. doi:10.1002/gepi.20533. PMID 21058334

ABSTRACT

Genome-wide association studies (GWAS) can identify common alleles that contribute to complex disease susceptibility. Despite the large number of SNPs assessed in each study, the effects of most common SNPs must be evaluated indirectly using either genotyped markers or haplotypes thereof as proxies. We have previously implemented a computationally efficient Markov Chain framework for genotype imputation and haplotyping in the freely available MaCH software package. The approach describes sampled chromosomes as mosaics of each other and uses available genotype and shotgun sequence data to estimate unobserved genotypes and haplotypes, together with useful measures of the quality of these estimates. Our approach is already widely used to facilitate comparison of results across studies as well as meta-analyses of GWAS. Here, we use simulations and experimental genotypes to evaluate its accuracy and utility, considering choices of genotyping panels, reference panel configurations, and designs where genotyping is replaced with shotgun sequencing. Importantly, we show that genotype imputation not only facilitates cross study analyses but also increases power of genetic association studies. We show that genotype imputation of common variants using HapMap haplotypes as a reference is very accurate using either genome-wide SNP data or smaller amounts of data typical in fine-mapping studies. Furthermore, we show the approach is applicable in a variety of populations. Finally, we illustrate how association analyses of unobserved variants will benefit from ongoing advances such as larger HapMap reference panels and whole genome shotgun sequencing technologies.

Show full abstractShow less

DOI

10.1002/gepi.20533

MACH / minimach pre-phasing

Tool

PUBMED_LINK

22820512

DESCRIPTION

(pre-phasing, minimac)

Show full descriptionShow less

URL

https://genome.sph.umich.edu/wiki/Minimac

TITLE

Fast and accurate genotype imputation in genome-wide association studies through pre-phasing.

Main citation

Howie B, Fuchsberger C, Stephens M, Marchini J, ...&, Abecasis GR. (2012) Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat Genet, 44 (8) 955-9. doi:10.1038/ng.2354. PMID 22820512

ABSTRACT

The 1000 Genomes Project and disease-specific sequencing efforts are producing large collections of haplotypes that can be used as reference panels for genotype imputation in genome-wide association studies (GWAS). However, imputing from large reference panels with existing methods imposes a high computational burden. We introduce a strategy called 'pre-phasing' that maintains the accuracy of leading methods while reducing computational costs. We first statistically estimate the haplotypes for each individual within the GWAS sample (pre-phasing) and then impute missing genotypes into these estimated haplotypes. This reduces the computational cost because (i) the GWAS samples must be phased only once, whereas standard methods would implicitly repeat phasing with each reference panel update, and (ii) it is much faster to match a phased GWAS haplotype to one reference haplotype than to match two unphased GWAS genotypes to a pair of reference haplotypes. We implemented our approach in the MaCH and IMPUTE2 frameworks, and we tested it on data sets from the Wellcome Trust Case Control Consortium 2 (WTCCC2), the Genetic Association Information Network (GAIN), the Women's Health Initiative (WHI) and the 1000 Genomes Project. This strategy will be particularly valuable for repeated imputation as reference panels evolve.

Show full abstractShow less

DOI

10.1038/ng.2354

MACH / minimach2

Tool

PUBMED_LINK

25338720

DESCRIPTION

(minimac2)

Show full descriptionShow less

URL

https://genome.sph.umich.edu/wiki/Minimac2

TITLE

minimac2: faster genotype imputation.

Main citation

Fuchsberger C, Abecasis GR, Hinds DA. (2015) minimac2: faster genotype imputation. Bioinformatics, 31 (5) 782-4. doi:10.1093/bioinformatics/btu704. PMID 25338720

ABSTRACT

UNLABELLED: Genotype imputation is a key step in the analysis of genome-wide association studies. Upcoming very large reference panels, such as those from The 1000 Genomes Project and the Haplotype Consortium, will improve imputation quality of rare and less common variants, but will also increase the computational burden. Here, we demonstrate how the application of software engineering techniques can help to keep imputation broadly accessible. Overall, these improvements speed up imputation by an order of magnitude compared with our previous implementation. AVAILABILITY AND IMPLEMENTATION: minimac2, including source code, documentation, and examples is available at http://genome.sph.umich.edu/wiki/Minimac2

Show full abstractShow less

DOI

10.1093/bioinformatics/btu704

MACH / minimach3

Tool

PUBMED_LINK

27571263

DESCRIPTION

(minimac3)

Show full descriptionShow less

URL

https://genome.sph.umich.edu/wiki/Minimac3

TITLE

Next-generation genotype imputation service and methods.

Main citation

Das S, Forer L, Schönherr S, Sidore C, ...&, Fuchsberger C. (2016) Next-generation genotype imputation service and methods. Nat Genet, 48 (10) 1284-1287. doi:10.1038/ng.3656. PMID 27571263

ABSTRACT

Genotype imputation is a key component of genetic association studies, where it increases power, facilitates meta-analysis, and aids interpretation of signals. Genotype imputation is computationally demanding and, with current tools, typically requires access to a high-performance computing cluster and to a reference panel of sequenced genomes. Here we describe improvements to imputation machinery that reduce computational requirements by more than an order of magnitude with no loss of accuracy in comparison to standard imputation tools. We also describe a new web-based service for imputation that facilitates access to new reference panels and greatly improves user experience and productivity.

Show full abstractShow less

DOI

10.1038/ng.3656

MACH / minimach4

Tool

DESCRIPTION

(minimac4)

Show full descriptionShow less

URL

https://genome.sph.umich.edu/wiki/Minimac4

MAGMA

Tool

PUBMED_LINK

25885710

FULL NAME

Multi-marker Analysis of GenoMic Annotation

DESCRIPTION

MAGMA is a tool for gene analysis and generalized gene-set analysis of GWAS data. It can be used to analyse both raw genotype data as well as summary SNP p-values from a previous GWAS or meta-analysis.

Show full descriptionShow less

URL

https://ctg.cncr.nl/software/magma

TITLE

MAGMA: generalized gene-set analysis of GWAS data.

Main citation

de Leeuw CA, Mooij JM, Heskes T, Posthuma D. (2015) MAGMA: generalized gene-set analysis of GWAS data. PLoS Comput Biol, 11 (4) e1004219. doi:10.1371/journal.pcbi.1004219. PMID 25885710

ABSTRACT

By aggregating data for complex traits in a biologically meaningful way, gene and gene-set analysis constitute a valuable addition to single-marker analysis. However, although various methods for gene and gene-set analysis currently exist, they generally suffer from a number of issues. Statistical power for most methods is strongly affected by linkage disequilibrium between markers, multi-marker associations are often hard to detect, and the reliance on permutation to compute p-values tends to make the analysis computationally very expensive. To address these issues we have developed MAGMA, a novel tool for gene and gene-set analysis. The gene analysis is based on a multiple regression model, to provide better statistical performance. The gene-set analysis is built as a separate layer around the gene analysis for additional flexibility. This gene-set analysis also uses a regression structure to allow generalization to analysis of continuous properties of genes and simultaneous analysis of multiple gene sets and other gene properties. Simulations and an analysis of Crohn's Disease data are used to evaluate the performance of MAGMA and to compare it to a number of other gene and gene-set analysis tools. The results show that MAGMA has significantly more power than other tools for both the gene and the gene-set analysis, identifying more genes and gene sets associated with Crohn's Disease while maintaining a correct type 1 error rate. Moreover, the MAGMA analysis of the Crohn's Disease data was found to be considerably faster as well.

Show full abstractShow less

DOI

10.1371/journal.pcbi.1004219

MAGMA

Tool

PUBMED_LINK

25885710

FULL NAME

Multi-marker Analysis of GenoMic Annotation

URL

https://ctg.cncr.nl/software/magma

TITLE

MAGMA: generalized gene-set analysis of GWAS data.

Main citation

de Leeuw CA, Mooij JM, Heskes T, Posthuma D. (2015) MAGMA: generalized gene-set analysis of GWAS data. PLoS Comput Biol, 11 (4) e1004219. doi:10.1371/journal.pcbi.1004219. PMID 25885710

ABSTRACT

By aggregating data for complex traits in a biologically meaningful way, gene and gene-set analysis constitute a valuable addition to single-marker analysis. However, although various methods for gene and gene-set analysis currently exist, they generally suffer from a number of issues. Statistical power for most methods is strongly affected by linkage disequilibrium between markers, multi-marker associations are often hard to detect, and the reliance on permutation to compute p-values tends to make the analysis computationally very expensive. To address these issues we have developed MAGMA, a novel tool for gene and gene-set analysis. The gene analysis is based on a multiple regression model, to provide better statistical performance. The gene-set analysis is built as a separate layer around the gene analysis for additional flexibility. This gene-set analysis also uses a regression structure to allow generalization to analysis of continuous properties of genes and simultaneous analysis of multiple gene sets and other gene properties. Simulations and an analysis of Crohn's Disease data are used to evaluate the performance of MAGMA and to compare it to a number of other gene and gene-set analysis tools. The results show that MAGMA has significantly more power than other tools for both the gene and the gene-set analysis, identifying more genes and gene sets associated with Crohn's Disease while maintaining a correct type 1 error rate. Moreover, the MAGMA analysis of the Crohn's Disease data was found to be considerably faster as well.

Show full abstractShow less

DOI

10.1371/journal.pcbi.1004219

MANOVA

Tool

FULL NAME

multivariate analysis of variance

MANTRA

Tool

PUBMED_LINK

22125221

FULL NAME

Meta-ANalysis of Transethnic Association studies

KEYWORDS

cross-population

Show full keywordsShow less

TITLE

Transethnic meta-analysis of genomewide association studies.

Main citation

Morris AP. (2011) Transethnic meta-analysis of genomewide association studies. Genet Epidemiol, 35 (8) 809-22. doi:10.1002/gepi.20630. PMID 22125221

ABSTRACT

The detection of loci contributing effects to complex human traits, and their subsequent fine-mapping for the location of causal variants, remains a considerable challenge for the genetics research community. Meta-analyses of genomewide association studies, primarily ascertained from European-descent populations, have made considerable advances in our understanding of complex trait genetics, although much of their heritability is still unexplained. With the increasing availability of genomewide association data from diverse populations, transethnic meta-analysis may offer an exciting opportunity to increase the power to detect novel complex trait loci and to improve the resolution of fine-mapping of causal variants by leveraging differences in local linkage disequilibrium structure between ethnic groups. However, we might also expect there to be substantial genetic heterogeneity between diverse populations, both in terms of the spectrum of causal variants and their allelic effects, which cannot easily be accommodated through traditional approaches to meta-analysis. In order to address this challenge, I propose novel transethnic meta-analysis methodology that takes account of the expected similarity in allelic effects between the most closely related populations, while allowing for heterogeneity between more diverse ethnic groups. This approach yields substantial improvements in performance, compared to fixed-effects meta-analysis, both in terms of power to detect association, and localization of the causal variant, over a range of models of heterogeneity between ethnic groups. Furthermore, when the similarity in allelic effects between populations is well captured by their relatedness, this approach has increased power and mapping resolution over random-effects meta-analysis.

Show full abstractShow less

DOI

10.1002/gepi.20630

MatrixEQTL (Matrix eQTL)

Tool

PUBMED_LINK

22492648

FULL NAME

Matrix eQTL

DESCRIPTION

Matrix eQTL is designed for fast eQTL analysis on large datasets. Matrix eQTL can test for association between genotype and gene expression using linear regression with either additive or ANOVA genotype effects. The models can include covariates to account for factors as population stratification, gender, and clinical variables. It also supports models with heteroscedastic and/or correlated errors, false discovery rate estimation and separate treatment of local (cis) and distant (trans) eQTLs.

Show full descriptionShow less

URL

http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/

TITLE

Matrix eQTL: ultra fast eQTL analysis via large matrix operations.

Main citation

Shabalin AA. (2012) Matrix eQTL: ultra fast eQTL analysis via large matrix operations. Bioinformatics, 28 (10) 1353-8. doi:10.1093/bioinformatics/bts163. PMID 22492648

ABSTRACT

MOTIVATION: Expression quantitative trait loci (eQTL) analysis links variations in gene expression levels to genotypes. For modern datasets, eQTL analysis is a computationally intensive task as it involves testing for association of billions of transcript-SNP (single-nucleotide polymorphism) pair. The heavy computational burden makes eQTL analysis less popular and sometimes forces analysts to restrict their attention to just a small subset of transcript-SNP pairs. As more transcripts and SNPs get interrogated over a growing number of samples, the demand for faster tools for eQTL analysis grows stronger. RESULTS: We have developed a new software for computationally efficient eQTL analysis called Matrix eQTL. In tests on large datasets, it was 2-3 orders of magnitude faster than existing popular tools for QTL/eQTL analysis, while finding the same eQTLs. The fast performance is achieved by special preprocessing and expressing the most computationally intensive part of the algorithm in terms of large matrix operations. Matrix eQTL supports additive linear and ANOVA models with covariates, including models with correlated and heteroskedastic errors. The issue of multiple testing is addressed by calculating false discovery rate; this can be done separately for cis- and trans-eQTLs.

Show full abstractShow less

DOI

10.1093/bioinformatics/bts163

MegaPRS

Tool

PUBMED_LINK

34234142

DESCRIPTION

individual level: big_spLinReg, LDAK-Ridge-Predict, LDAK-Bolt-Predict and LDAK-BayesR-Predict
sumstats: LDAK-Lasso-SS, LDAK-Ridge-SS, LDAK-Bolt-SS and LDAK-BayesR-SS

Show full descriptionShow less

URL

https://github.com/koido/MENTR

TITLE

Improved genetic prediction of complex traits from individual-level data or summary statistics.

Main citation

Zhang Q, Privé F, Vilhjálmsson B, Speed D. (2021) Improved genetic prediction of complex traits from individual-level data or summary statistics. Nat Commun, 12 (1) 4192. doi:10.1038/s41467-021-24485-y. PMID 34234142

ABSTRACT

Most existing tools for constructing genetic prediction models begin with the assumption that all genetic variants contribute equally towards the phenotype. However, this represents a suboptimal model for how heritability is distributed across the genome. Therefore, we develop prediction tools that allow the user to specify the heritability model. We compare individual-level data prediction tools using 14 UK Biobank phenotypes; our new tool LDAK-Bolt-Predict outperforms the existing tools Lasso, BLUP, Bolt-LMM and BayesR for all 14 phenotypes. We compare summary statistic prediction tools using 225 UK Biobank phenotypes; our new tool LDAK-BayesR-SS outperforms the existing tools lassosum, sBLUP, LDpred and SBayesR for 223 of the 225 phenotypes. When we improve the heritability model, the proportion of phenotypic variance explained increases by on average 14%, which is equivalent to increasing the sample size by a quarter.

Show full abstractShow less

DOI

10.1038/s41467-021-24485-y

MENTR

Tool

PUBMED_LINK

36411359

FULL NAME

mutation effect prediction on ncRNA transcription

DESCRIPTION

A machine-learning model (MENTR) that reliably links genome sequence and ncRNA expression at the cell type level

Show full descriptionShow less

URL

TITLE

Prediction of the cell-type-specific transcription of non-coding RNAs from genome sequences via machine learning.

Main citation

Koido M, Hon CC, Koyama S, Kawaji H, ...&, Terao C. (2023) Prediction of the cell-type-specific transcription of non-coding RNAs from genome sequences via machine learning. Nat Biomed Eng, 7 (6) 830-844. doi:10.1038/s41551-022-00961-8. PMID 36411359

ABSTRACT

Gene transcription is regulated through complex mechanisms involving non-coding RNAs (ncRNAs). As the transcription of ncRNAs, especially of enhancer RNAs, is often low and cell type specific, how the levels of RNA transcription depend on genotype remains largely unexplored. Here we report the development and utility of a machine-learning model (MENTR) that reliably links genome sequence and ncRNA expression at the cell type level. Effects on ncRNA transcription predicted by the model were concordant with estimates from published studies in a cell-type-dependent manner, regardless of allele frequency and genetic linkage. Among 41,223 variants from genome-wide association studies, the model identified 7,775 enhancer RNAs and 3,548 long ncRNAs causally associated with complex traits across 348 major human primary cells and tissues, such as rare variants plausibly altering the transcription of enhancer RNAs to influence the risks of Crohn's disease and asthma. The model may aid the discovery of causal variants and the generation of testable hypotheses for biological mechanisms driving complex traits.

Show full abstractShow less

DOI

10.1038/s41551-022-00961-8

MESuSiE

Tool

PUBMED_LINK

38168930

FULL NAME

multi-ancestry sum of the single effects model

DESCRIPTION

MESuSiE relies on GWAS summary statistics from multiple ancestries, properly accounts for the LD structure of the local genomic region in multiple ancestries, and explicitly models both shared and ancestry-specific causal signals to accommodate causal effect size similarity as well as heterogeneity across ancestries. MESuSiE outputs posterior inclusion probability of variant being shared or ancestry-specific causal variants.

Show full descriptionShow less

URL

https://github.com/borangao/MESuSiE

KEYWORDS

multi-trait, fine-mapping

Show full keywordsShow less

TITLE

MESuSiE enables scalable and powerful multi-ancestry fine-mapping of causal variants in genome-wide association studies.

Main citation

Gao B, Zhou X. (2024) MESuSiE enables scalable and powerful multi-ancestry fine-mapping of causal variants in genome-wide association studies. Nat Genet, 56 (1) 170-179. doi:10.1038/s41588-023-01604-7. PMID 38168930

ABSTRACT

Fine-mapping in genome-wide association studies attempts to identify causal SNPs from a set of candidate SNPs in a local genomic region of interest and is commonly performed in one genetic ancestry at a time. Here, we present multi-ancestry sum of the single effects model (MESuSiE), a probabilistic multi-ancestry fine-mapping method, to improve the accuracy and resolution of fine-mapping by leveraging association information across ancestries. MESuSiE uses summary statistics as input, accounts for the diverse linkage disequilibrium pattern observed in different ancestries, explicitly models both shared and ancestry-specific causal SNPs, and relies on a variational inference algorithm for scalable computation. We evaluated the performance of MESuSiE through comprehensive simulations and multi-ancestry fine-mapping of four lipid traits with both European and African samples. In the real data, MESuSiE improves fine-mapping resolution by 19.0% to 72.0% compared to existing approaches, is an order of magnitude faster, and captures and categorizes shared and ancestry-specific causal signals with enhanced functional enrichment.

Show full abstractShow less

DOI

10.1038/s41588-023-01604-7

meta-PRS

Tool

PUBMED_LINK

33964208

FULL NAME

linear combination of PRSs

URL

https://github.com/ClaraAlbi/paper_MetaPRS

TITLE

Leveraging both individual-level genetic data and GWAS summary statistics increases polygenic prediction.

Main citation

Albiñana C, Grove J, McGrath JJ, Agerbo E, ...&, Vilhjálmsson BJ. (2021) Leveraging both individual-level genetic data and GWAS summary statistics increases polygenic prediction. Am J Hum Genet, 108 (6) 1001-1011. doi:10.1016/j.ajhg.2021.04.014. PMID 33964208

ABSTRACT

The accuracy of polygenic risk scores (PRSs) to predict complex diseases increases with the training sample size. PRSs are generally derived based on summary statistics from large meta-analyses of multiple genome-wide association studies (GWASs). However, it is now common for researchers to have access to large individual-level data as well, such as the UK Biobank data. To the best of our knowledge, it has not yet been explored how best to combine both types of data (summary statistics and individual-level data) to optimize polygenic prediction. The most widely used approach to combine data is the meta-analysis of GWAS summary statistics (meta-GWAS), but we show that it does not always provide the most accurate PRS. Through simulations and using 12 real case-control and quantitative traits from both iPSYCH and UK Biobank along with external GWAS summary statistics, we compare meta-GWAS with two alternative data-combining approaches, stacked clumping and thresholding (SCT) and meta-PRS. We find that, when large individual-level data are available, the linear combination of PRSs (meta-PRS) is both a simple alternative to meta-GWAS and often more accurate.

Show full abstractShow less

DOI

10.1016/j.ajhg.2021.04.014

Meta-SAIGE

Tool

PUBMED_LINK

41266648

DESCRIPTION

Meta-SAIGE performs scalable cohort-level rare-variant meta-analysis from study-level outputs, emphasizing accurate null calibration (including low-prevalence binary traits), computational efficiency via reuse of LD structure across phenotypes, and power close to pooled individual-level analysis with SAIGE-GENE+.

Show full descriptionShow less

URL

https://meta-saige.leelabsg.org/ ,https://github.com/weizhouUMICH/SAIGE

KEYWORDS

rare variant, meta-analysis, SAIGE, summary statistics, type I error

Show full keywordsShow less

TITLE

Scalable and accurate rare variant meta-analysis with Meta-SAIGE.

Main citation

Park E, Nam K, Jeong S, Keat K, ...&, Lee S. (2025) Scalable and accurate rare variant meta-analysis with Meta-SAIGE. Nat Genet, 57 (12) 3185-3192. doi:10.1038/s41588-025-02403-y. PMID 41266648

ABSTRACT

Meta-analysis enhances the power of rare variant association tests by combining summary statistics across several cohorts. However, existing methods often fail to control type I error for low-prevalence binary traits and are computationally intensive. Here we introduce Meta-SAIGE-a scalable method for rare variant meta-analysis that accurately estimates the null distribution to control type I error and reuses the linkage disequilibrium matrix across phenotypes to boost computational efficiency in phenome-wide analyses. Simulations using UK Biobank whole-exome sequencing data show that Meta-SAIGE effectively controls type I error and achieves power comparable to pooled individual-level analysis with SAIGE-GENE+. Applying Meta-SAIGE to 83 low-prevalence phenotypes in UK Biobank and All of Us whole-exome sequencing data identified 237 gene-trait associations. Notably, 80 of these associations were not significant in either dataset alone, underscoring the power of our meta-analysis.

Show full abstractShow less

DOI

10.1038/s41588-025-02403-y

metabolites PRS atlas

Tool

PUBMED_LINK

36219204

DESCRIPTION

This web application can be used to query findings from a systematic analysis of 129 polygenic risk scores and 249 circulating metabolits using high-throughput nuclear magnetic resonance data from the UK Biobank study1,2. We encourage users of this resource to conduct follow-up analyses of associations to investigate potential causal and non-causal metabolic biomarkers. Age-stratified results can be used to investigate how potential sources of collider bias (e.g. statin therapy) may influence findings in the full sample

Show full descriptionShow less

URL

http://mrcieu.mrsoftware.org/metabolites_PRS_atlas/

TITLE

Constructing an atlas of associations between polygenic scores from across the human phenome and circulating metabolic biomarkers.

Main citation

Fang S, Holmes MV, Gaunt TR, Davey Smith G, ...&, Richardson TG. (2022) Constructing an atlas of associations between polygenic scores from across the human phenome and circulating metabolic biomarkers. Elife, 11 () . doi:10.7554/eLife.73951. PMID 36219204

ABSTRACT

BACKGROUND: Polygenic scores (PGS) are becoming an increasingly popular approach to predict complex disease risk, although they also hold the potential to develop insight into the molecular profiles of patients with an elevated genetic predisposition to disease. METHODS: We sought to construct an atlas of associations between 125 different PGS derived using results from genome-wide association studies and 249 circulating metabolites in up to 83,004 participants from the UK Biobank. RESULTS: As an exemplar to demonstrate the value of this atlas, we conducted a hypothesis-free evaluation of all associations with glycoprotein acetyls (GlycA), an inflammatory biomarker. Using bidirectional Mendelian randomization, we find that the associations highlighted likely reflect the effect of risk factors, such as adiposity or liability towards smoking, on systemic inflammation as opposed to the converse direction. Moreover, we repeated all analyses in our atlas within age strata to investigate potential sources of collider bias, such as medication usage. This was exemplified by comparing associations between lipoprotein lipid profiles and the coronary artery disease PGS in the youngest and oldest age strata, which had differing proportions of individuals undergoing statin therapy. Lastly, we generated all PGS-metabolite associations stratified by sex and separately after excluding 13 established lipid-associated loci to further evaluate the robustness of findings. CONCLUSIONS: We envisage that the atlas of results constructed in our study will motivate future hypothesis generation and help prioritize and deprioritize circulating metabolic traits for in-depth investigations. All results can be visualized and downloaded at http://mrcieu.mrsoftware.org/metabolites_PRS_atlas. FUNDING: This work is supported by funding from the Wellcome Trust, the British Heart Foundation, and the Medical Research Council Integrative Epidemiology Unit.

Show full abstractShow less

DOI

10.7554/eLife.73951

metaCCA

Tool

PUBMED_LINK

27153689

FULL NAME

meta canonical
correlation analysis

DESCRIPTION

metaCCA performs multivariate analysis of a single or multiple GWAS based on univariate regression coefficients. It allows multivariate representation of both phenotype and genotype. metaCCA extends the statistical technique of canonical correlation analysis to the setting where original individual-level records are not available, and employs a covariance shrinkage algorithm to achieve robustness.

Show full descriptionShow less

URL

https://github.com/aalto-ics-kepaco/metaCCA-matlab

TITLE

metaCCA: summary statistics-based multivariate meta-analysis of genome-wide association studies using canonical correlation analysis.

Main citation

Cichonska A, Rousu J, Marttinen P, Kangas AJ, ...&, Pirinen M. (2016) metaCCA: summary statistics-based multivariate meta-analysis of genome-wide association studies using canonical correlation analysis. Bioinformatics, 32 (13) 1981-9. doi:10.1093/bioinformatics/btw052. PMID 27153689

ABSTRACT

MOTIVATION: A dominant approach to genetic association studies is to perform univariate tests between genotype-phenotype pairs. However, analyzing related traits together increases statistical power, and certain complex associations become detectable only when several variants are tested jointly. Currently, modest sample sizes of individual cohorts, and restricted availability of individual-level genotype-phenotype data across the cohorts limit conducting multivariate tests. RESULTS: We introduce metaCCA, a computational framework for summary statistics-based analysis of a single or multiple studies that allows multivariate representation of both genotype and phenotype. It extends the statistical technique of canonical correlation analysis to the setting where original individual-level records are not available, and employs a covariance shrinkage algorithm to achieve robustness.Multivariate meta-analysis of two Finnish studies of nuclear magnetic resonance metabolomics by metaCCA, using standard univariate output from the program SNPTEST, shows an excellent agreement with the pooled individual-level analysis of original data. Motivated by strong multivariate signals in the lipid genes tested, we envision that multivariate association testing using metaCCA has a great potential to provide novel insights from already published summary statistics from high-throughput phenotyping technologies. AVAILABILITY AND IMPLEMENTATION: Code is available at https://github.com/aalto-ics-kepaco CONTACTS: anna.cichonska@helsinki.fi or matti.pirinen@helsinki.fi SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Show full abstractShow less

DOI

10.1093/bioinformatics/btw052

metafor

Tool

URL

https://wviechtb.github.io/metafor/

METAL

Tool

PUBMED_LINK

20616382

DESCRIPTION

METAL is a tool for meta-analysis genomewide association scans. METAL can combine either (a) test statistics and standard errors or (b) p-values across studies (taking sample size and direction of effect into account). METAL analysis is a convenient alternative to a direct analysis of merged data from multiple studies. It is especially appropriate when data from the individual studies cannot be analyzed together because of differences in ethnicity, phenotype distribution, gender or constraints in sharing of individual level data imposed. Meta-analysis results in little or no loss of efficiency compared to analysis of a combined dataset including data from all individual studies.

Show full descriptionShow less

URL

https://genome.sph.umich.edu/wiki/METAL_Documentation

TITLE

METAL: fast and efficient meta-analysis of genomewide association scans.

Main citation

Willer CJ, Li Y, Abecasis GR. (2010) METAL: fast and efficient meta-analysis of genomewide association scans. Bioinformatics, 26 (17) 2190-1. doi:10.1093/bioinformatics/btq340. PMID 20616382

ABSTRACT

SUMMARY: METAL provides a computationally efficient tool for meta-analysis of genome-wide association scans, which is a commonly used approach for improving power complex traits gene mapping studies. METAL provides a rich scripting interface and implements efficient memory management to allow analyses of very large data sets and to support a variety of input file formats. AVAILABILITY AND IMPLEMENTATION: METAL, including source code, documentation, examples, and executables, is available at http://www.sph.umich.edu/csg/abecasis/metal/.

Show full abstractShow less

DOI

10.1093/bioinformatics/btq340

MetaSKAT

Tool

PUBMED_LINK

23768515

DESCRIPTION

MetaSKAT is a R package for multiple marker meta-analysis. It can carry out meta-analysis of SKAT, SKAT-O and burden tests with individual level genotype data or gene level summary statistics.

Show full descriptionShow less

URL

https://www.hsph.harvard.edu/skat/metaskat/

TITLE

General framework for meta-analysis of rare variants in sequencing association studies.

Main citation

Lee S, Teslovich TM, Boehnke M, Lin X. (2013) General framework for meta-analysis of rare variants in sequencing association studies. Am J Hum Genet, 93 (1) 42-53. doi:10.1016/j.ajhg.2013.05.010. PMID 23768515

ABSTRACT

We propose a general statistical framework for meta-analysis of gene- or region-based multimarker rare variant association tests in sequencing association studies. In genome-wide association studies, single-marker meta-analysis has been widely used to increase statistical power by combining results via regression coefficients and standard errors from different studies. In analysis of rare variants in sequencing studies, region-based multimarker tests are often used to increase power. We propose meta-analysis methods for commonly used gene- or region-based rare variants tests, such as burden tests and variance component tests. Because estimation of regression coefficients of individual rare variants is often unstable or not feasible, the proposed method avoids this difficulty by calculating score statistics instead that only require fitting the null model for each study and then aggregating these score statistics across studies. Our proposed meta-analysis rare variant association tests are conducted based on study-specific summary statistics, specifically score statistics for each variant and between-variant covariance-type (linkage disequilibrium) relationship statistics for each gene or region. The proposed methods are able to incorporate different levels of heterogeneity of genetic effects across studies and are applicable to meta-analysis of multiple ancestry groups. We show that the proposed methods are essentially as powerful as joint analysis by directly pooling individual level genotype data. We conduct extensive simulations to evaluate the performance of our methods by varying levels of heterogeneity across studies, and we apply the proposed methods to meta-analysis of rare variant effects in a multicohort study of the genetics of blood lipid levels.

Show full abstractShow less

DOI

10.1016/j.ajhg.2013.05.010

MetaSTAAR

Tool

PUBMED_LINK

36564505

DESCRIPTION

MetaSTAAR is an R package for performing Meta-analysis of variant-Set Test for Association using Annotation infoRmation (MetaSTAAR) procedure in whole-genome sequencing (WGS) studies. MetaSTAAR enables functionally-informed rare variant meta-analysis of large WGS studies using an efficient, sparse matrix approach for storing summary statistic, while protecting data privacy of study participants and avoiding sharing subject-level data. MetaSTAAR accounts for relatedness and population structure of continuous and dichotomous traits, and boosts the power of rare variant meta-analysis by incorporating multiple variant functional annotations.

Show full descriptionShow less

URL

https://github.com/xihaoli/MetaSTAAR

TITLE

Powerful, scalable and resource-efficient meta-analysis of rare variant associations in large whole genome sequencing studies.

Main citation

Li X, Quick C, Zhou H, Gaynor SM, ...&, Lin X. (2023) Powerful, scalable and resource-efficient meta-analysis of rare variant associations in large whole genome sequencing studies. Nat Genet, 55 (1) 154-164. doi:10.1038/s41588-022-01225-6. PMID 36564505

ABSTRACT

Meta-analysis of whole genome sequencing/whole exome sequencing (WGS/WES) studies provides an attractive solution to the problem of collecting large sample sizes for discovering rare variants associated with complex phenotypes. Existing rare variant meta-analysis approaches are not scalable to biobank-scale WGS data. Here we present MetaSTAAR, a powerful and resource-efficient rare variant meta-analysis framework for large-scale WGS/WES studies. MetaSTAAR accounts for relatedness and population structure, can analyze both quantitative and dichotomous traits and boosts the power of rare variant tests by incorporating multiple variant functional annotations. Through meta-analysis of four lipid traits in 30,138 ancestrally diverse samples from 14 studies of the Trans Omics for Precision Medicine (TOPMed) Program, we show that MetaSTAAR performs rare variant meta-analysis at scale and produces results comparable to using pooled data. Additionally, we identified several conditionally significant rare variant associations with lipid traits. We further demonstrate that MetaSTAAR is scalable to biobank-scale cohorts through meta-analysis of TOPMed WGS data and UK Biobank WES data of ~200,000 samples.

Show full abstractShow less

DOI

10.1038/s41588-022-01225-6

metaUSAT/metaMANOVA

Tool

PUBMED_LINK

29226385

FULL NAME

unified score-based association test

DESCRIPTION

metaUSAT is a data-adaptive statistical approach for testing genetic associations of multiple traits from single/multiple studies using univariate GWAS summary statistics. This multivariate meta-analysis method can appropriately account for overlapping samples (if any) and can potentially test binary and/or continuous traits.

Show full descriptionShow less

URL

https://github.com/RayDebashree/metaUSAT

TITLE

Methods for meta-analysis of multiple traits using GWAS summary statistics.

Main citation

Ray D, Boehnke M. (2018) Methods for meta-analysis of multiple traits using GWAS summary statistics. Genet Epidemiol, 42 (2) 134-145. doi:10.1002/gepi.22105. PMID 29226385

ABSTRACT

Genome-wide association studies (GWAS) for complex diseases have focused primarily on single-trait analyses for disease status and disease-related quantitative traits. For example, GWAS on risk factors for coronary artery disease analyze genetic associations of plasma lipids such as total cholesterol, LDL-cholesterol, HDL-cholesterol, and triglycerides (TGs) separately. However, traits are often correlated and a joint analysis may yield increased statistical power for association over multiple univariate analyses. Recently several multivariate methods have been proposed that require individual-level data. Here, we develop metaUSAT (where USAT is unified score-based association test), a novel unified association test of a single genetic variant with multiple traits that uses only summary statistics from existing GWAS. Although the existing methods either perform well when most correlated traits are affected by the genetic variant in the same direction or are powerful when only a few of the correlated traits are associated, metaUSAT is designed to be robust to the association structure of correlated traits. metaUSAT does not require individual-level data and can test genetic associations of categorical and/or continuous traits. One can also use metaUSAT to analyze a single trait over multiple studies, appropriately accounting for overlapping samples, if any. metaUSAT provides an approximate asymptotic P-value for association and is computationally efficient for implementation at a genome-wide level. Simulation experiments show that metaUSAT maintains proper type-I error at low error levels. It has similar and sometimes greater power to detect association across a wide array of scenarios compared to existing methods, which are usually powerful for some specific association scenarios only. When applied to plasma lipids summary data from the METSIM and the T2D-GENES studies, metaUSAT detected genome-wide significant loci beyond the ones identified by univariate analyses. Evidence from larger studies suggest that the variants additionally detected by our test are, indeed, associated with lipid levels in humans. In summary, metaUSAT can provide novel insights into the genetic architecture of a common disease or traits.

Show full abstractShow less

DOI

10.1002/gepi.22105

MetaXcan

Tool

PUBMED_LINK

29739930

DESCRIPTION

MetaXcan is a set of tools to integrate genomic information of biological mechanisms with complex traits.

Show full descriptionShow less

URL

https://github.com/hakyimlab/MetaXcan

TITLE

Exploring the phenotypic consequences of tissue specific gene expression variation inferred from GWAS summary statistics.

Main citation

Barbeira AN, Dickinson SP, Bonazzola R, Zheng J, ...&, Im HK. (2018) Exploring the phenotypic consequences of tissue specific gene expression variation inferred from GWAS summary statistics. Nat Commun, 9 (1) 1825. doi:10.1038/s41467-018-03621-1. PMID 29739930

ABSTRACT

Scalable, integrative methods to understand mechanisms that link genetic variants with phenotypes are needed. Here we derive a mathematical expression to compute PrediXcan (a gene mapping approach) results using summary data (S-PrediXcan) and show its accuracy and general robustness to misspecified reference sets. We apply this framework to 44 GTEx tissues and 100+ phenotypes from GWAS and meta-analysis studies, creating a growing public catalog of associations that seeks to capture the effects of gene expression variation on human phenotypes. Replication in an independent cohort is shown. Most of the associations are tissue specific, suggesting context specificity of the trait etiology. Colocalized significant associations in unexpected tissues underscore the need for an agnostic scanning of multiple contexts to improve our ability to detect causal regulatory mechanisms. Monogenic disease genes are enriched among significant associations for related traits, suggesting that smaller alterations of these genes may cause a spectrum of milder phenotypes.

Show full abstractShow less

DOI

10.1038/s41467-018-03621-1

Michigan Imputation Server (Michigan)

Tool

PUBMED_LINK

27571263

URL

https://imputationserver.sph.umich.edu/index.html#!

TITLE

Next-generation genotype imputation service and methods.

Main citation

Das S, Forer L, Schönherr S, Sidore C, ...&, Fuchsberger C. (2016) Next-generation genotype imputation service and methods. Nat Genet, 48 (10) 1284-1287. doi:10.1038/ng.3656. PMID 27571263

ABSTRACT

Genotype imputation is a key component of genetic association studies, where it increases power, facilitates meta-analysis, and aids interpretation of signals. Genotype imputation is computationally demanding and, with current tools, typically requires access to a high-performance computing cluster and to a reference panel of sequenced genomes. Here we describe improvements to imputation machinery that reduce computational requirements by more than an order of magnitude with no loss of accuracy in comparison to standard imputation tools. We also describe a new web-based service for imputation that facilitates access to new reference panels and greatly improves user experience and productivity.

Show full abstractShow less

DOI

10.1038/ng.3656

MiXeR

Tool

PUBMED_LINK

32427991

FULL NAME

MiXeR（univariate）

DESCRIPTION

Causal Mixture Model for GWAS summary statistics

Show full descriptionShow less

URL

https://github.com/precimed/mixer

TITLE

Beyond SNP heritability: Polygenicity and discoverability of phenotypes estimated with a univariate Gaussian mixture model.

Main citation

Holland D, Frei O, Desikan R, Fan CC, ...&, Dale AM. (2020) Beyond SNP heritability: Polygenicity and discoverability of phenotypes estimated with a univariate Gaussian mixture model. PLoS Genet, 16 (5) e1008612. doi:10.1371/journal.pgen.1008612. PMID 32427991

ABSTRACT

Estimating the polygenicity (proportion of causally associated single nucleotide polymorphisms (SNPs)) and discoverability (effect size variance) of causal SNPs for human traits is currently of considerable interest. SNP-heritability is proportional to the product of these quantities. We present a basic model, using detailed linkage disequilibrium structure from a reference panel of 11 million SNPs, to estimate these quantities from genome-wide association studies (GWAS) summary statistics. We apply the model to diverse phenotypes and validate the implementation with simulations. We find model polygenicities (as a fraction of the reference panel) ranging from ≃ 2 × 10-5 to ≃ 4 × 10-3, with discoverabilities similarly ranging over two orders of magnitude. A power analysis allows us to estimate the proportions of phenotypic variance explained additively by causal SNPs reaching genome-wide significance at current sample sizes, and map out sample sizes required to explain larger portions of additive SNP heritability. The model also allows for estimating residual inflation (or deflation from over-correcting of z-scores), and assessing compatibility of replication and discovery GWAS summary statistics.

Show full abstractShow less

DOI

10.1371/journal.pgen.1008612

mJAM

Tool

FULL NAME

multi-population JAM

URL

https://github.com/USCbiostats/hJAM/R

KEYWORDS

multi-population

Show full keywordsShow less

Main citation

Shen, J., Jiang, L., Wang, K., Wang, A., Chen, F., Newcombe, P. J., ... & Conti, D. V. (2022). Fine-mapping and credible set construction using a multi-population joint analysis of marginal summary statistics from genome-wide association studies. bioRxiv, 2022-12.

MOSTest

Tool

PUBMED_LINK

32665545

FULL NAME

Multivariate Omnibus Statistical Test

DESCRIPTION

MOSTest is a tool for join genetical analysis of multiple traits, using multivariate analysis to boost the power of discovering associated loci.

Show full descriptionShow less

URL

https://github.com/precimed/mostest

TITLE

Understanding the genetic determinants of the brain with MOSTest.

Main citation

van der Meer D, Frei O, Kaufmann T, Shadrin AA, ...&, Dale AM. (2020) Understanding the genetic determinants of the brain with MOSTest. Nat Commun, 11 (1) 3512. doi:10.1038/s41467-020-17368-1. PMID 32665545

ABSTRACT

Regional brain morphology has a complex genetic architecture, consisting of many common polymorphisms with small individual effects. This has proven challenging for genome-wide association studies (GWAS). Due to the distributed nature of genetic signal across brain regions, multivariate analysis of regional measures may enhance discovery of genetic variants. Current multivariate approaches to GWAS are ill-suited for complex, large-scale data of this kind. Here, we introduce the Multivariate Omnibus Statistical Test (MOSTest), with an efficient computational design enabling rapid and reliable inference, and apply it to 171 regional brain morphology measures from 26,502 UK Biobank participants. At the conventional genome-wide significance threshold of α = 5 × 10-8, MOSTest identifies 347 genomic loci associated with regional brain morphology, more than any previous study, improving upon the discovery of established GWAS approaches more than threefold. Our findings implicate more than 5% of all protein-coding genes and provide evidence for gene sets involved in neuron development and differentiation.

Show full abstractShow less

DOI

10.1038/s41467-020-17368-1

MR-MEGA

Tool

PUBMED_LINK

28911207

FULL NAME

Meta-Regression of Multi-AncEstry Genetic Association

DESCRIPTION

MR-MEGA (Meta-Regression of Multi-AncEstry Genetic Association) is a tool to detect and fine-map complex trait association signals via multi-ancestry meta-regression. This approach uses genome-wide metrics of diversity between populations to derive axes of genetic variation via multi-dimensional scaling [Purcell 2007]. Allelic effects of a variant across GWAS, weighted by their corresponding standard errors, can then be modelled in a linear regression framework, including the axes of genetic variation as covariates. The flexibility of this model enables partitioning of the heterogeneity into components due to ancestry and residual variation, which would be expected to improve fine-mapping resolution.

Show full descriptionShow less

URL

https://genomics.ut.ee/en/tools

KEYWORDS

Multi-AncEstry

Show full keywordsShow less

TITLE

Trans-ethnic meta-regression of genome-wide association studies accounting for ancestry increases power for discovery and improves fine-mapping resolution.

Main citation

Mägi R, Horikoshi M, Sofer T, Mahajan A, ...&, Morris AP. (2017) Trans-ethnic meta-regression of genome-wide association studies accounting for ancestry increases power for discovery and improves fine-mapping resolution. Hum Mol Genet, 26 (18) 3639-3650. doi:10.1093/hmg/ddx280. PMID 28911207

ABSTRACT

Trans-ethnic meta-analysis of genome-wide association studies (GWAS) across diverse populations can increase power to detect complex trait loci when the underlying causal variants are shared between ancestry groups. However, heterogeneity in allelic effects between GWAS at these loci can occur that is correlated with ancestry. Here, a novel approach is presented to detect SNP association and quantify the extent of heterogeneity in allelic effects that is correlated with ancestry. We employ trans-ethnic meta-regression to model allelic effects as a function of axes of genetic variation, derived from a matrix of mean pairwise allele frequency differences between GWAS, and implemented in the MR-MEGA software. Through detailed simulations, we demonstrate increased power to detect association for MR-MEGA over fixed- and random-effects meta-analysis across a range of scenarios of heterogeneity in allelic effects between ethnic groups. We also demonstrate improved fine-mapping resolution, in loci containing a single causal variant, compared to these meta-analysis approaches and PAINTOR, and equivalent performance to MANTRA at reduced computational cost. Application of MR-MEGA to trans-ethnic GWAS of kidney function in 71,461 individuals indicates stronger signals of association than fixed-effects meta-analysis when heterogeneity in allelic effects is correlated with ancestry. Application of MR-MEGA to fine-mapping four type 2 diabetes susceptibility loci in 22,086 cases and 42,539 controls highlights: (i) strong evidence for heterogeneity in allelic effects that is correlated with ancestry only at the index SNP for the association signal at the CDKAL1 locus; and (ii) 99% credible sets with six or fewer variants for five distinct association signals.

Show full abstractShow less

DOI

10.1093/hmg/ddx280

MR-MEGA

Tool

PUBMED_LINK

28911207

FULL NAME

Meta-Regression of Multi-AncEstry Genetic Association

DESCRIPTION

MR-MEGA (Meta-Regression of Multi-AncEstry Genetic Association) is a tool to detect and fine-map complex trait association signals via multi-ancestry meta-regression. This approach uses genome-wide metrics of diversity between populations to derive axes of genetic variation via multi-dimensional scaling [Purcell 2007]. Allelic effects of a variant across GWAS, weighted by their corresponding standard errors, can then be modelled in a linear regression framework, including the axes of genetic variation as covariates. The flexibility of this model enables partitioning of the heterogeneity into components due to ancestry and residual variation, which would be expected to improve fine-mapping resolution.

Show full descriptionShow less

URL

https://genomics.ut.ee/en/tools

KEYWORDS

cross-population, Meta-Regression

Show full keywordsShow less

TITLE

Trans-ethnic meta-regression of genome-wide association studies accounting for ancestry increases power for discovery and improves fine-mapping resolution.

Main citation

Mägi R, Horikoshi M, Sofer T, Mahajan A, ...&, Morris AP. (2017) Trans-ethnic meta-regression of genome-wide association studies accounting for ancestry increases power for discovery and improves fine-mapping resolution. Hum Mol Genet, 26 (18) 3639-3650. doi:10.1093/hmg/ddx280. PMID 28911207

ABSTRACT

Trans-ethnic meta-analysis of genome-wide association studies (GWAS) across diverse populations can increase power to detect complex trait loci when the underlying causal variants are shared between ancestry groups. However, heterogeneity in allelic effects between GWAS at these loci can occur that is correlated with ancestry. Here, a novel approach is presented to detect SNP association and quantify the extent of heterogeneity in allelic effects that is correlated with ancestry. We employ trans-ethnic meta-regression to model allelic effects as a function of axes of genetic variation, derived from a matrix of mean pairwise allele frequency differences between GWAS, and implemented in the MR-MEGA software. Through detailed simulations, we demonstrate increased power to detect association for MR-MEGA over fixed- and random-effects meta-analysis across a range of scenarios of heterogeneity in allelic effects between ethnic groups. We also demonstrate improved fine-mapping resolution, in loci containing a single causal variant, compared to these meta-analysis approaches and PAINTOR, and equivalent performance to MANTRA at reduced computational cost. Application of MR-MEGA to trans-ethnic GWAS of kidney function in 71,461 individuals indicates stronger signals of association than fixed-effects meta-analysis when heterogeneity in allelic effects is correlated with ancestry. Application of MR-MEGA to fine-mapping four type 2 diabetes susceptibility loci in 22,086 cases and 42,539 controls highlights: (i) strong evidence for heterogeneity in allelic effects that is correlated with ancestry only at the index SNP for the association signal at the CDKAL1 locus; and (ii) 99% credible sets with six or fewer variants for five distinct association signals.

Show full abstractShow less

DOI

10.1093/hmg/ddx280

mRnd

Tool

PUBMED_LINK

24159078

FULL NAME

Power calculations for Mendelian Randomization

URL

https://shiny.cnsgenomics.com/mRnd/

TITLE

Calculating statistical power in Mendelian randomization studies.

Main citation

Brion MJ, Shakhbazov K, Visscher PM. (2013) Calculating statistical power in Mendelian randomization studies. Int J Epidemiol, 42 (5) 1497-501. doi:10.1093/ije/dyt179. PMID 24159078

ABSTRACT

In Mendelian randomization (MR) studies, where genetic variants are used as proxy measures for an exposure trait of interest, obtaining adequate statistical power is frequently a concern due to the small amount of variation in a phenotypic trait that is typically explained by genetic variants. A range of power estimates based on simulations and specific parameters for two-stage least squares (2SLS) MR analyses based on continuous variables has previously been published. However there are presently no specific equations or software tools one can implement for calculating power of a given MR study. Using asymptotic theory, we show that in the case of continuous variables and a single instrument, for example a single-nucleotide polymorphism (SNP) or multiple SNP predictor, statistical power for a fixed sample size is a function of two parameters: the proportion of variation in the exposure variable explained by the genetic predictor and the true causal association between the exposure and outcome variable. We demonstrate that power for 2SLS MR can be derived using the non-centrality parameter (NCP) of the statistical test that is employed to test whether the 2SLS regression coefficient is zero. We show that the previously published power estimates from simulations can be represented theoretically using this NCP-based approach, with similar estimates observed when the simulation-based estimates are compared with our NCP-based approach. General equations for calculating statistical power for 2SLS MR using the NCP are provided in this note, and we implement the calculations in a web-based application.

Show full abstractShow less

DOI

10.1093/ije/dyt179

ms

Tool

PUBMED_LINK

11847089

TITLE

Generating samples under a Wright-Fisher neutral model of genetic variation.

Main citation

Hudson RR. (2002) Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics, 18 (2) 337-8. doi:10.1093/bioinformatics/18.2.337. PMID 11847089

ABSTRACT

A Monte Carlo computer program is available to generate samples drawn from a population evolving according to a Wright-Fisher neutral model. The program assumes an infinite-sites model of mutation, and allows recombination, gene conversion, symmetric migration among subpopulations, and a variety of demographic histories. The samples produced can be used to investigate the sampling properties of any sample statistic under these neutral models.

Show full abstractShow less

DOI

10.1093/bioinformatics/18.2.337

MsCAVIAR

Tool

PUBMED_LINK

34543273

FULL NAME

multiple study causal variants identification in associated regions

DESCRIPTION

MsCAVIAR is a method for fine-mapping (identifying causal variants among GWAS associated variants) by leveraging information from multiple studies. One important application area is trans-ethnic fine mapping.

Show full descriptionShow less

URL

https://github.com/nlapier2/MsCAVIAR

KEYWORDS

multi-study finemapping

Show full keywordsShow less

TITLE

Identifying causal variants by fine mapping across multiple studies.

Main citation

LaPierre N, Taraszka K, Huang H, He R, ...&, Eskin E. (2021) Identifying causal variants by fine mapping across multiple studies. PLoS Genet, 17 (9) e1009733. doi:10.1371/journal.pgen.1009733. PMID 34543273

ABSTRACT

Increasingly large Genome-Wide Association Studies (GWAS) have yielded numerous variants associated with many complex traits, motivating the development of "fine mapping" methods to identify which of the associated variants are causal. Additionally, GWAS of the same trait for different populations are increasingly available, raising the possibility of refining fine mapping results further by leveraging different linkage disequilibrium (LD) structures across studies. Here, we introduce multiple study causal variants identification in associated regions (MsCAVIAR), a method that extends the popular CAVIAR fine mapping framework to a multiple study setting using a random effects model. MsCAVIAR only requires summary statistics and LD as input, accounts for uncertainty in association statistics using a multivariate normal model, allows for multiple causal variants at a locus, and explicitly models the possibility of different SNP effect sizes in different populations. We demonstrate the efficacy of MsCAVIAR in both a simulation study and a trans-ethnic, trans-biobank fine mapping analysis of High Density Lipoprotein (HDL).

Show full abstractShow less

DOI

10.1371/journal.pgen.1009733

MSMC

Tool

PUBMED_LINK

24952747

DESCRIPTION

Schiffels, S. & Durbin, R. Inferring human population size and separation history from multiple genome sequences. Nat. Genet. 46, 919–925 (2014).

Show full descriptionShow less

URL

https://github.com/stschiff/msmc

TITLE

Inferring human population size and separation history from multiple genome sequences.

Main citation

Schiffels S, Durbin R. (2014) Inferring human population size and separation history from multiple genome sequences. Nat Genet, 46 (8) 919-25. doi:10.1038/ng.3015. PMID 24952747

ABSTRACT

The availability of complete human genome sequences from populations across the world has given rise to new population genetic inference methods that explicitly model ancestral relationships under recombination and mutation. So far, application of these methods to evolutionary history more recent than 20,000-30,000 years ago and to population separations has been limited. Here we present a new method that overcomes these shortcomings. The multiple sequentially Markovian coalescent (MSMC) analyzes the observed pattern of mutations in multiple individuals, focusing on the first coalescence between any two individuals. Results from applying MSMC to genome sequences from nine populations across the world suggest that the genetic separation of non-African ancestors from African Yoruban ancestors started long before 50,000 years ago and give information about human population history as recent as 2,000 years ago, including the bottleneck in the peopling of the Americas and separations within Africa, East Asia and Europe.

Show full abstractShow less

DOI

10.1038/ng.3015

MTAG

Tool

PUBMED_LINK

29292387

FULL NAME

Multi-Trait Analysis of GWAS

DESCRIPTION

mtag is a Python-based command line tool for jointly analyzing multiple sets of GWAS summary statistics as described by Turley et. al. (2018). It can also be used as a tool to meta-analyze GWAS results.

Show full descriptionShow less

URL

https://github.com/JonJala/mtag

KEYWORDS

Multi-trait

Show full keywordsShow less

TITLE

Multi-trait analysis of genome-wide association summary statistics using MTAG.

Main citation

Turley P, Walters RK, Maghzian O, Okbay A, ...&, Benjamin DJ. (2018) Multi-trait analysis of genome-wide association summary statistics using MTAG. Nat Genet, 50 (2) 229-237. doi:10.1038/s41588-017-0009-4. PMID 29292387

ABSTRACT

We introduce multi-trait analysis of GWAS (MTAG), a method for joint analysis of summary statistics from genome-wide association studies (GWAS) of different traits, possibly from overlapping samples. We apply MTAG to summary statistics for depressive symptoms (N eff = 354,862), neuroticism (N = 168,105), and subjective well-being (N = 388,538). As compared to the 32, 9, and 13 genome-wide significant loci identified in the single-trait GWAS (most of which are themselves novel), MTAG increases the number of associated loci to 64, 37, and 49, respectively. Moreover, association statistics from MTAG yield more informative bioinformatics analyses and increase the variance explained by polygenic scores by approximately 25%, matching theoretical expectations.

Show full abstractShow less

DOI

10.1038/s41588-017-0009-4

Multi-PGS

Tool

PUBMED_LINK

37543680

DESCRIPTION

a framework to generate enriched PGS from a wealth of publicly available genome-wide association studies, combining thousands of studies focused on many different phenotypes, into a multi-PGS

Show full descriptionShow less

URL

https://github.com/ClaraAlbi/paper_multiPGS

TITLE

Multi-PGS enhances polygenic prediction by combining 937 polygenic scores.

Main citation

Albiñana C, Zhu Z, Schork AJ, Ingason A, ...&, Vilhjálmsson BJ. (2023) Multi-PGS enhances polygenic prediction by combining 937 polygenic scores. Nat Commun, 14 (1) 4702. doi:10.1038/s41467-023-40330-w. PMID 37543680

ABSTRACT

The predictive performance of polygenic scores (PGS) is largely dependent on the number of samples available to train the PGS. Increasing the sample size for a specific phenotype is expensive and takes time, but this sample size can be effectively increased by using genetically correlated phenotypes. We propose a framework to generate multi-PGS from thousands of publicly available genome-wide association studies (GWAS) with no need to individually select the most relevant ones. In this study, the multi-PGS framework increases prediction accuracy over single PGS for all included psychiatric disorders and other available outcomes, with prediction R2 increases of up to 9-fold for attention-deficit/hyperactivity disorder compared to a single PGS. We also generate multi-PGS for phenotypes without an existing GWAS and for case-case predictions. We benchmark the multi-PGS framework against other methods and highlight its potential application to new emerging biobanks.

Show full abstractShow less

DOI

10.1038/s41467-023-40330-w

MultiBLUP

Tool

PUBMED_LINK

24963154

URL

https://cran.r-project.org/web/packages/MultiPhen/index.html

TITLE

MultiBLUP: improved SNP-based prediction for complex traits.

Main citation

Speed D, Balding DJ. (2014) MultiBLUP: improved SNP-based prediction for complex traits. Genome Res, 24 (9) 1550-7. doi:10.1101/gr.169375.113. PMID 24963154

ABSTRACT

BLUP (best linear unbiased prediction) is widely used to predict complex traits in plant and animal breeding, and increasingly in human genetics. The BLUP mathematical model, which consists of a single random effect term, was adequate when kinships were measured from pedigrees. However, when genome-wide SNPs are used to measure kinships, the BLUP model implicitly assumes that all SNPs have the same effect-size distribution, which is a severe and unnecessary limitation. We propose MultiBLUP, which extends the BLUP model to include multiple random effects, allowing greatly improved prediction when the random effects correspond to classes of SNPs with distinct effect-size variances. The SNP classes can be specified in advance, for example, based on SNP functional annotations, and we also provide an adaptive procedure for determining a suitable partition of SNPs. We apply MultiBLUP to genome-wide association data from the Wellcome Trust Case Control Consortium (seven diseases), and from much larger studies of celiac disease and inflammatory bowel disease, finding that it consistently provides better prediction than alternative methods. Moreover, MultiBLUP is computationally very efficient; for the largest data set, which includes 12,678 individuals and 1.5 M SNPs, the total analysis can be run on a single desktop PC in less than a day and can be parallelized to run even faster. Tools to perform MultiBLUP are freely available in our software LDAK.

Show full abstractShow less

DOI

10.1101/gr.169375.113

MultiPhen

Tool

PUBMED_LINK

22567092

DESCRIPTION

Performs genetic association tests between SNPs (one-at-a-time) and multiple phenotypes (separately or in joint model).

Show full descriptionShow less

URL

TITLE

MultiPhen: joint model of multiple phenotypes can increase discovery in GWAS.

Main citation

O'Reilly PF, Hoggart CJ, Pomyen Y, Calboli FC, ...&, Coin LJ. (2012) MultiPhen: joint model of multiple phenotypes can increase discovery in GWAS. PLoS One, 7 (5) e34861. doi:10.1371/journal.pone.0034861. PMID 22567092

ABSTRACT

The genome-wide association study (GWAS) approach has discovered hundreds of genetic variants associated with diseases and quantitative traits. However, despite clinical overlap and statistical correlation between many phenotypes, GWAS are generally performed one-phenotype-at-a-time. Here we compare the performance of modelling multiple phenotypes jointly with that of the standard univariate approach. We introduce a new method and software, MultiPhen, that models multiple phenotypes simultaneously in a fast and interpretable way. By performing ordinal regression, MultiPhen tests the linear combination of phenotypes most associated with the genotypes at each SNP, and thus potentially captures effects hidden to single phenotype GWAS. We demonstrate via simulation that this approach provides a dramatic increase in power in many scenarios. There is a boost in power for variants that affect multiple phenotypes and for those that affect only one phenotype. While other multivariate methods have similar power gains, we describe several benefits of MultiPhen over these. In particular, we demonstrate that other multivariate methods that assume the genotypes are normally distributed, such as canonical correlation analysis (CCA) and MANOVA, can have highly inflated type-1 error rates when testing case-control or non-normal continuous phenotypes, while MultiPhen produces no such inflation. To test the performance of MultiPhen on real data we applied it to lipid traits in the Northern Finland Birth Cohort 1966 (NFBC1966). In these data MultiPhen discovers 21% more independent SNPs with known associations than the standard univariate GWAS approach, while applying MultiPhen in addition to the standard approach provides 37% increased discovery. The most associated linear combinations of the lipids estimated by MultiPhen at the leading SNPs accurately reflect the Friedewald Formula, suggesting that MultiPhen could be used to refine the definition of existing phenotypes or uncover novel heritable phenotypes.

Show full abstractShow less

DOI

10.1371/journal.pone.0034861

MultiSTAAR

Tool

PUBMED_LINK

39920506

FULL NAME

Multi-trait variant-Set Test for Association using Annotation infoRmation

DESCRIPTION

MultiSTAAR is an R package for performing Multi-trait variant-Set Test for Association using Annotation infoRmation (MultiSTAAR) procedure in whole-genome sequencing (WGS) studies. MultiSTAAR is a general framework that (1) leverages the correlation structure between multiple phenotypes to improve power of multi-trait analysis over single-trait analysis, and (2) incorporates both qualitative functional categories and quantitative complementary functional annotations using an omnibus multi-dimensional weighting scheme. MultiSTAAR accounts for population structure and relatedness, and is scalable for jointly analyzing large WGS studies of multiple correlated traits.

Show full descriptionShow less

URL

https://github.com/xihaoli/MultiSTAAR

TITLE

A statistical framework for multi-trait rare variant analysis in large-scale whole-genome sequencing studies.

Main citation

Li X, Chen H, Selvaraj MS, Van Buren E, ...&, Lin X. (2025) A statistical framework for multi-trait rare variant analysis in large-scale whole-genome sequencing studies. Nat Comput Sci, 5 (2) 125-143. doi:10.1038/s43588-024-00764-8. PMID 39920506

ABSTRACT

Large-scale whole-genome sequencing (WGS) studies have improved our understanding of the contributions of coding and noncoding rare variants to complex human traits. Leveraging association effect sizes across multiple traits in WGS rare variant association analysis can improve statistical power over single-trait analysis, and also detect pleiotropic genes and regions. Existing multi-trait methods have limited ability to perform rare variant analysis of large-scale WGS data. We propose MultiSTAAR, a statistical framework and computationally scalable analytical pipeline for functionally informed multi-trait rare variant analysis in large-scale WGS studies. MultiSTAAR accounts for relatedness, population structure and correlation among phenotypes by jointly analyzing multiple traits, and further empowers rare variant association analysis by incorporating multiple functional annotations. We applied MultiSTAAR to jointly analyze three lipid traits in 61,838 multi-ethnic samples from the Trans-Omics for Precision Medicine (TOPMed) Program. We discovered and replicated new associations with lipid traits missed by single-trait analysis.

Show full abstractShow less

DOI

10.1038/s43588-024-00764-8

MultiSuSiE

Tool

PUBMED_LINK

41491094

DESCRIPTION

MultiSuSiE is a multi-ancestry SuSiE-style fine-mapping framework that allows causal effect sizes to differ across ancestries, improving credible sets in diverse whole-genome sequencing cohorts such as All of Us.

Show full descriptionShow less

URL

https://github.com/jordanero/MultiSuSiE

KEYWORDS

cross-ancestry, fine-mapping

Show full keywordsShow less

TITLE

MultiSuSiE improves multi-ancestry fine-mapping in All of Us whole-genome sequencing data.

Main citation

Rossen J, Shi H, Strober BJ, Zhang MJ, ...&, Price AL. (2026) MultiSuSiE improves multi-ancestry fine-mapping in All of Us whole-genome sequencing data. Nat Genet, 58 (1) 67-76. doi:10.1038/s41588-025-02450-5. PMID 41491094

ABSTRACT

Leveraging multi-ancestry data can improve fine-mapping power. We propose MultiSuSiE, an extension of Sum of Single Effects (SuSiE), to multiple ancestries that allows causal effect sizes to vary across ancestries. We evaluated MultiSuSiE using whole-genome sequencing data from 47,000 African-ancestry, 36,000 Latino-ancestry and 116,000 European-ancestry individuals from All of Us. In simulations, MultiSuSiE applied to Afr36k + Lat36k + Eur36k was well-calibrated and attained higher power than SuSiE applied to Eur109k; compared to recent multi-ancestry methods (SuSiEx and MESuSiE), MultiSuSiE attained higher power and lower computational cost. In analyses of 14 quantitative traits, MultiSuSiE applied to Afr47k + Lat36k + Eur116k identified 348 fine-mapped variants with posterior inclusion probability (PIP) > 0.9, and MultiSuSiE applied to Afr36k + Lat36k + Eur36k identified 59% more PIP > 0.9 variants than SuSiE applied to Eur109k; MultiSuSiE identified 29% more PIP > 0.9 variants than SuSiEx, and MESuSiE was not included due to its high computational cost. We validated these findings through functional enrichment of fine-mapped variants and highlighted examples implicating biologically plausible fine-mapped variants.

Show full abstractShow less

DOI

10.1038/s41588-025-02450-5

MultiXcan

Tool

PUBMED_LINK

30668570

DESCRIPTION

an efficient statistical method (MultiXcan) that leverages the substantial sharing of eQTLs across tissues and contexts to improve our ability to identify potential target genes. MultiXcan integrates evidence across multiple panels using multivariate regression, which naturally takes into account the correlation structure.

Show full descriptionShow less

URL

https://github.com/hakyimlab/MetaXcan

TITLE

Integrating predicted transcriptome from multiple tissues improves association detection.

Main citation

Barbeira AN, Pividori M, Zheng J, Wheeler HE, ...&, Im HK. (2019) Integrating predicted transcriptome from multiple tissues improves association detection. PLoS Genet, 15 (1) e1007889. doi:10.1371/journal.pgen.1007889. PMID 30668570

ABSTRACT

Integration of genome-wide association studies (GWAS) and expression quantitative trait loci (eQTL) studies is needed to improve our understanding of the biological mechanisms underlying GWAS hits, and our ability to identify therapeutic targets. Gene-level association methods such as PrediXcan can prioritize candidate targets. However, limited eQTL sample sizes and absence of relevant developmental and disease context restrict our ability to detect associations. Here we propose an efficient statistical method (MultiXcan) that leverages the substantial sharing of eQTLs across tissues and contexts to improve our ability to identify potential target genes. MultiXcan integrates evidence across multiple panels using multivariate regression, which naturally takes into account the correlation structure. We apply our method to simulated and real traits from the UK Biobank and show that, in realistic settings, we can detect a larger set of significantly associated genes than using each panel separately. To improve applicability, we developed a summary result-based extension called S-MultiXcan, which we show yields highly concordant results with the individual level version when LD is well matched. Our multivariate model-based approach allowed us to use the individual level results as a gold standard to calibrate and develop a robust implementation of the summary-based extension. Results from our analysis as well as software and necessary resources to apply our method are publicly available.

Show full abstractShow less

DOI

10.1371/journal.pgen.1007889

MungeSumstats

Tool

PUBMED_LINK

34601555

DESCRIPTION

a Bioconductor package for the standardization and quality control of many GWAS summary statistics

Show full descriptionShow less

URL

https://github.com/neurogenomics/MungeSumstats

TITLE

MungeSumstats: a Bioconductor package for the standardization and quality control of many GWAS summary statistics.

Main citation

Murphy AE, Schilder BM, Skene NG. (2021) MungeSumstats: a Bioconductor package for the standardization and quality control of many GWAS summary statistics. Bioinformatics, 37 (23) 4593-4596. doi:10.1093/bioinformatics/btab665. PMID 34601555

ABSTRACT

MOTIVATION: Genome-wide association studies (GWAS) summary statistics have popularized and accelerated genetic research. However, a lack of standardization of the file formats used has proven problematic when running secondary analysis tools or performing meta-analysis studies. RESULTS: To address this issue, we have developed MungeSumstats, a Bioconductor R package for the standardization and quality control of GWAS summary statistics. MungeSumstats can handle the most common summary statistic formats, including variant call format (VCF) producing a reformatted, standardized, tabular summary statistic file, VCF or R native data object. AVAILABILITY AND IMPLEMENTATION: MungeSumstats is available on Bioconductor (v 3.13) and can also be found on Github at: https://neurogenomics.github.io/MungeSumstats. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Show full abstractShow less

DOI

10.1093/bioinformatics/btab665

MuPIT

Tool

PUBMED_LINK

23793516

FULL NAME

Mutation position imaging toolbox

DESCRIPTION

Webserver for mapping variant positions to annotated, interactive 3D structures

Show full descriptionShow less

URL

http://mupit.icm.jhu.edu/MuPIT_Interactive/

TITLE

MuPIT interactive: webserver for mapping variant positions to annotated, interactive 3D structures.

Main citation

Niknafs N, Kim D, Kim R, Diekhans M, ...&, Karchin R. (2013) MuPIT interactive: webserver for mapping variant positions to annotated, interactive 3D structures. Hum Genet, 132 (11) 1235-43. doi:10.1007/s00439-013-1325-0. PMID 23793516

ABSTRACT

Mutation position imaging toolbox (MuPIT) interactive is a browser-based application for single-nucleotide variants (SNVs), which automatically maps the genomic coordinates of SNVs onto the coordinates of available three-dimensional (3D) protein structures. The application is designed for interactive browser-based visualization of the putative functional relevance of SNVs by biologists who are not necessarily experts either in bioinformatics or protein structure. Users may submit batches of several thousand SNVs and review all protein structures that cover the SNVs, including available functional annotations such as binding sites, mutagenesis experiments, and common polymorphisms. Multiple SNVs may be mapped onto each structure, enabling 3D visualization of SNV clusters and their relationship to functionally annotated positions. We illustrate the utility of MuPIT interactive in rationalizing the impact of selected polymorphisms in the PharmGKB database, somatic mutations identified in the Cancer Genome Atlas study of invasive breast carcinomas, and rare variants identified in the exome sequencing project. MuPIT interactive is freely available for non-profit use at http://mupit.icm.jhu.edu .

Show full abstractShow less

DOI

10.1007/s00439-013-1325-0

MV-PLINK (MQFAM)

Tool

PUBMED_LINK

19019849

TITLE

A multivariate test of association.

Main citation

Ferreira MA, Purcell SM. (2009) A multivariate test of association. Bioinformatics, 25 (1) 132-3. doi:10.1093/bioinformatics/btn563. PMID 19019849

ABSTRACT

UNLABELLED: Although genetic association studies often test multiple, related phenotypes, few formal multivariate tests of association are available. We describe a test of association that can be efficiently applied to large population-based designs. AVAILABILITY: A C++ implementation can be obtained from the authors.

Show full abstractShow less

DOI

10.1093/bioinformatics/btn563

mvGWAMA

Tool

PUBMED_LINK

30617256

FULL NAME

Multivariate Genome-Wide Association Meta-Analysis

DESCRIPTION

mvGWAMA is a python script to perform a GWAS meta-analysis when there are sample overlap.

Show full descriptionShow less

URL

https://github.com/Kyoko-wtnb/mvGWAMA

TITLE

Genome-wide meta-analysis identifies new loci and functional pathways influencing Alzheimer's disease risk.

Main citation

Jansen IE, Savage JE, Watanabe K, Bryois J, ...&, Posthuma D. (2019) Genome-wide meta-analysis identifies new loci and functional pathways influencing Alzheimer's disease risk. Nat Genet, 51 (3) 404-413. doi:10.1038/s41588-018-0311-9. PMID 30617256

ABSTRACT

Alzheimer's disease (AD) is highly heritable and recent studies have identified over 20 disease-associated genomic loci. Yet these only explain a small proportion of the genetic variance, indicating that undiscovered loci remain. Here, we performed a large genome-wide association study of clinically diagnosed AD and AD-by-proxy (71,880 cases, 383,378 controls). AD-by-proxy, based on parental diagnoses, showed strong genetic correlation with AD (rg = 0.81). Meta-analysis identified 29 risk loci, implicating 215 potential causative genes. Associated genes are strongly expressed in immune-related tissues and cell types (spleen, liver, and microglia). Gene-set analyses indicate biological mechanisms involved in lipid-related processes and degradation of amyloid precursor proteins. We show strong genetic correlations with multiple health-related outcomes, and Mendelian randomization results suggest a protective effect of cognitive ability on AD risk. These results are a step forward in identifying the genetic factors that contribute to AD risk and add novel insights into the neurobiology of AD.

Show full abstractShow less

DOI

10.1038/s41588-018-0311-9

mvSuSiE

Tool

PUBMED_LINK

41634413

DESCRIPTION

mvSuSiE extends the Sum of Single Effects (SuSiE) model to joint fine-mapping of multiple traits, improving power and resolution relative to separate single-trait analyses while remaining computationally practical.

Show full descriptionShow less

URL

https://github.com/stephenslab/mvsusieR

KEYWORDS

multi-trait, fine-mapping

Show full keywordsShow less

TITLE

Fast and flexible joint fine-mapping of multiple traits via the Sum of Single Effects model.

Main citation

Zou Y, Carbonetto P, Xie D, Wang G, ...&, Stephens M. (2026) Fast and flexible joint fine-mapping of multiple traits via the Sum of Single Effects model. Nat Genet, 58 (2) 454-462. doi:10.1038/s41588-025-02486-7. PMID 41634413

ABSTRACT

We introduce mvSuSiE, a multitrait fine-mapping method, to identify putative causal variants from genetic association data (individual-level or summary). mvSuSiE learns patterns of shared genetic effects from data, and exploits these patterns to improve power to identify causal single nucleotide polymorphisms (SNPs). Comparisons on simulated data show that mvSuSiE is competitive in speed, power and precision with existing multitrait methods, and uniformly improves over single-trait fine-mapping (Sum of Single Effects) performed separately for each trait. We applied mvSuSiE to jointly fine-map 16 blood cell traits using data from the UK Biobank. By jointly analyzing traits and modeling heterogeneous effect-sharing patterns, we identified a substantially larger number of causal SNPs (>3,000) than single-trait fine-mapping and achieved narrower credible sets. mvSuSiE also more comprehensively characterized how genetic variants affect blood cell traits; 68% of causal SNPs showed significant effects across more than one blood cell type.

Show full abstractShow less

DOI

10.1038/s41588-025-02486-7

NARD

Tool

PUBMED_LINK

31640730

FULL NAME

Northeast Asian Reference Database

URL

https://nard.macrogen.com/

TITLE

NARD: whole-genome reference panel of 1779 Northeast Asians improves imputation accuracy of rare and low-frequency variants.

Main citation

Yoo SK, Kim CU, Kim HL, Kim S, ...&, Seo JS. (2019) NARD: whole-genome reference panel of 1779 Northeast Asians improves imputation accuracy of rare and low-frequency variants. Genome Med, 11 (1) 64. doi:10.1186/s13073-019-0677-z. PMID 31640730

ABSTRACT

Here, we present the Northeast Asian Reference Database (NARD), including whole-genome sequencing data of 1779 individuals from Korea, Mongolia, Japan, China, and Hong Kong. NARD provides the genetic diversity of Korean (n = 850) and Mongolian (n = 384) ancestries that were not present in the 1000 Genomes Project Phase 3 (1KGP3). We combined and re-phased the genotypes from NARD and 1KGP3 to construct a union set of haplotypes. This approach established a robust imputation reference panel for Northeast Asians, which yields the greatest imputation accuracy of rare and low-frequency variants compared with the existing panels. NARD imputation panel is available at https://nard.macrogen.com/ .

Show full abstractShow less

DOI

10.1186/s13073-019-0677-z

NARD2

Tool

PUBMED_LINK

37556544

FULL NAME

Northeast Asian Reference Database 2

URL

https://nard.macrogen.com/

TITLE

A whole-genome reference panel of 14,393 individuals for East Asian populations accelerates discovery of rare functional variants.

Main citation

Choi J, Kim S, Kim J, Son HY, ...&, Im SW. (2023) A whole-genome reference panel of 14,393 individuals for East Asian populations accelerates discovery of rare functional variants. Sci Adv, 9 (32) eadg6319. doi:10.1126/sciadv.adg6319. PMID 37556544

ABSTRACT

Underrepresentation of non-European (EUR) populations hinders growth of global precision medicine. Resources such as imputation reference panels that match the study population are necessary to find low-frequency variants with substantial effects. We created a reference panel consisting of 14,393 whole-genome sequences including more than 11,000 Asian individuals. Genome-wide association studies were conducted using the reference panel and a population-specific genotype array of 72,298 subjects for eight phenotypes. This panel yields improved imputation accuracy of rare and low-frequency variants within East Asian populations compared with the largest reference panel. Thirty-nine previously unidentified associations were found, and more than half of the variants were East Asian specific. We discovered genes with rare protein-altering variants, including LTBP1 for height and GPR75 for body mass index, as well as putative regulatory mechanisms for rare noncoding variants with cell type-specific effects. We suggest that this dataset will add to the potential value of Asian precision medicine.

Show full abstractShow less

DOI

10.1126/sciadv.adg6319

Nyuwa Genome Resource Phase 1

Tool

PUBMED_LINK

34788621

URL

http://bigdata.ibp.ac.cn/refpanel/getstarted.php

TITLE

NyuWa Genome resource: A deep whole-genome sequencing-based variation profile and reference panel for the Chinese population.

Main citation

Zhang P, Luo H, Li Y, Wang Y, ...&, He S. (2021) NyuWa Genome resource: A deep whole-genome sequencing-based variation profile and reference panel for the Chinese population. Cell Rep, 37 (7) 110017. doi:10.1016/j.celrep.2021.110017. PMID 34788621

ABSTRACT

The lack of haplotype reference panels and whole-genome sequencing resources specific to the Chinese population has greatly hindered genetic studies in the world's largest population. Here, we present the NyuWa genome resource, based on deep (26.2×) sequencing of 2,999 Chinese individuals, and construct a NyuWa reference panel of 5,804 haplotypes and 19.3 million variants, which is a high-quality publicly available Chinese population-specific reference panel with thousands of samples. Compared with other panels, the NyuWa reference panel reduces the Han Chinese imputation error rate by a margin ranging from 30% to 51%. Population structure and imputation simulation tests support the applicability of one integrated reference panel for northern and southern Chinese. In addition, a total of 22,504 loss-of-function variants in coding and noncoding genes are identified, including 11,493 novel variants. These results highlight the value of the NyuWa genome resource in facilitating genetic research in Chinese and Asian populations.

Show full abstractShow less

DOI

10.1016/j.celrep.2021.110017

NyuWa Imputation Server (NyuWa)

Tool

PUBMED_LINK

34788621

URL

http://bigdata.ibp.ac.cn/refpanel/getstarted.php

TITLE

NyuWa Genome resource: A deep whole-genome sequencing-based variation profile and reference panel for the Chinese population.

Main citation

Zhang P, Luo H, Li Y, Wang Y, ...&, He S. (2021) NyuWa Genome resource: A deep whole-genome sequencing-based variation profile and reference panel for the Chinese population. Cell Rep, 37 (7) 110017. doi:10.1016/j.celrep.2021.110017. PMID 34788621

ABSTRACT

The lack of haplotype reference panels and whole-genome sequencing resources specific to the Chinese population has greatly hindered genetic studies in the world's largest population. Here, we present the NyuWa genome resource, based on deep (26.2×) sequencing of 2,999 Chinese individuals, and construct a NyuWa reference panel of 5,804 haplotypes and 19.3 million variants, which is a high-quality publicly available Chinese population-specific reference panel with thousands of samples. Compared with other panels, the NyuWa reference panel reduces the Han Chinese imputation error rate by a margin ranging from 30% to 51%. Population structure and imputation simulation tests support the applicability of one integrated reference panel for northern and southern Chinese. In addition, a total of 22,504 loss-of-function variants in coding and noncoding genes are identified, including 11,493 novel variants. These results highlight the value of the NyuWa genome resource in facilitating genetic research in Chinese and Asian populations.

Show full abstractShow less

DOI

10.1016/j.celrep.2021.110017

Olink

Tool

PUBMED_LINK

34715355

TITLE

Proximity Extension Assay in Combination with Next-Generation Sequencing for High-throughput Proteome-wide Analysis.

Main citation

Wik L, Nordberg N, Broberg J, Björkesten J, ...&, Lundberg M. (2021) Proximity Extension Assay in Combination with Next-Generation Sequencing for High-throughput Proteome-wide Analysis. Mol Cell Proteomics, 20 () 100168. doi:10.1016/j.mcpro.2021.100168. PMID 34715355

ABSTRACT

Understanding the dynamics of the human proteome is crucial for developing biomarkers to be used as measurable indicators for disease severity and progression, patient stratification, and drug development. The Proximity Extension Assay (PEA) is a technology that translates protein information into actionable knowledge by linking protein-specific antibodies to DNA-encoded tags. In this report we demonstrate how we have combined the unique PEA technology with an innovative and automated sample preparation and high-throughput sequencing readout enabling parallel measurement of nearly 1500 proteins in 96 samples generating close to 150,000 data points per run. This advancement will have a major impact on the discovery of new biomarkers for disease prediction and prognosis and contribute to the development of the rapidly evolving fields of wellness monitoring and precision medicine.

Show full abstractShow less

DOI

10.1016/j.mcpro.2021.100168

OmiGA

eQTL Colocalization Fine mapping Multi-omics Tool

PUBMED_LINK

41680153

DESCRIPTION

Toolkit for molecular QTL (molQTL) mapping using linear mixed models that handle complex relatedness, aimed at high-throughput omics phenotypes with strong performance for discovery, fine mapping, and trait–molQTL colocalization versus common linear-mapper pipelines.

Show full descriptionShow less

URL

https://omiga.bio/ ,https://doi.org/10.1038/s41467-026-68978-0

KEYWORDS

molQTL, xQTL, LMM, relatedness, colocalization, fine mapping

Show full keywordsShow less

TITLE

OmiGA for ultra-efficient molecular quantitative trait loci mapping.

Main citation

Teng J, Zhang W, Gong W, Chen J, ...&, Zhang Z. (2026) OmiGA for ultra-efficient molecular quantitative trait loci mapping. Nat Commun, 17 (1) . doi:10.1038/s41467-026-68978-0. PMID 41680153

ABSTRACT

Molecular quantitative trait loci (molQTL) mapping is one of the most popular approaches to systematically characterize functional impacts of genomic variants, leading to advanced understanding of the regulatory mechanisms underpinning complex traits and diseases. However, when applied to high-throughput molecular phenotypes, the existing molQTL mapping tools often implement simple linear models, overlooking complex inter-individual relatedness, leading to false positives and insufficient statistical power. Here, we introduce OmiGA, an ultra-efficient omics genetic analysis toolkit, for molQTL mapping based on linear mixed model in populations with complex relatedness. Both computational simulations and real data analyses demonstrate that OmiGA outperforms the existing popular tools regarding molQTL discovery power, fine mapping of causal variants, colocalization of molQTL and trait associations, and computational efficiency. In summary, we recommend OmiGA for molQTL mapping in populations with complex relatedness, for example, those in the Farm animal Genotype-Tissue Expression project and family-based molQTL studies in humans.

Show full abstractShow less

DOI

10.1038/s41467-026-68978-0

Open Targets

Tool

PUBMED_LINK

27899665

DESCRIPTION

Open Targets is an innovative, large-scale, multi-year, public-private partnership that uses human genetics and genomics data for systematic drug target identification and prioritisation.

Show full descriptionShow less

URL

https://www.opentargets.org/

TITLE

Open Targets: a platform for therapeutic target identification and validation.

Main citation

Koscielny G, An P, Carvalho-Silva D, Cham JA, ...&, Dunham I. (2017) Open Targets: a platform for therapeutic target identification and validation. Nucleic Acids Res, 45 (D1) D985-D994. doi:10.1093/nar/gkw1055. PMID 27899665

ABSTRACT

We have designed and developed a data integration and visualization platform that provides evidence about the association of known and potential drug targets with diseases. The platform is designed to support identification and prioritization of biological targets for follow-up. Each drug target is linked to a disease using integrated genome-wide data from a broad range of data sources. The platform provides either a target-centric workflow to identify diseases that may be associated with a specific target, or a disease-centric workflow to identify targets that may be associated with a specific disease. Users can easily transition between these target- and disease-centric workflows. The Open Targets Validation Platform is accessible at https://www.targetvalidation.org.

Show full abstractShow less

DOI

10.1093/nar/gkw1055

Open Targets Genetics

Tool

PUBMED_LINK

34711957

DESCRIPTION

Open Targets Genetics is a comprehensive tool highlighting variant-centric statistical evidence to allow both prioritisation of candidate causal variants at trait-associated loci and identification of potential drug targets.

Show full descriptionShow less

URL

https://genetics.opentargets.org/

TITLE

An open approach to systematically prioritize causal variants and genes at all published human GWAS trait-associated loci.

Main citation

Mountjoy E, Schmidt EM, Carmona M, Schwartzentruber J, ...&, Ghoussaini M. (2021) An open approach to systematically prioritize causal variants and genes at all published human GWAS trait-associated loci. Nat Genet, 53 (11) 1527-1533. doi:10.1038/s41588-021-00945-5. PMID 34711957

ABSTRACT

Genome-wide association studies (GWASs) have identified many variants associated with complex traits, but identifying the causal gene(s) is a major challenge. In the present study, we present an open resource that provides systematic fine mapping and gene prioritization across 133,441 published human GWAS loci. We integrate genetics (GWAS Catalog and UK Biobank) with transcriptomic, proteomic and epigenomic data, including systematic disease-disease and disease-molecular trait colocalization results across 92 cell types and tissues. We identify 729 loci fine mapped to a single-coding causal variant and colocalized with a single gene. We trained a machine-learning model using the fine-mapped genetics and functional genomics data and 445 gold-standard curated GWAS loci to distinguish causal genes from neighboring genes, outperforming a naive distance-based model. Our prioritized genes were enriched for known approved drug targets (odds ratio = 8.1, 95% confidence interval = 5.7, 11.5). These results are publicly available through a web portal ( http://genetics.opentargets.org ), enabling users to easily prioritize genes at disease-associated loci and assess their potential as drug targets.

Show full abstractShow less

DOI

10.1038/s41588-021-00945-5

OpenADMIXTURE

Tool

PUBMED_LINK

36610401

DESCRIPTION

Ko, S., Chu, B. B., Peterson, D., Okenwa, C., Papp, J. C., Alexander, D. H., ... & Lange, K. L. (2023). Unsupervised discovery of ancestry-informative markers and genetic admixture proportions in biobank-scale datasets. The American Journal of Human Genetics.

Show full descriptionShow less

URL

https://github.com/OpenMendel/OpenADMIXTURE.jl

USE

This software package is an open-source Julia reimplementation of the ADMIXTURE package. It estimates ancestry with maximum-likelihood method for a large SNP genotype datasets, where individuals are assumed to be unrelated.

TITLE

Unsupervised discovery of ancestry-informative markers and genetic admixture proportions in biobank-scale datasets.

Main citation

Ko S, Chu BB, Peterson D, Okenwa C, ...&, Lange KL. (2023) Unsupervised discovery of ancestry-informative markers and genetic admixture proportions in biobank-scale datasets. Am J Hum Genet, 110 (2) 314-325. doi:10.1016/j.ajhg.2022.12.008. PMID 36610401

ABSTRACT

Admixture estimation plays a crucial role in ancestry inference and genome-wide association studies (GWASs). Computer programs such as ADMIXTURE and STRUCTURE are commonly employed to estimate the admixture proportions of sample individuals. However, these programs can be overwhelmed by the computational burdens imposed by the 105 to 106 samples and millions of markers commonly found in modern biobanks. An attractive strategy is to run these programs on a set of ancestry-informative SNP markers (AIMs) that exhibit substantially different frequencies across populations. Unfortunately, existing methods for identifying AIMs require knowing ancestry labels for a subset of the sample. This supervised learning approach creates a chicken and the egg scenario. In this paper, we present an unsupervised, scalable framework that seamlessly carries out AIM selection and likelihood-based estimation of admixture proportions. Our simulated and real data examples show that this approach is scalable to modern biobank datasets. OpenADMIXTURE, our Julia implementation of the method, is open source and available for free.

Show full abstractShow less

DOI

10.1016/j.ajhg.2022.12.008

OTTERS

Tool

PUBMED_LINK

36882394

FULL NAME

Omnibus Transcriptome Test using Expression Reference Summary data

DESCRIPTION

Dai, Q. et al. OTTERS: a powerful TWAS framework leveraging summary-level reference data. Nat. Commun. 14, 1271 (2023).

Show full descriptionShow less

URL

https://github.com/daiqile96/OTTERS

TITLE

OTTERS: a powerful TWAS framework leveraging summary-level reference data.

Main citation

Dai Q, Zhou G, Zhao H, Võsa U, ...&, Yang J. (2023) OTTERS: a powerful TWAS framework leveraging summary-level reference data. Nat Commun, 14 (1) 1271. doi:10.1038/s41467-023-36862-w. PMID 36882394

ABSTRACT

Most existing TWAS tools require individual-level eQTL reference data and thus are not applicable to summary-level reference eQTL datasets. The development of TWAS methods that can harness summary-level reference data is valuable to enable TWAS in broader settings and enhance power due to increased reference sample size. Thus, we develop a TWAS framework called OTTERS (Omnibus Transcriptome Test using Expression Reference Summary data) that adapts multiple polygenic risk score (PRS) methods to estimate eQTL weights from summary-level eQTL reference data and conducts an omnibus TWAS. We show that OTTERS is a practical and powerful TWAS tool by both simulations and application studies.

Show full abstractShow less

DOI

10.1038/s41467-023-36862-w

PAINTOR

Tool

PUBMED_LINK

25357204

FULL NAME

Probabilistic Annotation INtegraTOR

DESCRIPTION

Finding causal variants that underlie known risk loci is one of the main post-GWAS challenges. Here we present PAINTOR (Probabilistic Annotation INtegraTOR), a probabilistic framework that integrates association strength with genomic functional annotation data to improve accuracy in selecting plausible causal variants for functional validation. The main output of PAINTOR are probabilities for every variant to be causal that can be used for prioritization in functional assays to establish biological causality.

Show full descriptionShow less

URL

https://bogdan.dgsom.ucla.edu/pages/paintor/

KEYWORDS

Empirical Bayes prior

Show full keywordsShow less

TITLE

Integrating functional data to prioritize causal variants in statistical fine-mapping studies.

Main citation

Kichaev G, Yang WY, Lindstrom S, Hormozdiari F, ...&, Pasaniuc B. (2014) Integrating functional data to prioritize causal variants in statistical fine-mapping studies. PLoS Genet, 10 (10) e1004722. doi:10.1371/journal.pgen.1004722. PMID 25357204

ABSTRACT

Standard statistical approaches for prioritization of variants for functional testing in fine-mapping studies either use marginal association statistics or estimate posterior probabilities for variants to be causal under simplifying assumptions. Here, we present a probabilistic framework that integrates association strength with functional genomic annotation data to improve accuracy in selecting plausible causal variants for functional validation. A key feature of our approach is that it empirically estimates the contribution of each functional annotation to the trait of interest directly from summary association statistics while allowing for multiple causal variants at any risk locus. We devise efficient algorithms that estimate the parameters of our model across all risk loci to further increase performance. Using simulations starting from the 1000 Genomes data, we find that our framework consistently outperforms the current state-of-the-art fine-mapping methods, reducing the number of variants that need to be selected to capture 90% of the causal variants from an average of 13.3 to 10.4 SNPs per locus (as compared to the next-best performing strategy). Furthermore, we introduce a cost-to-benefit optimization framework for determining the number of variants to be followed up in functional assays and assess its performance using real and simulation data. We validate our findings using a large scale meta-analysis of four blood lipids traits and find that the relative probability for causality is increased for variants in exons and transcription start sites and decreased in repressed genomic regions at the risk loci of these traits. Using these highly predictive, trait-specific functional annotations, we estimate causality probabilities across all traits and variants, reducing the size of the 90% confidence set from an average of 17.5 to 13.5 variants per locus in this data.

Show full abstractShow less

DOI

10.1371/journal.pgen.1004722

PanMAN

Tool

PUBMED_LINK

41526696

FULL NAME

Pangenome Mutation-Annotated Network

DESCRIPTION

PanMAN is a compact pangenome representation built from mutation-annotated trees (PanMATs) linked into a network, designed to compress and query shared evolutionary history across large microbial pathogen collections.

Show full descriptionShow less

URL

https://github.com/TurakhiaLab/panman ,https://turakhia.ucsd.edu/panman/

KEYWORDS

pangenome, microbial genomics, compression, mutation-annotated tree, phylogeny

Show full keywordsShow less

TITLE

Compressive pangenomics using mutation-annotated networks.

Main citation

Walia S, Motwani H, Tseng YH, Smith K, ...&, Turakhia Y. (2026) Compressive pangenomics using mutation-annotated networks. Nat Genet, 58 (2) 445-453. doi:10.1038/s41588-025-02478-7. PMID 41526696

ABSTRACT

Pangenomics is an emerging field that uses collections of genomes, rather than a single reference, to reduce bias and capture intra-species diversity. However, existing pangenomic data formats face challenges in scaling to millions of genomes and primarily emphasize variation, often neglecting the underlying mutational events and evolutionary relationships. This work introduces Pangenome Mutation-Annotated Network (PanMAN), a lossless pangenome representation that achieves compression ratios ranging from 3.5-1,391× in file sizes compared to existing variation-preserving formats, with performance generally improving on larger datasets. In addition to compression, PanMAN increases representational capacity by encoding detailed mutational and evolutionary histories inferred across genomes, thereby enabling new biological insights. Using PanMAN, a comprehensive SARS-CoV-2 pangenome was constructed from 8 million publicly available sequences, requiring only 366 MB of disk space. We also present 'panmanUtils', a toolkit that supports common analyses and ensures interoperability with existing software. PanMAN is poised to greatly improve the scale, speed, resolution and scope of pangenomic analysis and data sharing.

Show full abstractShow less

DOI

10.1038/s41588-025-02478-7

PASCAL

Tool

PUBMED_LINK

26808494

FULL NAME

Pathway scoring algorithm

DESCRIPTION

Pascal (Pathway scoring algorithm) is an easy-to-use tool for gene scoring and pathway analysis from GWAS results.

Show full descriptionShow less

URL

https://www2.unil.ch/cbg/index.php?title=Pascal

TITLE

Fast and Rigorous Computation of Gene and Pathway Scores from SNP-Based Summary Statistics.

Main citation

Lamparter D, Marbach D, Rueedi R, Kutalik Z, ...&, Bergmann S. (2016) Fast and Rigorous Computation of Gene and Pathway Scores from SNP-Based Summary Statistics. PLoS Comput Biol, 12 (1) e1004714. doi:10.1371/journal.pcbi.1004714. PMID 26808494

ABSTRACT

Integrating single nucleotide polymorphism (SNP) p-values from genome-wide association studies (GWAS) across genes and pathways is a strategy to improve statistical power and gain biological insight. Here, we present Pascal (Pathway scoring algorithm), a powerful tool for computing gene and pathway scores from SNP-phenotype association summary statistics. For gene score computation, we implemented analytic and efficient numerical solutions to calculate test statistics. We examined in particular the sum and the maximum of chi-squared statistics, which measure the strongest and the average association signals per gene, respectively. For pathway scoring, we use a modified Fisher method, which offers not only significant power improvement over more traditional enrichment strategies, but also eliminates the problem of arbitrary threshold selection inherent in any binary membership based pathway enrichment approach. We demonstrate the marked increase in power by analyzing summary statistics from dozens of large meta-studies for various traits. Our extensive testing indicates that our method not only excels in rigorous type I error control, but also results in more biologically meaningful discoveries.

Show full abstractShow less

DOI

10.1371/journal.pcbi.1004714

PCHAT

Tool

PUBMED_LINK

17922480

FULL NAME

principal component of heritability association test

TITLE

Pleiotropy and principal components of heritability combine to increase power for association analysis.

Main citation

Klei L, Luca D, Devlin B, Roeder K. (2008) Pleiotropy and principal components of heritability combine to increase power for association analysis. Genet Epidemiol, 32 (1) 9-19. doi:10.1002/gepi.20257. PMID 17922480

ABSTRACT

When many correlated traits are measured the potential exists to discover the coordinated control of these traits via genotyped polymorphisms. A common statistical approach to this problem involves assessing the relationship between each phenotype and each single nucleotide polymorphism (SNP) individually (PHN); and taking a Bonferroni correction for the effective number of independent tests conducted. Alternatively, one can apply a dimension reduction technique, such as estimation of principal components, and test for an association with the principal components of the phenotypes (PCP) rather than the individual phenotypes. Building on the work of Lange and colleagues we develop an alternative method based on the principal component of heritability (PCH). For each SNP the PCH approach reduces the phenotypes to a single trait that has a higher heritability than any other linear combination of the phenotypes. As a result, the association between a SNP and derived trait is often easier to detect than an association with any of the individual phenotypes or the PCP. When applied to unrelated subjects, PCH has a drawback. For each SNP it is necessary to estimate the vector of loadings that maximize the heritability over all phenotypes. We develop a method of iterated sample splitting that uses one portion of the data for training and the remainder for testing. This cross-validation approach maintains the type I error control and yet utilizes the data efficiently, resulting in a powerful test for association.

Show full abstractShow less

DOI

10.1002/gepi.20257

PennPRS

Tool

DESCRIPTION

PennPRS is a centralized cloud computing platform for efficient polygenic risk score (PRS) model training in precision medicine. Users can either upload their own GWAS summary data or directly query data from public data sources we provide. PennPRS supports both single-ancestry and multi-ancestry PRS training.

Show full descriptionShow less

URL

https://pennprs.org/

KEYWORDS

PennPRS

Show full keywordsShow less

PREPRINT_DOI

10.1101/2025.02.07.25321875

Main citation

Jin, J., Li, B., Wang, X., Yang, X., Li, Y., Wang, R., ... & Zhao, B. (2025). PennPRS: a centralized cloud computing platform for efficient polygenic risk score training in precision medicine. medRxiv.

PES

Tool

PUBMED_LINK

31964963

FULL NAME

Pharmagenic_enrichment_score

DESCRIPTION

a framework to quantify an individual’s common variant enrichment in clinically actionable systems responsive to existing drugs.

Show full descriptionShow less

TITLE

Pharmacological enrichment of polygenic risk for precision medicine in complex disorders.

Main citation

Reay WR, Atkins JR, Carr VJ, Green MJ, ...&, Cairns MJ. (2020) Pharmacological enrichment of polygenic risk for precision medicine in complex disorders. Sci Rep, 10 (1) 879. doi:10.1038/s41598-020-57795-0. PMID 31964963

ABSTRACT

Individuals with complex disorders typically have a heritable burden of common variation that can be expressed as a polygenic risk score (PRS). While PRS has some predictive utility, it lacks the molecular specificity to be directly informative for clinical interventions. We therefore sought to develop a framework to quantify an individual's common variant enrichment in clinically actionable systems responsive to existing drugs. This was achieved with a metric designated the pharmagenic enrichment score (PES), which we demonstrate for individual SNP profiles in a cohort of cases with schizophrenia. A large proportion of these had elevated PES in one or more of eight clinically actionable gene-sets enriched with schizophrenia associated common variation. Notable candidates targeting these pathways included vitamins, antioxidants, insulin modulating agents, and cholinergic drugs. Interestingly, elevated PES was also observed in individuals with otherwise low common variant burden. The biological saliency of PES profiles were observed directly through their impact on gene expression in a subset of the cohort with matched transcriptomic data, supporting our assertion that this gene-set orientated approach could integrate an individual's common variant risk to inform personalised interventions, including drug repositioning, for complex disorders such as schizophrenia.

Show full abstractShow less

DOI

10.1038/s41598-020-57795-0

pgBoost

Tool

DESCRIPTION

pgBoost is an integrative modeling framework that trains a non-linear combination of existing linking strategies (including genomic distance) on fine-mapped eQTL data to assign a probabilistic score to each candidate SNP-gene link.

Show full descriptionShow less

URL

https://github.com/elizabethdorans/pgBoost

KEYWORDS

eQTL-informed gradient boosting

Show full keywordsShow less

PREPRINT_DOI

10.1101/2024.05.24.24307813

Main citation

Dorans, E. R., Jagadeesh, K., Dey, K., & Price, A. L. (2024). Linking regulatory variants to target genes by integrating single-cell multiome methods and genomic distance. medRxiv, 2024-05.

PGG.Han

Tool

PUBMED_LINK

31584086

URL

https://www.biosino.org/pgghan2/login

TITLE

PGG.Han: the Han Chinese genome database and analysis platform.

Main citation

Gao Y, Zhang C, Yuan L, Ling Y, ...&, Xu S. (2020) PGG.Han: the Han Chinese genome database and analysis platform. Nucleic Acids Res, 48 (D1) D971-D976. doi:10.1093/nar/gkz829. PMID 31584086

ABSTRACT

As the largest ethnic group in the world, the Han Chinese population is nonetheless underrepresented in global efforts to catalogue the genomic variability of natural populations. Here, we developed the PGG.Han, a population genome database to serve as the central repository for the genomic data of the Han Chinese Genome Initiative (Phase I). In its current version, the PGG.Han archives whole-genome sequences or high-density genome-wide single-nucleotide variants (SNVs) of 114 783 Han Chinese individuals (a.k.a. the Han100K), representing geographical sub-populations covering 33 of the 34 administrative divisions of China, as well as Singapore. The PGG.Han provides: (i) an interactive interface for visualization of the fine-scale genetic structure of the Han Chinese population; (ii) genome-wide allele frequencies of hierarchical sub-populations; (iii) ancestry inference for individual samples and controlling population stratification based on nested ancestry informative markers (AIMs) panels; (iv) population-structure-aware shared control data for genotype-phenotype association studies (e.g. GWASs) and (v) a Han-Chinese-specific reference panel for genotype imputation. Computational tools are implemented into the PGG.Han, and an online user-friendly interface is provided for data analysis and results visualization. The PGG.Han database is freely accessible via http://www.pgghan.org or https://www.hanchinesegenomes.org.

Show full abstractShow less

DOI

10.1093/nar/gkz829

PGG.Han panel (PGG.Han)

Tool

PUBMED_LINK

31584086

URL

https://www.biosino.org/pgghan2/index#home1

TITLE

PGG.Han: the Han Chinese genome database and analysis platform.

Main citation

Gao Y, Zhang C, Yuan L, Ling Y, ...&, Xu S. (2020) PGG.Han: the Han Chinese genome database and analysis platform. Nucleic Acids Res, 48 (D1) D971-D976. doi:10.1093/nar/gkz829. PMID 31584086

ABSTRACT

As the largest ethnic group in the world, the Han Chinese population is nonetheless underrepresented in global efforts to catalogue the genomic variability of natural populations. Here, we developed the PGG.Han, a population genome database to serve as the central repository for the genomic data of the Han Chinese Genome Initiative (Phase I). In its current version, the PGG.Han archives whole-genome sequences or high-density genome-wide single-nucleotide variants (SNVs) of 114 783 Han Chinese individuals (a.k.a. the Han100K), representing geographical sub-populations covering 33 of the 34 administrative divisions of China, as well as Singapore. The PGG.Han provides: (i) an interactive interface for visualization of the fine-scale genetic structure of the Han Chinese population; (ii) genome-wide allele frequencies of hierarchical sub-populations; (iii) ancestry inference for individual samples and controlling population stratification based on nested ancestry informative markers (AIMs) panels; (iv) population-structure-aware shared control data for genotype-phenotype association studies (e.g. GWASs) and (v) a Han-Chinese-specific reference panel for genotype imputation. Computational tools are implemented into the PGG.Han, and an online user-friendly interface is provided for data analysis and results visualization. The PGG.Han database is freely accessible via http://www.pgghan.org or https://www.hanchinesegenomes.org.

Show full abstractShow less

DOI

10.1093/nar/gkz829

PGS-adjusted GWAS

Tool

PUBMED_LINK

37723263

DESCRIPTION

adjustment of GWAS analyses for polygenic scores (PGSs) increases the statistical power for discovery across all ancestries

Show full descriptionShow less

KEYWORDS

LOCO-PGSs, two-stage meta-analysis strategy

Show full keywordsShow less

TITLE

Boosting the power of genome-wide association studies within and across ancestries by using polygenic scores.

Main citation

Campos AI, Namba S, Lin SC, Nam K, ...&, Yengo L. (2023) Boosting the power of genome-wide association studies within and across ancestries by using polygenic scores. Nat Genet, 55 (10) 1769-1776. doi:10.1038/s41588-023-01500-0. PMID 37723263

ABSTRACT

Genome-wide association studies (GWASs) have been mostly conducted in populations of European ancestry, which currently limits the transferability of their findings to other populations. Here, we show, through theory, simulations and applications to real data, that adjustment of GWAS analyses for polygenic scores (PGSs) increases the statistical power for discovery across all ancestries. We applied this method to analyze seven traits available in three large biobanks with participants of East Asian ancestry (n = 340,000 in total) and report 139 additional associations across traits. We also present a two-stage meta-analysis strategy whereby, in contributing cohorts, a PGS-adjusted GWAS is rerun using PGSs derived from a first round of a standard meta-analysis. On average, across traits, this approach yields a 1.26-fold increase in the number of detected associations (range 1.07- to 1.76-fold increase). Altogether, our study demonstrates the value of using PGSs to increase the power of GWASs in underrepresented populations and promotes such an analytical strategy for future GWAS meta-analyses.

Show full abstractShow less

DOI

10.1038/s41588-023-01500-0

PGS-adjusted RVATs

Tool

PUBMED_LINK

36959364

FULL NAME

PGS-adjusted rare variant association tests

DESCRIPTION

adjusting for common variant polygenic scores improves yield in gene-based rare variant association tests

Show full descriptionShow less

KEYWORDS

PGS, Rare variants

Show full keywordsShow less

TITLE

Adjusting for common variant polygenic scores improves yield in rare variant association analyses.

Main citation

Jurgens SJ, Pirruccello JP, Choi SH, Morrill VN, ...&, Ellinor PT. (2023) Adjusting for common variant polygenic scores improves yield in rare variant association analyses. Nat Genet, 55 (4) 544-548. doi:10.1038/s41588-023-01342-w. PMID 36959364

ABSTRACT

With the emergence of large-scale sequencing data, methods for improving power in rare variant association tests are needed. Here we show that adjusting for common variant polygenic scores improves yield in gene-based rare variant association tests across 65 quantitative traits in the UK Biobank (up to 20% increase at α = 2.6 × 10-6), without marked increases in false-positive rates or genomic inflation. Benefits were seen for various models, with the largest improvements seen for efficient sparse mixed-effects models. Our results illustrate how polygenic score adjustment can efficiently improve power in rare variant association discovery.

Show full abstractShow less

DOI

10.1038/s41588-023-01342-w

PGS-hub

Tool

PUBMED_LINK

41580418

DESCRIPTION

PGS-hub platform features the deployment of eight single-ancestry PGS algorithms and two multi-ancestry PGS algorithms, providing comprehensive and versatile tools for genetic risk assessment.

Show full descriptionShow less

URL

https://ngdc.cncb.ac.cn/pgs-hub/

KEYWORDS

PGS-hub

Show full keywordsShow less

TITLE

Comprehensive benchmarking single and multi ancestry polygenic score methods with the PGS-hub platform.

Main citation

Chen X, Wang F, Zhao H, Hao J, ...&, Wang M. (2026) Comprehensive benchmarking single and multi ancestry polygenic score methods with the PGS-hub platform. Nat Commun, 17 (1) . doi:10.1038/s41467-026-68599-7. PMID 41580418

ABSTRACT

Polygenic scores (PGS) quantify genetic contributions to complex traits, yet existing single- and multi-ancestry methods lack multi-dimensional evaluation within a unified framework. Here, we benchmarked 13 state-of-the-art PGS methods across 36 traits in UK Biobank European and African samples. The prediction performance, computational efficiency, the number of variants, and the impact of different linkage disequilibrium (LD) reference sizes were thoroughly assessed for each method. Results of single-ancestry methods demonstrate that LDpred2 has superior performance across a broad spectrum of complex traits in terms of accuracy and computational efficiency; however, other methods remain valuable for specific traits. For multi-ancestry methods, PRS-CSx and X-Wing have comparable performance, whereas LDpred2-multi outperforms both. Notably, we find that increasing the panel size of the LD reference significantly elevates PGS performance for sample sizes below 1,000, and it reaches a plateau when it exceeds 5,000 samples. Furthermore, implementing PGS calculation methods requires considerable technical effort and resource allocation. To support easy use of these PGS methods, we developed a user-friendly online computing platform, PGS-hub, that integrates all evaluated methods and is pre-configured with ancestry-stratified LD panels. This resource enables a scalable and harmonized PGS computation platform for the PGS community.

Show full abstractShow less

DOI

10.1038/s41467-026-68599-7

pgsc_calc

Tool

FULL NAME

The Polygenic Score Catalog Calculator

DESCRIPTION

pgsc_calc is a bioinformatics best-practice analysis pipeline for calculating polygenic [risk] scores on samples with imputed genotypes using existing scoring files from the Polygenic Score (PGS) Catalog and/or user-defined PGS/PRS.

Show full descriptionShow less

URL

https://github.com/PGScatalog/pgsc_calc

KEYWORDS

PRS calculation pipeline

Show full keywordsShow less

Main citation

Lambert, Wingfield et al. (2024) The Polygenic Score Catalog: new functionality and tools to enable FAIR research. medRxiv. doi:10.1101/2024.05.29.24307783.

PGSCatalog (PGS Catalog)

Tool

PUBMED_LINK

33692568

FULL NAME

PGS Catalog

DESCRIPTION

The PGS Catalog is an open database of published polygenic scores (PGS). Each PGS in the Catalog is consistently annotated with relevant metadata; including scoring files (variants, effect alleles/weights), annotations of how the PGS was developed and applied, and evaluations of their predictive performance.

Show full descriptionShow less

URL

https://www.pgscatalog.org/

KEYWORDS

PGS database

Show full keywordsShow less

TITLE

The Polygenic Score Catalog as an open database for reproducibility and systematic evaluation.

Main citation

Lambert SA, Gil L, Jupp S, Ritchie SC, ...&, Inouye M. (2021) The Polygenic Score Catalog as an open database for reproducibility and systematic evaluation. Nat Genet, 53 (4) 420-425. doi:10.1038/s41588-021-00783-5. PMID 33692568

ABSTRACT

We present the Polygenic Score (PGS) Catalog (https://www.PGSCatalog.org), an open resource of published scores (including variants, alleles and weights) and consistently curated metadata required for reproducibility and independent applications. The PGS Catalog has capabilities for user deposition, expert curation and programmatic access, thus providing the community with a platform for PGS dissemination, research and translation.

Show full abstractShow less

DOI

10.1038/s41588-021-00783-5

PGSFusion

Tool

DESCRIPTION

PGSFusion is your free comprehensive webserver for constructing polygenic scores (PGS)evaluating performance, and unlocking epidemiological insights. This server implements 16 leading summary statistics-based PGS methods in a standardized interface, and rigorously assesses their predictive capabilities using the UK Biobank dataset.

Show full descriptionShow less

URL

http://www.pgsfusion.net/#/

PREPRINT_DOI

10.1101/2024.08.05.606619

Main citation

Yang, S., Ye, X., Ji, X., Li, Z., Tian, M., Huang, P., & Cao, C. (2024). PGSFusion streamlines polygenic score construction and epidemiological applications in biobank-scale cohorts. bioRxiv, 2024-08.

pheweb

Tool

PUBMED_LINK

32504056

URL

https://github.com/statgen/pheweb

TITLE

Exploring and visualizing large-scale genetic associations by using PheWeb.

Main citation

Gagliano Taliun SA, VandeHaar P, Boughton AP, Welch RP, ...&, Abecasis GR. (2020) Exploring and visualizing large-scale genetic associations by using PheWeb. Nat Genet, 52 (6) 550-552. doi:10.1038/s41588-020-0622-5. PMID 32504056

DOI

10.1038/s41588-020-0622-5

PHLASH

Tool

FULL NAME

Population History Learning by Averaging Sampled Histories

DESCRIPTION

PHLASH is a Bayesian method for inferring population size history from whole-genome sequence data using a coalescent-based hidden Markov model. It provides accurate and adaptive estimates with automatic uncertainty quantification, leveraging GPU acceleration for efficiency. It outperforms existing tools like SMC++ and MSMC2 in accuracy and computational speed, particularly with large sample sizes.

Show full descriptionShow less

URL

https://github.com/jthlab/phlash

KEYWORDS

population size inference, Bayesian demographic inference, coalescent model, ancestral recombination graphs, whole-genome sequencing, GPU acceleration

Show full keywordsShow less

Main citation

Terhorst J. Accelerated Bayesian inference of population size history from recombining sequence data. Nat Genet. 2025; DOI: 10.1038/s41588-025-02323-x.

DOI

10.1038/s41588-025-02323-x

ARROW_SUMMARY

Whole-genome sequence data → Bayesian coalescent model with gradient-based likelihood evaluation → Posterior distribution of population size history with uncertainty quantification

AI_GENERATED

1.0

PLINK

Tool

PUBMED_LINK

17701901

DESCRIPTION

A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses. PLINK is a free, open-source whole genome association analysis toolset, designed to perform a range of basic, large-scale analyses in a computationally efficient manner. The focus of PLINK is purely on analysis of genotype/phenotype data, so there is no support for steps prior to this (e.g. study design and planning, generating genotype or CNV calls from raw data). Through integration with gPLINK and Haploview, there is some support for the subsequent visualization, annotation and storage of results.

Show full descriptionShow less

URL

https://www.cog-genomics.org/plink/

TITLE

PLINK: a tool set for whole-genome association and population-based linkage analyses.

Main citation

Purcell S, Neale B, Todd-Brown K, Thomas L, ...&, Sham PC. (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet, 81 (3) 559-75. doi:10.1086/519795. PMID 17701901

ABSTRACT

Whole-genome association studies (WGAS) bring new computational, as well as analytic, challenges to researchers. Many existing genetic-analysis tools are not designed to handle such large data sets in a convenient manner and do not necessarily exploit the new opportunities that whole-genome data bring. To address these issues, we developed PLINK, an open-source C/C++ WGAS tool set. With PLINK, large data sets comprising hundreds of thousands of markers genotyped for thousands of individuals can be rapidly manipulated and analyzed in their entirety. As well as providing tools to make the basic analytic steps computationally efficient, PLINK also supports some novel approaches to whole-genome data that take advantage of whole-genome coverage. We introduce PLINK and describe the five main domains of function: data management, summary statistics, population stratification, association analysis, and identity-by-descent estimation. In particular, we focus on the estimation and use of identity-by-state and identity-by-descent information in the context of population-based whole-genome studies. This information can be used to detect and correct for population stratification and to identify extended chromosomal segments that are shared identical by descent between very distantly related individuals. Analysis of the patterns of segmental sharing has the potential to map disease loci that contain multiple rare variants in a population-based linkage analysis.

Show full abstractShow less

DOI

10.1086/519795

PLINK-MDS (MDS)

Tool

PUBMED_LINK

17701901

FULL NAME

multidimensional scaling

URL

https://www.cog-genomics.org/plink/1.9/strat

KEYWORDS

MDS

Show full keywordsShow less

TITLE

PLINK: a tool set for whole-genome association and population-based linkage analyses.

Main citation

Purcell S, Neale B, Todd-Brown K, Thomas L, ...&, Sham PC. (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet, 81 (3) 559-75. doi:10.1086/519795. PMID 17701901

ABSTRACT

Whole-genome association studies (WGAS) bring new computational, as well as analytic, challenges to researchers. Many existing genetic-analysis tools are not designed to handle such large data sets in a convenient manner and do not necessarily exploit the new opportunities that whole-genome data bring. To address these issues, we developed PLINK, an open-source C/C++ WGAS tool set. With PLINK, large data sets comprising hundreds of thousands of markers genotyped for thousands of individuals can be rapidly manipulated and analyzed in their entirety. As well as providing tools to make the basic analytic steps computationally efficient, PLINK also supports some novel approaches to whole-genome data that take advantage of whole-genome coverage. We introduce PLINK and describe the five main domains of function: data management, summary statistics, population stratification, association analysis, and identity-by-descent estimation. In particular, we focus on the estimation and use of identity-by-state and identity-by-descent information in the context of population-based whole-genome studies. This information can be used to detect and correct for population stratification and to identify extended chromosomal segments that are shared identical by descent between very distantly related individuals. Analysis of the patterns of segmental sharing has the potential to map disease loci that contain multiple rare variants in a population-based linkage analysis.

Show full abstractShow less

DOI

10.1086/519795

PLINK1.9

Tool

PUBMED_LINK

17701901

DESCRIPTION

PLINK: a tool set for whole-genome association and population-based linkage analyses. This is a comprehensive update to Shaun Purcell's PLINK command-line program, developed by Christopher Chang with support from the NIH-NIDDK's Laboratory of Biological Modeling, the Purcell Lab, and others.

Show full descriptionShow less

URL

https://www.cog-genomics.org/plink/1.9/

TITLE

PLINK: a tool set for whole-genome association and population-based linkage analyses.

Main citation

Purcell S, Neale B, Todd-Brown K, Thomas L, ...&, Sham PC. (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet, 81 (3) 559-75. doi:10.1086/519795. PMID 17701901

ABSTRACT

Whole-genome association studies (WGAS) bring new computational, as well as analytic, challenges to researchers. Many existing genetic-analysis tools are not designed to handle such large data sets in a convenient manner and do not necessarily exploit the new opportunities that whole-genome data bring. To address these issues, we developed PLINK, an open-source C/C++ WGAS tool set. With PLINK, large data sets comprising hundreds of thousands of markers genotyped for thousands of individuals can be rapidly manipulated and analyzed in their entirety. As well as providing tools to make the basic analytic steps computationally efficient, PLINK also supports some novel approaches to whole-genome data that take advantage of whole-genome coverage. We introduce PLINK and describe the five main domains of function: data management, summary statistics, population stratification, association analysis, and identity-by-descent estimation. In particular, we focus on the estimation and use of identity-by-state and identity-by-descent information in the context of population-based whole-genome studies. This information can be used to detect and correct for population stratification and to identify extended chromosomal segments that are shared identical by descent between very distantly related individuals. Analysis of the patterns of segmental sharing has the potential to map disease loci that contain multiple rare variants in a population-based linkage analysis.

Show full abstractShow less

DOI

10.1086/519795

PLINK2

Tool

PUBMED_LINK

25722852

URL

https://www.cog-genomics.org/plink/2.0/

TITLE

Second-generation PLINK: rising to the challenge of larger and richer datasets.

Main citation

Chang CC, Chow CC, Tellier LC, Vattikuti S, ...&, Lee JJ. (2015) Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience, 4 () 7. doi:10.1186/s13742-015-0047-8. PMID 25722852

ABSTRACT

BACKGROUND: PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for faster and scalable implementations of key functions, such as logistic regression, linkage disequilibrium estimation, and genomic distance evaluation. In addition, GWAS and population-genetic data now frequently contain genotype likelihoods, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1's primary data format. FINDINGS: To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, [Formula: see text]-time/constant-space Hardy-Weinberg equilibrium and Fisher's exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. We have also developed an extension to the data format which adds low-overhead support for genotype likelihoods, phase, multiallelic variants, and reference vs. alternate alleles, which is the basis of our planned second release (PLINK 2.0). CONCLUSIONS: The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.

Show full abstractShow less

DOI

10.1186/s13742-015-0047-8

PLINK2

Tool

PUBMED_LINK

25722852

DESCRIPTION

The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.

Show full descriptionShow less

URL

https://www.cog-genomics.org/plink/2.0/

TITLE

Second-generation PLINK: rising to the challenge of larger and richer datasets.

Main citation

Chang CC, Chow CC, Tellier LC, Vattikuti S, ...&, Lee JJ. (2015) Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience, 4 () 7. doi:10.1186/s13742-015-0047-8. PMID 25722852

ABSTRACT

BACKGROUND: PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for faster and scalable implementations of key functions, such as logistic regression, linkage disequilibrium estimation, and genomic distance evaluation. In addition, GWAS and population-genetic data now frequently contain genotype likelihoods, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1's primary data format. FINDINGS: To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, [Formula: see text]-time/constant-space Hardy-Weinberg equilibrium and Fisher's exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. We have also developed an extension to the data format which adds low-overhead support for genotype likelihoods, phase, multiallelic variants, and reference vs. alternate alleles, which is the basis of our planned second release (PLINK 2.0). CONCLUSIONS: The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.

Show full abstractShow less

DOI

10.1186/s13742-015-0047-8

PLINK2

Tool

PUBMED_LINK

25722852

DESCRIPTION

The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility.

Show full descriptionShow less

URL

https://www.cog-genomics.org/plink/2.0/

USE

calculate PRS using genotype data.

TITLE

Second-generation PLINK: rising to the challenge of larger and richer datasets.

Main citation

Chang CC, Chow CC, Tellier LC, Vattikuti S, ...&, Lee JJ. (2015) Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience, 4 () 7. doi:10.1186/s13742-015-0047-8. PMID 25722852

ABSTRACT

BACKGROUND: PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for faster and scalable implementations of key functions, such as logistic regression, linkage disequilibrium estimation, and genomic distance evaluation. In addition, GWAS and population-genetic data now frequently contain genotype likelihoods, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1's primary data format. FINDINGS: To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, [Formula: see text]-time/constant-space Hardy-Weinberg equilibrium and Fisher's exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. We have also developed an extension to the data format which adds low-overhead support for genotype likelihoods, phase, multiallelic variants, and reference vs. alternate alleles, which is the basis of our planned second release (PLINK 2.0). CONCLUSIONS: The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.

Show full abstractShow less

DOI

10.1186/s13742-015-0047-8

POLMM

Tool

PUBMED_LINK

33836139

FULL NAME

proportional odds logistic mixed model (POLMM)

DESCRIPTION

Proportional Odds Logistic Mixed Model (POLMM) for ordinal categorical data analysis

Show full descriptionShow less

URL

https://github.com/WenjianBI/POLMM

KEYWORDS

ordinal categorical phenotypes

Show full keywordsShow less

TITLE

Efficient mixed model approach for large-scale genome-wide association studies of ordinal categorical phenotypes.

Main citation

Bi W, Zhou W, Dey R, Mukherjee B, ...&, Lee S. (2021) Efficient mixed model approach for large-scale genome-wide association studies of ordinal categorical phenotypes. Am J Hum Genet, 108 (5) 825-839. doi:10.1016/j.ajhg.2021.03.019. PMID 33836139

ABSTRACT

In genome-wide association studies, ordinal categorical phenotypes are widely used to measure human behaviors, satisfaction, and preferences. However, because of the lack of analysis tools, methods designed for binary or quantitative traits are commonly used inappropriately to analyze categorical phenotypes. To accurately model the dependence of an ordinal categorical phenotype on covariates, we propose an efficient mixed model association test, proportional odds logistic mixed model (POLMM). POLMM is computationally efficient to analyze large datasets with hundreds of thousands of samples, can control type I error rates at a stringent significance level regardless of the phenotypic distribution, and is more powerful than alternative methods. In contrast, the standard linear mixed model approaches cannot control type I error rates for rare variants when the phenotypic distribution is unbalanced, although they performed well when testing common variants. We applied POLMM to 258 ordinal categorical phenotypes on array genotypes and imputed samples from 408,961 individuals in UK Biobank. In total, we identified 5,885 genome-wide significant variants, of which, 424 variants (7.2%) are rare variants with MAF < 0.01.

Show full abstractShow less

DOI

10.1016/j.ajhg.2021.03.019

POP-GWAS

Tool

PUBMED_LINK

39349818

FULL NAME

Post-Prediction GWAS

DESCRIPTION

POP-TOOLS (POst-Prediction TOOLS) is a Python3-based command line toolkit for conducting valid and powerful machine learning (ML)-assisted genetic association studies.

Show full descriptionShow less

URL

https://github.com/qlu-lab/POP-TOOLS

KEYWORDS

imputed phenotypes, 3 GWASs

Show full keywordsShow less

TITLE

Valid inference for machine learning-assisted genome-wide association studies.

Main citation

Miao J, Wu Y, Sun Z, Miao X, ...&, Lu Q. (2024) Valid inference for machine learning-assisted genome-wide association studies. Nat Genet, 56 (11) 2361-2369. doi:10.1038/s41588-024-01934-0. PMID 39349818

ABSTRACT

Machine learning (ML) has become increasingly popular in almost all scientific disciplines, including human genetics. Owing to challenges related to sample collection and precise phenotyping, ML-assisted genome-wide association study (GWAS), which uses sophisticated ML techniques to impute phenotypes and then performs GWAS on the imputed outcomes, have become increasingly common in complex trait genetics research. However, the validity of ML-assisted GWAS associations has not been carefully evaluated. Here, we report pervasive risks for false-positive associations in ML-assisted GWAS and introduce Post-Prediction GWAS (POP-GWAS), a statistical framework that redesigns GWAS on ML-imputed outcomes. POP-GWAS ensures valid and powerful statistical inference irrespective of imputation quality and choice of algorithm, requiring only GWAS summary statistics as input. We employed POP-GWAS to perform a GWAS of bone mineral density derived from dual-energy X-ray absorptiometry imaging at 14 skeletal sites, identifying 89 new loci and revealing skeletal site-specific genetic architecture. Our framework offers a robust analytic solution for future ML-assisted GWAS.

Show full abstractShow less

DOI

10.1038/s41588-024-01934-0

popcorn

Tool

PUBMED_LINK

27321947

DESCRIPTION

Popcorn is a program for estimaing the correlation of causal variant effect. This is the python3 version of Popcorn and still under development sizes across populations in GWAS.

Show full descriptionShow less

URL

https://github.com/brielin/Popcorn

KEYWORDS

trans-ethnic

Show full keywordsShow less

TITLE

Transethnic Genetic-Correlation Estimates from Summary Statistics.

Main citation

Brown BC, Asian Genetic Epidemiology Network Type 2 Diabetes Consortium, Ye CJ, Price AL, ...&, Zaitlen N. (2016) Transethnic Genetic-Correlation Estimates from Summary Statistics. Am J Hum Genet, 99 (1) 76-88. doi:10.1016/j.ajhg.2016.05.001. PMID 27321947

ABSTRACT

The increasing number of genetic association studies conducted in multiple populations provides an unprecedented opportunity to study how the genetic architecture of complex phenotypes varies between populations, a problem important for both medical and population genetics. Here, we have developed a method for estimating the transethnic genetic correlation: the correlation of causal-variant effect sizes at SNPs common in populations. This methods takes advantage of the entire spectrum of SNP associations and uses only summary-level data from genome-wide association studies. This avoids the computational costs and privacy concerns associated with genotype-level information while remaining scalable to hundreds of thousands of individuals and millions of SNPs. We applied our method to data on gene expression, rheumatoid arthritis, and type 2 diabetes and overwhelmingly found that the genetic correlation was significantly less than 1. Our method is implemented in a Python package called Popcorn.

Show full abstractShow less

DOI

10.1016/j.ajhg.2016.05.001

popEVE

Tool

PUBMED_LINK

41286104

DESCRIPTION

popEVE is a proteome-wide deep generative model that scores missense variant pathogenicity by combining cross-species evolutionary predictors with human population cohort data, aiming for well-calibrated, human-specific deleteriousness estimates.

Show full descriptionShow less

URL

https://github.com/debbiemarkslab/popEVE

KEYWORDS

missense, pathogenicity, deep generative model, proteome-wide, UK Biobank

Show full keywordsShow less

TITLE

Proteome-wide model for human disease genetics.

Main citation

Orenbuch R, Shearer CA, Kollasch AW, Spinner AD, ...&, Marks DS. (2025) Proteome-wide model for human disease genetics. Nat Genet, 57 (12) 3165-3174. doi:10.1038/s41588-025-02400-1. PMID 41286104

ABSTRACT

Missense variants remain a challenge in genetic interpretation owing to their subtle and context-dependent effects. Although current prediction models perform well in known disease genes, their scores are not calibrated across the proteome, limiting generalizability. To address this knowledge gap, we developed popEVE, a deep generative model combining evolutionary and human population data to estimate variant deleteriousness on a proteome-wide scale. popEVE achieves state-of-the-art performance without overestimating the burden of deleterious variants and identifies variants in 442 genes in a severe developmental disorder cohort, including 123 novel candidates. These genes are functionally similar to known disease genes, and their variants often localize to critical regions. Remarkably, popEVE can prioritize likely causal variants using only child exomes, enabling diagnosis even without parental sequencing. This work provides a generalizable framework for rare disease variant interpretation, especially in singleton cases, and demonstrates the utility of calibrated, evolution-informed scoring models for clinical genomics.

Show full abstractShow less

DOI

10.1038/s41588-025-02400-1

popgen

Tool

PUBMED_LINK

27742697

FULL NAME

Geography of Genetic Variants Browser

URL

https://popgen.uchicago.edu/ggv/

TITLE

Visualizing the geography of genetic variants.

Main citation

Marcus JH, Novembre J. (2017) Visualizing the geography of genetic variants. Bioinformatics, 33 (4) 594-595. doi:10.1093/bioinformatics/btw643. PMID 27742697

ABSTRACT

SUMMARY: One of the key characteristics of any genetic variant is its geographic distribution. The geographic distribution can shed light on where an allele first arose, what populations it has spread to, and in turn on how migration, genetic drift, and natural selection have acted. The geographic distribution of a genetic variant can also be of great utility for medical/clinical geneticists and collectively many genetic variants can reveal population structure. Here we develop an interactive visualization tool for rapidly displaying the geographic distribution of genetic variants. Through a REST API and dynamic front-end, the Geography of Genetic Variants (GGV) browser ( http://popgen.uchicago.edu/ggv/ ) provides maps of allele frequencies in populations distributed across the globe. AVAILABILITY AND IMPLEMENTATION: GGV is implemented as a website ( http://popgen.uchicago.edu/ggv/ ) which employs an API to access frequency data ( http://popgen.uchicago.edu/freq_api/ ). Python and javascript source code for the website and the API are available at: http://github.com/NovembreLab/ggv/ and http://github.com/NovembreLab/ggv-api/ . CONTACT: jnovembre@uchicago.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Show full abstractShow less

DOI

10.1093/bioinformatics/btw643

PoPs

Tool

PUBMED_LINK

37443254

FULL NAME

gene-level Polygenic Priority Score (PoPS)

DESCRIPTION

PoPS is a gene prioritization method that leverages genome-wide signal from GWAS summary statistics and incorporates data from an extensive set of public bulk and single-cell expression datasets, curated biological pathways, and predicted protein-protein interactions.

Show full descriptionShow less

URL

https://github.com/FinucaneLab/pops

TITLE

Leveraging polygenic enrichments of gene features to predict genes underlying complex traits and diseases.

Main citation

Weeks EM, Ulirsch JC, Cheng NY, Trippe BL, ...&, Finucane HK. (2023) Leveraging polygenic enrichments of gene features to predict genes underlying complex traits and diseases. Nat Genet, 55 (8) 1267-1276. doi:10.1038/s41588-023-01443-6. PMID 37443254

ABSTRACT

Genome-wide association studies (GWASs) are a valuable tool for understanding the biology of complex human traits and diseases, but associated variants rarely point directly to causal genes. In the present study, we introduce a new method, polygenic priority score (PoPS), that learns trait-relevant gene features, such as cell-type-specific expression, to prioritize genes at GWAS loci. Using a large evaluation set of genes with fine-mapped coding variants, we show that PoPS and the closest gene individually outperform other gene prioritization methods, but observe the best overall performance by combining PoPS with orthogonal methods. Using this combined approach, we prioritize 10,642 unique gene-trait pairs across 113 complex traits and diseases with high precision, finding not only well-established gene-trait relationships but nominating new genes at unresolved loci, such as LGR4 for estimated glomerular filtration rate and CCR7 for deep vein thrombosis. Overall, we demonstrate that PoPS provides a powerful addition to the gene prioritization toolbox.

Show full abstractShow less

DOI

10.1038/s41588-023-01443-6

Porter

Tool

PUBMED_LINK

28287610

TITLE

Multivariate simulation framework reveals performance of multi-trait GWAS methods.

Main citation

Porter HF, O'Reilly PF. (2017) Multivariate simulation framework reveals performance of multi-trait GWAS methods. Sci Rep, 7 () 38837. doi:10.1038/srep38837. PMID 28287610

ABSTRACT

Burgeoning availability of genome-wide association study (GWAS) results and national biobank data has led to growing interest in performing multi-trait genetic analyses. Numerous multi-trait GWAS methods that exploit either summary statistics or individual-level data have been developed, but their relative performance is unclear. Here we develop a simulation framework to model the complex networks underlying multivariate genetic epidemiology, enabling the vast model space of genetic effects on multiple correlated traits to be explored systematically. We perform a comprehensive comparison of the leading multi-trait GWAS methods, finding: (1) method performance is highly sensitive to the specific combination of genetic effects and phenotypic correlations, (2) most of the current multivariate methods have remarkably similar statistical power, and (3) multivariate methods may offer a substantial increase in the discovery of genetic variants over the standard univariate approach. We believe our findings offer the clearest picture to date of the relative performance of multi-trait GWAS methods and act as a guide for method selection. We provide a web application and open-source software program implementing our simulation framework, for: (i) further benchmarking of multivariate GWAS methods, (ii) power calculations for multivariate genetic studies, and (iii) generating data for testing any multivariate method in genetic epidemiology.

Show full abstractShow less

DOI

10.1038/srep38837

PP-GWAS

GWAS Privacy-preserving GWAS Tool Summary statistics

PUBMED_LINK

41365878

DESCRIPTION

Privacy-preserving framework for multi-site GWAS on quantitative traits using a distributed linear mixed model and randomized encoding so servers never see raw genotypes or phenotypes—only obfuscated intermediates—while improving speed versus several cryptographic baselines.

Show full descriptionShow less

URL

https://github.com/mdppml/PP-GWAS ,https://doi.org/10.1038/s41467-025-66771-z

KEYWORDS

Privacy-preserving GWAS, multi-site, quantitative traits, federated analysis

Show full keywordsShow less

TITLE

PP-GWAS: Privacy Preserving Multi-Site Genome-wide Association Studies.

Main citation

Swaminathan A, Hannemann A, Ünal AB, Pfeifer N, ...&, Akgün M. (2025) PP-GWAS: Privacy Preserving Multi-Site Genome-wide Association Studies. Nat Commun, 16 (1) 11030. doi:10.1038/s41467-025-66771-z. PMID 41365878

ABSTRACT

Genome-wide association studies help uncover genetic influences on complex traits and diseases. Importantly, multi-site data collaborations enhance the statistical power of these studies but pose challenges due to the sensitivity of genomic data. Existing privacy-preserving approaches to performing multi-site genome-wide association studies rely on computationally expensive cryptographic techniques, which limit applicability. To address this, we present PP-GWAS, a privacy-preserving algorithm that improves efficiency and scalability while maintaining data privacy. Our method leverages randomized encoding within a distributed framework to perform stacked ridge regression on a linear mixed model, enabling robust analysis of quantitative phenotypes. We show experimentally using real-world and synthetic data that our approach achieves twice the computational speed of comparable methods while reducing resource consumption.

Show full abstractShow less

DOI

10.1038/s41467-025-66771-z

PredInterval

Tool

PUBMED_LINK

41083720

DESCRIPTION

PredInterval constructs statistically calibrated prediction intervals for phenotypes predicted from polygenic scores, compatible with arbitrary PGS methods and supporting individual-level data or GWAS summary statistics plus a small calibration sample.

Show full descriptionShow less

URL

https://github.com/xuchang0201/PredInterval

KEYWORDS

polygenic score, prediction interval, uncertainty, calibration, summary statistics

Show full keywordsShow less

TITLE

Statistical construction of calibrated prediction intervals for polygenic score-based phenotype prediction.

Main citation

Xu C, Ganesh SK, Zhou X. (2025) Statistical construction of calibrated prediction intervals for polygenic score-based phenotype prediction. Nat Genet, 57 (11) 2891-2900. doi:10.1038/s41588-025-02360-6. PMID 41083720

ABSTRACT

Accurately quantifying uncertainty in predicted phenotypes from polygenic score (PGS)-based applications is essential for reliable clinical interpretation of PGS, supporting effective disease risk assessment and informed decision-making. Here, we present PredInterval, a nonparametric method for constructing well-calibrated prediction intervals. PredInterval is compatible with any PGS method, takes either individual-level data or summary statistics as input and relies on information from quantiles of phenotypic residuals through cross-validation to achieve well-calibrated coverage of true phenotypic values across diverse genetic architectures. We apply PredInterval to analyze 17 traits in real-data applications, where PredInterval not only represents the sole method achieving well-calibrated prediction coverage across traits, but it also offers a principled approach to identify high-risk individuals using prediction intervals, leading to an average improvement of identification rates by 8.7-830.4% compared with existing approaches. Overall, PredInterval represents a robust and versatile tool for enhancing the clinical utility of PGS.

Show full abstractShow less

DOI

10.1038/s41588-025-02360-6

PrediXcan

Tool

PUBMED_LINK

26258848

DESCRIPTION

(deprecated) PrediXcan is a gene-based association test that prioritizes genes that are likely to be causal for the phenotype.

Show full descriptionShow less

URL

https://github.com/hakyimlab/PrediXcan

TITLE

A gene-based association method for mapping traits using reference transcriptome data.

Main citation

Gamazon ER, Wheeler HE, Shah KP, Mozaffari SV, ...&, Im HK. (2015) A gene-based association method for mapping traits using reference transcriptome data. Nat Genet, 47 (9) 1091-8. doi:10.1038/ng.3367. PMID 26258848

ABSTRACT

Genome-wide association studies (GWAS) have identified thousands of variants robustly associated with complex traits. However, the biological mechanisms underlying these associations are, in general, not well understood. We propose a gene-based association method called PrediXcan that directly tests the molecular mechanisms through which genetic variation affects phenotype. The approach estimates the component of gene expression determined by an individual's genetic profile and correlates 'imputed' gene expression with the phenotype under investigation to identify genes involved in the etiology of the phenotype. Genetically regulated gene expression is estimated using whole-genome tissue-dependent prediction models trained with reference transcriptome data sets. PrediXcan enjoys the benefits of gene-based approaches such as reduced multiple-testing burden and a principled approach to the design of follow-up experiments. Our results demonstrate that PrediXcan can detect known and new genes associated with disease traits and provide insights into the mechanism of these associations.

Show full abstractShow less

DOI

10.1038/ng.3367

Priority index

Tool

PUBMED_LINK

31253980

DESCRIPTION

A Comprehensive Resource for Genetic Targets in Immune-Mediated Disease

Show full descriptionShow less

URL

http://pi.well.ox.ac.uk:3010/

TITLE

A genetics-led approach defines the drug target landscape of 30 immune-related traits.

Main citation

Fang H, ULTRA-DD Consortium, De Wolf H, Knezevic B, ...&, Knight JC. (2019) A genetics-led approach defines the drug target landscape of 30 immune-related traits. Nat Genet, 51 (7) 1082-1091. doi:10.1038/s41588-019-0456-1. PMID 31253980

ABSTRACT

Most candidate drugs currently fail later-stage clinical trials, largely due to poor prediction of efficacy on early target selection1. Drug targets with genetic support are more likely to be therapeutically valid2,3, but the translational use of genome-scale data such as from genome-wide association studies for drug target discovery in complex diseases remains challenging4-6. Here, we show that integration of functional genomic and immune-related annotations, together with knowledge of network connectivity, maximizes the informativeness of genetics for target validation, defining the target prioritization landscape for 30 immune traits at the gene and pathway level. We demonstrate how our genetics-led drug target prioritization approach (the priority index) successfully identifies current therapeutics, predicts activity in high-throughput cellular screens (including L1000, CRISPR, mutagenesis and patient-derived cell assays), enables prioritization of under-explored targets and allows for determination of target-level trait relationships. The priority index is an open-access, scalable system accelerating early-stage drug target selection for immune-mediated disease.

Show full abstractShow less

DOI

10.1038/s41588-019-0456-1

PROSPER

Tool

PUBMED_LINK

38622117

FULL NAME

Polygenic Risk scOres based on enSemble of PEnalized Regression models

DESCRIPTION

PROSPER is a new multi-ancestry PRS method with penalized regression followed by ensemble learning. This software is a command line tool based on R programming language. Large-scale benchmarking study shows that PROSPER could be the leading method to reduce the disparity of PRS performance across ancestry groups

Show full descriptionShow less

URL

https://github.com/Jingning-Zhang/PROSPER

TITLE

An ensemble penalized regression method for multi-ancestry polygenic risk prediction.

Main citation

Zhang J, Zhan J, Jin J, Ma C, ...&, Chatterjee N. (2024) An ensemble penalized regression method for multi-ancestry polygenic risk prediction. Nat Commun, 15 (1) 3238. doi:10.1038/s41467-024-47357-7. PMID 38622117

ABSTRACT

Great efforts are being made to develop advanced polygenic risk scores (PRS) to improve the prediction of complex traits and diseases. However, most existing PRS are primarily trained on European ancestry populations, limiting their transferability to non-European populations. In this article, we propose a novel method for generating multi-ancestry Polygenic Risk scOres based on enSemble of PEnalized Regression models (PROSPER). PROSPER integrates genome-wide association studies (GWAS) summary statistics from diverse populations to develop ancestry-specific PRS with improved predictive power for minority populations. The method uses a combination of L 1 (lasso) and L 2 (ridge) penalty functions, a parsimonious specification of the penalty parameters across populations, and an ensemble step to combine PRS generated across different penalty parameters. We evaluate the performance of PROSPER and other existing methods on large-scale simulated and real datasets, including those from 23andMe Inc., the Global Lipids Genetics Consortium, and All of Us. Results show that PROSPER can substantially improve multi-ancestry polygenic prediction compared to alternative methods across a wide variety of genetic architectures. In real data analyses, for example, PROSPER increased out-of-sample prediction R2 for continuous traits by an average of 70% compared to a state-of-the-art Bayesian method (PRS-CSx) in the African ancestry population. Further, PROSPER is computationally highly scalable for the analysis of large SNP contents and many diverse populations.

Show full abstractShow less

DOI

10.1038/s41467-024-47357-7

Protter

Tool

PUBMED_LINK

24162465

DESCRIPTION

interactive protein feature visualization and integration with experimental proteomic data

Show full descriptionShow less

URL

https://wlab.ethz.ch/protter/start/

TITLE

Protter: interactive protein feature visualization and integration with experimental proteomic data.

Main citation

Omasits U, Ahrens CH, Müller S, Wollscheid B. (2014) Protter: interactive protein feature visualization and integration with experimental proteomic data. Bioinformatics, 30 (6) 884-6. doi:10.1093/bioinformatics/btt607. PMID 24162465

ABSTRACT

SUMMARY: The ability to integrate and visualize experimental proteomic evidence in the context of rich protein feature annotations represents an unmet need of the proteomics community. Here we present Protter, a web-based tool that supports interactive protein data analysis and hypothesis generation by visualizing both annotated sequence features and experimental proteomic data in the context of protein topology. Protter supports numerous proteomic file formats and automatically integrates a variety of reference protein annotation sources, which can be readily extended via modular plug-ins. A built-in export function produces publication-quality customized protein illustrations, also for large datasets. Visualizations of surfaceome datasets show the specific utility of Protter for the integrated visual analysis of membrane proteins and peptide selection for targeted proteomics. AVAILABILITY AND IMPLEMENTATION: The Protter web application is available at http://wlab.ethz.ch/protter. Source code and installation instructions are available at http://ulo.github.io/Protter/. CONTACT: wbernd@ethz.ch SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Show full abstractShow less

DOI

10.1093/bioinformatics/btt607

PRS atlas

Tool

PUBMED_LINK

30835202

DESCRIPTION

This web application can be used to query findings from an analysis of 162 polygenic risk scores and 551 complex traits using data from the UK Biobank study1. Traits were selected based on the heritability analysis conducted by the Neale Lab2 (P<0.05). We encourage users of this resource to conduct follow-up analyses of associations to robustly identify causal relationships between complex traits.

Show full descriptionShow less

URL

http://mrcieu.mrsoftware.org/PRS_atlas/

TITLE

An atlas of polygenic risk score associations to highlight putative causal relationships across the human phenome.

Main citation

Richardson TG, Harrison S, Hemani G, Davey Smith G. (2019) An atlas of polygenic risk score associations to highlight putative causal relationships across the human phenome. Elife, 8 () . doi:10.7554/eLife.43657. PMID 30835202

ABSTRACT

The age of large-scale genome-wide association studies (GWAS) has provided us with an unprecedented opportunity to evaluate the genetic liability of complex disease using polygenic risk scores (PRS). In this study, we have analysed 162 PRS (p<5×10-05) derived from GWAS and 551 heritable traits from the UK Biobank study (N = 334,398). Findings can be investigated using a web application (http:‌//‌mrcieu.‌mrsoftware.org/‌PRS‌_atlas/), which we envisage will help uncover both known and novel mechanisms which contribute towards disease susceptibility. To demonstrate this, we have investigated the results from a phenome-wide evaluation of schizophrenia genetic liability. Amongst findings were inverse associations with measures of cognitive function which extensive follow-up analyses using Mendelian randomization (MR) provided evidence of a causal relationship. We have also investigated the effect of multiple risk factors on disease using mediation and multivariable MR frameworks. Our atlas provides a resource for future endeavours seeking to unravel the causal determinants of complex disease.

Show full abstractShow less

DOI

10.7554/eLife.43657

PRS credible intervals

Tool

PUBMED_LINK

34931067

URL

https://privefl.github.io/bigsnpr/articles/prs_uncertainty.html

KEYWORDS

uncertainty

Show full keywordsShow less

TITLE

Large uncertainty in individual polygenic risk score estimation impacts PRS-based risk stratification.

Main citation

Ding Y, Hou K, Burch KS, Lapinska S, ...&, Pasaniuc B. (2022) Large uncertainty in individual polygenic risk score estimation impacts PRS-based risk stratification. Nat Genet, 54 (1) 30-39. doi:10.1038/s41588-021-00961-5. PMID 34931067

ABSTRACT

Although the cohort-level accuracy of polygenic risk scores (PRSs)-estimates of genetic value at the individual level-has been widely assessed, uncertainty in PRSs remains underexplored. In the present study, we show that Bayesian PRS methods can estimate the variance of an individual's PRS and can yield well-calibrated credible intervals via posterior sampling. For 13 real traits in the UK Biobank (n = 291,273 unrelated 'white British'), we observe large variances in individual PRS estimates which impact interpretation of PRS-based stratification; averaging across traits, only 0.8% (s.d. = 1.6%) of individuals with PRS point estimates in the top decile have corresponding 95% credible intervals fully contained in the top decile. We provide an analytical estimator for the expectation of individual PRS variance as a function of SNP heritability, number of causal SNPs and sample size. Our results showcase the importance of incorporating uncertainty in individual PRS estimates into subsequent analyses.

Show full abstractShow less

DOI

10.1038/s41588-021-00961-5

PRS-CS

Tool

PUBMED_LINK

30992449

DESCRIPTION

PRS-CS is a Python based command line tool that infers posterior SNP effect sizes under continuous shrinkage (CS) priors using GWAS summary statistics and an external LD reference panel.

Show full descriptionShow less

URL

https://github.com/getian107/PRScs

KEYWORDS

continuous shrinkage (CS) prior

Show full keywordsShow less

TITLE

Polygenic prediction via Bayesian regression and continuous shrinkage priors.

Main citation

Ge T, Chen CY, Ni Y, Feng YA, ...&, Smoller JW. (2019) Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nat Commun, 10 (1) 1776. doi:10.1038/s41467-019-09718-5. PMID 30992449

ABSTRACT

Polygenic risk scores (PRS) have shown promise in predicting human complex traits and diseases. Here, we present PRS-CS, a polygenic prediction method that infers posterior effect sizes of single nucleotide polymorphisms (SNPs) using genome-wide association summary statistics and an external linkage disequilibrium (LD) reference panel. PRS-CS utilizes a high-dimensional Bayesian regression framework, and is distinct from previous work by placing a continuous shrinkage (CS) prior on SNP effect sizes, which is robust to varying genetic architectures, provides substantial computational advantages, and enables multivariate modeling of local LD patterns. Simulation studies using data from the UK Biobank show that PRS-CS outperforms existing methods across a wide range of genetic architectures, especially when the training sample size is large. We apply PRS-CS to predict six common complex diseases and six quantitative traits in the Partners HealthCare Biobank, and further demonstrate the improvement of PRS-CS in prediction accuracy over alternative methods.

Show full abstractShow less

DOI

10.1038/s41467-019-09718-5

PRS-CSx

Tool

PUBMED_LINK

35513724

DESCRIPTION

PRS-CSx is a Python based command line tool that integrates GWAS summary statistics and external LD reference panels from multiple populations to improve cross-population polygenic prediction. Posterior SNP effect sizes are inferred under coupled continuous shrinkage (CS) priors across populations.

Show full descriptionShow less

URL

https://github.com/getian107/PRScsx

KEYWORDS

continuous shrinkage (CS) prior, cross-population

Show full keywordsShow less

TITLE

Improving polygenic prediction in ancestrally diverse populations.

Main citation

Ruan Y, Lin YF, Feng YA, Chen CY, ...&, Ge T. (2022) Improving polygenic prediction in ancestrally diverse populations. Nat Genet, 54 (5) 573-580. doi:10.1038/s41588-022-01054-7. PMID 35513724

ABSTRACT

Polygenic risk scores (PRS) have attenuated cross-population predictive performance. As existing genome-wide association studies (GWAS) have been conducted predominantly in individuals of European descent, the limited transferability of PRS reduces their clinical value in non-European populations, and may exacerbate healthcare disparities. Recent efforts to level ancestry imbalance in genomic research have expanded the scale of non-European GWAS, although most remain underpowered. Here, we present a new PRS construction method, PRS-CSx, which improves cross-population polygenic prediction by integrating GWAS summary statistics from multiple populations. PRS-CSx couples genetic effects across populations via a shared continuous shrinkage (CS) prior, enabling more accurate effect size estimation by sharing information between summary statistics and leveraging linkage disequilibrium diversity across discovery samples, while inheriting computational efficiency and robustness from PRS-CS. We show that PRS-CSx outperforms alternative methods across traits with a wide range of genetic architectures, cross-population genetic overlaps and discovery GWAS sample sizes in simulations, and improves the prediction of quantitative traits and schizophrenia risk in non-European populations.

Show full abstractShow less

DOI

10.1038/s41588-022-01054-7

PRS-FH

Tool

PUBMED_LINK

35935918

FULL NAME

family history

URL

https://alkesgroup.broadinstitute.org/UKBB/PRSFH/PRSFH/

KEYWORDS

family history

Show full keywordsShow less

TITLE

Incorporating family history of disease improves polygenic risk scores in diverse populations.

Main citation

Hujoel MLA, Loh PR, Neale BM, Price AL. (2022) Incorporating family history of disease improves polygenic risk scores in diverse populations. Cell Genom, 2 (7) . doi:10.1016/j.xgen.2022.100152. PMID 35935918

ABSTRACT

Polygenic risk scores (PRSs) derived from genotype data and family history (FH) of disease provide valuable information for predicting disease risk, but PRSs perform poorly when applied to diverse populations. Here, we explore methods for combining both types of information (PRS-FH) in UK Biobank data. PRSs were trained using all British individuals (n = 409,000), and target samples consisted of unrelated non-British Europeans (n = 42,000), South Asians (n = 7,000), or Africans (n = 7,000). We evaluated PRS, FH, and PRS-FH using liability-scale R 2, primarily focusing on 3 well-powered diseases (type 2 diabetes, hypertension, and depression). PRS attained average prediction R 2s of 5.8%, 4.0%, and 0.53% in non-British Europeans, South Asians, and Africans, confirming poor cross-population transferability. In contrast, PRS-FH attained average prediction R 2s of 13%, 12%, and 10%, respectively, representing a large improvement in Europeans and an extremely large improvement in Africans. In conclusion, including family history improves the accuracy of polygenic risk scores, particularly in diverse populations.

Show full abstractShow less

DOI

10.1016/j.xgen.2022.100152

PRS-RS

Tool

PUBMED_LINK

33692554

FULL NAME

Polygenic Risk Score Reporting Standards

TITLE

Improving reporting standards for polygenic scores in risk prediction studies.

Main citation

Wand H, Lambert SA, Tamburro C, Iacocca MA, ...&, Wojcik GL. (2021) Improving reporting standards for polygenic scores in risk prediction studies. Nature, 591 (7849) 211-219. doi:10.1038/s41586-021-03243-6. PMID 33692554

ABSTRACT

Polygenic risk scores (PRSs), which often aggregate results from genome-wide association studies, can bridge the gap between initial discovery efforts and clinical applications for the estimation of disease risk using genetics. However, there is notable heterogeneity in the application and reporting of these risk scores, which hinders the translation of PRSs into clinical care. Here, in a collaboration between the Clinical Genome Resource (ClinGen) Complex Disease Working Group and the Polygenic Score (PGS) Catalog, we present the Polygenic Risk Score Reporting Standards (PRS-RS), in which we update the Genetic Risk Prediction Studies (GRIPS) Statement to reflect the present state of the field. Drawing on the input of experts in epidemiology, statistics, disease-specific applications, implementation and policy, this comprehensive reporting framework defines the minimal information that is needed to interpret and evaluate PRSs, especially with respect to downstream clinical applications. Items span detailed descriptions of study populations, statistical methods for the development and validation of PRSs and considerations for the potential limitations of these scores. In addition, we emphasize the need for data availability and transparency, and we encourage researchers to deposit and share PRSs through the PGS Catalog to facilitate reproducibility and comparative benchmarking. By providing these criteria in a structured format that builds on existing standards and ontologies, the use of this framework in publishing PRSs will facilitate translation into clinical care and progress towards defining best practice.

Show full abstractShow less

DOI

10.1038/s41586-021-03243-6

PRS_to_Abs

Tool

PUBMED_LINK

34983942

DESCRIPTION

Converting Polygenic Score to Absolute Scale

Show full descriptionShow less

URL

https://opain.github.io/GenoPred/PRS_to_Abs_tool.html

TITLE

A tool for translating polygenic scores onto the absolute scale using summary statistics.

Main citation

Pain O, Gillett AC, Austin JC, Folkersen L, ...&, Lewis CM. (2022) A tool for translating polygenic scores onto the absolute scale using summary statistics. Eur J Hum Genet, 30 (3) 339-348. doi:10.1038/s41431-021-01028-z. PMID 34983942

ABSTRACT

There is growing interest in the clinical application of polygenic scores as their predictive utility increases for a range of health-related phenotypes. However, providing polygenic score predictions on the absolute scale is an important step for their safe interpretation. We have developed a method to convert polygenic scores to the absolute scale for binary and normally distributed phenotypes. This method uses summary statistics, requiring only the area-under-the-ROC curve (AUC) or variance explained (R2) by the polygenic score, and the prevalence of binary phenotypes, or mean and standard deviation of normally distributed phenotypes. Polygenic scores are converted using normal distribution theory. We also evaluate methods for estimating polygenic score AUC/R2 from genome-wide association study (GWAS) summary statistics alone. We validate the absolute risk conversion and AUC/R2 estimation using data for eight binary and three continuous phenotypes in the UK Biobank sample. When the AUC/R2 of the polygenic score is known, the observed and estimated absolute values were highly concordant. Estimates of AUC/R2 from the lassosum pseudovalidation method were most similar to the observed AUC/R2 values, though estimated values deviated substantially from the observed for autoimmune disorders. This study enables accurate interpretation of polygenic scores using only summary statistics, providing a useful tool for educational and clinical purposes. Furthermore, we have created interactive webtools implementing the conversion to the absolute ( https://opain.github.io/GenoPred/PRS_to_Abs_tool.html ). Several further barriers must be addressed before clinical implementation of polygenic scores, such as ensuring target individuals are well represented by the GWAS sample.

Show full abstractShow less

DOI

10.1038/s41431-021-01028-z

PRSet

Tool

PUBMED_LINK

36749789

DESCRIPTION

A new feature of PRSice is the ability to perform set base/pathway based analysis. This new feature is called PRSet.

Show full descriptionShow less

URL

https://www.prsice.info/quick_start_prset/

KEYWORDS

pathway-based

Show full keywordsShow less

TITLE

PRSet: Pathway-based polygenic risk score analyses and software.

Main citation

Choi SW, García-González J, Ruan Y, Wu HM, ...&, O'Reilly PF. (2023) PRSet: Pathway-based polygenic risk score analyses and software. PLoS Genet, 19 (2) e1010624. doi:10.1371/journal.pgen.1010624. PMID 36749789

ABSTRACT

Polygenic risk scores (PRSs) have been among the leading advances in biomedicine in recent years. As a proxy of genetic liability, PRSs are utilised across multiple fields and applications. While numerous statistical and machine learning methods have been developed to optimise their predictive accuracy, these typically distil genetic liability to a single number based on aggregation of an individual's genome-wide risk alleles. This results in a key loss of information about an individual's genetic profile, which could be critical given the functional sub-structure of the genome and the heterogeneity of complex disease. In this manuscript, we introduce a 'pathway polygenic' paradigm of disease risk, in which multiple genetic liabilities underlie complex diseases, rather than a single genome-wide liability. We describe a method and accompanying software, PRSet, for computing and analysing pathway-based PRSs, in which polygenic scores are calculated across genomic pathways for each individual. We evaluate the potential of pathway PRSs in two distinct ways, creating two major sections: (1) In the first section, we benchmark PRSet as a pathway enrichment tool, evaluating its capacity to capture GWAS signal in pathways. We find that for target sample sizes of >10,000 individuals, pathway PRSs have similar power for evaluating pathway enrichment as leading methods MAGMA and LD score regression, with the distinct advantage of providing individual-level estimates of genetic liability for each pathway -opening up a range of pathway-based PRS applications, (2) In the second section, we evaluate the performance of pathway PRSs for disease stratification. We show that using a supervised disease stratification approach, pathway PRSs (computed by PRSet) outperform two standard genome-wide PRSs (computed by C+T and lassosum) for classifying disease subtypes in 20 of 21 scenarios tested. As the definition and functional annotation of pathways becomes increasingly refined, we expect pathway PRSs to offer key insights into the heterogeneity of complex disease and treatment response, to generate biologically tractable therapeutic targets from polygenic signal, and, ultimately, to provide a powerful path to precision medicine.

Show full abstractShow less

DOI

10.1371/journal.pgen.1010624

PRSice-2

Tool

PUBMED_LINK

31307061

DESCRIPTION

PRSice (pronounced 'precise') is a Polygenic Risk Score software for calculating, applying, evaluating and plotting the results of polygenic risk scores (PRS) analyses.

Show full descriptionShow less

URL

https://www.prsice.info/

TITLE

PRSice-2: Polygenic Risk Score software for biobank-scale data.

Main citation

Choi SW, O'Reilly PF. (2019) PRSice-2: Polygenic Risk Score software for biobank-scale data. Gigascience, 8 (7) . doi:10.1093/gigascience/giz082. PMID 31307061

ABSTRACT

BACKGROUND: Polygenic risk score (PRS) analyses have become an integral part of biomedical research, exploited to gain insights into shared aetiology among traits, to control for genomic profile in experimental studies, and to strengthen causal inference, among a range of applications. Substantial efforts are now devoted to biobank projects to collect large genetic and phenotypic data, providing unprecedented opportunity for genetic discovery and applications. To process the large-scale data provided by such biobank resources, highly efficient and scalable methods and software are required. RESULTS: Here we introduce PRSice-2, an efficient and scalable software program for automating and simplifying PRS analyses on large-scale data. PRSice-2 handles both genotyped and imputed data, provides empirical association P-values free from inflation due to overfitting, supports different inheritance models, and can evaluate multiple continuous and binary target traits simultaneously. We demonstrate that PRSice-2 is dramatically faster and more memory-efficient than PRSice-1 and alternative PRS software, LDpred and lassosum, while having comparable predictive power. CONCLUSION: PRSice-2's combination of efficiency and power will be increasingly important as data sizes grow and as the applications of PRS become more sophisticated, e.g., when incorporated into high-dimensional or gene set-based analyses. PRSice-2 is written in C++, with an R script for plotting, and is freely available for download from http://PRSice.info.

Show full abstractShow less

DOI

10.1093/gigascience/giz082

PRSMix_AOI

Tool

FULL NAME

add -one-in (AOI)

PREPRINT_DOI

10.1101/2024.07.24.24310897

Main citation

Misra, A. et al. Instability of high polygenic risk classification and mitigation by integrative scoring. bioRxiv 2024.07.24.24310897 (2024) doi:10.1101/2024.07.24.24310897.

PRStuning

Tool

PUBMED_LINK

37398263

DESCRIPTION

Estimate Testing AUC for Binary Phenotype Using GWAS Summary Statistics from the Training Data

Show full descriptionShow less

TITLE

Tuning Parameters for Polygenic Risk Score Methods Using GWAS Summary Statistics from Training Data.

Main citation

Jiang W, Chen L, Girgenti MJ, Zhao H. (2023) Tuning Parameters for Polygenic Risk Score Methods Using GWAS Summary Statistics from Training Data. Res Sq, () . doi:10.21203/rs.3.rs-2939390/v1. PMID 37398263

ABSTRACT

Predicting genetic risks for common diseases may improve their prevention and early treatment. In recent years, various additive-model-based polygenic risk scores (PRS) methods have been proposed to combine the estimated effects of single nucleotide polymorphisms (SNPs) using data collected from genome-wide association studies (GWAS). Some of these methods require access to another external individual-level GWAS dataset to tune the hyperparameters, which can be difficult because of privacy and security-related concerns. Additionally, leaving out partial data for hyperparameter tuning can reduce the predictive accuracy of the constructed PRS model. In this article, we propose a novel method, called PRStuning, to automatically tune hyperparameters for different PRS methods using only GWAS summary statistics from the training data. The core idea is to first predict the performance of the PRS method with different parameter values, and then select the parameters with the best prediction performance. Because directly using the effects observed from the training data tends to overestimate the performance in the testing data (a phenomenon known as overfitting), we adopt an empirical Bayes approach to shrinking the predicted performance in accordance with the estimated genetic architecture of the disease. Results from extensive simulations and real data applications demonstrate that PRStuning can accurately predict the PRS performance across PRS methods and parameters, and it can help select the best-performing parameters.

Show full abstractShow less

DOI

10.21203/rs.3.rs-2939390/v1

PS4DR

Tool

PUBMED_LINK

32503412

FULL NAME

Pathway Signatures for Drug Repositioning

DESCRIPTION

This package comprises a modular workflow designed to identify drug repositioning candidates using multi-omics data sets. A schematic figure of the workflow is presented below. The R scripts necessary to run the MSDRP pipeline are located in the R directory.

Show full descriptionShow less

URL

https://github.com/ps4dr/ps4dr

TITLE

PS4DR: a multimodal workflow for identification and prioritization of drugs based on pathway signatures.

Main citation

Emon MA, Domingo-Fernández D, Hoyt CT, Hofmann-Apitius M. (2020) PS4DR: a multimodal workflow for identification and prioritization of drugs based on pathway signatures. BMC Bioinformatics, 21 (1) 231. doi:10.1186/s12859-020-03568-5. PMID 32503412

ABSTRACT

BACKGROUND: During the last decade, there has been a surge towards computational drug repositioning owing to constantly increasing -omics data in the biomedical research field. While numerous existing methods focus on the integration of heterogeneous data to propose candidate drugs, it is still challenging to substantiate their results with mechanistic insights of these candidate drugs. Therefore, there is a need for more innovative and efficient methods which can enable better integration of data and knowledge for drug repositioning. RESULTS: Here, we present a customizable workflow (PS4DR) which not only integrates high-throughput data such as genome-wide association study (GWAS) data and gene expression signatures from disease and drug perturbations but also takes pathway knowledge into consideration to predict drug candidates for repositioning. We have collected and integrated publicly available GWAS data and gene expression signatures for several diseases and hundreds of FDA-approved drugs or those under clinical trial in this study. Additionally, different pathway databases were used for mechanistic knowledge integration in the workflow. Using this systematic consolidation of data and knowledge, the workflow computes pathway signatures that assist in the prediction of new indications for approved and investigational drugs. CONCLUSION: We showcase PS4DR with applications demonstrating how this tool can be used for repositioning and identifying new drugs as well as proposing drugs that can simulate disease dysregulations. We were able to validate our workflow by demonstrating its capability to predict FDA-approved drugs for their known indications for several diseases. Further, PS4DR returned many potential drug candidates for repositioning that were backed up by epidemiological evidence extracted from scientific literature. Source code is freely available at https://github.com/ps4dr/ps4dr.

Show full abstractShow less

DOI

10.1186/s12859-020-03568-5

PSMC

Tool

PUBMED_LINK

21753753

DESCRIPTION

Li, H. & Durbin, R. Inference of human population history from individual whole-genome sequences. Nature 475, 493–496 (2011).

Show full descriptionShow less

URL

https://github.com/lh3/psmc

USE

This software package infers population size history from a diploid sequence
using the Pairwise Sequentially Markovian Coalescent (PSMC) model.

TITLE

Inference of human population history from individual whole-genome sequences.

Main citation

Li H, Durbin R. (2011) Inference of human population history from individual whole-genome sequences. Nature, 475 (7357) 493-6. doi:10.1038/nature10231. PMID 21753753

ABSTRACT

The history of human population size is important for understanding human evolution. Various studies have found evidence for a founder event (bottleneck) in East Asian and European populations, associated with the human dispersal out-of-Africa event around 60 thousand years (kyr) ago. However, these studies have had to assume simplified demographic models with few parameters, and they do not provide a precise date for the start and stop times of the bottleneck. Here, with fewer assumptions on population size changes, we present a more detailed history of human population sizes between approximately ten thousand and a million years ago, using the pairwise sequentially Markovian coalescent model applied to the complete diploid genome sequences of a Chinese male (YH), a Korean male (SJK), three European individuals (J. C. Venter, NA12891 and NA12878 (ref. 9)) and two Yoruba males (NA18507 (ref. 10) and NA19239). We infer that European and Chinese populations had very similar population-size histories before 10-20 kyr ago. Both populations experienced a severe bottleneck 10-60 kyr ago, whereas African populations experienced a milder bottleneck from which they recovered earlier. All three populations have an elevated effective population size between 60 and 250 kyr ago, possibly due to population substructure. We also infer that the differentiation of genetically modern humans may have started as early as 100-120 kyr ago, but considerable genetic exchanges may still have occurred until 20-40 kyr ago.

Show full abstractShow less

DOI

10.1038/nature10231

PTWAS

Tool

PUBMED_LINK

32912253

FULL NAME

probabilistic TWAS

URL

https://github.com/xqwen/ptwas

KEYWORDS

TWAS, instrumental variables

Show full keywordsShow less

TITLE

PTWAS: investigating tissue-relevant causal molecular mechanisms of complex traits using probabilistic TWAS analysis.

Main citation

Zhang Y, Quick C, Yu K, Barbeira A, ...&, Wen X. (2020) PTWAS: investigating tissue-relevant causal molecular mechanisms of complex traits using probabilistic TWAS analysis. Genome Biol, 21 (1) 232. doi:10.1186/s13059-020-02026-y. PMID 32912253

ABSTRACT

We propose a new computational framework, probabilistic transcriptome-wide association study (PTWAS), to investigate causal relationships between gene expressions and complex traits. PTWAS applies the established principles from instrumental variables analysis and takes advantage of probabilistic eQTL annotations to delineate and tackle the unique challenges arising in TWAS. PTWAS not only confers higher power than the existing methods but also provides novel functionalities to evaluate the causal assumptions and estimate tissue- or cell-type-specific gene-to-trait effects. We illustrate the power of PTWAS by analyzing the eQTL data across 49 tissues from GTEx (v8) and GWAS summary statistics from 114 complex traits.

Show full abstractShow less

DOI

10.1186/s13059-020-02026-y

PUMA-CUBS

Tool

DESCRIPTION

an ensemble learning strategy named PUMACUBS to combine multiple PRS models into an ensemble score without requiring external data for model fitting.

Show full descriptionShow less

URL

https://github.com/qlu-lab/PUMAS

Main citation

Zhao, Zijie, et al. "Optimizing and benchmarking polygenic risk scores with GWAS summary statistics." bioRxiv (2022).

QCTOOL v2 (QCTOOL)

Tool

PUBMED_LINK

15789306

FULL NAME

QCTOOL

DESCRIPTION

QCTOOL is a command-line utility program for manipulation and quality control of gwas datasets and other genome-wide data.

Show full descriptionShow less

URL

https://www.well.ox.ac.uk/~gav/qctool_v2/index.html

TITLE

A note on exact tests of Hardy-Weinberg equilibrium.

Main citation

Wigginton JE, Cutler DJ, Abecasis GR. (2005) A note on exact tests of Hardy-Weinberg equilibrium. Am J Hum Genet, 76 (5) 887-93. doi:10.1086/429864. PMID 15789306

ABSTRACT

Deviations from Hardy-Weinberg equilibrium (HWE) can indicate inbreeding, population stratification, and even problems in genotyping. In samples of affected individuals, these deviations can also provide evidence for association. Tests of HWE are commonly performed using a simple chi2 goodness-of-fit test. We show that this chi2 test can have inflated type I error rates, even in relatively large samples (e.g., samples of 1,000 individuals that include approximately 100 copies of the minor allele). On the basis of previous work, we describe exact tests of HWE together with efficient computational methods for their implementation. Our methods adequately control type I error in large and small samples and are computationally efficient. They have been implemented in freely available code that will be useful for quality assessment of genotype data and for the detection of genetic association or population stratification in very large data sets.

Show full abstractShow less

DOI

10.1086/429864

QRGWAS

Tool

PUBMED_LINK

39085219

FULL NAME

Quantile regression GWAS

URL

https://github.com/Iuliana-Ionita-Laza/QRGWAS

TITLE

Genome-wide discovery for biomarkers using quantile regression at biobank scale.

Main citation

Wang C, Wang T, Kiryluk K, Wei Y, ...&, Ionita-Laza I. (2024) Genome-wide discovery for biomarkers using quantile regression at biobank scale. Nat Commun, 15 (1) 6460. doi:10.1038/s41467-024-50726-x. PMID 39085219

ABSTRACT

Genome-wide association studies (GWAS) for biomarkers important for clinical phenotypes can lead to clinically relevant discoveries. Conventional GWAS for quantitative traits are based on simplified regression models modeling the conditional mean of a phenotype as a linear function of genotype. We draw attention here to an alternative, lesser known approach, namely quantile regression that naturally extends linear regression to the analysis of the entire conditional distribution of a phenotype of interest. Quantile regression can be applied efficiently at biobank scale, while having some unique advantages such as (1) identifying variants with heterogeneous effects across quantiles of the phenotype distribution; (2) accommodating a wide range of phenotype distributions including non-normal distributions, with invariance of results to trait transformations; and (3) providing more detailed information about genotype-phenotype associations even for those associations identified by conventional GWAS. We show in simulations that quantile regression is powerful across both homogeneous and various heterogeneous models. Applications to 39 quantitative traits in the UK Biobank demonstrate that quantile regression can be a helpful complement to linear regression in GWAS and can identify variants with larger effects on high-risk subgroups of individuals but with lower or no contribution overall.

Show full abstractShow less

DOI

10.1038/s41467-024-50726-x

Quickdraws

Tool

PUBMED_LINK

39789286

DESCRIPTION

Quickdraws is a scalable method to perform genome-wide association studies (GWAS) for quantitative and binary traits. To run GWAS using Quickdraws, you will need three main input files: bed (and bgen) files with model-building and testing genetic variants, phenotype files, and covariate files. For certain analyses, you may also need a list of model SNPs and a file describing close genetic relatives

Show full descriptionShow less

URL

https://palamaralab.github.io/software/quickdraws/manual/

TITLE

A scalable variational inference approach for increased mixed-model association power.

Main citation

Loya H, Kalantzis G, Cooper F, Palamara PF. (2025) A scalable variational inference approach for increased mixed-model association power. Nat Genet, 57 (2) 461-468. doi:10.1038/s41588-024-02044-7. PMID 39789286

ABSTRACT

The rapid growth of modern biobanks is creating new opportunities for large-scale genome-wide association studies (GWASs) and the analysis of complex traits. However, performing GWASs on millions of samples often leads to trade-offs between computational efficiency and statistical power, reducing the benefits of large-scale data collection efforts. We developed Quickdraws, a method that increases association power in quantitative and binary traits without sacrificing computational efficiency, leveraging a spike-and-slab prior on variant effects, stochastic variational inference and graphics processing unit acceleration. We applied Quickdraws to 79 quantitative and 50 binary traits in 405,088 UK Biobank samples, identifying 4.97% and 3.25% more associations than REGENIE and 22.71% and 7.07% more than FastGWA. Quickdraws had costs comparable to REGENIE, FastGWA and SAIGE on the UK Biobank Research Analysis Platform service, while being substantially faster than BOLT-LMM. These results highlight the promise of leveraging machine learning techniques for scalable GWASs without sacrificing power or robustness.

Show full abstractShow less

DOI

10.1038/s41588-024-02044-7

QUILT1

Tool

PUBMED_LINK

34083788

URL

https://github.com/rwdavies/QUILT

TITLE

Rapid genotype imputation from sequence with reference panels.

Main citation

Davies RW, Kucka M, Su D, Shi S, ...&, Myers S. (2021) Rapid genotype imputation from sequence with reference panels. Nat Genet, 53 (7) 1104-1111. doi:10.1038/s41588-021-00877-0. PMID 34083788

ABSTRACT

Inexpensive genotyping methods are essential to modern genomics. Here we present QUILT, which performs diploid genotype imputation using low-coverage whole-genome sequence data. QUILT employs Gibbs sampling to partition reads into maternal and paternal sets, facilitating rapid haploid imputation using large reference panels. We show this partitioning to be accurate over many megabases, enabling highly accurate imputation close to theoretical limits and outperforming existing methods. Moreover, QUILT can impute accurately using diverse technologies, including long reads from Oxford Nanopore Technologies, and a new form of low-cost barcoded Illumina sequencing called haplotagging, with the latter showing improved accuracy at low coverages. Relative to DNA genotyping microarrays, QUILT offers improved accuracy at reduced cost, particularly for diverse populations that are traditionally underserved in modern genomic analyses, with accuracy nearly doubling at rare SNPs. Finally, QUILT can accurately impute (four-digit) human leukocyte antigen types, the first such method from low-coverage sequence data.

Show full abstractShow less

DOI

10.1038/s41588-021-00877-0

QUILT2

Tool

DESCRIPTION

QUILT2 is a fast and memory-efficient method for imputation from low coverage sequence. Statistically, QUILT2 operates on a per-read basis, and is base quality aware, meaning it can accurately impute from diverse inputs, including short read (e.g. Illumina), long read sequencing (that might be noisy) (e.g. Oxford Nanopore Technologies), barcoded Illumina sequencing (e.g. Haplotagging) and ancient DNA. In addition, QUILT2 can impute both the mother and fetal genome using cfDNA NIPT data.

Show full descriptionShow less

URL

https://github.com/rwdavies/QUILT

PREPRINT_DOI

10.1101/2024.07.18.604149

Main citation

Li, Z., Albrechtsen, A. & Davies, R. W. Rapid and accurate genotype imputation from low coverage short read, long read, and cell free DNA sequence. bioRxiv 2024.07.18.604149 (2024) doi:10.1101/2024.07.18.604149.

RareMETAL

Tool

PUBMED_LINK

24894501

DESCRIPTION

RAREMETAL is a program that facilitates the meta-analysis of rare variants from genotype arrays or sequencing (manuscript in preparation).

Show full descriptionShow less

URL

https://genome.sph.umich.edu/wiki/RAREMETAL

KEYWORDS

rare variants

Show full keywordsShow less

TITLE

RAREMETAL: fast and powerful meta-analysis for rare variants.

Main citation

Feng S, Liu D, Zhan X, Wing MK, ...&, Abecasis GR. (2014) RAREMETAL: fast and powerful meta-analysis for rare variants. Bioinformatics, 30 (19) 2828-9. doi:10.1093/bioinformatics/btu367. PMID 24894501

ABSTRACT

SUMMARY: RAREMETAL is a computationally efficient tool for meta-analysis of rare variants genotyped using sequencing or arrays. RAREMETAL facilitates analyses of individual studies, accommodates a variety of input file formats, handles related and unrelated individuals, executes both single variant and burden tests and performs conditional association analyses. AVAILABILITY AND IMPLEMENTATION: http://genome.sph.umich.edu/wiki/RAREMETAL for executables, source code, documentation and tutorial.

Show full abstractShow less

DOI

10.1093/bioinformatics/btu367

RASQUAL

Tool

PUBMED_LINK

26656845

FULL NAME

Robust Allele Specific QUAntitation and quality controL

DESCRIPTION

RASQUAL (Robust Allele Specific QUAntification and quality controL) maps QTLs for sequenced based cellular traits by combining population and allele-specific signals.

Show full descriptionShow less

URL

https://github.com/natsuhiko/rasqual

TITLE

Fine-mapping cellular QTLs with RASQUAL and ATAC-seq.

Main citation

Kumasaka N, Knights AJ, Gaffney DJ. (2016) Fine-mapping cellular QTLs with RASQUAL and ATAC-seq. Nat Genet, 48 (2) 206-13. doi:10.1038/ng.3467. PMID 26656845

ABSTRACT

When cellular traits are measured using high-throughput DNA sequencing, quantitative trait loci (QTLs) manifest as fragment count differences between individuals and allelic differences within individuals. We present RASQUAL (Robust Allele-Specific Quantitation and Quality Control), a new statistical approach for association mapping that models genetic effects and accounts for biases in sequencing data using a single, probabilistic framework. RASQUAL substantially improves fine-mapping accuracy and sensitivity relative to existing methods in RNA-seq, DNase-seq and ChIP-seq data. We illustrate how RASQUAL can be used to maximize association detection by generating the first map of chromatin accessibility QTLs (caQTLs) in a European population using ATAC-seq. Despite a modest sample size, we identified 2,707 independent caQTLs (at a false discovery rate of 10%) and demonstrated how RASQUAL and ATAC-seq can provide powerful information for fine-mapping gene-regulatory variants and for linking distal regulatory elements with gene promoters. Our results highlight how combining between-individual and allele-specific genetic signals improves the functional interpretation of noncoding variation.

Show full abstractShow less

DOI

10.1038/ng.3467

Relate

Tool

PUBMED_LINK

31477933

DESCRIPTION

Speidel, L., Forest, M., Shi, S. & Myers, S. R. A method for genome-wide genealogy estimation for thousands of samples. Nat. Genet. 51, 1321–1329 (2019).

Show full descriptionShow less

URL

https://myersgroup.github.io/relate/

USE

Relate estimates genome-wide genealogies in the form of trees that adapt to changes in local ancestry caused by recombination. The method, which is scalable to thousands of samples, is described in the following paper. Please cite this paper if you use our software in your study.

TITLE

A method for genome-wide genealogy estimation for thousands of samples.

Main citation

Speidel L, Forest M, Shi S, Myers SR. (2019) A method for genome-wide genealogy estimation for thousands of samples. Nat Genet, 51 (9) 1321-1329. doi:10.1038/s41588-019-0484-x. PMID 31477933

ABSTRACT

Knowledge of genome-wide genealogies for thousands of individuals would simplify most evolutionary analyses for humans and other species, but has remained computationally infeasible. We have developed a method, Relate, scaling to >10,000 sequences while simultaneously estimating branch lengths, mutational ages and variable historical population sizes, as well as allowing for data errors. Application to 1,000 Genomes Project haplotypes produces joint genealogical histories for 26 human populations. Highly diverged lineages are present in all groups, but most frequent in Africa. Outside Africa, these mainly reflect ancient introgression from groups related to Neanderthals and Denisovans, while African signals instead reflect unknown events unique to that continent. Our approach allows more powerful inferences of natural selection than has previously been possible. We identify multiple regions under strong positive selection, and multi-allelic traits including hair color, body mass index and blood pressure, showing strong evidence of directional selection, varying among human groups.

Show full abstractShow less

DOI

10.1038/s41588-019-0484-x

REMETA

Tool

PUBMED_LINK

41225158

DESCRIPTION

REMETA is a computationally efficient C++ toolkit for meta-analysis of gene-based association tests using single-variant summary statistics from REGENIE-style pipelines, including burden and variance-component tests, with sparse per-study LD references rescaled per phenotype.

Show full descriptionShow less

URL

https://github.com/rgcgithub/remeta ,https://rgcgithub.github.io/remeta/

KEYWORDS

gene-based test, meta-analysis, summary statistics, REGENIE, burden, SKAT-O

Show full keywordsShow less

TITLE

Computationally efficient meta-analysis of gene-based tests using summary statistics in large-scale genetic studies.

Main citation

Joseph TA, Mbatchou J, Ghosh A, Marcketta A, ...&, Marchini J. (2025) Computationally efficient meta-analysis of gene-based tests using summary statistics in large-scale genetic studies. Nat Genet, 57 (12) 3193-3200. doi:10.1038/s41588-025-02390-0. PMID 41225158

ABSTRACT

Meta-analysis of gene-based tests using single-variant summary statistics is a powerful strategy for genetic association studies. However, current approaches require sharing the covariance matrix between variants for each study and trait of interest. For large-scale studies with many phenotypes, these matrices can be cumbersome to calculate, store and share. Here, to address this challenge, we present REMETA-an efficient tool for meta-analysis of gene-based tests. REMETA uses a single sparse covariance reference file per study that is rescaled for each phenotype using single-variant summary statistics. We develop new methods for binary traits with case-control imbalance, and to estimate allele frequencies, genotype counts and effect sizes of burden tests. We demonstrate the performance and advantages of our approach through meta-analysis of five traits in 469,376 samples in UK Biobank. The open-source REMETA software will facilitate meta-analysis across large-scale exome sequencing studies from diverse studies that cannot easily be combined.

Show full abstractShow less

DOI

10.1038/s41588-025-02390-0

RENT+

Tool

PUBMED_LINK

28065901

DESCRIPTION

Mirzaei, S. & Wu, Y. RENT+: an improved method for inferring local genealogical trees from haplotypes with recombination. Bioinformatics 33, 1021–1030 (2017).

Show full descriptionShow less

TITLE

RENT+: an improved method for inferring local genealogical trees from haplotypes with recombination.

Main citation

Mirzaei S, Wu Y. (2017) RENT+: an improved method for inferring local genealogical trees from haplotypes with recombination. Bioinformatics, 33 (7) 1021-1030. doi:10.1093/bioinformatics/btw735. PMID 28065901

ABSTRACT

MOTIVATION: : Haplotypes from one or multiple related populations share a common genealogical history. If this shared genealogy can be inferred from haplotypes, it can be very useful for many population genetics problems. However, with the presence of recombination, the genealogical history of haplotypes is complex and cannot be represented by a single genealogical tree. Therefore, inference of genealogical history with recombination is much more challenging than the case of no recombination. RESULTS: : In this paper, we present a new approach called RENT+ for the inference of local genealogical trees from haplotypes with the presence of recombination. RENT+ builds on a previous genealogy inference approach called RENT , which infers a set of related genealogical trees at different genomic positions. RENT+ represents a significant improvement over RENT in the sense that it is more effective in extracting information contained in the haplotype data about the underlying genealogy than RENT . The key components of RENT+ are several greatly enhanced genealogy inference rules. Through simulation, we show that RENT+ is more efficient and accurate than several existing genealogy inference methods. As an application, we apply RENT+ in the inference of population demographic history from haplotypes, which outperforms several existing methods. AVAILABILITY AND IMPLEMENTATION: : RENT+ is implemented in Java, and is freely available for download from: https://github.com/SajadMirzaei/RentPlus . CONTACTS: : sajad@engr.uconn.edu or ywu@engr.uconn.edu. SUPPLEMENTARY INFORMATION: : Supplementary data are available at Bioinformatics online.

Show full abstractShow less

DOI

10.1093/bioinformatics/btw735

RESHAPE

Tool

PUBMED_LINK

38745108

FULL NAME

REcombine and Share HAPlotypEs

DESCRIPTION

RESHAPE removes sample-level genetic information from a reference panel to create a synthetic reference panel. By providing it with a genetic map and the VCF/BCF of a reference panel, RESHAPE outputs a VCF/BCF of the same size where each haplotypes corresponds to a mosaic of the original haplotypes of the reference panel.

Show full descriptionShow less

URL

https://github.com/TheoCavinato/RESHAPE

TITLE

A resampling-based approach to share reference panels.

Main citation

Cavinato T, Rubinacci S, Malaspinas AS, Delaneau O. (2024) A resampling-based approach to share reference panels. Nat Comput Sci, 4 (5) 360-366. doi:10.1038/s43588-024-00630-7. PMID 38745108

ABSTRACT

For many genome-wide association studies, imputing genotypes from a haplotype reference panel is a necessary step. Over the past 15 years, reference panels have become larger and more diverse, leading to improvements in imputation accuracy. However, the latest generation of reference panels is subject to restrictions on data sharing due to concerns about privacy, limiting their usefulness for genotype imputation. In this context, here we propose RESHAPE, a method that employs a recombination Poisson process on a reference panel to simulate the genomes of hypothetical descendants after multiple generations. This data transformation helps to protect against re-identification threats and preserves data attributes, such as linkage disequilibrium patterns and, to some degree, identity-by-descent sharing, allowing for genotype imputation. Our experiments on gold-standard datasets show that simulated descendants up to eight generations can serve as reference panels without substantially reducing genotype imputation accuracy.

Show full abstractShow less

DOI

10.1038/s43588-024-00630-7

Review

Tool

PUBMED_LINK

32967067

TITLE

Genes and environments, development and time.

Main citation

Boyce WT, Sokolowski MB, Robinson GE. (2020) Genes and environments, development and time. Proc Natl Acad Sci U S A, 117 (38) 23235-23241. doi:10.1073/pnas.2016710117. PMID 32967067

ABSTRACT

A now substantial body of science implicates a dynamic interplay between genetic and environmental variation in the development of individual differences in behavior and health. Such outcomes are affected by molecular, often epigenetic, processes involving gene-environment (G-E) interplay that can influence gene expression. Early environments with exposures to poverty, chronic adversities, and acutely stressful events have been linked to maladaptive development and compromised health and behavior. Genetic differences can impart either enhanced or blunted susceptibility to the effects of such pathogenic environments. However, largely missing from present discourse regarding G-E interplay is the role of time, a "third factor" guiding the emergence of complex developmental endpoints across different scales of time. Trajectories of development increasingly appear best accounted for by a complex, dynamic interchange among the highly linked elements of genes, contexts, and time at multiple scales, including neurobiological (minutes to milliseconds), genomic (hours to minutes), developmental (years and months), and evolutionary (centuries and millennia) time. This special issue of PNAS thus explores time and timing among G-E transactions: The importance of timing and timescales in plasticity and critical periods of brain development; epigenetics and the molecular underpinnings of biologically embedded experience; the encoding of experience across time and biological levels of organization; and gene-regulatory networks in behavior and development and their linkages to neuronal networks. Taken together, the collection of papers offers perspectives on how G-E interplay operates contingently within and against a backdrop of time and timescales.

Show full abstractShow less

DOI

10.1073/pnas.2016710117

Review-Das

Tool

PUBMED_LINK

29799802

TITLE

Genotype Imputation from Large Reference Panels.

Main citation

Das S, Abecasis GR, Browning BL. (2018) Genotype Imputation from Large Reference Panels. Annu Rev Genomics Hum Genet, 19 () 73-96. doi:10.1146/annurev-genom-083117-021602. PMID 29799802

ABSTRACT

Genotype imputation has become a standard tool in genome-wide association studies because it enables researchers to inexpensively approximate whole-genome sequence data from genome-wide single-nucleotide polymorphism array data. Genotype imputation increases statistical power, facilitates fine mapping of causal variants, and plays a key role in meta-analyses of genome-wide association studies. Only variants that were previously observed in a reference panel of sequenced individuals can be imputed. However, the rapid increase in the number of deeply sequenced individuals will soon make it possible to assemble enormous reference panels that greatly increase the number of imputable variants. In this review, we present an overview of genotype imputation and describe the computational techniques that make it possible to impute genotypes from reference panels with millions of individuals.

Show full abstractShow less

DOI

10.1146/annurev-genom-083117-021602

Review-Fst

Tool

PUBMED_LINK

19687804

DESCRIPTION

Holsinger, K. E., & Weir, B. S. (2009). Genetics in geographically structured populations: defining, estimating and interpreting F ST. Nature Reviews Genetics, 10(9), 639-650.

Show full descriptionShow less

TITLE

Genetics in geographically structured populations: defining, estimating and interpreting F(ST).

Main citation

Holsinger KE, Weir BS. (2009) Genetics in geographically structured populations: defining, estimating and interpreting F(ST). Nat Rev Genet, 10 (9) 639-50. doi:10.1038/nrg2611. PMID 19687804

ABSTRACT

Wright's F-statistics, and especially F(ST), provide important insights into the evolutionary processes that influence the structure of genetic variation within and among populations, and they are among the most widely used descriptive statistics in population and evolutionary genetics. Estimates of F(ST) can identify regions of the genome that have been the target of selection, and comparisons of F(ST) from different parts of the genome can provide insights into the demographic history of populations. For these reasons and others, F(ST) has a central role in population and evolutionary genetics and has wide applications in fields that range from disease association mapping to forensic science. This Review clarifies how F(ST) is defined, how it should be estimated, how it is related to similar statistics and how estimates of F(ST) should be interpreted.

Show full abstractShow less

DOI

10.1038/nrg2611

Review-Kachuri

Tool

PUBMED_LINK

37620596

TITLE

Principles and methods for transferring polygenic risk scores across global populations.

Main citation

Kachuri L, Chatterjee N, Hirbo J, Schaid DJ, ...&, Ge T. (2024) Principles and methods for transferring polygenic risk scores across global populations. Nat Rev Genet, 25 (1) 8-25. doi:10.1038/s41576-023-00637-2. PMID 37620596

ABSTRACT

Polygenic risk scores (PRSs) summarize the genetic predisposition of a complex human trait or disease and may become a valuable tool for advancing precision medicine. However, PRSs that are developed in populations of predominantly European genetic ancestries can increase health disparities due to poor predictive performance in individuals of diverse and complex genetic ancestries. We describe genetic and modifiable risk factors that limit the transferability of PRSs across populations and review the strengths and weaknesses of existing PRS construction methods for diverse ancestries. Developing PRSs that benefit global populations in research and clinical settings provides an opportunity for innovation and is essential for health equity.

Show full abstractShow less

DOI

10.1038/s41576-023-00637-2

Review-Lappalainen

Tool

PUBMED_LINK

34554789

TITLE

From variant to function in human disease genetics.

Main citation

Lappalainen T, MacArthur DG. (2021) From variant to function in human disease genetics. Science, 373 (6562) 1464-1468. doi:10.1126/science.abi8207. PMID 34554789

ABSTRACT

Over the next decade, the primary challenge in human genetics will be to understand the biological mechanisms by which genetic variants influence phenotypes, including disease risk. Although the scale of this challenge is daunting, better methods for functional variant interpretation will have transformative consequences for disease diagnosis, risk prediction, and the development of new therapies. An array of new methods for characterizing variant impact at scale, using patient tissue samples as well as in vitro models, are already being applied to dissect variant mechanisms across a range of human cell types and environments. These approaches are also increasingly being deployed in clinical settings. We discuss the rationale, approaches, applications, and future outlook for characterizing the molecular and cellular effects of genetic variants.

Show full abstractShow less

DOI

10.1126/science.abi8207

Review-Li

Tool

PUBMED_LINK

19715440

TITLE

Genotype imputation.

Main citation

Li Y, Willer C, Sanna S, Abecasis G. (2009) Genotype imputation. Annu Rev Genomics Hum Genet, 10 () 387-406. doi:10.1146/annurev.genom.9.081307.164242. PMID 19715440

ABSTRACT

Genotype imputation is now an essential tool in the analysis of genome-wide association scans. This technique allows geneticists to accurately evaluate the evidence for association at genetic markers that are not directly genotyped. Genotype imputation is particularly useful for combining results across studies that rely on different genotyping platforms but also increases the power of individual scans. Here, we review the history and theoretical underpinnings of the technique. To illustrate performance of the approach, we summarize results from several gene mapping studies. Finally, we preview the role of genotype imputation in an era when whole genome resequencing is becoming increasingly common.

Show full abstractShow less

DOI

10.1146/annurev.genom.9.081307.164242

Review-Marchini

Tool

PUBMED_LINK

20517342

TITLE

Genotype imputation for genome-wide association studies.

Main citation

Marchini J, Howie B. (2010) Genotype imputation for genome-wide association studies. Nat Rev Genet, 11 (7) 499-511. doi:10.1038/nrg2796. PMID 20517342

ABSTRACT

In the past few years genome-wide association (GWA) studies have uncovered a large number of convincingly replicated associations for many complex human diseases. Genotype imputation has been used widely in the analysis of GWA studies to boost power, fine-map associations and facilitate the combination of results across studies using meta-analysis. This Review describes the details of several different statistical methods for imputing genotypes, illustrates and discusses the factors that influence imputation performance, and reviews methods that can be used to assess imputation performance and test association at imputed SNPs.

Show full abstractShow less

DOI

10.1038/nrg2796

Review-Peter

Tool

PUBMED_LINK

34554790

TITLE

Discovery and implications of polygenicity of common diseases.

Main citation

Visscher PM, Yengo L, Cox NJ, Wray NR. (2021) Discovery and implications of polygenicity of common diseases. Science, 373 (6562) 1468-1473. doi:10.1126/science.abi8206. PMID 34554790

ABSTRACT

The sequencing of the human genome has allowed the study of the genetic architecture of common diseases: the number of genomic variants that contribute to risk of disease and their joint frequency and effect size distribution. Common diseases are polygenic, with many loci contributing to phenotype, and the cumulative burden of risk alleles determines individual risk in conjunction with environmental factors. Most risk loci occur in noncoding regions of the genome regulating cell- and context-specific gene expression. Although the effect sizes of most risk alleles are small, their cumulative effects in individuals, quantified as a polygenic (risk) score, can identify people at increased risk of disease, thereby facilitating prevention or early intervention.

Show full abstractShow less

DOI

10.1126/science.abi8206

Review-Povysil

Tool

PUBMED_LINK

31605095

TITLE

Rare-variant collapsing analyses for complex traits: guidelines and applications.

Main citation

Povysil G, Petrovski S, Hostyk J, Aggarwal V, ...&, Goldstein DB. (2019) Rare-variant collapsing analyses for complex traits: guidelines and applications. Nat Rev Genet, 20 (12) 747-759. doi:10.1038/s41576-019-0177-4. PMID 31605095

ABSTRACT

The first phase of genome-wide association studies (GWAS) assessed the role of common variation in human disease. Advances optimizing and economizing high-throughput sequencing have enabled a second phase of association studies that assess the contribution of rare variation to complex disease in all protein-coding genes. Unlike the early microarray-based studies, sequencing-based studies catalogue the full range of genetic variation, including the evolutionarily youngest forms. Although the experience with common variants helped establish relevant standards for genome-wide studies, the analysis of rare variation introduces several challenges that require novel analysis approaches.

Show full abstractShow less

DOI

10.1038/s41576-019-0177-4

Review-Wang

Tool

PUBMED_LINK

35576555

TITLE

Challenges and Opportunities for Developing More Generalizable Polygenic Risk Scores.

Main citation

Wang Y, Tsuo K, Kanai M, Neale BM, ...&, Martin AR. (2022) Challenges and Opportunities for Developing More Generalizable Polygenic Risk Scores. Annu Rev Biomed Data Sci, 5 () 293-320. doi:10.1146/annurev-biodatasci-111721-074830. PMID 35576555

ABSTRACT

Polygenic risk scores (PRS) estimate an individual's genetic likelihood of complex traits and diseases by aggregating information across multiple genetic variants identified from genome-wide association studies. PRS can predict a broad spectrum of diseases and have therefore been widely used in research settings. Some work has investigated their potential applications as biomarkers in preventative medicine, but significant work is still needed to definitively establish and communicate absolute risk to patients for genetic and modifiable risk factors across demographic groups. However, the biggest limitation of PRS currently is that they show poor generalizability across diverse ancestries and cohorts. Major efforts are underway through methodological development and data generation initiatives to improve their generalizability. This review aims to comprehensively discuss current progress on the development of PRS, the factors that affect their generalizability, and promising areas for improving their accuracy, portability, and implementation.

Show full abstractShow less

DOI

10.1146/annurev-biodatasci-111721-074830

reviews

Tool

PUBMED_LINK

30387919

TITLE

Strategies for Pathway Analysis Using GWAS and WGS Data.

Main citation

White MJ, Yaspan BL, Veatch OJ, Goddard P, ...&, Contreras MG. (2019) Strategies for Pathway Analysis Using GWAS and WGS Data. Curr Protoc Hum Genet, 100 (1) e79. doi:10.1002/cphg.79. PMID 30387919

ABSTRACT

Single-allele study designs, commonly used in genome-wide association studies (GWAS) as well as the more recently developed whole genome sequencing (WGS) studies, are a standard approach for investigating the relationship of common variation within the human genome to a given phenotype of interest. However, single-allele association results published for many GWAS studies represent only the tip of the iceberg for the information that can be extracted from these datasets. The primary analysis strategy for GWAS entails association analysis in which only the single nucleotide polymorphisms (SNPs) with the strongest p-values are declared statistically significant due to issues arising from multiple testing and type I errors. Factors such as locus heterogeneity, epistasis, and multiple genes conferring small effects contribute to the complexity of the genetic models underlying phenotype expression. Thus, many biologically meaningful associations having lower effect sizes at individual genes are overlooked, making it difficult to separate true associations from a sea of false-positive associations. Organizing these individual SNPs into biologically meaningful groups to look at the overall effects of minor perturbations to genes and pathways is desirable. This pathway-based approach provides researchers with insight into the functional foundations of the phenotype being studied and allows testing of various genetic scenarios. © 2018 by John Wiley & Sons, Inc.

Show full abstractShow less

DOI

10.1002/cphg.79

Reviews

Tool

PUBMED_LINK

32860016

TITLE

Genetics meets proteomics: perspectives for large population-based studies.

Main citation

Suhre K, McCarthy MI, Schwenk JM. (2021) Genetics meets proteomics: perspectives for large population-based studies. Nat Rev Genet, 22 (1) 19-37. doi:10.1038/s41576-020-0268-2. PMID 32860016

ABSTRACT

Proteomic analysis of cells, tissues and body fluids has generated valuable insights into the complex processes influencing human biology. Proteins represent intermediate phenotypes for disease and provide insight into how genetic and non-genetic risk factors are mechanistically linked to clinical outcomes. Associations between protein levels and DNA sequence variants that colocalize with risk alleles for common diseases can expose disease-associated pathways, revealing novel drug targets and translational biomarkers. However, genome-wide, population-scale analyses of proteomic data are only now emerging. Here, we review current findings from studies of the plasma proteome and discuss their potential for advancing biomedical translation through the interpretation of genome-wide association analyses. We highlight the challenges faced by currently available technologies and provide perspectives relevant to their future application in large-scale biobank studies.

Show full abstractShow less

DOI

10.1038/s41576-020-0268-2

Reviews&Tutorials

Tool

PUBMED_LINK

27427429

TITLE

Commentary: Two-sample Mendelian randomization: opportunities and challenges.

Main citation

Lawlor DA. (2016) Commentary: Two-sample Mendelian randomization: opportunities and challenges. Int J Epidemiol, 45 (3) 908-15. doi:10.1093/ije/dyw127. PMID 27427429

DOI

10.1093/ije/dyw127

RFR SuSiE-inf FINEMAP-inf (RFR)

Tool

PUBMED_LINK

38036779

FULL NAME

Replication Failure Rate

DESCRIPTION

Replication Failure Rate (RFR), a metric to assess the consistency of fine-mapping results based on downsampling a large cohort. SuSiE-inf and FINEMAP-inf, that extend SuSiE and FINEMAP to incorporate a term for infinitesimal effects in addition to a small number of larger causal effects of interest.

Show full descriptionShow less

URL

https://github.com/FinucaneLab/fine-mapping-inf

TITLE

Improving fine-mapping by modeling infinitesimal effects.

Main citation

Cui R, Elzur RA, Kanai M, Ulirsch JC, ...&, Finucane HK. (2024) Improving fine-mapping by modeling infinitesimal effects. Nat Genet, 56 (1) 162-169. doi:10.1038/s41588-023-01597-3. PMID 38036779

ABSTRACT

Fine-mapping aims to identify causal genetic variants for phenotypes. Bayesian fine-mapping algorithms (for example, SuSiE, FINEMAP, ABF and COJO-ABF) are widely used, but assessing posterior probability calibration remains challenging in real data, where model misspecification probably exists, and true causal variants are unknown. We introduce replication failure rate (RFR), a metric to assess fine-mapping consistency by downsampling. SuSiE, FINEMAP and COJO-ABF show high RFR, indicating potential overconfidence in their output. Simulations reveal that nonsparse genetic architecture can lead to miscalibration, while imputation noise, nonuniform distribution of causal variants and quality control filters have minimal impact. Here we present SuSiE-inf and FINEMAP-inf, fine-mapping methods modeling infinitesimal effects alongside fewer larger causal effects. Our methods show improved calibration, RFR and functional enrichment, competitive recall and computational efficiency. Notably, using our methods' posterior effect sizes substantially increases polygenic risk score accuracy over SuSiE and FINEMAP. Our work improves causal variant identification for complex traits, a fundamental goal of human genetics.

Show full abstractShow less

DOI

10.1038/s41588-023-01597-3

RolyPoly

Tool

PUBMED_LINK

29106824

DESCRIPTION

RolyPoly is a regression-based polygenic model that can prioritize trait-relevant cell types and genes from GWAS summary statistics and gene expression data.

Show full descriptionShow less

URL

https://github.com/dcalderon/rolypoly

TITLE

Inferring Relevant Cell Types for Complex Traits by Using Single-Cell Gene Expression.

Main citation

Calderon D, Bhaskar A, Knowles DA, Golan D, ...&, Pritchard JK. (2017) Inferring Relevant Cell Types for Complex Traits by Using Single-Cell Gene Expression. Am J Hum Genet, 101 (5) 686-699. doi:10.1016/j.ajhg.2017.09.009. PMID 29106824

ABSTRACT

Previous studies have prioritized trait-relevant cell types by looking for an enrichment of genome-wide association study (GWAS) signal within functional regions. However, these studies are limited in cell resolution by the lack of functional annotations from difficult-to-characterize or rare cell populations. Measurement of single-cell gene expression has become a popular method for characterizing novel cell types, and yet limited work has linked single-cell RNA sequencing (RNA-seq) to phenotypes of interest. To address this deficiency, we present RolyPoly, a regression-based polygenic model that can prioritize trait-relevant cell types and genes from GWAS summary statistics and gene expression data. RolyPoly is designed to use expression data from either bulk tissue or single-cell RNA-seq. In this study, we demonstrated RolyPoly's accuracy through simulation and validated previously known tissue-trait associations. We discovered a significant association between microglia and late-onset Alzheimer disease and an association between schizophrenia and oligodendrocytes and replicating fetal cortical cells. Additionally, RolyPoly computes a trait-relevance score for each gene to reflect the importance of expression specific to a cell type. We found that differentially expressed genes in the prefrontal cortex of individuals with Alzheimer disease were significantly enriched with genes ranked highly by RolyPoly gene scores. Overall, our method represents a powerful framework for understanding the effect of common variants on cell types contributing to complex traits.

Show full abstractShow less

DOI

10.1016/j.ajhg.2017.09.009

rtPRS-CS

Tool

FULL NAME

real-time PRS-CS

DESCRIPTION

rtPRS-CS is a python-based command line tool that performs real-time online updating of polygenic risk score (PRS) weights in a target dataset, one sample at-a-time. Given the most recent set of SNP weights, for each new target sample with both phenotypic and genetic information, rtPRS-CS uses stochastic gradient descent to update the SNP weights, adjusting for the effect of a set of covariates.

Show full descriptionShow less

URL

https://github.com/getian107/rtPRS

PREPRINT_DOI

10.1101/2024.07.12.24310357

Main citation

Tubbs, J. D., Chen, Y., Duan, R., Huang, H. & Ge, T. Real-time dynamic polygenic prediction for streaming data. bioRxiv 2024.07.12.24310357 (2024) doi:10.1101/2024.07.12.24310357.

RWAS

Tool

PUBMED_LINK

35697866

FULL NAME

Regulome-Wide Association Study

URL

http://gusevlab.org/projects/fusion/#tcga-regulome-wide-association-study-rwas-atac-seq-models

TITLE

Allelic imbalance of chromatin accessibility in cancer identifies candidate causal risk variants and their mechanisms.

Main citation

Grishin D, Gusev A. (2022) Allelic imbalance of chromatin accessibility in cancer identifies candidate causal risk variants and their mechanisms. Nat Genet, 54 (6) 837-849. doi:10.1038/s41588-022-01075-2. PMID 35697866

ABSTRACT

While many germline cancer risk variants have been identified through genome-wide association studies (GWAS), the mechanisms by which these variants operate remain largely unknown. Here we used 406 cancer ATAC-Seq samples across 23 cancer types to identify 7,262 germline allele-specific accessibility QTLs (as-aQTLs). Cancer as-aQTLs had stronger enrichment for cancer risk heritability (up to 145 fold) than any other functional annotation across seven cancer GWAS. Most cancer as-aQTLs directly altered transcription factor (TF) motifs and exhibited differential TF binding and gene expression in functional screens. To connect as-aQTLs to putative risk mechanisms, we introduced the regulome-wide associations study (RWAS). RWAS identified genetically associated accessible peaks at >70% of known breast and prostate loci and discovered new risk loci in all examined cancer types. Integrating as-aQTL discovery, motif analysis and RWAS identified candidate causal regulatory elements and their probable upstream regulators. Our work establishes cancer as-aQTLs and RWAS analysis as powerful tools to study the genetic architecture of cancer risk.

Show full abstractShow less

DOI

10.1038/s41588-022-01075-2

S-LDXR

Tool

PUBMED_LINK

33597505

DESCRIPTION

S-LDXR is a software for estimating enrichment of stratified squared trans-ethnic genetic correlation across genomic annotations from GWAS summary statistics data.

Show full descriptionShow less

URL

https://huwenboshi.github.io/s-ldxr/

KEYWORDS

trans-ethnic, stratified, functional categories

Show full keywordsShow less

TITLE

Population-specific causal disease effect sizes in functionally important regions impacted by selection.

Main citation

Shi H, Gazal S, Kanai M, Koch EM, ...&, Price AL. (2021) Population-specific causal disease effect sizes in functionally important regions impacted by selection. Nat Commun, 12 (1) 1098. doi:10.1038/s41467-021-21286-1. PMID 33597505

ABSTRACT

Many diseases exhibit population-specific causal effect sizes with trans-ethnic genetic correlations significantly less than 1, limiting trans-ethnic polygenic risk prediction. We develop a new method, S-LDXR, for stratifying squared trans-ethnic genetic correlation across genomic annotations, and apply S-LDXR to genome-wide summary statistics for 31 diseases and complex traits in East Asians (average N = 90K) and Europeans (average N = 267K) with an average trans-ethnic genetic correlation of 0.85. We determine that squared trans-ethnic genetic correlation is 0.82× (s.e. 0.01) depleted in the top quintile of background selection statistic, implying more population-specific causal effect sizes. Accordingly, causal effect sizes are more population-specific in functionally important regions, including conserved and regulatory regions. In regions surrounding specifically expressed genes, causal effect sizes are most population-specific for skin and immune genes, and least population-specific for brain genes. Our results could potentially be explained by stronger gene-environment interaction at loci impacted by selection, particularly positive selection.

Show full abstractShow less

DOI

10.1038/s41467-021-21286-1

S-PrediXcan

Tool

PUBMED_LINK

29739930

DESCRIPTION

a mathematical expression to compute PrediXcan (a gene mapping approach) results using summary data (S-PrediXcan)

Show full descriptionShow less

URL

https://github.com/hakyimlab/MetaXcan

TITLE

Exploring the phenotypic consequences of tissue specific gene expression variation inferred from GWAS summary statistics.

Main citation

Barbeira AN, Dickinson SP, Bonazzola R, Zheng J, ...&, Im HK. (2018) Exploring the phenotypic consequences of tissue specific gene expression variation inferred from GWAS summary statistics. Nat Commun, 9 (1) 1825. doi:10.1038/s41467-018-03621-1. PMID 29739930

ABSTRACT

Scalable, integrative methods to understand mechanisms that link genetic variants with phenotypes are needed. Here we derive a mathematical expression to compute PrediXcan (a gene mapping approach) results using summary data (S-PrediXcan) and show its accuracy and general robustness to misspecified reference sets. We apply this framework to 44 GTEx tissues and 100+ phenotypes from GWAS and meta-analysis studies, creating a growing public catalog of associations that seeks to capture the effects of gene expression variation on human phenotypes. Replication in an independent cohort is shown. Most of the associations are tissue specific, suggesting context specificity of the trait etiology. Colocalized significant associations in unexpected tissues underscore the need for an agnostic scanning of multiple contexts to improve our ability to detect causal regulatory mechanisms. Monogenic disease genes are enriched among significant associations for related traits, suggesting that smaller alterations of these genes may cause a spectrum of milder phenotypes.

Show full abstractShow less

DOI

10.1038/s41467-018-03621-1

SAIGE

Tool

PUBMED_LINK

30104761

FULL NAME

Scalable and Accurate Implementation of GEneralized mixed model

DESCRIPTION

SAIGE is an R package with Scalable and Accurate Implementation of Generalized mixed model (Chen, H. et al. 2016). It accounts for sample relatedness and is feasible for genetic association tests in large cohorts and biobanks (N > 400,000). SAIGE performs single-variant association tests for binary traits and quantitative taits. For binary traits, SAIGE uses the saddlepoint approximation (SPA)(mhof, J. P. , 1961; Kuonen, D. 1999; Dey, R. et.al 2017) to account for case-control imbalance.

Show full descriptionShow less

URL

https://github.com/weizhouUMICH/SAIGE

KEYWORDS

case-control imbalance, saddlepoint approximation (SPA)

Show full keywordsShow less

TITLE

Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies.

Main citation

Zhou W, Nielsen JB, Fritsche LG, Dey R, ...&, Lee S. (2018) Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. Nat Genet, 50 (9) 1335-1341. doi:10.1038/s41588-018-0184-y. PMID 30104761

ABSTRACT

In genome-wide association studies (GWAS) for thousands of phenotypes in large biobanks, most binary traits have substantially fewer cases than controls. Both of the widely used approaches, the linear mixed model and the recently proposed logistic mixed model, perform poorly; they produce large type I error rates when used to analyze unbalanced case-control phenotypes. Here we propose a scalable and accurate generalized mixed model association test that uses the saddlepoint approximation to calibrate the distribution of score test statistics. This method, SAIGE (Scalable and Accurate Implementation of GEneralized mixed model), provides accurate P values even when case-control ratios are extremely unbalanced. SAIGE uses state-of-art optimization strategies to reduce computational costs; hence, it is applicable to GWAS for thousands of phenotypes by large biobanks. Through the analysis of UK Biobank data of 408,961 samples from white British participants with European ancestry for > 1,400 binary phenotypes, we show that SAIGE can efficiently analyze large sample data, controlling for unbalanced case-control ratios and sample relatedness.

Show full abstractShow less

DOI

10.1038/s41588-018-0184-y

SAIGE-GENE+

Tool

PUBMED_LINK

36138231

URL

https://github.com/weizhouUMICH/SAIGE

TITLE

SAIGE-GENE+ improves the efficiency and accuracy of set-based rare variant association tests.

Main citation

Zhou W, Bi W, Zhao Z, Dey KK, ...&, Lee S. (2022) SAIGE-GENE+ improves the efficiency and accuracy of set-based rare variant association tests. Nat Genet, 54 (10) 1466-1469. doi:10.1038/s41588-022-01178-w. PMID 36138231

ABSTRACT

Several biobanks, including UK Biobank (UKBB), are generating large-scale sequencing data. An existing method, SAIGE-GENE, performs well when testing variants with minor allele frequency (MAF) ≤ 1%, but inflation is observed in variance component set-based tests when restricting to variants with MAF ≤ 0.1% or 0.01%. Here, we propose SAIGE-GENE+ with greatly improved type I error control and computational efficiency to facilitate rare variant tests in large-scale data. We further show that incorporating multiple MAF cutoffs and functional annotations can improve power and thus uncover new gene-phenotype associations. In the analysis of UKBB whole exome sequencing data for 30 quantitative and 141 binary traits, SAIGE-GENE+ identified 551 gene-phenotype associations.

Show full abstractShow less

DOI

10.1038/s41588-022-01178-w

SAIGE-QTL

Tool

DESCRIPTION

SAIGE-QTL is a robust and scalable tool that can directly map eQTLs using single-cell profiles without needing aggregation at the pseudobulk level.

Show full descriptionShow less

URL

https://github.com/weizhou0/qtl

KEYWORDS

single -cell eQTL, rare variant, set-based test, trans-eQTL, SPA

Show full keywordsShow less

Main citation

Zhou, W., Cuomo, A., Xue, A., Kanai, M., Chau, G., Krishna, C., ... & Neale, B. M. (2024). Efficient and accurate mixed model association tool for single-cell eQTL analysis. medRxiv, 2024-05.

Sakaue

Tool

PUBMED_LINK

37495751

URL

https://github.com/immunogenomics/HLA_analyses_tutorial

KEYWORDS

HLA analyses tutorial

Show full keywordsShow less

TITLE

Tutorial: a statistical genetics guide to identifying HLA alleles driving complex disease.

Main citation

Sakaue S, Gurajala S, Curtis M, Luo Y, ...&, Raychaudhuri S. (2023) Tutorial: a statistical genetics guide to identifying HLA alleles driving complex disease. Nat Protoc, 18 (9) 2625-2641. doi:10.1038/s41596-023-00853-4. PMID 37495751

ABSTRACT

The human leukocyte antigen (HLA) locus is associated with more complex diseases than any other locus in the human genome. In many diseases, HLA explains more heritability than all other known loci combined. In silico HLA imputation methods enable rapid and accurate estimation of HLA alleles in the millions of individuals that are already genotyped on microarrays. HLA imputation has been used to define causal variation in autoimmune diseases, such as type I diabetes, and in human immunodeficiency virus infection control. However, there are few guidelines on performing HLA imputation, association testing, and fine mapping. Here, we present a comprehensive tutorial to impute HLA alleles from genotype data. We provide detailed guidance on performing standard quality control measures for input genotyping data and describe options to impute HLA alleles and amino acids either locally or using the web-based Michigan Imputation Server, which hosts a multi-ancestry HLA imputation reference panel. We also offer best practice recommendations to conduct association tests to define the alleles, amino acids, and haplotypes that affect human traits. Along with the pipeline, we provide a step-by-step online guide with scripts and available software ( https://github.com/immunogenomics/HLA_analyses_tutorial ). This tutorial will be broadly applicable to large-scale genotyping data and will contribute to defining the role of HLA in human diseases across global populations.

Show full abstractShow less

DOI

10.1038/s41596-023-00853-4

Salinas

Tool

PUBMED_LINK

29020254

TITLE

Statistical Analysis of Multiple Phenotypes in Genetic Epidemiologic Studies: From Cross-Phenotype Associations to Pleiotropy.

Main citation

Salinas YD, Wang Z, DeWan AT. (2018) Statistical Analysis of Multiple Phenotypes in Genetic Epidemiologic Studies: From Cross-Phenotype Associations to Pleiotropy. Am J Epidemiol, 187 (4) 855-863. doi:10.1093/aje/kwx296. PMID 29020254

ABSTRACT

In the context of genetics, pleiotropy refers to the phenomenon in which a single genetic locus affects more than 1 trait or disease. Genetic epidemiologic studies have identified loci associated with multiple phenotypes, and these cross-phenotype associations are often incorrectly interpreted as examples of pleiotropy. Pleiotropy is only one possible explanation for cross-phenotype associations. Cross-phenotype associations may also arise due to issues related to study design, confounder bias, or nongenetic causal links between the phenotypes under analysis. Therefore, it is necessary to dissect cross-phenotype associations carefully to uncover true pleiotropic loci. In this review, we describe statistical methods that can be used to identify robust statistical evidence of pleiotropy. First, we provide an overview of univariate and multivariate methods for discovery of cross-phenotype associations and highlight important considerations for choosing among available methods. Then, we describe how to dissect cross-phenotype associations by using mediation analysis. Pleiotropic loci provide insights into the mechanistic underpinnings of disease comorbidity, and they may serve as novel targets for interventions that simultaneously treat multiple diseases. Discerning between different types of cross-phenotype associations is necessary to realize the public health potential of pleiotropic loci.

Show full abstractShow less

DOI

10.1093/aje/kwx296

Sanger

Tool

PUBMED_LINK

27548312

URL

https://imputation.sanger.ac.uk/

TITLE

A reference panel of 64,976 haplotypes for genotype imputation.

Main citation

McCarthy S, Das S, Kretzschmar W, Delaneau O, ...&, Haplotype Reference Consortium. (2016) A reference panel of 64,976 haplotypes for genotype imputation. Nat Genet, 48 (10) 1279-83. doi:10.1038/ng.3643. PMID 27548312

ABSTRACT

We describe a reference panel of 64,976 human haplotypes at 39,235,157 SNPs constructed using whole-genome sequence data from 20 studies of predominantly European ancestry. Using this resource leads to accurate genotype imputation at minor allele frequencies as low as 0.1% and a large increase in the number of SNPs tested in association studies, and it can help to discover and refine causal loci. We describe remote server resources that allow researchers to carry out imputation and phasing consistently and efficiently.

Show full abstractShow less

DOI

10.1038/ng.3643

SARGE

Tool

PUBMED_LINK

34272242

DESCRIPTION

Schaefer, N. K., Shapiro, B. & Green, R. E. An ancestral recombination graph of human, Neanderthal, and Denisovan genomes. Sci. Adv. 7, eabc0776 (2021).

Show full descriptionShow less

TITLE

An ancestral recombination graph of human, Neanderthal, and Denisovan genomes.

Main citation

Schaefer NK, Shapiro B, Green RE. (2021) An ancestral recombination graph of human, Neanderthal, and Denisovan genomes. Sci Adv, 7 (29) . doi:10.1126/sciadv.abc0776. PMID 34272242

ABSTRACT

Many humans carry genes from Neanderthals, a legacy of past admixture. Existing methods detect this archaic hominin ancestry within human genomes using patterns of linkage disequilibrium or direct comparison to Neanderthal genomes. Each of these methods is limited in sensitivity and scalability. We describe a new ancestral recombination graph inference algorithm that scales to large genome-wide datasets and demonstrate its accuracy on real and simulated data. We then generate a genome-wide ancestral recombination graph including human and archaic hominin genomes. From this, we generate a map within human genomes of archaic ancestry and of genomic regions not shared with archaic hominins either by admixture or incomplete lineage sorting. We find that only 1.5 to 7% of the modern human genome is uniquely human. We also find evidence of multiple bursts of adaptive changes specific to modern humans within the past 600,000 years involving genes related to brain development and function.

Show full abstractShow less

DOI

10.1126/sciadv.abc0776

SBayesR

Tool

PUBMED_LINK

31704910

DESCRIPTION

extend a powerful individual-level data Bayesian multiple regression model (BayesR) to one that utilises summary statistics from genome-wide association studies, SBayesR.

Show full descriptionShow less

URL

TITLE

Improved polygenic prediction by Bayesian multiple regression on summary statistics.

Main citation

Lloyd-Jones LR, Zeng J, Sidorenko J, Yengo L, ...&, Visscher PM. (2019) Improved polygenic prediction by Bayesian multiple regression on summary statistics. Nat Commun, 10 (1) 5086. doi:10.1038/s41467-019-12653-0. PMID 31704910

ABSTRACT

Accurate prediction of an individual's phenotype from their DNA sequence is one of the great promises of genomics and precision medicine. We extend a powerful individual-level data Bayesian multiple regression model (BayesR) to one that utilises summary statistics from genome-wide association studies (GWAS), SBayesR. In simulation and cross-validation using 12 real traits and 1.1 million variants on 350,000 individuals from the UK Biobank, SBayesR improves prediction accuracy relative to commonly used state-of-the-art summary statistics methods at a fraction of the computational resources. Furthermore, using summary statistics for variants from the largest GWAS meta-analysis (n ≈ 700, 000) on height and BMI, we show that on average across traits and two independent data sets that SBayesR improves prediction R2 by 5.2% relative to LDpred and by 26.5% relative to clumping and p value thresholding.

Show full abstractShow less

DOI

10.1038/s41467-019-12653-0

SBayesRC

Tool

PUBMED_LINK

38689000

DESCRIPTION

SBayesRC integrates GWAS summary statistics with functional genomic annotations to improve polygenic prediction of complex traits.

Show full descriptionShow less

URL

KEYWORDS

functional genomic annotation, whole-genome variants, cross-ancestry

Show full keywordsShow less

TITLE

Leveraging functional genomic annotations and genome coverage to improve polygenic prediction of complex traits within and between ancestries.

Main citation

Zheng Z, Liu S, Sidorenko J, Wang Y, ...&, Zeng J. (2024) Leveraging functional genomic annotations and genome coverage to improve polygenic prediction of complex traits within and between ancestries. Nat Genet, 56 (5) 767-777. doi:10.1038/s41588-024-01704-y. PMID 38689000

ABSTRACT

We develop a method, SBayesRC, that integrates genome-wide association study (GWAS) summary statistics with functional genomic annotations to improve polygenic prediction of complex traits. Our method is scalable to whole-genome variant analysis and refines signals from functional annotations by allowing them to affect both causal variant probability and causal effect distribution. We analyze 50 complex traits and diseases using ∼7 million common single-nucleotide polymorphisms (SNPs) and 96 annotations. SBayesRC improves prediction accuracy by 14% in European ancestry and up to 34% in cross-ancestry prediction compared to the baseline method SBayesR, which does not use annotations, and outperforms other methods, including LDpred2, LDpred-funct, MegaPRS, PolyPred-S and PRS-CSx. Investigation of factors affecting prediction accuracy identifies a significant interaction between SNP density and annotation information, suggesting whole-genome sequence variants with annotations may further improve prediction. Functional partitioning analysis highlights a major contribution of evolutionary constrained regions to prediction accuracy and the largest per-SNP contribution from nonsynonymous SNPs.

Show full abstractShow less

DOI

10.1038/s41588-024-01704-y

SBayesS

Tool

PUBMED_LINK

33608517

DESCRIPTION

estimate multiple genetic architecture parameters including selection signature using only GWAS summary statistics

Show full descriptionShow less

URL

https://alkesgroup.broadinstitute.org/LDSCORE/Jagadeesh_Dey_sclinker

TITLE

Widespread signatures of natural selection across human complex traits and functional genomic categories.

Main citation

Zeng J, Xue A, Jiang L, Lloyd-Jones LR, ...&, Yang J. (2021) Widespread signatures of natural selection across human complex traits and functional genomic categories. Nat Commun, 12 (1) 1164. doi:10.1038/s41467-021-21446-3. PMID 33608517

ABSTRACT

Understanding how natural selection has shaped genetic architecture of complex traits is of importance in medical and evolutionary genetics. Bayesian methods have been developed using individual-level GWAS data to estimate multiple genetic architecture parameters including selection signature. Here, we present a method (SBayesS) that only requires GWAS summary statistics. We analyse data for 155 complex traits (n = 27k-547k) and project the estimates onto those obtained from evolutionary simulations. We estimate that, on average across traits, about 1% of human genome sequence are mutational targets with a mean selection coefficient of ~0.001. Common diseases, on average, show a smaller number of mutational targets and have been under stronger selection, compared to other traits. SBayesS analyses incorporating functional annotations reveal that selection signatures vary across genomic regions, among which coding regions have the strongest selection signature and are enriched for both the number of associated variants and the magnitude of effect sizes.

Show full abstractShow less

DOI

10.1038/s41467-021-21446-3

sc-linker

Tool

PUBMED_LINK

36175791

DESCRIPTION

a framework for integrating single-cell RNA-sequencing, epigenomic SNP-to-gene maps and genome-wide association study summary statistics to infer the underlying cell types and processes by which genetic variants influence disease

Show full descriptionShow less

URL

KEYWORDS

GWAS, scRNA-seq

Show full keywordsShow less

TITLE

Identifying disease-critical cell types and cellular processes by integrating single-cell RNA-sequencing and human genetics.

Main citation

Jagadeesh KA, Dey KK, Montoro DT, Mohan R, ...&, Regev A. (2022) Identifying disease-critical cell types and cellular processes by integrating single-cell RNA-sequencing and human genetics. Nat Genet, 54 (10) 1479-1492. doi:10.1038/s41588-022-01187-9. PMID 36175791

ABSTRACT

Genome-wide association studies provide a powerful means of identifying loci and genes contributing to disease, but in many cases, the related cell types/states through which genes confer disease risk remain unknown. Deciphering such relationships is important for identifying pathogenic processes and developing therapeutics. In the present study, we introduce sc-linker, a framework for integrating single-cell RNA-sequencing, epigenomic SNP-to-gene maps and genome-wide association study summary statistics to infer the underlying cell types and processes by which genetic variants influence disease. The inferred disease enrichments recapitulated known biology and highlighted notable cell-disease relationships, including γ-aminobutyric acid-ergic neurons in major depressive disorder, a disease-dependent M-cell program in ulcerative colitis and a disease-specific complement cascade process in multiple sclerosis. In autoimmune disease, both healthy and disease-dependent immune cell-type programs were associated, whereas only disease-dependent epithelial cell programs were prominent, suggesting a role in disease response rather than initiation. Our framework provides a powerful approach for identifying the cell types and cellular processes by which genetic variants influence disease.

Show full abstractShow less

DOI

10.1038/s41588-022-01187-9

ARROW_SUMMARY

scRNA-seq data →️ Derive cell-type-specific gene programs →️ Map SNPs to genes using epigenomic data →️ Integrate with GWAS summary statistics →️ Identify disease-critical cell types and processes

SCARlink

Tool

PUBMED_LINK

38514783

FULL NAME

single-cell ATAC + RNA linking

DESCRIPTION

Single-cell ATAC+RNA linking (SCARlink) uses multiomic single-cell ATAC and RNA to predict gene expression from chromatin accessibility and predict regulatory regions.

Show full descriptionShow less

URL

https://github.com/snehamitra/SCARlink/

KEYWORDS

Possion regression, scATAC, scRNA, tile-level accessibility

Show full keywordsShow less

TITLE

Single-cell multi-ome regression models identify functional and disease-associated enhancers and enable chromatin potential analysis.

Main citation

Mitra S, Malik R, Wong W, Rahman A, ...&, Leslie CS. (2024) Single-cell multi-ome regression models identify functional and disease-associated enhancers and enable chromatin potential analysis. Nat Genet, 56 (4) 627-636. doi:10.1038/s41588-024-01689-8. PMID 38514783

ABSTRACT

We present a gene-level regulatory model, single-cell ATAC + RNA linking (SCARlink), which predicts single-cell gene expression and links enhancers to target genes using multi-ome (scRNA-seq and scATAC-seq co-assay) sequencing data. The approach uses regularized Poisson regression on tile-level accessibility data to jointly model all regulatory effects at a gene locus, avoiding the limitations of pairwise gene-peak correlations and dependence on peak calling. SCARlink outperformed existing gene scoring methods for imputing gene expression from chromatin accessibility across high-coverage multi-ome datasets while giving comparable to improved performance on low-coverage datasets. Shapley value analysis on trained models identified cell-type-specific gene enhancers that are validated by promoter capture Hi-C and are 11× to 15× and 5× to 12× enriched in fine-mapped eQTLs and fine-mapped genome-wide association study (GWAS) variants, respectively. We further show that SCARlink-predicted and observed gene expression vectors provide a robust way to compute a chromatin potential vector field to enable developmental trajectory analysis.

Show full abstractShow less

DOI

10.1038/s41588-024-01689-8

ARROW_SUMMARY

scRNA-seq + scATAC-seq → Tile-level chromatin accessibility modeling → Regularized Poisson regression (SCARlink) → Predict gene expression & link enhancers to genes → Identify functional and disease-associated enhancers

SCAVENGE

Tool

PUBMED_LINK

35668323

FULL NAME

Single Cell Analysis of Variant Enrichment through Network propagation of GEnomic data

URL

https://github.com/sankaranlab/SCAVENGE

KEYWORDS

GWAS, scATAC, network propagation

Show full keywordsShow less

TITLE

Variant to function mapping at single-cell resolution through network propagation.

Main citation

Yu F, Cato LD, Weng C, Liggett LA, ...&, Sankaran VG. (2022) Variant to function mapping at single-cell resolution through network propagation. Nat Biotechnol, 40 (11) 1644-1653. doi:10.1038/s41587-022-01341-y. PMID 35668323

ABSTRACT

Genome-wide association studies in combination with single-cell genomic atlases can provide insights into the mechanisms of disease-causal genetic variation. However, identification of disease-relevant or trait-relevant cell types, states and trajectories is often hampered by sparsity and noise, particularly in the analysis of single-cell epigenomic data. To overcome these challenges, we present SCAVENGE, a computational algorithm that uses network propagation to map causal variants to their relevant cellular context at single-cell resolution. We demonstrate how SCAVENGE can help identify key biological mechanisms underlying human genetic variation, applying the method to blood traits at distinct stages of human hematopoiesis, to monocyte subsets that increase the risk for severe Coronavirus Disease 2019 (COVID-19) and to intermediate lymphocyte developmental states that predispose to acute leukemia. Our approach not only provides a framework for enabling variant-to-function insights at single-cell resolution but also suggests a more general strategy for maximizing the inferences that can be made using single-cell genomic data.

Show full abstractShow less

DOI

10.1038/s41587-022-01341-y

scDRS

Tool

PUBMED_LINK

36050550

FULL NAME

single-cell Disease Relevance Score

DESCRIPTION

an approach that links scRNA-seq with polygenic disease risk at single-cell resolution, independent of annotated cell types

Show full descriptionShow less

URL

https://github.com/martinjzhang/scDRS

KEYWORDS

GWAS, scRNA-seq

Show full keywordsShow less

TITLE

Polygenic enrichment distinguishes disease associations of individual cells in single-cell RNA-seq data.

Main citation

Zhang MJ, Hou K, Dey KK, Sakaue S, ...&, Price AL. (2022) Polygenic enrichment distinguishes disease associations of individual cells in single-cell RNA-seq data. Nat Genet, 54 (10) 1572-1580. doi:10.1038/s41588-022-01167-z. PMID 36050550

ABSTRACT

Single-cell RNA sequencing (scRNA-seq) provides unique insights into the pathology and cellular origin of disease. We introduce single-cell disease relevance score (scDRS), an approach that links scRNA-seq with polygenic disease risk at single-cell resolution, independent of annotated cell types. scDRS identifies cells exhibiting excess expression across disease-associated genes implicated by genome-wide association studies (GWASs). We applied scDRS to 74 diseases/traits and 1.3 million single-cell gene-expression profiles across 31 tissues/organs. Cell-type-level results broadly recapitulated known cell-type-disease associations. Individual-cell-level results identified subpopulations of disease-associated cells not captured by existing cell-type labels, including T cell subpopulations associated with inflammatory bowel disease, partially characterized by their effector-like states; neuron subpopulations associated with schizophrenia, partially characterized by their spatial locations; and hepatocyte subpopulations associated with triglyceride levels, partially characterized by their higher ploidy levels. Genes whose expression was correlated with the scDRS score across cells (reflecting coexpression with GWAS disease-associated genes) were strongly enriched for gold-standard drug target and Mendelian disease genes.

Show full abstractShow less

DOI

10.1038/s41588-022-01167-z

ARROW_SUMMARY

GWAS summary statistics → Select putative disease genes via MAGMA → Compute scDRS using Monte Carlo-based score aggregation → Normalize with control gene sets → Rank cells by disease relevance → Identify enriched subpopulations and co-expressed gene networks

SCENT

Tool

PUBMED_LINK

38594305

FULL NAME

single-cell enhancer target gene mapping

DESCRIPTION

SCENT uses single-cell multimodal data (e.g., 10X Multiome RNA/ATAC) and links ATAC-seq peaks (putative enhancers) to their target genes by modeling association between chromatin accessibility and gene expression across individual single cells.

Show full descriptionShow less

URL

https://github.com/immunogenomics/SCENT

KEYWORDS

Possion regression, scATAC-seq, scRNA-seq

Show full keywordsShow less

TITLE

Tissue-specific enhancer-gene maps from multimodal single-cell data identify causal disease alleles.

Main citation

Sakaue S, Weinand K, Isaac S, Dey KK, ...&, Raychaudhuri S. (2024) Tissue-specific enhancer-gene maps from multimodal single-cell data identify causal disease alleles. Nat Genet, 56 (4) 615-626. doi:10.1038/s41588-024-01682-1. PMID 38594305

ABSTRACT

Translating genome-wide association study (GWAS) loci into causal variants and genes requires accurate cell-type-specific enhancer-gene maps from disease-relevant tissues. Building enhancer-gene maps is essential but challenging with current experimental methods in primary human tissues. Here we developed a nonparametric statistical method, SCENT (single-cell enhancer target gene mapping), that models association between enhancer chromatin accessibility and gene expression in single-cell or nucleus multimodal RNA sequencing and ATAC sequencing data. We applied SCENT to 9 multimodal datasets including >120,000 single cells or nuclei and created 23 cell-type-specific enhancer-gene maps. These maps were highly enriched for causal variants in expression quantitative loci and GWAS for 1,143 diseases and traits. We identified likely causal genes for both common and rare diseases and linked somatic mutation hotspots to target genes. We demonstrate that application of SCENT to multimodal data from disease-relevant human tissue enables the scalable construction of accurate cell-type-specific enhancer-gene maps, essential for defining noncoding variant function.

Show full abstractShow less

DOI

10.1038/s41588-024-01682-1

ARROW_SUMMARY

Extract chromatin accessibility (ATAC-seq) & gene expression (RNA-seq) from single cells → Group cells by type → For each gene, define candidate enhancers within 1 Mb → Use distance-weighted non-parametric regression to model enhancer–gene associations → Assess significance via permutation testing → Build enhancer–gene links per cell type

scGWAS

Tool

PUBMED_LINK

36253801

FULL NAME

scRNA-seq assisted GWAS analysis

DESCRIPTION

scGWAS leverages scRNA-seq data to identify the genetically mediated associations between traits and cell types.

Show full descriptionShow less

URL

https://github.com/bsml320/scGWAS

TITLE

scGWAS: landscape of trait-cell type associations by integrating single-cell transcriptomics-wide and genome-wide association studies.

Main citation

Jia P, Hu R, Yan F, Dai Y, ...&, Zhao Z. (2022) scGWAS: landscape of trait-cell type associations by integrating single-cell transcriptomics-wide and genome-wide association studies. Genome Biol, 23 (1) 220. doi:10.1186/s13059-022-02785-w. PMID 36253801

ABSTRACT

BACKGROUND: The rapid accumulation of single-cell RNA sequencing (scRNA-seq) data presents unique opportunities to decode the genetically mediated cell-type specificity in complex diseases. Here, we develop a new method, scGWAS, which effectively leverages scRNA-seq data to achieve two goals: (1) to infer the cell types in which the disease-associated genes manifest and (2) to construct cellular modules which imply disease-specific activation of different processes. RESULTS: scGWAS only utilizes the average gene expression for each cell type followed by virtual search processes to construct the null distributions of module scores, making it scalable to large scRNA-seq datasets. We demonstrated scGWAS in 40 genome-wide association studies (GWAS) datasets (average sample size N ≈ 154,000) using 18 scRNA-seq datasets from nine major human/mouse tissues (totaling 1.08 million cells) and identified 2533 trait and cell-type associations, each with significant modules for further investigation. The module genes were validated using disease or clinically annotated references from ClinVar, OMIM, and pLI variants. CONCLUSIONS: We showed that the trait-cell type associations identified by scGWAS, while generally constrained to trait-tissue associations, could recapitulate many well-studied relationships and also reveal novel relationships, providing insights into the unsolved trait-tissue associations. Moreover, in each specific cell type, the associations with different traits were often mediated by different sets of risk genes, implying disease-specific activation of driving processes. In summary, scGWAS is a powerful tool for exploring the genetic basis of complex diseases at the cell type level using single-cell expression data.

Show full abstractShow less

DOI

10.1186/s13059-022-02785-w

scPRS

Tool

PUBMED_LINK

40715455

DESCRIPTION

We introduce scPRS, an interpretable geometric deep learning model that contructs single-cell-resolved PRS leveraging reference single-cell ATAC-seq data for enhanced disease prediction and biological discovery.

Show full descriptionShow less

URL

https://github.com/szhang1112/scPRS

KEYWORDS

GWAS, scATAC, cell-resolved PRS, GNN

Show full keywordsShow less

TITLE

Single-cell polygenic risk scores dissect cellular and molecular heterogeneity of complex human diseases.

Main citation

Zhang S, Shu H, Zhou J, Rubin-Sigler J, ...&, Snyder MP. (2025) Single-cell polygenic risk scores dissect cellular and molecular heterogeneity of complex human diseases. Nat Biotechnol, () . doi:10.1038/s41587-025-02725-6. PMID 40715455

ABSTRACT

Polygenic risk scores (PRSs) predict an individual's genetic risk for complex diseases, yet their utility in elucidating disease biology remains limited. We introduce scPRS, a graph neural network-based framework that computes single-cell-resolved PRSs by integrating reference single-cell chromatin accessibility profiles. scPRS outperforms traditional PRS approaches in genetic risk prediction, as demonstrated across multiple diseases including type 2 diabetes, hypertrophic cardiomyopathy, Alzheimer disease and severe COVID-19. Beyond risk prediction, scPRS prioritizes disease-critical cells and, when combined with a layered multiomic analysis, links risk variants to gene regulation in a cell-type-specific manner. Applied to these diseases, scPRS fine-maps causal cell types and cell-type-specific variants and genes, demonstrating its ability to bridge genetic risk with cell-specific biology. scPRS provides a unified framework for genetic risk prediction and mechanistic dissection of complex diseases, laying a methodological foundation for single-cell genetics.

Show full abstractShow less

DOI

10.1038/s41587-025-02725-6

ARROW_SUMMARY

GWAS summary statistics + scATAC‑seq data → per‑cell PRS calculation → GNN smoothing → aggregate to individual scPRS → interpret cell‑type contributions & fine‑map causal variants.

scTWAS

TWAS Single cell scRNA-seq Tool Summary statistics

PUBMED_LINK

41820391

DESCRIPTION

Statistical framework for cell-type-resolved transcriptome-wide association using single-cell RNA-seq: models sparsity and technical noise via latent variables and moment-based estimation to improve genetically regulated expression prediction and gene–trait discovery.

Show full descriptionShow less

URL

https://github.com/ZhaotongL/scTWAS ,https://doi.org/10.1038/s41467-026-70374-7

KEYWORDS

TWAS, single-cell, cell-type-specific, latent variable, GReX

Show full keywordsShow less

TITLE

scTWAS: a powerful statistical framework for single-cell transcriptome-wide association studies.

Main citation

Lin Z, Su C. (2026) scTWAS: a powerful statistical framework for single-cell transcriptome-wide association studies. Nat Commun, () . doi:10.1038/s41467-026-70374-7. PMID 41820391

ABSTRACT

Transcriptome-wide association studies (TWAS) have successfully identified genes associated with complex traits and diseases, but most have been performed using bulk gene expression data, which aggregate signals across heterogeneous cell types. Population-scale single-cell RNA sequencing data now make it possible to perform TWAS at the cell-type resolution, but present unique challenges due to strong noises, technical variations, and high sparsity. Here, we propose scTWAS, a statistical method to conduct cell-type-specific TWAS using single-cell data. Leveraging a latent-variable model and moment-based estimation to address the challenges of single-cell data, scTWAS consistently improves the prediction of genetically regulated gene expression across cell types in both blood and brain tissues. Compared to existing methods, scTWAS identifies substantially more gene-trait associations across 29 hematological traits and three immune-related diseases in immune cell types. An application to Alzheimer's disease also reveals cell-subtype-specific associations, including MS4A6A in the disease-associated microglial subtype and PPP1R37 in the inflammatory microglial subtype.

Show full abstractShow less

DOI

10.1038/s41467-026-70374-7

SDPR

Tool

PUBMED_LINK

34310601

DESCRIPTION

SDPR (Summary statistics based Dirichelt Process Regression) is a method to compute polygenic risk score (PRS) from summary statistics. It is the extension of Dirichlet Process Regression (DPR) to the use of summary statistics

Show full descriptionShow less

URL

https://github.com/eldronzhou/SDPR

TITLE

A fast and robust Bayesian nonparametric method for prediction of complex traits using summary statistics.

Main citation

Zhou G, Zhao H. (2021) A fast and robust Bayesian nonparametric method for prediction of complex traits using summary statistics. PLoS Genet, 17 (7) e1009697. doi:10.1371/journal.pgen.1009697. PMID 34310601

ABSTRACT

Genetic prediction of complex traits has great promise for disease prevention, monitoring, and treatment. The development of accurate risk prediction models is hindered by the wide diversity of genetic architecture across different traits, limited access to individual level data for training and parameter tuning, and the demand for computational resources. To overcome the limitations of the most existing methods that make explicit assumptions on the underlying genetic architecture and need a separate validation data set for parameter tuning, we develop a summary statistics-based nonparametric method that does not rely on validation datasets to tune parameters. In our implementation, we refine the commonly used likelihood assumption to deal with the discrepancy between summary statistics and external reference panel. We also leverage the block structure of the reference linkage disequilibrium matrix for implementation of a parallel algorithm. Through simulations and applications to twelve traits, we show that our method is adaptive to different genetic architectures, statistically robust, and computationally efficient. Our method is available at https://github.com/eldronzhou/SDPR.

Show full abstractShow less

DOI

10.1371/journal.pgen.1009697

SDPRX

Tool

PUBMED_LINK

36460009

DESCRIPTION

SDPRX is a statistical method for cross-population prediction of complex traits. It integrates GWAS summary statistics and LD matrices from two populations (EUR and non-EUR) to compuate polygenic risk scores.

Show full descriptionShow less

URL

https://github.com/eldronzhou/SDPRX

TITLE

SDPRX: A statistical method for cross-population prediction of complex traits.

Main citation

Zhou G, Chen T, Zhao H. (2023) SDPRX: A statistical method for cross-population prediction of complex traits. Am J Hum Genet, 110 (1) 13-22. doi:10.1016/j.ajhg.2022.11.007. PMID 36460009

ABSTRACT

Polygenic risk score (PRS) has demonstrated its great utility in biomedical research through identifying high-risk individuals for different diseases from their genotypes. However, the broader application of PRS to the general population is hindered by the limited transferability of PRS developed in Europeans to non-European populations. To improve PRS prediction accuracy in non-European populations, we develop a statistical method called SDPRX that can effectively integrate genome wide association study summary statistics from different populations. SDPRX automatically adjusts for linkage disequilibrium differences between populations and characterizes the joint distribution of the effect sizes of a variant in two populations to be both null, population specific, or shared with correlation. Through simulations and applications to real traits, we show that SDPRX improves the prediction performance over existing methods in non-European populations.

Show full abstractShow less

DOI

10.1016/j.ajhg.2022.11.007

SDS

Tool

PUBMED_LINK

27738015

FULL NAME

singleton density score

DESCRIPTION

Field, Y., Boyle, E. A., Telis, N., Gao, Z., Gaulton, K. J., Golan, D., ... & Pritchard, J. K. (2016). Detection of human adaptation during the past 2000 years. Science, 354(6313), 760-764.

Show full descriptionShow less

URL

https://github.com/yairf/SDS

KEYWORDS

singleton, recent selection

Show full keywordsShow less

USE

SDS is a method to infer very recent changes in allele frequencies from contemporary genome sequences

TITLE

Detection of human adaptation during the past 2000 years.

Main citation

Field Y, Boyle EA, Telis N, Gao Z, ...&, Pritchard JK. (2016) Detection of human adaptation during the past 2000 years. Science, 354 (6313) 760-764. doi:10.1126/science.aag0776. PMID 27738015

ABSTRACT

Detection of recent natural selection is a challenging problem in population genetics. Here we introduce the singleton density score (SDS), a method to infer very recent changes in allele frequencies from contemporary genome sequences. Applied to data from the UK10K Project, SDS reflects allele frequency changes in the ancestors of modern Britons during the past ~2000 to 3000 years. We see strong signals of selection at lactase and the major histocompatibility complex, and in favor of blond hair and blue eyes. For polygenic adaptation, we find that recent selection for increased height has driven allele frequency shifts across most of the genome. Moreover, we identify shifts associated with other complex traits, suggesting that polygenic adaptation has played a pervasive role in shaping genotypic and phenotypic variation in modern humans.

Show full abstractShow less

DOI

10.1126/science.aag0776

SECRET-GWAS

Tool

DESCRIPTION

A privacy-preserving, population-scale genome-wide association study (GWAS) tool enabling collaborative analysis across multiple institutions using confidential computing. It employs optimizations like streaming, batching, and data parallelization on Intel SGX-based platforms to support linear and logistic regression efficiently while protecting against side-channel attacks.

Show full descriptionShow less

URL

https://github.com/jonahrosenblum/SECRET-GWAS

KEYWORDS

Genome-wide association study (GWAS), Confidential computing, Privacy-preserving, Intel SGX, Secure multi-party computation

Show full keywordsShow less

Main citation

Rosenblum, J., Dong, J. & Narayanasamy, S. Confidential computing for population-scale genome-wide association studies with SECRET-GWAS. Nat Comput Sci (2025). https://doi.org/10.1038/s43588-025-00856-z

ARROW_SUMMARY

Genomic data from multiple institutions → Confidential computing (Intel SGX) with optimized linear/logistic regression → Privacy-preserving GWAS results using streaming, batching, and parallelization

AI_GENERATED

1.0

seismic

GWAS Single cell scRNA-seq Gene prioritization Tool

PUBMED_LINK

41034207

FULL NAME

Single-cell Expression Integration System for Mapping genetically Implicated Cell types

DESCRIPTION

R framework that links GWAS signals to single-cell-defined cell types via a cell-type gene specificity score (expression magnitude and consistency) and regression on gene-level association statistics, with influential-gene follow-up for interpretability.

Show full descriptionShow less

URL

https://github.com/ylaboratory/seismic ,https://ylaboratory.github.io/seismic/ ,https://doi.org/10.1038/s41467-025-63753-z

KEYWORDS

GWAS, scRNA-seq, cell type, MAGMA, post-GWAS interpretation

Show full keywordsShow less

TITLE

Disentangling associations between complex traits and cell types with seismic.

Main citation

Lai Q, Dannenfelser R, Roussarie JP, Yao V. (2025) Disentangling associations between complex traits and cell types with seismic. Nat Commun, 16 (1) 8744. doi:10.1038/s41467-025-63753-z. PMID 41034207

ABSTRACT

Integrating single-cell RNA sequencing with Genome-Wide Association Studies (GWAS) can uncover cell types involved in complex traits and disease. However, current methods often lack scalability, interpretability, and robustness. We present seismic, a framework that computes a novel specificity score capturing both expression magnitude and consistency across cell types and introduces influential gene analysis, an approach to identify genes driving each cell type-trait association. Across over 1000 cell-type characterizations at different granularities and 28 polygenic traits, seismic corroborates known associations and uncovers trait-relevant cell groups not apparent through other methodologies. In Parkinson's and Alzheimer's, seismic unveils both cell- and brain-region-specific differences in pathology. Analyzing a pathology-based Alzheimer's GWAS with seismic enables the identification of vulnerable neuron populations and molecular pathways implicated in their neurodegeneration. In general, seismic is a computationally efficient, powerful, and interpretable approach for mapping the relationships between polygenic traits and cell-type-specific expression, offering new insights into disease mechanisms.

Show full abstractShow less

DOI

10.1038/s41467-025-63753-z

SHAPEIT1

Tool

PUBMED_LINK

22138821

DESCRIPTION

(SHAPEIT1)

Show full descriptionShow less

URL

https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html

TITLE

A linear complexity phasing method for thousands of genomes.

Main citation

Delaneau O, Marchini J, Zagury JF. (2011) A linear complexity phasing method for thousands of genomes. Nat Methods, 9 (2) 179-81. doi:10.1038/nmeth.1785. PMID 22138821

ABSTRACT

Human-disease etiology can be better understood with phase information about diploid sequences. We present a method for estimating haplotypes, using genotype data from unrelated samples or small nuclear families, that leads to improved accuracy and speed compared to several widely used methods. The method, segmented haplotype estimation and imputation tool (SHAPEIT), scales linearly with the number of haplotypes used in each iteration and can be run efficiently on whole chromosomes.

Show full abstractShow less

DOI

10.1038/nmeth.1785

SHAPEIT2

Tool

PUBMED_LINK

23269371

DESCRIPTION

(SHAPEIT2)

Show full descriptionShow less

TITLE

Improved whole-chromosome phasing for disease and population genetic studies.

Main citation

Delaneau O, Zagury JF, Marchini J. (2013) Improved whole-chromosome phasing for disease and population genetic studies. Nat Methods, 10 (1) 5-6. doi:10.1038/nmeth.2307. PMID 23269371

DOI

10.1038/nmeth.2307

SHAPEIT3

Tool

PUBMED_LINK

27270105

DESCRIPTION

(SHAPEIT3)

Show full descriptionShow less

URL

https://jmarchini.org/shapeit3/

TITLE

Haplotype estimation for biobank-scale data sets.

Main citation

O'Connell J, Sharp K, Shrine N, Wain L, ...&, Marchini J. (2016) Haplotype estimation for biobank-scale data sets. Nat Genet, 48 (7) 817-20. doi:10.1038/ng.3583. PMID 27270105

ABSTRACT

The UK Biobank (UKB) has recently released genotypes on 152,328 individuals together with extensive phenotypic and lifestyle information. We present a new phasing method, SHAPEIT3, that can handle such biobank-scale data sets and results in switch error rates as low as ∼0.3%. The method exhibits O(NlogN) scaling with sample size N, enabling fast and accurate phasing of even larger cohorts.

Show full abstractShow less

DOI

10.1038/ng.3583

SHAPEIT4

Tool

PUBMED_LINK

31780650

DESCRIPTION

(SHAPEIT4)

Show full descriptionShow less

URL

https://odelaneau.github.io/shapeit4/

TITLE

Accurate, scalable and integrative haplotype estimation.

Main citation

Delaneau O, Zagury JF, Robinson MR, Marchini JL, ...&, Dermitzakis ET. (2019) Accurate, scalable and integrative haplotype estimation. Nat Commun, 10 (1) 5436. doi:10.1038/s41467-019-13225-y. PMID 31780650

ABSTRACT

The number of human genomes being genotyped or sequenced increases exponentially and efficient haplotype estimation methods able to handle this amount of data are now required. Here we present a method, SHAPEIT4, which substantially improves upon other methods to process large genotype and high coverage sequencing datasets. It notably exhibits sub-linear running times with sample size, provides highly accurate haplotypes and allows integrating external phasing information such as large reference panels of haplotypes, collections of pre-phased variants and long sequencing reads. We provide SHAPEIT4 in an open source format and demonstrate its performance in terms of accuracy and running times on two gold standard datasets: the UK Biobank data and the Genome In A Bottle.

Show full abstractShow less

DOI

10.1038/s41467-019-13225-y

SHAPEIT5

Tool

PUBMED_LINK

37386248

DESCRIPTION

(SHAPEIT5)

Show full descriptionShow less

TITLE

Accurate rare variant phasing of whole-genome and whole-exome sequencing data in the UK Biobank.

Main citation

Hofmeister RJ, Ribeiro DM, Rubinacci S, Delaneau O. (2023) Accurate rare variant phasing of whole-genome and whole-exome sequencing data in the UK Biobank. Nat Genet, 55 (7) 1243-1249. doi:10.1038/s41588-023-01415-w. PMID 37386248

ABSTRACT

Phasing involves distinguishing the two parentally inherited copies of each chromosome into haplotypes. Here, we introduce SHAPEIT5, a new phasing method that quickly and accurately processes large sequencing datasets and applied it to UK Biobank (UKB) whole-genome and whole-exome sequencing data. We demonstrate that SHAPEIT5 phases rare variants with low switch error rates of below 5% for variants present in just 1 sample out of 100,000. Furthermore, we outline a method for phasing singletons, which, although less precise, constitutes an important step towards future developments. We then demonstrate that the use of UKB as a reference panel improves the accuracy of genotype imputation, which is even more pronounced when phased with SHAPEIT5 compared with other methods. Finally, we screen the UKB data for loss-of-function compound heterozygous events and identify 549 genes where both gene copies are knocked out. These genes complement current knowledge of gene essentiality in the human genome.

Show full abstractShow less

DOI

10.1038/s41588-023-01415-w

shaPRS

Tool

PUBMED_LINK

38703768

DESCRIPTION

Leveraging shared genetic effects across traits and ancestries improves accuracy of polygenic scores

Show full descriptionShow less

URL

https://github.com/mkelcb/shaprs

KEYWORDS

cross-ancestry, genetic correlation

Show full keywordsShow less

TITLE

shaPRS: Leveraging shared genetic effects across traits or ancestries improves accuracy of polygenic scores.

Main citation

Kelemen M, Vigorito E, Fachal L, Anderson CA, ...&, Wallace C. (2024) shaPRS: Leveraging shared genetic effects across traits or ancestries improves accuracy of polygenic scores. Am J Hum Genet, 111 (6) 1006-1017. doi:10.1016/j.ajhg.2024.04.009. PMID 38703768

ABSTRACT

We present shaPRS, a method that leverages widespread pleiotropy between traits or shared genetic effects across ancestries, to improve the accuracy of polygenic scores. The method uses genome-wide summary statistics from two diseases or ancestries to improve the genetic effect estimate and standard error at SNPs where there is homogeneity of effect between the two datasets. When there is significant evidence of heterogeneity, the genetic effect from the disease or population closest to the target population is maintained. We show via simulation and a series of real-world examples that shaPRS substantially enhances the accuracy of polygenic risk scores (PRSs) for complex diseases and greatly improves PRS performance across ancestries. shaPRS is a PRS pre-processing method that is agnostic to the actual PRS generation method, and as a result, it can be integrated into existing PRS generation pipelines and continue to be applied as more performant PRS methods are developed over time.

Show full abstractShow less

DOI

10.1016/j.ajhg.2024.04.009

SiblingGWAS

Tool

PUBMED_LINK

35534559

FULL NAME

Within-sibship genome-wide association analyses

DESCRIPTION

Scripts for running GWAS using siblings to estimate Within-Family (WF) and Between-Family (BF) effects of genetic variants on continuous traits. Allows the inclusion of more than two siblings from one family.

Show full descriptionShow less

URL

https://github.com/LaurenceHowe/SiblingGWAS

TITLE

Within-sibship genome-wide association analyses decrease bias in estimates of direct genetic effects.

Main citation

Howe LJ, Nivard MG, Morris TT, Hansen AF, ...&, Davies NM. (2022) Within-sibship genome-wide association analyses decrease bias in estimates of direct genetic effects. Nat Genet, 54 (5) 581-592. doi:10.1038/s41588-022-01062-7. PMID 35534559

ABSTRACT

Estimates from genome-wide association studies (GWAS) of unrelated individuals capture effects of inherited variation (direct effects), demography (population stratification, assortative mating) and relatives (indirect genetic effects). Family-based GWAS designs can control for demographic and indirect genetic effects, but large-scale family datasets have been lacking. We combined data from 178,086 siblings from 19 cohorts to generate population (between-family) and within-sibship (within-family) GWAS estimates for 25 phenotypes. Within-sibship GWAS estimates were smaller than population estimates for height, educational attainment, age at first birth, number of children, cognitive ability, depressive symptoms and smoking. Some differences were observed in downstream SNP heritability, genetic correlations and Mendelian randomization analyses. For example, the within-sibship genetic correlation between educational attainment and body mass index attenuated towards zero. In contrast, analyses of most molecular phenotypes (for example, low-density lipoprotein-cholesterol) were generally consistent. We also found within-sibship evidence of polygenic adaptation on taller height. Here, we illustrate the importance of family-based GWAS data for phenotypes influenced by demographic and indirect genetic effects.

Show full abstractShow less

DOI

10.1038/s41588-022-01062-7

sim1000G

Tool

PUBMED_LINK

30646839

DESCRIPTION

a user-friendly genetic variant simulator in R for unrelated individuals and family-based designs

Show full descriptionShow less

URL

https://github.com/adimitromanolakis/sim1000G

TITLE

sim1000G: a user-friendly genetic variant simulator in R for unrelated individuals and family-based designs.

Main citation

Dimitromanolakis A, Xu J, Krol A, Briollais L. (2019) sim1000G: a user-friendly genetic variant simulator in R for unrelated individuals and family-based designs. BMC Bioinformatics, 20 (1) 26. doi:10.1186/s12859-019-2611-1. PMID 30646839

ABSTRACT

BACKGROUND: Simulation of genetic variants data is frequently required for the evaluation of statistical methods in the fields of human and animal genetics. Although a number of high-quality genetic simulators have been developed, many of them require advanced knowledge in population genetics or in computation to be used effectively. In addition, generating simulated data in the context of family-based studies demands sophisticated methods and advanced computer programming. RESULTS: To address these issues, we propose a new user-friendly and integrated R package, sim1000G, which simulates variants in genomic regions among unrelated individuals or among families. The only input needed is a raw phased Variant Call Format (VCF) file. Haplotypes are extracted to compute linkage disequilibrium (LD) in the simulated genomic regions and for the generation of new genotype data among unrelated individuals. The covariance across variants is used to preserve the LD structure of the original population. Pedigrees of arbitrary sizes are generated by modeling recombination events with sim1000G. To illustrate the application of sim1000G, various scenarios are presented assuming unrelated individuals from a single population or two distinct populations, or alternatively for three-generation pedigree data. Sim1000G can capture allele frequency diversity, short and long-range linkage disequilibrium (LD) patterns and subtle population differences in LD structure without the need of any tuning parameters. CONCLUSION: Sim1000G fills a gap in the vast area of genetic variants simulators by its simplicity and independence from external tools. Currently, it is one of the few simulation packages completely integrated into R and able to simulate multiple genetic variants among unrelated individuals and within families. Its implementation will facilitate the application and development of computational methods for association studies with both rare and common variants.

Show full abstractShow less

DOI

10.1186/s12859-019-2611-1

SIMER

Tool

FULL NAME

Data Simulation for Life Science and Breeding

DESCRIPTION

Data Simulation for Life Science and Breeding

Show full descriptionShow less

URL

https://github.com/xiaolei-lab/SIMER#genotype-data

simGWAS

Tool

PUBMED_LINK

30371734

DESCRIPTION

a fast method for simulation of large scale case–control GWAS summary statistics

Show full descriptionShow less

URL

https://github.com/chr1swallace/simGWAS

TITLE

simGWAS: a fast method for simulation of large scale case-control GWAS summary statistics.

Main citation

Fortune MD, Wallace C. (2019) simGWAS: a fast method for simulation of large scale case-control GWAS summary statistics. Bioinformatics, 35 (11) 1901-1906. doi:10.1093/bioinformatics/bty898. PMID 30371734

ABSTRACT

MOTIVATION: Methods for analysis of GWAS summary statistics have encouraged data sharing and democratized the analysis of different diseases. Ideal validation for such methods is application to simulated data, where some 'truth' is known. As GWAS increase in size, so does the computational complexity of such evaluations; standard practice repeatedly simulates and analyses genotype data for all individuals in an example study. RESULTS: We have developed a novel method based on an alternative approach, directly simulating GWAS summary data, without individual data as an intermediate step. We mathematically derive the expected statistics for any set of causal variants and their effect sizes, conditional upon control haplotype frequencies (available from public reference datasets). Simulation of GWAS summary output can be conducted independently of sample size by simulating random variates about these expected values. Across a range of scenarios, our method, produces very similar output to that from simulating individual genotypes with a substantial gain in speed even for modest sample sizes. Fast simulation of GWAS summary statistics will enable more complete and rapid evaluation of summary statistic methods as well as opening new potential avenues of research in fine mapping and gene set enrichment analysis. AVAILABILITY AND IMPLEMENTATION: Our method is available under a GPL license as an R package from http://github.com/chr1swallace/simGWAS. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Show full abstractShow less

DOI

10.1093/bioinformatics/bty898

SINGER

Tool

FULL NAME

sampling and inferring of genealogies with recombination

DESCRIPTION

SINGER is a Bayesian method for accelerating ARG (Ancestral Recombination Graph) sampling from the posterior distribution, enabling accurate inference and uncertainty quantification for hundreds of whole-genome sequences. It addresses scalability and accuracy challenges in ARG reconstruction, improving robustness to model misspecification. Applications include detecting population differentiation, archaic introgression, and trans-species polymorphism in regions like the HLA locus.

Show full descriptionShow less

URL

https://github.com/popgenmethods/SINGER

KEYWORDS

Ancestral Recombination Graph (ARG), Bayesian inference, population genomics, genealogical analysis, archaic introgression, trans-species polymorphism

Show full keywordsShow less

Main citation

Deng, Y., Nielsen, R., & Song, Y.S. Robust and accurate Bayesian inference of genome-wide genealogies for hundreds of genomes. Nature Genetics, 57, 2124–2135. https://doi.org/10.1038/s41588-025-02317-9

ARROW_SUMMARY

Phased WGS → Bayesian MCMC sampling (threading, ARG re-scaling, SGPR moves) → Genome-wide ARGs with uncertainty quantification

AI_GENERATED

1.0

SKAT

Tool

PUBMED_LINK

21737059

FULL NAME

sequence kernel association test

DESCRIPTION

SKAT is a SNP-set (e.g., a gene or a region) level test for association between a set of rare (or common) variants and dichotomous or quantitative phenotypes, SKAT aggregates individual score test statistics of SNPs in a SNP set and efficiently computes SNP-set level p-values, e.g. a gene or a region level p-value, while adjusting for covariates, such as principal components to account for population stratification. SKAT also allows for power/sample size calculations for designing for sequence association studies.

Show full descriptionShow less

URL

https://www.hsph.harvard.edu/skat/

TITLE

Rare-variant association testing for sequencing data with the sequence kernel association test.

Main citation

Wu MC, Lee S, Cai T, Li Y, ...&, Lin X. (2011) Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet, 89 (1) 82-93. doi:10.1016/j.ajhg.2011.05.029. PMID 21737059

ABSTRACT

Sequencing studies are increasingly being conducted to identify rare variants associated with complex traits. The limited power of classical single-marker association analysis for rare variants poses a central challenge in such studies. We propose the sequence kernel association test (SKAT), a supervised, flexible, computationally efficient regression method to test for association between genetic variants (common and rare) in a region and a continuous or dichotomous trait while easily adjusting for covariates. As a score-based variance-component test, SKAT can quickly calculate p values analytically by fitting the null model containing only the covariates, and so can easily be applied to genome-wide data. Using SKAT to analyze a genome-wide sequencing study of 1000 individuals, by segmenting the whole genome into 30 kb regions, requires only 7 hr on a laptop. Through analysis of simulated data across a wide range of practical scenarios and triglyceride data from the Dallas Heart Study, we show that SKAT can substantially outperform several alternative rare-variant association tests. We also provide analytic power and sample-size calculations to help design candidate-gene, whole-exome, and whole-genome sequence association studies.

Show full abstractShow less

DOI

10.1016/j.ajhg.2011.05.029

SKAT-O

Tool

PUBMED_LINK

22699862

FULL NAME

sequence kernel association test - optimal test

DESCRIPTION

estimating the correlation parameter in the kernel matrix to maximize the power, which corresponds to the estimated weight in the linear combination of the burden test and SKAT test statistics that maximizes power.

Show full descriptionShow less

URL

https://www.hsph.harvard.edu/skat/

TITLE

Optimal tests for rare variant effects in sequencing association studies.

Main citation

Lee S, Wu MC, Lin X. (2012) Optimal tests for rare variant effects in sequencing association studies. Biostatistics, 13 (4) 762-75. doi:10.1093/biostatistics/kxs014. PMID 22699862

ABSTRACT

With development of massively parallel sequencing technologies, there is a substantial need for developing powerful rare variant association tests. Common approaches include burden and non-burden tests. Burden tests assume all rare variants in the target region have effects on the phenotype in the same direction and of similar magnitude. The recently proposed sequence kernel association test (SKAT) (Wu, M. C., and others, 2011. Rare-variant association testing for sequencing data with the SKAT. The American Journal of Human Genetics 89, 82-93], an extension of the C-alpha test (Neale, B. M., and others, 2011. Testing for an unusual distribution of rare variants. PLoS Genetics 7, 161-165], provides a robust test that is particularly powerful in the presence of protective and deleterious variants and null variants, but is less powerful than burden tests when a large number of variants in a region are causal and in the same direction. As the underlying biological mechanisms are unknown in practice and vary from one gene to another across the genome, it is of substantial practical interest to develop a test that is optimal for both scenarios. In this paper, we propose a class of tests that include burden tests and SKAT as special cases, and derive an optimal test within this class that maximizes power. We show that this optimal test outperforms burden tests and SKAT in a wide range of scenarios. The results are illustrated using simulation studies and triglyceride data from the Dallas Heart Study. In addition, we have derived sample size/power calculation formula for SKAT with a new family of kernels to facilitate designing new sequence association studies.

Show full abstractShow less

DOI

10.1093/biostatistics/kxs014

SMMAT

Tool

PUBMED_LINK

30639324

FULL NAME

variant set mixed model association tests

DESCRIPTION

For rare variant analysis from sequencing association studies, GMMAT performs the variant Set Mixed Model Association Tests (SMMAT) as proposed in Chen et al. (2019), including the burden test, the sequence kernel association test (SKAT), SKAT-O and an efficient hybrid test of the burden test and SKAT, based on user-defined variant sets.

Show full descriptionShow less

URL

https://github.com/hanchenphd/GMMAT

TITLE

Efficient Variant Set Mixed Model Association Tests for Continuous and Binary Traits in Large-Scale Whole-Genome Sequencing Studies.

Main citation

Chen H, Huffman JE, Brody JA, Wang C, ...&, Lin X. (2019) Efficient Variant Set Mixed Model Association Tests for Continuous and Binary Traits in Large-Scale Whole-Genome Sequencing Studies. Am J Hum Genet, 104 (2) 260-274. doi:10.1016/j.ajhg.2018.12.012. PMID 30639324

ABSTRACT

With advances in whole-genome sequencing (WGS) technology, more advanced statistical methods for testing genetic association with rare variants are being developed. Methods in which variants are grouped for analysis are also known as variant-set, gene-based, and aggregate unit tests. The burden test and sequence kernel association test (SKAT) are two widely used variant-set tests, which were originally developed for samples of unrelated individuals and later have been extended to family data with known pedigree structures. However, computationally efficient and powerful variant-set tests are needed to make analyses tractable in large-scale WGS studies with complex study samples. In this paper, we propose the variant-set mixed model association tests (SMMAT) for continuous and binary traits using the generalized linear mixed model framework. These tests can be applied to large-scale WGS studies involving samples with population structure and relatedness, such as in the National Heart, Lung, and Blood Institute's Trans-Omics for Precision Medicine (TOPMed) program. SMMATs share the same null model for different variant sets, and a virtue of this null model, which includes covariates only, is that it needs to be fit only once for all tests in each genome-wide analysis. Simulation studies show that all the proposed SMMATs correctly control type I error rates for both continuous and binary traits in the presence of population structure and relatedness. We also illustrate our tests in a real data example of analysis of plasma fibrinogen levels in the TOPMed program (n = 23,763), using the Analysis Commons, a cloud-based computing platform.

Show full abstractShow less

DOI

10.1016/j.ajhg.2018.12.012

SMR

Tool

PUBMED_LINK

27019110

FULL NAME

Summary-data-based Mendelian Randomization

DESCRIPTION

The SMR software tool was originally developed to implement the SMR & HEIDI methods to test for pleiotropic association between the expression level of a gene and a complex trait of interest using summary-level data from GWAS and expression quantitative trait loci (eQTL) studies (Zhu et al. 2016 Nature Genetics). The SMR & HEIDI methodology can be interpreted as an analysis to test if the effect size of a SNP on the phenotype is mediated by gene expression. This tool can therefore be used to prioritize genes underlying GWAS hits for follow-up functional studies. The methods are applicable to all kinds of molecular QTL (xQTL) data, including DNA methylation QTL (mQTL) and protein abundance QTL (pQTL).

Show full descriptionShow less

URL

https://yanglab.westlake.edu.cn/software/smr/#Overview

KEYWORDS

pleiotropy or causality, xQTL, eQTL, MR, HEIDI, linkage

Show full keywordsShow less

TITLE

Integration of summary data from GWAS and eQTL studies predicts complex trait gene targets.

Main citation

Zhu Z, Zhang F, Hu H, Bakshi A, ...&, Yang J. (2016) Integration of summary data from GWAS and eQTL studies predicts complex trait gene targets. Nat Genet, 48 (5) 481-7. doi:10.1038/ng.3538. PMID 27019110

ABSTRACT

Genome-wide association studies (GWAS) have identified thousands of genetic variants associated with human complex traits. However, the genes or functional DNA elements through which these variants exert their effects on the traits are often unknown. We propose a method (called SMR) that integrates summary-level data from GWAS with data from expression quantitative trait locus (eQTL) studies to identify genes whose expression levels are associated with a complex trait because of pleiotropy. We apply the method to five human complex traits using GWAS data on up to 339,224 individuals and eQTL data on 5,311 individuals, and we prioritize 126 genes (for example, TRAF1 and ANKRD55 for rheumatoid arthritis and SNX19 and NMRAL1 for schizophrenia), of which 25 genes are new candidates; 77 genes are not the nearest annotated gene to the top associated GWAS SNP. These genes provide important leads to design future functional studies to understand the mechanism whereby DNA variation leads to complex trait variation.

Show full abstractShow less

DOI

10.1038/ng.3538

SMR-multi

Tool

PUBMED_LINK

29500431

TITLE

Integrative analysis of omics summary data reveals putative mechanisms underlying complex traits.

Main citation

Wu Y, Zeng J, Zhang F, Zhu Z, ...&, Yang J. (2018) Integrative analysis of omics summary data reveals putative mechanisms underlying complex traits. Nat Commun, 9 (1) 918. doi:10.1038/s41467-018-03371-0. PMID 29500431

ABSTRACT

The identification of genes and regulatory elements underlying the associations discovered by GWAS is essential to understanding the aetiology of complex traits (including diseases). Here, we demonstrate an analytical paradigm of prioritizing genes and regulatory elements at GWAS loci for follow-up functional studies. We perform an integrative analysis that uses summary-level SNP data from multi-omics studies to detect DNA methylation (DNAm) sites associated with gene expression and phenotype through shared genetic effects (i.e., pleiotropy). We identify pleiotropic associations between 7858 DNAm sites and 2733 genes. These DNAm sites are enriched in enhancers and promoters, and >40% of them are mapped to distal genes. Further pleiotropic association analyses, which link both the methylome and transcriptome to 12 complex traits, identify 149 DNAm sites and 66 genes, indicating a plausible mechanism whereby the effect of a genetic variant on phenotype is mediated by genetic regulation of transcription through DNAm.

Show full abstractShow less

DOI

10.1038/s41467-018-03371-0

snipar

Tool

PUBMED_LINK

35681053

FULL NAME

single nucleotide imputation of parents

DESCRIPTION

snipar (single nucleotide imputation of parents) is a Python package for inferring identity-by-descent (IBD) segments shared between siblings, imputing missing parental genotypes, and for performing family based genome-wide association and polygenic score analyses using observed and/or imputed parental genotypes.

Show full descriptionShow less

URL

https://github.com/AlexTISYoung/snipar

TITLE

Mendelian imputation of parental genotypes improves estimates of direct genetic effects.

Main citation

Young AI, Nehzati SM, Benonisdottir S, Okbay A, ...&, Kong A. (2022) Mendelian imputation of parental genotypes improves estimates of direct genetic effects. Nat Genet, 54 (6) 897-905. doi:10.1038/s41588-022-01085-0. PMID 35681053

ABSTRACT

Effects estimated by genome-wide association studies (GWASs) include effects of alleles in an individual on that individual (direct genetic effects), indirect genetic effects (for example, effects of alleles in parents on offspring through the environment) and bias from confounding. Within-family genetic variation is random, enabling unbiased estimation of direct genetic effects when parents are genotyped. However, parental genotypes are often missing. We introduce a method that imputes missing parental genotypes and estimates direct genetic effects. Our method, implemented in the software package snipar (single-nucleotide imputation of parents), gives more precise estimates of direct genetic effects than existing approaches. Using 39,614 individuals from the UK Biobank with at least one genotyped sibling/parent, we estimate the correlation between direct genetic effects and effects from standard GWASs for nine phenotypes, including educational attainment (r = 0.739, standard error (s.e.) = 0.086) and cognitive ability (r = 0.490, s.e. = 0.086). Our results demonstrate substantial confounding bias in standard GWASs for some phenotypes.

Show full abstractShow less

DOI

10.1038/s41588-022-01085-0

snipar-unified estimator (snipar)

Tool

PUBMED_LINK

40065166

FULL NAME

single nucleotide imputation of parents

URL

https://github.com/AlexTISYoung/snipar

TITLE

Family-based genome-wide association study designs for increased power and robustness.

Main citation

Guan J, Tan T, Nehzati SM, Bennett M, ...&, Young AS. (2025) Family-based genome-wide association study designs for increased power and robustness. Nat Genet, 57 (4) 1044-1052. doi:10.1038/s41588-025-02118-0. PMID 40065166

ABSTRACT

Family-based genome-wide association studies (FGWASs) use random, within-family genetic variation to remove confounding from estimates of direct genetic effects (DGEs). Here we introduce a 'unified estimator' that includes individuals without genotyped relatives, unifying standard and FGWAS while increasing power for DGE estimation. We also introduce a 'robust estimator' that is not biased in structured and/or admixed populations. In an analysis of 19 phenotypes in the UK Biobank, the unified estimator in the White British subsample and the robust estimator (applied without ancestry restrictions) increased the effective sample size for DGEs by 46.9% to 106.5% and 10.3% to 21.0%, respectively, compared to using genetic differences between siblings. Polygenic predictors derived from the unified estimator demonstrated superior out-of-sample prediction ability compared to other family-based methods. We implemented the methods in the software package snipar in an efficient linear mixed model that accounts for sample relatedness and sibling shared environment.

Show full abstractShow less

DOI

10.1038/s41588-025-02118-0

SNP2HLA

Tool

PUBMED_LINK

23762245

URL

http://software.broadinstitute.org/mpg/snp2hla/

TITLE

Imputing amino acid polymorphisms in human leukocyte antigens.

Main citation

Jia X, Han B, Onengut-Gumuscu S, Chen WM, ...&, de Bakker PI. (2013) Imputing amino acid polymorphisms in human leukocyte antigens. PLoS One, 8 (6) e64683. doi:10.1371/journal.pone.0064683. PMID 23762245

ABSTRACT

DNA sequence variation within human leukocyte antigen (HLA) genes mediate susceptibility to a wide range of human diseases. The complex genetic structure of the major histocompatibility complex (MHC) makes it difficult, however, to collect genotyping data in large cohorts. Long-range linkage disequilibrium between HLA loci and SNP markers across the major histocompatibility complex (MHC) region offers an alternative approach through imputation to interrogate HLA variation in existing GWAS data sets. Here we describe a computational strategy, SNP2HLA, to impute classical alleles and amino acid polymorphisms at class I (HLA-A, -B, -C) and class II (-DPA1, -DPB1, -DQA1, -DQB1, and -DRB1) loci. To characterize performance of SNP2HLA, we constructed two European ancestry reference panels, one based on data collected in HapMap-CEPH pedigrees (90 individuals) and another based on data collected by the Type 1 Diabetes Genetics Consortium (T1DGC, 5,225 individuals). We imputed HLA alleles in an independent data set from the British 1958 Birth Cohort (N = 918) with gold standard four-digit HLA types and SNPs genotyped using the Affymetrix GeneChip 500 K and Illumina Immunochip microarrays. We demonstrate that the sample size of the reference panel, rather than SNP density of the genotyping platform, is critical to achieve high imputation accuracy. Using the larger T1DGC reference panel, the average accuracy at four-digit resolution is 94.7% using the low-density Affymetrix GeneChip 500 K, and 96.7% using the high-density Illumina Immunochip. For amino acid polymorphisms within HLA genes, we achieve 98.6% and 99.3% accuracy using the Affymetrix GeneChip 500 K and Illumina Immunochip, respectively. Finally, we demonstrate how imputation and association testing at amino acid resolution can facilitate fine-mapping of primary MHC association signals, giving a specific example from type 1 diabetes.

Show full abstractShow less

DOI

10.1371/journal.pone.0064683

SnpEff

Tool

PUBMED_LINK

22728672

FULL NAME

SNP effect

DESCRIPTION

Genetic variant annotation and functional effect prediction toolbox. It annotates and predicts the effects of genetic variants on genes and proteins (such as amino acid changes).

Show full descriptionShow less

URL

http://pcingola.github.io/SnpEff/

TITLE

A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3.

Main citation

Cingolani P, Platts A, Wang le L, Coon M, ...&, Ruden DM. (2012) A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin), 6 (2) 80-92. doi:10.4161/fly.19695. PMID 22728672

ABSTRACT

We describe a new computer program, SnpEff, for rapidly categorizing the effects of variants in genome sequences. Once a genome is sequenced, SnpEff annotates variants based on their genomic locations and predicts coding effects. Annotated genomic locations include intronic, untranslated region, upstream, downstream, splice site, or intergenic regions. Coding effects such as synonymous or non-synonymous amino acid replacement, start codon gains or losses, stop codon gains or losses, or frame shifts can be predicted. Here the use of SnpEff is illustrated by annotating ~356,660 candidate SNPs in ~117 Mb unique sequences, representing a substitution rate of ~1/305 nucleotides, between the Drosophila melanogaster w(1118); iso-2; iso-3 strain and the reference y(1); cn(1) bw(1) sp(1) strain. We show that ~15,842 SNPs are synonymous and ~4,467 SNPs are non-synonymous (N/S ~0.28). The remaining SNPs are in other categories, such as stop codon gains (38 SNPs), stop codon losses (8 SNPs), and start codon gains (297 SNPs) in the 5'UTR. We found, as expected, that the SNP frequency is proportional to the recombination frequency (i.e., highest in the middle of chromosome arms). We also found that start-gain or stop-lost SNPs in Drosophila melanogaster often result in additions of N-terminal or C-terminal amino acids that are conserved in other Drosophila species. It appears that the 5' and 3' UTRs are reservoirs for genetic variations that changes the termini of proteins during evolution of the Drosophila genus. As genome sequencing is becoming inexpensive and routine, SnpEff enables rapid analyses of whole-genome sequencing data to be performed by an individual laboratory.

Show full abstractShow less

DOI

10.4161/fly.19695

SOMAmer

Tool

PUBMED_LINK

29079756

TITLE

Assessment of Variability in the SOMAscan Assay.

Main citation

Candia J, Cheung F, Kotliarov Y, Fantoni G, ...&, Biancotto A. (2017) Assessment of Variability in the SOMAscan Assay. Sci Rep, 7 (1) 14248. doi:10.1038/s41598-017-14755-5. PMID 29079756

ABSTRACT

SOMAscan is an aptamer-based proteomics assay capable of measuring 1,305 human protein analytes in serum, plasma, and other biological matrices with high sensitivity and specificity. In this work, we present a comprehensive meta-analysis of performance based on multiple serum and plasma runs using the current 1.3 k assay, as well as the previous 1.1 k version. We discuss normalization procedures and examine different strategies to minimize intra- and interplate nuisance effects. We implement a meta-analysis based on calibrator samples to characterize the coefficient of variation and signal-over-background intensity of each protein analyte. By incorporating coefficient of variation estimates into a theoretical model of statistical variability, we also provide a framework to enable rigorous statistical tests of significance in intervention studies and clinical trials, as well as quality control within and across laboratories. Furthermore, we investigate the stability of healthy subject baselines and determine the set of analytes that exhibit biologically stable baselines after technical variability is factored in. This work is accompanied by an interactive web-based tool, an initiative with the potential to become the cornerstone of a regularly updated, high quality repository with data sharing, reproducibility, and reusability as ultimate goals.

Show full abstractShow less

DOI

10.1038/s41598-017-14755-5

South and East Asian Reference Database (SEAD) (SEAD)

Tool

FULL NAME

South and East Asian Reference Database

URL

https://imputationserver.westlake.edu.cn/

PREPRINT_DOI

10.1101/2023.12.23.23300480

Main citation

Yang, M. Y., Zhong, J. D., Li, X., Bai, W. Y., Yuan, C. D., Qiu, M. C., ... & Zheng, H. F. (2023). SEAD: an augmented reference panel with 22,134 haplotypes boosts the rare variants imputation and GWAS analysis in Asian population. medRxiv, 2023-12.

SPAGRM

Tool

PUBMED_LINK

39915470

DESCRIPTION

PAGRM is a scalable and accurate analysis framework to control for sample relatedness in large-scale genome-wide association studies (GWAS).

Show full descriptionShow less

URL

https://github.com/HeXuPKU/SPAGRM

KEYWORDS

SPA, longitudinal traits

Show full keywordsShow less

TITLE

SPA

Main citation

Xu H, Ma Y, Xu LL, Li Y, ...&, Bi W. (2025) SPA Nat Commun, 16 (1) 1413. doi:10.1038/s41467-025-56669-1. PMID 39915470

ABSTRACT

Sample relatedness is a major confounder in genome-wide association studies (GWAS), potentially leading to inflated type I error rates if not appropriately controlled. A common strategy is to incorporate a random effect related to genetic relatedness matrix (GRM) into regression models. However, this approach is challenging for large-scale GWAS of complex traits, such as longitudinal traits. Here we propose a scalable and accurate analysis framework, SPAGRM, which controls for sample relatedness via a precise approximation of the joint distribution of genotypes. SPAGRM can utilize GRM-free models and thus is applicable to various trait types and statistical methods, including linear mixed models and generalized estimation equations for longitudinal traits. A hybrid strategy incorporating saddlepoint approximation greatly increases the accuracy to analyze low-frequency and rare genetic variants, especially in unbalanced phenotypic distributions. We also introduce SPAGRM(CCT) to aggregate the results following different models via Cauchy combination test. Extensive simulations and real data analyses demonstrated that SPAGRM maintains well-controlled type I error rates and SPAGRM(CCT) can serve as a broadly effective method. Applying SPAGRM to 79 longitudinal traits extracted from UK Biobank primary care data, we identified 7,463 genetic loci, making a pioneering attempt to conduct GWAS for these traits as longitudinal traits.

Show full abstractShow less

DOI

10.1038/s41467-025-56669-1

SPAGxECCT

Tool

PUBMED_LINK

40157913

DESCRIPTION

A scalable and accurate framework for large-scale genome-wide gene-environment interaction (G×E) analysis.

Show full descriptionShow less

URL

https://github.com/YuzhuoMa97/SPAGxECCT

KEYWORDS

Cauchy combination test

Show full keywordsShow less

TITLE

Efficient and accurate framework for genome-wide gene-environment interaction analysis in large-scale biobanks.

Main citation

Ma Y, Zhao Y, Zhang JF, Bi W. (2025) Efficient and accurate framework for genome-wide gene-environment interaction analysis in large-scale biobanks. Nat Commun, 16 (1) 3064. doi:10.1038/s41467-025-57887-3. PMID 40157913

ABSTRACT

Gene-environment interaction (G×E) analysis elucidates the interplay between genetic and environmental factors. Genome-wide association studies (GWAS) have expanded to encompass complex traits like time-to-event and ordinal traits, which provide richer phenotypic information. However, most existing scalable approaches focus only on quantitative or binary traits. Here we propose SPAGxECCT, a scalable and accurate framework for diverse trait types. SPAGxECCT fits a genotype-independent model and employs a hybrid strategy including saddlepoint approximation (SPA) for accurate p value calculation, especially for low-frequency variants and unbalanced phenotypic distributions. We extend SPAGxECCT to SPAGxEmixCCT, which accounts for population stratification and is applicable to multi-ancestry or admixed populations. SPAGxEmixCCT can further be extended to SPAGxEmixCCT-local, which identifies ancestry-specific G×E effects using local ancestry. Through extensive simulations and real data analyses of UK Biobank data, we demonstrate that SPAGxECCT and SPAGxEmixCCT are scalable to analyze large-scale study cohort, control type I error rates effectively, and maintain power.

Show full abstractShow less

DOI

10.1038/s41467-025-57887-3

SparsePro

Tool

PUBMED_LINK

38153934

DESCRIPTION

SparsePro is a command line tool for efficiently conducting genome-wide fine-mapping. Our method has two key features: First, by creating a sparse low-dimensional projection of the high-dimensional genotype, we enable a linear search of causal variants instead of an exponential search of causal configurations in most existing methods; Second, we adopt a probabilistic framework with a highly efficient variational expectation-maximization algorithm to integrate statistical associations and functional priors.

Show full descriptionShow less

URL

https://github.com/zhwm/SparsePro

TITLE

SparsePro: An efficient fine-mapping method integrating summary statistics and functional annotations.

Main citation

Zhang W, Najafabadi H, Li Y. (2023) SparsePro: An efficient fine-mapping method integrating summary statistics and functional annotations. PLoS Genet, 19 (12) e1011104. doi:10.1371/journal.pgen.1011104. PMID 38153934

ABSTRACT

Identifying causal variants from genome-wide association studies (GWAS) is challenging due to widespread linkage disequilibrium (LD) and the possible existence of multiple causal variants in the same genomic locus. Functional annotations of the genome may help to prioritize variants that are biologically relevant and thus improve fine-mapping of GWAS results. Classical fine-mapping methods conducting an exhaustive search of variant-level causal configurations have a high computational cost, especially when the underlying genetic architecture and LD patterns are complex. SuSiE provided an iterative Bayesian stepwise selection algorithm for efficient fine-mapping. In this work, we build connections between SuSiE and a paired mean field variational inference algorithm through the implementation of a sparse projection, and propose effective strategies for estimating hyperparameters and summarizing posterior probabilities. Moreover, we incorporate functional annotations into fine-mapping by jointly estimating enrichment weights to derive functionally-informed priors. We evaluate the performance of SparsePro through extensive simulations using resources from the UK Biobank. Compared to state-of-the-art methods, SparsePro achieved improved power for fine-mapping with reduced computation time. We demonstrate the utility of SparsePro through fine-mapping of five functional biomarkers of clinically relevant phenotypes. In summary, we have developed an efficient fine-mapping method for integrating summary statistics and functional annotations. Our method can have wide utility in understanding the genetics of complex traits and increasing the yield of functional follow-up studies of GWAS. SparsePro software is available on GitHub at https://github.com/zhwm/SparsePro.

Show full abstractShow less

DOI

10.1371/journal.pgen.1011104

sQTLseekeR

Tool

PUBMED_LINK

25140736

DESCRIPTION

sQTLseekeR is a R package to detect splicing QTLs (sQTLs), which are variants associated with change in the splicing pattern of a gene. Here, splicing patterns are modeled by the relative expression of the transcripts of a gene.

Show full descriptionShow less

URL

https://github.com/jmonlong/sQTLseekeR

TITLE

Identification of genetic variants associated with alternative splicing using sQTLseekeR.

Main citation

Monlong J, Calvo M, Ferreira PG, Guigó R. (2014) Identification of genetic variants associated with alternative splicing using sQTLseekeR. Nat Commun, 5 () 4698. doi:10.1038/ncomms5698. PMID 25140736

ABSTRACT

Identification of genetic variants affecting splicing in RNA sequencing population studies is still in its infancy. Splicing phenotype is more complex than gene expression and ought to be treated as a multivariate phenotype to be recapitulated completely. Here we represent the splicing pattern of a gene as the distribution of the relative abundances of a gene's alternative transcript isoforms. We develop a statistical framework that uses a distance-based approach to compute the variability of splicing ratios across observations, and a non-parametric analogue to multivariate analysis of variance. We implement this approach in the R package sQTLseekeR and use it to analyze RNA-Seq data from the Geuvadis project in 465 individuals. We identify hundreds of single nucleotide polymorphisms (SNPs) as splicing QTLs (sQTLs), including some falling in genome-wide association study SNPs. By developing the appropriate metrics, we show that sQTLseekeR compares favorably with existing methods that rely on univariate approaches, predicting variants that behave as expected from mutations affecting splicing.

Show full abstractShow less

DOI

10.1038/ncomms5698

STAAR

Tool

PUBMED_LINK

32839606

FULL NAME

variant-set test for association using annotation information

DESCRIPTION

STAAR is an R package for performing variant-Set Test for Association using Annotation infoRmation (STAAR) procedure in whole-genome sequencing (WGS) studies. STAAR is a general framework that incorporates both qualitative functional categories and quantitative complementary functional annotations using an omnibus multi-dimensional weighting scheme. STAAR accounts for population structure and relatedness, and is scalable for analyzing large WGS studies of continuous and dichotomous traits.

Show full descriptionShow less

URL

https://github.com/xihaoli/STAAR

KEYWORDS

functional annotations

Show full keywordsShow less

TITLE

Dynamic incorporation of multiple in silico functional annotations empowers rare variant association analysis of large whole-genome sequencing studies at scale.

Main citation

Li X, Li Z, Zhou H, Gaynor SM, ...&, Lin X. (2020) Dynamic incorporation of multiple in silico functional annotations empowers rare variant association analysis of large whole-genome sequencing studies at scale. Nat Genet, 52 (9) 969-983. doi:10.1038/s41588-020-0676-4. PMID 32839606

ABSTRACT

Large-scale whole-genome sequencing studies have enabled the analysis of rare variants (RVs) associated with complex phenotypes. Commonly used RV association tests have limited scope to leverage variant functions. We propose STAAR (variant-set test for association using annotation information), a scalable and powerful RV association test method that effectively incorporates both variant categories and multiple complementary annotations using a dynamic weighting scheme. For the latter, we introduce 'annotation principal components', multidimensional summaries of in silico variant annotations. STAAR accounts for population structure and relatedness and is scalable for analyzing very large cohort and biobank whole-genome sequencing studies of continuous and dichotomous traits. We applied STAAR to identify RVs associated with four lipid traits in 12,316 discovery and 17,822 replication samples from the Trans-Omics for Precision Medicine Program. We discovered and replicated new RV associations, including disruptive missense RVs of NPC1L1 and an intergenic region near APOC1P1 associated with low-density lipoprotein cholesterol.

Show full abstractShow less

DOI

10.1038/s41588-020-0676-4

STAARpipeline

Tool

PUBMED_LINK

36303018

FULL NAME

variant-set test for association using annotation information

DESCRIPTION

STAARpipeline is an R package for phenotype-genotype association analyses of biobank-scale WGS/WES data, including single variant analysis and variant set analysis.

Show full descriptionShow less

URL

https://github.com/xihaoli/STAARpipeline/

TITLE

A framework for detecting noncoding rare-variant associations of large-scale whole-genome sequencing studies.

Main citation

Li Z, Li X, Zhou H, Gaynor SM, ...&, Lin X. (2022) A framework for detecting noncoding rare-variant associations of large-scale whole-genome sequencing studies. Nat Methods, 19 (12) 1599-1611. doi:10.1038/s41592-022-01640-x. PMID 36303018

ABSTRACT

Large-scale whole-genome sequencing studies have enabled analysis of noncoding rare-variant (RV) associations with complex human diseases and traits. Variant-set analysis is a powerful approach to study RV association. However, existing methods have limited ability in analyzing the noncoding genome. We propose a computationally efficient and robust noncoding RV association detection framework, STAARpipeline, to automatically annotate a whole-genome sequencing study and perform flexible noncoding RV association analysis, including gene-centric analysis and fixed window-based and dynamic window-based non-gene-centric analysis by incorporating variant functional annotations. In gene-centric analysis, STAARpipeline uses STAAR to group noncoding variants based on functional categories of genes and incorporate multiple functional annotations. In non-gene-centric analysis, STAARpipeline uses SCANG-STAAR to incorporate dynamic window sizes and multiple functional annotations. We apply STAARpipeline to identify noncoding RV sets associated with four lipid traits in 21,015 discovery samples from the Trans-Omics for Precision Medicine (TOPMed) program and replicate several of them in an additional 9,123 TOPMed samples. We also analyze five non-lipid TOPMed traits.

Show full abstractShow less

DOI

10.1038/s41592-022-01640-x

Stephens

Tool

PUBMED_LINK

23861737

TITLE

A unified framework for association analysis with multiple related phenotypes.

Main citation

Stephens M. (2013) A unified framework for association analysis with multiple related phenotypes. PLoS One, 8 (7) e65245. doi:10.1371/journal.pone.0065245. PMID 23861737

ABSTRACT

We consider the problem of assessing associations between multiple related outcome variables, and a single explanatory variable of interest. This problem arises in many settings, including genetic association studies, where the explanatory variable is genotype at a genetic variant. We outline a framework for conducting this type of analysis, based on Bayesian model comparison and model averaging for multivariate regressions. This framework unifies several common approaches to this problem, and includes both standard univariate and standard multivariate association tests as special cases. The framework also unifies the problems of testing for associations and explaining associations - that is, identifying which outcome variables are associated with genotype. This provides an alternative to the usual, but conceptually unsatisfying, approach of resorting to univariate tests when explaining and interpreting significant multivariate findings. The method is computationally tractable genome-wide for modest numbers of phenotypes (e.g. 5-10), and can be applied to summary data, without access to raw genotype and phenotype data. We illustrate the methods on both simulated examples, and to a genome-wide association study of blood lipid traits where we identify 18 potential novel genetic associations that were not identified by univariate analyses of the same data.

Show full abstractShow less

DOI

10.1371/journal.pone.0065245

StructLMM

Tool

PUBMED_LINK

30478441

FULL NAME

Structured Linear Mixed Model

DESCRIPTION

Structured Linear Mixed Model (StructLMM) is a computationally efficient method to test for and characterize loci that interact with multiple environments

Show full descriptionShow less

URL

https://github.com/limix/struct-lmm ,https://github.com/limix/limix

TITLE

A linear mixed-model approach to study multivariate gene-environment interactions.

Main citation

Moore R, Casale FP, Jan Bonder M, Horta D, ...&, Stegle O. (2019) A linear mixed-model approach to study multivariate gene-environment interactions. Nat Genet, 51 (1) 180-186. doi:10.1038/s41588-018-0271-0. PMID 30478441

ABSTRACT

Different exposures, including diet, physical activity, or external conditions can contribute to genotype-environment interactions (G×E). Although high-dimensional environmental data are increasingly available and multiple exposures have been implicated with G×E at the same loci, multi-environment tests for G×E are not established. Here, we propose the structured linear mixed model (StructLMM), a computationally efficient method to identify and characterize loci that interact with one or more environments. After validating our model using simulations, we applied StructLMM to body mass index in the UK Biobank, where our model yields previously known and novel G×E signals. Finally, in an application to a large blood eQTL dataset, we demonstrate that StructLMM can be used to study interactions with hundreds of environmental variables.

Show full abstractShow less

DOI

10.1038/s41588-018-0271-0

SumHer

Tool

PUBMED_LINK

30510236

URL

https://github.com/qlu-lab/SUPERGNOVA

TITLE

SumHer better estimates the SNP heritability of complex traits from summary statistics.

Main citation

Speed D, Balding DJ. (2019) SumHer better estimates the SNP heritability of complex traits from summary statistics. Nat Genet, 51 (2) 277-284. doi:10.1038/s41588-018-0279-5. PMID 30510236

ABSTRACT

We present SumHer, software for estimating confounding bias, SNP heritability, enrichments of heritability and genetic correlations using summary statistics from genome-wide association studies. The key difference between SumHer and the existing software LD Score Regression (LDSC) is that SumHer allows the user to specify the heritability model. We apply SumHer to results from 24 large-scale association studies (average sample size 121,000) using our recommended heritability model. We show that these studies tended to substantially over-correct for confounding, and as a result the number of genome-wide significant loci was under-reported by about a quarter. We also estimate enrichments for 24 categories of SNPs defined by functional annotations. A previous study using LDSC reported that conserved regions were 13-fold enriched, and found a further six categories with above threefold enrichment. By contrast, our analysis using SumHer finds that none of the categories have enrichment above twofold. SumHer provides an improved understanding of the genetic architecture of complex traits, which enables more efficient analysis of future genetic data.

Show full abstractShow less

DOI

10.1038/s41588-018-0279-5

SUPERGNOVA

Tool

PUBMED_LINK

34493297

FULL NAME

SUPER GeNetic cOVariance Analyzer

DESCRIPTION

SUPERGNOVA (SUPER GeNetic cOVariance Analyzer) is a statistical framework to perform local genetic covariance analysis.

Show full descriptionShow less

URL

TITLE

SUPERGNOVA: local genetic correlation analysis reveals heterogeneous etiologic sharing of complex traits.

Main citation

Zhang Y, Lu Q, Ye Y, Huang K, ...&, Zhao H. (2021) SUPERGNOVA: local genetic correlation analysis reveals heterogeneous etiologic sharing of complex traits. Genome Biol, 22 (1) 262. doi:10.1186/s13059-021-02478-w. PMID 34493297

ABSTRACT

Local genetic correlation quantifies the genetic similarity of complex traits in specific genomic regions. However, accurate estimation of local genetic correlation remains challenging, due to linkage disequilibrium in local genomic regions and sample overlap across studies. We introduce SUPERGNOVA, a statistical framework to estimate local genetic correlations using summary statistics from genome-wide association studies. We demonstrate that SUPERGNOVA outperforms existing methods through simulations and analyses of 30 complex traits. In particular, we show that the positive yet paradoxical genetic correlation between autism spectrum disorder and cognitive performance could be explained by two etiologically distinct genetic signatures with bidirectional local genetic correlations.

Show full abstractShow less

DOI

10.1186/s13059-021-02478-w

SUSIE

Tool

PUBMED_LINK

37220626

FULL NAME

sum of single effects

DESCRIPTION

The susieR package implements a simple new way to perform variable selection in multiple regression (y = Xb + e). The methods implemented here are particularly well-suited to settings where some of the X variables are highly correlated, and the true effects are highly sparse (e.g. <20 non-zero effects in the vector b). One example of this is genetic fine-mapping applications, and this application was a major motivation for developing these methods.

Show full descriptionShow less

URL

https://stephenslab.github.io/susieR/index.html

KEYWORDS

fine-mapping, sum of single-effects (SuSiE) regression, iterative Bayesian stepwise selection (IBSS)

Show full keywordsShow less

TITLE

A simple new approach to variable selection in regression, with application to genetic fine mapping.

Main citation

Wang G, Sarkar A, Carbonetto P, Stephens M. (2020) A simple new approach to variable selection in regression, with application to genetic fine mapping. J R Stat Soc Series B Stat Methodol, 82 (5) 1273-1300. doi:10.1111/rssb.12388. PMID 37220626

ABSTRACT

We introduce a simple new approach to variable selection in linear regression, with a particular focus on quantifying uncertainty in which variables should be selected. The approach is based on a new model - the "Sum of Single Effects" (SuSiE) model - which comes from writing the sparse vector of regression coefficients as a sum of "single-effect" vectors, each with one non-zero element. We also introduce a corresponding new fitting procedure - Iterative Bayesian Stepwise Selection (IBSS) - which is a Bayesian analogue of stepwise selection methods. IBSS shares the computational simplicity and speed of traditional stepwise methods, but instead of selecting a single variable at each step, IBSS computes a distribution on variables that captures uncertainty in which variable to select. We provide a formal justification of this intuitive algorithm by showing that it optimizes a variational approximation to the posterior distribution under the SuSiE model. Further, this approximate posterior distribution naturally yields convenient novel summaries of uncertainty in variable selection, providing a Credible Set of variables for each selection. Our methods are particularly well-suited to settings where variables are highly correlated and detectable effects are sparse, both of which are characteristics of genetic fine-mapping applications. We demonstrate through numerical experiments that our methods outperform existing methods for this task, and illustrate their application to fine-mapping genetic variants influencing alternative splicing in human cell-lines. We also discuss the potential and challenges for applying these methods to generic variable selection problems.

Show full abstractShow less

DOI

10.1111/rssb.12388

SuSiE PCA

Tool

PUBMED_LINK

37953948

DESCRIPTION

SuSiE PCA is the abbreviation for the Sum of Single Effects model1 for principal component analysis. We develop SuSiE PCA for an efficient variable selection in PCA when dealing with high dimensional data with sparsity, and for quantifying uncertainty of contributing features for each latent component through posterior inclusion probabilities (PIPs).

Show full descriptionShow less

URL

https://github.com/mancusolab/susiepca

KEYWORDS

PCA, SuSiE

Show full keywordsShow less

TITLE

SuSiE PCA: A scalable Bayesian variable selection technique for principal component analysis.

Main citation

Yuan D, Mancuso N. (2023) SuSiE PCA: A scalable Bayesian variable selection technique for principal component analysis. iScience, 26 (11) 108181. doi:10.1016/j.isci.2023.108181. PMID 37953948

ABSTRACT

Latent factor models, like principal component analysis (PCA), provide a statistical framework to infer low-rank representation in various biological contexts. However, feature selection is challenging when this low-rank structure manifests from a sparse subspace. We introduce SuSiE PCA, a scalable sparse latent factor approach that evaluates uncertainty in contributing variables through posterior inclusion probabilities. We validate our model in extensive simulations and demonstrate that SuSiE PCA outperforms other approaches in signal detection and model robustness. We apply SuSiE PCA to multi-tissue expression quantitative trait loci (eQTLs) data from GTEx v8 and identify tissue-specific factors and their contributing eGenes. We further investigate its performance on the large-scale perturbation data and find that SuSiE PCA identifies modules with a higher enrichment of ribosome-related genes than sparse PCA (false discovery rate [FDR] =9.2×10-82 vs. 1.4×10-33), while being ∼ 18x faster. Overall, SuSiE PCA provides an efficient tool to identify relevant features in high-dimensional biological data.

Show full abstractShow less

DOI

10.1016/j.isci.2023.108181

SUSIE-RSS

Tool

PUBMED_LINK

35853082

FULL NAME

sum of single effects regression with summary statistics

DESCRIPTION

The susieR package implements a simple new way to perform variable selection in multiple regression (y = Xb + e). The methods implemented here are particularly well-suited to settings where some of the X variables are highly correlated, and the true effects are highly sparse (e.g. <20 non-zero effects in the vector b). One example of this is genetic fine-mapping applications, and this application was a major motivation for developing these methods.

Show full descriptionShow less

URL

https://stephenslab.github.io/susieR/index.html

KEYWORDS

fine-mapping, summary statistics

Show full keywordsShow less

TITLE

Fine-mapping from summary data with the "Sum of Single Effects" model.

Main citation

Zou Y, Carbonetto P, Wang G, Stephens M. (2022) Fine-mapping from summary data with the "Sum of Single Effects" model. PLoS Genet, 18 (7) e1010299. doi:10.1371/journal.pgen.1010299. PMID 35853082

ABSTRACT

In recent work, Wang et al introduced the "Sum of Single Effects" (SuSiE) model, and showed that it provides a simple and efficient approach to fine-mapping genetic variants from individual-level data. Here we present new methods for fitting the SuSiE model to summary data, for example to single-SNP z-scores from an association study and linkage disequilibrium (LD) values estimated from a suitable reference panel. To develop these new methods, we first describe a simple, generic strategy for extending any individual-level data method to deal with summary data. The key idea is to replace the usual regression likelihood with an analogous likelihood based on summary data. We show that existing fine-mapping methods such as FINEMAP and CAVIAR also (implicitly) use this strategy, but in different ways, and so this provides a common framework for understanding different methods for fine-mapping. We investigate other common practical issues in fine-mapping with summary data, including problems caused by inconsistencies between the z-scores and LD estimates, and we develop diagnostics to identify these inconsistencies. We also present a new refinement procedure that improves model fits in some data sets, and hence improves overall reliability of the SuSiE fine-mapping results. Detailed evaluations of fine-mapping methods in a range of simulated data sets show that SuSiE applied to summary data is competitive, in both speed and accuracy, with the best available fine-mapping methods for summary data.

Show full abstractShow less

DOI

10.1371/journal.pgen.1010299

SUSIEx

Tool

DESCRIPTION

SuSiEx is a Python based command line tool that performs cross-ethnic fine-mapping using GWAS summary statistics and LD reference panels. The method is built on the Sum of Single Effects (SuSiE) model.

Show full descriptionShow less

URL

https://github.com/getian107/SuSiEx

KEYWORDS

cross-ancestry, fine-mapping

Show full keywordsShow less

Main citation

Yuan, K., Longchamps, R. J., Pardiñas, A. F., Yu, M., Chen, T. T., Lin, S. C., ... & Schizophrenia Workgroup of Psychiatric Genomics Consortium. (2023). Fine-mapping across diverse ancestries drives the discovery of putative causal variants underlying human complex traits and diseases. medRxiv.

Swave

Tool

PUBMED_LINK

41807798

DESCRIPTION

Swave calls and genotypes structural variants from assembly-based pangenome graphs by encoding mapping patterns as images (“projection waves”) and classifying signals with a recurrent neural network, including complex and repetitive SVs for population-level characterization.

Show full descriptionShow less

URL

https://github.com/songbowang125/Swave ,https://github.com/songbowang125/Swave-Utils

KEYWORDS

structural variant, pangenome graph, deep learning, RNN, population genomics

Show full keywordsShow less

TITLE

Population-level structural variant characterization using pangenome graphs.

Main citation

Wang S, Xu T, Zhang P, Ye K. (2026) Population-level structural variant characterization using pangenome graphs. Nat Genet, 58 (3) 664-672. doi:10.1038/s41588-026-02538-6. PMID 41807798

ABSTRACT

Population-level structural variant (SV) profiling is crucial in the era of pangenomes. However, identifying SVs from genome assemblies and pangenome graphs remains a substantial challenge. Here we present Swave, a sequence-to-image, deep learning-based method that accurately resolves both simple and complex SVs, along with their population characteristics, from assembly-derived pangenome graphs. Swave introduces 'projection waves' to summarize the dotplot images that capture mapping patterns between reference and SV-indicating alleles in the pangenome. Then, a recurrent neural network distinguishes true SV signals from background noise introduced by genomic repeats. Swave demonstrates superior performance in both SV-type classification and genotyping compared with existing methods. When applied to healthy cohorts and rare-disease cohorts, Swave reveals complex and polymorphic SV patterns across human populations and identifies potentially pathogenic SVs. These advancements will facilitate the creation of comprehensive population-level SV catalogs, deepening our understanding of SVs in genetic diversity and disease associations.

Show full abstractShow less

DOI

10.1038/s41588-026-02538-6

t-SNE

Tool

FULL NAME

t-Distributed Stochastic Neighbor Embedding

URL

https://lvdmaaten.github.io/tsne/

KEYWORDS

t-SNE

Show full keywordsShow less

TATES

Tool

PUBMED_LINK

23359524

FULL NAME

Trait-based Association Test that uses Extended Simes procedure

TITLE

TATES: efficient multivariate genotype-phenotype analysis for genome-wide association studies.

Main citation

van der Sluis S, Posthuma D, Dolan CV. (2013) TATES: efficient multivariate genotype-phenotype analysis for genome-wide association studies. PLoS Genet, 9 (1) e1003235. doi:10.1371/journal.pgen.1003235. PMID 23359524

ABSTRACT

To date, the genome-wide association study (GWAS) is the primary tool to identify genetic variants that cause phenotypic variation. As GWAS analyses are generally univariate in nature, multivariate phenotypic information is usually reduced to a single composite score. This practice often results in loss of statistical power to detect causal variants. Multivariate genotype-phenotype methods do exist but attain maximal power only in special circumstances. Here, we present a new multivariate method that we refer to as TATES (Trait-based Association Test that uses Extended Simes procedure), inspired by the GATES procedure proposed by Li et al (2011). For each component of a multivariate trait, TATES combines p-values obtained in standard univariate GWAS to acquire one trait-based p-value, while correcting for correlations between components. Extensive simulations, probing a wide variety of genotype-phenotype models, show that TATES's false positive rate is correct, and that TATES's statistical power to detect causal variants explaining 0.5% of the variance can be 2.5-9 times higher than the power of univariate tests based on composite scores and 1.5-2 times higher than the power of the standard MANOVA. Unlike other multivariate methods, TATES detects both genetic variants that are common to multiple phenotypes and genetic variants that are specific to a single phenotype, i.e. TATES provides a more complete view of the genetic architecture of complex traits. As the actual causal genotype-phenotype model is usually unknown and probably phenotypically and genetically complex, TATES, available as an open source program, constitutes a powerful new multivariate strategy that allows researchers to identify novel causal variants, while the complexity of traits is no longer a limiting factor.

Show full abstractShow less

DOI

10.1371/journal.pgen.1003235

TCSC

Tool

PUBMED_LINK

37580597

FULL NAME

Tissue co-regulation score regression

DESCRIPTION

TCSC is a statistical genetics method to identify causal tissues in diseases and complex traits. We leverage TWAS and GWAS summary statistics while explicitly modeling the genetic co-regulation of genes across tissues.

Show full descriptionShow less

URL

https://github.com/TiffanyAmariuta/TCSC/

TITLE

Modeling tissue co-regulation estimates tissue-specific contributions to disease.

Main citation

Amariuta T, Siewert-Rocks K, Price AL. (2023) Modeling tissue co-regulation estimates tissue-specific contributions to disease. Nat Genet, 55 (9) 1503-1511. doi:10.1038/s41588-023-01474-z. PMID 37580597

ABSTRACT

Integrative analyses of genome-wide association studies and gene expression data have implicated many disease-critical tissues. However, co-regulation of genetic effects on gene expression across tissues impedes distinguishing biologically causal tissues from tagging tissues. In the present study, we introduce tissue co-regulation score regression (TCSC), which disentangles causal tissues from tagging tissues by regressing gene-disease association statistics (from transcriptome-wide association studies) on tissue co-regulation scores, reflecting correlations of predicted gene expression across genes and tissues. We applied TCSC to 78 diseases/traits (average n = 302,000) and gene expression prediction models for 48 GTEx tissues. TCSC identified 21 causal tissue-trait pairs at a 5% false discovery rate (FDR), including well-established findings, biologically plausible new findings (for example, aorta artery and glaucoma) and increased specificity of known tissue-trait associations (for example, subcutaneous adipose, but not visceral adipose, and high-density lipoprotein). TCSC also identified 17 causal tissue-trait covariance pairs at 5% FDR. In conclusion, TCSC is a precise method for distinguishing causal tissues from tagging tissues.

Show full abstractShow less

DOI

10.1038/s41588-023-01474-z

tensorQTL

Tool

PUBMED_LINK

31675989

DESCRIPTION

tensorQTL is a GPU-enabled QTL mapper, achieving ~200-300 fold faster cis- and trans-QTL mapping compared to CPU-based implementations.

Show full descriptionShow less

URL

https://github.com/broadinstitute/tensorqtl

TITLE

Scaling computational genomics to millions of individuals with GPUs.

Main citation

Taylor-Weiner A, Aguet F, Haradhvala NJ, Gosai S, ...&, Getz G. (2019) Scaling computational genomics to millions of individuals with GPUs. Genome Biol, 20 (1) 228. doi:10.1186/s13059-019-1836-7. PMID 31675989

ABSTRACT

Current genomics methods are designed to handle tens to thousands of samples but will need to scale to millions to match the pace of data and hypothesis generation in biomedical science. Here, we show that high efficiency at low cost can be achieved by leveraging general-purpose libraries for computing using graphics processing units (GPUs), such as PyTorch and TensorFlow. We demonstrate > 200-fold decreases in runtime and ~ 5-10-fold reductions in cost relative to CPUs. We anticipate that the accessibility of these libraries will lead to a widespread adoption of GPUs in computational genomics.

Show full abstractShow less

DOI

10.1186/s13059-019-1836-7

TetraHer

Tool

PUBMED_LINK

38490208

DESCRIPTION

a method for estimating the liability heritability of binary phenotypes

Show full descriptionShow less

URL