Tool
Catalog entries using this tag (links open the entry card on its page):
- 1000 Genomes Phase 3 Version 5 (1KGp3v5) — GWAS Tools
- 1KG SV imputation panel — GWAS Tools
- 1KG+7K — GWAS Tools
- ADMIXTURE — GWAS Tools
- ALL-Sum — GWAS Tools
- aMAT — GWAS Tools
- ANNOVAR — GWAS Tools
- Arbores — GWAS Tools
- ARG-Needle — GWAS Tools
- ARGinfer — GWAS Tools
- ARGweaver — GWAS Tools
- ARGweaver-D — GWAS Tools
- ASMC — GWAS Tools
- ASSET — GWAS Tools
- BayesR — GWAS Tools
- BayesRR-RC — GWAS Tools
- BayesS — GWAS Tools
- BEAGLE — GWAS Tools
- BEAGLE4 — GWAS Tools
- BEAGLE5.4 (Imputation) — GWAS Tools
- BEAGLE5.4 (Phasing) — GWAS Tools
- BEATRICE — GWAS Tools
- Benchmark-Wang — GWAS Tools
- BOLT-lMM — GWAS Tools
- BridgePRS — GWAS Tools
- CAFEH — GWAS Tools
- CalPred — GWAS Tools
- Cancer PRSweb — GWAS Tools
- CaTS power calculator — GWAS Tools
- CAVIAR — GWAS Tools
- CAVIARBF — GWAS Tools
- CC-GWAS — GWAS Tools
- cellAdmix — GWAS Tools
- ChinaMAP — GWAS Tools
- ChinaMAP panel — GWAS Tools
- ChromoMap — GWAS Tools
- CKB reference panel — GWAS Tools
- Cmplot — GWAS Tools
- CMS — GWAS Tools
- CNGB Imputation Service — GWAS Tools
- CoCoNet — GWAS Tools
- Coloc — GWAS Tools
- Coloc-susie — GWAS Tools
- Comparison — GWAS Tools
- Concepts&Principals — GWAS Tools
- CookHLA — GWAS Tools
- CoPheScan — GWAS Tools
- corrplot — GWAS Tools
- COWAS — GWAS Tools
- cross-trait LDSC — GWAS Tools
- cS2G — GWAS Tools
- CT-SLEB — GWAS Tools
- cTWAS — GWAS Tools
- Ctyper — GWAS Tools
- DBSLMM — GWAS Tools
- DDx-PRS — GWAS Tools
- DEEP*HLA — GWAS Tools
- DEPICT — GWAS Tools
- DOG — GWAS Tools
- DRUG TARGETOR — GWAS Tools
- EAGLE — GWAS Tools
- EAGLE2 — GWAS Tools
- eCAVIAR — GWAS Tools
- EHH — GWAS Tools
- EIGENSTRAT — GWAS Tools
- Ellis CA — GWAS Tools
- EMMAX — GWAS Tools
- EPIC — GWAS Tools
- ExPRSweb — GWAS Tools
- f — GWAS Tools
- FactorGO — GWAS Tools
- fastASSET — GWAS Tools
- fastGWA — GWAS Tools
- fastGWA-GLMM — GWAS Tools
- fastPHASE — GWAS Tools
- FastQTL — GWAS Tools
- FINEMAP — GWAS Tools
- flashfmZero — GWAS Tools
- Four-digit Multi-ethnic HLA v1 (2021) — GWAS Tools
- Four-digit Multi-ethnic HLA v2 (2022) — GWAS Tools
- FUMA — GWAS Tools
- FUSION — GWAS Tools
- G2P — GWAS Tools
- Galesloot — GWAS Tools
- GAS Power Calculator — GWAS Tools
- GATE — GWAS Tools
- GCTA — GWAS Tools
- GCTA — GWAS Tools
- GCTA-GREML-Binary — GWAS Tools
- GCTA-GREML-Bivariate — GWAS Tools
- GCTA-GREML-LDMS — GWAS Tools
- GCTA-GREML-Partition — GWAS Tools
- GCTA-GREML-Quantitative — GWAS Tools
- GEMMA — GWAS Tools
- GenoBoost — GWAS Tools
- GenomeAsia 100K — GWAS Tools
- Genomic-SEM — GWAS Tools
- GLEANR — GWAS Tools
- GLIMPSE — GWAS Tools
- GMRM — GWAS Tools
- GNOVA — GWAS Tools
- GPLEMMA — GWAS Tools
- GREP — GWAS Tools
- GRPa-PRS — GWAS Tools
- gsMap — GWAS Tools
- Guideline-Namba — GWAS Tools
- GWAMA — GWAS Tools
- gwas diversity monitor — GWAS Tools
- GWAS SVatalog — GWAS Tools
- GWAS-by-Subtraction — GWAS Tools
- GWASLab — GWAS Tools
- gwaslab — GWAS Tools
- GWAX — GWAS Tools
- GWFM — GWAS Tools
- Hail — GWAS Tools
- Han-MHC — GWAS Tools
- HAPGEN2 — GWAS Tools
- haploview — GWAS Tools
- HDL — GWAS Tools
- HDL-L — GWAS Tools
- HEELS — GWAS Tools
- HESS — GWAS Tools
- HGDP+1kGP — GWAS Tools
- HIBAG — GWAS Tools
- HIPO — GWAS Tools
- HLA-TAPAS — GWAS Tools
- HLARIMNT — GWAS Tools
- HRC — GWAS Tools
- HWE — GWAS Tools
- HyPrColoc — GWAS Tools
- IBS — GWAS Tools
- iHS — GWAS Tools
- IMPUTE — GWAS Tools
- IMPUTE2 — GWAS Tools
- IMPUTE4 — GWAS Tools
- IMPUTE5 — GWAS Tools
- JAM — GWAS Tools
- JASS — GWAS Tools
- JointPRS — GWAS Tools
- karyoploteR — GWAS Tools
- KwARG — GWAS Tools
- lassosum — GWAS Tools
- lassosum2 — GWAS Tools
- LAVA — GWAS Tools
- LCP-GWAS — GWAS Tools
- LDAK — GWAS Tools
- LDAK-GBAT — GWAS Tools
- LDAK-KVIK — GWAS Tools
- Ldlink — GWAS Tools
- LDlinkR — GWAS Tools
- LDpred — GWAS Tools
- LDpred-funct — GWAS Tools
- LDpred2 — GWAS Tools
- LDpred2-auto — GWAS Tools
- LDSC — GWAS Tools
- LDSC — GWAS Tools
- LDSC-SEG — GWAS Tools
- LDSTORE2 — GWAS Tools
- LeafCutter — GWAS Tools
- LEMMA — GWAS Tools
- Locityper — GWAS Tools
- locuszoom — GWAS Tools
- loftee — GWAS Tools
- Logica — GWAS Tools
- LT-FH — GWAS Tools
- MACH / minimach — GWAS Tools
- MACH / minimach pre-phasing — GWAS Tools
- MACH / minimach2 — GWAS Tools
- MACH / minimach3 — GWAS Tools
- MACH / minimach4 — GWAS Tools
- MAGMA — GWAS Tools
- MAGMA — GWAS Tools
- MANOVA — GWAS Tools
- MANTRA — GWAS Tools
- MatrixEQTL — GWAS Tools
- MegaPRS — GWAS Tools
- MENTR — GWAS Tools
- MESuSiE — GWAS Tools
- meta-PRS — GWAS Tools
- Meta-SAIGE — GWAS Tools
- metabolites PRS atlas — GWAS Tools
- metaCCA — GWAS Tools
- metafor — GWAS Tools
- METAL — GWAS Tools
- MetaSKAT — GWAS Tools
- MetaSTAAR — GWAS Tools
- metaUSAT/metaMANOVA — GWAS Tools
- MetaXcan — GWAS Tools
- Michigan Imputation Server — GWAS Tools
- MiXeR — GWAS Tools
- mJAM — GWAS Tools
- MOSTest — GWAS Tools
- MR-MEGA — GWAS Tools
- MR-MEGA — GWAS Tools
- mRnd — GWAS Tools
- ms — GWAS Tools
- MsCAVIAR — GWAS Tools
- MSMC — GWAS Tools
- MTAG — GWAS Tools
- Multi-PGS — GWAS Tools
- MultiBLUP — GWAS Tools
- MultiPhen — GWAS Tools
- MultiSTAAR — GWAS Tools
- MultiSuSiE — GWAS Tools
- MultiXcan — GWAS Tools
- MungeSumstats — GWAS Tools
- MuPIT — GWAS Tools
- MV-PLINK (MQFAM) — GWAS Tools
- mvGWAMA — GWAS Tools
- mvSuSiE — GWAS Tools
- NARD — GWAS Tools
- NARD2 — GWAS Tools
- Nyuwa Genome Resource Phase 1 — GWAS Tools
- NyuWa Imputation Server — GWAS Tools
- Olink — GWAS Tools
- OmiGA — GWAS Tools
- Open Targets — GWAS Tools
- Open Targets Genetics — GWAS Tools
- OpenADMIXTURE — GWAS Tools
- OTTERS — GWAS Tools
- PAINTOR — GWAS Tools
- PanMAN — GWAS Tools
- PASCAL — GWAS Tools
- PCHAT — GWAS Tools
- PennPRS — GWAS Tools
- PES — GWAS Tools
- pgBoost — GWAS Tools
- PGG.Han — GWAS Tools
- PGG.Han panel — GWAS Tools
- PGS-adjusted GWAS — GWAS Tools
- PGS-adjusted RVATs — GWAS Tools
- PGS-hub — GWAS Tools
- pgsc_calc — GWAS Tools
- PGSCatalog — GWAS Tools
- PGSFusion — GWAS Tools
- pheweb — GWAS Tools
- PHLASH — GWAS Tools
- PLINK — GWAS Tools
- PLINK-MDS — GWAS Tools
- PLINK1.9 — GWAS Tools
- PLINK2 — GWAS Tools
- PLINK2 — GWAS Tools
- PLINK2 — GWAS Tools
- POLMM — GWAS Tools
- POP-GWAS — GWAS Tools
- popcorn — GWAS Tools
- popEVE — GWAS Tools
- popgen — GWAS Tools
- PoPs — GWAS Tools
- Porter — GWAS Tools
- PP-GWAS — GWAS Tools
- PredInterval — GWAS Tools
- PrediXcan — GWAS Tools
- Priority index — GWAS Tools
- PROSPER — GWAS Tools
- Protter — GWAS Tools
- PRS atlas — GWAS Tools
- PRS credible intervals — GWAS Tools
- PRS-CS — GWAS Tools
- PRS-CSx — GWAS Tools
- PRS-FH — GWAS Tools
- PRS-RS — GWAS Tools
- PRS_to_Abs — GWAS Tools
- PRSet — GWAS Tools
- PRSice-2 — GWAS Tools
- PRSMix_AOI — GWAS Tools
- PRStuning — GWAS Tools
- PS4DR — GWAS Tools
- PSMC — GWAS Tools
- PTWAS — GWAS Tools
- PUMA-CUBS — GWAS Tools
- QCTOOL v2 — GWAS Tools
- QRGWAS — GWAS Tools
- Quickdraws — GWAS Tools
- QUILT1 — GWAS Tools
- QUILT2 — GWAS Tools
- RareMETAL — GWAS Tools
- RASQUAL — GWAS Tools
- Relate — GWAS Tools
- REMETA — GWAS Tools
- RENT+ — GWAS Tools
- RESHAPE — GWAS Tools
- Review — GWAS Tools
- Review-Das — GWAS Tools
- Review-Fst — GWAS Tools
- Review-Kachuri — GWAS Tools
- Review-Lappalainen — GWAS Tools
- Review-Li — GWAS Tools
- Review-Marchini — GWAS Tools
- Review-Peter — GWAS Tools
- Review-Povysil — GWAS Tools
- Review-Wang — GWAS Tools
- reviews — GWAS Tools
- Reviews — GWAS Tools
- Reviews&Tutorials — GWAS Tools
- RFR SuSiE-inf FINEMAP-inf — GWAS Tools
- RolyPoly — GWAS Tools
- rtPRS-CS — GWAS Tools
- RWAS — GWAS Tools
- S-LDXR — GWAS Tools
- S-PrediXcan — GWAS Tools
- SAIGE — GWAS Tools
- SAIGE-GENE — GWAS Tools
- SAIGE-QTL — GWAS Tools
- Sakaue — GWAS Tools
- Salinas — GWAS Tools
- Sanger — GWAS Tools
- SARGE — GWAS Tools
- SBayesR — GWAS Tools
- SBayesRC — GWAS Tools
- SBayesS — GWAS Tools
- sc-linker — GWAS Tools
- SCARlink — GWAS Tools
- SCAVENGE — GWAS Tools
- scDRS — GWAS Tools
- SCENT — GWAS Tools
- scGWAS — GWAS Tools
- scPRS — GWAS Tools
- scTWAS — GWAS Tools
- SDPR — GWAS Tools
- SDPRX — GWAS Tools
- SDS — GWAS Tools
- SECRET-GWAS — GWAS Tools
- seismic — GWAS Tools
- SHAPEIT1 — GWAS Tools
- SHAPEIT2 — GWAS Tools
- SHAPEIT3 — GWAS Tools
- SHAPEIT4 — GWAS Tools
- SHAPEIT5 — GWAS Tools
- shaPRS — GWAS Tools
- SiblingGWAS — GWAS Tools
- sim1000G — GWAS Tools
- SIMER — GWAS Tools
- simGWAS — GWAS Tools
- SINGER — GWAS Tools
- SKAT — GWAS Tools
- SKAT-O — GWAS Tools
- SMMAT — GWAS Tools
- SMR — GWAS Tools
- SMR-multi — GWAS Tools
- snipar — GWAS Tools
- snipar-unified estimator — GWAS Tools
- SNP2HLA — GWAS Tools
- SnpEff — GWAS Tools
- SOMAmer — GWAS Tools
- South and East Asian Reference Database (SEAD) — GWAS Tools
- SPAGRM — GWAS Tools
- SPAGxECCT — GWAS Tools
- SparsePro — GWAS Tools
- sQTLseekeR — GWAS Tools
- STAAR — GWAS Tools
- STAARpipeline — GWAS Tools
- Stephens — GWAS Tools
- StructLMM — GWAS Tools
- SumHer — GWAS Tools
- SUPERGNOVA — GWAS Tools
- SUSIE — GWAS Tools
- SuSiE PCA — GWAS Tools
- SUSIE-RSS — GWAS Tools
- SUSIEx — GWAS Tools
- Swave — GWAS Tools
- t-SNE — GWAS Tools
- TATES — GWAS Tools
- TCSC — GWAS Tools
- tensorQTL — GWAS Tools
- TetraHer — GWAS Tools
- TGVIS — GWAS Tools
- THISTLE — GWAS Tools
- TL-PRS — GWAS Tools
- TOPMED — GWAS Tools
- TrajGWAS — GWAS Tools
- Trans-Phar — GWAS Tools
- TReCASE — GWAS Tools
- TreeMix — GWAS Tools
- tsdate — GWAS Tools
- tsinfer — GWAS Tools
- Tutorial-Choi — GWAS Tools
- TWAS hub — GWAS Tools
- twas_sim — GWAS Tools
- Two-sample MR — GWAS Tools
- UMAP — GWAS Tools
- UpSetPlot — GWAS Tools
- VEGAS2 — GWAS Tools
- VEP — GWAS Tools
- VIPRS — GWAS Tools
- WBBC panel — GWAS Tools
- webTWAS — GWAS Tools
- Westlake Imputation Server — GWAS Tools
- winnerscurse — GWAS Tools
- wMT-SBLUP — GWAS Tools
- WtCoxG — GWAS Tools
- XP-EHH — GWAS Tools
- Yang — GWAS Tools
Entries
1000 Genomes Phase 3 Version 5 (1KGp3v5) (1KG)
PUBMED_LINK
URL
TITLE
A global reference for human genetic variation.
Main citation
1000 Genomes Project Consortium, Auton A, Brooks LD, Durbin RM, ...&, Abecasis GR. (2015) A global reference for human genetic variation. Nature, 526 (7571) 68-74. doi:10.1038/nature15393. PMID 26432245
ABSTRACT
The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes >99% of SNP variants with a frequency of >1% for a variety of ancestries. We describe the distribution of genetic variation across the global sample, and discuss the implications for common disease studies.
DOI
10.1038/nature15393
1KG SV imputation panel (1KG SV)
KEYWORDS
structural variants, long-read
PREPRINT_DOI
10.1101/2023.12.20.23300308
Main citation
Noyvert, B., Erzurumluoglu, A. M., Drichel, D., Omland, S., Andlauer, T. F., Mueller, S., ... & Ding, Z. (2023). Imputation of structural variants using a multi-ancestry long-read sequencing panel enables identification of disease associations. medRxiv, 2023-12.
1KG+7K
KEYWORDS
Japanese population-specific reference panel
PREPRINT_DOI
10.21203/rs.3.rs-3194976/v1
ADMIXTURE
PUBMED_LINK
DESCRIPTION
Alexander, D. H., Novembre, J., & Lange, K. (2009). Fast model-based estimation of ancestry in unrelated individuals. Genome research, 19(9), 1655-1664.
URL
USE
ADMIXTURE is a software tool for maximum likelihood estimation of individual ancestries from multilocus SNP genotype datasets. It uses the same statistical model as STRUCTURE but calculates estimates much more rapidly using a fast numerical optimization algorithm.
TITLE
Fast model-based estimation of ancestry in unrelated individuals.
Main citation
Alexander DH, Novembre J, Lange K. (2009) Fast model-based estimation of ancestry in unrelated individuals. Genome Res, 19 (9) 1655-64. doi:10.1101/gr.094052.109. PMID 19648217
ABSTRACT
Population stratification has long been recognized as a confounding factor in genetic association studies. Estimated ancestries, derived from multi-locus genotype data, can be used to perform a statistical correction for population stratification. One popular technique for estimation of ancestry is the model-based approach embodied by the widely applied program structure. Another approach, implemented in the program EIGENSTRAT, relies on Principal Component Analysis rather than model-based estimation and does not directly deliver admixture fractions. EIGENSTRAT has gained in popularity in part owing to its remarkable speed in comparison to structure. We present a new algorithm and a program, ADMIXTURE, for model-based estimation of ancestry in unrelated individuals. ADMIXTURE adopts the likelihood model embedded in structure. However, ADMIXTURE runs considerably faster, solving problems in minutes that take structure hours. In many of our experiments, we have found that ADMIXTURE is almost as fast as EIGENSTRAT. The runtime improvements of ADMIXTURE rely on a fast block relaxation scheme using sequential quadratic programming for block updates, coupled with a novel quasi-Newton acceleration of convergence. Our algorithm also runs faster and with greater accuracy than the implementation of an Expectation-Maximization (EM) algorithm incorporated in the program FRAPPE. Our simulations show that ADMIXTURE's maximum likelihood estimates of the underlying admixture coefficients and ancestral allele frequencies are as accurate as structure's Bayesian estimates. On real-world data sets, ADMIXTURE's estimates are directly comparable to those from structure and EIGENSTRAT. Taken together, our results show that ADMIXTURE's computational speed opens up the possibility of using a much larger set of markers in model-based ancestry estimation and that its estimates are suitable for use in correcting for population stratification in association studies.
DOI
10.1101/gr.094052.109
ALL-Sum
PUBMED_LINK
FULL NAME
Aggregated L0Learn using Summary-level data
DESCRIPTION
ALL - Sum leverages a L0L2 penalized regression and ensemble learning across tuning parameters to flexibly model traits with diverse genetic architectures.
URL
KEYWORDS
ensemble learning
TITLE
Fast and scalable ensemble learning method for versatile polygenic risk prediction.
Main citation
Chen T, Zhang H, Mazumder R, Lin X. (2024) Fast and scalable ensemble learning method for versatile polygenic risk prediction. Proc Natl Acad Sci U S A, 121 (33) e2403210121. doi:10.1073/pnas.2403210121. PMID 39110727
ABSTRACT
Polygenic risk scores (PRS) enhance population risk stratification and advance personalized medicine, but existing methods face several limitations, encompassing issues related to computational burden, predictive accuracy, and adaptability to a wide range of genetic architectures. To address these issues, we propose Aggregated L0Learn using Summary-level data (ALL-Sum), a fast and scalable ensemble learning method for computing PRS using summary statistics from genome-wide association studies (GWAS). ALL-Sum leverages a L0L2 penalized regression and ensemble learning across tuning parameters to flexibly model traits with diverse genetic architectures. In extensive large-scale simulations across a wide range of polygenicity and GWAS sample sizes, ALL-Sum consistently outperformed popular alternative methods in terms of prediction accuracy, runtime, and memory usage by 10%, 20-fold, and threefold, respectively, and demonstrated robustness to diverse genetic architectures. We validated the performance of ALL-Sum in real data analysis of 11 complex traits using GWAS summary statistics from nine data sources, including the Global Lipids Genetics Consortium, Breast Cancer Association Consortium, and FinnGen Biobank, with validation in the UK Biobank. Our results show that on average, ALL-Sum obtained PRS with 25% higher accuracy on average, with 15 times faster computation and half the memory than the current state-of-the-art methods, and had robust performance across a wide range of traits and diseases. Furthermore, our method demonstrates stable prediction when using linkage disequilibrium computed from different data sources. ALL-Sum is available as a user-friendly R software package with publicly available reference data for streamlined analysis.
DOI
10.1073/pnas.2403210121
aMAT
PUBMED_LINK
FULL NAME
adaptive multi-trait association test
TITLE
Multi-trait Genome-Wide Analyses of the Brain Imaging Phenotypes in UK Biobank.
Main citation
Wu C. (2020) Multi-trait Genome-Wide Analyses of the Brain Imaging Phenotypes in UK Biobank. Genetics, 215 (4) 947-958. doi:10.1534/genetics.120.303242. PMID 32540950
ABSTRACT
Many genetic variants identified in genome-wide association studies (GWAS) are associated with multiple, sometimes seemingly unrelated, traits. This motivates multi-trait association analyses, which have successfully identified novel associated loci for many complex diseases. While appealing, most existing methods focus on analyzing a relatively small number of traits, and may yield inflated Type 1 error rates when a large number of traits need to be analyzed jointly. As deep phenotyping data are becoming rapidly available, we develop a novel method, referred to as aMAT (adaptive multi-trait association test), for multi-trait analysis of any number of traits. We applied aMAT to GWAS summary statistics for a set of 58 volumetric imaging derived phenotypes from the UK Biobank. aMAT had a genomic inflation factor of 1.04, indicating the Type 1 error rate was well controlled. More important, aMAT identified 24 distinct risk loci, 13 of which were ignored by standard GWAS. In comparison, the competing methods either had a suspicious genomic inflation factor or identified much fewer risk loci. Finally, four additional sets of traits have been analyzed and provided similar conclusions.
DOI
10.1534/genetics.120.303242
ANNOVAR
PUBMED_LINK
FULL NAME
Annotate Variation
DESCRIPTION
ANNOVAR is an efficient software tool to utilize update-to-date information to functionally annotate genetic variants detected from diverse genomes (including human genome hg18, hg19, hg38, as well as mouse, worm, fly, yeast and many others).
URL
TITLE
ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data.
Main citation
Wang K, Li M, Hakonarson H. (2010) ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res, 38 (16) e164. doi:10.1093/nar/gkq603. PMID 20601685
ABSTRACT
High-throughput sequencing platforms are generating massive amounts of genetic variation data for diverse genomes, but it remains a challenge to pinpoint a small subset of functionally important variants. To fill these unmet needs, we developed the ANNOVAR tool to annotate single nucleotide variants (SNVs) and insertions/deletions, such as examining their functional consequence on genes, inferring cytogenetic bands, reporting functional importance scores, finding variants in conserved regions, or identifying variants reported in the 1000 Genomes Project and dbSNP. ANNOVAR can utilize annotation databases from the UCSC Genome Browser or any annotation data set conforming to Generic Feature Format version 3 (GFF3). We also illustrate a 'variants reduction' protocol on 4.7 million SNVs and indels from a human genome, including two causal mutations for Miller syndrome, a rare recessive disease. Through a stepwise procedure, we excluded variants that are unlikely to be causal, and identified 20 candidate genes including the causal gene. Using a desktop computer, ANNOVAR requires ∼4 min to perform gene-based annotation and ∼15 min to perform variants reduction on 4.7 million variants, making it practical to handle hundreds of human genomes in a day. ANNOVAR is freely available at http://www.openbioinformatics.org/annovar/.
DOI
10.1093/nar/gkq603
Arbores
PUBMED_LINK
DESCRIPTION
Heine, K., Beskos, A., Jasra, A., Balding, D. & De Iorio, M. Bridging trees for posterior inference on ancestral recombination graphs. Proc. Math. Phys. Eng. Sci. 474, 20180568 (2018).
TITLE
Bridging trees for posterior inference on ancestral recombination graphs.
Main citation
Heine K, Beskos A, Jasra A, Balding D, ...&, De Iorio M. (2018) Bridging trees for posterior inference on ancestral recombination graphs. Proc Math Phys Eng Sci, 474 (2220) 20180568. doi:10.1098/rspa.2018.0568. PMID 30602937
ABSTRACT
We present a new Markov chain Monte Carlo algorithm, implemented in the software Arbores, for inferring the history of a sample of DNA sequences. Our principal innovation is a bridging procedure, previously applied only for simple stochastic processes, in which the local computations within a bridge can proceed independently of the rest of the DNA sequence, facilitating large-scale parallelization.
DOI
10.1098/rspa.2018.0568
ARG-Needle
PUBMED_LINK
DESCRIPTION
Biobank-scale inference of ancestral recombination graphs enables genealogical analysis of complex traits. B. C. Zhang, A. Biddanda, Á. F. Gunnarsson, F. Cooper, P. F. Palamara. Nature Genetics. May 2023.
URL
TITLE
Biobank-scale inference of ancestral recombination graphs enables genealogical analysis of complex traits.
Main citation
Zhang BC, Biddanda A, Gunnarsson ÁF, Cooper F, ...&, Palamara PF. (2023) Biobank-scale inference of ancestral recombination graphs enables genealogical analysis of complex traits. Nat Genet, 55 (5) 768-776. doi:10.1038/s41588-023-01379-x. PMID 37127670
ABSTRACT
Genome-wide genealogies compactly represent the evolutionary history of a set of genomes and inferring them from genetic data has the potential to facilitate a wide range of analyses. We introduce a method, ARG-Needle, for accurately inferring biobank-scale genealogies from sequencing or genotyping array data, as well as strategies to utilize genealogies to perform association and other complex trait analyses. We use these methods to build genome-wide genealogies using genotyping data for 337,464 UK Biobank individuals and test for association across seven complex traits. Genealogy-based association detects more rare and ultra-rare signals (N = 134, frequency range 0.0007-0.1%) than genotype imputation using ~65,000 sequenced haplotypes (N = 64). In a subset of 138,039 exome sequencing samples, these associations strongly tag (average r = 0.72) underlying sequencing variants enriched (4.8×) for loss-of-function variation. These results demonstrate that inferred genome-wide genealogies may be leveraged in the analysis of complex traits, complementing approaches that require the availability of large, population-specific sequencing panels.
DOI
10.1038/s41588-023-01379-x
ARGinfer
PUBMED_LINK
DESCRIPTION
Mahmoudi, A., Koskela, J., Kelleher, J., Chan, Y.-B. & Balding, D. Bayesian inference of ancestral recombination graphs. PLoS Comput. Biol. 18, e1009960 (2022).
TITLE
Bayesian inference of ancestral recombination graphs.
Main citation
Mahmoudi A, Koskela J, Kelleher J, Chan YB, ...&, Balding D. (2022) Bayesian inference of ancestral recombination graphs. PLoS Comput Biol, 18 (3) e1009960. doi:10.1371/journal.pcbi.1009960. PMID 35263345
ABSTRACT
We present a novel algorithm, implemented in the software ARGinfer, for probabilistic inference of the Ancestral Recombination Graph under the Coalescent with Recombination. Our Markov Chain Monte Carlo algorithm takes advantage of the Succinct Tree Sequence data structure that has allowed great advances in simulation and point estimation, but not yet probabilistic inference. Unlike previous methods, which employ the Sequentially Markov Coalescent approximation, ARGinfer uses the Coalescent with Recombination, allowing more accurate inference of key evolutionary parameters. We show using simulations that ARGinfer can accurately estimate many properties of the evolutionary history of the sample, including the topology and branch lengths of the genealogical tree at each sequence site, and the times and locations of mutation and recombination events. ARGinfer approximates posterior probability distributions for these and other quantities, providing interpretable assessments of uncertainty that we show to be well calibrated. ARGinfer is currently limited to tens of DNA sequences of several hundreds of kilobases, but has scope for further computational improvements to increase its applicability.
DOI
10.1371/journal.pcbi.1009960
ARGweaver
PUBMED_LINK
DESCRIPTION
Rasmussen, M. D., Hubisz, M. J., Gronau, I. & Siepel, A. Genome-wide inference of ancestral recombination graphs. PLoS Genet. 10, e1004342 (2014).
URL
USE
The ARGweaver software package contains programs and libraries for sampling and manipulating ancestral recombination graphs (ARGs). An ARG is a rich data structure for representing the ancestry of DNA sequences undergoing coalescence and recombination.
TITLE
Genome-wide inference of ancestral recombination graphs.
Main citation
Rasmussen MD, Hubisz MJ, Gronau I, Siepel A. (2014) Genome-wide inference of ancestral recombination graphs. PLoS Genet, 10 (5) e1004342. doi:10.1371/journal.pgen.1004342. PMID 24831947
ABSTRACT
The complex correlation structure of a collection of orthologous DNA sequences is uniquely captured by the "ancestral recombination graph" (ARG), a complete record of coalescence and recombination events in the history of the sample. However, existing methods for ARG inference are computationally intensive, highly approximate, or limited to small numbers of sequences, and, as a consequence, explicit ARG inference is rarely used in applied population genomics. Here, we introduce a new algorithm for ARG inference that is efficient enough to apply to dozens of complete mammalian genomes. The key idea of our approach is to sample an ARG of [Formula: see text] chromosomes conditional on an ARG of [Formula: see text] chromosomes, an operation we call "threading." Using techniques based on hidden Markov models, we can perform this threading operation exactly, up to the assumptions of the sequentially Markov coalescent and a discretization of time. An extension allows for threading of subtrees instead of individual sequences. Repeated application of these threading operations results in highly efficient Markov chain Monte Carlo samplers for ARGs. We have implemented these methods in a computer program called ARGweaver. Experiments with simulated data indicate that ARGweaver converges rapidly to the posterior distribution over ARGs and is effective in recovering various features of the ARG for dozens of sequences generated under realistic parameters for human populations. In applications of ARGweaver to 54 human genome sequences from Complete Genomics, we find clear signatures of natural selection, including regions of unusually ancient ancestry associated with balancing selection and reductions in allele age in sites under directional selection. The patterns we observe near protein-coding genes are consistent with a primary influence from background selection rather than hitchhiking, although we cannot rule out a contribution from recurrent selective sweeps.
DOI
10.1371/journal.pgen.1004342
ARGweaver-D
PUBMED_LINK
DESCRIPTION
Hubisz, M. J., Williams, A. L. & Siepel, A. Mapping gene flow between ancient hominins through demography-aware inference of the ancestral recombination graph. PLoS Genet. 16, e1008895 (2020).
TITLE
Mapping gene flow between ancient hominins through demography-aware inference of the ancestral recombination graph.
Main citation
Hubisz MJ, Williams AL, Siepel A. (2020) Mapping gene flow between ancient hominins through demography-aware inference of the ancestral recombination graph. PLoS Genet, 16 (8) e1008895. doi:10.1371/journal.pgen.1008895. PMID 32760067
ABSTRACT
The sequencing of Neanderthal and Denisovan genomes has yielded many new insights about interbreeding events between extinct hominins and the ancestors of modern humans. While much attention has been paid to the relatively recent gene flow from Neanderthals and Denisovans into modern humans, other instances of introgression leave more subtle genomic evidence and have received less attention. Here, we present a major extension of the ARGweaver algorithm, called ARGweaver-D, which can infer local genetic relationships under a user-defined demographic model that includes population splits and migration events. This Bayesian algorithm probabilistically samples ancestral recombination graphs (ARGs) that specify not only tree topologies and branch lengths along the genome, but also indicate migrant lineages. The sampled ARGs can therefore be parsed to produce probabilities of introgression along the genome. We show that this method is well powered to detect the archaic migration into modern humans, even with only a few samples. We then show that the method can also detect introgressed regions stemming from older migration events, or from unsampled populations. We apply it to human, Neanderthal, and Denisovan genomes, looking for signatures of older proposed migration events, including ancient humans into Neanderthal, and unknown archaic hominins into Denisovans. We identify 3% of the Neanderthal genome that is putatively introgressed from ancient humans, and estimate that the gene flow occurred between 200-300kya. We find no convincing evidence that negative selection acted against these regions. Finally, we predict that 1% of the Denisovan genome was introgressed from an unsequenced, but highly diverged, archaic hominin ancestor. About 15% of these "super-archaic" regions-comprising at least about 4Mb-were, in turn, introgressed into modern humans and continue to exist in the genomes of people alive today.
DOI
10.1371/journal.pgen.1008895
ASMC
PUBMED_LINK
FULL NAME
Ascertained Sequentially Markovian Coalescent
DESCRIPTION
P. Palamara, J. Terhorst, Y. Song, A. Price. High-throughput inference of pairwise coalescence times identifies signals of selection and enriched disease heritability. Nature Genetics, 2018.
URL
USE
The Ascertained Sequentially Markovian Coalescent is a method to efficiently estimate pairwise coalescence time along the genome. It can be run using SNP array or whole-genome sequencing (WGS) data.
TITLE
High-throughput inference of pairwise coalescence times identifies signals of selection and enriched disease heritability.
Main citation
Palamara PF, Terhorst J, Song YS, Price AL. (2018) High-throughput inference of pairwise coalescence times identifies signals of selection and enriched disease heritability. Nat Genet, 50 (9) 1311-1317. doi:10.1038/s41588-018-0177-x. PMID 30104759
ABSTRACT
Interest in reconstructing demographic histories has motivated the development of methods to estimate locus-specific pairwise coalescence times from whole-genome sequencing data. Here we introduce a powerful new method, ASMC, that can estimate coalescence times using only SNP array data, and is orders of magnitude faster than previous approaches. We applied ASMC to detect recent positive selection in 113,851 phased British samples from the UK Biobank, and detected 12 genome-wide significant signals, including 6 novel loci. We also applied ASMC to sequencing data from 498 Dutch individuals to detect background selection at deeper time scales. We detected strong heritability enrichment in regions of high background selection in an analysis of 20 independent diseases and complex traits using stratified linkage disequilibrium score regression, conditioned on a broad set of functional annotations (including other background selection annotations). These results underscore the widespread effects of background selection on the genetic architecture of complex traits.
DOI
10.1038/s41588-018-0177-x
ASSET
PUBMED_LINK
FULL NAME
association analysis based on subsets
URL
TITLE
A subset-based approach improves power and interpretation for the combined analysis of genetic association studies of heterogeneous traits.
Main citation
Bhattacharjee S, Rajaraman P, Jacobs KB, Wheeler WA, ...&, Chatterjee N. (2012) A subset-based approach improves power and interpretation for the combined analysis of genetic association studies of heterogeneous traits. Am J Hum Genet, 90 (5) 821-35. doi:10.1016/j.ajhg.2012.03.015. PMID 22560090
ABSTRACT
Pooling genome-wide association studies (GWASs) increases power but also poses methodological challenges because studies are often heterogeneous. For example, combining GWASs of related but distinct traits can provide promising directions for the discovery of loci with small but common pleiotropic effects. Classical approaches for meta-analysis or pooled analysis, however, might not be suitable for such analysis because individual variants are likely to be associated with only a subset of the traits or might demonstrate effects in different directions. We propose a method that exhaustively explores subsets of studies for the presence of true association signals that are in either the same direction or possibly opposite directions. An efficient approximation is used for rapid evaluation of p values. We present two illustrative applications, one for a meta-analysis of separate case-control studies of six distinct cancers and another for pooled analysis of a case-control study of glioma, a class of brain tumors that contains heterogeneous subtypes. Both the applications and additional simulation studies demonstrate that the proposed methods offer improved power and more interpretable results when compared to traditional methods for the analysis of heterogeneous traits. The proposed framework has applications beyond genetic association studies.
DOI
10.1016/j.ajhg.2012.03.015
BayesR
PUBMED_LINK
DESCRIPTION
Bayesian mixture model to dissect genetic variation for disease in human populations and to construct more powerful risk predictors
URL
TITLE
Simultaneous discovery, estimation and prediction analysis of complex traits using a bayesian mixture model.
Main citation
Moser G, Lee SH, Hayes BJ, Goddard ME, ...&, Visscher PM. (2015) Simultaneous discovery, estimation and prediction analysis of complex traits using a bayesian mixture model. PLoS Genet, 11 (4) e1004969. doi:10.1371/journal.pgen.1004969. PMID 25849665
ABSTRACT
Gene discovery, estimation of heritability captured by SNP arrays, inference on genetic architecture and prediction analyses of complex traits are usually performed using different statistical models and methods, leading to inefficiency and loss of power. Here we use a Bayesian mixture model that simultaneously allows variant discovery, estimation of genetic variance explained by all variants and prediction of unobserved phenotypes in new samples. We apply the method to simulated data of quantitative traits and Welcome Trust Case Control Consortium (WTCCC) data on disease and show that it provides accurate estimates of SNP-based heritability, produces unbiased estimators of risk in new samples, and that it can estimate genetic architecture by partitioning variation across hundreds to thousands of SNPs. We estimated that, depending on the trait, 2,633 to 9,411 SNPs explain all of the SNP-based heritability in the WTCCC diseases. The majority of those SNPs (>96%) had small effects, confirming a substantial polygenic component to common diseases. The proportion of the SNP-based variance explained by large effects (each SNP explaining 1% of the variance) varied markedly between diseases, ranging from almost zero for bipolar disorder to 72% for type 1 diabetes. Prediction analyses demonstrate that for diseases with major loci, such as type 1 diabetes and rheumatoid arthritis, Bayesian methods outperform profile scoring or mixed model approaches.
DOI
10.1371/journal.pgen.1004969
BayesRR-RC
PUBMED_LINK
DESCRIPTION
gmrm is hybrid-parallel software for a Bayesian grouped mixture of regressions model for genome-wide association studies (GWAS). It is written in C++ using extensive optimisations and code vectorisation. It relies on plink's .bed format. It can handle multiple traits simultaneously.
URL
TITLE
Probabilistic inference of the genetic architecture underlying functional enrichment of complex traits.
Main citation
Patxot M, Banos DT, Kousathanas A, Orliac EJ, ...&, Robinson MR. (2021) Probabilistic inference of the genetic architecture underlying functional enrichment of complex traits. Nat Commun, 12 (1) 6972. doi:10.1038/s41467-021-27258-9. PMID 34848700
ABSTRACT
We develop a Bayesian model (BayesRR-RC) that provides robust SNP-heritability estimation, an alternative to marker discovery, and accurate genomic prediction, taking 22 seconds per iteration to estimate 8.4 million SNP-effects and 78 SNP-heritability parameters in the UK Biobank. We find that only ≤10% of the genetic variation captured for height, body mass index, cardiovascular disease, and type 2 diabetes is attributable to proximal regulatory regions within 10kb upstream of genes, while 12-25% is attributed to coding regions, 32-44% to introns, and 22-28% to distal 10-500kb upstream regions. Up to 24% of all cis and coding regions of each chromosome are associated with each trait, with over 3,100 independent exonic and intronic regions and over 5,400 independent regulatory regions having ≥95% probability of contributing ≥0.001% to the genetic variance of these four traits. Our open-source software (GMRM) provides a scalable alternative to current approaches for biobank data.
DOI
10.1038/s41467-021-27258-9
BayesS
PUBMED_LINK
URL
TITLE
Signatures of negative selection in the genetic architecture of human complex traits.
Main citation
Zeng J, de Vlaming R, Wu Y, Robinson MR, ...&, Yang J. (2018) Signatures of negative selection in the genetic architecture of human complex traits. Nat Genet, 50 (5) 746-753. doi:10.1038/s41588-018-0101-4. PMID 29662166
ABSTRACT
We develop a Bayesian mixed linear model that simultaneously estimates single-nucleotide polymorphism (SNP)-based heritability, polygenicity (proportion of SNPs with nonzero effects), and the relationship between SNP effect size and minor allele frequency for complex traits in conventionally unrelated individuals using genome-wide SNP data. We apply the method to 28 complex traits in the UK Biobank data (N = 126,752) and show that on average, 6% of SNPs have nonzero effects, which in total explain 22% of phenotypic variance. We detect significant (P < 0.05/28) signatures of natural selection in the genetic architecture of 23 traits, including reproductive, cardiovascular, and anthropometric traits, as well as educational attainment. The significant estimates of the relationship between effect size and minor allele frequency in complex traits are consistent with a model of negative (or purifying) selection, as confirmed by forward simulation. We conclude that negative selection acts pervasively on the genetic variants associated with human complex traits.
DOI
10.1038/s41588-018-0101-4
BEAGLE
PUBMED_LINK
URL
TITLE
Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering.
Main citation
Browning SR, Browning BL. (2007) Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am J Hum Genet, 81 (5) 1084-97. doi:10.1086/521987. PMID 17924348
ABSTRACT
Whole-genome association studies present many new statistical and computational challenges due to the large quantity of data obtained. One of these challenges is haplotype inference; methods for haplotype inference designed for small data sets from candidate-gene studies do not scale well to the large number of individuals genotyped in whole-genome association studies. We present a new method and software for inference of haplotype phase and missing data that can accurately phase data from whole-genome association studies, and we present the first comparison of haplotype-inference methods for real and simulated data sets with thousands of genotyped individuals. We find that our method outperforms existing methods in terms of both speed and accuracy for large data sets with thousands of individuals and densely spaced genetic markers, and we use our method to phase a real data set of 3,002 individuals genotyped for 490,032 markers in 3.1 days of computing time, with 99% of masked alleles imputed correctly. Our method is implemented in the Beagle software package, which is freely available.
DOI
10.1086/521987
BEAGLE4
PUBMED_LINK
DESCRIPTION
(beagle 4.1)
URL
TITLE
Genotype Imputation with Millions of Reference Samples.
Main citation
Browning BL, Browning SR. (2016) Genotype Imputation with Millions of Reference Samples. Am J Hum Genet, 98 (1) 116-26. doi:10.1016/j.ajhg.2015.11.020. PMID 26748515
ABSTRACT
We present a genotype imputation method that scales to millions of reference samples. The imputation method, based on the Li and Stephens model and implemented in Beagle v.4.1, is parallelized and memory efficient, making it well suited to multi-core computer processors. It achieves fast, accurate, and memory-efficient genotype imputation by restricting the probability model to markers that are genotyped in the target samples and by performing linear interpolation to impute ungenotyped variants. We compare Beagle v.4.1 with Impute2 and Minimac3 by using 1000 Genomes Project data, UK10K Project data, and simulated data. All three methods have similar accuracy but different memory requirements and different computation times. When imputing 10 Mb of sequence data from 50,000 reference samples, Beagle's throughput was more than 100× greater than Impute2's throughput on our computer servers. When imputing 10 Mb of sequence data from 200,000 reference samples in VCF format, Minimac3 consumed 26× more memory per computational thread and 15× more CPU time than Beagle. We demonstrate that Beagle v.4.1 scales to much larger reference panels by performing imputation from a simulated reference panel having 5 million samples and a mean marker density of one marker per four base pairs.
DOI
10.1016/j.ajhg.2015.11.020
BEAGLE5.4 (Imputation)
PUBMED_LINK
DESCRIPTION
(beagle 5.4 imputation)
URL
TITLE
A One-Penny Imputed Genome from Next-Generation Reference Panels.
Main citation
Browning BL, Zhou Y, Browning SR. (2018) A One-Penny Imputed Genome from Next-Generation Reference Panels. Am J Hum Genet, 103 (3) 338-348. doi:10.1016/j.ajhg.2018.07.015. PMID 30100085
ABSTRACT
Genotype imputation is commonly performed in genome-wide association studies because it greatly increases the number of markers that can be tested for association with a trait. In general, one should perform genotype imputation using the largest reference panel that is available because the number of accurately imputed variants increases with reference panel size. However, one impediment to using larger reference panels is the increased computational cost of imputation. We present a new genotype imputation method, Beagle 5.0, which greatly reduces the computational cost of imputation from large reference panels. We compare Beagle 5.0 with Beagle 4.1, Impute4, Minimac3, and Minimac4 using 1000 Genomes Project data, Haplotype Reference Consortium data, and simulated data for 10k, 100k, 1M, and 10M reference samples. All methods produce nearly identical accuracy, but Beagle 5.0 has the lowest computation time and the best scaling of computation time with increasing reference panel size. For 10k, 100k, 1M, and 10M reference samples and 1,000 phased target samples, Beagle 5.0's computation time is 3× (10k), 12× (100k), 43× (1M), and 533× (10M) faster than the fastest alternative method. Cost data from the Amazon Elastic Compute Cloud show that Beagle 5.0 can perform genome-wide imputation from 10M reference samples into 1,000 phased target samples at a cost of less than one US cent per sample.
DOI
10.1016/j.ajhg.2018.07.015
BEAGLE5.4 (Phasing)
PUBMED_LINK
DESCRIPTION
(beagle 5.4 phasing)
URL
TITLE
Fast two-stage phasing of large-scale sequence data.
Main citation
Browning BL, Tian X, Zhou Y, Browning SR. (2021) Fast two-stage phasing of large-scale sequence data. Am J Hum Genet, 108 (10) 1880-1890. doi:10.1016/j.ajhg.2021.08.005. PMID 34478634
ABSTRACT
Haplotype phasing is the estimation of haplotypes from genotype data. We present a fast, accurate, and memory-efficient haplotype phasing method that scales to large-scale SNP array and sequence data. The method uses marker windowing and composite reference haplotypes to reduce memory usage and computation time. It incorporates a progressive phasing algorithm that identifies confidently phased heterozygotes in each iteration and fixes the phase of these heterozygotes in subsequent iterations. For data with many low-frequency variants, such as whole-genome sequence data, the method employs a two-stage phasing algorithm that phases high-frequency markers via progressive phasing in the first stage and phases low-frequency markers via genotype imputation in the second stage. This haplotype phasing method is implemented in the open-source Beagle 5.2 software package. We compare Beagle 5.2 and SHAPEIT 4.2.1 by using expanding subsets of 485,301 UK Biobank samples and 38,387 TOPMed samples. Both methods have very similar accuracy and computation time for UK Biobank SNP array data. However, for TOPMed sequence data, Beagle is more than 20 times faster than SHAPEIT, achieves similar accuracy, and scales to larger sample sizes.
DOI
10.1016/j.ajhg.2021.08.005
BEATRICE
PUBMED_LINK
FULL NAME
Bayesian finE-mapping from summAry daTa using deep vaRiational InferenCE
DESCRIPTION
In this repository, we introduce BEATRICE, a finemapping tool to identify putative causal variants from GWAS summary data. BEATRICE combines a hierarchical Bayesian model with a deep learning-based inference procedure. This combination provides greater inferential power to handle noise and spurious interactions due to polygenicity of the trait, trans-interactions of variants, or varying correlation structure of the genomic region.
URL
TITLE
BEATRICE: Bayesian fine-mapping from summary data using deep variational inference.
Main citation
Ghosal S, Schatz MC, Venkataraman A. (2024) BEATRICE: Bayesian fine-mapping from summary data using deep variational inference. Bioinformatics, 40 (10) . doi:10.1093/bioinformatics/btae590. PMID 39360993
ABSTRACT
MOTIVATION: We introduce a novel framework BEATRICE to identify putative causal variants from GWAS statistics. Identifying causal variants is challenging due to their sparsity and high correlation in the nearby regions. To account for these challenges, we rely on a hierarchical Bayesian model that imposes a binary concrete prior on the set of causal variants. We derive a variational algorithm for this fine-mapping problem by minimizing the KL divergence between an approximate density and the posterior probability distribution of the causal configurations. Correspondingly, we use a deep neural network as an inference machine to estimate the parameters of our proposal distribution. Our stochastic optimization procedure allows us to sample from the space of causal configurations, which we use to compute the posterior inclusion probabilities and determine credible sets for each causal variant. We conduct a detailed simulation study to quantify the performance of our framework against two state-of-the-art baseline methods across different numbers of causal variants and noise paradigms, as defined by the relative genetic contributions of causal and noncausal variants. RESULTS: We demonstrate that BEATRICE achieves uniformly better coverage with comparable power and set sizes, and that the performance gain increases with the number of causal variants. We also show the efficacy BEATRICE in finding causal variants from the GWAS study of Alzheimer's disease. In comparison to the baselines, only BEATRICE can successfully find the APOE ϵ2 allele, a commonly associated variant of Alzheimer's. AVAILABILITY AND IMPLEMENTATION: BEATRICE is available for download at https://github.com/sayangsep/Beatrice-Finemapping.
DOI
10.1093/bioinformatics/btae590
Benchmark-Wang
PUBMED_LINK
TITLE
A comprehensive investigation of statistical and machine learning approaches for predicting complex human diseases on genomic variants.
Main citation
Wang C, Zhang J, Veldsman WP, Zhou X, ...&, Zhang L. (2023) A comprehensive investigation of statistical and machine learning approaches for predicting complex human diseases on genomic variants. Brief Bioinform, 24 (1) . doi:10.1093/bib/bbac552. PMID 36585786
ABSTRACT
Quantifying an individual's risk for common diseases is an important goal of precision health. The polygenic risk score (PRS), which aggregates multiple risk alleles of candidate diseases, has emerged as a standard approach for identifying high-risk individuals. Although several studies have been performed to benchmark the PRS calculation tools and assess their potential to guide future clinical applications, some issues remain to be further investigated, such as lacking (i) various simulated data with different genetic effects; (ii) evaluation of machine learning models and (iii) evaluation on multiple ancestries studies. In this study, we systematically validated and compared 13 statistical methods, 5 machine learning models and 2 ensemble models using simulated data with additive and genetic interaction models, 22 common diseases with internal training sets, 4 common diseases with external summary statistics and 3 common diseases for trans-ancestry studies in UK Biobank. The statistical methods were better in simulated data from additive models and machine learning models have edges for data that include genetic interactions. Ensemble models are generally the best choice by integrating various statistical methods. LDpred2 outperformed the other standalone tools, whereas PRS-CS, lassosum and DBSLMM showed comparable performance. We also identified that disease heritability strongly affected the predictive performance of all methods. Both the number and effect sizes of risk SNPs are important; and sample size strongly influences the performance of all methods. For the trans-ancestry studies, we found that the performance of most methods became worse when training and testing sets were from different populations.
DOI
10.1093/bib/bbac552
BOLT-lMM
PUBMED_LINK
DESCRIPTION
The BOLT-LMM software package currently consists of two main algorithms, the BOLT-LMM algorithm for mixed model association testing, and the BOLT-REML algorithm for variance components analysis (i.e., partitioning of SNP-heritability and estimation of genetic correlations).
URL
KEYWORDS
non-infinitesimal model, mixture of two Gaussian distributions
TITLE
Efficient Bayesian mixed-model analysis increases association power in large cohorts.
Main citation
Loh PR, Tucker G, Bulik-Sullivan BK, Vilhjálmsson BJ, ...&, Price AL. (2015) Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nat Genet, 47 (3) 284-90. doi:10.1038/ng.3190. PMID 25642633
ABSTRACT
Linear mixed models are a powerful statistical tool for identifying genetic associations and avoiding confounding. However, existing methods are computationally intractable in large cohorts and may not optimize power. All existing methods require time cost O(MN(2)) (where N is the number of samples and M is the number of SNPs) and implicitly assume an infinitesimal genetic architecture in which effect sizes are normally distributed, which can limit power. Here we present a far more efficient mixed-model association method, BOLT-LMM, which requires only a small number of O(MN) time iterations and increases power by modeling more realistic, non-infinitesimal genetic architectures via a Bayesian mixture prior on marker effect sizes. We applied BOLT-LMM to 9 quantitative traits in 23,294 samples from the Women's Genome Health Study (WGHS) and observed significant increases in power, consistent with simulations. Theory and simulations show that the boost in power increases with cohort size, making BOLT-LMM appealing for genome-wide association studies in large cohorts.
DOI
10.1038/ng.3190
BridgePRS
PUBMED_LINK
DESCRIPTION
BridgePRS is a Bayesian-ridge (Bridge) approach, which "bridges" the PRS between two populations of different ancestry, developed to tackle the "PRS Portability Problem". The PRS Portability Problem causes lower accuracy PRS in underrepresented populations due to the biased sampling in GWAS data collection.
URL
TITLE
BridgePRS leverages shared genetic effects across ancestries to increase polygenic risk score portability.
Main citation
Hoggart CJ, Choi SW, García-González J, Souaiaia T, ...&, O'Reilly PF. (2024) BridgePRS leverages shared genetic effects across ancestries to increase polygenic risk score portability. Nat Genet, 56 (1) 180-186. doi:10.1038/s41588-023-01583-9. PMID 38123642
ABSTRACT
Here we present BridgePRS, a novel Bayesian polygenic risk score (PRS) method that leverages shared genetic effects across ancestries to increase PRS portability. We evaluate BridgePRS via simulations and real UK Biobank data across 19 traits in individuals of African, South Asian and East Asian ancestry, using both UK Biobank and Biobank Japan genome-wide association study summary statistics; out-of-cohort validation is performed in the Mount Sinai (New York) BioMe biobank. BridgePRS is compared with the leading alternative, PRS-CSx, and two other PRS methods. Simulations suggest that the performance of BridgePRS relative to PRS-CSx increases as uncertainty increases: with lower trait heritability, higher polygenicity and greater between-population genetic diversity; and when causal variants are not present in the data. In real data, BridgePRS has a 61% larger average R2 than PRS-CSx in out-of-cohort prediction of African ancestry samples in BioMe (P = 6 × 10-5). BridgePRS is a computationally efficient, user-friendly and powerful approach for PRS analyses in non-European ancestries.
DOI
10.1038/s41588-023-01583-9
CAFEH
PUBMED_LINK
FULL NAME
colocalization and fine-mapping in the presence of allelic heterogeneity
DESCRIPTION
CAFEH is a method that performs finemapping and colocalization jointly over multiple phenotypes. CAFEH can be run with 10s of phenotypes and 1000s of variants in a few minutes.
URL
KEYWORDS
multi-trait, finemapping, colocalization
TITLE
Redefining tissue specificity of genetic regulation of gene expression in the presence of allelic heterogeneity.
Main citation
Arvanitis M, Tayeb K, Strober BJ, Battle A. (2022) Redefining tissue specificity of genetic regulation of gene expression in the presence of allelic heterogeneity. Am J Hum Genet, 109 (2) 223-239. doi:10.1016/j.ajhg.2022.01.002. PMID 35085493
ABSTRACT
Uncovering the functional impact of genetic variation on gene expression is important in understanding tissue biology and the pathogenesis of complex traits. Despite large efforts to map expression quantitative trait loci (eQTLs) across many human tissues, our ability to translate those findings to understanding human disease has been incomplete, and the majority of disease loci are not explained by association with expression of a target gene. Cell-type specificity and the presence of multiple independent causal variants for many eQTLs are potential confounders contributing to the apparent discrepancy with disease loci. In this study, we investigate the tissue specificity of genetic effects on gene expression and the overlap with disease loci while considering the presence of multiple causal variants within and across tissues. We find evidence of pervasive tissue specificity of eQTLs, often masked by linkage disequilibrium that misleads traditional meta-analytic approaches. We propose CAFEH (colocalization and fine-mapping in the presence of allelic heterogeneity), a Bayesian method that integrates genetic association data across multiple traits, incorporating linkage disequilibrium to identify causal variants. CAFEH outperforms previous approaches in colocalization and fine-mapping. Using CAFEH, we show that genes with highly tissue-specific genetic effects are under greater selection, enriched in differentiation and developmental processes, and more likely to be involved in human disease. Last, we demonstrate that CAFEH can efficiently leverage the widespread allelic heterogeneity in genetic regulation of gene expression to prioritize the target tissue in genome-wide association complex trait loci, thereby improving our ability to interpret complex trait genetics.
DOI
10.1016/j.ajhg.2022.01.002
CalPred
PUBMED_LINK
FULL NAME
Calibrated prediction intervals
DESCRIPTION
a statistical framework that jointly models the effects of all contexts on PGS accuracy with parameters learned in a calibration dataset
URL
KEYWORDS
trait prediction intervals
TITLE
Calibrated prediction intervals for polygenic scores across diverse contexts.
Main citation
Hou K, Xu Z, Ding Y, Mandla R, ...&, Pasaniuc B. (2024) Calibrated prediction intervals for polygenic scores across diverse contexts. Nat Genet, 56 (7) 1386-1396. doi:10.1038/s41588-024-01792-w. PMID 38886587
ABSTRACT
Polygenic scores (PGS) have emerged as the tool of choice for genomic prediction in a wide range of fields. We show that PGS performance varies broadly across contexts and biobanks. Contexts such as age, sex and income can impact PGS accuracy with similar magnitudes as genetic ancestry. Here we introduce an approach (CalPred) that models all contexts jointly to produce prediction intervals that vary across contexts to achieve calibration (include the trait with 90% probability), whereas existing methods are miscalibrated. In analyses of 72 traits across large and diverse biobanks (All of Us and UK Biobank), we find that prediction intervals required adjustment by up to 80% for quantitative traits. For disease traits, PGS-based predictions were miscalibrated across socioeconomic contexts such as annual household income levels, further highlighting the need of accounting for context information in PGS-based prediction across diverse populations.
DOI
10.1038/s41588-024-01792-w
Cancer PRSweb
PUBMED_LINK
DESCRIPTION
Our framework condenses these summary statistics into PRS using linkage disequilibrium pruning and p-value thresholding (fixed or data-adaptively optimized thresholds) or penalized, genome-wide effect size weighting. We evaluate them in the cancer-enriched cohort of the Michigan Genomics Initiative (MGI), a longitudinal biorepository effort at Michigan Medicine, and in the population-based UK Biobank Study (UKB). For each PRS construct, measures on performance, calibration, and discrimination are provided. Beyond the cancer PRS evaluation in MGI and UKB, the PRSweb platform features construct downloads, risk evaluation in the top percentiles, and phenome-wide PRS association studies (PRS-PheWAS) for a subset of PRS that are predictive for the primary cancer.
URL
KEYWORDS
Cancer PRS
TITLE
Cancer PRSweb: An Online Repository with Polygenic Risk Scores for Major Cancer Traits and Their Evaluation in Two Independent Biobanks.
Main citation
Fritsche LG, Patil S, Beesley LJ, VandeHaar P, ...&, Mukherjee B. (2020) Cancer PRSweb: An Online Repository with Polygenic Risk Scores for Major Cancer Traits and Their Evaluation in Two Independent Biobanks. Am J Hum Genet, 107 (5) 815-836. doi:10.1016/j.ajhg.2020.08.025. PMID 32991828
ABSTRACT
To facilitate scientific collaboration on polygenic risk scores (PRSs) research, we created an extensive PRS online repository for 35 common cancer traits integrating freely available genome-wide association studies (GWASs) summary statistics from three sources: published GWASs, the NHGRI-EBI GWAS Catalog, and UK Biobank-based GWASs. Our framework condenses these summary statistics into PRSs using various approaches such as linkage disequilibrium pruning/p value thresholding (fixed or data-adaptively optimized thresholds) and penalized, genome-wide effect size weighting. We evaluated the PRSs in two biobanks: the Michigan Genomics Initiative (MGI), a longitudinal biorepository effort at Michigan Medicine, and the population-based UK Biobank (UKB). For each PRS construct, we provide measures on predictive performance and discrimination. Besides PRS evaluation, the Cancer-PRSweb platform features construct downloads and phenome-wide PRS association study results (PRS-PheWAS) for predictive PRSs. We expect this integrated platform to accelerate PRS-related cancer research.
DOI
10.1016/j.ajhg.2020.08.025
CaTS power calculator
PUBMED_LINK
DESCRIPTION
CaTS is a simple, multi-platform interface for carrying out power calculations for large genetic association studies, including two stage genome wide association studies.
URL
TITLE
Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies.
Main citation
Skol AD, Scott LJ, Abecasis GR, Boehnke M. (2006) Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nat Genet, 38 (2) 209-13. doi:10.1038/ng1706. PMID 16415888
ABSTRACT
Genome-wide association is a promising approach to identify common genetic variants that predispose to human disease. Because of the high cost of genotyping hundreds of thousands of markers on thousands of subjects, genome-wide association studies often follow a staged design in which a proportion (pi(samples)) of the available samples are genotyped on a large number of markers in stage 1, and a proportion (pi(samples)) of these markers are later followed up by genotyping them on the remaining samples in stage 2. The standard strategy for analyzing such two-stage data is to view stage 2 as a replication study and focus on findings that reach statistical significance when stage 2 data are considered alone. We demonstrate that the alternative strategy of jointly analyzing the data from both stages almost always results in increased power to detect genetic association, despite the need to use more stringent significance levels, even when effect sizes differ between the two stages. We recommend joint analysis for all two-stage genome-wide association studies, especially when a relatively large proportion of the samples are genotyped in stage 1 (pi(samples) >or= 0.30), and a relatively large proportion of markers are selected for follow-up in stage 2 (pi(markers) >or= 0.01).
DOI
10.1038/ng1706
CAVIAR
PUBMED_LINK
FULL NAME
causal variants identification in associated regions
DESCRIPTION
a statistical framework that quantifies the probability of each variant to be causal while allowing an arbitrary number of causal variants.
URL
TITLE
Identifying causal variants at loci with multiple signals of association.
Main citation
Hormozdiari F, Kostem E, Kang EY, Pasaniuc B, ...&, Eskin E. (2014) Identifying causal variants at loci with multiple signals of association. Genetics, 198 (2) 497-508. doi:10.1534/genetics.114.167908. PMID 25104515
ABSTRACT
Although genome-wide association studies have successfully identified thousands of risk loci for complex traits, only a handful of the biologically causal variants, responsible for association at these loci, have been successfully identified. Current statistical methods for identifying causal variants at risk loci either use the strength of the association signal in an iterative conditioning framework or estimate probabilities for variants to be causal. A main drawback of existing methods is that they rely on the simplifying assumption of a single causal variant at each risk locus, which is typically invalid at many risk loci. In this work, we propose a new statistical framework that allows for the possibility of an arbitrary number of causal variants when estimating the posterior probability of a variant being causal. A direct benefit of our approach is that we predict a set of variants for each locus that under reasonable assumptions will contain all of the true causal variants with a high confidence level (e.g., 95%) even when the locus contains multiple causal variants. We use simulations to show that our approach provides 20-50% improvement in our ability to identify the causal variants compared to the existing methods at loci harboring multiple causal variants. We validate our approach using empirical data from an expression QTL study of CHI3L2 to identify new causal variants that affect gene expression at this locus. CAVIAR is publicly available online at http://genetics.cs.ucla.edu/caviar/.
DOI
10.1534/genetics.114.167908
CAVIARBF
PUBMED_LINK
FULL NAME
CAVIAR Bayes factor
DESCRIPTION
a fine-mapping method using marginal test statistics in the Bayesian framework
URL
KEYWORDS
Bayes factor
TITLE
Fine Mapping Causal Variants with an Approximate Bayesian Method Using Marginal Test Statistics.
Main citation
Chen W, Larrabee BR, Ovsyannikova IG, Kennedy RB, ...&, Schaid DJ. (2015) Fine Mapping Causal Variants with an Approximate Bayesian Method Using Marginal Test Statistics. Genetics, 200 (3) 719-36. doi:10.1534/genetics.115.176107. PMID 25948564
ABSTRACT
Two recently developed fine-mapping methods, CAVIAR and PAINTOR, demonstrate better performance over other fine-mapping methods. They also have the advantage of using only the marginal test statistics and the correlation among SNPs. Both methods leverage the fact that the marginal test statistics asymptotically follow a multivariate normal distribution and are likelihood based. However, their relationship with Bayesian fine mapping, such as BIMBAM, is not clear. In this study, we first show that CAVIAR and BIMBAM are actually approximately equivalent to each other. This leads to a fine-mapping method using marginal test statistics in the Bayesian framework, which we call CAVIAR Bayes factor (CAVIARBF). Another advantage of the Bayesian framework is that it can answer both association and fine-mapping questions. We also used simulations to compare CAVIARBF with other methods under different numbers of causal variants. The results showed that both CAVIARBF and BIMBAM have better performance than PAINTOR and other methods. Compared to BIMBAM, CAVIARBF has the advantage of using only marginal test statistics and takes about one-quarter to one-fifth of the running time. We applied different methods on two independent cohorts of the same phenotype. Results showed that CAVIARBF, BIMBAM, and PAINTOR selected the same top 3 SNPs; however, CAVIARBF and BIMBAM had better consistency in selecting the top 10 ranked SNPs between the two cohorts. Software is available at https://bitbucket.org/Wenan/caviarbf.
DOI
10.1534/genetics.115.176107
CC-GWAS
PUBMED_LINK
FULL NAME
case–case genome-wide association study
DESCRIPTION
The CCGWAS R package provides a tool for case-case association testing of two different disorders based on their respective case-control GWAS results
URL
TITLE
Identifying loci with different allele frequencies among cases of eight psychiatric disorders using CC-GWAS.
Main citation
Peyrot WJ, Price AL. (2021) Identifying loci with different allele frequencies among cases of eight psychiatric disorders using CC-GWAS. Nat Genet, 53 (4) 445-454. doi:10.1038/s41588-021-00787-1. PMID 33686288
ABSTRACT
Psychiatric disorders are highly genetically correlated, but little research has been conducted on the genetic differences between disorders. We developed a new method (case-case genome-wide association study; CC-GWAS) to test for differences in allele frequency between cases of two disorders using summary statistics from the respective case-control GWAS, transcending current methods that require individual-level data. Simulations and analytical computations confirm that CC-GWAS is well powered with effective control of type I error. We applied CC-GWAS to publicly available summary statistics for schizophrenia, bipolar disorder, major depressive disorder and five other psychiatric disorders. CC-GWAS identified 196 independent case-case loci, including 72 CC-GWAS-specific loci that were not significant at the genome-wide level in the input case-control summary statistics; two of the CC-GWAS-specific loci implicate the genes KLF6 and KLF16 (from the Krüppel-like family of transcription factors), which have been linked to neurite outgrowth and axon regeneration. CC-GWAS loci replicated convincingly in applications to datasets with independent replication data.
DOI
10.1038/s41588-021-00787-1
cellAdmix
PUBMED_LINK
DESCRIPTION
cellAdmix detects and corrects segmentation errors in imaging-based spatial transcriptomics by factorizing local molecular neighborhoods—analogous to doublet removal in scRNA-seq—to reassign transcripts that spill across cell boundaries.
URL
KEYWORDS
spatial transcriptomics, segmentation, matrix factorization, imaging-based ST
TITLE
Impact and correction of segmentation errors in spatial transcriptomics.
Main citation
Mitchel J, Gao T, Petukhov V, Cole E, ...&, Kharchenko PV. (2026) Impact and correction of segmentation errors in spatial transcriptomics. Nat Genet, 58 (2) 434-444. doi:10.1038/s41588-025-02497-4. PMID 41559218
ABSTRACT
Spatial transcriptomics aims to elucidate how cells coordinate within tissues by connecting cellular states to their native microenvironments. Imaging-based assays are especially promising, capturing molecular and cellular features at subcellular resolution in three dimensions. Interpretation of such data, however, hinges on accurate cell segmentation. Assigning individual molecules to the correct cells remains challenging. Here we re-analyze data from multiple tissues and platforms to find that segmentation errors currently confound most downstream analysis of cellular state, including differential expression, neighbor influence and ligand-receptor interactions. The extent to which misassigned molecules impact the results can be striking, frequently dominating the results. Thus, we show that matrix factorization of local molecular neighborhoods can effectively identify and isolate such molecular admixtures, thereby reducing their impact on downstream analyses, in a manner analogous to doublet filtering in single-cell RNA sequencing. As the applications of spatial transcriptomics assays become more widespread, accounting for segmentation errors will be important for resolving molecular mechanisms of tissue biology.
DOI
10.1038/s41588-025-02497-4
ChinaMAP
PUBMED_LINK
FULL NAME
China Metabolic Analytics Project
URL
TITLE
The ChinaMAP reference panel for the accurate genotype imputation in Chinese populations.
Main citation
Li L, Huang P, Sun X, Wang S, ...&, Wang W. (2021) The ChinaMAP reference panel for the accurate genotype imputation in Chinese populations. Cell Res, 31 (12) 1308-1310. doi:10.1038/s41422-021-00564-z. PMID 34489580
DOI
10.1038/s41422-021-00564-z
ChinaMAP panel (ChinaMAP)
PUBMED_LINK
FULL NAME
China Metabolic Analytics Project
URL
TITLE
The ChinaMAP reference panel for the accurate genotype imputation in Chinese populations.
Main citation
Li L, Huang P, Sun X, Wang S, ...&, Wang W. (2021) The ChinaMAP reference panel for the accurate genotype imputation in Chinese populations. Cell Res, 31 (12) 1308-1310. doi:10.1038/s41422-021-00564-z. PMID 34489580
DOI
10.1038/s41422-021-00564-z
ChromoMap
PUBMED_LINK
DESCRIPTION
an R package for interactive visualization of multi-omics data and annotation of chromosomes
URL
TITLE
ChromoMap: an R package for interactive visualization of multi-omics data and annotation of chromosomes.
Main citation
Anand L, Rodriguez Lopez CM. (2022) ChromoMap: an R package for interactive visualization of multi-omics data and annotation of chromosomes. BMC Bioinformatics, 23 (1) 33. doi:10.1186/s12859-021-04556-z. PMID 35016614
ABSTRACT
BACKGROUND: The recent advancements in high-throughput sequencing have resulted in the availability of annotated genomes, as well as of multi-omics data for many living organisms. This has increased the need for graphic tools that allow the concurrent visualization of genomes and feature-associated multi-omics data on single publication-ready plots. RESULTS: We present chromoMap, an R package, developed for the construction of interactive visualizations of chromosomes/chromosomal regions, mapping of any chromosomal feature with known coordinates (i.e., protein coding genes, transposable elements, non-coding RNAs, microsatellites, etc.), and chromosomal regional characteristics (i.e. genomic feature density, gene expression, DNA methylation, chromatin modifications, etc.) of organisms with a genome assembly. ChromoMap can also integrate multi-omics data (genomics, transcriptomics and epigenomics) in relation to their occurrence across chromosomes. ChromoMap takes tab-delimited files (BED like) or alternatively R objects to specify the genomic co-ordinates of the chromosomes and elements to annotate. Rendered chromosomes are composed of continuous windows of a given range, which, on hover, display detailed information about the elements annotated within that range. By adjusting parameters of a single function, users can generate a variety of plots that can either be saved as static image or as HTML documents. CONCLUSIONS: ChromoMap's flexibility allows for concurrent visualization of genomic data in each strand of a given chromosome, or of more than one homologous chromosome; allowing the comparison of multi-omic data between genotypes (e.g. species, varieties, etc.) or between homologous chromosomes of phased diploid/polyploid genomes. chromoMap is an extensive tool that can be potentially used in various bioinformatics analysis pipelines for genomic visualization of multi-omics data.
DOI
10.1186/s12859-021-04556-z
CKB reference panel (CKB)
PUBMED_LINK
FULL NAME
China Kadoorie Biobank
URL
TITLE
A high-resolution haplotype-resolved Reference panel constructed from the China Kadoorie Biobank Study.
Main citation
Yu C, Lan X, Tao Y, Guo Y, ...&, Li L. (2023) A high-resolution haplotype-resolved Reference panel constructed from the China Kadoorie Biobank Study. Nucleic Acids Res, 51 (21) 11770-11782. doi:10.1093/nar/gkad779. PMID 37870428
ABSTRACT
Precision medicine depends on high-accuracy individual-level genotype data. However, the whole-genome sequencing (WGS) is still not suitable for gigantic studies due to budget constraints. It is particularly important to construct highly accurate haplotype reference panel for genotype imputation. In this study, we used 10 000 samples with medium-depth WGS to construct a reference panel that we named the CKB reference panel. By imputing microarray datasets, it showed that the CKB panel outperformed compared panels in terms of both the number of well-imputed variants and imputation accuracy. In addition, we have completed the imputation of 100 706 microarrays with the CKB panel, and the after-imputed data is the hitherto largest whole genome data of the Chinese population. Furthermore, in the GWAS analysis of real phenotype height, the number of tested SNPs tripled and the number of significant SNPs doubled after imputation. Finally, we developed an online server for offering free genotype imputation service based on the CKB reference panel (https://db.cngb.org/imputation/). We believe that the CKB panel is of great value for imputing microarray or low-coverage genotype data of Chinese population, and potentially mixed populations. The imputation-completed 100 706 microarray data are enormous and precious resources of population genetic studies for complex traits and diseases.
DOI
10.1093/nar/gkad779
Cmplot
PUBMED_LINK
DESCRIPTION
an easy-to-use open-source web-based tool for visualizing, navigating and sharing GWAS and PheWAS results
URL
TITLE
rMVP: A Memory-efficient, Visualization-enhanced, and Parallel-accelerated Tool for Genome-wide Association Study.
Main citation
Yin L, Zhang H, Tang Z, Xu J, ...&, Liu X. (2021) rMVP: A Memory-efficient, Visualization-enhanced, and Parallel-accelerated Tool for Genome-wide Association Study. Genomics Proteomics Bioinformatics, 19 (4) 619-628. doi:10.1016/j.gpb.2020.10.007. PMID 33662620
ABSTRACT
Along with the development of high-throughput sequencing technologies, both sample size and SNP number are increasing rapidly in genome-wide association studies (GWAS), and the associated computation is more challenging than ever. Here, we present a memory-efficient, visualization-enhanced, and parallel-accelerated R package called "rMVP" to address the need for improved GWAS computation. rMVP can 1) effectively process large GWAS data, 2) rapidly evaluate population structure, 3) efficiently estimate variance components by Efficient Mixed-Model Association eXpedited (EMMAX), Factored Spectrally Transformed Linear Mixed Models (FaST-LMM), and Haseman-Elston (HE) regression algorithms, 4) implement parallel-accelerated association tests of markers using general linear model (GLM), mixed linear model (MLM), and fixed and random model circulating probability unification (FarmCPU) methods, 5) compute fast with a globally efficient design in the GWAS processes, and 6) generate various visualizations of GWAS-related information. Accelerated by block matrix multiplication strategy and multiple threads, the association test methods embedded in rMVP are significantly faster than PLINK, GEMMA, and FarmCPU_pkg. rMVP is freely available at https://github.com/xiaolei-lab/rMVP.
DOI
10.1016/j.gpb.2020.10.007
CMS
PUBMED_LINK
FULL NAME
Composite of multiple signals
DESCRIPTION
Grossman, S. R., Shylakhter, I., Karlsson, E. K., Byrne, E. H., Morales, S., Frieden, G., ... & Sabeti, P. C. (2010). A composite of multiple signals distinguishes causal variants in regions of positive selection. Science, 327(5967), 883-886.
TITLE
A composite of multiple signals distinguishes causal variants in regions of positive selection.
Main citation
Grossman SR, Shlyakhter I, Karlsson EK, Byrne EH, ...&, Sabeti PC. (2010) A composite of multiple signals distinguishes causal variants in regions of positive selection. Science, 327 (5967) 883-6. doi:10.1126/science.1183863. PMID 20056855
ABSTRACT
The human genome contains hundreds of regions whose patterns of genetic variation indicate recent positive natural selection, yet for most the underlying gene and the advantageous mutation remain unknown. We developed a method, composite of multiple signals (CMS), that combines tests for multiple signals of selection and increases resolution by up to 100-fold. By applying CMS to candidate regions from the International Haplotype Map, we localized population-specific selective signals to 55 kilobases (median), identifying known and novel causal variants. CMS can not just identify individual loci but implicates precise variants selected by evolution.
DOI
10.1126/science.1183863
CNGB Imputation Service (CNGB)
PUBMED_LINK
FULL NAME
China National GeneBank
URL
TITLE
A high-resolution haplotype-resolved Reference panel constructed from the China Kadoorie Biobank Study.
Main citation
Yu C, Lan X, Tao Y, Guo Y, ...&, Li L. (2023) A high-resolution haplotype-resolved Reference panel constructed from the China Kadoorie Biobank Study. Nucleic Acids Res, 51 (21) 11770-11782. doi:10.1093/nar/gkad779. PMID 37870428
ABSTRACT
Precision medicine depends on high-accuracy individual-level genotype data. However, the whole-genome sequencing (WGS) is still not suitable for gigantic studies due to budget constraints. It is particularly important to construct highly accurate haplotype reference panel for genotype imputation. In this study, we used 10 000 samples with medium-depth WGS to construct a reference panel that we named the CKB reference panel. By imputing microarray datasets, it showed that the CKB panel outperformed compared panels in terms of both the number of well-imputed variants and imputation accuracy. In addition, we have completed the imputation of 100 706 microarrays with the CKB panel, and the after-imputed data is the hitherto largest whole genome data of the Chinese population. Furthermore, in the GWAS analysis of real phenotype height, the number of tested SNPs tripled and the number of significant SNPs doubled after imputation. Finally, we developed an online server for offering free genotype imputation service based on the CKB reference panel (https://db.cngb.org/imputation/). We believe that the CKB panel is of great value for imputing microarray or low-coverage genotype data of Chinese population, and potentially mixed populations. The imputation-completed 100 706 microarray data are enormous and precious resources of population genetic studies for complex traits and diseases.
DOI
10.1093/nar/gkad779
CoCoNet
PUBMED_LINK
DESCRIPTION
CoCoNet is a composite likelihood-based covariance regression network model for identifying trait-relevant tissues or cell types.
URL
KEYWORDS
composite likelihood-based inference algorithm
TITLE
Leveraging gene co-expression patterns to infer trait-relevant tissues in genome-wide association studies.
Main citation
Shang L, Smith JA, Zhou X. (2020) Leveraging gene co-expression patterns to infer trait-relevant tissues in genome-wide association studies. PLoS Genet, 16 (4) e1008734. doi:10.1371/journal.pgen.1008734. PMID 32310941
ABSTRACT
Genome-wide association studies (GWASs) have identified many SNPs associated with various common diseases. Understanding the biological functions of these identified SNP associations requires identifying disease/trait relevant tissues or cell types. Here, we develop a network method, CoCoNet, to facilitate the identification of trait-relevant tissues or cell types. Different from existing approaches, CoCoNet incorporates tissue-specific gene co-expression networks constructed from either bulk or single cell RNA sequencing (RNAseq) studies with GWAS data for trait-tissue inference. In particular, CoCoNet relies on a covariance regression network model to express gene-level effect measurements for the given GWAS trait as a function of the tissue-specific co-expression adjacency matrix. With a composite likelihood-based inference algorithm, CoCoNet is scalable to tens of thousands of genes. We validate the performance of CoCoNet through extensive simulations. We apply CoCoNet for an in-depth analysis of four neurological disorders and four autoimmune diseases, where we integrate the corresponding GWASs with bulk RNAseq data from 38 tissues and single cell RNAseq data from 10 cell types. In the real data applications, we show how CoCoNet can help identify specific glial cell types relevant for neurological disorders and identify disease-targeted colon tissues as relevant for autoimmune diseases.
DOI
10.1371/journal.pgen.1008734
Coloc
PUBMED_LINK
URL
KEYWORDS
Approximate Bayes Factor (ABF)
TITLE
Bayesian test for colocalisation between pairs of genetic association studies using summary statistics.
Main citation
Giambartolomei C, Vukcevic D, Schadt EE, Franke L, ...&, Plagnol V. (2014) Bayesian test for colocalisation between pairs of genetic association studies using summary statistics. PLoS Genet, 10 (5) e1004383. doi:10.1371/journal.pgen.1004383. PMID 24830394
ABSTRACT
Genetic association studies, in particular the genome-wide association study (GWAS) design, have provided a wealth of novel insights into the aetiology of a wide range of human diseases and traits, in particular cardiovascular diseases and lipid biomarkers. The next challenge consists of understanding the molecular basis of these associations. The integration of multiple association datasets, including gene expression datasets, can contribute to this goal. We have developed a novel statistical methodology to assess whether two association signals are consistent with a shared causal variant. An application is the integration of disease scans with expression quantitative trait locus (eQTL) studies, but any pair of GWAS datasets can be integrated in this framework. We demonstrate the value of the approach by re-analysing a gene expression dataset in 966 liver samples with a published meta-analysis of lipid traits including >100,000 individuals of European ancestry. Combining all lipid biomarkers, our re-analysis supported 26 out of 38 reported colocalisation results with eQTLs and identified 14 new colocalisation results, hence highlighting the value of a formal statistical test. In three cases of reported eQTL-lipid pairs (SYPL2, IFT172, TBKBP1) for which our analysis suggests that the eQTL pattern is not consistent with the lipid association, we identify alternative colocalisation results with SORT1, GCKR, and KPNB1, indicating that these genes are more likely to be causal in these genomic intervals. A key feature of the method is the ability to derive the output statistics from single SNP summary statistics, hence making it possible to perform systematic meta-analysis type comparisons across multiple GWAS datasets (implemented online at http://coloc.cs.ucl.ac.uk/coloc/). Our methodology provides information about candidate causal genes in associated intervals and has direct implications for the understanding of complex diseases as well as the design of drugs to target disease pathways.
DOI
10.1371/journal.pgen.1004383
Coloc-susie
PUBMED_LINK
URL
KEYWORDS
Approximate Bayes Factor (ABF), Sum of Single Effects (SuSiE)
TITLE
A more accurate method for colocalisation analysis allowing for multiple causal variants.
Main citation
Wallace C. (2021) A more accurate method for colocalisation analysis allowing for multiple causal variants. PLoS Genet, 17 (9) e1009440. doi:10.1371/journal.pgen.1009440. PMID 34587156
ABSTRACT
In genome-wide association studies (GWAS) it is now common to search for, and find, multiple causal variants located in close proximity. It has also become standard to ask whether different traits share the same causal variants, but one of the popular methods to answer this question, coloc, makes the simplifying assumption that only a single causal variant exists for any given trait in any genomic region. Here, we examine the potential of the recently proposed Sum of Single Effects (SuSiE) regression framework, which can be used for fine-mapping genetic signals, for use with coloc. SuSiE is a novel approach that allows evidence for association at multiple causal variants to be evaluated simultaneously, whilst separating the statistical support for each variant conditional on the causal signal being considered. We show this results in more accurate coloc inference than other proposals to adapt coloc for multiple causal variants based on conditioning. We therefore recommend that coloc be used in combination with SuSiE to optimise accuracy of colocalisation analyses when multiple causal variants exist.
DOI
10.1371/journal.pgen.1009440
Comparison
PUBMED_LINK
TITLE
Synergistic insights into human health from aptamer- and antibody-based proteomic profiling.
Main citation
Pietzner M, Wheeler E, Carrasco-Zanini J, Kerrison ND, ...&, Langenberg C. (2021) Synergistic insights into human health from aptamer- and antibody-based proteomic profiling. Nat Commun, 12 (1) 6822. doi:10.1038/s41467-021-27164-0. PMID 34819519
ABSTRACT
Affinity-based proteomics has enabled scalable quantification of thousands of protein targets in blood enhancing biomarker discovery, understanding of disease mechanisms, and genetic evaluation of drug targets in humans through protein quantitative trait loci (pQTLs). Here, we integrate two partly complementary techniques-the aptamer-based SomaScan® v4 assay and the antibody-based Olink assays-to systematically assess phenotypic consequences of hundreds of pQTLs discovered for 871 protein targets across both platforms. We create a genetically anchored cross-platform proteome-phenome network comprising 547 protein-phenotype connections, 36.3% of which were only seen with one of the two platforms suggesting that both techniques capture distinct aspects of protein biology. We further highlight discordance of genetically predicted effect directions between assays, such as for PILRA and Alzheimer's disease. Our results showcase the synergistic nature of these technologies to better understand and identify disease mechanisms and provide a benchmark for future cross-platform discoveries.
DOI
10.1038/s41467-021-27164-0
Concepts&Principals
PUBMED_LINK
TITLE
Interpreting Mendelian-randomization estimates of the effects of categorical exposures such as disease status and educational attainment.
Main citation
Howe LJ, Tudball M, Davey Smith G, Davies NM. (2022) Interpreting Mendelian-randomization estimates of the effects of categorical exposures such as disease status and educational attainment. Int J Epidemiol, 51 (3) 948-957. doi:10.1093/ije/dyab208. PMID 34570226
ABSTRACT
BACKGROUND: Mendelian randomization has been previously used to estimate the effects of binary and ordinal categorical exposures-e.g. Type 2 diabetes or educational attainment defined by qualification-on outcomes. Binary and categorical phenotypes can be modelled in terms of liability-an underlying latent continuous variable with liability thresholds separating individuals into categories. Genetic variants influence an individual's categorical exposure via their effects on liability, thus Mendelian-randomization analyses with categorical exposures will capture effects of liability that act independently of exposure category. METHODS AND RESULTS: We discuss how groups in which the categorical exposure is invariant can be used to detect liability effects acting independently of exposure category. For example, associations between an adult educational-attainment polygenic score (PGS) and body mass index measured before the minimum school leaving age (e.g. age 10 years), cannot indicate the effects of years in full-time education on this outcome. Using UK Biobank data, we show that a higher educational-attainment PGS is strongly associated with lower smoking initiation and higher odds of glasses use at age 15 years. These associations were replicated in sibling models. An orthogonal approach using the raising of the school leaving age (ROSLA) policy change found that individuals who chose to remain in education to age 16 years before the reform likely had higher liability to educational attainment than those who were compelled to remain in education to age 16 years after the reform, and had higher income, lower pack-years of smoking, higher odds of glasses use and lower deprivation in adulthood. These results suggest that liability to educational attainment is associated with health and social outcomes independently of years in full-time education. CONCLUSIONS: Mendelian-randomization studies with non-continuous exposures should be interpreted in terms of liability, which may affect the outcome via changes in exposure category and/or independently.
DOI
10.1093/ije/dyab208
CookHLA
PUBMED_LINK
URL
TITLE
Accurate imputation of human leukocyte antigens with CookHLA.
Main citation
Cook S, Choi W, Lim H, Luo Y, ...&, Han B. (2021) Accurate imputation of human leukocyte antigens with CookHLA. Nat Commun, 12 (1) 1264. doi:10.1038/s41467-021-21541-5. PMID 33627654
ABSTRACT
The recent development of imputation methods enabled the prediction of human leukocyte antigen (HLA) alleles from intergenic SNP data, allowing studies to fine-map HLA for immune phenotypes. Here we report an accurate HLA imputation method, CookHLA, which has superior imputation accuracy compared to previous methods. CookHLA differs from other approaches in that it locally embeds prediction markers into highly polymorphic exons to account for exonic variability, and in that it adaptively learns the genetic map within MHC from the data to facilitate imputation. Our benchmarking with real datasets shows that our method achieves high imputation accuracy in a wide range of scenarios, including situations where the reference panel is small or ethnically unmatched.
DOI
10.1038/s41467-021-21541-5
CoPheScan
PUBMED_LINK
FULL NAME
Coloc adapted Phenome-wide Scan
URL
TITLE
CoPheScan: phenome-wide association studies accounting for linkage disequilibrium.
Main citation
Manipur I, Reales G, Sul JH, Shin MK, ...&, Wallace C. (2024) CoPheScan: phenome-wide association studies accounting for linkage disequilibrium. Nat Commun, 15 (1) 5862. doi:10.1038/s41467-024-49990-8. PMID 38997278
ABSTRACT
Phenome-wide association studies (PheWAS) facilitate the discovery of associations between a single genetic variant with multiple phenotypes. For variants which impact a specific protein, this can help identify additional therapeutic indications or on-target side effects of intervening on that protein. However, PheWAS is restricted by an inability to distinguish confounding due to linkage disequilibrium (LD) from true pleiotropy. Here we describe CoPheScan (Coloc adapted Phenome-wide Scan), a Bayesian approach that enables an intuitive and systematic exploration of causal associations while simultaneously addressing LD confounding. We demonstrate its performance through simulation, showing considerably better control of false positive rates than a conventional approach not accounting for LD. We used CoPheScan to perform PheWAS of protein-truncating variants and fine-mapped variants from disease and pQTL studies, in 2275 disease phenotypes from the UK Biobank. Our results identify the complexity of known pleiotropic genes such as APOE, and suggest a new causal role for TGM3 in skin cancer.
DOI
10.1038/s41467-024-49990-8
corrplot
DESCRIPTION
R package corrplot provides a visual exploratory tool on correlation matrix that supports automatic variable reordering to help detect hidden patterns among variables.
URL
COWAS
PUBMED_LINK
FULL NAME
Co-expression-wide association study
DESCRIPTION
Co-expression-wide association study (COWAS) extends TWAS/PWAS by testing pairs of genes or proteins whose genetically regulated co-expression or interaction is associated with a trait; includes implemented R software and trained imputation weights for summary-statistic follow-up.
URL
KEYWORDS
TWAS, PWAS, co-expression, gene-gene interaction, GWAS summary statistics
TITLE
Co-expression-wide association studies link genetically regulated interactions with complex traits.
Main citation
Malakhov MM, Pan W. (2025) Co-expression-wide association studies link genetically regulated interactions with complex traits. Nat Commun, 16 (1) 11061. doi:10.1038/s41467-025-66039-6. PMID 41381446
ABSTRACT
Transcriptome- and proteome-wide association studies (TWAS/PWAS) have proven successful in prioritizing genes and proteins whose genetically regulated expression modulates disease risk, but they ignore potential co-expression and interaction effects. To address this limitation, we introduce the co-expression-wide association study (COWAS) method, which can identify pairs of genes or proteins whose genetically regulated co-expression is associated with complex traits. COWAS first trains models to predict expression and co-expression from genetic variation, and then tests for association between imputed co-expression and the trait of interest while also accounting for direct effects from each exposure. We applied our method to plasma proteomic concentrations from the UK Biobank, identifying dozens of interacting protein pairs associated with cholesterol levels, Alzheimer's disease, and Parkinson's disease. Notably, our results demonstrate that co-expression between proteins may affect complex traits even if neither protein is detected to influence the trait when considered on its own. We also show how COWAS can help to disentangle direct and interaction effects, providing a richer picture of the molecular networks that mediate genetic effects on disease outcomes.
DOI
10.1038/s41467-025-66039-6
cross-trait LDSC
PUBMED_LINK
FULL NAME
cross-trait LD Score Regression
DESCRIPTION
ldsc is a command line tool for estimating heritability and genetic correlation from GWAS summary statistics. ldsc also computes LD Scores.
URL
KEYWORDS
cross-trait, LD score regression
TITLE
An atlas of genetic correlations across human diseases and traits.
Main citation
Bulik-Sullivan B, Finucane HK, Anttila V, Gusev A, ...&, Neale BM. (2015) An atlas of genetic correlations across human diseases and traits. Nat Genet, 47 (11) 1236-41. doi:10.1038/ng.3406. PMID 26414676
ABSTRACT
Identifying genetic correlations between complex traits and diseases can provide useful etiological insights and help prioritize likely causal relationships. The major challenges preventing estimation of genetic correlation from genome-wide association study (GWAS) data with current methods are the lack of availability of individual-level genotype data and widespread sample overlap among meta-analyses. We circumvent these difficulties by introducing a technique-cross-trait LD Score regression-for estimating genetic correlation that requires only GWAS summary statistics and is not biased by sample overlap. We use this method to estimate 276 genetic correlations among 24 traits. The results include genetic correlations between anorexia nervosa and schizophrenia, anorexia and obesity, and educational attainment and several diseases. These results highlight the power of genome-wide analyses, as there currently are no significantly associated SNPs for anorexia nervosa and only three for educational attainment.
DOI
10.1038/ng.3406
cS2G
PUBMED_LINK
FULL NAME
optimal combined S2G strategy
DESCRIPTION
heritability-based framework for evaluating and combining different S2G strategies to optimize their informativeness for common disease risk
URL
TITLE
Combining SNP-to-gene linking strategies to identify disease genes and assess disease omnigenicity.
Main citation
Gazal S, Weissbrod O, Hormozdiari F, Dey KK, ...&, Price AL. (2022) Combining SNP-to-gene linking strategies to identify disease genes and assess disease omnigenicity. Nat Genet, 54 (6) 827-836. doi:10.1038/s41588-022-01087-y. PMID 35668300
ABSTRACT
Disease-associated single-nucleotide polymorphisms (SNPs) generally do not implicate target genes, as most disease SNPs are regulatory. Many SNP-to-gene (S2G) linking strategies have been developed to link regulatory SNPs to the genes that they regulate in cis. Here, we developed a heritability-based framework for evaluating and combining different S2G strategies to optimize their informativeness for common disease risk. Our optimal combined S2G strategy (cS2G) included seven constituent S2G strategies and achieved a precision of 0.75 and a recall of 0.33, more than doubling the recall of any individual strategy. We applied cS2G to fine-mapping results for 49 UK Biobank diseases/traits to predict 5,095 causal SNP-gene-disease triplets (with S2G-derived functional interpretation) with high confidence. We further applied cS2G to provide an empirical assessment of disease omnigenicity; we determined that the top 1% of genes explained roughly half of the SNP heritability linked to all genes and that gene-level architectures vary with variant allele frequency.
DOI
10.1038/s41588-022-01087-y
CT-SLEB
PUBMED_LINK
DESCRIPTION
CT-SLEB is a method designed to generate multi-ancestry PRSs that incorporate existing large GWAS from EUR populations and smaller GWAS from non-EUR populations. The method has three key steps: 1. Clumping and Thresholding for selecting SNPs to be included in a PRS for the target population; 2. Empirical-Bayes method for estimating the coefficients of the SNPs; 3. Super-learning model to combine a series of PRSs generated under different SNP selection thresholds.
URL
TITLE
A new method for multiancestry polygenic prediction improves performance across diverse populations.
Main citation
Zhang H, Zhan J, Jin J, Zhang J, ...&, Chatterjee N. (2023) A new method for multiancestry polygenic prediction improves performance across diverse populations. Nat Genet, 55 (10) 1757-1768. doi:10.1038/s41588-023-01501-z. PMID 37749244
ABSTRACT
Polygenic risk scores (PRSs) increasingly predict complex traits; however, suboptimal performance in non-European populations raise concerns about clinical applications and health inequities. We developed CT-SLEB, a powerful and scalable method to calculate PRSs, using ancestry-specific genome-wide association study summary statistics from multiancestry training samples, integrating clumping and thresholding, empirical Bayes and superlearning. We evaluated CT-SLEB and nine alternative methods with large-scale simulated genome-wide association studies (~19 million common variants) and datasets from 23andMe, Inc., the Global Lipids Genetics Consortium, All of Us and UK Biobank, involving 5.1 million individuals of diverse ancestry, with 1.18 million individuals from four non-European populations across 13 complex traits. Results demonstrated that CT-SLEB significantly improves PRS performance in non-European populations compared with simple alternatives, with comparable or superior performance to a recent, computationally intensive method. Moreover, our simulation studies offered insights into sample size requirements and SNP density effects on multiancestry risk prediction.
DOI
10.1038/s41588-023-01501-z
cTWAS
PUBMED_LINK
FULL NAME
causal-TWAS
DESCRIPTION
Expression Quantitative Trait Loci (eQTLs) have often been used to nominate candidate genes from Genome-wide association studies (GWAS). However, commonly used methods are susceptible to false positives largely due to Linkage Disequilibrium of eQTLs with causal variants acting on the phenotype directly. Our method, causal-TWAS (cTWAS), addressed this challenge by borrowing ideas from statistical fine-mapping. It is a generalization of Transcriptome-wide association studies (TWAS), but when analyzing any gene, it adjusts for other nearby genes and all nearby genetic variants.
URL
KEYWORDS
TWAS, fine-mapping
TITLE
Adjusting for genetic confounders in transcriptome-wide association studies improves discovery of risk genes of complex traits.
Main citation
Zhao S, Crouse W, Qian S, Luo K, ...&, He X. (2024) Adjusting for genetic confounders in transcriptome-wide association studies improves discovery of risk genes of complex traits. Nat Genet, 56 (2) 336-347. doi:10.1038/s41588-023-01648-9. PMID 38279041
ABSTRACT
Many methods have been developed to leverage expression quantitative trait loci (eQTL) data to nominate candidate genes from genome-wide association studies. These methods, including colocalization, transcriptome-wide association studies (TWAS) and Mendelian randomization-based methods; however, all suffer from a key problem-when assessing the role of a gene in a trait using its eQTLs, nearby variants and genetic components of other genes' expression may be correlated with these eQTLs and have direct effects on the trait, acting as potential confounders. Our extensive simulations showed that existing methods fail to account for these 'genetic confounders', resulting in severe inflation of false positives. Our new method, causal-TWAS (cTWAS), borrows ideas from statistical fine-mapping and allows us to adjust all genetic confounders. cTWAS showed calibrated false discovery rates in simulations, and its application on several common traits discovered new candidate genes. In conclusion, cTWAS provides a robust statistical framework for gene discovery.
DOI
10.1038/s41588-023-01648-9
Ctyper
PUBMED_LINK
DESCRIPTION
Ctyper genotypes sequence-resolved copy-number variation and other complex polymorphic genes using a pangenome reference matrix, enabling allele- and copy-aware calls at scale for biobank-style cohorts.
URL
KEYWORDS
CNV, copy number, pangenome, sequence-resolved, biobank scale
TITLE
Genotyping sequence-resolved copy number variation using pangenomes reveals paralog-specific global diversity and expression divergence of duplicated genes.
Main citation
Ma W, Chaisson MJP. (2025) Genotyping sequence-resolved copy number variation using pangenomes reveals paralog-specific global diversity and expression divergence of duplicated genes. Nat Genet, 57 (11) 2909-2919. doi:10.1038/s41588-025-02346-4. PMID 41107550
ABSTRACT
Copy number variable (CNV) genes are important in evolution and disease, yet their sequence variation remains a blind spot in large-scale studies. We present ctyper, a method that leverages pangenomes to produce allele-specific copy numbers with locally phased variants from next-generation sequencing samples. Benchmarking on 3,351 CNV genes and 212 challenging medically relevant (CMR) genes, ctyper captures 96.5% of phased variants with ≥99.1% correctness of copy number in CNV genes and 94.8% of phased variants in CMR genes. Ctyper takes 1.5 h to genotype a genome on one CPU. The ctyper genotypes give a 4.81-fold improvement in predictions of gene expression compared to known expression quantitative trait locus (eQTL) variants. Allele-specific expression quantified divergent expression in 7.94% of paralogs and tissue-specific biases in 4.68%. We found reduced expression of SMN2 due to SMN1 conversion, potentially affecting spinal muscular atrophy, and increased expression of translocated duplications of AMY2B. Overall, ctyper enables biobank-scale genotyping of CNV and CMR genes.
DOI
10.1038/s41588-025-02346-4
DBSLMM
PUBMED_LINK
FULL NAME
Deterministic Bayesian Sparse Linear Mixed Model
DESCRIPTION
There are two versions of DBSLMM: the tuning version and the deterministic version. The tuning version examines three different heritability choices and requires a validation data to tune the heritability hyper-parameter. The deterministic version uses one heritability estimate and directly fit the model in the training data without a separate validation data. Both versions requires a reference data to compute the SNP correlation matrix. In our experience, the tuning version may work more accurately than the deterministic version.
URL
TITLE
Accurate and Scalable Construction of Polygenic Scores in Large Biobank Data Sets.
Main citation
Yang S, Zhou X. (2020) Accurate and Scalable Construction of Polygenic Scores in Large Biobank Data Sets. Am J Hum Genet, 106 (5) 679-693. doi:10.1016/j.ajhg.2020.03.013. PMID 32330416
ABSTRACT
Accurate construction of polygenic scores (PGS) can enable early diagnosis of diseases and facilitate the development of personalized medicine. Accurate PGS construction requires prediction models that are both adaptive to different genetic architectures and scalable to biobank scale datasets with millions of individuals and tens of millions of genetic variants. Here, we develop such a method called Deterministic Bayesian Sparse Linear Mixed Model (DBSLMM). DBSLMM relies on a flexible modeling assumption on the effect size distribution to achieve robust and accurate prediction performance across a range of genetic architectures. DBSLMM also relies on a simple deterministic search algorithm to yield an approximate analytic estimation solution using summary statistics only. The deterministic search algorithm, when paired with further algebraic innovations, results in substantial computational savings. With simulations, we show that DBSLMM achieves scalable and accurate prediction performance across a range of realistic genetic architectures. We then apply DBSLMM to analyze 25 traits in UK Biobank. For these traits, compared to existing approaches, DBSLMM achieves an average of 2.03%-101.09% accuracy gain in internal cross-validations. In external validations on two separate datasets, including one from BioBank Japan, DBSLMM achieves an average of 14.74%-522.74% accuracy gain. In these real data applications, DBSLMM is 1.03-28.11 times faster and uses only 7.4%-24.8% of physical memory as compared to other multiple regression-based PGS methods. Overall, DBSLMM represents an accurate and scalable method for constructing PGS in biobank scale datasets.
DOI
10.1016/j.ajhg.2020.03.013
DDx-PRS
FULL NAME
Differential Diagnosis-Polygenic Risk Score
DESCRIPTION
The DDxPRS R function provides a tool for distuingishing different disorders based on polygenic prediction.
URL
Main citation
Peyrot, W. J., Panagiotaropoulou, G., Olde Loohuis, L. M., Adams, M., Awasthi, S., Ge, T., ... & Price, A. L. (2024). Distinguishing different psychiatric disorders using DDx-PRS. medRxiv, 2024-02.
DEEP*HLA
PUBMED_LINK
URL
TITLE
A deep learning method for HLA imputation and trans-ethnic MHC fine-mapping of type 1 diabetes.
Main citation
Naito T, Suzuki K, Hirata J, Kamatani Y, ...&, Okada Y. (2021) A deep learning method for HLA imputation and trans-ethnic MHC fine-mapping of type 1 diabetes. Nat Commun, 12 (1) 1639. doi:10.1038/s41467-021-21975-x. PMID 33712626
ABSTRACT
Conventional human leukocyte antigen (HLA) imputation methods drop their performance for infrequent alleles, which is one of the factors that reduce the reliability of trans-ethnic major histocompatibility complex (MHC) fine-mapping due to inter-ethnic heterogeneity in allele frequency spectra. We develop DEEP*HLA, a deep learning method for imputing HLA genotypes. Through validation using the Japanese and European HLA reference panels (n = 1,118 and 5,122), DEEP*HLA achieves the highest accuracies with significant superiority for low-frequency and rare alleles. DEEP*HLA is less dependent on distance-dependent linkage disequilibrium decay of the target alleles and might capture the complicated region-wide information. We apply DEEP*HLA to type 1 diabetes GWAS data from BioBank Japan (n = 62,387) and UK Biobank (n = 354,459), and successfully disentangle independently associated class I and II HLA variants with shared risk among diverse populations (the top signal at amino acid position 71 of HLA-DRβ1; P = 7.5 × 10-120). Our study illustrates the value of deep learning in genotype imputation and trans-ethnic MHC fine-mapping.
DOI
10.1038/s41467-021-21975-x
DEPICT
PUBMED_LINK
FULL NAME
Data-driven Expression Prioritized Integration for Complex Traits
DESCRIPTION
an integrative tool that employs predicted gene functions to systematically prioritize the most likely causal genes at associated loci, highlight enriched pathways and identify tissues/cell types where genes from associated loci are highly expressed. DEPICT is not limited to genes with established functions and prioritizes relevant gene sets for many phenotypes.
URL
KEYWORDS
co-regulation of gene expression
TITLE
Biological interpretation of genome-wide association studies using predicted gene functions.
Main citation
Pers TH, Karjalainen JM, Chan Y, Westra HJ, ...&, Franke L. (2015) Biological interpretation of genome-wide association studies using predicted gene functions. Nat Commun, 6 () 5890. doi:10.1038/ncomms6890. PMID 25597830
ABSTRACT
The main challenge for gaining biological insights from genetic associations is identifying which genes and pathways explain the associations. Here we present DEPICT, an integrative tool that employs predicted gene functions to systematically prioritize the most likely causal genes at associated loci, highlight enriched pathways and identify tissues/cell types where genes from associated loci are highly expressed. DEPICT is not limited to genes with established functions and prioritizes relevant gene sets for many phenotypes.
DOI
10.1038/ncomms6890
DOG
PUBMED_LINK
FULL NAME
Domain Graph
DESCRIPTION
a novel software of DOG for experimentalists, to prepare publication-quality figures of protein domain structures
URL
TITLE
DOG 1.0: illustrator of protein domain structures.
Main citation
Ren J, Wen L, Gao X, Jin C, ...&, Yao X. (2009) DOG 1.0: illustrator of protein domain structures. Cell Res, 19 (2) 271-3. doi:10.1038/cr.2009.6. PMID 19153597
DOI
10.1038/cr.2009.6
DRUG TARGETOR
PUBMED_LINK
DESCRIPTION
This website harnesses results from genome-wide association studies (GWAS), and drug bioactivity data, to prioritize drugs and targets for a given phenotype. Drug Targetor networks are constructed using genetically scored drugs and genes, connected by the type of drug-target or drug-gene interaction
URL
TITLE
Drug Targetor: a web interface to investigate the human druggome for over 500 phenotypes.
Main citation
Gaspar HA, Hübel C, Breen G. (2019) Drug Targetor: a web interface to investigate the human druggome for over 500 phenotypes. Bioinformatics, 35 (14) 2515-2517. doi:10.1093/bioinformatics/bty982. PMID 30517594
ABSTRACT
SUMMARY: Results from hundreds of genome-wide association studies (GWAS) are now freely available and offer a catalogue of the association between phenotypes across medicine with variants in the genome. With the aim of using this data to better understand therapeutic mechanisms, we have developed Drug Targetor, a web interface that allows the generation and exploration of drug-target networks of hundreds of phenotypes using GWAS data. Drug Targetor networks consist of drug and target nodes ordered by genetic association and connected by drug-target or drug-gene relationship. We show that Drug Targetor can help prioritize drugs, targets and drug-target interactions for a specific phenotype based on genetic evidence. AVAILABILITY AND IMPLEMENTATION: Drug Targetor v1.21 is a web application freely available online at drugtargetor.com and under MIT licence. The source code can be found at https://github.com/hagax8/drugtargetor. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
DOI
10.1093/bioinformatics/bty982
EAGLE
PUBMED_LINK
DESCRIPTION
(EAGLE1)
URL
TITLE
Fast and accurate long-range phasing in a UK Biobank cohort.
Main citation
Loh PR, Palamara PF, Price AL. (2016) Fast and accurate long-range phasing in a UK Biobank cohort. Nat Genet, 48 (7) 811-6. doi:10.1038/ng.3571. PMID 27270109
ABSTRACT
Recent work has leveraged the extensive genotyping of the Icelandic population to perform long-range phasing (LRP), enabling accurate imputation and association analysis of rare variants in target samples typed on genotyping arrays. Here we develop a fast and accurate LRP method, Eagle, that extends this paradigm to populations with much smaller proportions of genotyped samples by harnessing long (>4-cM) identical-by-descent (IBD) tracts shared among distantly related individuals. We applied Eagle to N ≈ 150,000 samples (0.2% of the British population) from the UK Biobank, and we determined that it is 1-2 orders of magnitude faster than existing methods while achieving similar or better phasing accuracy (switch error rate ≈ 0.3%, corresponding to perfect phase in a majority of 10-Mb segments). We also observed that, when used within an imputation pipeline, Eagle prephasing improved downstream imputation accuracy in comparison to prephasing in batches using existing methods, as necessary to achieve comparable computational cost.
DOI
10.1038/ng.3571
EAGLE2
PUBMED_LINK
DESCRIPTION
(EAGLE2)
URL
TITLE
Reference-based phasing using the Haplotype Reference Consortium panel.
Main citation
Loh PR, Danecek P, Palamara PF, Fuchsberger C, ...&, L Price A. (2016) Reference-based phasing using the Haplotype Reference Consortium panel. Nat Genet, 48 (11) 1443-1448. doi:10.1038/ng.3679. PMID 27694958
ABSTRACT
Haplotype phasing is a fundamental problem in medical and population genetics. Phasing is generally performed via statistical phasing in a genotyped cohort, an approach that can yield high accuracy in very large cohorts but attains lower accuracy in smaller cohorts. Here we instead explore the paradigm of reference-based phasing. We introduce a new phasing algorithm, Eagle2, that attains high accuracy across a broad range of cohort sizes by efficiently leveraging information from large external reference panels (such as the Haplotype Reference Consortium; HRC) using a new data structure based on the positional Burrows-Wheeler transform. We demonstrate that Eagle2 attains a ∼20× speedup and ∼10% increase in accuracy compared to reference-based phasing using SHAPEIT2. On European-ancestry samples, Eagle2 with the HRC panel achieves >2× the accuracy of 1000 Genomes-based phasing. Eagle2 is open source and freely available for HRC-based phasing via the Sanger Imputation Service and the Michigan Imputation Server.
DOI
10.1038/ng.3679
eCAVIAR
PUBMED_LINK
FULL NAME
eQTL and GWAS Causal Variant Identification in Associated Regions
URL
TITLE
Colocalization of GWAS and eQTL Signals Detects Target Genes.
Main citation
Hormozdiari F, van de Bunt M, Segrè AV, Li X, ...&, Eskin E. (2016) Colocalization of GWAS and eQTL Signals Detects Target Genes. Am J Hum Genet, 99 (6) 1245-1260. doi:10.1016/j.ajhg.2016.10.003. PMID 27866706
ABSTRACT
The vast majority of genome-wide association study (GWAS) risk loci fall in non-coding regions of the genome. One possible hypothesis is that these GWAS risk loci alter the individual's disease risk through their effect on gene expression in different tissues. In order to understand the mechanisms driving a GWAS risk locus, it is helpful to determine which gene is affected in specific tissue types. For example, the relevant gene and tissue could play a role in the disease mechanism if the same variant responsible for a GWAS locus also affects gene expression. Identifying whether or not the same variant is causal in both GWASs and expression quantitative trail locus (eQTL) studies is challenging because of the uncertainty induced by linkage disequilibrium and the fact that some loci harbor multiple causal variants. However, current methods that address this problem assume that each locus contains a single causal variant. In this paper, we present eCAVIAR, a probabilistic method that has several key advantages over existing methods. First, our method can account for more than one causal variant in any given locus. Second, it can leverage summary statistics without accessing the individual genotype data. We use both simulated and real datasets to demonstrate the utility of our method. Using publicly available eQTL data on 45 different tissues, we demonstrate that eCAVIAR can prioritize likely relevant tissues and target genes for a set of glucose- and insulin-related trait loci.
DOI
10.1016/j.ajhg.2016.10.003
EHH
PUBMED_LINK
FULL NAME
Extended haplotype homozygosity
DESCRIPTION
Sabeti, P. C., Reich, D. E., Higgins, J. M., Levine, H. Z., Richter, D. J., Schaffner, S. F., ... & Lander, E. S. (2002). Detecting recent positive selection in the human genome from haplotype structure. Nature, 419(6909), 832-837.
TITLE
Detecting selection using extended haplotype homozygosity (EHH)-based statistics in unphased or unpolarized data.
Main citation
Klassmann A, Gautier M. (2022) Detecting selection using extended haplotype homozygosity (EHH)-based statistics in unphased or unpolarized data. PLoS One, 17 (1) e0262024. doi:10.1371/journal.pone.0262024. PMID 35041674
ABSTRACT
Analysis of population genetic data often includes a search for genomic regions with signs of recent positive selection. One of such approaches involves the concept of extended haplotype homozygosity (EHH) and its associated statistics. These statistics typically require phased haplotypes, and some of them necessitate polarized variants. Here, we unify and extend previously proposed modifications to loosen these requirements. We compare the modified versions with the original ones by measuring the false discovery rate in simulated whole-genome scans and by quantifying the overlap of inferred candidate regions in empirical data. We find that phasing information is indispensable for accurate estimation of within-population statistics (for all but very large samples) and of cross-population statistics for small samples. Ancestry information, in contrast, is of lesser importance for both types of statistic. Our publicly available R package rehh incorporates the modified statistics presented here.
DOI
10.1371/journal.pone.0262024
EIGENSTRAT
PUBMED_LINK
URL
KEYWORDS
PCA, Linear
TITLE
Principal components analysis corrects for stratification in genome-wide association studies.
Main citation
Price AL, Patterson NJ, Plenge RM, Weinblatt ME, ...&, Reich D. (2006) Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet, 38 (8) 904-9. doi:10.1038/ng1847. PMID 16862161
ABSTRACT
Population stratification--allele frequency differences between cases and controls due to systematic ancestry differences-can cause spurious associations in disease studies. We describe a method that enables explicit detection and correction of population stratification on a genome-wide scale. Our method uses principal components analysis to explicitly model ancestry differences between cases and controls. The resulting correction is specific to a candidate marker's variation in frequency across ancestral populations, minimizing spurious associations while maximizing power to detect true associations. Our simple, efficient approach can easily be applied to disease studies with hundreds of thousands of markers.
DOI
10.1038/ng1847
Ellis CA
PUBMED_LINK
TITLE
Inflation of polygenic risk scores caused by sample overlap and relatedness: Examples of a major risk of bias.
Main citation
Ellis CA, Oliver KL, Harris RV, Ottman R, ...&, Bahlo M. (2024) Inflation of polygenic risk scores caused by sample overlap and relatedness: Examples of a major risk of bias. Am J Hum Genet, 111 (9) 1805-1809. doi:10.1016/j.ajhg.2024.07.014. PMID 39168121
ABSTRACT
Polygenic risk scores (PRSs) are an important tool for understanding the role of common genetic variants in human disease. Standard best practices recommend that PRSs be analyzed in cohorts that are independent of the genome-wide association study (GWAS) used to derive the scores without sample overlap or relatedness between the two cohorts. However, identifying sample overlap and relatedness can be challenging in an era of GWASs performed by large biobanks and international research consortia. Although most genomics researchers are aware of best practices and theoretical concerns about sample overlap and relatedness between GWAS and PRS cohorts, the prevailing assumption is that the risk of bias is small for very large GWASs. Here, we present two real-world examples demonstrating that sample overlap and relatedness is not a minor or theoretical concern but an important potential source of bias in PRS studies. Using a recently developed statistical adjustment tool, we found that excluding overlapping and related samples was equal to or more powerful than adjusting for overlap bias. Our goal is to make genomics researchers aware of the magnitude of risk of bias from sample overlap and relatedness and to highlight the need for mitigation tools, including independent validation cohorts in PRS studies, continued development of statistical adjustment methods, and tools for researchers to test their cohorts for overlap and relatedness with GWAS cohorts without sharing individual-level data.
DOI
10.1016/j.ajhg.2024.07.014
EMMAX
PUBMED_LINK
FULL NAME
efficient mixed-model association eXpedited
DESCRIPTION
EMMAX is a statistical test for large scale human or model organism association mapping accounting for the sample structure. In addition to the computational efficiency obtained by EMMA algorithm, EMMAX takes advantage of the fact that each loci explains only a small fraction of complex traits, which allows us to avoid repetitive variance component estimation procedure, resulting in a significant amount of increase in computational time of association mapping using mixed model.
URL
TITLE
Variance component model to account for sample structure in genome-wide association studies.
Main citation
Kang HM, Sul JH, Service SK, Zaitlen NA, ...&, Eskin E. (2010) Variance component model to account for sample structure in genome-wide association studies. Nat Genet, 42 (4) 348-54. doi:10.1038/ng.548. PMID 20208533
ABSTRACT
Although genome-wide association studies (GWASs) have identified numerous loci associated with complex traits, imprecise modeling of the genetic relatedness within study samples may cause substantial inflation of test statistics and possibly spurious associations. Variance component approaches, such as efficient mixed-model association (EMMA), can correct for a wide range of sample structures by explicitly accounting for pairwise relatedness between individuals, using high-density markers to model the phenotype distribution; but such approaches are computationally impractical. We report here a variance component approach implemented in publicly available software, EMMA eXpedited (EMMAX), that reduces the computational time for analyzing large GWAS data sets from years to hours. We apply this method to two human GWAS data sets, performing association analysis for ten quantitative traits from the Northern Finland Birth Cohort and seven common diseases from the Wellcome Trust Case Control Consortium. We find that EMMAX outperforms both principal component analysis and genomic control in correcting for sample structure.
DOI
10.1038/ng.548
EPIC
PUBMED_LINK
FULL NAME
cEll tyPe enrIChment
DESCRIPTION
Inferring relevant tissues and cell types for complex traits in genome-wide association studies
URL
KEYWORDS
GWAS, scRNA-seq
TITLE
EPIC: Inferring relevant cell types for complex traits by integrating genome-wide association studies and single-cell RNA sequencing.
Main citation
Wang R, Lin DY, Jiang Y. (2022) EPIC: Inferring relevant cell types for complex traits by integrating genome-wide association studies and single-cell RNA sequencing. PLoS Genet, 18 (6) e1010251. doi:10.1371/journal.pgen.1010251. PMID 35709291
ABSTRACT
More than a decade of genome-wide association studies (GWASs) have identified genetic risk variants that are significantly associated with complex traits. Emerging evidence suggests that the function of trait-associated variants likely acts in a tissue- or cell-type-specific fashion. Yet, it remains challenging to prioritize trait-relevant tissues or cell types to elucidate disease etiology. Here, we present EPIC (cEll tyPe enrIChment), a statistical framework that relates large-scale GWAS summary statistics to cell-type-specific gene expression measurements from single-cell RNA sequencing (scRNA-seq). We derive powerful gene-level test statistics for common and rare variants, separately and jointly, and adopt generalized least squares to prioritize trait-relevant cell types while accounting for the correlation structures both within and between genes. Using enrichment of loci associated with four lipid traits in the liver and enrichment of loci associated with three neurological disorders in the brain as ground truths, we show that EPIC outperforms existing methods. We apply our framework to multiple scRNA-seq datasets from different platforms and identify cell types underlying type 2 diabetes and schizophrenia. The enrichment is replicated using independent GWAS and scRNA-seq datasets and further validated using PubMed search and existing bulk case-control testing results.
DOI
10.1371/journal.pgen.1010251
ExPRSweb
PUBMED_LINK
FULL NAME
exposure polygenic risk scores (ExPRSs)
DESCRIPTION
Integrating published and freely available genome-wide association studies (GWAS) summary statistics from multiple sources (published GWAS, the NHGRI-EBI GWAS Catalog, FinnGen- or UKB-based GWAS), we created an online repository for exposure polygenic risk scores (ExPRS) for health-related exposure traits. Our framework condenses these summary statistics into ExPRS using linkage disequilibrium pruning and p-value thresholding (P&T) or penalized, genome-wide effect size weighting. We evaluate them in the cohort of the Michigan Genomics Initiative (MGI), a longitudinal biorepository effort at Michigan Medicine, and in the population-based UK Biobank Study (UKB). For each ExPRS construct, measures on performance, accuracy, and discrimination are provided. Beyond the ExPRS evaluation in MGI and UKB, the ExPRSweb platform features construct downloads, evaluation in the top percentiles, and phenome-wide ExPRS association studies (ExPRS-PheWAS) for a subset of ExPRS that are predictive for the corresponding exposure.
URL
KEYWORDS
exposure PRS
TITLE
ExPRSweb: An online repository with polygenic risk scores for common health-related exposures.
Main citation
Ma Y, Patil S, Zhou X, Mukherjee B, ...&, Fritsche LG. (2022) ExPRSweb: An online repository with polygenic risk scores for common health-related exposures. Am J Hum Genet, 109 (10) 1742-1760. doi:10.1016/j.ajhg.2022.09.001. PMID 36152628
ABSTRACT
Complex traits are influenced by genetic risk factors, lifestyle, and environmental variables, so-called exposures. Some exposures, e.g., smoking or lipid levels, have common genetic modifiers identified in genome-wide association studies. Because measurements are often unfeasible, exposure polygenic risk scores (ExPRSs) offer an alternative to study the influence of exposures on various phenotypes. Here, we collected publicly available summary statistics for 28 exposures and applied four common PRS methods to generate ExPRSs in two large biobanks: the Michigan Genomics Initiative and the UK Biobank. We established ExPRSs for 27 exposures and demonstrated their applicability in phenome-wide association studies and as predictors for common chronic conditions. Especially the addition of multiple ExPRSs showed, for several chronic conditions, an improvement compared to prediction models that only included traditional, disease-focused PRSs. To facilitate follow-up studies, we share all ExPRS constructs and generated results via an online repository called ExPRSweb.
DOI
10.1016/j.ajhg.2022.09.001
f
PUBMED_LINK
FULL NAME
fraction of sites under selection
DESCRIPTION
Moon, S., & Akey, J. M. (2016). A flexible method for estimating the fraction of fitness influencing mutations from large sequencing data sets. Genome Research, 26(6), 834-843.
TITLE
A flexible method for estimating the fraction of fitness influencing mutations from large sequencing data sets.
Main citation
Moon S, Akey JM. (2016) A flexible method for estimating the fraction of fitness influencing mutations from large sequencing data sets. Genome Res, 26 (6) 834-43. doi:10.1101/gr.203059.115. PMID 27197222
ABSTRACT
A continuing challenge in the analysis of massively large sequencing data sets is quantifying and interpreting non-neutrally evolving mutations. Here, we describe a flexible and robust approach based on the site frequency spectrum to estimate the fraction of deleterious and adaptive variants from large-scale sequencing data sets. We applied our method to approximately 1 million single nucleotide variants (SNVs) identified in high-coverage exome sequences of 6515 individuals. We estimate that the fraction of deleterious nonsynonymous SNVs is higher than previously reported; quantify the effects of genomic context, codon bias, chromatin accessibility, and number of protein-protein interactions on deleterious protein-coding SNVs; and identify pathways and networks that have likely been influenced by positive selection. Furthermore, we show that the fraction of deleterious nonsynonymous SNVs is significantly higher for Mendelian versus complex disease loci and in exons harboring dominant versus recessive Mendelian mutations. In summary, as genome-scale sequencing data accumulate in progressively larger sample sizes, our method will enable increasingly high-resolution inferences into the characteristics and determinants of non-neutral variation.
DOI
10.1101/gr.203059.115
FactorGO
PUBMED_LINK
FULL NAME
Factor analysis model in Genetic assOciation
DESCRIPTION
FactorGo is a scalable variational factor analysis model that learns pleiotropic factors using GWAS summary statistics.
URL
KEYWORDS
pleiotropy, factor analysis
TITLE
A scalable approach to characterize pleiotropy across thousands of human diseases and complex traits using GWAS summary statistics.
Main citation
Zhang Z, Jung J, Kim A, Suboc N, ...&, Mancuso N. (2023) A scalable approach to characterize pleiotropy across thousands of human diseases and complex traits using GWAS summary statistics. Am J Hum Genet, 110 (11) 1863-1874. doi:10.1016/j.ajhg.2023.09.015. PMID 37879338
ABSTRACT
Genome-wide association studies (GWASs) across thousands of traits have revealed the pervasive pleiotropy of trait-associated genetic variants. While methods have been proposed to characterize pleiotropic components across groups of phenotypes, scaling these approaches to ultra-large-scale biobanks has been challenging. Here, we propose FactorGo, a scalable variational factor analysis model to identify and characterize pleiotropic components using biobank GWAS summary data. In extensive simulations, we observe that FactorGo outperforms the state-of-the-art (model-free) approach tSVD in capturing latent pleiotropic factors across phenotypes while maintaining a similar computational cost. We apply FactorGo to estimate 100 latent pleiotropic factors from GWAS summary data of 2,483 phenotypes measured in European-ancestry Pan-UK BioBank individuals (N = 420,531). Next, we find that factors from FactorGo are more enriched with relevant tissue-specific annotations than those identified by tSVD (p = 2.58E-10) and validate our approach by recapitulating brain-specific enrichment for BMI and the height-related connection between reproductive system and muscular-skeletal growth. Finally, our analyses suggest shared etiologies between rheumatoid arthritis and periodontal condition in addition to alkaline phosphatase as a candidate prognostic biomarker for prostate cancer. Overall, FactorGo improves our biological understanding of shared etiologies across thousands of GWASs.
DOI
10.1016/j.ajhg.2023.09.015
fastASSET
PUBMED_LINK
URL
TITLE
Genome-wide large-scale multi-trait analysis characterizes global patterns of pleiotropy and unique trait-specific variants.
Main citation
Qi G, Chhetri SB, Ray D, Dutta D, ...&, Chatterjee N. (2024) Genome-wide large-scale multi-trait analysis characterizes global patterns of pleiotropy and unique trait-specific variants. Nat Commun, 15 (1) 6985. doi:10.1038/s41467-024-51075-5. PMID 39143063
ABSTRACT
Genome-wide association studies (GWAS) have found widespread evidence of pleiotropy, but characterization of global patterns of pleiotropy remain highly incomplete due to insufficient power of current approaches. We develop fastASSET, a method that allows efficient detection of variant-level pleiotropic association across many traits. We analyze GWAS summary statistics of 116 complex traits of diverse types collected from the GRASP repository and large GWAS Consortia. We identify 2293 independent loci and find that the lead variants in nearly all these loci (~99%) to be associated with ≥ 2 traits (median = 6). We observe that degree of pleiotropy estimated from our study predicts that observed in the UK Biobank for a much larger number of traits (K = 4114) (correlation = 0.43, p-value < 2.2 × 10 - 16 ). Follow-up analyzes of 21 trait-specific variants indicate their link to the expression in trait-related tissues for a small number of genes involved in relevant biological processes. Our findings provide deeper insight into the nature of pleiotropy and leads to identification of highly trait-specific susceptibility variants.
DOI
10.1038/s41467-024-51075-5
fastGWA
PUBMED_LINK
URL
KEYWORDS
grid-search-based REML algorithm
TITLE
A resource-efficient tool for mixed model association analysis of large-scale data.
Main citation
Jiang L, Zheng Z, Qi T, Kemper KE, ...&, Yang J. (2019) A resource-efficient tool for mixed model association analysis of large-scale data. Nat Genet, 51 (12) 1749-1755. doi:10.1038/s41588-019-0530-8. PMID 31768069
ABSTRACT
The genome-wide association study (GWAS) has been widely used as an experimental design to detect associations between genetic variants and a phenotype. Two major confounding factors, population stratification and relatedness, could potentially lead to inflated GWAS test statistics and hence to spurious associations. Mixed linear model (MLM)-based approaches can be used to account for sample structure. However, genome-wide association (GWA) analyses in biobank samples such as the UK Biobank (UKB) often exceed the capability of most existing MLM-based tools especially if the number of traits is large. Here, we develop an MLM-based tool (fastGWA) that controls for population stratification by principal components and for relatedness by a sparse genetic relationship matrix for GWA analyses of biobank-scale data. We demonstrate by extensive simulations that fastGWA is reliable, robust and highly resource-efficient. We then apply fastGWA to 2,173 traits on array-genotyped and imputed samples from 456,422 individuals and to 2,048 traits on whole-exome-sequenced samples from 46,191 individuals in the UKB.
DOI
10.1038/s41588-019-0530-8
fastGWA-GLMM
PUBMED_LINK
URL
TITLE
A generalized linear mixed model association tool for biobank-scale data.
Main citation
Jiang L, Zheng Z, Fang H, Yang J. (2021) A generalized linear mixed model association tool for biobank-scale data. Nat Genet, 53 (11) 1616-1621. doi:10.1038/s41588-021-00954-4. PMID 34737426
ABSTRACT
Compared with linear mixed model-based genome-wide association (GWA) methods, generalized linear mixed model (GLMM)-based methods have better statistical properties when applied to binary traits but are computationally much slower. In the present study, leveraging efficient sparse matrix-based algorithms, we developed a GLMM-based GWA tool, fastGWA-GLMM, that is severalfold to orders of magnitude faster than the state-of-the-art tools when applied to the UK Biobank (UKB) data and scalable to cohorts with millions of individuals. We show by simulation that the fastGWA-GLMM test statistics of both common and rare variants are well calibrated under the null, even for traits with extreme case-control ratios. We applied fastGWA-GLMM to the UKB data of 456,348 individuals, 11,842,647 variants and 2,989 binary traits (full summary statistics available at http://fastgwa.info/ukbimpbin ), and identified 259 rare variants associated with 75 traits, demonstrating the use of imputed genotype data in a large cohort to discover rare variants for binary complex traits.
DOI
10.1038/s41588-021-00954-4
fastPHASE
PUBMED_LINK
URL
TITLE
A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase.
Main citation
Scheet P, Stephens M. (2006) A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am J Hum Genet, 78 (4) 629-44. doi:10.1086/502802. PMID 16532393
ABSTRACT
We present a statistical model for patterns of genetic variation in samples of unrelated individuals from natural populations. This model is based on the idea that, over short regions, haplotypes in a population tend to cluster into groups of similar haplotypes. To capture the fact that, because of recombination, this clustering tends to be local in nature, our model allows cluster memberships to change continuously along the chromosome according to a hidden Markov model. This approach is flexible, allowing for both "block-like" patterns of linkage disequilibrium (LD) and gradual decline in LD with distance. The resulting model is also fast and, as a result, is practicable for large data sets (e.g., thousands of individuals typed at hundreds of thousands of markers). We illustrate the utility of the model by applying it to dense single-nucleotide-polymorphism genotype data for the tasks of imputing missing genotypes and estimating haplotypic phase. For imputing missing genotypes, methods based on this model are as accurate or more accurate than existing methods. For haplotype estimation, the point estimates are slightly less accurate than those from the best existing methods (e.g., for unrelated Centre d'Etude du Polymorphisme Humain individuals from the HapMap project, switch error was 0.055 for our method vs. 0.051 for PHASE) but require a small fraction of the computational cost. In addition, we demonstrate that the model accurately reflects uncertainty in its estimates, in that probabilities computed using the model are approximately well calibrated. The methods described in this article are implemented in a software package, fastPHASE, which is available from the Stephens Lab Web site.
DOI
10.1086/502802
FastQTL
PUBMED_LINK
DESCRIPTION
In order to discover quantitative trait loci (QTLs), multi-dimensional genomic datasets combining DNA-seq and ChiP-/RNA-seq require methods that rapidly correlate tens of thousands of molecular phenotypes with millions of genetic variants while appropriately controlling for multiple testing. FastQTL implements a popular cis-QTL mapping strategy in a user- and cluster-friendly tool. FastQTL also proposes an efficient permutation procedure to control for multiple testing.
URL
TITLE
Fast and efficient QTL mapper for thousands of molecular phenotypes.
Main citation
Ongen H, Buil A, Brown AA, Dermitzakis ET, ...&, Delaneau O. (2016) Fast and efficient QTL mapper for thousands of molecular phenotypes. Bioinformatics, 32 (10) 1479-85. doi:10.1093/bioinformatics/btv722. PMID 26708335
ABSTRACT
MOTIVATION: In order to discover quantitative trait loci, multi-dimensional genomic datasets combining DNA-seq and ChiP-/RNA-seq require methods that rapidly correlate tens of thousands of molecular phenotypes with millions of genetic variants while appropriately controlling for multiple testing. RESULTS: We have developed FastQTL, a method that implements a popular cis-QTL mapping strategy in a user- and cluster-friendly tool. FastQTL also proposes an efficient permutation procedure to control for multiple testing. The outcome of permutations is modeled using beta distributions trained from a few permutations and from which adjusted P-values can be estimated at any level of significance with little computational cost. The Geuvadis & GTEx pilot datasets can be now easily analyzed an order of magnitude faster than previous approaches. AVAILABILITY AND IMPLEMENTATION: Source code, binaries and comprehensive documentation of FastQTL are freely available to download at http://fastqtl.sourceforge.net/ CONTACT: emmanouil.dermitzakis@unige.ch or olivier.delaneau@unige.ch SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
DOI
10.1093/bioinformatics/btv722
FINEMAP
PUBMED_LINK
DESCRIPTION
FINEMAP is a program for 1.identifying causal SNPs, 2. estimating effect sizes of causal SNPs, 3 estimating the heritability contribution of causal SNPs
URL
KEYWORDS
Shotgun Stochastic Search (SSS)
TITLE
FINEMAP: efficient variable selection using summary data from genome-wide association studies.
Main citation
Benner C, Spencer CC, Havulinna AS, Salomaa V, ...&, Pirinen M. (2016) FINEMAP: efficient variable selection using summary data from genome-wide association studies. Bioinformatics, 32 (10) 1493-501. doi:10.1093/bioinformatics/btw018. PMID 26773131
ABSTRACT
MOTIVATION: The goal of fine-mapping in genomic regions associated with complex diseases and traits is to identify causal variants that point to molecular mechanisms behind the associations. Recent fine-mapping methods using summary data from genome-wide association studies rely on exhaustive search through all possible causal configurations, which is computationally expensive. RESULTS: We introduce FINEMAP, a software package to efficiently explore a set of the most important causal configurations of the region via a shotgun stochastic search algorithm. We show that FINEMAP produces accurate results in a fraction of processing time of existing approaches and is therefore a promising tool for analyzing growing amounts of data produced in genome-wide association studies and emerging sequencing projects. AVAILABILITY AND IMPLEMENTATION: FINEMAP v1.0 is freely available for Mac OS X and Linux at http://www.christianbenner.com CONTACT: : christian.benner@helsinki.fi or matti.pirinen@helsinki.fi.
DOI
10.1093/bioinformatics/btw018
flashfmZero
PUBMED_LINK
DESCRIPTION
flashfmZero performs zero-correlation latent-factor-based multi-trait fine-mapping from GWAS summary statistics for high-dimensional trait panels (e.g., blood cell counts). Latent-factor GWAS can surface signals below univariate thresholds; in INTERVAL blood-cell analyses, 99% credible sets were at least as small as univariate fine-mapping in most comparisons and were nested within univariate latent-factor credible sets.
URL
KEYWORDS
latent factor, multi-trait, fine-mapping, GWAS summary statistics, high-dimensional traits
TITLE
Improved genetic discovery and fine-mapping resolution through multivariate latent factor analysis of high-dimensional traits.
Main citation
Zhou F, Astle WJ, Butterworth AS, Asimit JL. (2025) Improved genetic discovery and fine-mapping resolution through multivariate latent factor analysis of high-dimensional traits. Cell Genom, 5 (5) 100847. doi:10.1016/j.xgen.2025.100847. PMID 40220762
ABSTRACT
Genome-wide association studies (GWASs) of high-dimensional traits, such as blood cell or metabolic traits, often use univariate approaches, ignoring trait relationships. Biological mechanisms generating variation in high-dimensional traits can be captured parsimoniously through a GWAS of latent factors. Here, we introduce flashfmZero, a zero-correlation latent-factor-based multi-trait fine-mapping approach. In an application to 25 latent factors derived from 99 blood cell traits in the INTERVAL cohort, we show that latent factor GWASs enable the detection of signals generating sub-threshold associations with several blood cell traits. The 99% credible sets (CS99) from flashfmZero were equal to or smaller in size than those from univariate fine-mapping of blood cell traits in 87% of our comparisons. In all cases univariate latent factor CS99 contained those from flashfmZero. Our latent factor approaches can be applied to GWAS summary statistics and will enhance power for the discovery and fine-mapping of associations for many traits.
DOI
10.1016/j.xgen.2025.100847
Four-digit Multi-ethnic HLA v1 (2021)
PUBMED_LINK
DESCRIPTION
Available on Michigan imputation server
URL
TITLE
A high-resolution HLA reference panel capturing global population diversity enables multi-ancestry fine-mapping in HIV host response.
Main citation
Luo Y, Kanai M, Choi W, Li X, ...&, Raychaudhuri S. (2021) A high-resolution HLA reference panel capturing global population diversity enables multi-ancestry fine-mapping in HIV host response. Nat Genet, 53 (10) 1504-1516. doi:10.1038/s41588-021-00935-7. PMID 34611364
ABSTRACT
Fine-mapping to plausible causal variation may be more effective in multi-ancestry cohorts, particularly in the MHC, which has population-specific structure. To enable such studies, we constructed a large (n = 21,546) HLA reference panel spanning five global populations based on whole-genome sequences. Despite population-specific long-range haplotypes, we demonstrated accurate imputation at G-group resolution (94.2%, 93.7%, 97.8% and 93.7% in admixed African (AA), East Asian (EAS), European (EUR) and Latino (LAT) populations). Applying HLA imputation to genome-wide association study data for HIV-1 viral load in three populations (EUR, AA and LAT), we obviated effects of previously reported associations from population-specific HIV studies and discovered a novel association at position 156 in HLA-B. We pinpointed the MHC association to three amino acid positions (97, 67 and 156) marking three consecutive pockets (C, B and D) within the HLA-B peptide-binding groove, explaining 12.9% of trait variance.
DOI
10.1038/s41588-021-00935-7
Four-digit Multi-ethnic HLA v2 (2022)
PUBMED_LINK
DESCRIPTION
Available on Michigan imputation server
URL
TITLE
A high-resolution HLA reference panel capturing global population diversity enables multi-ancestry fine-mapping in HIV host response.
Main citation
Luo Y, Kanai M, Choi W, Li X, ...&, Raychaudhuri S. (2021) A high-resolution HLA reference panel capturing global population diversity enables multi-ancestry fine-mapping in HIV host response. Nat Genet, 53 (10) 1504-1516. doi:10.1038/s41588-021-00935-7. PMID 34611364
ABSTRACT
Fine-mapping to plausible causal variation may be more effective in multi-ancestry cohorts, particularly in the MHC, which has population-specific structure. To enable such studies, we constructed a large (n = 21,546) HLA reference panel spanning five global populations based on whole-genome sequences. Despite population-specific long-range haplotypes, we demonstrated accurate imputation at G-group resolution (94.2%, 93.7%, 97.8% and 93.7% in admixed African (AA), East Asian (EAS), European (EUR) and Latino (LAT) populations). Applying HLA imputation to genome-wide association study data for HIV-1 viral load in three populations (EUR, AA and LAT), we obviated effects of previously reported associations from population-specific HIV studies and discovered a novel association at position 156 in HLA-B. We pinpointed the MHC association to three amino acid positions (97, 67 and 156) marking three consecutive pockets (C, B and D) within the HLA-B peptide-binding groove, explaining 12.9% of trait variance.
DOI
10.1038/s41588-021-00935-7
FUMA
PUBMED_LINK
DESCRIPTION
FUMA is a platform that can be used to annotate, prioritize, visualize and interpret GWAS results.
URL
TITLE
Functional mapping and annotation of genetic associations with FUMA.
Main citation
Watanabe K, Taskesen E, van Bochoven A, Posthuma D. (2017) Functional mapping and annotation of genetic associations with FUMA. Nat Commun, 8 (1) 1826. doi:10.1038/s41467-017-01261-5. PMID 29184056
ABSTRACT
A main challenge in genome-wide association studies (GWAS) is to pinpoint possible causal variants. Results from GWAS typically do not directly translate into causal variants because the majority of hits are in non-coding or intergenic regions, and the presence of linkage disequilibrium leads to effects being statistically spread out across multiple variants. Post-GWAS annotation facilitates the selection of most likely causal variant(s). Multiple resources are available for post-GWAS annotation, yet these can be time consuming and do not provide integrated visual aids for data interpretation. We, therefore, develop FUMA: an integrative web-based platform using information from multiple biological resources to facilitate functional annotation of GWAS results, gene prioritization and interactive visualization. FUMA accommodates positional, expression quantitative trait loci (eQTL) and chromatin interaction mappings, and provides gene-based, pathway and tissue enrichment results. FUMA results directly aid in generating hypotheses that are testable in functional experiments aimed at proving causal relations.
DOI
10.1038/s41467-017-01261-5
FUSION
PUBMED_LINK
FULL NAME
Functional Summary-based Imputation
DESCRIPTION
FUSION is a suite of tools for performing transcriptome-wide and regulome-wide association studies (TWAS and RWAS). FUSION builds predictive models of the genetic component of a functional/molecular phenotype and predicts and tests that component for association with disease using GWAS summary statistics. The goal is to identify associations between a GWAS phenotype and a functional phenotype that was only measured in reference data. We provide precomputed predictive models from multiple studies to facilitate this analysis.
URL
TITLE
Integrative approaches for large-scale transcriptome-wide association studies.
Main citation
Gusev A, Ko A, Shi H, Bhatia G, ...&, Pasaniuc B. (2016) Integrative approaches for large-scale transcriptome-wide association studies. Nat Genet, 48 (3) 245-52. doi:10.1038/ng.3506. PMID 26854917
ABSTRACT
Many genetic variants influence complex traits by modulating gene expression, thus altering the abundance of one or multiple proteins. Here we introduce a powerful strategy that integrates gene expression measurements with summary association statistics from large-scale genome-wide association studies (GWAS) to identify genes whose cis-regulated expression is associated with complex traits. We leverage expression imputation from genetic data to perform a transcriptome-wide association study (TWAS) to identify significant expression-trait associations. We applied our approaches to expression data from blood and adipose tissue measured in ∼ 3,000 individuals overall. We imputed gene expression into GWAS data from over 900,000 phenotype measurements to identify 69 new genes significantly associated with obesity-related traits (BMI, lipids and height). Many of these genes are associated with relevant phenotypes in the Hybrid Mouse Diversity Panel. Our results showcase the power of integrating genotype, gene expression and phenotype to gain insights into the genetic basis of complex traits.
DOI
10.1038/ng.3506
G2P
PUBMED_LINK
FULL NAME
A Genome-Wide-Association-Study Simulation Tool for Genotype Simulation, Phenotype Simulation, and Power Evaluation
DESCRIPTION
a Genome-Wide-Association-Study simulation tool for genotype simulation, phenotype simulation and power evaluation
URL
TITLE
G2P: a Genome-Wide-Association-Study simulation tool for genotype simulation, phenotype simulation and power evaluation.
Main citation
Tang Y, Liu X. (2019) G2P: a Genome-Wide-Association-Study simulation tool for genotype simulation, phenotype simulation and power evaluation. Bioinformatics, 35 (19) 3852-3854. doi:10.1093/bioinformatics/btz126. PMID 30848784
ABSTRACT
MOTIVATION: Plenty of Genome-Wide-Association-Study (GWAS) methods have been developed for mapping genetic markers that associated with human diseases and agricultural economic traits. Computer simulation is a nice tool to test the performances of various GWAS methods under certain scenarios. Existing tools are either inefficient in terms of computation and memory efficiency or inconvenient to use to simulate big, realistic genotype data and phenotype data to evaluate available GWAS methods. RESULTS: Here, we present a GWAS simulation tool named G2P that can be used to simulate genotype data, phenotype data and perform power evaluation of GWAS methods. G2P is a user-friendly tool with all functions is provided in both graphical user interface and pipeline manners and it is available for Windows, Mac and Linux environments. Furthermore, G2P achieves maximum efficiency in terms of both memory usage and simulation speed; with G2P, the simulation of genotype data that includes 1 000 000 samples and 2 000 000 markers can be accomplished in 5 h. AVAILABILITY AND IMPLEMENTATION: The G2P software, user manual, and example datasets are freely available at GitHub: https://github.com/XiaoleiLiuBio/G2P. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
DOI
10.1093/bioinformatics/btz126
Galesloot
PUBMED_LINK
TITLE
A comparison of multivariate genome-wide association methods.
Main citation
Galesloot TE, van Steen K, Kiemeney LA, Janss LL, ...&, Vermeulen SH. (2014) A comparison of multivariate genome-wide association methods. PLoS One, 9 (4) e95923. doi:10.1371/journal.pone.0095923. PMID 24763738
ABSTRACT
Joint association analysis of multiple traits in a genome-wide association study (GWAS), i.e. a multivariate GWAS, offers several advantages over analyzing each trait in a separate GWAS. In this study we directly compared a number of multivariate GWAS methods using simulated data. We focused on six methods that are implemented in the software packages PLINK, SNPTEST, MultiPhen, BIMBAM, PCHAT and TATES, and also compared them to standard univariate GWAS, analysis of the first principal component of the traits, and meta-analysis of univariate results. We simulated data (N = 1000) for three quantitative traits and one bi-allelic quantitative trait locus (QTL), and varied the number of traits associated with the QTL (explained variance 0.1%), minor allele frequency of the QTL, residual correlation between the traits, and the sign of the correlation induced by the QTL relative to the residual correlation. We compared the power of the methods using empirically fixed significance thresholds (α = 0.05). Our results showed that the multivariate methods implemented in PLINK, SNPTEST, MultiPhen and BIMBAM performed best for the majority of the tested scenarios, with a notable increase in power for scenarios with an opposite sign of genetic and residual correlation. All multivariate analyses resulted in a higher power than univariate analyses, even when only one of the traits was associated with the QTL. Hence, use of multivariate GWAS methods can be recommended, even when genetic correlations between traits are weak.
DOI
10.1371/journal.pone.0095923
GAS Power Calculator
FULL NAME
Genetic Association Study Power Calculator
DESCRIPTION
This Genetic Association Study (GAS) Power Calculator is a simple interface that can be used to compute statistical power for large one-stage genetic association studies. The underlying method is derived from the CaTS power calculator for two-stage association studies (2006).
URL
PREPRINT_DOI
10.1101/164343
Main citation
Johnson, J. L., & Abecasis, G. R. (2017). GAS Power Calculator: web-based power calculator for genetic association studies. BioRxiv, 164343.
GATE
PUBMED_LINK
FULL NAME
Genetic Analysis of Time-to-Event phenotypes
DESCRIPTION
GATE (Genetic Analysis of Time-to-Event phenotypes) is an R package with Scalable and accurate genome-wide association analysis of censored survival data in large scale biobanks using frailty models.
GATE performs single-variant association tests for time-to-event endpoints. GATE uses uses the saddlepoint approximation (SPA)(mhof, J. P. , 1961; Kuonen, D. 1999; Dey, R. et.al 2017) to account for heavy censoring rates.
GATE performs single-variant association tests for time-to-event endpoints. GATE uses uses the saddlepoint approximation (SPA)(mhof, J. P. , 1961; Kuonen, D. 1999; Dey, R. et.al 2017) to account for heavy censoring rates.
URL
KEYWORDS
censored time-to-event (TTE) phenotypes
TITLE
Efficient and accurate frailty model approach for genome-wide survival association analysis in large-scale biobanks.
Main citation
Dey R, Zhou W, Kiiskinen T, Havulinna A, ...&, Lin X. (2022) Efficient and accurate frailty model approach for genome-wide survival association analysis in large-scale biobanks. Nat Commun, 13 (1) 5437. doi:10.1038/s41467-022-32885-x. PMID 36114182
ABSTRACT
With decades of electronic health records linked to genetic data, large biobanks provide unprecedented opportunities for systematically understanding the genetics of the natural history of complex diseases. Genome-wide survival association analysis can identify genetic variants associated with ages of onset, disease progression and lifespan. We propose an efficient and accurate frailty model approach for genome-wide survival association analysis of censored time-to-event (TTE) phenotypes by accounting for both population structure and relatedness. Our method utilizes state-of-the-art optimization strategies to reduce the computational cost. The saddlepoint approximation is used to allow for analysis of heavily censored phenotypes (>90%) and low frequency variants (down to minor allele count 20). We demonstrate the performance of our method through extensive simulation studies and analysis of five TTE phenotypes, including lifespan, with heavy censoring rates (90.9% to 99.8%) on ~400,000 UK Biobank participants with white British ancestry and ~180,000 individuals in FinnGen. We further analyzed 871 TTE phenotypes in the UK Biobank and presented the genome-wide scale phenome-wide association results with the PheWeb browser.
DOI
10.1038/s41467-022-32885-x
GCTA
PUBMED_LINK
FULL NAME
Genome-wide complex trait analysis (GCTA)
DESCRIPTION
GCTA-GREML analysis:GCTA can simulate a GWAS based on real genotype data.
URL
TITLE
GCTA: a tool for genome-wide complex trait analysis.
Main citation
Yang J, Lee SH, Goddard ME, Visscher PM. (2011) GCTA: a tool for genome-wide complex trait analysis. Am J Hum Genet, 88 (1) 76-82. doi:10.1016/j.ajhg.2010.11.011. PMID 21167468
ABSTRACT
For most human complex diseases and traits, SNPs identified by genome-wide association studies (GWAS) explain only a small fraction of the heritability. Here we report a user-friendly software tool called genome-wide complex trait analysis (GCTA), which was developed based on a method we recently developed to address the "missing heritability" problem. GCTA estimates the variance explained by all the SNPs on a chromosome or on the whole genome for a complex trait rather than testing the association of any particular SNP to the trait. We introduce GCTA's five main functions: data management, estimation of the genetic relationships from SNPs, mixed linear model analysis of variance explained by the SNPs, estimation of the linkage disequilibrium structure, and GWAS simulation. We focus on the function of estimating the variance explained by all the SNPs on the X chromosome and testing the hypotheses of dosage compensation. The GCTA software is a versatile tool to estimate and partition complex trait variation with large GWAS data sets.
DOI
10.1016/j.ajhg.2010.11.011
GCTA
PUBMED_LINK
FULL NAME
Genome-wide complex trait analysis (GCTA)
DESCRIPTION
GCTA-GREML analysis:GCTA can simulate a GWAS based on real genotype data.
URL
TITLE
GCTA: a tool for genome-wide complex trait analysis.
Main citation
Yang J, Lee SH, Goddard ME, Visscher PM. (2011) GCTA: a tool for genome-wide complex trait analysis. Am J Hum Genet, 88 (1) 76-82. doi:10.1016/j.ajhg.2010.11.011. PMID 21167468
ABSTRACT
For most human complex diseases and traits, SNPs identified by genome-wide association studies (GWAS) explain only a small fraction of the heritability. Here we report a user-friendly software tool called genome-wide complex trait analysis (GCTA), which was developed based on a method we recently developed to address the "missing heritability" problem. GCTA estimates the variance explained by all the SNPs on a chromosome or on the whole genome for a complex trait rather than testing the association of any particular SNP to the trait. We introduce GCTA's five main functions: data management, estimation of the genetic relationships from SNPs, mixed linear model analysis of variance explained by the SNPs, estimation of the linkage disequilibrium structure, and GWAS simulation. We focus on the function of estimating the variance explained by all the SNPs on the X chromosome and testing the hypotheses of dosage compensation. The GCTA software is a versatile tool to estimate and partition complex trait variation with large GWAS data sets.
DOI
10.1016/j.ajhg.2010.11.011
GCTA-GREML-Binary (GREML)
PUBMED_LINK
FULL NAME
Genome-wide complex trait analysis (GCTA) Genome-based restricted maximum likelihood (GREML)
DESCRIPTION
(case-control)
URL
TITLE
Estimating missing heritability for disease from genome-wide association studies.
Main citation
Lee SH, Wray NR, Goddard ME, Visscher PM. (2011) Estimating missing heritability for disease from genome-wide association studies. Am J Hum Genet, 88 (3) 294-305. doi:10.1016/j.ajhg.2011.02.002. PMID 21376301
ABSTRACT
Genome-wide association studies are designed to discover SNPs that are associated with a complex trait. Employing strict significance thresholds when testing individual SNPs avoids false positives at the expense of increasing false negatives. Recently, we developed a method for quantitative traits that estimates the variation accounted for when fitting all SNPs simultaneously. Here we develop this method further for case-control studies. We use a linear mixed model for analysis of binary traits and transform the estimates to a liability scale by adjusting both for scale and for ascertainment of the case samples. We show by theory and simulation that the method is unbiased. We apply the method to data from the Wellcome Trust Case Control Consortium and show that a substantial proportion of variation in liability for Crohn disease, bipolar disorder, and type I diabetes is tagged by common SNPs.
DOI
10.1016/j.ajhg.2011.02.002
GCTA-GREML-Bivariate (GREML)
PUBMED_LINK
FULL NAME
Genome-wide complex trait analysis (GCTA) Genome-based restricted maximum likelihood (GREML)
DESCRIPTION
(Bivariate GREML)
URL
KEYWORDS
bivariate
TITLE
Estimation of pleiotropy between complex diseases using single-nucleotide polymorphism-derived genomic relationships and restricted maximum likelihood.
Main citation
Lee SH, Yang J, Goddard ME, Visscher PM, ...&, Wray NR. (2012) Estimation of pleiotropy between complex diseases using single-nucleotide polymorphism-derived genomic relationships and restricted maximum likelihood. Bioinformatics, 28 (19) 2540-2. doi:10.1093/bioinformatics/bts474. PMID 22843982
ABSTRACT
SUMMARY: Genetic correlations are the genome-wide aggregate effects of causal variants affecting multiple traits. Traditionally, genetic correlations between complex traits are estimated from pedigree studies, but such estimates can be confounded by shared environmental factors. Moreover, for diseases, low prevalence rates imply that even if the true genetic correlation between disorders was high, co-aggregation of disorders in families might not occur or could not be distinguished from chance. We have developed and implemented statistical methods based on linear mixed models to obtain unbiased estimates of the genetic correlation between pairs of quantitative traits or pairs of binary traits of complex diseases using population-based case-control studies with genome-wide single-nucleotide polymorphism data. The method is validated in a simulation study and applied to estimate genetic correlation between various diseases from Wellcome Trust Case Control Consortium data in a series of bivariate analyses. We estimate a significant positive genetic correlation between risk of Type 2 diabetes and hypertension of ~0.31 (SE 0.14, P = 0.024). AVAILABILITY: Our methods, appropriate for both quantitative and binary traits, are implemented in the freely available software GCTA (http://www.complextraitgenomics.com/software/gcta/reml_bivar.html). CONTACT: hong.lee@uq.edu.au SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
DOI
10.1093/bioinformatics/bts474
GCTA-GREML-LDMS
PUBMED_LINK
DESCRIPTION
(GREML-LDMS)
URL
TITLE
Genetic variance estimation with imputed variants finds negligible missing heritability for human height and body mass index.
Main citation
Yang J, Bakshi A, Zhu Z, Hemani G, ...&, Visscher PM. (2015) Genetic variance estimation with imputed variants finds negligible missing heritability for human height and body mass index. Nat Genet, 47 (10) 1114-20. doi:10.1038/ng.3390. PMID 26323059
ABSTRACT
We propose a method (GREML-LDMS) to estimate heritability for human complex traits in unrelated individuals using whole-genome sequencing data. We demonstrate using simulations based on whole-genome sequencing data that ∼97% and ∼68% of variation at common and rare variants, respectively, can be captured by imputation. Using the GREML-LDMS method, we estimate from 44,126 unrelated individuals that all ∼17 million imputed variants explain 56% (standard error (s.e.) = 2.3%) of variance for height and 27% (s.e. = 2.5%) of variance for body mass index (BMI), and we find evidence that height- and BMI-associated variants have been under natural selection. Considering the imperfect tagging of imputation and potential overestimation of heritability from previous family-based studies, heritability is likely to be 60-70% for height and 30-40% for BMI. Therefore, the missing heritability is small for both traits. For further discovery of genes associated with complex traits, a study design with SNP arrays followed by imputation is more cost-effective than whole-genome sequencing at current prices.
DOI
10.1038/ng.3390
GCTA-GREML-Partition (GREML)
PUBMED_LINK
FULL NAME
Genome-wide complex trait analysis (GCTA) Genome-based restricted maximum likelihood (GREML)
DESCRIPTION
(partition the genetic variance into individual chromosomes and genomic segments)
URL
TITLE
Genome partitioning of genetic variation for complex traits using common SNPs.
Main citation
Yang J, Manolio TA, Pasquale LR, Boerwinkle E, ...&, Visscher PM. (2011) Genome partitioning of genetic variation for complex traits using common SNPs. Nat Genet, 43 (6) 519-25. doi:10.1038/ng.823. PMID 21552263
ABSTRACT
We estimate and partition genetic variation for height, body mass index (BMI), von Willebrand factor and QT interval (QTi) using 586,898 SNPs genotyped on 11,586 unrelated individuals. We estimate that ∼45%, ∼17%, ∼25% and ∼21% of the variance in height, BMI, von Willebrand factor and QTi, respectively, can be explained by all autosomal SNPs and a further ∼0.5-1% can be explained by X chromosome SNPs. We show that the variance explained by each chromosome is proportional to its length, and that SNPs in or near genes explain more variation than SNPs between genes. We propose a new approach to estimate variation due to cryptic relatedness and population stratification. Our results provide further evidence that a substantial proportion of heritability is captured by common SNPs, that height, BMI and QTi are highly polygenic traits, and that the additive variation explained by a part of the genome is approximately proportional to the total length of DNA contained within genes therein.
DOI
10.1038/ng.823
GCTA-GREML-Quantitative (GREML)
PUBMED_LINK
FULL NAME
Genome-wide complex trait analysis (GCTA) Genome-based restricted maximum likelihood (GREML)
DESCRIPTION
GCTA-GREML analysis: estimating the variance explained by the SNPs / GCTA-GREML analysis for a case-control study
URL
TITLE
Common SNPs explain a large proportion of the heritability for human height.
Main citation
Yang J, Benyamin B, McEvoy BP, Gordon S, ...&, Visscher PM. (2010) Common SNPs explain a large proportion of the heritability for human height. Nat Genet, 42 (7) 565-9. doi:10.1038/ng.608. PMID 20562875
ABSTRACT
SNPs discovered by genome-wide association studies (GWASs) account for only a small fraction of the genetic variation of complex traits in human populations. Where is the remaining heritability? We estimated the proportion of variance for human height explained by 294,831 SNPs genotyped on 3,925 unrelated individuals using a linear model analysis, and validated the estimation method with simulations based on the observed genotype data. We show that 45% of variance can be explained by considering all SNPs simultaneously. Thus, most of the heritability is not missing but has not previously been detected because the individual effects are too small to pass stringent significance tests. We provide evidence that the remaining heritability is due to incomplete linkage disequilibrium between causal variants and genotyped SNPs, exacerbated by causal variants having lower minor allele frequency than the SNPs explored to date.
DOI
10.1038/ng.608
GEMMA
PUBMED_LINK
FULL NAME
genome-wide efficient mixed-model association
DESCRIPTION
GEMMA is the software implementing the Genome-wide Efficient Mixed Model Association algorithm for a standard linear mixed model and some of its close relatives for genome-wide association studies (GWAS). It fits a standard linear mixed model (LMM) to account for population stratification and sample structure for single marker association tests. It fits a Bayesian sparse linear mixed model (BSLMM) using Markov chain Monte Carlo (MCMC) for estimating the proportion of variance in phenotypes explained (PVE) by typed genotypes (i.e. chip heritability), predicting phenotypes, and identifying associated markers by jointly modeling all markers while controlling for population structure. It is computationally efficient for large scale GWAS and uses freely available open-source numerical libraries.
URL
TITLE
Genome-wide efficient mixed-model analysis for association studies.
Main citation
Zhou X, Stephens M. (2012) Genome-wide efficient mixed-model analysis for association studies. Nat Genet, 44 (7) 821-4. doi:10.1038/ng.2310. PMID 22706312
ABSTRACT
Linear mixed models have attracted considerable attention recently as a powerful and effective tool for accounting for population stratification and relatedness in genetic association tests. However, existing methods for exact computation of standard test statistics are computationally impractical for even moderate-sized genome-wide association studies. To address this issue, several approximate methods have been proposed. Here, we present an efficient exact method, which we refer to as genome-wide efficient mixed-model association (GEMMA), that makes approximations unnecessary in many contexts. This method is approximately n times faster than the widely used exact method known as efficient mixed-model association (EMMA), where n is the sample size, making exact genome-wide association analysis computationally practical for large numbers of individuals.
DOI
10.1038/ng.2310
GenoBoost
PUBMED_LINK
DESCRIPTION
GenoBoost is a polygenic score method to capture additive and non-additive genetic inheritance effects.
URL
KEYWORDS
additive effects, non-additive effects, statistical boosting
TITLE
A polygenic score method boosted by non-additive models.
Main citation
Ohta R, Tanigawa Y, Suzuki Y, Kellis M, ...&, Morishita S. (2024) A polygenic score method boosted by non-additive models. Nat Commun, 15 (1) 4433. doi:10.1038/s41467-024-48654-x. PMID 38811555
ABSTRACT
Dominance heritability in complex traits has received increasing recognition. However, most polygenic score (PGS) approaches do not incorporate non-additive effects. Here, we present GenoBoost, a flexible PGS modeling framework capable of considering both additive and non-additive effects, specifically focusing on genetic dominance. Building on statistical boosting theory, we derive provably optimal GenoBoost scores and provide its efficient implementation for analyzing large-scale cohorts. We benchmark it against seven commonly used PGS methods and demonstrate its competitive predictive performance. GenoBoost is ranked the best for four traits and second-best for three traits among twelve tested disease outcomes in UK Biobank. We reveal that GenoBoost improves prediction for autoimmune diseases by incorporating non-additive effects localized in the MHC locus and, more broadly, works best in less polygenic traits. We further demonstrate that GenoBoost can infer the mode of genetic inheritance without requiring prior knowledge. For example, GenoBoost finds non-zero genetic dominance effects for 602 of 900 selected genetic variants, resulting in 2.5% improvements in predicting psoriasis cases. Lastly, we show that GenoBoost can prioritize genetic loci with genetic dominance not previously reported in the GWAS catalog. Our results highlight the increased accuracy and biological insights from incorporating non-additive effects in PGS models.
DOI
10.1038/s41467-024-48654-x
GenomeAsia 100K
PUBMED_LINK
URL
TITLE
The GenomeAsia 100K Project enables genetic discoveries across Asia.
Main citation
GenomeAsia100K Consortium. (2019) The GenomeAsia 100K Project enables genetic discoveries across Asia. Nature, 576 (7785) 106-111. doi:10.1038/s41586-019-1793-z. PMID 31802016
ABSTRACT
The underrepresentation of non-Europeans in human genetic studies so far has limited the diversity of individuals in genomic datasets and led to reduced medical relevance for a large proportion of the world's population. Population-specific reference genome datasets as well as genome-wide association studies in diverse populations are needed to address this issue. Here we describe the pilot phase of the GenomeAsia 100K Project. This includes a whole-genome sequencing reference dataset from 1,739 individuals of 219 population groups and 64 countries across Asia. We catalogue genetic variation, population structure, disease associations and founder effects. We also explore the use of this dataset in imputation, to facilitate genetic studies in populations across Asia and worldwide.
DOI
10.1038/s41586-019-1793-z
Genomic-SEM
PUBMED_LINK
FULL NAME
genomic structural equation modelling
DESCRIPTION
R-package which allows the user to fit structural equation models based on the summary statistics obtained from genome wide association studies (GWAS).
URL
KEYWORDS
SEM
TITLE
Genomic structural equation modelling provides insights into the multivariate genetic architecture of complex traits.
Main citation
Grotzinger AD, Rhemtulla M, de Vlaming R, Ritchie SJ, ...&, Tucker-Drob EM. (2019) Genomic structural equation modelling provides insights into the multivariate genetic architecture of complex traits. Nat Hum Behav, 3 (5) 513-525. doi:10.1038/s41562-019-0566-x. PMID 30962613
ABSTRACT
Genetic correlations estimated from genome-wide association studies (GWASs) reveal pervasive pleiotropy across a wide variety of phenotypes. We introduce genomic structural equation modelling (genomic SEM): a multivariate method for analysing the joint genetic architecture of complex traits. Genomic SEM synthesizes genetic correlations and single-nucleotide polymorphism heritabilities inferred from GWAS summary statistics of individual traits from samples with varying and unknown degrees of overlap. Genomic SEM can be used to model multivariate genetic associations among phenotypes, identify variants with effects on general dimensions of cross-trait liability, calculate more predictive polygenic scores and identify loci that cause divergence between traits. We demonstrate several applications of genomic SEM, including a joint analysis of summary statistics from five psychiatric traits. We identify 27 independent single-nucleotide polymorphisms not previously identified in the contributing univariate GWASs. Polygenic scores from genomic SEM consistently outperform those from univariate GWASs. Genomic SEM is flexible and open ended, and allows for continuous innovation in multivariate genetic analysis.
DOI
10.1038/s41562-019-0566-x
GLEANR
PUBMED_LINK
FULL NAME
GWAS latent embeddings accounting for noise and regularization
DESCRIPTION
GLEANER is a GWAS matrix factorization tool to estimate sparse latent pleiotropic genetic factors. Factors map traits to a distribution of SNP effects that may capture biological pathways or mechanisms shared by these traits.
URL
TITLE
Sparse matrix factorization robust to sample sharing across GWASs reveals interpretable genetic components.
Main citation
Omdahl AR, Weinstock JS, Keener R, Chhetri SB, ...&, Battle A. (2025) Sparse matrix factorization robust to sample sharing across GWASs reveals interpretable genetic components. Am J Hum Genet, 112 (9) 2178-2197. doi:10.1016/j.ajhg.2025.07.003. PMID 40730164
ABSTRACT
Complex trait-associated genetic variation is highly pleiotropic. This extensive pleiotropy implies that multi-phenotype analyses are informative for characterizing genetic associations, as they facilitate the discovery of trait-shared and trait-specific variants and pathways ("genetic factors"). Previous efforts have estimated genetic factors using matrix factorization (MF) applied to numerous genome-wide association studies (GWASs). However, existing methods are susceptible to spurious factors arising from residual confounding due to sample sharing in biobank GWASs. Furthermore, MF approaches have historically estimated dense factors, loaded on most traits and variants, that are challenging to map onto interpretable biological pathways. To address these shortcomings, we introduce "GWAS latent embeddings accounting for noise and regularization" (GLEANR), an MF method for detection of sparse genetic factors from summary statistics. GLEANR accounts for sample sharing between studies and uses regularization to estimate a data-driven number of interpretable factors. GLEANR is robust to confounding induced by shared samples and improves the replication of genetic factors derived from distinct biobanks. We used GLEANR to evaluate 137 diverse GWASs from the UK Biobank, identifying 58 factors that decompose the genetic architecture of input traits and have distinct signatures of negative selection and degrees of polygenicity. These sparse factors can be interpreted with respect to disease, cell type, and pathway enrichment. We highlight three such factors that captured platelet-measure phenotypes and were enriched for disease-relevant markers corresponding to distinct stages of platelet differentiation. Overall, GLEANR is a powerful tool for discovering both trait-specific and trait-shared pathways underlying complex traits from GWAS summary statistics.
DOI
10.1016/j.ajhg.2025.07.003
GLIMPSE
PUBMED_LINK
FULL NAME
Genotype Likelihoods IMputation and PhaSing mEthod
DESCRIPTION
GLIMPSE is a phasing and imputation method for large-scale low-coverage sequencing studies.
URL
TITLE
Efficient phasing and imputation of low-coverage sequencing data using large reference panels.
Main citation
Rubinacci S, Ribeiro DM, Hofmeister RJ, Delaneau O. (2021) Efficient phasing and imputation of low-coverage sequencing data using large reference panels. Nat Genet, 53 (1) 120-126. doi:10.1038/s41588-020-00756-0. PMID 33414550
ABSTRACT
Low-coverage whole-genome sequencing followed by imputation has been proposed as a cost-effective genotyping approach for disease and population genetics studies. However, its competitiveness against SNP arrays is undermined because current imputation methods are computationally expensive and unable to leverage large reference panels. Here, we describe a method, GLIMPSE, for phasing and imputation of low-coverage sequencing datasets from modern reference panels. We demonstrate its remarkable performance across different coverages and human populations. GLIMPSE achieves imputation of a genome for less than US$1 in computational cost, considerably outperforming other methods and improving imputation accuracy over the full allele frequency range. As a proof of concept, we show that 1× coverage enables effective gene expression association studies and outperforms dense SNP arrays in rare variant burden tests. Overall, this study illustrates the promising potential of low-coverage imputation and suggests a paradigm shift in the design of future genomic studies.
DOI
10.1038/s41588-020-00756-0
GMRM
PUBMED_LINK
FULL NAME
Bayesian grouped mixture of regressions model
DESCRIPTION
gmrm is hybrid-parallel software for a Bayesian grouped mixture of regressions model for genome-wide association studies (GWAS). It is written in C++ using extensive optimisations and code vectorisation. It relies on plink's .bed format. It can handle multiple traits simultaneously.
URL
TITLE
Improving GWAS discovery and genomic prediction accuracy in biobank data.
Main citation
Orliac EJ, Trejo Banos D, Ojavee SE, Läll K, ...&, Robinson MR. (2022) Improving GWAS discovery and genomic prediction accuracy in biobank data. Proc Natl Acad Sci U S A, 119 (31) e2121279119. doi:10.1073/pnas.2121279119. PMID 35905320
ABSTRACT
Genetically informed, deep-phenotyped biobanks are an important research resource and it is imperative that the most powerful, versatile, and efficient analysis approaches are used. Here, we apply our recently developed Bayesian grouped mixture of regressions model (GMRM) in the UK and Estonian Biobanks and obtain the highest genomic prediction accuracy reported to date across 21 heritable traits. When compared to other approaches, GMRM accuracy was greater than annotation prediction models run in the LDAK or LDPred-funct software by 15% (SE 7%) and 14% (SE 2%), respectively, and was 18% (SE 3%) greater than a baseline BayesR model without single-nucleotide polymorphism (SNP) markers grouped into minor allele frequency-linkage disequilibrium (MAF-LD) annotation categories. For height, the prediction accuracy R2 was 47% in a UK Biobank holdout sample, which was 76% of the estimated [Formula: see text]. We then extend our GMRM prediction model to provide mixed-linear model association (MLMA) SNP marker estimates for genome-wide association (GWAS) discovery, which increased the independent loci detected to 16,162 in unrelated UK Biobank individuals, compared to 10,550 from BoltLMM and 10,095 from Regenie, a 62 and 65% increase, respectively. The average [Formula: see text] value of the leading markers increased by 15.24 (SE 0.41) for every 1% increase in prediction accuracy gained over a baseline BayesR model across the traits. Thus, we show that modeling genetic associations accounting for MAF and LD differences among SNP markers, and incorporating prior knowledge of genomic function, is important for both genomic prediction and discovery in large-scale individual-level studies.
DOI
10.1073/pnas.2121279119
GNOVA
PUBMED_LINK
FULL NAME
GeNetic cOVariance Analyzer
DESCRIPTION
A principled framework to estimate annotation-stratified genetic covariance using GWAS summary statistics.
URL
TITLE
A Powerful Approach to Estimating Annotation-Stratified Genetic Covariance via GWAS Summary Statistics.
Main citation
Lu Q, Li B, Ou D, Erlendsdottir M, ...&, Zhao H. (2017) A Powerful Approach to Estimating Annotation-Stratified Genetic Covariance via GWAS Summary Statistics. Am J Hum Genet, 101 (6) 939-964. doi:10.1016/j.ajhg.2017.11.001. PMID 29220677
ABSTRACT
Despite the success of large-scale genome-wide association studies (GWASs) on complex traits, our understanding of their genetic architecture is far from complete. Jointly modeling multiple traits' genetic profiles has provided insights into the shared genetic basis of many complex traits. However, large-scale inference sets a high bar for both statistical power and biological interpretability. Here we introduce a principled framework to estimate annotation-stratified genetic covariance between traits using GWAS summary statistics. Through theoretical and numerical analyses, we demonstrate that our method provides accurate covariance estimates, thereby enabling researchers to dissect both the shared and distinct genetic architecture across traits to better understand their etiologies. Among 50 complex traits with publicly accessible GWAS summary statistics (Ntotal≈ 4.5 million), we identified more than 170 pairs with statistically significant genetic covariance. In particular, we found strong genetic covariance between late-onset Alzheimer disease (LOAD) and amyotrophic lateral sclerosis (ALS), two major neurodegenerative diseases, in single-nucleotide polymorphisms (SNPs) with high minor allele frequencies and in SNPs located in the predicted functional genome. Joint analysis of LOAD, ALS, and other traits highlights LOAD's correlation with cognitive traits and hints at an autoimmune component for ALS.
DOI
10.1016/j.ajhg.2017.11.001
GPLEMMA
PUBMED_LINK
FULL NAME
Gaussian Prior Linear Environment Mixed Model Analysis
DESCRIPTION
GPLEMMA (Gaussian Prior Linear Environment Mixed Model Analysis) non-linear randomized Haseman-Elston regression method for flexible modeling of gene-environment interactions in large datasets such as the UK Biobank.
URL
TITLE
A non-linear regression method for estimation of gene-environment heritability.
Main citation
Kerin M, Marchini J. (2021) A non-linear regression method for estimation of gene-environment heritability. Bioinformatics, 36 (24) 5632-5639. doi:10.1093/bioinformatics/btaa1079. PMID 33367483
ABSTRACT
MOTIVATION: Gene-environment (GxE) interactions are one of the least studied aspects of the genetic architecture of human traits and diseases. The environment of an individual is inherently high dimensional, evolves through time and can be expensive and time consuming to measure. The UK Biobank study, with all 500 000 participants having undergone an extensive baseline questionnaire, represents a unique opportunity to assess GxE heritability for many traits and diseases in a well powered setting. RESULTS: We have developed a randomized Haseman-Elston non-linear regression method applicable when many environmental variables have been measured on each individual. The method (GPLEMMA) simultaneously estimates a linear environmental score (ES) and its GxE heritability. We compare the method via simulation to a whole-genome regression approach (LEMMA) for estimating GxE heritability. We show that GPLEMMA is more computationally efficient than LEMMA on large datasets, and produces results highly correlated with those from LEMMA when applied to simulated data and real data from the UK Biobank. AVAILABILITY AND IMPLEMENTATION: Software implementing the GPLEMMA method is available from https://jmarchini.org/gplemma/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
DOI
10.1093/bioinformatics/btaa1079
GREP
PUBMED_LINK
FULL NAME
Genome for REPositioning drugs
DESCRIPTION
GREP can quantify an enrichment of the user-defined set of genes in the target of clinical indication categories and capture potentially repositionable drugs targeting the gene set. Both can be run in a few seconds!
URL
TITLE
GREP: genome for REPositioning drugs.
Main citation
Sakaue S, Okada Y. (2019) GREP: genome for REPositioning drugs. Bioinformatics, 35 (19) 3821-3823. doi:10.1093/bioinformatics/btz166. PMID 30859178
ABSTRACT
SUMMARY: Making use of accumulated genetic knowledge for clinical practice is our next goal in human genetics. Here we introduce GREP (Genome for REPositioning drugs), a standalone python software to quantify an enrichment of the user-defined set of genes in the target of clinical indication categories and to capture potentially repositionable drugs targeting the gene set. We show that genes identified by the large-scale genome-wide association studies were robustly enriched in the approved drugs to treat the trait of interest. This enrichment analysis was also highly applicable to other sets of biological genes such as those identified by gene expression studies and genes somatically mutated in cancers. This software should accelerate investigators to reposition drugs to other indications with the guidance of known genomics. AVAILABILITY AND IMPLEMENTATION: GREP is available at https://github.com/saorisakaue/GREP as a python source code. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
DOI
10.1093/bioinformatics/btz166
GRPa-PRS
PUBMED_LINK
FULL NAME
genetically-regulated pathways
URL
TITLE
GRPa-PRS: A risk stratification method to identify genetically-regulated pathways in polygenic diseases.
Main citation
Li X, Fernandes BS, Liu A, Chen J, ...&, Dai Y. (2024) GRPa-PRS: A risk stratification method to identify genetically-regulated pathways in polygenic diseases. medRxiv, () . doi:10.1101/2023.06.19.23291621. PMID 37425929
ABSTRACT
BACKGROUND: Polygenic risk scores (PRS) are tools used to evaluate an individual's susceptibility to polygenic diseases based on their genetic profile. A considerable proportion of people carry a high genetic risk but evade the disease. On the other hand, some individuals with a low risk of eventually developing the disease. We hypothesized that unknown counterfactors might be involved in reversing the PRS prediction, which might provide new insights into the pathogenesis, prevention, and early intervention of diseases. METHODS: We built a novel computational framework to identify genetically-regulated pathways (GRPas) using PRS-based stratification for each cohort. We curated two AD cohorts with genotyping data; the discovery (disc) and the replication (rep) datasets include 2722 and 2854 individuals, respectively. First, we calculated the optimized PRS model based on the three recent AD GWAS summary statistics for each cohort. Then, we stratified the individuals by their PRS and clinical diagnosis into six biologically meaningful PRS strata, such as AD cases with low/high risk and cognitively normal (CN) with low/high risk. Lastly, we imputed individual genetically-regulated expression (GReX) and identified differential GReX and GRPas between risk strata using gene-set enrichment and variational analyses in two models, with and without APOE effects. An orthogonality test was further conducted to verify those GRPas are independent of PRS risk. To verify the generalizability of other polygenic diseases, we further applied a default model of GRPa-PRS for schizophrenia (SCZ). RESULTS: For each stratum, we conducted the same procedures in both the disc and rep datasets for comparison. In AD, we identified several well-known AD-related pathways, including amyloid-beta clearance, tau protein binding, and astrocyte response to oxidative stress. Additionally, we discovered resilience-related GRPs that are orthogonal to AD PRS, such as the calcium signaling pathway and divalent inorganic cation homeostasis. In SCZ, pathways related to mitochondrial function and muscle development were highlighted. Finally, our GRPa-PRS method identified more consistent differential pathways compared to another variant-based pathway PRS method. CONCLUSIONS: We developed a framework, GRPa-PRS, to systematically explore the differential GReX and GRPas among individuals stratified by their estimated PRS. The GReX-level comparison among those strata unveiled new insights into the pathways associated with disease risk and resilience. Our framework is extendable to other polygenic complex diseases.
DOI
10.1101/2023.06.19.23291621
gsMap
PUBMED_LINK
FULL NAME
genetically informed spatial mapping of cells for complex traits
DESCRIPTION
gsMap (genetically informed spatial mapping of cells for complex traits) integrates spatial transcriptomics (ST) data with genome-wide association study (GWAS) summary statistics to map cells to human complex traits, including diseases, in a spatially resolved manner.
URL
KEYWORDS
spatial transciptomics
TITLE
Spatially resolved mapping of cells associated with human complex traits.
Main citation
Song L, Chen W, Hou J, Guo M, ...&, Yang J. (2025) Spatially resolved mapping of cells associated with human complex traits. Nature, 641 (8064) 932-941. doi:10.1038/s41586-025-08757-x. PMID 40108460
ABSTRACT
Depicting spatial distributions of disease-relevant cells is crucial for understanding disease pathology1,2. Here we present genetically informed spatial mapping of cells for complex traits (gsMap), a method that integrates spatial transcriptomics data with summary statistics from genome-wide association studies to map cells to human complex traits, including diseases, in a spatially resolved manner. Using embryonic spatial transcriptomics datasets covering 25 organs, we benchmarked gsMap through simulation and by corroborating known trait-associated cells or regions in various organs. Applying gsMap to brain spatial transcriptomics data, we reveal that the spatial distribution of glutamatergic neurons associated with schizophrenia more closely resembles that for cognitive traits than that for mood traits such as depression. The schizophrenia-associated glutamatergic neurons were distributed near the dorsal hippocampus, with upregulated expression of calcium signalling and regulation genes, whereas depression-associated glutamatergic neurons were distributed near the deep medial prefrontal cortex, with upregulated expression of neuroplasticity and psychiatric drug target genes. Our study provides a method for spatially resolved mapping of trait-associated cells and demonstrates the gain of biological insights (such as the spatial distribution of trait-relevant cells and related signature genes) through these maps.
DOI
10.1038/s41586-025-08757-x
ARROW_SUMMARY
Spatial transcriptomics data + GWAS summary statistics → Graph Neural Network identifies homogeneous spatial domains → Compute Gene Specificity Scores (GSS) for each spot → Map GSS to nearby SNPs → Perform Stratified LD Score Regression (S-LDSC) to assess trait heritability enrichment → Aggregate spot-level p-values using the Cauchy Combination Test to identify trait-associated spatial regions
Guideline-Namba
PUBMED_LINK
DESCRIPTION
a practical guideline for genomics-driven drug discovery for cross-population meta-analysis, as lessons from the Global Biobank Meta-analysis Initiative (GBMI)
TITLE
A practical guideline of genomics-driven drug discovery in the era of global biobank meta-analysis.
Main citation
Namba S, Konuma T, Wu KH, Zhou W, ...&, Okada Y. (2022) A practical guideline of genomics-driven drug discovery in the era of global biobank meta-analysis. Cell Genom, 2 (10) 100190. doi:10.1016/j.xgen.2022.100190. PMID 36778001
ABSTRACT
Genomics-driven drug discovery is indispensable for accelerating the development of novel therapeutic targets. However, the drug discovery framework based on evidence from genome-wide association studies (GWASs) has not been established, especially for cross-population GWAS meta-analysis. Here, we introduce a practical guideline for genomics-driven drug discovery for cross-population meta-analysis, as lessons from the Global Biobank Meta-analysis Initiative (GBMI). Our drug discovery framework encompassed three methodologies and was applied to the 13 common diseases targeted by GBMI (N mean = 1,329,242). Individual methodologies complementarily prioritized drugs and drug targets, which were systematically validated by referring previously known drug-disease relationships. Integration of the three methodologies provided a comprehensive catalog of candidate drugs for repositioning, nominating promising drug candidates targeting the genes involved in the coagulation process for venous thromboembolism and the interleukin-4 and interleukin-13 signaling pathway for gout. Our study highlighted key factors for successful genomics-driven drug discovery using cross-population meta-analyses.
DOI
10.1016/j.xgen.2022.100190
GWAMA
PUBMED_LINK
FULL NAME
Genome-Wide Association Meta-Analysis
DESCRIPTION
Software tool for meta analysis of whole genome association data
URL
TITLE
GWAMA: software for genome-wide association meta-analysis.
Main citation
Mägi R, Morris AP. (2010) GWAMA: software for genome-wide association meta-analysis. BMC Bioinformatics, 11 () 288. doi:10.1186/1471-2105-11-288. PMID 20509871
ABSTRACT
BACKGROUND: Despite the recent success of genome-wide association studies in identifying novel loci contributing effects to complex human traits, such as type 2 diabetes and obesity, much of the genetic component of variation in these phenotypes remains unexplained. One way to improving power to detect further novel loci is through meta-analysis of studies from the same population, increasing the sample size over any individual study. Although statistical software analysis packages incorporate routines for meta-analysis, they are ill equipped to meet the challenges of the scale and complexity of data generated in genome-wide association studies. RESULTS: We have developed flexible, open-source software for the meta-analysis of genome-wide association studies. The software incorporates a variety of error trapping facilities, and provides a range of meta-analysis summary statistics. The software is distributed with scripts that allow simple formatting of files containing the results of each association study and generate graphical summaries of genome-wide meta-analysis results. CONCLUSIONS: The GWAMA (Genome-Wide Association Meta-Analysis) software has been developed to perform meta-analysis of summary statistics generated from genome-wide association studies of dichotomous phenotypes or quantitative traits. Software with source files, documentation and example data files are freely available online at http://www.well.ox.ac.uk/GWAMA.
DOI
10.1186/1471-2105-11-288
gwas diversity monitor
PUBMED_LINK
URL
TITLE
The GWAS Diversity Monitor tracks diversity by disease in real time.
Main citation
Mills MC, Rahal C. (2020) The GWAS Diversity Monitor tracks diversity by disease in real time. Nat Genet, 52 (3) 242-243. doi:10.1038/s41588-020-0580-y. PMID 32139905
DOI
10.1038/s41588-020-0580-y
GWAS SVatalog
FULL NAME
GWAS SVatalog: a visualization tool to aid fine-mapping of GWAS loci with structural variations
DESCRIPTION
Novel open-source web tool combining GWAS Catalog's SNP-trait associations with LD statistics to identify SVs explaining GWAS loci [1]
URL
KEYWORDS
GWAS, structural variations, visualization, fine-mapping
USE
Computes and visualizes linkage disequilibrium between structural variations and GWAS-associated SNPs [1]
PREPRINT_DOI
10.1101/2025.09.03.674075
Main citation
Chirmade S, Wang Z, et al. (2025). GWAS SVatalog: a visualization tool to aid fine-mapping of GWAS loci with structural variations. bioRxiv
GWAS-by-Subtraction
PUBMED_LINK
URL
TITLE
Investigating the genetic architecture of noncognitive skills using GWAS-by-subtraction.
Main citation
Demange PA, Malanchini M, Mallard TT, Biroli P, ...&, Nivard MG. (2021) Investigating the genetic architecture of noncognitive skills using GWAS-by-subtraction. Nat Genet, 53 (1) 35-44. doi:10.1038/s41588-020-00754-2. PMID 33414549
ABSTRACT
Little is known about the genetic architecture of traits affecting educational attainment other than cognitive ability. We used genomic structural equation modeling and prior genome-wide association studies (GWASs) of educational attainment (n = 1,131,881) and cognitive test performance (n = 257,841) to estimate SNP associations with educational attainment variation that is independent of cognitive ability. We identified 157 genome-wide-significant loci and a polygenic architecture accounting for 57% of genetic variance in educational attainment. Noncognitive genetics were enriched in the same brain tissues and cell types as cognitive performance, but showed different associations with gray-matter brain volumes. Noncognitive genetics were further distinguished by associations with personality traits, less risky behavior and increased risk for certain psychiatric disorders. For socioeconomic success and longevity, noncognitive and cognitive-performance genetics demonstrated associations of similar magnitude. By conducting a GWAS of a phenotype that was not directly measured, we offer a view of genetic architecture of noncognitive skills influencing educational success.
DOI
10.1038/s41588-020-00754-2
GWASLab
DESCRIPTION
a python package for handling GWAS sumstats.
URL
PREPRINT_DOI
10.51094/jxiv.370
Main citation
GWASLab preprint: He, Y., Koido, M., Shimmori, Y., Kamatani, Y. (2023). GWASLab: a Python package for processing and visualizing GWAS summary statistics. Preprint at Jxiv, 2023-5. https://doi.org/10.51094/jxiv.370
gwaslab
GWAX
PUBMED_LINK
FULL NAME
genome-wide association by proxy
DESCRIPTION
In randomly ascertained cohorts, replacing cases with their first-degree relatives enables studies of diseases that are absent (or nearly absent) in the cohort.
TITLE
Case-control association mapping by proxy using family history of disease.
Main citation
Liu JZ, Erlich Y, Pickrell JK. (2017) Case-control association mapping by proxy using family history of disease. Nat Genet, 49 (3) 325-331. doi:10.1038/ng.3766. PMID 28092683
ABSTRACT
Collecting cases for case-control genetic association studies can be time-consuming and expensive. In some situations (such as studies of late-onset or rapidly lethal diseases), it may be more practical to identify family members of cases. In randomly ascertained cohorts, replacing cases with their first-degree relatives enables studies of diseases that are absent (or nearly absent) in the cohort. We refer to this approach as genome-wide association study by proxy (GWAX) and apply it to 12 common diseases in 116,196 individuals from the UK Biobank. Meta-analysis with published genome-wide association study summary statistics replicated established risk loci and yielded four newly associated loci for Alzheimer's disease, eight for coronary artery disease and five for type 2 diabetes. In addition to informing disease biology, our results demonstrate the utility of association mapping without directly observing cases. We anticipate that GWAX will prove useful in future genetic studies of complex traits in large population cohorts.
DOI
10.1038/ng.3766
GWFM
PUBMED_LINK
FULL NAME
Genome-wide fine-mapping with functional annotations
DESCRIPTION
Genome-wide fine-mapping (GWFM) with functional annotations models the global genetic architecture rather than isolated loci; compared with region-specific approaches it improves error control, power, resolution, precision, replication, and cross-ancestry phenotype prediction. Distributed as part of the GCTB software suite.
URL
KEYWORDS
fine-mapping, functional annotation, credible sets, trans-ancestry
TITLE
Genome-wide fine-mapping improves identification of causal variants.
Main citation
Wu Y, Zheng Z, Thibaut L, Lin T, ...&, Zeng J. (2026) Genome-wide fine-mapping improves identification of causal variants. Nat Genet, () . doi:10.1038/s41588-026-02549-3. PMID 41912930
ABSTRACT
Fine-mapping refines genotype-phenotype association signals to identify causal variants underlying complex traits. However, current methods typically focus on individual genomic loci and do not account for the global genetic architecture. Here we demonstrate the advantages of performing genome-wide fine-mapping (GWFM) with functional annotations and develop methods to facilitate GWFM. In simulations and real data analyses, GWFM outperforms current methods across several metrics, including error control, mapping power, resolution, precision, replication rate and trans-ancestry phenotype prediction. Across 48 complex traits, we identify credible sets that collectively explain 18% of the SNP-based heritability ( h SNP 2 ) on average, with 30% credible sets located outside genome-wide significant loci. Leveraging the genetic architecture estimated from GWFM, we predict that fine-mapping over 50% of h SNP 2 would require an average of 2 million samples. Finally, as proof-of-principle, we highlight a known causal variant at FTO influencing body mass index and identify new missense causal variants influencing schizophrenia and Crohn's disease risk.
DOI
10.1038/s41588-026-02549-3
Hail
DESCRIPTION
Hail is an open-source, general-purpose, Python-based data analysis tool with additional data types and methods for working with genomic data.
URL
Han-MHC
PUBMED_LINK
URL
TITLE
Deep sequencing of the MHC region in the Chinese population contributes to studies of complex disease.
Main citation
Zhou F, Cao H, Zuo X, Zhang T, ...&, Zhang X. (2016) Deep sequencing of the MHC region in the Chinese population contributes to studies of complex disease. Nat Genet, 48 (7) 740-6. doi:10.1038/ng.3576. PMID 27213287
ABSTRACT
The human major histocompatibility complex (MHC) region has been shown to be associated with numerous diseases. However, it remains a challenge to pinpoint the causal variants for these associations because of the extreme complexity of the region. We thus sequenced the entire 5-Mb MHC region in 20,635 individuals of Han Chinese ancestry (10,689 controls and 9,946 patients with psoriasis) and constructed a Han-MHC database that includes both variants and HLA gene typing results of high accuracy. We further identified multiple independent new susceptibility loci in HLA-C, HLA-B, HLA-DPB1 and BTNL2 and an intergenic variant, rs118179173, associated with psoriasis and confirmed the well-established risk allele HLA-C*06:02. We anticipate that our Han-MHC reference panel built by deep sequencing of a large number of samples will serve as a useful tool for investigating the role of the MHC region in a variety of diseases and thus advance understanding of the pathogenesis of these disorders.
DOI
10.1038/ng.3576
HAPGEN2
PUBMED_LINK
DESCRIPTION
HAPGEN2 is a an updated version of the program HAPGEN, which simulates case control datasets at SNP markers. The new version can now simulate multiple disease SNPs on a single chromosome, on the assumption that each disease SNP acts independently and are in Hardy-Weinberg equilibrium. We also supply a R package that can simulate interaction between the disease SNPs.
URL
TITLE
HAPGEN2: simulation of multiple disease SNPs.
Main citation
Su Z, Marchini J, Donnelly P. (2011) HAPGEN2: simulation of multiple disease SNPs. Bioinformatics, 27 (16) 2304-5. doi:10.1093/bioinformatics/btr341. PMID 21653516
ABSTRACT
MOTIVATION: Performing experiments with simulated data is an inexpensive approach to evaluating competing experimental designs and analysis methods in genome-wide association studies. Simulation based on resampling known haplotypes is fast and efficient and can produce samples with patterns of linkage disequilibrium (LD), which mimic those in real data. However, the inability of current methods to simulate multiple nearby disease SNPs on the same chromosome can limit their application. RESULTS: We introduce a new simulation algorithm based on a successful resampling method, HAPGEN, that can simulate multiple nearby disease SNPs on the same chromosome. The new method, HAPGEN2, retains many advantages of resampling methods and expands the range of disease models that current simulators offer. AVAILABILITY: HAPGEN2 is freely available from http://www.stats.ox.ac.uk/~marchini/software/gwas/gwas.html. CONTACT: zhan@well.ox.ac.uk SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
DOI
10.1093/bioinformatics/btr341
haploview
PUBMED_LINK
DESCRIPTION
Haploview is designed to simplify and expedite the process of haplotype analysis by providing a common interface to several tasks relating to such analyses.
URL
TITLE
Haploview: analysis and visualization of LD and haplotype maps.
Main citation
Barrett JC, Fry B, Maller J, Daly MJ. (2005) Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics, 21 (2) 263-5. doi:10.1093/bioinformatics/bth457. PMID 15297300
ABSTRACT
UNLABELLED: Research over the last few years has revealed significant haplotype structure in the human genome. The characterization of these patterns, particularly in the context of medical genetic association studies, is becoming a routine research activity. Haploview is a software package that provides computation of linkage disequilibrium statistics and population haplotype patterns from primary genotype data in a visually appealing and interactive interface. AVAILABILITY: http://www.broad.mit.edu/mpg/haploview/ CONTACT: jcbarret@broad.mit.edu
DOI
10.1093/bioinformatics/bth457
HDL
PUBMED_LINK
FULL NAME
High-Definition Likelihood
DESCRIPTION
High-Definition Likelihood (HDL) is a likelihood-based method for estimating genetic correlation using GWAS summary statistics. Compared to LD Score regression (LDSC), It reduces the variance of a genetic correlation estimate by about 60%.
URL
TITLE
High-definition likelihood inference of genetic correlations across human complex traits.
Main citation
Ning Z, Pawitan Y, Shen X. (2020) High-definition likelihood inference of genetic correlations across human complex traits. Nat Genet, 52 (8) 859-864. doi:10.1038/s41588-020-0653-y. PMID 32601477
ABSTRACT
Genetic correlation is a central parameter for understanding shared genetic architecture between complex traits. By using summary statistics from genome-wide association studies (GWAS), linkage disequilibrium score regression (LDSC) was developed for unbiased estimation of genetic correlations. Although easy to use, LDSC only partially utilizes LD information. By fully accounting for LD across the genome, we develop a high-definition likelihood (HDL) method to improve precision in genetic correlation estimation. Compared to LDSC, HDL reduces the variance of genetic correlation estimates by about 60%, equivalent to a 2.5-fold increase in sample size. We apply HDL and LDSC to estimate 435 genetic correlations among 30 behavioral and disease-related phenotypes measured in the UK Biobank (UKBB). In addition to 154 significant genetic correlations observed for both methods, HDL identified another 57 significant genetic correlations, compared to only another 2 significant genetic correlations identified by LDSC. HDL brings more power to genomic analyses and better reveals the underlying connections across human complex traits.
DOI
10.1038/s41588-020-0653-y
HDL-L
PUBMED_LINK
FULL NAME
high-definition likelihood (local)
DESCRIPTION
High-Definition Likelihood (HDL) is a likelihood-based method for estimating genetic correlation using GWAS summary statistics. Compared to LD Score regression (LDSC), It reduces the variance of a genetic correlation estimate by about 60%. Here, we provide an R-based computational tool HDL to implement our method.
URL
KEYWORDS
likelihood-based inference
TITLE
An enhanced framework for local genetic correlation analysis.
Main citation
Li Y, Pawitan Y, Shen X. (2025) An enhanced framework for local genetic correlation analysis. Nat Genet, 57 (4) 1053-1058. doi:10.1038/s41588-025-02123-3. PMID 40065165
ABSTRACT
Genetic correlation is a key parameter in the joint genetic model of complex traits, but it is usually estimated on a global genomic scale. Understanding local genetic correlations provides more detailed insight into the shared genetic architecture of complex traits. However, a state-of-the-art tool for local genetic correlation analysis, LAVA, is prone to false inference. Here we extend the high-definition likelihood (HDL) method to a local version, HDL-L, which performs genetic correlation analysis in small, approximately independent linkage disequilibrium blocks. HDL-L allows a more granular estimation of genetic variances and covariances. Simulations show that HDL-L offers more consistent heritability estimates and more efficient genetic correlation estimates compared with LAVA. HDL-L demonstrated robust performance across a wide range of simulations conducted under varying parameter settings. In the analysis of 30 phenotypes from the UK Biobank, HDL-L identified 109 significant local genetic correlations and showed a notable computational advantage. HDL-L proves to be a powerful tool for uncovering the detailed genetic landscape that underlies complex human traits, offering both accuracy and computational efficiency.
DOI
10.1038/s41588-025-02123-3
HEELS
PUBMED_LINK
FULL NAME
Heritability Estimation with high Efficiency using LD and association Summary Statistics
DESCRIPTION
HEELS is a Python-based command line tool that produce accurate and precise local heritability estimates using summary-level statistics (marginal association test statistics along with the empirical (in-sample) LD statistics).
URL
TITLE
Accurate and efficient estimation of local heritability using summary statistics and the linkage disequilibrium matrix.
Main citation
Li H, Mazumder R, Lin X. (2023) Accurate and efficient estimation of local heritability using summary statistics and the linkage disequilibrium matrix. Nat Commun, 14 (1) 7954. doi:10.1038/s41467-023-43565-9. PMID 38040712
ABSTRACT
Existing SNP-heritability estimators that leverage summary statistics from genome-wide association studies (GWAS) are much less efficient (i.e., have larger standard errors) than the restricted maximum likelihood (REML) estimators which require access to individual-level data. We introduce a new method for local heritability estimation-Heritability Estimation with high Efficiency using LD and association Summary Statistics (HEELS)-that significantly improves the statistical efficiency of summary-statistics-based heritability estimator and attains comparable statistical efficiency as REML (with a relative statistical efficiency >92%). Moreover, we propose representing the empirical LD matrix as the sum of a low-rank matrix and a banded matrix. We show that this way of modeling the LD can not only reduce the storage and memory cost, but also improve the computational efficiency of heritability estimation. We demonstrate the statistical efficiency of HEELS and the advantages of our proposed LD approximation strategies both in simulations and through empirical analyses of the UK Biobank data.
DOI
10.1038/s41467-023-43565-9
HESS
PUBMED_LINK
FULL NAME
Heritability Estimation from Summary Statistics
DESCRIPTION
HESS (Heritability Estimation from Summary Statistics) is a software package for estimating and visualizing local SNP-heritability and genetic covariance (correlation) from GWAS summary association data.
URL
TITLE
Contrasting the Genetic Architecture of 30 Complex Traits from Summary Association Data.
Main citation
Shi H, Kichaev G, Pasaniuc B. (2016) Contrasting the Genetic Architecture of 30 Complex Traits from Summary Association Data. Am J Hum Genet, 99 (1) 139-53. doi:10.1016/j.ajhg.2016.05.013. PMID 27346688
ABSTRACT
Variance-component methods that estimate the aggregate contribution of large sets of variants to the heritability of complex traits have yielded important insights into the genetic architecture of common diseases. Here, we introduce methods that estimate the total trait variance explained by the typed variants at a single locus in the genome (local SNP heritability) from genome-wide association study (GWAS) summary data while accounting for linkage disequilibrium among variants. We applied our estimator to ultra-large-scale GWAS summary data of 30 common traits and diseases to gain insights into their local genetic architecture. First, we found that common SNPs have a high contribution to the heritability of all studied traits. Second, we identified traits for which the majority of the SNP heritability can be confined to a small percentage of the genome. Third, we identified GWAS risk loci where the entire locus explains significantly more variance in the trait than the GWAS reported variants. Finally, we identified loci that explain a significant amount of heritability across multiple traits.
DOI
10.1016/j.ajhg.2016.05.013
HGDP+1kGP
PUBMED_LINK
FULL NAME
Human Genome Diversity Project + 1000 Genomes project
URL
TITLE
A harmonized public resource of deeply sequenced diverse human genomes.
Main citation
Koenig Z, Yohannes MT, Nkambule LL, Zhao X, ...&, Martin AR. (2024) A harmonized public resource of deeply sequenced diverse human genomes. Genome Res, 34 (5) 796-809. doi:10.1101/gr.278378.123. PMID 38749656
ABSTRACT
Underrepresented populations are often excluded from genomic studies owing in part to a lack of resources supporting their analyses. The 1000 Genomes Project (1kGP) and Human Genome Diversity Project (HGDP), which have recently been sequenced to high coverage, are valuable genomic resources because of the global diversity they capture and their open data sharing policies. Here, we harmonized a high-quality set of 4094 whole genomes from 80 populations in the HGDP and 1kGP with data from the Genome Aggregation Database (gnomAD) and identified over 153 million high-quality SNVs, indels, and SVs. We performed a detailed ancestry analysis of this cohort, characterizing population structure and patterns of admixture across populations, analyzing site frequency spectra, and measuring variant counts at global and subcontinental levels. We also show substantial added value from this data set compared with the prior versions of the component resources, typically combined via liftOver and variant intersection; for example, we catalog millions of new genetic variants, mostly rare, compared with previous releases. In addition to unrestricted individual-level public release, we provide detailed tutorials for conducting many of the most common quality-control steps and analyses with these data in a scalable cloud-computing environment and publicly release this new phased joint callset for use as a haplotype resource in phasing and imputation pipelines. This jointly called reference panel will serve as a key resource to support research of diverse ancestry populations.
DOI
10.1101/gr.278378.123
HIBAG
PUBMED_LINK
URL
TITLE
HIBAG--HLA genotype imputation with attribute bagging.
Main citation
Zheng X, Shen J, Cox C, Wakefield JC, ...&, Weir BS. (2014) HIBAG--HLA genotype imputation with attribute bagging. Pharmacogenomics J, 14 (2) 192-200. doi:10.1038/tpj.2013.18. PMID 23712092
ABSTRACT
Genotyping of classical human leukocyte antigen (HLA) alleles is an essential tool in the analysis of diseases and adverse drug reactions with associations mapping to the major histocompatibility complex (MHC). However, deriving high-resolution HLA types subsequent to whole-genome single-nucleotide polymorphism (SNP) typing or sequencing is often cost prohibitive for large samples. An alternative approach takes advantage of the extended haplotype structure within the MHC to predict HLA alleles using dense SNP genotypes, such as those available from genome-wide SNP panels. Current methods for HLA imputation are difficult to apply or may require the user to have access to large training data sets with SNP and HLA types. We propose HIBAG, HLA Imputation using attribute BAGging, that makes predictions by averaging HLA-type posterior probabilities over an ensemble of classifiers built on bootstrap samples. We assess the performance of HIBAG using our study data (n=2668 subjects of European ancestry) as a training set and HLA data from the British 1958 birth cohort study (n≈1000 subjects) as independent validation samples. Prediction accuracies for HLA-A, B, C, DRB1 and DQB1 range from 92.2% to 98.1% using a set of SNP markers common to the Illumina 1M Duo, OmniQuad, OmniExpress, 660K and 550K platforms. HIBAG performed well compared with the other two leading methods, HLA*IMP and BEAGLE. This method is implemented in a freely available HIBAG R package that includes pre-fit classifiers for European, Asian, Hispanic and African ancestries, providing a readily available imputation approach without the need to have access to large training data sets.
DOI
10.1038/tpj.2013.18
HIPO
PUBMED_LINK
FULL NAME
heritability informed power optimization
DESCRIPTION
hipo is an R package that performs heritability informed power optimization (HIPO) for conducting multi-trait association analysis on summary level data.
URL
TITLE
Heritability informed power optimization (HIPO) leads to enhanced detection of genetic associations across multiple traits.
Main citation
Qi G, Chatterjee N. (2018) Heritability informed power optimization (HIPO) leads to enhanced detection of genetic associations across multiple traits. PLoS Genet, 14 (10) e1007549. doi:10.1371/journal.pgen.1007549. PMID 30289880
ABSTRACT
Genome-wide association studies have shown that pleiotropy is a common phenomenon that can potentially be exploited for enhanced detection of susceptibility loci. We propose heritability informed power optimization (HIPO) for conducting powerful pleiotropic analysis using summary-level association statistics. We find optimal linear combinations of association coefficients across traits that are expected to maximize non-centrality parameter for the underlying test statistics, taking into account estimates of heritability, sample size variations and overlaps across the traits. Simulation studies show that the proposed method has correct type I error, robust to population stratification and leads to desired genome-wide enrichment of association signals. Application of the proposed method to publicly available data for three groups of genetically related traits, lipids (N = 188,577), psychiatric diseases (Ncase = 33,332, Ncontrol = 27,888) and social science traits (N ranging between 161,460 to 298,420 across individual traits) increased the number of genome-wide significant loci by 12%, 200% and 50%, respectively, compared to those found by analysis of individual traits. Evidence of replication is present for many of these loci in subsequent larger studies for individual traits. HIPO can potentially be extended to high-dimensional phenotypes as a way of dimension reduction to maximize power for subsequent genetic association testing.
DOI
10.1371/journal.pgen.1007549
HLA-TAPAS
PUBMED_LINK
FULL NAME
HLA-Typing At Protein for Association Studie
DESCRIPTION
HLA-TAPAS (HLA-Typing At Protein for Association Studies) is an HLA-focused pipeline that can handle HLA reference panel construction (MakeReference), HLA imputation (SNP2HLA), and HLA association (HLAassoc).
URL
KEYWORDS
HLA pipeline
TITLE
A high-resolution HLA reference panel capturing global population diversity enables multi-ancestry fine-mapping in HIV host response.
Main citation
Luo Y, Kanai M, Choi W, Li X, ...&, Raychaudhuri S. (2021) A high-resolution HLA reference panel capturing global population diversity enables multi-ancestry fine-mapping in HIV host response. Nat Genet, 53 (10) 1504-1516. doi:10.1038/s41588-021-00935-7. PMID 34611364
ABSTRACT
Fine-mapping to plausible causal variation may be more effective in multi-ancestry cohorts, particularly in the MHC, which has population-specific structure. To enable such studies, we constructed a large (n = 21,546) HLA reference panel spanning five global populations based on whole-genome sequences. Despite population-specific long-range haplotypes, we demonstrated accurate imputation at G-group resolution (94.2%, 93.7%, 97.8% and 93.7% in admixed African (AA), East Asian (EAS), European (EUR) and Latino (LAT) populations). Applying HLA imputation to genome-wide association study data for HIV-1 viral load in three populations (EUR, AA and LAT), we obviated effects of previously reported associations from population-specific HIV studies and discovered a novel association at position 156 in HLA-B. We pinpointed the MHC association to three amino acid positions (97, 67 and 156) marking three consecutive pockets (C, B and D) within the HLA-B peptide-binding groove, explaining 12.9% of trait variance.
DOI
10.1038/s41588-021-00935-7
HLARIMNT
FULL NAME
HLA Reliable IMputatioN by Transformer
URL
KEYWORDS
HLA, imputation
Main citation
Tanaka, K., Kato, K., Nonaka, N., & Seita, J. (2022). Efficient HLA imputation from sequential SNPs data by Transformer. arXiv preprint arXiv:2211.06430.
HRC
PUBMED_LINK
URL
TITLE
A reference panel of 64,976 haplotypes for genotype imputation.
Main citation
McCarthy S, Das S, Kretzschmar W, Delaneau O, ...&, Haplotype Reference Consortium. (2016) A reference panel of 64,976 haplotypes for genotype imputation. Nat Genet, 48 (10) 1279-83. doi:10.1038/ng.3643. PMID 27548312
ABSTRACT
We describe a reference panel of 64,976 human haplotypes at 39,235,157 SNPs constructed using whole-genome sequence data from 20 studies of predominantly European ancestry. Using this resource leads to accurate genotype imputation at minor allele frequencies as low as 0.1% and a large increase in the number of SNPs tested in association studies, and it can help to discover and refine causal loci. We describe remote server resources that allow researchers to carry out imputation and phasing consistently and efficiently.
DOI
10.1038/ng.3643
HWE
PUBMED_LINK
FULL NAME
Exact Tests of Hardy-Weinberg Equilibrium
DESCRIPTION
Wigginton, J. E., Cutler, D. J., & Abecasis, G. R. (2005). A note on exact tests of Hardy-Weinberg equilibrium. The American Journal of Human Genetics, 76(5), 887-893.
TITLE
A note on exact tests of Hardy-Weinberg equilibrium.
Main citation
Wigginton JE, Cutler DJ, Abecasis GR. (2005) A note on exact tests of Hardy-Weinberg equilibrium. Am J Hum Genet, 76 (5) 887-93. doi:10.1086/429864. PMID 15789306
ABSTRACT
Deviations from Hardy-Weinberg equilibrium (HWE) can indicate inbreeding, population stratification, and even problems in genotyping. In samples of affected individuals, these deviations can also provide evidence for association. Tests of HWE are commonly performed using a simple chi2 goodness-of-fit test. We show that this chi2 test can have inflated type I error rates, even in relatively large samples (e.g., samples of 1,000 individuals that include approximately 100 copies of the minor allele). On the basis of previous work, we describe exact tests of HWE together with efficient computational methods for their implementation. Our methods adequately control type I error in large and small samples and are computationally efficient. They have been implemented in freely available code that will be useful for quality assessment of genotype data and for the detection of genetic association or population stratification in very large data sets.
DOI
10.1086/429864
HyPrColoc
PUBMED_LINK
FULL NAME
Hypothesis Prioritisation for multi-trait Colocalization
URL
KEYWORDS
multiple traits,
TITLE
A fast and efficient colocalization algorithm for identifying shared genetic risk factors across multiple traits.
Main citation
Foley CN, Staley JR, Breen PG, Sun BB, ...&, Howson JMM. (2021) A fast and efficient colocalization algorithm for identifying shared genetic risk factors across multiple traits. Nat Commun, 12 (1) 764. doi:10.1038/s41467-020-20885-8. PMID 33536417
ABSTRACT
Genome-wide association studies (GWAS) have identified thousands of genomic regions affecting complex diseases. The next challenge is to elucidate the causal genes and mechanisms involved. One approach is to use statistical colocalization to assess shared genetic aetiology across multiple related traits (e.g. molecular traits, metabolic pathways and complex diseases) to identify causal pathways, prioritize causal variants and evaluate pleiotropy. We propose HyPrColoc (Hypothesis Prioritisation for multi-trait Colocalization), an efficient deterministic Bayesian algorithm using GWAS summary statistics that can detect colocalization across vast numbers of traits simultaneously (e.g. 100 traits can be jointly analysed in around 1 s). We perform a genome-wide multi-trait colocalization analysis of coronary heart disease (CHD) and fourteen related traits, identifying 43 regions in which CHD colocalized with ≥1 trait, including 5 previously unknown CHD loci. Across the 43 loci, we further integrate gene and protein expression quantitative trait loci to identify candidate causal genes.
DOI
10.1038/s41467-020-20885-8
IBS
PUBMED_LINK
FULL NAME
illustrator of biological sequences
DESCRIPTION
an illustrator for the presentation and visualization of biological sequences
URL
TITLE
IBS: an illustrator for the presentation and visualization of biological sequences.
Main citation
Liu W, Xie Y, Ma J, Luo X, ...&, Ren J. (2015) IBS: an illustrator for the presentation and visualization of biological sequences. Bioinformatics, 31 (20) 3359-61. doi:10.1093/bioinformatics/btv362. PMID 26069263
ABSTRACT
UNLABELLED: Biological sequence diagrams are fundamental for visualizing various functional elements in protein or nucleotide sequences that enable a summarization and presentation of existing information as well as means of intuitive new discoveries. Here, we present a software package called illustrator of biological sequences (IBS) that can be used for representing the organization of either protein or nucleotide sequences in a convenient, efficient and precise manner. Multiple options are provided in IBS, and biological sequences can be manipulated, recolored or rescaled in a user-defined mode. Also, the final representational artwork can be directly exported into a publication-quality figure. AVAILABILITY AND IMPLEMENTATION: The standalone package of IBS was implemented in JAVA, while the online service was implemented in HTML5 and JavaScript. Both the standalone package and online service are freely available at http://ibs.biocuckoo.org. CONTACT: renjian.sysu@gmail.com or xueyu@hust.edu.cn SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
DOI
10.1093/bioinformatics/btv362
iHS
PUBMED_LINK
FULL NAME
Integrated haplotype score
DESCRIPTION
Voight, B. F., Kudaravalli, S., Wen, X., & Pritchard, J. K. (2006). A map of recent positive selection in the human genome. PLoS biology, 4(3), e72.
TITLE
A map of recent positive selection in the human genome.
Main citation
Voight BF, Kudaravalli S, Wen X, Pritchard JK. (2006) A map of recent positive selection in the human genome. PLoS Biol, 4 (3) e72. doi:10.1371/journal.pbio.0040072. PMID 16494531
ABSTRACT
The identification of signals of very recent positive selection provides information about the adaptation of modern humans to local conditions. We report here on a genome-wide scan for signals of very recent positive selection in favor of variants that have not yet reached fixation. We describe a new analytical method for scanning single nucleotide polymorphism (SNP) data for signals of recent selection, and apply this to data from the International HapMap Project. In all three continental groups we find widespread signals of recent positive selection. Most signals are region-specific, though a significant excess are shared across groups. Contrary to some earlier low resolution studies that suggested a paucity of recent selection in sub-Saharan Africans, we find that by some measures our strongest signals of selection are from the Yoruba population. Finally, since these signals indicate the existence of genetic variants that have substantially different fitnesses, they must indicate loci that are the source of significant phenotypic variation. Though the relevant phenotypes are generally not known, such loci should be of particular interest in mapping studies of complex traits. For this purpose we have developed a set of SNPs that can be used to tag the strongest approximately 250 signals of recent selection in each population.
DOI
10.1371/journal.pbio.0040072
IMPUTE
PUBMED_LINK
URL
TITLE
A new multipoint method for genome-wide association studies by imputation of genotypes.
Main citation
Marchini J, Howie B, Myers S, McVean G, ...&, Donnelly P. (2007) A new multipoint method for genome-wide association studies by imputation of genotypes. Nat Genet, 39 (7) 906-13. doi:10.1038/ng2088. PMID 17572673
ABSTRACT
Genome-wide association studies are set to become the method of choice for uncovering the genetic basis of human diseases. A central challenge in this area is the development of powerful multipoint methods that can detect causal variants that have not been directly genotyped. We propose a coherent analysis framework that treats the problem as one involving missing or uncertain genotypes. Central to our approach is a model-based imputation method for inferring genotypes at observed or unobserved SNPs, leading to improved power over existing methods for multipoint association mapping. Using real genome-wide association study data, we show that our approach (i) is accurate and well calibrated, (ii) provides detailed views of associated regions that facilitate follow-up studies and (iii) can be used to validate and correct data at genotyped markers. A notable future use of our method will be to boost power by combining data from genome-wide scans that use different SNP sets.
DOI
10.1038/ng2088
IMPUTE2
PUBMED_LINK
TITLE
A flexible and accurate genotype imputation method for the next generation of genome-wide association studies.
Main citation
Howie BN, Donnelly P, Marchini J. (2009) A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet, 5 (6) e1000529. doi:10.1371/journal.pgen.1000529. PMID 19543373
ABSTRACT
Genotype imputation methods are now being widely used in the analysis of genome-wide association studies. Most imputation analyses to date have used the HapMap as a reference dataset, but new reference panels (such as controls genotyped on multiple SNP chips and densely typed samples from the 1,000 Genomes Project) will soon allow a broader range of SNPs to be imputed with higher accuracy, thereby increasing power. We describe a genotype imputation method (IMPUTE version 2) that is designed to address the challenges presented by these new datasets. The main innovation of our approach is a flexible modelling framework that increases accuracy and combines information across multiple reference panels while remaining computationally feasible. We find that IMPUTE v2 attains higher accuracy than other methods when the HapMap provides the sole reference panel, but that the size of the panel constrains the improvements that can be made. We also find that imputation accuracy can be greatly enhanced by expanding the reference panel to contain thousands of chromosomes and that IMPUTE v2 outperforms other methods in this setting at both rare and common SNPs, with overall error rates that are 15%-20% lower than those of the closest competing method. One particularly challenging aspect of next-generation association studies is to integrate information across multiple reference panels genotyped on different sets of SNPs; we show that our approach to this problem has practical advantages over other suggested solutions.
DOI
10.1371/journal.pgen.1000529
IMPUTE4
PUBMED_LINK
TITLE
The UK Biobank resource with deep phenotyping and genomic data.
Main citation
Bycroft C, Freeman C, Petkova D, Band G, ...&, Marchini J. (2018) The UK Biobank resource with deep phenotyping and genomic data. Nature, 562 (7726) 203-209. doi:10.1038/s41586-018-0579-z. PMID 30305743
ABSTRACT
The UK Biobank project is a prospective cohort study with deep genetic and phenotypic data collected on approximately 500,000 individuals from across the United Kingdom, aged between 40 and 69 at recruitment. The open resource is unique in its size and scope. A rich variety of phenotypic and health-related information is available on each participant, including biological measurements, lifestyle indicators, biomarkers in blood and urine, and imaging of the body and brain. Follow-up information is provided by linking health and medical records. Genome-wide genotype data have been collected on all participants, providing many opportunities for the discovery of new genetic associations and the genetic bases of complex traits. Here we describe the centralized analysis of the genetic data, including genotype quality, properties of population structure and relatedness of the genetic data, and efficient phasing and genotype imputation that increases the number of testable variants to around 96 million. Classical allelic variation at 11 human leukocyte antigen genes was imputed, resulting in the recovery of signals with known associations between human leukocyte antigen alleles and many diseases.
DOI
10.1038/s41586-018-0579-z
IMPUTE5
PUBMED_LINK
TITLE
Genotype imputation using the Positional Burrows Wheeler Transform.
Main citation
Rubinacci S, Delaneau O, Marchini J. (2020) Genotype imputation using the Positional Burrows Wheeler Transform. PLoS Genet, 16 (11) e1009049. doi:10.1371/journal.pgen.1009049. PMID 33196638
ABSTRACT
Genotype imputation is the process of predicting unobserved genotypes in a sample of individuals using a reference panel of haplotypes. In the last 10 years reference panels have increased in size by more than 100 fold. Increasing reference panel size improves accuracy of markers with low minor allele frequencies but poses ever increasing computational challenges for imputation methods. Here we present IMPUTE5, a genotype imputation method that can scale to reference panels with millions of samples. This method continues to refine the observation made in the IMPUTE2 method, that accuracy is optimized via use of a custom subset of haplotypes when imputing each individual. It achieves fast, accurate, and memory-efficient imputation by selecting haplotypes using the Positional Burrows Wheeler Transform (PBWT). By using the PBWT data structure at genotyped markers, IMPUTE5 identifies locally best matching haplotypes and long identical by state segments. The method then uses the selected haplotypes as conditioning states within the IMPUTE model. Using the HRC reference panel, which has ∼65,000 haplotypes, we show that IMPUTE5 is up to 30x faster than MINIMAC4 and up to 3x faster than BEAGLE5.1, and uses less memory than both these methods. Using simulated reference panels we show that IMPUTE5 scales sub-linearly with reference panel size. For example, keeping the number of imputed markers constant, increasing the reference panel size from 10,000 to 1 million haplotypes requires less than twice the computation time. As the reference panel increases in size IMPUTE5 is able to utilize a smaller number of reference haplotypes, thus reducing computational cost.
DOI
10.1371/journal.pgen.1009049
JAM
PUBMED_LINK
FULL NAME
joint analysis of marginal summary statistics
DESCRIPTION
Bayesian variable selection under a range of likelihoods, including linear regression for continuous outcomes, logistic regression for binary outcomes, Weibull regression for survival outcomes binary and survial outcomes, and the "JAM" model for summary genetic association data.
URL
TITLE
JAM: A Scalable Bayesian Framework for Joint Analysis of Marginal SNP Effects.
Main citation
Newcombe PJ, Conti DV, Richardson S. (2016) JAM: A Scalable Bayesian Framework for Joint Analysis of Marginal SNP Effects. Genet Epidemiol, 40 (3) 188-201. doi:10.1002/gepi.21953. PMID 27027514
ABSTRACT
Recently, large scale genome-wide association study (GWAS) meta-analyses have boosted the number of known signals for some traits into the tens and hundreds. Typically, however, variants are only analysed one-at-a-time. This complicates the ability of fine-mapping to identify a small set of SNPs for further functional follow-up. We describe a new and scalable algorithm, joint analysis of marginal summary statistics (JAM), for the re-analysis of published marginal summary statistics under joint multi-SNP models. The correlation is accounted for according to estimates from a reference dataset, and models and SNPs that best explain the complete joint pattern of marginal effects are highlighted via an integrated Bayesian penalized regression framework. We provide both enumerated and Reversible Jump MCMC implementations of JAM and present some comparisons of performance. In a series of realistic simulation studies, JAM demonstrated identical performance to various alternatives designed for single region settings. In multi-region settings, where the only multivariate alternative involves stepwise selection, JAM offered greater power and specificity. We also present an application to real published results from MAGIC (meta-analysis of glucose and insulin related traits consortium) - a GWAS meta-analysis of more than 15,000 people. We re-analysed several genomic regions that produced multiple significant signals with glucose levels 2 hr after oral stimulation. Through joint multivariate modelling, JAM was able to formally rule out many SNPs, and for one gene, ADCY5, suggests that an additional SNP, which transpired to be more biologically plausible, should be followed up with equal priority to the reported index.
DOI
10.1002/gepi.21953
JASS
PUBMED_LINK
FULL NAME
Joint Analysis of Summary Statistics
DESCRIPTION
JASS is a python package that handles the computation of the joint statistics over sets of selected GWAS results, and the interactive exploration of the results through a web interface. The generation of joint statistics over a set of selected studies, and the generation of static plots to display the results, is easily performed using the command line interface. These functionalities can also be accessed through a web application embedded in the python package, which also enables the exploration of the results through a dynamic Javascript interface. The JASS analysis module handles the data processing, going from the import of the data up to the computation of the joint statistics and the generation of the various static plots to illustrate the results. However, we also briefly describe in the next section the pre-processing of raw GWAS data which can be performed through a companion script provided on behalf of the JASS package.
URL
TITLE
JASS: command line and web interface for the joint analysis of GWAS results.
Main citation
Julienne H, Lechat P, Guillemot V, Lasry C, ...&, Aschard H. (2020) JASS: command line and web interface for the joint analysis of GWAS results. NAR Genom Bioinform, 2 (1) lqaa003. doi:10.1093/nargab/lqaa003. PMID 32002517
ABSTRACT
Genome-wide association study (GWAS) has been the driving force for identifying association between genetic variants and human phenotypes. Thousands of GWAS summary statistics covering a broad range of human traits and diseases are now publicly available. These GWAS have proven their utility for a range of secondary analyses, including in particular the joint analysis of multiple phenotypes to identify new associated genetic variants. However, although several methods have been proposed, there are very few large-scale applications published so far because of challenges in implementing these methods on real data. Here, we present JASS (Joint Analysis of Summary Statistics), a polyvalent Python package that addresses this need. Our package incorporates recently developed joint tests such as the omnibus approach and various weighted sum of Z-score tests while solving all practical and computational barriers for large-scale multivariate analysis of GWAS summary statistics. This includes data cleaning and harmonization tools, an efficient algorithm for fast derivation of joint statistics, an optimized data management process and a web interface for exploration purposes. Both benchmark analyses and real data applications demonstrated the robustness and strong potential of JASS for the detection of new associated genetic variants. Our package is freely available at https://gitlab.pasteur.fr/statistical-genetics/jass.
DOI
10.1093/nargab/lqaa003
JointPRS
PUBMED_LINK
DESCRIPTION
Data-adaptive polygenic score framework that borrows strength across populations via genetic correlations using only GWAS summary statistics and LD references—supporting prediction with or without individual-level tuning data.
URL
KEYWORDS
PRS, multi-population, genetic correlation, summary statistics, cross-ancestry
TITLE
JointPRS: A data-adaptive framework for multi-population genetic risk prediction incorporating genetic correlation.
Main citation
Xu L, Zhou G, Jiang W, Zhang H, ...&, Zhao H. (2025) JointPRS: A data-adaptive framework for multi-population genetic risk prediction incorporating genetic correlation. Nat Commun, 16 (1) 3841. doi:10.1038/s41467-025-59243-x. PMID 40268942
ABSTRACT
Genetic risk prediction for non-European populations is hindered by limited Genome-Wide Association Study (GWAS) sample sizes and small tuning datasets. We propose JointPRS, a data-adaptive framework that leverages genetic correlations across multiple populations using GWAS summary statistics. It achieves accurate predictions without individual-level tuning data and remains effective in the presence of a small tuning set thanks to its data-adaptive approach. Through extensive simulations and real data applications to 22 quantitative and four binary traits in five continental populations evaluated using the UK Biobank (UKBB) and All of Us (AoU), JointPRS consistently outperforms six state-of-the-art methods across three data scenarios: no tuning data, same-cohort tuning and testing, and cross-cohort tuning and testing. Notably, in the Admixed American population, JointPRS improves lipid trait prediction in AoU by 6.46%-172.00% compared to the other existing methods.
DOI
10.1038/s41467-025-59243-x
karyoploteR
PUBMED_LINK
DESCRIPTION
karyoploteR is an R package to create karyoplots, that is, representations of whole genomes with arbitrary data plotted on them. It is inspired by the R base graphics system and does not depend on other graphics packages. The aim of karyoploteR is to offer the user an easy way to plot data along the genome to get broad genome-wide view to facilitate the identification of genome wide relations and distributions.
URL
TITLE
karyoploteR: an R/Bioconductor package to plot customizable genomes displaying arbitrary data.
Main citation
Gel B, Serra E. (2017) karyoploteR: an R/Bioconductor package to plot customizable genomes displaying arbitrary data. Bioinformatics, 33 (19) 3088-3090. doi:10.1093/bioinformatics/btx346. PMID 28575171
ABSTRACT
MOTIVATION: Data visualization is a crucial tool for data exploration, analysis and interpretation. For the visualization of genomic data there lacks a tool to create customizable non-circular plots of whole genomes from any species. RESULTS: We have developed karyoploteR, an R/Bioconductor package to create linear chromosomal representations of any genome with genomic annotations and experimental data plotted along them. Plot creation process is inspired in R base graphics, with a main function creating karyoplots with no data and multiple additional functions, including custom functions written by the end-user, adding data and other graphical elements. This approach allows the creation of highly customizable plots from arbitrary data with complete freedom on data positioning and representation. AVAILABILITY AND IMPLEMENTATION: karyoploteR is released under Artistic-2.0 License. Source code and documentation are freely available through Bioconductor (http://www.bioconductor.org/packages/karyoploteR) and at the examples and tutorial page at https://bernatgel.github.io/karyoploter_tutorial. CONTACT: bgel@igtp.cat.
DOI
10.1093/bioinformatics/btx346
KwARG
PUBMED_LINK
DESCRIPTION
Ignatieva, A., Lyngsø, R. B., Jenkins, P. A. & Hein, J. KwARG: parsimonious reconstruction of ancestral recombination graphs with recurrent mutation. Bioinformatics 37, 3277–3284 (2021).
TITLE
KwARG: parsimonious reconstruction of ancestral recombination graphs with recurrent mutation.
Main citation
Ignatieva A, Lyngsø RB, Jenkins PA, Hein J. (2021) KwARG: parsimonious reconstruction of ancestral recombination graphs with recurrent mutation. Bioinformatics, 37 (19) 3277-3284. doi:10.1093/bioinformatics/btab351. PMID 33970217
ABSTRACT
MOTIVATION: The reconstruction of possible histories given a sample of genetic data in the presence of recombination and recurrent mutation is a challenging problem, but can provide key insights into the evolution of a population. We present KwARG, which implements a parsimony-based greedy heuristic algorithm for finding plausible genealogical histories (ancestral recombination graphs) that are minimal or near-minimal in the number of posited recombination and mutation events. RESULTS: Given an input dataset of aligned sequences, KwARG outputs a list of possible candidate solutions, each comprising a list of mutation and recombination events that could have generated the dataset; the relative proportion of recombinations and recurrent mutations in a solution can be controlled via specifying a set of 'cost' parameters. We demonstrate that the algorithm performs well when compared against existing methods. AVAILABILITY AND IMPLEMENTATION: The software is available at https://github.com/a-ignatieva/kwarg. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
DOI
10.1093/bioinformatics/btab351
lassosum
PUBMED_LINK
DESCRIPTION
lassosum is a method for computing LASSO/Elastic Net estimates of a linear regression problem given summary statistics from GWAS and Genome-wide meta-analyses, accounting for Linkage Disequilibrium (LD), via a reference panel.
URL
KEYWORDS
penalized regression
TITLE
Polygenic scores via penalized regression on summary statistics.
Main citation
Mak TSH, Porsch RM, Choi SW, Zhou X, ...&, Sham PC. (2017) Polygenic scores via penalized regression on summary statistics. Genet Epidemiol, 41 (6) 469-480. doi:10.1002/gepi.22050. PMID 28480976
ABSTRACT
Polygenic scores (PGS) summarize the genetic contribution of a person's genotype to a disease or phenotype. They can be used to group participants into different risk categories for diseases, and are also used as covariates in epidemiological analyses. A number of possible ways of calculating PGS have been proposed, and recently there is much interest in methods that incorporate information available in published summary statistics. As there is no inherent information on linkage disequilibrium (LD) in summary statistics, a pertinent question is how we can use LD information available elsewhere to supplement such analyses. To answer this question, we propose a method for constructing PGS using summary statistics and a reference panel in a penalized regression framework, which we call lassosum. We also propose a general method for choosing the value of the tuning parameter in the absence of validation data. In our simulations, we showed that pseudovalidation often resulted in prediction accuracy that is comparable to using a dataset with validation phenotype and was clearly superior to the conservative option of setting the tuning parameter of lassosum to its lowest value. We also showed that lassosum achieved better prediction accuracy than simple clumping and P-value thresholding in almost all scenarios. It was also substantially faster and more accurate than the recently proposed LDpred.
DOI
10.1002/gepi.22050
lassosum2
PUBMED_LINK
DESCRIPTION
lassosum2 is a re-implementation of the lassosum model that now uses the exact same input parameters as LDpred2 (corr and df_beta). It should be fast to run. It can be run next to LDpred2 and the best model can be chosen using the validation set. Note that parameter ‘s’ from lassosum has been replaced by a new parameter ‘delta’ in lassosum2, in order to better reflect that the lassosum model also uses L2-regularization (therefore, elastic-net regularization).
URL
TITLE
Identifying and correcting for misspecifications in GWAS summary statistics and polygenic scores.
Main citation
Privé F, Arbel J, Aschard H, Vilhjálmsson BJ. (2022) Identifying and correcting for misspecifications in GWAS summary statistics and polygenic scores. HGG Adv, 3 (4) 100136. doi:10.1016/j.xhgg.2022.100136. PMID 36105883
ABSTRACT
Publicly available genome-wide association studies (GWAS) summary statistics exhibit uneven quality, which can impact the validity of follow-up analyses. First, we present an overview of possible misspecifications that come with GWAS summary statistics. Then, in both simulations and real-data analyses, we show that additional information such as imputation INFO scores, allele frequencies, and per-variant sample sizes in GWAS summary statistics can be used to detect possible issues and correct for misspecifications in the GWAS summary statistics. One important motivation for us is to improve the predictive performance of polygenic scores built from these summary statistics. Unfortunately, owing to the lack of reporting standards for GWAS summary statistics, this additional information is not systematically reported. We also show that using well-matched linkage disequilibrium (LD) references can improve model fit and translate into more accurate prediction. Finally, we discuss how to make polygenic score methods such as lassosum and LDpred2 more robust to these misspecifications to improve their predictive power.
DOI
10.1016/j.xhgg.2022.100136
LAVA
PUBMED_LINK
FULL NAME
Local Analysis of [co]Variant Association
DESCRIPTION
LAVA is a tool to conduct genome-wide, local genetic correlation analysis on multiple traits, using GWAS summary statistics as input.
URL
TITLE
An integrated framework for local genetic correlation analysis.
Main citation
Werme J, van der Sluis S, Posthuma D, de Leeuw CA. (2022) An integrated framework for local genetic correlation analysis. Nat Genet, 54 (3) 274-282. doi:10.1038/s41588-022-01017-y. PMID 35288712
ABSTRACT
Genetic correlation (rg) analysis is used to identify phenotypes that may have a shared genetic basis. Traditionally, rg is studied globally, considering only the average of the shared signal across the genome, although this approach may fail when the rg is confined to particular genomic regions or in opposing directions at different loci. Current tools for local rg analysis are restricted to analysis of two phenotypes. Here we introduce LAVA, an integrated framework for local rg analysis that, in addition to testing the standard bivariate local rgs between two phenotypes, can evaluate local heritabilities and analyze conditional genetic relations between several phenotypes using partial correlation and multiple regression. Applied to 25 behavioral and health phenotypes, we show considerable heterogeneity in the bivariate local rgs across the genome, which is often masked by the global rg patterns, and demonstrate how our conditional approaches can elucidate more complex, multivariate genetic relations.
DOI
10.1038/s41588-022-01017-y
LCP-GWAS
PUBMED_LINK
FULL NAME
Linear Combination Phenotype GWAS
KEYWORDS
multivariate GWAS follow-up analyses
TITLE
An expanded analysis framework for multivariate GWAS connects inflammatory biomarkers to functional variants and disease.
Main citation
Ruotsalainen SE, Partanen JJ, Cichonska A, Lin J, ...&, Koskela J. (2021) An expanded analysis framework for multivariate GWAS connects inflammatory biomarkers to functional variants and disease. Eur J Hum Genet, 29 (2) 309-324. doi:10.1038/s41431-020-00730-8. PMID 33110245
ABSTRACT
Multivariate methods are known to increase the statistical power to detect associations in the case of shared genetic basis between phenotypes. They have, however, lacked essential analytic tools to follow-up and understand the biology underlying these associations. We developed a novel computational workflow for multivariate GWAS follow-up analyses, including fine-mapping and identification of the subset of traits driving associations (driver traits). Many follow-up tools require univariate regression coefficients which are lacking from multivariate results. Our method overcomes this problem by using Canonical Correlation Analysis to turn each multivariate association into its optimal univariate Linear Combination Phenotype (LCP). This enables an LCP-GWAS, which in turn generates the statistics required for follow-up analyses. We implemented our method on 12 highly correlated inflammatory biomarkers in a Finnish population-based study. Altogether, we identified 11 associations, four of which (F5, ABO, C1orf140 and PDGFRB) were not detected by biomarker-specific analyses. Fine-mapping identified 19 signals within the 11 loci and driver trait analysis determined the traits contributing to the associations. A phenome-wide association study on the 19 representative variants from the signals in 176,899 individuals from the FinnGen study revealed 53 disease associations (p < 1 × 10-4). Several reported pQTLs in the 11 loci provided orthogonal evidence for the biologically relevant functions of the representative variants. Our novel multivariate analysis workflow provides a powerful addition to standard univariate GWAS analyses by enabling multivariate GWAS follow-up and thus promoting the advancement of powerful multivariate methods in genomics.
DOI
10.1038/s41431-020-00730-8
LDAK
PUBMED_LINK
FULL NAME
LD-adjusted kinships
DESCRIPTION
LDAK is a software package for analysing association study data.
URL
TITLE
Improved heritability estimation from genome-wide SNPs.
Main citation
Speed D, Hemani G, Johnson MR, Balding DJ. (2012) Improved heritability estimation from genome-wide SNPs. Am J Hum Genet, 91 (6) 1011-21. doi:10.1016/j.ajhg.2012.10.010. PMID 23217325
ABSTRACT
Estimation of narrow-sense heritability, h(2), from genome-wide SNPs genotyped in unrelated individuals has recently attracted interest and offers several advantages over traditional pedigree-based methods. With the use of this approach, it has been estimated that over half the heritability of human height can be attributed to the ~300,000 SNPs on a genome-wide genotyping array. In comparison, only 5%-10% can be explained by SNPs reaching genome-wide significance. We investigated via simulation the validity of several key assumptions underpinning the mixed-model analysis used in SNP-based h(2) estimation. Although we found that the method is reasonably robust to violations of four key assumptions, it can be highly sensitive to uneven linkage disequilibrium (LD) between SNPs: contributions to h(2) are overestimated from causal variants in regions of high LD and are underestimated in regions of low LD. The overall direction of the bias can be up or down depending on the genetic architecture of the trait, but it can be substantial in realistic scenarios. We propose a modified kinship matrix in which SNPs are weighted according to local LD. We show that this correction greatly reduces the bias and increases the precision of h(2) estimates. We demonstrate the impact of our method on the first seven diseases studied by the Wellcome Trust Case Control Consortium. Our LD adjustment revises downward the h(2) estimate for immune-related diseases, as expected because of high LD in the major-histocompatibility region, but increases it for some nonimmune diseases. To calculate our revised kinship matrix, we developed LDAK, software for computing LD-adjusted kinships.
DOI
10.1016/j.ajhg.2012.10.010
LDAK-GBAT
PUBMED_LINK
FULL NAME
LDAK gene-based association testing
URL
TITLE
LDAK-GBAT: Fast and powerful gene-based association testing using summary statistics.
Main citation
Berrandou TE, Balding D, Speed D. (2023) LDAK-GBAT: Fast and powerful gene-based association testing using summary statistics. Am J Hum Genet, 110 (1) 23-29. doi:10.1016/j.ajhg.2022.11.010. PMID 36480927
ABSTRACT
We present LDAK-GBAT, a tool for gene-based association testing using summary statistics from genome-wide association studies that is computationally efficient, produces well-calibrated p values, and is significantly more powerful than existing tools. LDAK-GBAT takes approximately 30 min to analyze imputed data (2.9M common, genic SNPs), requiring less than 10 Gb memory. It shows good control of type 1 error given an appropriate reference panel. Across 109 phenotypes (82 from the UK Biobank, 18 from the Million Veteran Program, and nine from the Psychiatric Genetics Consortium), LDAK-GBAT finds on average 19% (SE: 1%) more significant genes than the existing tool sumFREGAT-ACAT, with even greater gains in comparison with MAGMA, GCTA-fastBAT, sumFREGAT-SKAT-O, and sumFREGAT-PCA.
DOI
10.1016/j.ajhg.2022.11.010
LDAK-KVIK
URL
Main citation
Hof, J. P. & Speed, D. LDAK-KVIK performs fast and powerful mixed-model association analysis of quantitative and binary phenotypes. bioRxiv 2024.07.25.24311005 (2024) doi:10.1101/2024.07.25.24311005.
Ldlink
PUBMED_LINK
DESCRIPTION
LDlink is a suite of web-based applications designed to easily and efficiently interrogate linkage disequilibrium in population groups. Each included application is specialized for querying and displaying unique aspects of linkage disequilibrium.
URL
TITLE
LDlink: a web-based application for exploring population-specific haplotype structure and linking correlated alleles of possible functional variants.
Main citation
Machiela MJ, Chanock SJ. (2015) LDlink: a web-based application for exploring population-specific haplotype structure and linking correlated alleles of possible functional variants. Bioinformatics, 31 (21) 3555-7. doi:10.1093/bioinformatics/btv402. PMID 26139635
ABSTRACT
UNLABELLED: Assessing linkage disequilibrium (LD) across ancestral populations is a powerful approach for investigating population-specific genetic structure as well as functionally mapping regions of disease susceptibility. Here, we present LDlink, a web-based collection of bioinformatic modules that query single nucleotide polymorphisms (SNPs) in population groups of interest to generate haplotype tables and interactive plots. Modules are designed with an emphasis on ease of use, query flexibility, and interactive visualization of results. Phase 3 haplotype data from the 1000 Genomes Project are referenced for calculating pairwise metrics of LD, searching for proxies in high LD, and enumerating all observed haplotypes. LDlink is tailored for investigators interested in mapping common and uncommon disease susceptibility loci by focusing on output linking correlated alleles and highlighting putative functional variants. AVAILABILITY AND IMPLEMENTATION: LDlink is a free and publically available web tool which can be accessed at http://analysistools.nci.nih.gov/LDlink/. CONTACT: mitchell.machiela@nih.gov.
DOI
10.1093/bioinformatics/btv402
LDlinkR
PUBMED_LINK
DESCRIPTION
An R Package for Rapidly Calculating Linkage Disequilibrium Statistics in Diverse Populations
URL
Main citation
Myers TA, Chanock SJ, Machiela MJ. (2020) Front Genet, 11 () 157. doi:10.3389/fgene.2020.00157. PMID 32180801
ABSTRACT
Genomic research involving human genetics and evolutionary biology relies heavily on linkage disequilibrium (LD) to investigate population-specific genetic structure, functionally map regions of disease susceptibility and uncover evolutionary history. Interactive and powerful tools are needed to calculate population-specific LD estimates for integrative genomics research. LDlink is an interactive suite of web-based tools developed to query germline variants in 1000 Genomes Project population groups of interest and generate interactive tables and plots of LD estimates. As an expansion to this resource, we have developed an R package, LDlinkR, designed to rapidly calculate statistics for large lists of variants and LD attributes that eliminates the time needed to perform repetitive requests from the web-based LDlink tool. LDlinkR accelerates genomic research by providing efficient and user-friendly functions to programmatically interrogate and download pairwise LD estimates from expansive lists of genetic variants. LDlinkR is a free and publicly available R package that can be installed from the Comprehensive R Archive Network (CRAN) or downloaded from https://github.com/CBIIT/LDlinkR.
DOI
10.3389/fgene.2020.00157
LDpred
PUBMED_LINK
DESCRIPTION
LDpred is a Python based software package that adjusts GWAS summary statistics for the effects of linkage disequilibrium (LD).
URL
KEYWORDS
Bayesian, Gaussian infinitesimal prior, python
TITLE
Modeling Linkage Disequilibrium Increases Accuracy of Polygenic Risk Scores.
Main citation
Vilhjálmsson BJ, Yang J, Finucane HK, Gusev A, ...&, Price AL. (2015) Modeling Linkage Disequilibrium Increases Accuracy of Polygenic Risk Scores. Am J Hum Genet, 97 (4) 576-92. doi:10.1016/j.ajhg.2015.09.001. PMID 26430803
ABSTRACT
Polygenic risk scores have shown great promise in predicting complex disease risk and will become more accurate as training sample sizes increase. The standard approach for calculating risk scores involves linkage disequilibrium (LD)-based marker pruning and applying a p value threshold to association statistics, but this discards information and can reduce predictive accuracy. We introduce LDpred, a method that infers the posterior mean effect size of each marker by using a prior on effect sizes and LD information from an external reference panel. Theory and simulations show that LDpred outperforms the approach of pruning followed by thresholding, particularly at large sample sizes. Accordingly, predicted R(2) increased from 20.1% to 25.3% in a large schizophrenia dataset and from 9.8% to 12.0% in a large multiple sclerosis dataset. A similar relative improvement in accuracy was observed for three additional large disease datasets and for non-European schizophrenia samples. The advantage of LDpred over existing methods will grow as sample sizes increase.
DOI
10.1016/j.ajhg.2015.09.001
LDpred-funct
PUBMED_LINK
DESCRIPTION
LDpred-funct is a method for polygenic prediction that leverages trait-specific functional priors to increase prediction accuracy.
URL
KEYWORDS
Bayesian, functional priors
TITLE
Incorporating functional priors improves polygenic prediction accuracy in UK Biobank and 23andMe data sets.
Main citation
Márquez-Luna C, Gazal S, Loh PR, Kim SS, ...&, Price AL. (2021) Incorporating functional priors improves polygenic prediction accuracy in UK Biobank and 23andMe data sets. Nat Commun, 12 (1) 6052. doi:10.1038/s41467-021-25171-9. PMID 34663819
ABSTRACT
Polygenic risk prediction is a widely investigated topic because of its promising clinical applications. Genetic variants in functional regions of the genome are enriched for complex trait heritability. Here, we introduce a method for polygenic prediction, LDpred-funct, that leverages trait-specific functional priors to increase prediction accuracy. We fit priors using the recently developed baseline-LD model, including coding, conserved, regulatory, and LD-related annotations. We analytically estimate posterior mean causal effect sizes and then use cross-validation to regularize these estimates, improving prediction accuracy for sparse architectures. We applied LDpred-funct to predict 21 highly heritable traits in the UK Biobank (avg N = 373 K as training data). LDpred-funct attained a +4.6% relative improvement in average prediction accuracy (avg prediction R2 = 0.144; highest R2 = 0.413 for height) compared to SBayesR (the best method that does not incorporate functional information). For height, meta-analyzing training data from UK Biobank and 23andMe cohorts (N = 1107 K) increased prediction R2 to 0.431. Our results show that incorporating functional priors improves polygenic prediction accuracy, consistent with the functional architecture of complex traits.
DOI
10.1038/s41467-021-25171-9
LDpred2
PUBMED_LINK
DESCRIPTION
LDpred-2 is one of the dedicated PRS programs which is an R package that uses a Bayesian approach to polygenic risk scoring.
URL
KEYWORDS
Bayesian, R, LDpred2-grid (LDpred2), LDpred2-auto, LDpred2-sparse
TITLE
LDpred2: better, faster, stronger.
Main citation
Privé F, Arbel J, Vilhjálmsson BJ. (2021) LDpred2: better, faster, stronger. Bioinformatics, 36 (22-23) 5424-5431. doi:10.1093/bioinformatics/btaa1029. PMID 33326037
ABSTRACT
MOTIVATION: Polygenic scores have become a central tool in human genetics research. LDpred is a popular method for deriving polygenic scores based on summary statistics and a matrix of correlation between genetic variants. However, LDpred has limitations that may reduce its predictive performance. RESULTS: Here, we present LDpred2, a new version of LDpred that addresses these issues. We also provide two new options in LDpred2: a 'sparse' option that can learn effects that are exactly 0, and an 'auto' option that directly learns the two LDpred parameters from data. We benchmark predictive performance of LDpred2 against the previous version on simulated and real data, demonstrating substantial improvements in robustness and predictive accuracy compared to LDpred1. We then show that LDpred2 also outperforms other polygenic score methods recently developed, with a mean AUC over the 8 real traits analyzed here of 65.1%, compared to 63.8% for lassosum, 62.9% for PRS-CS and 61.5% for SBayesR. Note that LDpred2 provides more accurate polygenic scores when run genome-wide, instead of per chromosome. AVAILABILITY AND IMPLEMENTATION: LDpred2 is implemented in R package bigsnpr. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
DOI
10.1093/bioinformatics/btaa1029
LDpred2-auto
PUBMED_LINK
DESCRIPTION
LDpred2 is a widely used Bayesian method for building polygenic scores (PGS). LDpred2-auto can infer the two parameters from the LDpred model, h^2 and p, so that it does not require an additional validation dataset to choose best-performing parameters. Here, we present a new version of LDpred2-auto, which adds a third parameter alpha to its model for modeling negative selection. Additional changes are also made to provide better sampling of these parameters.
URL
KEYWORDS
Bayesian, new LDpred2-auto, α (relationship between MAF and beta)
TITLE
Inferring disease architecture and predictive ability with LDpred2-auto.
Main citation
Privé F, Albiñana C, Arbel J, Pasaniuc B, ...&, Vilhjálmsson BJ. (2023) Inferring disease architecture and predictive ability with LDpred2-auto. Am J Hum Genet, 110 (12) 2042-2055. doi:10.1016/j.ajhg.2023.10.010. PMID 37944514
ABSTRACT
LDpred2 is a widely used Bayesian method for building polygenic scores (PGSs). LDpred2-auto can infer the two parameters from the LDpred model, the SNP heritability h2 and polygenicity p, so that it does not require an additional validation dataset to choose best-performing parameters. The main aim of this paper is to properly validate the use of LDpred2-auto for inferring multiple genetic parameters. Here, we present a new version of LDpred2-auto that adds an optional third parameter α to its model, for modeling negative selection. We then validate the inference of these three parameters (or two, when using the previous model). We also show that LDpred2-auto provides per-variant probabilities of being causal that are well calibrated and can therefore be used for fine-mapping purposes. We also introduce a formula to infer the out-of-sample predictive performance r2 of the resulting PGS directly from the Gibbs sampler of LDpred2-auto. Finally, we extend the set of HapMap3 variants recommended to use with LDpred2 with 37% more variants to improve the coverage of this set, and we show that this new set of variants captures 12% more heritability and provides 6% more predictive performance, on average, in UK Biobank analyses.
DOI
10.1016/j.ajhg.2023.10.010
LDSC
PUBMED_LINK
FULL NAME
LD Score Regression
DESCRIPTION
ldsc is a command line tool for estimating heritability and genetic correlation from GWAS summary statistics. ldsc also computes LD Scores.
URL
TITLE
LD Score regression distinguishes confounding from polygenicity in genome-wide association studies.
Main citation
Bulik-Sullivan BK, Loh PR, Finucane HK, Ripke S, ...&, Neale BM. (2015) LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat Genet, 47 (3) 291-5. doi:10.1038/ng.3211. PMID 25642630
ABSTRACT
Both polygenicity (many small genetic effects) and confounding biases, such as cryptic relatedness and population stratification, can yield an inflated distribution of test statistics in genome-wide association studies (GWAS). However, current methods cannot distinguish between inflation from a true polygenic signal and bias. We have developed an approach, LD Score regression, that quantifies the contribution of each by examining the relationship between test statistics and linkage disequilibrium (LD). The LD Score regression intercept can be used to estimate a more powerful and accurate correction factor than genomic control. We find strong evidence that polygenicity accounts for the majority of the inflation in test statistics in many GWAS of large sample size.
DOI
10.1038/ng.3211
LDSC
PUBMED_LINK
FULL NAME
LD Score Regression
DESCRIPTION
ldsc is a command line tool for estimating heritability and genetic correlation from GWAS summary statistics. ldsc also computes LD Scores.
URL
TITLE
LD Score regression distinguishes confounding from polygenicity in genome-wide association studies.
Main citation
Bulik-Sullivan BK, Loh PR, Finucane HK, Ripke S, ...&, Neale BM. (2015) LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat Genet, 47 (3) 291-5. doi:10.1038/ng.3211. PMID 25642630
ABSTRACT
Both polygenicity (many small genetic effects) and confounding biases, such as cryptic relatedness and population stratification, can yield an inflated distribution of test statistics in genome-wide association studies (GWAS). However, current methods cannot distinguish between inflation from a true polygenic signal and bias. We have developed an approach, LD Score regression, that quantifies the contribution of each by examining the relationship between test statistics and linkage disequilibrium (LD). The LD Score regression intercept can be used to estimate a more powerful and accurate correction factor than genomic control. We find strong evidence that polygenicity accounts for the majority of the inflation in test statistics in many GWAS of large sample size.
DOI
10.1038/ng.3211
LDSC-SEG
PUBMED_LINK
FULL NAME
LD score regression applied to specifically expressed genes
URL
KEYWORDS
LDSC, tissue, cell type
TITLE
Heritability enrichment of specifically expressed genes identifies disease-relevant tissues and cell types.
Main citation
Finucane HK, Reshef YA, Anttila V, Slowikowski K, ...&, Price AL. (2018) Heritability enrichment of specifically expressed genes identifies disease-relevant tissues and cell types. Nat Genet, 50 (4) 621-629. doi:10.1038/s41588-018-0081-4. PMID 29632380
ABSTRACT
We introduce an approach to identify disease-relevant tissues and cell types by analyzing gene expression data together with genome-wide association study (GWAS) summary statistics. Our approach uses stratified linkage disequilibrium (LD) score regression to test whether disease heritability is enriched in regions surrounding genes with the highest specific expression in a given tissue. We applied our approach to gene expression data from several sources together with GWAS summary statistics for 48 diseases and traits (average N = 169,331) and found significant tissue-specific enrichments (false discovery rate (FDR) < 5%) for 34 traits. In our analysis of multiple tissues, we detected a broad range of enrichments that recapitulated known biology. In our brain-specific analysis, significant enrichments included an enrichment of inhibitory over excitatory neurons for bipolar disorder, and excitatory over inhibitory neurons for schizophrenia and body mass index. Our results demonstrate that our polygenic approach is a powerful way to leverage gene expression data for interpreting GWAS signals.
DOI
10.1038/s41588-018-0081-4
LDSTORE2
PUBMED_LINK
DESCRIPTION
LDstore is a computationally efficient program for estimating and storing Linkage Disequilibrium (SNP correlations). It combines some of the best features from RAREMETALWORKER and PLINK by implementing parallel processing using OPENMP and storing of the SNP correlations with information about the SNPs in the same binary file for fast lookups. LDstore is therefore the ideal tool for sharing SNP correlations in large-scale meta-analyses of genome-wide association studies and for on-the-fly computing/querying within web portals.
URL
TITLE
Prospects of Fine-Mapping Trait-Associated Genomic Regions by Using Summary Statistics from Genome-wide Association Studies.
Main citation
Benner C, Havulinna AS, Järvelin MR, Salomaa V, ...&, Pirinen M. (2017) Prospects of Fine-Mapping Trait-Associated Genomic Regions by Using Summary Statistics from Genome-wide Association Studies. Am J Hum Genet, 101 (4) 539-551. doi:10.1016/j.ajhg.2017.08.012. PMID 28942963
ABSTRACT
During the past few years, various novel statistical methods have been developed for fine-mapping with the use of summary statistics from genome-wide association studies (GWASs). Although these approaches require information about the linkage disequilibrium (LD) between variants, there has not been a comprehensive evaluation of how estimation of the LD structure from reference genotype panels performs in comparison with that from the original individual-level GWAS data. Using population genotype data from Finland and the UK Biobank, we show here that a reference panel of 1,000 individuals from the target population is adequate for a GWAS cohort of up to 10,000 individuals, whereas smaller panels, such as those from the 1000 Genomes Project, should be avoided. We also show, both theoretically and empirically, that the size of the reference panel needs to scale with the GWAS sample size; this has important consequences for the application of these methods in ongoing GWAS meta-analyses and large biobank studies. We conclude by providing software tools and by recommending practices for sharing LD information to more efficiently exploit summary statistics in genetics research.
DOI
10.1016/j.ajhg.2017.08.012
LeafCutter
PUBMED_LINK
DESCRIPTION
Leafcutter quantifies RNA splicing variation using short-read RNA-seq data. The core idea is to leverage spliced reads (reads that span an intron) to quantify (differential) intron usage across samples.
URL
TITLE
Annotation-free quantification of RNA splicing using LeafCutter.
Main citation
Li YI, Knowles DA, Humphrey J, Barbeira AN, ...&, Pritchard JK. (2018) Annotation-free quantification of RNA splicing using LeafCutter. Nat Genet, 50 (1) 151-158. doi:10.1038/s41588-017-0004-9. PMID 29229983
ABSTRACT
The excision of introns from pre-mRNA is an essential step in mRNA processing. We developed LeafCutter to study sample and population variation in intron splicing. LeafCutter identifies variable splicing events from short-read RNA-seq data and finds events of high complexity. Our approach obviates the need for transcript annotations and circumvents the challenges in estimating relative isoform or exon usage in complex splicing events. LeafCutter can be used both to detect differential splicing between sample groups and to map splicing quantitative trait loci (sQTLs). Compared with contemporary methods, our approach identified 1.4-2.1 times more sQTLs, many of which helped us ascribe molecular effects to disease-associated variants. Transcriptome-wide associations between LeafCutter intron quantifications and 40 complex traits increased the number of associated disease genes at a 5% false discovery rate by an average of 2.1-fold compared with that detected through the use of gene expression levels alone. LeafCutter is fast, scalable, easy to use, and available online.
DOI
10.1038/s41588-017-0004-9
LEMMA
PUBMED_LINK
FULL NAME
Linear Environment Mixed Model Analysis
DESCRIPTION
LEMMA (Linear Environment Mixed Model Analysis) is a whole genome wide regression method for flexible modeling of gene-environment interactions in large datasets such as the UK Biobank.
URL
TITLE
Inferring Gene-by-Environment Interactions with a Bayesian Whole-Genome Regression Model.
Main citation
Kerin M, Marchini J. (2020) Inferring Gene-by-Environment Interactions with a Bayesian Whole-Genome Regression Model. Am J Hum Genet, 107 (4) 698-713. doi:10.1016/j.ajhg.2020.08.009. PMID 32888427
ABSTRACT
The contribution of gene-by-environment (GxE) interactions for many human traits and diseases is poorly characterized. We propose a Bayesian whole-genome regression model for joint modeling of main genetic effects and GxE interactions in large-scale datasets, such as the UK Biobank, where many environmental variables have been measured. The method is called LEMMA (Linear Environment Mixed Model Analysis) and estimates a linear combination of environmental variables, called an environmental score (ES), that interacts with genetic markers throughout the genome. The ES provides a readily interpretable way to examine the combined effect of many environmental variables. The ES can be used both to estimate the proportion of phenotypic variance attributable to GxE effects and to test for GxE effects at genetic variants across the genome. GxE effects can induce heteroskedasticity in quantitative traits, and LEMMA accounts for this by using robust standard error estimates when testing for GxE effects. When applied to body mass index, systolic blood pressure, diastolic blood pressure, and pulse pressure in the UK Biobank, we estimate that 9.3%, 3.9%, 1.6%, and 12.5%, respectively, of phenotypic variance is explained by GxE interactions and that low-frequency variants explain most of this variance. We also identify three loci that interact with the estimated environmental scores (-log10p>7.3).
DOI
10.1016/j.ajhg.2020.08.009
Locityper
PUBMED_LINK
DESCRIPTION
Locityper performs targeted genotyping of structurally variable and hyperpolymorphic genes—including HLA, KIR, MUC, and FCGR families—from short- or long-read whole-genome sequencing by aligning reads to locus haplotypes (often from pangenomes) and scoring depth and insert-size consistency.
URL
KEYWORDS
genotyping, complex loci, HLA, short read, long read, WGS
TITLE
Locityper enables targeted genotyping of complex polymorphic genes.
Main citation
Prodanov T, Plender EG, Seebohm G, Meuth SG, ...&, Marschall T. (2025) Locityper enables targeted genotyping of complex polymorphic genes. Nat Genet, 57 (11) 2901-2908. doi:10.1038/s41588-025-02362-4. PMID 41107551
ABSTRACT
The human genome contains many structurally variable polymorphic loci, including several hundred disease-associated genes, almost inaccessible for accurate variant calling. Here we present Locityper, a tool capable of genotyping such challenging genes using short-read and long-read whole-genome sequencing. For each target, Locityper recruits and aligns reads to locus haplotypes, for instance, extracted from a pangenome, and finds the likeliest haplotype pair by optimizing read alignment, insert size and read depth profiles. Across 256 challenging medically relevant loci, Locityper achieves a median quality value (QV) above 35 from both long-read and short-read data, outperforming state-of-the-art Illumina and PacBio HiFi variant calling pipelines by 10.9 and 1.7 points, respectively. Furthermore, Locityper provides access to hyperpolymorphic HLA genes and other gene families, including KIR, MUC and FCGR. With its low running time of 1 h 35 m per sample at eight threads, Locityper is scalable to biobank-sized cohorts, enabling association studies for previously intractable disease-relevant genes.
DOI
10.1038/s41588-025-02362-4
locuszoom
PUBMED_LINK
URL
TITLE
LocusZoom: regional visualization of genome-wide association scan results.
Main citation
Pruim RJ, Welch RP, Sanna S, Teslovich TM, ...&, Willer CJ. (2010) LocusZoom: regional visualization of genome-wide association scan results. Bioinformatics, 26 (18) 2336-7. doi:10.1093/bioinformatics/btq419. PMID 20634204
ABSTRACT
UNLABELLED: Genome-wide association studies (GWAS) have revealed hundreds of loci associated with common human genetic diseases and traits. We have developed a web-based plotting tool that provides fast visual display of GWAS results in a publication-ready format. LocusZoom visually displays regional information such as the strength and extent of the association signal relative to genomic position, local linkage disequilibrium (LD) and recombination patterns and the positions of genes in the region. AVAILABILITY: LocusZoom can be accessed from a web interface at http://csg.sph.umich.edu/locuszoom. Users may generate a single plot using a web form, or many plots using batch mode. The software utilizes LD information from HapMap Phase II (CEU, YRI and JPT+CHB) or 1000 Genomes (CEU) and gene information from the UCSC browser, and will accept SNP identifiers in dbSNP or 1000 Genomes format. Single plots are generated in approximately 20 s. Source code and associated databases are available for download and local installation, and full documentation is available online.
DOI
10.1093/bioinformatics/btq419
loftee
PUBMED_LINK
FULL NAME
Loss-Of-Function Transcript Effect Estimator
DESCRIPTION
A VEP plugin to identify LoF (loss-of-function) variation. Currently assesses variants that are stop-gained, splice site disrupting and Frameshift variants.
URL
TITLE
The mutational constraint spectrum quantified from variation in 141,456 humans.
Main citation
Karczewski KJ, Francioli LC, Tiao G, Cummings BB, ...&, MacArthur DG. (2020) The mutational constraint spectrum quantified from variation in 141,456 humans. Nature, 581 (7809) 434-443. doi:10.1038/s41586-020-2308-7. PMID 32461654
ABSTRACT
Genetic variants that inactivate protein-coding genes are a powerful source of information about the phenotypic consequences of gene disruption: genes that are crucial for the function of an organism will be depleted of such variants in natural populations, whereas non-essential genes will tolerate their accumulation. However, predicted loss-of-function variants are enriched for annotation errors, and tend to be found at extremely low frequencies, so their analysis requires careful variant annotation and very large sample sizes1. Here we describe the aggregation of 125,748 exomes and 15,708 genomes from human sequencing studies into the Genome Aggregation Database (gnomAD). We identify 443,769 high-confidence predicted loss-of-function variants in this cohort after filtering for artefacts caused by sequencing and annotation errors. Using an improved model of human mutation rates, we classify human protein-coding genes along a spectrum that represents tolerance to inactivation, validate this classification using data from model organisms and engineered human cells, and show that it can be used to improve the power of gene discovery for both common and rare diseases.
DOI
10.1038/s41586-020-2308-7
Logica
FULL NAME
LOcal GenetIc Correlation across Ancestries
DESCRIPTION
Logica (LOcal GenetIc Correlation across Ancestries), a new method specifically designed to estimate local genetic correlations across ancestries. Logica employs a bivariate linear mixed model that explicitly accounts for diverse LD patterns across ancestries, operates on GWAS summary statistics, and utilizes a maximum likelihood framework for robust inference. Logica is implemented as an open-source R package。
URL
LT-FH
PUBMED_LINK
FULL NAME
liability threshold model, conditional on case–control status and family history
DESCRIPTION
an association method based on posterior mean genetic liabilities under a liability threshold model, conditional on case-control status and family history (LT-FH)
URL
TITLE
Liability threshold modeling of case-control status and family history of disease increases association power.
Main citation
Hujoel MLA, Gazal S, Loh PR, Patterson N, ...&, Price AL. (2020) Liability threshold modeling of case-control status and family history of disease increases association power. Nat Genet, 52 (5) 541-547. doi:10.1038/s41588-020-0613-6. PMID 32313248
ABSTRACT
Family history of disease can provide valuable information in case-control association studies, but it is currently unclear how to best combine case-control status and family history of disease. We developed an association method based on posterior mean genetic liabilities under a liability threshold model, conditional on case-control status and family history (LT-FH). Analyzing 12 diseases from the UK Biobank (average N = 350,000) we compared LT-FH to genome-wide association without using family history (GWAS) and a previous proxy-based method incorporating family history (GWAX). LT-FH was 63% (standard error (s.e.) 6%) more powerful than GWAS and 36% (s.e. 4%) more powerful than the trait-specific maximum of GWAS and GWAX, based on the number of independent genome-wide-significant loci across all diseases (for example, 690 loci for LT-FH versus 423 for GWAS); relative improvements were similar when applying BOLT-LMM to GWAS, GWAX and LT-FH phenotypes. Thus, LT-FH greatly increases association power when family history of disease is available.
DOI
10.1038/s41588-020-0613-6
MACH / minimach
PUBMED_LINK
DESCRIPTION
(MACH)
URL
TITLE
MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes.
Main citation
Li Y, Willer CJ, Ding J, Scheet P, ...&, Abecasis GR. (2010) MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet Epidemiol, 34 (8) 816-34. doi:10.1002/gepi.20533. PMID 21058334
ABSTRACT
Genome-wide association studies (GWAS) can identify common alleles that contribute to complex disease susceptibility. Despite the large number of SNPs assessed in each study, the effects of most common SNPs must be evaluated indirectly using either genotyped markers or haplotypes thereof as proxies. We have previously implemented a computationally efficient Markov Chain framework for genotype imputation and haplotyping in the freely available MaCH software package. The approach describes sampled chromosomes as mosaics of each other and uses available genotype and shotgun sequence data to estimate unobserved genotypes and haplotypes, together with useful measures of the quality of these estimates. Our approach is already widely used to facilitate comparison of results across studies as well as meta-analyses of GWAS. Here, we use simulations and experimental genotypes to evaluate its accuracy and utility, considering choices of genotyping panels, reference panel configurations, and designs where genotyping is replaced with shotgun sequencing. Importantly, we show that genotype imputation not only facilitates cross study analyses but also increases power of genetic association studies. We show that genotype imputation of common variants using HapMap haplotypes as a reference is very accurate using either genome-wide SNP data or smaller amounts of data typical in fine-mapping studies. Furthermore, we show the approach is applicable in a variety of populations. Finally, we illustrate how association analyses of unobserved variants will benefit from ongoing advances such as larger HapMap reference panels and whole genome shotgun sequencing technologies.
DOI
10.1002/gepi.20533
MACH / minimach pre-phasing
PUBMED_LINK
DESCRIPTION
(pre-phasing, minimac)
URL
TITLE
Fast and accurate genotype imputation in genome-wide association studies through pre-phasing.
Main citation
Howie B, Fuchsberger C, Stephens M, Marchini J, ...&, Abecasis GR. (2012) Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat Genet, 44 (8) 955-9. doi:10.1038/ng.2354. PMID 22820512
ABSTRACT
The 1000 Genomes Project and disease-specific sequencing efforts are producing large collections of haplotypes that can be used as reference panels for genotype imputation in genome-wide association studies (GWAS). However, imputing from large reference panels with existing methods imposes a high computational burden. We introduce a strategy called 'pre-phasing' that maintains the accuracy of leading methods while reducing computational costs. We first statistically estimate the haplotypes for each individual within the GWAS sample (pre-phasing) and then impute missing genotypes into these estimated haplotypes. This reduces the computational cost because (i) the GWAS samples must be phased only once, whereas standard methods would implicitly repeat phasing with each reference panel update, and (ii) it is much faster to match a phased GWAS haplotype to one reference haplotype than to match two unphased GWAS genotypes to a pair of reference haplotypes. We implemented our approach in the MaCH and IMPUTE2 frameworks, and we tested it on data sets from the Wellcome Trust Case Control Consortium 2 (WTCCC2), the Genetic Association Information Network (GAIN), the Women's Health Initiative (WHI) and the 1000 Genomes Project. This strategy will be particularly valuable for repeated imputation as reference panels evolve.
DOI
10.1038/ng.2354
MACH / minimach2
PUBMED_LINK
DESCRIPTION
(minimac2)
URL
TITLE
minimac2: faster genotype imputation.
Main citation
Fuchsberger C, Abecasis GR, Hinds DA. (2015) minimac2: faster genotype imputation. Bioinformatics, 31 (5) 782-4. doi:10.1093/bioinformatics/btu704. PMID 25338720
ABSTRACT
UNLABELLED: Genotype imputation is a key step in the analysis of genome-wide association studies. Upcoming very large reference panels, such as those from The 1000 Genomes Project and the Haplotype Consortium, will improve imputation quality of rare and less common variants, but will also increase the computational burden. Here, we demonstrate how the application of software engineering techniques can help to keep imputation broadly accessible. Overall, these improvements speed up imputation by an order of magnitude compared with our previous implementation. AVAILABILITY AND IMPLEMENTATION: minimac2, including source code, documentation, and examples is available at http://genome.sph.umich.edu/wiki/Minimac2
DOI
10.1093/bioinformatics/btu704
MACH / minimach3
PUBMED_LINK
DESCRIPTION
(minimac3)
URL
TITLE
Next-generation genotype imputation service and methods.
Main citation
Das S, Forer L, Schönherr S, Sidore C, ...&, Fuchsberger C. (2016) Next-generation genotype imputation service and methods. Nat Genet, 48 (10) 1284-1287. doi:10.1038/ng.3656. PMID 27571263
ABSTRACT
Genotype imputation is a key component of genetic association studies, where it increases power, facilitates meta-analysis, and aids interpretation of signals. Genotype imputation is computationally demanding and, with current tools, typically requires access to a high-performance computing cluster and to a reference panel of sequenced genomes. Here we describe improvements to imputation machinery that reduce computational requirements by more than an order of magnitude with no loss of accuracy in comparison to standard imputation tools. We also describe a new web-based service for imputation that facilitates access to new reference panels and greatly improves user experience and productivity.
DOI
10.1038/ng.3656
MACH / minimach4
DESCRIPTION
(minimac4)
URL
MAGMA
PUBMED_LINK
FULL NAME
Multi-marker Analysis of GenoMic Annotation
DESCRIPTION
MAGMA is a tool for gene analysis and generalized gene-set analysis of GWAS data. It can be used to analyse both raw genotype data as well as summary SNP p-values from a previous GWAS or meta-analysis.
URL
TITLE
MAGMA: generalized gene-set analysis of GWAS data.
Main citation
de Leeuw CA, Mooij JM, Heskes T, Posthuma D. (2015) MAGMA: generalized gene-set analysis of GWAS data. PLoS Comput Biol, 11 (4) e1004219. doi:10.1371/journal.pcbi.1004219. PMID 25885710
ABSTRACT
By aggregating data for complex traits in a biologically meaningful way, gene and gene-set analysis constitute a valuable addition to single-marker analysis. However, although various methods for gene and gene-set analysis currently exist, they generally suffer from a number of issues. Statistical power for most methods is strongly affected by linkage disequilibrium between markers, multi-marker associations are often hard to detect, and the reliance on permutation to compute p-values tends to make the analysis computationally very expensive. To address these issues we have developed MAGMA, a novel tool for gene and gene-set analysis. The gene analysis is based on a multiple regression model, to provide better statistical performance. The gene-set analysis is built as a separate layer around the gene analysis for additional flexibility. This gene-set analysis also uses a regression structure to allow generalization to analysis of continuous properties of genes and simultaneous analysis of multiple gene sets and other gene properties. Simulations and an analysis of Crohn's Disease data are used to evaluate the performance of MAGMA and to compare it to a number of other gene and gene-set analysis tools. The results show that MAGMA has significantly more power than other tools for both the gene and the gene-set analysis, identifying more genes and gene sets associated with Crohn's Disease while maintaining a correct type 1 error rate. Moreover, the MAGMA analysis of the Crohn's Disease data was found to be considerably faster as well.
DOI
10.1371/journal.pcbi.1004219
MAGMA
PUBMED_LINK
FULL NAME
Multi-marker Analysis of GenoMic Annotation
URL
TITLE
MAGMA: generalized gene-set analysis of GWAS data.
Main citation
de Leeuw CA, Mooij JM, Heskes T, Posthuma D. (2015) MAGMA: generalized gene-set analysis of GWAS data. PLoS Comput Biol, 11 (4) e1004219. doi:10.1371/journal.pcbi.1004219. PMID 25885710
ABSTRACT
By aggregating data for complex traits in a biologically meaningful way, gene and gene-set analysis constitute a valuable addition to single-marker analysis. However, although various methods for gene and gene-set analysis currently exist, they generally suffer from a number of issues. Statistical power for most methods is strongly affected by linkage disequilibrium between markers, multi-marker associations are often hard to detect, and the reliance on permutation to compute p-values tends to make the analysis computationally very expensive. To address these issues we have developed MAGMA, a novel tool for gene and gene-set analysis. The gene analysis is based on a multiple regression model, to provide better statistical performance. The gene-set analysis is built as a separate layer around the gene analysis for additional flexibility. This gene-set analysis also uses a regression structure to allow generalization to analysis of continuous properties of genes and simultaneous analysis of multiple gene sets and other gene properties. Simulations and an analysis of Crohn's Disease data are used to evaluate the performance of MAGMA and to compare it to a number of other gene and gene-set analysis tools. The results show that MAGMA has significantly more power than other tools for both the gene and the gene-set analysis, identifying more genes and gene sets associated with Crohn's Disease while maintaining a correct type 1 error rate. Moreover, the MAGMA analysis of the Crohn's Disease data was found to be considerably faster as well.
DOI
10.1371/journal.pcbi.1004219
MANOVA
FULL NAME
multivariate analysis of variance
MANTRA
PUBMED_LINK
FULL NAME
Meta-ANalysis of Transethnic Association studies
KEYWORDS
cross-population
TITLE
Transethnic meta-analysis of genomewide association studies.
Main citation
Morris AP. (2011) Transethnic meta-analysis of genomewide association studies. Genet Epidemiol, 35 (8) 809-22. doi:10.1002/gepi.20630. PMID 22125221
ABSTRACT
The detection of loci contributing effects to complex human traits, and their subsequent fine-mapping for the location of causal variants, remains a considerable challenge for the genetics research community. Meta-analyses of genomewide association studies, primarily ascertained from European-descent populations, have made considerable advances in our understanding of complex trait genetics, although much of their heritability is still unexplained. With the increasing availability of genomewide association data from diverse populations, transethnic meta-analysis may offer an exciting opportunity to increase the power to detect novel complex trait loci and to improve the resolution of fine-mapping of causal variants by leveraging differences in local linkage disequilibrium structure between ethnic groups. However, we might also expect there to be substantial genetic heterogeneity between diverse populations, both in terms of the spectrum of causal variants and their allelic effects, which cannot easily be accommodated through traditional approaches to meta-analysis. In order to address this challenge, I propose novel transethnic meta-analysis methodology that takes account of the expected similarity in allelic effects between the most closely related populations, while allowing for heterogeneity between more diverse ethnic groups. This approach yields substantial improvements in performance, compared to fixed-effects meta-analysis, both in terms of power to detect association, and localization of the causal variant, over a range of models of heterogeneity between ethnic groups. Furthermore, when the similarity in allelic effects between populations is well captured by their relatedness, this approach has increased power and mapping resolution over random-effects meta-analysis.
DOI
10.1002/gepi.20630
MatrixEQTL (Matrix eQTL)
PUBMED_LINK
FULL NAME
Matrix eQTL
DESCRIPTION
Matrix eQTL is designed for fast eQTL analysis on large datasets. Matrix eQTL can test for association between genotype and gene expression using linear regression with either additive or ANOVA genotype effects. The models can include covariates to account for factors as population stratification, gender, and clinical variables. It also supports models with heteroscedastic and/or correlated errors, false discovery rate estimation and separate treatment of local (cis) and distant (trans) eQTLs.
URL
TITLE
Matrix eQTL: ultra fast eQTL analysis via large matrix operations.
Main citation
Shabalin AA. (2012) Matrix eQTL: ultra fast eQTL analysis via large matrix operations. Bioinformatics, 28 (10) 1353-8. doi:10.1093/bioinformatics/bts163. PMID 22492648
ABSTRACT
MOTIVATION: Expression quantitative trait loci (eQTL) analysis links variations in gene expression levels to genotypes. For modern datasets, eQTL analysis is a computationally intensive task as it involves testing for association of billions of transcript-SNP (single-nucleotide polymorphism) pair. The heavy computational burden makes eQTL analysis less popular and sometimes forces analysts to restrict their attention to just a small subset of transcript-SNP pairs. As more transcripts and SNPs get interrogated over a growing number of samples, the demand for faster tools for eQTL analysis grows stronger. RESULTS: We have developed a new software for computationally efficient eQTL analysis called Matrix eQTL. In tests on large datasets, it was 2-3 orders of magnitude faster than existing popular tools for QTL/eQTL analysis, while finding the same eQTLs. The fast performance is achieved by special preprocessing and expressing the most computationally intensive part of the algorithm in terms of large matrix operations. Matrix eQTL supports additive linear and ANOVA models with covariates, including models with correlated and heteroskedastic errors. The issue of multiple testing is addressed by calculating false discovery rate; this can be done separately for cis- and trans-eQTLs.
DOI
10.1093/bioinformatics/bts163
MegaPRS
PUBMED_LINK
DESCRIPTION
individual level: big_spLinReg, LDAK-Ridge-Predict, LDAK-Bolt-Predict and LDAK-BayesR-Predict
sumstats: LDAK-Lasso-SS, LDAK-Ridge-SS, LDAK-Bolt-SS and LDAK-BayesR-SS
sumstats: LDAK-Lasso-SS, LDAK-Ridge-SS, LDAK-Bolt-SS and LDAK-BayesR-SS
URL
TITLE
Improved genetic prediction of complex traits from individual-level data or summary statistics.
Main citation
Zhang Q, Privé F, Vilhjálmsson B, Speed D. (2021) Improved genetic prediction of complex traits from individual-level data or summary statistics. Nat Commun, 12 (1) 4192. doi:10.1038/s41467-021-24485-y. PMID 34234142
ABSTRACT
Most existing tools for constructing genetic prediction models begin with the assumption that all genetic variants contribute equally towards the phenotype. However, this represents a suboptimal model for how heritability is distributed across the genome. Therefore, we develop prediction tools that allow the user to specify the heritability model. We compare individual-level data prediction tools using 14 UK Biobank phenotypes; our new tool LDAK-Bolt-Predict outperforms the existing tools Lasso, BLUP, Bolt-LMM and BayesR for all 14 phenotypes. We compare summary statistic prediction tools using 225 UK Biobank phenotypes; our new tool LDAK-BayesR-SS outperforms the existing tools lassosum, sBLUP, LDpred and SBayesR for 223 of the 225 phenotypes. When we improve the heritability model, the proportion of phenotypic variance explained increases by on average 14%, which is equivalent to increasing the sample size by a quarter.
DOI
10.1038/s41467-021-24485-y
MENTR
PUBMED_LINK
FULL NAME
mutation effect prediction on ncRNA transcription
DESCRIPTION
A machine-learning model (MENTR) that reliably links genome sequence and ncRNA expression at the cell type level
URL
TITLE
Prediction of the cell-type-specific transcription of non-coding RNAs from genome sequences via machine learning.
Main citation
Koido M, Hon CC, Koyama S, Kawaji H, ...&, Terao C. (2023) Prediction of the cell-type-specific transcription of non-coding RNAs from genome sequences via machine learning. Nat Biomed Eng, 7 (6) 830-844. doi:10.1038/s41551-022-00961-8. PMID 36411359
ABSTRACT
Gene transcription is regulated through complex mechanisms involving non-coding RNAs (ncRNAs). As the transcription of ncRNAs, especially of enhancer RNAs, is often low and cell type specific, how the levels of RNA transcription depend on genotype remains largely unexplored. Here we report the development and utility of a machine-learning model (MENTR) that reliably links genome sequence and ncRNA expression at the cell type level. Effects on ncRNA transcription predicted by the model were concordant with estimates from published studies in a cell-type-dependent manner, regardless of allele frequency and genetic linkage. Among 41,223 variants from genome-wide association studies, the model identified 7,775 enhancer RNAs and 3,548 long ncRNAs causally associated with complex traits across 348 major human primary cells and tissues, such as rare variants plausibly altering the transcription of enhancer RNAs to influence the risks of Crohn's disease and asthma. The model may aid the discovery of causal variants and the generation of testable hypotheses for biological mechanisms driving complex traits.
DOI
10.1038/s41551-022-00961-8
MESuSiE
PUBMED_LINK
FULL NAME
multi-ancestry sum of the single effects model
DESCRIPTION
MESuSiE relies on GWAS summary statistics from multiple ancestries, properly accounts for the LD structure of the local genomic region in multiple ancestries, and explicitly models both shared and ancestry-specific causal signals to accommodate causal effect size similarity as well as heterogeneity across ancestries. MESuSiE outputs posterior inclusion probability of variant being shared or ancestry-specific causal variants.
URL
KEYWORDS
multi-trait, fine-mapping
TITLE
MESuSiE enables scalable and powerful multi-ancestry fine-mapping of causal variants in genome-wide association studies.
Main citation
Gao B, Zhou X. (2024) MESuSiE enables scalable and powerful multi-ancestry fine-mapping of causal variants in genome-wide association studies. Nat Genet, 56 (1) 170-179. doi:10.1038/s41588-023-01604-7. PMID 38168930
ABSTRACT
Fine-mapping in genome-wide association studies attempts to identify causal SNPs from a set of candidate SNPs in a local genomic region of interest and is commonly performed in one genetic ancestry at a time. Here, we present multi-ancestry sum of the single effects model (MESuSiE), a probabilistic multi-ancestry fine-mapping method, to improve the accuracy and resolution of fine-mapping by leveraging association information across ancestries. MESuSiE uses summary statistics as input, accounts for the diverse linkage disequilibrium pattern observed in different ancestries, explicitly models both shared and ancestry-specific causal SNPs, and relies on a variational inference algorithm for scalable computation. We evaluated the performance of MESuSiE through comprehensive simulations and multi-ancestry fine-mapping of four lipid traits with both European and African samples. In the real data, MESuSiE improves fine-mapping resolution by 19.0% to 72.0% compared to existing approaches, is an order of magnitude faster, and captures and categorizes shared and ancestry-specific causal signals with enhanced functional enrichment.
DOI
10.1038/s41588-023-01604-7
meta-PRS
PUBMED_LINK
FULL NAME
linear combination of PRSs
URL
TITLE
Leveraging both individual-level genetic data and GWAS summary statistics increases polygenic prediction.
Main citation
Albiñana C, Grove J, McGrath JJ, Agerbo E, ...&, Vilhjálmsson BJ. (2021) Leveraging both individual-level genetic data and GWAS summary statistics increases polygenic prediction. Am J Hum Genet, 108 (6) 1001-1011. doi:10.1016/j.ajhg.2021.04.014. PMID 33964208
ABSTRACT
The accuracy of polygenic risk scores (PRSs) to predict complex diseases increases with the training sample size. PRSs are generally derived based on summary statistics from large meta-analyses of multiple genome-wide association studies (GWASs). However, it is now common for researchers to have access to large individual-level data as well, such as the UK Biobank data. To the best of our knowledge, it has not yet been explored how best to combine both types of data (summary statistics and individual-level data) to optimize polygenic prediction. The most widely used approach to combine data is the meta-analysis of GWAS summary statistics (meta-GWAS), but we show that it does not always provide the most accurate PRS. Through simulations and using 12 real case-control and quantitative traits from both iPSYCH and UK Biobank along with external GWAS summary statistics, we compare meta-GWAS with two alternative data-combining approaches, stacked clumping and thresholding (SCT) and meta-PRS. We find that, when large individual-level data are available, the linear combination of PRSs (meta-PRS) is both a simple alternative to meta-GWAS and often more accurate.
DOI
10.1016/j.ajhg.2021.04.014
Meta-SAIGE
PUBMED_LINK
DESCRIPTION
Meta-SAIGE performs scalable cohort-level rare-variant meta-analysis from study-level outputs, emphasizing accurate null calibration (including low-prevalence binary traits), computational efficiency via reuse of LD structure across phenotypes, and power close to pooled individual-level analysis with SAIGE-GENE+.
URL
KEYWORDS
rare variant, meta-analysis, SAIGE, summary statistics, type I error
TITLE
Scalable and accurate rare variant meta-analysis with Meta-SAIGE.
Main citation
Park E, Nam K, Jeong S, Keat K, ...&, Lee S. (2025) Scalable and accurate rare variant meta-analysis with Meta-SAIGE. Nat Genet, 57 (12) 3185-3192. doi:10.1038/s41588-025-02403-y. PMID 41266648
ABSTRACT
Meta-analysis enhances the power of rare variant association tests by combining summary statistics across several cohorts. However, existing methods often fail to control type I error for low-prevalence binary traits and are computationally intensive. Here we introduce Meta-SAIGE-a scalable method for rare variant meta-analysis that accurately estimates the null distribution to control type I error and reuses the linkage disequilibrium matrix across phenotypes to boost computational efficiency in phenome-wide analyses. Simulations using UK Biobank whole-exome sequencing data show that Meta-SAIGE effectively controls type I error and achieves power comparable to pooled individual-level analysis with SAIGE-GENE+. Applying Meta-SAIGE to 83 low-prevalence phenotypes in UK Biobank and All of Us whole-exome sequencing data identified 237 gene-trait associations. Notably, 80 of these associations were not significant in either dataset alone, underscoring the power of our meta-analysis.
DOI
10.1038/s41588-025-02403-y
metabolites PRS atlas
PUBMED_LINK
DESCRIPTION
This web application can be used to query findings from a systematic analysis of 129 polygenic risk scores and 249 circulating metabolits using high-throughput nuclear magnetic resonance data from the UK Biobank study1,2. We encourage users of this resource to conduct follow-up analyses of associations to investigate potential causal and non-causal metabolic biomarkers. Age-stratified results can be used to investigate how potential sources of collider bias (e.g. statin therapy) may influence findings in the full sample
URL
TITLE
Constructing an atlas of associations between polygenic scores from across the human phenome and circulating metabolic biomarkers.
Main citation
Fang S, Holmes MV, Gaunt TR, Davey Smith G, ...&, Richardson TG. (2022) Constructing an atlas of associations between polygenic scores from across the human phenome and circulating metabolic biomarkers. Elife, 11 () . doi:10.7554/eLife.73951. PMID 36219204
ABSTRACT
BACKGROUND: Polygenic scores (PGS) are becoming an increasingly popular approach to predict complex disease risk, although they also hold the potential to develop insight into the molecular profiles of patients with an elevated genetic predisposition to disease. METHODS: We sought to construct an atlas of associations between 125 different PGS derived using results from genome-wide association studies and 249 circulating metabolites in up to 83,004 participants from the UK Biobank. RESULTS: As an exemplar to demonstrate the value of this atlas, we conducted a hypothesis-free evaluation of all associations with glycoprotein acetyls (GlycA), an inflammatory biomarker. Using bidirectional Mendelian randomization, we find that the associations highlighted likely reflect the effect of risk factors, such as adiposity or liability towards smoking, on systemic inflammation as opposed to the converse direction. Moreover, we repeated all analyses in our atlas within age strata to investigate potential sources of collider bias, such as medication usage. This was exemplified by comparing associations between lipoprotein lipid profiles and the coronary artery disease PGS in the youngest and oldest age strata, which had differing proportions of individuals undergoing statin therapy. Lastly, we generated all PGS-metabolite associations stratified by sex and separately after excluding 13 established lipid-associated loci to further evaluate the robustness of findings. CONCLUSIONS: We envisage that the atlas of results constructed in our study will motivate future hypothesis generation and help prioritize and deprioritize circulating metabolic traits for in-depth investigations. All results can be visualized and downloaded at http://mrcieu.mrsoftware.org/metabolites_PRS_atlas. FUNDING: This work is supported by funding from the Wellcome Trust, the British Heart Foundation, and the Medical Research Council Integrative Epidemiology Unit.
DOI
10.7554/eLife.73951
metaCCA
PUBMED_LINK
FULL NAME
meta canonical
correlation analysis
correlation analysis
DESCRIPTION
metaCCA performs multivariate analysis of a single or multiple GWAS based on univariate regression coefficients. It allows multivariate representation of both phenotype and genotype. metaCCA extends the statistical technique of canonical correlation analysis to the setting where original individual-level records are not available, and employs a covariance shrinkage algorithm to achieve robustness.
URL
TITLE
metaCCA: summary statistics-based multivariate meta-analysis of genome-wide association studies using canonical correlation analysis.
Main citation
Cichonska A, Rousu J, Marttinen P, Kangas AJ, ...&, Pirinen M. (2016) metaCCA: summary statistics-based multivariate meta-analysis of genome-wide association studies using canonical correlation analysis. Bioinformatics, 32 (13) 1981-9. doi:10.1093/bioinformatics/btw052. PMID 27153689
ABSTRACT
MOTIVATION: A dominant approach to genetic association studies is to perform univariate tests between genotype-phenotype pairs. However, analyzing related traits together increases statistical power, and certain complex associations become detectable only when several variants are tested jointly. Currently, modest sample sizes of individual cohorts, and restricted availability of individual-level genotype-phenotype data across the cohorts limit conducting multivariate tests. RESULTS: We introduce metaCCA, a computational framework for summary statistics-based analysis of a single or multiple studies that allows multivariate representation of both genotype and phenotype. It extends the statistical technique of canonical correlation analysis to the setting where original individual-level records are not available, and employs a covariance shrinkage algorithm to achieve robustness.Multivariate meta-analysis of two Finnish studies of nuclear magnetic resonance metabolomics by metaCCA, using standard univariate output from the program SNPTEST, shows an excellent agreement with the pooled individual-level analysis of original data. Motivated by strong multivariate signals in the lipid genes tested, we envision that multivariate association testing using metaCCA has a great potential to provide novel insights from already published summary statistics from high-throughput phenotyping technologies. AVAILABILITY AND IMPLEMENTATION: Code is available at https://github.com/aalto-ics-kepaco CONTACTS: anna.cichonska@helsinki.fi or matti.pirinen@helsinki.fi SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
DOI
10.1093/bioinformatics/btw052
metafor
METAL
PUBMED_LINK
DESCRIPTION
METAL is a tool for meta-analysis genomewide association scans. METAL can combine either (a) test statistics and standard errors or (b) p-values across studies (taking sample size and direction of effect into account). METAL analysis is a convenient alternative to a direct analysis of merged data from multiple studies. It is especially appropriate when data from the individual studies cannot be analyzed together because of differences in ethnicity, phenotype distribution, gender or constraints in sharing of individual level data imposed. Meta-analysis results in little or no loss of efficiency compared to analysis of a combined dataset including data from all individual studies.
URL
TITLE
METAL: fast and efficient meta-analysis of genomewide association scans.
Main citation
Willer CJ, Li Y, Abecasis GR. (2010) METAL: fast and efficient meta-analysis of genomewide association scans. Bioinformatics, 26 (17) 2190-1. doi:10.1093/bioinformatics/btq340. PMID 20616382
ABSTRACT
SUMMARY: METAL provides a computationally efficient tool for meta-analysis of genome-wide association scans, which is a commonly used approach for improving power complex traits gene mapping studies. METAL provides a rich scripting interface and implements efficient memory management to allow analyses of very large data sets and to support a variety of input file formats. AVAILABILITY AND IMPLEMENTATION: METAL, including source code, documentation, examples, and executables, is available at http://www.sph.umich.edu/csg/abecasis/metal/.
DOI
10.1093/bioinformatics/btq340
MetaSKAT
PUBMED_LINK
DESCRIPTION
MetaSKAT is a R package for multiple marker meta-analysis. It can carry out meta-analysis of SKAT, SKAT-O and burden tests with individual level genotype data or gene level summary statistics.
URL
TITLE
General framework for meta-analysis of rare variants in sequencing association studies.
Main citation
Lee S, Teslovich TM, Boehnke M, Lin X. (2013) General framework for meta-analysis of rare variants in sequencing association studies. Am J Hum Genet, 93 (1) 42-53. doi:10.1016/j.ajhg.2013.05.010. PMID 23768515
ABSTRACT
We propose a general statistical framework for meta-analysis of gene- or region-based multimarker rare variant association tests in sequencing association studies. In genome-wide association studies, single-marker meta-analysis has been widely used to increase statistical power by combining results via regression coefficients and standard errors from different studies. In analysis of rare variants in sequencing studies, region-based multimarker tests are often used to increase power. We propose meta-analysis methods for commonly used gene- or region-based rare variants tests, such as burden tests and variance component tests. Because estimation of regression coefficients of individual rare variants is often unstable or not feasible, the proposed method avoids this difficulty by calculating score statistics instead that only require fitting the null model for each study and then aggregating these score statistics across studies. Our proposed meta-analysis rare variant association tests are conducted based on study-specific summary statistics, specifically score statistics for each variant and between-variant covariance-type (linkage disequilibrium) relationship statistics for each gene or region. The proposed methods are able to incorporate different levels of heterogeneity of genetic effects across studies and are applicable to meta-analysis of multiple ancestry groups. We show that the proposed methods are essentially as powerful as joint analysis by directly pooling individual level genotype data. We conduct extensive simulations to evaluate the performance of our methods by varying levels of heterogeneity across studies, and we apply the proposed methods to meta-analysis of rare variant effects in a multicohort study of the genetics of blood lipid levels.
DOI
10.1016/j.ajhg.2013.05.010
MetaSTAAR
PUBMED_LINK
DESCRIPTION
MetaSTAAR is an R package for performing Meta-analysis of variant-Set Test for Association using Annotation infoRmation (MetaSTAAR) procedure in whole-genome sequencing (WGS) studies. MetaSTAAR enables functionally-informed rare variant meta-analysis of large WGS studies using an efficient, sparse matrix approach for storing summary statistic, while protecting data privacy of study participants and avoiding sharing subject-level data. MetaSTAAR accounts for relatedness and population structure of continuous and dichotomous traits, and boosts the power of rare variant meta-analysis by incorporating multiple variant functional annotations.
URL
TITLE
Powerful, scalable and resource-efficient meta-analysis of rare variant associations in large whole genome sequencing studies.
Main citation
Li X, Quick C, Zhou H, Gaynor SM, ...&, Lin X. (2023) Powerful, scalable and resource-efficient meta-analysis of rare variant associations in large whole genome sequencing studies. Nat Genet, 55 (1) 154-164. doi:10.1038/s41588-022-01225-6. PMID 36564505
ABSTRACT
Meta-analysis of whole genome sequencing/whole exome sequencing (WGS/WES) studies provides an attractive solution to the problem of collecting large sample sizes for discovering rare variants associated with complex phenotypes. Existing rare variant meta-analysis approaches are not scalable to biobank-scale WGS data. Here we present MetaSTAAR, a powerful and resource-efficient rare variant meta-analysis framework for large-scale WGS/WES studies. MetaSTAAR accounts for relatedness and population structure, can analyze both quantitative and dichotomous traits and boosts the power of rare variant tests by incorporating multiple variant functional annotations. Through meta-analysis of four lipid traits in 30,138 ancestrally diverse samples from 14 studies of the Trans Omics for Precision Medicine (TOPMed) Program, we show that MetaSTAAR performs rare variant meta-analysis at scale and produces results comparable to using pooled data. Additionally, we identified several conditionally significant rare variant associations with lipid traits. We further demonstrate that MetaSTAAR is scalable to biobank-scale cohorts through meta-analysis of TOPMed WGS data and UK Biobank WES data of ~200,000 samples.
DOI
10.1038/s41588-022-01225-6
metaUSAT/metaMANOVA
PUBMED_LINK
FULL NAME
unified score-based association test
DESCRIPTION
metaUSAT is a data-adaptive statistical approach for testing genetic associations of multiple traits from single/multiple studies using univariate GWAS summary statistics. This multivariate meta-analysis method can appropriately account for overlapping samples (if any) and can potentially test binary and/or continuous traits.
URL
TITLE
Methods for meta-analysis of multiple traits using GWAS summary statistics.
Main citation
Ray D, Boehnke M. (2018) Methods for meta-analysis of multiple traits using GWAS summary statistics. Genet Epidemiol, 42 (2) 134-145. doi:10.1002/gepi.22105. PMID 29226385
ABSTRACT
Genome-wide association studies (GWAS) for complex diseases have focused primarily on single-trait analyses for disease status and disease-related quantitative traits. For example, GWAS on risk factors for coronary artery disease analyze genetic associations of plasma lipids such as total cholesterol, LDL-cholesterol, HDL-cholesterol, and triglycerides (TGs) separately. However, traits are often correlated and a joint analysis may yield increased statistical power for association over multiple univariate analyses. Recently several multivariate methods have been proposed that require individual-level data. Here, we develop metaUSAT (where USAT is unified score-based association test), a novel unified association test of a single genetic variant with multiple traits that uses only summary statistics from existing GWAS. Although the existing methods either perform well when most correlated traits are affected by the genetic variant in the same direction or are powerful when only a few of the correlated traits are associated, metaUSAT is designed to be robust to the association structure of correlated traits. metaUSAT does not require individual-level data and can test genetic associations of categorical and/or continuous traits. One can also use metaUSAT to analyze a single trait over multiple studies, appropriately accounting for overlapping samples, if any. metaUSAT provides an approximate asymptotic P-value for association and is computationally efficient for implementation at a genome-wide level. Simulation experiments show that metaUSAT maintains proper type-I error at low error levels. It has similar and sometimes greater power to detect association across a wide array of scenarios compared to existing methods, which are usually powerful for some specific association scenarios only. When applied to plasma lipids summary data from the METSIM and the T2D-GENES studies, metaUSAT detected genome-wide significant loci beyond the ones identified by univariate analyses. Evidence from larger studies suggest that the variants additionally detected by our test are, indeed, associated with lipid levels in humans. In summary, metaUSAT can provide novel insights into the genetic architecture of a common disease or traits.
DOI
10.1002/gepi.22105
MetaXcan
PUBMED_LINK
DESCRIPTION
MetaXcan is a set of tools to integrate genomic information of biological mechanisms with complex traits.
URL
TITLE
Exploring the phenotypic consequences of tissue specific gene expression variation inferred from GWAS summary statistics.
Main citation
Barbeira AN, Dickinson SP, Bonazzola R, Zheng J, ...&, Im HK. (2018) Exploring the phenotypic consequences of tissue specific gene expression variation inferred from GWAS summary statistics. Nat Commun, 9 (1) 1825. doi:10.1038/s41467-018-03621-1. PMID 29739930
ABSTRACT
Scalable, integrative methods to understand mechanisms that link genetic variants with phenotypes are needed. Here we derive a mathematical expression to compute PrediXcan (a gene mapping approach) results using summary data (S-PrediXcan) and show its accuracy and general robustness to misspecified reference sets. We apply this framework to 44 GTEx tissues and 100+ phenotypes from GWAS and meta-analysis studies, creating a growing public catalog of associations that seeks to capture the effects of gene expression variation on human phenotypes. Replication in an independent cohort is shown. Most of the associations are tissue specific, suggesting context specificity of the trait etiology. Colocalized significant associations in unexpected tissues underscore the need for an agnostic scanning of multiple contexts to improve our ability to detect causal regulatory mechanisms. Monogenic disease genes are enriched among significant associations for related traits, suggesting that smaller alterations of these genes may cause a spectrum of milder phenotypes.
DOI
10.1038/s41467-018-03621-1
Michigan Imputation Server (Michigan)
PUBMED_LINK
URL
TITLE
Next-generation genotype imputation service and methods.
Main citation
Das S, Forer L, Schönherr S, Sidore C, ...&, Fuchsberger C. (2016) Next-generation genotype imputation service and methods. Nat Genet, 48 (10) 1284-1287. doi:10.1038/ng.3656. PMID 27571263
ABSTRACT
Genotype imputation is a key component of genetic association studies, where it increases power, facilitates meta-analysis, and aids interpretation of signals. Genotype imputation is computationally demanding and, with current tools, typically requires access to a high-performance computing cluster and to a reference panel of sequenced genomes. Here we describe improvements to imputation machinery that reduce computational requirements by more than an order of magnitude with no loss of accuracy in comparison to standard imputation tools. We also describe a new web-based service for imputation that facilitates access to new reference panels and greatly improves user experience and productivity.
DOI
10.1038/ng.3656
MiXeR
PUBMED_LINK
FULL NAME
MiXeR(univariate)
DESCRIPTION
Causal Mixture Model for GWAS summary statistics
URL
TITLE
Beyond SNP heritability: Polygenicity and discoverability of phenotypes estimated with a univariate Gaussian mixture model.
Main citation
Holland D, Frei O, Desikan R, Fan CC, ...&, Dale AM. (2020) Beyond SNP heritability: Polygenicity and discoverability of phenotypes estimated with a univariate Gaussian mixture model. PLoS Genet, 16 (5) e1008612. doi:10.1371/journal.pgen.1008612. PMID 32427991
ABSTRACT
Estimating the polygenicity (proportion of causally associated single nucleotide polymorphisms (SNPs)) and discoverability (effect size variance) of causal SNPs for human traits is currently of considerable interest. SNP-heritability is proportional to the product of these quantities. We present a basic model, using detailed linkage disequilibrium structure from a reference panel of 11 million SNPs, to estimate these quantities from genome-wide association studies (GWAS) summary statistics. We apply the model to diverse phenotypes and validate the implementation with simulations. We find model polygenicities (as a fraction of the reference panel) ranging from ≃ 2 × 10-5 to ≃ 4 × 10-3, with discoverabilities similarly ranging over two orders of magnitude. A power analysis allows us to estimate the proportions of phenotypic variance explained additively by causal SNPs reaching genome-wide significance at current sample sizes, and map out sample sizes required to explain larger portions of additive SNP heritability. The model also allows for estimating residual inflation (or deflation from over-correcting of z-scores), and assessing compatibility of replication and discovery GWAS summary statistics.
DOI
10.1371/journal.pgen.1008612
mJAM
FULL NAME
multi-population JAM
URL
KEYWORDS
multi-population
Main citation
Shen, J., Jiang, L., Wang, K., Wang, A., Chen, F., Newcombe, P. J., ... & Conti, D. V. (2022). Fine-mapping and credible set construction using a multi-population joint analysis of marginal summary statistics from genome-wide association studies. bioRxiv, 2022-12.
MOSTest
PUBMED_LINK
FULL NAME
Multivariate Omnibus Statistical Test
DESCRIPTION
MOSTest is a tool for join genetical analysis of multiple traits, using multivariate analysis to boost the power of discovering associated loci.
URL
TITLE
Understanding the genetic determinants of the brain with MOSTest.
Main citation
van der Meer D, Frei O, Kaufmann T, Shadrin AA, ...&, Dale AM. (2020) Understanding the genetic determinants of the brain with MOSTest. Nat Commun, 11 (1) 3512. doi:10.1038/s41467-020-17368-1. PMID 32665545
ABSTRACT
Regional brain morphology has a complex genetic architecture, consisting of many common polymorphisms with small individual effects. This has proven challenging for genome-wide association studies (GWAS). Due to the distributed nature of genetic signal across brain regions, multivariate analysis of regional measures may enhance discovery of genetic variants. Current multivariate approaches to GWAS are ill-suited for complex, large-scale data of this kind. Here, we introduce the Multivariate Omnibus Statistical Test (MOSTest), with an efficient computational design enabling rapid and reliable inference, and apply it to 171 regional brain morphology measures from 26,502 UK Biobank participants. At the conventional genome-wide significance threshold of α = 5 × 10-8, MOSTest identifies 347 genomic loci associated with regional brain morphology, more than any previous study, improving upon the discovery of established GWAS approaches more than threefold. Our findings implicate more than 5% of all protein-coding genes and provide evidence for gene sets involved in neuron development and differentiation.
DOI
10.1038/s41467-020-17368-1
MR-MEGA
PUBMED_LINK
FULL NAME
Meta-Regression of Multi-AncEstry Genetic Association
DESCRIPTION
MR-MEGA (Meta-Regression of Multi-AncEstry Genetic Association) is a tool to detect and fine-map complex trait association signals via multi-ancestry meta-regression. This approach uses genome-wide metrics of diversity between populations to derive axes of genetic variation via multi-dimensional scaling [Purcell 2007]. Allelic effects of a variant across GWAS, weighted by their corresponding standard errors, can then be modelled in a linear regression framework, including the axes of genetic variation as covariates. The flexibility of this model enables partitioning of the heterogeneity into components due to ancestry and residual variation, which would be expected to improve fine-mapping resolution.
URL
KEYWORDS
Multi-AncEstry
TITLE
Trans-ethnic meta-regression of genome-wide association studies accounting for ancestry increases power for discovery and improves fine-mapping resolution.
Main citation
Mägi R, Horikoshi M, Sofer T, Mahajan A, ...&, Morris AP. (2017) Trans-ethnic meta-regression of genome-wide association studies accounting for ancestry increases power for discovery and improves fine-mapping resolution. Hum Mol Genet, 26 (18) 3639-3650. doi:10.1093/hmg/ddx280. PMID 28911207
ABSTRACT
Trans-ethnic meta-analysis of genome-wide association studies (GWAS) across diverse populations can increase power to detect complex trait loci when the underlying causal variants are shared between ancestry groups. However, heterogeneity in allelic effects between GWAS at these loci can occur that is correlated with ancestry. Here, a novel approach is presented to detect SNP association and quantify the extent of heterogeneity in allelic effects that is correlated with ancestry. We employ trans-ethnic meta-regression to model allelic effects as a function of axes of genetic variation, derived from a matrix of mean pairwise allele frequency differences between GWAS, and implemented in the MR-MEGA software. Through detailed simulations, we demonstrate increased power to detect association for MR-MEGA over fixed- and random-effects meta-analysis across a range of scenarios of heterogeneity in allelic effects between ethnic groups. We also demonstrate improved fine-mapping resolution, in loci containing a single causal variant, compared to these meta-analysis approaches and PAINTOR, and equivalent performance to MANTRA at reduced computational cost. Application of MR-MEGA to trans-ethnic GWAS of kidney function in 71,461 individuals indicates stronger signals of association than fixed-effects meta-analysis when heterogeneity in allelic effects is correlated with ancestry. Application of MR-MEGA to fine-mapping four type 2 diabetes susceptibility loci in 22,086 cases and 42,539 controls highlights: (i) strong evidence for heterogeneity in allelic effects that is correlated with ancestry only at the index SNP for the association signal at the CDKAL1 locus; and (ii) 99% credible sets with six or fewer variants for five distinct association signals.
DOI
10.1093/hmg/ddx280
MR-MEGA
PUBMED_LINK
FULL NAME
Meta-Regression of Multi-AncEstry Genetic Association
DESCRIPTION
MR-MEGA (Meta-Regression of Multi-AncEstry Genetic Association) is a tool to detect and fine-map complex trait association signals via multi-ancestry meta-regression. This approach uses genome-wide metrics of diversity between populations to derive axes of genetic variation via multi-dimensional scaling [Purcell 2007]. Allelic effects of a variant across GWAS, weighted by their corresponding standard errors, can then be modelled in a linear regression framework, including the axes of genetic variation as covariates. The flexibility of this model enables partitioning of the heterogeneity into components due to ancestry and residual variation, which would be expected to improve fine-mapping resolution.
URL
KEYWORDS
cross-population, Meta-Regression
TITLE
Trans-ethnic meta-regression of genome-wide association studies accounting for ancestry increases power for discovery and improves fine-mapping resolution.
Main citation
Mägi R, Horikoshi M, Sofer T, Mahajan A, ...&, Morris AP. (2017) Trans-ethnic meta-regression of genome-wide association studies accounting for ancestry increases power for discovery and improves fine-mapping resolution. Hum Mol Genet, 26 (18) 3639-3650. doi:10.1093/hmg/ddx280. PMID 28911207
ABSTRACT
Trans-ethnic meta-analysis of genome-wide association studies (GWAS) across diverse populations can increase power to detect complex trait loci when the underlying causal variants are shared between ancestry groups. However, heterogeneity in allelic effects between GWAS at these loci can occur that is correlated with ancestry. Here, a novel approach is presented to detect SNP association and quantify the extent of heterogeneity in allelic effects that is correlated with ancestry. We employ trans-ethnic meta-regression to model allelic effects as a function of axes of genetic variation, derived from a matrix of mean pairwise allele frequency differences between GWAS, and implemented in the MR-MEGA software. Through detailed simulations, we demonstrate increased power to detect association for MR-MEGA over fixed- and random-effects meta-analysis across a range of scenarios of heterogeneity in allelic effects between ethnic groups. We also demonstrate improved fine-mapping resolution, in loci containing a single causal variant, compared to these meta-analysis approaches and PAINTOR, and equivalent performance to MANTRA at reduced computational cost. Application of MR-MEGA to trans-ethnic GWAS of kidney function in 71,461 individuals indicates stronger signals of association than fixed-effects meta-analysis when heterogeneity in allelic effects is correlated with ancestry. Application of MR-MEGA to fine-mapping four type 2 diabetes susceptibility loci in 22,086 cases and 42,539 controls highlights: (i) strong evidence for heterogeneity in allelic effects that is correlated with ancestry only at the index SNP for the association signal at the CDKAL1 locus; and (ii) 99% credible sets with six or fewer variants for five distinct association signals.
DOI
10.1093/hmg/ddx280
mRnd
PUBMED_LINK
FULL NAME
Power calculations for Mendelian Randomization
URL
TITLE
Calculating statistical power in Mendelian randomization studies.
Main citation
Brion MJ, Shakhbazov K, Visscher PM. (2013) Calculating statistical power in Mendelian randomization studies. Int J Epidemiol, 42 (5) 1497-501. doi:10.1093/ije/dyt179. PMID 24159078
ABSTRACT
In Mendelian randomization (MR) studies, where genetic variants are used as proxy measures for an exposure trait of interest, obtaining adequate statistical power is frequently a concern due to the small amount of variation in a phenotypic trait that is typically explained by genetic variants. A range of power estimates based on simulations and specific parameters for two-stage least squares (2SLS) MR analyses based on continuous variables has previously been published. However there are presently no specific equations or software tools one can implement for calculating power of a given MR study. Using asymptotic theory, we show that in the case of continuous variables and a single instrument, for example a single-nucleotide polymorphism (SNP) or multiple SNP predictor, statistical power for a fixed sample size is a function of two parameters: the proportion of variation in the exposure variable explained by the genetic predictor and the true causal association between the exposure and outcome variable. We demonstrate that power for 2SLS MR can be derived using the non-centrality parameter (NCP) of the statistical test that is employed to test whether the 2SLS regression coefficient is zero. We show that the previously published power estimates from simulations can be represented theoretically using this NCP-based approach, with similar estimates observed when the simulation-based estimates are compared with our NCP-based approach. General equations for calculating statistical power for 2SLS MR using the NCP are provided in this note, and we implement the calculations in a web-based application.
DOI
10.1093/ije/dyt179
ms
PUBMED_LINK
TITLE
Generating samples under a Wright-Fisher neutral model of genetic variation.
Main citation
Hudson RR. (2002) Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics, 18 (2) 337-8. doi:10.1093/bioinformatics/18.2.337. PMID 11847089
ABSTRACT
A Monte Carlo computer program is available to generate samples drawn from a population evolving according to a Wright-Fisher neutral model. The program assumes an infinite-sites model of mutation, and allows recombination, gene conversion, symmetric migration among subpopulations, and a variety of demographic histories. The samples produced can be used to investigate the sampling properties of any sample statistic under these neutral models.
DOI
10.1093/bioinformatics/18.2.337
MsCAVIAR
PUBMED_LINK
FULL NAME
multiple study causal variants identification in associated regions
DESCRIPTION
MsCAVIAR is a method for fine-mapping (identifying causal variants among GWAS associated variants) by leveraging information from multiple studies. One important application area is trans-ethnic fine mapping.
URL
KEYWORDS
multi-study finemapping
TITLE
Identifying causal variants by fine mapping across multiple studies.
Main citation
LaPierre N, Taraszka K, Huang H, He R, ...&, Eskin E. (2021) Identifying causal variants by fine mapping across multiple studies. PLoS Genet, 17 (9) e1009733. doi:10.1371/journal.pgen.1009733. PMID 34543273
ABSTRACT
Increasingly large Genome-Wide Association Studies (GWAS) have yielded numerous variants associated with many complex traits, motivating the development of "fine mapping" methods to identify which of the associated variants are causal. Additionally, GWAS of the same trait for different populations are increasingly available, raising the possibility of refining fine mapping results further by leveraging different linkage disequilibrium (LD) structures across studies. Here, we introduce multiple study causal variants identification in associated regions (MsCAVIAR), a method that extends the popular CAVIAR fine mapping framework to a multiple study setting using a random effects model. MsCAVIAR only requires summary statistics and LD as input, accounts for uncertainty in association statistics using a multivariate normal model, allows for multiple causal variants at a locus, and explicitly models the possibility of different SNP effect sizes in different populations. We demonstrate the efficacy of MsCAVIAR in both a simulation study and a trans-ethnic, trans-biobank fine mapping analysis of High Density Lipoprotein (HDL).
DOI
10.1371/journal.pgen.1009733
MSMC
PUBMED_LINK
DESCRIPTION
Schiffels, S. & Durbin, R. Inferring human population size and separation history from multiple genome sequences. Nat. Genet. 46, 919–925 (2014).
URL
TITLE
Inferring human population size and separation history from multiple genome sequences.
Main citation
Schiffels S, Durbin R. (2014) Inferring human population size and separation history from multiple genome sequences. Nat Genet, 46 (8) 919-25. doi:10.1038/ng.3015. PMID 24952747
ABSTRACT
The availability of complete human genome sequences from populations across the world has given rise to new population genetic inference methods that explicitly model ancestral relationships under recombination and mutation. So far, application of these methods to evolutionary history more recent than 20,000-30,000 years ago and to population separations has been limited. Here we present a new method that overcomes these shortcomings. The multiple sequentially Markovian coalescent (MSMC) analyzes the observed pattern of mutations in multiple individuals, focusing on the first coalescence between any two individuals. Results from applying MSMC to genome sequences from nine populations across the world suggest that the genetic separation of non-African ancestors from African Yoruban ancestors started long before 50,000 years ago and give information about human population history as recent as 2,000 years ago, including the bottleneck in the peopling of the Americas and separations within Africa, East Asia and Europe.
DOI
10.1038/ng.3015
MTAG
PUBMED_LINK
FULL NAME
Multi-Trait Analysis of GWAS
DESCRIPTION
mtag is a Python-based command line tool for jointly analyzing multiple sets of GWAS summary statistics as described by Turley et. al. (2018). It can also be used as a tool to meta-analyze GWAS results.
URL
KEYWORDS
Multi-trait
TITLE
Multi-trait analysis of genome-wide association summary statistics using MTAG.
Main citation
Turley P, Walters RK, Maghzian O, Okbay A, ...&, Benjamin DJ. (2018) Multi-trait analysis of genome-wide association summary statistics using MTAG. Nat Genet, 50 (2) 229-237. doi:10.1038/s41588-017-0009-4. PMID 29292387
ABSTRACT
We introduce multi-trait analysis of GWAS (MTAG), a method for joint analysis of summary statistics from genome-wide association studies (GWAS) of different traits, possibly from overlapping samples. We apply MTAG to summary statistics for depressive symptoms (N eff = 354,862), neuroticism (N = 168,105), and subjective well-being (N = 388,538). As compared to the 32, 9, and 13 genome-wide significant loci identified in the single-trait GWAS (most of which are themselves novel), MTAG increases the number of associated loci to 64, 37, and 49, respectively. Moreover, association statistics from MTAG yield more informative bioinformatics analyses and increase the variance explained by polygenic scores by approximately 25%, matching theoretical expectations.
DOI
10.1038/s41588-017-0009-4
Multi-PGS
PUBMED_LINK
DESCRIPTION
a framework to generate enriched PGS from a wealth of publicly available genome-wide association studies, combining thousands of studies focused on many different phenotypes, into a multi-PGS
URL
TITLE
Multi-PGS enhances polygenic prediction by combining 937 polygenic scores.
Main citation
Albiñana C, Zhu Z, Schork AJ, Ingason A, ...&, Vilhjálmsson BJ. (2023) Multi-PGS enhances polygenic prediction by combining 937 polygenic scores. Nat Commun, 14 (1) 4702. doi:10.1038/s41467-023-40330-w. PMID 37543680
ABSTRACT
The predictive performance of polygenic scores (PGS) is largely dependent on the number of samples available to train the PGS. Increasing the sample size for a specific phenotype is expensive and takes time, but this sample size can be effectively increased by using genetically correlated phenotypes. We propose a framework to generate multi-PGS from thousands of publicly available genome-wide association studies (GWAS) with no need to individually select the most relevant ones. In this study, the multi-PGS framework increases prediction accuracy over single PGS for all included psychiatric disorders and other available outcomes, with prediction R2 increases of up to 9-fold for attention-deficit/hyperactivity disorder compared to a single PGS. We also generate multi-PGS for phenotypes without an existing GWAS and for case-case predictions. We benchmark the multi-PGS framework against other methods and highlight its potential application to new emerging biobanks.
DOI
10.1038/s41467-023-40330-w
MultiBLUP
PUBMED_LINK
URL
TITLE
MultiBLUP: improved SNP-based prediction for complex traits.
Main citation
Speed D, Balding DJ. (2014) MultiBLUP: improved SNP-based prediction for complex traits. Genome Res, 24 (9) 1550-7. doi:10.1101/gr.169375.113. PMID 24963154
ABSTRACT
BLUP (best linear unbiased prediction) is widely used to predict complex traits in plant and animal breeding, and increasingly in human genetics. The BLUP mathematical model, which consists of a single random effect term, was adequate when kinships were measured from pedigrees. However, when genome-wide SNPs are used to measure kinships, the BLUP model implicitly assumes that all SNPs have the same effect-size distribution, which is a severe and unnecessary limitation. We propose MultiBLUP, which extends the BLUP model to include multiple random effects, allowing greatly improved prediction when the random effects correspond to classes of SNPs with distinct effect-size variances. The SNP classes can be specified in advance, for example, based on SNP functional annotations, and we also provide an adaptive procedure for determining a suitable partition of SNPs. We apply MultiBLUP to genome-wide association data from the Wellcome Trust Case Control Consortium (seven diseases), and from much larger studies of celiac disease and inflammatory bowel disease, finding that it consistently provides better prediction than alternative methods. Moreover, MultiBLUP is computationally very efficient; for the largest data set, which includes 12,678 individuals and 1.5 M SNPs, the total analysis can be run on a single desktop PC in less than a day and can be parallelized to run even faster. Tools to perform MultiBLUP are freely available in our software LDAK.
DOI
10.1101/gr.169375.113
MultiPhen
PUBMED_LINK
DESCRIPTION
Performs genetic association tests between SNPs (one-at-a-time) and multiple phenotypes (separately or in joint model).
URL
TITLE
MultiPhen: joint model of multiple phenotypes can increase discovery in GWAS.
Main citation
O'Reilly PF, Hoggart CJ, Pomyen Y, Calboli FC, ...&, Coin LJ. (2012) MultiPhen: joint model of multiple phenotypes can increase discovery in GWAS. PLoS One, 7 (5) e34861. doi:10.1371/journal.pone.0034861. PMID 22567092
ABSTRACT
The genome-wide association study (GWAS) approach has discovered hundreds of genetic variants associated with diseases and quantitative traits. However, despite clinical overlap and statistical correlation between many phenotypes, GWAS are generally performed one-phenotype-at-a-time. Here we compare the performance of modelling multiple phenotypes jointly with that of the standard univariate approach. We introduce a new method and software, MultiPhen, that models multiple phenotypes simultaneously in a fast and interpretable way. By performing ordinal regression, MultiPhen tests the linear combination of phenotypes most associated with the genotypes at each SNP, and thus potentially captures effects hidden to single phenotype GWAS. We demonstrate via simulation that this approach provides a dramatic increase in power in many scenarios. There is a boost in power for variants that affect multiple phenotypes and for those that affect only one phenotype. While other multivariate methods have similar power gains, we describe several benefits of MultiPhen over these. In particular, we demonstrate that other multivariate methods that assume the genotypes are normally distributed, such as canonical correlation analysis (CCA) and MANOVA, can have highly inflated type-1 error rates when testing case-control or non-normal continuous phenotypes, while MultiPhen produces no such inflation. To test the performance of MultiPhen on real data we applied it to lipid traits in the Northern Finland Birth Cohort 1966 (NFBC1966). In these data MultiPhen discovers 21% more independent SNPs with known associations than the standard univariate GWAS approach, while applying MultiPhen in addition to the standard approach provides 37% increased discovery. The most associated linear combinations of the lipids estimated by MultiPhen at the leading SNPs accurately reflect the Friedewald Formula, suggesting that MultiPhen could be used to refine the definition of existing phenotypes or uncover novel heritable phenotypes.
DOI
10.1371/journal.pone.0034861
MultiSTAAR
PUBMED_LINK
FULL NAME
Multi-trait variant-Set Test for Association using Annotation infoRmation
DESCRIPTION
MultiSTAAR is an R package for performing Multi-trait variant-Set Test for Association using Annotation infoRmation (MultiSTAAR) procedure in whole-genome sequencing (WGS) studies. MultiSTAAR is a general framework that (1) leverages the correlation structure between multiple phenotypes to improve power of multi-trait analysis over single-trait analysis, and (2) incorporates both qualitative functional categories and quantitative complementary functional annotations using an omnibus multi-dimensional weighting scheme. MultiSTAAR accounts for population structure and relatedness, and is scalable for jointly analyzing large WGS studies of multiple correlated traits.
URL
TITLE
A statistical framework for multi-trait rare variant analysis in large-scale whole-genome sequencing studies.
Main citation
Li X, Chen H, Selvaraj MS, Van Buren E, ...&, Lin X. (2025) A statistical framework for multi-trait rare variant analysis in large-scale whole-genome sequencing studies. Nat Comput Sci, 5 (2) 125-143. doi:10.1038/s43588-024-00764-8. PMID 39920506
ABSTRACT
Large-scale whole-genome sequencing (WGS) studies have improved our understanding of the contributions of coding and noncoding rare variants to complex human traits. Leveraging association effect sizes across multiple traits in WGS rare variant association analysis can improve statistical power over single-trait analysis, and also detect pleiotropic genes and regions. Existing multi-trait methods have limited ability to perform rare variant analysis of large-scale WGS data. We propose MultiSTAAR, a statistical framework and computationally scalable analytical pipeline for functionally informed multi-trait rare variant analysis in large-scale WGS studies. MultiSTAAR accounts for relatedness, population structure and correlation among phenotypes by jointly analyzing multiple traits, and further empowers rare variant association analysis by incorporating multiple functional annotations. We applied MultiSTAAR to jointly analyze three lipid traits in 61,838 multi-ethnic samples from the Trans-Omics for Precision Medicine (TOPMed) Program. We discovered and replicated new associations with lipid traits missed by single-trait analysis.
DOI
10.1038/s43588-024-00764-8
MultiSuSiE
PUBMED_LINK
DESCRIPTION
MultiSuSiE is a multi-ancestry SuSiE-style fine-mapping framework that allows causal effect sizes to differ across ancestries, improving credible sets in diverse whole-genome sequencing cohorts such as All of Us.
URL
KEYWORDS
cross-ancestry, fine-mapping
TITLE
MultiSuSiE improves multi-ancestry fine-mapping in All of Us whole-genome sequencing data.
Main citation
Rossen J, Shi H, Strober BJ, Zhang MJ, ...&, Price AL. (2026) MultiSuSiE improves multi-ancestry fine-mapping in All of Us whole-genome sequencing data. Nat Genet, 58 (1) 67-76. doi:10.1038/s41588-025-02450-5. PMID 41491094
ABSTRACT
Leveraging multi-ancestry data can improve fine-mapping power. We propose MultiSuSiE, an extension of Sum of Single Effects (SuSiE), to multiple ancestries that allows causal effect sizes to vary across ancestries. We evaluated MultiSuSiE using whole-genome sequencing data from 47,000 African-ancestry, 36,000 Latino-ancestry and 116,000 European-ancestry individuals from All of Us. In simulations, MultiSuSiE applied to Afr36k + Lat36k + Eur36k was well-calibrated and attained higher power than SuSiE applied to Eur109k; compared to recent multi-ancestry methods (SuSiEx and MESuSiE), MultiSuSiE attained higher power and lower computational cost. In analyses of 14 quantitative traits, MultiSuSiE applied to Afr47k + Lat36k + Eur116k identified 348 fine-mapped variants with posterior inclusion probability (PIP) > 0.9, and MultiSuSiE applied to Afr36k + Lat36k + Eur36k identified 59% more PIP > 0.9 variants than SuSiE applied to Eur109k; MultiSuSiE identified 29% more PIP > 0.9 variants than SuSiEx, and MESuSiE was not included due to its high computational cost. We validated these findings through functional enrichment of fine-mapped variants and highlighted examples implicating biologically plausible fine-mapped variants.
DOI
10.1038/s41588-025-02450-5
MultiXcan
PUBMED_LINK
DESCRIPTION
an efficient statistical method (MultiXcan) that leverages the substantial sharing of eQTLs across tissues and contexts to improve our ability to identify potential target genes. MultiXcan integrates evidence across multiple panels using multivariate regression, which naturally takes into account the correlation structure.
URL
TITLE
Integrating predicted transcriptome from multiple tissues improves association detection.
Main citation
Barbeira AN, Pividori M, Zheng J, Wheeler HE, ...&, Im HK. (2019) Integrating predicted transcriptome from multiple tissues improves association detection. PLoS Genet, 15 (1) e1007889. doi:10.1371/journal.pgen.1007889. PMID 30668570
ABSTRACT
Integration of genome-wide association studies (GWAS) and expression quantitative trait loci (eQTL) studies is needed to improve our understanding of the biological mechanisms underlying GWAS hits, and our ability to identify therapeutic targets. Gene-level association methods such as PrediXcan can prioritize candidate targets. However, limited eQTL sample sizes and absence of relevant developmental and disease context restrict our ability to detect associations. Here we propose an efficient statistical method (MultiXcan) that leverages the substantial sharing of eQTLs across tissues and contexts to improve our ability to identify potential target genes. MultiXcan integrates evidence across multiple panels using multivariate regression, which naturally takes into account the correlation structure. We apply our method to simulated and real traits from the UK Biobank and show that, in realistic settings, we can detect a larger set of significantly associated genes than using each panel separately. To improve applicability, we developed a summary result-based extension called S-MultiXcan, which we show yields highly concordant results with the individual level version when LD is well matched. Our multivariate model-based approach allowed us to use the individual level results as a gold standard to calibrate and develop a robust implementation of the summary-based extension. Results from our analysis as well as software and necessary resources to apply our method are publicly available.
DOI
10.1371/journal.pgen.1007889
MungeSumstats
PUBMED_LINK
DESCRIPTION
a Bioconductor package for the standardization and quality control of many GWAS summary statistics
URL
TITLE
MungeSumstats: a Bioconductor package for the standardization and quality control of many GWAS summary statistics.
Main citation
Murphy AE, Schilder BM, Skene NG. (2021) MungeSumstats: a Bioconductor package for the standardization and quality control of many GWAS summary statistics. Bioinformatics, 37 (23) 4593-4596. doi:10.1093/bioinformatics/btab665. PMID 34601555
ABSTRACT
MOTIVATION: Genome-wide association studies (GWAS) summary statistics have popularized and accelerated genetic research. However, a lack of standardization of the file formats used has proven problematic when running secondary analysis tools or performing meta-analysis studies. RESULTS: To address this issue, we have developed MungeSumstats, a Bioconductor R package for the standardization and quality control of GWAS summary statistics. MungeSumstats can handle the most common summary statistic formats, including variant call format (VCF) producing a reformatted, standardized, tabular summary statistic file, VCF or R native data object. AVAILABILITY AND IMPLEMENTATION: MungeSumstats is available on Bioconductor (v 3.13) and can also be found on Github at: https://neurogenomics.github.io/MungeSumstats. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
DOI
10.1093/bioinformatics/btab665
MuPIT
PUBMED_LINK
FULL NAME
Mutation position imaging toolbox
DESCRIPTION
Webserver for mapping variant positions to annotated, interactive 3D structures
URL
TITLE
MuPIT interactive: webserver for mapping variant positions to annotated, interactive 3D structures.
Main citation
Niknafs N, Kim D, Kim R, Diekhans M, ...&, Karchin R. (2013) MuPIT interactive: webserver for mapping variant positions to annotated, interactive 3D structures. Hum Genet, 132 (11) 1235-43. doi:10.1007/s00439-013-1325-0. PMID 23793516
ABSTRACT
Mutation position imaging toolbox (MuPIT) interactive is a browser-based application for single-nucleotide variants (SNVs), which automatically maps the genomic coordinates of SNVs onto the coordinates of available three-dimensional (3D) protein structures. The application is designed for interactive browser-based visualization of the putative functional relevance of SNVs by biologists who are not necessarily experts either in bioinformatics or protein structure. Users may submit batches of several thousand SNVs and review all protein structures that cover the SNVs, including available functional annotations such as binding sites, mutagenesis experiments, and common polymorphisms. Multiple SNVs may be mapped onto each structure, enabling 3D visualization of SNV clusters and their relationship to functionally annotated positions. We illustrate the utility of MuPIT interactive in rationalizing the impact of selected polymorphisms in the PharmGKB database, somatic mutations identified in the Cancer Genome Atlas study of invasive breast carcinomas, and rare variants identified in the exome sequencing project. MuPIT interactive is freely available for non-profit use at http://mupit.icm.jhu.edu .
DOI
10.1007/s00439-013-1325-0
MV-PLINK (MQFAM)
PUBMED_LINK
TITLE
A multivariate test of association.
Main citation
Ferreira MA, Purcell SM. (2009) A multivariate test of association. Bioinformatics, 25 (1) 132-3. doi:10.1093/bioinformatics/btn563. PMID 19019849
ABSTRACT
UNLABELLED: Although genetic association studies often test multiple, related phenotypes, few formal multivariate tests of association are available. We describe a test of association that can be efficiently applied to large population-based designs. AVAILABILITY: A C++ implementation can be obtained from the authors.
DOI
10.1093/bioinformatics/btn563
mvGWAMA
PUBMED_LINK
FULL NAME
Multivariate Genome-Wide Association Meta-Analysis
DESCRIPTION
mvGWAMA is a python script to perform a GWAS meta-analysis when there are sample overlap.
URL
TITLE
Genome-wide meta-analysis identifies new loci and functional pathways influencing Alzheimer's disease risk.
Main citation
Jansen IE, Savage JE, Watanabe K, Bryois J, ...&, Posthuma D. (2019) Genome-wide meta-analysis identifies new loci and functional pathways influencing Alzheimer's disease risk. Nat Genet, 51 (3) 404-413. doi:10.1038/s41588-018-0311-9. PMID 30617256
ABSTRACT
Alzheimer's disease (AD) is highly heritable and recent studies have identified over 20 disease-associated genomic loci. Yet these only explain a small proportion of the genetic variance, indicating that undiscovered loci remain. Here, we performed a large genome-wide association study of clinically diagnosed AD and AD-by-proxy (71,880 cases, 383,378 controls). AD-by-proxy, based on parental diagnoses, showed strong genetic correlation with AD (rg = 0.81). Meta-analysis identified 29 risk loci, implicating 215 potential causative genes. Associated genes are strongly expressed in immune-related tissues and cell types (spleen, liver, and microglia). Gene-set analyses indicate biological mechanisms involved in lipid-related processes and degradation of amyloid precursor proteins. We show strong genetic correlations with multiple health-related outcomes, and Mendelian randomization results suggest a protective effect of cognitive ability on AD risk. These results are a step forward in identifying the genetic factors that contribute to AD risk and add novel insights into the neurobiology of AD.
DOI
10.1038/s41588-018-0311-9
mvSuSiE
PUBMED_LINK
DESCRIPTION
mvSuSiE extends the Sum of Single Effects (SuSiE) model to joint fine-mapping of multiple traits, improving power and resolution relative to separate single-trait analyses while remaining computationally practical.
URL
KEYWORDS
multi-trait, fine-mapping
TITLE
Fast and flexible joint fine-mapping of multiple traits via the Sum of Single Effects model.
Main citation
Zou Y, Carbonetto P, Xie D, Wang G, ...&, Stephens M. (2026) Fast and flexible joint fine-mapping of multiple traits via the Sum of Single Effects model. Nat Genet, 58 (2) 454-462. doi:10.1038/s41588-025-02486-7. PMID 41634413
ABSTRACT
We introduce mvSuSiE, a multitrait fine-mapping method, to identify putative causal variants from genetic association data (individual-level or summary). mvSuSiE learns patterns of shared genetic effects from data, and exploits these patterns to improve power to identify causal single nucleotide polymorphisms (SNPs). Comparisons on simulated data show that mvSuSiE is competitive in speed, power and precision with existing multitrait methods, and uniformly improves over single-trait fine-mapping (Sum of Single Effects) performed separately for each trait. We applied mvSuSiE to jointly fine-map 16 blood cell traits using data from the UK Biobank. By jointly analyzing traits and modeling heterogeneous effect-sharing patterns, we identified a substantially larger number of causal SNPs (>3,000) than single-trait fine-mapping and achieved narrower credible sets. mvSuSiE also more comprehensively characterized how genetic variants affect blood cell traits; 68% of causal SNPs showed significant effects across more than one blood cell type.
DOI
10.1038/s41588-025-02486-7
NARD
PUBMED_LINK
FULL NAME
Northeast Asian Reference Database
URL
TITLE
NARD: whole-genome reference panel of 1779 Northeast Asians improves imputation accuracy of rare and low-frequency variants.
Main citation
Yoo SK, Kim CU, Kim HL, Kim S, ...&, Seo JS. (2019) NARD: whole-genome reference panel of 1779 Northeast Asians improves imputation accuracy of rare and low-frequency variants. Genome Med, 11 (1) 64. doi:10.1186/s13073-019-0677-z. PMID 31640730
ABSTRACT
Here, we present the Northeast Asian Reference Database (NARD), including whole-genome sequencing data of 1779 individuals from Korea, Mongolia, Japan, China, and Hong Kong. NARD provides the genetic diversity of Korean (n = 850) and Mongolian (n = 384) ancestries that were not present in the 1000 Genomes Project Phase 3 (1KGP3). We combined and re-phased the genotypes from NARD and 1KGP3 to construct a union set of haplotypes. This approach established a robust imputation reference panel for Northeast Asians, which yields the greatest imputation accuracy of rare and low-frequency variants compared with the existing panels. NARD imputation panel is available at https://nard.macrogen.com/ .
DOI
10.1186/s13073-019-0677-z
NARD2
PUBMED_LINK
FULL NAME
Northeast Asian Reference Database 2
URL
TITLE
A whole-genome reference panel of 14,393 individuals for East Asian populations accelerates discovery of rare functional variants.
Main citation
Choi J, Kim S, Kim J, Son HY, ...&, Im SW. (2023) A whole-genome reference panel of 14,393 individuals for East Asian populations accelerates discovery of rare functional variants. Sci Adv, 9 (32) eadg6319. doi:10.1126/sciadv.adg6319. PMID 37556544
ABSTRACT
Underrepresentation of non-European (EUR) populations hinders growth of global precision medicine. Resources such as imputation reference panels that match the study population are necessary to find low-frequency variants with substantial effects. We created a reference panel consisting of 14,393 whole-genome sequences including more than 11,000 Asian individuals. Genome-wide association studies were conducted using the reference panel and a population-specific genotype array of 72,298 subjects for eight phenotypes. This panel yields improved imputation accuracy of rare and low-frequency variants within East Asian populations compared with the largest reference panel. Thirty-nine previously unidentified associations were found, and more than half of the variants were East Asian specific. We discovered genes with rare protein-altering variants, including LTBP1 for height and GPR75 for body mass index, as well as putative regulatory mechanisms for rare noncoding variants with cell type-specific effects. We suggest that this dataset will add to the potential value of Asian precision medicine.
DOI
10.1126/sciadv.adg6319
Nyuwa Genome Resource Phase 1
PUBMED_LINK
URL
TITLE
NyuWa Genome resource: A deep whole-genome sequencing-based variation profile and reference panel for the Chinese population.
Main citation
Zhang P, Luo H, Li Y, Wang Y, ...&, He S. (2021) NyuWa Genome resource: A deep whole-genome sequencing-based variation profile and reference panel for the Chinese population. Cell Rep, 37 (7) 110017. doi:10.1016/j.celrep.2021.110017. PMID 34788621
ABSTRACT
The lack of haplotype reference panels and whole-genome sequencing resources specific to the Chinese population has greatly hindered genetic studies in the world's largest population. Here, we present the NyuWa genome resource, based on deep (26.2×) sequencing of 2,999 Chinese individuals, and construct a NyuWa reference panel of 5,804 haplotypes and 19.3 million variants, which is a high-quality publicly available Chinese population-specific reference panel with thousands of samples. Compared with other panels, the NyuWa reference panel reduces the Han Chinese imputation error rate by a margin ranging from 30% to 51%. Population structure and imputation simulation tests support the applicability of one integrated reference panel for northern and southern Chinese. In addition, a total of 22,504 loss-of-function variants in coding and noncoding genes are identified, including 11,493 novel variants. These results highlight the value of the NyuWa genome resource in facilitating genetic research in Chinese and Asian populations.
DOI
10.1016/j.celrep.2021.110017
NyuWa Imputation Server (NyuWa)
PUBMED_LINK
URL
TITLE
NyuWa Genome resource: A deep whole-genome sequencing-based variation profile and reference panel for the Chinese population.
Main citation
Zhang P, Luo H, Li Y, Wang Y, ...&, He S. (2021) NyuWa Genome resource: A deep whole-genome sequencing-based variation profile and reference panel for the Chinese population. Cell Rep, 37 (7) 110017. doi:10.1016/j.celrep.2021.110017. PMID 34788621
ABSTRACT
The lack of haplotype reference panels and whole-genome sequencing resources specific to the Chinese population has greatly hindered genetic studies in the world's largest population. Here, we present the NyuWa genome resource, based on deep (26.2×) sequencing of 2,999 Chinese individuals, and construct a NyuWa reference panel of 5,804 haplotypes and 19.3 million variants, which is a high-quality publicly available Chinese population-specific reference panel with thousands of samples. Compared with other panels, the NyuWa reference panel reduces the Han Chinese imputation error rate by a margin ranging from 30% to 51%. Population structure and imputation simulation tests support the applicability of one integrated reference panel for northern and southern Chinese. In addition, a total of 22,504 loss-of-function variants in coding and noncoding genes are identified, including 11,493 novel variants. These results highlight the value of the NyuWa genome resource in facilitating genetic research in Chinese and Asian populations.
DOI
10.1016/j.celrep.2021.110017
Olink
PUBMED_LINK
TITLE
Proximity Extension Assay in Combination with Next-Generation Sequencing for High-throughput Proteome-wide Analysis.
Main citation
Wik L, Nordberg N, Broberg J, Björkesten J, ...&, Lundberg M. (2021) Proximity Extension Assay in Combination with Next-Generation Sequencing for High-throughput Proteome-wide Analysis. Mol Cell Proteomics, 20 () 100168. doi:10.1016/j.mcpro.2021.100168. PMID 34715355
ABSTRACT
Understanding the dynamics of the human proteome is crucial for developing biomarkers to be used as measurable indicators for disease severity and progression, patient stratification, and drug development. The Proximity Extension Assay (PEA) is a technology that translates protein information into actionable knowledge by linking protein-specific antibodies to DNA-encoded tags. In this report we demonstrate how we have combined the unique PEA technology with an innovative and automated sample preparation and high-throughput sequencing readout enabling parallel measurement of nearly 1500 proteins in 96 samples generating close to 150,000 data points per run. This advancement will have a major impact on the discovery of new biomarkers for disease prediction and prognosis and contribute to the development of the rapidly evolving fields of wellness monitoring and precision medicine.
DOI
10.1016/j.mcpro.2021.100168
OmiGA
PUBMED_LINK
DESCRIPTION
Toolkit for molecular QTL (molQTL) mapping using linear mixed models that handle complex relatedness, aimed at high-throughput omics phenotypes with strong performance for discovery, fine mapping, and trait–molQTL colocalization versus common linear-mapper pipelines.
URL
KEYWORDS
molQTL, xQTL, LMM, relatedness, colocalization, fine mapping
TITLE
OmiGA for ultra-efficient molecular quantitative trait loci mapping.
Main citation
Teng J, Zhang W, Gong W, Chen J, ...&, Zhang Z. (2026) OmiGA for ultra-efficient molecular quantitative trait loci mapping. Nat Commun, 17 (1) . doi:10.1038/s41467-026-68978-0. PMID 41680153
ABSTRACT
Molecular quantitative trait loci (molQTL) mapping is one of the most popular approaches to systematically characterize functional impacts of genomic variants, leading to advanced understanding of the regulatory mechanisms underpinning complex traits and diseases. However, when applied to high-throughput molecular phenotypes, the existing molQTL mapping tools often implement simple linear models, overlooking complex inter-individual relatedness, leading to false positives and insufficient statistical power. Here, we introduce OmiGA, an ultra-efficient omics genetic analysis toolkit, for molQTL mapping based on linear mixed model in populations with complex relatedness. Both computational simulations and real data analyses demonstrate that OmiGA outperforms the existing popular tools regarding molQTL discovery power, fine mapping of causal variants, colocalization of molQTL and trait associations, and computational efficiency. In summary, we recommend OmiGA for molQTL mapping in populations with complex relatedness, for example, those in the Farm animal Genotype-Tissue Expression project and family-based molQTL studies in humans.
DOI
10.1038/s41467-026-68978-0
Open Targets
PUBMED_LINK
DESCRIPTION
Open Targets is an innovative, large-scale, multi-year, public-private partnership that uses human genetics and genomics data for systematic drug target identification and prioritisation.
URL
TITLE
Open Targets: a platform for therapeutic target identification and validation.
Main citation
Koscielny G, An P, Carvalho-Silva D, Cham JA, ...&, Dunham I. (2017) Open Targets: a platform for therapeutic target identification and validation. Nucleic Acids Res, 45 (D1) D985-D994. doi:10.1093/nar/gkw1055. PMID 27899665
ABSTRACT
We have designed and developed a data integration and visualization platform that provides evidence about the association of known and potential drug targets with diseases. The platform is designed to support identification and prioritization of biological targets for follow-up. Each drug target is linked to a disease using integrated genome-wide data from a broad range of data sources. The platform provides either a target-centric workflow to identify diseases that may be associated with a specific target, or a disease-centric workflow to identify targets that may be associated with a specific disease. Users can easily transition between these target- and disease-centric workflows. The Open Targets Validation Platform is accessible at https://www.targetvalidation.org.
DOI
10.1093/nar/gkw1055
Open Targets Genetics
PUBMED_LINK
DESCRIPTION
Open Targets Genetics is a comprehensive tool highlighting variant-centric statistical evidence to allow both prioritisation of candidate causal variants at trait-associated loci and identification of potential drug targets.
URL
TITLE
An open approach to systematically prioritize causal variants and genes at all published human GWAS trait-associated loci.
Main citation
Mountjoy E, Schmidt EM, Carmona M, Schwartzentruber J, ...&, Ghoussaini M. (2021) An open approach to systematically prioritize causal variants and genes at all published human GWAS trait-associated loci. Nat Genet, 53 (11) 1527-1533. doi:10.1038/s41588-021-00945-5. PMID 34711957
ABSTRACT
Genome-wide association studies (GWASs) have identified many variants associated with complex traits, but identifying the causal gene(s) is a major challenge. In the present study, we present an open resource that provides systematic fine mapping and gene prioritization across 133,441 published human GWAS loci. We integrate genetics (GWAS Catalog and UK Biobank) with transcriptomic, proteomic and epigenomic data, including systematic disease-disease and disease-molecular trait colocalization results across 92 cell types and tissues. We identify 729 loci fine mapped to a single-coding causal variant and colocalized with a single gene. We trained a machine-learning model using the fine-mapped genetics and functional genomics data and 445 gold-standard curated GWAS loci to distinguish causal genes from neighboring genes, outperforming a naive distance-based model. Our prioritized genes were enriched for known approved drug targets (odds ratio = 8.1, 95% confidence interval = 5.7, 11.5). These results are publicly available through a web portal ( http://genetics.opentargets.org ), enabling users to easily prioritize genes at disease-associated loci and assess their potential as drug targets.
DOI
10.1038/s41588-021-00945-5
OpenADMIXTURE
PUBMED_LINK
DESCRIPTION
Ko, S., Chu, B. B., Peterson, D., Okenwa, C., Papp, J. C., Alexander, D. H., ... & Lange, K. L. (2023). Unsupervised discovery of ancestry-informative markers and genetic admixture proportions in biobank-scale datasets. The American Journal of Human Genetics.
URL
USE
This software package is an open-source Julia reimplementation of the ADMIXTURE package. It estimates ancestry with maximum-likelihood method for a large SNP genotype datasets, where individuals are assumed to be unrelated.
TITLE
Unsupervised discovery of ancestry-informative markers and genetic admixture proportions in biobank-scale datasets.
Main citation
Ko S, Chu BB, Peterson D, Okenwa C, ...&, Lange KL. (2023) Unsupervised discovery of ancestry-informative markers and genetic admixture proportions in biobank-scale datasets. Am J Hum Genet, 110 (2) 314-325. doi:10.1016/j.ajhg.2022.12.008. PMID 36610401
ABSTRACT
Admixture estimation plays a crucial role in ancestry inference and genome-wide association studies (GWASs). Computer programs such as ADMIXTURE and STRUCTURE are commonly employed to estimate the admixture proportions of sample individuals. However, these programs can be overwhelmed by the computational burdens imposed by the 105 to 106 samples and millions of markers commonly found in modern biobanks. An attractive strategy is to run these programs on a set of ancestry-informative SNP markers (AIMs) that exhibit substantially different frequencies across populations. Unfortunately, existing methods for identifying AIMs require knowing ancestry labels for a subset of the sample. This supervised learning approach creates a chicken and the egg scenario. In this paper, we present an unsupervised, scalable framework that seamlessly carries out AIM selection and likelihood-based estimation of admixture proportions. Our simulated and real data examples show that this approach is scalable to modern biobank datasets. OpenADMIXTURE, our Julia implementation of the method, is open source and available for free.
DOI
10.1016/j.ajhg.2022.12.008
OTTERS
PUBMED_LINK
FULL NAME
Omnibus Transcriptome Test using Expression Reference Summary data
DESCRIPTION
Dai, Q. et al. OTTERS: a powerful TWAS framework leveraging summary-level reference data. Nat. Commun. 14, 1271 (2023).
URL
TITLE
OTTERS: a powerful TWAS framework leveraging summary-level reference data.
Main citation
Dai Q, Zhou G, Zhao H, Võsa U, ...&, Yang J. (2023) OTTERS: a powerful TWAS framework leveraging summary-level reference data. Nat Commun, 14 (1) 1271. doi:10.1038/s41467-023-36862-w. PMID 36882394
ABSTRACT
Most existing TWAS tools require individual-level eQTL reference data and thus are not applicable to summary-level reference eQTL datasets. The development of TWAS methods that can harness summary-level reference data is valuable to enable TWAS in broader settings and enhance power due to increased reference sample size. Thus, we develop a TWAS framework called OTTERS (Omnibus Transcriptome Test using Expression Reference Summary data) that adapts multiple polygenic risk score (PRS) methods to estimate eQTL weights from summary-level eQTL reference data and conducts an omnibus TWAS. We show that OTTERS is a practical and powerful TWAS tool by both simulations and application studies.
DOI
10.1038/s41467-023-36862-w
PAINTOR
PUBMED_LINK
FULL NAME
Probabilistic Annotation INtegraTOR
DESCRIPTION
Finding causal variants that underlie known risk loci is one of the main post-GWAS challenges. Here we present PAINTOR (Probabilistic Annotation INtegraTOR), a probabilistic framework that integrates association strength with genomic functional annotation data to improve accuracy in selecting plausible causal variants for functional validation. The main output of PAINTOR are probabilities for every variant to be causal that can be used for prioritization in functional assays to establish biological causality.
URL
KEYWORDS
Empirical Bayes prior
TITLE
Integrating functional data to prioritize causal variants in statistical fine-mapping studies.
Main citation
Kichaev G, Yang WY, Lindstrom S, Hormozdiari F, ...&, Pasaniuc B. (2014) Integrating functional data to prioritize causal variants in statistical fine-mapping studies. PLoS Genet, 10 (10) e1004722. doi:10.1371/journal.pgen.1004722. PMID 25357204
ABSTRACT
Standard statistical approaches for prioritization of variants for functional testing in fine-mapping studies either use marginal association statistics or estimate posterior probabilities for variants to be causal under simplifying assumptions. Here, we present a probabilistic framework that integrates association strength with functional genomic annotation data to improve accuracy in selecting plausible causal variants for functional validation. A key feature of our approach is that it empirically estimates the contribution of each functional annotation to the trait of interest directly from summary association statistics while allowing for multiple causal variants at any risk locus. We devise efficient algorithms that estimate the parameters of our model across all risk loci to further increase performance. Using simulations starting from the 1000 Genomes data, we find that our framework consistently outperforms the current state-of-the-art fine-mapping methods, reducing the number of variants that need to be selected to capture 90% of the causal variants from an average of 13.3 to 10.4 SNPs per locus (as compared to the next-best performing strategy). Furthermore, we introduce a cost-to-benefit optimization framework for determining the number of variants to be followed up in functional assays and assess its performance using real and simulation data. We validate our findings using a large scale meta-analysis of four blood lipids traits and find that the relative probability for causality is increased for variants in exons and transcription start sites and decreased in repressed genomic regions at the risk loci of these traits. Using these highly predictive, trait-specific functional annotations, we estimate causality probabilities across all traits and variants, reducing the size of the 90% confidence set from an average of 17.5 to 13.5 variants per locus in this data.
DOI
10.1371/journal.pgen.1004722
PanMAN
PUBMED_LINK
FULL NAME
Pangenome Mutation-Annotated Network
DESCRIPTION
PanMAN is a compact pangenome representation built from mutation-annotated trees (PanMATs) linked into a network, designed to compress and query shared evolutionary history across large microbial pathogen collections.
URL
KEYWORDS
pangenome, microbial genomics, compression, mutation-annotated tree, phylogeny
TITLE
Compressive pangenomics using mutation-annotated networks.
Main citation
Walia S, Motwani H, Tseng YH, Smith K, ...&, Turakhia Y. (2026) Compressive pangenomics using mutation-annotated networks. Nat Genet, 58 (2) 445-453. doi:10.1038/s41588-025-02478-7. PMID 41526696
ABSTRACT
Pangenomics is an emerging field that uses collections of genomes, rather than a single reference, to reduce bias and capture intra-species diversity. However, existing pangenomic data formats face challenges in scaling to millions of genomes and primarily emphasize variation, often neglecting the underlying mutational events and evolutionary relationships. This work introduces Pangenome Mutation-Annotated Network (PanMAN), a lossless pangenome representation that achieves compression ratios ranging from 3.5-1,391× in file sizes compared to existing variation-preserving formats, with performance generally improving on larger datasets. In addition to compression, PanMAN increases representational capacity by encoding detailed mutational and evolutionary histories inferred across genomes, thereby enabling new biological insights. Using PanMAN, a comprehensive SARS-CoV-2 pangenome was constructed from 8 million publicly available sequences, requiring only 366 MB of disk space. We also present 'panmanUtils', a toolkit that supports common analyses and ensures interoperability with existing software. PanMAN is poised to greatly improve the scale, speed, resolution and scope of pangenomic analysis and data sharing.
DOI
10.1038/s41588-025-02478-7
PASCAL
PUBMED_LINK
FULL NAME
Pathway scoring algorithm
DESCRIPTION
Pascal (Pathway scoring algorithm) is an easy-to-use tool for gene scoring and pathway analysis from GWAS results.
URL
TITLE
Fast and Rigorous Computation of Gene and Pathway Scores from SNP-Based Summary Statistics.
Main citation
Lamparter D, Marbach D, Rueedi R, Kutalik Z, ...&, Bergmann S. (2016) Fast and Rigorous Computation of Gene and Pathway Scores from SNP-Based Summary Statistics. PLoS Comput Biol, 12 (1) e1004714. doi:10.1371/journal.pcbi.1004714. PMID 26808494
ABSTRACT
Integrating single nucleotide polymorphism (SNP) p-values from genome-wide association studies (GWAS) across genes and pathways is a strategy to improve statistical power and gain biological insight. Here, we present Pascal (Pathway scoring algorithm), a powerful tool for computing gene and pathway scores from SNP-phenotype association summary statistics. For gene score computation, we implemented analytic and efficient numerical solutions to calculate test statistics. We examined in particular the sum and the maximum of chi-squared statistics, which measure the strongest and the average association signals per gene, respectively. For pathway scoring, we use a modified Fisher method, which offers not only significant power improvement over more traditional enrichment strategies, but also eliminates the problem of arbitrary threshold selection inherent in any binary membership based pathway enrichment approach. We demonstrate the marked increase in power by analyzing summary statistics from dozens of large meta-studies for various traits. Our extensive testing indicates that our method not only excels in rigorous type I error control, but also results in more biologically meaningful discoveries.
DOI
10.1371/journal.pcbi.1004714
PCHAT
PUBMED_LINK
FULL NAME
principal component of heritability association test
TITLE
Pleiotropy and principal components of heritability combine to increase power for association analysis.
Main citation
Klei L, Luca D, Devlin B, Roeder K. (2008) Pleiotropy and principal components of heritability combine to increase power for association analysis. Genet Epidemiol, 32 (1) 9-19. doi:10.1002/gepi.20257. PMID 17922480
ABSTRACT
When many correlated traits are measured the potential exists to discover the coordinated control of these traits via genotyped polymorphisms. A common statistical approach to this problem involves assessing the relationship between each phenotype and each single nucleotide polymorphism (SNP) individually (PHN); and taking a Bonferroni correction for the effective number of independent tests conducted. Alternatively, one can apply a dimension reduction technique, such as estimation of principal components, and test for an association with the principal components of the phenotypes (PCP) rather than the individual phenotypes. Building on the work of Lange and colleagues we develop an alternative method based on the principal component of heritability (PCH). For each SNP the PCH approach reduces the phenotypes to a single trait that has a higher heritability than any other linear combination of the phenotypes. As a result, the association between a SNP and derived trait is often easier to detect than an association with any of the individual phenotypes or the PCP. When applied to unrelated subjects, PCH has a drawback. For each SNP it is necessary to estimate the vector of loadings that maximize the heritability over all phenotypes. We develop a method of iterated sample splitting that uses one portion of the data for training and the remainder for testing. This cross-validation approach maintains the type I error control and yet utilizes the data efficiently, resulting in a powerful test for association.
DOI
10.1002/gepi.20257
PennPRS
DESCRIPTION
PennPRS is a centralized cloud computing platform for efficient polygenic risk score (PRS) model training in precision medicine. Users can either upload their own GWAS summary data or directly query data from public data sources we provide. PennPRS supports both single-ancestry and multi-ancestry PRS training.
URL
KEYWORDS
PennPRS
PREPRINT_DOI
10.1101/2025.02.07.25321875
Main citation
Jin, J., Li, B., Wang, X., Yang, X., Li, Y., Wang, R., ... & Zhao, B. (2025). PennPRS: a centralized cloud computing platform for efficient polygenic risk score training in precision medicine. medRxiv.
PES
PUBMED_LINK
FULL NAME
Pharmagenic_enrichment_score
DESCRIPTION
a framework to quantify an individual’s common variant enrichment in clinically actionable systems responsive to existing drugs.
TITLE
Pharmacological enrichment of polygenic risk for precision medicine in complex disorders.
Main citation
Reay WR, Atkins JR, Carr VJ, Green MJ, ...&, Cairns MJ. (2020) Pharmacological enrichment of polygenic risk for precision medicine in complex disorders. Sci Rep, 10 (1) 879. doi:10.1038/s41598-020-57795-0. PMID 31964963
ABSTRACT
Individuals with complex disorders typically have a heritable burden of common variation that can be expressed as a polygenic risk score (PRS). While PRS has some predictive utility, it lacks the molecular specificity to be directly informative for clinical interventions. We therefore sought to develop a framework to quantify an individual's common variant enrichment in clinically actionable systems responsive to existing drugs. This was achieved with a metric designated the pharmagenic enrichment score (PES), which we demonstrate for individual SNP profiles in a cohort of cases with schizophrenia. A large proportion of these had elevated PES in one or more of eight clinically actionable gene-sets enriched with schizophrenia associated common variation. Notable candidates targeting these pathways included vitamins, antioxidants, insulin modulating agents, and cholinergic drugs. Interestingly, elevated PES was also observed in individuals with otherwise low common variant burden. The biological saliency of PES profiles were observed directly through their impact on gene expression in a subset of the cohort with matched transcriptomic data, supporting our assertion that this gene-set orientated approach could integrate an individual's common variant risk to inform personalised interventions, including drug repositioning, for complex disorders such as schizophrenia.
DOI
10.1038/s41598-020-57795-0
pgBoost
DESCRIPTION
pgBoost is an integrative modeling framework that trains a non-linear combination of existing linking strategies (including genomic distance) on fine-mapped eQTL data to assign a probabilistic score to each candidate SNP-gene link.
URL
KEYWORDS
eQTL-informed gradient boosting
PREPRINT_DOI
10.1101/2024.05.24.24307813
Main citation
Dorans, E. R., Jagadeesh, K., Dey, K., & Price, A. L. (2024). Linking regulatory variants to target genes by integrating single-cell multiome methods and genomic distance. medRxiv, 2024-05.
PGG.Han
PUBMED_LINK
URL
TITLE
PGG.Han: the Han Chinese genome database and analysis platform.
Main citation
Gao Y, Zhang C, Yuan L, Ling Y, ...&, Xu S. (2020) PGG.Han: the Han Chinese genome database and analysis platform. Nucleic Acids Res, 48 (D1) D971-D976. doi:10.1093/nar/gkz829. PMID 31584086
ABSTRACT
As the largest ethnic group in the world, the Han Chinese population is nonetheless underrepresented in global efforts to catalogue the genomic variability of natural populations. Here, we developed the PGG.Han, a population genome database to serve as the central repository for the genomic data of the Han Chinese Genome Initiative (Phase I). In its current version, the PGG.Han archives whole-genome sequences or high-density genome-wide single-nucleotide variants (SNVs) of 114 783 Han Chinese individuals (a.k.a. the Han100K), representing geographical sub-populations covering 33 of the 34 administrative divisions of China, as well as Singapore. The PGG.Han provides: (i) an interactive interface for visualization of the fine-scale genetic structure of the Han Chinese population; (ii) genome-wide allele frequencies of hierarchical sub-populations; (iii) ancestry inference for individual samples and controlling population stratification based on nested ancestry informative markers (AIMs) panels; (iv) population-structure-aware shared control data for genotype-phenotype association studies (e.g. GWASs) and (v) a Han-Chinese-specific reference panel for genotype imputation. Computational tools are implemented into the PGG.Han, and an online user-friendly interface is provided for data analysis and results visualization. The PGG.Han database is freely accessible via http://www.pgghan.org or https://www.hanchinesegenomes.org.
DOI
10.1093/nar/gkz829
PGG.Han panel (PGG.Han)
PUBMED_LINK
URL
TITLE
PGG.Han: the Han Chinese genome database and analysis platform.
Main citation
Gao Y, Zhang C, Yuan L, Ling Y, ...&, Xu S. (2020) PGG.Han: the Han Chinese genome database and analysis platform. Nucleic Acids Res, 48 (D1) D971-D976. doi:10.1093/nar/gkz829. PMID 31584086
ABSTRACT
As the largest ethnic group in the world, the Han Chinese population is nonetheless underrepresented in global efforts to catalogue the genomic variability of natural populations. Here, we developed the PGG.Han, a population genome database to serve as the central repository for the genomic data of the Han Chinese Genome Initiative (Phase I). In its current version, the PGG.Han archives whole-genome sequences or high-density genome-wide single-nucleotide variants (SNVs) of 114 783 Han Chinese individuals (a.k.a. the Han100K), representing geographical sub-populations covering 33 of the 34 administrative divisions of China, as well as Singapore. The PGG.Han provides: (i) an interactive interface for visualization of the fine-scale genetic structure of the Han Chinese population; (ii) genome-wide allele frequencies of hierarchical sub-populations; (iii) ancestry inference for individual samples and controlling population stratification based on nested ancestry informative markers (AIMs) panels; (iv) population-structure-aware shared control data for genotype-phenotype association studies (e.g. GWASs) and (v) a Han-Chinese-specific reference panel for genotype imputation. Computational tools are implemented into the PGG.Han, and an online user-friendly interface is provided for data analysis and results visualization. The PGG.Han database is freely accessible via http://www.pgghan.org or https://www.hanchinesegenomes.org.
DOI
10.1093/nar/gkz829
PGS-adjusted GWAS
PUBMED_LINK
DESCRIPTION
adjustment of GWAS analyses for polygenic scores (PGSs) increases the statistical power for discovery across all ancestries
KEYWORDS
LOCO-PGSs, two-stage meta-analysis strategy
TITLE
Boosting the power of genome-wide association studies within and across ancestries by using polygenic scores.
Main citation
Campos AI, Namba S, Lin SC, Nam K, ...&, Yengo L. (2023) Boosting the power of genome-wide association studies within and across ancestries by using polygenic scores. Nat Genet, 55 (10) 1769-1776. doi:10.1038/s41588-023-01500-0. PMID 37723263
ABSTRACT
Genome-wide association studies (GWASs) have been mostly conducted in populations of European ancestry, which currently limits the transferability of their findings to other populations. Here, we show, through theory, simulations and applications to real data, that adjustment of GWAS analyses for polygenic scores (PGSs) increases the statistical power for discovery across all ancestries. We applied this method to analyze seven traits available in three large biobanks with participants of East Asian ancestry (n = 340,000 in total) and report 139 additional associations across traits. We also present a two-stage meta-analysis strategy whereby, in contributing cohorts, a PGS-adjusted GWAS is rerun using PGSs derived from a first round of a standard meta-analysis. On average, across traits, this approach yields a 1.26-fold increase in the number of detected associations (range 1.07- to 1.76-fold increase). Altogether, our study demonstrates the value of using PGSs to increase the power of GWASs in underrepresented populations and promotes such an analytical strategy for future GWAS meta-analyses.
DOI
10.1038/s41588-023-01500-0
PGS-adjusted RVATs
PUBMED_LINK
FULL NAME
PGS-adjusted rare variant association tests
DESCRIPTION
adjusting for common variant polygenic scores improves yield in gene-based rare variant association tests
KEYWORDS
PGS, Rare variants
TITLE
Adjusting for common variant polygenic scores improves yield in rare variant association analyses.
Main citation
Jurgens SJ, Pirruccello JP, Choi SH, Morrill VN, ...&, Ellinor PT. (2023) Adjusting for common variant polygenic scores improves yield in rare variant association analyses. Nat Genet, 55 (4) 544-548. doi:10.1038/s41588-023-01342-w. PMID 36959364
ABSTRACT
With the emergence of large-scale sequencing data, methods for improving power in rare variant association tests are needed. Here we show that adjusting for common variant polygenic scores improves yield in gene-based rare variant association tests across 65 quantitative traits in the UK Biobank (up to 20% increase at α = 2.6 × 10-6), without marked increases in false-positive rates or genomic inflation. Benefits were seen for various models, with the largest improvements seen for efficient sparse mixed-effects models. Our results illustrate how polygenic score adjustment can efficiently improve power in rare variant association discovery.
DOI
10.1038/s41588-023-01342-w
PGS-hub
PUBMED_LINK
DESCRIPTION
PGS-hub platform features the deployment of eight single-ancestry PGS algorithms and two multi-ancestry PGS algorithms, providing comprehensive and versatile tools for genetic risk assessment.
URL
KEYWORDS
PGS-hub
TITLE
Comprehensive benchmarking single and multi ancestry polygenic score methods with the PGS-hub platform.
Main citation
Chen X, Wang F, Zhao H, Hao J, ...&, Wang M. (2026) Comprehensive benchmarking single and multi ancestry polygenic score methods with the PGS-hub platform. Nat Commun, 17 (1) . doi:10.1038/s41467-026-68599-7. PMID 41580418
ABSTRACT
Polygenic scores (PGS) quantify genetic contributions to complex traits, yet existing single- and multi-ancestry methods lack multi-dimensional evaluation within a unified framework. Here, we benchmarked 13 state-of-the-art PGS methods across 36 traits in UK Biobank European and African samples. The prediction performance, computational efficiency, the number of variants, and the impact of different linkage disequilibrium (LD) reference sizes were thoroughly assessed for each method. Results of single-ancestry methods demonstrate that LDpred2 has superior performance across a broad spectrum of complex traits in terms of accuracy and computational efficiency; however, other methods remain valuable for specific traits. For multi-ancestry methods, PRS-CSx and X-Wing have comparable performance, whereas LDpred2-multi outperforms both. Notably, we find that increasing the panel size of the LD reference significantly elevates PGS performance for sample sizes below 1,000, and it reaches a plateau when it exceeds 5,000 samples. Furthermore, implementing PGS calculation methods requires considerable technical effort and resource allocation. To support easy use of these PGS methods, we developed a user-friendly online computing platform, PGS-hub, that integrates all evaluated methods and is pre-configured with ancestry-stratified LD panels. This resource enables a scalable and harmonized PGS computation platform for the PGS community.
DOI
10.1038/s41467-026-68599-7
pgsc_calc
FULL NAME
The Polygenic Score Catalog Calculator
DESCRIPTION
pgsc_calc is a bioinformatics best-practice analysis pipeline for calculating polygenic [risk] scores on samples with imputed genotypes using existing scoring files from the Polygenic Score (PGS) Catalog and/or user-defined PGS/PRS.
URL
KEYWORDS
PRS calculation pipeline
Main citation
Lambert, Wingfield et al. (2024) The Polygenic Score Catalog: new functionality and tools to enable FAIR research. medRxiv. doi:10.1101/2024.05.29.24307783.
PGSCatalog (PGS Catalog)
PUBMED_LINK
FULL NAME
PGS Catalog
DESCRIPTION
The PGS Catalog is an open database of published polygenic scores (PGS). Each PGS in the Catalog is consistently annotated with relevant metadata; including scoring files (variants, effect alleles/weights), annotations of how the PGS was developed and applied, and evaluations of their predictive performance.
URL
KEYWORDS
PGS database
TITLE
The Polygenic Score Catalog as an open database for reproducibility and systematic evaluation.
Main citation
Lambert SA, Gil L, Jupp S, Ritchie SC, ...&, Inouye M. (2021) The Polygenic Score Catalog as an open database for reproducibility and systematic evaluation. Nat Genet, 53 (4) 420-425. doi:10.1038/s41588-021-00783-5. PMID 33692568
ABSTRACT
We present the Polygenic Score (PGS) Catalog (https://www.PGSCatalog.org), an open resource of published scores (including variants, alleles and weights) and consistently curated metadata required for reproducibility and independent applications. The PGS Catalog has capabilities for user deposition, expert curation and programmatic access, thus providing the community with a platform for PGS dissemination, research and translation.
DOI
10.1038/s41588-021-00783-5
PGSFusion
DESCRIPTION
PGSFusion is your free comprehensive webserver for constructing polygenic scores (PGS)evaluating performance, and unlocking epidemiological insights. This server implements 16 leading summary statistics-based PGS methods in a standardized interface, and rigorously assesses their predictive capabilities using the UK Biobank dataset.
URL
PREPRINT_DOI
10.1101/2024.08.05.606619
Main citation
Yang, S., Ye, X., Ji, X., Li, Z., Tian, M., Huang, P., & Cao, C. (2024). PGSFusion streamlines polygenic score construction and epidemiological applications in biobank-scale cohorts. bioRxiv, 2024-08.
pheweb
PUBMED_LINK
URL
TITLE
Exploring and visualizing large-scale genetic associations by using PheWeb.
Main citation
Gagliano Taliun SA, VandeHaar P, Boughton AP, Welch RP, ...&, Abecasis GR. (2020) Exploring and visualizing large-scale genetic associations by using PheWeb. Nat Genet, 52 (6) 550-552. doi:10.1038/s41588-020-0622-5. PMID 32504056
DOI
10.1038/s41588-020-0622-5
PHLASH
FULL NAME
Population History Learning by Averaging Sampled Histories
DESCRIPTION
PHLASH is a Bayesian method for inferring population size history from whole-genome sequence data using a coalescent-based hidden Markov model. It provides accurate and adaptive estimates with automatic uncertainty quantification, leveraging GPU acceleration for efficiency. It outperforms existing tools like SMC++ and MSMC2 in accuracy and computational speed, particularly with large sample sizes.
URL
KEYWORDS
population size inference, Bayesian demographic inference, coalescent model, ancestral recombination graphs, whole-genome sequencing, GPU acceleration
Main citation
Terhorst J. Accelerated Bayesian inference of population size history from recombining sequence data. Nat Genet. 2025; DOI: 10.1038/s41588-025-02323-x.
DOI
10.1038/s41588-025-02323-x
ARROW_SUMMARY
Whole-genome sequence data → Bayesian coalescent model with gradient-based likelihood evaluation → Posterior distribution of population size history with uncertainty quantification
AI_GENERATED
1.0
PLINK
PUBMED_LINK
DESCRIPTION
A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses. PLINK is a free, open-source whole genome association analysis toolset, designed to perform a range of basic, large-scale analyses in a computationally efficient manner. The focus of PLINK is purely on analysis of genotype/phenotype data, so there is no support for steps prior to this (e.g. study design and planning, generating genotype or CNV calls from raw data). Through integration with gPLINK and Haploview, there is some support for the subsequent visualization, annotation and storage of results.
URL
TITLE
PLINK: a tool set for whole-genome association and population-based linkage analyses.
Main citation
Purcell S, Neale B, Todd-Brown K, Thomas L, ...&, Sham PC. (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet, 81 (3) 559-75. doi:10.1086/519795. PMID 17701901
ABSTRACT
Whole-genome association studies (WGAS) bring new computational, as well as analytic, challenges to researchers. Many existing genetic-analysis tools are not designed to handle such large data sets in a convenient manner and do not necessarily exploit the new opportunities that whole-genome data bring. To address these issues, we developed PLINK, an open-source C/C++ WGAS tool set. With PLINK, large data sets comprising hundreds of thousands of markers genotyped for thousands of individuals can be rapidly manipulated and analyzed in their entirety. As well as providing tools to make the basic analytic steps computationally efficient, PLINK also supports some novel approaches to whole-genome data that take advantage of whole-genome coverage. We introduce PLINK and describe the five main domains of function: data management, summary statistics, population stratification, association analysis, and identity-by-descent estimation. In particular, we focus on the estimation and use of identity-by-state and identity-by-descent information in the context of population-based whole-genome studies. This information can be used to detect and correct for population stratification and to identify extended chromosomal segments that are shared identical by descent between very distantly related individuals. Analysis of the patterns of segmental sharing has the potential to map disease loci that contain multiple rare variants in a population-based linkage analysis.
DOI
10.1086/519795
PLINK-MDS (MDS)
PUBMED_LINK
FULL NAME
multidimensional scaling
URL
KEYWORDS
MDS
TITLE
PLINK: a tool set for whole-genome association and population-based linkage analyses.
Main citation
Purcell S, Neale B, Todd-Brown K, Thomas L, ...&, Sham PC. (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet, 81 (3) 559-75. doi:10.1086/519795. PMID 17701901
ABSTRACT
Whole-genome association studies (WGAS) bring new computational, as well as analytic, challenges to researchers. Many existing genetic-analysis tools are not designed to handle such large data sets in a convenient manner and do not necessarily exploit the new opportunities that whole-genome data bring. To address these issues, we developed PLINK, an open-source C/C++ WGAS tool set. With PLINK, large data sets comprising hundreds of thousands of markers genotyped for thousands of individuals can be rapidly manipulated and analyzed in their entirety. As well as providing tools to make the basic analytic steps computationally efficient, PLINK also supports some novel approaches to whole-genome data that take advantage of whole-genome coverage. We introduce PLINK and describe the five main domains of function: data management, summary statistics, population stratification, association analysis, and identity-by-descent estimation. In particular, we focus on the estimation and use of identity-by-state and identity-by-descent information in the context of population-based whole-genome studies. This information can be used to detect and correct for population stratification and to identify extended chromosomal segments that are shared identical by descent between very distantly related individuals. Analysis of the patterns of segmental sharing has the potential to map disease loci that contain multiple rare variants in a population-based linkage analysis.
DOI
10.1086/519795
PLINK1.9
PUBMED_LINK
DESCRIPTION
PLINK: a tool set for whole-genome association and population-based linkage analyses. This is a comprehensive update to Shaun Purcell's PLINK command-line program, developed by Christopher Chang with support from the NIH-NIDDK's Laboratory of Biological Modeling, the Purcell Lab, and others.
URL
TITLE
PLINK: a tool set for whole-genome association and population-based linkage analyses.
Main citation
Purcell S, Neale B, Todd-Brown K, Thomas L, ...&, Sham PC. (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet, 81 (3) 559-75. doi:10.1086/519795. PMID 17701901
ABSTRACT
Whole-genome association studies (WGAS) bring new computational, as well as analytic, challenges to researchers. Many existing genetic-analysis tools are not designed to handle such large data sets in a convenient manner and do not necessarily exploit the new opportunities that whole-genome data bring. To address these issues, we developed PLINK, an open-source C/C++ WGAS tool set. With PLINK, large data sets comprising hundreds of thousands of markers genotyped for thousands of individuals can be rapidly manipulated and analyzed in their entirety. As well as providing tools to make the basic analytic steps computationally efficient, PLINK also supports some novel approaches to whole-genome data that take advantage of whole-genome coverage. We introduce PLINK and describe the five main domains of function: data management, summary statistics, population stratification, association analysis, and identity-by-descent estimation. In particular, we focus on the estimation and use of identity-by-state and identity-by-descent information in the context of population-based whole-genome studies. This information can be used to detect and correct for population stratification and to identify extended chromosomal segments that are shared identical by descent between very distantly related individuals. Analysis of the patterns of segmental sharing has the potential to map disease loci that contain multiple rare variants in a population-based linkage analysis.
DOI
10.1086/519795
PLINK2
PUBMED_LINK
URL
TITLE
Second-generation PLINK: rising to the challenge of larger and richer datasets.
Main citation
Chang CC, Chow CC, Tellier LC, Vattikuti S, ...&, Lee JJ. (2015) Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience, 4 () 7. doi:10.1186/s13742-015-0047-8. PMID 25722852
ABSTRACT
BACKGROUND: PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for faster and scalable implementations of key functions, such as logistic regression, linkage disequilibrium estimation, and genomic distance evaluation. In addition, GWAS and population-genetic data now frequently contain genotype likelihoods, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1's primary data format. FINDINGS: To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, [Formula: see text]-time/constant-space Hardy-Weinberg equilibrium and Fisher's exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. We have also developed an extension to the data format which adds low-overhead support for genotype likelihoods, phase, multiallelic variants, and reference vs. alternate alleles, which is the basis of our planned second release (PLINK 2.0). CONCLUSIONS: The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.
DOI
10.1186/s13742-015-0047-8
PLINK2
PUBMED_LINK
DESCRIPTION
The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.
URL
TITLE
Second-generation PLINK: rising to the challenge of larger and richer datasets.
Main citation
Chang CC, Chow CC, Tellier LC, Vattikuti S, ...&, Lee JJ. (2015) Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience, 4 () 7. doi:10.1186/s13742-015-0047-8. PMID 25722852
ABSTRACT
BACKGROUND: PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for faster and scalable implementations of key functions, such as logistic regression, linkage disequilibrium estimation, and genomic distance evaluation. In addition, GWAS and population-genetic data now frequently contain genotype likelihoods, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1's primary data format. FINDINGS: To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, [Formula: see text]-time/constant-space Hardy-Weinberg equilibrium and Fisher's exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. We have also developed an extension to the data format which adds low-overhead support for genotype likelihoods, phase, multiallelic variants, and reference vs. alternate alleles, which is the basis of our planned second release (PLINK 2.0). CONCLUSIONS: The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.
DOI
10.1186/s13742-015-0047-8
PLINK2
PUBMED_LINK
DESCRIPTION
The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility.
URL
USE
calculate PRS using genotype data.
TITLE
Second-generation PLINK: rising to the challenge of larger and richer datasets.
Main citation
Chang CC, Chow CC, Tellier LC, Vattikuti S, ...&, Lee JJ. (2015) Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience, 4 () 7. doi:10.1186/s13742-015-0047-8. PMID 25722852
ABSTRACT
BACKGROUND: PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for faster and scalable implementations of key functions, such as logistic regression, linkage disequilibrium estimation, and genomic distance evaluation. In addition, GWAS and population-genetic data now frequently contain genotype likelihoods, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1's primary data format. FINDINGS: To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, [Formula: see text]-time/constant-space Hardy-Weinberg equilibrium and Fisher's exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. We have also developed an extension to the data format which adds low-overhead support for genotype likelihoods, phase, multiallelic variants, and reference vs. alternate alleles, which is the basis of our planned second release (PLINK 2.0). CONCLUSIONS: The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.
DOI
10.1186/s13742-015-0047-8
POLMM
PUBMED_LINK
FULL NAME
proportional odds logistic mixed model (POLMM)
DESCRIPTION
Proportional Odds Logistic Mixed Model (POLMM) for ordinal categorical data analysis
URL
KEYWORDS
ordinal categorical phenotypes
TITLE
Efficient mixed model approach for large-scale genome-wide association studies of ordinal categorical phenotypes.
Main citation
Bi W, Zhou W, Dey R, Mukherjee B, ...&, Lee S. (2021) Efficient mixed model approach for large-scale genome-wide association studies of ordinal categorical phenotypes. Am J Hum Genet, 108 (5) 825-839. doi:10.1016/j.ajhg.2021.03.019. PMID 33836139
ABSTRACT
In genome-wide association studies, ordinal categorical phenotypes are widely used to measure human behaviors, satisfaction, and preferences. However, because of the lack of analysis tools, methods designed for binary or quantitative traits are commonly used inappropriately to analyze categorical phenotypes. To accurately model the dependence of an ordinal categorical phenotype on covariates, we propose an efficient mixed model association test, proportional odds logistic mixed model (POLMM). POLMM is computationally efficient to analyze large datasets with hundreds of thousands of samples, can control type I error rates at a stringent significance level regardless of the phenotypic distribution, and is more powerful than alternative methods. In contrast, the standard linear mixed model approaches cannot control type I error rates for rare variants when the phenotypic distribution is unbalanced, although they performed well when testing common variants. We applied POLMM to 258 ordinal categorical phenotypes on array genotypes and imputed samples from 408,961 individuals in UK Biobank. In total, we identified 5,885 genome-wide significant variants, of which, 424 variants (7.2%) are rare variants with MAF < 0.01.
DOI
10.1016/j.ajhg.2021.03.019
POP-GWAS
PUBMED_LINK
FULL NAME
Post-Prediction GWAS
DESCRIPTION
POP-TOOLS (POst-Prediction TOOLS) is a Python3-based command line toolkit for conducting valid and powerful machine learning (ML)-assisted genetic association studies.
URL
KEYWORDS
imputed phenotypes, 3 GWASs
TITLE
Valid inference for machine learning-assisted genome-wide association studies.
Main citation
Miao J, Wu Y, Sun Z, Miao X, ...&, Lu Q. (2024) Valid inference for machine learning-assisted genome-wide association studies. Nat Genet, 56 (11) 2361-2369. doi:10.1038/s41588-024-01934-0. PMID 39349818
ABSTRACT
Machine learning (ML) has become increasingly popular in almost all scientific disciplines, including human genetics. Owing to challenges related to sample collection and precise phenotyping, ML-assisted genome-wide association study (GWAS), which uses sophisticated ML techniques to impute phenotypes and then performs GWAS on the imputed outcomes, have become increasingly common in complex trait genetics research. However, the validity of ML-assisted GWAS associations has not been carefully evaluated. Here, we report pervasive risks for false-positive associations in ML-assisted GWAS and introduce Post-Prediction GWAS (POP-GWAS), a statistical framework that redesigns GWAS on ML-imputed outcomes. POP-GWAS ensures valid and powerful statistical inference irrespective of imputation quality and choice of algorithm, requiring only GWAS summary statistics as input. We employed POP-GWAS to perform a GWAS of bone mineral density derived from dual-energy X-ray absorptiometry imaging at 14 skeletal sites, identifying 89 new loci and revealing skeletal site-specific genetic architecture. Our framework offers a robust analytic solution for future ML-assisted GWAS.
DOI
10.1038/s41588-024-01934-0
popcorn
PUBMED_LINK
DESCRIPTION
Popcorn is a program for estimaing the correlation of causal variant effect. This is the python3 version of Popcorn and still under development sizes across populations in GWAS.
URL
KEYWORDS
trans-ethnic
TITLE
Transethnic Genetic-Correlation Estimates from Summary Statistics.
Main citation
Brown BC, Asian Genetic Epidemiology Network Type 2 Diabetes Consortium, Ye CJ, Price AL, ...&, Zaitlen N. (2016) Transethnic Genetic-Correlation Estimates from Summary Statistics. Am J Hum Genet, 99 (1) 76-88. doi:10.1016/j.ajhg.2016.05.001. PMID 27321947
ABSTRACT
The increasing number of genetic association studies conducted in multiple populations provides an unprecedented opportunity to study how the genetic architecture of complex phenotypes varies between populations, a problem important for both medical and population genetics. Here, we have developed a method for estimating the transethnic genetic correlation: the correlation of causal-variant effect sizes at SNPs common in populations. This methods takes advantage of the entire spectrum of SNP associations and uses only summary-level data from genome-wide association studies. This avoids the computational costs and privacy concerns associated with genotype-level information while remaining scalable to hundreds of thousands of individuals and millions of SNPs. We applied our method to data on gene expression, rheumatoid arthritis, and type 2 diabetes and overwhelmingly found that the genetic correlation was significantly less than 1. Our method is implemented in a Python package called Popcorn.
DOI
10.1016/j.ajhg.2016.05.001
popEVE
PUBMED_LINK
DESCRIPTION
popEVE is a proteome-wide deep generative model that scores missense variant pathogenicity by combining cross-species evolutionary predictors with human population cohort data, aiming for well-calibrated, human-specific deleteriousness estimates.
URL
KEYWORDS
missense, pathogenicity, deep generative model, proteome-wide, UK Biobank
TITLE
Proteome-wide model for human disease genetics.
Main citation
Orenbuch R, Shearer CA, Kollasch AW, Spinner AD, ...&, Marks DS. (2025) Proteome-wide model for human disease genetics. Nat Genet, 57 (12) 3165-3174. doi:10.1038/s41588-025-02400-1. PMID 41286104
ABSTRACT
Missense variants remain a challenge in genetic interpretation owing to their subtle and context-dependent effects. Although current prediction models perform well in known disease genes, their scores are not calibrated across the proteome, limiting generalizability. To address this knowledge gap, we developed popEVE, a deep generative model combining evolutionary and human population data to estimate variant deleteriousness on a proteome-wide scale. popEVE achieves state-of-the-art performance without overestimating the burden of deleterious variants and identifies variants in 442 genes in a severe developmental disorder cohort, including 123 novel candidates. These genes are functionally similar to known disease genes, and their variants often localize to critical regions. Remarkably, popEVE can prioritize likely causal variants using only child exomes, enabling diagnosis even without parental sequencing. This work provides a generalizable framework for rare disease variant interpretation, especially in singleton cases, and demonstrates the utility of calibrated, evolution-informed scoring models for clinical genomics.
DOI
10.1038/s41588-025-02400-1
popgen
PUBMED_LINK
FULL NAME
Geography of Genetic Variants Browser
URL
TITLE
Visualizing the geography of genetic variants.
Main citation
Marcus JH, Novembre J. (2017) Visualizing the geography of genetic variants. Bioinformatics, 33 (4) 594-595. doi:10.1093/bioinformatics/btw643. PMID 27742697
ABSTRACT
SUMMARY: One of the key characteristics of any genetic variant is its geographic distribution. The geographic distribution can shed light on where an allele first arose, what populations it has spread to, and in turn on how migration, genetic drift, and natural selection have acted. The geographic distribution of a genetic variant can also be of great utility for medical/clinical geneticists and collectively many genetic variants can reveal population structure. Here we develop an interactive visualization tool for rapidly displaying the geographic distribution of genetic variants. Through a REST API and dynamic front-end, the Geography of Genetic Variants (GGV) browser ( http://popgen.uchicago.edu/ggv/ ) provides maps of allele frequencies in populations distributed across the globe. AVAILABILITY AND IMPLEMENTATION: GGV is implemented as a website ( http://popgen.uchicago.edu/ggv/ ) which employs an API to access frequency data ( http://popgen.uchicago.edu/freq_api/ ). Python and javascript source code for the website and the API are available at: http://github.com/NovembreLab/ggv/ and http://github.com/NovembreLab/ggv-api/ . CONTACT: jnovembre@uchicago.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
DOI
10.1093/bioinformatics/btw643
PoPs
PUBMED_LINK
FULL NAME
gene-level Polygenic Priority Score (PoPS)
DESCRIPTION
PoPS is a gene prioritization method that leverages genome-wide signal from GWAS summary statistics and incorporates data from an extensive set of public bulk and single-cell expression datasets, curated biological pathways, and predicted protein-protein interactions.
URL
TITLE
Leveraging polygenic enrichments of gene features to predict genes underlying complex traits and diseases.
Main citation
Weeks EM, Ulirsch JC, Cheng NY, Trippe BL, ...&, Finucane HK. (2023) Leveraging polygenic enrichments of gene features to predict genes underlying complex traits and diseases. Nat Genet, 55 (8) 1267-1276. doi:10.1038/s41588-023-01443-6. PMID 37443254
ABSTRACT
Genome-wide association studies (GWASs) are a valuable tool for understanding the biology of complex human traits and diseases, but associated variants rarely point directly to causal genes. In the present study, we introduce a new method, polygenic priority score (PoPS), that learns trait-relevant gene features, such as cell-type-specific expression, to prioritize genes at GWAS loci. Using a large evaluation set of genes with fine-mapped coding variants, we show that PoPS and the closest gene individually outperform other gene prioritization methods, but observe the best overall performance by combining PoPS with orthogonal methods. Using this combined approach, we prioritize 10,642 unique gene-trait pairs across 113 complex traits and diseases with high precision, finding not only well-established gene-trait relationships but nominating new genes at unresolved loci, such as LGR4 for estimated glomerular filtration rate and CCR7 for deep vein thrombosis. Overall, we demonstrate that PoPS provides a powerful addition to the gene prioritization toolbox.
DOI
10.1038/s41588-023-01443-6
Porter
PUBMED_LINK
TITLE
Multivariate simulation framework reveals performance of multi-trait GWAS methods.
Main citation
Porter HF, O'Reilly PF. (2017) Multivariate simulation framework reveals performance of multi-trait GWAS methods. Sci Rep, 7 () 38837. doi:10.1038/srep38837. PMID 28287610
ABSTRACT
Burgeoning availability of genome-wide association study (GWAS) results and national biobank data has led to growing interest in performing multi-trait genetic analyses. Numerous multi-trait GWAS methods that exploit either summary statistics or individual-level data have been developed, but their relative performance is unclear. Here we develop a simulation framework to model the complex networks underlying multivariate genetic epidemiology, enabling the vast model space of genetic effects on multiple correlated traits to be explored systematically. We perform a comprehensive comparison of the leading multi-trait GWAS methods, finding: (1) method performance is highly sensitive to the specific combination of genetic effects and phenotypic correlations, (2) most of the current multivariate methods have remarkably similar statistical power, and (3) multivariate methods may offer a substantial increase in the discovery of genetic variants over the standard univariate approach. We believe our findings offer the clearest picture to date of the relative performance of multi-trait GWAS methods and act as a guide for method selection. We provide a web application and open-source software program implementing our simulation framework, for: (i) further benchmarking of multivariate GWAS methods, (ii) power calculations for multivariate genetic studies, and (iii) generating data for testing any multivariate method in genetic epidemiology.
DOI
10.1038/srep38837
PP-GWAS
PUBMED_LINK
DESCRIPTION
Privacy-preserving framework for multi-site GWAS on quantitative traits using a distributed linear mixed model and randomized encoding so servers never see raw genotypes or phenotypes—only obfuscated intermediates—while improving speed versus several cryptographic baselines.
URL
KEYWORDS
Privacy-preserving GWAS, multi-site, quantitative traits, federated analysis
TITLE
PP-GWAS: Privacy Preserving Multi-Site Genome-wide Association Studies.
Main citation
Swaminathan A, Hannemann A, Ünal AB, Pfeifer N, ...&, Akgün M. (2025) PP-GWAS: Privacy Preserving Multi-Site Genome-wide Association Studies. Nat Commun, 16 (1) 11030. doi:10.1038/s41467-025-66771-z. PMID 41365878
ABSTRACT
Genome-wide association studies help uncover genetic influences on complex traits and diseases. Importantly, multi-site data collaborations enhance the statistical power of these studies but pose challenges due to the sensitivity of genomic data. Existing privacy-preserving approaches to performing multi-site genome-wide association studies rely on computationally expensive cryptographic techniques, which limit applicability. To address this, we present PP-GWAS, a privacy-preserving algorithm that improves efficiency and scalability while maintaining data privacy. Our method leverages randomized encoding within a distributed framework to perform stacked ridge regression on a linear mixed model, enabling robust analysis of quantitative phenotypes. We show experimentally using real-world and synthetic data that our approach achieves twice the computational speed of comparable methods while reducing resource consumption.
DOI
10.1038/s41467-025-66771-z
PredInterval
PUBMED_LINK
DESCRIPTION
PredInterval constructs statistically calibrated prediction intervals for phenotypes predicted from polygenic scores, compatible with arbitrary PGS methods and supporting individual-level data or GWAS summary statistics plus a small calibration sample.
URL
KEYWORDS
polygenic score, prediction interval, uncertainty, calibration, summary statistics
TITLE
Statistical construction of calibrated prediction intervals for polygenic score-based phenotype prediction.
Main citation
Xu C, Ganesh SK, Zhou X. (2025) Statistical construction of calibrated prediction intervals for polygenic score-based phenotype prediction. Nat Genet, 57 (11) 2891-2900. doi:10.1038/s41588-025-02360-6. PMID 41083720
ABSTRACT
Accurately quantifying uncertainty in predicted phenotypes from polygenic score (PGS)-based applications is essential for reliable clinical interpretation of PGS, supporting effective disease risk assessment and informed decision-making. Here, we present PredInterval, a nonparametric method for constructing well-calibrated prediction intervals. PredInterval is compatible with any PGS method, takes either individual-level data or summary statistics as input and relies on information from quantiles of phenotypic residuals through cross-validation to achieve well-calibrated coverage of true phenotypic values across diverse genetic architectures. We apply PredInterval to analyze 17 traits in real-data applications, where PredInterval not only represents the sole method achieving well-calibrated prediction coverage across traits, but it also offers a principled approach to identify high-risk individuals using prediction intervals, leading to an average improvement of identification rates by 8.7-830.4% compared with existing approaches. Overall, PredInterval represents a robust and versatile tool for enhancing the clinical utility of PGS.
DOI
10.1038/s41588-025-02360-6
PrediXcan
PUBMED_LINK
DESCRIPTION
(deprecated) PrediXcan is a gene-based association test that prioritizes genes that are likely to be causal for the phenotype.
URL
TITLE
A gene-based association method for mapping traits using reference transcriptome data.
Main citation
Gamazon ER, Wheeler HE, Shah KP, Mozaffari SV, ...&, Im HK. (2015) A gene-based association method for mapping traits using reference transcriptome data. Nat Genet, 47 (9) 1091-8. doi:10.1038/ng.3367. PMID 26258848
ABSTRACT
Genome-wide association studies (GWAS) have identified thousands of variants robustly associated with complex traits. However, the biological mechanisms underlying these associations are, in general, not well understood. We propose a gene-based association method called PrediXcan that directly tests the molecular mechanisms through which genetic variation affects phenotype. The approach estimates the component of gene expression determined by an individual's genetic profile and correlates 'imputed' gene expression with the phenotype under investigation to identify genes involved in the etiology of the phenotype. Genetically regulated gene expression is estimated using whole-genome tissue-dependent prediction models trained with reference transcriptome data sets. PrediXcan enjoys the benefits of gene-based approaches such as reduced multiple-testing burden and a principled approach to the design of follow-up experiments. Our results demonstrate that PrediXcan can detect known and new genes associated with disease traits and provide insights into the mechanism of these associations.
DOI
10.1038/ng.3367
Priority index
PUBMED_LINK
DESCRIPTION
A Comprehensive Resource for Genetic Targets in Immune-Mediated Disease
URL
TITLE
A genetics-led approach defines the drug target landscape of 30 immune-related traits.
Main citation
Fang H, ULTRA-DD Consortium, De Wolf H, Knezevic B, ...&, Knight JC. (2019) A genetics-led approach defines the drug target landscape of 30 immune-related traits. Nat Genet, 51 (7) 1082-1091. doi:10.1038/s41588-019-0456-1. PMID 31253980
ABSTRACT
Most candidate drugs currently fail later-stage clinical trials, largely due to poor prediction of efficacy on early target selection1. Drug targets with genetic support are more likely to be therapeutically valid2,3, but the translational use of genome-scale data such as from genome-wide association studies for drug target discovery in complex diseases remains challenging4-6. Here, we show that integration of functional genomic and immune-related annotations, together with knowledge of network connectivity, maximizes the informativeness of genetics for target validation, defining the target prioritization landscape for 30 immune traits at the gene and pathway level. We demonstrate how our genetics-led drug target prioritization approach (the priority index) successfully identifies current therapeutics, predicts activity in high-throughput cellular screens (including L1000, CRISPR, mutagenesis and patient-derived cell assays), enables prioritization of under-explored targets and allows for determination of target-level trait relationships. The priority index is an open-access, scalable system accelerating early-stage drug target selection for immune-mediated disease.
DOI
10.1038/s41588-019-0456-1
PROSPER
PUBMED_LINK
FULL NAME
Polygenic Risk scOres based on enSemble of PEnalized Regression models
DESCRIPTION
PROSPER is a new multi-ancestry PRS method with penalized regression followed by ensemble learning. This software is a command line tool based on R programming language. Large-scale benchmarking study shows that PROSPER could be the leading method to reduce the disparity of PRS performance across ancestry groups
URL
TITLE
An ensemble penalized regression method for multi-ancestry polygenic risk prediction.
Main citation
Zhang J, Zhan J, Jin J, Ma C, ...&, Chatterjee N. (2024) An ensemble penalized regression method for multi-ancestry polygenic risk prediction. Nat Commun, 15 (1) 3238. doi:10.1038/s41467-024-47357-7. PMID 38622117
ABSTRACT
Great efforts are being made to develop advanced polygenic risk scores (PRS) to improve the prediction of complex traits and diseases. However, most existing PRS are primarily trained on European ancestry populations, limiting their transferability to non-European populations. In this article, we propose a novel method for generating multi-ancestry Polygenic Risk scOres based on enSemble of PEnalized Regression models (PROSPER). PROSPER integrates genome-wide association studies (GWAS) summary statistics from diverse populations to develop ancestry-specific PRS with improved predictive power for minority populations. The method uses a combination of L 1 (lasso) and L 2 (ridge) penalty functions, a parsimonious specification of the penalty parameters across populations, and an ensemble step to combine PRS generated across different penalty parameters. We evaluate the performance of PROSPER and other existing methods on large-scale simulated and real datasets, including those from 23andMe Inc., the Global Lipids Genetics Consortium, and All of Us. Results show that PROSPER can substantially improve multi-ancestry polygenic prediction compared to alternative methods across a wide variety of genetic architectures. In real data analyses, for example, PROSPER increased out-of-sample prediction R2 for continuous traits by an average of 70% compared to a state-of-the-art Bayesian method (PRS-CSx) in the African ancestry population. Further, PROSPER is computationally highly scalable for the analysis of large SNP contents and many diverse populations.
DOI
10.1038/s41467-024-47357-7
Protter
PUBMED_LINK
DESCRIPTION
interactive protein feature visualization and integration with experimental proteomic data
URL
TITLE
Protter: interactive protein feature visualization and integration with experimental proteomic data.
Main citation
Omasits U, Ahrens CH, Müller S, Wollscheid B. (2014) Protter: interactive protein feature visualization and integration with experimental proteomic data. Bioinformatics, 30 (6) 884-6. doi:10.1093/bioinformatics/btt607. PMID 24162465
ABSTRACT
SUMMARY: The ability to integrate and visualize experimental proteomic evidence in the context of rich protein feature annotations represents an unmet need of the proteomics community. Here we present Protter, a web-based tool that supports interactive protein data analysis and hypothesis generation by visualizing both annotated sequence features and experimental proteomic data in the context of protein topology. Protter supports numerous proteomic file formats and automatically integrates a variety of reference protein annotation sources, which can be readily extended via modular plug-ins. A built-in export function produces publication-quality customized protein illustrations, also for large datasets. Visualizations of surfaceome datasets show the specific utility of Protter for the integrated visual analysis of membrane proteins and peptide selection for targeted proteomics. AVAILABILITY AND IMPLEMENTATION: The Protter web application is available at http://wlab.ethz.ch/protter. Source code and installation instructions are available at http://ulo.github.io/Protter/. CONTACT: wbernd@ethz.ch SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
DOI
10.1093/bioinformatics/btt607
PRS atlas
PUBMED_LINK
DESCRIPTION
This web application can be used to query findings from an analysis of 162 polygenic risk scores and 551 complex traits using data from the UK Biobank study1. Traits were selected based on the heritability analysis conducted by the Neale Lab2 (P<0.05). We encourage users of this resource to conduct follow-up analyses of associations to robustly identify causal relationships between complex traits.
URL
TITLE
An atlas of polygenic risk score associations to highlight putative causal relationships across the human phenome.
Main citation
Richardson TG, Harrison S, Hemani G, Davey Smith G. (2019) An atlas of polygenic risk score associations to highlight putative causal relationships across the human phenome. Elife, 8 () . doi:10.7554/eLife.43657. PMID 30835202
ABSTRACT
The age of large-scale genome-wide association studies (GWAS) has provided us with an unprecedented opportunity to evaluate the genetic liability of complex disease using polygenic risk scores (PRS). In this study, we have analysed 162 PRS (p<5×10-05) derived from GWAS and 551 heritable traits from the UK Biobank study (N = 334,398). Findings can be investigated using a web application (http://mrcieu.mrsoftware.org/PRS_atlas/), which we envisage will help uncover both known and novel mechanisms which contribute towards disease susceptibility. To demonstrate this, we have investigated the results from a phenome-wide evaluation of schizophrenia genetic liability. Amongst findings were inverse associations with measures of cognitive function which extensive follow-up analyses using Mendelian randomization (MR) provided evidence of a causal relationship. We have also investigated the effect of multiple risk factors on disease using mediation and multivariable MR frameworks. Our atlas provides a resource for future endeavours seeking to unravel the causal determinants of complex disease.
DOI
10.7554/eLife.43657
PRS credible intervals
PUBMED_LINK
URL
KEYWORDS
uncertainty
TITLE
Large uncertainty in individual polygenic risk score estimation impacts PRS-based risk stratification.
Main citation
Ding Y, Hou K, Burch KS, Lapinska S, ...&, Pasaniuc B. (2022) Large uncertainty in individual polygenic risk score estimation impacts PRS-based risk stratification. Nat Genet, 54 (1) 30-39. doi:10.1038/s41588-021-00961-5. PMID 34931067
ABSTRACT
Although the cohort-level accuracy of polygenic risk scores (PRSs)-estimates of genetic value at the individual level-has been widely assessed, uncertainty in PRSs remains underexplored. In the present study, we show that Bayesian PRS methods can estimate the variance of an individual's PRS and can yield well-calibrated credible intervals via posterior sampling. For 13 real traits in the UK Biobank (n = 291,273 unrelated 'white British'), we observe large variances in individual PRS estimates which impact interpretation of PRS-based stratification; averaging across traits, only 0.8% (s.d. = 1.6%) of individuals with PRS point estimates in the top decile have corresponding 95% credible intervals fully contained in the top decile. We provide an analytical estimator for the expectation of individual PRS variance as a function of SNP heritability, number of causal SNPs and sample size. Our results showcase the importance of incorporating uncertainty in individual PRS estimates into subsequent analyses.
DOI
10.1038/s41588-021-00961-5
PRS-CS
PUBMED_LINK
DESCRIPTION
PRS-CS is a Python based command line tool that infers posterior SNP effect sizes under continuous shrinkage (CS) priors using GWAS summary statistics and an external LD reference panel.
URL
KEYWORDS
continuous shrinkage (CS) prior
TITLE
Polygenic prediction via Bayesian regression and continuous shrinkage priors.
Main citation
Ge T, Chen CY, Ni Y, Feng YA, ...&, Smoller JW. (2019) Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nat Commun, 10 (1) 1776. doi:10.1038/s41467-019-09718-5. PMID 30992449
ABSTRACT
Polygenic risk scores (PRS) have shown promise in predicting human complex traits and diseases. Here, we present PRS-CS, a polygenic prediction method that infers posterior effect sizes of single nucleotide polymorphisms (SNPs) using genome-wide association summary statistics and an external linkage disequilibrium (LD) reference panel. PRS-CS utilizes a high-dimensional Bayesian regression framework, and is distinct from previous work by placing a continuous shrinkage (CS) prior on SNP effect sizes, which is robust to varying genetic architectures, provides substantial computational advantages, and enables multivariate modeling of local LD patterns. Simulation studies using data from the UK Biobank show that PRS-CS outperforms existing methods across a wide range of genetic architectures, especially when the training sample size is large. We apply PRS-CS to predict six common complex diseases and six quantitative traits in the Partners HealthCare Biobank, and further demonstrate the improvement of PRS-CS in prediction accuracy over alternative methods.
DOI
10.1038/s41467-019-09718-5
PRS-CSx
PUBMED_LINK
DESCRIPTION
PRS-CSx is a Python based command line tool that integrates GWAS summary statistics and external LD reference panels from multiple populations to improve cross-population polygenic prediction. Posterior SNP effect sizes are inferred under coupled continuous shrinkage (CS) priors across populations.
URL
KEYWORDS
continuous shrinkage (CS) prior, cross-population
TITLE
Improving polygenic prediction in ancestrally diverse populations.
Main citation
Ruan Y, Lin YF, Feng YA, Chen CY, ...&, Ge T. (2022) Improving polygenic prediction in ancestrally diverse populations. Nat Genet, 54 (5) 573-580. doi:10.1038/s41588-022-01054-7. PMID 35513724
ABSTRACT
Polygenic risk scores (PRS) have attenuated cross-population predictive performance. As existing genome-wide association studies (GWAS) have been conducted predominantly in individuals of European descent, the limited transferability of PRS reduces their clinical value in non-European populations, and may exacerbate healthcare disparities. Recent efforts to level ancestry imbalance in genomic research have expanded the scale of non-European GWAS, although most remain underpowered. Here, we present a new PRS construction method, PRS-CSx, which improves cross-population polygenic prediction by integrating GWAS summary statistics from multiple populations. PRS-CSx couples genetic effects across populations via a shared continuous shrinkage (CS) prior, enabling more accurate effect size estimation by sharing information between summary statistics and leveraging linkage disequilibrium diversity across discovery samples, while inheriting computational efficiency and robustness from PRS-CS. We show that PRS-CSx outperforms alternative methods across traits with a wide range of genetic architectures, cross-population genetic overlaps and discovery GWAS sample sizes in simulations, and improves the prediction of quantitative traits and schizophrenia risk in non-European populations.
DOI
10.1038/s41588-022-01054-7
PRS-FH
PUBMED_LINK
FULL NAME
family history
URL
KEYWORDS
family history
TITLE
Incorporating family history of disease improves polygenic risk scores in diverse populations.
Main citation
Hujoel MLA, Loh PR, Neale BM, Price AL. (2022) Incorporating family history of disease improves polygenic risk scores in diverse populations. Cell Genom, 2 (7) . doi:10.1016/j.xgen.2022.100152. PMID 35935918
ABSTRACT
Polygenic risk scores (PRSs) derived from genotype data and family history (FH) of disease provide valuable information for predicting disease risk, but PRSs perform poorly when applied to diverse populations. Here, we explore methods for combining both types of information (PRS-FH) in UK Biobank data. PRSs were trained using all British individuals (n = 409,000), and target samples consisted of unrelated non-British Europeans (n = 42,000), South Asians (n = 7,000), or Africans (n = 7,000). We evaluated PRS, FH, and PRS-FH using liability-scale R 2, primarily focusing on 3 well-powered diseases (type 2 diabetes, hypertension, and depression). PRS attained average prediction R 2s of 5.8%, 4.0%, and 0.53% in non-British Europeans, South Asians, and Africans, confirming poor cross-population transferability. In contrast, PRS-FH attained average prediction R 2s of 13%, 12%, and 10%, respectively, representing a large improvement in Europeans and an extremely large improvement in Africans. In conclusion, including family history improves the accuracy of polygenic risk scores, particularly in diverse populations.
DOI
10.1016/j.xgen.2022.100152
PRS-RS
PUBMED_LINK
FULL NAME
Polygenic Risk Score Reporting Standards
TITLE
Improving reporting standards for polygenic scores in risk prediction studies.
Main citation
Wand H, Lambert SA, Tamburro C, Iacocca MA, ...&, Wojcik GL. (2021) Improving reporting standards for polygenic scores in risk prediction studies. Nature, 591 (7849) 211-219. doi:10.1038/s41586-021-03243-6. PMID 33692554
ABSTRACT
Polygenic risk scores (PRSs), which often aggregate results from genome-wide association studies, can bridge the gap between initial discovery efforts and clinical applications for the estimation of disease risk using genetics. However, there is notable heterogeneity in the application and reporting of these risk scores, which hinders the translation of PRSs into clinical care. Here, in a collaboration between the Clinical Genome Resource (ClinGen) Complex Disease Working Group and the Polygenic Score (PGS) Catalog, we present the Polygenic Risk Score Reporting Standards (PRS-RS), in which we update the Genetic Risk Prediction Studies (GRIPS) Statement to reflect the present state of the field. Drawing on the input of experts in epidemiology, statistics, disease-specific applications, implementation and policy, this comprehensive reporting framework defines the minimal information that is needed to interpret and evaluate PRSs, especially with respect to downstream clinical applications. Items span detailed descriptions of study populations, statistical methods for the development and validation of PRSs and considerations for the potential limitations of these scores. In addition, we emphasize the need for data availability and transparency, and we encourage researchers to deposit and share PRSs through the PGS Catalog to facilitate reproducibility and comparative benchmarking. By providing these criteria in a structured format that builds on existing standards and ontologies, the use of this framework in publishing PRSs will facilitate translation into clinical care and progress towards defining best practice.
DOI
10.1038/s41586-021-03243-6
PRS_to_Abs
PUBMED_LINK
DESCRIPTION
Converting Polygenic Score to Absolute Scale
URL
TITLE
A tool for translating polygenic scores onto the absolute scale using summary statistics.
Main citation
Pain O, Gillett AC, Austin JC, Folkersen L, ...&, Lewis CM. (2022) A tool for translating polygenic scores onto the absolute scale using summary statistics. Eur J Hum Genet, 30 (3) 339-348. doi:10.1038/s41431-021-01028-z. PMID 34983942
ABSTRACT
There is growing interest in the clinical application of polygenic scores as their predictive utility increases for a range of health-related phenotypes. However, providing polygenic score predictions on the absolute scale is an important step for their safe interpretation. We have developed a method to convert polygenic scores to the absolute scale for binary and normally distributed phenotypes. This method uses summary statistics, requiring only the area-under-the-ROC curve (AUC) or variance explained (R2) by the polygenic score, and the prevalence of binary phenotypes, or mean and standard deviation of normally distributed phenotypes. Polygenic scores are converted using normal distribution theory. We also evaluate methods for estimating polygenic score AUC/R2 from genome-wide association study (GWAS) summary statistics alone. We validate the absolute risk conversion and AUC/R2 estimation using data for eight binary and three continuous phenotypes in the UK Biobank sample. When the AUC/R2 of the polygenic score is known, the observed and estimated absolute values were highly concordant. Estimates of AUC/R2 from the lassosum pseudovalidation method were most similar to the observed AUC/R2 values, though estimated values deviated substantially from the observed for autoimmune disorders. This study enables accurate interpretation of polygenic scores using only summary statistics, providing a useful tool for educational and clinical purposes. Furthermore, we have created interactive webtools implementing the conversion to the absolute ( https://opain.github.io/GenoPred/PRS_to_Abs_tool.html ). Several further barriers must be addressed before clinical implementation of polygenic scores, such as ensuring target individuals are well represented by the GWAS sample.
DOI
10.1038/s41431-021-01028-z
PRSet
PUBMED_LINK
DESCRIPTION
A new feature of PRSice is the ability to perform set base/pathway based analysis. This new feature is called PRSet.
URL
KEYWORDS
pathway-based
TITLE
PRSet: Pathway-based polygenic risk score analyses and software.
Main citation
Choi SW, García-González J, Ruan Y, Wu HM, ...&, O'Reilly PF. (2023) PRSet: Pathway-based polygenic risk score analyses and software. PLoS Genet, 19 (2) e1010624. doi:10.1371/journal.pgen.1010624. PMID 36749789
ABSTRACT
Polygenic risk scores (PRSs) have been among the leading advances in biomedicine in recent years. As a proxy of genetic liability, PRSs are utilised across multiple fields and applications. While numerous statistical and machine learning methods have been developed to optimise their predictive accuracy, these typically distil genetic liability to a single number based on aggregation of an individual's genome-wide risk alleles. This results in a key loss of information about an individual's genetic profile, which could be critical given the functional sub-structure of the genome and the heterogeneity of complex disease. In this manuscript, we introduce a 'pathway polygenic' paradigm of disease risk, in which multiple genetic liabilities underlie complex diseases, rather than a single genome-wide liability. We describe a method and accompanying software, PRSet, for computing and analysing pathway-based PRSs, in which polygenic scores are calculated across genomic pathways for each individual. We evaluate the potential of pathway PRSs in two distinct ways, creating two major sections: (1) In the first section, we benchmark PRSet as a pathway enrichment tool, evaluating its capacity to capture GWAS signal in pathways. We find that for target sample sizes of >10,000 individuals, pathway PRSs have similar power for evaluating pathway enrichment as leading methods MAGMA and LD score regression, with the distinct advantage of providing individual-level estimates of genetic liability for each pathway -opening up a range of pathway-based PRS applications, (2) In the second section, we evaluate the performance of pathway PRSs for disease stratification. We show that using a supervised disease stratification approach, pathway PRSs (computed by PRSet) outperform two standard genome-wide PRSs (computed by C+T and lassosum) for classifying disease subtypes in 20 of 21 scenarios tested. As the definition and functional annotation of pathways becomes increasingly refined, we expect pathway PRSs to offer key insights into the heterogeneity of complex disease and treatment response, to generate biologically tractable therapeutic targets from polygenic signal, and, ultimately, to provide a powerful path to precision medicine.
DOI
10.1371/journal.pgen.1010624
PRSice-2
PUBMED_LINK
DESCRIPTION
PRSice (pronounced 'precise') is a Polygenic Risk Score software for calculating, applying, evaluating and plotting the results of polygenic risk scores (PRS) analyses.
URL
TITLE
PRSice-2: Polygenic Risk Score software for biobank-scale data.
Main citation
Choi SW, O'Reilly PF. (2019) PRSice-2: Polygenic Risk Score software for biobank-scale data. Gigascience, 8 (7) . doi:10.1093/gigascience/giz082. PMID 31307061
ABSTRACT
BACKGROUND: Polygenic risk score (PRS) analyses have become an integral part of biomedical research, exploited to gain insights into shared aetiology among traits, to control for genomic profile in experimental studies, and to strengthen causal inference, among a range of applications. Substantial efforts are now devoted to biobank projects to collect large genetic and phenotypic data, providing unprecedented opportunity for genetic discovery and applications. To process the large-scale data provided by such biobank resources, highly efficient and scalable methods and software are required. RESULTS: Here we introduce PRSice-2, an efficient and scalable software program for automating and simplifying PRS analyses on large-scale data. PRSice-2 handles both genotyped and imputed data, provides empirical association P-values free from inflation due to overfitting, supports different inheritance models, and can evaluate multiple continuous and binary target traits simultaneously. We demonstrate that PRSice-2 is dramatically faster and more memory-efficient than PRSice-1 and alternative PRS software, LDpred and lassosum, while having comparable predictive power. CONCLUSION: PRSice-2's combination of efficiency and power will be increasingly important as data sizes grow and as the applications of PRS become more sophisticated, e.g., when incorporated into high-dimensional or gene set-based analyses. PRSice-2 is written in C++, with an R script for plotting, and is freely available for download from http://PRSice.info.
DOI
10.1093/gigascience/giz082
PRSMix_AOI
FULL NAME
add -one-in (AOI)
PREPRINT_DOI
10.1101/2024.07.24.24310897
Main citation
Misra, A. et al. Instability of high polygenic risk classification and mitigation by integrative scoring. bioRxiv 2024.07.24.24310897 (2024) doi:10.1101/2024.07.24.24310897.
PRStuning
PUBMED_LINK
DESCRIPTION
Estimate Testing AUC for Binary Phenotype Using GWAS Summary Statistics from the Training Data
TITLE
Tuning Parameters for Polygenic Risk Score Methods Using GWAS Summary Statistics from Training Data.
Main citation
Jiang W, Chen L, Girgenti MJ, Zhao H. (2023) Tuning Parameters for Polygenic Risk Score Methods Using GWAS Summary Statistics from Training Data. Res Sq, () . doi:10.21203/rs.3.rs-2939390/v1. PMID 37398263
ABSTRACT
Predicting genetic risks for common diseases may improve their prevention and early treatment. In recent years, various additive-model-based polygenic risk scores (PRS) methods have been proposed to combine the estimated effects of single nucleotide polymorphisms (SNPs) using data collected from genome-wide association studies (GWAS). Some of these methods require access to another external individual-level GWAS dataset to tune the hyperparameters, which can be difficult because of privacy and security-related concerns. Additionally, leaving out partial data for hyperparameter tuning can reduce the predictive accuracy of the constructed PRS model. In this article, we propose a novel method, called PRStuning, to automatically tune hyperparameters for different PRS methods using only GWAS summary statistics from the training data. The core idea is to first predict the performance of the PRS method with different parameter values, and then select the parameters with the best prediction performance. Because directly using the effects observed from the training data tends to overestimate the performance in the testing data (a phenomenon known as overfitting), we adopt an empirical Bayes approach to shrinking the predicted performance in accordance with the estimated genetic architecture of the disease. Results from extensive simulations and real data applications demonstrate that PRStuning can accurately predict the PRS performance across PRS methods and parameters, and it can help select the best-performing parameters.
DOI
10.21203/rs.3.rs-2939390/v1
PS4DR
PUBMED_LINK
FULL NAME
Pathway Signatures for Drug Repositioning
DESCRIPTION
This package comprises a modular workflow designed to identify drug repositioning candidates using multi-omics data sets. A schematic figure of the workflow is presented below. The R scripts necessary to run the MSDRP pipeline are located in the R directory.
URL
TITLE
PS4DR: a multimodal workflow for identification and prioritization of drugs based on pathway signatures.
Main citation
Emon MA, Domingo-Fernández D, Hoyt CT, Hofmann-Apitius M. (2020) PS4DR: a multimodal workflow for identification and prioritization of drugs based on pathway signatures. BMC Bioinformatics, 21 (1) 231. doi:10.1186/s12859-020-03568-5. PMID 32503412
ABSTRACT
BACKGROUND: During the last decade, there has been a surge towards computational drug repositioning owing to constantly increasing -omics data in the biomedical research field. While numerous existing methods focus on the integration of heterogeneous data to propose candidate drugs, it is still challenging to substantiate their results with mechanistic insights of these candidate drugs. Therefore, there is a need for more innovative and efficient methods which can enable better integration of data and knowledge for drug repositioning. RESULTS: Here, we present a customizable workflow (PS4DR) which not only integrates high-throughput data such as genome-wide association study (GWAS) data and gene expression signatures from disease and drug perturbations but also takes pathway knowledge into consideration to predict drug candidates for repositioning. We have collected and integrated publicly available GWAS data and gene expression signatures for several diseases and hundreds of FDA-approved drugs or those under clinical trial in this study. Additionally, different pathway databases were used for mechanistic knowledge integration in the workflow. Using this systematic consolidation of data and knowledge, the workflow computes pathway signatures that assist in the prediction of new indications for approved and investigational drugs. CONCLUSION: We showcase PS4DR with applications demonstrating how this tool can be used for repositioning and identifying new drugs as well as proposing drugs that can simulate disease dysregulations. We were able to validate our workflow by demonstrating its capability to predict FDA-approved drugs for their known indications for several diseases. Further, PS4DR returned many potential drug candidates for repositioning that were backed up by epidemiological evidence extracted from scientific literature. Source code is freely available at https://github.com/ps4dr/ps4dr.
DOI
10.1186/s12859-020-03568-5
PSMC
PUBMED_LINK
DESCRIPTION
Li, H. & Durbin, R. Inference of human population history from individual whole-genome sequences. Nature 475, 493–496 (2011).
URL
USE
This software package infers population size history from a diploid sequence
using the Pairwise Sequentially Markovian Coalescent (PSMC) model.
using the Pairwise Sequentially Markovian Coalescent (PSMC) model.
TITLE
Inference of human population history from individual whole-genome sequences.
Main citation
Li H, Durbin R. (2011) Inference of human population history from individual whole-genome sequences. Nature, 475 (7357) 493-6. doi:10.1038/nature10231. PMID 21753753
ABSTRACT
The history of human population size is important for understanding human evolution. Various studies have found evidence for a founder event (bottleneck) in East Asian and European populations, associated with the human dispersal out-of-Africa event around 60 thousand years (kyr) ago. However, these studies have had to assume simplified demographic models with few parameters, and they do not provide a precise date for the start and stop times of the bottleneck. Here, with fewer assumptions on population size changes, we present a more detailed history of human population sizes between approximately ten thousand and a million years ago, using the pairwise sequentially Markovian coalescent model applied to the complete diploid genome sequences of a Chinese male (YH), a Korean male (SJK), three European individuals (J. C. Venter, NA12891 and NA12878 (ref. 9)) and two Yoruba males (NA18507 (ref. 10) and NA19239). We infer that European and Chinese populations had very similar population-size histories before 10-20 kyr ago. Both populations experienced a severe bottleneck 10-60 kyr ago, whereas African populations experienced a milder bottleneck from which they recovered earlier. All three populations have an elevated effective population size between 60 and 250 kyr ago, possibly due to population substructure. We also infer that the differentiation of genetically modern humans may have started as early as 100-120 kyr ago, but considerable genetic exchanges may still have occurred until 20-40 kyr ago.
DOI
10.1038/nature10231
PTWAS
PUBMED_LINK
FULL NAME
probabilistic TWAS
URL
KEYWORDS
TWAS, instrumental variables
TITLE
PTWAS: investigating tissue-relevant causal molecular mechanisms of complex traits using probabilistic TWAS analysis.
Main citation
Zhang Y, Quick C, Yu K, Barbeira A, ...&, Wen X. (2020) PTWAS: investigating tissue-relevant causal molecular mechanisms of complex traits using probabilistic TWAS analysis. Genome Biol, 21 (1) 232. doi:10.1186/s13059-020-02026-y. PMID 32912253
ABSTRACT
We propose a new computational framework, probabilistic transcriptome-wide association study (PTWAS), to investigate causal relationships between gene expressions and complex traits. PTWAS applies the established principles from instrumental variables analysis and takes advantage of probabilistic eQTL annotations to delineate and tackle the unique challenges arising in TWAS. PTWAS not only confers higher power than the existing methods but also provides novel functionalities to evaluate the causal assumptions and estimate tissue- or cell-type-specific gene-to-trait effects. We illustrate the power of PTWAS by analyzing the eQTL data across 49 tissues from GTEx (v8) and GWAS summary statistics from 114 complex traits.
DOI
10.1186/s13059-020-02026-y
PUMA-CUBS
DESCRIPTION
an ensemble learning strategy named PUMACUBS to combine multiple PRS models into an ensemble score without requiring external data for model fitting.
URL
Main citation
Zhao, Zijie, et al. "Optimizing and benchmarking polygenic risk scores with GWAS summary statistics." bioRxiv (2022).
QCTOOL v2 (QCTOOL)
PUBMED_LINK
FULL NAME
QCTOOL
DESCRIPTION
QCTOOL is a command-line utility program for manipulation and quality control of gwas datasets and other genome-wide data.
URL
TITLE
A note on exact tests of Hardy-Weinberg equilibrium.
Main citation
Wigginton JE, Cutler DJ, Abecasis GR. (2005) A note on exact tests of Hardy-Weinberg equilibrium. Am J Hum Genet, 76 (5) 887-93. doi:10.1086/429864. PMID 15789306
ABSTRACT
Deviations from Hardy-Weinberg equilibrium (HWE) can indicate inbreeding, population stratification, and even problems in genotyping. In samples of affected individuals, these deviations can also provide evidence for association. Tests of HWE are commonly performed using a simple chi2 goodness-of-fit test. We show that this chi2 test can have inflated type I error rates, even in relatively large samples (e.g., samples of 1,000 individuals that include approximately 100 copies of the minor allele). On the basis of previous work, we describe exact tests of HWE together with efficient computational methods for their implementation. Our methods adequately control type I error in large and small samples and are computationally efficient. They have been implemented in freely available code that will be useful for quality assessment of genotype data and for the detection of genetic association or population stratification in very large data sets.
DOI
10.1086/429864
QRGWAS
PUBMED_LINK
FULL NAME
Quantile regression GWAS
URL
TITLE
Genome-wide discovery for biomarkers using quantile regression at biobank scale.
Main citation
Wang C, Wang T, Kiryluk K, Wei Y, ...&, Ionita-Laza I. (2024) Genome-wide discovery for biomarkers using quantile regression at biobank scale. Nat Commun, 15 (1) 6460. doi:10.1038/s41467-024-50726-x. PMID 39085219
ABSTRACT
Genome-wide association studies (GWAS) for biomarkers important for clinical phenotypes can lead to clinically relevant discoveries. Conventional GWAS for quantitative traits are based on simplified regression models modeling the conditional mean of a phenotype as a linear function of genotype. We draw attention here to an alternative, lesser known approach, namely quantile regression that naturally extends linear regression to the analysis of the entire conditional distribution of a phenotype of interest. Quantile regression can be applied efficiently at biobank scale, while having some unique advantages such as (1) identifying variants with heterogeneous effects across quantiles of the phenotype distribution; (2) accommodating a wide range of phenotype distributions including non-normal distributions, with invariance of results to trait transformations; and (3) providing more detailed information about genotype-phenotype associations even for those associations identified by conventional GWAS. We show in simulations that quantile regression is powerful across both homogeneous and various heterogeneous models. Applications to 39 quantitative traits in the UK Biobank demonstrate that quantile regression can be a helpful complement to linear regression in GWAS and can identify variants with larger effects on high-risk subgroups of individuals but with lower or no contribution overall.
DOI
10.1038/s41467-024-50726-x
Quickdraws
PUBMED_LINK
DESCRIPTION
Quickdraws is a scalable method to perform genome-wide association studies (GWAS) for quantitative and binary traits. To run GWAS using Quickdraws, you will need three main input files: bed (and bgen) files with model-building and testing genetic variants, phenotype files, and covariate files. For certain analyses, you may also need a list of model SNPs and a file describing close genetic relatives
URL
TITLE
A scalable variational inference approach for increased mixed-model association power.
Main citation
Loya H, Kalantzis G, Cooper F, Palamara PF. (2025) A scalable variational inference approach for increased mixed-model association power. Nat Genet, 57 (2) 461-468. doi:10.1038/s41588-024-02044-7. PMID 39789286
ABSTRACT
The rapid growth of modern biobanks is creating new opportunities for large-scale genome-wide association studies (GWASs) and the analysis of complex traits. However, performing GWASs on millions of samples often leads to trade-offs between computational efficiency and statistical power, reducing the benefits of large-scale data collection efforts. We developed Quickdraws, a method that increases association power in quantitative and binary traits without sacrificing computational efficiency, leveraging a spike-and-slab prior on variant effects, stochastic variational inference and graphics processing unit acceleration. We applied Quickdraws to 79 quantitative and 50 binary traits in 405,088 UK Biobank samples, identifying 4.97% and 3.25% more associations than REGENIE and 22.71% and 7.07% more than FastGWA. Quickdraws had costs comparable to REGENIE, FastGWA and SAIGE on the UK Biobank Research Analysis Platform service, while being substantially faster than BOLT-LMM. These results highlight the promise of leveraging machine learning techniques for scalable GWASs without sacrificing power or robustness.
DOI
10.1038/s41588-024-02044-7
QUILT1
PUBMED_LINK
URL
TITLE
Rapid genotype imputation from sequence with reference panels.
Main citation
Davies RW, Kucka M, Su D, Shi S, ...&, Myers S. (2021) Rapid genotype imputation from sequence with reference panels. Nat Genet, 53 (7) 1104-1111. doi:10.1038/s41588-021-00877-0. PMID 34083788
ABSTRACT
Inexpensive genotyping methods are essential to modern genomics. Here we present QUILT, which performs diploid genotype imputation using low-coverage whole-genome sequence data. QUILT employs Gibbs sampling to partition reads into maternal and paternal sets, facilitating rapid haploid imputation using large reference panels. We show this partitioning to be accurate over many megabases, enabling highly accurate imputation close to theoretical limits and outperforming existing methods. Moreover, QUILT can impute accurately using diverse technologies, including long reads from Oxford Nanopore Technologies, and a new form of low-cost barcoded Illumina sequencing called haplotagging, with the latter showing improved accuracy at low coverages. Relative to DNA genotyping microarrays, QUILT offers improved accuracy at reduced cost, particularly for diverse populations that are traditionally underserved in modern genomic analyses, with accuracy nearly doubling at rare SNPs. Finally, QUILT can accurately impute (four-digit) human leukocyte antigen types, the first such method from low-coverage sequence data.
DOI
10.1038/s41588-021-00877-0
QUILT2
DESCRIPTION
QUILT2 is a fast and memory-efficient method for imputation from low coverage sequence. Statistically, QUILT2 operates on a per-read basis, and is base quality aware, meaning it can accurately impute from diverse inputs, including short read (e.g. Illumina), long read sequencing (that might be noisy) (e.g. Oxford Nanopore Technologies), barcoded Illumina sequencing (e.g. Haplotagging) and ancient DNA. In addition, QUILT2 can impute both the mother and fetal genome using cfDNA NIPT data.
URL
PREPRINT_DOI
10.1101/2024.07.18.604149
Main citation
Li, Z., Albrechtsen, A. & Davies, R. W. Rapid and accurate genotype imputation from low coverage short read, long read, and cell free DNA sequence. bioRxiv 2024.07.18.604149 (2024) doi:10.1101/2024.07.18.604149.
RareMETAL
PUBMED_LINK
DESCRIPTION
RAREMETAL is a program that facilitates the meta-analysis of rare variants from genotype arrays or sequencing (manuscript in preparation).
URL
KEYWORDS
rare variants
TITLE
RAREMETAL: fast and powerful meta-analysis for rare variants.
Main citation
Feng S, Liu D, Zhan X, Wing MK, ...&, Abecasis GR. (2014) RAREMETAL: fast and powerful meta-analysis for rare variants. Bioinformatics, 30 (19) 2828-9. doi:10.1093/bioinformatics/btu367. PMID 24894501
ABSTRACT
SUMMARY: RAREMETAL is a computationally efficient tool for meta-analysis of rare variants genotyped using sequencing or arrays. RAREMETAL facilitates analyses of individual studies, accommodates a variety of input file formats, handles related and unrelated individuals, executes both single variant and burden tests and performs conditional association analyses. AVAILABILITY AND IMPLEMENTATION: http://genome.sph.umich.edu/wiki/RAREMETAL for executables, source code, documentation and tutorial.
DOI
10.1093/bioinformatics/btu367
RASQUAL
PUBMED_LINK
FULL NAME
Robust Allele Specific QUAntitation and quality controL
DESCRIPTION
RASQUAL (Robust Allele Specific QUAntification and quality controL) maps QTLs for sequenced based cellular traits by combining population and allele-specific signals.
URL
TITLE
Fine-mapping cellular QTLs with RASQUAL and ATAC-seq.
Main citation
Kumasaka N, Knights AJ, Gaffney DJ. (2016) Fine-mapping cellular QTLs with RASQUAL and ATAC-seq. Nat Genet, 48 (2) 206-13. doi:10.1038/ng.3467. PMID 26656845
ABSTRACT
When cellular traits are measured using high-throughput DNA sequencing, quantitative trait loci (QTLs) manifest as fragment count differences between individuals and allelic differences within individuals. We present RASQUAL (Robust Allele-Specific Quantitation and Quality Control), a new statistical approach for association mapping that models genetic effects and accounts for biases in sequencing data using a single, probabilistic framework. RASQUAL substantially improves fine-mapping accuracy and sensitivity relative to existing methods in RNA-seq, DNase-seq and ChIP-seq data. We illustrate how RASQUAL can be used to maximize association detection by generating the first map of chromatin accessibility QTLs (caQTLs) in a European population using ATAC-seq. Despite a modest sample size, we identified 2,707 independent caQTLs (at a false discovery rate of 10%) and demonstrated how RASQUAL and ATAC-seq can provide powerful information for fine-mapping gene-regulatory variants and for linking distal regulatory elements with gene promoters. Our results highlight how combining between-individual and allele-specific genetic signals improves the functional interpretation of noncoding variation.
DOI
10.1038/ng.3467
Relate
PUBMED_LINK
DESCRIPTION
Speidel, L., Forest, M., Shi, S. & Myers, S. R. A method for genome-wide genealogy estimation for thousands of samples. Nat. Genet. 51, 1321–1329 (2019).
URL
USE
Relate estimates genome-wide genealogies in the form of trees that adapt to changes in local ancestry caused by recombination. The method, which is scalable to thousands of samples, is described in the following paper. Please cite this paper if you use our software in your study.
TITLE
A method for genome-wide genealogy estimation for thousands of samples.
Main citation
Speidel L, Forest M, Shi S, Myers SR. (2019) A method for genome-wide genealogy estimation for thousands of samples. Nat Genet, 51 (9) 1321-1329. doi:10.1038/s41588-019-0484-x. PMID 31477933
ABSTRACT
Knowledge of genome-wide genealogies for thousands of individuals would simplify most evolutionary analyses for humans and other species, but has remained computationally infeasible. We have developed a method, Relate, scaling to >10,000 sequences while simultaneously estimating branch lengths, mutational ages and variable historical population sizes, as well as allowing for data errors. Application to 1,000 Genomes Project haplotypes produces joint genealogical histories for 26 human populations. Highly diverged lineages are present in all groups, but most frequent in Africa. Outside Africa, these mainly reflect ancient introgression from groups related to Neanderthals and Denisovans, while African signals instead reflect unknown events unique to that continent. Our approach allows more powerful inferences of natural selection than has previously been possible. We identify multiple regions under strong positive selection, and multi-allelic traits including hair color, body mass index and blood pressure, showing strong evidence of directional selection, varying among human groups.
DOI
10.1038/s41588-019-0484-x
REMETA
PUBMED_LINK
DESCRIPTION
REMETA is a computationally efficient C++ toolkit for meta-analysis of gene-based association tests using single-variant summary statistics from REGENIE-style pipelines, including burden and variance-component tests, with sparse per-study LD references rescaled per phenotype.
URL
KEYWORDS
gene-based test, meta-analysis, summary statistics, REGENIE, burden, SKAT-O
TITLE
Computationally efficient meta-analysis of gene-based tests using summary statistics in large-scale genetic studies.
Main citation
Joseph TA, Mbatchou J, Ghosh A, Marcketta A, ...&, Marchini J. (2025) Computationally efficient meta-analysis of gene-based tests using summary statistics in large-scale genetic studies. Nat Genet, 57 (12) 3193-3200. doi:10.1038/s41588-025-02390-0. PMID 41225158
ABSTRACT
Meta-analysis of gene-based tests using single-variant summary statistics is a powerful strategy for genetic association studies. However, current approaches require sharing the covariance matrix between variants for each study and trait of interest. For large-scale studies with many phenotypes, these matrices can be cumbersome to calculate, store and share. Here, to address this challenge, we present REMETA-an efficient tool for meta-analysis of gene-based tests. REMETA uses a single sparse covariance reference file per study that is rescaled for each phenotype using single-variant summary statistics. We develop new methods for binary traits with case-control imbalance, and to estimate allele frequencies, genotype counts and effect sizes of burden tests. We demonstrate the performance and advantages of our approach through meta-analysis of five traits in 469,376 samples in UK Biobank. The open-source REMETA software will facilitate meta-analysis across large-scale exome sequencing studies from diverse studies that cannot easily be combined.
DOI
10.1038/s41588-025-02390-0
RENT+
PUBMED_LINK
DESCRIPTION
Mirzaei, S. & Wu, Y. RENT+: an improved method for inferring local genealogical trees from haplotypes with recombination. Bioinformatics 33, 1021–1030 (2017).
TITLE
RENT+: an improved method for inferring local genealogical trees from haplotypes with recombination.
Main citation
Mirzaei S, Wu Y. (2017) RENT+: an improved method for inferring local genealogical trees from haplotypes with recombination. Bioinformatics, 33 (7) 1021-1030. doi:10.1093/bioinformatics/btw735. PMID 28065901
ABSTRACT
MOTIVATION: : Haplotypes from one or multiple related populations share a common genealogical history. If this shared genealogy can be inferred from haplotypes, it can be very useful for many population genetics problems. However, with the presence of recombination, the genealogical history of haplotypes is complex and cannot be represented by a single genealogical tree. Therefore, inference of genealogical history with recombination is much more challenging than the case of no recombination. RESULTS: : In this paper, we present a new approach called RENT+ for the inference of local genealogical trees from haplotypes with the presence of recombination. RENT+ builds on a previous genealogy inference approach called RENT , which infers a set of related genealogical trees at different genomic positions. RENT+ represents a significant improvement over RENT in the sense that it is more effective in extracting information contained in the haplotype data about the underlying genealogy than RENT . The key components of RENT+ are several greatly enhanced genealogy inference rules. Through simulation, we show that RENT+ is more efficient and accurate than several existing genealogy inference methods. As an application, we apply RENT+ in the inference of population demographic history from haplotypes, which outperforms several existing methods. AVAILABILITY AND IMPLEMENTATION: : RENT+ is implemented in Java, and is freely available for download from: https://github.com/SajadMirzaei/RentPlus . CONTACTS: : sajad@engr.uconn.edu or ywu@engr.uconn.edu. SUPPLEMENTARY INFORMATION: : Supplementary data are available at Bioinformatics online.
DOI
10.1093/bioinformatics/btw735
RESHAPE
PUBMED_LINK
FULL NAME
REcombine and Share HAPlotypEs
DESCRIPTION
RESHAPE removes sample-level genetic information from a reference panel to create a synthetic reference panel. By providing it with a genetic map and the VCF/BCF of a reference panel, RESHAPE outputs a VCF/BCF of the same size where each haplotypes corresponds to a mosaic of the original haplotypes of the reference panel.
URL
TITLE
A resampling-based approach to share reference panels.
Main citation
Cavinato T, Rubinacci S, Malaspinas AS, Delaneau O. (2024) A resampling-based approach to share reference panels. Nat Comput Sci, 4 (5) 360-366. doi:10.1038/s43588-024-00630-7. PMID 38745108
ABSTRACT
For many genome-wide association studies, imputing genotypes from a haplotype reference panel is a necessary step. Over the past 15 years, reference panels have become larger and more diverse, leading to improvements in imputation accuracy. However, the latest generation of reference panels is subject to restrictions on data sharing due to concerns about privacy, limiting their usefulness for genotype imputation. In this context, here we propose RESHAPE, a method that employs a recombination Poisson process on a reference panel to simulate the genomes of hypothetical descendants after multiple generations. This data transformation helps to protect against re-identification threats and preserves data attributes, such as linkage disequilibrium patterns and, to some degree, identity-by-descent sharing, allowing for genotype imputation. Our experiments on gold-standard datasets show that simulated descendants up to eight generations can serve as reference panels without substantially reducing genotype imputation accuracy.
DOI
10.1038/s43588-024-00630-7
Review
PUBMED_LINK
TITLE
Genes and environments, development and time.
Main citation
Boyce WT, Sokolowski MB, Robinson GE. (2020) Genes and environments, development and time. Proc Natl Acad Sci U S A, 117 (38) 23235-23241. doi:10.1073/pnas.2016710117. PMID 32967067
ABSTRACT
A now substantial body of science implicates a dynamic interplay between genetic and environmental variation in the development of individual differences in behavior and health. Such outcomes are affected by molecular, often epigenetic, processes involving gene-environment (G-E) interplay that can influence gene expression. Early environments with exposures to poverty, chronic adversities, and acutely stressful events have been linked to maladaptive development and compromised health and behavior. Genetic differences can impart either enhanced or blunted susceptibility to the effects of such pathogenic environments. However, largely missing from present discourse regarding G-E interplay is the role of time, a "third factor" guiding the emergence of complex developmental endpoints across different scales of time. Trajectories of development increasingly appear best accounted for by a complex, dynamic interchange among the highly linked elements of genes, contexts, and time at multiple scales, including neurobiological (minutes to milliseconds), genomic (hours to minutes), developmental (years and months), and evolutionary (centuries and millennia) time. This special issue of PNAS thus explores time and timing among G-E transactions: The importance of timing and timescales in plasticity and critical periods of brain development; epigenetics and the molecular underpinnings of biologically embedded experience; the encoding of experience across time and biological levels of organization; and gene-regulatory networks in behavior and development and their linkages to neuronal networks. Taken together, the collection of papers offers perspectives on how G-E interplay operates contingently within and against a backdrop of time and timescales.
DOI
10.1073/pnas.2016710117
Review-Das
PUBMED_LINK
TITLE
Genotype Imputation from Large Reference Panels.
Main citation
Das S, Abecasis GR, Browning BL. (2018) Genotype Imputation from Large Reference Panels. Annu Rev Genomics Hum Genet, 19 () 73-96. doi:10.1146/annurev-genom-083117-021602. PMID 29799802
ABSTRACT
Genotype imputation has become a standard tool in genome-wide association studies because it enables researchers to inexpensively approximate whole-genome sequence data from genome-wide single-nucleotide polymorphism array data. Genotype imputation increases statistical power, facilitates fine mapping of causal variants, and plays a key role in meta-analyses of genome-wide association studies. Only variants that were previously observed in a reference panel of sequenced individuals can be imputed. However, the rapid increase in the number of deeply sequenced individuals will soon make it possible to assemble enormous reference panels that greatly increase the number of imputable variants. In this review, we present an overview of genotype imputation and describe the computational techniques that make it possible to impute genotypes from reference panels with millions of individuals.
DOI
10.1146/annurev-genom-083117-021602
Review-Fst
PUBMED_LINK
DESCRIPTION
Holsinger, K. E., & Weir, B. S. (2009). Genetics in geographically structured populations: defining, estimating and interpreting F ST. Nature Reviews Genetics, 10(9), 639-650.
TITLE
Genetics in geographically structured populations: defining, estimating and interpreting F(ST).
Main citation
Holsinger KE, Weir BS. (2009) Genetics in geographically structured populations: defining, estimating and interpreting F(ST). Nat Rev Genet, 10 (9) 639-50. doi:10.1038/nrg2611. PMID 19687804
ABSTRACT
Wright's F-statistics, and especially F(ST), provide important insights into the evolutionary processes that influence the structure of genetic variation within and among populations, and they are among the most widely used descriptive statistics in population and evolutionary genetics. Estimates of F(ST) can identify regions of the genome that have been the target of selection, and comparisons of F(ST) from different parts of the genome can provide insights into the demographic history of populations. For these reasons and others, F(ST) has a central role in population and evolutionary genetics and has wide applications in fields that range from disease association mapping to forensic science. This Review clarifies how F(ST) is defined, how it should be estimated, how it is related to similar statistics and how estimates of F(ST) should be interpreted.
DOI
10.1038/nrg2611
Review-Kachuri
PUBMED_LINK
TITLE
Principles and methods for transferring polygenic risk scores across global populations.
Main citation
Kachuri L, Chatterjee N, Hirbo J, Schaid DJ, ...&, Ge T. (2024) Principles and methods for transferring polygenic risk scores across global populations. Nat Rev Genet, 25 (1) 8-25. doi:10.1038/s41576-023-00637-2. PMID 37620596
ABSTRACT
Polygenic risk scores (PRSs) summarize the genetic predisposition of a complex human trait or disease and may become a valuable tool for advancing precision medicine. However, PRSs that are developed in populations of predominantly European genetic ancestries can increase health disparities due to poor predictive performance in individuals of diverse and complex genetic ancestries. We describe genetic and modifiable risk factors that limit the transferability of PRSs across populations and review the strengths and weaknesses of existing PRS construction methods for diverse ancestries. Developing PRSs that benefit global populations in research and clinical settings provides an opportunity for innovation and is essential for health equity.
DOI
10.1038/s41576-023-00637-2
Review-Lappalainen
PUBMED_LINK
TITLE
From variant to function in human disease genetics.
Main citation
Lappalainen T, MacArthur DG. (2021) From variant to function in human disease genetics. Science, 373 (6562) 1464-1468. doi:10.1126/science.abi8207. PMID 34554789
ABSTRACT
Over the next decade, the primary challenge in human genetics will be to understand the biological mechanisms by which genetic variants influence phenotypes, including disease risk. Although the scale of this challenge is daunting, better methods for functional variant interpretation will have transformative consequences for disease diagnosis, risk prediction, and the development of new therapies. An array of new methods for characterizing variant impact at scale, using patient tissue samples as well as in vitro models, are already being applied to dissect variant mechanisms across a range of human cell types and environments. These approaches are also increasingly being deployed in clinical settings. We discuss the rationale, approaches, applications, and future outlook for characterizing the molecular and cellular effects of genetic variants.
DOI
10.1126/science.abi8207
Review-Li
PUBMED_LINK
TITLE
Genotype imputation.
Main citation
Li Y, Willer C, Sanna S, Abecasis G. (2009) Genotype imputation. Annu Rev Genomics Hum Genet, 10 () 387-406. doi:10.1146/annurev.genom.9.081307.164242. PMID 19715440
ABSTRACT
Genotype imputation is now an essential tool in the analysis of genome-wide association scans. This technique allows geneticists to accurately evaluate the evidence for association at genetic markers that are not directly genotyped. Genotype imputation is particularly useful for combining results across studies that rely on different genotyping platforms but also increases the power of individual scans. Here, we review the history and theoretical underpinnings of the technique. To illustrate performance of the approach, we summarize results from several gene mapping studies. Finally, we preview the role of genotype imputation in an era when whole genome resequencing is becoming increasingly common.
DOI
10.1146/annurev.genom.9.081307.164242
Review-Marchini
PUBMED_LINK
TITLE
Genotype imputation for genome-wide association studies.
Main citation
Marchini J, Howie B. (2010) Genotype imputation for genome-wide association studies. Nat Rev Genet, 11 (7) 499-511. doi:10.1038/nrg2796. PMID 20517342
ABSTRACT
In the past few years genome-wide association (GWA) studies have uncovered a large number of convincingly replicated associations for many complex human diseases. Genotype imputation has been used widely in the analysis of GWA studies to boost power, fine-map associations and facilitate the combination of results across studies using meta-analysis. This Review describes the details of several different statistical methods for imputing genotypes, illustrates and discusses the factors that influence imputation performance, and reviews methods that can be used to assess imputation performance and test association at imputed SNPs.
DOI
10.1038/nrg2796
Review-Peter
PUBMED_LINK
TITLE
Discovery and implications of polygenicity of common diseases.
Main citation
Visscher PM, Yengo L, Cox NJ, Wray NR. (2021) Discovery and implications of polygenicity of common diseases. Science, 373 (6562) 1468-1473. doi:10.1126/science.abi8206. PMID 34554790
ABSTRACT
The sequencing of the human genome has allowed the study of the genetic architecture of common diseases: the number of genomic variants that contribute to risk of disease and their joint frequency and effect size distribution. Common diseases are polygenic, with many loci contributing to phenotype, and the cumulative burden of risk alleles determines individual risk in conjunction with environmental factors. Most risk loci occur in noncoding regions of the genome regulating cell- and context-specific gene expression. Although the effect sizes of most risk alleles are small, their cumulative effects in individuals, quantified as a polygenic (risk) score, can identify people at increased risk of disease, thereby facilitating prevention or early intervention.
DOI
10.1126/science.abi8206
Review-Povysil
PUBMED_LINK
TITLE
Rare-variant collapsing analyses for complex traits: guidelines and applications.
Main citation
Povysil G, Petrovski S, Hostyk J, Aggarwal V, ...&, Goldstein DB. (2019) Rare-variant collapsing analyses for complex traits: guidelines and applications. Nat Rev Genet, 20 (12) 747-759. doi:10.1038/s41576-019-0177-4. PMID 31605095
ABSTRACT
The first phase of genome-wide association studies (GWAS) assessed the role of common variation in human disease. Advances optimizing and economizing high-throughput sequencing have enabled a second phase of association studies that assess the contribution of rare variation to complex disease in all protein-coding genes. Unlike the early microarray-based studies, sequencing-based studies catalogue the full range of genetic variation, including the evolutionarily youngest forms. Although the experience with common variants helped establish relevant standards for genome-wide studies, the analysis of rare variation introduces several challenges that require novel analysis approaches.
DOI
10.1038/s41576-019-0177-4
Review-Wang
PUBMED_LINK
TITLE
Challenges and Opportunities for Developing More Generalizable Polygenic Risk Scores.
Main citation
Wang Y, Tsuo K, Kanai M, Neale BM, ...&, Martin AR. (2022) Challenges and Opportunities for Developing More Generalizable Polygenic Risk Scores. Annu Rev Biomed Data Sci, 5 () 293-320. doi:10.1146/annurev-biodatasci-111721-074830. PMID 35576555
ABSTRACT
Polygenic risk scores (PRS) estimate an individual's genetic likelihood of complex traits and diseases by aggregating information across multiple genetic variants identified from genome-wide association studies. PRS can predict a broad spectrum of diseases and have therefore been widely used in research settings. Some work has investigated their potential applications as biomarkers in preventative medicine, but significant work is still needed to definitively establish and communicate absolute risk to patients for genetic and modifiable risk factors across demographic groups. However, the biggest limitation of PRS currently is that they show poor generalizability across diverse ancestries and cohorts. Major efforts are underway through methodological development and data generation initiatives to improve their generalizability. This review aims to comprehensively discuss current progress on the development of PRS, the factors that affect their generalizability, and promising areas for improving their accuracy, portability, and implementation.
DOI
10.1146/annurev-biodatasci-111721-074830
reviews
PUBMED_LINK
TITLE
Strategies for Pathway Analysis Using GWAS and WGS Data.
Main citation
White MJ, Yaspan BL, Veatch OJ, Goddard P, ...&, Contreras MG. (2019) Strategies for Pathway Analysis Using GWAS and WGS Data. Curr Protoc Hum Genet, 100 (1) e79. doi:10.1002/cphg.79. PMID 30387919
ABSTRACT
Single-allele study designs, commonly used in genome-wide association studies (GWAS) as well as the more recently developed whole genome sequencing (WGS) studies, are a standard approach for investigating the relationship of common variation within the human genome to a given phenotype of interest. However, single-allele association results published for many GWAS studies represent only the tip of the iceberg for the information that can be extracted from these datasets. The primary analysis strategy for GWAS entails association analysis in which only the single nucleotide polymorphisms (SNPs) with the strongest p-values are declared statistically significant due to issues arising from multiple testing and type I errors. Factors such as locus heterogeneity, epistasis, and multiple genes conferring small effects contribute to the complexity of the genetic models underlying phenotype expression. Thus, many biologically meaningful associations having lower effect sizes at individual genes are overlooked, making it difficult to separate true associations from a sea of false-positive associations. Organizing these individual SNPs into biologically meaningful groups to look at the overall effects of minor perturbations to genes and pathways is desirable. This pathway-based approach provides researchers with insight into the functional foundations of the phenotype being studied and allows testing of various genetic scenarios. © 2018 by John Wiley & Sons, Inc.
DOI
10.1002/cphg.79
Reviews
PUBMED_LINK
TITLE
Genetics meets proteomics: perspectives for large population-based studies.
Main citation
Suhre K, McCarthy MI, Schwenk JM. (2021) Genetics meets proteomics: perspectives for large population-based studies. Nat Rev Genet, 22 (1) 19-37. doi:10.1038/s41576-020-0268-2. PMID 32860016
ABSTRACT
Proteomic analysis of cells, tissues and body fluids has generated valuable insights into the complex processes influencing human biology. Proteins represent intermediate phenotypes for disease and provide insight into how genetic and non-genetic risk factors are mechanistically linked to clinical outcomes. Associations between protein levels and DNA sequence variants that colocalize with risk alleles for common diseases can expose disease-associated pathways, revealing novel drug targets and translational biomarkers. However, genome-wide, population-scale analyses of proteomic data are only now emerging. Here, we review current findings from studies of the plasma proteome and discuss their potential for advancing biomedical translation through the interpretation of genome-wide association analyses. We highlight the challenges faced by currently available technologies and provide perspectives relevant to their future application in large-scale biobank studies.
DOI
10.1038/s41576-020-0268-2
Reviews&Tutorials
PUBMED_LINK
TITLE
Commentary: Two-sample Mendelian randomization: opportunities and challenges.
Main citation
Lawlor DA. (2016) Commentary: Two-sample Mendelian randomization: opportunities and challenges. Int J Epidemiol, 45 (3) 908-15. doi:10.1093/ije/dyw127. PMID 27427429
DOI
10.1093/ije/dyw127
RFR SuSiE-inf FINEMAP-inf (RFR)
PUBMED_LINK
FULL NAME
Replication Failure Rate
DESCRIPTION
Replication Failure Rate (RFR), a metric to assess the consistency of fine-mapping results based on downsampling a large cohort. SuSiE-inf and FINEMAP-inf, that extend SuSiE and FINEMAP to incorporate a term for infinitesimal effects in addition to a small number of larger causal effects of interest.
URL
TITLE
Improving fine-mapping by modeling infinitesimal effects.
Main citation
Cui R, Elzur RA, Kanai M, Ulirsch JC, ...&, Finucane HK. (2024) Improving fine-mapping by modeling infinitesimal effects. Nat Genet, 56 (1) 162-169. doi:10.1038/s41588-023-01597-3. PMID 38036779
ABSTRACT
Fine-mapping aims to identify causal genetic variants for phenotypes. Bayesian fine-mapping algorithms (for example, SuSiE, FINEMAP, ABF and COJO-ABF) are widely used, but assessing posterior probability calibration remains challenging in real data, where model misspecification probably exists, and true causal variants are unknown. We introduce replication failure rate (RFR), a metric to assess fine-mapping consistency by downsampling. SuSiE, FINEMAP and COJO-ABF show high RFR, indicating potential overconfidence in their output. Simulations reveal that nonsparse genetic architecture can lead to miscalibration, while imputation noise, nonuniform distribution of causal variants and quality control filters have minimal impact. Here we present SuSiE-inf and FINEMAP-inf, fine-mapping methods modeling infinitesimal effects alongside fewer larger causal effects. Our methods show improved calibration, RFR and functional enrichment, competitive recall and computational efficiency. Notably, using our methods' posterior effect sizes substantially increases polygenic risk score accuracy over SuSiE and FINEMAP. Our work improves causal variant identification for complex traits, a fundamental goal of human genetics.
DOI
10.1038/s41588-023-01597-3
RolyPoly
PUBMED_LINK
DESCRIPTION
RolyPoly is a regression-based polygenic model that can prioritize trait-relevant cell types and genes from GWAS summary statistics and gene expression data.
URL
TITLE
Inferring Relevant Cell Types for Complex Traits by Using Single-Cell Gene Expression.
Main citation
Calderon D, Bhaskar A, Knowles DA, Golan D, ...&, Pritchard JK. (2017) Inferring Relevant Cell Types for Complex Traits by Using Single-Cell Gene Expression. Am J Hum Genet, 101 (5) 686-699. doi:10.1016/j.ajhg.2017.09.009. PMID 29106824
ABSTRACT
Previous studies have prioritized trait-relevant cell types by looking for an enrichment of genome-wide association study (GWAS) signal within functional regions. However, these studies are limited in cell resolution by the lack of functional annotations from difficult-to-characterize or rare cell populations. Measurement of single-cell gene expression has become a popular method for characterizing novel cell types, and yet limited work has linked single-cell RNA sequencing (RNA-seq) to phenotypes of interest. To address this deficiency, we present RolyPoly, a regression-based polygenic model that can prioritize trait-relevant cell types and genes from GWAS summary statistics and gene expression data. RolyPoly is designed to use expression data from either bulk tissue or single-cell RNA-seq. In this study, we demonstrated RolyPoly's accuracy through simulation and validated previously known tissue-trait associations. We discovered a significant association between microglia and late-onset Alzheimer disease and an association between schizophrenia and oligodendrocytes and replicating fetal cortical cells. Additionally, RolyPoly computes a trait-relevance score for each gene to reflect the importance of expression specific to a cell type. We found that differentially expressed genes in the prefrontal cortex of individuals with Alzheimer disease were significantly enriched with genes ranked highly by RolyPoly gene scores. Overall, our method represents a powerful framework for understanding the effect of common variants on cell types contributing to complex traits.
DOI
10.1016/j.ajhg.2017.09.009
rtPRS-CS
FULL NAME
real-time PRS-CS
DESCRIPTION
rtPRS-CS is a python-based command line tool that performs real-time online updating of polygenic risk score (PRS) weights in a target dataset, one sample at-a-time. Given the most recent set of SNP weights, for each new target sample with both phenotypic and genetic information, rtPRS-CS uses stochastic gradient descent to update the SNP weights, adjusting for the effect of a set of covariates.
URL
PREPRINT_DOI
10.1101/2024.07.12.24310357
Main citation
Tubbs, J. D., Chen, Y., Duan, R., Huang, H. & Ge, T. Real-time dynamic polygenic prediction for streaming data. bioRxiv 2024.07.12.24310357 (2024) doi:10.1101/2024.07.12.24310357.
RWAS
PUBMED_LINK
FULL NAME
Regulome-Wide Association Study
URL
TITLE
Allelic imbalance of chromatin accessibility in cancer identifies candidate causal risk variants and their mechanisms.
Main citation
Grishin D, Gusev A. (2022) Allelic imbalance of chromatin accessibility in cancer identifies candidate causal risk variants and their mechanisms. Nat Genet, 54 (6) 837-849. doi:10.1038/s41588-022-01075-2. PMID 35697866
ABSTRACT
While many germline cancer risk variants have been identified through genome-wide association studies (GWAS), the mechanisms by which these variants operate remain largely unknown. Here we used 406 cancer ATAC-Seq samples across 23 cancer types to identify 7,262 germline allele-specific accessibility QTLs (as-aQTLs). Cancer as-aQTLs had stronger enrichment for cancer risk heritability (up to 145 fold) than any other functional annotation across seven cancer GWAS. Most cancer as-aQTLs directly altered transcription factor (TF) motifs and exhibited differential TF binding and gene expression in functional screens. To connect as-aQTLs to putative risk mechanisms, we introduced the regulome-wide associations study (RWAS). RWAS identified genetically associated accessible peaks at >70% of known breast and prostate loci and discovered new risk loci in all examined cancer types. Integrating as-aQTL discovery, motif analysis and RWAS identified candidate causal regulatory elements and their probable upstream regulators. Our work establishes cancer as-aQTLs and RWAS analysis as powerful tools to study the genetic architecture of cancer risk.
DOI
10.1038/s41588-022-01075-2
S-LDXR
PUBMED_LINK
DESCRIPTION
S-LDXR is a software for estimating enrichment of stratified squared trans-ethnic genetic correlation across genomic annotations from GWAS summary statistics data.
URL
KEYWORDS
trans-ethnic, stratified, functional categories
TITLE
Population-specific causal disease effect sizes in functionally important regions impacted by selection.
Main citation
Shi H, Gazal S, Kanai M, Koch EM, ...&, Price AL. (2021) Population-specific causal disease effect sizes in functionally important regions impacted by selection. Nat Commun, 12 (1) 1098. doi:10.1038/s41467-021-21286-1. PMID 33597505
ABSTRACT
Many diseases exhibit population-specific causal effect sizes with trans-ethnic genetic correlations significantly less than 1, limiting trans-ethnic polygenic risk prediction. We develop a new method, S-LDXR, for stratifying squared trans-ethnic genetic correlation across genomic annotations, and apply S-LDXR to genome-wide summary statistics for 31 diseases and complex traits in East Asians (average N = 90K) and Europeans (average N = 267K) with an average trans-ethnic genetic correlation of 0.85. We determine that squared trans-ethnic genetic correlation is 0.82× (s.e. 0.01) depleted in the top quintile of background selection statistic, implying more population-specific causal effect sizes. Accordingly, causal effect sizes are more population-specific in functionally important regions, including conserved and regulatory regions. In regions surrounding specifically expressed genes, causal effect sizes are most population-specific for skin and immune genes, and least population-specific for brain genes. Our results could potentially be explained by stronger gene-environment interaction at loci impacted by selection, particularly positive selection.
DOI
10.1038/s41467-021-21286-1
S-PrediXcan
PUBMED_LINK
DESCRIPTION
a mathematical expression to compute PrediXcan (a gene mapping approach) results using summary data (S-PrediXcan)
URL
TITLE
Exploring the phenotypic consequences of tissue specific gene expression variation inferred from GWAS summary statistics.
Main citation
Barbeira AN, Dickinson SP, Bonazzola R, Zheng J, ...&, Im HK. (2018) Exploring the phenotypic consequences of tissue specific gene expression variation inferred from GWAS summary statistics. Nat Commun, 9 (1) 1825. doi:10.1038/s41467-018-03621-1. PMID 29739930
ABSTRACT
Scalable, integrative methods to understand mechanisms that link genetic variants with phenotypes are needed. Here we derive a mathematical expression to compute PrediXcan (a gene mapping approach) results using summary data (S-PrediXcan) and show its accuracy and general robustness to misspecified reference sets. We apply this framework to 44 GTEx tissues and 100+ phenotypes from GWAS and meta-analysis studies, creating a growing public catalog of associations that seeks to capture the effects of gene expression variation on human phenotypes. Replication in an independent cohort is shown. Most of the associations are tissue specific, suggesting context specificity of the trait etiology. Colocalized significant associations in unexpected tissues underscore the need for an agnostic scanning of multiple contexts to improve our ability to detect causal regulatory mechanisms. Monogenic disease genes are enriched among significant associations for related traits, suggesting that smaller alterations of these genes may cause a spectrum of milder phenotypes.
DOI
10.1038/s41467-018-03621-1
SAIGE
PUBMED_LINK
FULL NAME
Scalable and Accurate Implementation of GEneralized mixed model
DESCRIPTION
SAIGE is an R package with Scalable and Accurate Implementation of Generalized mixed model (Chen, H. et al. 2016). It accounts for sample relatedness and is feasible for genetic association tests in large cohorts and biobanks (N > 400,000). SAIGE performs single-variant association tests for binary traits and quantitative taits. For binary traits, SAIGE uses the saddlepoint approximation (SPA)(mhof, J. P. , 1961; Kuonen, D. 1999; Dey, R. et.al 2017) to account for case-control imbalance.
URL
KEYWORDS
case-control imbalance, saddlepoint approximation (SPA)
TITLE
Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies.
Main citation
Zhou W, Nielsen JB, Fritsche LG, Dey R, ...&, Lee S. (2018) Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. Nat Genet, 50 (9) 1335-1341. doi:10.1038/s41588-018-0184-y. PMID 30104761
ABSTRACT
In genome-wide association studies (GWAS) for thousands of phenotypes in large biobanks, most binary traits have substantially fewer cases than controls. Both of the widely used approaches, the linear mixed model and the recently proposed logistic mixed model, perform poorly; they produce large type I error rates when used to analyze unbalanced case-control phenotypes. Here we propose a scalable and accurate generalized mixed model association test that uses the saddlepoint approximation to calibrate the distribution of score test statistics. This method, SAIGE (Scalable and Accurate Implementation of GEneralized mixed model), provides accurate P values even when case-control ratios are extremely unbalanced. SAIGE uses state-of-art optimization strategies to reduce computational costs; hence, it is applicable to GWAS for thousands of phenotypes by large biobanks. Through the analysis of UK Biobank data of 408,961 samples from white British participants with European ancestry for > 1,400 binary phenotypes, we show that SAIGE can efficiently analyze large sample data, controlling for unbalanced case-control ratios and sample relatedness.
DOI
10.1038/s41588-018-0184-y
SAIGE-GENE+
PUBMED_LINK
URL
TITLE
SAIGE-GENE+ improves the efficiency and accuracy of set-based rare variant association tests.
Main citation
Zhou W, Bi W, Zhao Z, Dey KK, ...&, Lee S. (2022) SAIGE-GENE+ improves the efficiency and accuracy of set-based rare variant association tests. Nat Genet, 54 (10) 1466-1469. doi:10.1038/s41588-022-01178-w. PMID 36138231
ABSTRACT
Several biobanks, including UK Biobank (UKBB), are generating large-scale sequencing data. An existing method, SAIGE-GENE, performs well when testing variants with minor allele frequency (MAF) ≤ 1%, but inflation is observed in variance component set-based tests when restricting to variants with MAF ≤ 0.1% or 0.01%. Here, we propose SAIGE-GENE+ with greatly improved type I error control and computational efficiency to facilitate rare variant tests in large-scale data. We further show that incorporating multiple MAF cutoffs and functional annotations can improve power and thus uncover new gene-phenotype associations. In the analysis of UKBB whole exome sequencing data for 30 quantitative and 141 binary traits, SAIGE-GENE+ identified 551 gene-phenotype associations.
DOI
10.1038/s41588-022-01178-w
SAIGE-QTL
DESCRIPTION
SAIGE-QTL is a robust and scalable tool that can directly map eQTLs using single-cell profiles without needing aggregation at the pseudobulk level.
URL
KEYWORDS
single -cell eQTL, rare variant, set-based test, trans-eQTL, SPA
Main citation
Zhou, W., Cuomo, A., Xue, A., Kanai, M., Chau, G., Krishna, C., ... & Neale, B. M. (2024). Efficient and accurate mixed model association tool for single-cell eQTL analysis. medRxiv, 2024-05.
Sakaue
PUBMED_LINK
URL
KEYWORDS
HLA analyses tutorial
TITLE
Tutorial: a statistical genetics guide to identifying HLA alleles driving complex disease.
Main citation
Sakaue S, Gurajala S, Curtis M, Luo Y, ...&, Raychaudhuri S. (2023) Tutorial: a statistical genetics guide to identifying HLA alleles driving complex disease. Nat Protoc, 18 (9) 2625-2641. doi:10.1038/s41596-023-00853-4. PMID 37495751
ABSTRACT
The human leukocyte antigen (HLA) locus is associated with more complex diseases than any other locus in the human genome. In many diseases, HLA explains more heritability than all other known loci combined. In silico HLA imputation methods enable rapid and accurate estimation of HLA alleles in the millions of individuals that are already genotyped on microarrays. HLA imputation has been used to define causal variation in autoimmune diseases, such as type I diabetes, and in human immunodeficiency virus infection control. However, there are few guidelines on performing HLA imputation, association testing, and fine mapping. Here, we present a comprehensive tutorial to impute HLA alleles from genotype data. We provide detailed guidance on performing standard quality control measures for input genotyping data and describe options to impute HLA alleles and amino acids either locally or using the web-based Michigan Imputation Server, which hosts a multi-ancestry HLA imputation reference panel. We also offer best practice recommendations to conduct association tests to define the alleles, amino acids, and haplotypes that affect human traits. Along with the pipeline, we provide a step-by-step online guide with scripts and available software ( https://github.com/immunogenomics/HLA_analyses_tutorial ). This tutorial will be broadly applicable to large-scale genotyping data and will contribute to defining the role of HLA in human diseases across global populations.
DOI
10.1038/s41596-023-00853-4
Salinas
PUBMED_LINK
TITLE
Statistical Analysis of Multiple Phenotypes in Genetic Epidemiologic Studies: From Cross-Phenotype Associations to Pleiotropy.
Main citation
Salinas YD, Wang Z, DeWan AT. (2018) Statistical Analysis of Multiple Phenotypes in Genetic Epidemiologic Studies: From Cross-Phenotype Associations to Pleiotropy. Am J Epidemiol, 187 (4) 855-863. doi:10.1093/aje/kwx296. PMID 29020254
ABSTRACT
In the context of genetics, pleiotropy refers to the phenomenon in which a single genetic locus affects more than 1 trait or disease. Genetic epidemiologic studies have identified loci associated with multiple phenotypes, and these cross-phenotype associations are often incorrectly interpreted as examples of pleiotropy. Pleiotropy is only one possible explanation for cross-phenotype associations. Cross-phenotype associations may also arise due to issues related to study design, confounder bias, or nongenetic causal links between the phenotypes under analysis. Therefore, it is necessary to dissect cross-phenotype associations carefully to uncover true pleiotropic loci. In this review, we describe statistical methods that can be used to identify robust statistical evidence of pleiotropy. First, we provide an overview of univariate and multivariate methods for discovery of cross-phenotype associations and highlight important considerations for choosing among available methods. Then, we describe how to dissect cross-phenotype associations by using mediation analysis. Pleiotropic loci provide insights into the mechanistic underpinnings of disease comorbidity, and they may serve as novel targets for interventions that simultaneously treat multiple diseases. Discerning between different types of cross-phenotype associations is necessary to realize the public health potential of pleiotropic loci.
DOI
10.1093/aje/kwx296
Sanger
PUBMED_LINK
URL
TITLE
A reference panel of 64,976 haplotypes for genotype imputation.
Main citation
McCarthy S, Das S, Kretzschmar W, Delaneau O, ...&, Haplotype Reference Consortium. (2016) A reference panel of 64,976 haplotypes for genotype imputation. Nat Genet, 48 (10) 1279-83. doi:10.1038/ng.3643. PMID 27548312
ABSTRACT
We describe a reference panel of 64,976 human haplotypes at 39,235,157 SNPs constructed using whole-genome sequence data from 20 studies of predominantly European ancestry. Using this resource leads to accurate genotype imputation at minor allele frequencies as low as 0.1% and a large increase in the number of SNPs tested in association studies, and it can help to discover and refine causal loci. We describe remote server resources that allow researchers to carry out imputation and phasing consistently and efficiently.
DOI
10.1038/ng.3643
SARGE
PUBMED_LINK
DESCRIPTION
Schaefer, N. K., Shapiro, B. & Green, R. E. An ancestral recombination graph of human, Neanderthal, and Denisovan genomes. Sci. Adv. 7, eabc0776 (2021).
TITLE
An ancestral recombination graph of human, Neanderthal, and Denisovan genomes.
Main citation
Schaefer NK, Shapiro B, Green RE. (2021) An ancestral recombination graph of human, Neanderthal, and Denisovan genomes. Sci Adv, 7 (29) . doi:10.1126/sciadv.abc0776. PMID 34272242
ABSTRACT
Many humans carry genes from Neanderthals, a legacy of past admixture. Existing methods detect this archaic hominin ancestry within human genomes using patterns of linkage disequilibrium or direct comparison to Neanderthal genomes. Each of these methods is limited in sensitivity and scalability. We describe a new ancestral recombination graph inference algorithm that scales to large genome-wide datasets and demonstrate its accuracy on real and simulated data. We then generate a genome-wide ancestral recombination graph including human and archaic hominin genomes. From this, we generate a map within human genomes of archaic ancestry and of genomic regions not shared with archaic hominins either by admixture or incomplete lineage sorting. We find that only 1.5 to 7% of the modern human genome is uniquely human. We also find evidence of multiple bursts of adaptive changes specific to modern humans within the past 600,000 years involving genes related to brain development and function.
DOI
10.1126/sciadv.abc0776
SBayesR
PUBMED_LINK
DESCRIPTION
extend a powerful individual-level data Bayesian multiple regression model (BayesR) to one that utilises summary statistics from genome-wide association studies, SBayesR.
URL
TITLE
Improved polygenic prediction by Bayesian multiple regression on summary statistics.
Main citation
Lloyd-Jones LR, Zeng J, Sidorenko J, Yengo L, ...&, Visscher PM. (2019) Improved polygenic prediction by Bayesian multiple regression on summary statistics. Nat Commun, 10 (1) 5086. doi:10.1038/s41467-019-12653-0. PMID 31704910
ABSTRACT
Accurate prediction of an individual's phenotype from their DNA sequence is one of the great promises of genomics and precision medicine. We extend a powerful individual-level data Bayesian multiple regression model (BayesR) to one that utilises summary statistics from genome-wide association studies (GWAS), SBayesR. In simulation and cross-validation using 12 real traits and 1.1 million variants on 350,000 individuals from the UK Biobank, SBayesR improves prediction accuracy relative to commonly used state-of-the-art summary statistics methods at a fraction of the computational resources. Furthermore, using summary statistics for variants from the largest GWAS meta-analysis (n ≈ 700, 000) on height and BMI, we show that on average across traits and two independent data sets that SBayesR improves prediction R2 by 5.2% relative to LDpred and by 26.5% relative to clumping and p value thresholding.
DOI
10.1038/s41467-019-12653-0
SBayesRC
PUBMED_LINK
DESCRIPTION
SBayesRC integrates GWAS summary statistics with functional genomic annotations to improve polygenic prediction of complex traits.
URL
KEYWORDS
functional genomic annotation, whole-genome variants, cross-ancestry
TITLE
Leveraging functional genomic annotations and genome coverage to improve polygenic prediction of complex traits within and between ancestries.
Main citation
Zheng Z, Liu S, Sidorenko J, Wang Y, ...&, Zeng J. (2024) Leveraging functional genomic annotations and genome coverage to improve polygenic prediction of complex traits within and between ancestries. Nat Genet, 56 (5) 767-777. doi:10.1038/s41588-024-01704-y. PMID 38689000
ABSTRACT
We develop a method, SBayesRC, that integrates genome-wide association study (GWAS) summary statistics with functional genomic annotations to improve polygenic prediction of complex traits. Our method is scalable to whole-genome variant analysis and refines signals from functional annotations by allowing them to affect both causal variant probability and causal effect distribution. We analyze 50 complex traits and diseases using ∼7 million common single-nucleotide polymorphisms (SNPs) and 96 annotations. SBayesRC improves prediction accuracy by 14% in European ancestry and up to 34% in cross-ancestry prediction compared to the baseline method SBayesR, which does not use annotations, and outperforms other methods, including LDpred2, LDpred-funct, MegaPRS, PolyPred-S and PRS-CSx. Investigation of factors affecting prediction accuracy identifies a significant interaction between SNP density and annotation information, suggesting whole-genome sequence variants with annotations may further improve prediction. Functional partitioning analysis highlights a major contribution of evolutionary constrained regions to prediction accuracy and the largest per-SNP contribution from nonsynonymous SNPs.
DOI
10.1038/s41588-024-01704-y
SBayesS
PUBMED_LINK
DESCRIPTION
estimate multiple genetic architecture parameters including selection signature using only GWAS summary statistics
URL
TITLE
Widespread signatures of natural selection across human complex traits and functional genomic categories.
Main citation
Zeng J, Xue A, Jiang L, Lloyd-Jones LR, ...&, Yang J. (2021) Widespread signatures of natural selection across human complex traits and functional genomic categories. Nat Commun, 12 (1) 1164. doi:10.1038/s41467-021-21446-3. PMID 33608517
ABSTRACT
Understanding how natural selection has shaped genetic architecture of complex traits is of importance in medical and evolutionary genetics. Bayesian methods have been developed using individual-level GWAS data to estimate multiple genetic architecture parameters including selection signature. Here, we present a method (SBayesS) that only requires GWAS summary statistics. We analyse data for 155 complex traits (n = 27k-547k) and project the estimates onto those obtained from evolutionary simulations. We estimate that, on average across traits, about 1% of human genome sequence are mutational targets with a mean selection coefficient of ~0.001. Common diseases, on average, show a smaller number of mutational targets and have been under stronger selection, compared to other traits. SBayesS analyses incorporating functional annotations reveal that selection signatures vary across genomic regions, among which coding regions have the strongest selection signature and are enriched for both the number of associated variants and the magnitude of effect sizes.
DOI
10.1038/s41467-021-21446-3
sc-linker
PUBMED_LINK
DESCRIPTION
a framework for integrating single-cell RNA-sequencing, epigenomic SNP-to-gene maps and genome-wide association study summary statistics to infer the underlying cell types and processes by which genetic variants influence disease
URL
KEYWORDS
GWAS, scRNA-seq
TITLE
Identifying disease-critical cell types and cellular processes by integrating single-cell RNA-sequencing and human genetics.
Main citation
Jagadeesh KA, Dey KK, Montoro DT, Mohan R, ...&, Regev A. (2022) Identifying disease-critical cell types and cellular processes by integrating single-cell RNA-sequencing and human genetics. Nat Genet, 54 (10) 1479-1492. doi:10.1038/s41588-022-01187-9. PMID 36175791
ABSTRACT
Genome-wide association studies provide a powerful means of identifying loci and genes contributing to disease, but in many cases, the related cell types/states through which genes confer disease risk remain unknown. Deciphering such relationships is important for identifying pathogenic processes and developing therapeutics. In the present study, we introduce sc-linker, a framework for integrating single-cell RNA-sequencing, epigenomic SNP-to-gene maps and genome-wide association study summary statistics to infer the underlying cell types and processes by which genetic variants influence disease. The inferred disease enrichments recapitulated known biology and highlighted notable cell-disease relationships, including γ-aminobutyric acid-ergic neurons in major depressive disorder, a disease-dependent M-cell program in ulcerative colitis and a disease-specific complement cascade process in multiple sclerosis. In autoimmune disease, both healthy and disease-dependent immune cell-type programs were associated, whereas only disease-dependent epithelial cell programs were prominent, suggesting a role in disease response rather than initiation. Our framework provides a powerful approach for identifying the cell types and cellular processes by which genetic variants influence disease.
DOI
10.1038/s41588-022-01187-9
ARROW_SUMMARY
scRNA-seq data →️ Derive cell-type-specific gene programs →️ Map SNPs to genes using epigenomic data →️ Integrate with GWAS summary statistics →️ Identify disease-critical cell types and processes
SCARlink
PUBMED_LINK
FULL NAME
single-cell ATAC + RNA linking
DESCRIPTION
Single-cell ATAC+RNA linking (SCARlink) uses multiomic single-cell ATAC and RNA to predict gene expression from chromatin accessibility and predict regulatory regions.
URL
KEYWORDS
Possion regression, scATAC, scRNA, tile-level accessibility
TITLE
Single-cell multi-ome regression models identify functional and disease-associated enhancers and enable chromatin potential analysis.
Main citation
Mitra S, Malik R, Wong W, Rahman A, ...&, Leslie CS. (2024) Single-cell multi-ome regression models identify functional and disease-associated enhancers and enable chromatin potential analysis. Nat Genet, 56 (4) 627-636. doi:10.1038/s41588-024-01689-8. PMID 38514783
ABSTRACT
We present a gene-level regulatory model, single-cell ATAC + RNA linking (SCARlink), which predicts single-cell gene expression and links enhancers to target genes using multi-ome (scRNA-seq and scATAC-seq co-assay) sequencing data. The approach uses regularized Poisson regression on tile-level accessibility data to jointly model all regulatory effects at a gene locus, avoiding the limitations of pairwise gene-peak correlations and dependence on peak calling. SCARlink outperformed existing gene scoring methods for imputing gene expression from chromatin accessibility across high-coverage multi-ome datasets while giving comparable to improved performance on low-coverage datasets. Shapley value analysis on trained models identified cell-type-specific gene enhancers that are validated by promoter capture Hi-C and are 11× to 15× and 5× to 12× enriched in fine-mapped eQTLs and fine-mapped genome-wide association study (GWAS) variants, respectively. We further show that SCARlink-predicted and observed gene expression vectors provide a robust way to compute a chromatin potential vector field to enable developmental trajectory analysis.
DOI
10.1038/s41588-024-01689-8
ARROW_SUMMARY
scRNA-seq + scATAC-seq → Tile-level chromatin accessibility modeling → Regularized Poisson regression (SCARlink) → Predict gene expression & link enhancers to genes → Identify functional and disease-associated enhancers
SCAVENGE
PUBMED_LINK
FULL NAME
Single Cell Analysis of Variant Enrichment through Network propagation of GEnomic data
URL
KEYWORDS
GWAS, scATAC, network propagation
TITLE
Variant to function mapping at single-cell resolution through network propagation.
Main citation
Yu F, Cato LD, Weng C, Liggett LA, ...&, Sankaran VG. (2022) Variant to function mapping at single-cell resolution through network propagation. Nat Biotechnol, 40 (11) 1644-1653. doi:10.1038/s41587-022-01341-y. PMID 35668323
ABSTRACT
Genome-wide association studies in combination with single-cell genomic atlases can provide insights into the mechanisms of disease-causal genetic variation. However, identification of disease-relevant or trait-relevant cell types, states and trajectories is often hampered by sparsity and noise, particularly in the analysis of single-cell epigenomic data. To overcome these challenges, we present SCAVENGE, a computational algorithm that uses network propagation to map causal variants to their relevant cellular context at single-cell resolution. We demonstrate how SCAVENGE can help identify key biological mechanisms underlying human genetic variation, applying the method to blood traits at distinct stages of human hematopoiesis, to monocyte subsets that increase the risk for severe Coronavirus Disease 2019 (COVID-19) and to intermediate lymphocyte developmental states that predispose to acute leukemia. Our approach not only provides a framework for enabling variant-to-function insights at single-cell resolution but also suggests a more general strategy for maximizing the inferences that can be made using single-cell genomic data.
DOI
10.1038/s41587-022-01341-y
scDRS
PUBMED_LINK
FULL NAME
single-cell Disease Relevance Score
DESCRIPTION
an approach that links scRNA-seq with polygenic disease risk at single-cell resolution, independent of annotated cell types
URL
KEYWORDS
GWAS, scRNA-seq
TITLE
Polygenic enrichment distinguishes disease associations of individual cells in single-cell RNA-seq data.
Main citation
Zhang MJ, Hou K, Dey KK, Sakaue S, ...&, Price AL. (2022) Polygenic enrichment distinguishes disease associations of individual cells in single-cell RNA-seq data. Nat Genet, 54 (10) 1572-1580. doi:10.1038/s41588-022-01167-z. PMID 36050550
ABSTRACT
Single-cell RNA sequencing (scRNA-seq) provides unique insights into the pathology and cellular origin of disease. We introduce single-cell disease relevance score (scDRS), an approach that links scRNA-seq with polygenic disease risk at single-cell resolution, independent of annotated cell types. scDRS identifies cells exhibiting excess expression across disease-associated genes implicated by genome-wide association studies (GWASs). We applied scDRS to 74 diseases/traits and 1.3 million single-cell gene-expression profiles across 31 tissues/organs. Cell-type-level results broadly recapitulated known cell-type-disease associations. Individual-cell-level results identified subpopulations of disease-associated cells not captured by existing cell-type labels, including T cell subpopulations associated with inflammatory bowel disease, partially characterized by their effector-like states; neuron subpopulations associated with schizophrenia, partially characterized by their spatial locations; and hepatocyte subpopulations associated with triglyceride levels, partially characterized by their higher ploidy levels. Genes whose expression was correlated with the scDRS score across cells (reflecting coexpression with GWAS disease-associated genes) were strongly enriched for gold-standard drug target and Mendelian disease genes.
DOI
10.1038/s41588-022-01167-z
ARROW_SUMMARY
GWAS summary statistics → Select putative disease genes via MAGMA → Compute scDRS using Monte Carlo-based score aggregation → Normalize with control gene sets → Rank cells by disease relevance → Identify enriched subpopulations and co-expressed gene networks
SCENT
PUBMED_LINK
FULL NAME
single-cell enhancer target gene mapping
DESCRIPTION
SCENT uses single-cell multimodal data (e.g., 10X Multiome RNA/ATAC) and links ATAC-seq peaks (putative enhancers) to their target genes by modeling association between chromatin accessibility and gene expression across individual single cells.
URL
KEYWORDS
Possion regression, scATAC-seq, scRNA-seq
TITLE
Tissue-specific enhancer-gene maps from multimodal single-cell data identify causal disease alleles.
Main citation
Sakaue S, Weinand K, Isaac S, Dey KK, ...&, Raychaudhuri S. (2024) Tissue-specific enhancer-gene maps from multimodal single-cell data identify causal disease alleles. Nat Genet, 56 (4) 615-626. doi:10.1038/s41588-024-01682-1. PMID 38594305
ABSTRACT
Translating genome-wide association study (GWAS) loci into causal variants and genes requires accurate cell-type-specific enhancer-gene maps from disease-relevant tissues. Building enhancer-gene maps is essential but challenging with current experimental methods in primary human tissues. Here we developed a nonparametric statistical method, SCENT (single-cell enhancer target gene mapping), that models association between enhancer chromatin accessibility and gene expression in single-cell or nucleus multimodal RNA sequencing and ATAC sequencing data. We applied SCENT to 9 multimodal datasets including >120,000 single cells or nuclei and created 23 cell-type-specific enhancer-gene maps. These maps were highly enriched for causal variants in expression quantitative loci and GWAS for 1,143 diseases and traits. We identified likely causal genes for both common and rare diseases and linked somatic mutation hotspots to target genes. We demonstrate that application of SCENT to multimodal data from disease-relevant human tissue enables the scalable construction of accurate cell-type-specific enhancer-gene maps, essential for defining noncoding variant function.
DOI
10.1038/s41588-024-01682-1
ARROW_SUMMARY
Extract chromatin accessibility (ATAC-seq) & gene expression (RNA-seq) from single cells → Group cells by type → For each gene, define candidate enhancers within 1 Mb → Use distance-weighted non-parametric regression to model enhancer–gene associations → Assess significance via permutation testing → Build enhancer–gene links per cell type
scGWAS
PUBMED_LINK
FULL NAME
scRNA-seq assisted GWAS analysis
DESCRIPTION
scGWAS leverages scRNA-seq data to identify the genetically mediated associations between traits and cell types.
URL
TITLE
scGWAS: landscape of trait-cell type associations by integrating single-cell transcriptomics-wide and genome-wide association studies.
Main citation
Jia P, Hu R, Yan F, Dai Y, ...&, Zhao Z. (2022) scGWAS: landscape of trait-cell type associations by integrating single-cell transcriptomics-wide and genome-wide association studies. Genome Biol, 23 (1) 220. doi:10.1186/s13059-022-02785-w. PMID 36253801
ABSTRACT
BACKGROUND: The rapid accumulation of single-cell RNA sequencing (scRNA-seq) data presents unique opportunities to decode the genetically mediated cell-type specificity in complex diseases. Here, we develop a new method, scGWAS, which effectively leverages scRNA-seq data to achieve two goals: (1) to infer the cell types in which the disease-associated genes manifest and (2) to construct cellular modules which imply disease-specific activation of different processes. RESULTS: scGWAS only utilizes the average gene expression for each cell type followed by virtual search processes to construct the null distributions of module scores, making it scalable to large scRNA-seq datasets. We demonstrated scGWAS in 40 genome-wide association studies (GWAS) datasets (average sample size N ≈ 154,000) using 18 scRNA-seq datasets from nine major human/mouse tissues (totaling 1.08 million cells) and identified 2533 trait and cell-type associations, each with significant modules for further investigation. The module genes were validated using disease or clinically annotated references from ClinVar, OMIM, and pLI variants. CONCLUSIONS: We showed that the trait-cell type associations identified by scGWAS, while generally constrained to trait-tissue associations, could recapitulate many well-studied relationships and also reveal novel relationships, providing insights into the unsolved trait-tissue associations. Moreover, in each specific cell type, the associations with different traits were often mediated by different sets of risk genes, implying disease-specific activation of driving processes. In summary, scGWAS is a powerful tool for exploring the genetic basis of complex diseases at the cell type level using single-cell expression data.
DOI
10.1186/s13059-022-02785-w
scPRS
PUBMED_LINK
DESCRIPTION
We introduce scPRS, an interpretable geometric deep learning model that contructs single-cell-resolved PRS leveraging reference single-cell ATAC-seq data for enhanced disease prediction and biological discovery.
URL
KEYWORDS
GWAS, scATAC, cell-resolved PRS, GNN
TITLE
Single-cell polygenic risk scores dissect cellular and molecular heterogeneity of complex human diseases.
Main citation
Zhang S, Shu H, Zhou J, Rubin-Sigler J, ...&, Snyder MP. (2025) Single-cell polygenic risk scores dissect cellular and molecular heterogeneity of complex human diseases. Nat Biotechnol, () . doi:10.1038/s41587-025-02725-6. PMID 40715455
ABSTRACT
Polygenic risk scores (PRSs) predict an individual's genetic risk for complex diseases, yet their utility in elucidating disease biology remains limited. We introduce scPRS, a graph neural network-based framework that computes single-cell-resolved PRSs by integrating reference single-cell chromatin accessibility profiles. scPRS outperforms traditional PRS approaches in genetic risk prediction, as demonstrated across multiple diseases including type 2 diabetes, hypertrophic cardiomyopathy, Alzheimer disease and severe COVID-19. Beyond risk prediction, scPRS prioritizes disease-critical cells and, when combined with a layered multiomic analysis, links risk variants to gene regulation in a cell-type-specific manner. Applied to these diseases, scPRS fine-maps causal cell types and cell-type-specific variants and genes, demonstrating its ability to bridge genetic risk with cell-specific biology. scPRS provides a unified framework for genetic risk prediction and mechanistic dissection of complex diseases, laying a methodological foundation for single-cell genetics.
DOI
10.1038/s41587-025-02725-6
ARROW_SUMMARY
GWAS summary statistics + scATAC‑seq data → per‑cell PRS calculation → GNN smoothing → aggregate to individual scPRS → interpret cell‑type contributions & fine‑map causal variants.
scTWAS
PUBMED_LINK
DESCRIPTION
Statistical framework for cell-type-resolved transcriptome-wide association using single-cell RNA-seq: models sparsity and technical noise via latent variables and moment-based estimation to improve genetically regulated expression prediction and gene–trait discovery.
URL
KEYWORDS
TWAS, single-cell, cell-type-specific, latent variable, GReX
TITLE
scTWAS: a powerful statistical framework for single-cell transcriptome-wide association studies.
Main citation
Lin Z, Su C. (2026) scTWAS: a powerful statistical framework for single-cell transcriptome-wide association studies. Nat Commun, () . doi:10.1038/s41467-026-70374-7. PMID 41820391
ABSTRACT
Transcriptome-wide association studies (TWAS) have successfully identified genes associated with complex traits and diseases, but most have been performed using bulk gene expression data, which aggregate signals across heterogeneous cell types. Population-scale single-cell RNA sequencing data now make it possible to perform TWAS at the cell-type resolution, but present unique challenges due to strong noises, technical variations, and high sparsity. Here, we propose scTWAS, a statistical method to conduct cell-type-specific TWAS using single-cell data. Leveraging a latent-variable model and moment-based estimation to address the challenges of single-cell data, scTWAS consistently improves the prediction of genetically regulated gene expression across cell types in both blood and brain tissues. Compared to existing methods, scTWAS identifies substantially more gene-trait associations across 29 hematological traits and three immune-related diseases in immune cell types. An application to Alzheimer's disease also reveals cell-subtype-specific associations, including MS4A6A in the disease-associated microglial subtype and PPP1R37 in the inflammatory microglial subtype.
DOI
10.1038/s41467-026-70374-7
SDPR
PUBMED_LINK
DESCRIPTION
SDPR (Summary statistics based Dirichelt Process Regression) is a method to compute polygenic risk score (PRS) from summary statistics. It is the extension of Dirichlet Process Regression (DPR) to the use of summary statistics
URL
TITLE
A fast and robust Bayesian nonparametric method for prediction of complex traits using summary statistics.
Main citation
Zhou G, Zhao H. (2021) A fast and robust Bayesian nonparametric method for prediction of complex traits using summary statistics. PLoS Genet, 17 (7) e1009697. doi:10.1371/journal.pgen.1009697. PMID 34310601
ABSTRACT
Genetic prediction of complex traits has great promise for disease prevention, monitoring, and treatment. The development of accurate risk prediction models is hindered by the wide diversity of genetic architecture across different traits, limited access to individual level data for training and parameter tuning, and the demand for computational resources. To overcome the limitations of the most existing methods that make explicit assumptions on the underlying genetic architecture and need a separate validation data set for parameter tuning, we develop a summary statistics-based nonparametric method that does not rely on validation datasets to tune parameters. In our implementation, we refine the commonly used likelihood assumption to deal with the discrepancy between summary statistics and external reference panel. We also leverage the block structure of the reference linkage disequilibrium matrix for implementation of a parallel algorithm. Through simulations and applications to twelve traits, we show that our method is adaptive to different genetic architectures, statistically robust, and computationally efficient. Our method is available at https://github.com/eldronzhou/SDPR.
DOI
10.1371/journal.pgen.1009697
SDPRX
PUBMED_LINK
DESCRIPTION
SDPRX is a statistical method for cross-population prediction of complex traits. It integrates GWAS summary statistics and LD matrices from two populations (EUR and non-EUR) to compuate polygenic risk scores.
URL
TITLE
SDPRX: A statistical method for cross-population prediction of complex traits.
Main citation
Zhou G, Chen T, Zhao H. (2023) SDPRX: A statistical method for cross-population prediction of complex traits. Am J Hum Genet, 110 (1) 13-22. doi:10.1016/j.ajhg.2022.11.007. PMID 36460009
ABSTRACT
Polygenic risk score (PRS) has demonstrated its great utility in biomedical research through identifying high-risk individuals for different diseases from their genotypes. However, the broader application of PRS to the general population is hindered by the limited transferability of PRS developed in Europeans to non-European populations. To improve PRS prediction accuracy in non-European populations, we develop a statistical method called SDPRX that can effectively integrate genome wide association study summary statistics from different populations. SDPRX automatically adjusts for linkage disequilibrium differences between populations and characterizes the joint distribution of the effect sizes of a variant in two populations to be both null, population specific, or shared with correlation. Through simulations and applications to real traits, we show that SDPRX improves the prediction performance over existing methods in non-European populations.
DOI
10.1016/j.ajhg.2022.11.007
SDS
PUBMED_LINK
FULL NAME
singleton density score
DESCRIPTION
Field, Y., Boyle, E. A., Telis, N., Gao, Z., Gaulton, K. J., Golan, D., ... & Pritchard, J. K. (2016). Detection of human adaptation during the past 2000 years. Science, 354(6313), 760-764.
URL
KEYWORDS
singleton, recent selection
USE
SDS is a method to infer very recent changes in allele frequencies from contemporary genome sequences
TITLE
Detection of human adaptation during the past 2000 years.
Main citation
Field Y, Boyle EA, Telis N, Gao Z, ...&, Pritchard JK. (2016) Detection of human adaptation during the past 2000 years. Science, 354 (6313) 760-764. doi:10.1126/science.aag0776. PMID 27738015
ABSTRACT
Detection of recent natural selection is a challenging problem in population genetics. Here we introduce the singleton density score (SDS), a method to infer very recent changes in allele frequencies from contemporary genome sequences. Applied to data from the UK10K Project, SDS reflects allele frequency changes in the ancestors of modern Britons during the past ~2000 to 3000 years. We see strong signals of selection at lactase and the major histocompatibility complex, and in favor of blond hair and blue eyes. For polygenic adaptation, we find that recent selection for increased height has driven allele frequency shifts across most of the genome. Moreover, we identify shifts associated with other complex traits, suggesting that polygenic adaptation has played a pervasive role in shaping genotypic and phenotypic variation in modern humans.
DOI
10.1126/science.aag0776
SECRET-GWAS
DESCRIPTION
A privacy-preserving, population-scale genome-wide association study (GWAS) tool enabling collaborative analysis across multiple institutions using confidential computing. It employs optimizations like streaming, batching, and data parallelization on Intel SGX-based platforms to support linear and logistic regression efficiently while protecting against side-channel attacks.
URL
KEYWORDS
Genome-wide association study (GWAS), Confidential computing, Privacy-preserving, Intel SGX, Secure multi-party computation
Main citation
Rosenblum, J., Dong, J. & Narayanasamy, S. Confidential computing for population-scale genome-wide association studies with SECRET-GWAS. Nat Comput Sci (2025). https://doi.org/10.1038/s43588-025-00856-z
ARROW_SUMMARY
Genomic data from multiple institutions → Confidential computing (Intel SGX) with optimized linear/logistic regression → Privacy-preserving GWAS results using streaming, batching, and parallelization
AI_GENERATED
1.0
seismic
PUBMED_LINK
FULL NAME
Single-cell Expression Integration System for Mapping genetically Implicated Cell types
DESCRIPTION
R framework that links GWAS signals to single-cell-defined cell types via a cell-type gene specificity score (expression magnitude and consistency) and regression on gene-level association statistics, with influential-gene follow-up for interpretability.
URL
KEYWORDS
GWAS, scRNA-seq, cell type, MAGMA, post-GWAS interpretation
TITLE
Disentangling associations between complex traits and cell types with seismic.
Main citation
Lai Q, Dannenfelser R, Roussarie JP, Yao V. (2025) Disentangling associations between complex traits and cell types with seismic. Nat Commun, 16 (1) 8744. doi:10.1038/s41467-025-63753-z. PMID 41034207
ABSTRACT
Integrating single-cell RNA sequencing with Genome-Wide Association Studies (GWAS) can uncover cell types involved in complex traits and disease. However, current methods often lack scalability, interpretability, and robustness. We present seismic, a framework that computes a novel specificity score capturing both expression magnitude and consistency across cell types and introduces influential gene analysis, an approach to identify genes driving each cell type-trait association. Across over 1000 cell-type characterizations at different granularities and 28 polygenic traits, seismic corroborates known associations and uncovers trait-relevant cell groups not apparent through other methodologies. In Parkinson's and Alzheimer's, seismic unveils both cell- and brain-region-specific differences in pathology. Analyzing a pathology-based Alzheimer's GWAS with seismic enables the identification of vulnerable neuron populations and molecular pathways implicated in their neurodegeneration. In general, seismic is a computationally efficient, powerful, and interpretable approach for mapping the relationships between polygenic traits and cell-type-specific expression, offering new insights into disease mechanisms.
DOI
10.1038/s41467-025-63753-z
SHAPEIT1
PUBMED_LINK
DESCRIPTION
(SHAPEIT1)
URL
TITLE
A linear complexity phasing method for thousands of genomes.
Main citation
Delaneau O, Marchini J, Zagury JF. (2011) A linear complexity phasing method for thousands of genomes. Nat Methods, 9 (2) 179-81. doi:10.1038/nmeth.1785. PMID 22138821
ABSTRACT
Human-disease etiology can be better understood with phase information about diploid sequences. We present a method for estimating haplotypes, using genotype data from unrelated samples or small nuclear families, that leads to improved accuracy and speed compared to several widely used methods. The method, segmented haplotype estimation and imputation tool (SHAPEIT), scales linearly with the number of haplotypes used in each iteration and can be run efficiently on whole chromosomes.
DOI
10.1038/nmeth.1785
SHAPEIT2
PUBMED_LINK
DESCRIPTION
(SHAPEIT2)
TITLE
Improved whole-chromosome phasing for disease and population genetic studies.
Main citation
Delaneau O, Zagury JF, Marchini J. (2013) Improved whole-chromosome phasing for disease and population genetic studies. Nat Methods, 10 (1) 5-6. doi:10.1038/nmeth.2307. PMID 23269371
DOI
10.1038/nmeth.2307
SHAPEIT3
PUBMED_LINK
DESCRIPTION
(SHAPEIT3)
URL
TITLE
Haplotype estimation for biobank-scale data sets.
Main citation
O'Connell J, Sharp K, Shrine N, Wain L, ...&, Marchini J. (2016) Haplotype estimation for biobank-scale data sets. Nat Genet, 48 (7) 817-20. doi:10.1038/ng.3583. PMID 27270105
ABSTRACT
The UK Biobank (UKB) has recently released genotypes on 152,328 individuals together with extensive phenotypic and lifestyle information. We present a new phasing method, SHAPEIT3, that can handle such biobank-scale data sets and results in switch error rates as low as ∼0.3%. The method exhibits O(NlogN) scaling with sample size N, enabling fast and accurate phasing of even larger cohorts.
DOI
10.1038/ng.3583
SHAPEIT4
PUBMED_LINK
DESCRIPTION
(SHAPEIT4)
URL
TITLE
Accurate, scalable and integrative haplotype estimation.
Main citation
Delaneau O, Zagury JF, Robinson MR, Marchini JL, ...&, Dermitzakis ET. (2019) Accurate, scalable and integrative haplotype estimation. Nat Commun, 10 (1) 5436. doi:10.1038/s41467-019-13225-y. PMID 31780650
ABSTRACT
The number of human genomes being genotyped or sequenced increases exponentially and efficient haplotype estimation methods able to handle this amount of data are now required. Here we present a method, SHAPEIT4, which substantially improves upon other methods to process large genotype and high coverage sequencing datasets. It notably exhibits sub-linear running times with sample size, provides highly accurate haplotypes and allows integrating external phasing information such as large reference panels of haplotypes, collections of pre-phased variants and long sequencing reads. We provide SHAPEIT4 in an open source format and demonstrate its performance in terms of accuracy and running times on two gold standard datasets: the UK Biobank data and the Genome In A Bottle.
DOI
10.1038/s41467-019-13225-y
SHAPEIT5
PUBMED_LINK
DESCRIPTION
(SHAPEIT5)
TITLE
Accurate rare variant phasing of whole-genome and whole-exome sequencing data in the UK Biobank.
Main citation
Hofmeister RJ, Ribeiro DM, Rubinacci S, Delaneau O. (2023) Accurate rare variant phasing of whole-genome and whole-exome sequencing data in the UK Biobank. Nat Genet, 55 (7) 1243-1249. doi:10.1038/s41588-023-01415-w. PMID 37386248
ABSTRACT
Phasing involves distinguishing the two parentally inherited copies of each chromosome into haplotypes. Here, we introduce SHAPEIT5, a new phasing method that quickly and accurately processes large sequencing datasets and applied it to UK Biobank (UKB) whole-genome and whole-exome sequencing data. We demonstrate that SHAPEIT5 phases rare variants with low switch error rates of below 5% for variants present in just 1 sample out of 100,000. Furthermore, we outline a method for phasing singletons, which, although less precise, constitutes an important step towards future developments. We then demonstrate that the use of UKB as a reference panel improves the accuracy of genotype imputation, which is even more pronounced when phased with SHAPEIT5 compared with other methods. Finally, we screen the UKB data for loss-of-function compound heterozygous events and identify 549 genes where both gene copies are knocked out. These genes complement current knowledge of gene essentiality in the human genome.
DOI
10.1038/s41588-023-01415-w
shaPRS
PUBMED_LINK
DESCRIPTION
Leveraging shared genetic effects across traits and ancestries improves accuracy of polygenic scores
URL
KEYWORDS
cross-ancestry, genetic correlation
TITLE
shaPRS: Leveraging shared genetic effects across traits or ancestries improves accuracy of polygenic scores.
Main citation
Kelemen M, Vigorito E, Fachal L, Anderson CA, ...&, Wallace C. (2024) shaPRS: Leveraging shared genetic effects across traits or ancestries improves accuracy of polygenic scores. Am J Hum Genet, 111 (6) 1006-1017. doi:10.1016/j.ajhg.2024.04.009. PMID 38703768
ABSTRACT
We present shaPRS, a method that leverages widespread pleiotropy between traits or shared genetic effects across ancestries, to improve the accuracy of polygenic scores. The method uses genome-wide summary statistics from two diseases or ancestries to improve the genetic effect estimate and standard error at SNPs where there is homogeneity of effect between the two datasets. When there is significant evidence of heterogeneity, the genetic effect from the disease or population closest to the target population is maintained. We show via simulation and a series of real-world examples that shaPRS substantially enhances the accuracy of polygenic risk scores (PRSs) for complex diseases and greatly improves PRS performance across ancestries. shaPRS is a PRS pre-processing method that is agnostic to the actual PRS generation method, and as a result, it can be integrated into existing PRS generation pipelines and continue to be applied as more performant PRS methods are developed over time.
DOI
10.1016/j.ajhg.2024.04.009
SiblingGWAS
PUBMED_LINK
FULL NAME
Within-sibship genome-wide association analyses
DESCRIPTION
Scripts for running GWAS using siblings to estimate Within-Family (WF) and Between-Family (BF) effects of genetic variants on continuous traits. Allows the inclusion of more than two siblings from one family.
URL
TITLE
Within-sibship genome-wide association analyses decrease bias in estimates of direct genetic effects.
Main citation
Howe LJ, Nivard MG, Morris TT, Hansen AF, ...&, Davies NM. (2022) Within-sibship genome-wide association analyses decrease bias in estimates of direct genetic effects. Nat Genet, 54 (5) 581-592. doi:10.1038/s41588-022-01062-7. PMID 35534559
ABSTRACT
Estimates from genome-wide association studies (GWAS) of unrelated individuals capture effects of inherited variation (direct effects), demography (population stratification, assortative mating) and relatives (indirect genetic effects). Family-based GWAS designs can control for demographic and indirect genetic effects, but large-scale family datasets have been lacking. We combined data from 178,086 siblings from 19 cohorts to generate population (between-family) and within-sibship (within-family) GWAS estimates for 25 phenotypes. Within-sibship GWAS estimates were smaller than population estimates for height, educational attainment, age at first birth, number of children, cognitive ability, depressive symptoms and smoking. Some differences were observed in downstream SNP heritability, genetic correlations and Mendelian randomization analyses. For example, the within-sibship genetic correlation between educational attainment and body mass index attenuated towards zero. In contrast, analyses of most molecular phenotypes (for example, low-density lipoprotein-cholesterol) were generally consistent. We also found within-sibship evidence of polygenic adaptation on taller height. Here, we illustrate the importance of family-based GWAS data for phenotypes influenced by demographic and indirect genetic effects.
DOI
10.1038/s41588-022-01062-7
sim1000G
PUBMED_LINK
DESCRIPTION
a user-friendly genetic variant simulator in R for unrelated individuals and family-based designs
URL
TITLE
sim1000G: a user-friendly genetic variant simulator in R for unrelated individuals and family-based designs.
Main citation
Dimitromanolakis A, Xu J, Krol A, Briollais L. (2019) sim1000G: a user-friendly genetic variant simulator in R for unrelated individuals and family-based designs. BMC Bioinformatics, 20 (1) 26. doi:10.1186/s12859-019-2611-1. PMID 30646839
ABSTRACT
BACKGROUND: Simulation of genetic variants data is frequently required for the evaluation of statistical methods in the fields of human and animal genetics. Although a number of high-quality genetic simulators have been developed, many of them require advanced knowledge in population genetics or in computation to be used effectively. In addition, generating simulated data in the context of family-based studies demands sophisticated methods and advanced computer programming. RESULTS: To address these issues, we propose a new user-friendly and integrated R package, sim1000G, which simulates variants in genomic regions among unrelated individuals or among families. The only input needed is a raw phased Variant Call Format (VCF) file. Haplotypes are extracted to compute linkage disequilibrium (LD) in the simulated genomic regions and for the generation of new genotype data among unrelated individuals. The covariance across variants is used to preserve the LD structure of the original population. Pedigrees of arbitrary sizes are generated by modeling recombination events with sim1000G. To illustrate the application of sim1000G, various scenarios are presented assuming unrelated individuals from a single population or two distinct populations, or alternatively for three-generation pedigree data. Sim1000G can capture allele frequency diversity, short and long-range linkage disequilibrium (LD) patterns and subtle population differences in LD structure without the need of any tuning parameters. CONCLUSION: Sim1000G fills a gap in the vast area of genetic variants simulators by its simplicity and independence from external tools. Currently, it is one of the few simulation packages completely integrated into R and able to simulate multiple genetic variants among unrelated individuals and within families. Its implementation will facilitate the application and development of computational methods for association studies with both rare and common variants.
DOI
10.1186/s12859-019-2611-1
SIMER
FULL NAME
Data Simulation for Life Science and Breeding
DESCRIPTION
Data Simulation for Life Science and Breeding
URL
simGWAS
PUBMED_LINK
DESCRIPTION
a fast method for simulation of large scale case–control GWAS summary statistics
URL
TITLE
simGWAS: a fast method for simulation of large scale case-control GWAS summary statistics.
Main citation
Fortune MD, Wallace C. (2019) simGWAS: a fast method for simulation of large scale case-control GWAS summary statistics. Bioinformatics, 35 (11) 1901-1906. doi:10.1093/bioinformatics/bty898. PMID 30371734
ABSTRACT
MOTIVATION: Methods for analysis of GWAS summary statistics have encouraged data sharing and democratized the analysis of different diseases. Ideal validation for such methods is application to simulated data, where some 'truth' is known. As GWAS increase in size, so does the computational complexity of such evaluations; standard practice repeatedly simulates and analyses genotype data for all individuals in an example study. RESULTS: We have developed a novel method based on an alternative approach, directly simulating GWAS summary data, without individual data as an intermediate step. We mathematically derive the expected statistics for any set of causal variants and their effect sizes, conditional upon control haplotype frequencies (available from public reference datasets). Simulation of GWAS summary output can be conducted independently of sample size by simulating random variates about these expected values. Across a range of scenarios, our method, produces very similar output to that from simulating individual genotypes with a substantial gain in speed even for modest sample sizes. Fast simulation of GWAS summary statistics will enable more complete and rapid evaluation of summary statistic methods as well as opening new potential avenues of research in fine mapping and gene set enrichment analysis. AVAILABILITY AND IMPLEMENTATION: Our method is available under a GPL license as an R package from http://github.com/chr1swallace/simGWAS. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
DOI
10.1093/bioinformatics/bty898
SINGER
FULL NAME
sampling and inferring of genealogies with recombination
DESCRIPTION
SINGER is a Bayesian method for accelerating ARG (Ancestral Recombination Graph) sampling from the posterior distribution, enabling accurate inference and uncertainty quantification for hundreds of whole-genome sequences. It addresses scalability and accuracy challenges in ARG reconstruction, improving robustness to model misspecification. Applications include detecting population differentiation, archaic introgression, and trans-species polymorphism in regions like the HLA locus.
URL
KEYWORDS
Ancestral Recombination Graph (ARG), Bayesian inference, population genomics, genealogical analysis, archaic introgression, trans-species polymorphism
Main citation
Deng, Y., Nielsen, R., & Song, Y.S. Robust and accurate Bayesian inference of genome-wide genealogies for hundreds of genomes. Nature Genetics, 57, 2124–2135. https://doi.org/10.1038/s41588-025-02317-9
ARROW_SUMMARY
Phased WGS → Bayesian MCMC sampling (threading, ARG re-scaling, SGPR moves) → Genome-wide ARGs with uncertainty quantification
AI_GENERATED
1.0
SKAT
PUBMED_LINK
FULL NAME
sequence kernel association test
DESCRIPTION
SKAT is a SNP-set (e.g., a gene or a region) level test for association between a set of rare (or common) variants and dichotomous or quantitative phenotypes, SKAT aggregates individual score test statistics of SNPs in a SNP set and efficiently computes SNP-set level p-values, e.g. a gene or a region level p-value, while adjusting for covariates, such as principal components to account for population stratification. SKAT also allows for power/sample size calculations for designing for sequence association studies.
URL
TITLE
Rare-variant association testing for sequencing data with the sequence kernel association test.
Main citation
Wu MC, Lee S, Cai T, Li Y, ...&, Lin X. (2011) Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet, 89 (1) 82-93. doi:10.1016/j.ajhg.2011.05.029. PMID 21737059
ABSTRACT
Sequencing studies are increasingly being conducted to identify rare variants associated with complex traits. The limited power of classical single-marker association analysis for rare variants poses a central challenge in such studies. We propose the sequence kernel association test (SKAT), a supervised, flexible, computationally efficient regression method to test for association between genetic variants (common and rare) in a region and a continuous or dichotomous trait while easily adjusting for covariates. As a score-based variance-component test, SKAT can quickly calculate p values analytically by fitting the null model containing only the covariates, and so can easily be applied to genome-wide data. Using SKAT to analyze a genome-wide sequencing study of 1000 individuals, by segmenting the whole genome into 30 kb regions, requires only 7 hr on a laptop. Through analysis of simulated data across a wide range of practical scenarios and triglyceride data from the Dallas Heart Study, we show that SKAT can substantially outperform several alternative rare-variant association tests. We also provide analytic power and sample-size calculations to help design candidate-gene, whole-exome, and whole-genome sequence association studies.
DOI
10.1016/j.ajhg.2011.05.029
SKAT-O
PUBMED_LINK
FULL NAME
sequence kernel association test - optimal test
DESCRIPTION
estimating the correlation parameter in the kernel matrix to maximize the power, which corresponds to the estimated weight in the linear combination of the burden test and SKAT test statistics that maximizes power.
URL
TITLE
Optimal tests for rare variant effects in sequencing association studies.
Main citation
Lee S, Wu MC, Lin X. (2012) Optimal tests for rare variant effects in sequencing association studies. Biostatistics, 13 (4) 762-75. doi:10.1093/biostatistics/kxs014. PMID 22699862
ABSTRACT
With development of massively parallel sequencing technologies, there is a substantial need for developing powerful rare variant association tests. Common approaches include burden and non-burden tests. Burden tests assume all rare variants in the target region have effects on the phenotype in the same direction and of similar magnitude. The recently proposed sequence kernel association test (SKAT) (Wu, M. C., and others, 2011. Rare-variant association testing for sequencing data with the SKAT. The American Journal of Human Genetics 89, 82-93], an extension of the C-alpha test (Neale, B. M., and others, 2011. Testing for an unusual distribution of rare variants. PLoS Genetics 7, 161-165], provides a robust test that is particularly powerful in the presence of protective and deleterious variants and null variants, but is less powerful than burden tests when a large number of variants in a region are causal and in the same direction. As the underlying biological mechanisms are unknown in practice and vary from one gene to another across the genome, it is of substantial practical interest to develop a test that is optimal for both scenarios. In this paper, we propose a class of tests that include burden tests and SKAT as special cases, and derive an optimal test within this class that maximizes power. We show that this optimal test outperforms burden tests and SKAT in a wide range of scenarios. The results are illustrated using simulation studies and triglyceride data from the Dallas Heart Study. In addition, we have derived sample size/power calculation formula for SKAT with a new family of kernels to facilitate designing new sequence association studies.
DOI
10.1093/biostatistics/kxs014
SMMAT
PUBMED_LINK
FULL NAME
variant set mixed model association tests
DESCRIPTION
For rare variant analysis from sequencing association studies, GMMAT performs the variant Set Mixed Model Association Tests (SMMAT) as proposed in Chen et al. (2019), including the burden test, the sequence kernel association test (SKAT), SKAT-O and an efficient hybrid test of the burden test and SKAT, based on user-defined variant sets.
URL
TITLE
Efficient Variant Set Mixed Model Association Tests for Continuous and Binary Traits in Large-Scale Whole-Genome Sequencing Studies.
Main citation
Chen H, Huffman JE, Brody JA, Wang C, ...&, Lin X. (2019) Efficient Variant Set Mixed Model Association Tests for Continuous and Binary Traits in Large-Scale Whole-Genome Sequencing Studies. Am J Hum Genet, 104 (2) 260-274. doi:10.1016/j.ajhg.2018.12.012. PMID 30639324
ABSTRACT
With advances in whole-genome sequencing (WGS) technology, more advanced statistical methods for testing genetic association with rare variants are being developed. Methods in which variants are grouped for analysis are also known as variant-set, gene-based, and aggregate unit tests. The burden test and sequence kernel association test (SKAT) are two widely used variant-set tests, which were originally developed for samples of unrelated individuals and later have been extended to family data with known pedigree structures. However, computationally efficient and powerful variant-set tests are needed to make analyses tractable in large-scale WGS studies with complex study samples. In this paper, we propose the variant-set mixed model association tests (SMMAT) for continuous and binary traits using the generalized linear mixed model framework. These tests can be applied to large-scale WGS studies involving samples with population structure and relatedness, such as in the National Heart, Lung, and Blood Institute's Trans-Omics for Precision Medicine (TOPMed) program. SMMATs share the same null model for different variant sets, and a virtue of this null model, which includes covariates only, is that it needs to be fit only once for all tests in each genome-wide analysis. Simulation studies show that all the proposed SMMATs correctly control type I error rates for both continuous and binary traits in the presence of population structure and relatedness. We also illustrate our tests in a real data example of analysis of plasma fibrinogen levels in the TOPMed program (n = 23,763), using the Analysis Commons, a cloud-based computing platform.
DOI
10.1016/j.ajhg.2018.12.012
SMR
PUBMED_LINK
FULL NAME
Summary-data-based Mendelian Randomization
DESCRIPTION
The SMR software tool was originally developed to implement the SMR & HEIDI methods to test for pleiotropic association between the expression level of a gene and a complex trait of interest using summary-level data from GWAS and expression quantitative trait loci (eQTL) studies (Zhu et al. 2016 Nature Genetics). The SMR & HEIDI methodology can be interpreted as an analysis to test if the effect size of a SNP on the phenotype is mediated by gene expression. This tool can therefore be used to prioritize genes underlying GWAS hits for follow-up functional studies. The methods are applicable to all kinds of molecular QTL (xQTL) data, including DNA methylation QTL (mQTL) and protein abundance QTL (pQTL).
URL
KEYWORDS
pleiotropy or causality, xQTL, eQTL, MR, HEIDI, linkage
TITLE
Integration of summary data from GWAS and eQTL studies predicts complex trait gene targets.
Main citation
Zhu Z, Zhang F, Hu H, Bakshi A, ...&, Yang J. (2016) Integration of summary data from GWAS and eQTL studies predicts complex trait gene targets. Nat Genet, 48 (5) 481-7. doi:10.1038/ng.3538. PMID 27019110
ABSTRACT
Genome-wide association studies (GWAS) have identified thousands of genetic variants associated with human complex traits. However, the genes or functional DNA elements through which these variants exert their effects on the traits are often unknown. We propose a method (called SMR) that integrates summary-level data from GWAS with data from expression quantitative trait locus (eQTL) studies to identify genes whose expression levels are associated with a complex trait because of pleiotropy. We apply the method to five human complex traits using GWAS data on up to 339,224 individuals and eQTL data on 5,311 individuals, and we prioritize 126 genes (for example, TRAF1 and ANKRD55 for rheumatoid arthritis and SNX19 and NMRAL1 for schizophrenia), of which 25 genes are new candidates; 77 genes are not the nearest annotated gene to the top associated GWAS SNP. These genes provide important leads to design future functional studies to understand the mechanism whereby DNA variation leads to complex trait variation.
DOI
10.1038/ng.3538
SMR-multi
PUBMED_LINK
TITLE
Integrative analysis of omics summary data reveals putative mechanisms underlying complex traits.
Main citation
Wu Y, Zeng J, Zhang F, Zhu Z, ...&, Yang J. (2018) Integrative analysis of omics summary data reveals putative mechanisms underlying complex traits. Nat Commun, 9 (1) 918. doi:10.1038/s41467-018-03371-0. PMID 29500431
ABSTRACT
The identification of genes and regulatory elements underlying the associations discovered by GWAS is essential to understanding the aetiology of complex traits (including diseases). Here, we demonstrate an analytical paradigm of prioritizing genes and regulatory elements at GWAS loci for follow-up functional studies. We perform an integrative analysis that uses summary-level SNP data from multi-omics studies to detect DNA methylation (DNAm) sites associated with gene expression and phenotype through shared genetic effects (i.e., pleiotropy). We identify pleiotropic associations between 7858 DNAm sites and 2733 genes. These DNAm sites are enriched in enhancers and promoters, and >40% of them are mapped to distal genes. Further pleiotropic association analyses, which link both the methylome and transcriptome to 12 complex traits, identify 149 DNAm sites and 66 genes, indicating a plausible mechanism whereby the effect of a genetic variant on phenotype is mediated by genetic regulation of transcription through DNAm.
DOI
10.1038/s41467-018-03371-0
snipar
PUBMED_LINK
FULL NAME
single nucleotide imputation of parents
DESCRIPTION
snipar (single nucleotide imputation of parents) is a Python package for inferring identity-by-descent (IBD) segments shared between siblings, imputing missing parental genotypes, and for performing family based genome-wide association and polygenic score analyses using observed and/or imputed parental genotypes.
URL
TITLE
Mendelian imputation of parental genotypes improves estimates of direct genetic effects.
Main citation
Young AI, Nehzati SM, Benonisdottir S, Okbay A, ...&, Kong A. (2022) Mendelian imputation of parental genotypes improves estimates of direct genetic effects. Nat Genet, 54 (6) 897-905. doi:10.1038/s41588-022-01085-0. PMID 35681053
ABSTRACT
Effects estimated by genome-wide association studies (GWASs) include effects of alleles in an individual on that individual (direct genetic effects), indirect genetic effects (for example, effects of alleles in parents on offspring through the environment) and bias from confounding. Within-family genetic variation is random, enabling unbiased estimation of direct genetic effects when parents are genotyped. However, parental genotypes are often missing. We introduce a method that imputes missing parental genotypes and estimates direct genetic effects. Our method, implemented in the software package snipar (single-nucleotide imputation of parents), gives more precise estimates of direct genetic effects than existing approaches. Using 39,614 individuals from the UK Biobank with at least one genotyped sibling/parent, we estimate the correlation between direct genetic effects and effects from standard GWASs for nine phenotypes, including educational attainment (r = 0.739, standard error (s.e.) = 0.086) and cognitive ability (r = 0.490, s.e. = 0.086). Our results demonstrate substantial confounding bias in standard GWASs for some phenotypes.
DOI
10.1038/s41588-022-01085-0
snipar-unified estimator (snipar)
PUBMED_LINK
FULL NAME
single nucleotide imputation of parents
URL
TITLE
Family-based genome-wide association study designs for increased power and robustness.
Main citation
Guan J, Tan T, Nehzati SM, Bennett M, ...&, Young AS. (2025) Family-based genome-wide association study designs for increased power and robustness. Nat Genet, 57 (4) 1044-1052. doi:10.1038/s41588-025-02118-0. PMID 40065166
ABSTRACT
Family-based genome-wide association studies (FGWASs) use random, within-family genetic variation to remove confounding from estimates of direct genetic effects (DGEs). Here we introduce a 'unified estimator' that includes individuals without genotyped relatives, unifying standard and FGWAS while increasing power for DGE estimation. We also introduce a 'robust estimator' that is not biased in structured and/or admixed populations. In an analysis of 19 phenotypes in the UK Biobank, the unified estimator in the White British subsample and the robust estimator (applied without ancestry restrictions) increased the effective sample size for DGEs by 46.9% to 106.5% and 10.3% to 21.0%, respectively, compared to using genetic differences between siblings. Polygenic predictors derived from the unified estimator demonstrated superior out-of-sample prediction ability compared to other family-based methods. We implemented the methods in the software package snipar in an efficient linear mixed model that accounts for sample relatedness and sibling shared environment.
DOI
10.1038/s41588-025-02118-0
SNP2HLA
PUBMED_LINK
URL
TITLE
Imputing amino acid polymorphisms in human leukocyte antigens.
Main citation
Jia X, Han B, Onengut-Gumuscu S, Chen WM, ...&, de Bakker PI. (2013) Imputing amino acid polymorphisms in human leukocyte antigens. PLoS One, 8 (6) e64683. doi:10.1371/journal.pone.0064683. PMID 23762245
ABSTRACT
DNA sequence variation within human leukocyte antigen (HLA) genes mediate susceptibility to a wide range of human diseases. The complex genetic structure of the major histocompatibility complex (MHC) makes it difficult, however, to collect genotyping data in large cohorts. Long-range linkage disequilibrium between HLA loci and SNP markers across the major histocompatibility complex (MHC) region offers an alternative approach through imputation to interrogate HLA variation in existing GWAS data sets. Here we describe a computational strategy, SNP2HLA, to impute classical alleles and amino acid polymorphisms at class I (HLA-A, -B, -C) and class II (-DPA1, -DPB1, -DQA1, -DQB1, and -DRB1) loci. To characterize performance of SNP2HLA, we constructed two European ancestry reference panels, one based on data collected in HapMap-CEPH pedigrees (90 individuals) and another based on data collected by the Type 1 Diabetes Genetics Consortium (T1DGC, 5,225 individuals). We imputed HLA alleles in an independent data set from the British 1958 Birth Cohort (N = 918) with gold standard four-digit HLA types and SNPs genotyped using the Affymetrix GeneChip 500 K and Illumina Immunochip microarrays. We demonstrate that the sample size of the reference panel, rather than SNP density of the genotyping platform, is critical to achieve high imputation accuracy. Using the larger T1DGC reference panel, the average accuracy at four-digit resolution is 94.7% using the low-density Affymetrix GeneChip 500 K, and 96.7% using the high-density Illumina Immunochip. For amino acid polymorphisms within HLA genes, we achieve 98.6% and 99.3% accuracy using the Affymetrix GeneChip 500 K and Illumina Immunochip, respectively. Finally, we demonstrate how imputation and association testing at amino acid resolution can facilitate fine-mapping of primary MHC association signals, giving a specific example from type 1 diabetes.
DOI
10.1371/journal.pone.0064683
SnpEff
PUBMED_LINK
FULL NAME
SNP effect
DESCRIPTION
Genetic variant annotation and functional effect prediction toolbox. It annotates and predicts the effects of genetic variants on genes and proteins (such as amino acid changes).
URL
TITLE
A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3.
Main citation
Cingolani P, Platts A, Wang le L, Coon M, ...&, Ruden DM. (2012) A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin), 6 (2) 80-92. doi:10.4161/fly.19695. PMID 22728672
ABSTRACT
We describe a new computer program, SnpEff, for rapidly categorizing the effects of variants in genome sequences. Once a genome is sequenced, SnpEff annotates variants based on their genomic locations and predicts coding effects. Annotated genomic locations include intronic, untranslated region, upstream, downstream, splice site, or intergenic regions. Coding effects such as synonymous or non-synonymous amino acid replacement, start codon gains or losses, stop codon gains or losses, or frame shifts can be predicted. Here the use of SnpEff is illustrated by annotating ~356,660 candidate SNPs in ~117 Mb unique sequences, representing a substitution rate of ~1/305 nucleotides, between the Drosophila melanogaster w(1118); iso-2; iso-3 strain and the reference y(1); cn(1) bw(1) sp(1) strain. We show that ~15,842 SNPs are synonymous and ~4,467 SNPs are non-synonymous (N/S ~0.28). The remaining SNPs are in other categories, such as stop codon gains (38 SNPs), stop codon losses (8 SNPs), and start codon gains (297 SNPs) in the 5'UTR. We found, as expected, that the SNP frequency is proportional to the recombination frequency (i.e., highest in the middle of chromosome arms). We also found that start-gain or stop-lost SNPs in Drosophila melanogaster often result in additions of N-terminal or C-terminal amino acids that are conserved in other Drosophila species. It appears that the 5' and 3' UTRs are reservoirs for genetic variations that changes the termini of proteins during evolution of the Drosophila genus. As genome sequencing is becoming inexpensive and routine, SnpEff enables rapid analyses of whole-genome sequencing data to be performed by an individual laboratory.
DOI
10.4161/fly.19695
SOMAmer
PUBMED_LINK
TITLE
Assessment of Variability in the SOMAscan Assay.
Main citation
Candia J, Cheung F, Kotliarov Y, Fantoni G, ...&, Biancotto A. (2017) Assessment of Variability in the SOMAscan Assay. Sci Rep, 7 (1) 14248. doi:10.1038/s41598-017-14755-5. PMID 29079756
ABSTRACT
SOMAscan is an aptamer-based proteomics assay capable of measuring 1,305 human protein analytes in serum, plasma, and other biological matrices with high sensitivity and specificity. In this work, we present a comprehensive meta-analysis of performance based on multiple serum and plasma runs using the current 1.3 k assay, as well as the previous 1.1 k version. We discuss normalization procedures and examine different strategies to minimize intra- and interplate nuisance effects. We implement a meta-analysis based on calibrator samples to characterize the coefficient of variation and signal-over-background intensity of each protein analyte. By incorporating coefficient of variation estimates into a theoretical model of statistical variability, we also provide a framework to enable rigorous statistical tests of significance in intervention studies and clinical trials, as well as quality control within and across laboratories. Furthermore, we investigate the stability of healthy subject baselines and determine the set of analytes that exhibit biologically stable baselines after technical variability is factored in. This work is accompanied by an interactive web-based tool, an initiative with the potential to become the cornerstone of a regularly updated, high quality repository with data sharing, reproducibility, and reusability as ultimate goals.
DOI
10.1038/s41598-017-14755-5
South and East Asian Reference Database (SEAD) (SEAD)
FULL NAME
South and East Asian Reference Database
URL
PREPRINT_DOI
10.1101/2023.12.23.23300480
Main citation
Yang, M. Y., Zhong, J. D., Li, X., Bai, W. Y., Yuan, C. D., Qiu, M. C., ... & Zheng, H. F. (2023). SEAD: an augmented reference panel with 22,134 haplotypes boosts the rare variants imputation and GWAS analysis in Asian population. medRxiv, 2023-12.
SPAGRM
PUBMED_LINK
DESCRIPTION
PAGRM is a scalable and accurate analysis framework to control for sample relatedness in large-scale genome-wide association studies (GWAS).
URL
KEYWORDS
SPA, longitudinal traits
TITLE
SPA
Main citation
Xu H, Ma Y, Xu LL, Li Y, ...&, Bi W. (2025) SPA Nat Commun, 16 (1) 1413. doi:10.1038/s41467-025-56669-1. PMID 39915470
ABSTRACT
Sample relatedness is a major confounder in genome-wide association studies (GWAS), potentially leading to inflated type I error rates if not appropriately controlled. A common strategy is to incorporate a random effect related to genetic relatedness matrix (GRM) into regression models. However, this approach is challenging for large-scale GWAS of complex traits, such as longitudinal traits. Here we propose a scalable and accurate analysis framework, SPAGRM, which controls for sample relatedness via a precise approximation of the joint distribution of genotypes. SPAGRM can utilize GRM-free models and thus is applicable to various trait types and statistical methods, including linear mixed models and generalized estimation equations for longitudinal traits. A hybrid strategy incorporating saddlepoint approximation greatly increases the accuracy to analyze low-frequency and rare genetic variants, especially in unbalanced phenotypic distributions. We also introduce SPAGRM(CCT) to aggregate the results following different models via Cauchy combination test. Extensive simulations and real data analyses demonstrated that SPAGRM maintains well-controlled type I error rates and SPAGRM(CCT) can serve as a broadly effective method. Applying SPAGRM to 79 longitudinal traits extracted from UK Biobank primary care data, we identified 7,463 genetic loci, making a pioneering attempt to conduct GWAS for these traits as longitudinal traits.
DOI
10.1038/s41467-025-56669-1
SPAGxECCT
PUBMED_LINK
DESCRIPTION
A scalable and accurate framework for large-scale genome-wide gene-environment interaction (G×E) analysis.
URL
KEYWORDS
Cauchy combination test
TITLE
Efficient and accurate framework for genome-wide gene-environment interaction analysis in large-scale biobanks.
Main citation
Ma Y, Zhao Y, Zhang JF, Bi W. (2025) Efficient and accurate framework for genome-wide gene-environment interaction analysis in large-scale biobanks. Nat Commun, 16 (1) 3064. doi:10.1038/s41467-025-57887-3. PMID 40157913
ABSTRACT
Gene-environment interaction (G×E) analysis elucidates the interplay between genetic and environmental factors. Genome-wide association studies (GWAS) have expanded to encompass complex traits like time-to-event and ordinal traits, which provide richer phenotypic information. However, most existing scalable approaches focus only on quantitative or binary traits. Here we propose SPAGxECCT, a scalable and accurate framework for diverse trait types. SPAGxECCT fits a genotype-independent model and employs a hybrid strategy including saddlepoint approximation (SPA) for accurate p value calculation, especially for low-frequency variants and unbalanced phenotypic distributions. We extend SPAGxECCT to SPAGxEmixCCT, which accounts for population stratification and is applicable to multi-ancestry or admixed populations. SPAGxEmixCCT can further be extended to SPAGxEmixCCT-local, which identifies ancestry-specific G×E effects using local ancestry. Through extensive simulations and real data analyses of UK Biobank data, we demonstrate that SPAGxECCT and SPAGxEmixCCT are scalable to analyze large-scale study cohort, control type I error rates effectively, and maintain power.
DOI
10.1038/s41467-025-57887-3
SparsePro
PUBMED_LINK
DESCRIPTION
SparsePro is a command line tool for efficiently conducting genome-wide fine-mapping. Our method has two key features: First, by creating a sparse low-dimensional projection of the high-dimensional genotype, we enable a linear search of causal variants instead of an exponential search of causal configurations in most existing methods; Second, we adopt a probabilistic framework with a highly efficient variational expectation-maximization algorithm to integrate statistical associations and functional priors.
URL
TITLE
SparsePro: An efficient fine-mapping method integrating summary statistics and functional annotations.
Main citation
Zhang W, Najafabadi H, Li Y. (2023) SparsePro: An efficient fine-mapping method integrating summary statistics and functional annotations. PLoS Genet, 19 (12) e1011104. doi:10.1371/journal.pgen.1011104. PMID 38153934
ABSTRACT
Identifying causal variants from genome-wide association studies (GWAS) is challenging due to widespread linkage disequilibrium (LD) and the possible existence of multiple causal variants in the same genomic locus. Functional annotations of the genome may help to prioritize variants that are biologically relevant and thus improve fine-mapping of GWAS results. Classical fine-mapping methods conducting an exhaustive search of variant-level causal configurations have a high computational cost, especially when the underlying genetic architecture and LD patterns are complex. SuSiE provided an iterative Bayesian stepwise selection algorithm for efficient fine-mapping. In this work, we build connections between SuSiE and a paired mean field variational inference algorithm through the implementation of a sparse projection, and propose effective strategies for estimating hyperparameters and summarizing posterior probabilities. Moreover, we incorporate functional annotations into fine-mapping by jointly estimating enrichment weights to derive functionally-informed priors. We evaluate the performance of SparsePro through extensive simulations using resources from the UK Biobank. Compared to state-of-the-art methods, SparsePro achieved improved power for fine-mapping with reduced computation time. We demonstrate the utility of SparsePro through fine-mapping of five functional biomarkers of clinically relevant phenotypes. In summary, we have developed an efficient fine-mapping method for integrating summary statistics and functional annotations. Our method can have wide utility in understanding the genetics of complex traits and increasing the yield of functional follow-up studies of GWAS. SparsePro software is available on GitHub at https://github.com/zhwm/SparsePro.
DOI
10.1371/journal.pgen.1011104
sQTLseekeR
PUBMED_LINK
DESCRIPTION
sQTLseekeR is a R package to detect splicing QTLs (sQTLs), which are variants associated with change in the splicing pattern of a gene. Here, splicing patterns are modeled by the relative expression of the transcripts of a gene.
URL
TITLE
Identification of genetic variants associated with alternative splicing using sQTLseekeR.
Main citation
Monlong J, Calvo M, Ferreira PG, Guigó R. (2014) Identification of genetic variants associated with alternative splicing using sQTLseekeR. Nat Commun, 5 () 4698. doi:10.1038/ncomms5698. PMID 25140736
ABSTRACT
Identification of genetic variants affecting splicing in RNA sequencing population studies is still in its infancy. Splicing phenotype is more complex than gene expression and ought to be treated as a multivariate phenotype to be recapitulated completely. Here we represent the splicing pattern of a gene as the distribution of the relative abundances of a gene's alternative transcript isoforms. We develop a statistical framework that uses a distance-based approach to compute the variability of splicing ratios across observations, and a non-parametric analogue to multivariate analysis of variance. We implement this approach in the R package sQTLseekeR and use it to analyze RNA-Seq data from the Geuvadis project in 465 individuals. We identify hundreds of single nucleotide polymorphisms (SNPs) as splicing QTLs (sQTLs), including some falling in genome-wide association study SNPs. By developing the appropriate metrics, we show that sQTLseekeR compares favorably with existing methods that rely on univariate approaches, predicting variants that behave as expected from mutations affecting splicing.
DOI
10.1038/ncomms5698
STAAR
PUBMED_LINK
FULL NAME
variant-set test for association using annotation information
DESCRIPTION
STAAR is an R package for performing variant-Set Test for Association using Annotation infoRmation (STAAR) procedure in whole-genome sequencing (WGS) studies. STAAR is a general framework that incorporates both qualitative functional categories and quantitative complementary functional annotations using an omnibus multi-dimensional weighting scheme. STAAR accounts for population structure and relatedness, and is scalable for analyzing large WGS studies of continuous and dichotomous traits.
URL
KEYWORDS
functional annotations
TITLE
Dynamic incorporation of multiple in silico functional annotations empowers rare variant association analysis of large whole-genome sequencing studies at scale.
Main citation
Li X, Li Z, Zhou H, Gaynor SM, ...&, Lin X. (2020) Dynamic incorporation of multiple in silico functional annotations empowers rare variant association analysis of large whole-genome sequencing studies at scale. Nat Genet, 52 (9) 969-983. doi:10.1038/s41588-020-0676-4. PMID 32839606
ABSTRACT
Large-scale whole-genome sequencing studies have enabled the analysis of rare variants (RVs) associated with complex phenotypes. Commonly used RV association tests have limited scope to leverage variant functions. We propose STAAR (variant-set test for association using annotation information), a scalable and powerful RV association test method that effectively incorporates both variant categories and multiple complementary annotations using a dynamic weighting scheme. For the latter, we introduce 'annotation principal components', multidimensional summaries of in silico variant annotations. STAAR accounts for population structure and relatedness and is scalable for analyzing very large cohort and biobank whole-genome sequencing studies of continuous and dichotomous traits. We applied STAAR to identify RVs associated with four lipid traits in 12,316 discovery and 17,822 replication samples from the Trans-Omics for Precision Medicine Program. We discovered and replicated new RV associations, including disruptive missense RVs of NPC1L1 and an intergenic region near APOC1P1 associated with low-density lipoprotein cholesterol.
DOI
10.1038/s41588-020-0676-4
STAARpipeline
PUBMED_LINK
FULL NAME
variant-set test for association using annotation information
DESCRIPTION
STAARpipeline is an R package for phenotype-genotype association analyses of biobank-scale WGS/WES data, including single variant analysis and variant set analysis.
URL
TITLE
A framework for detecting noncoding rare-variant associations of large-scale whole-genome sequencing studies.
Main citation
Li Z, Li X, Zhou H, Gaynor SM, ...&, Lin X. (2022) A framework for detecting noncoding rare-variant associations of large-scale whole-genome sequencing studies. Nat Methods, 19 (12) 1599-1611. doi:10.1038/s41592-022-01640-x. PMID 36303018
ABSTRACT
Large-scale whole-genome sequencing studies have enabled analysis of noncoding rare-variant (RV) associations with complex human diseases and traits. Variant-set analysis is a powerful approach to study RV association. However, existing methods have limited ability in analyzing the noncoding genome. We propose a computationally efficient and robust noncoding RV association detection framework, STAARpipeline, to automatically annotate a whole-genome sequencing study and perform flexible noncoding RV association analysis, including gene-centric analysis and fixed window-based and dynamic window-based non-gene-centric analysis by incorporating variant functional annotations. In gene-centric analysis, STAARpipeline uses STAAR to group noncoding variants based on functional categories of genes and incorporate multiple functional annotations. In non-gene-centric analysis, STAARpipeline uses SCANG-STAAR to incorporate dynamic window sizes and multiple functional annotations. We apply STAARpipeline to identify noncoding RV sets associated with four lipid traits in 21,015 discovery samples from the Trans-Omics for Precision Medicine (TOPMed) program and replicate several of them in an additional 9,123 TOPMed samples. We also analyze five non-lipid TOPMed traits.
DOI
10.1038/s41592-022-01640-x
Stephens
PUBMED_LINK
TITLE
A unified framework for association analysis with multiple related phenotypes.
Main citation
Stephens M. (2013) A unified framework for association analysis with multiple related phenotypes. PLoS One, 8 (7) e65245. doi:10.1371/journal.pone.0065245. PMID 23861737
ABSTRACT
We consider the problem of assessing associations between multiple related outcome variables, and a single explanatory variable of interest. This problem arises in many settings, including genetic association studies, where the explanatory variable is genotype at a genetic variant. We outline a framework for conducting this type of analysis, based on Bayesian model comparison and model averaging for multivariate regressions. This framework unifies several common approaches to this problem, and includes both standard univariate and standard multivariate association tests as special cases. The framework also unifies the problems of testing for associations and explaining associations - that is, identifying which outcome variables are associated with genotype. This provides an alternative to the usual, but conceptually unsatisfying, approach of resorting to univariate tests when explaining and interpreting significant multivariate findings. The method is computationally tractable genome-wide for modest numbers of phenotypes (e.g. 5-10), and can be applied to summary data, without access to raw genotype and phenotype data. We illustrate the methods on both simulated examples, and to a genome-wide association study of blood lipid traits where we identify 18 potential novel genetic associations that were not identified by univariate analyses of the same data.
DOI
10.1371/journal.pone.0065245
StructLMM
PUBMED_LINK
FULL NAME
Structured Linear Mixed Model
DESCRIPTION
Structured Linear Mixed Model (StructLMM) is a computationally efficient method to test for and characterize loci that interact with multiple environments
URL
TITLE
A linear mixed-model approach to study multivariate gene-environment interactions.
Main citation
Moore R, Casale FP, Jan Bonder M, Horta D, ...&, Stegle O. (2019) A linear mixed-model approach to study multivariate gene-environment interactions. Nat Genet, 51 (1) 180-186. doi:10.1038/s41588-018-0271-0. PMID 30478441
ABSTRACT
Different exposures, including diet, physical activity, or external conditions can contribute to genotype-environment interactions (G×E). Although high-dimensional environmental data are increasingly available and multiple exposures have been implicated with G×E at the same loci, multi-environment tests for G×E are not established. Here, we propose the structured linear mixed model (StructLMM), a computationally efficient method to identify and characterize loci that interact with one or more environments. After validating our model using simulations, we applied StructLMM to body mass index in the UK Biobank, where our model yields previously known and novel G×E signals. Finally, in an application to a large blood eQTL dataset, we demonstrate that StructLMM can be used to study interactions with hundreds of environmental variables.
DOI
10.1038/s41588-018-0271-0
SumHer
PUBMED_LINK
URL
TITLE
SumHer better estimates the SNP heritability of complex traits from summary statistics.
Main citation
Speed D, Balding DJ. (2019) SumHer better estimates the SNP heritability of complex traits from summary statistics. Nat Genet, 51 (2) 277-284. doi:10.1038/s41588-018-0279-5. PMID 30510236
ABSTRACT
We present SumHer, software for estimating confounding bias, SNP heritability, enrichments of heritability and genetic correlations using summary statistics from genome-wide association studies. The key difference between SumHer and the existing software LD Score Regression (LDSC) is that SumHer allows the user to specify the heritability model. We apply SumHer to results from 24 large-scale association studies (average sample size 121,000) using our recommended heritability model. We show that these studies tended to substantially over-correct for confounding, and as a result the number of genome-wide significant loci was under-reported by about a quarter. We also estimate enrichments for 24 categories of SNPs defined by functional annotations. A previous study using LDSC reported that conserved regions were 13-fold enriched, and found a further six categories with above threefold enrichment. By contrast, our analysis using SumHer finds that none of the categories have enrichment above twofold. SumHer provides an improved understanding of the genetic architecture of complex traits, which enables more efficient analysis of future genetic data.
DOI
10.1038/s41588-018-0279-5
SUPERGNOVA
PUBMED_LINK
FULL NAME
SUPER GeNetic cOVariance Analyzer
DESCRIPTION
SUPERGNOVA (SUPER GeNetic cOVariance Analyzer) is a statistical framework to perform local genetic covariance analysis.
URL
TITLE
SUPERGNOVA: local genetic correlation analysis reveals heterogeneous etiologic sharing of complex traits.
Main citation
Zhang Y, Lu Q, Ye Y, Huang K, ...&, Zhao H. (2021) SUPERGNOVA: local genetic correlation analysis reveals heterogeneous etiologic sharing of complex traits. Genome Biol, 22 (1) 262. doi:10.1186/s13059-021-02478-w. PMID 34493297
ABSTRACT
Local genetic correlation quantifies the genetic similarity of complex traits in specific genomic regions. However, accurate estimation of local genetic correlation remains challenging, due to linkage disequilibrium in local genomic regions and sample overlap across studies. We introduce SUPERGNOVA, a statistical framework to estimate local genetic correlations using summary statistics from genome-wide association studies. We demonstrate that SUPERGNOVA outperforms existing methods through simulations and analyses of 30 complex traits. In particular, we show that the positive yet paradoxical genetic correlation between autism spectrum disorder and cognitive performance could be explained by two etiologically distinct genetic signatures with bidirectional local genetic correlations.
DOI
10.1186/s13059-021-02478-w
SUSIE
PUBMED_LINK
FULL NAME
sum of single effects
DESCRIPTION
The susieR package implements a simple new way to perform variable selection in multiple regression (y = Xb + e). The methods implemented here are particularly well-suited to settings where some of the X variables are highly correlated, and the true effects are highly sparse (e.g. <20 non-zero effects in the vector b). One example of this is genetic fine-mapping applications, and this application was a major motivation for developing these methods.
URL
KEYWORDS
fine-mapping, sum of single-effects (SuSiE) regression, iterative Bayesian stepwise selection (IBSS)
TITLE
A simple new approach to variable selection in regression, with application to genetic fine mapping.
Main citation
Wang G, Sarkar A, Carbonetto P, Stephens M. (2020) A simple new approach to variable selection in regression, with application to genetic fine mapping. J R Stat Soc Series B Stat Methodol, 82 (5) 1273-1300. doi:10.1111/rssb.12388. PMID 37220626
ABSTRACT
We introduce a simple new approach to variable selection in linear regression, with a particular focus on quantifying uncertainty in which variables should be selected. The approach is based on a new model - the "Sum of Single Effects" (SuSiE) model - which comes from writing the sparse vector of regression coefficients as a sum of "single-effect" vectors, each with one non-zero element. We also introduce a corresponding new fitting procedure - Iterative Bayesian Stepwise Selection (IBSS) - which is a Bayesian analogue of stepwise selection methods. IBSS shares the computational simplicity and speed of traditional stepwise methods, but instead of selecting a single variable at each step, IBSS computes a distribution on variables that captures uncertainty in which variable to select. We provide a formal justification of this intuitive algorithm by showing that it optimizes a variational approximation to the posterior distribution under the SuSiE model. Further, this approximate posterior distribution naturally yields convenient novel summaries of uncertainty in variable selection, providing a Credible Set of variables for each selection. Our methods are particularly well-suited to settings where variables are highly correlated and detectable effects are sparse, both of which are characteristics of genetic fine-mapping applications. We demonstrate through numerical experiments that our methods outperform existing methods for this task, and illustrate their application to fine-mapping genetic variants influencing alternative splicing in human cell-lines. We also discuss the potential and challenges for applying these methods to generic variable selection problems.
DOI
10.1111/rssb.12388
SuSiE PCA
PUBMED_LINK
DESCRIPTION
SuSiE PCA is the abbreviation for the Sum of Single Effects model1 for principal component analysis. We develop SuSiE PCA for an efficient variable selection in PCA when dealing with high dimensional data with sparsity, and for quantifying uncertainty of contributing features for each latent component through posterior inclusion probabilities (PIPs).
URL
KEYWORDS
PCA, SuSiE
TITLE
SuSiE PCA: A scalable Bayesian variable selection technique for principal component analysis.
Main citation
Yuan D, Mancuso N. (2023) SuSiE PCA: A scalable Bayesian variable selection technique for principal component analysis. iScience, 26 (11) 108181. doi:10.1016/j.isci.2023.108181. PMID 37953948
ABSTRACT
Latent factor models, like principal component analysis (PCA), provide a statistical framework to infer low-rank representation in various biological contexts. However, feature selection is challenging when this low-rank structure manifests from a sparse subspace. We introduce SuSiE PCA, a scalable sparse latent factor approach that evaluates uncertainty in contributing variables through posterior inclusion probabilities. We validate our model in extensive simulations and demonstrate that SuSiE PCA outperforms other approaches in signal detection and model robustness. We apply SuSiE PCA to multi-tissue expression quantitative trait loci (eQTLs) data from GTEx v8 and identify tissue-specific factors and their contributing eGenes. We further investigate its performance on the large-scale perturbation data and find that SuSiE PCA identifies modules with a higher enrichment of ribosome-related genes than sparse PCA (false discovery rate [FDR] =9.2×10-82 vs. 1.4×10-33), while being ∼ 18x faster. Overall, SuSiE PCA provides an efficient tool to identify relevant features in high-dimensional biological data.
DOI
10.1016/j.isci.2023.108181
SUSIE-RSS
PUBMED_LINK
FULL NAME
sum of single effects regression with summary statistics
DESCRIPTION
The susieR package implements a simple new way to perform variable selection in multiple regression (y = Xb + e). The methods implemented here are particularly well-suited to settings where some of the X variables are highly correlated, and the true effects are highly sparse (e.g. <20 non-zero effects in the vector b). One example of this is genetic fine-mapping applications, and this application was a major motivation for developing these methods.
URL
KEYWORDS
fine-mapping, summary statistics
TITLE
Fine-mapping from summary data with the "Sum of Single Effects" model.
Main citation
Zou Y, Carbonetto P, Wang G, Stephens M. (2022) Fine-mapping from summary data with the "Sum of Single Effects" model. PLoS Genet, 18 (7) e1010299. doi:10.1371/journal.pgen.1010299. PMID 35853082
ABSTRACT
In recent work, Wang et al introduced the "Sum of Single Effects" (SuSiE) model, and showed that it provides a simple and efficient approach to fine-mapping genetic variants from individual-level data. Here we present new methods for fitting the SuSiE model to summary data, for example to single-SNP z-scores from an association study and linkage disequilibrium (LD) values estimated from a suitable reference panel. To develop these new methods, we first describe a simple, generic strategy for extending any individual-level data method to deal with summary data. The key idea is to replace the usual regression likelihood with an analogous likelihood based on summary data. We show that existing fine-mapping methods such as FINEMAP and CAVIAR also (implicitly) use this strategy, but in different ways, and so this provides a common framework for understanding different methods for fine-mapping. We investigate other common practical issues in fine-mapping with summary data, including problems caused by inconsistencies between the z-scores and LD estimates, and we develop diagnostics to identify these inconsistencies. We also present a new refinement procedure that improves model fits in some data sets, and hence improves overall reliability of the SuSiE fine-mapping results. Detailed evaluations of fine-mapping methods in a range of simulated data sets show that SuSiE applied to summary data is competitive, in both speed and accuracy, with the best available fine-mapping methods for summary data.
DOI
10.1371/journal.pgen.1010299
SUSIEx
DESCRIPTION
SuSiEx is a Python based command line tool that performs cross-ethnic fine-mapping using GWAS summary statistics and LD reference panels. The method is built on the Sum of Single Effects (SuSiE) model.
URL
KEYWORDS
cross-ancestry, fine-mapping
Main citation
Yuan, K., Longchamps, R. J., Pardiñas, A. F., Yu, M., Chen, T. T., Lin, S. C., ... & Schizophrenia Workgroup of Psychiatric Genomics Consortium. (2023). Fine-mapping across diverse ancestries drives the discovery of putative causal variants underlying human complex traits and diseases. medRxiv.
Swave
PUBMED_LINK
DESCRIPTION
Swave calls and genotypes structural variants from assembly-based pangenome graphs by encoding mapping patterns as images (“projection waves”) and classifying signals with a recurrent neural network, including complex and repetitive SVs for population-level characterization.
URL
KEYWORDS
structural variant, pangenome graph, deep learning, RNN, population genomics
TITLE
Population-level structural variant characterization using pangenome graphs.
Main citation
Wang S, Xu T, Zhang P, Ye K. (2026) Population-level structural variant characterization using pangenome graphs. Nat Genet, 58 (3) 664-672. doi:10.1038/s41588-026-02538-6. PMID 41807798
ABSTRACT
Population-level structural variant (SV) profiling is crucial in the era of pangenomes. However, identifying SVs from genome assemblies and pangenome graphs remains a substantial challenge. Here we present Swave, a sequence-to-image, deep learning-based method that accurately resolves both simple and complex SVs, along with their population characteristics, from assembly-derived pangenome graphs. Swave introduces 'projection waves' to summarize the dotplot images that capture mapping patterns between reference and SV-indicating alleles in the pangenome. Then, a recurrent neural network distinguishes true SV signals from background noise introduced by genomic repeats. Swave demonstrates superior performance in both SV-type classification and genotyping compared with existing methods. When applied to healthy cohorts and rare-disease cohorts, Swave reveals complex and polymorphic SV patterns across human populations and identifies potentially pathogenic SVs. These advancements will facilitate the creation of comprehensive population-level SV catalogs, deepening our understanding of SVs in genetic diversity and disease associations.
DOI
10.1038/s41588-026-02538-6
t-SNE
FULL NAME
t-Distributed Stochastic Neighbor Embedding
URL
KEYWORDS
t-SNE
TATES
PUBMED_LINK
FULL NAME
Trait-based Association Test that uses Extended Simes procedure
TITLE
TATES: efficient multivariate genotype-phenotype analysis for genome-wide association studies.
Main citation
van der Sluis S, Posthuma D, Dolan CV. (2013) TATES: efficient multivariate genotype-phenotype analysis for genome-wide association studies. PLoS Genet, 9 (1) e1003235. doi:10.1371/journal.pgen.1003235. PMID 23359524
ABSTRACT
To date, the genome-wide association study (GWAS) is the primary tool to identify genetic variants that cause phenotypic variation. As GWAS analyses are generally univariate in nature, multivariate phenotypic information is usually reduced to a single composite score. This practice often results in loss of statistical power to detect causal variants. Multivariate genotype-phenotype methods do exist but attain maximal power only in special circumstances. Here, we present a new multivariate method that we refer to as TATES (Trait-based Association Test that uses Extended Simes procedure), inspired by the GATES procedure proposed by Li et al (2011). For each component of a multivariate trait, TATES combines p-values obtained in standard univariate GWAS to acquire one trait-based p-value, while correcting for correlations between components. Extensive simulations, probing a wide variety of genotype-phenotype models, show that TATES's false positive rate is correct, and that TATES's statistical power to detect causal variants explaining 0.5% of the variance can be 2.5-9 times higher than the power of univariate tests based on composite scores and 1.5-2 times higher than the power of the standard MANOVA. Unlike other multivariate methods, TATES detects both genetic variants that are common to multiple phenotypes and genetic variants that are specific to a single phenotype, i.e. TATES provides a more complete view of the genetic architecture of complex traits. As the actual causal genotype-phenotype model is usually unknown and probably phenotypically and genetically complex, TATES, available as an open source program, constitutes a powerful new multivariate strategy that allows researchers to identify novel causal variants, while the complexity of traits is no longer a limiting factor.
DOI
10.1371/journal.pgen.1003235
TCSC
PUBMED_LINK
FULL NAME
Tissue co-regulation score regression
DESCRIPTION
TCSC is a statistical genetics method to identify causal tissues in diseases and complex traits. We leverage TWAS and GWAS summary statistics while explicitly modeling the genetic co-regulation of genes across tissues.
URL
TITLE
Modeling tissue co-regulation estimates tissue-specific contributions to disease.
Main citation
Amariuta T, Siewert-Rocks K, Price AL. (2023) Modeling tissue co-regulation estimates tissue-specific contributions to disease. Nat Genet, 55 (9) 1503-1511. doi:10.1038/s41588-023-01474-z. PMID 37580597
ABSTRACT
Integrative analyses of genome-wide association studies and gene expression data have implicated many disease-critical tissues. However, co-regulation of genetic effects on gene expression across tissues impedes distinguishing biologically causal tissues from tagging tissues. In the present study, we introduce tissue co-regulation score regression (TCSC), which disentangles causal tissues from tagging tissues by regressing gene-disease association statistics (from transcriptome-wide association studies) on tissue co-regulation scores, reflecting correlations of predicted gene expression across genes and tissues. We applied TCSC to 78 diseases/traits (average n = 302,000) and gene expression prediction models for 48 GTEx tissues. TCSC identified 21 causal tissue-trait pairs at a 5% false discovery rate (FDR), including well-established findings, biologically plausible new findings (for example, aorta artery and glaucoma) and increased specificity of known tissue-trait associations (for example, subcutaneous adipose, but not visceral adipose, and high-density lipoprotein). TCSC also identified 17 causal tissue-trait covariance pairs at 5% FDR. In conclusion, TCSC is a precise method for distinguishing causal tissues from tagging tissues.
DOI
10.1038/s41588-023-01474-z
tensorQTL
PUBMED_LINK
DESCRIPTION
tensorQTL is a GPU-enabled QTL mapper, achieving ~200-300 fold faster cis- and trans-QTL mapping compared to CPU-based implementations.
URL
TITLE
Scaling computational genomics to millions of individuals with GPUs.
Main citation
Taylor-Weiner A, Aguet F, Haradhvala NJ, Gosai S, ...&, Getz G. (2019) Scaling computational genomics to millions of individuals with GPUs. Genome Biol, 20 (1) 228. doi:10.1186/s13059-019-1836-7. PMID 31675989
ABSTRACT
Current genomics methods are designed to handle tens to thousands of samples but will need to scale to millions to match the pace of data and hypothesis generation in biomedical science. Here, we show that high efficiency at low cost can be achieved by leveraging general-purpose libraries for computing using graphics processing units (GPUs), such as PyTorch and TensorFlow. We demonstrate > 200-fold decreases in runtime and ~ 5-10-fold reductions in cost relative to CPUs. We anticipate that the accessibility of these libraries will lead to a widespread adoption of GPUs in computational genomics.
DOI
10.1186/s13059-019-1836-7
TetraHer
PUBMED_LINK
DESCRIPTION
a method for estimating the liability heritability of binary phenotypes
URL
TITLE
Estimating disease heritability from complex pedigrees allowing for ascertainment and covariates.
Main citation
Speed D, Evans DM. (2024) Estimating disease heritability from complex pedigrees allowing for ascertainment and covariates. Am J Hum Genet, 111 (4) 680-690. doi:10.1016/j.ajhg.2024.02.010. PMID 38490208
ABSTRACT
We propose TetraHer, a method for estimating the liability heritability of binary phenotypes. TetraHer has five key features. First, it can be applied to data from complex pedigrees that contain multiple types of relationships. Second, it can correct for ascertainment of cases or controls. Third, it can accommodate covariates. Fourth, it can model the contribution of common environment. Fifth, it produces a likelihood that can be used for significance testing. We first demonstrate the validity of TetraHer on simulated data. We then use TetraHer to estimate liability heritability for 229 codes from the tenth International Classification of Diseases (ICD-10). We identify 107 codes with significant heritability (p < 0.05/229), which can be used in future analyses for investigating the genetic architecture of human diseases.
DOI
10.1016/j.ajhg.2024.02.010
TGVIS
PUBMED_LINK
FULL NAME
Tissue-Gene pairs, direct causal Variants, and Infinitesimal effects selector
DESCRIPTION
Multivariate TWAS approach that prioritizes causal gene–tissue pairs and candidate causal variants from GWAS summary data while explicitly controlling for genome-wide infinitesimal (polygenic) effects that can otherwise inflate false gene discoveries.
URL
KEYWORDS
multivariate TWAS, infinitesimal model, causal gene-tissue, eQTL, sQTL
TITLE
Uncovering causal gene-tissue pairs and variants through a multivariate TWAS controlling for infinitesimal effects.
Main citation
Yang Y, Lorincz-Comi N, Zhu X. (2025) Uncovering causal gene-tissue pairs and variants through a multivariate TWAS controlling for infinitesimal effects. Nat Commun, 16 (1) 6098. doi:10.1038/s41467-025-61423-8. PMID 40603866
ABSTRACT
Transcriptome-wide association studies (TWAS) are commonly used to prioritize causal genes underlying associations found in genome-wide association studies (GWAS) and have been extended to identify causal genes through multivariate TWAS methods. However, recent studies have shown that widespread infinitesimal effects due to polygenicity can impair the performance of these methods. In this report, we introduce a multivariate TWAS method named tissue-gene pairs, direct causal variants, and infinitesimal effects selector (TGVIS) to identify tissue-specific causal genes and direct causal variants while accounting for infinitesimal effects. In simulations, TGVIS maintains an accurate prioritization of causal gene-tissue pairs and variants and demonstrates comparable or superior power to existing approaches, regardless of the presence of infinitesimal effects. In the real data analysis of GWAS summary data of 45 cardiometabolic traits and expression/splicing quantitative trait loci from 31 tissues, TGVIS is able to improve causal gene prioritization and identifies novel genes that were missed by conventional TWAS.
DOI
10.1038/s41467-025-61423-8
THISTLE
PUBMED_LINK
FULL NAME
testing for heterogeneity between isoform-eQTL effects
DESCRIPTION
THISTLE (testing for heterogeneity between isoform-eQTL effects) is a transcript-based splicing QTL (sQTL) mapping method that uses either individual-level genotype and RNA-seq data or summary-level isoform-eQTL data.
URL
TITLE
Genetic control of RNA splicing and its distinct role in complex trait variation.
Main citation
Qi T, Wu Y, Fang H, Zhang F, ...&, Yang J. (2022) Genetic control of RNA splicing and its distinct role in complex trait variation. Nat Genet, 54 (9) 1355-1363. doi:10.1038/s41588-022-01154-4. PMID 35982161
ABSTRACT
Most genetic variants identified from genome-wide association studies (GWAS) in humans are noncoding, indicating their role in gene regulation. Previous studies have shown considerable links of GWAS signals to expression quantitative trait loci (eQTLs) but the links to other genetic regulatory mechanisms, such as splicing QTLs (sQTLs), are underexplored. Here, we introduce an sQTL mapping method, testing for heterogeneity between isoform-eQTL effects (THISTLE), with improved power over competing methods. Applying THISTLE together with a complementary sQTL mapping strategy to brain transcriptomic (n = 2,865) and genotype data, we identified 12,794 genes with cis-sQTLs at P < 5 × 10-8, approximately 61% of which were distinct from eQTLs. Integrating the sQTL data into GWAS for 12 brain-related complex traits (including diseases), we identified 244 genes associated with the traits through cis-sQTLs, approximately 61% of which could not be discovered using the corresponding eQTL data. Our study demonstrates the distinct role of most sQTLs in the genetic regulation of transcription and complex trait variation.
DOI
10.1038/s41588-022-01154-4
TL-PRS
PUBMED_LINK
FULL NAME
transfer learning PRS
DESCRIPTION
This R package helps users to construct multi-ethnic polygenic risk score (PRS) using transfer learning. It can help predict PRS of minor ancestry using summary statistics from exsiting resources, such as UK Biobank.
URL
TITLE
The construction of cross-population polygenic risk scores using transfer learning.
Main citation
Zhao Z, Fritsche LG, Smith JA, Mukherjee B, ...&, Lee S. (2022) The construction of cross-population polygenic risk scores using transfer learning. Am J Hum Genet, 109 (11) 1998-2008. doi:10.1016/j.ajhg.2022.09.010. PMID 36240765
ABSTRACT
As most existing genome-wide association studies (GWASs) were conducted in European-ancestry cohorts, and as the existing polygenic risk score (PRS) models have limited transferability across ancestry groups, PRS research on non-European-ancestry groups needs to make efficient use of available data until we attain large sample sizes across all ancestry groups. Here we propose a PRS method using transfer learning techniques. Our approach, TL-PRS, uses gradient descent to fine-tune the baseline PRS model from an ancestry group with large sample GWASs to the dataset of target ancestry. In our application of constructing PRS for seven quantitative and two dichotomous traits for 10,285 individuals of South Asian ancestry and 8,168 individuals of African ancestry in UK Biobank, TL-PRS using PRS-CS as a baseline method obtained 25% average relative improvement for South Asian samples and 29% for African samples compared to the standard PRS-CS method in terms of predicted R2. Our approach increases the transferability of PRSs across ancestries and thereby helps reduce existing inequities in genetics research.
DOI
10.1016/j.ajhg.2022.09.010
TOPMED
PUBMED_LINK
FULL NAME
Trans-Omics for Precision Medicine
URL
TITLE
Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program.
Main citation
Taliun D, Harris DN, Kessler MD, Carlson J, ...&, Abecasis GR. (2021) Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature, 590 (7845) 290-299. doi:10.1038/s41586-021-03205-y. PMID 33568819
ABSTRACT
The Trans-Omics for Precision Medicine (TOPMed) programme seeks to elucidate the genetic architecture and biology of heart, lung, blood and sleep disorders, with the ultimate goal of improving diagnosis, treatment and prevention of these diseases. The initial phases of the programme focused on whole-genome sequencing of individuals with rich phenotypic data and diverse backgrounds. Here we describe the TOPMed goals and design as well as the available resources and early insights obtained from the sequence data. The resources include a variant browser, a genotype imputation server, and genomic and phenotypic data that are available through dbGaP (Database of Genotypes and Phenotypes)1. In the first 53,831 TOPMed samples, we detected more than 400 million single-nucleotide and insertion or deletion variants after alignment with the reference genome. Additional previously undescribed variants were detected through assembly of unmapped reads and customized analysis in highly variable loci. Among the more than 400 million detected variants, 97% have frequencies of less than 1% and 46% are singletons that are present in only one individual (53% among unrelated individuals). These rare variants provide insights into mutational processes and recent human evolutionary history. The extensive catalogue of genetic variation in TOPMed studies provides unique opportunities for exploring the contributions of rare and noncoding sequence variants to phenotypic variation. Furthermore, combining TOPMed haplotypes with modern imputation methods improves the power and reach of genome-wide association studies to include variants down to a frequency of approximately 0.01%.
DOI
10.1038/s41586-021-03205-y
TrajGWAS
PUBMED_LINK
FULL NAME
GWAS of longitudinal trajectories
DESCRIPTION
TrajGWAS.jl is a Julia package for performing genome-wide association studies (GWAS) for continuous longitudinal phenotypes using a modified linear mixed effects model. It builds upon the within-subject variance estimation by robust regression (WiSER) method and can be used to identify variants associated with changes in the mean and within-subject variability of the longitduinal trait.
URL
KEYWORDS
biomarker trajectories, mean, within-subject (WS) variability, linear mixed effect model, within-subject variance estimation by robust regression (WiSER) method
TITLE
GWAS of longitudinal trajectories at biobank scale.
Main citation
Ko S, German CA, Jensen A, Shen J, ...&, Zhou JJ. (2022) GWAS of longitudinal trajectories at biobank scale. Am J Hum Genet, 109 (3) 433-445. doi:10.1016/j.ajhg.2022.01.018. PMID 35196515
ABSTRACT
Biobanks linked to massive, longitudinal electronic health record (EHR) data make numerous new genetic research questions feasible. One among these is the study of biomarker trajectories. For example, high blood pressure measurements over visits strongly predict stroke onset, and consistently high fasting glucose and Hb1Ac levels define diabetes. Recent research reveals that not only the mean level of biomarker trajectories but also their fluctuations, or within-subject (WS) variability, are risk factors for many diseases. Glycemic variation, for instance, is recently considered an important clinical metric in diabetes management. It is crucial to identify the genetic factors that shift the mean or alter the WS variability of a biomarker trajectory. Compared to traditional cross-sectional studies, trajectory analysis utilizes more data points and captures a complete picture of the impact of time-varying factors, including medication history and lifestyle. Currently, there are no efficient tools for genome-wide association studies (GWASs) of biomarker trajectories at the biobank scale, even for just mean effects. We propose TrajGWAS, a linear mixed effect model-based method for testing genetic effects that shift the mean or alter the WS variability of a biomarker trajectory. It is scalable to biobank data with 100,000 to 1,000,000 individuals and many longitudinal measurements and robust to distributional assumptions. Simulation studies corroborate that TrajGWAS controls the type I error rate and is powerful. Analysis of eleven biomarkers measured longitudinally and extracted from UK Biobank primary care data for more than 150,000 participants with 1,800,000 observations reveals loci that significantly alter the mean or WS variability.
DOI
10.1016/j.ajhg.2022.01.018
Trans-Phar
PUBMED_LINK
FULL NAME
integration of Transcriptome-wide association study and Pharmacological database
DESCRIPTION
This software achieves in silico screening of chemical compounds, which have inverse effects in expression profiles compared with genetically regulated gene expression of common diseases, from large-scale pharmacological database (Connectivity Map [CMap] L1000 library).
URL
TITLE
Integration of genetically regulated gene expression and pharmacological library provides therapeutic drug candidates.
Main citation
Konuma T, Ogawa K, Okada Y. (2021) Integration of genetically regulated gene expression and pharmacological library provides therapeutic drug candidates. Hum Mol Genet, 30 (3-4) 294-304. doi:10.1093/hmg/ddab049. PMID 33577681
ABSTRACT
Approaches toward new therapeutics using disease genomics, such as genome-wide association study (GWAS), are anticipated. Here, we developed Trans-Phar [integration of transcriptome-wide association study (TWAS) and pharmacological database], achieving in silico screening of compounds from a large-scale pharmacological database (L1000 Connectivity Map), which have inverse expression profiles compared with tissue-specific genetically regulated gene expression. Firstly we confirmed the statistical robustness by the application of the null GWAS data and enrichment in the true-positive drug-disease relationships by the application of UK-Biobank GWAS summary statistics in broad disease categories, then we applied the GWAS summary statistics of large-scale European meta-analysis (17 traits; naverage = 201 849) and the hospitalized COVID-19 (n = 900 687), which has urgent need for drug development. We detected potential therapeutic compounds as well as anisomycin in schizophrenia (false discovery rate (FDR)-q = 0.056) and verapamil in hospitalized COVID-19 (FDR-q = 0.068) as top-associated compounds. This approach could be effective in disease genomics-driven drug development.
DOI
10.1093/hmg/ddab049
TReCASE (TReCASE (asSeq))
PUBMED_LINK
FULL NAME
Total Read Count and Allele-Specific Expression
DESCRIPTION
A Statistical Framework for eQTL Mapping Using RNA-seq Data.
URL
TITLE
A statistical method for joint estimation of cis-eQTLs and parent-of-origin effects under family trio design.
Main citation
Zhabotynsky V, Inoue K, Magnuson T, Mauro Calabrese J, ...&, Sun W. (2019) A statistical method for joint estimation of cis-eQTLs and parent-of-origin effects under family trio design. Biometrics, 75 (3) 864-874. doi:10.1111/biom.13026. PMID 30666629
ABSTRACT
RNA sequencing allows one to study allelic imbalance of gene expression, which may be due to genetic factors or genomic imprinting (i.e., higher expression of maternal or paternal allele). It is desirable to model both genetic and parent-of-origin effects simultaneously to avoid confounding and to improve the power to detect either effect. In studies of genetically tractable model organisms, separation of genetic and parent-of-origin effects can be achieved by studying reciprocal cross of two inbred strains. In contrast, this task is much more challenging in outbred populations such as humans. To address this challenge, we propose a new framework to combine experimental strategies and novel statistical methods. Specifically, we propose to study genetic and imprinting effects in family trios with RNA-seq data from the children and genotype data from both parents and children, and quantify genetic effects by cis-eQTLs. Towards this end, we have extended our method that studies the eQTLs of RNA-seq data (Sun, Biometrics 2012, 68(1): 1-11) to model both cis-eQTL and parent-of-origin effects, and evaluated its performance using extensive simulations. Since sample size may be limited in family trios, we have developed a data analysis pipeline that borrows information from external data of unrelated individuals for cis-eQTL mapping. We have also collected RNA-seq data from the children of 30 family trios, applied our method to analyze this dataset, and identified some previously reported imprinted genes as well as some new candidates of imprinted genes.
DOI
10.1111/biom.13026
TreeMix
PUBMED_LINK
DESCRIPTION
Pickrell, J., & Pritchard, J. (2012). Inference of population splits and mixtures from genome-wide allele frequency data. Nature Precedings, 1-1.
URL
TITLE
Inference of population splits and mixtures from genome-wide allele frequency data.
Main citation
Pickrell JK, Pritchard JK. (2012) Inference of population splits and mixtures from genome-wide allele frequency data. PLoS Genet, 8 (11) e1002967. doi:10.1371/journal.pgen.1002967. PMID 23166502
ABSTRACT
Many aspects of the historical relationships between populations in a species are reflected in genetic data. Inferring these relationships from genetic data, however, remains a challenging task. In this paper, we present a statistical model for inferring the patterns of population splits and mixtures in multiple populations. In our model, the sampled populations in a species are related to their common ancestor through a graph of ancestral populations. Using genome-wide allele frequency data and a Gaussian approximation to genetic drift, we infer the structure of this graph. We applied this method to a set of 55 human populations and a set of 82 dog breeds and wild canids. In both species, we show that a simple bifurcating tree does not fully describe the data; in contrast, we infer many migration events. While some of the migration events that we find have been detected previously, many have not. For example, in the human data, we infer that Cambodians trace approximately 16% of their ancestry to a population ancestral to other extant East Asian populations. In the dog data, we infer that both the boxer and basenji trace a considerable fraction of their ancestry (9% and 25%, respectively) to wolves subsequent to domestication and that East Asian toy breeds (the Shih Tzu and the Pekingese) result from admixture between modern toy breeds and "ancient" Asian breeds. Software implementing the model described here, called TreeMix, is available at http://treemix.googlecode.com.
DOI
10.1371/journal.pgen.1002967
tsdate
PUBMED_LINK
DESCRIPTION
Wohns, A. W. et al. A unified genealogy of modern and ancient genomes. Science 375, eabi8264 (2022).
URL
USE
The tsdate program [Wohns et al., 2022] infers dates for nodes in a genetic genealogy, sometimes loosely known as an ancestral recombination graph or ARG [Wong et al., 2023]. More precisely, it takes a genealogy in tree sequence format as an input and returns a copy of that tree sequence with altered node and mutation times. These times have been estimated on the basis of the number of mutations along the edges connecting genomes in the genealogy (i.e. using the “molecular clock”).
TITLE
A unified genealogy of modern and ancient genomes.
Main citation
Wohns AW, Wong Y, Jeffery B, Akbari A, ...&, McVean G. (2022) A unified genealogy of modern and ancient genomes. Science, 375 (6583) eabi8264. doi:10.1126/science.abi8264. PMID 35201891
ABSTRACT
The sequencing of modern and ancient genomes from around the world has revolutionized our understanding of human history and evolution. However, the problem of how best to characterize ancestral relationships from the totality of human genomic variation remains unsolved. Here, we address this challenge with nonparametric methods that enable us to infer a unified genealogy of modern and ancient humans. This compact representation of multiple datasets explores the challenges of missing and erroneous data and uses ancient samples to constrain and date relationships. We demonstrate the power of the method to recover relationships between individuals and populations as well as to identify descendants of ancient samples. Finally, we introduce a simple nonparametric estimator of the geographical location of ancestors that recapitulates key events in human history.
DOI
10.1126/science.abi8264
tsinfer
PUBMED_LINK
DESCRIPTION
Kelleher, J. et al. Inferring whole-genome histories in large population datasets. Nat. Genet. 51, 1330–1338 (2019).
URL
USE
Infer a tree sequence from genetic variation data
TITLE
Inferring whole-genome histories in large population datasets.
Main citation
Kelleher J, Wong Y, Wohns AW, Fadil C, ...&, McVean G. (2019) Inferring whole-genome histories in large population datasets. Nat Genet, 51 (9) 1330-1338. doi:10.1038/s41588-019-0483-y. PMID 31477934
ABSTRACT
Inferring the full genealogical history of a set of DNA sequences is a core problem in evolutionary biology, because this history encodes information about the events and forces that have influenced a species. However, current methods are limited, and the most accurate techniques are able to process no more than a hundred samples. As datasets that consist of millions of genomes are now being collected, there is a need for scalable and efficient inference methods to fully utilize these resources. Here we introduce an algorithm that is able to not only infer whole-genome histories with comparable accuracy to the state-of-the-art but also process four orders of magnitude more sequences. The approach also provides an 'evolutionary encoding' of the data, enabling efficient calculation of relevant statistics. We apply the method to human data from the 1000 Genomes Project, Simons Genome Diversity Project and UK Biobank, showing that the inferred genealogies are rich in biological signal and efficient to process.
DOI
10.1038/s41588-019-0483-y
Tutorial-Choi (PRS Tutorial)
PUBMED_LINK
FULL NAME
PRS Tutorial
DESCRIPTION
This tutorial provides a step-by-step guide to performing basic polygenic risk score (PRS) analyses and accompanies our PRS Guide paper. The aim of this tutorial is to provide a simple introduction of PRS analyses to those new to PRS, while equipping existing users with a better understanding of the processes and implementation "underneath the hood" of popular PRS software.
URL
TITLE
Tutorial: a guide to performing polygenic risk score analyses.
Main citation
Choi SW, Mak TS, O'Reilly PF. (2020) Tutorial: a guide to performing polygenic risk score analyses. Nat Protoc, 15 (9) 2759-2772. doi:10.1038/s41596-020-0353-1. PMID 32709988
ABSTRACT
A polygenic score (PGS) or polygenic risk score (PRS) is an estimate of an individual's genetic liability to a trait or disease, calculated according to their genotype profile and relevant genome-wide association study (GWAS) data. While present PRSs typically explain only a small fraction of trait variance, their correlation with the single largest contributor to phenotypic variation-genetic liability-has led to the routine application of PRSs across biomedical research. Among a range of applications, PRSs are exploited to assess shared etiology between phenotypes, to evaluate the clinical utility of genetic data for complex disease and as part of experimental studies in which, for example, experiments are performed that compare outcomes (e.g., gene expression and cellular response to treatment) between individuals with low and high PRS values. As GWAS sample sizes increase and PRSs become more powerful, PRSs are set to play a key role in research and stratified medicine. However, despite the importance and growing application of PRSs, there are limited guidelines for performing PRS analyses, which can lead to inconsistency between studies and misinterpretation of results. Here, we provide detailed guidelines for performing and interpreting PRS analyses. We outline standard quality control steps, discuss different methods for the calculation of PRSs, provide an introductory online tutorial, highlight common misconceptions relating to PRS results, offer recommendations for best practice and discuss future challenges.
DOI
10.1038/s41596-020-0353-1
TWAS hub
PUBMED_LINK
DESCRIPTION
TWAS hub is an interactive browser of results from integrative analyses of GWAS and functional data for hundreds of traits and >100k expression models. The aim is facilitate the investigation of individual TWAS associations; pleiotropic disease/trait associations for a given gene of interest; predicted gene associations for a given disease/trait of interest with detailed per-locus statistics; and pleiotropic relationships between traits based on shared associated genes.
URL
TITLE
Integrating Gene Expression with Summary Association Statistics to Identify Genes Associated with 30 Complex Traits.
Main citation
Mancuso N, Shi H, Goddard P, Kichaev G, ...&, Pasaniuc B. (2017) Integrating Gene Expression with Summary Association Statistics to Identify Genes Associated with 30 Complex Traits. Am J Hum Genet, 100 (3) 473-487. doi:10.1016/j.ajhg.2017.01.031. PMID 28238358
ABSTRACT
Although genome-wide association studies (GWASs) have identified thousands of risk loci for many complex traits and diseases, the causal variants and genes at these loci remain largely unknown. Here, we introduce a method for estimating the local genetic correlation between gene expression and a complex trait and utilize it to estimate the genetic correlation due to predicted expression between pairs of traits. We integrated gene expression measurements from 45 expression panels with summary GWAS data to perform 30 multi-tissue transcriptome-wide association studies (TWASs). We identified 1,196 genes whose expression is associated with these traits; of these, 168 reside more than 0.5 Mb away from any previously reported GWAS significant variant. We then used our approach to find 43 pairs of traits with significant genetic correlation at the level of predicted expression; of these, eight were not found through genetic correlation at the SNP level. Finally, we used bi-directional regression to find evidence that BMI causally influences triglyceride levels and that triglyceride levels causally influence low-density lipoprotein. Together, our results provide insight into the role of gene expression in the susceptibility of complex traits and diseases.
DOI
10.1016/j.ajhg.2017.01.031
twas_sim
PUBMED_LINK
DESCRIPTION
A python software leveraging real genotype data to simulate a complex trait as a function of latent expression, fit eQTL weights in independent data, and perform GWAS/TWAS on the complex trait.
URL
TITLE
twas_sim, a Python-based tool for simulation and power analysis of transcriptome-wide association analysis.
Main citation
Wang X, Lu Z, Bhattacharya A, Pasaniuc B, ...&, Mancuso N. (2023) twas_sim, a Python-based tool for simulation and power analysis of transcriptome-wide association analysis. Bioinformatics, 39 (5) . doi:10.1093/bioinformatics/btad288. PMID 37099718
ABSTRACT
SUMMARY: Genome-wide association studies (GWASs) have identified numerous genetic variants associated with complex disease risk; however, most of these associations are non-coding, complicating identifying their proximal target gene. Transcriptome-wide association studies (TWASs) have been proposed to mitigate this gap by integrating expression quantitative trait loci (eQTL) data with GWAS data. Numerous methodological advancements have been made for TWAS, yet each approach requires ad hoc simulations to demonstrate feasibility. Here, we present twas_sim, a computationally scalable and easily extendable tool for simplified performance evaluation and power analysis for TWAS methods. AVAILABILITY AND IMPLEMENTATION: Software and documentation are available at https://github.com/mancusolab/twas_sim.
DOI
10.1093/bioinformatics/btad288
Two-sample MR
PUBMED_LINK
DESCRIPTION
Two sample Mendelian randomisation (2SMR) is a method to estimate the causal effect of an exposure on an outcome using only summary statistics from genome wide association studies (GWAS). Though conceptually straightforward, there are a number of steps that are required to perform the analysis properly, and they can be cumbersome
URL
TITLE
The MR-Base platform supports systematic causal inference across the human phenome.
Main citation
Hemani G, Zheng J, Elsworth B, Wade KH, ...&, Haycock PC. (2018) The MR-Base platform supports systematic causal inference across the human phenome. Elife, 7 () . doi:10.7554/eLife.34408. PMID 29846171
ABSTRACT
Results from genome-wide association studies (GWAS) can be used to infer causal relationships between phenotypes, using a strategy known as 2-sample Mendelian randomization (2SMR) and bypassing the need for individual-level data. However, 2SMR methods are evolving rapidly and GWAS results are often insufficiently curated, undermining efficient implementation of the approach. We therefore developed MR-Base (<ext-link ext-link-type="uri" xlink:href="http://www.mrbase.org">http://www.mrbase.org</ext-link>): a platform that integrates a curated database of complete GWAS results (no restrictions according to statistical significance) with an application programming interface, web app and R packages that automate 2SMR. The software includes several sensitivity analyses for assessing the impact of horizontal pleiotropy and other violations of assumptions. The database currently comprises 11 billion single nucleotide polymorphism-trait associations from 1673 GWAS and is updated on a regular basis. Integrating data with software ensures more rigorous application of hypothesis-driven analyses and allows millions of potential causal relationships to be efficiently evaluated in phenome-wide association studies.
DOI
10.7554/eLife.34408
UMAP
FULL NAME
Uniform Manifold Approximation and Projection
URL
KEYWORDS
UMAP
Main citation
McInnes, L., Healy, J., & Melville, J. (2018). Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426.
UpSetPlot
PUBMED_LINK
DESCRIPTION
This is another Python implementation of UpSet plots by Lex et al. [Lex2014]. UpSet plots are used to visualise set overlaps; like Venn diagrams but more readable.
URL
TITLE
UpSet: Visualization of Intersecting Sets.
Main citation
Lex A, Gehlenborg N, Strobelt H, Vuillemot R, ...&, Pfister H. (2014) UpSet: Visualization of Intersecting Sets. IEEE Trans Vis Comput Graph, 20 (12) 1983-92. doi:10.1109/TVCG.2014.2346248. PMID 26356912
ABSTRACT
Understanding relationships between sets is an important analysis task that has received widespread attention in the visualization community. The major challenge in this context is the combinatorial explosion of the number of set intersections if the number of sets exceeds a trivial threshold. In this paper we introduce UpSet, a novel visualization technique for the quantitative analysis of sets, their intersections, and aggregates of intersections. UpSet is focused on creating task-driven aggregates, communicating the size and properties of aggregates and intersections, and a duality between the visualization of the elements in a dataset and their set membership. UpSet visualizes set intersections in a matrix layout and introduces aggregates based on groupings and queries. The matrix layout enables the effective representation of associated data, such as the number of elements in the aggregates and intersections, as well as additional summary statistics derived from subset or element attributes. Sorting according to various measures enables a task-driven analysis of relevant intersections and aggregates. The elements represented in the sets and their associated attributes are visualized in a separate view. Queries based on containment in specific intersections, aggregates or driven by attribute filters are propagated between both views. We also introduce several advanced visual encodings and interaction methods to overcome the problems of varying scales and to address scalability. UpSet is web-based and open source. We demonstrate its general utility in multiple use cases from various domains.
DOI
10.1109/TVCG.2014.2346248
VEGAS2
PUBMED_LINK
FULL NAME
Versatile Gene-based Association Study - 2 version 2
DESCRIPTION
This is the VEGAS2 web platform. Here, user can perfome the gene-based and pathway-based analyses on GWAS summary data using VEGAS and VEGAS2Pathway approaches respectively. It is publically available for non-commercial use.
URL
TITLE
VEGAS2: Software for More Flexible Gene-Based Testing.
Main citation
Mishra A, Macgregor S. (2015) VEGAS2: Software for More Flexible Gene-Based Testing. Twin Res Hum Genet, 18 (1) 86-91. doi:10.1017/thg.2014.79. PMID 25518859
ABSTRACT
Gene-based tests such as versatile gene-based association study (VEGAS) are commonly used following per-single nucleotide polymorphism (SNP) GWAS (genome-wide association studies) analysis. Two limitations of VEGAS were that the HapMap2 reference set was used to model the correlation between SNPs and only autosomal genes were considered. HapMap2 has now been superseded by the 1,000 Genomes reference set, and whereas early GWASs frequently ignored the X chromosome, it is now commonly included. Here we have developed VEGAS2, an extension that uses 1,000 Genomes data to model SNP correlations across the autosomes and chromosome X. VEGAS2 allows greater flexibility when defining gene boundaries. VEGAS2 offers both a user-friendly, web-based front end and a command line Linux version. The online version of VEGAS2 can be accessed through https://vegas2.qimrberghofer.edu.au/. The command line version can be downloaded from https://vegas2.qimrberghofer.edu.au/zVEGAS2offline.tgz. The command line version is developed in Perl, R and shell scripting languages; source code is available for further development.
DOI
10.1017/thg.2014.79
VEP
PUBMED_LINK
FULL NAME
Ensembl Variant Effect Predictor
DESCRIPTION
The Ensembl Variant Effect Predictor is a powerful toolset for the analysis, annotation, and prioritization of genomic variants in coding and non-coding regions. It provides access to an extensive collection of genomic annotation, with a variety of interfaces to suit different requirements, and simple options for configuring and extending analysis. It is open source, free to use, and supports full reproducibility of results. The Ensembl Variant Effect Predictor can simplify and accelerate variant interpretation in a wide range of study designs.
URL
TITLE
The Ensembl Variant Effect Predictor.
Main citation
McLaren W, Gil L, Hunt SE, Riat HS, ...&, Cunningham F. (2016) The Ensembl Variant Effect Predictor. Genome Biol, 17 (1) 122. doi:10.1186/s13059-016-0974-4. PMID 27268795
ABSTRACT
The Ensembl Variant Effect Predictor is a powerful toolset for the analysis, annotation, and prioritization of genomic variants in coding and non-coding regions. It provides access to an extensive collection of genomic annotation, with a variety of interfaces to suit different requirements, and simple options for configuring and extending analysis. It is open source, free to use, and supports full reproducibility of results. The Ensembl Variant Effect Predictor can simplify and accelerate variant interpretation in a wide range of study designs.
DOI
10.1186/s13059-016-0974-4
VIPRS
PUBMED_LINK
FULL NAME
Variational inference of polygenic risk scores
DESCRIPTION
viprs is a python package that implements scripts and utilities for running variational inference algorithms on genome-wide association study (GWAS) data for the purposes polygenic risk estimation.
URL
KEYWORDS
Variational Inference (VI)
TITLE
Fast and accurate Bayesian polygenic risk modeling with variational inference.
Main citation
Zabad S, Gravel S, Li Y. (2023) Fast and accurate Bayesian polygenic risk modeling with variational inference. Am J Hum Genet, 110 (5) 741-761. doi:10.1016/j.ajhg.2023.03.009. PMID 37030289
ABSTRACT
The advent of large-scale genome-wide association studies (GWASs) has motivated the development of statistical methods for phenotype prediction with single-nucleotide polymorphism (SNP) array data. These polygenic risk score (PRS) methods use a multiple linear regression framework to infer joint effect sizes of all genetic variants on the trait. Among the subset of PRS methods that operate on GWAS summary statistics, sparse Bayesian methods have shown competitive predictive ability. However, most existing Bayesian approaches employ Markov chain Monte Carlo (MCMC) algorithms, which are computationally inefficient and do not scale favorably to higher dimensions, for posterior inference. Here, we introduce variational inference of polygenic risk scores (VIPRS), a Bayesian summary statistics-based PRS method that utilizes variational inference techniques to approximate the posterior distribution for the effect sizes. Our experiments with 36 simulation configurations and 12 real phenotypes from the UK Biobank dataset demonstrated that VIPRS is consistently competitive with the state-of-the-art in prediction accuracy while being more than twice as fast as popular MCMC-based approaches. This performance advantage is robust across a variety of genetic architectures, SNP heritabilities, and independent GWAS cohorts. In addition to its competitive accuracy on the "White British" samples, VIPRS showed improved transferability when applied to other ethnic groups, with up to 1.7-fold increase in R2 among individuals of Nigerian ancestry for low-density lipoprotein (LDL) cholesterol. To illustrate its scalability, we applied VIPRS to a dataset of 9.6 million genetic markers, which conferred further improvements in prediction accuracy for highly polygenic traits, such as height.
DOI
10.1016/j.ajhg.2023.03.009
WBBC panel (WBBC)
PUBMED_LINK
URL
TITLE
Genomic analyses of 10,376 individuals in the Westlake BioBank for Chinese (WBBC) pilot project.
Main citation
Cong PK, Bai WY, Li JC, Yang MY, ...&, Zheng HF. (2022) Genomic analyses of 10,376 individuals in the Westlake BioBank for Chinese (WBBC) pilot project. Nat Commun, 13 (1) 2939. doi:10.1038/s41467-022-30526-x. PMID 35618720
ABSTRACT
We initiate the Westlake BioBank for Chinese (WBBC) pilot project with 4,535 whole-genome sequencing (WGS) individuals and 5,841 high-density genotyping individuals, and identify 81.5 million SNPs and INDELs, of which 38.5% are absent in dbSNP Build 151. We provide a population-specific reference panel and an online imputation server ( https://wbbc.westlake.edu.cn/ ) which could yield substantial improvement of imputation performance in Chinese population, especially for low-frequency and rare variants. By analyzing the singleton density of the WGS data, we find selection signatures in SNX29, DNAH1 and WDR1 genes, and the derived alleles of the alcohol metabolism genes (ADH1A and ADH1B) emerge around 7,000 years ago and tend to be more common from 4,000 years ago in East Asia. Genetic evidence supports the corresponding geographical boundaries of the Qinling-Huaihe Line and Nanling Mountains, which separate the Han Chinese into subgroups, and we reveal that North Han was more homogeneous than South Han.
DOI
10.1038/s41467-022-30526-x
webTWAS
PUBMED_LINK
DESCRIPTION
a resource for disease candidate susceptibility genes identified by transcriptome-wide association study
URL
TITLE
webTWAS: a resource for disease candidate susceptibility genes identified by transcriptome-wide association study.
Main citation
Cao C, Wang J, Kwok D, Cui F, ...&, Zou Q. (2022) webTWAS: a resource for disease candidate susceptibility genes identified by transcriptome-wide association study. Nucleic Acids Res, 50 (D1) D1123-D1130. doi:10.1093/nar/gkab957. PMID 34669946
ABSTRACT
The development of transcriptome-wide association studies (TWAS) has enabled researchers to better identify and interpret causal genes in many diseases. However, there are currently no resources providing a comprehensive listing of gene-disease associations discovered by TWAS from published GWAS summary statistics. TWAS analyses are also difficult to conduct due to the complexity of TWAS software pipelines. To address these issues, we introduce a new resource called webTWAS, which integrates a database of the most comprehensive disease GWAS datasets currently available with credible sets of potential causal genes identified by multiple TWAS software packages. Specifically, a total of 235 064 gene-diseases associations for a wide range of human diseases are prioritized from 1298 high-quality downloadable European GWAS summary statistics. Associations are calculated with seven different statistical models based on three popular and representative TWAS software packages. Users can explore associations at the gene or disease level, and easily search for related studies or diseases using the MeSH disease tree. Since the effects of diseases are highly tissue-specific, webTWAS applies tissue-specific enrichment analysis to identify significant tissues. A user-friendly web server is also available to run custom TWAS analyses on user-provided GWAS summary statistics data. webTWAS is freely available at http://www.webtwas.net.
DOI
10.1093/nar/gkab957
Westlake Imputation Server
PUBMED_LINK
URL
TITLE
Genomic analyses of 10,376 individuals in the Westlake BioBank for Chinese (WBBC) pilot project.
Main citation
Cong PK, Bai WY, Li JC, Yang MY, ...&, Zheng HF. (2022) Genomic analyses of 10,376 individuals in the Westlake BioBank for Chinese (WBBC) pilot project. Nat Commun, 13 (1) 2939. doi:10.1038/s41467-022-30526-x. PMID 35618720
ABSTRACT
We initiate the Westlake BioBank for Chinese (WBBC) pilot project with 4,535 whole-genome sequencing (WGS) individuals and 5,841 high-density genotyping individuals, and identify 81.5 million SNPs and INDELs, of which 38.5% are absent in dbSNP Build 151. We provide a population-specific reference panel and an online imputation server ( https://wbbc.westlake.edu.cn/ ) which could yield substantial improvement of imputation performance in Chinese population, especially for low-frequency and rare variants. By analyzing the singleton density of the WGS data, we find selection signatures in SNX29, DNAH1 and WDR1 genes, and the derived alleles of the alcohol metabolism genes (ADH1A and ADH1B) emerge around 7,000 years ago and tend to be more common from 4,000 years ago in East Asia. Genetic evidence supports the corresponding geographical boundaries of the Qinling-Huaihe Line and Nanling Mountains, which separate the Han Chinese into subgroups, and we reveal that North Han was more homogeneous than South Han.
DOI
10.1038/s41467-022-30526-x
winnerscurse
DESCRIPTION
This package, winnerscurse, has been designed to provide easy access to published methods which aim to correct for Winner’s Curse, using GWAS summary statistics.
URL
wMT-SBLUP
PUBMED_LINK
FULL NAME
weighted approximate multi-trait summary statistic BLUP
URL
TITLE
Improving genetic prediction by leveraging genetic correlations among human diseases and traits.
Main citation
Maier RM, Zhu Z, Lee SH, Trzaskowski M, ...&, Robinson MR. (2018) Improving genetic prediction by leveraging genetic correlations among human diseases and traits. Nat Commun, 9 (1) 989. doi:10.1038/s41467-017-02769-6. PMID 29515099
ABSTRACT
Genomic prediction has the potential to contribute to precision medicine. However, to date, the utility of such predictors is limited due to low accuracy for most traits. Here theory and simulation study are used to demonstrate that widespread pleiotropy among phenotypes can be utilised to improve genomic risk prediction. We show how a genetic predictor can be created as a weighted index that combines published genome-wide association study (GWAS) summary statistics across many different traits. We apply this framework to predict risk of schizophrenia and bipolar disorder in the Psychiatric Genomics consortium data, finding substantial heterogeneity in prediction accuracy increases across cohorts. For six additional phenotypes in the UK Biobank data, we find increases in prediction accuracy ranging from 0.7% for height to 47% for type 2 diabetes, when using a multi-trait predictor that combines published summary statistics from multiple traits, as compared to a predictor based only on one trait.
DOI
10.1038/s41467-017-02769-6
WtCoxG
DESCRIPTION
A computationally efficient Cox-based method for genome-wide association studies (GWAS) of time-to-event phenotypes that addresses case ascertainment bias using a weighted Cox proportional hazards model, incorporates saddlepoint approximation for accuracy with rare variants, and boosts statistical power by integrating external minor allele frequency (MAF) data from public resources.
URL
KEYWORDS
Genome-wide association study, time-to-event phenotypes, Cox regression, case ascertainment, weighted Cox model, rare variants, minor allele frequencies, UK Biobank
Main citation
Li, Y., Ma, Y., Xu, H. et al. Applying weighted Cox regression to genome-wide association studies of time-to-event phenotypes. Nat Comput Sci (2025). https://doi.org/10.1038/s43588-025-00864-z
ARROW_SUMMARY
Genomic data + time-to-event phenotypes → Weighted Cox model with saddlepoint approximation → association signals; Leverages external MAF data → enhanced statistical power.
AI_GENERATED
1.0
XP-EHH
PUBMED_LINK
FULL NAME
Cross-population extended haplotype homozygosity
DESCRIPTION
Klassmann, A., & Gautier, M. (2022). Detecting selection using extended haplotype homozygosity (EHH)-based statistics in unphased or unpolarized data. PloS one, 17(1), e0262024.
TITLE
Detecting selection using extended haplotype homozygosity (EHH)-based statistics in unphased or unpolarized data.
Main citation
Klassmann A, Gautier M. (2022) Detecting selection using extended haplotype homozygosity (EHH)-based statistics in unphased or unpolarized data. PLoS One, 17 (1) e0262024. doi:10.1371/journal.pone.0262024. PMID 35041674
ABSTRACT
Analysis of population genetic data often includes a search for genomic regions with signs of recent positive selection. One of such approaches involves the concept of extended haplotype homozygosity (EHH) and its associated statistics. These statistics typically require phased haplotypes, and some of them necessitate polarized variants. Here, we unify and extend previously proposed modifications to loosen these requirements. We compare the modified versions with the original ones by measuring the false discovery rate in simulated whole-genome scans and by quantifying the overlap of inferred candidate regions in empirical data. We find that phasing information is indispensable for accurate estimation of within-population statistics (for all but very large samples) and of cross-population statistics for small samples. Ancestry information, in contrast, is of lesser importance for both types of statistic. Our publicly available R package rehh incorporates the modified statistics presented here.
DOI
10.1371/journal.pone.0262024
Yang
PUBMED_LINK
TITLE
Methods for Analyzing Multivariate Phenotypes in Genetic Association Studies.
Main citation
Yang Q, Wang Y. (2012) Methods for Analyzing Multivariate Phenotypes in Genetic Association Studies. J Probab Stat, 2012 () 652569. doi:10.1155/2012/652569. PMID 24748889
ABSTRACT
This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Multivariate phenotypes are frequently encountered in genetic association studies. The purpose of analyzing multivariate phenotypes usually includes discovery of novel genetic variants of pleiotropy effects, that is, affecting multiple phenotypes, and the ultimate goal of uncovering the underlying genetic mechanism. In recent years, there have been new method development and application of existing statistical methods to such phenotypes. In this paper, we provide a review of the available methods for analyzing association between a single marker and a multivariate phenotype consisting of the same type of components (e.g., all continuous or all categorical) or different types of components (e.g., some are continuous and others are categorical). We also reviewed causal inference methods designed to test whether the detected association with the multivariate phenotype is truly pleiotropy or the genetic marker exerts its effects on some phenotypes through affecting the others.
DOI
10.1155/2012/652569