Skip to content

Data_processing

Summary Table

NAME CITATION YEAR
GCTA Yang J, Lee SH, Goddard ME, Visscher PM. (2011) GCTA: a tool for genome-wide complex trait analysis Am. J. Hum. Genet., 88 (1) 76-82. doi:10.1016/j.ajhg.2010.11.011. PMID 21167468 2011
GWASLab GWASLab preprint: He, Y., Koido, M., Shimmori, Y., Kamatani, Y. (2023). GWASLab: a Python package for processing and visualizing GWAS summary statistics. Preprint at Jxiv, 2023-5. https://doi.org/10.51094/jxiv.370 NA
Hail Hail Team. Hail 0.2.13-81ab564db2b4. https://github.com/hail-is/hail/releases/tag/0.2.13. NA
LDSC Bulik-Sullivan B, Loh PR, Finucane HK, Ripke S, ...&, O'Donovan MC. (2015) LD Score regression distinguishes confounding from polygenicity in genome-wide association studies Nat. Genet., 47 (3) 291-295. doi:10.1038/ng.3211. PMID 25642630 2015
LDSTORE2 Benner C, Havulinna AS, Järvelin MR, Salomaa V, ...&, Pirinen M. (2017) Prospects of fine-mapping trait-associated genomic regions by using summary statistics from genome-wide association studies Am. J. Hum. Genet., 101 (4) 539-551. doi:10.1016/j.ajhg.2017.08.012. PMID 28942963 2017
MungeSumstats Murphy AE, Schilder BM, Skene NG. (2021) MungeSumstats: a Bioconductor package for the standardization and quality control of many GWAS summary statistics Bioinformatics, 37 (23) 4593-4596. doi:10.1093/bioinformatics/btab665. PMID 34601555 2021
PLINK1.9 Purcell S, Neale B, Todd-Brown K, Thomas L, ...&, Sham PC. (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses Am. J. Hum. Genet., 81 (3) 559-575. doi:10.1086/519795. PMID 17701901 2007
PLINK2 Chang CC, Chow CC, Tellier LC, Vattikuti S, ...&, Lee JJ. (2015) Second-generation PLINK: rising to the challenge of larger and richer datasets Gigascience, 4 (1) 7. doi:10.1186/s13742-015-0047-8. PMID 25722852 2015
QCTOOL v2 Wigginton JE, Cutler DJ, Abecasis GR. (2005) A note on exact tests of Hardy-Weinberg equilibrium Am. J. Hum. Genet., 76 (5) 887-893. doi:10.1086/429864. PMID 15789306 2005

GCTA

  • NAME : GCTA
  • SHORT NAME : GCTA
  • FULL NAME : Genome-wide complex trait analysis (GCTA)
  • DESCRIPTION : GCTA-GREML analysis:GCTA can simulate a GWAS based on real genotype data.
  • URL : https://yanglab.westlake.edu.cn/software/gcta/
  • TITLE : GCTA: a tool for genome-wide complex trait analysis
  • DOI : 10.1016/j.ajhg.2010.11.011
  • ABSTRACT : For most human complex diseases and traits, SNPs identified by genome-wide association studies (GWAS) explain only a small fraction of the heritability. Here we report a user-friendly software tool called genome-wide complex trait analysis (GCTA), which was developed based on a method we recently developed to address the "missing heritability" problem. GCTA estimates the variance explained by all the SNPs on a chromosome or on the whole genome for a complex trait rather than testing the association of any particular SNP to the trait. We introduce GCTA's five main functions: data management, estimation of the genetic relationships from SNPs, mixed linear model analysis of variance explained by the SNPs, estimation of the linkage disequilibrium structure, and GWAS simulation. We focus on the function of estimating the variance explained by all the SNPs on the X chromosome and testing the hypotheses of dosage compensation. The GCTA software is a versatile tool to estimate and partition complex trait variation with large GWAS data sets.
  • CITATION : Yang J, Lee SH, Goddard ME, Visscher PM. (2011) GCTA: a tool for genome-wide complex trait analysis Am. J. Hum. Genet., 88 (1) 76-82. doi:10.1016/j.ajhg.2010.11.011. PMID 21167468
  • JOURNAL_INFO : American journal of human genetics ; Am. J. Hum. Genet. ; 2011 ; 88 ; 1 ; 76-82
  • PUBMED_LINK : 21167468

GWASLab

  • NAME : GWASLab
  • SHORT NAME : GWASLab
  • FULL NAME : GWASLab
  • DESCRIPTION : a python package for handling GWAS sumstats.
  • URL : https://github.com/Cloufield/gwaslab
  • PREPRINT_DOI : 10.51094/jxiv.370
  • CITATION : GWASLab preprint: He, Y., Koido, M., Shimmori, Y., Kamatani, Y. (2023). GWASLab: a Python package for processing and visualizing GWAS summary statistics. Preprint at Jxiv, 2023-5. https://doi.org/10.51094/jxiv.370

Hail

  • NAME : Hail
  • SHORT NAME : Hail
  • FULL NAME : Hail
  • DESCRIPTION : Hail is an open-source, general-purpose, Python-based data analysis tool with additional data types and methods for working with genomic data.
  • URL : https://hail.is/
  • CITATION : Hail Team. Hail 0.2.13-81ab564db2b4. https://github.com/hail-is/hail/releases/tag/0.2.13.

LDSC

  • NAME : LDSC
  • SHORT NAME : LDSC
  • FULL NAME : LD Score Regression
  • DESCRIPTION : ldsc is a command line tool for estimating heritability and genetic correlation from GWAS summary statistics. ldsc also computes LD Scores.
  • URL : https://github.com/bulik/ldsc
  • TITLE : LD Score regression distinguishes confounding from polygenicity in genome-wide association studies
  • DOI : 10.1038/ng.3211
  • ABSTRACT : Both polygenicity (many small genetic effects) and confounding biases, such as cryptic relatedness and population stratification, can yield an inflated distribution of test statistics in genome-wide association studies (GWAS). However, current methods cannot distinguish between inflation from a true polygenic signal and bias. We have developed an approach, LD Score regression, that quantifies the contribution of each by examining the relationship between test statistics and linkage disequilibrium (LD). The LD Score regression intercept can be used to estimate a more powerful and accurate correction factor than genomic control. We find strong evidence that polygenicity accounts for the majority of the inflation in test statistics in many GWAS of large sample size.
  • CITATION : Bulik-Sullivan B, Loh PR, Finucane HK, Ripke S, ...&, O'Donovan MC. (2015) LD Score regression distinguishes confounding from polygenicity in genome-wide association studies Nat. Genet., 47 (3) 291-295. doi:10.1038/ng.3211. PMID 25642630
  • JOURNAL_INFO : Nature genetics ; Nat. Genet. ; 2015 ; 47 ; 3 ; 291-295
  • PUBMED_LINK : 25642630

LDSTORE2

  • NAME : LDSTORE2
  • SHORT NAME : LDSTORE2
  • FULL NAME : LDSTORE2
  • DESCRIPTION : LDstore is a computationally efficient program for estimating and storing Linkage Disequilibrium (SNP correlations). It combines some of the best features from RAREMETALWORKER and PLINK by implementing parallel processing using OPENMP and storing of the SNP correlations with information about the SNPs in the same binary file for fast lookups. LDstore is therefore the ideal tool for sharing SNP correlations in large-scale meta-analyses of genome-wide association studies and for on-the-fly computing/querying within web portals.
  • URL : http://www.christianbenner.com/#
  • TITLE : Prospects of fine-mapping trait-associated genomic regions by using summary statistics from genome-wide association studies
  • DOI : 10.1016/j.ajhg.2017.08.012
  • ABSTRACT : During the past few years, various novel statistical methods have been developed for fine-mapping with the use of summary statistics from genome-wide association studies (GWASs). Although these approaches require information about the linkage disequilibrium (LD) between variants, there has not been a comprehensive evaluation of how estimation of the LD structure from reference genotype panels performs in comparison with that from the original individual-level GWAS data. Using population genotype data from Finland and the UK Biobank, we show here that a reference panel of 1,000 individuals from the target population is adequate for a GWAS cohort of up to 10,000 individuals, whereas smaller panels, such as those from the 1000 Genomes Project, should be avoided. We also show, both theoretically and empirically, that the size of the reference panel needs to scale with the GWAS sample size; this has important consequences for the application of these methods in ongoing GWAS meta-analyses and large biobank studies. We conclude by providing software tools and by recommending practices for sharing LD information to more efficiently exploit summary statistics in genetics research.
  • CITATION : Benner C, Havulinna AS, Järvelin MR, Salomaa V, ...&, Pirinen M. (2017) Prospects of fine-mapping trait-associated genomic regions by using summary statistics from genome-wide association studies Am. J. Hum. Genet., 101 (4) 539-551. doi:10.1016/j.ajhg.2017.08.012. PMID 28942963
  • JOURNAL_INFO : The American Journal of Human Genetics ; Am. J. Hum. Genet. ; 2017 ; 101 ; 4 ; 539-551
  • PUBMED_LINK : 28942963

MungeSumstats

  • NAME : MungeSumstats
  • SHORT NAME : MungeSumstats
  • FULL NAME : MungeSumstats
  • DESCRIPTION : a Bioconductor package for the standardization and quality control of many GWAS summary statistics
  • URL : https://github.com/neurogenomics/MungeSumstats
  • TITLE : MungeSumstats: a Bioconductor package for the standardization and quality control of many GWAS summary statistics
  • DOI : 10.1093/bioinformatics/btab665
  • ABSTRACT : MOTIVATION: Genome-wide association studies (GWAS) summary statistics have popularized and accelerated genetic research. However, a lack of standardization of the file formats used has proven problematic when running secondary analysis tools or performing meta-analysis studies. RESULTS: To address this issue, we have developed MungeSumstats, a Bioconductor R package for the standardization and quality control of GWAS summary statistics. MungeSumstats can handle the most common summary statistic formats, including variant call format (VCF) producing a reformatted, standardized, tabular summary statistic file, VCF or R native data object. AVAILABILITY AND IMPLEMENTATION: MungeSumstats is available on Bioconductor (v 3.13) and can also be found on Github at: https://neurogenomics.github.io/MungeSumstats. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
  • COPYRIGHT : https://creativecommons.org/licenses/by/4.0/
  • CITATION : Murphy AE, Schilder BM, Skene NG. (2021) MungeSumstats: a Bioconductor package for the standardization and quality control of many GWAS summary statistics Bioinformatics, 37 (23) 4593-4596. doi:10.1093/bioinformatics/btab665. PMID 34601555
  • JOURNAL_INFO : Bioinformatics (Oxford, England) ; Bioinformatics ; 2021 ; 37 ; 23 ; 4593-4596
  • PUBMED_LINK : 34601555

PLINK1.9

  • NAME : PLINK1.9
  • SHORT NAME : PLINK1.9
  • FULL NAME : PLINK1.9
  • DESCRIPTION : PLINK: a tool set for whole-genome association and population-based linkage analyses. This is a comprehensive update to Shaun Purcell's PLINK command-line program, developed by Christopher Chang with support from the NIH-NIDDK's Laboratory of Biological Modeling, the Purcell Lab, and others.
  • URL : https://www.cog-genomics.org/plink/1.9/
  • TITLE : PLINK: a tool set for whole-genome association and population-based linkage analyses
  • DOI : 10.1086/519795
  • ABSTRACT : Whole-genome association studies (WGAS) bring new computational, as well as analytic, challenges to researchers. Many existing genetic-analysis tools are not designed to handle such large data sets in a convenient manner and do not necessarily exploit the new opportunities that whole-genome data bring. To address these issues, we developed PLINK, an open-source C/C++ WGAS tool set. With PLINK, large data sets comprising hundreds of thousands of markers genotyped for thousands of individuals can be rapidly manipulated and analyzed in their entirety. As well as providing tools to make the basic analytic steps computationally efficient, PLINK also supports some novel approaches to whole-genome data that take advantage of whole-genome coverage. We introduce PLINK and describe the five main domains of function: data management, summary statistics, population stratification, association analysis, and identity-by-descent estimation. In particular, we focus on the estimation and use of identity-by-state and identity-by-descent information in the context of population-based whole-genome studies. This information can be used to detect and correct for population stratification and to identify extended chromosomal segments that are shared identical by descent between very distantly related individuals. Analysis of the patterns of segmental sharing has the potential to map disease loci that contain multiple rare variants in a population-based linkage analysis.
  • COPYRIGHT : https://www.elsevier.com/open-access/userlicense/1.0/
  • CITATION : Purcell S, Neale B, Todd-Brown K, Thomas L, ...&, Sham PC. (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses Am. J. Hum. Genet., 81 (3) 559-575. doi:10.1086/519795. PMID 17701901
  • JOURNAL_INFO : The American Journal of Human Genetics ; Am. J. Hum. Genet. ; 2007 ; 81 ; 3 ; 559-575
  • PUBMED_LINK : 17701901

PLINK2

  • NAME : PLINK2
  • SHORT NAME : PLINK2
  • FULL NAME : PLINK2
  • DESCRIPTION : The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.
  • URL : https://www.cog-genomics.org/plink/2.0/
  • TITLE : Second-generation PLINK: rising to the challenge of larger and richer datasets
  • DOI : 10.1186/s13742-015-0047-8
  • ABSTRACT : BACKGROUND: PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for faster and scalable implementations of key functions, such as logistic regression, linkage disequilibrium estimation, and genomic distance evaluation. In addition, GWAS and population-genetic data now frequently contain genotype likelihoods, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1's primary data format. FINDINGS: To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, [Formula: see text]-time/constant-space Hardy-Weinberg equilibrium and Fisher's exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. We have also developed an extension to the data format which adds low-overhead support for genotype likelihoods, phase, multiallelic variants, and reference vs. alternate alleles, which is the basis of our planned second release (PLINK 2.0). CONCLUSIONS: The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.
  • CITATION : Chang CC, Chow CC, Tellier LC, Vattikuti S, ...&, Lee JJ. (2015) Second-generation PLINK: rising to the challenge of larger and richer datasets Gigascience, 4 (1) 7. doi:10.1186/s13742-015-0047-8. PMID 25722852
  • JOURNAL_INFO : GigaScience ; Gigascience ; 2015 ; 4 ; 1 ; 7
  • PUBMED_LINK : 25722852

QCTOOL v2

  • NAME : QCTOOL v2
  • SHORT NAME : QCTOOL
  • FULL NAME : QCTOOL
  • DESCRIPTION : QCTOOL is a command-line utility program for manipulation and quality control of gwas datasets and other genome-wide data.
  • URL : https://www.well.ox.ac.uk/~gav/qctool_v2/index.html
  • TITLE : A note on exact tests of Hardy-Weinberg equilibrium
  • DOI : 10.1086/429864
  • ABSTRACT : Deviations from Hardy-Weinberg equilibrium (HWE) can indicate inbreeding, population stratification, and even problems in genotyping. In samples of affected individuals, these deviations can also provide evidence for association. Tests of HWE are commonly performed using a simple chi2 goodness-of-fit test. We show that this chi2 test can have inflated type I error rates, even in relatively large samples (e.g., samples of 1,000 individuals that include approximately 100 copies of the minor allele). On the basis of previous work, we describe exact tests of HWE together with efficient computational methods for their implementation. Our methods adequately control type I error in large and small samples and are computationally efficient. They have been implemented in freely available code that will be useful for quality assessment of genotype data and for the detection of genetic association or population stratification in very large data sets.
  • COPYRIGHT : https://www.elsevier.com/open-access/userlicense/1.0/
  • CITATION : Wigginton JE, Cutler DJ, Abecasis GR. (2005) A note on exact tests of Hardy-Weinberg equilibrium Am. J. Hum. Genet., 76 (5) 887-893. doi:10.1086/429864. PMID 15789306
  • JOURNAL_INFO : The American Journal of Human Genetics ; Am. J. Hum. Genet. ; 2005 ; 76 ; 5 ; 887-893
  • PUBMED_LINK : 15789306