Data_processing

Summary Table

NAME	CITATION	YEAR
GCTA	Yang J, Lee SH, Goddard ME, Visscher PM. (2011) GCTA: a tool for genome-wide complex trait analysis Am. J. Hum. Genet., 88 (1) 76-82. doi:10.1016/j.ajhg.2010.11.011. PMID 21167468	2011
GWASLab	GWASLab preprint: He, Y., Koido, M., Shimmori, Y., Kamatani, Y. (2023). GWASLab: a Python package for processing and visualizing GWAS summary statistics. Preprint at Jxiv, 2023-5. https://doi.org/10.51094/jxiv.370	NA
Hail	Hail Team. Hail 0.2.13-81ab564db2b4. https://github.com/hail-is/hail/releases/tag/0.2.13.	NA
LDSC	Bulik-Sullivan B, Loh PR, Finucane HK, Ripke S, ...&, O'Donovan MC. (2015) LD Score regression distinguishes confounding from polygenicity in genome-wide association studies Nat. Genet., 47 (3) 291-295. doi:10.1038/ng.3211. PMID 25642630	2015
LDSTORE2	Benner C, Havulinna AS, Järvelin MR, Salomaa V, ...&, Pirinen M. (2017) Prospects of fine-mapping trait-associated genomic regions by using summary statistics from genome-wide association studies Am. J. Hum. Genet., 101 (4) 539-551. doi:10.1016/j.ajhg.2017.08.012. PMID 28942963	2017
MungeSumstats	Murphy AE, Schilder BM, Skene NG. (2021) MungeSumstats: a Bioconductor package for the standardization and quality control of many GWAS summary statistics Bioinformatics, 37 (23) 4593-4596. doi:10.1093/bioinformatics/btab665. PMID 34601555	2021
PLINK1.9	Purcell S, Neale B, Todd-Brown K, Thomas L, ...&, Sham PC. (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses Am. J. Hum. Genet., 81 (3) 559-575. doi:10.1086/519795. PMID 17701901	2007
PLINK2	Chang CC, Chow CC, Tellier LC, Vattikuti S, ...&, Lee JJ. (2015) Second-generation PLINK: rising to the challenge of larger and richer datasets Gigascience, 4 (1) 7. doi:10.1186/s13742-015-0047-8. PMID 25722852	2015
QCTOOL v2	Wigginton JE, Cutler DJ, Abecasis GR. (2005) A note on exact tests of Hardy-Weinberg equilibrium Am. J. Hum. Genet., 76 (5) 887-893. doi:10.1086/429864. PMID 15789306	2005

GCTA

NAME : GCTA
SHORT NAME : GCTA
FULL NAME : Genome-wide complex trait analysis (GCTA)
DESCRIPTION : GCTA-GREML analysis:GCTA can simulate a GWAS based on real genotype data.
URL : https://yanglab.westlake.edu.cn/software/gcta/
TITLE : GCTA: a tool for genome-wide complex trait analysis
DOI : 10.1016/j.ajhg.2010.11.011
ABSTRACT : For most human complex diseases and traits, SNPs identified by genome-wide association studies (GWAS) explain only a small fraction of the heritability. Here we report a user-friendly software tool called genome-wide complex trait analysis (GCTA), which was developed based on a method we recently developed to address the "missing heritability" problem. GCTA estimates the variance explained by all the SNPs on a chromosome or on the whole genome for a complex trait rather than testing the association of any particular SNP to the trait. We introduce GCTA's five main functions: data management, estimation of the genetic relationships from SNPs, mixed linear model analysis of variance explained by the SNPs, estimation of the linkage disequilibrium structure, and GWAS simulation. We focus on the function of estimating the variance explained by all the SNPs on the X chromosome and testing the hypotheses of dosage compensation. The GCTA software is a versatile tool to estimate and partition complex trait variation with large GWAS data sets.
CITATION : Yang J, Lee SH, Goddard ME, Visscher PM. (2011) GCTA: a tool for genome-wide complex trait analysis Am. J. Hum. Genet., 88 (1) 76-82. doi:10.1016/j.ajhg.2010.11.011. PMID 21167468
JOURNAL_INFO : American journal of human genetics ; Am. J. Hum. Genet. ; 2011 ; 88 ; 1 ; 76-82
PUBMED_LINK : 21167468

GWASLab

NAME : GWASLab
SHORT NAME : GWASLab
FULL NAME : GWASLab
DESCRIPTION : a python package for handling GWAS sumstats.
URL : https://github.com/Cloufield/gwaslab
PREPRINT_DOI : 10.51094/jxiv.370
CITATION : GWASLab preprint: He, Y., Koido, M., Shimmori, Y., Kamatani, Y. (2023). GWASLab: a Python package for processing and visualizing GWAS summary statistics. Preprint at Jxiv, 2023-5. https://doi.org/10.51094/jxiv.370

Hail

NAME : Hail
SHORT NAME : Hail
FULL NAME : Hail
DESCRIPTION : Hail is an open-source, general-purpose, Python-based data analysis tool with additional data types and methods for working with genomic data.
URL : https://hail.is/
CITATION : Hail Team. Hail 0.2.13-81ab564db2b4. https://github.com/hail-is/hail/releases/tag/0.2.13.

LDSC

NAME : LDSC
SHORT NAME : LDSC
FULL NAME : LD Score Regression
DESCRIPTION : ldsc is a command line tool for estimating heritability and genetic correlation from GWAS summary statistics. ldsc also computes LD Scores.
URL : https://github.com/bulik/ldsc
TITLE : LD Score regression distinguishes confounding from polygenicity in genome-wide association studies
DOI : 10.1038/ng.3211
ABSTRACT : Both polygenicity (many small genetic effects) and confounding biases, such as cryptic relatedness and population stratification, can yield an inflated distribution of test statistics in genome-wide association studies (GWAS). However, current methods cannot distinguish between inflation from a true polygenic signal and bias. We have developed an approach, LD Score regression, that quantifies the contribution of each by examining the relationship between test statistics and linkage disequilibrium (LD). The LD Score regression intercept can be used to estimate a more powerful and accurate correction factor than genomic control. We find strong evidence that polygenicity accounts for the majority of the inflation in test statistics in many GWAS of large sample size.
CITATION : Bulik-Sullivan B, Loh PR, Finucane HK, Ripke S, ...&, O'Donovan MC. (2015) LD Score regression distinguishes confounding from polygenicity in genome-wide association studies Nat. Genet., 47 (3) 291-295. doi:10.1038/ng.3211. PMID 25642630
JOURNAL_INFO : Nature genetics ; Nat. Genet. ; 2015 ; 47 ; 3 ; 291-295
PUBMED_LINK : 25642630

LDSTORE2

NAME : LDSTORE2
SHORT NAME : LDSTORE2
FULL NAME : LDSTORE2
DESCRIPTION : LDstore is a computationally efficient program for estimating and storing Linkage Disequilibrium (SNP correlations). It combines some of the best features from RAREMETALWORKER and PLINK by implementing parallel processing using OPENMP and storing of the SNP correlations with information about the SNPs in the same binary file for fast lookups. LDstore is therefore the ideal tool for sharing SNP correlations in large-scale meta-analyses of genome-wide association studies and for on-the-fly computing/querying within web portals.
URL : http://www.christianbenner.com/#
TITLE : Prospects of fine-mapping trait-associated genomic regions by using summary statistics from genome-wide association studies
DOI : 10.1016/j.ajhg.2017.08.012
ABSTRACT : During the past few years, various novel statistical methods have been developed for fine-mapping with the use of summary statistics from genome-wide association studies (GWASs). Although these approaches require information about the linkage disequilibrium (LD) between variants, there has not been a comprehensive evaluation of how estimation of the LD structure from reference genotype panels performs in comparison with that from the original individual-level GWAS data. Using population genotype data from Finland and the UK Biobank, we show here that a reference panel of 1,000 individuals from the target population is adequate for a GWAS cohort of up to 10,000 individuals, whereas smaller panels, such as those from the 1000 Genomes Project, should be avoided. We also show, both theoretically and empirically, that the size of the reference panel needs to scale with the GWAS sample size; this has important consequences for the application of these methods in ongoing GWAS meta-analyses and large biobank studies. We conclude by providing software tools and by recommending practices for sharing LD information to more efficiently exploit summary statistics in genetics research.
CITATION : Benner C, Havulinna AS, Järvelin MR, Salomaa V, ...&, Pirinen M. (2017) Prospects of fine-mapping trait-associated genomic regions by using summary statistics from genome-wide association studies Am. J. Hum. Genet., 101 (4) 539-551. doi:10.1016/j.ajhg.2017.08.012. PMID 28942963
JOURNAL_INFO : The American Journal of Human Genetics ; Am. J. Hum. Genet. ; 2017 ; 101 ; 4 ; 539-551
PUBMED_LINK : 28942963

MungeSumstats

NAME : MungeSumstats
SHORT NAME : MungeSumstats
FULL NAME : MungeSumstats
DESCRIPTION : a Bioconductor package for the standardization and quality control of many GWAS summary statistics
URL : https://github.com/neurogenomics/MungeSumstats
TITLE : MungeSumstats: a Bioconductor package for the standardization and quality control of many GWAS summary statistics
DOI : 10.1093/bioinformatics/btab665
ABSTRACT : MOTIVATION: Genome-wide association studies (GWAS) summary statistics have popularized and accelerated genetic research. However, a lack of standardization of the file formats used has proven problematic when running secondary analysis tools or performing meta-analysis studies. RESULTS: To address this issue, we have developed MungeSumstats, a Bioconductor R package for the standardization and quality control of GWAS summary statistics. MungeSumstats can handle the most common summary statistic formats, including variant call format (VCF) producing a reformatted, standardized, tabular summary statistic file, VCF or R native data object. AVAILABILITY AND IMPLEMENTATION: MungeSumstats is available on Bioconductor (v 3.13) and can also be found on Github at: https://neurogenomics.github.io/MungeSumstats. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
COPYRIGHT : https://creativecommons.org/licenses/by/4.0/
CITATION : Murphy AE, Schilder BM, Skene NG. (2021) MungeSumstats: a Bioconductor package for the standardization and quality control of many GWAS summary statistics Bioinformatics, 37 (23) 4593-4596. doi:10.1093/bioinformatics/btab665. PMID 34601555
JOURNAL_INFO : Bioinformatics (Oxford, England) ; Bioinformatics ; 2021 ; 37 ; 23 ; 4593-4596
PUBMED_LINK : 34601555

PLINK1.9

NAME : PLINK1.9
SHORT NAME : PLINK1.9
FULL NAME : PLINK1.9
DESCRIPTION : PLINK: a tool set for whole-genome association and population-based linkage analyses. This is a comprehensive update to Shaun Purcell's PLINK command-line program, developed by Christopher Chang with support from the NIH-NIDDK's Laboratory of Biological Modeling, the Purcell Lab, and others.
URL : https://www.cog-genomics.org/plink/1.9/
TITLE : PLINK: a tool set for whole-genome association and population-based linkage analyses
DOI : 10.1086/519795
ABSTRACT : Whole-genome association studies (WGAS) bring new computational, as well as analytic, challenges to researchers. Many existing genetic-analysis tools are not designed to handle such large data sets in a convenient manner and do not necessarily exploit the new opportunities that whole-genome data bring. To address these issues, we developed PLINK, an open-source C/C++ WGAS tool set. With PLINK, large data sets comprising hundreds of thousands of markers genotyped for thousands of individuals can be rapidly manipulated and analyzed in their entirety. As well as providing tools to make the basic analytic steps computationally efficient, PLINK also supports some novel approaches to whole-genome data that take advantage of whole-genome coverage. We introduce PLINK and describe the five main domains of function: data management, summary statistics, population stratification, association analysis, and identity-by-descent estimation. In particular, we focus on the estimation and use of identity-by-state and identity-by-descent information in the context of population-based whole-genome studies. This information can be used to detect and correct for population stratification and to identify extended chromosomal segments that are shared identical by descent between very distantly related individuals. Analysis of the patterns of segmental sharing has the potential to map disease loci that contain multiple rare variants in a population-based linkage analysis.
COPYRIGHT : https://www.elsevier.com/open-access/userlicense/1.0/
CITATION : Purcell S, Neale B, Todd-Brown K, Thomas L, ...&, Sham PC. (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses Am. J. Hum. Genet., 81 (3) 559-575. doi:10.1086/519795. PMID 17701901
JOURNAL_INFO : The American Journal of Human Genetics ; Am. J. Hum. Genet. ; 2007 ; 81 ; 3 ; 559-575
PUBMED_LINK : 17701901

PLINK2

NAME : PLINK2
SHORT NAME : PLINK2
FULL NAME : PLINK2
DESCRIPTION : The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.
URL : https://www.cog-genomics.org/plink/2.0/
TITLE : Second-generation PLINK: rising to the challenge of larger and richer datasets
DOI : 10.1186/s13742-015-0047-8
ABSTRACT : BACKGROUND: PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for faster and scalable implementations of key functions, such as logistic regression, linkage disequilibrium estimation, and genomic distance evaluation. In addition, GWAS and population-genetic data now frequently contain genotype likelihoods, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1's primary data format. FINDINGS: To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, [Formula: see text]-time/constant-space Hardy-Weinberg equilibrium and Fisher's exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. We have also developed an extension to the data format which adds low-overhead support for genotype likelihoods, phase, multiallelic variants, and reference vs. alternate alleles, which is the basis of our planned second release (PLINK 2.0). CONCLUSIONS: The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.
CITATION : Chang CC, Chow CC, Tellier LC, Vattikuti S, ...&, Lee JJ. (2015) Second-generation PLINK: rising to the challenge of larger and richer datasets Gigascience, 4 (1) 7. doi:10.1186/s13742-015-0047-8. PMID 25722852
JOURNAL_INFO : GigaScience ; Gigascience ; 2015 ; 4 ; 1 ; 7
PUBMED_LINK : 25722852

QCTOOL v2

NAME : QCTOOL v2
SHORT NAME : QCTOOL
FULL NAME : QCTOOL
DESCRIPTION : QCTOOL is a command-line utility program for manipulation and quality control of gwas datasets and other genome-wide data.
URL : https://www.well.ox.ac.uk/~gav/qctool_v2/index.html
TITLE : A note on exact tests of Hardy-Weinberg equilibrium
DOI : 10.1086/429864
ABSTRACT : Deviations from Hardy-Weinberg equilibrium (HWE) can indicate inbreeding, population stratification, and even problems in genotyping. In samples of affected individuals, these deviations can also provide evidence for association. Tests of HWE are commonly performed using a simple chi2 goodness-of-fit test. We show that this chi2 test can have inflated type I error rates, even in relatively large samples (e.g., samples of 1,000 individuals that include approximately 100 copies of the minor allele). On the basis of previous work, we describe exact tests of HWE together with efficient computational methods for their implementation. Our methods adequately control type I error in large and small samples and are computationally efficient. They have been implemented in freely available code that will be useful for quality assessment of genotype data and for the detection of genetic association or population stratification in very large data sets.
COPYRIGHT : https://www.elsevier.com/open-access/userlicense/1.0/
CITATION : Wigginton JE, Cutler DJ, Abecasis GR. (2005) A note on exact tests of Hardy-Weinberg equilibrium Am. J. Hum. Genet., 76 (5) 887-893. doi:10.1086/429864. PMID 15789306
JOURNAL_INFO : The American Journal of Human Genetics ; Am. J. Hum. Genet. ; 2005 ; 76 ; 5 ; 887-893
PUBMED_LINK : 15789306