Skip to content

Tools Data processing

Curation of Data processing — listings under the GWAS Tools tab.

Summary Table

Click a column header to sort the table.

NAME Main citation YEAR
Ctyper
Ma W et al., Nat Genet, 2025
2025
GCTA
Yang J et al., Am J Hum Genet, 2011
2011
GWASLab
2023
2023
Hail
2018
2018
LDSC
Bulik-Sullivan BK et al., Nat Genet, 2015
2015
LDSTORE2
Benner C et al., Am J Hum Genet, 2017
2017
Locityper
Prodanov T et al., Nat Genet, 2025
2025
MungeSumstats
Murphy AE et al., Bioinformatics, 2021
2021
PLINK1.9
Purcell S et al., Am J Hum Genet, 2007
2007
PLINK2
Chang CC et al., Gigascience, 2015
2015
PanMAN
Walia S et al., Nat Genet, 2026
2026
QCTOOL v2
Wigginton JE et al., Am J Hum Genet, 2005
2005
Swave
Wang S et al., Nat Genet, 2026
2026

Ctyper

Tool
PUBMED_LINK
41107550
DESCRIPTION
Ctyper genotypes sequence-resolved copy-number variation and other complex polymorphic genes using a pangenome reference matrix, enabling allele- and copy-aware calls at scale for biobank-style cohorts.
URL
https://github.com/ChaissonLab/Ctyper ,https://www.nature.com/articles/s41588-025-02346-4
KEYWORDS
CNV, copy number, pangenome, sequence-resolved, biobank scale
TITLE
Genotyping sequence-resolved copy number variation using pangenomes reveals paralog-specific global diversity and expression divergence of duplicated genes.
Main citation
Ma W, Chaisson MJP. (2025) Genotyping sequence-resolved copy number variation using pangenomes reveals paralog-specific global diversity and expression divergence of duplicated genes. Nat Genet, 57 (11) 2909-2919. doi:10.1038/s41588-025-02346-4. PMID 41107550
ABSTRACT
Copy number variable (CNV) genes are important in evolution and disease, yet their sequence variation remains a blind spot in large-scale studies. We present ctyper, a method that leverages pangenomes to produce allele-specific copy numbers with locally phased variants from next-generation sequencing samples. Benchmarking on 3,351 CNV genes and 212 challenging medically relevant (CMR) genes, ctyper captures 96.5% of phased variants with ≥99.1% correctness of copy number in CNV genes and 94.8% of phased variants in CMR genes. Ctyper takes 1.5 h to genotype a genome on one CPU. The ctyper genotypes give a 4.81-fold improvement in predictions of gene expression compared to known expression quantitative trait locus (eQTL) variants. Allele-specific expression quantified divergent expression in 7.94% of paralogs and tissue-specific biases in 4.68%. We found reduced expression of SMN2 due to SMN1 conversion, potentially affecting spinal muscular atrophy, and increased expression of translocated duplications of AMY2B. Overall, ctyper enables biobank-scale genotyping of CNV and CMR genes.
DOI
10.1038/s41588-025-02346-4

GCTA

Tool
PUBMED_LINK
21167468
FULL NAME
Genome-wide complex trait analysis (GCTA)
DESCRIPTION
GCTA-GREML analysis:GCTA can simulate a GWAS based on real genotype data.
URL
https://yanglab.westlake.edu.cn/software/gcta/
TITLE
GCTA: a tool for genome-wide complex trait analysis.
Main citation
Yang J, Lee SH, Goddard ME, Visscher PM. (2011) GCTA: a tool for genome-wide complex trait analysis. Am J Hum Genet, 88 (1) 76-82. doi:10.1016/j.ajhg.2010.11.011. PMID 21167468
ABSTRACT
For most human complex diseases and traits, SNPs identified by genome-wide association studies (GWAS) explain only a small fraction of the heritability. Here we report a user-friendly software tool called genome-wide complex trait analysis (GCTA), which was developed based on a method we recently developed to address the "missing heritability" problem. GCTA estimates the variance explained by all the SNPs on a chromosome or on the whole genome for a complex trait rather than testing the association of any particular SNP to the trait. We introduce GCTA's five main functions: data management, estimation of the genetic relationships from SNPs, mixed linear model analysis of variance explained by the SNPs, estimation of the linkage disequilibrium structure, and GWAS simulation. We focus on the function of estimating the variance explained by all the SNPs on the X chromosome and testing the hypotheses of dosage compensation. The GCTA software is a versatile tool to estimate and partition complex trait variation with large GWAS data sets.
DOI
10.1016/j.ajhg.2010.11.011

GWASLab

Tool
DESCRIPTION
a python package for handling GWAS sumstats.
URL
https://github.com/Cloufield/gwaslab
PREPRINT_DOI
10.51094/jxiv.370
Main citation
GWASLab preprint: He, Y., Koido, M., Shimmori, Y., Kamatani, Y. (2023). GWASLab: a Python package for processing and visualizing GWAS summary statistics. Preprint at Jxiv, 2023-5. https://doi.org/10.51094/jxiv.370

Hail

Tool
DESCRIPTION
Hail is an open-source, general-purpose, Python-based data analysis tool with additional data types and methods for working with genomic data.
URL
https://hail.is/

LDSC

Tool
PUBMED_LINK
25642630
FULL NAME
LD Score Regression
DESCRIPTION
ldsc is a command line tool for estimating heritability and genetic correlation from GWAS summary statistics. ldsc also computes LD Scores.
URL
https://github.com/bulik/ldsc
TITLE
LD Score regression distinguishes confounding from polygenicity in genome-wide association studies.
Main citation
Bulik-Sullivan BK, Loh PR, Finucane HK, Ripke S, ...&, Neale BM. (2015) LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat Genet, 47 (3) 291-5. doi:10.1038/ng.3211. PMID 25642630
ABSTRACT
Both polygenicity (many small genetic effects) and confounding biases, such as cryptic relatedness and population stratification, can yield an inflated distribution of test statistics in genome-wide association studies (GWAS). However, current methods cannot distinguish between inflation from a true polygenic signal and bias. We have developed an approach, LD Score regression, that quantifies the contribution of each by examining the relationship between test statistics and linkage disequilibrium (LD). The LD Score regression intercept can be used to estimate a more powerful and accurate correction factor than genomic control. We find strong evidence that polygenicity accounts for the majority of the inflation in test statistics in many GWAS of large sample size.
DOI
10.1038/ng.3211

LDSTORE2

Tool
PUBMED_LINK
28942963
DESCRIPTION
LDstore is a computationally efficient program for estimating and storing Linkage Disequilibrium (SNP correlations). It combines some of the best features from RAREMETALWORKER and PLINK by implementing parallel processing using OPENMP and storing of the SNP correlations with information about the SNPs in the same binary file for fast lookups. LDstore is therefore the ideal tool for sharing SNP correlations in large-scale meta-analyses of genome-wide association studies and for on-the-fly computing/querying within web portals.
URL
http://www.christianbenner.com/#
TITLE
Prospects of Fine-Mapping Trait-Associated Genomic Regions by Using Summary Statistics from Genome-wide Association Studies.
Main citation
Benner C, Havulinna AS, Järvelin MR, Salomaa V, ...&, Pirinen M. (2017) Prospects of Fine-Mapping Trait-Associated Genomic Regions by Using Summary Statistics from Genome-wide Association Studies. Am J Hum Genet, 101 (4) 539-551. doi:10.1016/j.ajhg.2017.08.012. PMID 28942963
ABSTRACT
During the past few years, various novel statistical methods have been developed for fine-mapping with the use of summary statistics from genome-wide association studies (GWASs). Although these approaches require information about the linkage disequilibrium (LD) between variants, there has not been a comprehensive evaluation of how estimation of the LD structure from reference genotype panels performs in comparison with that from the original individual-level GWAS data. Using population genotype data from Finland and the UK Biobank, we show here that a reference panel of 1,000 individuals from the target population is adequate for a GWAS cohort of up to 10,000 individuals, whereas smaller panels, such as those from the 1000 Genomes Project, should be avoided. We also show, both theoretically and empirically, that the size of the reference panel needs to scale with the GWAS sample size; this has important consequences for the application of these methods in ongoing GWAS meta-analyses and large biobank studies. We conclude by providing software tools and by recommending practices for sharing LD information to more efficiently exploit summary statistics in genetics research.
DOI
10.1016/j.ajhg.2017.08.012

Locityper

Tool
PUBMED_LINK
41107551
DESCRIPTION
Locityper performs targeted genotyping of structurally variable and hyperpolymorphic genes—including HLA, KIR, MUC, and FCGR families—from short- or long-read whole-genome sequencing by aligning reads to locus haplotypes (often from pangenomes) and scoring depth and insert-size consistency.
URL
https://github.com/tprodanov/locityper
KEYWORDS
genotyping, complex loci, HLA, short read, long read, WGS
TITLE
Locityper enables targeted genotyping of complex polymorphic genes.
Main citation
Prodanov T, Plender EG, Seebohm G, Meuth SG, ...&, Marschall T. (2025) Locityper enables targeted genotyping of complex polymorphic genes. Nat Genet, 57 (11) 2901-2908. doi:10.1038/s41588-025-02362-4. PMID 41107551
ABSTRACT
The human genome contains many structurally variable polymorphic loci, including several hundred disease-associated genes, almost inaccessible for accurate variant calling. Here we present Locityper, a tool capable of genotyping such challenging genes using short-read and long-read whole-genome sequencing. For each target, Locityper recruits and aligns reads to locus haplotypes, for instance, extracted from a pangenome, and finds the likeliest haplotype pair by optimizing read alignment, insert size and read depth profiles. Across 256 challenging medically relevant loci, Locityper achieves a median quality value (QV) above 35 from both long-read and short-read data, outperforming state-of-the-art Illumina and PacBio HiFi variant calling pipelines by 10.9 and 1.7 points, respectively. Furthermore, Locityper provides access to hyperpolymorphic HLA genes and other gene families, including KIR, MUC and FCGR. With its low running time of 1 h 35 m per sample at eight threads, Locityper is scalable to biobank-sized cohorts, enabling association studies for previously intractable disease-relevant genes.
DOI
10.1038/s41588-025-02362-4

MungeSumstats

Tool
PUBMED_LINK
34601555
DESCRIPTION
a Bioconductor package for the standardization and quality control of many GWAS summary statistics
URL
https://github.com/neurogenomics/MungeSumstats
TITLE
MungeSumstats: a Bioconductor package for the standardization and quality control of many GWAS summary statistics.
Main citation
Murphy AE, Schilder BM, Skene NG. (2021) MungeSumstats: a Bioconductor package for the standardization and quality control of many GWAS summary statistics. Bioinformatics, 37 (23) 4593-4596. doi:10.1093/bioinformatics/btab665. PMID 34601555
ABSTRACT
MOTIVATION: Genome-wide association studies (GWAS) summary statistics have popularized and accelerated genetic research. However, a lack of standardization of the file formats used has proven problematic when running secondary analysis tools or performing meta-analysis studies. RESULTS: To address this issue, we have developed MungeSumstats, a Bioconductor R package for the standardization and quality control of GWAS summary statistics. MungeSumstats can handle the most common summary statistic formats, including variant call format (VCF) producing a reformatted, standardized, tabular summary statistic file, VCF or R native data object. AVAILABILITY AND IMPLEMENTATION: MungeSumstats is available on Bioconductor (v 3.13) and can also be found on Github at: https://neurogenomics.github.io/MungeSumstats. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
DOI
10.1093/bioinformatics/btab665

PLINK1.9

Tool
PUBMED_LINK
17701901
DESCRIPTION
PLINK: a tool set for whole-genome association and population-based linkage analyses. This is a comprehensive update to Shaun Purcell's PLINK command-line program, developed by Christopher Chang with support from the NIH-NIDDK's Laboratory of Biological Modeling, the Purcell Lab, and others.
URL
https://www.cog-genomics.org/plink/1.9/
TITLE
PLINK: a tool set for whole-genome association and population-based linkage analyses.
Main citation
Purcell S, Neale B, Todd-Brown K, Thomas L, ...&, Sham PC. (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet, 81 (3) 559-75. doi:10.1086/519795. PMID 17701901
ABSTRACT
Whole-genome association studies (WGAS) bring new computational, as well as analytic, challenges to researchers. Many existing genetic-analysis tools are not designed to handle such large data sets in a convenient manner and do not necessarily exploit the new opportunities that whole-genome data bring. To address these issues, we developed PLINK, an open-source C/C++ WGAS tool set. With PLINK, large data sets comprising hundreds of thousands of markers genotyped for thousands of individuals can be rapidly manipulated and analyzed in their entirety. As well as providing tools to make the basic analytic steps computationally efficient, PLINK also supports some novel approaches to whole-genome data that take advantage of whole-genome coverage. We introduce PLINK and describe the five main domains of function: data management, summary statistics, population stratification, association analysis, and identity-by-descent estimation. In particular, we focus on the estimation and use of identity-by-state and identity-by-descent information in the context of population-based whole-genome studies. This information can be used to detect and correct for population stratification and to identify extended chromosomal segments that are shared identical by descent between very distantly related individuals. Analysis of the patterns of segmental sharing has the potential to map disease loci that contain multiple rare variants in a population-based linkage analysis.
DOI
10.1086/519795

PLINK2

Tool
PUBMED_LINK
25722852
DESCRIPTION
The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.
URL
https://www.cog-genomics.org/plink/2.0/
TITLE
Second-generation PLINK: rising to the challenge of larger and richer datasets.
Main citation
Chang CC, Chow CC, Tellier LC, Vattikuti S, ...&, Lee JJ. (2015) Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience, 4 () 7. doi:10.1186/s13742-015-0047-8. PMID 25722852
ABSTRACT
BACKGROUND: PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for faster and scalable implementations of key functions, such as logistic regression, linkage disequilibrium estimation, and genomic distance evaluation. In addition, GWAS and population-genetic data now frequently contain genotype likelihoods, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1's primary data format. FINDINGS: To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, [Formula: see text]-time/constant-space Hardy-Weinberg equilibrium and Fisher's exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. We have also developed an extension to the data format which adds low-overhead support for genotype likelihoods, phase, multiallelic variants, and reference vs. alternate alleles, which is the basis of our planned second release (PLINK 2.0). CONCLUSIONS: The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.
DOI
10.1186/s13742-015-0047-8

PanMAN

Tool
PUBMED_LINK
41526696
FULL NAME
Pangenome Mutation-Annotated Network
DESCRIPTION
PanMAN is a compact pangenome representation built from mutation-annotated trees (PanMATs) linked into a network, designed to compress and query shared evolutionary history across large microbial pathogen collections.
URL
https://github.com/TurakhiaLab/panman ,https://turakhia.ucsd.edu/panman/
KEYWORDS
pangenome, microbial genomics, compression, mutation-annotated tree, phylogeny
TITLE
Compressive pangenomics using mutation-annotated networks.
Main citation
Walia S, Motwani H, Tseng YH, Smith K, ...&, Turakhia Y. (2026) Compressive pangenomics using mutation-annotated networks. Nat Genet, 58 (2) 445-453. doi:10.1038/s41588-025-02478-7. PMID 41526696
ABSTRACT
Pangenomics is an emerging field that uses collections of genomes, rather than a single reference, to reduce bias and capture intra-species diversity. However, existing pangenomic data formats face challenges in scaling to millions of genomes and primarily emphasize variation, often neglecting the underlying mutational events and evolutionary relationships. This work introduces Pangenome Mutation-Annotated Network (PanMAN), a lossless pangenome representation that achieves compression ratios ranging from 3.5-1,391× in file sizes compared to existing variation-preserving formats, with performance generally improving on larger datasets. In addition to compression, PanMAN increases representational capacity by encoding detailed mutational and evolutionary histories inferred across genomes, thereby enabling new biological insights. Using PanMAN, a comprehensive SARS-CoV-2 pangenome was constructed from 8 million publicly available sequences, requiring only 366 MB of disk space. We also present 'panmanUtils', a toolkit that supports common analyses and ensures interoperability with existing software. PanMAN is poised to greatly improve the scale, speed, resolution and scope of pangenomic analysis and data sharing.
DOI
10.1038/s41588-025-02478-7

QCTOOL v2 (QCTOOL)

Tool
PUBMED_LINK
15789306
FULL NAME
QCTOOL
DESCRIPTION
QCTOOL is a command-line utility program for manipulation and quality control of gwas datasets and other genome-wide data.
URL
https://www.well.ox.ac.uk/~gav/qctool_v2/index.html
TITLE
A note on exact tests of Hardy-Weinberg equilibrium.
Main citation
Wigginton JE, Cutler DJ, Abecasis GR. (2005) A note on exact tests of Hardy-Weinberg equilibrium. Am J Hum Genet, 76 (5) 887-93. doi:10.1086/429864. PMID 15789306
ABSTRACT
Deviations from Hardy-Weinberg equilibrium (HWE) can indicate inbreeding, population stratification, and even problems in genotyping. In samples of affected individuals, these deviations can also provide evidence for association. Tests of HWE are commonly performed using a simple chi2 goodness-of-fit test. We show that this chi2 test can have inflated type I error rates, even in relatively large samples (e.g., samples of 1,000 individuals that include approximately 100 copies of the minor allele). On the basis of previous work, we describe exact tests of HWE together with efficient computational methods for their implementation. Our methods adequately control type I error in large and small samples and are computationally efficient. They have been implemented in freely available code that will be useful for quality assessment of genotype data and for the detection of genetic association or population stratification in very large data sets.
DOI
10.1086/429864

Swave

Tool
PUBMED_LINK
41807798
DESCRIPTION
Swave calls and genotypes structural variants from assembly-based pangenome graphs by encoding mapping patterns as images (“projection waves”) and classifying signals with a recurrent neural network, including complex and repetitive SVs for population-level characterization.
URL
https://github.com/songbowang125/Swave ,https://github.com/songbowang125/Swave-Utils
KEYWORDS
structural variant, pangenome graph, deep learning, RNN, population genomics
TITLE
Population-level structural variant characterization using pangenome graphs.
Main citation
Wang S, Xu T, Zhang P, Ye K. (2026) Population-level structural variant characterization using pangenome graphs. Nat Genet, 58 (3) 664-672. doi:10.1038/s41588-026-02538-6. PMID 41807798
ABSTRACT
Population-level structural variant (SV) profiling is crucial in the era of pangenomes. However, identifying SVs from genome assemblies and pangenome graphs remains a substantial challenge. Here we present Swave, a sequence-to-image, deep learning-based method that accurately resolves both simple and complex SVs, along with their population characteristics, from assembly-derived pangenome graphs. Swave introduces 'projection waves' to summarize the dotplot images that capture mapping patterns between reference and SV-indicating alleles in the pangenome. Then, a recurrent neural network distinguishes true SV signals from background noise introduced by genomic repeats. Swave demonstrates superior performance in both SV-type classification and genotyping compared with existing methods. When applied to healthy cohorts and rare-disease cohorts, Swave reveals complex and polymorphic SV patterns across human populations and identifies potentially pathogenic SVs. These advancements will facilitate the creation of comprehensive population-level SV catalogs, deepening our understanding of SVs in genetic diversity and disease associations.
DOI
10.1038/s41588-026-02538-6