Tools Data processing

Curation of Data processing — listings under the GWAS Tools tab.

Summary Table

Click a column header to sort the table.

NAME	Main citation	YEAR
Ctyper	Ma W et al., Nat Genet, 2025	2025
GCTA	Yang J et al., Am J Hum Genet, 2011	2011
GWASLab	2023	2023
Hail	2018	2018
LDSC	Bulik-Sullivan BK et al., Nat Genet, 2015	2015
LDSTORE2	Benner C et al., Am J Hum Genet, 2017	2017
Locityper	Prodanov T et al., Nat Genet, 2025	2025
MungeSumstats	Murphy AE et al., Bioinformatics, 2021	2021
PLINK1.9	Purcell S et al., Am J Hum Genet, 2007	2007
PLINK2	Chang CC et al., Gigascience, 2015	2015
PanMAN	Walia S et al., Nat Genet, 2026	2026
QCTOOL v2	Wigginton JE et al., Am J Hum Genet, 2005	2005
Swave	Wang S et al., Nat Genet, 2026	2026

Ctyper

Tool

PUBMED_LINK

41107550

DESCRIPTION

Ctyper genotypes sequence-resolved copy-number variation and other complex polymorphic genes using a pangenome reference matrix, enabling allele- and copy-aware calls at scale for biobank-style cohorts.

Show full descriptionShow less

URL

https://github.com/ChaissonLab/Ctyper ,https://www.nature.com/articles/s41588-025-02346-4

KEYWORDS

CNV, copy number, pangenome, sequence-resolved, biobank scale

Show full keywordsShow less

TITLE

Genotyping sequence-resolved copy number variation using pangenomes reveals paralog-specific global diversity and expression divergence of duplicated genes.

Main citation

Ma W, Chaisson MJP. (2025) Genotyping sequence-resolved copy number variation using pangenomes reveals paralog-specific global diversity and expression divergence of duplicated genes. Nat Genet, 57 (11) 2909-2919. doi:10.1038/s41588-025-02346-4. PMID 41107550

ABSTRACT

Copy number variable (CNV) genes are important in evolution and disease, yet their sequence variation remains a blind spot in large-scale studies. We present ctyper, a method that leverages pangenomes to produce allele-specific copy numbers with locally phased variants from next-generation sequencing samples. Benchmarking on 3,351 CNV genes and 212 challenging medically relevant (CMR) genes, ctyper captures 96.5% of phased variants with ≥99.1% correctness of copy number in CNV genes and 94.8% of phased variants in CMR genes. Ctyper takes 1.5 h to genotype a genome on one CPU. The ctyper genotypes give a 4.81-fold improvement in predictions of gene expression compared to known expression quantitative trait locus (eQTL) variants. Allele-specific expression quantified divergent expression in 7.94% of paralogs and tissue-specific biases in 4.68%. We found reduced expression of SMN2 due to SMN1 conversion, potentially affecting spinal muscular atrophy, and increased expression of translocated duplications of AMY2B. Overall, ctyper enables biobank-scale genotyping of CNV and CMR genes.

Show full abstractShow less

DOI

10.1038/s41588-025-02346-4

GCTA

Tool

PUBMED_LINK

21167468

FULL NAME

Genome-wide complex trait analysis (GCTA)

DESCRIPTION

GCTA-GREML analysis:GCTA can simulate a GWAS based on real genotype data.

Show full descriptionShow less

URL

https://yanglab.westlake.edu.cn/software/gcta/

TITLE

GCTA: a tool for genome-wide complex trait analysis.

Main citation

Yang J, Lee SH, Goddard ME, Visscher PM. (2011) GCTA: a tool for genome-wide complex trait analysis. Am J Hum Genet, 88 (1) 76-82. doi:10.1016/j.ajhg.2010.11.011. PMID 21167468

ABSTRACT

For most human complex diseases and traits, SNPs identified by genome-wide association studies (GWAS) explain only a small fraction of the heritability. Here we report a user-friendly software tool called genome-wide complex trait analysis (GCTA), which was developed based on a method we recently developed to address the "missing heritability" problem. GCTA estimates the variance explained by all the SNPs on a chromosome or on the whole genome for a complex trait rather than testing the association of any particular SNP to the trait. We introduce GCTA's five main functions: data management, estimation of the genetic relationships from SNPs, mixed linear model analysis of variance explained by the SNPs, estimation of the linkage disequilibrium structure, and GWAS simulation. We focus on the function of estimating the variance explained by all the SNPs on the X chromosome and testing the hypotheses of dosage compensation. The GCTA software is a versatile tool to estimate and partition complex trait variation with large GWAS data sets.

Show full abstractShow less

DOI

10.1016/j.ajhg.2010.11.011

GWASLab

Tool

DESCRIPTION

a python package for handling GWAS sumstats.

Show full descriptionShow less

URL

https://github.com/Cloufield/gwaslab

PREPRINT_DOI

10.51094/jxiv.370

Main citation

GWASLab preprint: He, Y., Koido, M., Shimmori, Y., Kamatani, Y. (2023). GWASLab: a Python package for processing and visualizing GWAS summary statistics. Preprint at Jxiv, 2023-5. https://doi.org/10.51094/jxiv.370

Hail

Tool

DESCRIPTION

Hail is an open-source, general-purpose, Python-based data analysis tool with additional data types and methods for working with genomic data.

Show full descriptionShow less

URL

https://hail.is/

LDSC

Tool

PUBMED_LINK

25642630

FULL NAME

LD Score Regression

DESCRIPTION

ldsc is a command line tool for estimating heritability and genetic correlation from GWAS summary statistics. ldsc also computes LD Scores.

Show full descriptionShow less

URL

https://github.com/bulik/ldsc

TITLE

LD Score regression distinguishes confounding from polygenicity in genome-wide association studies.

Main citation

Bulik-Sullivan BK, Loh PR, Finucane HK, Ripke S, ...&, Neale BM. (2015) LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat Genet, 47 (3) 291-5. doi:10.1038/ng.3211. PMID 25642630

ABSTRACT

Both polygenicity (many small genetic effects) and confounding biases, such as cryptic relatedness and population stratification, can yield an inflated distribution of test statistics in genome-wide association studies (GWAS). However, current methods cannot distinguish between inflation from a true polygenic signal and bias. We have developed an approach, LD Score regression, that quantifies the contribution of each by examining the relationship between test statistics and linkage disequilibrium (LD). The LD Score regression intercept can be used to estimate a more powerful and accurate correction factor than genomic control. We find strong evidence that polygenicity accounts for the majority of the inflation in test statistics in many GWAS of large sample size.

Show full abstractShow less

DOI

10.1038/ng.3211

LDSTORE2

Tool

PUBMED_LINK

28942963

DESCRIPTION

LDstore is a computationally efficient program for estimating and storing Linkage Disequilibrium (SNP correlations). It combines some of the best features from RAREMETALWORKER and PLINK by implementing parallel processing using OPENMP and storing of the SNP correlations with information about the SNPs in the same binary file for fast lookups. LDstore is therefore the ideal tool for sharing SNP correlations in large-scale meta-analyses of genome-wide association studies and for on-the-fly computing/querying within web portals.

Show full descriptionShow less

URL

http://www.christianbenner.com/#

TITLE

Prospects of Fine-Mapping Trait-Associated Genomic Regions by Using Summary Statistics from Genome-wide Association Studies.

Main citation

Benner C, Havulinna AS, Järvelin MR, Salomaa V, ...&, Pirinen M. (2017) Prospects of Fine-Mapping Trait-Associated Genomic Regions by Using Summary Statistics from Genome-wide Association Studies. Am J Hum Genet, 101 (4) 539-551. doi:10.1016/j.ajhg.2017.08.012. PMID 28942963

ABSTRACT

During the past few years, various novel statistical methods have been developed for fine-mapping with the use of summary statistics from genome-wide association studies (GWASs). Although these approaches require information about the linkage disequilibrium (LD) between variants, there has not been a comprehensive evaluation of how estimation of the LD structure from reference genotype panels performs in comparison with that from the original individual-level GWAS data. Using population genotype data from Finland and the UK Biobank, we show here that a reference panel of 1,000 individuals from the target population is adequate for a GWAS cohort of up to 10,000 individuals, whereas smaller panels, such as those from the 1000 Genomes Project, should be avoided. We also show, both theoretically and empirically, that the size of the reference panel needs to scale with the GWAS sample size; this has important consequences for the application of these methods in ongoing GWAS meta-analyses and large biobank studies. We conclude by providing software tools and by recommending practices for sharing LD information to more efficiently exploit summary statistics in genetics research.

Show full abstractShow less

DOI

10.1016/j.ajhg.2017.08.012

Locityper

Tool

PUBMED_LINK

41107551

DESCRIPTION

Locityper performs targeted genotyping of structurally variable and hyperpolymorphic genes—including HLA, KIR, MUC, and FCGR families—from short- or long-read whole-genome sequencing by aligning reads to locus haplotypes (often from pangenomes) and scoring depth and insert-size consistency.

Show full descriptionShow less

URL

https://github.com/tprodanov/locityper

KEYWORDS

genotyping, complex loci, HLA, short read, long read, WGS

Show full keywordsShow less

TITLE

Locityper enables targeted genotyping of complex polymorphic genes.

Main citation

Prodanov T, Plender EG, Seebohm G, Meuth SG, ...&, Marschall T. (2025) Locityper enables targeted genotyping of complex polymorphic genes. Nat Genet, 57 (11) 2901-2908. doi:10.1038/s41588-025-02362-4. PMID 41107551

ABSTRACT

The human genome contains many structurally variable polymorphic loci, including several hundred disease-associated genes, almost inaccessible for accurate variant calling. Here we present Locityper, a tool capable of genotyping such challenging genes using short-read and long-read whole-genome sequencing. For each target, Locityper recruits and aligns reads to locus haplotypes, for instance, extracted from a pangenome, and finds the likeliest haplotype pair by optimizing read alignment, insert size and read depth profiles. Across 256 challenging medically relevant loci, Locityper achieves a median quality value (QV) above 35 from both long-read and short-read data, outperforming state-of-the-art Illumina and PacBio HiFi variant calling pipelines by 10.9 and 1.7 points, respectively. Furthermore, Locityper provides access to hyperpolymorphic HLA genes and other gene families, including KIR, MUC and FCGR. With its low running time of 1 h 35 m per sample at eight threads, Locityper is scalable to biobank-sized cohorts, enabling association studies for previously intractable disease-relevant genes.

Show full abstractShow less

DOI

10.1038/s41588-025-02362-4

MungeSumstats

Tool

PUBMED_LINK

34601555

DESCRIPTION

a Bioconductor package for the standardization and quality control of many GWAS summary statistics

Show full descriptionShow less

URL

https://github.com/neurogenomics/MungeSumstats

TITLE

MungeSumstats: a Bioconductor package for the standardization and quality control of many GWAS summary statistics.

Main citation

Murphy AE, Schilder BM, Skene NG. (2021) MungeSumstats: a Bioconductor package for the standardization and quality control of many GWAS summary statistics. Bioinformatics, 37 (23) 4593-4596. doi:10.1093/bioinformatics/btab665. PMID 34601555

ABSTRACT

MOTIVATION: Genome-wide association studies (GWAS) summary statistics have popularized and accelerated genetic research. However, a lack of standardization of the file formats used has proven problematic when running secondary analysis tools or performing meta-analysis studies. RESULTS: To address this issue, we have developed MungeSumstats, a Bioconductor R package for the standardization and quality control of GWAS summary statistics. MungeSumstats can handle the most common summary statistic formats, including variant call format (VCF) producing a reformatted, standardized, tabular summary statistic file, VCF or R native data object. AVAILABILITY AND IMPLEMENTATION: MungeSumstats is available on Bioconductor (v 3.13) and can also be found on Github at: https://neurogenomics.github.io/MungeSumstats. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Show full abstractShow less

DOI

10.1093/bioinformatics/btab665

PLINK1.9

Tool

PUBMED_LINK

17701901

DESCRIPTION

PLINK: a tool set for whole-genome association and population-based linkage analyses. This is a comprehensive update to Shaun Purcell's PLINK command-line program, developed by Christopher Chang with support from the NIH-NIDDK's Laboratory of Biological Modeling, the Purcell Lab, and others.

Show full descriptionShow less

URL

https://www.cog-genomics.org/plink/1.9/

TITLE

PLINK: a tool set for whole-genome association and population-based linkage analyses.

Main citation

Purcell S, Neale B, Todd-Brown K, Thomas L, ...&, Sham PC. (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet, 81 (3) 559-75. doi:10.1086/519795. PMID 17701901

ABSTRACT

Whole-genome association studies (WGAS) bring new computational, as well as analytic, challenges to researchers. Many existing genetic-analysis tools are not designed to handle such large data sets in a convenient manner and do not necessarily exploit the new opportunities that whole-genome data bring. To address these issues, we developed PLINK, an open-source C/C++ WGAS tool set. With PLINK, large data sets comprising hundreds of thousands of markers genotyped for thousands of individuals can be rapidly manipulated and analyzed in their entirety. As well as providing tools to make the basic analytic steps computationally efficient, PLINK also supports some novel approaches to whole-genome data that take advantage of whole-genome coverage. We introduce PLINK and describe the five main domains of function: data management, summary statistics, population stratification, association analysis, and identity-by-descent estimation. In particular, we focus on the estimation and use of identity-by-state and identity-by-descent information in the context of population-based whole-genome studies. This information can be used to detect and correct for population stratification and to identify extended chromosomal segments that are shared identical by descent between very distantly related individuals. Analysis of the patterns of segmental sharing has the potential to map disease loci that contain multiple rare variants in a population-based linkage analysis.

Show full abstractShow less

DOI

10.1086/519795

PLINK2

Tool

PUBMED_LINK

25722852

DESCRIPTION

The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.

Show full descriptionShow less

URL

https://www.cog-genomics.org/plink/2.0/

TITLE

Second-generation PLINK: rising to the challenge of larger and richer datasets.

Main citation

Chang CC, Chow CC, Tellier LC, Vattikuti S, ...&, Lee JJ. (2015) Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience, 4 () 7. doi:10.1186/s13742-015-0047-8. PMID 25722852

ABSTRACT

BACKGROUND: PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for faster and scalable implementations of key functions, such as logistic regression, linkage disequilibrium estimation, and genomic distance evaluation. In addition, GWAS and population-genetic data now frequently contain genotype likelihoods, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1's primary data format. FINDINGS: To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, [Formula: see text]-time/constant-space Hardy-Weinberg equilibrium and Fisher's exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. We have also developed an extension to the data format which adds low-overhead support for genotype likelihoods, phase, multiallelic variants, and reference vs. alternate alleles, which is the basis of our planned second release (PLINK 2.0). CONCLUSIONS: The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.

Show full abstractShow less

DOI

10.1186/s13742-015-0047-8

PanMAN

Tool

PUBMED_LINK

41526696

FULL NAME

Pangenome Mutation-Annotated Network

DESCRIPTION

PanMAN is a compact pangenome representation built from mutation-annotated trees (PanMATs) linked into a network, designed to compress and query shared evolutionary history across large microbial pathogen collections.

Show full descriptionShow less

URL

https://github.com/TurakhiaLab/panman ,https://turakhia.ucsd.edu/panman/

KEYWORDS

pangenome, microbial genomics, compression, mutation-annotated tree, phylogeny

Show full keywordsShow less

TITLE

Compressive pangenomics using mutation-annotated networks.

Main citation

Walia S, Motwani H, Tseng YH, Smith K, ...&, Turakhia Y. (2026) Compressive pangenomics using mutation-annotated networks. Nat Genet, 58 (2) 445-453. doi:10.1038/s41588-025-02478-7. PMID 41526696

ABSTRACT

Pangenomics is an emerging field that uses collections of genomes, rather than a single reference, to reduce bias and capture intra-species diversity. However, existing pangenomic data formats face challenges in scaling to millions of genomes and primarily emphasize variation, often neglecting the underlying mutational events and evolutionary relationships. This work introduces Pangenome Mutation-Annotated Network (PanMAN), a lossless pangenome representation that achieves compression ratios ranging from 3.5-1,391× in file sizes compared to existing variation-preserving formats, with performance generally improving on larger datasets. In addition to compression, PanMAN increases representational capacity by encoding detailed mutational and evolutionary histories inferred across genomes, thereby enabling new biological insights. Using PanMAN, a comprehensive SARS-CoV-2 pangenome was constructed from 8 million publicly available sequences, requiring only 366 MB of disk space. We also present 'panmanUtils', a toolkit that supports common analyses and ensures interoperability with existing software. PanMAN is poised to greatly improve the scale, speed, resolution and scope of pangenomic analysis and data sharing.

Show full abstractShow less

DOI

10.1038/s41588-025-02478-7

QCTOOL v2 (QCTOOL)

Tool

PUBMED_LINK

15789306

FULL NAME

QCTOOL

DESCRIPTION

QCTOOL is a command-line utility program for manipulation and quality control of gwas datasets and other genome-wide data.

Show full descriptionShow less

URL

https://www.well.ox.ac.uk/~gav/qctool_v2/index.html

TITLE

A note on exact tests of Hardy-Weinberg equilibrium.

Main citation

Wigginton JE, Cutler DJ, Abecasis GR. (2005) A note on exact tests of Hardy-Weinberg equilibrium. Am J Hum Genet, 76 (5) 887-93. doi:10.1086/429864. PMID 15789306

ABSTRACT

Deviations from Hardy-Weinberg equilibrium (HWE) can indicate inbreeding, population stratification, and even problems in genotyping. In samples of affected individuals, these deviations can also provide evidence for association. Tests of HWE are commonly performed using a simple chi2 goodness-of-fit test. We show that this chi2 test can have inflated type I error rates, even in relatively large samples (e.g., samples of 1,000 individuals that include approximately 100 copies of the minor allele). On the basis of previous work, we describe exact tests of HWE together with efficient computational methods for their implementation. Our methods adequately control type I error in large and small samples and are computationally efficient. They have been implemented in freely available code that will be useful for quality assessment of genotype data and for the detection of genetic association or population stratification in very large data sets.

Show full abstractShow less

DOI

10.1086/429864

Swave

Tool

PUBMED_LINK

41807798

DESCRIPTION

Swave calls and genotypes structural variants from assembly-based pangenome graphs by encoding mapping patterns as images (“projection waves”) and classifying signals with a recurrent neural network, including complex and repetitive SVs for population-level characterization.

Show full descriptionShow less

URL

https://github.com/songbowang125/Swave ,https://github.com/songbowang125/Swave-Utils

KEYWORDS

structural variant, pangenome graph, deep learning, RNN, population genomics

Show full keywordsShow less

TITLE

Population-level structural variant characterization using pangenome graphs.

Main citation

Wang S, Xu T, Zhang P, Ye K. (2026) Population-level structural variant characterization using pangenome graphs. Nat Genet, 58 (3) 664-672. doi:10.1038/s41588-026-02538-6. PMID 41807798

ABSTRACT

Population-level structural variant (SV) profiling is crucial in the era of pangenomes. However, identifying SVs from genome assemblies and pangenome graphs remains a substantial challenge. Here we present Swave, a sequence-to-image, deep learning-based method that accurately resolves both simple and complex SVs, along with their population characteristics, from assembly-derived pangenome graphs. Swave introduces 'projection waves' to summarize the dotplot images that capture mapping patterns between reference and SV-indicating alleles in the pangenome. Then, a recurrent neural network distinguishes true SV signals from background noise introduced by genomic repeats. Swave demonstrates superior performance in both SV-type classification and genotyping compared with existing methods. When applied to healthy cohorts and rare-disease cohorts, Swave reveals complex and polymorphic SV patterns across human populations and identifies potentially pathogenic SVs. These advancements will facilitate the creation of comprehensive population-level SV catalogs, deepening our understanding of SVs in genetic diversity and disease associations.

Show full abstractShow less

DOI

10.1038/s41588-026-02538-6