Skip to content

Dimension_reduction

Summary Table

NAME CITATION YEAR
EIGENSTRAT Price AL, Patterson NJ, Plenge RM, Weinblatt ME, ...&, Reich D. (2006) Principal components analysis corrects for stratification in genome-wide association studies Nat. Genet., 38 (8) 904-909. doi:10.1038/ng1847. PMID 16862161 2006
PLINK-MDS Purcell S, Neale B, Todd-Brown K, Thomas L, ...&, Sham PC. (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses Am. J. Hum. Genet., 81 (3) 559-575. doi:10.1086/519795. PMID 17701901 2007
SuSiE PCA Yuan D, Mancuso N. (2023) SuSiE PCA: A scalable Bayesian variable selection technique for principal component analysis iScience, 26 (11) 108181. doi:10.1016/j.isci.2023.108181. PMID 37953948 2023
UMAP McInnes, L., Healy, J., & Melville, J. (2018). Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426. NA
t-SNE Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of machine learning research, 9(11). NA

EIGENSTRAT

  • NAME : EIGENSTRAT
  • SHORT NAME : EIGENSTRAT
  • FULL NAME : EIGENSTRAT
  • URL : https://github.com/DReichLab/EIG
  • KEYWORDS : PCA, Linear
  • TITLE : Principal components analysis corrects for stratification in genome-wide association studies
  • DOI : 10.1038/ng1847
  • ABSTRACT : Population stratification--allele frequency differences between cases and controls due to systematic ancestry differences-can cause spurious associations in disease studies. We describe a method that enables explicit detection and correction of population stratification on a genome-wide scale. Our method uses principal components analysis to explicitly model ancestry differences between cases and controls. The resulting correction is specific to a candidate marker's variation in frequency across ancestral populations, minimizing spurious associations while maximizing power to detect true associations. Our simple, efficient approach can easily be applied to disease studies with hundreds of thousands of markers.
  • CITATION : Price AL, Patterson NJ, Plenge RM, Weinblatt ME, ...&, Reich D. (2006) Principal components analysis corrects for stratification in genome-wide association studies Nat. Genet., 38 (8) 904-909. doi:10.1038/ng1847. PMID 16862161
  • JOURNAL_INFO : Nature genetics ; Nat. Genet. ; 2006 ; 38 ; 8 ; 904-909
  • PUBMED_LINK : 16862161
  • NAME : PLINK-MDS
  • SHORT NAME : MDS
  • FULL NAME : multidimensional scaling
  • URL : https://www.cog-genomics.org/plink/1.9/strat
  • KEYWORDS : MDS
  • TITLE : PLINK: a tool set for whole-genome association and population-based linkage analyses
  • DOI : 10.1086/519795
  • ABSTRACT : Whole-genome association studies (WGAS) bring new computational, as well as analytic, challenges to researchers. Many existing genetic-analysis tools are not designed to handle such large data sets in a convenient manner and do not necessarily exploit the new opportunities that whole-genome data bring. To address these issues, we developed PLINK, an open-source C/C++ WGAS tool set. With PLINK, large data sets comprising hundreds of thousands of markers genotyped for thousands of individuals can be rapidly manipulated and analyzed in their entirety. As well as providing tools to make the basic analytic steps computationally efficient, PLINK also supports some novel approaches to whole-genome data that take advantage of whole-genome coverage. We introduce PLINK and describe the five main domains of function: data management, summary statistics, population stratification, association analysis, and identity-by-descent estimation. In particular, we focus on the estimation and use of identity-by-state and identity-by-descent information in the context of population-based whole-genome studies. This information can be used to detect and correct for population stratification and to identify extended chromosomal segments that are shared identical by descent between very distantly related individuals. Analysis of the patterns of segmental sharing has the potential to map disease loci that contain multiple rare variants in a population-based linkage analysis.
  • COPYRIGHT : https://www.elsevier.com/open-access/userlicense/1.0/
  • CITATION : Purcell S, Neale B, Todd-Brown K, Thomas L, ...&, Sham PC. (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses Am. J. Hum. Genet., 81 (3) 559-575. doi:10.1086/519795. PMID 17701901
  • JOURNAL_INFO : The American Journal of Human Genetics ; Am. J. Hum. Genet. ; 2007 ; 81 ; 3 ; 559-575
  • PUBMED_LINK : 17701901

SuSiE PCA

  • NAME : SuSiE PCA
  • SHORT NAME : SuSiE PCA
  • FULL NAME : SuSiE PCA
  • DESCRIPTION : SuSiE PCA is the abbreviation for the Sum of Single Effects model1 for principal component analysis. We develop SuSiE PCA for an efficient variable selection in PCA when dealing with high dimensional data with sparsity, and for quantifying uncertainty of contributing features for each latent component through posterior inclusion probabilities (PIPs).
  • URL : https://github.com/mancusolab/susiepca
  • KEYWORDS : PCA, SuSiE
  • TITLE : SuSiE PCA: A scalable Bayesian variable selection technique for principal component analysis
  • DOI : 10.1016/j.isci.2023.108181
  • ABSTRACT : Latent factor models, like principal component analysis (PCA), provide a statistical framework to infer low-rank representation in various biological contexts. However, feature selection is challenging when this low-rank structure manifests from a sparse subspace. We introduce SuSiE PCA, a scalable sparse latent factor approach that evaluates uncertainty in contributing variables through posterior inclusion probabilities. We validate our model in extensive simulations and demonstrate that SuSiE PCA outperforms other approaches in signal detection and model robustness. We apply SuSiE PCA to multi-tissue expression quantitative trait loci (eQTLs) data from GTEx v8 and identify tissue-specific factors and their contributing eGenes. We further investigate its performance on the large-scale perturbation data and find that SuSiE PCA identifies modules with a higher enrichment of ribosome-related genes than sparse PCA (false discovery rate [FDR] =9.2×10-82 vs. 1.4×10-33), while being ∼ 18x faster. Overall, SuSiE PCA provides an efficient tool to identify relevant features in high-dimensional biological data.
  • CITATION : Yuan D, Mancuso N. (2023) SuSiE PCA: A scalable Bayesian variable selection technique for principal component analysis iScience, 26 (11) 108181. doi:10.1016/j.isci.2023.108181. PMID 37953948
  • JOURNAL_INFO : iScience ; iScience ; 2023 ; 26 ; 11 ; 108181
  • PUBMED_LINK : 37953948

UMAP

  • NAME : UMAP
  • SHORT NAME : UMAP
  • FULL NAME : Uniform Manifold Approximation and Projection
  • URL : https://github.com/lmcinnes/umap
  • KEYWORDS : UMAP
  • CITATION : McInnes, L., Healy, J., & Melville, J. (2018). Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426.

t-SNE

  • NAME : t-SNE
  • SHORT NAME : t-SNE
  • FULL NAME : t-Distributed Stochastic Neighbor Embedding
  • URL : https://lvdmaaten.github.io/tsne/
  • KEYWORDS : t-SNE
  • CITATION : Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of machine learning research, 9(11).