Dimension_reduction

Summary Table

NAME	CITATION	YEAR
EIGENSTRAT	Price AL, Patterson NJ, Plenge RM, Weinblatt ME, ...&, Reich D. (2006) Principal components analysis corrects for stratification in genome-wide association studies Nat. Genet., 38 (8) 904-909. doi:10.1038/ng1847. PMID 16862161	2006
PLINK-MDS	Purcell S, Neale B, Todd-Brown K, Thomas L, ...&, Sham PC. (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses Am. J. Hum. Genet., 81 (3) 559-575. doi:10.1086/519795. PMID 17701901	2007
SuSiE PCA	Yuan D, Mancuso N. (2023) SuSiE PCA: A scalable Bayesian variable selection technique for principal component analysis iScience, 26 (11) 108181. doi:10.1016/j.isci.2023.108181. PMID 37953948	2023
UMAP	McInnes, L., Healy, J., & Melville, J. (2018). Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426.	NA
t-SNE	Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of machine learning research, 9(11).	NA

EIGENSTRAT

NAME : EIGENSTRAT
SHORT NAME : EIGENSTRAT
FULL NAME : EIGENSTRAT
URL : https://github.com/DReichLab/EIG
KEYWORDS : PCA, Linear
TITLE : Principal components analysis corrects for stratification in genome-wide association studies
DOI : 10.1038/ng1847
ABSTRACT : Population stratification--allele frequency differences between cases and controls due to systematic ancestry differences-can cause spurious associations in disease studies. We describe a method that enables explicit detection and correction of population stratification on a genome-wide scale. Our method uses principal components analysis to explicitly model ancestry differences between cases and controls. The resulting correction is specific to a candidate marker's variation in frequency across ancestral populations, minimizing spurious associations while maximizing power to detect true associations. Our simple, efficient approach can easily be applied to disease studies with hundreds of thousands of markers.
CITATION : Price AL, Patterson NJ, Plenge RM, Weinblatt ME, ...&, Reich D. (2006) Principal components analysis corrects for stratification in genome-wide association studies Nat. Genet., 38 (8) 904-909. doi:10.1038/ng1847. PMID 16862161
JOURNAL_INFO : Nature genetics ; Nat. Genet. ; 2006 ; 38 ; 8 ; 904-909
PUBMED_LINK : 16862161

PLINK-MDS

NAME : PLINK-MDS
SHORT NAME : MDS
FULL NAME : multidimensional scaling
URL : https://www.cog-genomics.org/plink/1.9/strat
KEYWORDS : MDS
TITLE : PLINK: a tool set for whole-genome association and population-based linkage analyses
DOI : 10.1086/519795
ABSTRACT : Whole-genome association studies (WGAS) bring new computational, as well as analytic, challenges to researchers. Many existing genetic-analysis tools are not designed to handle such large data sets in a convenient manner and do not necessarily exploit the new opportunities that whole-genome data bring. To address these issues, we developed PLINK, an open-source C/C++ WGAS tool set. With PLINK, large data sets comprising hundreds of thousands of markers genotyped for thousands of individuals can be rapidly manipulated and analyzed in their entirety. As well as providing tools to make the basic analytic steps computationally efficient, PLINK also supports some novel approaches to whole-genome data that take advantage of whole-genome coverage. We introduce PLINK and describe the five main domains of function: data management, summary statistics, population stratification, association analysis, and identity-by-descent estimation. In particular, we focus on the estimation and use of identity-by-state and identity-by-descent information in the context of population-based whole-genome studies. This information can be used to detect and correct for population stratification and to identify extended chromosomal segments that are shared identical by descent between very distantly related individuals. Analysis of the patterns of segmental sharing has the potential to map disease loci that contain multiple rare variants in a population-based linkage analysis.
COPYRIGHT : https://www.elsevier.com/open-access/userlicense/1.0/
CITATION : Purcell S, Neale B, Todd-Brown K, Thomas L, ...&, Sham PC. (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses Am. J. Hum. Genet., 81 (3) 559-575. doi:10.1086/519795. PMID 17701901
JOURNAL_INFO : The American Journal of Human Genetics ; Am. J. Hum. Genet. ; 2007 ; 81 ; 3 ; 559-575
PUBMED_LINK : 17701901

SuSiE PCA

NAME : SuSiE PCA
SHORT NAME : SuSiE PCA
FULL NAME : SuSiE PCA
DESCRIPTION : SuSiE PCA is the abbreviation for the Sum of Single Effects model1 for principal component analysis. We develop SuSiE PCA for an efficient variable selection in PCA when dealing with high dimensional data with sparsity, and for quantifying uncertainty of contributing features for each latent component through posterior inclusion probabilities (PIPs).
URL : https://github.com/mancusolab/susiepca
KEYWORDS : PCA, SuSiE
TITLE : SuSiE PCA: A scalable Bayesian variable selection technique for principal component analysis
DOI : 10.1016/j.isci.2023.108181
ABSTRACT : Latent factor models, like principal component analysis (PCA), provide a statistical framework to infer low-rank representation in various biological contexts. However, feature selection is challenging when this low-rank structure manifests from a sparse subspace. We introduce SuSiE PCA, a scalable sparse latent factor approach that evaluates uncertainty in contributing variables through posterior inclusion probabilities. We validate our model in extensive simulations and demonstrate that SuSiE PCA outperforms other approaches in signal detection and model robustness. We apply SuSiE PCA to multi-tissue expression quantitative trait loci (eQTLs) data from GTEx v8 and identify tissue-specific factors and their contributing eGenes. We further investigate its performance on the large-scale perturbation data and find that SuSiE PCA identifies modules with a higher enrichment of ribosome-related genes than sparse PCA (false discovery rate [FDR] =9.2×10-82 vs. 1.4×10-33), while being ∼ 18x faster. Overall, SuSiE PCA provides an efficient tool to identify relevant features in high-dimensional biological data.
CITATION : Yuan D, Mancuso N. (2023) SuSiE PCA: A scalable Bayesian variable selection technique for principal component analysis iScience, 26 (11) 108181. doi:10.1016/j.isci.2023.108181. PMID 37953948
JOURNAL_INFO : iScience ; iScience ; 2023 ; 26 ; 11 ; 108181
PUBMED_LINK : 37953948

UMAP

NAME : UMAP
SHORT NAME : UMAP
FULL NAME : Uniform Manifold Approximation and Projection
URL : https://github.com/lmcinnes/umap
KEYWORDS : UMAP
CITATION : McInnes, L., Healy, J., & Melville, J. (2018). Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426.

t-SNE

NAME : t-SNE
SHORT NAME : t-SNE
FULL NAME : t-Distributed Stochastic Neighbor Embedding
URL : https://lvdmaaten.github.io/tsne/
KEYWORDS : t-SNE
CITATION : Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of machine learning research, 9(11).