AI GWAS Phenotyping
Curation of Phenotyping within GWAS — listings under the AI tab.
AI-driven Phenotype Extraction
ML methods that create or enhance phenotypes from clinical data for GWAS:
- Ensemble ML using biomarkers and multimodal features for case ascertainment (MILTON, Garg et al. PMID 39471869, Nat Genet 2024).
- Synthetic surrogates enabling GWAS on imputed phenotypes (SynSurr, McCaw et al. PMID 39468299, Nat Genet 2024).
- Multimodal topic models integrating diagnosis codes, medications, and genetics (MixEHR-SAGE, Cui et al. PMID 39843619, Nat Med 2026).
Summary Table
Click a column header to sort the table.
| NAME | Main citation | YEAR |
|---|---|---|
| MILTON | Garg M et al., Nat Genet, 2024 |
2024 |
| MixEHR-SAGE | Yang Z et al., Brief Bioinform, 2026 |
2026 |
| SynSurr | McCaw ZR et al., Nat Genet, 2024 |
2024 |
MILTON
PUBMED_LINK
FULL NAME
MILTON - Machine Learning with Phenotype Associations for Disease Prediction
DESCRIPTION
MILTON is an ensemble machine learning framework that utilizes biomarkers and multi-omics data to predict 3,213 diseases in the UK Biobank. It predicts incident disease cases undiagnosed at time of recruitment and demonstrates utility in augmenting genetic association discovery by empowering case-control GWAS with predicted phenotypes. Published in Nature Genetics.
TITLE
Disease prediction with multi-omics and biomarkers empowers case-control genetic discoveries in the UK Biobank.
ABSTRACT
The emergence of biobank-level datasets offers new opportunities to discover novel biomarkers and develop predictive algorithms for human disease. Here, we present an ensemble machine-learning framework (machine learning with phenotype associations, MILTON) utilizing a range of biomarkers to predict 3,213 diseases in the UK Biobank. MILTON predicts incident disease cases undiagnosed at time of recruitment, largely outperforming available polygenic risk scores, and augments genetic association discovery.
DOI
10.1038/s41588-024-01898-1
MixEHR-SAGE
PUBMED_LINK
FULL NAME
MixEHR-SAGE - Multi-modal Topic Modeling for PheWAS and GWAS
DESCRIPTION
MixEHR-SAGE is a PheCode-guided multi-modal topic model that integrates diagnoses, procedures, and medications from EHR to enhance phenotyping for GWAS. By combining expert-informed priors with probabilistic inference, it identifies over 1000 interpretable phenotype topics from UK Biobank data and improves disease incidence prediction and GWAS discovery. Published in Briefings in Bioinformatics.
TITLE
PheCode-guided multi-modal topic modeling of electronic health records improves disease incidence prediction and GWAS discovery from UK Biobank.
ABSTRACT
Phenome-wide association studies rely on disease definitions derived from diagnostic codes, often failing to leverage the full richness of electronic health records (EHR). We present MixEHR-SAGE, a PheCode-guided multi-modal topic model that integrates diagnoses, procedures, and medications to enhance phenotyping from large-scale EHRs. Applied to 350,000 individuals with high-quality genetic data, MixEHR-SAGE-derived risk scores accurately predicted disease incidence and improved GWAS discovery.
DOI
10.1093/bib/bbag030
SynSurr
PUBMED_LINK
FULL NAME
SynSurr - Synthetic Surrogates for GWAS of Missing Phenotypes
DESCRIPTION
SynSurr (Synthetic Surrogate analysis) is a method that makes GWAS on imputed phenotypes robust to imputation errors. Rather than replacing missing values, SynSurr jointly analyzes the observed and imputed data to provide calibrated association statistics, improving power for genome-wide association studies of partially missing phenotypes in population biobanks. Published in Nature Genetics.
TITLE
Synthetic surrogates improve power for genome-wide association studies of partially missing phenotypes in population biobanks.
ABSTRACT
Within population biobanks, incomplete measurement of certain traits limits the power for genetic discovery. Machine learning is increasingly used to impute the missing values from the available data. However, performing GWAS on imputed traits can introduce spurious associations. Here we introduce SynSurr analysis, which makes GWAS on imputed phenotypes robust to imputation errors by jointly analyzing observed and imputed data.
DOI
10.1038/s41588-024-01793-9