Skip to content

Vision-Language

Catalog entries using this tag (links open the entry card on its page):

Entries

CONCH

AI Imaging Pathology Foundation Model Vision-Language Histopathology Mahmood Lab Zero-Shot
PUBMED_LINK
38504017
FULL NAME
CONCH — Contrastive learning from Captions for Histopathology (Vision-Language Foundation Model)
DESCRIPTION
CONCH (CONtrastive learning from Captions for Histopathology) is a vision-language foundation model from Mahmood Lab (Harvard/BWH). Pretrained on 1.17M histopathology image-text pairs from diverse sources (PubMed, educational resources, textbooks). Evaluated across 14 clinically relevant tasks including zero-shot cancer classification, text-to-image retrieval, image-to-text retrieval, caption generation, and tissue segmentation. Outperforms standard models including CLIP and PLIP. CONCH also works on non-H&E stains (IHC, special stains), demonstrating broad applicability. Available as an open-source model for academic use.
URL
https://github.com/mahmoodlab/CONCH
TITLE
A visual-language foundation model for computational pathology.
Main citation
Lu MY, Chen B, Williamson DFK, Chen RJ, Liang I, Ding T, Jaume G, Odintsov I, Le LP, Gerber G, Parwani AV, Zhang A, Mahmood F. (2024) A visual-language foundation model for computational pathology. Nature Medicine, 30(3):863-874. doi:10.1038/s41591-024-02856-4. PMID 38504017
ABSTRACT
We introduce CONCH, a visual-language foundation model developed using diverse sources of histopathology images and text. Trained on 1.17 million pathology image-text pairs, CONCH achieves state-of-the-art performance across 14 clinically relevant tasks, including zero-shot cancer classification, text-to-image and image-to-text retrieval, caption generation, and tissue segmentation. CONCH outperforms standard models like CLIP and PLIP, and generalizes to non-H&E stains including immunohistochemistry and special stains, demonstrating its versatility as a foundation model for computational pathology.
DOI
10.1038/s41591-024-02856-4

CONCH

AI Multimodal Vision-Language Pathology Foundation Model Representation Learning
PUBMED_LINK
38480913
FULL NAME
CONCH — Contrastive Learning from Captions for Histopathology
DESCRIPTION
CONCH (CONtrastive learning from Captions for Histopathology) is a vision-language foundation model pretrained on 1.17 million histopathology image-caption pairs. It achieves state-of-the-art performance across 14 diverse benchmarks including histology image classification, segmentation, captioning, text-to-image and image-to-text retrieval. As a multimodal model bridging visual pathology data with biomedical text, CONCH enables zero-shot transfer and minimal fine-tuning for diverse computational pathology tasks.
URL
https://github.com/mahmoodlab/CONCH
TITLE
A visual-language foundation model for computational pathology.
Main citation
Lu MY, Chen B, Williamson DFK, Chen RJ, Liang I, Ding T, Noor G, Sang Y, Mahmood F. (2024) A visual-language foundation model for computational pathology. Nature Medicine, 30(3):863-874. doi:10.1038/s41591-024-02856-4. PMID 38480913
ABSTRACT
The accelerated adoption of digital pathology and advances in deep learning have enabled the development of robust models for various pathology tasks. However, model training is often difficult due to label scarcity. Additionally, most models in histopathology leverage only image data. We introduce CONCH, a visual-language foundation model developed using diverse sources of histopathology images, biomedical text, and over 1.17 million image-caption pairs via task-agnostic pretraining. Evaluated on 14 diverse benchmarks, CONCH achieves state-of-the-art performance on histology image classification, segmentation, captioning, and cross-modal retrieval.
DOI
10.1038/s41591-024-02856-4

KEEP

AI Imaging Pathology Foundation Model Vision-Language Knowledge Graph Rare Cancer Cancer Cell
PUBMED_LINK
41720085
FULL NAME
KEEP — Knowledge-Enhanced Pathology Vision-Language Foundation Model
DESCRIPTION
KEEP (KnowledgE-Enhanced Pathology) is a vision-language foundation model from Shanghai AI Lab / SJTU that systematically integrates disease knowledge into pretraining for cancer diagnosis. Uses a comprehensive disease knowledge graph with 11,454 diseases and 139,143 attributes from DO and UMLS to reorganize millions of pathology image-text pairs into 143,000 semantically structured groups aligned with disease ontology hierarchies. Across 18 public benchmarks (14,000+ WSIs) and 4 institutional rare cancer datasets (926 cases), KEEP consistently outperforms existing foundation models (CHIEF, CONCH, UNI), with substantial gains for rare subtypes (+8.5 pts balanced accuracy vs CONCH on 30 rare brain cancers). Published in Cancer Cell, Feb 2026.
URL
https://github.com/MAGIC-AI4Med/KEEP
TITLE
Knowledge-enhanced pretraining for vision-language pathology foundation model on cancer diagnosis.
Main citation
Zhou X, Sun L, He D, Guan W, Wang G, Wang R, Wang L, Yuan X, Sun X, Zhang Y, Sun K, Wang Y, Xie W. (2026) Knowledge-enhanced pretraining for vision-language pathology foundation model on cancer diagnosis. Cancer Cell, 44(4):777-791. doi:10.1016/j.ccell.2026.01.019. PMID 41720085
ABSTRACT
Vision-language foundation models have shown great promise in computational pathology but remain primarily data-driven, lacking explicit integration of medical knowledge. We introduce KEEP, a foundation model that systematically incorporates disease knowledge into pretraining for cancer diagnosis. KEEP leverages a comprehensive disease knowledge graph encompassing 11,454 diseases and 139,143 attributes to reorganize millions of pathology image-text pairs into 143,000 semantically structured groups aligned with disease ontology hierarchies. Across 18 public benchmarks (over 14,000 WSIs) and 4 institutional rare cancer datasets (926 cases), KEEP consistently outperformed existing foundation models, showing substantial gains for rare subtypes.
DOI
10.1016/j.ccell.2026.01.019

TITAN

AI Imaging Pathology Foundation Model Vision-Language Whole-Slide Mahmood Lab
PUBMED_LINK
41193692
FULL NAME
TITAN — Transformer-based pathology Image and Text Alignment Network
DESCRIPTION
TITAN (Transformer-based pathology Image and Text Alignment Network) is a multimodal whole-slide foundation model from Mahmood Lab (Harvard/BWH). Pretrained on 335,645 WSIs via visual self-supervised learning and vision-language alignment with 423K synthetic captions from PathChat + 183K pathology reports. Without any fine-tuning, TITAN produces general-purpose slide representations for zero-shot classification, rare cancer retrieval, cross-modal retrieval, and pathology report generation. Outperforms both ROI and slide foundation models across diverse clinical tasks.
URL
https://github.com/mahmoodlab/TITAN
TITLE
A multimodal whole-slide foundation model for pathology.
Main citation
Ding T, Wagner SJ, Song AH, Chen RJ, Lu MY, Zhang A, Vaidya AJ, Jaume G, Shaban M, Kim A, Williamson DFK, Oldenburg L, Chen B, Alajaji A, Noor G, Sang Y, Peng T, Le LP, Mahmood F. (2025) A multimodal whole-slide foundation model for pathology. Nature Medicine, 31:3749-3761. doi:10.1038/s41591-025-03982-3. PMID 41193692
ABSTRACT
The field of computational pathology has been transformed with recent advances in foundation models that encode histopathology region-of-interests into versatile feature representations. However, translating these advancements to address complex clinical challenges at the patient and slide level remains constrained by limited clinical data. We propose TITAN, a multimodal whole-slide foundation model pretrained using 335,645 whole-slide images via visual self-supervised learning and vision-language alignment with pathology reports and 423,122 synthetic captions. Without any fine-tuning, TITAN can extract general-purpose slide representations and generate pathology reports that generalize to resource-limited clinical scenarios such as rare disease retrieval and cancer prognosis.
DOI
10.1038/s41591-025-03982-3