Multimodal

Catalog entries using this tag (links open the entry card on its page):

Entries

CONCH

AI Multimodal Vision-Language Pathology Foundation Model Representation Learning

PUBMED_LINK

38480913

FULL NAME

CONCH — Contrastive Learning from Captions for Histopathology

DESCRIPTION

CONCH (CONtrastive learning from Captions for Histopathology) is a vision-language foundation model pretrained on 1.17 million histopathology image-caption pairs. It achieves state-of-the-art performance across 14 diverse benchmarks including histology image classification, segmentation, captioning, text-to-image and image-to-text retrieval. As a multimodal model bridging visual pathology data with biomedical text, CONCH enables zero-shot transfer and minimal fine-tuning for diverse computational pathology tasks.

Show full descriptionShow less

URL

https://github.com/mahmoodlab/CONCH

TITLE

A visual-language foundation model for computational pathology.

Main citation

Lu MY, Chen B, Williamson DFK, Chen RJ, Liang I, Ding T, Noor G, Sang Y, Mahmood F. (2024) A visual-language foundation model for computational pathology. Nature Medicine, 30(3):863-874. doi:10.1038/s41591-024-02856-4. PMID 38480913

ABSTRACT

The accelerated adoption of digital pathology and advances in deep learning have enabled the development of robust models for various pathology tasks. However, model training is often difficult due to label scarcity. Additionally, most models in histopathology leverage only image data. We introduce CONCH, a visual-language foundation model developed using diverse sources of histopathology images, biomedical text, and over 1.17 million image-caption pairs via task-agnostic pretraining. Evaluated on 14 diverse benchmarks, CONCH achieves state-of-the-art performance on histology image classification, segmentation, captioning, and cross-modal retrieval.

Show full abstractShow less

DOI

10.1038/s41591-024-02856-4

mSTAR

AI Imaging Pathology Foundation Model Multimodal Gene Expression Whole-Slide HKUST

PUBMED_LINK

41387679

FULL NAME

mSTAR — Multimodal Self-TAught Pretraining (WSI + Reports + Gene Expression)

DESCRIPTION

mSTAR (Multimodal Self-TAught PRetraining) is a pathology foundation model from HKUST/SJTU that integrates three modalities: pathology slides (WSIs), expert pathology reports, and gene expression (RNA-Seq) data. Curates the largest multimodal dataset of 26,169 slide-level modality pairs across 32 cancer types from 10,275 TCGA patients (>116M patch images). Uses a two-stage paradigm: (1) slide-level contrastive learning across WSI-report-gene modalities, (2) self-taught training that propagates multimodal knowledge from slide aggregator (teacher) to patch extractor (student). Evaluated on 97 tasks across 15 application types, outperforming UNI, CONCH, CHIEF, and GigaPath. Key finding: multimodal integration yields greater improvements than simply expanding vision-only datasets (53x data efficiency vs Virchow). Published in Nat Commun, Dec 2025.

Show full descriptionShow less

URL

https://github.com/Innse/mSTAR

TITLE

A multimodal knowledge-enhanced whole-slide pathology foundation model.

Main citation

Xu Y, Wang Y, Zhou F, Ma J, Yang S, Lin H, Wang X, Wang J, Liang L, Han A, Jin C, Cheng KT, Chen H. (2025) A multimodal knowledge-enhanced whole-slide pathology foundation model. Nature Communications, 16:11406. doi:10.1038/s41467-025-66220-x. PMID 41387679

ABSTRACT

Computational pathology has advanced through foundation models, yet faces challenges in multimodal integration and capturing whole-slide context. We present mSTAR, the pathology foundation model that incorporates three modalities: pathology slides, expert-created reports, and gene expression data, within a unified framework. Our dataset includes 26,169 slide-level modality pairs across 32 cancer types, comprising over 116 million patch images. This approach injects multimodal whole-slide context into patch representations, expanding modeling from single to multiple modalities and from patch-level to slide-level analysis. Across 97 tasks, mSTAR outperforms previous SOTA models, particularly in molecular prediction, revealing that multimodal integration yields greater improvements than simply expanding vision-only datasets.

Show full abstractShow less

DOI

10.1038/s41467-025-66220-x