Datasets
Catalog entries using this tag (links open the entry card on its page):
Entries
BRFSS
FULL NAME
Behavioral Risk Factor Surveillance System
DESCRIPTION
BRFSS (Behavioral Risk Factor Surveillance System) is the nation's premier system of health-related telephone surveys conducted by the CDC and state health departments. Established in 1984, it collects state-level data on US adult residents regarding health-related risk behaviors, chronic health conditions, and use of preventive services. Conducts over 400,000 interviews annually across all 50 states, DC, and US territories. Topics include: high blood pressure, cholesterol, diabetes, asthma, BMI/obesity, smoking/tobacco use, alcohol consumption, physical activity, diet, cancer screening (breast, cervical, colorectal), immunizations, HIV/AIDS, mental health, and healthcare access. Data includes demographic variables (age, sex, race/ethnicity, income, education) and survey weights for population-representative estimates. Completely free, publicly available on the CDC website — no registration or data use agreement required. Available in SAS, ASCII, and CSV formats. Often used alongside NHANES for complementary population health analyses (BRFSS = larger sample, broader coverage; NHANES = deeper phenotyping with physical exams and lab measurements).
URL
KEYWORDS
BRFSS, CDC, behavioral risk, telephone survey, public health, chronic disease, risk factors, US surveillance
Main citation
CDC. Behavioral Risk Factor Surveillance System Survey Data. Atlanta, GA: US Department of Health and Human Services, Centers for Disease Control and Prevention. https://www.cdc.gov/brfss/
eICU Collaborative Research Database (eICU-CRD)
PUBMED_LINK
DESCRIPTION
The eICU Collaborative Research Database (eICU-CRD) is a large multi-center intensive care unit database from Philips Healthcare's eICU telehealth program, in partnership with MIT Laboratory for Computational Physiology. Contains de-identified data for over 200,000 admissions from 139,000 unique patients across 335 ICU units at 208 US hospitals (2014-2015). Includes: demographics, vital signs, care plan documentation, severity of illness measures (APACHE IV), diagnoses (3,933 unique active problems), laboratory measurements (158 lab types), medications, continuous infusions, intake/output, microbiology, nurse charting, and structured notes. Data access follows the same process as MIMIC: free of charge, requires PhysioNet credentialed access (CITI Data or Specimens Only Research course + Data Use Agreement). Complements MIMIC-IV (single-center, Boston) with multi-center US coverage for external validation and generalizability of ICU ML models.
URL
KEYWORDS
eICU, critical care, ICU, multicenter, Philips, PhysioNet, vital signs, severity of illness
TITLE
The eICU Collaborative Research Database, a freely available multi-center database for critical care research.
Main citation
Pollard TJ, Johnson AEW, Raffa JD, Celi LA, Mark RG, Badawi O. (2018) The eICU Collaborative Research Database, a freely available multi-center database for critical care research. Scientific Data, 5:180178. doi:10.1038/sdata.2018.178. PMID 30204154
ABSTRACT
Critical care patients are monitored closely through the course of their illness. Philips Healthcare has developed a telehealth system, the eICU Program, which leverages these data to support management of critically ill patients. Here we describe the eICU Collaborative Research Database, a multi-center intensive care unit (ICU) database with high granularity data for over 200,000 admissions to ICUs monitored by eICU Programs across the United States.
DOI
10.1038/sdata.2018.178
MIMIC-IV
PUBMED_LINK
FULL NAME
Medical Information Mart for Intensive Care IV
DESCRIPTION
MIMIC-IV is a large, freely-available de-identified clinical database comprising over 300,000 patients admitted to the Beth Israel Deaconess Medical Center (2008-2019). It includes comprehensive ICU and Emergency Department data: demographics, vital signs, laboratory measurements, medications, procedures, diagnoses (ICD codes), imaging reports, nursing notes, and mortality outcomes. The relational database (BigQuery or local PostgreSQL) links hospital admissions (ADMISSIONS), patient stays (ICUSTAYS), charted observations (CHARTEVENTS), lab events (LABEVENTS), microbiology data (MICROBIOLOGYEVENTS), prescriptions (PRESCRIPTIONS), and discharge summaries. MIMIC-IV replaces MIMIC-III (2001-2012) with a modernized schema, cleaner data model, and expanded coverage. Widely used for developing and benchmarking clinical AI models (mortality prediction, sepsis detection, phenotyping, NLP), it requires credentialed access via PhysioNet (CITI Data or Specimens Only course). Supporting datasets include MIMIC-CXR (chest X-ray images) and MIMIC-NOTE (de-identified clinical notes).
URL
KEYWORDS
EHR, ICU, clinical database, de-identified, Beth Israel, critical care, MIMIC, medical informatics
TITLE
MIMIC-IV, a freely accessible electronic health record dataset.
Main citation
Johnson AEW, Bulgarelli L, Shen L, Gayles A, Shammout A, Horng S, Pollard TJ, Hao S, Moody B, Gow B, Lehman LH, Celi LA, Mark RG. (2023) MIMIC-IV, a freely accessible electronic health record dataset. Scientific Data, 10:1. doi:10.1038/s41597-022-01899-x. PMID 36596836
ABSTRACT
MIMIC-IV is a publicly available database of de-identified electronic health records for patients admitted to the Beth Israel Deaconess Medical Center (BIDMC) in Boston, Massachusetts. The database is updated annually and is freely available to credentialed researchers. MIMIC-IV contains information on patient demographic characteristics, vital signs, laboratory measurements, medications, and diagnoses. We describe the process of creating the database, the structure of the data, and the tools available to users. MIMIC-IV is a valuable resource for researchers in critical care, clinical informatics, and machine learning.
DOI
10.1038/s41597-022-01899-x
NHANES
FULL NAME
National Health and Nutrition Examination Survey
DESCRIPTION
NHANES is a program of studies by the US CDC designed to assess the health and nutritional status of adults and children in the United States. It combines interviews, physical examinations, and laboratory tests across 2-year continuous cycles (1999-present). The survey includes: demographics (age, sex, race/ethnicity, income); ~40+ questionnaire components (diabetes DIQ, cardiovascular CDQ, depression PHQ-9/DPQ, sleep SLQ, smoking SMQ, alcohol ALQ, kidney KIQ, physical function PFQ, osteoporosis OSQ, weight history WHQ); physical examinations (anthropometry BMX, blood pressure BPX, DXA bone density, spirometry SPX, hearing AUXAR, vision VIX, accelerometry PAXRAW); and ~50+ laboratory biomarkers (comprehensive metabolic panel BIOPRO, complete blood count CBC, HbA1c GHB, glucose GLU, insulin INS, lipids TCHOL/HDL, CRP/HSCRP, vitamin D VID, ferritin FERTIN, testosterone TST, hepatitis/HIV/HPV serology, PFAS chemicals, heavy metals, cotinine). Dietary intake data includes 2-day 24-hour recall (DR1IFF/DR2IFF). NHANES is multi-ethnic and US-representative with survey weights. Total: ~1,600 SAS XPT files, ~5 GB. Available at no cost, no registration required. Widely used for epidemiological research, biomarker GWAS replication, and cross-population comparisons with biobank data (UKBB, BBJ).
URL
KEYWORDS
NHANES, CDC, health survey, biomarkers, nutrition, demographics, epidemiology, US representative, public health
Main citation
CDC/National Center for Health Statistics. National Health and Nutrition Examination Survey Data. Hyattsville, MD: US Department of Health and Human Services, Centers for Disease Control and Prevention. https://www.cdc.gov/nchs/nhanes/
PTB-XL
FULL NAME
PTB-XL: A Large Publicly Available Electrocardiography Dataset
DESCRIPTION
PTB-XL is the largest freely accessible clinical 12-lead ECG-waveform dataset, comprising 21,837 records from 18,885 patients of 10 seconds length. Annotated by up to two cardiologists with 71 SCP-ECG diagnostic, form, and rhythm statements organized into 5 superclasses (NORM, CD, MI, HYP, STTC) and 24 subclasses. Includes raw waveforms at 500Hz and downsampled 100Hz, plus rich metadata: demographics (age, sex, height, weight), signal quality annotations (noise, baseline drift, electrodes), and recommended stratified 10-fold cross-validation splits. Fully open access on PhysioNet — no registration or training required. Widely used as the standard benchmark for automated ECG interpretation, arrhythmia detection, and deep learning in cardiology.
URL
KEYWORDS
ECG, electrocardiography, 12-lead, cardiovascular, waveform, signal processing, PhysioNet
TITLE
PTB-XL, a large publicly available electrocardiography dataset.
Main citation
Wagner P, Strodthoff N, Bousseljot RD, Kreiseler D, Lunze FI, Samek W, Schaeffter T. (2020) PTB-XL, a large publicly available electrocardiography dataset. Scientific Data, 7:154. doi:10.1038/s41597-020-0495-6.
ABSTRACT
Electrocardiography (ECG) is a key non-invasive diagnostic tool for cardiovascular diseases which is increasingly supported by algorithms based on machine learning. Major obstacles for the development of automatic ECG interpretation algorithms are both the lack of public datasets and well-defined benchmarking procedures to allow comparisons of different algorithms. To address these issues, we put forward PTB-XL, the to-date largest freely accessible clinical 12-lead ECG-waveform dataset comprising 21837 records from 18885 patients of 10 seconds length.
DOI
10.1038/s41597-020-0495-6
TDC
PUBMED_LINK
FULL NAME
Therapeutics Data Commons — AI Foundation for Therapeutic Science
DESCRIPTION
Therapeutics Data Commons (TDC) is a coordinated initiative providing AI-ready datasets and curated benchmarks across the full spectrum of therapeutic modalities (small molecules, biologics, gene therapy) and stages (target identification, hit discovery, lead optimization, manufacturing). Features 100+ datasets across 50+ learning tasks, with standardized evaluation protocols, data splits, and public leaderboards. Supports systematic evaluation of AI methods for drug discovery and development.
URL
TITLE
Artificial intelligence foundation for therapeutic science.
Main citation
Huang K, Fu T, Gao W, Zhao Y, Roohani Y, Leskovec J, Coley CW, Xiao C, Sun J, Zitnik M. (2022) Artificial intelligence foundation for therapeutic science. Nature Chemical Biology, 18(10):1034-1036. doi:10.1038/s41589-022-01131-2. PMID 35970914
ABSTRACT
Artificial intelligence is poised to enable breakthroughs and discoveries in therapeutic science. Therapeutics Data Commons is a coordinated initiative to access and evaluate AI capability across therapeutic modalities and stages of discovery. The Commons is a resource with AI-solvable tasks, AI-ready datasets, and curated benchmarks, providing an ecosystem of tools, libraries, leaderboards, and community resources.
DOI
10.1038/s41589-022-01131-2