| ReXGradient-160K | 160k multi-site CXR + reports — vision-language radiology sandbox. | Harvard DUA, non-commercial |
| CheXpert Plus | Large paired CXR–report set; benchmarked in many papers. | Stanford DUA, free research |
| Endoscapes 2023 | Open laparoscopic chole frames — segmentation / CVS detection. | CC BY-NC-SA 4.0 |
| Surg-3M | 3M surgical frames powering "SurgFM" foundation model. | TBA, expect research-only |
| AFRICAI Repository | Imaging sets from African centres — fairness & domain shift. | Mixed open licenses |
| OpenOximetry | Waveforms + skin-tone data for pulse-ox bias work. | PhysioNet credentialed |
| DeepLesion | 32k CT slices with bounding-box lesions; detection / tracking. | NIH DUA, research-only |
| BioASQ Synergy 2024 | Biomedical Q-A pairs — LLM eval set. | CC BY 2.5 |
| CliniFact | Clinical-trial fact-checking corpus — fine-tune retrieval / RAG. | MIT |
| Hallucination Annotations | Doctor- & LLM-written discharge summaries with token-level labels. | PhysioNet credentialed |
| Clinical-Trial Eligibility QA | QA pairs linking MIMIC-IV to apixaban RCT criteria. | PhysioNet credentialed |
| PIFIR | Wearable PPG/ECG for arrhythmia-free interval prediction. | PhysioNet restricted |
| GREGoR R02 | Rare-disease genomic + phenotypic harmonised data. | dbGaP controlled |
| Synthetic Rare-Disease EHRs | Benchmark synthetic EHRs for low-prevalence conditions. | CC BY |
| Korea4K | 4k Korean genomes — ancestry diversity for variant calling. | EGA controlled |
| OpenNeuro | 20k+ public neuro-imaging sessions; BIDS-ready. | CC0 / CC BY-SA |
| Bridge2AI-Voice | Multimodal speech (voice, vitals) for health AI. | PhysioNet restricted |
| PMDB Pain Monitoring | Wearable IMU + self-report pain diary. | CC BY 4.0 |
| DREAMT Wearable Sleep | Apple Watch PSG pairs for sleep-staging models. | PhysioNet restricted |
| MC-MED | Multi-condition medical dialogue (GPT / human). | PhysioNet credentialed |
| Wearable Stress Dataset | Smartwatch vitals + stress labels — mental health ML. | PhysioNet restricted |
| MIMIC-IV v3.1 | Flagship 380k-patient de-id EHR; ED + ICU tables. | PhysioNet credentialed + CITI |
| MIETIC | Italian clinical-notes corpus with entity spans. | PhysioNet credentialed |
| ODD (Opioid Behavior) | Annotated notes for opioid-related behaviour NLP. | PhysioNet credentialed |
| UK Biobank | 500k UK adult cohort — EHR, surveys, genetics. | Controlled access |
| All of Us (NIH) | 1M-goal US cohort — EHR, surveys, genomics, wearables. | Registered + Controlled tiers |
| TCGA | ~11k patients across 33 cancer types — multi-omics + clinical. | Partially open |
| AmsterdamUMCdb | First open European ICU DB — 23k admissions. | DUA required |
| ADNI | Longitudinal Alzheimer's — serial MRI/PET, clinical, biomarkers. | Free non-commercial |
| ABCD Study | 10k youths — neuroimaging, cognitive, mental health, genetic. | NIMH controlled access |
| NHANES (CDC) | US national survey — health, nutrition, lab data. | Public domain |
| CheXpert (original) | 224k chest X-rays, 65k patients — labeled findings. | Free non-commercial |
| EchoNet-Dynamic | 10k+ cardiac ultrasound videos with EF + ventricle volumes. | Non-commercial |
| Synthea | Realistic synthetic patient records — full EHR. | MIT |
| 1000 Genomes | WGS from ~2,500 diverse individuals — human variation reference. | Open access |
| DementiaBank (Pitt) | Speech recordings + transcripts from Alzheimer's patients + controls. | Consortium access |
| VitalDB | 6,300+ surgeries with continuous high-freq vital sign waveforms. | Open, registration + DUA |
| Medical Segmentation Decathlon | 10 open datasets for 3D medical image segmentation. | CC BY-SA 4.0 |
| PANDA | 10k+ prostate biopsy WSIs with Gleason grades. | CC BY 4.0 |