Synthetic Data vs. Real-World Chaos: Why “Perfect” EHRs Fall Short in the ED and Hospital

Imagine an emergency physician poring over a new patient’s chart. The screen shows a patchwork of information: a chief complaint with scant history, some labs still pending, allergies “unknown,” and no prior records in the system. This incomplete, in-progress data is the reality of emergency medicine and inpatient care. Yet if you open a popular synthetic EHR dataset like Synthea, you’d find a neatly filled-out record – a complete medical history from birth to the present, every diagnosis and lab result in place (Synthea). While synthetic data sets are valuable, their assumption of data completeness can make them mismatched to the messy reality of acute care settings. In this article, we’ll explore how synthetic EHR data’s perfection clashes with real-world incompleteness, why it matters for clinical informatics in emergency and inpatient environments, and how we might bridge the gap to make synthetic data more useful for these applications.

When Synthetic Data Paints a Perfect Picture

Synthetic EHR generators like Synthea are designed to create realistic but complete patient records. Synthea, for example, simulates a patient’s entire lifespan, outputting a full history of encounters, diagnoses, medications, and outcomes. By design, it produces fully populated data fields – in Synthea’s output, every synthetic patient has documented allergies, medications, lab results, and so on (Synthea). There are no dangling tests without results, no missing problem lists, and no unknown social history; everything is statistically consistent and complete. This completeness is a deliberate feature: it enables data scientists to work with clean, comprehensive datasets free of privacy concerns. In fact, one can generate a million lifelike patient records with Synthea and find that each contains a coherent, end-to-end story of the patient’s health journey.

However, this “perfection” is a double-edged sword. In synthetic data, every patient’s medication list is up to date, and every diagnosis is eventually recorded – essentially an idealized scenario. Patients always adhere to treatments, and clinicians follow guidelines to the letter in the simulated world ( Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record - PMC ). For instance, Synthea’s hypertension patients reliably take their meds, and diabetic patients get all recommended tests on time. From a data perspective, there are few if any nulls or “unknown” values. This approach makes synthetic datasets excellent for testing algorithms under ideal conditions, or for research questions that assume complete data. It also sidesteps the messy variability of real clinical documentation. But it’s precisely this tidiness that can become a limitation when we move from the lab to the chaotic environment of an ER or busy hospital ward.

The Messy Reality of Incomplete Data in ED and Inpatient Care

Contrast synthetic completeness with what happens in a real emergency department or inpatient setting. Real-world EHR data is often incomplete, fragmented, and evolving. A patient arrives at the ED with a condition that’s undifferentiated; clinicians might only have a triage note and vitals to start with. Labs and imaging results trickle in over time, and some data may never be captured at all. It’s common that key pieces of information – medication history, allergies, past medical records – are missing upon arrival. A study in a neurological emergency setting found that only 66% of ED patients had complete information (diagnoses, medications, allergies) at admission. In 11% of cases, no information in those categories was available initially, and about 27% of patients lacked critical data in two or more categories (Missing Medical Data in Neurological Emergency Care Compromise Patient Safety and Healthcare Resources) (Missing Medical Data in Neurological Emergency Care Compromise Patient Safety and Healthcare Resources). In other words, about one in four patients arrived with seriously incomplete records, creating a high-risk situation from the start.

Emergency physicians and hospitalists are trained to make the best decisions under these conditions. As one emergency medicine review put it, “chaotic environments and incomplete information” are inherent to acute care, and doctors must act despite significant uncertainty ( Being explicit about the uncertainty of clinical practice in training - PMC ). This uncertainty is not a rare edge case; it’s the norm. Medication lists are a prime example – multiple studies have shown that anywhere from 48% to 87% of ED patients’ medication lists contain errors or omissions (Frequency of incomplete medication histories obtained at triage). It’s not unusual for an inpatient team to discover on day 2 of a hospitalization that the patient’s home medication list was inaccurate, or that a key piece of history (“actually, the patient had a seizure last year”) was missing on admission. In intensive care units, data may be continuously streaming from monitors, yet documentation of a patient’s history might still be incomplete for days.

This stands in stark contrast to a synthetic dataset like Synthea, where you’d never encounter an “unknown allergy” flag or an undocumented medication. The implications for clinical informatics are significant. If we develop or test clinical decision support tools and predictive models purely on fully complete synthetic data, we risk overlooking how those tools handle missing or partial information. For example, an early warning model for sepsis might perform perfectly on synthetic patients (who conveniently have every lab value), but in a real ED it might falter when faced with a patient who hasn’t had lactate or cultures resulted yet. Real-world data is messy – and systems built on overly tidy data can struggle when deployed in the wild.

Why “Perfect Data” Falls Short in ED and Inpatient Settings

Incomplete data isn’t just an inconvenience; it’s a defining feature of acute care data workflows. Clinical informatics professionals know that data in the hospital comes in waves – first a chief complaint, then some vitals, later lab results, maybe a preliminary diagnosis, and so on. Often, decisions must be made mid-stream, not after all data is collected. In emergency medicine especially, you rarely have the luxury of completeness. As Dr. Michael Pulia, an ED physician and informatics researcher, notes, doctors often must act with “varying degrees of diagnostic uncertainty” because of incomplete information, and still provide safe care ( Being explicit about the uncertainty of clinical practice in training - PMC ). Inpatient care, while more controlled than the ED, also deals with evolving data: consider a hospitalist adjusting treatment plans as new labs return over several days, or waiting on a specialist consult to fill in pieces of the puzzle.

Given this reality, synthetic datasets that assume a fully filled-out record can be less useful for simulating or studying these scenarios. If a dataset always provides a final diagnosis, one might ask: how do we simulate the process of reaching that diagnosis with limited initial data? If every synthetic patient’s lab results are instantly available, how do we test an algorithm meant to trigger an alert before all results are known? There’s a risk of developing a false sense of security in our analytics – a model might look highly accurate on synthetic data but fail to accommodate missingness when deployed. In practice, models and decision support need to explicitly handle absent data (sometimes treating “no data” as a critical piece of information itself).

Experts have pointed out this gap. In a pilot project using Synthea to emulate real-world evidence, researchers cautioned that synthetic data may not reflect “typical” EHR source data, largely because real EHRs suffer issues like missing and inconsistent data that Synthea doesn’t fully reproduce (Use of Fast Healthcare Interoperability Resources (FHIR) in the Generation of Real World Evidence (RWE) | CDISC). In other words, a synthetic database’s completeness is atypical compared to the patchy data hospitals grapple with. Another study validating Synthea’s output against real clinical data found notable differences in key quality measures, underscoring that what looks like a comprehensive record in Synthea might diverge from reality in important ways ( Application of Bayesian networks to generate synthetic health data - PMC ). These findings echo what many informatics experts suspect: if your data is too clean, you’re probably not mimicking the real world.

Real-world case examples bring these challenges to life. Consider a hospital IT team trying to test a new clinical decision support (CDS) tool for sepsis detection. They initially use synthetic patient data (where every patient record conveniently has a diagnosis code for sepsis by discharge and all relevant vitals and labs). The tool performs amazingly well in the test environment. But when they pilot it live in the ED, it starts missing obvious sepsis cases or firing false alerts. Why? Because in the real ED, patients don’t come with a diagnosis label – the model had been effectively “peeking” at the answer in the synthetic data. It also didn’t know how to handle cases where, say, lactate wasn’t measured (in synthetic data, every septic patient had a lactate drawn and resulted). The synthetic training gave it a misleadingly complete picture. Similarly, an inpatient fall-risk prediction algorithm might be trained on synthetic records where every patient’s mobility status is documented; once on a real ward, it encounters blanks (nurses sometimes haven’t charted mobility yet) and its accuracy drops.

These scenarios illustrate a key point: **for emergency and inpatient informatics, we often need to train and test systems on partial information, just as clinicians operate on partial information. Synthetic data that only offers the final, fully populated record misses the intermediate states that are so critical.

Expert Voices: The Challenge of Incomplete Data

Clinicians and informatics leaders have long been aware of the challenges posed by incomplete data. Dr. Dimitrios Papanagnou, an emergency physician, highlights that “diagnostic uncertainty” is central to ED practice, and it stems from exactly those chaotic, incomplete data situations ( Being explicit about the uncertainty of clinical practice in training - PMC ). It’s a point often made in patient safety circles: incomplete assessments and missing information contribute to errors and adverse outcomes in emergency care (The Role of Patients' Stories in Emergency Medicine Triage - PubMed). One malpractice analysis noted that “incomplete assessments are often the starting point for adverse patient outcomes” in the ED, leading to misdiagnoses and treatment delays (Surprising New Data on Closed ED Claims: Incomplete…).

From the informatics perspective, missing data is a well-recognized problem for predictive modeling. A Penn Medicine study in 2023 emphasized that EHR data are collected at irregular intervals with varying completeness, and they developed a framework to simulate realistic missing data because no standard approach existed before ( Mining for equitable health: Assessing the impact of missing data in electronic health records - PMC ) ( Mining for equitable health: Assessing the impact of missing data in electronic health records - PMC ). The researchers found that introducing realistic patterns of missingness (as opposed to completely filled data) significantly affected model performance in an ICU setting ( Mining for equitable health: Assessing the impact of missing data in electronic health records - PMC ). This suggests that ignoring the issue of missing data can lead to biased or over-optimistic models. Another real-world insight: the National COVID Cohort Collaborative (N3C), which pooled EHR data for COVID-19 research, had to grapple with lots of incomplete fields. They even developed algorithms to flag patients with “high data-completeness” in EHR data (National COVID Cohort Collaborative (N3C): Rationale, design ...), an effort to identify which records were sufficiently filled to trust for certain analyses. These examples underline a consensus in the community: data completeness is the exception, not the rule, in operational healthcare settings.

So when synthetic datasets assume completeness by default, experts see a mismatch. As one report bluntly put it, Synthea data may not mirror the quirks of real EHR data, cautioning users not to get too comfortable with its cleanliness (Use of Fast Healthcare Interoperability Resources (FHIR) in the Generation of Real World Evidence (RWE) | CDISC). Clinical informatics professionals who have tried using synthetic data for ED or inpatient projects often observe, “It didn’t prepare us for the real data’s messiness.” The devil is in the details – it’s not just whether a blood pressure value is plausible, but whether it might be missing altogether during a critical window.

Bridging the Gap: Making Synthetic Data More Useful for Acute Care

Synthetic data isn’t going away – nor should it. The potential benefits (unlimited data with no PHI, ability to simulate rare scenarios, etc.) are enormous. The question is how to adapt or augment synthetic EHR datasets to better serve use cases in emergency and inpatient care. Here are several potential improvements and alternatives that could help make synthetic data more realistic and useful for these settings: