Limitations¶
Things to know before publishing anything based on these outputs.
Parser-level¶
Patient-portal exports are inconsistent¶
C-CDA is a standard. Patient-portal exports vary in their adherence to it. Common issues:
- Narrative-only sections. Some vendors ship a beautiful HTML-rendered patient summary with no
<entry>elements at all. Patient Edition falls back to narrative records, but downstream analyses (anything that filters on acode) skip these. - Non-standard section codes. A section that should be
11450-4(Problem List) might appear with a vendor-specific LOINC or no code at all. These end up incategory = observationsand are partially recoverable. - Non-UTF-8 encodings. Older Epic exports occasionally use Windows-1252 with a UTF-8 declaration. Parser tolerance varies by Python version.
- Self-author chains. A patient who is listed as their own document author throws off any author-aware downstream tool. Patient Edition ignores authorship; this isn't a problem here but is one in OMOP or Phenopackets handoff.
What the parser has and hasn't been tested against¶
We have tested against:
- Epic MyChart C-CDA 2.1 exports - the most common patient-portal source. The parser handles Epic's concern-act wrappers, narrative anchor references, BATTERY organizers, and the section codes Epic actually emits.
- Synthetic C-CDA samples in
notebook/sample_data/(theJaneDemoPatient/JoeDemoPatient/AlexDemoPatientfiles).
We have not systematically tested against:
- Cerner HealtheLife exports
- Athenahealth patient portal exports
- Allscripts patient portals
- Veteran Affairs Blue Button / VA HealtheVet exports
- Apple Health Records C-CDA exports (which package data from multiple sources)
If you have access to scrubbed samples from any of these and the parser produces empty or wrong records, please file a GitHub issue with the diagnostic-cell output. Vendor coverage is the most actionable thing contributors can help with.
Identity-resolution limitations¶
See Stage 3 - Identity resolution for the full discussion. In short:
- No fuzzy matching. Spelling differences across documents split one patient into two.
- Name collisions. Two real patients sharing name + DOB without an MRN collide into one ID.
- No cross-system linkage. Documents from two hospitals for the same patient won't merge unless name + DOB match exactly and no conflicting MRN is present.
Modeling limitations¶
These apply to any analysis built on the outputs, but especially to the feature matrix:
Small-n¶
Cohorts assembled from patient-mediated data are usually small (dozens, not thousands). AUCs and clustering scores at small n are noisy. Use cross-validation with permutation tests, and report confidence intervals, not point estimates.
Top-K is a hack¶
The default feature matrix uses the top 30 problem codes and top 30 medication codes by frequency. This throws away rare-but-important codes. For real work, map codes to a hierarchy (ATC for meds, ICD-10 chapters or Mondo for problems) before modeling.
Temporal leakage¶
Pivoting all records into features means post-outcome data can leak backward in time. For a prediction task, censor the feature matrix at a defined index date and only count records before it. The current pipeline doesn't do this for you.
Class imbalance¶
Patient-mediated cohorts are often very imbalanced (most participants share lots of records, a few share almost none). class_weight='balanced' helps but isn't a substitute for proper sampling.
Selection bias¶
Patients who go to the trouble of downloading their records and sharing them with researchers are not a random sample. They tend to be:
- More engaged with their health
- Higher socioeconomic status
- More technologically comfortable
- Often patient-advocate or rare disease populations
Any inference from a patient-mediated cohort needs to account for this.
What's missing compared to Registry Forge¶
Already covered in How it differs from RegistryForge, but to summarize what's absent or incomplete and might bite you:
- OMOP ETL coverage. Patient Edition ships seven core OMOP tables (PERSON, OBSERVATION_PERIOD, CONDITION_OCCURRENCE, DRUG_EXPOSURE, MEASUREMENT, OBSERVATION, DEVICE_EXPOSURE). Missing: PROCEDURE_OCCURRENCE, VISIT_OCCURRENCE, NOTE, VISIT_DETAIL, CARE_SITE, the derived ERA tables. Registry Forge's full ETL accepts our master CSV if you need the broader coverage.
- Mondo coverage. Starter crosswalk covers ~80 codes around epilepsy, neurodevelopmental disorders, adjacent neurology, and common comorbidities. For broader rare disease coverage, supply a custom crosswalk CSV.
- Phenopackets completeness.
phenotypicFeaturesnot yet populated from note flags; no variants or biosamples; no full HPO mapping. Subject, diseases, medical actions, and measurements are all present. - Note flagging. Naive regex matching only. No negation detection, no section-context awareness, no entity normalization. Use it for screening, not for clinical extraction.
- No drug repurposing analysis.
- No formal QC framework (Kahn 2016 / OHDSI DQD).
- No code-system harmonization beyond what's preserved (we keep all codings; we don't actively map across vocabularies except via Mondo or Athena lookup).
- No survival analysis.
- No NLP on narrative content beyond keyword flagging.
Most of these are deliberate scope choices for the patient-mediated use case. A few (HPO mapping, broader OMOP coverage) are roadmap items waiting on contributors.
What we won't fix¶
- PHI scrubbing. Patient Edition will not become a de-identification tool. The role-appropriate place to do de-identification is a separate, validated pipeline like the NIH NLM Scrubber or i2b2 NLP, not a parsing notebook.
- EHR-side ingestion. That's what Registry Forge proper is for. If you have backend EHR access, use it.
- Clinical decision support. This tool exists to support research workflows. It is not designed for, validated for, or intended for clinical use.
Reporting issues¶
Before opening an issue:
- Try to reproduce with a synthetic or fully-scrubbed C-CDA. Don't post real PHI.
- Include the Python traceback if any, the exact version of the notebook you're running, and a short description of what you expected vs. what happened.
- If the issue is parser tolerance for a specific vendor format, providing a redacted example helps enormously.