Overview¶
A one-page summary of the design.
Input¶
A folder, a subfolder tree, or a .zip containing C-CDA XML documents downloaded by a patient from a health-system patient portal - most commonly MyChart, but the format is also produced by Cerner HealtheLife, Athenahealth, and any patient portal that implements the Blue Button / Patient Access API export.
Every document begins like this:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type='text/xsl' href='STYLE.XSL'?>
<ClinicalDocument xmlns="urn:hl7-org:v3" ...>
<recordTarget>
<patientRole>
<id extension="MRN-..."/>
<patient>
<name><given>...</given><family>...</family></name>
...
</patient>
</patientRole>
</recordTarget>
<component>
<structuredBody>
<component><section>
<code code="11450-4" .../> <!-- Problem List -->
...
</section></component>
...
</structuredBody>
</component>
</ClinicalDocument>
The parser doesn't care about the wrapping zip layout - it walks recursively.
Processing¶
Five stages, all standard-library Python plus pandas:
- Folder discovery. Walk the input recursively, unzip if needed, collect every
.xmlfile. - C-CDA parsing. For each file: extract demographics from
recordTarget/patientRole, then walk every section understructuredBodyand convert each<entry>into one long-format record. Fall back to section narrative if structured entries are missing. - Identity resolution. Generate a stable
patient_idper document. Use MRN when present; otherwise hash(last_name, first_name, dob). Across documents with the same identity, records merge into one logical patient. - Long-format assembly. Concatenate every record into the Registry Forge
patient_master.csvschema, sorted by(last_name, first_name, patient_id, category, effective_date). - Dashboard + feature matrix. Build a single self-contained HTML dashboard with client-side search/filter, plus an ML-ready patient × feature CSV.
Output¶
| File | Purpose |
|---|---|
patient_master.csv |
Long-format, every record from every document. Schema-compatible with Registry Forge. |
dashboard.html |
Single offline browser app. Filter by patient/category/vocabulary; full-text global search; CSV export of filtered view. |
patient_features.csv |
Patient × feature matrix. Demographics, category counts, top-K diagnosis/medication binary flags, lab-value means. |
parse_log.csv |
One row per source file: which patient it resolved to, how many records it contributed. |
registry_forge_patient_bundle.zip |
The above four files in one archive for handoff. |
Modes¶
Auto-detected. No flag to flip.
- Single-patient mode. One person across many documents. Dashboard becomes a timeline; feature matrix is one row.
- Multi-patient mode. Several people. Dashboard adds a patient filter; feature matrix is suitable for clustering or supervised modeling.
What we deliberately don't do¶
- No EHR connection. Never authenticates to anything. Never makes a network call with patient data.
- No FHIR resource fetching. Patient portals occasionally embed FHIR resources inside C-CDA narrative blocks; we leave those alone.
- No PHI scrubbing. Outputs contain names, MRNs, DOBs. That is intentional - the researcher needs them. The user is responsible for handling outputs as PHI.
- No code-system harmonization. SNOMED stays SNOMED, ICD-10 stays ICD-10. We preserve
all_codings_jsonso downstream consumers can do their own mapping. - No deduplication across overlapping documents. A patient who exported their record three times will produce three copies of overlapping records. That preserves provenance; deduping is a downstream choice.
What lives where¶
- Notebook:
notebook/RegistryForge_Patient.ipynb - Docs site source:
docs/ - mkdocs config:
mkdocs.yml