Skip to content

Overview

A one-page summary of the design.

Input

A folder, a subfolder tree, or a .zip containing C-CDA XML documents downloaded by a patient from a health-system patient portal - most commonly MyChart, but the format is also produced by Cerner HealtheLife, Athenahealth, and any patient portal that implements the Blue Button / Patient Access API export.

Every document begins like this:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type='text/xsl' href='STYLE.XSL'?>
<ClinicalDocument xmlns="urn:hl7-org:v3" ...>
  <recordTarget>
    <patientRole>
      <id extension="MRN-..."/>
      <patient>
        <name><given>...</given><family>...</family></name>
        ...
      </patient>
    </patientRole>
  </recordTarget>
  <component>
    <structuredBody>
      <component><section>
        <code code="11450-4" .../>  <!-- Problem List -->
        ...
      </section></component>
      ...
    </structuredBody>
  </component>
</ClinicalDocument>

The parser doesn't care about the wrapping zip layout - it walks recursively.

Processing

Five stages, all standard-library Python plus pandas:

  1. Folder discovery. Walk the input recursively, unzip if needed, collect every .xml file.
  2. C-CDA parsing. For each file: extract demographics from recordTarget/patientRole, then walk every section under structuredBody and convert each <entry> into one long-format record. Fall back to section narrative if structured entries are missing.
  3. Identity resolution. Generate a stable patient_id per document. Use MRN when present; otherwise hash (last_name, first_name, dob). Across documents with the same identity, records merge into one logical patient.
  4. Long-format assembly. Concatenate every record into the Registry Forge patient_master.csv schema, sorted by (last_name, first_name, patient_id, category, effective_date).
  5. Dashboard + feature matrix. Build a single self-contained HTML dashboard with client-side search/filter, plus an ML-ready patient × feature CSV.

Output

File Purpose
patient_master.csv Long-format, every record from every document. Schema-compatible with Registry Forge.
dashboard.html Single offline browser app. Filter by patient/category/vocabulary; full-text global search; CSV export of filtered view.
patient_features.csv Patient × feature matrix. Demographics, category counts, top-K diagnosis/medication binary flags, lab-value means.
parse_log.csv One row per source file: which patient it resolved to, how many records it contributed.
registry_forge_patient_bundle.zip The above four files in one archive for handoff.

Modes

Auto-detected. No flag to flip.

  • Single-patient mode. One person across many documents. Dashboard becomes a timeline; feature matrix is one row.
  • Multi-patient mode. Several people. Dashboard adds a patient filter; feature matrix is suitable for clustering or supervised modeling.

What we deliberately don't do

  • No EHR connection. Never authenticates to anything. Never makes a network call with patient data.
  • No FHIR resource fetching. Patient portals occasionally embed FHIR resources inside C-CDA narrative blocks; we leave those alone.
  • No PHI scrubbing. Outputs contain names, MRNs, DOBs. That is intentional - the researcher needs them. The user is responsible for handling outputs as PHI.
  • No code-system harmonization. SNOMED stays SNOMED, ICD-10 stays ICD-10. We preserve all_codings_json so downstream consumers can do their own mapping.
  • No deduplication across overlapping documents. A patient who exported their record three times will produce three copies of overlapping records. That preserves provenance; deduping is a downstream choice.

What lives where

  • Notebook: notebook/RegistryForge_Patient.ipynb
  • Docs site source: docs/
  • mkdocs config: mkdocs.yml