Skip to content

Stage 2 - C-CDA parsing

The core of Patient Edition. For each XML file produced by Stage 1, this stage extracts demographics from the document header and walks every section under structuredBody to produce one long-format record per <entry>.

The document model

A C-CDA Continuity of Care Document looks like this in structure:

ClinicalDocument
├── recordTarget/patientRole       ← demographics (Stage 2.1)
│   ├── id (MRN)
│   └── patient
│       ├── name (given, family)
│       ├── administrativeGenderCode
│       ├── birthTime
│       └── maritalStatusCode
├── code                            ← document type (LOINC, e.g. 34133-9)
├── effectiveTime                   ← document date
├── custodian/.../name              ← source organization
└── component/structuredBody
    └── component/section (×N)      ← each clinical section (Stage 2.2)
        ├── code                    ← section type (LOINC)
        ├── title
        ├── text                    ← narrative
        └── entry (×N)
            └── observation | substanceAdministration | procedure | encounter | act
                ├── code            ← the clinical code (SNOMED, RxNorm, LOINC, ...)
                ├── effectiveTime
                ├── value           ← lab value (PQ, CD, ST types)
                └── statusCode

Two-pass logic

Pass 1: Header

parse_patient_header(root) extracts:

Field XPath Notes
last_name recordTarget/patientRole/patient/name/family
first_name recordTarget/patientRole/patient/name/given First <given> element only
dob recordTarget/patientRole/patient/birthTime/@value Reformatted to YYYY-MM-DD
mrn recordTarget/patientRole/id/@extension First <id> with an extension attribute
gender recordTarget/patientRole/patient/administrativeGenderCode/@code Mapped M→male, F→female, UN→unknown; otherwise the displayName
marital_status recordTarget/patientRole/patient/maritalStatusCode/@displayName

The stable patient_id is built at this step (see Stage 3).

Pass 2: Sections

For each <section> under structuredBody:

  1. Map the section's LOINC code to a Registry Forge category via the SECTION_TO_CATEGORY table:

    SECTION_TO_CATEGORY = {
        '11450-4': 'problems',           # Problem list
        '10160-0': 'medications',        # History of medication use
        '47519-4': 'procedures',         # History of procedures
        '30954-2': 'labs_vitals',        # Lab results
        '8716-3':  'labs_vitals',        # Vital signs
        '48765-2': 'allergies',
        '11369-6': 'immunizations',
        '18776-5': 'careplans',
        '18748-4': 'diagnostic_reports',
        '46239-0': 'encounters',
        '61146-7': 'goals',
        '29762-2': 'notes',              # Social history (notes)
        '10157-6': 'notes',              # Family history
    }
    
  2. For each <entry> under the section, drill into the first structured child (observation, substanceAdministration, procedure, encounter, act, organizer, supply).

  3. Build one long-format record dict.

Coded fields per entry type

The "primary code" lives on different elements depending on the entry type:

Entry type Code lives on Example
observation (problem, lab) observation/code SNOMED for problems, LOINC for labs
substanceAdministration (medication) substanceAdministration/consumable/manufacturedProduct/manufacturedMaterial/code RxNorm
procedure procedure/code CPT or SNOMED
encounter encounter/code CPT or HL7-ActCode
observation with participant/playingEntity (allergy) participant/playingEntity/code RxNorm (drug allergy) or SNOMED

The parser tries these in order. The first non-empty code wins. All other codings (e.g. ICD-10 translations of SNOMED problems) are collected into all_codings_json.

Value extraction

Lab/vital observations have a <value> element whose type is declared by xsi:type:

xsi:type What it means How we parse
PQ (Physical Quantity) Numeric + unit value = @value, unit = @unit
CD / CO (Coded) Coded result value = @displayName or @code, unit = ''
ST / ED (String / Encapsulated Data) Free text value = element text
(absent type) Fallback value = @value if present, else element text

Narrative fallback

If a section has zero <entry> elements but does have a <text> block, the parser emits one record with:

  • category = notes (or whatever the section's normal category is)
  • code = '', code_system = ''
  • text = flattened narrative

This is common in patient-portal exports - some vendors ship structured entries, others ship only narrative. The fallback ensures no text is silently dropped.

OID-to-vocabulary mapping

C-CDA codeSystem attributes are numeric OIDs. We map the common ones:

OID_TO_VOCAB = {
    '2.16.840.1.113883.6.96':  'SNOMED-CT',
    '2.16.840.1.113883.6.88':  'RxNorm',
    '2.16.840.1.113883.6.1':   'LOINC',
    '2.16.840.1.113883.6.103': 'ICD-9-CM',
    '2.16.840.1.113883.6.90':  'ICD-10-CM',
    '2.16.840.1.113883.6.12':  'CPT-4',
    '2.16.840.1.113883.6.14':  'HCPCS',
    '2.16.840.1.113883.12.292':'CVX',
    ...
}

Unknown OIDs fall back to the document's codeSystemName attribute when present. Genuinely unknown systems land in the output as-is - we don't drop them, just don't normalize them.

Error handling

parse_ccda() returns (demographics, records, error):

  • On parse failure (malformed XML): (None, [], 'parse_error: ...').
  • On wrong root element: (None, [], 'not a ClinicalDocument: <...>').
  • On missing patient header: (None, [], 'no recordTarget/patientRole').
  • On success: (demographics_dict, [record_dict, ...], None).

Stage 3 collects the failures into an errors list without halting the run.

Vendor-specific patterns the parser handles

Epic MyChart exports - the most common patient-portal source - follow several conventions that the parser explicitly recognizes:

Concern-act wrappers

Problems and allergies in Epic C-CDA 2.1 are wrapped in a <act classCode="ACT"><code code="CONC"/> envelope. The actual clinical content lives two levels down:

<entry>
  <act ...>
    <code code="CONC".../>                       ← wrapper, ignored
    <entryRelationship typeCode="SUBJ">
      <observation>
        <code code="64572001" displayName="Disease"/>     ← problem-template marker
        <value xsi:type="CD" code="253153000" .../>       ← the REAL diagnosis SNOMED
      </observation>
    </entryRelationship>
  </act>
</entry>

The parser recognizes the CONCERN_ACT_TEMPLATES set (2.16.840.1.113883.10.20.22.4.3 for problems, .30 for allergies) and drills through to the inner observation. When it sees a "problem observation placeholder" code like 64572001, it promotes the <value> element's coded result to be the primary code.

Narrative anchor references

Epic populates structured entries with <text><reference value="#problem19name"/></text> pointing to anchored elements in the section's <text> table. The actual human-readable diagnosis name (e.g. "Infantile spasms without mention of intractable epilepsy") is in <td ID="problem19name">…</td> within the narrative.

The parser builds a per-section narrative index (build_narrative_index()) mapping every ID-tagged element to its rendered text, then resolves entry references against that index. Without this, the display_name column would be empty even for richly-coded entries.

BATTERY organizers

Lab results are grouped under <organizer classCode="BATTERY"> with the panel's order code (e.g. CPT-4 71046) on the organizer itself and individual measurements as <component><observation> children. The parser emits the organizer as one record (status "panel") and each component as its own record with its proper LOINC, value, and unit.

Allergy substance lookup

In allergy observations (template 2.16.840.1.113883.10.20.22.4.7), the <code> is ASSERTION (a meaningless wrapper) and the actual allergen lives on participant/participantRole/playingEntity/code. The parser detects allergy templates and pulls the substance code from playingEntity.

If you hit a vendor whose exports use a different pattern, the diagnostic cell (notebook section 3b) is the fastest way to see what's actually in your files; share the output in a GitHub issue and we can add handling.

What we don't extract

  • Document-level <author>. Patient-portal exports often list the patient as their own author. We don't surface it.
  • <participant> other than allergy substances. Encounter participants, ordering providers, performing providers - all dropped.
  • Free-text reasons inside structured entries. A medication entry might include a <text> block with the prescription instructions; we capture the medication code but not the SIG. (You can recover it from raw_record_json if needed.)
  • C-CDA templates and templateIds. We rely on section LOINC codes for category routing, not template OIDs.

Bridging any of these is a one-function change in the notebook.