Stage 2 - C-CDA parsing¶
The core of Patient Edition. For each XML file produced by Stage 1, this stage extracts demographics from the document header and walks every section under structuredBody to produce one long-format record per <entry>.
The document model¶
A C-CDA Continuity of Care Document looks like this in structure:
ClinicalDocument
├── recordTarget/patientRole ← demographics (Stage 2.1)
│ ├── id (MRN)
│ └── patient
│ ├── name (given, family)
│ ├── administrativeGenderCode
│ ├── birthTime
│ └── maritalStatusCode
├── code ← document type (LOINC, e.g. 34133-9)
├── effectiveTime ← document date
├── custodian/.../name ← source organization
└── component/structuredBody
└── component/section (×N) ← each clinical section (Stage 2.2)
├── code ← section type (LOINC)
├── title
├── text ← narrative
└── entry (×N)
└── observation | substanceAdministration | procedure | encounter | act
├── code ← the clinical code (SNOMED, RxNorm, LOINC, ...)
├── effectiveTime
├── value ← lab value (PQ, CD, ST types)
└── statusCode
Two-pass logic¶
Pass 1: Header¶
parse_patient_header(root) extracts:
| Field | XPath | Notes |
|---|---|---|
last_name |
recordTarget/patientRole/patient/name/family |
|
first_name |
recordTarget/patientRole/patient/name/given |
First <given> element only |
dob |
recordTarget/patientRole/patient/birthTime/@value |
Reformatted to YYYY-MM-DD |
mrn |
recordTarget/patientRole/id/@extension |
First <id> with an extension attribute |
gender |
recordTarget/patientRole/patient/administrativeGenderCode/@code |
Mapped M→male, F→female, UN→unknown; otherwise the displayName |
marital_status |
recordTarget/patientRole/patient/maritalStatusCode/@displayName |
The stable patient_id is built at this step (see Stage 3).
Pass 2: Sections¶
For each <section> under structuredBody:
-
Map the section's LOINC code to a Registry Forge category via the
SECTION_TO_CATEGORYtable:SECTION_TO_CATEGORY = { '11450-4': 'problems', # Problem list '10160-0': 'medications', # History of medication use '47519-4': 'procedures', # History of procedures '30954-2': 'labs_vitals', # Lab results '8716-3': 'labs_vitals', # Vital signs '48765-2': 'allergies', '11369-6': 'immunizations', '18776-5': 'careplans', '18748-4': 'diagnostic_reports', '46239-0': 'encounters', '61146-7': 'goals', '29762-2': 'notes', # Social history (notes) '10157-6': 'notes', # Family history } -
For each
<entry>under the section, drill into the first structured child (observation,substanceAdministration,procedure,encounter,act,organizer,supply). - Build one long-format record dict.
Coded fields per entry type¶
The "primary code" lives on different elements depending on the entry type:
| Entry type | Code lives on | Example |
|---|---|---|
observation (problem, lab) |
observation/code |
SNOMED for problems, LOINC for labs |
substanceAdministration (medication) |
substanceAdministration/consumable/manufacturedProduct/manufacturedMaterial/code |
RxNorm |
procedure |
procedure/code |
CPT or SNOMED |
encounter |
encounter/code |
CPT or HL7-ActCode |
observation with participant/playingEntity (allergy) |
participant/playingEntity/code |
RxNorm (drug allergy) or SNOMED |
The parser tries these in order. The first non-empty code wins. All other codings (e.g. ICD-10 translations of SNOMED problems) are collected into all_codings_json.
Value extraction¶
Lab/vital observations have a <value> element whose type is declared by xsi:type:
xsi:type |
What it means | How we parse |
|---|---|---|
PQ (Physical Quantity) |
Numeric + unit | value = @value, unit = @unit |
CD / CO (Coded) |
Coded result | value = @displayName or @code, unit = '' |
ST / ED (String / Encapsulated Data) |
Free text | value = element text |
| (absent type) | Fallback | value = @value if present, else element text |
Narrative fallback¶
If a section has zero <entry> elements but does have a <text> block, the parser emits one record with:
category = notes(or whatever the section's normal category is)code = '',code_system = ''text = flattened narrative
This is common in patient-portal exports - some vendors ship structured entries, others ship only narrative. The fallback ensures no text is silently dropped.
OID-to-vocabulary mapping¶
C-CDA codeSystem attributes are numeric OIDs. We map the common ones:
OID_TO_VOCAB = {
'2.16.840.1.113883.6.96': 'SNOMED-CT',
'2.16.840.1.113883.6.88': 'RxNorm',
'2.16.840.1.113883.6.1': 'LOINC',
'2.16.840.1.113883.6.103': 'ICD-9-CM',
'2.16.840.1.113883.6.90': 'ICD-10-CM',
'2.16.840.1.113883.6.12': 'CPT-4',
'2.16.840.1.113883.6.14': 'HCPCS',
'2.16.840.1.113883.12.292':'CVX',
...
}
Unknown OIDs fall back to the document's codeSystemName attribute when present. Genuinely unknown systems land in the output as-is - we don't drop them, just don't normalize them.
Error handling¶
parse_ccda() returns (demographics, records, error):
- On parse failure (malformed XML):
(None, [], 'parse_error: ...'). - On wrong root element:
(None, [], 'not a ClinicalDocument: <...>'). - On missing patient header:
(None, [], 'no recordTarget/patientRole'). - On success:
(demographics_dict, [record_dict, ...], None).
Stage 3 collects the failures into an errors list without halting the run.
Vendor-specific patterns the parser handles¶
Epic MyChart exports - the most common patient-portal source - follow several conventions that the parser explicitly recognizes:
Concern-act wrappers¶
Problems and allergies in Epic C-CDA 2.1 are wrapped in a <act classCode="ACT"><code code="CONC"/> envelope. The actual clinical content lives two levels down:
<entry>
<act ...>
<code code="CONC".../> ← wrapper, ignored
<entryRelationship typeCode="SUBJ">
<observation>
<code code="64572001" displayName="Disease"/> ← problem-template marker
<value xsi:type="CD" code="253153000" .../> ← the REAL diagnosis SNOMED
</observation>
</entryRelationship>
</act>
</entry>
The parser recognizes the CONCERN_ACT_TEMPLATES set (2.16.840.1.113883.10.20.22.4.3 for problems, .30 for allergies) and drills through to the inner observation. When it sees a "problem observation placeholder" code like 64572001, it promotes the <value> element's coded result to be the primary code.
Narrative anchor references¶
Epic populates structured entries with <text><reference value="#problem19name"/></text> pointing to anchored elements in the section's <text> table. The actual human-readable diagnosis name (e.g. "Infantile spasms without mention of intractable epilepsy") is in <td ID="problem19name">…</td> within the narrative.
The parser builds a per-section narrative index (build_narrative_index()) mapping every ID-tagged element to its rendered text, then resolves entry references against that index. Without this, the display_name column would be empty even for richly-coded entries.
BATTERY organizers¶
Lab results are grouped under <organizer classCode="BATTERY"> with the panel's order code (e.g. CPT-4 71046) on the organizer itself and individual measurements as <component><observation> children. The parser emits the organizer as one record (status "panel") and each component as its own record with its proper LOINC, value, and unit.
Allergy substance lookup¶
In allergy observations (template 2.16.840.1.113883.10.20.22.4.7), the <code> is ASSERTION (a meaningless wrapper) and the actual allergen lives on participant/participantRole/playingEntity/code. The parser detects allergy templates and pulls the substance code from playingEntity.
If you hit a vendor whose exports use a different pattern, the diagnostic cell (notebook section 3b) is the fastest way to see what's actually in your files; share the output in a GitHub issue and we can add handling.
What we don't extract¶
- Document-level
<author>. Patient-portal exports often list the patient as their own author. We don't surface it. <participant>other than allergy substances. Encounter participants, ordering providers, performing providers - all dropped.- Free-text reasons inside structured entries. A medication entry might include a
<text>block with the prescription instructions; we capture the medication code but not the SIG. (You can recover it fromraw_record_jsonif needed.) - C-CDA templates and
templateIds. We rely on section LOINC codes for category routing, not template OIDs.
Bridging any of these is a one-function change in the notebook.