Architecture¶
Pipeline flow¶
flowchart TD
A[ccda_chunks.csv<br/>fhir_chunks.csv<br/>uuid_mapping.csv]
--> B[Stage 1<br/>Reassembly + decode]
B --> C[Stage 2<br/>Format detection]
C --> D{format?}
D -->|CCDA XML| E[CCDA parser]
D -->|FHIR JSON| F[FHIR parser]
D -->|RTF / HTML / PDF| G[Text extraction]
E --> H[Stage 4<br/>Join + dedupe]
F --> H
G --> H
H --> I[Stage 5<br/>Display enrichment]
I --> J[Stage 6<br/>Test-patient filter]
J --> K[dashboard_data.json<br/>dashboard_data.xlsx<br/>per-tab CSVs]
Module structure¶
The entire pipeline is a single Python file (run_pipeline.py) organized
into labelled sections:
| Section | Purpose |
|---|---|
| Part A | Imports & user-configurable constants |
| Part B | Display-enrichment lookup tables (LOINC, SNOMED) |
| Part C | Name cleaning + MRN normalization |
| Part D | Code-walking utilities (CCDA <translation>, FHIR coding[]) |
| Part E | Format detection + decoding + text strippers |
| Part F | Stage 1 -- reassembly |
| Part G | Stage 2 -- CCDA parsing |
| Part H | Stage 3 -- FHIR parsing (13 resource types) |
| Part I | Stage 4 -- patient deduplication & schema aliasing |
| Part J | Stage 4 -- bundle assembly |
| Part K | Stage 5 -- display enrichment |
| Part L | Stage 6 -- test-patient filter |
| Part M | Output writers (JSON, XLSX, CSV) |
| Part N | main() orchestrator |
Data model¶
The output bundle is a flat JSON object with these top-level keys:
| Key | Type | Description |
|---|---|---|
metadata |
object | Generation timestamp, pipeline version, total patient count |
patients |
array | Patient records with demographics + num_documents |
documents |
array | One record per source document (CCDA, FHIR, RTF, HTML, PDF) with plain_text |
encounters |
array | Visit / encounter records |
problems |
array | Conditions / diagnoses |
medications |
array | MedicationRequest, MedicationStatement, MedicationAdministration |
procedures |
array | Procedures performed |
labs |
array | Laboratory observations |
vitals |
array | Vital-sign observations |
labs_vitals |
array | Convenience union of labs + vitals |
allergies |
array | AllergyIntolerance records |
immunizations |
array | Immunization records |
careplans |
array | CarePlan records |
diagnostic_reports |
array | DiagnosticReport records |
goals |
array | Goal records |
notes |
array | One row per CCDA section narrative (title + body + char count) |
document_references |
array | FHIR DocumentReference records |
Every clinical record carries a patient_id field that joins back to
patients[].patient_id.
See Output schema for per-row field definitions.
Cross-bundle reference resolution¶
A real world FHIR export often splits resources across bundles such that a
MedicationRequest in bundle A references a Medication resource in
bundle B. The same applies to DocumentReference -> Patient.
Stage 3 handles this with a two-pass design:
- Pre-pass: walk every FHIR bundle once. Build a global
medication_indexkeyed by Medication resource id, full URL, andMedication/<id>form. Build a globaldocref_pid_mapfromDocumentReference.subject.referenceso any document can resolve back to a patient. - Main pass: walk every bundle again. When a
MedicationRequest'smedicationReferencepoints outside the current bundle, look it up in the global index. When a CCDA / RTF / HTML / PDF document UUID isn't inuuid_mapping.csv, look it up indocref_pid_map.
This is the difference between "most medications have a generic 'medication-12345' display" and "every medication has its RxNorm code, RxNorm display, and dose". Same for document linkage.