Architecture¶

Pipeline flow¶

flowchart TD
  A[ccda_chunks.csv<br/>fhir_chunks.csv<br/>uuid_mapping.csv]
    --> B[Stage 1<br/>Reassembly + decode]
  B --> C[Stage 2<br/>Format detection]
  C --> D{format?}
  D -->|CCDA XML| E[CCDA parser]
  D -->|FHIR JSON| F[FHIR parser]
  D -->|RTF / HTML / PDF| G[Text extraction]
  E --> H[Stage 4<br/>Join + dedupe]
  F --> H
  G --> H
  H --> I[Stage 5<br/>Display enrichment]
  I --> J[Stage 6<br/>Test-patient filter]
  J --> K[dashboard_data.json<br/>dashboard_data.xlsx<br/>per-tab CSVs]

Module structure¶

The entire pipeline is a single Python file (run_pipeline.py) organized into labelled sections:

Section	Purpose
Part A	Imports & user-configurable constants
Part B	Display-enrichment lookup tables (LOINC, SNOMED)
Part C	Name cleaning + MRN normalization
Part D	Code-walking utilities (CCDA `<translation>`, FHIR `coding[]`)
Part E	Format detection + decoding + text strippers
Part F	Stage 1 -- reassembly
Part G	Stage 2 -- CCDA parsing
Part H	Stage 3 -- FHIR parsing (13 resource types)
Part I	Stage 4 -- patient deduplication & schema aliasing
Part J	Stage 4 -- bundle assembly
Part K	Stage 5 -- display enrichment
Part L	Stage 6 -- test-patient filter
Part M	Output writers (JSON, XLSX, CSV)
Part N	`main()` orchestrator

Data model¶

The output bundle is a flat JSON object with these top-level keys:

Key	Type	Description
`metadata`	object	Generation timestamp, pipeline version, total patient count
`patients`	array	Patient records with demographics + `num_documents`
`documents`	array	One record per source document (CCDA, FHIR, RTF, HTML, PDF) with `plain_text`
`encounters`	array	Visit / encounter records
`problems`	array	Conditions / diagnoses
`medications`	array	MedicationRequest, MedicationStatement, MedicationAdministration
`procedures`	array	Procedures performed
`labs`	array	Laboratory observations
`vitals`	array	Vital-sign observations
`labs_vitals`	array	Convenience union of `labs` + `vitals`
`allergies`	array	AllergyIntolerance records
`immunizations`	array	Immunization records
`careplans`	array	CarePlan records
`diagnostic_reports`	array	DiagnosticReport records
`goals`	array	Goal records
`notes`	array	One row per CCDA section narrative (title + body + char count)
`document_references`	array	FHIR DocumentReference records

Every clinical record carries a patient_id field that joins back to patients[].patient_id.

See Output schema for per-row field definitions.

Cross-bundle reference resolution¶

A real world FHIR export often splits resources across bundles such that a MedicationRequest in bundle A references a Medication resource in bundle B. The same applies to DocumentReference -> Patient.

Stage 3 handles this with a two-pass design:

Pre-pass: walk every FHIR bundle once. Build a global medication_index keyed by Medication resource id, full URL, and Medication/<id> form. Build a global docref_pid_map from DocumentReference.subject.reference so any document can resolve back to a patient.
Main pass: walk every bundle again. When a MedicationRequest's medicationReference points outside the current bundle, look it up in the global index. When a CCDA / RTF / HTML / PDF document UUID isn't in uuid_mapping.csv, look it up in docref_pid_map.

This is the difference between "most medications have a generic 'medication-12345' display" and "every medication has its RxNorm code, RxNorm display, and dose". Same for document linkage.