Overview¶
Inputs¶
The pipeline expects three input files in a working directory:
| File | Required | Description |
|---|---|---|
CCDA and FHIR data/ccda_chunks.csv |
yes | base64-encoded CCDA XML documents, chunked into rows |
CCDA and FHIR data/fhir_chunks.csv |
yes | FHIR Bundle JSON documents, chunked into rows |
uuid_mapping.csv |
recommended | maps document_uuid to patient_id, plus optional demographic columns |
test_patients.txt |
optional | exclusion rules, one per line |
Each chunked CSV has three columns: id, chunk_index, chunk_data. The
chunking is required because clinical documents routinely exceed
warehouse-to-CSV cell-size limits. The pipeline reassembles them by
concatenating chunks in chunk_index order, then decodes (base64 for CCDA,
plain JSON for FHIR).
See Data extraction (Databricks) for the warehouse-side code that produces these files.
Stages¶
| # | Stage | Purpose |
|---|---|---|
| 1 | Decoding & reassembly | Concatenate chunks; base64 decode CCDA |
| 2 | Format detection | Magic-byte detection: PDF / RTF / CCDA XML / HTML / unknown |
| 3 | FHIR resource extraction | 13 resource types, code-system-aware coding walk |
| 4 | Joining & assembly | Cross-format patient linkage; deduplication by name+DOB |
| 5 | Display-name enrichment | LOINC / SNOMED display backfill where missing |
| 6 | Test-patient exclusion | Rule-based filter (file-based + hardcoded) |
Each stage is documented separately under Pipeline stages.
Outputs¶
| File | Format | Purpose |
|---|---|---|
dashboard_data.json |
JSON | The full bundle - read by the dashboard |
dashboard_data.xlsx |
Excel | One sheet per category for spreadsheet review |
dashboard_data.prefilter.json |
JSON | Snapshot before Stage 6 (audit) |
csv_exports/*.csv |
CSV | One file per category |
pipeline_run.log |
text | Time-stamped run log |
The bundle structure is documented under Output schema.
Key design choices¶
Chunked CSVs over Parquet. CSV is universally readable, requires no warehouse-specific tooling, and travels through any data-sharing workflow (email, sftp, object storage). The chunking pattern handles multi-megabyte clinical documents that would otherwise exceed CSV cell-size limits.
Cross-bundle global indexing. Real world FHIR exports often place
Medication resources in a separate bundle from the MedicationRequest
that references them. Stage 3 does a pre-pass over all bundles to build
global Medication and DocumentReference indices before any per-bundle
parsing, so cross-bundle references resolve correctly.
Schema aliasing. The bundle emits both canonical field names
(effective_date, display_name) and aliases (start_date, authored_on,
onset_datetime, allergen, vaccine) so downstream consumers can use
whichever schema is convenient.
Browser-only dashboard. The viewer is a single static HTML file that
loads dashboard_data.json from a file picker. No server, no API key, no
deployment. Suitable for sharing with collaborators who lack engineering
infrastructure.