Skip to content

Overview

Inputs

The pipeline expects three input files in a working directory:

File Required Description
CCDA and FHIR data/ccda_chunks.csv yes base64-encoded CCDA XML documents, chunked into rows
CCDA and FHIR data/fhir_chunks.csv yes FHIR Bundle JSON documents, chunked into rows
uuid_mapping.csv recommended maps document_uuid to patient_id, plus optional demographic columns
test_patients.txt optional exclusion rules, one per line

Each chunked CSV has three columns: id, chunk_index, chunk_data. The chunking is required because clinical documents routinely exceed warehouse-to-CSV cell-size limits. The pipeline reassembles them by concatenating chunks in chunk_index order, then decodes (base64 for CCDA, plain JSON for FHIR).

See Data extraction (Databricks) for the warehouse-side code that produces these files.

Stages

# Stage Purpose
1 Decoding & reassembly Concatenate chunks; base64 decode CCDA
2 Format detection Magic-byte detection: PDF / RTF / CCDA XML / HTML / unknown
3 FHIR resource extraction 13 resource types, code-system-aware coding walk
4 Joining & assembly Cross-format patient linkage; deduplication by name+DOB
5 Display-name enrichment LOINC / SNOMED display backfill where missing
6 Test-patient exclusion Rule-based filter (file-based + hardcoded)

Each stage is documented separately under Pipeline stages.

Outputs

File Format Purpose
dashboard_data.json JSON The full bundle - read by the dashboard
dashboard_data.xlsx Excel One sheet per category for spreadsheet review
dashboard_data.prefilter.json JSON Snapshot before Stage 6 (audit)
csv_exports/*.csv CSV One file per category
pipeline_run.log text Time-stamped run log

The bundle structure is documented under Output schema.

Key design choices

Chunked CSVs over Parquet. CSV is universally readable, requires no warehouse-specific tooling, and travels through any data-sharing workflow (email, sftp, object storage). The chunking pattern handles multi-megabyte clinical documents that would otherwise exceed CSV cell-size limits.

Cross-bundle global indexing. Real world FHIR exports often place Medication resources in a separate bundle from the MedicationRequest that references them. Stage 3 does a pre-pass over all bundles to build global Medication and DocumentReference indices before any per-bundle parsing, so cross-bundle references resolve correctly.

Schema aliasing. The bundle emits both canonical field names (effective_date, display_name) and aliases (start_date, authored_on, onset_datetime, allergen, vaccine) so downstream consumers can use whichever schema is convenient.

Browser-only dashboard. The viewer is a single static HTML file that loads dashboard_data.json from a file picker. No server, no API key, no deployment. Suitable for sharing with collaborators who lack engineering infrastructure.