Overview¶

Inputs¶

The pipeline expects three input files in a working directory:

File	Required	Description
`CCDA and FHIR data/ccda_chunks.csv`	yes	base64-encoded CCDA XML documents, chunked into rows
`CCDA and FHIR data/fhir_chunks.csv`	yes	FHIR Bundle JSON documents, chunked into rows
`uuid_mapping.csv`	recommended	maps `document_uuid` to `patient_id`, plus optional demographic columns
`test_patients.txt`	optional	exclusion rules, one per line

Each chunked CSV has three columns: id, chunk_index, chunk_data. The chunking is required because clinical documents routinely exceed warehouse-to-CSV cell-size limits. The pipeline reassembles them by concatenating chunks in chunk_index order, then decodes (base64 for CCDA, plain JSON for FHIR).

See Data extraction (Databricks) for the warehouse-side code that produces these files.

Stages¶

#	Stage	Purpose
1	Decoding & reassembly	Concatenate chunks; base64 decode CCDA
2	Format detection	Magic-byte detection: PDF / RTF / CCDA XML / HTML / unknown
3	FHIR resource extraction	13 resource types, code-system-aware coding walk
4	Joining & assembly	Cross-format patient linkage; deduplication by name+DOB
5	Display-name enrichment	LOINC / SNOMED display backfill where missing
6	Test-patient exclusion	Rule-based filter (file-based + hardcoded)

Each stage is documented separately under Pipeline stages.

Outputs¶

File	Format	Purpose
`dashboard_data.json`	JSON	The full bundle - read by the dashboard
`dashboard_data.xlsx`	Excel	One sheet per category for spreadsheet review
`dashboard_data.prefilter.json`	JSON	Snapshot before Stage 6 (audit)
`csv_exports/*.csv`	CSV	One file per category
`pipeline_run.log`	text	Time-stamped run log

The bundle structure is documented under Output schema.

Key design choices¶

Chunked CSVs over Parquet. CSV is universally readable, requires no warehouse-specific tooling, and travels through any data-sharing workflow (email, sftp, object storage). The chunking pattern handles multi-megabyte clinical documents that would otherwise exceed CSV cell-size limits.

Cross-bundle global indexing. Real world FHIR exports often place Medication resources in a separate bundle from the MedicationRequest that references them. Stage 3 does a pre-pass over all bundles to build global Medication and DocumentReference indices before any per-bundle parsing, so cross-bundle references resolve correctly.

Schema aliasing. The bundle emits both canonical field names (effective_date, display_name) and aliases (start_date, authored_on, onset_datetime, allergen, vaccine) so downstream consumers can use whichever schema is convenient.

Browser-only dashboard. The viewer is a single static HTML file that loads dashboard_data.json from a file picker. No server, no API key, no deployment. Suitable for sharing with collaborators who lack engineering infrastructure.