Stage 1 -- Decoding & reassembly¶
Purpose¶
Reassemble the chunked CSV exports back into individual document files on disk, decoding base64 where needed.
Inputs¶
<work_dir>/CCDA and FHIR data/ccda_chunks.csv<work_dir>/CCDA and FHIR data/fhir_chunks.csv
Each is a long, narrow CSV with three columns:
| Column | Type | Meaning |
|---|---|---|
id |
string | Document UUID (one document = many chunk rows) |
chunk_index |
int | 0-based ordinal within the document |
chunk_data |
string | Up to 30,000 characters of payload |
For CCDA the payload is base64-encoded XML. For FHIR the payload is raw JSON text (no encoding -- JSON is already CSV-safe when properly quoted).
Outputs¶
<work_dir>/ccda_assembled/<id>.xml-- one decoded XML file per CCDA document<work_dir>/fhir_assembled/<id>.json-- one JSON file per FHIR bundle
If the output directory already contains many files, the stage skips reassembly. To force a clean rebuild, delete the assembled directories before running.
Implementation notes¶
Concatenation¶
df = pd.read_csv(csv_path).sort_values(['id', 'chunk_index'])
for fid, chunks in df.groupby('id'):
data = ''.join(chunks['chunk_data'].dropna().astype(str))
# write data (decoded if base64)
groupby preserves the sort order. The .dropna() guards against the rare
case where a chunk's chunk_data is empty -- skipping it would corrupt the
document.
Base64 decoding (robust_decode)¶
CCDA chunks are base64-encoded. Some warehouses double-encode (the source
table itself stores base64, then the export wraps it in another base64
layer). robust_decode handles both cases:
- If the input already looks decoded (PDF / RTF / XML / HTML magic bytes), return it untouched.
- Strip non-base64 characters and pad to a multiple of 4.
- Decode once. If the result still doesn't look decoded, try a second pass.
The "looks decoded" check is critical when the input is binary bytes.
Round-tripping binary content through a UTF-8 decode (errors='ignore')
silently drops non-UTF-8 bytes from PDF streams and corrupts them, so the
stage detects bytes that already pass the magic-byte check and returns
them unchanged.
CSV resilience¶
pandas.read_csv handles quoting, embedded commas, and embedded newlines
correctly as long as the warehouse's CSV writer used the standard RFC 4180
escape rules. The Databricks export examples in
Data extraction use the default Spark CSV writer,
which produces RFC 4180-compliant output.
Performance¶
For a typical export (~22,000 documents, ~10 GB of source XML), reassembly takes about three minutes on commodity hardware. The bottleneck is CSV parsing; once decoded, writing files is fast.