Stage 1 -- Decoding & reassembly¶

Purpose¶

Reassemble the chunked CSV exports back into individual document files on disk, decoding base64 where needed.

Inputs¶

<work_dir>/CCDA and FHIR data/ccda_chunks.csv
<work_dir>/CCDA and FHIR data/fhir_chunks.csv

Each is a long, narrow CSV with three columns:

Column	Type	Meaning
`id`	string	Document UUID (one document = many chunk rows)
`chunk_index`	int	0-based ordinal within the document
`chunk_data`	string	Up to 30,000 characters of payload

For CCDA the payload is base64-encoded XML. For FHIR the payload is raw JSON text (no encoding -- JSON is already CSV-safe when properly quoted).

Outputs¶

<work_dir>/ccda_assembled/<id>.xml -- one decoded XML file per CCDA document
<work_dir>/fhir_assembled/<id>.json -- one JSON file per FHIR bundle

If the output directory already contains many files, the stage skips reassembly. To force a clean rebuild, delete the assembled directories before running.

Implementation notes¶

Concatenation¶

df = pd.read_csv(csv_path).sort_values(['id', 'chunk_index'])
for fid, chunks in df.groupby('id'):
    data = ''.join(chunks['chunk_data'].dropna().astype(str))
    # write data (decoded if base64)

groupby preserves the sort order. The .dropna() guards against the rare case where a chunk's chunk_data is empty -- skipping it would corrupt the document.

Base64 decoding (`robust_decode`)¶

CCDA chunks are base64-encoded. Some warehouses double-encode (the source table itself stores base64, then the export wraps it in another base64 layer). robust_decode handles both cases:

If the input already looks decoded (PDF / RTF / XML / HTML magic bytes), return it untouched.
Strip non-base64 characters and pad to a multiple of 4.
Decode once. If the result still doesn't look decoded, try a second pass.

The "looks decoded" check is critical when the input is binary bytes. Round-tripping binary content through a UTF-8 decode (errors='ignore') silently drops non-UTF-8 bytes from PDF streams and corrupts them, so the stage detects bytes that already pass the magic-byte check and returns them unchanged.

CSV resilience¶

pandas.read_csv handles quoting, embedded commas, and embedded newlines correctly as long as the warehouse's CSV writer used the standard RFC 4180 escape rules. The Databricks export examples in Data extraction use the default Spark CSV writer, which produces RFC 4180-compliant output.

Performance¶

For a typical export (~22,000 documents, ~10 GB of source XML), reassembly takes about three minutes on commodity hardware. The bottleneck is CSV parsing; once decoded, writing files is fast.