Data extraction (Databricks)¶
The ARC pipeline reads three CSV files generated by your clinical data warehouse. This page provides ready-to-run Databricks code that produces those files from typical SMART-on-FHIR ingestion tables.
Output files¶
The export produces three CSVs:
| File | Format | Per-row content |
|---|---|---|
ccda_chunks.csv |
id, chunk_index, chunk_data |
base64-encoded CCDA, 30k chars per chunk |
fhir_chunks.csv |
id, chunk_index, chunk_data |
raw FHIR JSON text, 30k chars per chunk |
uuid_mapping.csv |
document_uuid, patient_id (+ optional demographics) |
one row per document |
Why chunked CSV¶
Clinical CCDA documents and FHIR bundles routinely exceed 100 KB and can exceed several megabytes. Most warehouses (Databricks Unity Catalog included) enforce a per-cell size limit on CSV exports -- typically 32 KB. Chunking sidesteps this by splitting each document's content into multiple rows.
The chunking is reproducible and stable: chunk index N always contains
characters [N*chunk_size, (N+1)*chunk_size) of the source. The pipeline
reassembles by sorting on (id, chunk_index) and concatenating.
PySpark notebook¶
The full notebook is included in the documentation assets:
databricks_export.py
Configuration block at the top:
CCDA_TABLE = "your_database.documents"
CCDA_ID_COL = "document_uuid"
CCDA_CONTENT_COL = "content_base64"
CCDA_FORMAT_COL = "document_format"
CCDA_FORMAT_VALUE = "ccda"
FHIR_TABLE = "your_database.fhir_bundles"
FHIR_ID_COL = "bundle_uuid"
FHIR_CONTENT_COL = "bundle_json"
MAPPING_TABLE = "your_database.documents"
MAPPING_DOC_ID_COL = "document_uuid"
MAPPING_PATIENT_COL = "patient_id"
# ... optional demographic columns
OUTPUT_DIR = "/dbfs/FileStore/arc_pipeline_export"
CHUNK_SIZE = 30000
The chunking SQL¶
The core operation is a single Spark SQL expression that produces one row per chunk without a Python UDF:
SELECT id, chunk_index, chunk_data
FROM (
SELECT
id,
posexplode(
transform(
sequence(0, CAST(ceil(length(content) / 30000.0) AS INT) - 1),
i -> substring(content, i * 30000 + 1, 30000)
)
) AS (chunk_index, chunk_data)
FROM source
)
ORDER BY id, chunk_index
How it works:
length(content) / 30000-- number of chunks needed (with ceiling).sequence(0, n - 1)-- generate chunk indices[0, 1, 2, ...].transform(seq, i -> substring(...))-- slice content for each index.posexplode-- flatten the array into one row per slice.
This runs entirely in Spark's JVM execution path -- no Python serialization overhead. A 100 GB source table chunks in a few minutes on a standard cluster.
Pure SQL alternative¶
For sites that prefer plain SQL notebooks, the same logic in
databricks_export.sql.
Output handling¶
Spark's CSV writer creates a directory containing a part-file rather than a single named CSV. The notebook's last cell renames the part-files:
parts = glob.glob(os.path.join(spark_output_dir, "part-*.csv"))
target = os.path.join(parent_dir, "ccda_chunks.csv")
shutil.move(parts[0], target)
After this step, you have three plain CSV files ready to drop into the ARC pipeline working directory.
Adapting to other warehouses¶
The chunking SQL is portable to any engine that supports sequence,
transform, and posexplode-style array unnesting:
| Warehouse | Equivalent functions |
|---|---|
| Snowflake | LATERAL FLATTEN(INPUT => SPLIT_TO_ARRAY(...)) or a recursive CTE |
| BigQuery | UNNEST(GENERATE_ARRAY(0, n)) + SUBSTR() |
| Postgres | generate_series(0, n) + substring() |
| DuckDB | range(0, n) + substr(), identical pattern |
The pipeline doesn't care where the chunks come from -- as long as the CSV
has id, chunk_index, chunk_data columns and the rows can be sorted by
(id, chunk_index) to reassemble the document, you're good.
Validation¶
After the export, the notebook validates output by reading the CSVs back and counting rows + distinct documents. A sane export looks like:
ccda_chunks: 47,213 rows, 22,691 distinct documents
fhir_chunks: 3,612 rows, 1,806 distinct documents
uuid_mapping: 23,593 rows
If ccda_chunks rows / distinct ids is approximately 2-3, your CCDA
documents are 60-90 KB on average, which is normal. If it's 10+ you have
some very large documents (100s of pages) -- still fine, just expect
slower reassembly.