Skip to content

Data extraction (Databricks)

The ARC pipeline reads three CSV files generated by your clinical data warehouse. This page provides ready-to-run Databricks code that produces those files from typical SMART-on-FHIR ingestion tables.

Output files

The export produces three CSVs:

File Format Per-row content
ccda_chunks.csv id, chunk_index, chunk_data base64-encoded CCDA, 30k chars per chunk
fhir_chunks.csv id, chunk_index, chunk_data raw FHIR JSON text, 30k chars per chunk
uuid_mapping.csv document_uuid, patient_id (+ optional demographics) one row per document

Why chunked CSV

Clinical CCDA documents and FHIR bundles routinely exceed 100 KB and can exceed several megabytes. Most warehouses (Databricks Unity Catalog included) enforce a per-cell size limit on CSV exports -- typically 32 KB. Chunking sidesteps this by splitting each document's content into multiple rows.

The chunking is reproducible and stable: chunk index N always contains characters [N*chunk_size, (N+1)*chunk_size) of the source. The pipeline reassembles by sorting on (id, chunk_index) and concatenating.

PySpark notebook

The full notebook is included in the documentation assets: databricks_export.py

Configuration block at the top:

CCDA_TABLE        = "your_database.documents"
CCDA_ID_COL       = "document_uuid"
CCDA_CONTENT_COL  = "content_base64"
CCDA_FORMAT_COL   = "document_format"
CCDA_FORMAT_VALUE = "ccda"

FHIR_TABLE       = "your_database.fhir_bundles"
FHIR_ID_COL      = "bundle_uuid"
FHIR_CONTENT_COL = "bundle_json"

MAPPING_TABLE       = "your_database.documents"
MAPPING_DOC_ID_COL  = "document_uuid"
MAPPING_PATIENT_COL = "patient_id"
# ... optional demographic columns

OUTPUT_DIR = "/dbfs/FileStore/arc_pipeline_export"
CHUNK_SIZE = 30000

The chunking SQL

The core operation is a single Spark SQL expression that produces one row per chunk without a Python UDF:

SELECT id, chunk_index, chunk_data
FROM (
  SELECT
    id,
    posexplode(
      transform(
        sequence(0, CAST(ceil(length(content) / 30000.0) AS INT) - 1),
        i -> substring(content, i * 30000 + 1, 30000)
      )
    ) AS (chunk_index, chunk_data)
  FROM source
)
ORDER BY id, chunk_index

How it works:

  1. length(content) / 30000 -- number of chunks needed (with ceiling).
  2. sequence(0, n - 1) -- generate chunk indices [0, 1, 2, ...].
  3. transform(seq, i -> substring(...)) -- slice content for each index.
  4. posexplode -- flatten the array into one row per slice.

This runs entirely in Spark's JVM execution path -- no Python serialization overhead. A 100 GB source table chunks in a few minutes on a standard cluster.

Pure SQL alternative

For sites that prefer plain SQL notebooks, the same logic in databricks_export.sql.

Output handling

Spark's CSV writer creates a directory containing a part-file rather than a single named CSV. The notebook's last cell renames the part-files:

parts = glob.glob(os.path.join(spark_output_dir, "part-*.csv"))
target = os.path.join(parent_dir, "ccda_chunks.csv")
shutil.move(parts[0], target)

After this step, you have three plain CSV files ready to drop into the ARC pipeline working directory.

Adapting to other warehouses

The chunking SQL is portable to any engine that supports sequence, transform, and posexplode-style array unnesting:

Warehouse Equivalent functions
Snowflake LATERAL FLATTEN(INPUT => SPLIT_TO_ARRAY(...)) or a recursive CTE
BigQuery UNNEST(GENERATE_ARRAY(0, n)) + SUBSTR()
Postgres generate_series(0, n) + substring()
DuckDB range(0, n) + substr(), identical pattern

The pipeline doesn't care where the chunks come from -- as long as the CSV has id, chunk_index, chunk_data columns and the rows can be sorted by (id, chunk_index) to reassemble the document, you're good.

Validation

After the export, the notebook validates output by reading the CSVs back and counting rows + distinct documents. A sane export looks like:

ccda_chunks: 47,213 rows, 22,691 distinct documents
fhir_chunks:  3,612 rows,  1,806 distinct documents
uuid_mapping: 23,593 rows

If ccda_chunks rows / distinct ids is approximately 2-3, your CCDA documents are 60-90 KB on average, which is normal. If it's 10+ you have some very large documents (100s of pages) -- still fine, just expect slower reassembly.