Skip to content

Installation

The pipeline runs on any standard Python environment. There are no infrastructure requirements beyond Python and a few libraries.

Get the code

Every file referenced on this page is downloadable from the Downloads page.

Requirements

  • Python 3.10 or newer
  • 4 GB RAM (more for very large cohorts)
  • Local disk space roughly 3x the size of the chunked CSV inputs

Dependencies

pip install pandas pypdf openpyxl
Package Purpose
pandas CSV reading, Excel output
pypdf PDF text extraction
openpyxl Excel writer engine

That's the entire dependency list. The pipeline uses only standard library modules (xml.etree.ElementTree, json, base64, re, unicodedata, logging) for everything else.

Files you need

Copy these into your working directory:

File Source
run_pipeline.py from the project release
dashboard.html from the project release
test_patients.txt from the project release (or write your own)

And generate (or receive from your warehouse team):

File How
CCDA and FHIR data/ccda_chunks.csv see Databricks export
CCDA and FHIR data/fhir_chunks.csv see Databricks export
uuid_mapping.csv see Databricks export

Running

The pipeline is a single Python module. Three equivalent ways to run it:

cd /path/to/working/directory
python run_pipeline.py
import sys
sys.path.insert(0, '/path/to/working/directory')
import run_pipeline
bundle = run_pipeline.main()
from run_pipeline import main
bundle = main()
print(f"Patients: {len(bundle['patients'])}")

The first form writes outputs to disk and exits. The second and third return the bundle dict in memory for further programmatic use.

Configuration

The default working directory and input paths are defined as constants near the top of run_pipeline.py. See Configuration for the full list and how to override them.