Skip to content

Downloads

Everything you need to run the pipeline locally is hosted directly from this site. Right-click any link and pick "Save link as..." (or just click to download).

Pipeline runtime

The two files needed to ingest your own data and view the result:

File Size Purpose
run_pipeline.py ~78 KB The full ETL pipeline. Single-file Python module with all seven stages (six in the JAMIA manuscript plus the code-inventory output). Edit BASE_DIR near the top and HARDCODED_TEST_EXCLUSIONS if needed, then run.
dashboard.html ~60 KB Browser-only viewer. Open it locally, click the file picker, point it at your dashboard_data.json. No server required.
omop_etl.py ~16 KB OMOP CDM v5.4 ETL. Reads dashboard_data.json plus an Athena vocabulary download and writes nine CDM tables to a folder tagged with the vocabulary release version. See OMOP ETL.
phenopackets_etl.py ~30 KB GA4GH Phenopackets v2 ETL. Reads dashboard_data.json plus optional note_extractions.csv and an Athena vocabulary directory; emits one Phenopacket JSON per patient with HPO phenotypes, Mondo diseases, LOINC measurements, RxNorm/SNOMED medical actions, and a placeholder genetic-interpretation block. Seed mappings cover ALS, epilepsy, and autoimmune disease. See Phenopackets ETL.
mondo_omop_bridge.py ~16 KB Mondo-OMOP bridge & rare disease cohort builder. Given a Mondo term ID, walks the Mondo disease hierarchy to find every descendant and emits a code list (SNOMED CT + ICD-10-CM) defining the cohort, plus OMOP standard concept_ids if Athena is available. Includes GARD/NORD/Orphanet rare disease subset flags. Adapted from Monarch Initiative's mondo2omop. See Mondo-OMOP bridge.
note_extraction.py ~12 KB Regex-based recovery of ALS-specific content (ALSFRS-R, ECAS, El Escorial, family history, genetic mutations, treatment milestones) from CCDA narratives and decoded note text. Seed patterns intended for site-specific tuning. See Note extraction.
device_extraction.py ~28 KB Equipment & DME extraction. Walks the bundle for HCPCS Level II / SNOMED CT / CPT-4 device codes and runs regex against narratives for AAC devices, wheelchairs, BiPAP/NIV, cough-assist, hospital beds, PEG tubes, and OT/PT/SLP equipment referrals. Emits device_codes.csv and device_extractions.csv. See Device & equipment extraction.

Example notebook

File Size Purpose
ARC_pipeline.ipynb ~12 KB Six-cell example notebook for end-to-end runs. Loads inputs, runs the pipeline, prints diagnostics, saves outputs. Works in any Jupyter-compatible environment.

Configuration templates

File Size Purpose
test_patients.txt ~1 KB File-based exclusion-rule template with the four match modes commented in. See Stage 6.

Databricks data extraction

These run inside Databricks (or any Spark-SQL engine) and produce the chunked CSV inputs the pipeline reads. See Data extraction (Databricks) for full instructions.

File Size Purpose
databricks_export.py ~9 KB Full PySpark notebook with config block, validation, and consolidation.
databricks_export.sql ~3 KB Pure-SQL alternative for SQL-only Databricks notebooks.

Sample data

A complete synthetic ALS patient -- 84 records covering every category -- that you can run the pipeline against without any real EHR access. See Sample data for what's in each file.

File Size Purpose
sample_ccda.xml ~19 KB Rich CCDA continuity-of-care document, 8 populated sections
sample_fhir_bundle.json ~80 KB FHIR R4 Bundle, 57 resources covering every extracted type
patient_master.csv ~70 KB Long-format master CSV produced by Stage 7b. One row per record, patient demographics on every row. See Patient master CSV.
uuid_mapping.csv <1 KB Document-to-patient bridge with full demographics
ccda_chunks.csv ~25 KB Pipeline-ready chunked CSV for the CCDA
fhir_chunks.csv ~85 KB Pipeline-ready chunked CSV for the FHIR bundle

Quick local setup

To get a fully working local copy, download these four files plus the five sample-data files:

your-work-dir/
├── run_pipeline.py            <- from "Pipeline runtime"
├── dashboard.html             <- from "Pipeline runtime"
├── test_patients.txt          <- from "Configuration templates"
├── uuid_mapping.csv           <- from "Sample data"
└── CCDA and FHIR data/
    ├── ccda_chunks.csv        <- from "Sample data"
    └── fhir_chunks.csv        <- from "Sample data"

Then pip install pandas pypdf openpyxl and python run_pipeline.py. See Quickstart for full instructions.