Downloads¶
Everything you need to run the pipeline locally is hosted directly from this site. Right-click any link and pick "Save link as..." (or just click to download).
Pipeline runtime¶
The two files needed to ingest your own data and view the result:
| File | Size | Purpose |
|---|---|---|
run_pipeline.py |
~78 KB | The full ETL pipeline. Single-file Python module with all seven stages (six in the JAMIA manuscript plus the code-inventory output). Edit BASE_DIR near the top and HARDCODED_TEST_EXCLUSIONS if needed, then run. |
dashboard.html |
~60 KB | Browser-only viewer. Open it locally, click the file picker, point it at your dashboard_data.json. No server required. |
omop_etl.py |
~16 KB | OMOP CDM v5.4 ETL. Reads dashboard_data.json plus an Athena vocabulary download and writes nine CDM tables to a folder tagged with the vocabulary release version. See OMOP ETL. |
phenopackets_etl.py |
~30 KB | GA4GH Phenopackets v2 ETL. Reads dashboard_data.json plus optional note_extractions.csv and an Athena vocabulary directory; emits one Phenopacket JSON per patient with HPO phenotypes, Mondo diseases, LOINC measurements, RxNorm/SNOMED medical actions, and a placeholder genetic-interpretation block. Seed mappings cover ALS, epilepsy, and autoimmune disease. See Phenopackets ETL. |
mondo_omop_bridge.py |
~16 KB | Mondo-OMOP bridge & rare disease cohort builder. Given a Mondo term ID, walks the Mondo disease hierarchy to find every descendant and emits a code list (SNOMED CT + ICD-10-CM) defining the cohort, plus OMOP standard concept_ids if Athena is available. Includes GARD/NORD/Orphanet rare disease subset flags. Adapted from Monarch Initiative's mondo2omop. See Mondo-OMOP bridge. |
note_extraction.py |
~12 KB | Regex-based recovery of ALS-specific content (ALSFRS-R, ECAS, El Escorial, family history, genetic mutations, treatment milestones) from CCDA narratives and decoded note text. Seed patterns intended for site-specific tuning. See Note extraction. |
device_extraction.py |
~28 KB | Equipment & DME extraction. Walks the bundle for HCPCS Level II / SNOMED CT / CPT-4 device codes and runs regex against narratives for AAC devices, wheelchairs, BiPAP/NIV, cough-assist, hospital beds, PEG tubes, and OT/PT/SLP equipment referrals. Emits device_codes.csv and device_extractions.csv. See Device & equipment extraction. |
Example notebook¶
| File | Size | Purpose |
|---|---|---|
ARC_pipeline.ipynb |
~12 KB | Six-cell example notebook for end-to-end runs. Loads inputs, runs the pipeline, prints diagnostics, saves outputs. Works in any Jupyter-compatible environment. |
Configuration templates¶
| File | Size | Purpose |
|---|---|---|
test_patients.txt |
~1 KB | File-based exclusion-rule template with the four match modes commented in. See Stage 6. |
Databricks data extraction¶
These run inside Databricks (or any Spark-SQL engine) and produce the chunked CSV inputs the pipeline reads. See Data extraction (Databricks) for full instructions.
| File | Size | Purpose |
|---|---|---|
databricks_export.py |
~9 KB | Full PySpark notebook with config block, validation, and consolidation. |
databricks_export.sql |
~3 KB | Pure-SQL alternative for SQL-only Databricks notebooks. |
Sample data¶
A complete synthetic ALS patient -- 84 records covering every category -- that you can run the pipeline against without any real EHR access. See Sample data for what's in each file.
| File | Size | Purpose |
|---|---|---|
sample_ccda.xml |
~19 KB | Rich CCDA continuity-of-care document, 8 populated sections |
sample_fhir_bundle.json |
~80 KB | FHIR R4 Bundle, 57 resources covering every extracted type |
patient_master.csv |
~70 KB | Long-format master CSV produced by Stage 7b. One row per record, patient demographics on every row. See Patient master CSV. |
uuid_mapping.csv |
<1 KB | Document-to-patient bridge with full demographics |
ccda_chunks.csv |
~25 KB | Pipeline-ready chunked CSV for the CCDA |
fhir_chunks.csv |
~85 KB | Pipeline-ready chunked CSV for the FHIR bundle |
Quick local setup¶
To get a fully working local copy, download these four files plus the five sample-data files:
your-work-dir/
├── run_pipeline.py <- from "Pipeline runtime"
├── dashboard.html <- from "Pipeline runtime"
├── test_patients.txt <- from "Configuration templates"
├── uuid_mapping.csv <- from "Sample data"
└── CCDA and FHIR data/
├── ccda_chunks.csv <- from "Sample data"
└── fhir_chunks.csv <- from "Sample data"
Then pip install pandas pypdf openpyxl and python run_pipeline.py.
See Quickstart for full instructions.