Funding acknowledgement. This work was supported by the Centers for Disease Control and Prevention grant # R01-TS000341.

Open-source · Local-first · No servers required

Registry Forge¶

From raw EHR data to registry analytics — the whole pipeline, in one place.

Take raw EHR exports — C-CDA XML, FHIR R4 Bundles, RTF clinical notes, PDFs, and HTML fragments — and produce a research-ready data set in one pass: a structured patient bundle, OMOP CDM v5.4 tables, GA4GH Phenopackets, a browser dashboard, a shareable cohort EDA report, a drug repurposing report with CURE ID export, and privacy-safe device-and-equipment and environmental-exposure dashboards. Single-file Python, runs locally, ships with a five-tier QC framework and an inventory of every vocabulary and code it touched.

C-CDA XML FHIR R4 JSON RTF notes PDF HTML

Quickstart Cohort dashboard Cohort EDA Note extraction Drug repurposing report Device dashboard Exposure dashboard Downloads

Six live demos run entirely in your browser: the interactive cohort dashboard (load your own bundle JSON), the privacy-safe cohort EDA report, the clinical-note extraction dashboard, the drug repurposing report, and the device-and-equipment and environmental-exposure dashboards — all pre-loaded against synthetic ALS cohorts.

Live demos

Six privacy-safe demos — open the HTML and you're done

Every Registry Forge demo is a single self-contained HTML file. No server, no JavaScript build, no external scripts. PT-NNNN pseudonyms, year-only dates, and k-anonymity baked into the static dashboards and reports. Open one in any browser; share it with collaborators as an artifact attached to an email.

Cohort dashboard

The interactive patient-and-cohort dashboard produced by the core pipeline. Loads your own bundle JSON in the browser and shows demographics, problems, medications, labs, encounters, and an integrated note viewer for every patient.

Open demo
Cohort EDA report

Privacy-safe exploratory analysis report. ALS-specific measurements (ALSFRS-R, FVC% predicted, El Escorial), demographics, comorbidities, and code-vocabulary distribution. Designed for sharing outside the clinical firewall.

Open demo
Clinical-note extraction dashboard

ALS-specific clinical content recovered from unstructured narrative: ALSFRS-R total and four subdomain scores, FVC % predicted, El Escorial / Awaji-Shima certainty, onset region, family history, genetic mutations, FTD-spectrum mentions, and dated treatment milestones (PEG, tracheostomy, NIV, riluzole, edaravone).

Open demo
Drug repurposing report

Reimer-methodology pharmacoepidemiology hypothesis generation. Produces an HTML report with the ALS clinical-interest medication panel, an ATC-class-grouped medication summary, a forest plot, and a CURE ID intake CSV ready for off-label-use reporting.

Open demo
Device & equipment dashboard

Devices, durable-medical-equipment indicators, and ALS-care-pathway procedures (spirometry, sleep studies, EMG/NCS, PT/OT evaluations, speech screening, mobility status). Combines structured codes and note-regex matches with source-snippet drill-down.

Open demo
Environmental / occupational exposure dashboard

ALS-relevant environmental, occupational, and toxic exposures grounded in ECTO, the GA4GH-Phenopackets-compatible exposure ontology. Smoking, military service, pesticides, heavy metals, solvents, head trauma, cyanotoxins, asbestos, EMF, air pollution, mold — with verified ECTO term IDs where available and an explicit curation worklist for the rest.

Open demo

Everything you get

A complete EHR-to-research pipeline, in sixteen modules

Registry Forge is organized as a small core pipeline (run_pipeline.py, Stages 1–7, plus the browser dashboard) and a set of independent add-on modules that consume the bundle the core produces. Run only the core and you have a researcher-usable bundle and a dashboard. Add OMOP for OHDSI federation, Phenopackets for rare disease research, the cohort EDA for PHI-safe sharing, drug repurposing for pharmacoepidemiology hypothesis generation — whatever your use case calls for. Every module is plain Python with editable seed dictionaries; the modules are starting points, not finished products.

Multi-format ingest

Reads raw C-CDA XML, FHIR R4 JSON Bundles, RTF clinical notes, HTML fragments, base64-chunked CSV exports, and PDFs — routed to format- specific parsers by magic-byte detection. No conversion required upstream.
Cross-format patient linkage

Resolves FHIR medicationReference, DocumentReference.subject, and UUID-to-identifier bridges so the same patient across CCDA + FHIR + notes appears as one record. Survives messy production identifier drift.
Coding troubleshooting & enrichment

Built-in display-name lookup for common LOINC and SNOMED codes fills the blanks the EHR left empty. Multiple codings preserved per record so downstream consumers can pick the one that fits.
Code inventory report

Stage 7 walks every record and emits a CSV listing every (vocabulary, code) pair encountered, with reference and unique-patient counts and the bundle categories the code appears in. Your seed-mapping worklist.
Patient master CSV

A long-format master CSV that joins every patient's demographics onto every coded record, with FHIR-only and CCDA-only splits, raw record JSON preserved per row, and Excel-safe encoding for collaborator review.
OMOP CDM v5.4 ETL

Maps every source code to its standard concept via Athena CONCEPT_RELATIONSHIP Maps to and routes records by domain into nine CDM tables. Output folder tagged with the vocabulary release version for reproducibility.
GA4GH Phenopackets v2 ETL

Structured-code-driven Phenopackets — ICD-10 / SNOMED → HPO and Mondo via seed tables for spectrum of motor neuron disease, epilepsy, and autoimmune disease. Plug in a curated genetics CSV to populate full HGNC/HGVS/ACMG GenomicInterpretations.
Mondo-OMOP cohort builder

Given a Mondo term ID, walks the disease hierarchy to find every descendant and emits a code list (SNOMED + ICD-10) defining the cohort, plus OMOP standard concept_ids. Built-in rare disease subset flags (GARD, NORD, Orphanet). Adapted from Monarch Initiative's mondo2omop.
Note extraction (regex)

Recovers ALS-specific content from free-text narratives that rarely makes it into discrete fields: ALSFRS-R total + 4 subdomains, ECAS scores, El Escorial, family history, gene mentions, treatment milestones. Demonstration layer; production-graded by site-specific clinical review.
Device & equipment extraction

Walks the bundle for HCPCS Level II + SNOMED + CPT codes (speech-generating devices, wheelchairs, BiPAP, cough-assist, feeding tubes, hospital beds, orthotics) and runs regex against narratives for both generic equipment terms and brand-name detection (Tobii Dynavox, Trilogy, Hoyer lift, The Vest, Kangaroo pump, AVAPS / NIPPV, OT/PT/SLP eval). Emits two joinable CSVs.
Browser dashboard

A single static HTML file that loads the bundle in any browser — no server, no install. Per-patient views, cohort overview, format filters, and a global keyword search across every document body and clinical record.
Cohort EDA report

A single self-contained HTML page summarizing demographics, code coverage, observation period, vocabulary distribution, and ALS-specific signal — safe to share with colleagues. Pseudonymized IDs, banded ages, k-anonymity suppression, no per-patient diagnostic codes, no free text. Live demo →
Drug repurposing analysis

Adapts the methodology of Reimer et al., Lancet Digit Health 2026 to your bundle: identifies the motor-neuron-disease cohort, applies the paper’s exposure criteria A and B, groups medications by ATC class, and exports a clean cohort table for downstream Cox / propensity analysis plus a CURE ID intake CSV ready for FDA / NCATS-NIH submission. Highly customizable via the ATC_SEED and exposure-window constants. Read more →
Five-tier QC framework

A starting set of checks — schema validation, mapping coverage tracking, cross-output consistency, clinician spot-review, and synthetic-cohort regression testing — with recommended cadence. Built in, not bolted on. Adopters review their data through the established frameworks: the Kahn et al. 2016 taxonomy and the Book of OHDSI Chapter 15 + OHDSI DQD for OMOP.
Reproducibility built in

OMOP and Phenopackets output folders carry the vocabulary release version in the folder name; metaData.resources records every ontology version used. Any output file is traceable to the exact mappings that produced it.
Synthetic demo cohort

Jane Marie Demo — a clinically-realistic synthetic ALS patient with 84 records spanning every bundle category. Ships with the pipeline; runs end- to-end in seconds; the regression-test bedrock for every change.

Tested across multiple production source-data variants

Production C-CDA exports FHIR R4 Bundle Real ARC production data Production RTF notes Generic PDF Databricks chunked CSV

Pipeline at a glance¶

Databricks export (chunked CSVs)  +  FHIR Bundle pulls  +  C-CDA XML
                              |
                              v
+----------------------------------------------------------+
|  Stage 1   Decoding & reassembly   (base64, chunk concat) |
|  Stage 2   Format detection         (magic-byte routing)  |
|  Stage 3   FHIR resource extraction (13 resource types)   |
|  Stage 4   Joining & assembly       (cross-format linkage)|
|  Stage 5   Display-name enrichment  (LOINC / SNOMED)      |
|  Stage 6   Test-patient exclusion   (rule-based filter)   |
|  Stage 7   Code inventory + master CSV + note extraction  |
+----------------------------------------------------------+
                              |
            +-----------------+-----------------+
            v                 v                 v
   dashboard_data.json   omop_etl.py    phenopackets_etl.py
   + dashboard.html      9 OMOP CDM     GA4GH Phenopackets v2
   + patient_master.csv  v5.4 tables    + cohort + summary

Get started¶

Overview — a single-page summary
Installation — Python environment setup
Quickstart — run end-to-end against the included synthetic data
Live dashboard demo — the patient dashboard running on the synthetic cohort
Cohort EDA demo — the no-PHI cohort report you can share with colleagues
Data extraction (Databricks) — generate the chunked CSV inputs from your warehouse

Built on the same foundation as established consumer registry platforms¶

Registry Forge uses the same patient-directed SMART on FHIR + OAuth 2.0 acquisition pattern that has become standard across consumer health-record applications, registry platforms, and rare disease frameworks built on REDCap or similar tools. It is the open-source ETL layer for organizations that want the same data flow without standing up a vendor platform.

A note on validation¶

Registry Forge has been built and tested against the ALS Therapy Development Institute's ARC Study data. We have not yet tested it against data from other organizations, EHR vendors, or registry deployments. If you would like to use Registry Forge with your own data and help us understand how it performs in other settings, please contact us — we would welcome the collaboration.

Registry Forge gives you the bones. You bring the judgment.¶

Registry Forge automates the engineering work — schema parsing, code system normalization, vocabulary harmonization, FHIR / C-CDA chunk reassembly, the OMOP and Phenopackets ETLs, the device and note extraction layers — that would otherwise take a small team many thousands of hours to write from scratch. What it does not do is replace your own clinical, methodological, and editorial judgment about your registry.

Real ETL is iterative:

You will look at the OMOP CONDITION_OCCURRENCE table and decide some source codes shouldn't have mapped to those standard concepts — or that they need a different mapping for your study.
You will look at the cohort EDA and decide to drop a handful of patients who shouldn't be in the analysis, or to revisit a participant whose record looks anomalous.
You will look at the Phenopackets output and add disease seed mappings before submitting to a Matchmaker network.
You will look at the device extraction CSVs and decide which brand-name matches to collapse to a generic class, and which to keep distinct.
You will tune the note-extraction regex against your own narrative style.

That iteration is your work, and it's what makes the registry trustworthy. Registry Forge is a starting point that gets you to that iteration much faster — not a substitute for it.

For the canonical reference on doing this work well, read the OHDSI community's Book of OHDSI, Chapter 6 — Extract Transform Load. It covers how to plan an ETL, how to think about source-to-target mapping, when to add custom logic, and how to validate the result. Registry Forge follows the spirit of those practices; the depth comes from you.

About, citation, and contact¶

Registry Forge was developed by Danielle Boyce, MPH, DPA, of the ALS Therapy Development Institute, as part of the ARC Study natural history registry program. De-identified ARC data is being made available through the ARC Data Commons.

A DOI for this software and its companion manuscript is forthcoming. Registry Forge is released under the MIT License — permissive, no warranty, free to use and modify. If you would like help adapting it to your own registry or EHR vendor, please reach out: dboyce@als.net.

See the About & license page for full details, acknowledgements, and how to cite. For related tools and training — CDAtransformer (single-file C-CDA inspection), the ALS TDI Real World Evidence Resources hub, Guide to Real-World Data for Clinical Research, and the OMOP introductory course — see the Resources & related tools page.