Skip to content

How it differs from Registry Forge

A side-by-side. Both projects produce the same downstream data shape; they differ in everything upstream and in what the optional advanced exports cover.

Dimension Registry Forge Patient Edition
Primary input Databricks chunked CSV exports, FHIR R4 Bundles, C-CDA from system-level exports A folder or zip of patient-downloaded C-CDA XML files
Source environment EHR / data warehouse, SMART-on-FHIR backend, OAuth 2.0 Researcher's laptop or local Jupyter notebook
Network calls during run Yes - fetches FHIR resources, optionally vocabulary services None. Everything offline.
Identity model EHR-supplied patient IDs, cross-format reference resolution (CCDA ↔ FHIR ↔ Notes), MRN-based dedup Stable hash of MRN if available, else (last_name, first_name, dob). No cross-format joining (everything is C-CDA).
Scale assumption Tens to thousands of patients One to a few hundred
Pipeline stages 7 core + 6 add-ons 5 core (discovery, parsing, identity, assembly, outputs) + 5 optional (OMOP, Mondo, Phenopackets, EDA, notes flagging)
Output schema Long-format patient_master.csv + multi-format JSON bundle Same long-format patient_master.csv
Dashboard Multi-patient browser dashboard with per-patient drill-down Same concept; rebuilt to be agnostic to single- vs. multi-patient mode
Vocabularies parsed SNOMED, RxNorm, LOINC, ICD-9/10, CPT, HCPCS, CVX, ATC, HPO, Mondo SNOMED, RxNorm, LOINC, ICD-9/10, CPT, HCPCS, CVX, NDC; optional Mondo via starter crosswalk; optional OMOP concept resolution via Athena vocabulary
Validation Tested against ALS TDI ARC Study production data Tested against synthetic epilepsy demo data and real MyChart exports
License MIT MIT

Advanced exports - both projects offer these

Patient Edition now ships starter implementations of every major advanced export the parent project does. All are opt-in via flags or kwargs; none run by default. All are marked under construction and produce starter-quality output suitable for development and evaluation, not production registries.

Export Registry Forge Patient Edition
OMOP CDM v5.4 Production ETL across the full schema Starter ETL covering 7 core tables (PERSON, OBSERVATION_PERIOD, CONDITION_OCCURRENCE, DRUG_EXPOSURE, MEASUREMENT, OBSERVATION, DEVICE_EXPOSURE). Concept IDs resolved against the Athena vocabulary when supplied.
GA4GH Phenopackets v2 Full per-patient JSON with resources, evidence, and provenance One JSON per patient with subject, diseases (Mondo-mapped), medical actions (RxNorm), measurements (LOINC). phenotypicFeatures not yet populated.
Mondo disease ontology Comprehensive crosswalk integrated with the OHDSI vocab stack Starter crosswalk (~80 codes) focused on epilepsy, adjacent neurology, neurodevelopmental disorders, and common comorbidities. User-supplied override CSV supported.
EDA reports ydata-profiling + sweetviz reports integrated into the multi-patient dashboard ydata-profiling + sweetviz reports + a custom Chart.js dashboard, all with PHI-aware disclaimer banners
Note keyword flagging NLP pipeline with negation detection and section context Naive regex screening with snippet extraction. Default keyword list neurology/rare disease focused.
Drug repurposing Reimer-methodology cohort-level exposure analysis Not implemented (cohort sizes typically too small)

What Patient Edition deliberately omits

Several Registry Forge features make less sense for patient-shared data and are intentionally absent:

No SMART-on-FHIR ingest

Patient downloads come as files. The OAuth dance, the FHIR Bundle pagination, the medicationReference resolution - none of that applies. The parser only knows how to read XML on disk.

No drug repurposing report

The Reimer-methodology analysis Registry Forge implements requires cohort-level exposure groups (≥ N patients per ATC class) and time-to-event endpoints that aren't typically reconstructable from patient-downloaded data alone.

No five-tier QC framework

The Kahn 2016 + OHDSI DQD framework is designed for institutional pipelines with clinician spot-review and regression testing across releases. Patient-edition users typically work with one fixed bundle of files and don't have a release cadence.

No negation or context detection in note flagging

The flagger does simple case-insensitive regex matching. no history of seizures produces a seizure flag. Patient Edition's flagger is a screening tool only; Registry Forge's NLP layer would be a more appropriate target if you need clinical-grade extraction.

What we kept identical

  • Long-format schema for the master CSV. Every column matches Registry Forge byte-for-byte.
  • raw_record_json column on every record for reviewer drill-down.
  • Sort order: (last_name, first_name, patient_id, category, effective_date).
  • Self-contained HTML dashboard philosophy: open, no server.
  • Excel-safe encoding (UTF-8 with BOM, 32K char truncation on long cells).
  • OMOP CDM v5.4 column names byte-for-byte across the seven tables both projects emit.
  • GA4GH Phenopackets v2 structural compliance (pxf validate should pass).
  • MIT license.

Can outputs be mixed?

Yes - that's the point. If you have an institutional cohort processed through Registry Forge and a participant-shared cohort processed through Patient Edition, you can concatenate the two patient_master.csv files and run a single analysis. The source column distinguishes records: Registry Forge tags rows as ccda or fhir; Patient Edition only ever writes ccda (or empty for patient headers). Just be aware of identity-resolution differences before merging at the patient level.

For OMOP CDM merging, the column schemas line up directly. Concatenate the same-named tables, run de-duplication by (person_source_value, *_source_value, *_date), and apply your own person ID reconciliation across the two sources.