How it differs from Registry Forge¶
A side-by-side. Both projects produce the same downstream data shape; they differ in everything upstream and in what the optional advanced exports cover.
| Dimension | Registry Forge | Patient Edition |
|---|---|---|
| Primary input | Databricks chunked CSV exports, FHIR R4 Bundles, C-CDA from system-level exports | A folder or zip of patient-downloaded C-CDA XML files |
| Source environment | EHR / data warehouse, SMART-on-FHIR backend, OAuth 2.0 | Researcher's laptop or local Jupyter notebook |
| Network calls during run | Yes - fetches FHIR resources, optionally vocabulary services | None. Everything offline. |
| Identity model | EHR-supplied patient IDs, cross-format reference resolution (CCDA ↔ FHIR ↔ Notes), MRN-based dedup | Stable hash of MRN if available, else (last_name, first_name, dob). No cross-format joining (everything is C-CDA). |
| Scale assumption | Tens to thousands of patients | One to a few hundred |
| Pipeline stages | 7 core + 6 add-ons | 5 core (discovery, parsing, identity, assembly, outputs) + 5 optional (OMOP, Mondo, Phenopackets, EDA, notes flagging) |
| Output schema | Long-format patient_master.csv + multi-format JSON bundle |
Same long-format patient_master.csv |
| Dashboard | Multi-patient browser dashboard with per-patient drill-down | Same concept; rebuilt to be agnostic to single- vs. multi-patient mode |
| Vocabularies parsed | SNOMED, RxNorm, LOINC, ICD-9/10, CPT, HCPCS, CVX, ATC, HPO, Mondo | SNOMED, RxNorm, LOINC, ICD-9/10, CPT, HCPCS, CVX, NDC; optional Mondo via starter crosswalk; optional OMOP concept resolution via Athena vocabulary |
| Validation | Tested against ALS TDI ARC Study production data | Tested against synthetic epilepsy demo data and real MyChart exports |
| License | MIT | MIT |
Advanced exports - both projects offer these¶
Patient Edition now ships starter implementations of every major advanced export the parent project does. All are opt-in via flags or kwargs; none run by default. All are marked under construction and produce starter-quality output suitable for development and evaluation, not production registries.
| Export | Registry Forge | Patient Edition |
|---|---|---|
| OMOP CDM v5.4 | Production ETL across the full schema | Starter ETL covering 7 core tables (PERSON, OBSERVATION_PERIOD, CONDITION_OCCURRENCE, DRUG_EXPOSURE, MEASUREMENT, OBSERVATION, DEVICE_EXPOSURE). Concept IDs resolved against the Athena vocabulary when supplied. |
| GA4GH Phenopackets v2 | Full per-patient JSON with resources, evidence, and provenance | One JSON per patient with subject, diseases (Mondo-mapped), medical actions (RxNorm), measurements (LOINC). phenotypicFeatures not yet populated. |
| Mondo disease ontology | Comprehensive crosswalk integrated with the OHDSI vocab stack | Starter crosswalk (~80 codes) focused on epilepsy, adjacent neurology, neurodevelopmental disorders, and common comorbidities. User-supplied override CSV supported. |
| EDA reports | ydata-profiling + sweetviz reports integrated into the multi-patient dashboard | ydata-profiling + sweetviz reports + a custom Chart.js dashboard, all with PHI-aware disclaimer banners |
| Note keyword flagging | NLP pipeline with negation detection and section context | Naive regex screening with snippet extraction. Default keyword list neurology/rare disease focused. |
| Drug repurposing | Reimer-methodology cohort-level exposure analysis | Not implemented (cohort sizes typically too small) |
What Patient Edition deliberately omits¶
Several Registry Forge features make less sense for patient-shared data and are intentionally absent:
No SMART-on-FHIR ingest¶
Patient downloads come as files. The OAuth dance, the FHIR Bundle pagination, the medicationReference resolution - none of that applies. The parser only knows how to read XML on disk.
No drug repurposing report¶
The Reimer-methodology analysis Registry Forge implements requires cohort-level exposure groups (≥ N patients per ATC class) and time-to-event endpoints that aren't typically reconstructable from patient-downloaded data alone.
No five-tier QC framework¶
The Kahn 2016 + OHDSI DQD framework is designed for institutional pipelines with clinician spot-review and regression testing across releases. Patient-edition users typically work with one fixed bundle of files and don't have a release cadence.
No negation or context detection in note flagging¶
The flagger does simple case-insensitive regex matching. no history of seizures produces a seizure flag. Patient Edition's flagger is a screening tool only; Registry Forge's NLP layer would be a more appropriate target if you need clinical-grade extraction.
What we kept identical¶
- Long-format schema for the master CSV. Every column matches Registry Forge byte-for-byte.
raw_record_jsoncolumn on every record for reviewer drill-down.- Sort order:
(last_name, first_name, patient_id, category, effective_date). - Self-contained HTML dashboard philosophy: open, no server.
- Excel-safe encoding (UTF-8 with BOM, 32K char truncation on long cells).
- OMOP CDM v5.4 column names byte-for-byte across the seven tables both projects emit.
- GA4GH Phenopackets v2 structural compliance (
pxf validateshould pass). - MIT license.
Can outputs be mixed?¶
Yes - that's the point. If you have an institutional cohort processed through Registry Forge and a participant-shared cohort processed through Patient Edition, you can concatenate the two patient_master.csv files and run a single analysis. The source column distinguishes records: Registry Forge tags rows as ccda or fhir; Patient Edition only ever writes ccda (or empty for patient headers). Just be aware of identity-resolution differences before merging at the patient level.
For OMOP CDM merging, the column schemas line up directly. Concatenate the same-named tables, run de-duplication by (person_source_value, *_source_value, *_date), and apply your own person ID reconciliation across the two sources.