How it differs from Registry Forge¶

A side-by-side. Both projects produce the same downstream data shape; they differ in everything upstream and in what the optional advanced exports cover.

Dimension	Registry Forge	Patient Edition
Primary input	Databricks chunked CSV exports, FHIR R4 Bundles, C-CDA from system-level exports	A folder or zip of patient-downloaded C-CDA XML files
Source environment	EHR / data warehouse, SMART-on-FHIR backend, OAuth 2.0	Researcher's laptop or local Jupyter notebook
Network calls during run	Yes - fetches FHIR resources, optionally vocabulary services	None. Everything offline.
Identity model	EHR-supplied patient IDs, cross-format reference resolution (CCDA ↔ FHIR ↔ Notes), MRN-based dedup	Stable hash of MRN if available, else `(last_name, first_name, dob)`. No cross-format joining (everything is C-CDA).
Scale assumption	Tens to thousands of patients	One to a few hundred
Pipeline stages	7 core + 6 add-ons	5 core (discovery, parsing, identity, assembly, outputs) + 5 optional (OMOP, Mondo, Phenopackets, EDA, notes flagging)
Output schema	Long-format `patient_master.csv` + multi-format JSON bundle	Same long-format `patient_master.csv`
Dashboard	Multi-patient browser dashboard with per-patient drill-down	Same concept; rebuilt to be agnostic to single- vs. multi-patient mode
Vocabularies parsed	SNOMED, RxNorm, LOINC, ICD-9/10, CPT, HCPCS, CVX, ATC, HPO, Mondo	SNOMED, RxNorm, LOINC, ICD-9/10, CPT, HCPCS, CVX, NDC; optional Mondo via starter crosswalk; optional OMOP concept resolution via Athena vocabulary
Validation	Tested against ALS TDI ARC Study production data	Tested against synthetic epilepsy demo data and real MyChart exports
License	MIT	MIT

Advanced exports - both projects offer these¶

Patient Edition now ships starter implementations of every major advanced export the parent project does. All are opt-in via flags or kwargs; none run by default. All are marked under construction and produce starter-quality output suitable for development and evaluation, not production registries.

Export	Registry Forge	Patient Edition
OMOP CDM v5.4	Production ETL across the full schema	Starter ETL covering 7 core tables (PERSON, OBSERVATION_PERIOD, CONDITION_OCCURRENCE, DRUG_EXPOSURE, MEASUREMENT, OBSERVATION, DEVICE_EXPOSURE). Concept IDs resolved against the Athena vocabulary when supplied.
GA4GH Phenopackets v2	Full per-patient JSON with resources, evidence, and provenance	One JSON per patient with subject, diseases (Mondo-mapped), medical actions (RxNorm), measurements (LOINC). `phenotypicFeatures` not yet populated.
Mondo disease ontology	Comprehensive crosswalk integrated with the OHDSI vocab stack	Starter crosswalk (~80 codes) focused on epilepsy, adjacent neurology, neurodevelopmental disorders, and common comorbidities. User-supplied override CSV supported.
EDA reports	ydata-profiling + sweetviz reports integrated into the multi-patient dashboard	ydata-profiling + sweetviz reports + a custom Chart.js dashboard, all with PHI-aware disclaimer banners
Note keyword flagging	NLP pipeline with negation detection and section context	Naive regex screening with snippet extraction. Default keyword list neurology/rare disease focused.
Drug repurposing	Reimer-methodology cohort-level exposure analysis	Not implemented (cohort sizes typically too small)

What Patient Edition deliberately omits¶

Several Registry Forge features make less sense for patient-shared data and are intentionally absent:

No SMART-on-FHIR ingest¶

Patient downloads come as files. The OAuth dance, the FHIR Bundle pagination, the medicationReference resolution - none of that applies. The parser only knows how to read XML on disk.

No drug repurposing report¶

The Reimer-methodology analysis Registry Forge implements requires cohort-level exposure groups (≥ N patients per ATC class) and time-to-event endpoints that aren't typically reconstructable from patient-downloaded data alone.

No five-tier QC framework¶

The Kahn 2016 + OHDSI DQD framework is designed for institutional pipelines with clinician spot-review and regression testing across releases. Patient-edition users typically work with one fixed bundle of files and don't have a release cadence.

No negation or context detection in note flagging¶

The flagger does simple case-insensitive regex matching. no history of seizures produces a seizure flag. Patient Edition's flagger is a screening tool only; Registry Forge's NLP layer would be a more appropriate target if you need clinical-grade extraction.

What we kept identical¶

Long-format schema for the master CSV. Every column matches Registry Forge byte-for-byte.
raw_record_json column on every record for reviewer drill-down.
Sort order: (last_name, first_name, patient_id, category, effective_date).
Self-contained HTML dashboard philosophy: open, no server.
Excel-safe encoding (UTF-8 with BOM, 32K char truncation on long cells).
OMOP CDM v5.4 column names byte-for-byte across the seven tables both projects emit.
GA4GH Phenopackets v2 structural compliance (pxf validate should pass).
MIT license.

Can outputs be mixed?¶

Yes - that's the point. If you have an institutional cohort processed through Registry Forge and a participant-shared cohort processed through Patient Edition, you can concatenate the two patient_master.csv files and run a single analysis. The source column distinguishes records: Registry Forge tags rows as ccda or fhir; Patient Edition only ever writes ccda (or empty for patient headers). Just be aware of identity-resolution differences before merging at the patient level.

For OMOP CDM merging, the column schemas line up directly. Concatenate the same-named tables, run de-duplication by (person_source_value, *_source_value, *_date), and apply your own person ID reconciliation across the two sources.