Skip to content

Registry Forge - Patient Edition

From a folder of patient-downloaded health records to a research-ready data set.

Open-source · Local-first · No servers required

How Registry Forge Patient Edition works: patients and advocacy groups download health records from their patient portal, send the zip file to a researcher, who runs the open-source notebook to produce a long-format CSV, a searchable dashboard, an ML-ready feature matrix, and a parse log. All processing is local; no PHI leaves the researcher's machine.


Built for two audiences

If you're a patient, caregiver, or family member

You can already download your full health record from MyChart, Epic patient portal, or any Blue Button-style export. That download is a folder of XML files - comprehensive, but not easy to read or share usefully.

What you can do with Registry Forge - Patient Edition:

  1. Download your records from your patient portal.
  2. Zip the folder.
  3. If you choose to share it with a researcher - and your IRB protocol, data-use agreement, or personal preferences allow it - send them the zip.
  4. Their analyst can run this tool and turn the zip into a structured, searchable record set in minutes. No vendor, no EHR integration, no API keys.

You stay in control of which records leave your possession. Once shared, the researcher can search every diagnosis, medication, lab, procedure, and note in your record set the same way an institutional researcher would search a clinical registry.

If you run a patient advocacy group, registry, or natural history study

You no longer need a third-party data-aggregation vendor in the loop for participant-mediated data acquisition.

The workflow becomes:

  1. (With IRB approval) ask participants to export their records from their patient portal and send the zip to your study team.
  2. Your analyst points this notebook at the folder of zips.
  3. Out comes a long-format CSV across every participant, a searchable dashboard, and an ML-ready feature matrix - the same shape any institutional registry produces.

For rare disease, pediatric, or geographically-dispersed cohorts where centralized EHR access isn't feasible, this is a direct path from patient consent to analyzable data.


A note on privacy. Patient-downloaded records contain identifiable PHI (names, MRNs, dates of birth, free-text clinical notes). Treat the outputs accordingly. The tool runs entirely locally and makes no network calls with patient data, but the researcher receiving the zip is responsible for handling it under HIPAA, their IRB, and any applicable data-use agreement. See Privacy & PHI before processing real records.


What the tool actually does

This project is a companion to Registry Forge, reusing its long-format schema, dashboard concept, and research-ready output design - but optimized for a different input shape.

Where Registry Forge is built around EHR-side extracts (Databricks chunked CSVs, FHIR Bundle pulls via SMART on FHIR + OAuth), Patient Edition is built around the data shape researchers actually see when participants share their own records: a zipped folder of C-CDA XML files downloaded from MyChart, Epic patient portal, or any Blue Button-style consumer health record export.

Why a separate tool?

Patient mediated data exchange is a distinct workflow with distinct constraints:

  • No EHR access. The researcher never authenticates to the source system. Documents arrive as files on a hard drive.
  • No vendor-supplied identifiers. Patients are identified by name and DOB across documents; the MRN, when present, only makes sense inside the originating health system.
  • Repeated exports. A patient who downloads their record once a year for five years gives you five overlapping C-CDA documents, not five clean snapshots.
  • Variable completeness. Some patient portals export rich FHIR resources behind the scenes and render them as structured C-CDA entries; others ship narrative-only documents.
  • Single-patient deep-dives are common. A researcher studying a rare disease n-of-1 case wants the same tooling as one running a 200-patient cohort.

Registry Forge Patient Edition gives you the same outputs Registry Forge does - a long-format patient master CSV, a searchable HTML dashboard, an ML-ready feature matrix - starting from a folder of XML files.

The parser is tested against Epic MyChart C-CDA 2.1 exports and handles the Epic-specific patterns researchers typically encounter: concern-act wrappers around problems and allergies, narrative anchor references (where Epic puts diagnosis names in a section's narrative table and the structured entries reference them), BATTERY organizers for lab panels, and the dozen-plus LOINC section codes Epic actually emits.

Pipeline at a glance

Folder of MyChart / Epic / Blue Button C-CDA XML downloads (or a .zip)
                              |
                              v
+----------------------------------------------------------+
|  Stage 1   Folder discovery       (recursive walk, zip)   |
|  Stage 2   C-CDA parsing          (header + 13 sections)  |
|  Stage 3   Identity resolution    (MRN | name+DOB hash)   |
|  Stage 4   Long-format assembly   (RegistryForge schema)  |
|  Stage 5   Dashboard + features   (single HTML, CSV)      |
+----------------------------------------------------------+
                              |
            +-----------------+-----------------+
            v                 v                 v
   patient_master.csv   dashboard.html    patient_features.csv
   (long format,        (offline,         (one row per patient,
    one row/record)      searchable)       ready for ML)

What you get

  • patient_master.csv - every record from every document, with demographics joined on the front. Same schema as Registry Forge's master CSV, so any downstream tool that consumes Registry Forge outputs consumes these too.
  • dashboard.html - a single self-contained HTML file. Open it in any browser; client-side filtering, sorting, and global search across the cohort or single patient. No server. No build step.
  • patient_features.csv - patient × feature matrix with demographics, category counts, top-K diagnosis/medication flags, and lab-value summaries. Ready to feed into a clustering or classification notebook.
  • parse_log.csv - one row per source file, with the patient it resolved to and the record count it contributed. Traceability for the parse.

Auto-detection: single-patient or multi-patient

The notebook decides based on identity matching across the input documents:

  • Single-patient mode. All files resolve to the same person. Outputs are a longitudinal record assembly across every document. The dashboard becomes a single-patient timeline; the feature matrix has one row.
  • Multi-patient mode. Documents resolve to different people. Outputs are a cohort. The dashboard adds a patient filter; the feature matrix has one row per person and is suitable for clustering and supervised modeling.

You don't have to declare which mode you're in. The parser walks every file, hashes a stable identity, and counts.

Built for researchers, not for vendors

The intended user is a research analyst who has been handed a USB stick or a zip file by a participant or a partner site, and who needs to turn that into a queryable data set without standing up a server, an EHR connection, or a vendor account. Everything runs in a local Python environment with two dependencies (pandas, numpy).

Get started

A note on privacy

Patient-downloaded health records are still PHI. The notebook runs entirely on your machine and never makes a network call with patient data. Do not commit the outputs to a public repository. See Privacy & PHI for the full discussion.

Status & funding

Registry Forge Patient Edition is an independent companion to Registry Forge. It is not affiliated with, endorsed by, or sponsored by the original Registry Forge authors, the ALS Therapy Development Institute, the CDC, or any patient-portal vendor. It exists to give researchers working with patient-shared data a starting point with the same shape as the Registry Forge ecosystem.

Registry Forge itself was developed by Danielle Boyce, MPH, DPA, at the ALS Therapy Development Institute, supported by CDC grant # R01-TS000341. See About & license for full attribution.