Privacy & PHI¶

Patient-downloaded health records are still PHI under HIPAA when handled by a researcher. The participant's act of downloading their own data and handing it to you doesn't strip the identifiers - and the outputs of this pipeline preserve those identifiers on purpose, because researchers need them.

What stays identifiable¶

The default outputs contain:

Names - first_name, last_name on every row
Medical record numbers - mrn when the source document includes one
Dates of birth - dob on every row, full precision
Event dates - effective_date, end_date at day-level precision
Custodian organization names - visible in the document header parse
Source filenames - source_file may itself encode identifying information depending on how the patient or vendor named the file
Free-text narratives - section narratives can contain names, addresses, phone numbers, or other identifiers the patient's clinicians wrote into the record

These are kept on purpose. The researcher's analysis depends on at least some of them.

What the pipeline does not do¶

No PHI scrubbing. The notebook will not redact names, dates, or other identifiers. It is not Safe Harbor de-identification.
No network calls with patient data. The parser runs entirely locally. No vocabulary lookup against external services, no FHIR resource fetching, no telemetry.
No logging beyond parse_log.csv. And that file is written to the same OUTPUT_DIR as the master CSV; it is not transmitted anywhere.

Where the data is during a run¶

Two scenarios depending on how you run the notebook:

Local Python¶

Files stay on your machine.
Outputs land wherever OUTPUT_DIR points.
Nothing leaves the machine unless you explicitly move it.

Recommendations¶

Concrete steps to reduce risk:

Before running¶

Confirm IRB / DUA coverage. Patient mediated data exchange is a relatively new category for many IRBs. Make sure your protocol explicitly covers receiving and analyzing patient-downloaded records.
Choose your runtime deliberately. Local Python on an institutional laptop with full-disk encryption is the lowest-risk default for handling PHI.
Use a clean working directory. Don't put PHI in a directory that's synced to a personal Dropbox / iCloud / OneDrive account.

After running¶

Treat outputs as PHI. patient_master.csv and dashboard.html contain everything the source documents contained, in more convenient form. They are not safer than the source.
Do not commit outputs to a public Git repository. This sounds obvious. It happens anyway. Add out/ and *.csv and *.html (in output paths) to .gitignore from day one.
Strip before sharing. If you need to share a dashboard with a collaborator who doesn't need PHI, write a post-processing step that:
- Replaces last_name / first_name with PT-XXXX shorthand
- Year-bands dates (effective_date → effective_year)
- Truncates free text or replaces it with code-only display
- Drops raw_record_json
Audit the dashboard before sharing. Open it in a text editor and search for last names. The embedded JSON in <script type="application/json"> is the easiest place to spot a PHI leak you missed.

For a privacy-safe analog of Registry Forge's cohort EDA report - the kind of artifact you can send to a colleague outside the clinical firewall - Patient Edition does not yet ship one. A pseudonymizing post-processor for the master CSV is on the roadmap; contributions welcome.

What to delete when the project ends¶

A clean tear-down:

out/
├── patient_master.csv               ← delete
├── patient_features.csv             ← delete (it's derived but contains demographics)
├── dashboard.html                   ← delete (embedded data)
├── parse_log.csv                    ← delete (filenames + IDs)
├── registry_forge_patient_bundle.zip ← delete

And the input folder, if your DUA requires it.

What this guidance is not¶

It is not legal advice. It is not a substitute for your institution's privacy officer. When in doubt, ask them before running anything against real patient records.