Privacy & PHI¶
Patient-downloaded health records are still PHI under HIPAA when handled by a researcher. The participant's act of downloading their own data and handing it to you doesn't strip the identifiers - and the outputs of this pipeline preserve those identifiers on purpose, because researchers need them.
What stays identifiable¶
The default outputs contain:
- Names -
first_name,last_nameon every row - Medical record numbers -
mrnwhen the source document includes one - Dates of birth -
dobon every row, full precision - Event dates -
effective_date,end_dateat day-level precision - Custodian organization names - visible in the document header parse
- Source filenames -
source_filemay itself encode identifying information depending on how the patient or vendor named the file - Free-text narratives - section narratives can contain names, addresses, phone numbers, or other identifiers the patient's clinicians wrote into the record
These are kept on purpose. The researcher's analysis depends on at least some of them.
What the pipeline does not do¶
- No PHI scrubbing. The notebook will not redact names, dates, or other identifiers. It is not Safe Harbor de-identification.
- No network calls with patient data. The parser runs entirely locally. No vocabulary lookup against external services, no FHIR resource fetching, no telemetry.
- No logging beyond
parse_log.csv. And that file is written to the sameOUTPUT_DIRas the master CSV; it is not transmitted anywhere.
Where the data is during a run¶
Two scenarios depending on how you run the notebook:
Local Python¶
- Files stay on your machine.
- Outputs land wherever
OUTPUT_DIRpoints. - Nothing leaves the machine unless you explicitly move it.
Recommendations¶
Concrete steps to reduce risk:
Before running¶
- Confirm IRB / DUA coverage. Patient mediated data exchange is a relatively new category for many IRBs. Make sure your protocol explicitly covers receiving and analyzing patient-downloaded records.
- Choose your runtime deliberately. Local Python on an institutional laptop with full-disk encryption is the lowest-risk default for handling PHI.
- Use a clean working directory. Don't put PHI in a directory that's synced to a personal Dropbox / iCloud / OneDrive account.
After running¶
- Treat outputs as PHI.
patient_master.csvanddashboard.htmlcontain everything the source documents contained, in more convenient form. They are not safer than the source. - Do not commit outputs to a public Git repository. This sounds obvious. It happens anyway. Add
out/and*.csvand*.html(in output paths) to.gitignorefrom day one. - Strip before sharing. If you need to share a dashboard with a collaborator who doesn't need PHI, write a post-processing step that:
- Replaces
last_name/first_namewithPT-XXXXshorthand - Year-bands dates (
effective_date→effective_year) - Truncates free text or replaces it with code-only display
- Drops
raw_record_json
- Replaces
- Audit the dashboard before sharing. Open it in a text editor and search for last names. The embedded JSON in
<script type="application/json">is the easiest place to spot a PHI leak you missed.
For sharing externally¶
For a privacy-safe analog of Registry Forge's cohort EDA report - the kind of artifact you can send to a colleague outside the clinical firewall - Patient Edition does not yet ship one. A pseudonymizing post-processor for the master CSV is on the roadmap; contributions welcome.
What to delete when the project ends¶
A clean tear-down:
out/
├── patient_master.csv ← delete
├── patient_features.csv ← delete (it's derived but contains demographics)
├── dashboard.html ← delete (embedded data)
├── parse_log.csv ← delete (filenames + IDs)
├── registry_forge_patient_bundle.zip ← delete
And the input folder, if your DUA requires it.
What this guidance is not¶
It is not legal advice. It is not a substitute for your institution's privacy officer. When in doubt, ask them before running anything against real patient records.