Patient identifier handling in Registry Forge outputs¶
EHR-vendor-issued patient IDs, medical record numbers (MRNs), and FHIR resource IDs derived from those source identifiers are direct identifiers under HIPAA Safe Harbor ยง164.514(b)(2)(i)(H). This page documents where they appear in each output Registry Forge produces, and what to do about it.
Audit by output¶
| Output file | Contains raw patient_id? | Default behavior | Notes |
|---|---|---|---|
dashboard_data.json |
Yes | PHI by design | Full bundle that the dashboard loads. Internal use only. |
dashboard.html |
No identifiers in the HTML itself | Safe to share the HTML file alone | The HTML file is just the viewer; it only exposes PHI when paired with dashboard_data.json |
patient_master.csv |
Yes | PHI | One row per patient with demographics. Internal use only. |
code_inventory.csv |
No | Privacy-safe | Aggregate counts only, no per-patient data |
cohort_eda_report.html |
No — uses PT-NNNN pseudonyms |
Privacy-safe by design | Six layered privacy transforms; safe to share with any colleague |
omop_output/PERSON.csv |
Yes — in the person_source_value column |
PHI | The OMOP spec stores the source identifier here. The integer person_id is internal and not directly identifying, but downstream OMOP tables link by person_id |
omop_output/*.csv (other tables) |
No raw patient IDs — only person_id integer |
PHI through linkage to PERSON.csv | Strip or pseudonymize person_source_value in PERSON.csv before sharing |
phenopackets_output/*.json |
Yes — in subject.id |
PHI | The Phenopackets spec uses subject.id as the canonical identifier. Pseudonymize before submitting to Matchmaker Exchange or sharing |
drug_repurposing_cohort.csv |
No by default (PT-NNNN); raw if pseudonymize=False |
Privacy-safe by default since v2026-05 | New pseudonymize=True default. Set False only for internal chart-review linkage |
drug_repurposing_summary.csv |
No | Privacy-safe | Aggregate per-medication, k-anonymity applied |
cure_id_intake.csv |
No — uses PT-NNNN pseudonyms |
Privacy-safe by design | Intended for FDA / NCATS submission, must be PII-free |
drug_repurposing_report.html |
No | Privacy-safe | No per-patient IDs in the rendered HTML |
What to do before sharing¶
For the dashboard: share dashboard.html and dashboard_data.json only with people who have ARC data access under your IRB protocol. There is no PHI-safe version of the dashboard data — the per-patient view is the point.
For OMOP outputs going to external sites or federated networks: before sharing the omop_output/ folder, blank or pseudonymize the person_source_value column in PERSON.csv. A one-line approach:
import pandas as pd
p = pd.read_csv('omop_output_2026-05-03/PERSON.csv')
mapping = {pid: f'PT-{i+1:04d}' for i, pid in enumerate(sorted(p['person_source_value'].dropna().unique()))}
p['person_source_value'] = p['person_source_value'].map(mapping).fillna('')
p.to_csv('omop_output_share/PERSON.csv', index=False)
# Save mapping to your PHI vault, NOT to the share folder
pd.DataFrame(mapping.items(), columns=['real_id','pseudo_id']).to_csv('PHI_VAULT/pid_mapping.csv', index=False)
The other OMOP tables only carry the integer person_id, so they don't need rewriting — only PERSON.csv does.
For Phenopackets going to Matchmaker Exchange, Beacon networks, or any external party: before submission, rewrite subject.id in each Phenopacket JSON with a pseudonym. Same approach as above. Store the mapping in your PHI vault.
For drug repurposing: the default pseudonymize=True already replaces raw patient_ids in drug_repurposing_cohort.csv with PT-NNNN. No further action needed unless you explicitly turned pseudonymization off for internal linkage.
Why we don't pseudonymize by default in run_pipeline.py or omop_etl.py¶
Because both the dashboard and the OMOP / Phenopackets standards require the source identifier for their primary use cases:
- The dashboard's per-patient view is keyed on
patient_id; pseudonymizing it would break the dashboard. - OMOP
person_source_valueexists because OHDSI sites need to link back to source records for QC, chart review, and incident-finding. Pseudonymizing it by default would undercut the standard's intent. - Phenopacket
subject.idis similarly canonical — the spec assumes you'll pseudonymize before submission, not during construction.
The right pattern is: produce the full-identifier output once, store it in a PHI-controlled location, and pseudonymize on the way out to anywhere else. The drug repurposing module is the exception because its analytic output (the cohort CSV) has no reason to carry the source identifier — downstream pharmacoepi tools work on the data, not the patient lookup. So it pseudonymizes by default.
What to ask your IRB if you're unsure¶
Three concrete questions:
- Is the recipient covered under the same IRB protocol that authorized ARC's data acquisition? If yes, identifiers can stay. If no, pseudonymize.
- Is the recipient a covered entity, business associate, or external researcher? Different rules apply (BAA in place vs. data use agreement vs. de-identified data set).
- Are you sharing for treatment, payment, or operations — or for research? Research sharing without IRB authorization or a DUA generally requires de-identification.
The Registry Forge defaults assume the most cautious answer in each case.