Skip to content

Patient identifier handling in Registry Forge outputs

EHR-vendor-issued patient IDs, medical record numbers (MRNs), and FHIR resource IDs derived from those source identifiers are direct identifiers under HIPAA Safe Harbor ยง164.514(b)(2)(i)(H). This page documents where they appear in each output Registry Forge produces, and what to do about it.

Audit by output

Output file Contains raw patient_id? Default behavior Notes
dashboard_data.json Yes PHI by design Full bundle that the dashboard loads. Internal use only.
dashboard.html No identifiers in the HTML itself Safe to share the HTML file alone The HTML file is just the viewer; it only exposes PHI when paired with dashboard_data.json
patient_master.csv Yes PHI One row per patient with demographics. Internal use only.
code_inventory.csv No Privacy-safe Aggregate counts only, no per-patient data
cohort_eda_report.html No — uses PT-NNNN pseudonyms Privacy-safe by design Six layered privacy transforms; safe to share with any colleague
omop_output/PERSON.csv Yes — in the person_source_value column PHI The OMOP spec stores the source identifier here. The integer person_id is internal and not directly identifying, but downstream OMOP tables link by person_id
omop_output/*.csv (other tables) No raw patient IDs — only person_id integer PHI through linkage to PERSON.csv Strip or pseudonymize person_source_value in PERSON.csv before sharing
phenopackets_output/*.json Yes — in subject.id PHI The Phenopackets spec uses subject.id as the canonical identifier. Pseudonymize before submitting to Matchmaker Exchange or sharing
drug_repurposing_cohort.csv No by default (PT-NNNN); raw if pseudonymize=False Privacy-safe by default since v2026-05 New pseudonymize=True default. Set False only for internal chart-review linkage
drug_repurposing_summary.csv No Privacy-safe Aggregate per-medication, k-anonymity applied
cure_id_intake.csv No — uses PT-NNNN pseudonyms Privacy-safe by design Intended for FDA / NCATS submission, must be PII-free
drug_repurposing_report.html No Privacy-safe No per-patient IDs in the rendered HTML

What to do before sharing

For the dashboard: share dashboard.html and dashboard_data.json only with people who have ARC data access under your IRB protocol. There is no PHI-safe version of the dashboard data — the per-patient view is the point.

For OMOP outputs going to external sites or federated networks: before sharing the omop_output/ folder, blank or pseudonymize the person_source_value column in PERSON.csv. A one-line approach:

import pandas as pd
p = pd.read_csv('omop_output_2026-05-03/PERSON.csv')
mapping = {pid: f'PT-{i+1:04d}' for i, pid in enumerate(sorted(p['person_source_value'].dropna().unique()))}
p['person_source_value'] = p['person_source_value'].map(mapping).fillna('')
p.to_csv('omop_output_share/PERSON.csv', index=False)
# Save mapping to your PHI vault, NOT to the share folder
pd.DataFrame(mapping.items(), columns=['real_id','pseudo_id']).to_csv('PHI_VAULT/pid_mapping.csv', index=False)

The other OMOP tables only carry the integer person_id, so they don't need rewriting — only PERSON.csv does.

For Phenopackets going to Matchmaker Exchange, Beacon networks, or any external party: before submission, rewrite subject.id in each Phenopacket JSON with a pseudonym. Same approach as above. Store the mapping in your PHI vault.

For drug repurposing: the default pseudonymize=True already replaces raw patient_ids in drug_repurposing_cohort.csv with PT-NNNN. No further action needed unless you explicitly turned pseudonymization off for internal linkage.

Why we don't pseudonymize by default in run_pipeline.py or omop_etl.py

Because both the dashboard and the OMOP / Phenopackets standards require the source identifier for their primary use cases:

  • The dashboard's per-patient view is keyed on patient_id; pseudonymizing it would break the dashboard.
  • OMOP person_source_value exists because OHDSI sites need to link back to source records for QC, chart review, and incident-finding. Pseudonymizing it by default would undercut the standard's intent.
  • Phenopacket subject.id is similarly canonical — the spec assumes you'll pseudonymize before submission, not during construction.

The right pattern is: produce the full-identifier output once, store it in a PHI-controlled location, and pseudonymize on the way out to anywhere else. The drug repurposing module is the exception because its analytic output (the cohort CSV) has no reason to carry the source identifier — downstream pharmacoepi tools work on the data, not the patient lookup. So it pseudonymizes by default.

What to ask your IRB if you're unsure

Three concrete questions:

  1. Is the recipient covered under the same IRB protocol that authorized ARC's data acquisition? If yes, identifiers can stay. If no, pseudonymize.
  2. Is the recipient a covered entity, business associate, or external researcher? Different rules apply (BAA in place vs. data use agreement vs. de-identified data set).
  3. Are you sharing for treatment, payment, or operations — or for research? Research sharing without IRB authorization or a DUA generally requires de-identification.

The Registry Forge defaults assume the most cautious answer in each case.