Patient identifier handling in Registry Forge outputs¶

EHR-vendor-issued patient IDs, medical record numbers (MRNs), and FHIR resource IDs derived from those source identifiers are direct identifiers under HIPAA Safe Harbor §164.514(b)(2)(i)(H). This page documents where they appear in each output Registry Forge produces, and what to do about it.

Audit by output¶

Output file	Contains raw patient_id?	Default behavior	Notes
`dashboard_data.json`	Yes	PHI by design	Full bundle that the dashboard loads. Internal use only.
`dashboard.html`	No identifiers in the HTML itself	Safe to share the HTML file alone	The HTML file is just the viewer; it only exposes PHI when paired with `dashboard_data.json`
`patient_master.csv`	Yes	PHI	One row per patient with demographics. Internal use only.
`code_inventory.csv`	No	Privacy-safe	Aggregate counts only, no per-patient data
`cohort_eda_report.html`	No — uses `PT-NNNN` pseudonyms	Privacy-safe by design	Six layered privacy transforms; safe to share with any colleague
`omop_output/PERSON.csv`	Yes — in the `person_source_value` column	PHI	The OMOP spec stores the source identifier here. The integer `person_id` is internal and not directly identifying, but downstream OMOP tables link by `person_id`
`omop_output/*.csv` (other tables)	No raw patient IDs — only `person_id` integer	PHI through linkage to PERSON.csv	Strip or pseudonymize `person_source_value` in PERSON.csv before sharing
`phenopackets_output/*.json`	Yes — in `subject.id`	PHI	The Phenopackets spec uses `subject.id` as the canonical identifier. Pseudonymize before submitting to Matchmaker Exchange or sharing
`drug_repurposing_cohort.csv`	No by default (`PT-NNNN`); raw if `pseudonymize=False`	Privacy-safe by default since v2026-05	New `pseudonymize=True` default. Set False only for internal chart-review linkage
`drug_repurposing_summary.csv`	No	Privacy-safe	Aggregate per-medication, k-anonymity applied
`cure_id_intake.csv`	No — uses `PT-NNNN` pseudonyms	Privacy-safe by design	Intended for FDA / NCATS submission, must be PII-free
`drug_repurposing_report.html`	No	Privacy-safe	No per-patient IDs in the rendered HTML

For the dashboard: share dashboard.html and dashboard_data.json only with people who have ARC data access under your IRB protocol. There is no PHI-safe version of the dashboard data — the per-patient view is the point.

For OMOP outputs going to external sites or federated networks: before sharing the omop_output/ folder, blank or pseudonymize the person_source_value column in PERSON.csv. A one-line approach:

import pandas as pd
p = pd.read_csv('omop_output_2026-05-03/PERSON.csv')
mapping = {pid: f'PT-{i+1:04d}' for i, pid in enumerate(sorted(p['person_source_value'].dropna().unique()))}
p['person_source_value'] = p['person_source_value'].map(mapping).fillna('')
p.to_csv('omop_output_share/PERSON.csv', index=False)
# Save mapping to your PHI vault, NOT to the share folder
pd.DataFrame(mapping.items(), columns=['real_id','pseudo_id']).to_csv('PHI_VAULT/pid_mapping.csv', index=False)

The other OMOP tables only carry the integer person_id, so they don't need rewriting — only PERSON.csv does.

For Phenopackets going to Matchmaker Exchange, Beacon networks, or any external party: before submission, rewrite subject.id in each Phenopacket JSON with a pseudonym. Same approach as above. Store the mapping in your PHI vault.

For drug repurposing: the default pseudonymize=True already replaces raw patient_ids in drug_repurposing_cohort.csv with PT-NNNN. No further action needed unless you explicitly turned pseudonymization off for internal linkage.

Why we don't pseudonymize by default in run_pipeline.py or omop_etl.py¶

Because both the dashboard and the OMOP / Phenopackets standards require the source identifier for their primary use cases:

The dashboard's per-patient view is keyed on patient_id; pseudonymizing it would break the dashboard.
OMOP person_source_value exists because OHDSI sites need to link back to source records for QC, chart review, and incident-finding. Pseudonymizing it by default would undercut the standard's intent.
Phenopacket subject.id is similarly canonical — the spec assumes you'll pseudonymize before submission, not during construction.

The right pattern is: produce the full-identifier output once, store it in a PHI-controlled location, and pseudonymize on the way out to anywhere else. The drug repurposing module is the exception because its analytic output (the cohort CSV) has no reason to carry the source identifier — downstream pharmacoepi tools work on the data, not the patient lookup. So it pseudonymizes by default.

What to ask your IRB if you're unsure¶

Three concrete questions:

Is the recipient covered under the same IRB protocol that authorized ARC's data acquisition? If yes, identifiers can stay. If no, pseudonymize.
Is the recipient a covered entity, business associate, or external researcher? Different rules apply (BAA in place vs. data use agreement vs. de-identified data set).
Are you sharing for treatment, payment, or operations — or for research? Research sharing without IRB authorization or a DUA generally requires de-identification.

The Registry Forge defaults assume the most cautious answer in each case.

Patient identifier handling in Registry Forge outputs¶

Audit by output¶

What to do before sharing¶

Why we don't pseudonymize by default in run_pipeline.py or omop_etl.py¶

What to ask your IRB if you're unsure¶