Stage 3 - Identity resolution¶
Each C-CDA document represents one patient. To merge records across multiple documents from the same patient - and to keep records from different patients separate - we need a stable, deterministic patient_id.
The identity hash¶
identity = (mrn or f'{last_name}|{first_name}|{dob}').strip('|')
patient_id = 'PT-' + hashlib.sha1(identity.encode('utf-8')).hexdigest()[:10].upper()
Two layers:
- Prefer MRN. When the C-CDA includes an MRN (extension attribute on
recordTarget/patientRole/id), use it as the identity string. Within one health system, this is the strongest possible identifier. - Fallback: name + DOB. When no MRN is present (some Blue Button exports strip it), concatenate
last_name|first_name|dob.
The identity string is SHA-1 hashed and truncated to 10 hex characters, prefixed PT-. Example: PT-1D2DF76565.
Design rationale¶
Stable across runs. The same input documents will always produce the same patient_id. Reproducibility is important when researchers re-run the pipeline as documents are added.
No PHI in the ID. The hash is one-way. The patient_id itself can be shared in a paper or a forum without exposing name or DOB. (Demographics are still in patient_master.csv, of course; the ID just isn't itself PHI.)
Cross-system safe. A patient's MRN at hospital A is unrelated to their MRN at hospital B. Hashing the MRN doesn't help cross-system linkage - but it doesn't make it worse either. If you need cross-system patient matching, that's a separate problem (probabilistic linkage, hashed identifiers via HIE) and not something Patient Edition tries to solve.
Known limitations¶
Inconsistent demographics split one patient into two¶
If a patient's name appears as Jane M. Doe in one document and Jane Doe in another (no middle initial), the hash differs and they become two patient_ids.
Symptoms. Two patient header rows in patient_master.csv that obviously refer to the same person.
Mitigation. Post-process by editing the last_name / first_name cells to a canonical spelling and re-running the notebook from cell 4 onward. Or add a hand-maintained identity_overrides.csv step before the hashing (not yet built in; would be a useful contribution).
Two patients with the same name + DOB collide¶
This is the classic identity-resolution failure mode. Two people named "John Smith" born "1980-01-01" hash to the same ID. The MRN-first strategy mitigates this when MRNs are present.
Symptoms. Records that obviously don't belong to the same person grouped under one patient_id - e.g. inconsistent gender or marital status across the patient header rows.
Mitigation. Inspect parse_log.csv. If two source files for the "same" patient have wildly different categories or codes, manually disambiguate by introducing a suffix to one set of files' demographics before re-running.
The MRN isn't always stable¶
Within one EHR vendor's system, MRNs are stable. Across vendors, or after a system migration, MRNs can change. A patient who downloaded their record before and after their hospital's EHR migration may have two different MRNs for the same person.
Symptoms. Two patient_ids with identical name and DOB but different MRNs.
Mitigation. Same as the name-spelling case - edit toward a canonical identity before re-running.
What this is not doing¶
- No probabilistic linkage. No fuzzy matching on names, no Levenshtein distance, no Soundex. Identity is exact-match-after-hashing.
- No cross-source identity resolution. Registry Forge resolves
medicationReferenceURLs across CCDA, FHIR, and notes. Patient Edition has only one source (C-CDA on disk) so this isn't relevant. - No use of
<id>roots. The CDAidelement has both aroot(an OID namespace) and anextension(the actual identifier). We use the extension. The root is recorded inraw_record_jsonif you need it later.
How patient_id flows downstream¶
- Every row in
patient_master.csvcarries thepatient_id. - The dashboard's patient filter dropdown is built from unique
patient_ids. - The feature matrix is indexed by
patient_id.
If you spot an identity-resolution problem at any of those layers, the fix is upstream in Stage 3.