Stage 4 - Long-format assembly¶
Stage 4 turns the flat list of record dicts produced by Stage 2 into a properly typed, sorted, schema-compliant DataFrame and writes it as patient_master.csv.
The schema¶
Identical to Registry Forge's patient_master.csv. Columns in order:
last_name, first_name, dob, patient_id, mrn, gender, marital_status,
category, source, effective_date, end_date,
code, code_system, display_name, all_codings_json,
value, unit, status, text, source_file, raw_record_json
See the Output schema for column-by-column documentation.
Assembly steps¶
df = pd.DataFrame(all_records)
# Ensure every expected column exists, even if no source records populated it
for c in MASTER_COLUMNS:
if c not in df.columns:
df[c] = ''
# Reorder columns to the canonical schema
df = df[MASTER_COLUMNS].fillna('')
# Stable sort: each patient's records grouped together, in category order, by date
df = df.sort_values(
['last_name','first_name','patient_id','category','effective_date'],
kind='stable'
).reset_index(drop=True)
# UTF-8 with BOM so Excel opens it correctly
df.to_csv(master_csv_path, index=False, encoding='utf-8-sig')
Sort order rationale¶
(last_name, first_name, patient_id, category, effective_date) puts:
- All of one patient's rows contiguously.
- Within a patient: the
patientheader row first (sorts before other categories alphabetically? No - but the join with the header row in Stage 2 means the header is one of the records being sorted; the emptyeffective_dateof categories without dates sorts first, which is acceptable). - Within a patient's category: chronologically by
effective_date.
A researcher scrolling top-to-bottom for one patient gets a chronological-ish narrative; a researcher filtering to category = medications across the cohort sees every prescription grouped by person.
Encoding choices¶
UTF-8 with BOM (utf-8-sig). Excel on Windows defaults to interpreting CSV as Windows-1252 unless it sees a BOM. The BOM forces UTF-8, which preserves non-ASCII characters in names and clinical text.
No quoting policy override. pandas defaults to quoting only when necessary (commas, quotes, newlines). That's fine for downstream pandas, R, and Excel consumption.
32K cell truncation. Excel's per-cell limit is 32,767 characters. Long narratives and very large raw_record_json blobs are truncated to ~32,000 characters with a ...[truncated] marker. This affects only edge cases (very long discharge summaries).
What's in raw_record_json¶
A JSON-encoded snapshot of the parsed entry, including:
section_title- the human-readable section heading.tag- the XML element type (observation,substanceAdministration, etc.).primary_code- the code we promoted to thecode/code_system/display_namecolumns.all_codings- every coding on the entry, including translations.effective_date,end_date,value,unit,status- duplicated from the structured columns for self-containment.
A reviewer who wants to verify "did the parser correctly identify the ICD-10 translation of this SNOMED problem?" opens raw_record_json for the row, parses the JSON, and inspects all_codings. No need to re-open the source XML.
What we don't do here¶
- No deduplication. Three identical records from three repeated downloads stay as three rows. The
source_filecolumn tells you where each one came from. Deduping is a researcher choice -df.drop_duplicates(['patient_id','category','code','effective_date','value'])is one line in pandas if you want it. - No date normalization beyond
YYYY-MM-DD. Times of day are dropped (intentional - patient downloads rarely have meaningful sub-day precision). Time zones are ignored. - No standardization of
valueunits. A creatinine inmg/dLand a creatinine inµmol/Lwill both appear as-is. Unit harmonization is a downstream choice that depends on the analysis.