Skip to content

Stage 4 - Long-format assembly

Stage 4 turns the flat list of record dicts produced by Stage 2 into a properly typed, sorted, schema-compliant DataFrame and writes it as patient_master.csv.

The schema

Identical to Registry Forge's patient_master.csv. Columns in order:

last_name, first_name, dob, patient_id, mrn, gender, marital_status,
category, source, effective_date, end_date,
code, code_system, display_name, all_codings_json,
value, unit, status, text, source_file, raw_record_json

See the Output schema for column-by-column documentation.

Assembly steps

df = pd.DataFrame(all_records)

# Ensure every expected column exists, even if no source records populated it
for c in MASTER_COLUMNS:
    if c not in df.columns:
        df[c] = ''

# Reorder columns to the canonical schema
df = df[MASTER_COLUMNS].fillna('')

# Stable sort: each patient's records grouped together, in category order, by date
df = df.sort_values(
    ['last_name','first_name','patient_id','category','effective_date'],
    kind='stable'
).reset_index(drop=True)

# UTF-8 with BOM so Excel opens it correctly
df.to_csv(master_csv_path, index=False, encoding='utf-8-sig')

Sort order rationale

(last_name, first_name, patient_id, category, effective_date) puts:

  1. All of one patient's rows contiguously.
  2. Within a patient: the patient header row first (sorts before other categories alphabetically? No - but the join with the header row in Stage 2 means the header is one of the records being sorted; the empty effective_date of categories without dates sorts first, which is acceptable).
  3. Within a patient's category: chronologically by effective_date.

A researcher scrolling top-to-bottom for one patient gets a chronological-ish narrative; a researcher filtering to category = medications across the cohort sees every prescription grouped by person.

Encoding choices

UTF-8 with BOM (utf-8-sig). Excel on Windows defaults to interpreting CSV as Windows-1252 unless it sees a BOM. The BOM forces UTF-8, which preserves non-ASCII characters in names and clinical text.

No quoting policy override. pandas defaults to quoting only when necessary (commas, quotes, newlines). That's fine for downstream pandas, R, and Excel consumption.

32K cell truncation. Excel's per-cell limit is 32,767 characters. Long narratives and very large raw_record_json blobs are truncated to ~32,000 characters with a ...[truncated] marker. This affects only edge cases (very long discharge summaries).

What's in raw_record_json

A JSON-encoded snapshot of the parsed entry, including:

  • section_title - the human-readable section heading.
  • tag - the XML element type (observation, substanceAdministration, etc.).
  • primary_code - the code we promoted to the code / code_system / display_name columns.
  • all_codings - every coding on the entry, including translations.
  • effective_date, end_date, value, unit, status - duplicated from the structured columns for self-containment.

A reviewer who wants to verify "did the parser correctly identify the ICD-10 translation of this SNOMED problem?" opens raw_record_json for the row, parses the JSON, and inspects all_codings. No need to re-open the source XML.

What we don't do here

  • No deduplication. Three identical records from three repeated downloads stay as three rows. The source_file column tells you where each one came from. Deduping is a researcher choice - df.drop_duplicates(['patient_id','category','code','effective_date','value']) is one line in pandas if you want it.
  • No date normalization beyond YYYY-MM-DD. Times of day are dropped (intentional - patient downloads rarely have meaningful sub-day precision). Time zones are ignored.
  • No standardization of value units. A creatinine in mg/dL and a creatinine in µmol/L will both appear as-is. Unit harmonization is a downstream choice that depends on the analysis.