Patient master CSV¶

Stage 7b of the pipeline writes three long-format master CSVs alongside dashboard_data.json. Each row is one record (problem, medication, lab, narrative section, etc.) with the patient's demographics joined on the front, so a colleague can:

Open the file in Excel and filter by patient_id to see one person's complete record across every category.
Filter by category to see all medications, all problems, etc., across the cohort.
Filter by source to compare what came from CCDA documents vs. discrete FHIR resources.
Cross-check any dashboard cell against the spreadsheet, then drill into raw_record_json to see the exact original record the row came from.

Three files are written:

File	Contents
`patient_master.csv`	Every record from every patient, both CCDA- and FHIR-sourced
`patient_master_fhir.csv`	Patient demographic header rows + FHIR-sourced records only
`patient_master_ccda.csv`	Patient demographic header rows + CCDA-sourced records only

All three share the same schema. The split files are conveniences — same data as the master, pre-filtered.

Schema¶

The patient demographic columns come first (per request) so that name + DOB are always the leftmost identifiers, followed by patient_id and the rest of the row.

Column	Always present?	Notes
`last_name`	yes	Patient's last name; blank for FHIR-only patients with no demographics
`first_name`	yes	Patient's first name
`dob`	yes	Date of birth, `YYYY-MM-DD`
`patient_id`	yes	Canonical patient identifier used throughout the pipeline
`mrn`	yes	Medical record number (when available)
`gender`	yes	FHIR-style `male` / `female` / `other` / `unknown`
`marital_status`	yes	When recorded
`category`	yes	`patient`, `problems`, `medications`, `procedures`, `labs_vitals`, `allergies`, `immunizations`, `careplans`, `diagnostic_reports`, `goals`, `notes`, `documents`, `document_references`, `encounters`
`source`	yes	`fhir`, `ccda`, or empty for the patient header row
`effective_date`	when known	Best available date for the record (effective, start, authored, recorded). For the `patient` row, this is DOB.
`end_date`	when known	End or resolution date when applicable
`code`	when coded	Primary code value
`code_system`	when coded	Vocabulary name (`SNOMED-CT`, `RxNorm`, `LOINC`, `ICD-10-CM`, `CVX`, `CPT-4`, etc.)
`display_name`	when known	Human-readable label for the primary code
`all_codings_json`	when coded	Full list of all codings on the record (e.g. SNOMED + ICD-10 translation), as a compact JSON string
`value`	for labs/vitals	Numeric or text value
`unit`	for labs/vitals	Units of measurement
`status`	when recorded	`active`, `completed`, `final`, etc.
`text`	varies	Best free-text column for the category — section narrative, dosage instruction, conclusion, description, allergic reaction, etc.
`source_file`	when known	Original source file path or FHIR resource ID
`raw_record_json`	always	Compact JSON of the entire original record, so a reviewer can verify every other column

Cells that exceed Excel's per-cell character limit (32,767) are truncated to ~32,000 characters with a ...[truncated] marker. This affects only very long narratives and the raw_record_json for unusually large records.

Sort order¶

Rows are sorted by (last_name, first_name, patient_id, category, effective_date). This puts each patient's records together, with the patient header row at the top, followed by every other category in a stable order, and within each category by date. For one patient, scrolling top-to-bottom is roughly chronological after the demographic header.

Recommended Excel workflow¶

Open patient_master.csv in Excel. Excel will prompt about encoding — the file is UTF-8 with BOM, so it opens cleanly.
Select the header row, Data → Filter to add filter dropdowns to every column.
Filter patient_id (or last_name) to one patient. The rows above the filter become a tidy chronological record for that person.
Compare against the dashboard's per-tab views. Each dashboard row should correspond to a master CSV row with the same code, display_name, and effective_date.
To cross-check a single dashboard cell, copy the raw_record_json value into a JSON viewer and confirm the underlying record matches.

pandas / R workflow¶

import pandas as pd

df = pd.read_csv('patient_master.csv', encoding='utf-8-sig', dtype=str)

# How many records per patient per category?
df.groupby(['patient_id', 'category']).size().unstack(fill_value=0)

# All medications for one patient, in date order
df.query("patient_id == '<id>' and category == 'medications'") \
  .sort_values('effective_date')[['effective_date','code','display_name','text']]

# All FHIR-only or all CCDA-only via the split files
fhir = pd.read_csv('patient_master_fhir.csv', encoding='utf-8-sig', dtype=str)
ccda = pd.read_csv('patient_master_ccda.csv', encoding='utf-8-sig', dtype=str)

df <- read.csv('patient_master.csv', encoding='UTF-8', stringsAsFactors=FALSE)
table(df$category, df$source)

When to use which file¶

patient_master.csv — default. One source of truth across both pipelines. Use this unless you have a specific reason to look at one source in isolation.
patient_master_fhir.csv — for analyses that require the discrete-resource fidelity of FHIR (e.g. exact medication dosage, lab numeric values, structured encounter classes). FHIR rows have richer value/unit and code coverage.
patient_master_ccda.csv — for analyses that need the CCDA section-narrative text content (clinical impressions, plan-of-care narratives, problem-list descriptions). CCDA rows preserve the prose that FHIR resources strip out.

The split files include the patient header rows so they're self-contained — you don't need to merge with the main file to know which patient each record belongs to.

Relationship to other outputs¶

The master CSVs do not replace any existing output:

dashboard_data.json — canonical bundle, consumed by the HTML dashboard. Master CSVs are derived from it.
dashboard_data.xlsx — multi-sheet Excel workbook, one sheet per category, no patient demographics joined. Better for category-by-category review; worse for one-patient deep-dives.
csv_exports/ — per-category CSVs (medications.csv, problems.csv, etc.), each with patient_id but not the joined demographics.
patient_master.csv (this file) — long-format with demographics joined, designed for cross-referencing dashboard rows and for analysts who prefer one big file.
code_inventory.csv — aggregated by (vocabulary, code), one row per unique code, not per record.

For a colleague who only wants to spot-check the dashboard against a spreadsheet, the master CSV is the right starting point.