Patient master CSV¶
Stage 7b of the pipeline writes three long-format master CSVs alongside dashboard_data.json. Each row is one record (problem, medication, lab, narrative section, etc.) with the patient's demographics joined on the front, so a colleague can:
- Open the file in Excel and filter by
patient_idto see one person's complete record across every category. - Filter by
categoryto see all medications, all problems, etc., across the cohort. - Filter by
sourceto compare what came from CCDA documents vs. discrete FHIR resources. - Cross-check any dashboard cell against the spreadsheet, then drill into
raw_record_jsonto see the exact original record the row came from.
Three files are written:
| File | Contents |
|---|---|
patient_master.csv |
Every record from every patient, both CCDA- and FHIR-sourced |
patient_master_fhir.csv |
Patient demographic header rows + FHIR-sourced records only |
patient_master_ccda.csv |
Patient demographic header rows + CCDA-sourced records only |
All three share the same schema. The split files are conveniences — same data as the master, pre-filtered.
Schema¶
The patient demographic columns come first (per request) so that name + DOB are always the leftmost identifiers, followed by patient_id and the rest of the row.
| Column | Always present? | Notes |
|---|---|---|
last_name |
yes | Patient's last name; blank for FHIR-only patients with no demographics |
first_name |
yes | Patient's first name |
dob |
yes | Date of birth, YYYY-MM-DD |
patient_id |
yes | Canonical patient identifier used throughout the pipeline |
mrn |
yes | Medical record number (when available) |
gender |
yes | FHIR-style male / female / other / unknown |
marital_status |
yes | When recorded |
category |
yes | patient, problems, medications, procedures, labs_vitals, allergies, immunizations, careplans, diagnostic_reports, goals, notes, documents, document_references, encounters |
source |
yes | fhir, ccda, or empty for the patient header row |
effective_date |
when known | Best available date for the record (effective, start, authored, recorded). For the patient row, this is DOB. |
end_date |
when known | End or resolution date when applicable |
code |
when coded | Primary code value |
code_system |
when coded | Vocabulary name (SNOMED-CT, RxNorm, LOINC, ICD-10-CM, CVX, CPT-4, etc.) |
display_name |
when known | Human-readable label for the primary code |
all_codings_json |
when coded | Full list of all codings on the record (e.g. SNOMED + ICD-10 translation), as a compact JSON string |
value |
for labs/vitals | Numeric or text value |
unit |
for labs/vitals | Units of measurement |
status |
when recorded | active, completed, final, etc. |
text |
varies | Best free-text column for the category — section narrative, dosage instruction, conclusion, description, allergic reaction, etc. |
source_file |
when known | Original source file path or FHIR resource ID |
raw_record_json |
always | Compact JSON of the entire original record, so a reviewer can verify every other column |
Cells that exceed Excel's per-cell character limit (32,767) are truncated to ~32,000 characters with a ...[truncated] marker. This affects only very long narratives and the raw_record_json for unusually large records.
Sort order¶
Rows are sorted by (last_name, first_name, patient_id, category, effective_date). This puts each patient's records together, with the patient header row at the top, followed by every other category in a stable order, and within each category by date. For one patient, scrolling top-to-bottom is roughly chronological after the demographic header.
Recommended Excel workflow¶
- Open
patient_master.csvin Excel. Excel will prompt about encoding — the file is UTF-8 with BOM, so it opens cleanly. - Select the header row, Data → Filter to add filter dropdowns to every column.
- Filter
patient_id(orlast_name) to one patient. The rows above the filter become a tidy chronological record for that person. - Compare against the dashboard's per-tab views. Each dashboard row should correspond to a master CSV row with the same
code,display_name, andeffective_date. - To cross-check a single dashboard cell, copy the
raw_record_jsonvalue into a JSON viewer and confirm the underlying record matches.
pandas / R workflow¶
import pandas as pd
df = pd.read_csv('patient_master.csv', encoding='utf-8-sig', dtype=str)
# How many records per patient per category?
df.groupby(['patient_id', 'category']).size().unstack(fill_value=0)
# All medications for one patient, in date order
df.query("patient_id == '<id>' and category == 'medications'") \
.sort_values('effective_date')[['effective_date','code','display_name','text']]
# All FHIR-only or all CCDA-only via the split files
fhir = pd.read_csv('patient_master_fhir.csv', encoding='utf-8-sig', dtype=str)
ccda = pd.read_csv('patient_master_ccda.csv', encoding='utf-8-sig', dtype=str)
df <- read.csv('patient_master.csv', encoding='UTF-8', stringsAsFactors=FALSE)
table(df$category, df$source)
When to use which file¶
patient_master.csv— default. One source of truth across both pipelines. Use this unless you have a specific reason to look at one source in isolation.patient_master_fhir.csv— for analyses that require the discrete-resource fidelity of FHIR (e.g. exact medication dosage, lab numeric values, structured encounter classes). FHIR rows have richervalue/unitand code coverage.patient_master_ccda.csv— for analyses that need the CCDA section-narrative text content (clinical impressions, plan-of-care narratives, problem-list descriptions). CCDA rows preserve the prose that FHIR resources strip out.
The split files include the patient header rows so they're self-contained — you don't need to merge with the main file to know which patient each record belongs to.
Relationship to other outputs¶
The master CSVs do not replace any existing output:
dashboard_data.json— canonical bundle, consumed by the HTML dashboard. Master CSVs are derived from it.dashboard_data.xlsx— multi-sheet Excel workbook, one sheet per category, no patient demographics joined. Better for category-by-category review; worse for one-patient deep-dives.csv_exports/— per-category CSVs (medications.csv, problems.csv, etc.), each withpatient_idbut not the joined demographics.patient_master.csv(this file) — long-format with demographics joined, designed for cross-referencing dashboard rows and for analysts who prefer one big file.code_inventory.csv— aggregated by (vocabulary, code), one row per unique code, not per record.
For a colleague who only wants to spot-check the dashboard against a spreadsheet, the master CSV is the right starting point.