Skip to content

Patient master CSV

Stage 7b of the pipeline writes three long-format master CSVs alongside dashboard_data.json. Each row is one record (problem, medication, lab, narrative section, etc.) with the patient's demographics joined on the front, so a colleague can:

  • Open the file in Excel and filter by patient_id to see one person's complete record across every category.
  • Filter by category to see all medications, all problems, etc., across the cohort.
  • Filter by source to compare what came from CCDA documents vs. discrete FHIR resources.
  • Cross-check any dashboard cell against the spreadsheet, then drill into raw_record_json to see the exact original record the row came from.

Three files are written:

File Contents
patient_master.csv Every record from every patient, both CCDA- and FHIR-sourced
patient_master_fhir.csv Patient demographic header rows + FHIR-sourced records only
patient_master_ccda.csv Patient demographic header rows + CCDA-sourced records only

All three share the same schema. The split files are conveniences — same data as the master, pre-filtered.

Schema

The patient demographic columns come first (per request) so that name + DOB are always the leftmost identifiers, followed by patient_id and the rest of the row.

Column Always present? Notes
last_name yes Patient's last name; blank for FHIR-only patients with no demographics
first_name yes Patient's first name
dob yes Date of birth, YYYY-MM-DD
patient_id yes Canonical patient identifier used throughout the pipeline
mrn yes Medical record number (when available)
gender yes FHIR-style male / female / other / unknown
marital_status yes When recorded
category yes patient, problems, medications, procedures, labs_vitals, allergies, immunizations, careplans, diagnostic_reports, goals, notes, documents, document_references, encounters
source yes fhir, ccda, or empty for the patient header row
effective_date when known Best available date for the record (effective, start, authored, recorded). For the patient row, this is DOB.
end_date when known End or resolution date when applicable
code when coded Primary code value
code_system when coded Vocabulary name (SNOMED-CT, RxNorm, LOINC, ICD-10-CM, CVX, CPT-4, etc.)
display_name when known Human-readable label for the primary code
all_codings_json when coded Full list of all codings on the record (e.g. SNOMED + ICD-10 translation), as a compact JSON string
value for labs/vitals Numeric or text value
unit for labs/vitals Units of measurement
status when recorded active, completed, final, etc.
text varies Best free-text column for the category — section narrative, dosage instruction, conclusion, description, allergic reaction, etc.
source_file when known Original source file path or FHIR resource ID
raw_record_json always Compact JSON of the entire original record, so a reviewer can verify every other column

Cells that exceed Excel's per-cell character limit (32,767) are truncated to ~32,000 characters with a ...[truncated] marker. This affects only very long narratives and the raw_record_json for unusually large records.

Sort order

Rows are sorted by (last_name, first_name, patient_id, category, effective_date). This puts each patient's records together, with the patient header row at the top, followed by every other category in a stable order, and within each category by date. For one patient, scrolling top-to-bottom is roughly chronological after the demographic header.

  1. Open patient_master.csv in Excel. Excel will prompt about encoding — the file is UTF-8 with BOM, so it opens cleanly.
  2. Select the header row, Data → Filter to add filter dropdowns to every column.
  3. Filter patient_id (or last_name) to one patient. The rows above the filter become a tidy chronological record for that person.
  4. Compare against the dashboard's per-tab views. Each dashboard row should correspond to a master CSV row with the same code, display_name, and effective_date.
  5. To cross-check a single dashboard cell, copy the raw_record_json value into a JSON viewer and confirm the underlying record matches.

pandas / R workflow

import pandas as pd

df = pd.read_csv('patient_master.csv', encoding='utf-8-sig', dtype=str)

# How many records per patient per category?
df.groupby(['patient_id', 'category']).size().unstack(fill_value=0)

# All medications for one patient, in date order
df.query("patient_id == '<id>' and category == 'medications'") \
  .sort_values('effective_date')[['effective_date','code','display_name','text']]

# All FHIR-only or all CCDA-only via the split files
fhir = pd.read_csv('patient_master_fhir.csv', encoding='utf-8-sig', dtype=str)
ccda = pd.read_csv('patient_master_ccda.csv', encoding='utf-8-sig', dtype=str)
df <- read.csv('patient_master.csv', encoding='UTF-8', stringsAsFactors=FALSE)
table(df$category, df$source)

When to use which file

  • patient_master.csv — default. One source of truth across both pipelines. Use this unless you have a specific reason to look at one source in isolation.
  • patient_master_fhir.csv — for analyses that require the discrete-resource fidelity of FHIR (e.g. exact medication dosage, lab numeric values, structured encounter classes). FHIR rows have richer value/unit and code coverage.
  • patient_master_ccda.csv — for analyses that need the CCDA section-narrative text content (clinical impressions, plan-of-care narratives, problem-list descriptions). CCDA rows preserve the prose that FHIR resources strip out.

The split files include the patient header rows so they're self-contained — you don't need to merge with the main file to know which patient each record belongs to.

Relationship to other outputs

The master CSVs do not replace any existing output:

  • dashboard_data.json — canonical bundle, consumed by the HTML dashboard. Master CSVs are derived from it.
  • dashboard_data.xlsx — multi-sheet Excel workbook, one sheet per category, no patient demographics joined. Better for category-by-category review; worse for one-patient deep-dives.
  • csv_exports/ — per-category CSVs (medications.csv, problems.csv, etc.), each with patient_id but not the joined demographics.
  • patient_master.csv (this file) — long-format with demographics joined, designed for cross-referencing dashboard rows and for analysts who prefer one big file.
  • code_inventory.csv — aggregated by (vocabulary, code), one row per unique code, not per record.

For a colleague who only wants to spot-check the dashboard against a spreadsheet, the master CSV is the right starting point.