Output schema¶

patient_master.csv follows the exact same schema as Registry Forge's master CSV. This page documents each column. If you're already familiar with Registry Forge, you can skim - the only behavioral differences are in the source column (always ccda or empty) and the absence of fhir rows.

Columns¶

Demographic block (always present)¶

Column	Required	Notes
`last_name`	yes	Patient's last name
`first_name`	yes	Patient's first name
`dob`	yes	Date of birth, `YYYY-MM-DD`
`patient_id`	yes	Canonical patient identifier, format `PT-XXXXXXXXXX`
`mrn`	yes	Medical record number when available; blank otherwise
`gender`	yes	`male`, `female`, `other`, `unknown`
`marital_status`	when recorded	Free-text from `maritalStatusCode/@displayName`

The demographic block is duplicated onto every row of every record, so a filter to one patient's rows is fully self-describing.

Record block¶

Column	Required	Notes
`category`	yes	One of: `patient`, `problems`, `medications`, `procedures`, `labs_vitals`, `allergies`, `immunizations`, `careplans`, `diagnostic_reports`, `goals`, `notes`, `encounters`, `observations`
`source`	yes	Always `ccda` in Patient Edition (or empty for patient header rows)
`effective_date`	when known	`YYYY-MM-DD`. For the `patient` row, this is DOB.
`end_date`	when known	End/resolution date
`code`	when coded	Primary code value
`code_system`	when coded	Vocabulary name: `SNOMED-CT`, `RxNorm`, `LOINC`, `ICD-10-CM`, `ICD-9-CM`, `CPT-4`, `HCPCS`, `CVX`, etc.
`display_name`	when known	Human-readable label for the primary code
`all_codings_json`	when coded	JSON array of every coding on the entry, including translations
`value`	for labs/vitals	Numeric or text value
`unit`	for labs/vitals	Units (UCUM strings, usually)
`status`	when recorded	`active`, `completed`, `final`, etc.
`text`	varies	Free-text or display name fallback

Provenance¶

Column	Required	Notes
`source_file`	yes	Relative path to the source XML file
`raw_record_json`	yes	Compact JSON of the parsed entry, for reviewer drill-down

Categories¶

Category	Source section LOINC	Typical content
`patient`	-	One header row per (patient, document) pair
`problems`	11450-4	Diagnoses, problem list entries
`medications`	10160-0	Prescribed and self-reported medications
`procedures`	47519-4	Surgical and medical procedures
`labs_vitals`	30954-2, 8716-3	Lab results and vital signs
`allergies`	48765-2	Drug, food, environmental allergies
`immunizations`	11369-6	Vaccines administered
`careplans`	18776-5	Plan of care narratives
`diagnostic_reports`	18748-4	Imaging and other diagnostic reports
`encounters`	46239-0, 46240-8	Visits, admissions
`goals`	61146-7	Patient and care-team goals
`notes`	29762-2, 10157-6	Social and family history; fallback for narrative-only sections
`observations`	(other)	Sections we don't recognize map here

Example rows¶

A condensed look at what real rows in the master CSV contain (synthetic data):

last_name | first_name | dob        | patient_id    | category    | source | effective_date | code      | code_system | display_name        | value | unit
Patient   | Test       | 1965-03-15 | PT-1D2DF76565 | patient     |        | 1965-03-15     |           |             | Test Patient        |       |
Patient   | Test       | 1965-03-15 | PT-1D2DF76565 | problems    | ccda   | 2023-01-01     | 55505003  | SNOMED-CT   | Motor neuron disease|       |
Patient   | Test       | 1965-03-15 | PT-1D2DF76565 | medications | ccda   | 2023-02-01     | 9468      | RxNorm      | Riluzole 50 MG ...  |       |
Patient   | Test       | 1965-03-15 | PT-1D2DF76565 | labs_vitals | ccda   | 2024-06-01     | 2160-0    | LOINC       | Creatinine, serum   | 0.9   | mg/dL

Reading the CSV¶

Python / pandas¶

import pandas as pd
df = pd.read_csv('patient_master.csv', encoding='utf-8-sig', dtype=str)

# Counts per category per patient
df.groupby(['patient_id','category']).size().unstack(fill_value=0)

# All medications, sorted by date, for one patient
df.query("patient_id == 'PT-1D2DF76565' and category == 'medications'") \
  .sort_values('effective_date')[['effective_date','code','display_name','text']]

# All LOINC labs across the cohort with numeric values
labs = df[df['category'] == 'labs_vitals'].copy()
labs['value_num'] = pd.to_numeric(labs['value'], errors='coerce')

R¶

df <- read.csv('patient_master.csv', encoding='UTF-8', stringsAsFactors=FALSE)
table(df$category, df$source)

Excel¶

The file is UTF-8 with BOM, so it opens cleanly in Excel without an encoding prompt. Add filters via Data → Filter and pivot on patient_id or category.

Compatibility with Registry Forge¶

Because the schemas match, you can:

Concatenate a Patient Edition patient_master.csv with a Registry Forge patient_master.csv in pandas (pd.concat).
Feed our output into the Registry Forge OMOP ETL - it accepts the master CSV as input.
Use any analytic notebook built for Registry Forge against Patient Edition outputs and vice versa.

The only practical difference is the source column distribution: Registry Forge produces a mix of ccda and fhir rows; Patient Edition produces only ccda. Any downstream code that filters or groups on source will work but find fewer source types.