Skip to content

Output schema

patient_master.csv follows the exact same schema as Registry Forge's master CSV. This page documents each column. If you're already familiar with Registry Forge, you can skim - the only behavioral differences are in the source column (always ccda or empty) and the absence of fhir rows.

Columns

Demographic block (always present)

Column Required Notes
last_name yes Patient's last name
first_name yes Patient's first name
dob yes Date of birth, YYYY-MM-DD
patient_id yes Canonical patient identifier, format PT-XXXXXXXXXX
mrn yes Medical record number when available; blank otherwise
gender yes male, female, other, unknown
marital_status when recorded Free-text from maritalStatusCode/@displayName

The demographic block is duplicated onto every row of every record, so a filter to one patient's rows is fully self-describing.

Record block

Column Required Notes
category yes One of: patient, problems, medications, procedures, labs_vitals, allergies, immunizations, careplans, diagnostic_reports, goals, notes, encounters, observations
source yes Always ccda in Patient Edition (or empty for patient header rows)
effective_date when known YYYY-MM-DD. For the patient row, this is DOB.
end_date when known End/resolution date
code when coded Primary code value
code_system when coded Vocabulary name: SNOMED-CT, RxNorm, LOINC, ICD-10-CM, ICD-9-CM, CPT-4, HCPCS, CVX, etc.
display_name when known Human-readable label for the primary code
all_codings_json when coded JSON array of every coding on the entry, including translations
value for labs/vitals Numeric or text value
unit for labs/vitals Units (UCUM strings, usually)
status when recorded active, completed, final, etc.
text varies Free-text or display name fallback

Provenance

Column Required Notes
source_file yes Relative path to the source XML file
raw_record_json yes Compact JSON of the parsed entry, for reviewer drill-down

Categories

Category Source section LOINC Typical content
patient - One header row per (patient, document) pair
problems 11450-4 Diagnoses, problem list entries
medications 10160-0 Prescribed and self-reported medications
procedures 47519-4 Surgical and medical procedures
labs_vitals 30954-2, 8716-3 Lab results and vital signs
allergies 48765-2 Drug, food, environmental allergies
immunizations 11369-6 Vaccines administered
careplans 18776-5 Plan of care narratives
diagnostic_reports 18748-4 Imaging and other diagnostic reports
encounters 46239-0, 46240-8 Visits, admissions
goals 61146-7 Patient and care-team goals
notes 29762-2, 10157-6 Social and family history; fallback for narrative-only sections
observations (other) Sections we don't recognize map here

Example rows

A condensed look at what real rows in the master CSV contain (synthetic data):

last_name | first_name | dob        | patient_id    | category    | source | effective_date | code      | code_system | display_name        | value | unit
Patient   | Test       | 1965-03-15 | PT-1D2DF76565 | patient     |        | 1965-03-15     |           |             | Test Patient        |       |
Patient   | Test       | 1965-03-15 | PT-1D2DF76565 | problems    | ccda   | 2023-01-01     | 55505003  | SNOMED-CT   | Motor neuron disease|       |
Patient   | Test       | 1965-03-15 | PT-1D2DF76565 | medications | ccda   | 2023-02-01     | 9468      | RxNorm      | Riluzole 50 MG ...  |       |
Patient   | Test       | 1965-03-15 | PT-1D2DF76565 | labs_vitals | ccda   | 2024-06-01     | 2160-0    | LOINC       | Creatinine, serum   | 0.9   | mg/dL

Reading the CSV

Python / pandas

import pandas as pd
df = pd.read_csv('patient_master.csv', encoding='utf-8-sig', dtype=str)

# Counts per category per patient
df.groupby(['patient_id','category']).size().unstack(fill_value=0)

# All medications, sorted by date, for one patient
df.query("patient_id == 'PT-1D2DF76565' and category == 'medications'") \
  .sort_values('effective_date')[['effective_date','code','display_name','text']]

# All LOINC labs across the cohort with numeric values
labs = df[df['category'] == 'labs_vitals'].copy()
labs['value_num'] = pd.to_numeric(labs['value'], errors='coerce')

R

df <- read.csv('patient_master.csv', encoding='UTF-8', stringsAsFactors=FALSE)
table(df$category, df$source)

Excel

The file is UTF-8 with BOM, so it opens cleanly in Excel without an encoding prompt. Add filters via Data → Filter and pivot on patient_id or category.

Compatibility with Registry Forge

Because the schemas match, you can:

  • Concatenate a Patient Edition patient_master.csv with a Registry Forge patient_master.csv in pandas (pd.concat).
  • Feed our output into the Registry Forge OMOP ETL - it accepts the master CSV as input.
  • Use any analytic notebook built for Registry Forge against Patient Edition outputs and vice versa.

The only practical difference is the source column distribution: Registry Forge produces a mix of ccda and fhir rows; Patient Edition produces only ccda. Any downstream code that filters or groups on source will work but find fewer source types.