Output schema¶
patient_master.csv follows the exact same schema as Registry Forge's master CSV. This page documents each column. If you're already familiar with Registry Forge, you can skim - the only behavioral differences are in the source column (always ccda or empty) and the absence of fhir rows.
Columns¶
Demographic block (always present)¶
| Column | Required | Notes |
|---|---|---|
last_name |
yes | Patient's last name |
first_name |
yes | Patient's first name |
dob |
yes | Date of birth, YYYY-MM-DD |
patient_id |
yes | Canonical patient identifier, format PT-XXXXXXXXXX |
mrn |
yes | Medical record number when available; blank otherwise |
gender |
yes | male, female, other, unknown |
marital_status |
when recorded | Free-text from maritalStatusCode/@displayName |
The demographic block is duplicated onto every row of every record, so a filter to one patient's rows is fully self-describing.
Record block¶
| Column | Required | Notes |
|---|---|---|
category |
yes | One of: patient, problems, medications, procedures, labs_vitals, allergies, immunizations, careplans, diagnostic_reports, goals, notes, encounters, observations |
source |
yes | Always ccda in Patient Edition (or empty for patient header rows) |
effective_date |
when known | YYYY-MM-DD. For the patient row, this is DOB. |
end_date |
when known | End/resolution date |
code |
when coded | Primary code value |
code_system |
when coded | Vocabulary name: SNOMED-CT, RxNorm, LOINC, ICD-10-CM, ICD-9-CM, CPT-4, HCPCS, CVX, etc. |
display_name |
when known | Human-readable label for the primary code |
all_codings_json |
when coded | JSON array of every coding on the entry, including translations |
value |
for labs/vitals | Numeric or text value |
unit |
for labs/vitals | Units (UCUM strings, usually) |
status |
when recorded | active, completed, final, etc. |
text |
varies | Free-text or display name fallback |
Provenance¶
| Column | Required | Notes |
|---|---|---|
source_file |
yes | Relative path to the source XML file |
raw_record_json |
yes | Compact JSON of the parsed entry, for reviewer drill-down |
Categories¶
| Category | Source section LOINC | Typical content |
|---|---|---|
patient |
- | One header row per (patient, document) pair |
problems |
11450-4 | Diagnoses, problem list entries |
medications |
10160-0 | Prescribed and self-reported medications |
procedures |
47519-4 | Surgical and medical procedures |
labs_vitals |
30954-2, 8716-3 | Lab results and vital signs |
allergies |
48765-2 | Drug, food, environmental allergies |
immunizations |
11369-6 | Vaccines administered |
careplans |
18776-5 | Plan of care narratives |
diagnostic_reports |
18748-4 | Imaging and other diagnostic reports |
encounters |
46239-0, 46240-8 | Visits, admissions |
goals |
61146-7 | Patient and care-team goals |
notes |
29762-2, 10157-6 | Social and family history; fallback for narrative-only sections |
observations |
(other) | Sections we don't recognize map here |
Example rows¶
A condensed look at what real rows in the master CSV contain (synthetic data):
last_name | first_name | dob | patient_id | category | source | effective_date | code | code_system | display_name | value | unit
Patient | Test | 1965-03-15 | PT-1D2DF76565 | patient | | 1965-03-15 | | | Test Patient | |
Patient | Test | 1965-03-15 | PT-1D2DF76565 | problems | ccda | 2023-01-01 | 55505003 | SNOMED-CT | Motor neuron disease| |
Patient | Test | 1965-03-15 | PT-1D2DF76565 | medications | ccda | 2023-02-01 | 9468 | RxNorm | Riluzole 50 MG ... | |
Patient | Test | 1965-03-15 | PT-1D2DF76565 | labs_vitals | ccda | 2024-06-01 | 2160-0 | LOINC | Creatinine, serum | 0.9 | mg/dL
Reading the CSV¶
Python / pandas¶
import pandas as pd
df = pd.read_csv('patient_master.csv', encoding='utf-8-sig', dtype=str)
# Counts per category per patient
df.groupby(['patient_id','category']).size().unstack(fill_value=0)
# All medications, sorted by date, for one patient
df.query("patient_id == 'PT-1D2DF76565' and category == 'medications'") \
.sort_values('effective_date')[['effective_date','code','display_name','text']]
# All LOINC labs across the cohort with numeric values
labs = df[df['category'] == 'labs_vitals'].copy()
labs['value_num'] = pd.to_numeric(labs['value'], errors='coerce')
R¶
df <- read.csv('patient_master.csv', encoding='UTF-8', stringsAsFactors=FALSE)
table(df$category, df$source)
Excel¶
The file is UTF-8 with BOM, so it opens cleanly in Excel without an encoding prompt. Add filters via Data → Filter and pivot on patient_id or category.
Compatibility with Registry Forge¶
Because the schemas match, you can:
- Concatenate a Patient Edition
patient_master.csvwith a Registry Forgepatient_master.csvin pandas (pd.concat). - Feed our output into the Registry Forge OMOP ETL - it accepts the master CSV as input.
- Use any analytic notebook built for Registry Forge against Patient Edition outputs and vice versa.
The only practical difference is the source column distribution: Registry Forge produces a mix of ccda and fhir rows; Patient Edition produces only ccda. Any downstream code that filters or groups on source will work but find fewer source types.