Code inventory¶
Stage 7 of the pipeline writes code_inventory.csv next to dashboard_data.json. It lists every unique (vocabulary, code) pair seen anywhere in the bundle, along with how often it was referenced, how many distinct patients had it, the most-common display name observed, and which bundle categories it appeared in.
Columns¶
| Column | Meaning |
|---|---|
vocabulary |
Source vocabulary as named by the pipeline (SNOMED-CT, ICD-10-CM, RxNorm, LOINC, CVX, CPT-4, FHIR encounter class, ...) |
code |
Source code value |
display_name |
Most-common display string seen for this code across all references |
n_references |
Total number of records (across all categories) referencing this code |
n_unique_patients |
Number of distinct patients who have at least one reference to this code |
source_categories |
Semicolon-separated list of bundle tabs this code appears in (e.g. problems;diagnostic_reports) |
How codes are collected¶
For each record in the coded categories — problems, medications, procedures, allergies, immunizations, labs_vitals, diagnostic_reports, document_references, encounters — the pipeline walks each entry's all_codings array. A record with both a SNOMED concept and an ICD-10-CM translation contributes two rows to the inventory (one per vocabulary). When all_codings is missing, the pipeline falls back to the record's top-level code and code_system fields.
The split labs and vitals views are skipped because their rows are duplicated in the unified labs_vitals tab; including all three would triple-count the same observations.
Reading the inventory¶
The CSV is sorted by vocabulary, then by descending reference count, then by code. A typical first look:
import pandas as pd
inv = pd.read_csv('code_inventory.csv')
print(inv['vocabulary'].value_counts()) # codes per vocab
print(inv.sort_values('n_references', ascending=False).head(20)) # most-used codes
print(inv[inv['n_unique_patients'] >= 10].shape[0]) # codes hit by many patients
The inventory is the input to the OMOP ETL and is also useful on its own — for QA passes, to spot codes whose displays are inconsistent across records, or to size a vocabulary download from Athena.