OMOP ETL¶
Preview — under active development. This module is still being refined. Patterns, mappings, and category boundaries should be validated against your own corpus before being relied on for analysis or publication.
omop_etl.py reads dashboard_data.json plus a folder of OMOP vocabulary CSVs downloaded from Athena, maps every source code to its standard concept, and writes OMOP CDM v5.4 tables to a folder tagged with the vocabulary release version.
What gets produced¶
| OMOP table | Source data | Notes |
|---|---|---|
PERSON.csv |
patients | Gender mapped to 8507/8532; race and ethnicity left as 0 (not in source) |
VISIT_OCCURRENCE.csv |
encounters | Visit class (AMB/IMP/EMER) mapped to standard visit concepts |
CONDITION_OCCURRENCE.csv |
problems whose standard concept is in the Condition domain | |
DRUG_EXPOSURE.csv |
medications + immunizations + drug-allergy entries whose standard concept is in the Drug domain | sig populated from the medication display name |
PROCEDURE_OCCURRENCE.csv |
procedures whose standard concept is in the Procedure domain | |
MEASUREMENT.csv |
labs/vitals + diagnostic reports whose standard concept is in the Measurement domain | value_as_number and unit_source_value preserved |
OBSERVATION.csv |
non-drug allergies and other Observation-domain entries | |
OBSERVATION_PERIOD.csv |
derived | Earliest to latest event date per patient |
CDM_SOURCE.csv |
metadata | Records the exact vocabulary versions used |
Records are routed by the domain of the standard concept, not by the source category. An ICD-10-CM code in problems whose Maps to target lives in the Procedure domain (rare but real, e.g. some Z-codes) ends up in PROCEDURE_OCCURRENCE, where it belongs.
Mapping logic¶
For each source (vocabulary, code) pair the ETL:
- Looks up the concept in
CONCEPT.csvfiltered to that vocabulary →source_concept_id. - Follows
CONCEPT_RELATIONSHIP.csvrows whererelationship_id = 'Maps to'andconcept_id_1 = source_concept_id→concept_id_2is the standard concept. - Reads
domain_idof the standard concept → that's the target table. - Writes a row with both
*_concept_id(standard, queryable) and*_source_concept_id+*_source_value(lineage back to the raw code).
Codes that don't resolve are still written, with *_concept_id = 0 and the source values populated, so nothing is silently dropped. The log lists the unmapped codes with frequency counts.
Picking the primary code when a record has multiple codings¶
FHIR MedicationRequest.medicationCodeableConcept.coding[] (and the corresponding CCDA <code> plus <translation> children) frequently carries several codes for the same record — for example an Ingredient (RxNorm IN), a Semantic Clinical Drug (SCD), and a Brand Name (BN), all referring to the same prescription. The ETL captures every coding on all_codings so nothing is lost, but downstream rows in DRUG_EXPOSURE.drug_source_concept_id and drug_source_value need a single source code chosen as primary.
For the medication category the ETL applies a TTY-aware preference order based on each coding's concept_class_id in CONCEPT.csv:
SCD (Clinical Drug) > SBD (Branded Drug)
> SCDC/SBDC (Quantified) > GPCK/BPCK (Pack)
> SCDF/SBDF (Drug Form)
> IN (Ingredient) > PIN > BN (Brand Name) > DF
SCD and SBD encode ingredient + strength + dose form, which is what OHDSI's DRUG_STRENGTH table and most downstream analytics actually key off. IN is ingredient-only (too generic); BN loses the strength. Picking SCD/SBD as primary lines up drug_source_value with the level of detail OHDSI expects. The OMOP standard drug_concept_id is reached via Maps to and is generally identical regardless of which TTY was chosen as primary — Athena's curation already collapses TTY variants — but drug_source_concept_id and drug_source_value carry the TTY-correct source code through to the analyst.
When Athena is not loaded, the upstream ingestion stage falls back to a lightweight heuristic that prefers codings whose display string carries both a strength pattern (50 MG, 500 MG/ML) and a dose-form word (Tablet, Solution, Injection). The OMOP ETL re-confirms with concept_class_id once Athena is present.
For non-medication categories the first coding matching the domain's preferred system wins. Future revisions may extend TTY-aware selection to LOINC (component vs panel codes) and SNOMED CT (preferred vs synonyms).
Vocabulary-release tagging¶
The output folder is named omop_output_<vocab_version>/ — the version is read from VOCABULARY.csv in your Athena download. For example:
omop_output_SNOMED_CT_International_Edition_2024-09-01/
omop_output_SNOMED_CT_International_Edition_2025-01-01/
Different Athena releases never overwrite each other. CDM_SOURCE.csv inside each folder records every vocabulary's version string verbatim.
Downloading the vocabulary from Athena¶
- Visit https://athena.ohdsi.org/vocabulary/list and sign in.
- Select at minimum: SNOMED, ICD10CM, RxNorm, RxNorm Extension, LOINC, CVX, CPT4.
- Click Download Vocabularies, give the bundle a name, accept the CPT-4 license terms (Athena will prompt; this requires a separate one-time license click).
- Wait for the email with the download link (typically a few minutes to an hour).
- Unzip the archive into a single folder. You should see
CONCEPT.csv,CONCEPT_RELATIONSHIP.csv,VOCABULARY.csv, and several others. PointVOCAB_DIRat this folder.
Running it¶
Defaults assume your inputs are in the current directory:
# omop_etl.py top-of-file:
BUNDLE_PATH = './dashboard_data.json'
VOCAB_DIR = './vocab/'
OMOP_OUTPUT_BASE = './omop_output'
From a notebook:
import omop_etl
omop_etl.main(
bundle_path='./dashboard_data.json',
vocab_dir='./vocab/',
omop_output_base='./omop_output',
)
From the command line:
Either path produces a folder like ./omop_output_<release_tag>/ with the nine CDM CSVs and a log file.
Memory and runtime¶
The Athena CONCEPT table is large — tens of millions of rows when all vocabularies are included. The ETL streams it once and keeps only:
- All concepts whose
vocabulary_idmatches one of the source vocabs in your bundle, plus - All standard concepts in the target domains (used as
Maps todestinations).
That keeps the in-memory footprint to a few hundred thousand rows for a typical setup. Runtime on a Colab CPU is dominated by the CONCEPT_RELATIONSHIP scan (a few minutes for a full Athena download).
Sanity checks after a run¶
import pandas as pd, os
out = './omop_output_<your_release>/'
for fn in sorted(os.listdir(out)):
if not fn.endswith('.csv'): continue
df = pd.read_csv(os.path.join(out, fn))
print(f"{fn:<32} rows={len(df):>6,}")
# unmapped-rate per table
for fn, idcol in [('CONDITION_OCCURRENCE.csv','condition_concept_id'),
('DRUG_EXPOSURE.csv','drug_concept_id'),
('PROCEDURE_OCCURRENCE.csv','procedure_concept_id'),
('MEASUREMENT.csv','measurement_concept_id')]:
df = pd.read_csv(os.path.join(out, fn))
if len(df):
unmapped = (df[idcol] == 0).sum()
print(f"{fn:<32} unmapped: {unmapped}/{len(df)} ({unmapped/len(df)*100:.1f}%)")
A high unmapped rate usually means your Athena download is missing one of the source vocabularies — double-check CDM_SOURCE.csv and re-download from Athena with the missing vocab selected.