Skip to content

OMOP ETL

Preview — under active development. This module is still being refined. Patterns, mappings, and category boundaries should be validated against your own corpus before being relied on for analysis or publication.

omop_etl.py reads dashboard_data.json plus a folder of OMOP vocabulary CSVs downloaded from Athena, maps every source code to its standard concept, and writes OMOP CDM v5.4 tables to a folder tagged with the vocabulary release version.

What gets produced

OMOP table Source data Notes
PERSON.csv patients Gender mapped to 8507/8532; race and ethnicity left as 0 (not in source)
VISIT_OCCURRENCE.csv encounters Visit class (AMB/IMP/EMER) mapped to standard visit concepts
CONDITION_OCCURRENCE.csv problems whose standard concept is in the Condition domain
DRUG_EXPOSURE.csv medications + immunizations + drug-allergy entries whose standard concept is in the Drug domain sig populated from the medication display name
PROCEDURE_OCCURRENCE.csv procedures whose standard concept is in the Procedure domain
MEASUREMENT.csv labs/vitals + diagnostic reports whose standard concept is in the Measurement domain value_as_number and unit_source_value preserved
OBSERVATION.csv non-drug allergies and other Observation-domain entries
OBSERVATION_PERIOD.csv derived Earliest to latest event date per patient
CDM_SOURCE.csv metadata Records the exact vocabulary versions used

Records are routed by the domain of the standard concept, not by the source category. An ICD-10-CM code in problems whose Maps to target lives in the Procedure domain (rare but real, e.g. some Z-codes) ends up in PROCEDURE_OCCURRENCE, where it belongs.

Mapping logic

For each source (vocabulary, code) pair the ETL:

  1. Looks up the concept in CONCEPT.csv filtered to that vocabulary → source_concept_id.
  2. Follows CONCEPT_RELATIONSHIP.csv rows where relationship_id = 'Maps to' and concept_id_1 = source_concept_idconcept_id_2 is the standard concept.
  3. Reads domain_id of the standard concept → that's the target table.
  4. Writes a row with both *_concept_id (standard, queryable) and *_source_concept_id + *_source_value (lineage back to the raw code).

Codes that don't resolve are still written, with *_concept_id = 0 and the source values populated, so nothing is silently dropped. The log lists the unmapped codes with frequency counts.

Picking the primary code when a record has multiple codings

FHIR MedicationRequest.medicationCodeableConcept.coding[] (and the corresponding CCDA <code> plus <translation> children) frequently carries several codes for the same record — for example an Ingredient (RxNorm IN), a Semantic Clinical Drug (SCD), and a Brand Name (BN), all referring to the same prescription. The ETL captures every coding on all_codings so nothing is lost, but downstream rows in DRUG_EXPOSURE.drug_source_concept_id and drug_source_value need a single source code chosen as primary.

For the medication category the ETL applies a TTY-aware preference order based on each coding's concept_class_id in CONCEPT.csv:

SCD (Clinical Drug) > SBD (Branded Drug)
  > SCDC/SBDC (Quantified) > GPCK/BPCK (Pack)
  > SCDF/SBDF (Drug Form)
  > IN (Ingredient) > PIN > BN (Brand Name) > DF

SCD and SBD encode ingredient + strength + dose form, which is what OHDSI's DRUG_STRENGTH table and most downstream analytics actually key off. IN is ingredient-only (too generic); BN loses the strength. Picking SCD/SBD as primary lines up drug_source_value with the level of detail OHDSI expects. The OMOP standard drug_concept_id is reached via Maps to and is generally identical regardless of which TTY was chosen as primary — Athena's curation already collapses TTY variants — but drug_source_concept_id and drug_source_value carry the TTY-correct source code through to the analyst.

When Athena is not loaded, the upstream ingestion stage falls back to a lightweight heuristic that prefers codings whose display string carries both a strength pattern (50 MG, 500 MG/ML) and a dose-form word (Tablet, Solution, Injection). The OMOP ETL re-confirms with concept_class_id once Athena is present.

For non-medication categories the first coding matching the domain's preferred system wins. Future revisions may extend TTY-aware selection to LOINC (component vs panel codes) and SNOMED CT (preferred vs synonyms).

Vocabulary-release tagging

The output folder is named omop_output_<vocab_version>/ — the version is read from VOCABULARY.csv in your Athena download. For example:

omop_output_SNOMED_CT_International_Edition_2024-09-01/
omop_output_SNOMED_CT_International_Edition_2025-01-01/

Different Athena releases never overwrite each other. CDM_SOURCE.csv inside each folder records every vocabulary's version string verbatim.

Downloading the vocabulary from Athena

  1. Visit https://athena.ohdsi.org/vocabulary/list and sign in.
  2. Select at minimum: SNOMED, ICD10CM, RxNorm, RxNorm Extension, LOINC, CVX, CPT4.
  3. Click Download Vocabularies, give the bundle a name, accept the CPT-4 license terms (Athena will prompt; this requires a separate one-time license click).
  4. Wait for the email with the download link (typically a few minutes to an hour).
  5. Unzip the archive into a single folder. You should see CONCEPT.csv, CONCEPT_RELATIONSHIP.csv, VOCABULARY.csv, and several others. Point VOCAB_DIR at this folder.

Running it

Defaults assume your inputs are in the current directory:

# omop_etl.py top-of-file:
BUNDLE_PATH       = './dashboard_data.json'
VOCAB_DIR         = './vocab/'
OMOP_OUTPUT_BASE  = './omop_output'

From a notebook:

import omop_etl
omop_etl.main(
    bundle_path='./dashboard_data.json',
    vocab_dir='./vocab/',
    omop_output_base='./omop_output',
)

From the command line:

python omop_etl.py

Either path produces a folder like ./omop_output_<release_tag>/ with the nine CDM CSVs and a log file.

Memory and runtime

The Athena CONCEPT table is large — tens of millions of rows when all vocabularies are included. The ETL streams it once and keeps only:

  • All concepts whose vocabulary_id matches one of the source vocabs in your bundle, plus
  • All standard concepts in the target domains (used as Maps to destinations).

That keeps the in-memory footprint to a few hundred thousand rows for a typical setup. Runtime on a Colab CPU is dominated by the CONCEPT_RELATIONSHIP scan (a few minutes for a full Athena download).

Sanity checks after a run

import pandas as pd, os
out = './omop_output_<your_release>/'
for fn in sorted(os.listdir(out)):
    if not fn.endswith('.csv'): continue
    df = pd.read_csv(os.path.join(out, fn))
    print(f"{fn:<32} rows={len(df):>6,}")

# unmapped-rate per table
for fn, idcol in [('CONDITION_OCCURRENCE.csv','condition_concept_id'),
                  ('DRUG_EXPOSURE.csv','drug_concept_id'),
                  ('PROCEDURE_OCCURRENCE.csv','procedure_concept_id'),
                  ('MEASUREMENT.csv','measurement_concept_id')]:
    df = pd.read_csv(os.path.join(out, fn))
    if len(df):
        unmapped = (df[idcol] == 0).sum()
        print(f"{fn:<32} unmapped: {unmapped}/{len(df)}  ({unmapped/len(df)*100:.1f}%)")

A high unmapped rate usually means your Athena download is missing one of the source vocabularies — double-check CDM_SOURCE.csv and re-download from Athena with the missing vocab selected.