Mondo-OMOP bridge & rare disease cohort builder¶
Preview — under active development. This module is still being refined. Patterns, mappings, and category boundaries should be validated against your own corpus before being relied on for analysis or publication.
mondo_omop_bridge.py is the inverse of phenopackets_etl.py. The Phenopackets ETL runs forward — source codes in the bundle → HPO/Mondo for a Phenopacket. The bridge runs backward — given a Mondo term ID, walk the Mondo disease hierarchy to find every descendant, then produce a code list that defines the cohort.
Use it when you want to answer questions like:
"Give me everyone with anything in the spectrum of motor neuron disease."
"Build me a code list for an epilepsy cohort I can paste into a query against any EHR or claims warehouse."
"Find all GARD-designated rare neurologic diseases in our cohort."
Attribution¶
The mapping logic and the use of Mondo's KGX same_as cross-references is adapted directly from the Monarch Initiative's mondo2omop repository (MIT-licensed). We reuse their approach with credit, and extend it with a per-target cohort-build entry point, an optional Athena-free fallback, and a stdlib BFS implementation for environments without NetworkX.
What it produces¶
The module emits one master mapping table and two per-cohort files for every target Mondo ID:
| File | Contents |
|---|---|
MONDO2OMOP_<release>.tsv |
One row per (Mondo term, source vocabulary, source code). Columns: mondo_id, mondo_label, mondo_description, source_vocabulary (ICD10CM, SNOMED, MeSH), source_code, standard_concept_id (OMOP, Condition domain), standard_concept_name, standard_vocabulary, standard_concept_code, plus six rare disease subset flags. Skipped Athena columns if no vocab_dir provided. |
cohort_<MONDO_id>_codes.tsv |
The code list that defines the cohort — one row per (Mondo descendant of target, source vocabulary, code). Drop the source-code column into a Databricks query or an EHR chart-review filter. |
cohort_<MONDO_id>_omop.tsv |
Same cohort joined to OMOP standard concept_ids. Drop the standard_concept_id list into a CONDITION_OCCURRENCE filter on your OMOP CDM extract. Skipped if vocab_dir is None. |
Every output row carries the six rare disease subset flags that Mondo annotates:
| Flag | What it means |
|---|---|
rare |
Mondo's general rare disease flag |
gard_rare |
Listed in GARD (NIH Genetic and Rare Diseases Information Center) |
nord_rare |
Listed by NORD (National Organization for Rare Disorders) |
orphanet_rare |
Listed in Orphanet |
inferred_rare |
Inferred rare by Mondo's reasoner |
mondo_rare |
Mondo's own rare disease subset designation |
How the descendant walk works¶
Mondo's KGX edges file records is_a (subclass_of) relationships between disease terms. The module loads those edges into a directed graph (NetworkX if available, plain dict-of-sets BFS otherwise) and computes the descendants of the target term.
Following the upstream mondo2omop logic, the module restricts to terms that are descendants of MONDO:0700096 (human disease) and excludes descendants of MONDO:0042489 (disease susceptibility), MONDO:0021125 (disease characteristic), and MONDO:0021178 (injury) before the walk.
Picking the right target term matters. "ALS" specifically (MONDO:0004976) returns just ALS and ALS-FTD because PMA and PLS are siblings in the Mondo hierarchy, not descendants. To get the full spectrum of motor neuron disease (ALS + PMA + PLS + ALS-FTD), target their shared parent: MONDO:0019056 motor neuron disease.
A few useful anchors for ALS TDI's domain:
| Anchor | What you get |
|---|---|
MONDO:0019056 motor neuron disease |
spectrum of motor neuron disease (ALS, PMA, PLS, ALS-FTD, juvenile MND, hereditary forms) |
MONDO:0004976 amyotrophic lateral sclerosis |
Just ALS proper and its sub-types |
MONDO:0005027 epilepsy |
All epilepsy syndromes (focal, generalized, syndromic) |
MONDO:0005301 multiple sclerosis |
MS and its phenotypes (RRMS, SPMS, PPMS) |
MONDO:0007915 systemic lupus erythematosus |
SLE and its subtypes |
MONDO:0005071 nervous system disorder |
Everything neurologic (very broad) |
Browse the full hierarchy at Monarch Initiative or the Mondo OBO browser.
Running it¶
The module accepts a pre-downloaded Mondo KGX directory or downloads a release itself:
import mondo_omop_bridge
mondo_omop_bridge.main(
mondo_kgx_dir = None, # None = download
mondo_version = '2026-04-07', # if downloading; current release
vocab_dir = '/path/to/Athena/Vocab', # optional; enables OMOP join
target_mondo_ids = ['MONDO:0019056', # spectrum of motor neuron disease
'MONDO:0005027', # epilepsy
'MONDO:0005301'], # MS
out_root = './mondo_omop_output',
)
From the command line, with a few defaults pre-set:
In Colab, with Mondo cached on Drive:
import sys
sys.path.insert(0, '/content/work')
import mondo_omop_bridge
DRIVE = '/content/drive/MyDrive/ALS_TDI_complete_FINAL_PIPELINE'
mondo_omop_bridge.main(
mondo_kgx_dir = f'{DRIVE}/mondo_kgx', # pre-downloaded
vocab_dir = f'{DRIVE}/Vocab', # Athena bundle
target_mondo_ids = ['MONDO:0019056','MONDO:0005027'],
out_root = f'{DRIVE}/mondo_omop_output',
)
Mondo download sources. The module fetches the Mondo ontology from the canonical sources in the following order, falling back as needed:
- GitHub releases (primary, recommended) — https://github.com/monarch-initiative/mondo/releases. The Monarch Initiative publishes monthly tagged releases (e.g.
v2026-04-07) with assets includingmondo.json,mondo.owl, andmondo.obo. The bridge module readsmondo.jsonfrom the release matchingmondo_version(orlatestifmondo_version=None). This is the stable, versioned, and citable source. - OBO PURL (fallback) — http://purl.obolibrary.org/obo/mondo.json. Always-current latest version; use this when you want the bleeding edge without specifying a version. The Mondo download page is https://mondo.monarchinitiative.org/pages/download/.
- Legacy KGX TSV — the previous KG-OBO mirror at
kg-hub.berkeleybop.io/kg-obo/mondo/is deprecated and no longer maintained. The module retains a load path for KGX TSV (mondo_kgx_tsv_nodes.tsv+mondo_kgx_tsv_edges.tsv) so adopters with cached KGX files can still use them, but new users should use the GitHub or PURL paths above.
The mondo.json file is approximately 50–100 MB; first run downloads it, subsequent runs against the same Mondo version reuse the cached copy.
Worked example — spectrum of motor neuron disease cohort¶
Run against the synthetic 10-term Mondo subset that ships with Registry Forge as a smoke test:
$ python mondo_omop_bridge.py
[12:31] NetworkX available: True
[12:31] Mondo KGX dir: ./synthetic_mondo
[12:31] Loaded 10 nodes, 9 edges
[12:31] Disease nodes (post-obsolete filter): 10
[12:31] Building master MONDO2OMOP table ...
[12:31] Kept 9 Mondo human-disease nodes (post-exclusions)
[12:31] 14 (Mondo, source-code) rows written
[12:31] Building cohort for MONDO:0004976 ...
[12:31] Target: MONDO:0004976 amyotrophic lateral sclerosis
[12:31] 2 Mondo terms in cohort (target + descendants)
[12:31] Wrote ./out/cohort_MONDO_0004976_codes.tsv (4 rows)
cohort_MONDO_0004976_codes.tsv (snippet):
mondo_id mondo_label source_vocabulary source_code rare gard_rare nord_rare orphanet_rare inferred_rare mondo_rare
MONDO:0004976 amyotrophic lateral sclerosis ICD10CM G12.21 1 1 0 1 0 0
MONDO:0004976 amyotrophic lateral sclerosis SNOMED 86044005 1 1 0 1 0 0
MONDO:0019469 amyotrophic lateral sclerosis-frontotemporal dementia ICD10CM G31.09 1 1 0 1 0 0
MONDO:0019469 amyotrophic lateral sclerosis-frontotemporal dementia SNOMED 230260009 1 1 0 1 0 0
With the full Mondo release, this same query returns dozens of rows covering hereditary ALS subtypes, ALS variants by causative gene, juvenile MND, and the full ALS-FTD spectrum — exactly the cohort you want to query in CONDITION_OCCURRENCE.
Relationship to the other modules¶
phenopackets_etl.py and mondo_omop_bridge.py operate on the same vocabulary mappings but in opposite directions:
| Module | Direction | Input | Output |
|---|---|---|---|
phenopackets_etl.py |
Source → Mondo | EHR records with SNOMED/ICD-10 codes | GA4GH Phenopacket with Mondo diseases[] |
mondo_omop_bridge.py |
Mondo → Source + OMOP | A Mondo term ID | Code list (SNOMED, ICD-10-CM) + OMOP standard concept_ids |
Both consume the same Athena vocabulary bundle when present. Both honor the same ontology release version in their output filenames. Neither requires the other to run.
The bridge's output also doubles as a high-quality input for extending phenopackets_etl.py's seed mapping tables — the MONDO2OMOP_<release>.tsv master table is exactly the (source-vocab, source-code) → Mondo mapping that the Phenopackets ETL needs. A small script that reads the master TSV and emits a Python dict literal can drop straight into the SNOMED_ICD_TO_MONDO block of phenopackets_etl.py.
Dependencies¶
- Python 3.9+
requests(for downloading Mondo KGX). Pre-download manually if not available.networkx(optional). Without it, the module falls back to a pure-stdlib BFS over a dict-of-sets graph. The fallback is slightly slower on full Mondo but functionally equivalent.
No pandas dependency despite the upstream mondo2omop using it — the bridge uses csv module + stdlib joins. This keeps the dependency footprint small for adopters who want to drop the module into a constrained environment.