Skip to content

Advanced exports

Under construction

The OMOP, Mondo, and Phenopackets modules are starter implementations. They produce schema-conforming output for the fields they populate, but each has documented gaps: OMOP concept IDs are resolved only when an Athena vocabulary is supplied, Mondo ships an epilepsy-focused starter crosswalk, and Phenopackets don't yet carry variants, biosamples, or full HPO-mapped phenotypic features. Function signatures and output shape may evolve. Treat the output as a starting point for downstream ETL, not a finished product.

Demos: see Downloads & demos for working OMOP, Mondo, and Phenopackets output generated from sample_data/.

Beyond the four core outputs (master CSV, dashboard, feature matrix, parse log), Registry Forge - Patient Edition can produce three standards-conforming exports that the parent Registry Forge also generates, plus EDA reports and note keyword flagging:

  • OMOP CDM v5.4 - the de facto research data model for observational health studies
  • Mondo disease ontology mapping - unified disease IDs spanning ICD, SNOMED, OMIM, Orphanet
  • GA4GH Phenopackets v2 - the international standard for sharing clinical and phenotypic data
  • EDA reports - see EDA reports
  • Note keyword flagging - see Note flagging

Each export is opt-in via CLI flags (--omop, --phenopackets, --mondo, --eda, --flag-notes) or the matching with_*= kwargs of build_outputs().

OMOP CDM v5.4

OMOP CDM is the data model that backs OHDSI's federated research network. If you want to run an existing OMOP-based study against patient-mediated data, you need your data in this shape.

registryforge-patient parse ./my_ccda_folder --output ./out --omop

Produces under ./out/omop/:

File What's in it
PERSON.csv One row per patient with gender, birth date, alias as person_source_value
OBSERVATION_PERIOD.csv One row per patient with min/max date span across all their records
CONDITION_OCCURRENCE.csv One row per problem-list entry, with start/end dates, source code
DRUG_EXPOSURE.csv One row per medication entry, with start/end dates, source code, SIG
MEASUREMENT.csv One row per lab/vital with a numeric value, with date, value, unit
OBSERVATION.csv Allergies, social and family history, behavioral observations
DEVICE_EXPOSURE.csv Implants, durable medical equipment from devices and medical_equipment categories

Column names and types match the OMOP CDM v5.4 specification.

Concept ID resolution

When you pass an Athena vocabulary directory via --omop-vocab (CLI) or omop_vocab= (Python), the exporter resolves source codes to OMOP standard concept IDs via the OHDSI canonical two-step lookup: (code, vocabulary_id) → source concept_id via CONCEPT.csv, then 'Maps to' edge → standard concept_id via CONCEPT_RELATIONSHIP.csv. Without a vocab, *_concept_id columns are left as 0.

registryforge-patient parse ./my_ccda_folder --output ./out \
    --omop --omop-vocab /path/to/athena/vocab
from registryforge_patient import build_outputs, AthenaVocab

vocab = AthenaVocab('/path/to/athena/vocab')
vocab.load(vocabs=['RxNorm', 'SNOMED', 'LOINC', 'ICD10CM', 'ICD9CM'])

build_outputs(
    input_path='./my_ccda_folder',
    output_dir='./out',
    with_omop=True,
    omop_vocab=vocab,
)

The full Athena bundle is large (~5 GB total); restrict the vocabs= filter to the vocabularies that appear in patient-portal C-CDAs to keep memory manageable.

What's not done

This is a starter ETL:

  • These seven tables only. The full OMOP CDM has 16+ tables including PROCEDURE_OCCURRENCE, VISIT_OCCURRENCE, CARE_SITE, NOTE, etc. PROCEDURE_OCCURRENCE is the most obvious next target.
  • Derived rollup tables (CONDITION_ERA, DRUG_ERA, OBSERVATION_ERA) are not produced. These are built analytically from the OBSERVATION-style tables by OHDSI's standard tooling; the source data ETL doesn't try to construct them.
  • No VOCABULARY/CDM/META metadata tables - those belong on the receiving OMOP instance.

For a complete production OMOP ETL with full table coverage and concept mapping, see the parent Registry Forge.

From Python

from registryforge_patient import to_omop, AthenaVocab
import pandas as pd

df = pd.read_csv('./out/patient_master.csv', encoding='utf-8-sig')

# Without vocab: source values populated, concept_ids = 0
to_omop(df, output_dir='./omop/')

# With vocab: source values + resolved concept_ids
vocab = AthenaVocab('/path/to/athena/vocab').load()
to_omop(df, output_dir='./omop/', vocab=vocab)

Mondo disease ontology mapping

Mondo unifies ICD-10-CM, ICD-9-CM, SNOMED-CT, OMIM, Orphanet, and other disease vocabularies into a single ID space. Mondo IDs are particularly valuable for rare disease cohort identification because:

  • One ID covers a disease across every source vocabulary
  • Stable across vocabulary updates
  • Used as the canonical disease term in GA4GH Phenopackets
registryforge-patient parse ./my_ccda_folder --output ./out --mondo

This adds two columns to patient_master.csv:

  • mondo_id - the MONDO:XXXXXXX identifier
  • mondo_label - the human-readable disease name

Only rows where category == 'problems' get mapped. Rows without a match get empty strings.

Starter mapping

The package ships with an embedded starter crosswalk focused on epilepsy and adjacent neurology - covering focal/generalized/Dravet/Lennox-Gastaut/West syndrome/myoclonic/absence epilepsy, intellectual disability spectrum, autism, neurogenetic syndromes (Rett, tuberous sclerosis, Angelman, DMD), other neurology (ALS, Parkinson, MS, Alzheimer, Huntington), plus a small set of common comorbidities so the starter still works on typical patient data. Inspect what's available:

from registryforge_patient.mondo import MONDO_STARTER
print(len(MONDO_STARTER), 'codes in starter')

Using a custom/larger crosswalk

For broader coverage, download Mondo's SSSOM crosswalks from github.com/monarch-initiative/mondo and provide your own CSV/TSV:

source_code,mondo_id,mondo_label,source_vocab
86044005,MONDO:0004976,amyotrophic lateral sclerosis,SNOMED-CT
G12.21,MONDO:0004976,amyotrophic lateral sclerosis,ICD-10-CM
...
registryforge-patient mondo ./out/patient_master.csv \
    --output ./out/patient_master_mondo.csv \
    --crosswalk ./my_full_mondo_crosswalk.csv

Or from Python:

from registryforge_patient.mondo import map_to_mondo, load_mondo_crosswalk

crosswalk = load_mondo_crosswalk('./my_full_mondo_crosswalk.csv')
df_mapped = map_to_mondo(df, override_crosswalk=crosswalk)

GA4GH Phenopackets v2

Phenopackets is the GA4GH standard for sharing clinical and phenotypic data about individuals. It's the data exchange format used by Matchmaker Exchange, several rare disease federation projects, and a growing list of clinical research platforms.

registryforge-patient parse ./my_ccda_folder --output ./out --phenopackets

Produces one .json file per patient under ./out/phenopackets/. Each phenopacket follows the v2 schema and includes:

  • subject - the Individual with id, sex, dateOfBirth
  • diseases - one entry per active diagnosis. If Mondo mapping was applied, the disease term uses the Mondo ID; otherwise the original SNOMED/ICD code. Onset date and status (active/resolved) are populated.
  • medicalActions - one entry per medication, with RxNorm-coded agent and start/end dates
  • measurements - one entry per lab or vital with a numeric value, with LOINC assay code and quantity
  • metaData - schema version, resources list (the vocabularies used in this packet), creation timestamp

What's not exported

Phenopackets v2 can carry information that patient-portal C-CDA exports don't include:

  • Variant calls (interpretation)
  • Biosamples
  • Pedigree
  • HPO-mapped phenotypic features (these would need an HPO crosswalk; the current export uses Mondo for diseases instead)

If you need any of these, layer them on by editing the JSON files after generation, or build them from a different data source.

From Python

from registryforge_patient import to_phenopackets, map_to_mondo
import pandas as pd

df = pd.read_csv('./out/patient_master.csv', encoding='utf-8-sig')
df = map_to_mondo(df)            # apply Mondo first for better disease terms
to_phenopackets(df, output_dir='./phenopackets/')

Validating the output

phenopacket-tools is the official CLI for validating Phenopackets. After generating them:

pxf validate ./out/phenopackets/PT-XXXXXXXXXX.json

Combining all three

Pass everything at once:

registryforge-patient parse ./my_ccda_folder --output ./out \
    --omop --phenopackets --mondo

--phenopackets automatically implies --mondo because Mondo IDs make the disease terms in the phenopackets cleaner. Set both explicitly if you want the Mondo columns added to patient_master.csv regardless.

Direct conversion from an existing master CSV

If you already have a patient_master.csv (from a previous run, or from someone else who used this tool), you can run the export modules without re-parsing the source XMLs:

registryforge-patient omop         ./patient_master.csv --output ./omop/
registryforge-patient phenopackets ./patient_master.csv --output ./phenopackets/
registryforge-patient mondo        ./patient_master.csv --output ./patient_master_mondo.csv