Skip to content

Phenopackets ETL (GA4GH / HPO / Mondo)

Preview — under active development. This module is still being refined. Patterns, mappings, and category boundaries should be validated against your own corpus before being relied on for analysis or publication.

phenopackets_etl.py reads the bundle that run_pipeline.py produces and emits one GA4GH Phenopacket v2 JSON document per patient — submittable to the Matchmaker Exchange, GA4GH Beacon networks, the Monarch Initiative, and other rare disease research consortia. The Phenopacket sits next to the OMOP CDM output as the second standard interchange artifact: OMOP for federated observational research, Phenopackets for deep phenotyping and matchmaking.

The module's primary path is fully driven by the structured codes that the EHR feed already supplies — ICD-10-CM and SNOMED CT problem codes, LOINC labs and vitals, RxNorm medications, SNOMED CT and CPT-4 procedures. No NLP. No manual abstraction. Every Phenopacket field traces back to a coded record in the bundle.

Two optional layers extend the structured-code output: a demonstration layer that shows note-derived content (ALSFRS-R from narratives, gene mentions) can be surfaced as Phenopacket fields, and a production-quality genetic-data path for variant data maintained outside the EHR pipeline.

What ships in the seed tables

Hand-curated mappings for three disease areas:

Area SNOMED+ICD-10 → HPO SNOMED+ICD-10 → Mondo
spectrum of motor neuron disease ~25 ~9
Epilepsy ~14 ~8
Autoimmune neurologic & rheumatic ~17 ~15

Adopters extend the dicts at the top of phenopackets_etl.py for their disease area. For broader coverage, point the module at an Athena vocabulary download with HPO and Mondo selected and the mapping table is supplemented automatically via Athena's CONCEPT_RELATIONSHIP Maps-to edges.

How the structured-code path populates a Phenopacket

Phenopacket field Source in the bundle Mapping mechanism
subject patients[] (demographics, sex, DOB) Direct
phenotypicFeatures[] problems[] (ICD-10-CM, SNOMED CT) Seed table → HPO; or Athena Maps-to → HPO
diseases[] problems[] (ICD-10-CM, SNOMED CT) Seed table → Mondo; or Athena Maps-to → Mondo
measurements[] labs_vitals[] (LOINC, occasionally SNOMED) Pass-through (LOINC is native to Phenopackets)
medicalActions[].treatment medications[] (RxNorm) Pass-through
medicalActions[].procedure procedures[] (SNOMED CT, CPT-4) Pass-through
medicalActions[] (immunization) immunizations[] (CVX) Pass-through
metaData.resources[] Athena VOCABULARY.csv if present Records exact ontology release versions

A primary diagnosis like ALS is emitted only as a Disease (Mondo-coded) and is not duplicated in phenotypicFeatures. The module looks up each problem code against both Mondo and HPO and routes the code to the canonical slot.

unmapped_codes.csv is written alongside the Phenopackets — every (vocabulary, code) that didn't map to either HPO or Mondo, ranked by frequency. That report is the worklist for extending the seed tables.

How the medication coding is picked

A single FHIR MedicationRequest.medicationCodeableConcept.coding[] frequently contains several codes for the same prescription — most commonly an Ingredient (RxNorm IN), a Semantic Clinical Drug (SCD), and a Brand Name (BN). The Phenopacket MedicalAction.treatment.agent.id carries one of those as a CURIE (e.g. RxNorm:860975), and the choice matters: an Ingredient code loses the strength and dose form (so RxNorm:6809 for metformin tells a Matchmaker partner only that something containing metformin was prescribed), while an SCD code (RxNorm:860975 for "Metformin 500 MG Oral Tablet") carries the full drug-product semantics that downstream consumers expect.

The ETL applies a TTY-aware preference order to RxNorm codings using each concept's concept_class_id from CONCEPT.csv:

SCD (Clinical Drug) > SBD (Branded Drug)
  > SCDC/SBDC (Quantified)
  > GPCK/BPCK (Pack)
  > SCDF/SBDF (Drug Form)
  > IN (Ingredient) > PIN
  > BN (Brand Name) > DF

If the same MedicationRequest carries RxNorm:6809 (Ingredient) and RxNorm:860975 (SCD), the SCD wins regardless of which came first in coding[]. If Athena is not loaded, the ETL falls back to whichever RxNorm coding the upstream ingestion picked — which uses a display-string heuristic (strength pattern + dose-form word) to approximate SCD preference. The log line TTY-aware RxNorm selection: ENABLED (...) or ... DISABLED (...) records which mode the run used.

The same approach extends naturally to other vocabularies that have a concept_class_id distinction (LOINC component vs panel, SNOMED CT preferred vs synonym). For now it's RxNorm-only; the OMOP ETL applies the same logic to populate DRUG_EXPOSURE.drug_source_concept_id.

Demonstration layer — note-derived content

The note extraction module (note_extraction.py) recovers ALS-specific content from free-text narratives using regex patterns. The Phenopackets ETL can fold those rows into the output:

  • ALSFRS-R total and subdomain scores → Measurement with LOINC 67131-4 and related codes
  • ECAS total and subscores → Measurement with placeholder LOINCs (LOINC has not yet assigned official codes for ECAS subscores)
  • FVC % predicted → Measurement with LOINC 19868-9
  • Gene-symbol mentions (C9orf72, SOD1, FUS, …) → enrichment of the interpretations block with HGNC IDs and the source snippet

These fields are emitted with _provenance: "note_extraction_demonstration" so downstream consumers can identify them. Toggle with include_note_measurements=True/False in main().

Important framing for our use case: ARC is not yet using note-extracted content for production output. The demonstration shows the plumbing works end-to-end — that note-derived ALSFRS-R can travel from narrative through to a properly-coded Phenopacket Measurement — without committing to the abstraction quality required for production. The note extraction patterns are seed patterns for site-specific tuning; until that tuning is validated against real ARC narratives by clinical reviewers, set include_note_measurements=False for any output that leaves the registry. Other registries with a mature, validated NLP layer can flip the flag to True and have the structure already in place.

Production-grade genetic data — external_genetics_csv

ARC variant data lives in a separate data set under a different patient identifier system, and similar splits are common in registries (a research-only VCF, a CLIA-lab report, an outside-of-EHR genetic-counseling record). The Phenopackets ETL accepts a curated CSV — maintained outside the EHR pipeline — that joins by patient identifier and produces fully-coded GenomicInterpretation entries.

Schema (header row required, columns case-insensitive):

patient_id,gene_symbol,hgnc_id,hgvs,pathogenicity,zygosity,source
demo-patient-001,C9orf72,HGNC:28337,NM_018325.5:c.*1733_*1734insGGGGCCGGGGCCGGGGCC,Pathogenic,heterozygous,Invitae 2024-08 ALS panel
demo-patient-002,SOD1,HGNC:11179,NM_000454.4:c.272A>C,Likely pathogenic,heterozygous,Research VCF chr21:31668928 verified by Sanger 2024-02
demo-patient-003,,,,,,Negative — multi-gene ALS panel, Invitae 2024-08

Field notes:

  • patient_id — Must match the identifier used in dashboard_data.json. If your variant data carries a different identifier system, build a small bridge step that joins them before producing this CSV.
  • gene_symbol — HGNC-approved symbol. The module auto-resolves the HGNC ID for the 13 ALS-associated genes shipped in the GENE_HGNC lookup (C9orf72, SOD1, FUS, TARDBP, TBK1, VCP, UBQLN2, PFN1, MATR3, CHCHD10, ANG, OPTN, ATXN2). For other genes, also fill hgnc_id.
  • hgnc_id — With or without the HGNC: prefix. Optional when gene_symbol is in the shipped list.
  • hgvs — HGVS variant descriptor. Strongly recommended; without it the variant interpretation is gene-only. Coding-DNA reference (NM_018325.5:c.*1733...) and genomic reference (NC_000009.12:g.27573529G>A) are both accepted; Phenopackets stores the string verbatim.
  • pathogenicity — One of: Pathogenic, Likely pathogenic, Uncertain significance, Likely benign, Benign. Mapped to the Phenopackets AcmgPathogenicityClassification enum.
  • zygosity — One of: heterozygous, homozygous, hemizygous. Mapped to GENO ontology terms.
  • source — Free text describing where the finding came from. Recorded under _source on the interpretation; useful for clinician review and provenance audits.

When this CSV is supplied, the Phenopacket's interpretations[] block has progressStatus: "COMPLETED" and full GenomicInterpretation entries. When it isn't, the module falls back to either a note-derived demonstration block (if include_note_measurements=True and the patient has a gene mention in notes) or a clearly-labeled placeholder.

Running it

From the command line:

# Production path: structured codes only
python phenopackets_etl.py

From Python with all flags:

import phenopackets_etl

phenopackets_etl.main(
    bundle_path                 = './dashboard_data.json',
    note_extractions_path       = './note_extractions.csv',     # used only if include_note_measurements
    vocab_dir                   = '/path/to/Athena/Vocab',      # optional broader mapping
    out_root                    = './',
    include_note_measurements   = False,                        # production: structured only
    external_genetics_csv       = './external_genetics.csv',    # curated outside the EHR
)

Three operating modes that match the way registries actually work:

Mode include_note_measurements external_genetics_csv When to use
Structured-only False None Default for ARC and any registry whose NLP and external-genetics linkage are not yet validated
Structured + curated genetics False path to CSV Production: variant findings come from curated, manually-maintained data
Structured + demo NLP + curated genetics True path to CSV Mature NLP pipeline; note-derived ALSFRS-R is intended for downstream use

For the manuscript and live demo we run the second mode against the synthetic patient with a hand-built genetics CSV — that's what produces the worked example.

Output

A versioned output folder named phenopackets_output_HPO-<release>_Mondo-<release>/ containing:

File Contents
<patient_id>.json One GA4GH Phenopacket v2 per patient
cohort.json All patients combined as a Phenopackets Cohort resource
summary.csv Per-patient counts: phenotypic features, diseases, measurements (split into _structured vs _note_demo), medical actions, interpretation_source (external_genetics_csv / note_extraction_demo / placeholder_only)
unmapped_codes.csv Every (vocabulary, code) that didn't map to HPO or Mondo, ranked by frequency

The folder name carries the ontology release identifiers so successive runs against updated Athena downloads don't overwrite each other.

Quality control

Phenopackets are one tier of an integrated QC framework. Every output is schema-validated against the GA4GH Phenopacket v2 JSON Schema, the unmapped-codes report tracks coverage, and metaData.resources records exact ontology release versions for full reproducibility. Spot-review by a clinician — comparing the dashboard's per-patient view side-by-side with the Phenopacket — catches mapping defects that pure schema validation cannot.

What this is and isn't

  • Is: a structured-code-driven Phenopacket producer that wraps the data Registry Forge already extracts (demographics, problems, labs, medications, procedures) into the GA4GH-blessed interchange format. End-to-end traceable, no NLP required.
  • Is: a clean injection point for variant data maintained outside the EHR feed, via the external_genetics_csv parameter. Structured the way real registries actually keep that data.
  • Is: a demonstration that note-derived content can be folded in once the NLP layer is validated. The plumbing is real; the production decision is held back until clinical review of the regex patterns is complete.
  • Isn't: a clinical NLP system. The note extraction patterns are seed patterns and should not be relied on for production output without site-specific validation.
  • Isn't: complete on the genetic side without the external CSV. Without it the interpretations block is honest about what's missing — _placeholder: true and an explanatory _comment — so downstream consumers handle it correctly.