Quality control framework¶

Registry Forge produces six classes of output (browser dashboard, master CSVs, code inventory, OMOP CDM, note extractions, Phenopackets). Every one is derived from the same underlying bundle, so a defect anywhere upstream propagates everywhere downstream. Quality control is not optional and is not one-time. This page describes the recommended QC framework, organized as five tiers from cheapest to most labor-intensive.

The QC philosophy: catch defects at the layer they originate, prefer automated checks over manual ones, and make manual review cheap enough that you actually do it.

Data-quality review frameworks adopters should use

The checks on this page are a practical starting set for Registry Forge outputs — not a complete data-quality program. For the full picture, adopters should review their data through two well-established frameworks:

Kahn et al. (2016), A harmonized data quality assessment terminology and framework for the secondary use of electronic health record data, EGEMS 4(1):1244 (PMID 27713905) — the canonical methodological reference for observational-research data quality. Defines the conformance / completeness / plausibility taxonomy that is the standard vocabulary for talking about EHR data quality.
The OMOP / OHDSI data quality framework, specifically the Book of OHDSI, Chapter 15 — Data Quality and the OHDSI Data Quality Dashboard — OMOP-specific guidance and a reference tool that runs Kahn-aligned checks against an OMOP CDM instance.

Adopters running production OMOP workloads should layer those frameworks on top of the checks described below.

Tier 1 — Schema validation (every run)¶

Every output should pass a schema check before it leaves the workstation.

Output	Schema
`dashboard_data.json`	Internal schema embedded in `run_pipeline.py` (column presence, type sanity)
`dashboard_data.xlsx`	Implicit (column headers; one row per record)
`code_inventory.csv`	Fixed columns; non-empty `vocabulary` and `code`
`patient_master*.csv`	Fixed 21-column header; UTF-8 with BOM; quoted cells
OMOP CDM v5.4 tables	Official OMOP CDM v5.4 spec — types, FK integrity, required columns
Phenopackets JSON	GA4GH Phenopacket v2 JSON Schema

Schema validation catches the most common defect class — a column rename or a missing required field — in seconds. Run it as the last step of every pipeline invocation. The OMOP CDM schemas are published as DDLs by OHDSI; the Phenopackets schema is published as JSON Schema by GA4GH.

A simple validation snippet for Phenopackets:

import json, jsonschema, urllib.request
schema = json.loads(urllib.request.urlopen(
    'https://raw.githubusercontent.com/phenopackets/phenopacket-schema/'
    'master/src/main/proto/phenopackets/schema/v2/phenopackets.proto'
).read().decode())
# Use a generated JSON-Schema if you have it, or validate manually

Tier 2 — Mapping coverage (every run)¶

code_inventory.csv and unmapped_codes.csv (from the OMOP and Phenopackets ETLs) are the inputs to this tier.

Track three numbers per run:

Metric	What it tells you
Unique source codes / total source codes	Should be stable; sudden change means upstream data source changed
`concept_id != 0` rate in OMOP outputs	What fraction of records mapped to a standard concept
`unmapped_codes.csv` row count, ranked by frequency	The seed-table extension worklist

Set thresholds the registry is comfortable with (e.g., "Phenopackets unmapped rate <10% on the top 100 most-referenced problem codes") and fail the pipeline if a run breaches them. The pipeline already computes these counts and writes them to pipeline_run.log; thresholding is a few lines of additional Python.

A note on what unmapped means. "Unmapped" is not the same as "wrong." A code that has no HPO or Mondo mapping is genuinely unmapped, but a code that does map may still map incorrectly. Mapping coverage is necessary but not sufficient. Tier 4 (manual review) is the only check that catches incorrect mappings.

Tier 3 — Cross-output consistency (every run)¶

The same patient should look like the same patient across every output. Cross-checks:

Check	Expected
Patient count	Equal across `patients` array, OMOP `PERSON.csv`, Phenopackets cohort
Per-patient medication count (FHIR-only)	Equal between `patient_master_fhir.csv` filtered to medications and OMOP `DRUG_EXPOSURE.csv` filtered to person_id
Per-patient problem count	Equal between `patient_master.csv` filtered to problems and OMOP `CONDITION_OCCURRENCE.csv` plus Phenopackets `phenotypicFeatures + diseases`
ALSFRS-R total in `note_extractions.csv`	Present as a Measurement in the corresponding Phenopacket

Discrepancies are usually one of three things: a record was filtered at one stage but not another; a code mapping diverged; or a patient_id normalization slipped between stages. The cross-output diff is the fastest way to find them.

A short Python script that runs after the pipeline produces these counts side-by-side and flags any patient where the counts disagree by more than 1:

import pandas as pd, json
pm   = pd.read_csv('patient_master.csv', encoding='utf-8-sig', dtype=str)
pers = pd.read_csv('omop_output_*/PERSON.csv', encoding='utf-8-sig', dtype=str)
pp   = json.load(open('phenopackets_output_*/cohort.json'))

ids_master = set(pm.query("category == 'patient'")['patient_id'])
ids_omop   = set(pers['person_source_value'])
ids_pp     = {m['subject']['id'] for m in pp['members']}

print('In master only:', ids_master - ids_omop - ids_pp)
print('In OMOP only:  ', ids_omop  - ids_master - ids_pp)
print('In Phenopackets only:', ids_pp - ids_master - ids_omop)

Anything in any of those three sets is a defect.

Tier 4 — Spot-review by a clinician (recurring)¶

Tiers 1–3 catch structural defects. Tier 4 catches semantic defects: codes that map to the wrong target, dates that are off, regex patterns that captured the wrong number from a sentence, code prioritizations that hide an important coding.

Recommended cadence: monthly during pilot, quarterly thereafter, after any change to the seed mapping tables, and after every Athena vocabulary release update.

Recommended sample: stratified random sample of N=5–10 patients. For each, the reviewer looks at:

The dashboard's per-patient view, side-by-side with
The patient's row in patient_master.csv, and
The corresponding <patient_id>.json Phenopacket

Cross-referencing the three is the canonical clinician QC workflow. Discrepancies fall into a small number of buckets:

Bucket	Example	Fix
Wrong code mapping	SNOMED 86044005 mapping to a non-ALS HPO term	Edit seed table
Missed phenotype	"Spasticity" in narrative not captured anywhere	Add regex pattern OR seed mapping
Mis-extracted value	ALSFRS-R captured `38` from "ALSFRS-R 38/48 in March 2023" but the year was actually `2024`	Tighten regex
Date drift	A medication start_date that's months earlier in OMOP than in Phenopackets	Trace through merge logic
Identifier drift	Patient appears in dashboard but not in OMOP	Normalization mismatch upstream

Build a short reviewer template (PDF or Word) that lists the bucket categories and a free-text field for "other." File the completed reviews in version control so the registry has a paper trail of QC findings and the seed-table edits that resulted.

Tier 5 — Synthetic regression test (every change)¶

The synthetic demonstration cohort (Jane Marie Demo) is the regression bedrock. Every code path in the pipeline runs against a bundle whose answers are known. When a defect is fixed in production data, write a fixture for it — add a clinically-realistic narrative or coded record to the synthetic CCDA that exercises the previously-broken path — and confirm the pipeline still produces the expected output.

The synthetic cohort is small enough to run end-to-end in seconds, so making this a pre-commit / CI check is reasonable. A minimum CI check:

python run_pipeline.py        # produces dashboard_data.json + master CSVs + code_inventory.csv
python omop_etl.py            # produces OMOP output
python note_extraction.py     # produces note_extractions.csv
python phenopackets_etl.py    # produces Phenopackets output
python qc_assert.py           # the script that codifies the expected counts

qc_assert.py reads the outputs and checks invariants: patient count = 1, ALS appears in diseases, ALSFRS-R total appears in measurements, etc. Any failure is a regression.

Reproducibility expectations¶

Every output that depends on a vocabulary should record the vocabulary version it used. OMOP's CDM_SOURCE.csv does this for the standard concept set; the Phenopackets metaData.resources[] block does the same for HPO/Mondo/LOINC/RxNorm/SNOMED/ICD-10-CM/HGNC. The folder names — omop_output_<vocab_release>/ and phenopackets_output_HPO-<release>_Mondo-<release>/ — carry the same information at the filesystem level.

The combination means that for any output file, anyone can reconstruct the exact mapping decisions that produced it. This matters for audits, for downstream sharing, and for the case where a vocabulary update produces a different mapping for the same source code.

Summary — what to do, when¶

Trigger	Tiers to run
Every pipeline invocation	1, 2, 3, 5 (automated)
Pilot phase, monthly	4 (clinician spot-review)
Post-pilot, quarterly	4
After seed-table edit	4 + 5 (focused on changed mappings)
After Athena vocabulary update	All five tiers; diff vs. previous run
Before submitting Phenopackets to Matchmaker	4 on the submission set
Before publishing OMOP cohort to a federated study	All five tiers

The only tier requiring human time is Tier 4. The others should run automatically and fail loudly. If a registry can't allocate clinician time for monthly Tier 4 review during pilot, scale back automated outputs (e.g., produce Phenopackets but don't submit them) until that capacity exists.