Quality control framework¶
Registry Forge produces six classes of output (browser dashboard, master CSVs, code inventory, OMOP CDM, note extractions, Phenopackets). Every one is derived from the same underlying bundle, so a defect anywhere upstream propagates everywhere downstream. Quality control is not optional and is not one-time. This page describes the recommended QC framework, organized as five tiers from cheapest to most labor-intensive.
The QC philosophy: catch defects at the layer they originate, prefer automated checks over manual ones, and make manual review cheap enough that you actually do it.
Data-quality review frameworks adopters should use
The checks on this page are a practical starting set for Registry Forge outputs — not a complete data-quality program. For the full picture, adopters should review their data through two well-established frameworks:
- Kahn et al. (2016), A harmonized data quality assessment terminology and framework for the secondary use of electronic health record data, EGEMS 4(1):1244 (PMID 27713905) — the canonical methodological reference for observational-research data quality. Defines the conformance / completeness / plausibility taxonomy that is the standard vocabulary for talking about EHR data quality.
- The OMOP / OHDSI data quality framework, specifically the Book of OHDSI, Chapter 15 — Data Quality and the OHDSI Data Quality Dashboard — OMOP-specific guidance and a reference tool that runs Kahn-aligned checks against an OMOP CDM instance.
Adopters running production OMOP workloads should layer those frameworks on top of the checks described below.
Tier 1 — Schema validation (every run)¶
Every output should pass a schema check before it leaves the workstation.
| Output | Schema |
|---|---|
dashboard_data.json |
Internal schema embedded in run_pipeline.py (column presence, type sanity) |
dashboard_data.xlsx |
Implicit (column headers; one row per record) |
code_inventory.csv |
Fixed columns; non-empty vocabulary and code |
patient_master*.csv |
Fixed 21-column header; UTF-8 with BOM; quoted cells |
| OMOP CDM v5.4 tables | Official OMOP CDM v5.4 spec — types, FK integrity, required columns |
| Phenopackets JSON | GA4GH Phenopacket v2 JSON Schema |
Schema validation catches the most common defect class — a column rename or a missing required field — in seconds. Run it as the last step of every pipeline invocation. The OMOP CDM schemas are published as DDLs by OHDSI; the Phenopackets schema is published as JSON Schema by GA4GH.
A simple validation snippet for Phenopackets:
import json, jsonschema, urllib.request
schema = json.loads(urllib.request.urlopen(
'https://raw.githubusercontent.com/phenopackets/phenopacket-schema/'
'master/src/main/proto/phenopackets/schema/v2/phenopackets.proto'
).read().decode())
# Use a generated JSON-Schema if you have it, or validate manually
Tier 2 — Mapping coverage (every run)¶
code_inventory.csv and unmapped_codes.csv (from the OMOP and Phenopackets ETLs) are the inputs to this tier.
Track three numbers per run:
| Metric | What it tells you |
|---|---|
| Unique source codes / total source codes | Should be stable; sudden change means upstream data source changed |
concept_id != 0 rate in OMOP outputs |
What fraction of records mapped to a standard concept |
unmapped_codes.csv row count, ranked by frequency |
The seed-table extension worklist |
Set thresholds the registry is comfortable with (e.g., "Phenopackets unmapped rate <10% on the top 100 most-referenced problem codes") and fail the pipeline if a run breaches them. The pipeline already computes these counts and writes them to pipeline_run.log; thresholding is a few lines of additional Python.
A note on what unmapped means. "Unmapped" is not the same as "wrong." A code that has no HPO or Mondo mapping is genuinely unmapped, but a code that does map may still map incorrectly. Mapping coverage is necessary but not sufficient. Tier 4 (manual review) is the only check that catches incorrect mappings.
Tier 3 — Cross-output consistency (every run)¶
The same patient should look like the same patient across every output. Cross-checks:
| Check | Expected |
|---|---|
| Patient count | Equal across patients array, OMOP PERSON.csv, Phenopackets cohort |
| Per-patient medication count (FHIR-only) | Equal between patient_master_fhir.csv filtered to medications and OMOP DRUG_EXPOSURE.csv filtered to person_id |
| Per-patient problem count | Equal between patient_master.csv filtered to problems and OMOP CONDITION_OCCURRENCE.csv plus Phenopackets phenotypicFeatures + diseases |
ALSFRS-R total in note_extractions.csv |
Present as a Measurement in the corresponding Phenopacket |
Discrepancies are usually one of three things: a record was filtered at one stage but not another; a code mapping diverged; or a patient_id normalization slipped between stages. The cross-output diff is the fastest way to find them.
A short Python script that runs after the pipeline produces these counts side-by-side and flags any patient where the counts disagree by more than 1:
import pandas as pd, json
pm = pd.read_csv('patient_master.csv', encoding='utf-8-sig', dtype=str)
pers = pd.read_csv('omop_output_*/PERSON.csv', encoding='utf-8-sig', dtype=str)
pp = json.load(open('phenopackets_output_*/cohort.json'))
ids_master = set(pm.query("category == 'patient'")['patient_id'])
ids_omop = set(pers['person_source_value'])
ids_pp = {m['subject']['id'] for m in pp['members']}
print('In master only:', ids_master - ids_omop - ids_pp)
print('In OMOP only: ', ids_omop - ids_master - ids_pp)
print('In Phenopackets only:', ids_pp - ids_master - ids_omop)
Anything in any of those three sets is a defect.
Tier 4 — Spot-review by a clinician (recurring)¶
Tiers 1–3 catch structural defects. Tier 4 catches semantic defects: codes that map to the wrong target, dates that are off, regex patterns that captured the wrong number from a sentence, code prioritizations that hide an important coding.
Recommended cadence: monthly during pilot, quarterly thereafter, after any change to the seed mapping tables, and after every Athena vocabulary release update.
Recommended sample: stratified random sample of N=5–10 patients. For each, the reviewer looks at:
- The dashboard's per-patient view, side-by-side with
- The patient's row in
patient_master.csv, and - The corresponding
<patient_id>.jsonPhenopacket
Cross-referencing the three is the canonical clinician QC workflow. Discrepancies fall into a small number of buckets:
| Bucket | Example | Fix |
|---|---|---|
| Wrong code mapping | SNOMED 86044005 mapping to a non-ALS HPO term | Edit seed table |
| Missed phenotype | "Spasticity" in narrative not captured anywhere | Add regex pattern OR seed mapping |
| Mis-extracted value | ALSFRS-R captured 38 from "ALSFRS-R 38/48 in March 2023" but the year was actually 2024 |
Tighten regex |
| Date drift | A medication start_date that's months earlier in OMOP than in Phenopackets | Trace through merge logic |
| Identifier drift | Patient appears in dashboard but not in OMOP | Normalization mismatch upstream |
Build a short reviewer template (PDF or Word) that lists the bucket categories and a free-text field for "other." File the completed reviews in version control so the registry has a paper trail of QC findings and the seed-table edits that resulted.
Tier 5 — Synthetic regression test (every change)¶
The synthetic demonstration cohort (Jane Marie Demo) is the regression bedrock. Every code path in the pipeline runs against a bundle whose answers are known. When a defect is fixed in production data, write a fixture for it — add a clinically-realistic narrative or coded record to the synthetic CCDA that exercises the previously-broken path — and confirm the pipeline still produces the expected output.
The synthetic cohort is small enough to run end-to-end in seconds, so making this a pre-commit / CI check is reasonable. A minimum CI check:
python run_pipeline.py # produces dashboard_data.json + master CSVs + code_inventory.csv
python omop_etl.py # produces OMOP output
python note_extraction.py # produces note_extractions.csv
python phenopackets_etl.py # produces Phenopackets output
python qc_assert.py # the script that codifies the expected counts
qc_assert.py reads the outputs and checks invariants: patient count = 1, ALS appears in diseases, ALSFRS-R total appears in measurements, etc. Any failure is a regression.
Reproducibility expectations¶
Every output that depends on a vocabulary should record the vocabulary version it used. OMOP's CDM_SOURCE.csv does this for the standard concept set; the Phenopackets metaData.resources[] block does the same for HPO/Mondo/LOINC/RxNorm/SNOMED/ICD-10-CM/HGNC. The folder names — omop_output_<vocab_release>/ and phenopackets_output_HPO-<release>_Mondo-<release>/ — carry the same information at the filesystem level.
The combination means that for any output file, anyone can reconstruct the exact mapping decisions that produced it. This matters for audits, for downstream sharing, and for the case where a vocabulary update produces a different mapping for the same source code.
Summary — what to do, when¶
| Trigger | Tiers to run |
|---|---|
| Every pipeline invocation | 1, 2, 3, 5 (automated) |
| Pilot phase, monthly | 4 (clinician spot-review) |
| Post-pilot, quarterly | 4 |
| After seed-table edit | 4 + 5 (focused on changed mappings) |
| After Athena vocabulary update | All five tiers; diff vs. previous run |
| Before submitting Phenopackets to Matchmaker | 4 on the submission set |
| Before publishing OMOP cohort to a federated study | All five tiers |
The only tier requiring human time is Tier 4. The others should run automatically and fail loudly. If a registry can't allocate clinician time for monthly Tier 4 review during pilot, scale back automated outputs (e.g., produce Phenopackets but don't submit them) until that capacity exists.