Skip to content

Cohort EDA report

A single self-contained interactive HTML file summarizing cohort demographics, code coverage, and temporal patterns — ready to share with collaborators who don't have access to the underlying bundle. No PHI is included; identifiers are pseudonymized and dates are aggregated.

Generated by cohort_eda.py from dashboard_data.json in seconds, with one entry point:

import cohort_eda
cohort_eda.generate_report(
    bundle_path = 'dashboard_data.json',
    out_path    = 'cohort_eda_report.html',
    k           = 5,                       # k-anonymity threshold
    cohort_name = 'ALS TDI ARC Study',
)

Live demo

Below is the actual cohort EDA report, pre-loaded against a synthetic 71-patient ALS-shaped cohort that exercises every section. Click through the tabs and try the search/filter controls in the bottom two tabs.

What's in the report

Eight tabs, each anchored by the privacy transforms described below.

Cohort — total N, sex distribution donut, age band histogram (10-year bands, 90+ collapsed per HIPAA Safe Harbor), race and ethnicity bars. Median observation length and percent female shown as headline cards.

Data volume — total records per clinical category (problems, medications, labs, etc.), unique codes per category, distribution of records per patient as a histogram, and a source-format pie (FHIR vs CCDA vs PDF vs RTF).

Coverage — observation period histogram in years (per patient, from earliest to latest dated record), plus banded buckets (<1 year, 1–2 years, 2–5 years, 5–10 years, 10–20 years, 20+ years).

Trends — stacked-bar showing records per calendar year, broken out by clinical category. Year-level aggregation only — exact dates are never plotted.

Vocabularies — horizontal bar of records by coding system across all categories. Raw OIDs are normalized to friendly names (SNOMED-CT, LOINC, RxNorm, ICD-10-CM, plus an EHR-vendor-local rollup category per detected vendor) so a single vendor flowsheet table doesn't dominate with twenty separate bars.

ALS signal — count of patients with any motor-neuron-disease-spectrum marker (ICD-10-CM G12.20/21/22/23/29 and ICD-9-CM 335.20–29, SNOMED CT 86044005 / 230258005, Mondo MONDO:0004976 / MONDO:0019056; LOINC 82953-1 ALSFRS-R total score plus the 12 ALSFRS-R item codes 82940-8 through 82952-3 and the panel code 82954-9; LOINC 19868-9 / 19870-5 / 19872-1 / 19874-7 FVC; riluzole / edaravone / tofersen / sodium phenylbutyrate-taurursodiol [Relyvrio]), plus a horizontal bar of per-marker counts with k-anonymity suppression.

Top codes — top 25 codes per vocabulary, ranked by total references, with a search box and a vocabulary filter dropdown. Patient counts below the k-anonymity threshold are shown as <5. Where the friendly vocabulary name was derived from an OID (e.g. a vendor-local rollup such as Vendor-A local from urn:oid:<vendor-enterprise-root>.<customer>.<table>.…), the raw OID can optionally be shown as a small gray sub-line on each row so analysts can tell different vendor tables apart. Note that the customer-specific portion of an EHR-vendor enterprise OID is reverse-lookupable to a specific health system in many cases, and therefore identifies the data source even though it is not a HIPAA Safe Harbor identifier. Pass show_raw_oids=False to generate_report (the default) to suppress the sub-line; this is the right setting for reports leaving the registry.

Patients — one row per patient with the synthetic ID, sex, age band, race, ethnicity, observation period band, and a banded record count (1–9, 10–49, 50–99, 100–499, 500–999, 1000+). Diagnostic codes are not shown per patient to prevent re-identification of rare disease cases. The row also flags whether any ALS markers were seen but does not name them.

Privacy transforms applied

The output contains no PHI by design. Privacy protections are layered:

  • Patient identifiers — every patient gets a synthetic PT-NNNN identifier assigned by sorted-order pseudonymization. The mapping is computed at generation time and is not written to disk; running the report twice against the same cohort produces stable IDs but reading the report alone gives no way to invert them.

  • Ages — dates of birth are converted to 10-year age bands (<20, 20–29, …, 80–89). Ages over 89 collapse to 90+ per the HIPAA Safe Harbor de-identification rule §164.514(b)(2)(i)(C). The raw DOB never leaves the bundle.

  • Observation period — reported as a duration in years, never as absolute calendar dates. A patient with a 4-year follow-up shows in the 2–5 years band; the actual dates aren't in the report.

  • Activity over time — year-level aggregation only. Even the trend chart never shows a month or day.

  • k-anonymity — every cross-tab cell with fewer than k patients (default 5) is suppressed. In tables the cell renders as <5; in categorical breakdowns small categories collapse to Other (<k each) so they don't disappear from sight but can't be inverted to identify the underlying individuals.

  • No free text — note narratives and clinical comments are never included.

  • No diagnostic codes per patient — the per-patient table excludes condition codes specifically because a rare condition + sex + age band can be re-identifying in a small cohort.

For the most conservative settings on rare disease cohorts, raise k from 5 to 10 or 15 — one parameter change. The marker section in particular benefits from a higher k if your cohort has very specific subtypes.

Vocabulary normalization

A real production FHIR export typically carries 20+ distinct vocabulary strings even though most resolve to the same handful of code systems. Without normalization the Vocabulary tab fills with unreadable noise. The report consolidates:

Standard code systems (resolved by OID, URL, or string alias):

  • SNOMED CT (2.16.840.1.113883.6.96 / http://snomed.info/sct)
  • RxNorm (2.16.840.1.113883.6.88)
  • LOINC (2.16.840.1.113883.6.1)
  • ICD-10-CM (2.16.840.1.113883.6.90) and ICD-10-PCS (2.16.840.1.113883.6.4) tracked as separate code systems — one is diagnosis coding, the other is inpatient procedure coding
  • ICD-9-CM (diagnosis) and ICD-9-PCS (procedure) likewise separated
  • CPT-4 (2.16.840.1.113883.6.12)
  • HCPCS Level II (2.16.840.1.113883.6.14)
  • CVX (2.16.840.1.113883.12.292)
  • NDC (2.16.840.1.113883.6.69)
  • CDC Race & Ethnicity rollup (aggregated for cohort-summary readability; CDC publishes multiple OIDs that map to this label)
  • Common HL7 v3 code systems (ActCode, RoleCode, Confidentiality, ObservationInterpretation, AdministrativeGender)

Third-party clinical terminologies (resolved by name):

  • IMO ProblemIT (Intelligent Medical Objects problem list terminology)
  • Medi-Span Drug Descriptor / Generic Product Identifier
  • NDDF (First Databank National Drug Data File)

EHR-vendor-local code systems. When an OID, URL, or string label matches a known EHR-vendor enterprise root, the record is grouped under a per-vendor "local" rollup label so a single vendor's flowsheet tables don't fragment the Vocabulary view into many tiny bars. Aggregation keys still use the raw vocabulary string so codes from different vendor tables don't collide; only the display label is rolled up. The Top codes table can optionally show the raw OID as a small sub-line, gated on show_raw_oids=True (default False; see the Top codes section above for the privacy rationale).

The specific OID prefixes and string patterns that map to each vendor rollup are coded in the OID_FRIENDLY_NAMES, STRING_ALIASES, SUBSTRING_TO_LABEL, and prefix-matching blocks at the top of cohort_eda.py. Adopters are expected to extend these dictionaries for the vendors and local code-system strings their data actually contains.

Catch-all fallbacks (only fire when no specific mapping matches):

  • Bare numeric OIDs that don't match any known prefix → Unmapped OID (<first 20 chars>…). We do not assume an unmapped OID is local — it could be a private registry, a regional code system, or a standard OID that's not in our list yet.
  • Bare integer strings → HL7 v2 table <N>. HL7 v2 tables (0001 Administrative Sex, 0078 Interpretation codes, 0203 Identifier type, etc.) are standard HL7-defined tables, not local. Analysts who see a v2 table number can look it up against the HL7 v2.x specification to determine whether the particular table is HL7-standard or site-defined.

Help adapting normalization to your EHR vendor

The normalization rules above are based on what we have seen in our own production data. They cover the most common EHR-vendor patterns we have encountered, but EHR exports vary enormously between vendors, deployments, and individual sites. If you are running Registry Forge against an export with vocabulary strings or OID patterns you would like help normalizing — particularly for vendor-specific or site-local code systems — please email dboyce@als.net. We are happy to look at sample (de-identified) vocabulary strings privately and recommend mappings, and to fold useful patterns into a future release where appropriate.

Sharing the report

The HTML file is self-contained — open it in any browser, share it via email or a Drive link, and it works without further infrastructure. The only network dependency is Chart.js, which loads from cdnjs at view time and renders the bars and donuts. If a colleague views the report offline, the tables and per-patient grid still render fine; only the charts will be blank.

For colleagues who need to verify there's no PHI in the file, point them at the privacy banner at the top of every report ("Privacy protections applied: …") and at this docs page.