Skip to content

Exploratory data analysis (EDA)

Beyond the searchable record-explorer dashboard, Registry Forge - Patient Edition can generate three additional EDA artifacts that summarize the cohort visually and tabularly. These match the EDA pattern of the parent Registry Forge so analyses translate between the two.

Under construction

The EDA module is a starter implementation. Charts, report sections, and column groupings may change between releases. ydata-profiling and sweetviz reports embed the full underlying data and are NOT de-identified - do not share generated HTML when built from PHI.

The three EDA artifacts

File What's in it Best for
eda_dashboard.html Custom Chart.js charts (categories, vocabularies, top dx/rx, time series) Fast visual overview, no dependencies
profile_master.html ydata-profiling report on long-format records Per-column types, distributions, missingness
profile_features.html ydata-profiling report on the patient feature matrix Patient-level patterns, correlations
sweetviz_master.html sweetviz report with type detection and overview tables Quick scan, target-aware exploration when applicable

How to generate them

# Custom dashboard only (no extra dependencies)
registryforge-patient parse ./your/ccda/folder --output ./out --eda

# Full set (requires the optional extras)
pip install registryforge-patient[eda]
registryforge-patient parse ./your/ccda/folder --output ./out --eda

From Python:

from registryforge_patient import build_outputs

build_outputs(
    input_path='./your/ccda/folder',
    output_dir='./out',
    with_eda=True,
    eda_is_phi=False,    # set True for real patient data
)

Outputs land in <output_dir>/eda/.

Custom EDA dashboard (demo)

Generated from the synthetic demo files in sample_data/. Three demo patients (Jane Demo, Joe Demo, Alex Demo), three years of records each.

Synthetic data

The patients and records shown below are not real. Distributions are illustrative; do not draw clinical or epidemiological conclusions from them.

Open the dashboard in a new tab :material-open-in-new:

ydata-profiling and sweetviz reports

These require the optional [eda] extras (pip install registryforge-patient[eda]). The generated HTML files are typically 5-30 MB each and take 1-5 minutes to produce.

Demo reports against sample_data/ are not shipped in the docs site because the synthetic cohort is too small (3 patients) to show meaningful distributions. Generate your own against either your real data or a larger synthetic cohort:

registryforge-patient parse ./your/ccda/folder --output ./out --eda
ls out/eda/
# eda_dashboard.html
# profile_master.html
# profile_features.html
# sweetviz_master.html

What the charts show

The custom dashboard has seven charts:

Records by category - doughnut chart of how many rows fall into each Registry Forge category (problems, medications, labs_vitals, procedures, allergies, immunizations, encounters, notes, etc.). A heavily problem-list-skewed cohort vs an encounter-heavy one tells you something about how the source EHR exports its data.

Records by vocabulary - doughnut chart of code system mix. Healthy C-CDA exports use SNOMED for problems, RxNorm for medications, LOINC for labs. Heavy ICD-10-CM presence suggests problem-list rows that haven't been mapped through to SNOMED.

Records per year - bar chart of records timestamped to each year. Useful for spotting data gaps (e.g. years before the patient enrolled in MyChart) or backfill artifacts (huge spike in a single year when historical records got bulk-loaded).

Top 15 diagnoses - horizontal bar of the most prevalent diagnoses by unique-patient count (not record count). One chronic condition mentioned in every export doesn't dominate.

Top 15 medications - same logic for medications, grouped by RxNorm code.

Top 10 labs/vitals by record count - which lab tests show up most. Vitals like blood pressure, weight, and pulse usually top this; routine chemistry panels follow.

Patient demographics - doughnut of gender breakdown from the feature matrix. With one patient this is a single slice; with a cohort it shows the gender mix.

Disclaimers

Every generated report has an embedded disclaimer banner.

For runs against sample_data/ or other synthetic sources, the banner is amber and reads:

Synthetic demo data. This report was generated from synthetic C-CDA documents that ship with Registry Forge - Patient Edition for demonstration purposes. The patients Jane Demo, Joe Demo, and Alex Demo are not real. The distributions, time series, and correlations shown are illustrative only - they do not represent any real cohort. Do not use these patterns to draw clinical or epidemiological conclusions.

For runs against real patient data (eda_is_phi=True), the banner is red and reads:

Contains PHI. This report was generated from identifiable patient-mediated health records. It embeds full record data including diagnoses, medications, lab values, and dates. Do NOT share this file by email, cloud sync, or any unencrypted channel. Do NOT commit it to any repository unless consulting with experts. Treat it as you would the source records under HIPAA, your IRB protocol, and any applicable data-use agreement.

The disclaimer is injected directly into every generated HTML file, so it travels with the report even if the file is moved.

What's not done yet

  • HPO-mapped phenotypic features as a category in the EDA
  • Lab values plotted over time per patient (single-patient timeline view)
  • Cohort-comparison reports (sweetviz's main strength) - the current sweetviz call uses single-table mode
  • Auto-detection of meaningful target variables for sweetviz
  • Drug-class breakdown via ATC (currently flat by RxNorm code)
  • A per-patient one-page summary suitable for sharing with the patient themselves

These will likely arrive in future releases as the module matures.