Skip to content

Stage 5 - Dashboard & feature matrix

Stage 5 takes the assembled patient_master.csv and produces two analyst-facing artifacts: a self-contained HTML dashboard and an ML-ready patient × feature matrix.

The dashboard

build_dashboard(df, mode, output_path) writes a single HTML file that contains everything - data, styling, search logic. No external scripts, no CDN, no server. Open it from a USB stick on an air-gapped machine; it still works.

What goes into the HTML

  • Embedded data. Records (minus the heavy raw_record_json column to keep file size manageable) are serialized to JSON and dropped into a <script type="application/json"> tag. A page with ~20,000 records produces an HTML file around 5–10 MB.
  • Controls. Global search input, patient dropdown (multi-patient mode only), category dropdown, vocabulary dropdown, export-filtered-CSV button.
  • Table. Sortable columns, hover highlighting, category pills, truncated long text with full text on tooltip.
  • Footer. Schema reference and generation timestamp.

Why client-side only

The same reasons Registry Forge does it:

  • Offline. Researchers email these to collaborators or attach them to IRB submissions. They have to work without network.
  • No deployment. No Flask, no Streamlit, no Dash. The file is the app.
  • Auditable. A reviewer can open the HTML in a text editor and verify exactly what JavaScript runs and what data is embedded.

Rendering performance

The first 2,000 filtered rows are rendered to the DOM. If more match, a footer row says so and prompts the user to narrow the filter. This keeps the browser responsive even on a 50,000-row cohort. The underlying filter set is still in memory, so the Export filtered CSV button gives you all matching rows - not just the displayed ones.

The feature matrix

patient_features.csv is a wide DataFrame indexed by patient_id, with one column per engineered feature.

Features built

Feature group Columns How
Demographics age_years, gender_male, gender_female, gender_unknown, ... Age computed from DOB to today; gender one-hot.
Category counts n_problems, n_medications, n_labs_vitals, ... Group-by-category record counts after deduplication.
Top-K diagnoses dx__55505003_Motor_neuron_disease, ... Top 30 most-common problem codes by unique patient count; binary 0/1 per patient. One column per SNOMED code, with a canonical display name.
Top-K medications rx__9468_Riluzole_50_MG_Oral_Tablet, ... Top 30 RxNorm codes by unique patients; binary. Grouped by code only, so the same drug appearing with different prescription SIGs collapses to one column.
Top-K labs (numeric) lab__2160-0_Creatinine_serum, ... Top 20 LOINC codes by unique patients; value is the patient's mean numeric result. Grouped by LOINC only, so the same lab appearing with different reference ranges or unit strings collapses.

K is configurable in the notebook. Default values keep the matrix to about 80–100 columns even on Epic exports with verbose display names.

Deduplication

Epic patient-portal exports repeat the same clinical facts across every annual download. A single chronic diagnosis can appear hundreds of times in the master CSV - once per encounter export that restates the problem list. For modeling, that inflates counts in a way that doesn't reflect distinct clinical events.

The feature builder applies a deduplication step by default (DEDUPLICATE_OVERLAPPING = True at the top of the cell). It collapses records that share (patient_id, category, code, effective_date, value, unit), keeping the first occurrence with its source_file provenance. Toggle it off if you specifically want raw counts.

Typical reduction: a 78,000-record master CSV from ~700 Epic documents drops to perhaps 3,000–6,000 deduplicated records - closer to the patient's actual distinct clinical history.

Why top-K, and why "by unique-patient count"

Two reasons for top-K:

  1. Interpretability. A model with 30 disease features and 30 drug features is something a researcher can read directly. A model with 5,000 sparse code columns is not.
  2. Sample-size realism. Patient-mediated cohorts are small. Throwing thousands of features at a 50-patient cohort is a recipe for overfit nonsense. Top-K forces you to engage with which codes actually appear in your data.

And selecting top-K by unique patient count rather than raw record count is the right choice because chronic conditions that appear in every export for one patient would otherwise crowd out diagnoses that are genuinely common across the cohort. A cohort of 50 patients in which 35 have diabetes is more interesting than one patient who has diabetes mentioned in 500 documents.

Bump K when your cohort is larger. Replace top-K with a code hierarchy (RxNorm ingredient classes, ATC groups, ICD-10 chapters, Mondo ancestors) when you have the time and the standards mappings.

What the feature matrix doesn't include

  • Temporal features. No "first event," "last event," "duration since first event." For longitudinal modeling, add these from patient_master.csv directly.
  • Survival outcomes. No event date, no follow-up duration. For time-to-event, build a separate outcomes frame.
  • Free-text features. The text column is ignored. Adding TF-IDF or embeddings is left as a downstream choice - patient-portal narratives are often boilerplate-heavy and benefit from cleaning before any NLP.

How the two outputs relate

The dashboard is for looking at the data. The feature matrix is for modeling the data. Most researchers use both: explore in the dashboard, hypothesize, then build a model against the feature matrix (or against custom features derived from the master CSV).

Both files reference the same patient_id values, so you can take an interesting cluster from the modeling side and filter the dashboard to those patients to read their actual records.

Bundling for handoff

After Stage 5, registry_forge_patient_bundle.zip packages all four output files:

registry_forge_patient_bundle.zip
├── patient_master.csv
├── patient_features.csv
├── dashboard.html
└── parse_log.csv

This is the natural unit of work - a single file you can email, attach to a JIRA ticket, or upload to a secure share. Each artifact is meaningful on its own; together they're a complete analytic data set.