Skip to content

Researcher workflow

A practical walk-through of common analyses once you have the four output files.

Sanity-check the parse first

Before any analysis, three quick checks:

import pandas as pd
df = pd.read_csv('out/patient_master.csv', encoding='utf-8-sig', dtype=str)
log = pd.read_csv('out/parse_log.csv')

# 1. Did all your documents parse?
print(f'Documents in log: {len(log)}')
print(f'Documents in master: {df["source_file"].nunique()}')
# Should be equal. If not, check the errors output from notebook cell 4.

# 2. Are all your patients distinct?
print(f'Unique patient_ids: {df["patient_id"].nunique()}')
print(df.drop_duplicates("patient_id")[["patient_id","last_name","first_name","dob"]])
# Eyeball the table. Are any patient_ids you expected to merge actually split?
# Any IDs that obviously belong to different people but share one ID?

# 3. Does the category distribution look right?
print(df["category"].value_counts())
# If you expected medications but see zero, the source documents may have
# stripped that section, or the parser may not have recognized the section code.

Build a patient timeline

For single-patient mode (or one patient out of a cohort):

pid = 'PT-1D2DF76565'
timeline = (df.query("patient_id == @pid and category != 'patient'")
              .sort_values('effective_date')
              [['effective_date','category','code_system','code','display_name','value','unit','text']])
timeline.head(30)

For a visual timeline:

import matplotlib.pyplot as plt
import seaborn as sns

ts = timeline.assign(date=pd.to_datetime(timeline['effective_date'], errors='coerce')).dropna(subset=['date'])
plt.figure(figsize=(11, 4))
sns.scatterplot(data=ts, x='date', y='category', hue='category', legend=False, s=40)
plt.title(f'Clinical event timeline - {pid}')
plt.tight_layout(); plt.show()

Find every record matching a code or keyword

# By code
df.query("code == '55505003'")[['patient_id','effective_date','display_name','source_file']]

# By free-text search across the cohort
mask = df['text'].str.contains('peg|gastrostomy', case=False, na=False) | \
       df['display_name'].str.contains('peg|gastrostomy', case=False, na=False)
df[mask][['patient_id','category','effective_date','display_name','text']].head()

The dashboard does the same thing in a UI, but pandas is faster for repeated scripted analyses.

Cohort summary: who's on what

# Pivot table: patient × medication code
meds = (df[df['category']=='medications']
        .pivot_table(index='patient_id', columns='display_name',
                     values='effective_date', aggfunc='count', fill_value=0))
meds.head()

# Patients on riluzole at any point
ril = df.query("category == 'medications' and display_name.str.contains('riluzole', case=False)", engine='python')
print(f'{ril["patient_id"].nunique()} patient(s) on riluzole')
labs = df.query("patient_id == @pid and category == 'labs_vitals' and value != ''").copy()
labs['date'] = pd.to_datetime(labs['effective_date'], errors='coerce')
labs['num']  = pd.to_numeric(labs['value'], errors='coerce')

# Focus on one analyte - creatinine, for example
cr = labs[labs['display_name'].str.contains('creatinine', case=False, na=False)].dropna(subset=['date','num'])
cr.plot(x='date', y='num', marker='o', title='Serum creatinine over time'); 

Use the feature matrix for ML

X = pd.read_csv('out/patient_features.csv', index_col=0)

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Standardize numeric columns
num = X.select_dtypes(include='number').fillna(0)
Xs = StandardScaler().fit_transform(num)

# Cluster
km = KMeans(n_clusters=3, random_state=42, n_init=10).fit(Xs)
pca = PCA(n_components=2).fit_transform(Xs)

import matplotlib.pyplot as plt
plt.scatter(pca[:,0], pca[:,1], c=km.labels_, cmap='tab10')
plt.title('Patient clusters in PCA space'); plt.show()

For a more elaborate ML notebook that does PCA, KMeans, UMAP (optional), and a supervised demo with feature importance, see the companion notebook from the original Registry Forge analysis - it accepts our patient_master.csv unchanged.

Combine with other Registry Forge outputs

If you also have an institutional cohort processed by Registry Forge:

institutional = pd.read_csv('institutional/patient_master.csv', encoding='utf-8-sig', dtype=str)
patient_shared = pd.read_csv('out/patient_master.csv', encoding='utf-8-sig', dtype=str)

combined = pd.concat([institutional, patient_shared], ignore_index=True)
print(combined['source'].value_counts())
# ccda    ...
# fhir    ...
# (empty for patient header rows)

Just be careful at the patient level: a patient who exists in both cohorts will have two different patient_ids unless you do explicit linkage.

Export a redacted dashboard for sharing

Roll your own pseudonymization before re-building the dashboard:

df_safe = df.copy()
# Pseudonymize names
idmap = {pid: f'PT-{i:04d}' for i, pid in enumerate(df_safe['patient_id'].unique())}
df_safe['patient_id'] = df_safe['patient_id'].map(idmap)
df_safe['last_name']  = ''
df_safe['first_name'] = ''
df_safe['mrn']        = ''
# Year-only dates
df_safe['effective_date'] = df_safe['effective_date'].str.slice(0, 4)
df_safe['dob']            = df_safe['dob'].str.slice(0, 4)
# Strip free text
df_safe['text'] = ''
df_safe.to_csv('out/patient_master_safe.csv', index=False, encoding='utf-8-sig')
# Then re-run the dashboard cell pointing at df_safe

This is rough but adequate for sharing structure with a methodologist who doesn't need PHI.

When to graduate to a heavier tool

Patient Edition is deliberately minimal. Reach for something heavier when:

  • You have more than a few hundred patients. OMOP + a real database backend is more sustainable than CSVs.
  • You need rare disease federation. GA4GH Phenopackets and Matchmaker Exchange exist for a reason.
  • You're doing time-to-event analysis at cohort scale. lifelines plus a study-specific feature pipeline is the right framing; the feature matrix here is a snapshot, not a longitudinal model.
  • You need formal validation. The five-tier QC framework Registry Forge ships with (Kahn 2016 + OHDSI DQD) is the right reference; this project doesn't try to replicate it.