Researcher workflow¶
A practical walk-through of common analyses once you have the four output files.
Sanity-check the parse first¶
Before any analysis, three quick checks:
import pandas as pd
df = pd.read_csv('out/patient_master.csv', encoding='utf-8-sig', dtype=str)
log = pd.read_csv('out/parse_log.csv')
# 1. Did all your documents parse?
print(f'Documents in log: {len(log)}')
print(f'Documents in master: {df["source_file"].nunique()}')
# Should be equal. If not, check the errors output from notebook cell 4.
# 2. Are all your patients distinct?
print(f'Unique patient_ids: {df["patient_id"].nunique()}')
print(df.drop_duplicates("patient_id")[["patient_id","last_name","first_name","dob"]])
# Eyeball the table. Are any patient_ids you expected to merge actually split?
# Any IDs that obviously belong to different people but share one ID?
# 3. Does the category distribution look right?
print(df["category"].value_counts())
# If you expected medications but see zero, the source documents may have
# stripped that section, or the parser may not have recognized the section code.
Build a patient timeline¶
For single-patient mode (or one patient out of a cohort):
pid = 'PT-1D2DF76565'
timeline = (df.query("patient_id == @pid and category != 'patient'")
.sort_values('effective_date')
[['effective_date','category','code_system','code','display_name','value','unit','text']])
timeline.head(30)
For a visual timeline:
import matplotlib.pyplot as plt
import seaborn as sns
ts = timeline.assign(date=pd.to_datetime(timeline['effective_date'], errors='coerce')).dropna(subset=['date'])
plt.figure(figsize=(11, 4))
sns.scatterplot(data=ts, x='date', y='category', hue='category', legend=False, s=40)
plt.title(f'Clinical event timeline - {pid}')
plt.tight_layout(); plt.show()
Find every record matching a code or keyword¶
# By code
df.query("code == '55505003'")[['patient_id','effective_date','display_name','source_file']]
# By free-text search across the cohort
mask = df['text'].str.contains('peg|gastrostomy', case=False, na=False) | \
df['display_name'].str.contains('peg|gastrostomy', case=False, na=False)
df[mask][['patient_id','category','effective_date','display_name','text']].head()
The dashboard does the same thing in a UI, but pandas is faster for repeated scripted analyses.
Cohort summary: who's on what¶
# Pivot table: patient × medication code
meds = (df[df['category']=='medications']
.pivot_table(index='patient_id', columns='display_name',
values='effective_date', aggfunc='count', fill_value=0))
meds.head()
# Patients on riluzole at any point
ril = df.query("category == 'medications' and display_name.str.contains('riluzole', case=False)", engine='python')
print(f'{ril["patient_id"].nunique()} patient(s) on riluzole')
Longitudinal lab trends for one patient¶
labs = df.query("patient_id == @pid and category == 'labs_vitals' and value != ''").copy()
labs['date'] = pd.to_datetime(labs['effective_date'], errors='coerce')
labs['num'] = pd.to_numeric(labs['value'], errors='coerce')
# Focus on one analyte - creatinine, for example
cr = labs[labs['display_name'].str.contains('creatinine', case=False, na=False)].dropna(subset=['date','num'])
cr.plot(x='date', y='num', marker='o', title='Serum creatinine over time');
Use the feature matrix for ML¶
X = pd.read_csv('out/patient_features.csv', index_col=0)
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
# Standardize numeric columns
num = X.select_dtypes(include='number').fillna(0)
Xs = StandardScaler().fit_transform(num)
# Cluster
km = KMeans(n_clusters=3, random_state=42, n_init=10).fit(Xs)
pca = PCA(n_components=2).fit_transform(Xs)
import matplotlib.pyplot as plt
plt.scatter(pca[:,0], pca[:,1], c=km.labels_, cmap='tab10')
plt.title('Patient clusters in PCA space'); plt.show()
For a more elaborate ML notebook that does PCA, KMeans, UMAP (optional), and a supervised demo with feature importance, see the companion notebook from the original Registry Forge analysis - it accepts our patient_master.csv unchanged.
Combine with other Registry Forge outputs¶
If you also have an institutional cohort processed by Registry Forge:
institutional = pd.read_csv('institutional/patient_master.csv', encoding='utf-8-sig', dtype=str)
patient_shared = pd.read_csv('out/patient_master.csv', encoding='utf-8-sig', dtype=str)
combined = pd.concat([institutional, patient_shared], ignore_index=True)
print(combined['source'].value_counts())
# ccda ...
# fhir ...
# (empty for patient header rows)
Just be careful at the patient level: a patient who exists in both cohorts will have two different patient_ids unless you do explicit linkage.
Export a redacted dashboard for sharing¶
Roll your own pseudonymization before re-building the dashboard:
df_safe = df.copy()
# Pseudonymize names
idmap = {pid: f'PT-{i:04d}' for i, pid in enumerate(df_safe['patient_id'].unique())}
df_safe['patient_id'] = df_safe['patient_id'].map(idmap)
df_safe['last_name'] = ''
df_safe['first_name'] = ''
df_safe['mrn'] = ''
# Year-only dates
df_safe['effective_date'] = df_safe['effective_date'].str.slice(0, 4)
df_safe['dob'] = df_safe['dob'].str.slice(0, 4)
# Strip free text
df_safe['text'] = ''
df_safe.to_csv('out/patient_master_safe.csv', index=False, encoding='utf-8-sig')
# Then re-run the dashboard cell pointing at df_safe
This is rough but adequate for sharing structure with a methodologist who doesn't need PHI.
When to graduate to a heavier tool¶
Patient Edition is deliberately minimal. Reach for something heavier when:
- You have more than a few hundred patients. OMOP + a real database backend is more sustainable than CSVs.
- You need rare disease federation. GA4GH Phenopackets and Matchmaker Exchange exist for a reason.
- You're doing time-to-event analysis at cohort scale.
lifelinesplus a study-specific feature pipeline is the right framing; the feature matrix here is a snapshot, not a longitudinal model. - You need formal validation. The five-tier QC framework Registry Forge ships with (Kahn 2016 + OHDSI DQD) is the right reference; this project doesn't try to replicate it.