Drug repurposing analysis¶

drug_repurposing.py is the add-on module that adapts the methodology of Reimer et al., Lancet Digital Health 2026 — an EHR-based study identifying drug repurposing candidates for ALS in the US Veterans Health Administration database — to the bundle that run_pipeline.py produces. It also exports candidate cases in the schema of the FDA / NCATS-NIH / Critical Path Institute CURE Drug Repurposing Collaboratory Treatment Registry intake CRF.

Live demo¶

Below is the actual drug repurposing report, pre-loaded against a synthetic 220-patient ALS cohort that exercises every section — cohort overview, 21 medications grouped under 15 ATC classes, crude exposed-vs-unexposed median survival comparison, and protective / harmful directional markers. All values are synthetic; no real ARC data.

What it does, what it does not do¶

This module is responsible for cohort assembly and exposure assignment. It identifies who counts as exposed to a given medication under explicit criteria, computes baseline characteristics, attaches survival information, groups medications by ATC class, and exports both an analytic data set and a CURE ID intake file.

It deliberately does not run propensity-score matching or Cox proportional hazards regression. Rigorous causal inference on observational EHR data needs lifelines / statsmodels and per-cohort tuning of caliper widths, immortal-time-bias correction, covariate-balance checks, and multiple-testing correction. The cohort data set this module emits (drug_repurposing_cohort.csv) is the input to those tools — not a substitute for them.

Cohort definition¶

The module walks the bundle problems[] for any record coded under the motor-neuron-disease spectrum:

Vocabulary	Code	Meaning
ICD-10-CM	G12.21	Amyotrophic lateral sclerosis
ICD-10-CM	G12.20	Motor neuron disease, unspecified
ICD-10-CM	G12.22	Progressive bulbar palsy
ICD-10-CM	G12.23	Primary lateral sclerosis (adult)
ICD-10-CM	G12.29	Other motor neuron disease
ICD-9-CM	335.20–29	Motor neuron disease (legacy)
SNOMED CT	86044005	Amyotrophic lateral sclerosis
SNOMED CT	230258005	Motor neuron disease
Mondo	MONDO:0004976	Amyotrophic lateral sclerosis
Mondo	MONDO:0019056	Motor neuron disease

Cohort entry date is the earliest such diagnosis. Patients with no medication record after cohort entry are excluded as a proxy for lack of engagement with the index health system, following Reimer 2026.

Exposure criteria¶

The module supports four exposure-criterion modes via the criterion parameter.

criterion='any' (default) — Criterion C, fits C-CDA / FHIR registry data. A patient is exposed to a medication if they have at least one medication record with effective_date in the window from 6 months before to 12 months after cohort entry. This is the right default for Registry Forge bundles produced by run_pipeline.py, where each medication row represents one prescription event (C-CDA SubstanceAdministration or FHIR MedicationRequest / MedicationStatement) and there is typically no separate end-date.

criterion='reimer', 'reimer-a', 'reimer-b' — the strict Reimer 2026 criteria, designed for VHA-style dispense-event data. Reimer's data had one row per pharmacy fill with dispense_date and supply_days, allowing the reconstruction of treatment intervals. Criterion A requires the prescription start or end to fall within 12 months of cohort entry and the end date to be at least 6 months after the start. Criterion B requires at least two dispenses in the [-6, +12] month window. Neither will fire on C-CDA / FHIR data with one record per medication and no end_date.

import drug_repurposing
drug_repurposing.main(
    bundle_path = 'dashboard_data.json',
    out_dir     = 'drug_repurposing_output',
    min_exposed = 5,           # see "Tuning min_exposed for small cohorts" below
    k           = 5,
    criterion   = 'any',       # default; 'reimer' for dispense-event data
    cohort_name = 'ARC Study EHR cohort',
)

Tuning `min_exposed` for small cohorts¶

Reimer used min_exposed = 30 against 11,003 ALS patients. For a 96-patient cohort that threshold is far too high — almost no medication will have 30+ exposed patients, and the output will be empty. The default in drug_repurposing.py is min_exposed = 5, which is a sensible starting value for cohorts of a few dozen to a few hundred patients.

If your run produces empty output, the diagnostic log prints a hint telling you what the top medication's exposed-patient count is, so you can lower min_exposed accordingly:

top exposed medications across cohort:
     12 patients  riluzole    (dropped, < min_exposed=30)
      8 patients  atorvastatin    (dropped, < min_exposed=30)
HINT: top medication has 12 exposed patients (need >= 30).
      Try drug_repurposing.main(..., min_exposed=12) to see signal.

Diagnostic output¶

Empty outputs are not silent. The module logs counts at every step:

How many problem rows were scanned, how many matched a motor-neuron-disease code, and which coding system they came from.
How many medication rows had a patient_id, how many normalized to a non-empty name, how many had a parseable date, how many had an end_date (with an explicit note if none did), and how many resolved to a known ATC_SEED entry.
The top 15 unmatched medication names — the right list to consult when extending ATC_SEED or BRAND_TO_GENERIC for your data.
How many cohort patients had any medication record, how many medications had ≥1 exposed patient under the selected criterion, and which medications met min_exposed.

ATC class seed mapping¶

Medications are grouped by Anatomical Therapeutic Chemical (ATC) classification via the ATC_SEED dictionary at the top of the module. The seed covers every drug Reimer reported plus obvious extensions:

ATC code	Class	Drugs in seed
C10AA	HMG-CoA reductase inhibitors	simvastatin, lovastatin, pravastatin, atorvastatin, rosuvastatin, pitavastatin, fluvastatin
G04BE	PDE5 inhibitors	sildenafil, vardenafil, tadalafil, avanafil
G04CA	α-adrenoreceptor antagonists	tamsulosin, terazosin, alfuzosin, silodosin
M03BX	Centrally acting muscle relaxants	cyclobenzaprine, baclofen, tizanidine
N07XX	Other nervous system drugs — ALS	riluzole, edaravone, tofersen
N07XX59	Other nervous system drugs — pseudobulbar affect	dextromethorphan-quinidine (FDA-approved combination, brand name Nuedexta)
R05DA09	Opium alkaloid antitussive	dextromethorphan (when prescribed alone, almost always an antitussive; also the active CNS ingredient in Nuedexta)
C01BA01	Class IA antiarrhythmic	quinidine (when prescribed alone at therapeutic doses; in Nuedexta it is the subtherapeutic CYP2D6-inhibitor ingredient)
M01AC	Oxicam NSAIDs	meloxicam
A11CC	Vitamin D	colecalciferol
B01AF	Direct factor Xa inhibitors	rivaroxaban
C03AA	Thiazide diuretics	hydrochlorothiazide
C09AA	ACE inhibitors	lisinopril
A03AB	Anticholinergics (for sialorrhea)	glycopyrrolate
A04AA	5HT3 antagonists	ondansetron
N02AB	Opioid analgesics	fentanyl
N02BF	Gabapentinoids	gabapentin
R01BA	Decongestants	pseudoephedrine
R05CA	Mucolytics	guaifenesin
R06AE	H1 antihistamines	cetirizine
R03AC	Short-acting β-agonists	salbutamol / albuterol
R03DC	Leukotriene receptor antagonists	montelukast

Nuedexta and its ingredients are tracked three ways¶

Following Reimer 2026, the module tracks the FDA-approved dextromethorphan-quinidine combination and its two active ingredients separately, because they appear in EHR data in different clinical contexts:

Combination (Nuedexta) → N07XX59. Prescribed for pseudobulbar affect in ALS. Resolves from Nuedexta, dextromethorphan-quinidine, dextromethorphan/quinidine, or any dosage-suffixed variant.
Dextromethorphan alone → R05DA09. Almost always an antitussive when prescribed alone. Reimer reported standalone dextromethorphan as a separate harm-direction signal under this class.
Quinidine alone → C01BA01. A Class IA antiarrhythmic when prescribed alone at therapeutic doses; the subtherapeutic CYP2D6-inhibitor role in Nuedexta is distinct.

Adopters with different therapeutic-area focus extend ATC_SEED directly in source.

Brand-name resolution¶

Real EHR data routinely names medications by brand (Lipitor, Flomax, Nuedexta, Robinul, Radicava) rather than generic. The module ships a second dictionary, BRAND_TO_GENERIC, that _norm_med_name consults before the ATC_SEED lookup. A record with display name "Lipitor 20 mg tablet" is therefore normalized to "atorvastatin" and matches the atorvastatin entry; "Nuedexta 20 mg-10 mg tablet" is normalized to "dextromethorphan-quinidine" and matches the combination entry.

Brand coverage in the shipped seed:

Category	Brand → generic mappings
ALS FDA-approved	Rilutek / Tiglutik / Exservan → riluzole · Radicava / Radicava ORS → edaravone · Qalsody → tofersen · Nuedexta → dextromethorphan-quinidine · Relyvrio / Albrioza → sodium phenylbutyrate-taurursodiol
Statins	Lipitor → atorvastatin · Crestor → rosuvastatin · Zocor → simvastatin · Mevacor / Altoprev → lovastatin · Pravachol → pravastatin · Livalo / Zypitamag → pitavastatin · Lescol → fluvastatin
PDE5 inhibitors	Viagra / Revatio → sildenafil · Levitra / Staxyn → vardenafil · Cialis / Adcirca → tadalafil · Stendra → avanafil
α-blockers	Flomax → tamsulosin · Hytrin → terazosin · Uroxatral → alfuzosin · Rapaflo → silodosin · Cardura → doxazosin
Muscle relaxants	Flexeril / Amrix / Fexmid → cyclobenzaprine · Zanaflex → tizanidine · Lioresal / Gablofen / Ozobax → baclofen
Sialorrhea	Robinul / Cuvposa / Glycate / Dartisla → glycopyrrolate
Antiemetics	Zofran / Zuplenz → ondansetron
Opioids	Sublimaze / Duragesic / Actiq / Fentora / Abstral / Subsys / Lazanda → fentanyl
Gabapentinoids	Neurontin / Gralise / Horizant → gabapentin
Cough / cold	Sudafed → pseudoephedrine · Mucinex / Robitussin → guaifenesin
Antihistamines	Zyrtec / Reactine → cetirizine
Bronchodilators	Ventolin / Proventil / ProAir / AccuNeb → salbutamol · Xopenex → levalbuterol
Leukotrienes	Singulair → montelukast
ACE inhibitors	Prinivil / Zestril / Qbrelis → lisinopril
Anticoagulants	Xarelto → rivaroxaban
Vitamin D	Vitamin D / Vitamin D3 / D3 / cholecalciferol → colecalciferol

Adopters extend BRAND_TO_GENERIC for brand names common in their own site's EHR; the dictionary lookup is performed before the ATC_SEED lookup so adding a new brand requires no other changes.

Outputs¶

Four files are produced in out_dir:

1. `drug_repurposing_cohort.csv`¶

Long-form per-(patient, medication) data set, one row per cohort patient per kept medication.

Column	Description
`patient_id`	Bundle patient identifier
`cohort_entry`	Earliest motor-neuron-disease diagnosis date (ISO)
`age_at_entry`	Years (float)
`sex`, `race`, `ethnicity`, `marital_status`	Baseline covariates
`medication`	Normalized medication name
`atc_code`, `atc_class`	From `ATC_SEED` lookup
`exposed`	1 if either criterion A or B was met, else 0
`criterion`	"A", "B", "A+B", or empty
`n_dispenses_window`	Number of dispenses in −6 to +12 month window
`first_dispense`	First dispense date (ISO)
`survival_days`	Days from cohort entry to death (or to today if censored)
`death_observed`	1 if death recorded, 0 if censored
`deceased_date`	Death date if recorded

This is the input to Cox proportional hazards regression and propensity-score-matched analysis. A typical follow-on notebook does:

import pandas as pd
from lifelines import CoxPHFitter

df = pd.read_csv('drug_repurposing_cohort.csv')

for med in df['medication'].unique():
    sub = df[df['medication'] == med].copy()
    sub['exposed'] = sub['exposed'].astype(int)
    cph = CoxPHFitter()
    cph.fit(sub[['survival_days','death_observed','exposed',
                 'age_at_entry','sex','race']].dropna(),
            duration_col='survival_days',
            event_col='death_observed',
            formula='exposed + age_at_entry + C(sex) + C(race)')
    hr = cph.hazard_ratios_['exposed']
    p  = cph.summary.loc['exposed','p']
    print(f'{med:20s} HR={hr:.2f}  p={p:.4f}')

For propensity matching, use psmpy or implement caliper matching directly per Reimer's Methods (caliper width 0.2 × SD of the logit of the propensity score, up to three matched controls per treated unit).

2. `drug_repurposing_summary.csv`¶

Per-medication summary, one row per medication.

Column	Description
`medication`	Normalized medication name
`atc_code`, `atc_class`	From `ATC_SEED`
`n_exposed`	Count with k-anonymity threshold applied (renders as "<5" if below)
`n_unexposed`	Count of patients in cohort not exposed
`n_deaths_exposed`, `n_deaths_unexposed`	Deaths in each arm
`median_survival_days_exposed`, `median_survival_days_unexposed`	Crude median survival in each arm

3. `cure_id_intake.csv`¶

One row per exposed (patient, medication) formatted to match the FDA / NCATS-NIH / Critical Path Institute CURE ID Treatment Registry intake CRF. Banded and pseudonymized for PII-free submission:

Column	Description
`therapeutic_area`	"Rare Genetic Disorders"
`disease`	"Amyotrophic lateral sclerosis (or motor neuron disease spectrum)"
`user_type`	"A Healthcare Provider"
`pseudo_patient_id`	"PT-NNNN" (stable per run, not persisted)
`age_group`	CURE ID age band ("51 - 60 years", "61 - 70 years", etc.)
`sex`	"Female" / "Male" / "Unknown"
`country_treated`	"United States"
`races`	Comma-separated CURE ID race options
`medication_name`, `medication_atc_class`, `medication_atc_code`	From cohort table
`exposure_criterion`	"A", "B", or "A+B"
`n_dispenses_window`	Number of dispenses in window
`why_new_way`	"Repurposing candidate identified from EHR survival signal"
`treatment_outcome`	Banded by survival days from cohort entry
`source_pipeline`	"Registry Forge drug_repurposing.py (Reimer 2026 methodology)"

This file is the artifact a registry submits to the CURE ID platform. The PII-free posture (pseudonymized identifiers, banded ages, no absolute dates, k-anonymity at the medication level) matches the registry's documented expectations.

4. `drug_repurposing_report.html`¶

Single-file analytical report grouped by ATC class. Cohort overview cards, per-class medication tables with exposed and unexposed counts, crude median survival comparison, and methodological limitations.

Customization¶

The module exposes several knobs for site-specific tuning:

MIN_EXPOSED_PATIENTS — default 30 (Reimer); lower for smaller cohorts.
DEFAULT_K_ANONYMITY — default 5; raise for smaller or more sensitive cohorts.
CRITERION_A_*, CRITERION_B_* — exposure-window constants (12 months after, 6 months minimum duration; 6 months before, 12 months after, 2 dispenses minimum). Site-specific exposure definitions can adjust these directly.
COHORT_CODES — dictionary of (system, code) → display for cohort identification. Adapt to a different disease by replacing with the relevant ICD-10-CM / SNOMED / Mondo codes.
ATC_SEED — medication-to-ATC-class mapping. Extend with additional drugs relevant to your therapeutic area.
CURE_ID_AGE_BANDS — age-band definitions, currently aligned to the CURE ID Generic Other Disease CRF.
CURE_ID_RACE_MAP — bundle-to-CURE-ID race-option mapping.

All of these live as plain Python dictionaries at the top of the module file.

Limitations to keep in mind¶

Indication bias — a patient prescribed a PDE5 inhibitor may by virtue of that indication be healthier than average; the patient prescribed an opioid may by virtue of that indication be in advanced disease. Propensity matching helps but does not eliminate this.
Immortal-time bias — the exposure criteria already account for this by requiring duration or repeat dispensing, but downstream Cox analysis must continue the care.
EHR diagnosis accuracy — Reimer addressed this by running a sensitivity analysis restricted to riluzole-treated patients (a drug with a very ALS-specific indication). The same sensitivity analysis is appropriate here.
Dose-response unavailable — dosage information is not currently surfaced from the bundle; adopters wanting a dose-response analysis should extend the medication parser.
Multiple-testing burden — screening hundreds of medications inflates the false-positive rate. Bonferroni correction at the medication count is the conservative default; Reimer reported both uncorrected and Bonferroni-corrected results.
Single-cohort generalization — outputs are for hypothesis generation, not for confirmatory inference. Prospective randomized trials remain the gold standard for therapeutic efficacy.

Reference¶

Reimer RJ, Soper B, Wilson JL, Goncalves AR, Cadena J, Suarez P, Gryshuk AL, Osborne TF, Grimes KV, Ray P. Identification of drug repurposing candidates for amyotrophic lateral sclerosis using electronic health records: a retrospective cohort study. Lancet Digit Health 2026. https://doi.org/10.1016/j.landig.2025.100963