Regex extraction from unstructured notes¶
Preview — under active development. This module is still being refined. Patterns, mappings, and category boundaries should be validated against your own corpus before being relied on for analysis or publication.
note_extraction.py reads dashboard_data.json and walks every free-text source — CCDA section narratives in the notes array and the plain_text field of each decoded document — matching a library of regular expressions. The output is a single CSV (note_extractions.csv) with one row per match: patient, source kind, source identifier, pattern name, captured value, surrounding snippet, character offset, and a one-line description.
This is a starter module, not validated NLP. Local clinical phrasing varies a lot, so every pattern needs site-specific tuning before captured values feed analysis. For production clinical NLP, look at cTAKES, MedSpaCy, or Clinical-BERT-based pipelines. The patterns shipped here are aimed at things that are commonly missing from discrete coded fields and have to be reconstructed from notes.
Why it's needed¶
Even a well-coded EHR export is missing things that matter to a natural history registry:
- Disease severity at a point in time — ALSFRS-R and ECAS scores are recorded as note text, not as discrete observations, in most EHR systems. (LOINC 67131-4 exists for ALSFRS-R but is rarely populated.)
- Diagnostic certainty — El Escorial / Awaji-Shima classification ("clinically probable ALS") almost never reaches a discrete field.
- Family history and genetic status — recorded in social-history notes; rarely coded.
- Phenotype — site of onset, cognitive/behavioral involvement, FTD spectrum diagnoses are typically free-text.
- Disease milestones — PEG placement date, NIV initiation date, riluzole start date are sometimes coded as procedures or medications but often only narrated.
A pipeline that stops at structured codes leaves all of this on the table. A regex pass over the same parsed bundle, even imperfect, recovers a meaningful fraction of it cheaply.
Patterns shipped¶
| Group | Patterns |
|---|---|
| ALSFRS-R | total, bulbar, fine motor, gross motor, respiratory subdomains |
| ECAS | total, ALS-specific, non-ALS-specific, FTD-spectrum mentions |
| Diagnosis | El Escorial / Awaji-Shima certainty (definite/probable/possible), site of onset (bulbar/limb/spinal/respiratory) |
| Family history | negative ("no family history of MND"), positive (named-relative constructions, +FH), genetic mutations (C9orf72, SOD1, FUS, TARDBP, TBK1, VCP, UBQLN2, PFN1, MATR3, CHCHD10, ANG, OPTN, ATXN2) |
| Pulmonary | FVC percent-predicted |
| Treatment milestones | PEG placement date, NIV/BiPAP initiation date, riluzole start date, edaravone start date, tracheostomy date |
Twenty patterns total in the shipped file. Each is registered with a name, regex, capture-group index, and a one-line description that ends up in the output CSV alongside the match.
Output format¶
patient_id,source_kind,source_id,pattern,value,snippet,char_offset,description
demo-patient-001,ccda_section,doc-001::Plan of Care,alsfrs_r_total,32,"...Most recent ALSFRS-R total: 32 / 48 (bulbar 8...",433,ALSFRS-R total score, range 0-48
demo-patient-001,ccda_section,doc-001::Plan of Care,el_escorial,clinically probable,"...Diagnosis: clinically probable ALS by Awaji-Shima criteria...",178,El Escorial / Awaji-Shima diagnostic certainty level
demo-patient-001,ccda_section,doc-001::Plan of Care,family_history_negative,no known family history of ALS,"...Family history: no known family history of ALS, MND, or FTD...",247,Negative family history
demo-patient-001,ccda_section,doc-001::Plan of Care,fvc_percent_predicted,62,"...FVC: 62% predicted (down from 78% predicted...",612,Forced vital capacity, percent predicted
source_kind is ccda_section for matches in CCDA section narratives or document:<format> (e.g. document:ccda_xml, document:rtf, document:pdf, document:html) for matches in decoded document body text.
The same content can match in both source types (a Plan-of-Care narrative lives inside the CCDA section and inside the full CCDA document plain_text), which produces two rows for one underlying clinical fact. Dedupe by (patient_id, pattern, value) if you only want one row per fact, or keep both for provenance.
Running it¶
Defaults assume the bundle and the output sit in the current directory:
# top of note_extraction.py
BUNDLE_PATH = './dashboard_data.json'
OUT_PATH = './note_extractions.csv'
From a notebook:
import note_extraction
note_extraction.main(
bundle_path='./dashboard_data.json',
out_path='./note_extractions.csv',
)
From the command line:
The log at the end of the run summarizes per-pattern hit counts and per-patient match totals — quick eyeball check that the patterns actually fired.
Adapting to your data¶
Local phrasing matters. The shipped patterns were written against synthetic and lightly-disguised examples and miss a lot of real world variation. Steps to localize:
- Sample first. Pull a few hundred narrative documents from your bundle and grep for each pattern's anchor word (
ALSFRS,ECAS,family history,FVC,riluzole). Read the matches and the surrounding sentences. - Add alternations. "ALSFRS-R: 32" and "ALSFRS-R 32/48" and "Total ALSFRS-R score is thirty-two" all need to match. The shipped patterns handle the first two; the third would need a numeric-words extension.
- Watch negation and hypotheticals. "Considering PEG", "discussed risks of tracheostomy", "patient declined NIV" all contain the trigger word but no actual milestone. Negation lookbehind (as the shipped
family_history_positiveuses to avoid "no family history of...") is a quick patch; clinical NLP libraries do this properly with assertion and modality detection. - Add patterns specific to your disease. A cardiology registry would pattern on ejection fraction phrasing, NYHA class. An oncology registry on tumor stage, response criteria. A neurology registry on UPDRS, MoCA, Hoehn & Yahr.
- Validate before trusting captured values. Hand-review a stratified sample of
note_extractions.csv— one in twenty matches across patterns — and compute precision per pattern. Anything below ~95% needs another pattern revision before its values feed downstream analysis.
When to step up to real NLP¶
Regex is appropriate when:
- The phrasing is fairly stereotyped (lab values, structured headers, score grids).
- You're piloting and need first-pass coverage in days, not months.
- The domain is small enough that a few dozen patterns cover most of it.
Regex breaks down when:
- Negation, hedging, and temporality matter ("considering tracheostomy", "tracheostomy declined", "tracheostomy was considered three years ago").
- Coreference resolution is needed (multiple paragraphs about different family members).
- Numbers are written as words, abbreviations vary across providers, or notes are heavily templated with stale auto-populated content.
At that point a clinical NLP pipeline is the right answer:
- cTAKES (Apache) — mature, UMLS-backed, includes assertion/temporal modules.
- MedSpaCy — lighter-weight, modern Python, with
negspacyfor negation. - Spark NLP for Healthcare — commercial; full clinical entity extraction including ICD/RxNorm linking.
- Clinical-BERT family models — useful for relation extraction (e.g. "drug X caused side-effect Y") that pattern-matching can't reach.
This module's CSV output is intentionally compatible with these tools: (patient_id, source_id, pattern, value, char_offset) is the same row shape clinical-NLP outputs use, so a downstream system can consume note_extractions.csv directly while a richer NLP pipeline is being stood up.