Skip to content

Regex extraction from unstructured notes

Preview — under active development. This module is still being refined. Patterns, mappings, and category boundaries should be validated against your own corpus before being relied on for analysis or publication.

note_extraction.py reads dashboard_data.json and walks every free-text source — CCDA section narratives in the notes array and the plain_text field of each decoded document — matching a library of regular expressions. The output is a single CSV (note_extractions.csv) with one row per match: patient, source kind, source identifier, pattern name, captured value, surrounding snippet, character offset, and a one-line description.

This is a starter module, not validated NLP. Local clinical phrasing varies a lot, so every pattern needs site-specific tuning before captured values feed analysis. For production clinical NLP, look at cTAKES, MedSpaCy, or Clinical-BERT-based pipelines. The patterns shipped here are aimed at things that are commonly missing from discrete coded fields and have to be reconstructed from notes.

Why it's needed

Even a well-coded EHR export is missing things that matter to a natural history registry:

  • Disease severity at a point in time — ALSFRS-R and ECAS scores are recorded as note text, not as discrete observations, in most EHR systems. (LOINC 67131-4 exists for ALSFRS-R but is rarely populated.)
  • Diagnostic certainty — El Escorial / Awaji-Shima classification ("clinically probable ALS") almost never reaches a discrete field.
  • Family history and genetic status — recorded in social-history notes; rarely coded.
  • Phenotype — site of onset, cognitive/behavioral involvement, FTD spectrum diagnoses are typically free-text.
  • Disease milestones — PEG placement date, NIV initiation date, riluzole start date are sometimes coded as procedures or medications but often only narrated.

A pipeline that stops at structured codes leaves all of this on the table. A regex pass over the same parsed bundle, even imperfect, recovers a meaningful fraction of it cheaply.

Patterns shipped

Group Patterns
ALSFRS-R total, bulbar, fine motor, gross motor, respiratory subdomains
ECAS total, ALS-specific, non-ALS-specific, FTD-spectrum mentions
Diagnosis El Escorial / Awaji-Shima certainty (definite/probable/possible), site of onset (bulbar/limb/spinal/respiratory)
Family history negative ("no family history of MND"), positive (named-relative constructions, +FH), genetic mutations (C9orf72, SOD1, FUS, TARDBP, TBK1, VCP, UBQLN2, PFN1, MATR3, CHCHD10, ANG, OPTN, ATXN2)
Pulmonary FVC percent-predicted
Treatment milestones PEG placement date, NIV/BiPAP initiation date, riluzole start date, edaravone start date, tracheostomy date

Twenty patterns total in the shipped file. Each is registered with a name, regex, capture-group index, and a one-line description that ends up in the output CSV alongside the match.

Output format

patient_id,source_kind,source_id,pattern,value,snippet,char_offset,description
demo-patient-001,ccda_section,doc-001::Plan of Care,alsfrs_r_total,32,"...Most recent ALSFRS-R total: 32 / 48 (bulbar 8...",433,ALSFRS-R total score, range 0-48
demo-patient-001,ccda_section,doc-001::Plan of Care,el_escorial,clinically probable,"...Diagnosis: clinically probable ALS by Awaji-Shima criteria...",178,El Escorial / Awaji-Shima diagnostic certainty level
demo-patient-001,ccda_section,doc-001::Plan of Care,family_history_negative,no known family history of ALS,"...Family history: no known family history of ALS, MND, or FTD...",247,Negative family history
demo-patient-001,ccda_section,doc-001::Plan of Care,fvc_percent_predicted,62,"...FVC: 62% predicted (down from 78% predicted...",612,Forced vital capacity, percent predicted

source_kind is ccda_section for matches in CCDA section narratives or document:<format> (e.g. document:ccda_xml, document:rtf, document:pdf, document:html) for matches in decoded document body text.

The same content can match in both source types (a Plan-of-Care narrative lives inside the CCDA section and inside the full CCDA document plain_text), which produces two rows for one underlying clinical fact. Dedupe by (patient_id, pattern, value) if you only want one row per fact, or keep both for provenance.

Running it

Defaults assume the bundle and the output sit in the current directory:

# top of note_extraction.py
BUNDLE_PATH = './dashboard_data.json'
OUT_PATH    = './note_extractions.csv'

From a notebook:

import note_extraction
note_extraction.main(
    bundle_path='./dashboard_data.json',
    out_path='./note_extractions.csv',
)

From the command line:

python note_extraction.py

The log at the end of the run summarizes per-pattern hit counts and per-patient match totals — quick eyeball check that the patterns actually fired.

Adapting to your data

Local phrasing matters. The shipped patterns were written against synthetic and lightly-disguised examples and miss a lot of real world variation. Steps to localize:

  1. Sample first. Pull a few hundred narrative documents from your bundle and grep for each pattern's anchor word (ALSFRS, ECAS, family history, FVC, riluzole). Read the matches and the surrounding sentences.
  2. Add alternations. "ALSFRS-R: 32" and "ALSFRS-R 32/48" and "Total ALSFRS-R score is thirty-two" all need to match. The shipped patterns handle the first two; the third would need a numeric-words extension.
  3. Watch negation and hypotheticals. "Considering PEG", "discussed risks of tracheostomy", "patient declined NIV" all contain the trigger word but no actual milestone. Negation lookbehind (as the shipped family_history_positive uses to avoid "no family history of...") is a quick patch; clinical NLP libraries do this properly with assertion and modality detection.
  4. Add patterns specific to your disease. A cardiology registry would pattern on ejection fraction phrasing, NYHA class. An oncology registry on tumor stage, response criteria. A neurology registry on UPDRS, MoCA, Hoehn & Yahr.
  5. Validate before trusting captured values. Hand-review a stratified sample of note_extractions.csv — one in twenty matches across patterns — and compute precision per pattern. Anything below ~95% needs another pattern revision before its values feed downstream analysis.

When to step up to real NLP

Regex is appropriate when:

  • The phrasing is fairly stereotyped (lab values, structured headers, score grids).
  • You're piloting and need first-pass coverage in days, not months.
  • The domain is small enough that a few dozen patterns cover most of it.

Regex breaks down when:

  • Negation, hedging, and temporality matter ("considering tracheostomy", "tracheostomy declined", "tracheostomy was considered three years ago").
  • Coreference resolution is needed (multiple paragraphs about different family members).
  • Numbers are written as words, abbreviations vary across providers, or notes are heavily templated with stale auto-populated content.

At that point a clinical NLP pipeline is the right answer:

  • cTAKES (Apache) — mature, UMLS-backed, includes assertion/temporal modules.
  • MedSpaCy — lighter-weight, modern Python, with negspacy for negation.
  • Spark NLP for Healthcare — commercial; full clinical entity extraction including ICD/RxNorm linking.
  • Clinical-BERT family models — useful for relation extraction (e.g. "drug X caused side-effect Y") that pattern-matching can't reach.

This module's CSV output is intentionally compatible with these tools: (patient_id, source_id, pattern, value, char_offset) is the same row shape clinical-NLP outputs use, so a downstream system can consume note_extractions.csv directly while a richer NLP pipeline is being stood up.