Skip to content

Note keyword flagging

Scans the free-text fields of every record (the text and display_name columns) for substring matches against a keyword list. Returns a flag DataFrame with patient ID, source file, category, keyword matched, and a snippet of surrounding context.

Under construction, screening only

The flagger does naive case-insensitive regex matching. It's a starting point for cohort screening and chart review prioritization, not a clinically validated NLP pipeline. Treat presence or absence of a flag as a starting question, never as definitive evidence.

What it produces

Two CSV files in <output_dir>/:

  • note_flags.csv - one row per match, with these columns:
Column What's in it
patient_id Stable patient ID from the master
source_file Which C-CDA file produced the match
category Registry Forge category of the source row (notes, problems, observations, etc.)
effective_date Date on the source row
keyword Label of the keyword that matched
matched_text The exact substring that matched
snippet ±80 characters of context around the match
source_column Which column the match came from (text or display_name)
source_row_index Row index in the master DataFrame
  • note_flags_summary.csv - one row per (patient_id, keyword) pair with a hit count. Useful as a cohort-screening matrix.

Default keyword list

The defaults are tuned for neurology and rare disease cohorts (Boyce Lab's focus):

Label Pattern (case-insensitive regex)
seizure seizure(s)?, convulsion, status epilepticus
intellectual intellectual disability, developmental delay, cognitive impairment/delay
wheelchair wheelchair, wheel chair
AAC AAC, augmentative (and\|or)? alternative communication, speech-generating device
genetic testing genetic testing, chromosomal microarray, exome sequencing, whole genome sequencing, karyotype, gene panel
epilepsy epilepsy, epileptic, epileptiform
autism autism, autistic, ASD
feeding tube g-tube, gastrostomy, feeding tube, nasogastric, enteral feeding
ventilator ventilator/ventilation, tracheostomy, BiPAP, CPAP
physical therapy physical/occupational/speech therapy, PT, OT, SLP

Word boundaries (\b) prevent unwanted matches like AAC inside vaccine or Isaac.

How to use

CLI

registryforge-patient parse ./your/ccda/folder --output ./out --flag-notes

Outputs land at ./out/note_flags.csv and ./out/note_flags_summary.csv.

Python

from registryforge_patient import flag_notes, flag_summary, build_outputs

# As part of the pipeline
build_outputs(
    input_path='./your/ccda/folder',
    output_dir='./out',
    with_notes=True,
)

# Or standalone against a DataFrame
import pandas as pd
df = pd.read_csv('./out/patient_master.csv', encoding='utf-8-sig')
flags = flag_notes(df)
print(flags.head())

summary = flag_summary(flags)
print(summary)

Custom keyword list

Pass your own list of (label, pattern) tuples or plain strings:

my_keywords = [
    ('mobility', r'\b(wheelchair|walker|cane|gait)\b'),
    ('respiratory', r'\b(BiPAP|CPAP|tracheostomy|ventilator)\b'),
    'NMOSD',                  # plain string treated as literal substring
    'rituximab',
]

flags = flag_notes(df, keywords=my_keywords)

Plain strings are auto-escaped, so special regex characters in your search term won't break matching. Tuples let you specify the full regex.

Scan additional columns

By default the flagger checks text and display_name. To also scan the full raw JSON blob (slower but catches matches the parser didn't surface to top-level columns):

flags = flag_notes(df, text_columns=['text', 'display_name', 'raw_record_json'])

What's not done

  • No entity normalization (AAC and augmentative communication produce separate flag entries even though they refer to the same concept)
  • No negation detection (no history of seizures produces a seizure flag)
  • No section-context awareness (a seizure mention in a problem-list row vs. a family-history row vs. a denied-symptoms note all flag identically)
  • No HPO term resolution - the underlying ontology mapping isn't connected to the flagger yet

These are tractable improvements for future releases. The proper solution for negation/context is something like NegEx or a small spaCy pipeline, which would graduate this from a screening tool to something closer to a real NLP layer.

Demo output

A note_flags.csv from the synthetic sample_data/ is empty because the demo patients don't have neurology vocabulary in their narratives. Build your own against sample_data/ with a keyword list that matches the demo data:

from registryforge_patient import parse_folder, flag_notes
df, _ = parse_folder('./sample_data')
flags = flag_notes(df)   # use default neurology/rare disease keyword list
print(flags.groupby('keyword').size().sort_values(ascending=False))

You should get around 80 matches across roughly 7 of the default keywords. Joe Demo (the Dravet syndrome persona) accounts for most of them - his clinical notes mention seizures, AAC, wheelchair, physical therapy, and genetic testing in detail. Alex Demo contributes a smaller number of intellectual disability + genetic testing flags. Jane Demo (the well-controlled focal epilepsy persona) has minimal note text and few keyword matches.