Note keyword flagging¶

Scans the free-text fields of every record (the text and display_name columns) for substring matches against a keyword list. Returns a flag DataFrame with patient ID, source file, category, keyword matched, and a snippet of surrounding context.

Under construction, screening only

The flagger does naive case-insensitive regex matching. It's a starting point for cohort screening and chart review prioritization, not a clinically validated NLP pipeline. Treat presence or absence of a flag as a starting question, never as definitive evidence.

What it produces¶

Two CSV files in <output_dir>/:

note_flags.csv - one row per match, with these columns:

Column	What's in it
`patient_id`	Stable patient ID from the master
`source_file`	Which C-CDA file produced the match
`category`	Registry Forge category of the source row (notes, problems, observations, etc.)
`effective_date`	Date on the source row
`keyword`	Label of the keyword that matched
`matched_text`	The exact substring that matched
`snippet`	±80 characters of context around the match
`source_column`	Which column the match came from (`text` or `display_name`)
`source_row_index`	Row index in the master DataFrame

note_flags_summary.csv - one row per (patient_id, keyword) pair with a hit count. Useful as a cohort-screening matrix.

Default keyword list¶

The defaults are tuned for neurology and rare disease cohorts (Boyce Lab's focus):

Label	Pattern (case-insensitive regex)
seizure	`seizure(s)?`, `convulsion`, `status epilepticus`
intellectual	`intellectual disability`, `developmental delay`, `cognitive impairment/delay`
wheelchair	`wheelchair`, `wheel chair`
AAC	`AAC`, `augmentative (and\\|or)? alternative communication`, `speech-generating device`
genetic testing	`genetic testing`, `chromosomal microarray`, `exome sequencing`, `whole genome sequencing`, `karyotype`, `gene panel`
epilepsy	`epilepsy`, `epileptic`, `epileptiform`
autism	`autism`, `autistic`, `ASD`
feeding tube	`g-tube`, `gastrostomy`, `feeding tube`, `nasogastric`, `enteral feeding`
ventilator	`ventilator/ventilation`, `tracheostomy`, `BiPAP`, `CPAP`
physical therapy	`physical/occupational/speech therapy`, `PT`, `OT`, `SLP`

Word boundaries (\b) prevent unwanted matches like AAC inside vaccine or Isaac.

How to use¶

CLI¶

registryforge-patient parse ./your/ccda/folder --output ./out --flag-notes

Outputs land at ./out/note_flags.csv and ./out/note_flags_summary.csv.

Python¶

from registryforge_patient import flag_notes, flag_summary, build_outputs

# As part of the pipeline
build_outputs(
    input_path='./your/ccda/folder',
    output_dir='./out',
    with_notes=True,
)

# Or standalone against a DataFrame
import pandas as pd
df = pd.read_csv('./out/patient_master.csv', encoding='utf-8-sig')
flags = flag_notes(df)
print(flags.head())

summary = flag_summary(flags)
print(summary)

Custom keyword list¶

Pass your own list of (label, pattern) tuples or plain strings:

my_keywords = [
    ('mobility', r'\b(wheelchair|walker|cane|gait)\b'),
    ('respiratory', r'\b(BiPAP|CPAP|tracheostomy|ventilator)\b'),
    'NMOSD',                  # plain string treated as literal substring
    'rituximab',
]

flags = flag_notes(df, keywords=my_keywords)

Plain strings are auto-escaped, so special regex characters in your search term won't break matching. Tuples let you specify the full regex.

Scan additional columns¶

By default the flagger checks text and display_name. To also scan the full raw JSON blob (slower but catches matches the parser didn't surface to top-level columns):

flags = flag_notes(df, text_columns=['text', 'display_name', 'raw_record_json'])

What's not done¶

No entity normalization (AAC and augmentative communication produce separate flag entries even though they refer to the same concept)
No negation detection (no history of seizures produces a seizure flag)
No section-context awareness (a seizure mention in a problem-list row vs. a family-history row vs. a denied-symptoms note all flag identically)
No HPO term resolution - the underlying ontology mapping isn't connected to the flagger yet

These are tractable improvements for future releases. The proper solution for negation/context is something like NegEx or a small spaCy pipeline, which would graduate this from a screening tool to something closer to a real NLP layer.

Demo output¶

A note_flags.csv from the synthetic sample_data/ is empty because the demo patients don't have neurology vocabulary in their narratives. Build your own against sample_data/ with a keyword list that matches the demo data:

from registryforge_patient import parse_folder, flag_notes
df, _ = parse_folder('./sample_data')
flags = flag_notes(df)   # use default neurology/rare disease keyword list
print(flags.groupby('keyword').size().sort_values(ascending=False))

You should get around 80 matches across roughly 7 of the default keywords. Joe Demo (the Dravet syndrome persona) accounts for most of them - his clinical notes mention seizures, AAC, wheelchair, physical therapy, and genetic testing in detail. Alex Demo contributes a smaller number of intellectual disability + genetic testing flags. Jane Demo (the well-controlled focal epilepsy persona) has minimal note text and few keyword matches.