Note keyword flagging¶
Scans the free-text fields of every record (the text and display_name columns) for substring matches against a keyword list. Returns a flag DataFrame with patient ID, source file, category, keyword matched, and a snippet of surrounding context.
Under construction, screening only
The flagger does naive case-insensitive regex matching. It's a starting point for cohort screening and chart review prioritization, not a clinically validated NLP pipeline. Treat presence or absence of a flag as a starting question, never as definitive evidence.
What it produces¶
Two CSV files in <output_dir>/:
note_flags.csv- one row per match, with these columns:
| Column | What's in it |
|---|---|
patient_id |
Stable patient ID from the master |
source_file |
Which C-CDA file produced the match |
category |
Registry Forge category of the source row (notes, problems, observations, etc.) |
effective_date |
Date on the source row |
keyword |
Label of the keyword that matched |
matched_text |
The exact substring that matched |
snippet |
±80 characters of context around the match |
source_column |
Which column the match came from (text or display_name) |
source_row_index |
Row index in the master DataFrame |
note_flags_summary.csv- one row per (patient_id, keyword) pair with a hit count. Useful as a cohort-screening matrix.
Default keyword list¶
The defaults are tuned for neurology and rare disease cohorts (Boyce Lab's focus):
| Label | Pattern (case-insensitive regex) |
|---|---|
| seizure | seizure(s)?, convulsion, status epilepticus |
| intellectual | intellectual disability, developmental delay, cognitive impairment/delay |
| wheelchair | wheelchair, wheel chair |
| AAC | AAC, augmentative (and\|or)? alternative communication, speech-generating device |
| genetic testing | genetic testing, chromosomal microarray, exome sequencing, whole genome sequencing, karyotype, gene panel |
| epilepsy | epilepsy, epileptic, epileptiform |
| autism | autism, autistic, ASD |
| feeding tube | g-tube, gastrostomy, feeding tube, nasogastric, enteral feeding |
| ventilator | ventilator/ventilation, tracheostomy, BiPAP, CPAP |
| physical therapy | physical/occupational/speech therapy, PT, OT, SLP |
Word boundaries (\b) prevent unwanted matches like AAC inside vaccine or Isaac.
How to use¶
CLI¶
Outputs land at ./out/note_flags.csv and ./out/note_flags_summary.csv.
Python¶
from registryforge_patient import flag_notes, flag_summary, build_outputs
# As part of the pipeline
build_outputs(
input_path='./your/ccda/folder',
output_dir='./out',
with_notes=True,
)
# Or standalone against a DataFrame
import pandas as pd
df = pd.read_csv('./out/patient_master.csv', encoding='utf-8-sig')
flags = flag_notes(df)
print(flags.head())
summary = flag_summary(flags)
print(summary)
Custom keyword list¶
Pass your own list of (label, pattern) tuples or plain strings:
my_keywords = [
('mobility', r'\b(wheelchair|walker|cane|gait)\b'),
('respiratory', r'\b(BiPAP|CPAP|tracheostomy|ventilator)\b'),
'NMOSD', # plain string treated as literal substring
'rituximab',
]
flags = flag_notes(df, keywords=my_keywords)
Plain strings are auto-escaped, so special regex characters in your search term won't break matching. Tuples let you specify the full regex.
Scan additional columns¶
By default the flagger checks text and display_name. To also scan the full raw JSON blob (slower but catches matches the parser didn't surface to top-level columns):
What's not done¶
- No entity normalization (
AACandaugmentative communicationproduce separate flag entries even though they refer to the same concept) - No negation detection (
no history of seizuresproduces aseizureflag) - No section-context awareness (a
seizuremention in a problem-list row vs. a family-history row vs. a denied-symptoms note all flag identically) - No HPO term resolution - the underlying ontology mapping isn't connected to the flagger yet
These are tractable improvements for future releases. The proper solution for negation/context is something like NegEx or a small spaCy pipeline, which would graduate this from a screening tool to something closer to a real NLP layer.
Demo output¶
A note_flags.csv from the synthetic sample_data/ is empty because the demo patients don't have neurology vocabulary in their narratives. Build your own against sample_data/ with a keyword list that matches the demo data:
from registryforge_patient import parse_folder, flag_notes
df, _ = parse_folder('./sample_data')
flags = flag_notes(df) # use default neurology/rare disease keyword list
print(flags.groupby('keyword').size().sort_values(ascending=False))
You should get around 80 matches across roughly 7 of the default keywords. Joe Demo (the Dravet syndrome persona) accounts for most of them - his clinical notes mention seizures, AAC, wheelchair, physical therapy, and genetic testing in detail. Alex Demo contributes a smaller number of intellectual disability + genetic testing flags. Jane Demo (the well-controlled focal epilepsy persona) has minimal note text and few keyword matches.