Skip to content

Stage 2 -- Format detection

Purpose

Classify each reassembled document by format so the right parser can run on it. EHR exports routinely mix CCDA XML, RTF clinical notes, HTML fragments, and PDF scans under a single "document" abstraction.

Detection logic

detect_format(data) returns one of:

Value Detected via
pdf first 4 bytes are %PDF
rtf content starts with {\rtf after whitespace strip
ccda_xml starts with <?xml or contains <ClinicalDocument in the first 512 bytes
html_fragment contains <html>, <body>, <div>, <p>, <span>, <br>, or <table> tag in the first 512 bytes (case-insensitive)
unknown none of the above

The first-512-byte window is sufficient: clinical documents have predictable preambles. The check order matters -- PDF first (binary signature), then RTF, then strict XML, then forgiving HTML.

Per-format text extraction

After detection, each document goes through a format-specific extractor:

Format Extractor Notes
pdf extract_pdf_text (uses pypdf) Returns extracted text. Image-only scans return [PDF - no extractable text] -- they need OCR for content.
rtf strip_rtf Hand-written RTF stripper. Handles \'XX hex escapes, \uNNNN? Unicode escapes, removes font/color/style/list tables, preserves text.
html_fragment strip_html Removes <script> and <style> blocks, strips remaining tags, decodes HTML entities.
ccda_xml full structured parse See Stage 3 -- structured records are extracted in addition to narrative text.
unknown UTF-8 decode best-effort Treated as plain text.

Every document, regardless of format, ends up with a plain_text field on its document record. This is what powers the dashboard's keyword search.

Why a hand-written RTF stripper

The standard Python ecosystem doesn't ship a maintained pure-Python RTF parser. Available libraries either require system dependencies (e.g. LibreOffice) or fail on the non-standard RTF that EHR vendors emit. The hand-written stripper is ~60 lines, has no dependencies, and is tolerant of malformed groups.

The trade-off is that table layout is lost -- output is plain prose with paragraphs preserved. For keyword search and clinical review, that's sufficient.

Empty PDFs

Many EHR PDFs are scanned images embedded as JPEG/JPEG2000 streams with no text layer. pypdf cannot recover text from those without an OCR pass. The dashboard has a "Hide empty PDFs" toggle (default on) that suppresses these so users can focus on PDFs that actually contain extractable text.