Stage 2 -- Format detection¶

Purpose¶

Classify each reassembled document by format so the right parser can run on it. EHR exports routinely mix CCDA XML, RTF clinical notes, HTML fragments, and PDF scans under a single "document" abstraction.

Detection logic¶

detect_format(data) returns one of:

Value	Detected via
`pdf`	first 4 bytes are `%PDF`
`rtf`	content starts with `{\rtf` after whitespace strip
`ccda_xml`	starts with `<?xml` or contains `<ClinicalDocument` in the first 512 bytes
`html_fragment`	contains `<html>`, `<body>`, `<div>`, `<p>`, `<span>`, `<br>`, or `<table>` tag in the first 512 bytes (case-insensitive)
`unknown`	none of the above

The first-512-byte window is sufficient: clinical documents have predictable preambles. The check order matters -- PDF first (binary signature), then RTF, then strict XML, then forgiving HTML.

Per-format text extraction¶

After detection, each document goes through a format-specific extractor:

Format	Extractor	Notes
`pdf`	`extract_pdf_text` (uses `pypdf`)	Returns extracted text. Image-only scans return `[PDF - no extractable text]` -- they need OCR for content.
`rtf`	`strip_rtf`	Hand-written RTF stripper. Handles `\'XX` hex escapes, `\uNNNN?` Unicode escapes, removes font/color/style/list tables, preserves text.
`html_fragment`	`strip_html`	Removes `<script>` and `<style>` blocks, strips remaining tags, decodes HTML entities.
`ccda_xml`	full structured parse	See Stage 3 -- structured records are extracted in addition to narrative text.
`unknown`	UTF-8 decode best-effort	Treated as plain text.

Every document, regardless of format, ends up with a plain_text field on its document record. This is what powers the dashboard's keyword search.

Why a hand-written RTF stripper¶

The standard Python ecosystem doesn't ship a maintained pure-Python RTF parser. Available libraries either require system dependencies (e.g. LibreOffice) or fail on the non-standard RTF that EHR vendors emit. The hand-written stripper is ~60 lines, has no dependencies, and is tolerant of malformed groups.

The trade-off is that table layout is lost -- output is plain prose with paragraphs preserved. For keyword search and clinical review, that's sufficient.

Empty PDFs¶

Many EHR PDFs are scanned images embedded as JPEG/JPEG2000 streams with no text layer. pypdf cannot recover text from those without an OCR pass. The dashboard has a "Hide empty PDFs" toggle (default on) that suppresses these so users can focus on PDFs that actually contain extractable text.