Troubleshooting¶

"No patients in output"¶

Check, in this order:

Did Stage 1 reassembly write files? Look in <work_dir>/ccda_assembled/ and <work_dir>/fhir_assembled/. If empty, the input CSVs are missing or empty.
Did Stage 3 find any FHIR Patient resources? The log line Patients discovered (pre-mapping fill): N tells you.
If FHIR has 0 patients but CCDAs exist, the patients are coming from CCDA <recordTarget> only. Make sure your CCDAs have well-formed recordTarget/patientRole/patient/name blocks.

"Patient appears with name `[Smith]` (with brackets)"¶

The source data has Python list-repr leftovers like ['Smith'] or b'Smith'. The clean_name utility strips these, but only after the upstream parser puts them on a record. If you're seeing them, the source column is being serialized in Python's repr() form rather than the underlying string. Fix: change the warehouse export to write first_name (string) instead of first_name (list/bytes representation).

"Medications all show generic codes, not RxNorm"¶

Stage 3 prefers RxNorm by default, but if the source CodeableConcept contains only NDC codes (or only manufacturer codes), there's nothing to prefer. Check all_codings on a sample medication record -- if RxNorm isn't in the list, the warehouse export is missing it.

"Documents show `(Unlinked)` in the dashboard"¶

A document's patient_id is null. Possible causes, in order of likelihood:

The document UUID isn't in uuid_mapping.csv. Check the mapping coverage: wc -l uuid_mapping.csv should be close to the number of documents you exported.
The document is a CCDA but its <recordTarget> is missing or malformed.
The MRN-based fallback didn't find a match. Print a sample CCDA's MRN and a sample FHIR Patient's MRN -- they need to normalize to the same string (alphanumeric uppercase).

"All PDFs say `[PDF - no extractable text]`"¶

Most likely the PDFs are image-only scans (no embedded text layer) and need OCR. Confirm by opening one in a PDF viewer and trying to copy text -- if it doesn't select, the PDF has no text to extract.

If some PDFs DO have text in a viewer but the pipeline shows them as empty, check the input bytes are intact -- a chunked CSV that was re-encoded to UTF-8 anywhere in the pipeline (warehouse, transfer, storage) will lose the PDF's binary structure.

Excel export fails on large bundles¶

openpyxl is slow and memory-hungry for very large workbooks (50+ MB). Workarounds:

Use the per-tab CSV exports instead.
Filter the bundle to a smaller cohort before exporting.
Use a streaming Excel writer (xlsxwriter with constant_memory=True) -- requires modifying write_outputs.

"Dedupe merged patients I didn't want merged"¶

The dedupe key is (lower(first_name), lower(last_name), dob). Two distinct people with identical names and DOB will merge incorrectly. To disable: comment out the _dedupe_patients_by_name_dob call in main(). To use a different key (e.g. include MRN), edit the function -- it's 20 lines.

"Stage 6 removed 0 test patients but I added rules"¶

Check the log line Total active exclusion rules: N. If 0:

test_patients.txt wasn't found at <work_dir>/.
All lines in the file are comments or empty.
HARDCODED_TEST_EXCLUSIONS is empty.

If N > 0 but Removed 0 test patients, the rules don't match any patient. Check the rule syntax:

name: requires an exact full-name match.
name_contains: is a substring match.
Both are case-insensitive.
mrn_contains: checks the MRN field, not the patient_id UUID.

Diagnostic snippets¶

Print PDF extraction rate¶

import json
with open('dashboard_data.json') as f:
    b = json.load(f)
pdfs = [d for d in b['documents'] if d.get('source_format') == 'pdf']
has_text = sum(1 for d in pdfs if d.get('plain_text')
                              and not d['plain_text'].startswith('[PDF -'))
print(f"PDFs with text: {has_text} / {len(pdfs)}")

Find unlinked documents¶

unlinked = [d for d in b['documents'] if not d.get('patient_id')]
print(f"Unlinked: {len(unlinked)}")
for d in unlinked[:5]:
    print(f"  {d['source_format']}: {d['source_file']}")

Audit dedupe¶

Compare dashboard_data.prefilter.json (snapshot before Stage 6) to dashboard_data.json:

with open('dashboard_data.prefilter.json') as f: pre  = json.load(f)
with open('dashboard_data.json')           as f: post = json.load(f)
print(f"Patients: {len(pre['patients'])} -> {len(post['patients'])}")