Troubleshooting¶
"No patients in output"¶
Check, in this order:
- Did Stage 1 reassembly write files? Look in
<work_dir>/ccda_assembled/and<work_dir>/fhir_assembled/. If empty, the input CSVs are missing or empty. - Did Stage 3 find any FHIR
Patientresources? The log linePatients discovered (pre-mapping fill): Ntells you. - If FHIR has 0 patients but CCDAs exist, the patients are coming from
CCDA
<recordTarget>only. Make sure your CCDAs have well-formedrecordTarget/patientRole/patient/nameblocks.
"Patient appears with name [Smith] (with brackets)"¶
The source data has Python list-repr leftovers like ['Smith'] or
b'Smith'. The clean_name utility strips these, but only after the
upstream parser puts them on a record. If you're seeing them, the source
column is being serialized in Python's repr() form rather than the
underlying string. Fix: change the warehouse export to write
first_name (string) instead of first_name (list/bytes representation).
"Medications all show generic codes, not RxNorm"¶
Stage 3 prefers RxNorm by default, but if the source CodeableConcept
contains only NDC codes (or only manufacturer codes), there's nothing to
prefer. Check all_codings on a sample medication record -- if RxNorm
isn't in the list, the warehouse export is missing it.
"Documents show (Unlinked) in the dashboard"¶
A document's patient_id is null. Possible causes, in order of
likelihood:
- The document UUID isn't in
uuid_mapping.csv. Check the mapping coverage:wc -l uuid_mapping.csvshould be close to the number of documents you exported. - The document is a CCDA but its
<recordTarget>is missing or malformed. - The MRN-based fallback didn't find a match. Print a sample CCDA's MRN and a sample FHIR Patient's MRN -- they need to normalize to the same string (alphanumeric uppercase).
"All PDFs say [PDF - no extractable text]"¶
Most likely the PDFs are image-only scans (no embedded text layer) and need OCR. Confirm by opening one in a PDF viewer and trying to copy text -- if it doesn't select, the PDF has no text to extract.
If some PDFs DO have text in a viewer but the pipeline shows them as empty, check the input bytes are intact -- a chunked CSV that was re-encoded to UTF-8 anywhere in the pipeline (warehouse, transfer, storage) will lose the PDF's binary structure.
Excel export fails on large bundles¶
openpyxl is slow and memory-hungry for very large workbooks (50+ MB).
Workarounds:
- Use the per-tab CSV exports instead.
- Filter the bundle to a smaller cohort before exporting.
- Use a streaming Excel writer (
xlsxwriterwithconstant_memory=True) -- requires modifyingwrite_outputs.
"Dedupe merged patients I didn't want merged"¶
The dedupe key is (lower(first_name), lower(last_name), dob). Two
distinct people with identical names and DOB will merge incorrectly. To
disable: comment out the _dedupe_patients_by_name_dob call in main().
To use a different key (e.g. include MRN), edit the function -- it's
20 lines.
"Stage 6 removed 0 test patients but I added rules"¶
Check the log line Total active exclusion rules: N. If 0:
test_patients.txtwasn't found at<work_dir>/.- All lines in the file are comments or empty.
HARDCODED_TEST_EXCLUSIONSis empty.
If N > 0 but Removed 0 test patients, the rules don't match any
patient. Check the rule syntax:
name:requires an exact full-name match.name_contains:is a substring match.- Both are case-insensitive.
mrn_contains:checks the MRN field, not the patient_id UUID.
Diagnostic snippets¶
Print PDF extraction rate¶
import json
with open('dashboard_data.json') as f:
b = json.load(f)
pdfs = [d for d in b['documents'] if d.get('source_format') == 'pdf']
has_text = sum(1 for d in pdfs if d.get('plain_text')
and not d['plain_text'].startswith('[PDF -'))
print(f"PDFs with text: {has_text} / {len(pdfs)}")
Find unlinked documents¶
unlinked = [d for d in b['documents'] if not d.get('patient_id')]
print(f"Unlinked: {len(unlinked)}")
for d in unlinked[:5]:
print(f" {d['source_format']}: {d['source_file']}")
Audit dedupe¶
Compare dashboard_data.prefilter.json (snapshot before Stage 6) to
dashboard_data.json: