Stage 1 - Folder discovery¶
The first stage walks the input and produces an in-memory list of (path, raw_bytes) tuples - one per XML file.
Inputs accepted¶
- A path to a folder. Walked recursively; subfolders are followed.
- A path to a
.zipfile. Unzipped in memory; nested folders preserved as path prefixes. - A mix of both (set both
INPUT_FOLDERandINPUT_ZIPif you have data in two places - the parser concatenates).
What it does¶
def collect_xml_files(folder='', zip_path=''):
files_out = []
if zip_path:
with zipfile.ZipFile(zip_path) as z:
for name in z.namelist():
if name.lower().endswith('.xml') and not name.startswith('__MACOSX'):
with z.open(name) as f:
files_out.append((name, f.read()))
elif folder:
for root, _, names in os.walk(folder):
for name in names:
if name.lower().endswith('.xml'):
full = os.path.join(root, name)
rel = os.path.relpath(full, folder)
with open(full, 'rb') as f:
files_out.append((rel, f.read()))
return files_out
Design choices¶
Read bytes, not text. C-CDA files declare their encoding inside the XML prolog (encoding="UTF-8" typically, but Windows-1252 and UTF-16 happen). We hand raw bytes to xml.etree.ElementTree, which respects the declared encoding.
Skip __MACOSX/. When a Mac user creates a zip via Finder, macOS injects a parallel __MACOSX/ tree of resource-fork files that look like valid XML names. They aren't. We skip them.
No filtering by filename pattern. Patient portals name files inconsistently: summary.xml, Continuity_of_Care_Document.xml, ccd-2024-06-15.xml, export_001.xml, GUIDs, timestamps. We accept any .xml extension and let Stage 2 reject documents whose root element isn't a ClinicalDocument.
In-memory. Bytes are held in memory between Stage 1 and Stage 2. For a few hundred documents this is fine - typical C-CDAs are 20-200 KB each. For thousands, you'd want to stream, but Patient Edition isn't designed for that scale.
What happens if files are missing or unreadable¶
- Permission errors → Python raises
PermissionErrorat theopen()call. Notebook stops. - Malformed zip →
zipfile.BadZipFile. Notebook stops with a clear message. - Zero XML files found → Stage 1 raises
RuntimeError('No XML files found. Check INPUT_FOLDER or INPUT_ZIP above.').
Per-file parse failures aren't caught here - they happen in Stage 2 and are collected into an errors list rather than halting the run.
Output of this stage¶
A Python list:
raw_files = [
('Continuity_of_Care_2022.xml', b'<?xml version="1.0"...'),
('subfolder/CCDA_2023.xml', b'<?xml version="1.0"...'),
...
]
Stage 2 consumes this list one element at a time.