Stage 1 - Folder discovery¶

The first stage walks the input and produces an in-memory list of (path, raw_bytes) tuples - one per XML file.

Inputs accepted¶

A path to a folder. Walked recursively; subfolders are followed.
A path to a .zip file. Unzipped in memory; nested folders preserved as path prefixes.
A mix of both (set both INPUT_FOLDER and INPUT_ZIP if you have data in two places - the parser concatenates).

What it does¶

def collect_xml_files(folder='', zip_path=''):
    files_out = []
    if zip_path:
        with zipfile.ZipFile(zip_path) as z:
            for name in z.namelist():
                if name.lower().endswith('.xml') and not name.startswith('__MACOSX'):
                    with z.open(name) as f:
                        files_out.append((name, f.read()))
    elif folder:
        for root, _, names in os.walk(folder):
            for name in names:
                if name.lower().endswith('.xml'):
                    full = os.path.join(root, name)
                    rel  = os.path.relpath(full, folder)
                    with open(full, 'rb') as f:
                        files_out.append((rel, f.read()))
    return files_out

Design choices¶

Read bytes, not text. C-CDA files declare their encoding inside the XML prolog (encoding="UTF-8" typically, but Windows-1252 and UTF-16 happen). We hand raw bytes to xml.etree.ElementTree, which respects the declared encoding.

Skip __MACOSX/. When a Mac user creates a zip via Finder, macOS injects a parallel __MACOSX/ tree of resource-fork files that look like valid XML names. They aren't. We skip them.

No filtering by filename pattern. Patient portals name files inconsistently: summary.xml, Continuity_of_Care_Document.xml, ccd-2024-06-15.xml, export_001.xml, GUIDs, timestamps. We accept any .xml extension and let Stage 2 reject documents whose root element isn't a ClinicalDocument.

In-memory. Bytes are held in memory between Stage 1 and Stage 2. For a few hundred documents this is fine - typical C-CDAs are 20-200 KB each. For thousands, you'd want to stream, but Patient Edition isn't designed for that scale.

What happens if files are missing or unreadable¶

Permission errors → Python raises PermissionError at the open() call. Notebook stops.
Malformed zip → zipfile.BadZipFile. Notebook stops with a clear message.
Zero XML files found → Stage 1 raises RuntimeError('No XML files found. Check INPUT_FOLDER or INPUT_ZIP above.').

Per-file parse failures aren't caught here - they happen in Stage 2 and are collected into an errors list rather than halting the run.

Output of this stage¶

A Python list:

raw_files = [
    ('Continuity_of_Care_2022.xml', b'<?xml version="1.0"...'),
    ('subfolder/CCDA_2023.xml',     b'<?xml version="1.0"...'),
    ...
]

Stage 2 consumes this list one element at a time.