Contributing¶

Issues, pull requests, and parser samples are welcome.

What helps most¶

In rough priority order:

Vendor-specific parsing samples. If your patient-portal export produces zero records or weird records, a scrubbed C-CDA file plus a description of what went wrong is the most useful thing you can contribute.
Section-code mappings. The SECTION_TO_CATEGORY table in the notebook handles common LOINC section codes. If your vendor uses one we don't recognize and you can identify what category it should map to, that's a one-line PR.
Documentation clarifications. If a stage page doesn't match what you actually saw the parser do, that's a bug in the docs.
Privacy / de-identification tooling. A separate companion notebook that takes patient_master.csv and produces a Safe Harbor-compatible version would unlock external sharing of dashboards. Big lift but high value.
Code-system harmonization. Mapping SNOMED → ICD-10, or RxNorm → ATC, against a local copy of Athena vocabularies. Optional add-on, not core.

What does NOT help¶

Untested vendor-specific code paths. Adding logic that handles your specific export but breaks generic CDA isn't useful. Tests against synthetic C-CDA samples are required.
Sample data containing real PHI. We won't accept any data we can't verify is synthetic or fully scrubbed. Even partial PHI is a hard no.
Renaming things to match a different schema standard. The Registry Forge schema is what gives this tool interoperability with the parent project. Pick a different name if you want to fork in a different direction.

Development setup¶

git clone https://github.com/BoyceLab/RegistryForge4Patients.git
cd RegistryForge4Patients

python -m venv .venv
source .venv/bin/activate

pip install pandas numpy matplotlib jupyter mkdocs-material

Run the notebook against synthetic samples in notebook/sample_data/. Run the docs site locally with:

mkdocs serve

It'll open at http://127.0.0.1:8000.

Testing changes to the parser¶

The notebook ships with a test_ccda.xml synthetic document covering the main entry types (problems with SNOMED + ICD-10 translation, medications with RxNorm, labs with LOINC + PQ values, allergies). Run the notebook end-to-end against that file as a smoke test before opening a PR.

For a heavier regression test, run against any synthetic CCDA corpus you have access to. A vendor-neutral synthetic test suite is on the roadmap.

Style¶

Python: standard pep8-ish. The notebook is the source of truth; if you're refactoring, keep cell boundaries meaningful.
Docs: short paragraphs, plain prose. Avoid jargon when a clear word exists.
Pull requests: one logical change per PR. A docs typo and a parser bugfix should be separate PRs.

Code of conduct¶

Be kind. Researchers working with patient-shared data are usually solving a real problem, often for one specific patient or family they care about. Treat questions and bug reports accordingly.

License contributions¶

By submitting a pull request, you agree your contribution is licensed under the project's MIT License.