Environmental / occupational / toxic exposure extraction¶
Preview — under active development. This module is still being refined. Patterns, mappings, and category boundaries should be validated against your own corpus before being relied on for analysis or publication.
exposure_extraction.py walks the dashboard bundle and surfaces
environmental, occupational, and toxic exposures relevant to ALS
risk factor literature. It combines two extraction layers — structured
ICD-10-CM codes and regex patterns over clinical narrative — and tags
each finding with a representative ECTO (Environmental Conditions,
Treatments, and Exposures Ontology) term where one has been verified.
Module source (exposure_extraction.py)
What it captures¶
The patterns and codes are drawn from the published ALS environmental risk factor literature. Each category is grounded in ECTO; the specificity of the ontology grounding varies (see ECTO grounding below).
Structured codes (ICD-10-CM)¶
| Family | Codes | Captures |
|---|---|---|
| Nicotine | F17.x, Z87.891 |
Current and former nicotine dependence, history of smoking |
| Occupational | Z57.x (0–9) |
Noise, radiation, dust, agricultural toxins, industrial chemicals, temperature, vibration |
| Substance contact | Z77.0xx |
Lead, uranium, arsenic, aromatic compounds, asbestos, other hazardous chemicals |
| Pollution | Z77.110–.118 |
Air, water, soil pollution |
| Physical hazards | Z77.121–.128 |
Noise, mold, harmful algae / cyanotoxins |
| Body fluids | Z77.21, Z77.22, Z77.29, Z77.9 |
Hazardous body fluids, environmental tobacco smoke, other |
| Military | Z91.82, Z56.82 |
Personal history of military deployment, military deployment status |
| Trauma | Z87.820, S06.0X0A, S06.9X0A |
Personal history of TBI, concussion, unspecified intracranial injury |
Regex patterns (clinical narrative)¶
The pattern set covers 14 categories across the major ALS risk factor domains:
| Category | Example patterns |
|---|---|
| Smoking | current / former smoker, pack-years, chewing tobacco |
| Military service | general service, Gulf War / OIF / OEF / OND, Vietnam + Agent Orange, Camp Lejeune |
| Pesticides | general pesticide, organophosphate, paraquat, glyphosate / Roundup, DDT, agricultural occupations |
| Heavy metals | lead, mercury, manganese / welding fumes, arsenic, cadmium |
| Solvents | TCE / PCE (context-required), benzene, formaldehyde, general organic solvents, welding |
| Asbestos | asbestos exposure, asbestos-containing material, mesothelioma risk |
| Head trauma | TBI history, multiple concussions, football career, contact sports, military blast injury |
| EMF / electrical | electrical worker, lineman, electrician, EMF / electromagnetic field exposure |
| Cyanotoxins | BMAA, cyanobacteria, blue-green algae, harmful algal blooms |
| Mold | mold exposure, black / toxic mold, stachybotrys, mycotoxin |
| Air pollution | PM2.5 / PM10, particulate matter, diesel exhaust, traffic-related air pollution |
| Occupational dust | silica, coal, wood, grain, cotton dust; silicosis; coal workers pneumoconiosis |
Patterns that historically caused false positives are documented in comments inside the module:
NFLalone is not a contact-sport match — in an ALS cohort, NFL overwhelmingly means neurofilament light (a serum biomarker for motor-neuron-disease activity). NFL only matches when paired with explicit football vocabulary (played NFL,NFL player,NFL career,in the NFL, etc.).- Bare
PERCis not a perchloroethylene match — in clinical notes, PERC is the Pulmonary Embolism Rule-out Criteria. PCE / PERC matches require explicit exposure context. - Bare
TCErequires exposure / contamination / occupational context for the same reason —TCE clinical trialdoes not match.
ECTO grounding¶
Each exposure category is mapped to a representative term in ECTO. ECTO is an OBO Foundry ontology with ~2,700 exposure terms; it is the recommended terminology for exposure annotation in GA4GH Phenopackets and the GA4GH Human Exposome Data Standards Study Group, which is extending Phenopackets with formal schemas for exposure data.
The mapping table inside the module distinguishes three statuses:
verified— the ECTO term ID has been confirmed against the published SSSOM mapping file. These appear in the dashboard as clickable links to the OLS browser.pending— no verified term ID yet. The dashboard displays "curation pending" rather than a guessed identifier. Adopters should search the OLS ECTO browser for the closest term and replace theNoneplaceholder.catch_all/out_of_scope— buckets that don't correspond to real ECTO concepts, or that ECTO doesn't cover (e.g. psychosocial exposures).
Currently verified IDs:
| Category / agent | ECTO term | Label |
|---|---|---|
| smoking (all sub-patterns) | ECTO:9000250 |
exposure to nicotine |
| lead | ECTO:9000945 |
exposure to lead |
| mercury | ECTO:0001571 |
exposure to mercury |
| arsenic | ECTO:9000032 |
exposure to arsenic |
| asbestos | ECTO:9000033 |
exposure to asbestos |
| air pollution (PM inhalation) | ECTO:0000977 |
exposure to ultrafine respirable suspended particulate matter via inhalation |
The remaining patterns and categories are flagged as pending and
appear in the dashboard's ECTO Mapping tab as a curation worklist.
Outputs¶
| File | Description |
|---|---|
exposure_codes.csv |
Structured ICD-10-CM matches |
exposure_extractions.csv |
Regex matches against narrative text |
Both feed into exposure_dashboard.py for
visual review, and into the downstream Phenopackets generator for
formal encoding.
Usage¶
import exposure_extraction
result = exposure_extraction.main(
bundle_path = './dashboard_data.json',
out_root = './',
)
print(result)
# {'exposure_codes_csv': './exposure_codes.csv',
# 'exposure_extractions_csv': './exposure_extractions.csv',
# 'code_rows': N,
# 'extraction_rows': M}
Important caveat¶
Most environmental and occupational exposure information lives in
social-history narrative sections of clinical notes rather than as
structured ICD-10 codes. The Z57.x and Z77.x families are
systematically under-coded in clinical practice; the bulk of signal in
any ALS-cohort run of this module typically comes from the regex side
scanning narratives. Adopters should not over-interpret the absence of
structured codes as absence of exposure, and detection sensitivity will
vary substantially across EHR vendors and clinical-documentation
conventions.
Related¶
- Exposure dashboard — visualization layer for the outputs of this module.
- Note extraction — companion module that surfaces structured ALS findings (ALSFRS-R scores, FVC% predicted, El Escorial category, etc.) from the same narrative sources.
- GA4GH Phenopackets — downstream consumer of these exposure annotations.