This code is associated with our submissions to the N2C2 Shared Task - Track 2, on extraction of social determinants of health (SDoH) from clinical notes. It was also used in a consequent study on the effects of different NLP models on downstream medical association study results.
This code contains the used submission script, and the two main python files to train or apply our BIO-scheme base SDOH models:
sdoh_model_bert_bio.py
: The code used for all BERT settings (callsdoh_model_bert_bio.py -h
for more detailed information).sdoh_model_bio.py
: The code used for all other settings (callsdoh_model_bio.py -h
for more detailed information).Submission_script.sh
: The script that was used to make our submissions for the shared task.pretrain_embs.py
: The script used to pretrain the fastText embeddings (on the MIMIC III and the UCSF data).association_study_experiments.sh
: The script used to conduct the experiments from the arXiv article.
The text data (clinical notes from MIMIC III and the University of Washington) and SDoH annotations were provided by the task organizers under a data sharing agreement, for patient privacy reasons. For this reason we cannot share this data here.
The DNR/DNI annotations can be found on this repository: https://github.com/tuur/code-status-annotations-mimic
Results from our submissions are reported in the attached abstract:
Madhumita Sushil, Atul J. Butte, Ewoud Schuit, Artuur M. Leeuwenberg. Cross-institution extraction of social determinants of health from clinical notes: an evaluation of methods. AMIA Natural Language Processing Working Group Pre-Symposium. November, 2022.
Results from consequent study about the impact of NLP modeling choices on downstream association study results, published in the Journal of Clinical Epidemiology:
Sushil, Madhumita, et al. Cross-institution natural language processing for reliable clinical association studies: a methodological exploration Journal of Clinical Epidemiology. 2024 Mar 1;167:111258.