This repository is a research work by Lindvall Lab at Dana-Farber Cancer Institute on extracting present/current symptoms reported by the patients from their electronic health record (EHR). Symptoms are vital outcomes for cancer clinical trials, observational research, and population-level surveillance. We sought to develop, test, and externally validate a deep learning model to extract symptoms from unstructured clinical notes in the electronic health record (EHR).
- Processing
- Training
- Inference
python processing/label_output.py \
--input {location of the label-studio output json files} \
--label_config {configuration used to set up label-studio; xml file} \
--label all OR --keep goals_or care
--hpi \
--stratified_split 0.3 \
--test
- Without
--test
argument, data will be stratified split to train/valid 0.7/0.3 - With
--test
argument, data will be stratified split to train/valid/test 0.7/0.15/0.15 - It takes around 17s to load the spacy
en_core_sci_lg
model, please wait.
- Transformer model choices: 'bert', 'xlnet', 'roberta', 'xlm-roberta', 'camembert', 'distilbert', 'electra'
conda activate transformers
python ner.py \
--dset {location of the data that has been converted to ConLL format} \
--model_class electra \
--pretrained_model google/electra-base-discriminator \
--lr 6e-5 \
--decay 0.02 \
--warmups 500
- Bayesian optimization with Gaussian processes
- Please open the interactive plots (contour_plot, slice_plot, cv_plot, etc) in browser
python optimization.py \
--model bert \
--lr 1e-6 1e-4 \
--decay 0.01 0.1 \
--warmups 0 3000 \
--eps 1e-9 1e-7
python processing/model_output.py \
--model_output processing/output/symptoms_hpi_all/prediction_test.txt \
--label_output_dir symptoms/storage/label-studio/project/completions/ \
--label_config symptoms/storage/label-studio/project/config.xml
Use raw csv files with a column containing clinical note - no need to convert into ConLL format.
python inference/run_and_predict.py -ipf {location of the input file} -opf {location of dummy output file} -cn {name of the column containing the clinical note}
All codes are modified from
The GNU GPL v2 version of PathML is made available via Open Source licensing. The user is free to use, modify, and distribute under the terms of the GNU General Public License version 2.
Commercial license options are available also.
Questions? Comments? Suggestions? Get in touch!