A Simple Machine Learning Framework for Assisted Citation Screening
This repository contains the source code of a research project to develop a machine learning framework to semi-automatize citation collection and screening in systematic reviews and meta-analyses.
The framework was developed and evaluated in context of aging and longevity research studies and tested on a particular dataset related to "Dasatinib and Quercetin Senolytic Therapy Risk-Benefit Analysis" (D&Q Analysis) published by Forever Healthy Foundation.
The results are in 3 interactive tables of exported documents accessible: here.
You can read more about it here or check the presentation slides for a quick overview.
The empirical results show that the proposed system can identify 95% of relevant documents on average with 17% precision. The overall performance of the system is reasonable, since the reviewers have to screen only around 35% of retrieved documents on average to achieve the desired 95% recall. This saves them around 60% of work comparing to random screening where they would need to screen 95% of documents on average to achieve the desired 95% recall.
Below are the results we got on D&Q Analysis dataset tested on 153 labeled documents using 5-fold cross-validation procedure.
Fold | Precision | Recall | PR-AUC | WSS@R |
---|---|---|---|---|
1 | 0.13 | 0.94 | 0.54 | 0.53 |
2 | 0.16 | 0.90 | 0.33 | 0.61 |
3 | 0.14 | 1.00 | 0.48 | 0.63 |
4 | 0.19 | 0.94 | 0.54 | 0.67 |
5 | 0.20 | 0.97 | 0.43 | 0.71 |
Below are PR-Curves and WSS@R-Curves we got on D&Q Analysis dataset.
The code is tested with Ubuntu 20.04.1 LTS. The framework is implemented in Python and uses JavaScript for the font-end. In './src' directory is a Makefile
that creates Python virtual environment in ./src/longevity-research-screening-venv
and installs all the dependencies from ./src/requirements.txt
.
Below is an overview of the proposed framework.
To reproduce the experimental results one can use the Makefile
in ./src/
.
For the results of a simple model that uses only binary features, run:
cd src
make run
You should see the results and updated plots in figures
directory.
The full model uses extracted features from LDA topic model constructed using Java-based package for statistical natural language processing called Mallet
. To build Mallet
you need to have Java and Apache ant
build tool installed. On Debian based distro, run:
sudo apt-get install default-jdk
sudo apt-get install ant
Then simply run make full
which also downloads Mallet
to ./src
:
cd src
make full
The framework uses a local MySQL database. To re-run the pre-processing steps you need to have mysql server installed. On Debian based distro, run:
sudo apt install mysql-server
Then import the database dump (5.4 MB) of the dataset for D&Q Analysis: longevity_research.sql into the local MySQL database.
Then, to execute all the pre-processing steps, run:
python3 preprocess_articles.py
Besides mysql server you need to also have chromedriver
tool for web-scraping. On Debian based distro you can download the latest release of chromedriver
to .src
directory by running get_chromedriver.sh
script in src
dir:
cd src
./get_chromedriver.sh
Then, to re-create the dataset for D&Q Analysis, run the Python script create_database.py
:
python3 ./src/create_database.py
The script queries PubMed database with provided search terms devised by Forever Healthy foundation using pymed
API. Additionally, it scrapes some data directly from websites of journals or clinical trials (Clinical-
Trials.gov) using chromedriver
. The script creates a database called longevity_research
. The retrieved data is saved into longevity_research
database in dasatinib_and_quercetin_senolytic_therapy
table.
It takes some time for the script to finish since it waits a certain time interval between calling get, to avoid to many requests in short time. The script prints the estimated time for scraping when started.
If you find this work useful, please cite:
Lalovic, Marko. (2021, March 10). A Simple Machine Learning Framework for Citation Screening of Aging and Longevity Research Studies. Zenodo. http://doi.org/10.5281/zenodo.4603365
Or by using bib entry.
The code is released under MIT License. See the LICENSE file for more details.