ml-citation-screening

A Simple Machine Learning Framework for Assisted Citation Screening

Introduction

This repository contains the source code of a research project to develop a machine learning framework to semi-automatize citation collection and screening in systematic reviews and meta-analyses.

The framework was developed and evaluated in context of aging and longevity research studies and tested on a particular dataset related to "Dasatinib and Quercetin Senolytic Therapy Risk-Benefit Analysis" (D&Q Analysis) published by Forever Healthy Foundation.

The results are in 3 interactive tables of exported documents accessible: here.

You can read more about it here or check the presentation slides for a quick overview.

Evaluation Results

The empirical results show that the proposed system can identify 95% of relevant documents on average with 17% precision. The overall performance of the system is reasonable, since the reviewers have to screen only around 35% of retrieved documents on average to achieve the desired 95% recall. This saves them around 60% of work comparing to random screening where they would need to screen 95% of documents on average to achieve the desired 95% recall.

Below are the results we got on D&Q Analysis dataset tested on 153 labeled documents using 5-fold cross-validation procedure.

Fold	Precision	Recall	PR-AUC	WSS@R
1	0.13	0.94	0.54	0.53
2	0.16	0.90	0.33	0.61
3	0.14	1.00	0.48	0.63
4	0.19	0.94	0.54	0.67
5	0.20	0.97	0.43	0.71

Below are PR-Curves and WSS@R-Curves we got on D&Q Analysis dataset.

Installation

The code is tested with Ubuntu 20.04.1 LTS. The framework is implemented in Python and uses JavaScript for the font-end. In './src' directory is a Makefile that creates Python virtual environment in ./src/longevity-research-screening-venv and installs all the dependencies from ./src/requirements.txt.

Usage

Below is an overview of the proposed framework.

Running the Model

To reproduce the experimental results one can use the Makefile in ./src/.

For the results of a simple model that uses only binary features, run:

cd src
make run

You should see the results and updated plots in figures directory.

The full model uses extracted features from LDA topic model constructed using Java-based package for statistical natural language processing called Mallet. To build Mallet you need to have Java and Apache ant build tool installed. On Debian based distro, run:

sudo apt-get install default-jdk
sudo apt-get install ant

Then simply run make full which also downloads Mallet to ./src:

cd src
make full

Pre-processing

The framework uses a local MySQL database. To re-run the pre-processing steps you need to have mysql server installed. On Debian based distro, run:

sudo apt install mysql-server

Then import the database dump (5.4 MB) of the dataset for D&Q Analysis: longevity_research.sql into the local MySQL database.

Then, to execute all the pre-processing steps, run:

python3 preprocess_articles.py

Creating a Dataset

Besides mysql server you need to also have chromedriver tool for web-scraping. On Debian based distro you can download the latest release of chromedriver to .src directory by running get_chromedriver.sh script in src dir:

cd src
./get_chromedriver.sh

Then, to re-create the dataset for D&Q Analysis, run the Python script create_database.py:

python3 ./src/create_database.py

The script queries PubMed database with provided search terms devised by Forever Healthy foundation using pymed API. Additionally, it scrapes some data directly from websites of journals or clinical trials (Clinical- Trials.gov) using chromedriver. The script creates a database called longevity_research. The retrieved data is saved into longevity_research database in dasatinib_and_quercetin_senolytic_therapy table.

It takes some time for the script to finish since it waits a certain time interval between calling get, to avoid to many requests in short time. The script prints the estimated time for scraping when started.

Citing this Work

If you find this work useful, please cite:

Lalovic, Marko. (2021, March 10). A Simple Machine Learning Framework for Citation Screening of Aging and Longevity Research Studies. Zenodo. http://doi.org/10.5281/zenodo.4603365

Or by using bib entry.

License

The code is released under MIT License. See the LICENSE file for more details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ml-citation-screening

Introduction

Evaluation Results

Installation

Usage

Running the Model

Pre-processing

Creating a Dataset

Citing this Work

License

About

Releases 1

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
data		data
export		export
presentation		presentation
report		report
src		src
LICENSE		LICENSE
README.md		README.md

License

markolalovic/ml-citation-screening

Folders and files

Latest commit

History

Repository files navigation

ml-citation-screening

Introduction

Evaluation Results

Installation

Usage

Running the Model

Pre-processing

Creating a Dataset

Citing this Work

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Contributors 2

Languages