Skip to content

Latest commit

 

History

History
122 lines (90 loc) · 4.31 KB

README.md

File metadata and controls

122 lines (90 loc) · 4.31 KB

Science Concierge

a Python repository for content-based recommendation based on Latent semantic analysis (LSA) topic distance and Rocchio Algorithm. Science Concierge is an backend algorithm for Scholarfy www.scholarfy.net, an automatic scheduler for conference.

See full article on PLOS ONE, Arxiv or full tex manuscript and presentation here. You can also see the scale version of Scholarfy to 14.3M articles from Pubmed at pubmed.scholarfy.net.

Usage

First, clone the repository.

$ git clone https://github.com/titipata/science_concierge

Install dependencies using pip,

$ pip install -r requirements.txt

Install the library using setup.py,

$ python setup.py develop install

Download example data

We provide example csv file from Pubmed Open Acess Subset that you can download and play with (we parsed using pubmed_parser). Each file contains pmc, pmid, title, abstract, publication_year as column name. Use download function to download example data,

import science_concierge
science_concierge.download(['pubmed_oa_2015.csv', 'pubmed_oa_2016.csv'])

We provide pubmed_oa_{year}.csv from {year} = 2007, ..., 2016 (note 2007 is all publications before year 2008). Alternative is to use awscli to download,

$ aws s3 cp s3://science-of-science-bucket/science_concierge/data/ . --recursive

Example usage of Science Concierge

You can build quick recommendation by importing ScienceConcierge class then use fit method to fit list of documents. Then use recommend to recommend documents based on like or dislike documents.

import pandas as pd
from science_concierge import ScienceConcierge

df = pd.read_csv('data/pubmed_oa_2016.csv', encoding='utf-8')
docs = list(df.abstract) # provide list of abstracts
titles = list(df.title) # titles
# select weighting from 'count', 'tfidf', or 'entropy'
recommend_model = ScienceConcierge(stemming=True, ngram_range=(1,1),
                                   weighting='entropy', norm=None,
                                   n_components=200, n_recommend=200,
                                   verbose=True)
recommend_model.fit(docs) # input list of documents or abstracts
index = recommend_model.recommend(likes=[10000], dislikes=[]) # input list of like/dislike index (here we like title[10000])
docs_recommend = [titles[i] for i in index[0:10]] # recommended documents

Vectorizer available

We have adds on vectorizer classes including LogEntropyVectorizer and BM25Vectorizer for calculating documents-terms weighting from input list of documents. Here is an example usage.

from science_concierge import LogEntropyVectorizer
l_model = LogEntropyVectorizer(norm=None, ngram_range=(1,2),
                               stop_words='english', min_df=1, max_df=0.8)
X = l_model.fit_transform(docs) # where docs is list of documents

In this case when we have sparse matrix of documents, we can use fit_document_matrix method directly.

recommend_model = ScienceConcierge(n_components=200, n_recommend=200)
recommend_model.fit_document_matrix(X)
index = recommend_model.recommend(likes=[10000], dislikes=[])

Dependencies

Members

License

License

Copyright (c) 2015 Titipat Achakulvisut, Daniel E. Acuna, Tulakan Ruangrong, Konrad Kording