#About
The goal of this app is to enable the computation of similarity of movies based on their plots. It is developed using the following research paper as a role model: Tania Farinella, Sonia Bergamaschi, Laura Po: A Non-intrusive Movie Recommendation System. OTM Conferences (2) 2012: 736-751
The complete application workflow consists of 6 major steps:
- Collecting movie data (movie URI, title and plot) from an RDF triplestore and saving the data to a local storage
- Reading the data from the local storage and creating vectors of tokenized movie plots to be used in step 3; movie plots are also "cleaned" from stop-words.
- Generate TF-IDF (Term Frequency –Inverse Document Frequency) vectors for each tokenized movie plot
- Use TF-IDF vectors from step 3 to generate a matrix on which Singular Value Decomposition (SVD) will be performed; rows of this matrix represent movie plots, while columns represent tokens extracted from the movie plots
- Perform SVD to reduce the dimension of the matrix
-
- Starting from the reduced matrix (produced in step 5), calculate cosine similarity between all pairs of the rows from the matrix (i.e., vectors representing movie plots), and save the values in separate .csv files for every movie, for further use.
#Implementation
Jena RDF API is used for retrieving movie URIs, titles and plots from an RDF triple contains numerous movie data gathered from DBPedia, whereas the movie plots were added from Wikipedia.
Movie plot tokenization is performed using Stanford Core-nlp framework, with added list of stop-words from the LUCENE library.
TF-IDF calculations are performed using a custom written class.
Matrices and vectors are generated using RealMatrix interface and it's implementation BlockRealMatrix and RealVector.
SVD calculation is performed using the Apache SVD class
Cosine similarity is calculated by a method that is a part of the RealVector implementation.
All of the data is written into appropriate .csv files using opencsv library.
For steps 1 and 2 of the above given workflow, dataRetrieval/ DataSelection class is used. Custom TF-IDF calculations are performed using the methods from measures/MeasureTFIDF class. All of the matrix generation and calculation is performed using the methods from the measures/MatrixGenerator class. The measures/CosineSimilarity class is obviously used forcalculating similarity of vectors representing movie plots.
#Notes
In order to run the application, it is necessary to download the Stanford Core-nlp library and add it to the build path of the project.
#Acknowledgements
This application has been developed as a part of the project assignment for the subject Intelligent Systems at the Faculty of Organization Sciences, University of Belgrade, Serbia.