- Python 3.x and pip
- Gemsim, Numpy, NLTK, NLTK Trainer, Spacy, Sklearn, Pandas, Pyphen, Pyspellchecker
-
It's highly recommended creating a virtualenv before installing the dependencies
-
Dependencies
pip3 install virtualenv
virtualenv <YOU_NAME_IT>
source <THE_NAME_ABOVE>/bin/activate
pip install -r requirements.txt
sh setup.sh
- NLTK setup (Within a python terminal)
import nltk
nltk.download('punkt')
nltk.download('mac_morpho')
nltk.download('stopwords')
The step above should install the dependencies in your nltk_data folder (~/nltk_data)
#Usage
- TBD
- Extract textual document content from different sources (PDF, Docs and text files)
- Convert textual document into stylometric features
- Contains Random Forest and Simple Neural Network classifiers over the data described in the next section
- There are two main types of data set inside the data/parsed-data folder: -- Regular data files, with textual content and masked author name -- Stylometric data files, that represent the conversion of the raw text into stylometric features (~50)
PS: Each data set has two versions of it, 'selected' means that samples with less than 3 per author were removed, 'data' is the complete data set with no exclusions