-
Notifications
You must be signed in to change notification settings - Fork 55
Running Hermes
One of the easiest ways to run the Hermes code is to run spark in a standalone instance. The instructions below are how to get things running on a linux box in particular, but only the anaconda download would be different for a different operating system.
In addition, there are currently both Python 2.7 and 3.5 compatible versions of the Hermes project at the current moment. Both versions work with Python 2.7, but the Python 2.7 version will not run on Python 3.5. Both versions operate best with Spark version 2.0.
Install Anaconda
wget https://repo.continuum.io/archive/Anaconda2-4.2.0-Linux-x86_64.sh
chmod +x Anaconda2-4.2.0-Linux-x86_64.sh
./Anaconda2-4.2.0-Linux-x86_64.sh
Install Hermes dependencies
conda install networkx xlrd beautifulsoup4
pip install rdflib pyshp
Install Spark
wget http://d3kbcqa49mib13.cloudfront.net/spark-2.0.2-bin-hadoop2.7.tgz
tar xzf spark-2.0.2-bin-hadoop2.7.tgz
Clone the Hermes repository! Also, while you are at it create a zip file of the Hermes code, which is what we use to run scripts. We have found that having you zip the code yourself generally yields better results than if we zip the files and push that onto GitHub.
git clone https://github.com/Lab41/hermes.git
cd hermes/
zip -r hermes.zip src __init__.py
cd ..
Convert the data into json files. The exact format varies on which dataset you are working with. For example, to convert the Kaggle dataset, you would do the following (of course modifying the paths dependent upon where your files are located):
python hermes/src/utils/kaggle_etl/scripts_to_json.py /path/to/kaggle/files/ -o /output/directory/
Run Spark
cd spark-2.0.2-bin-hadoop2.7
PYSPARK_PYTHON=python3 bin/pyspark --master local[30] --driver-memory 8g
While you are in Spark, execute a Hermes python script
# Python 2 Version
execfile('hermes_script.py')
# Python 3 Version
exec(open('hermes_script.py').read())