This is a tutorial to build Spark-based machine learning models for capturing word meanings. You can learn how to build a word2vec model using Twitter data on IBM's Watson Studio using Apache Spark.
To read the blog, please click here.
Step 1. If you already have an account on IBM's Watson Studio, go to Step 2. If not, you can create an account for free here..
-
Download (without uncompressing) some tweets from here to your lap top. The
tweets.gz
file contains a 10% sample (using Twitter decahose API) of a 15 minute batch of the public tweets from December 23rd. The size of this compressed file is 116MB (compression ratio is about 10 to 1) -
Go to your recently created project on Watson Studio and click on the
+New data asset
icon -
Load the file by clicking on
browse
or just dropping it -
Once the file is loaded, click on Apply to add this file to your project.
You should see your tweets file under the data assets list of your project.
-
Go to your environments tab and create a
Default Spark Python 3.5 XS
environment -
Click on From URL (3rd tab), choose a name for your notebook (ex: "Spark-based ML to capture word meaning"), copy and paste this url https://github.com/IBMDataScience/word2vec/blob/master/Spark-based%20machine%20learning%20for%20word%20meanings.ipynb into the Notebook URL rectangle and click on Create Notebook.
You are now in your new notebook and the rest of the instructions are in there.