-
Notifications
You must be signed in to change notification settings - Fork 5
Week 3
NOTE: THIS EXERCISE HAS BEEN POSTPONED UNTIL NEXT WEEK (WEEK 4) DUE TO PROBLEMS WITH THE UKKO CLUSTER. APOLOGIES FOR THE INCONVENIENCE.
NOTE ADDED 26 SEPT 2017: UKKO NODES 3-16 SHOULD BE ACCESSIBLE NOW.
Thanks to Prof Jiaheng Lu for providing this assignment!
General instruction: In this exercise, you are supposed to use Hadoop to perform two table join with a large data set. We provide you the data sets. You may use the Ukko cluster (or other available computing machines) to run the programs. Please read the instructions on Hadoop programming and WordCount example. Updated on October 1, 2017
Pay attention to the running time of your programs, the quality of the program code, and associated documentation. Make sure to find all correct results and then try to reduce the running time of your programs.
-
Implement one executable Hadoop MapReduce program to perform the inner join of two tables based on
Student ID
to satisfy the following two filtering conditions simultaneously:a. The year of birth is greater than (>) 1990 and;
b. the score of course 1 is greater than (>) 80 and that of course 2 is no more than (<=) 95
Sample data of Student table:
Student ID Name Year of Birth 20170126453 Kristalee Copperwaite 2000 20170433596 Roeberta Naden 1997
Sample data of Score table:
Student ID Score for course1 Score for course2 Score for course3 20170126453 93 97 80 20170140241 86 85 87 20170433596 82 60 80
Join result:
Student ID Name Year of Birth Score for course1 Score for course2 Score for course3 20170433596 Roeberta Naden 1997 82 60 80
Click here to download the data.
There are two files in the folder: one
Score
table and oneStudent
table.Student
is a big table, butScore
is small with three thousand tuples. The Reduce-side join algorithm in the lecture may not be the most efficient one in this case. For example, please consider to use distributed cache with Hadoop. -
Run your MapReduce programs and then report the result size and the performance, e.g. the elapsed time. (Hint: you may get the job information and the elapsed time through the Web interface of master node, and the default address is
hostname:8088
in Hadoop cluster)In the documentation, you should explain how your codes solve the problems and how they use Hadoop. In particular, analyze your results by answering the following questions:
a. What are the result size of the join and the elapsed time of your programs?
b. How many computer nodes did you use to run the program? Have you tried to reduce the running time by using more nodes in Ukko cluster?
c. Upload your source codes and analyze the performance of your program codes. What optimizations have you tried to reduce the running time?
References for your programming
- Miner D, Shook A. MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems. O'Reilly Media, Inc., 2012.
- Lam C. Hadoop in Action. Manning Publications Co., 2010.
- Bakhshi R. How to Plan and Configure Yarn and MapReduce 2 in HDP 2.0, HortonWorks.com
We'll be looking into machine learning by checking out the HASYv2 dataset that contains hand written mathematical symbols as images. The whole dataset is quite big, so we'll restrict ourselves to doing 10-class classification on some of the symbols. Download the data and complete the following tasks.
-
Extract the data and find inside a file called
hasy-data-labels.csv
. This file contains the labels for each of the images in thehasy_data
folder. Read the labels in and only keep the rows where thesymbol_id
is within the inclusive range[70, 80]
. Read the corresponding images as black-and-white images and flatten them so that each image is a single vector of shape32x32 == 1024
. Your dataset should now consist of your input data of shape(1020, 1024)
and your labels of shape(1020, )
. That is, a matrix of shape1020 x 1024
and a vector of size1020
. -
Shuffle the data, and then split it into training and test sets, using the first 80% of the data for training and the rest for evaluation.
-
Fit a logistic regression classifier on your data. Note that since logistic regression is a binary classifier, you will have to, for example, use a so-called "one-vs-all" strategy where the prediction task is formulated as "is the input class X or one of the others?" and the classifier selects the class with the highest probability. Most library implementations will do this for you - feel free to use one.
-
To get an idea of how good the model did, let's create our own classifier that simply guesses the most common class in the training set. Then, evaluate your logistic regression model on the test data, and compare it to the majority class classifier. The logistic regression model should have significantly better accuracy - our naive model is merely making a guess.
-
Plot some of the images that the classifier classified wrongly. Can you think of why this happens? Would you have gotten it right? Hint. scipy has a function for this
Note that you are meant to use Python in this exercise. However, if you can find a suitable AutoML implementation for your favorite language (e.g here seems to be one for R) then you are free to use that language as well
-
This time train a random forest classifier on the data. A random forest is a collection of decision trees, which makes it an ensemble of classifiers. Each tree uses a random subset of the features to make it's prediction. Without tuning any parameters, how is the accuracy?
-
The amount of trees to use as a part of the random forest is an example of a hyperparameter, because it is a parameter that is set prior to the learning process. In contrast, a parameter is a value in the model that is learned from the data. Train 20 classifiers, with varying amounts of decision trees starting from 10 up until 200, and plot the test accuracy as a function of the amount of classifiers. Does the accuracy keep increasing? Is more better?
-
If we had picked the amount of decision trees by taking the value with the best test accuracy from the last plot, we would have overfit our hyperparameters to the test data. Can you see why it is a mistake to tune hyperparameters of your model by using the test data?
-
Reshuffle and resplit the data so that it is divided in 3 parts: training (80%), validation (10%) and test (10%). Repeatedly train a model of your choosing (e.g random forest) on the training data, and evaluate it's performance on the validation set, while tuning the hyperparameters so that the accuracy on the validation set increases. Then, finally evaluate the performance of your model on the test data. What can you say in terms of the generalization of your model?
-
This process of picking a suitable model, evaluating it's performance and tuning the hyperparameters is very time consuming. A new idea in machine learning is the concept of automating this by using an optimization algorithm to find the best model in the space of models and their hyperparameters. Have a look at TPOT, an automated ML solution that finds a good model and a good set of hyperparameters automatically. Try it on this data, it should outperform simple models like the ones we tried easily. Note that running the algorithm might take a while, depending on the strength of your computer. TPOT uses cross-validation internally, so we don't need our own validation set.