Week 3

Exercise 1 - MapReduce

NOTE: THIS EXERCISE HAS BEEN POSTPONED UNTIL NEXT WEEK (WEEK 4) DUE TO PROBLEMS WITH THE UKKO CLUSTER. APOLOGIES FOR THE INCONVENIENCE.

NOTE ADDED 26 SEPT 2017: UKKO NODES 3-16 SHOULD BE ACCESSIBLE NOW.

Thanks to Prof Jiaheng Lu for providing this assignment!

General instruction: In this exercise, you are supposed to use Hadoop to perform two table join with a large data set. We provide you the data sets. You may use the Ukko cluster (or other available computing machines) to run the programs. Please read the instructions on Hadoop programming and WordCount example. Updated on October 1, 2017

Pay attention to the running time of your programs, the quality of the program code, and associated documentation. Make sure to find all correct results and then try to reduce the running time of your programs.

Implement one executable Hadoop MapReduce program to perform the inner join of two tables based on Student ID to satisfy the following two filtering conditions simultaneously:

a. The year of birth is greater than (>) 1990 and;

b. the score of course 1 is greater than (>) 80 and that of course 2 is no more than (<=) 95

Sample data of Student table:
```
Student ID    Name                    Year of Birth
20170126453   Kristalee Copperwaite   2000
20170433596   Roeberta Naden          1997
```
Sample data of Score table:
```
Student ID    Score for course1    Score for course2    Score for course3
20170126453   93                   97                   80
20170140241   86                   85                   87
20170433596   82                   60                   80
```
Join result:
```
Student ID    Name             Year of Birth    Score for course1   Score for course2   Score for course3
20170433596   Roeberta Naden   1997             82                  60                  80
```
Click here to download the data.

There are two files in the folder: one Score table and one Student table. Student is a big table, but Score is small with three thousand tuples. The Reduce-side join algorithm in the lecture may not be the most efficient one in this case. For example, please consider to use distributed cache with Hadoop.
Run your MapReduce programs and then report the result size and the performance, e.g. the elapsed time. (Hint: you may get the job information and the elapsed time through the Web interface of master node, and the default address is hostname:8088 in Hadoop cluster)

In the documentation, you should explain how your codes solve the problems and how they use Hadoop. In particular, analyze your results by answering the following questions:

a. What are the result size of the join and the elapsed time of your programs?

b. How many computer nodes did you use to run the program? Have you tried to reduce the running time by using more nodes in Ukko cluster?

c. Upload your source codes and analyze the performance of your program codes. What optimizations have you tried to reduce the running time?

References for your programming

Miner D, Shook A. MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems. O'Reilly Media, Inc., 2012.
Lam C. Hadoop in Action. Manning Publications Co., 2010.
Bakhshi R. How to Plan and Configure Yarn and MapReduce 2 in HDP 2.0, HortonWorks.com

Exercise 2

We'll be looking into machine learning by checking out the HASYv2 dataset that contains hand written mathematical symbols as images. The whole dataset is quite big, so we'll restrict ourselves to doing 10-class classification on some of the symbols. Download the data and complete the following tasks.

Extract the data and find inside a file called hasy-data-labels.csv. This file contains the labels for each of the images in the hasy_data folder. Read the labels in and only keep the rows where the symbol_id is within the inclusive range [70, 80]. Read the corresponding images as black-and-white images and flatten them so that each image is a single vector of shape 32x32 == 1024. Your dataset should now consist of your input data of shape (1020, 1024) and your labels of shape (1020, ). That is, a matrix of shape 1020 x 1024 and a vector of size 1020.
Shuffle the data, and then split it into training and test sets, using the first 80% of the data for training and the rest for evaluation.
Fit a logistic regression classifier on your data. Note that since logistic regression is a binary classifier, you will have to, for example, use a so-called "one-vs-all" strategy where the prediction task is formulated as "is the input class X or one of the others?" and the classifier selects the class with the highest probability. Most library implementations will do this for you - feel free to use one.
To get an idea of how good the model did, let's create our own classifier that simply guesses the most common class in the training set. Then, evaluate your logistic regression model on the test data, and compare it to the majority class classifier. The logistic regression model should have significantly better accuracy - our naive model is merely making a guess.
Plot some of the images that the classifier classified wrongly. Can you think of why this happens? Would you have gotten it right? Hint. scipy has a function for this

Exercise 3

Note that you are meant to use Python in this exercise. However, if you can find a suitable AutoML implementation for your favorite language (e.g here seems to be one for R) then you are free to use that language as well

This time train a random forest classifier on the data. A random forest is a collection of decision trees, which makes it an ensemble of classifiers. Each tree uses a random subset of the features to make it's prediction. Without tuning any parameters, how is the accuracy?
The amount of trees to use as a part of the random forest is an example of a hyperparameter, because it is a parameter that is set prior to the learning process. In contrast, a parameter is a value in the model that is learned from the data. Train 20 classifiers, with varying amounts of decision trees starting from 10 up until 200, and plot the test accuracy as a function of the amount of classifiers. Does the accuracy keep increasing? Is more better?
If we had picked the amount of decision trees by taking the value with the best test accuracy from the last plot, we would have overfit our hyperparameters to the test data. Can you see why it is a mistake to tune hyperparameters of your model by using the test data?
Reshuffle and resplit the data so that it is divided in 3 parts: training (80%), validation (10%) and test (10%). Repeatedly train a model of your choosing (e.g random forest) on the training data, and evaluate it's performance on the validation set, while tuning the hyperparameters so that the accuracy on the validation set increases. Then, finally evaluate the performance of your model on the test data. What can you say in terms of the generalization of your model?
This process of picking a suitable model, evaluating it's performance and tuning the hyperparameters is very time consuming. A new idea in machine learning is the concept of automating this by using an optimization algorithm to find the best model in the space of models and their hyperparameters. Have a look at TPOT, an automated ML solution that finds a good model and a good set of hyperparameters automatically. Try it on this data, it should outperform simple models like the ones we tried easily. Note that running the algorithm might take a while, depending on the strength of your computer. TPOT uses cross-validation internally, so we don't need our own validation set.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Week 3

Exercise 1 - MapReduce

Exercise 2

Exercise 3

Clone this wiki locally