1-Using-R-with-Hadoop.Rpres

Introduction to Using R with Hadoop
========================================================
author: Andrie de Vries & Michele Usuelli
date: 2015-07-01, UseR!2015
width: 1680
height: 1050
css: css/custom.css

```{r include=FALSE}
knitr::opts_chunk$set(cache=TRUE)
```


About us
========================================================

Andrie de Vries
- Programme Manager, Community projects (Microsoft)
- Set up an independent market research firm in 2009
- Joined Revolution Analytics in 2013
- Author of `ggdendro`, `checkpoint` and `miniCRAN` packages on CRAN
- Co-author of *R for Dummies*

![](images/r-for-dummies.jpg)

***

Michele Usuelli
- Data Scientist (Microsoft)
- Joined Revolution Analytics in 2014
- Author of *R Machine Learning Essentials*

![](images/r-machine-learning.png)


Connecting to Azure with your browser
========================================================

http://ra-ldn-cluster-master-02.cloudapp.net:8787

You should already have received individual login details

Why Hadoop?
========================================================

![](images/indeed-job-trend-stats.png)

Source: http://www.indeed.com/jobtrends?q=HTML5%2C+hadoop%2C+SAS&l=

Hype central
========================================================
![](images/dilbert-big-data-in-the-cloud.png)


Is your problem big enough for Hadoop?
================================================

When to use Hadoop?
* Conventional processing tools won’t work on your data
* Your data is really BIG
- Won’t fit/process in your favourite database or file-system
* Your data is really diverse !
- Semi-structured – JSON, XML, Logs, Images, Sounds
* You’re a whiz at programming

***
When not to use Hadoop?
* !(When to use Hadoop?)
* You’re in a hurry !


My job is reducing
==================
![](images/xkcd-my-job-is-compiling.png)


Some important components
=====================

* HDFS
  - distributed file system
  - `rhdfs`
* mapreduce
  - task manager
  - `rmr2`
* hbase
  - NOSQL database
  - `rhbase`
* hive
  - SQL like database
  - `RHbase`


The Azure cluster
==========

You are using a HortonWorks Hadoop cluster, provisioned in the Microsoft Azure cloud

![](images/cluster-structure.png)

MapReduce
============
type: section

MapReduce
=========

A programming abstraction
* Applies to many types of big data calculation

Hides messy implementation detail in library
* Implicit parallelisation
* Load balancing
* Reduce data movement
* Robust job / machine failure management

~~You as the programmer doesn't need to think about this (too much)~~

MapReduce solves a generic problem
==================================

* Read a large amount of data
* MAP
* Extract a summary from each record / block
* Shuffle and sort
* REDUCE
* Aggregate, filter transform


The problem outline is generic – simply implement map and reduce to solve the problem at hand

![](images/img-hadoop-logical-flow.png)


The hadoop mapreduce magic
==========================
type:alert

So how does Hadoop do its magic?

**Remember: the key is key**

The Hadoop promise:

**Hadoop guarantees that records with the same key will be processed by the same reducer**

This magic happens during the shuffle and sort phase

**During shuffle and sort, all items with the same key get moved to the same node**

![](images/img-hadoop-logical-flow.png)


rmr2
============
type: section

rmr2
============

The `rmr2` package allows you to write R code in the mapper and reducer, without having to know anything about java.

![](images/img-rmr2.png)


MapReduce in R pseudo-code
==========================

In the mapper, `v’` is available as data – no need for an explicit read statement

```{r, eval=FALSE}
mapper <- function(k, v){
    ...
    keyval(k’, v’)
    }
```

In the reducer, all `v’` with the same `k’` is processed together

```{r, eval=FALSE}
reducer <- function(k’, v’){
    ...
    keyval(k’’, v’’)
    }
```

Pass these functions to `mapreduce():

```{r, eval=FALSE}
mapreduce(input,
          map = mapper,
          reduce = reducer,
          ...)
```

Testing using the local backend
===============================

Local backend

```{r, eval=FALSE}
rmr.options(backend = "local")
```

* The `rmr2` package has a "local" back end, completely implemented in R
* Useful for development, testing and debugging
* **Since computation runs entirely in memory, on small data, it's fast!**


***

Hadoop backend

```{r, eval=FALSE}
rmr.options(backend = "hadoop")
```

* This allows easy (and fast) testing before scaling to the “hadoop” backend
* Computation distributed to hdfs and mapreduce
* Hadoop computation overhead


Sending data  R <---> Hadoop
============================

* In a real Hadoop context, your data will be ingested into dfs by dedicated tools, e.g. Squoop or Flume
* For easy testing and development, the `rmr2` package has two convenience functions that allow you to import / export a `big.data.object` into the R session

```{r, eval=FALSE}
to.dfs()
from.dfs()
```

Using the mapreduce() function
==============================

```{r, eval=FALSE}
m <- mapreduce(input,
               input.format,
               map,
               reduce,
               output = NULL, ...)
```

* Specify the `input`, `map` and `reduce` functions
* Optionally, specify `output`, to persist result in hdfs
* If `output = NULL`, `mapreduce()` returns a temporary `big.data.object`
* A `big.data.object` is a pointer to a temporary file in dfs

If you know that the resulting object is small, use `to.dfs()` to return data from dfs into your R session

```{r, eval=FALSE}
m()  # returns the file location of the big.data.object
from.dfs(m) # available in R session
```

Using the key-value pair
========================

Everything in hadoop uses a key - value pair. (~~Remember: the key is key!~~)

Use `keyval()` to create the key-value pair inside your mapper and reducer.

```{r, eval=FALSE}
mapper <- function(k, v){
    ...
    keyval(k', v')
}
reducer <- function(k, v){
    ...
    keyval(k', v')
}
```


Use the helper functions `keys()` and `values()` to separate components from the `big.data.object`

```{r, eval=FALSE}
m <- mapreduce(input, map, reduce, ...)
x <- from.dfs(m) # available in R session

keys(x)
values(x)
```

Cheat sheet for a mapreduce flow using rmr2
=====================

Optional: get a sample of data to dfs:
```{r, eval=FALSE}
hdp.file <- to.dfs(...)
```

Mapreduce:

```{r, eval=FALSE}
map.fun <- function(k, v){...; keyval(k', v')}
reduce.fun <- function(k, v){...; keyval(k', v')}

m <- mapreduce(input,
               map = map.fun,
               reduce = reduce.fun,
               ...
)
```

Inspect results

```{r, eval=FALSE}
x <- from.dfs(m)

keys(x)
values(x)
```


End
===
type: section

Thank you.