-
Notifications
You must be signed in to change notification settings - Fork 35
/
Copy path1-Using-R-with-Hadoop.Rpres
328 lines (223 loc) · 6.64 KB
/
1-Using-R-with-Hadoop.Rpres
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
Introduction to Using R with Hadoop
========================================================
author: Andrie de Vries & Michele Usuelli
date: 2015-07-01, UseR!2015
width: 1680
height: 1050
css: css/custom.css
```{r include=FALSE}
knitr::opts_chunk$set(cache=TRUE)
```
About us
========================================================
Andrie de Vries
- Programme Manager, Community projects (Microsoft)
- Set up an independent market research firm in 2009
- Joined Revolution Analytics in 2013
- Author of `ggdendro`, `checkpoint` and `miniCRAN` packages on CRAN
- Co-author of *R for Dummies*
![](images/r-for-dummies.jpg)
***
Michele Usuelli
- Data Scientist (Microsoft)
- Joined Revolution Analytics in 2014
- Author of *R Machine Learning Essentials*
![](images/r-machine-learning.png)
Connecting to Azure with your browser
========================================================
http://ra-ldn-cluster-master-02.cloudapp.net:8787
You should already have received individual login details
Why Hadoop?
========================================================
![](images/indeed-job-trend-stats.png)
Source: http://www.indeed.com/jobtrends?q=HTML5%2C+hadoop%2C+SAS&l=
Hype central
========================================================
![](images/dilbert-big-data-in-the-cloud.png)
Is your problem big enough for Hadoop?
================================================
When to use Hadoop?
* Conventional processing tools won’t work on your data
* Your data is really BIG
- Won’t fit/process in your favourite database or file-system
* Your data is really diverse !
- Semi-structured – JSON, XML, Logs, Images, Sounds
* You’re a whiz at programming
***
When not to use Hadoop?
* !(When to use Hadoop?)
* You’re in a hurry !
My job is reducing
==================
![](images/xkcd-my-job-is-compiling.png)
Some important components
=====================
* HDFS
- distributed file system
- `rhdfs`
* mapreduce
- task manager
- `rmr2`
* hbase
- NOSQL database
- `rhbase`
* hive
- SQL like database
- `RHbase`
The Azure cluster
==========
You are using a HortonWorks Hadoop cluster, provisioned in the Microsoft Azure cloud
![](images/cluster-structure.png)
MapReduce
============
type: section
MapReduce
=========
A programming abstraction
* Applies to many types of big data calculation
Hides messy implementation detail in library
* Implicit parallelisation
* Load balancing
* Reduce data movement
* Robust job / machine failure management
~~You as the programmer doesn't need to think about this (too much)~~
MapReduce solves a generic problem
==================================
* Read a large amount of data
* MAP
* Extract a summary from each record / block
* Shuffle and sort
* REDUCE
* Aggregate, filter transform
The problem outline is generic – simply implement map and reduce to solve the problem at hand
![](images/img-hadoop-logical-flow.png)
The hadoop mapreduce magic
==========================
type:alert
So how does Hadoop do its magic?
**Remember: the key is key**
The Hadoop promise:
**Hadoop guarantees that records with the same key will be processed by the same reducer**
This magic happens during the shuffle and sort phase
**During shuffle and sort, all items with the same key get moved to the same node**
![](images/img-hadoop-logical-flow.png)
rmr2
============
type: section
rmr2
============
The `rmr2` package allows you to write R code in the mapper and reducer, without having to know anything about java.
![](images/img-rmr2.png)
MapReduce in R pseudo-code
==========================
In the mapper, `v’` is available as data – no need for an explicit read statement
```{r, eval=FALSE}
mapper <- function(k, v){
...
keyval(k’, v’)
}
```
In the reducer, all `v’` with the same `k’` is processed together
```{r, eval=FALSE}
reducer <- function(k’, v’){
...
keyval(k’’, v’’)
}
```
Pass these functions to `mapreduce():
```{r, eval=FALSE}
mapreduce(input,
map = mapper,
reduce = reducer,
...)
```
Testing using the local backend
===============================
Local backend
```{r, eval=FALSE}
rmr.options(backend = "local")
```
* The `rmr2` package has a "local" back end, completely implemented in R
* Useful for development, testing and debugging
* **Since computation runs entirely in memory, on small data, it's fast!**
***
Hadoop backend
```{r, eval=FALSE}
rmr.options(backend = "hadoop")
```
* This allows easy (and fast) testing before scaling to the “hadoop” backend
* Computation distributed to hdfs and mapreduce
* Hadoop computation overhead
Sending data R <---> Hadoop
============================
* In a real Hadoop context, your data will be ingested into dfs by dedicated tools, e.g. Squoop or Flume
* For easy testing and development, the `rmr2` package has two convenience functions that allow you to import / export a `big.data.object` into the R session
```{r, eval=FALSE}
to.dfs()
from.dfs()
```
Using the mapreduce() function
==============================
```{r, eval=FALSE}
m <- mapreduce(input,
input.format,
map,
reduce,
output = NULL, ...)
```
* Specify the `input`, `map` and `reduce` functions
* Optionally, specify `output`, to persist result in hdfs
* If `output = NULL`, `mapreduce()` returns a temporary `big.data.object`
* A `big.data.object` is a pointer to a temporary file in dfs
If you know that the resulting object is small, use `to.dfs()` to return data from dfs into your R session
```{r, eval=FALSE}
m() # returns the file location of the big.data.object
from.dfs(m) # available in R session
```
Using the key-value pair
========================
Everything in hadoop uses a key - value pair. (~~Remember: the key is key!~~)
Use `keyval()` to create the key-value pair inside your mapper and reducer.
```{r, eval=FALSE}
mapper <- function(k, v){
...
keyval(k', v')
}
reducer <- function(k, v){
...
keyval(k', v')
}
```
Use the helper functions `keys()` and `values()` to separate components from the `big.data.object`
```{r, eval=FALSE}
m <- mapreduce(input, map, reduce, ...)
x <- from.dfs(m) # available in R session
keys(x)
values(x)
```
Cheat sheet for a mapreduce flow using rmr2
=====================
Optional: get a sample of data to dfs:
```{r, eval=FALSE}
hdp.file <- to.dfs(...)
```
Mapreduce:
```{r, eval=FALSE}
map.fun <- function(k, v){...; keyval(k', v')}
reduce.fun <- function(k, v){...; keyval(k', v')}
m <- mapreduce(input,
map = map.fun,
reduce = reduce.fun,
...
)
```
Inspect results
```{r, eval=FALSE}
x <- from.dfs(m)
keys(x)
values(x)
```
End
===
type: section
Thank you.