2-Taxi-analysis-with-RHadoop.Rpres

Analysing New York taxis with RHadoop
========================================================
author: Andrie de Vries & Michele Usuelli
date: 2015-07-01, UseR!2015
width: 1680
height: 1050
css: css/custom.css


Motivation: New York taxi data
=============

An inconveniently sized data set (~200GB uncompressed CSV).

Contains information about every single taxi trip in New York over a 4-year period.

***

![](images/taxi-tweet.png)

Source: http://chriswhong.com/open-data/foil_nyc_taxi/

Introduction to the taxi data
====
type: section

```{r include=FALSE}
knitr::opts_chunk$set(cache = FALSE)
```


Introduction to the taxi data
====

The data is at http://publish.illinois.edu/dbwork/open-data/

Previous analysis published at

> Brian Donovan and Daniel B. Work. “Using coarse GPS data to quantify city-scale transportation system resilience to extreme events.”  to appear, Transportation Research Board 94th Annual Meeting, August 2014.  [preprint](https://www.dropbox.com/s/deruyszudfqrll0/TRB15DonovanWork.pdf), [source code](https://github.com/UIUC-Transportation-Data/gpsresilience).


Data dictionary
===============

Field | Description
----- | -----------
medallion |  a permit to operate a yellow taxi cab in New York City, it is effectively a (randomly assigned) car ID.  See also medallions.
hack_license |  a license to drive the vehicle, it is effectively a (randomly assigned) driver ID. See also hack license.
vendor_id |   e.g., Verifone Transportation Systems (VTS), or Mobile Knowledge Systems Inc (CMT), implemented as part of the Technology Passenger Enhancements Project.
rate_code |  taximeter rate, see NYCT&L description.
store_and_fwd_flag |  unknown attribute.
pickup_datetime |  start time of the trip, mm-dd-yyyy hh24:mm:ss EDT.
dropoff_datetime |  end time of the trip, mm-dd-yyyy hh24:mm:ss EDT.
passenger_count |  number of passengers on the trip, default value is one.
trip_time_in_secs |  trip time measured by the taximeter in seconds.
trip_distance |  trip distance measured by the taximeter in miles.
pickup_longitude and pickup_latitude |  GPS coordinates at the start of the trip.
dropoff_longitude and dropoff_latitude |  GPS coordinates at the end of the trip.

Source: http://publish.illinois.edu/dbwork/open-data/

Introduction to mapreduce()
==========
type: section


Using a local context in rmr2
======

```{r taxi-2-rmr-local, cache=FALSE, include=FALSE}
read_chunk("taxi/taxi-2-rmr-local.R")
```

```{r load-packages}
```


Exercise 1
==========
type: cobalt

* Open the folder `exercises`
* Open the file `ex-1-lapply.R`
* Source the script

You are looking at a simple `mapreduce()` job that simulates `lapply()` in base R.


make.input.format()
======

The function `make.input.format()` allows you to specify the attributes of your input data.

The argument `format = "csv"` specifies a csv file. This is a wrapper around `read.table()`.

```{r make.input.format}
```

from.dfs()
======

Use from.dfs() to get a (small) file from dfs to local memory

```{r from.dfs-1}
```

from.dfs()
======

Use `keys()` and `values()` to extract the components

```{r from.dfs-2}
```

Why do the columns not have labels?
======
type: alert

Remember: hdfs splits the individual files across nodes

Implications:
* The chunk that's available to your mapper may not have a header file
* Thus, in general, csv files should not have a header at all!

The solution:
* Make a custom input format, specifying the column names
* (in the same way you specify `col.names` to `read.table()`)

Make an input format
======

The file, `data/dictionary_trip_data.csv` (in the linux file system) contains a data dictionary:

```{r make.input.format-with-colnames-1}
```

Use this information to configure the input format

Make an input format
======

Use the data dictionary information to configure the input format:


```{r make.input.format-with-colnames-2}
```


mapreduce() without any transformation
======

```{r mapreduce-1-a}
```

mapreduce() without any transformation
======

```{r mapreduce-1-b}
```

mapreduce() with a simpler mapper
======

```{r mapreduce-2}
```

mapreduce() with a simpler mapper
======

![](images/mapreduce-weekdays-0.png)


mapreduce() emitting a keyvalue pair
======

```{r mapreduce-3}
```

mapreduce() with a reducer
======

```{r mapreduce-4}
```

mapreduce() with a reducer
======

![](images/mapreduce-weekdays-1.png)

Can we write a more sensible mapper and reducer?
======

![](images/mapreduce-weekdays-2.png)

mapreduce() with a more sensible mapper
======

```{r mapreduce-5}
```

mapreduce() the final step
======

```{r mapreduce-6}
```

Exercise 2
==========
type: cobalt

* Open the folder `exercises`
* Open the file `ex-2-taxi-local.R`
* Source the script

This script contains a short `mapreduce()` script on the taxi data.

Run the code, section by section.

Ensure you understand what happens.

Try to compute the mean of number of passengers.

Hint:

* The mapper should compute sum and count of passengers
* The reduced should compute `mean <- sum / count`


Analysing by hour
=================
type: section

Doing the analysis by hour
==========================

```{r mapreduce-7-a}
```

Doing the analysis by hour
==========================

```{r mapreduce-7-b}
```


Plotting the results
====================

```{r mapreduce-7-plot-1, out.width="1500px", out.height="1000px"}
```

Plotting the results
====================

```{r mapreduce-7-plot-2, echo=FALSE, out.width="1500px", out.height="1000px"}
```


Putting files in hdfs with rhdfs
===============
type: section

rhdfs function overview
=======================

* Initialize
  - `hdfs.init()`
  - `hdfs.defaults()`
* File and directory manipulation
  - `hdfs.ls()`
  - `hdfs.delete()`
  - `hdfs.mkdir()`
  - `hdfs.exists()`
* Copy and move from local <-> HDFS
  - `hdfs.put()`
  - `hdfs.get()`

***

* Manipulate files within HDFS
  - `hdfs.copy()`
  - `hdfs.move()`
  - `hdfs.rename()`
* Reading files directly from HDFS
  - `hdfs.file()`
  - `hdfs.read()`
  - `hdfs.write()`
  - `hdfs.flush()`
  - `hdfs.seek()`
  - `hdfs.tell(con)`
  - `hdfs.close()`
  - `hdfs.line.reader()`
  -  `hdfs.read.text.file()`


Putting the data from local file system to dfs
============

You use the `rhdfs` package to manipulate files in the hadoop dfs.

```{r put-taxi-in-dfs, cache=FALSE, include=FALSE}
read_chunk("utils/put-taxi-data-to-dfs.R")
```

```{r rhdfs, eval=FALSE}
```


Exercise 3
==========
type: cobalt

* Open the folder `exercises`
* Open the file `ex-3-put-taxi-data-to-dfs.R`
* Source the script

Use the `rhdfs` functions to satisfy yourself the files are in hadoop.

Hint: Use `hdfs.ls()`

Running the script in Hadoop compute context
=================
type: section


Running the script in Hadoop compute context
=================

Once you've tested your script in local context, it is generally very easy to deploy on all your data.

```{r, eval=FALSE}
rmr.options(backend = "hadoop")
```

What the hadoop output means
============

```
packageJobJar: [] [/usr/lib/hadoop-mapreduce/hadoop-streaming-2.4.0.2.1.5.0-695.jar] /tmp/streamjob4577673905010649130.jar tmpDir=null
15/06/23 14:18:01 INFO impl.TimelineClientImpl: Timeline service address: http://0.0.0.0:8188/ws/v1/timeline/
15/06/23 14:18:01 INFO client.RMProxy: Connecting to ResourceManager at ra-ldn-cluster-master-02.cloudapp.net/172.16.0.5:8050
15/06/23 14:18:02 INFO impl.TimelineClientImpl: Timeline service address: http://0.0.0.0:8188/ws/v1/timeline/
....
15/06/23 14:18:17 INFO mapreduce.Job: Job job_1434365936669_0041 running in uber mode : false
15/06/23 14:18:17 INFO mapreduce.Job:  map 0% reduce 0%
15/06/23 14:18:31 INFO mapreduce.Job:  map 1% reduce 0%
15/06/23 14:18:34 INFO mapreduce.Job:  map 17% reduce 0%
15/06/23 14:18:40 INFO mapreduce.Job:  map 19% reduce 0%
15/06/23 14:18:41 INFO mapreduce.Job:  map 21% reduce 0%
15/06/23 14:18:47 INFO mapreduce.Job:  map 37% reduce 0%
15/06/23 14:18:48 INFO mapreduce.Job:  map 92% reduce 0%
15/06/23 14:18:49 INFO mapreduce.Job:  map 92% reduce 19%
15/06/23 14:18:50 INFO mapreduce.Job:  map 100% reduce 19%
15/06/23 14:18:52 INFO mapreduce.Job:  map 100% reduce 100%
15/06/23 14:19:00 INFO mapreduce.Job: Job job_1434365936669_0041 completed successfully
...
  rmr
		reduce calls=24
15/06/23 14:19:01 INFO streaming.StreamJob: Output directory: /tmp/file55a346591243
```

Exercise 4
==========
type: cobalt

* Open the folder `exercises`
* Open the file `ex-4-taxi-hadoop.R`
* Source the script


Summary
=========
type: section


Summary
=========

In this session you:

* Used an Azure cluster running HortonWorks
* Developed scripts in local context with the `rmr2` package:
  - Write mappers
  - Wrote reducers
  - Used `mapreduce()`
* Returned data to your local R session, for plotting
* Uploaded files into hdfs using `rhdfs`
* Ran the script directly in hadoop, using the hadoop context

Questions
=========
type: section


Questions
=========


?
?
?


End
===
type: section

Thank you.