GitHub

Titanic Data

File list:
1.AggregatedReport.java
2.ExtractExcelToCSV.java
3.titanic3.xls
4.newFile.txt
5.pom.xml

Design Decisions & Project Issues:

I have configured Spark in windows and placed spark code under D:/spark-1.5.2-bin-hadoop2.6 folder in my local machine.

Reading excel file as is not a great way of handling data since we might loose the data exact behavior(certain column are integers but while reading they might appear as double) so parsed it and converted it to '|' delimited file without altering data.

First Report would appear as below.

+-----+------------+---------------+
|Count| CatDetails| Cat|

+-----+------------+---------------+
| 200| 1|Passenger Class|
| 181| 3|Passenger Class|
| 119| 2|Passenger Class|
| 339| female| Gender|
| 161| male| Gender|
| 45| Teenage| AgeBand|
| 89| Middle Age| AgeBand|
| 205| Young| AgeBand|
| 111| Elderly| AgeBand|
| 50|Less than 10| AgeBand|
+-----+------------+---------------+

Second Report would appear as below.

+-----+--------------------+
|Count| State|
+-----+--------------------+
| 1| MA|
| 1| Ireland Washington|
| 1| null|
| 1| NY / Briarcliff ...|
| 1| PQ|
| 1| England / Cottag...|
| 4| PA / Cooperstown|
| 1| IL|
| 2| AB|
| 3| PA|
| 1| NJ|
| 1| NY / Washington|
| 1| England / New Du...|
| 1| Co Cork|
| 2| NY|
| 2| MA|
| 1| NY|
| 1| England|
| 2| VT|
| 1| OH|
+-----+--------------------+
only showing top 20 rows

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
spark-basics		spark-basics
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Titanic Data

About

Releases

Packages

Languages

haidernitc/SparkCodeTest

Folders and files

Latest commit

History

Repository files navigation

Titanic Data

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages