Skip to content

haidernitc/SparkCodeTest

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 

Repository files navigation

Titanic Data

File list:
1.AggregatedReport.java
2.ExtractExcelToCSV.java
3.titanic3.xls
4.newFile.txt
5.pom.xml

Design Decisions & Project Issues:

I have configured Spark in windows and placed spark code under D:/spark-1.5.2-bin-hadoop2.6 folder in my local machine.

Reading excel file as is not a great way of handling data since we might loose the data exact behavior(certain column are integers but while reading they might appear as double) so parsed it and converted it to '|' delimited file without altering data.

First Report would appear as below.

+-----+------------+---------------+
|Count| CatDetails| Cat|

+-----+------------+---------------+
| 200| 1|Passenger Class|
| 181| 3|Passenger Class|
| 119| 2|Passenger Class|
| 339| female| Gender|
| 161| male| Gender|
| 45| Teenage| AgeBand|
| 89| Middle Age| AgeBand|
| 205| Young| AgeBand|
| 111| Elderly| AgeBand|
| 50|Less than 10| AgeBand|
+-----+------------+---------------+

Second Report would appear as below.

+-----+--------------------+
|Count| State|
+-----+--------------------+
| 1| MA|
| 1| Ireland Washington|
| 1| null|
| 1| NY / Briarcliff ...|
| 1| PQ|
| 1| England / Cottag...|
| 4| PA / Cooperstown|
| 1| IL|
| 2| AB|
| 3| PA|
| 1| NJ|
| 1| NY / Washington|
| 1| England / New Du...|
| 1| Co Cork|
| 2| NY|
| 2| MA|
| 1| NY|
| 1| England|
| 2| VT|
| 1| OH|
+-----+--------------------+
only showing top 20 rows

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages