Spark + AWS Data Lake and ETL

Project Summary

Sparkify, a music streaming startup, wanted to collect logs they have on user activity and song data and centralize them in a database in order to run analytics. This AWS S3 data lake, set up with a star schema, will help them to easily access their data in an intuitive fashion and start getting rich insights into their user base.

I set up an EMR instance with a Spark cluster to process their logs, reading them in from an S3 bucket. I then ran transformations on that big data, distributing it out into separate tables and then writing it back into an S3 data lake.

Why this Database and ETL design?

My client Sparkify has moved to a cloud based system and now keeps their big data logs in an AWS S3 bucket. The end goal was to get that raw .json data from their logs into fact and dimenstion tables in a S3 data lake with parquet files.

Database structure overview

From Udacity

How to run

Start by cloning this repository
Install all python requirements from the requirements.txt
Create an S3 bucket and fill in those details in the etl.py main() output_data variable
Initialize an EMR cluster with Spark
Fill in the dl_template with your own custom details
SSH into the EMR cluster and upload your dl_template.cfg and etl.py files
Run spark-submit etl.py to initialize the spark job and write the resultant tables to parquet files in your s3 output path

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Spark + AWS Data Lake and ETL

Project Summary

Why this Database and ETL design?

Database structure overview

How to run

Files

README.md

Latest commit

History

README.md

File metadata and controls

Spark + AWS Data Lake and ETL

Project Summary

Why this Database and ETL design?

Database structure overview

How to run