A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. The analytics team is interested in understanding what songs users are listening to. Currently, they don't have an easy way to query their data, which resides in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.
The purpose of this data engineering project is to create a Postgress database with tables designed to optimise capabilities in song play analytics. My task is to create a database schema and ETL pipeline for this analysis.
The first dataset is a subset of real data from the Million Song Dataset.
Each file is in JSON format and contains metadata about a song and the artist of that song.
The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset.
{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}
The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above.
These simulate activity logs from a music streaming app based on specified configurations.
The log files in the dataset I'll be working with are partitioned by year and month. For example, here are filepaths to two files in this dataset.
log_data/2018/11/2018-11-12-events.json
log_data/2018/11/2018-11-13-events.json
I will create a star schema for this project with 1 Fact table and 4 Dimension Tables
- Records in log data associated with song plays i.e. records with page NextSong
- songplay_id INT PRIMARY KEY
- start_time TIMESTAMP
- user_id INT
- level VARCHAR
- song_id VARCHAR
- artist_id VARCHAR
- sessio INT
- location VARCHAR
- user_agent VARCHAR
- Users in the app
- user_id INT PRIMARY KEY
- first_name VARCHAR
- last_name VARCHAR
- gender VARCHAR
- level VARCHAR
- Songs in music database
- song_id VARCHAR PRIMARY KEY
- title VARCHAR
- artist_id VARCHAR
- year INT
- duration NUMBERIC
- Artists in music database
- artist_id VARCHAR PRIMARY KEY
- name VARCHAR
- location VARCHAR
- latitude FLOAT
- longitude FLOAT
- Timestamps of records in songplays broken down into specific units
- start_time TIMESTAMP PRIMARY KEY
- hour INT
- day INT
- week INT
- month INT
- year INT
- weekday INT
In order to create tables, first of all I connect to the Sparkify database, then use the CREATE
SQL statement to create the 5 tables above.
I extract all the song data from the JSON files using `get_files`.
songs_data
- Select the columns that I need from the JSON files and turn these columns info a dataframe.
- Then I insert all the song data row by row into the
song
table that I previously greated.
artists_data
- Select the columns that I need from the JSON files and turn these columns info a dataframe.
- Then I insert all the artists' data row by row into the
artists
table that I previously greated.
time data
- Select the data in the
ts
column, useto_datetime
to turn the timestamp data from miliseconds to datetime. - Use
datetime
functions to break the timestamp in to hour,day, week, month, year, weekday - Then I insert all the time data row by row into the
time
table that I previously greated.
users data
- Select the columns that I need from the JSON files and turn these columns info a dataframe.
- Then I insert all the users' data row by row into the
users
table that I previously greated.
song play table
- To create a fact table we need to join the
songs
andartists
tables to get the song_id and artist_id in once place. - Get all the other relevant data from the log data file
- Insert the data row by row into the
song play table
I previously created
create_tables.py
- Drops and create tables in database
etl.ipynb
- Reads and processes a single file from song_data and log_data and loads the data into tables
etl.py
- Reads and processes multiple files from song_data and log_data and loads them into tables, responsible for the ETL job
sql_queries.py
- Contains all sql queries, and is imported into the last three files above.
test.ipynb
- Shows first few rows of each table after I create those tables successfully in database, the main point of this file is to validate my process
To create the database and table structure, run the following command:
!python create_tables.py
To parse the log files, run the following command:
!python etl.py