GitHub - winniewytong/data_modelling_with_postgres: udacity data engineering data degree

Purpose

A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. The analytics team is interested in understanding what songs users are listening to. Currently, they don't have an easy way to query their data, which resides in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.

The purpose of this data engineering project is to create a Postgress database with tables designed to optimise capabilities in song play analytics. My task is to create a database schema and ETL pipeline for this analysis.

Datasets

Song Dataset

The first dataset is a subset of real data from the Million Song Dataset.
Each file is in JSON format and contains metadata about a song and the artist of that song.
The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset.

{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}

Log Dataset

The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above.
These simulate activity logs from a music streaming app based on specified configurations.
The log files in the dataset I'll be working with are partitioned by year and month. For example, here are filepaths to two files in this dataset.

log_data/2018/11/2018-11-12-events.json

log_data/2018/11/2018-11-13-events.json

Schema

I will create a star schema for this project with 1 Fact table and 4 Dimension Tables

Fact Table

songplays table

Records in log data associated with song plays i.e. records with page NextSong

songplay_id INT PRIMARY KEY
start_time TIMESTAMP
user_id INT
level VARCHAR
song_id VARCHAR
artist_id VARCHAR
sessio INT
location VARCHAR
user_agent VARCHAR

Dimesion Tables

users

Users in the app

user_id INT PRIMARY KEY
first_name VARCHAR
last_name VARCHAR
gender VARCHAR
level VARCHAR

songs

Songs in music database

song_id VARCHAR PRIMARY KEY
title VARCHAR
artist_id VARCHAR
year INT
duration NUMBERIC

artists

Artists in music database

artist_id VARCHAR PRIMARY KEY
name VARCHAR
location VARCHAR
latitude FLOAT
longitude FLOAT

time

Timestamps of records in songplays broken down into specific units

start_time TIMESTAMP PRIMARY KEY
hour INT
day INT
week INT
month INT
year INT
weekday INT

ETL Processes

In order to create tables, first of all I connect to the Sparkify database, then use the CREATE SQL statement to create the 5 tables above.

Dimension Table

songs and artists tables

I extract all the song data from the JSON files using `get_files`.

songs_data

Select the columns that I need from the JSON files and turn these columns info a dataframe.
Then I insert all the song data row by row into the song table that I previously greated.

artists_data

Select the columns that I need from the JSON files and turn these columns info a dataframe.
Then I insert all the artists' data row by row into the artists table that I previously greated.

time and users tables

time data

Select the data in the ts column, use to_datetime to turn the timestamp data from miliseconds to datetime.
Use datetime functions to break the timestamp in to hour,day, week, month, year, weekday
Then I insert all the time data row by row into the time table that I previously greated.

users data

Select the columns that I need from the JSON files and turn these columns info a dataframe.
Then I insert all the users' data row by row into the users table that I previously greated.

Fact Table

song play table

To create a fact table we need to join the songs and artists tables to get the song_id and artist_id in once place.
Get all the other relevant data from the log data file
Insert the data row by row into the song play table I previously created

File Structure

create_tables.py - Drops and create tables in database

etl.ipynb - Reads and processes a single file from song_data and log_data and loads the data into tables

etl.py - Reads and processes multiple files from song_data and log_data and loads them into tables, responsible for the ETL job

sql_queries.py - Contains all sql queries, and is imported into the last three files above.

test.ipynb - Shows first few rows of each table after I create those tables successfully in database, the main point of this file is to validate my process

Running the Python Scripts

To create the database and table structure, run the following command:

!python create_tables.py

To parse the log files, run the following command:

!python etl.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Purpose

Datasets

Song Dataset

Log Dataset

Schema

Fact Table

songplays table

Dimesion Tables

users

songs

artists

time

ETL Processes

Dimension Table

songs and artists tables

time and users tables

Fact Table

File Structure

Running the Python Scripts

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
README.md		README.md
create_tables.py		create_tables.py
etl.ipynb		etl.ipynb
etl.py		etl.py
sql_queries.py		sql_queries.py
test.ipynb		test.ipynb

winniewytong/data_modelling_with_postgres

Folders and files

Latest commit

History

Repository files navigation

Purpose

Datasets

Song Dataset

Log Dataset

Schema

Fact Table

songplays table

Dimesion Tables

users

songs

artists

time

ETL Processes

Dimension Table

songs and artists tables

time and users tables

Fact Table

File Structure

Running the Python Scripts

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages