Skip to content

udacity data engineering data degree - data modelling with postgres

Notifications You must be signed in to change notification settings

winniewytong/data_modelling_with_postgres

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Purpose

A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. The analytics team is interested in understanding what songs users are listening to. Currently, they don't have an easy way to query their data, which resides in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.

The purpose of this data engineering project is to create a Postgress database with tables designed to optimise capabilities in song play analytics. My task is to create a database schema and ETL pipeline for this analysis.

Datasets

Song Dataset


The first dataset is a subset of real data from the Million Song Dataset.
Each file is in JSON format and contains metadata about a song and the artist of that song.
The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset.

{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}

Log Dataset


The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above.
These simulate activity logs from a music streaming app based on specified configurations.
The log files in the dataset I'll be working with are partitioned by year and month. For example, here are filepaths to two files in this dataset.

log_data/2018/11/2018-11-12-events.json

log_data/2018/11/2018-11-13-events.json

Schema

I will create a star schema for this project with 1 Fact table and 4 Dimension Tables

Fact Table

songplays table

  • Records in log data associated with song plays i.e. records with page NextSong
  1. songplay_id INT PRIMARY KEY
  2. start_time TIMESTAMP
  3. user_id INT
  4. level VARCHAR
  5. song_id VARCHAR
  6. artist_id VARCHAR
  7. sessio INT
  8. location VARCHAR
  9. user_agent VARCHAR

Dimesion Tables

users

  • Users in the app
  1. user_id INT PRIMARY KEY
  2. first_name VARCHAR
  3. last_name VARCHAR
  4. gender VARCHAR
  5. level VARCHAR

songs

  • Songs in music database
  1. song_id VARCHAR PRIMARY KEY
  2. title VARCHAR
  3. artist_id VARCHAR
  4. year INT
  5. duration NUMBERIC

artists

  • Artists in music database
  1. artist_id VARCHAR PRIMARY KEY
  2. name VARCHAR
  3. location VARCHAR
  4. latitude FLOAT
  5. longitude FLOAT

time

  • Timestamps of records in songplays broken down into specific units
  1. start_time TIMESTAMP PRIMARY KEY
  2. hour INT
  3. day INT
  4. week INT
  5. month INT
  6. year INT
  7. weekday INT

ETL Processes

In order to create tables, first of all I connect to the Sparkify database, then use the CREATE SQL statement to create the 5 tables above.

Dimension Table

songs and artists tables


I extract all the song data from the JSON files using `get_files`.
  1. songs_data
  • Select the columns that I need from the JSON files and turn these columns info a dataframe.
  • Then I insert all the song data row by row into the song table that I previously greated.
  1. artists_data
  • Select the columns that I need from the JSON files and turn these columns info a dataframe.
  • Then I insert all the artists' data row by row into the artists table that I previously greated.

time and users tables

  1. time data
  • Select the data in the ts column, use to_datetime to turn the timestamp data from miliseconds to datetime.
  • Use datetime functions to break the timestamp in to hour,day, week, month, year, weekday
  • Then I insert all the time data row by row into the time table that I previously greated.
  1. users data
  • Select the columns that I need from the JSON files and turn these columns info a dataframe.
  • Then I insert all the users' data row by row into the users table that I previously greated.

Fact Table

song play table

  1. To create a fact table we need to join the songs and artists tables to get the song_id and artist_id in once place.
  2. Get all the other relevant data from the log data file
  3. Insert the data row by row into the song play table I previously created

File Structure

create_tables.py - Drops and create tables in database

etl.ipynb - Reads and processes a single file from song_data and log_data and loads the data into tables

etl.py - Reads and processes multiple files from song_data and log_data and loads them into tables, responsible for the ETL job

sql_queries.py - Contains all sql queries, and is imported into the last three files above.

test.ipynb - Shows first few rows of each table after I create those tables successfully in database, the main point of this file is to validate my process

Running the Python Scripts


To create the database and table structure, run the following command:

!python create_tables.py

To parse the log files, run the following command:

!python etl.py

About

udacity data engineering data degree - data modelling with postgres

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published