sparkify

This project is intended to analyze data for a hypothetical start-up called Sparkify. This music streaming start-up wants to analyze their song- and log-related data in a more efficient way.

More specifically, the goal of this project is to build a PostgreSQL database and establish an ETL process to optimize the querying of their (JSON) song- and log-related data. This in turn will facilitate the analysis of their data.

Database schema

The implemented database schema can be seen in the ER diagram below

The Entity Relationshiip Diagram (ERD) above is a Star Schema where the facts (or metrics) are represented by the songplays relation. The reason for this is to have the analysis of log and song data at the heart of the business. From the songplays relation one can observe the dimension of the sparkify business: users, artists, songs and time. Each of these relations represents a core business aspect of sparkify.

ETL pipeline

In order to pipe sparkify's JSON data to the PostgreSQL database an ETL process was establish from the data source. The schema shown above was implemented by the use and filtering of pandas.DataFrame objects, combine with the use of a series of INSERT and SELECT SQL statements, once the relations had been created.

Dependencies

The following libraries need to be installed for the code to work: numpy, pandas and psycopg2.

Use the package manager pip to install any of these libraries if needed.

pip install numpy
pip install pandas
pip install psycopg2

Authors

The author of this repo is me, Raul Bermejo, as part of the Data Engineer program at Udacity.

Usage

So far the usage of this project is minimal, and therefore there are no examples. This section will be updated when enough progress has been made.

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
__pycache__		__pycache__
data		data
images		images
tests		tests
README.md		README.md
create_tables.py		create_tables.py
etl.ipynb		etl.ipynb
etl.py		etl.py
sql_queries.py		sql_queries.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sparkify

Database schema

ETL pipeline

Dependencies

Authors

Usage

Contributing

About

Releases

Packages

Languages

raul-bermejo/sparkify

Folders and files

Latest commit

History

Repository files navigation

sparkify

Database schema

ETL pipeline

Dependencies

Authors

Usage

Contributing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages