You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Reddit Data Pipeline Engineering | AWS End to End Data Engineering
Tools & Services
Apache Airflow (—version 2.8.1)
AWS Redshift data Warehouse
AWS Glue
Amazon S3 bucket
Amazon Athena
PostgreSQL
Prerequisites
Python 3.9 or higher
AWS Account with permissions for S3, Glue, Athena, and Redshift
Reddit API credentials
Docker Installed
Set-Up
# change VScode python interpreter to the virtual environment python (3.9.6 under venv)
pip freeze > requirements.txt # make sure the requirements in this virtual environment is same as mentioned
docker compose up -d --build # create containers defined in docker-compose.yml file; Build images before starting containers.
# Airflow deployed in Docker
Project Introduction
Connect Airflow to Reddit instance on the cloud
Push the data from Airflow → S3 bucket → Connect AWS Glue
Perform manipulation on these data
Querying & Visualization
reddit_dag.py
If you run Airflow instance in Docker, by the time you start the dag,