- Apache Airflow (—version 2.8.1)
- AWS Redshift data Warehouse
- AWS Glue
- Amazon S3 bucket
- Amazon Athena
- PostgreSQL
- Python 3.9 or higher
- AWS Account with permissions for S3, Glue, Athena, and Redshift
- Reddit API credentials
- Docker Installed
# change VScode python interpreter to the virtual environment python (3.9.6 under venv)
pip freeze > requirements.txt # make sure the requirements in this virtual environment is same as mentioned
docker compose up -d --build # create containers defined in docker-compose.yml file; Build images before starting containers.
# Airflow deployed in Docker
- Connect Airflow to Reddit instance on the cloud
- Push the data from Airflow → S3 bucket → Connect AWS Glue
- Perform manipulation on these data
- Querying & Visualization
- If you run Airflow instance in Docker, by the time you start the dag,