This project demonstrates how to use Apache Airflow for orchestrating a daily ETL pipeline that fetches weather data from the free OpenWeather API, cleans and transforms the data, and stores it in a PostgreSQL database. It provides a scalable and reproducible example that can be extended for machine learning tasks such as training predictive models on weather-related datasets.
This repository offers a sample Airflow project integrating a daily weather data fetch from OpenWeather’s free API into a PostgreSQL database. The data is cleaned, validated, and ready for further downstream tasks, such as ML model training or dashboard visualization.
To run this project:
- Follow the steps in the Setup Instructions section in the specified order, as the sequence is crucial to properly configuring your environment, database, and connections.
- Download or clone the DAG and related codes in the exact same structure as described in the Project Structure section.
- Use the Airflow UI to trigger and monitor the pipeline execution.
Below are some screenshots showcasing the Airflow UI and the weather_data_pipeline
DAG in action. These visuals provide an overview of how the pipeline is orchestrated within Apache Airflow.
Airflow Dashboard | Pipeline Execution Detail |
---|---|
Pipeline Graph View | Pipeline XCom |
- Airflow Dashboard (AirFlow01): Displays the main Airflow UI with an active
weather_data_pipeline
DAG. - Pipeline Execution Logs (AirFlow02): Showcases task logs and data retrieved from the OpenWeather API.
- Pipeline Graph View (AirFlow03): Visualizes the flow of tasks in the pipeline, including
load_locations
,fetch_weather_data
, andinsert_weather_data
. - Pipeline DAG Details (AirFlow04): Summarizes DAG execution details such as run duration, task statuses, and DAG configuration.
- Automated Orchestration: Daily scheduled runs controlled by Airflow.
- Data Cleaning & Validation: Ensures consistent, reliable data for analysis.
- Modular Design: Separate files for DAGs, database connection, and data processing.
- Easily Extensible: Add more cities, transformations, or ML tasks as needed.
┌─────────────────┐
│ OpenWeather API │
└──────┬──────────┘
│ (Fetch JSON)
v
┌───────────────────┐
│ Airflow DAG │
│ (weather_pipeline)│
└──────┬────────────┘
│ (Tasks Execution)
v
┌─────────────────┐
│ Load Locations │
│ (CSV to JSON) │
└──────┬──────────┘
│
v
┌─────────────────────┐
│ Fetch Weather Data │
│ (API Requests) │
└──────┬──────────────┘
│
v
┌───────────────────┐
│ Insert to DB │
│ (PostgreSQL Table)│
└──────┬────────────┘
│ (SQLAlchemy engine)
v
┌─────────────────┐
│ PostgreSQL DB │
└─────────────────┘
Apache Airflow is not fully supported on Windows. It is recommended to run Airflow on a Linux-based system.
- If you're using Windows, you can install and run Airflow using the Windows Subsystem for Linux (WSL) feature.
- Alternatively, you can use a Linux virtual machine (VM) or Docker for running Airflow on Windows.
To set up the project, please follow the instructions in the Setup folder in the following order listed below.
-
Airflow Setup:
Step-by-step instructions for installing and configuring Apache Airflow. -
Database Setup:
Guide for setting up the PostgreSQL database, including creating schemas and tables. -
Folder Setup:
Instructions for verifying and setting up the required folder structure for your Airflow project. -
IDE Setup:
Guide for installing and configuring Visual Studio Code (or another IDE) for developing Airflow DAGs and helper scripts. -
Airflow Database Connection Setup:
Detailed steps for securely setting up database connections within Airflow. -
API Connection Setup:
Instructions for obtaining an OpenWeather API key and securely adding it as an Airflow connection.
The following structure represents how the project should be organized on your local machine where the Airflow pipeline is running. It includes essential folders and files required for executing the ETL pipeline, managing logs, and configuring Airflow. Files related to repository setup and documentation (such as setup guides) are not included here.
AIRFLOW/
├── dags/ # Airflow DAGs for ETL orchestration
│ ├── data/ # Directory for data-related files
│ │ └── locations.csv # CSV file containing city location data
│ └── ML_Data_ETL_dag.py # Main Airflow DAG script
│
├── logs/ # Airflow logs for debugging and tracking
│ ├── dag_id=weather_data_pipeline # Logs for specific DAG execution
│ ├── dag_processor_manager # Logs for Airflow's DAG processor
│ └── scheduler # Logs for the Airflow scheduler
│
├── airflow.cfg # Airflow configuration file
├── airflow.db # SQLite database for Airflow metadata (development only)
├── airflow-webserver.pid # Process ID file for the Airflow webserver
├── webserver_config.py # Airflow webserver configuration
│
├── README.md # Project documentation
└── requirements.txt # Python dependencies
Contributions are welcome and appreciated! To contribute to this project, please follow these steps:
-
Fork the Repository: Click the "Fork" button on the GitHub page of this repository to create a copy under your own account.
-
Create a New Branch: git checkout -b feature/your-feature-name Choose a clear, descriptive name for your branch that reflects the changes you’re making.
-
Make Your Changes:
- Add or modify code, tests, or documentation as needed.
- Ensure that your code adheres to the style and format defined by this project (PEP 8 for Python).
- If you are adding new features, include tests or update existing tests to maintain coverage and confirm that your additions work as intended.
-
Run Tests:
Example test command
pytest tests/
Make sure all tests pass and there are no regressions.
-
Commit Your Changes:
git add . git commit -m "Add your commit message here"
Write clear and concise commit messages that explain what your changes do.
-
Push and Open a Pull Request:
git push origin feature/your-feature-name
Go to your forked repository on GitHub and open a Pull Request (PR) against the main branch of this repository. Describe your changes, why they’re needed, and how to test them.
-
Code Review and Feedback:
- Be open to feedback and make the requested changes where applicable.
-
Merge: Once your PR is approved, it will be merged into the main branch.
Note: If you’re unsure about any aspect of your contribution or would like to propose an idea before coding, feel free to open an issue first. Constructive discussion helps ensure we move in a direction that benefits the entire community.