This project implements a fully automated MLOps pipeline for a K-Means clustering model using Apache Airflow. The pipeline ingests data, preprocesses it, performs feature engineering, trains a model, and logs results—all in an event-driven workflow.
The dataset comes from a Kaggle e-commerce dataset, and the project focuses on:
✅ Automating data ingestion & preprocessing
✅ Ensuring model reproducibility with DVC & MLflow
✅ Deploying Apache Airflow for workflow orchestration
✅ Continuous monitoring & retraining based on parameter changes
To build a scalable MLOps pipeline, the following technologies were integrated:
Technology | Purpose |
---|---|
Apache Airflow 🏗️ | Workflow orchestration & scheduling |
DVC (Data Version Control) 📊 | Data tracking & versioning for reproducibility |
MLflow 🔎 | Model tracking, experiment logging, and evaluation |
Python, Pandas, Scikit-learn 🐍 | Data processing & K-Means model training |
This pipeline is event-driven and continuously monitors dataset & parameter updates.
1️⃣ Airflow detects dataset/parameter changes 📡
2️⃣ Triggers DVC to ensure data version consistency 📂
3️⃣ Feature engineering & preprocessing are executed 🔄
4️⃣ Retrains the K-Means model based on latest data 🎯
5️⃣ MLflow logs metrics & hyperparameters for monitoring 📈
graph TD;
A[Airflow Scheduler] -->|Detects Changes| B[DVC Data Versioning];
B -->|Triggers DAG Run| C[Feature Engineering];
C -->|Prepares Data| D[Model Training];
D -->|Trains & Evaluates Model| E[MLflow Tracking];
E -->|Logs Metrics| F[Performance Dashboard];
git clone https://github.com/muntakim1/machine-learning-pipeline-airflow-mlflow-dvc.git
cd machine-learning-pipeline-airflow-mlflow-dvc
pip install -r requirements.txt
dvc init
mlflow server --backend-store-uri sqlite:///mlflow.db --default-artifact-root ./mlruns
airflow db init
airflow scheduler & airflow webserver
airflow dags trigger ml_pipeline
- Automates model retraining based on data drift detection
- Ensures model reproducibility with DVC & MLflow
- Scales efficiently with Airflow's DAG scheduling
- Continuous monitoring of metrics & hyperparameters
✅ Fully automated end-to-end MLOps pipeline
✅ Dynamic parameter tuning for real-time model updates
✅ Seamless data & model tracking with DVC & MLflow
✅ Scalable Airflow DAGs for production-ready workflows
I am actively seeking Data Science roles where I can leverage my expertise in Machine Learning, MLOps, and Automation to build scalable AI solutions.
📩 Let’s connect! LinkedIn | Email
This project is licensed under the MIT License.
🔹 Contributions are welcome! Feel free to fork this repo, submit issues, and create pull requests! 🚀
#DataScience #MachineLearning #MLOps #ApacheAirflow #DVC #MLflow #AI #Automation #Hiring #OpenToWork 🚀