This repository demonstrates a robust machine learning pipeline for predicting salaries based on years of experience using a linear regression model. The pipeline integrates tools like DVC for data and model versioning, Airflow for parameter scanning, and MLflow for tracking model performance.
- Overview
- Project Structure
- Features
- Tools and Technologies
- Setup and Installation
- Usage
- Contributing
- License
The project focuses on:
- Preprocessing: Cleaning and transforming the dataset for linear regression.
- Training: Developing a linear regression model to predict salaries.
- Evaluation: Assessing model performance with metrics like MAE, MSE, and R².
- Versioning and Tracking: Using DVC to manage data/model versions and MLflow to log and track experiment performance.
- Parameter Scanning: Automating hyperparameter exploration with Airflow DAGs.
.
├── data/ # Raw and processed datasets
├── scripts/ # Source code for preprocessing, training, and evaluation
├── models/ # Saved models
|-- metrics/ # Saved metrics
├── dvc.yaml # DVC pipeline configuration
├── params.yaml # Model and processing configuration
├── dags/ # Airflow DAGs for parameter scanning
├── mlflow_logs/ # MLflow tracking logs
├── requirements.txt # Python dependencies
├── README.md # Project documentation
└── LICENSE # License file
- Data Version Control: Track raw and processed datasets using DVC.
- Pipeline Automation: Modular steps for preprocessing, training, and evaluation.
- Parameter Scanning: Airflow DAGs to explore hyperparameters like learning rate, batch size, etc.
- Model Tracking: MLflow integration to log metrics, parameters, and model artifacts.
- Reproducibility: End-to-end pipeline ensures reproducible results.
- DVC: Data versioning and pipeline management.
- Airflow: Workflow orchestration and parameter scanning.
- MLflow: Experiment tracking and model lifecycle management.
- scikit-learn: Machine learning library for linear regression.
- pandas: Data manipulation and preprocessing.
- Python: Primary programming language.
-
Clone the repository:
git clone https://github.com/muntakim1/yearsexperience-salary-pipeline.git cd yearsexperience-salary-pipeline
-
Install dependencies:
pip install -r requirements.txt
-
Initialize DVC:
dvc init
-
Set up Airflow:
airflow db init airflow scheduler & airflow webserver
-
Configure MLflow:
mlflow server --backend-store-uri sqlite:///mlflow.db --default-artifact-root ./mlruns
Run the complete pipeline using DVC:
dvc repro
- Place the Airflow DAG in the
dags/
directory. - Start the Airflow webserver and trigger the DAG.
- Monitor execution and view logs through the Airflow UI.
- Start the MLflow server:
mlflow ui
- Access the UI at
http://127.0.0.1:5000
. - Compare experiment metrics and download models.
Contributions are welcome! Please fork the repository and submit a pull request. For major changes, open an issue to discuss your proposal first.
This project is licensed under the MIT License.