Skip to content
Rahul Balaji edited this page Jun 26, 2024 · 2 revisions

Welcome to the list_price_estimation wiki!

Project Overview:

Sure, I can include the usage and evaluation of the Linear Regression model in your GitHub Wiki page. Here is the updated content with the Linear Regression model section added:


Manufacturer List Price Estimation and Prediction Model

Introduction

Welcome to the Manufacturer List Price Estimation and Prediction Model project. This project aims to estimate and predict manufacturer list prices for various products using machine learning models. We collect and analyze data from multiple retail sources to fill in gaps in our database and anticipate future price changes.

Project Overview

Objectives

  1. Current Objective: Estimate missing manufacturer list prices using existing data.
  2. Future Objective: Predict list price changes based on e-commerce data trends over time.

Data Sources

We scrape data from various retail sites, including:

  • CVS
  • Walgreens
  • Walmart
  • Amazon
  • Costco
  • Sam's Club
  • Pick N Save

Collected Metrics

For each product, the following metrics are collected:

  • Retail price
  • Promoted price
  • Product size
  • Manufacturer
  • Category
  • Retail site

Data Preprocessing

Handling Missing Data

  • Promo_Price: Filled with Retail_Price if missing.
  • Category: Filled with the most frequent category.
  • Count: Filled with 0 if missing.
  • Multiplier: Filled with 0 if missing.

Feature Engineering

We created interaction features to improve model performance:

  • Retail_Price_Count: Retail_Price multiplied by Count
  • Promo_Price_Count: Promo_Price multiplied by Count
  • Retail_Price_Multiplier: Retail_Price multiplied by Multiplier
  • Promo_Price_Multiplier: Promo_Price multiplied by Multiplier

Categorical Encoding

Categorical features such as Manufacturer and Category are encoded using OneHotEncoder.

Model Development

Models Used

  1. Linear Regression
  2. Random Forest Regressor
  3. XGBoost Regressor

Hyperparameter Tuning

We performed hyperparameter tuning using Grid Search to find the best parameters for our models.

Evaluation Metrics

We used Mean Squared Error (MSE) to evaluate model performance.

Results

Linear Regression

  • Mean Squared Error: 8.56

Random Forest Regressor

  • Best Parameters: (list best parameters from grid search)
  • Mean Squared Error: 3.31 (after adding interaction features)

XGBoost Regressor

  • Best Parameters: (list best parameters from grid search)
  • Mean Squared Error: 4.24

Comparison of Models

The Random Forest model outperformed both the Linear Regression and XGBoost models with the lowest Mean Squared Error. Below is a summary of the MSE for each model:

  • Linear Regression: 8.56
  • XGBoost: 4.24
  • Random Forest: 3.31

Predictions

We applied the trained models to predict list prices for new data. The predictions are saved in predictions_rf.csv and predictions_xgb.csv.

Future Work

  1. Enhanced Feature Engineering: Continue exploring additional features and transformations.
  2. Model Ensemble: Combine multiple models to improve accuracy.
  3. Time Series Analysis: Develop models to predict list price changes over time.
  4. Deployment: Deploy the models to a production environment and monitor performance.

Contributing

Contributions are welcome! Please fork the repository and submit pull requests.

References


This structure now includes the evaluation of the Linear Regression model alongside the other models, providing a complete overview of your project's approach and results. Feel free to further customize it based on your needs.