Welcome to the documentation for the Manufacturer List Price Estimation and Prediction Model. This project aims to estimate manufacturer list prices for products not in our database and predict list price changes based on e-commerce data.
- Current Objective: Estimate missing manufacturer list prices using a trained model on existing data.
- Future Objective: Predict list price changes based on historical e-commerce data.
We scrape data from various retail sites, including:
- CVS
- Walgreens
- Walmart
- Amazon
- Costco
- Sam's Club
- Pick N Save
For each product, we collect:
- Retail Price
- Promoted Price
- Product Size
- Manufacturer
- Category
- Retail Site
Identifying important features is crucial for the model's accuracy. The features used in our model include:
- Retail Price
- Promoted Price
- Product Size
- Manufacturer
- Category
- Retail Site
To ensure the model performs well, we preprocess the data by:
- Handling missing values
- Standardizing or normalizing features
- Encoding categorical variables
We considered several models, including:
- Regression Models: Linear Regression, Ridge Regression, Lasso Regression
- Tree-based Models: Decision Trees, Random Forest, Gradient Boosting
- Ensemble Methods: Combining multiple models for better performance
import pandas as pd
def load_data(file_path):
return pd.read_csv(file_path)
def preprocess_data(data):
# Handling missing 'Count' values
data['Count'] = data['Count'].apply(lambda x: 0 if pd.isnull(x) or x % 1 != 0 else x)
return data
from sklearn.ensemble import RandomForestRegressor
import joblib
def train_model(train_data):
features = ['Retail_Price', 'Promo_Price', 'Count', 'Manufacturer', 'Category']
X = pd.get_dummies(train_data[features])
y = train_data['List_Price']
model = RandomForestRegressor(random_state=42)
model.fit(X, y)
joblib.dump(model, 'random_forest_model.joblib')
return model
def predict_list_price(model, predict_data):
features = ['Retail_Price', 'Promo_Price', 'Count', 'Manufacturer', 'Category']
X = pd.get_dummies(predict_data[features])
predictions = model.predict(X)
predict_data['Predicted_List_Price'] = predictions
predict_data.to_csv('predicted_data.csv', index=False)
We achieved the following accuracy metrics on our training data:
- Mean Absolute Error (MAE): 0.92
- Mean Squared Error (MSE): 3.30
- Root Mean Squared Error (RMSE): 1.82
- R-squared (R2): 0.96
These metrics indicate a highly accurate model, with an R-squared value of 0.96 meaning that 96% of the variance in list prices is explained by the model. The MAE and RMSE values are low, suggesting that the model's predictions are close to the actual values.
- MAE: On average, our predictions are off by 0.92 units from the actual list prices.
- MSE: This metric gives more weight to larger errors, which helps identify any significant discrepancies in the predictions.
- RMSE: Similar to MSE, but in the same units as the target variable, providing an average error magnitude.
- R2: Indicates that 96% of the variability in list prices is accounted for by our model, demonstrating its robustness and reliability.
- Data Collection: Continuously collect more data to improve the model.
- Feature Engineering: Explore additional features to enhance model performance.
- Model Tuning: Optimize hyperparameters for better accuracy.
- Prediction of Price Changes: Develop a model to predict list price changes based on historical trends and external factors.
This project demonstrates a successful approach to estimating and predicting manufacturer list prices using machine learning. The high accuracy metrics achieved underscore the effectiveness of the models and preprocessing techniques employed.
For detailed code and further explanations, please refer to the source code repository.