This repository contains a complete pipeline for predicting house prices using various machine learning models. The project includes data preprocessing, model training, evaluation, and generating predictions.
This project involves using regression models to predict house prices. The goal is to build a model that can accurately estimate the sale price of a house based on various features.
The dataset used in this project includes:
- Train Data: Contains the features and target variable (
SalePrice
) for training the model. - Test Data: Contains the features for which predictions need to be made.
The dataset may include missing values and irrelevant columns, which are addressed during preprocessing.
-
Check for Missing Values:
missing_values = df.isnull().sum() print(missing_values[missing_values > 0])
-
Drop Irrelevant Columns:
df_cleaned = df.drop(['Id', 'SomeIrrelevantColumn'], axis=1)
-
Convert Categorical Columns:
from sklearn.preprocessing import OneHotEncoder encoder = OneHotEncoder() encoded_features = encoder.fit_transform(df_cleaned[['CategoricalColumn']])
-
Prepare Data for Modeling:
from sklearn.model_selection import train_test_split X = df_cleaned.drop('SalePrice', axis=1) y = df_cleaned['SalePrice'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
-
Fit the Model:
import xgboost as xgb model = xgb.XGBRegressor() model.fit(X_train, y_train)
-
Generate Predictions:
predictions = model.predict(X_test)
- Evaluate Model Performance:
from sklearn.metrics import mean_squared_error, r2_score mse = mean_squared_error(y_test, predictions) r2 = r2_score(y_test, predictions) print(f"Mean Squared Error: {mse}") print(f"R^2 Score: {r2}")
Achieved an accuracy of around 65-70% on the training dataset with XGBoost. Due to time constraints, the model can be further improved.
- Prepare Submission File:
submission = pd.DataFrame({'Id': test_data['Id'], 'SalePrice': predictions}) submission.to_csv('submission.csv', index=False)
Attached are the Submission
and Submission2
files.