Pandas 1.5.3 causes `ValueError` #46

ianmeinert · 2023-03-13T18:11:30Z

Course:
"Complete Machine Learning & Data Science Bootcamp 2023"
Section 12, video 195, "Preprocessing Our Data", In the exercise "Make Predictions on Test Data"

Issue:
ValueError is thrown as demonstrated.

# Manually adjust to have auctioneerID_is_missing column
df_test["auctioneerID_is_missing"] = False
df_test.head()

# Make predictions on the test data
test_preds = ideal_model.predict(df_test)

A ValueError occurs:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[75], line 2
      1 # Make predictions on the test data
----> 2 test_preds = ideal_model.predict(df_test)

File ~/Documents/code/udemy/udemy_ml_ds_ztm/.venv/lib/python3.9/site-packages/sklearn/ensemble/_forest.py:981, in ForestRegressor.predict(self, X)
    979 check_is_fitted(self)
    980 # Check data
--> 981 X = self._validate_X_predict(X)
    983 # Assign chunk of trees to jobs
    984 n_jobs, _, _ = _partition_estimators(self.n_estimators, self.n_jobs)

File ~/Documents/code/udemy/udemy_ml_ds_ztm/.venv/lib/python3.9/site-packages/sklearn/ensemble/_forest.py:602, in BaseForest._validate_X_predict(self, X)
    599 """
    600 Validate X whenever one tries to predict, apply, predict_proba."""
    601 check_is_fitted(self)
--> 602 X = self._validate_data(X, dtype=DTYPE, accept_sparse="csr", reset=False)
    603 if issparse(X) and (X.indices.dtype != np.intc or X.indptr.dtype != np.intc):
    604     raise ValueError("No support for np.int64 index based sparse matrices")

File ~/Documents/code/udemy/udemy_ml_ds_ztm/.venv/lib/python3.9/site-packages/sklearn/base.py:548, in BaseEstimator._validate_data(self, X, y, reset, validate_separately, **check_params)
    483 def _validate_data(
    484     self,
    485     X="no_validation",
   (...)
    489     **check_params,
    490 ):
    491     """Validate input data and set or check the `n_features_in_` attribute.
    492 
    493     Parameters
   (...)
    546         validated.
    547     """
--> 548     self._check_feature_names(X, reset=reset)
    550     if y is None and self._get_tags()["requires_y"]:
    551         raise ValueError(
    552             f"This {self.__class__.__name__} estimator "
    553             "requires y to be passed, but the target y is None."
    554         )

File ~/Documents/code/udemy/udemy_ml_ds_ztm/.venv/lib/python3.9/site-packages/sklearn/base.py:481, in BaseEstimator._check_feature_names(self, X, reset)
    476 if not missing_names and not unexpected_names:
    477     message += (
    478         "Feature names must be in the same order as they were in fit.\n"
    479     )
--> 481 raise ValueError(message)

ValueError: The feature names should match those that were passed during fit.
Feature names must be in the same order as they were in fit.

Tests:
By the error alone, one could assume the error was caused by the addition of the missing column. After a bit of research and troubleshooting, I ran the following tests to determine if they had the same columns, in order.

set(df_test.columns) == set(X_train.columns)
[Output]: True

df_test.columns.tolist() == X_train.columns.tolist()
[Output]: False

sorted(df_test.columns) == sorted(X_train.columns)
[Output]: True

Solution:
To fix the column order, I had to reindex the test data, based on the columns of the train data

df_test = df_test.reindex(X_train.columns, axis=1)

The code was successful, demonstrated by the next following lines in the exercise.

# Make predictions on the test data
test_preds = ideal_model.predict(df_test)
test_preds

which resulted in:

array([17030.00927386, 14355.53565165, 46623.08774286, ...,
       11964.85073347, 16496.71079281, 27119.99044029])

The text was updated successfully, but these errors were encountered:

fancellu · 2023-06-26T20:36:21Z

Thanks for flagging up this issue. I saw the same thing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pandas 1.5.3 causes `ValueError` #46

Pandas 1.5.3 causes `ValueError` #46

ianmeinert commented Mar 13, 2023

fancellu commented Jun 26, 2023

Pandas 1.5.3 causes ValueError #46

Pandas 1.5.3 causes ValueError #46

Comments

ianmeinert commented Mar 13, 2023

fancellu commented Jun 26, 2023

Pandas 1.5.3 causes `ValueError` #46

Pandas 1.5.3 causes `ValueError` #46