Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metric comparison - Colab code - Validation set missing #111

Open
simoneroviaro opened this issue Nov 3, 2024 · 0 comments
Open

Metric comparison - Colab code - Validation set missing #111

simoneroviaro opened this issue Nov 3, 2024 · 0 comments

Comments

@simoneroviaro
Copy link

The issue refer to the follow-up Google Colab code of the "Note: Metric Comparison Improvement", by the end of the chapter "Scikit-learn: Creating Machine Learning Models".

In the Colab code, both RandomizedSearchCV and GridSearchCV were applied directly to the training set without an explicit validation set.

Quote "The most important part is they all use the same data splits created using train_test_split() and np.random.seed(42)".

I initially supposed that this was referring to the fact that during the previous lessons, a validation set was created for RandomizedSearchCV, but it was not consistent with the GridSearchCV, where a 80/20 train_test_split was used instead.

This turned out not to be the case in Colab code. Infact both RandomizedSearchCV and GridSearchCV were applied directly to the training set without an explicit validation set.

  • Couldn't this approach lead to overfitting? Any tuning process based on test set performance indirectly leaks information about the test set into the model selection process.

This is not consistent with the content of the previous lessons where the validation set was exaplained.

Could you please clarify?
Thanks,
Simone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant