Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAINT Introduce use of set_output to output dataframes #683

Merged
merged 15 commits into from
Jun 14, 2023

Conversation

ArturoAmorQ
Copy link
Collaborator

Pandas output with set_output API is available since v 1.2.

This PR introduces such a nice feature to the MOOC.

@ArturoAmorQ ArturoAmorQ marked this pull request as draft February 9, 2023 09:27
Copy link
Collaborator

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very natural way to introduce this. +1.

@ogrisel
Copy link
Collaborator

ogrisel commented Feb 12, 2023

This is still draft so I did not merge. But feel free to undraft and merge.

@ogrisel
Copy link
Collaborator

ogrisel commented Feb 12, 2023

I think we should use set_output(transform="pandas") by default in the notebook titled "Encoding of categorical variables".

@ArturoAmorQ
Copy link
Collaborator Author

I think we should use set_output(transform="pandas") by default in the notebook titled "Encoding of categorical variables".

The global setting raises an ValueError: Pandas output does not support sparse data when training the model at the end of the notebook.

We can still set the output to be dataframe when creating the instances in the rest of the notebook, and use new instances with default input for the pipeline.

@ArturoAmorQ ArturoAmorQ marked this pull request as ready for review February 16, 2023 14:05
@ArturoAmorQ ArturoAmorQ added this to the MOOC 4.0 milestone Feb 23, 2023
@glemaitre glemaitre self-requested a review June 5, 2023 12:29
Copy link
Collaborator

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will make a second check to be sure that we don't have other places where we should be using this type of output.

python_scripts/02_numerical_pipeline_scaling.py Outdated Show resolved Hide resolved
# %%
data_train_scaled = pd.DataFrame(data_train_scaled, columns=data_train.columns)
scaler = StandardScaler().set_output(transform="pandas")
data_train_scaled = scaler.fit_transform(data_train)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the analysis, I would also some link to the documentation: https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_set_output.html.

I would probably mention that we can set the output of a Pipeline using the sklearn.set_config function without going into details but instead providing delegating to the scikit-learn example.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment may be relevant for Issue #675.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a suggestion above to add the link right away without waiting for a PR dedicated to address #675.

@glemaitre
Copy link
Collaborator

In the notebook linear_model_regularization, I am wondering if we should advocate for trying to get the feature names from model[:-1].get_feature_names_out(...) or instead have set_output and then access model[-1].feature_names_in_.

Co-authored-by: Guillaume Lemaitre <[email protected]>
@ogrisel
Copy link
Collaborator

ogrisel commented Jun 5, 2023

+1 for model[-1].feature_names_in_ which should make the code even shorter.

@glemaitre
Copy link
Collaborator

Otherwise LGTM.

Copy link
Collaborator

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM again. Just a few more comments. Feel free to merge once the suggested changes are integrated (assuming you agree with those).

python_scripts/ensemble_adaboost.py Outdated Show resolved Hide resolved
python_scripts/02_numerical_pipeline_scaling.py Outdated Show resolved Hide resolved
# %%
data_train_scaled = pd.DataFrame(data_train_scaled, columns=data_train.columns)
scaler = StandardScaler().set_output(transform="pandas")
data_train_scaled = scaler.fit_transform(data_train)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a suggestion above to add the link right away without waiting for a PR dedicated to address #675.

ArturoAmorQ and others added 2 commits June 14, 2023 15:05
@ArturoAmorQ ArturoAmorQ merged commit c34c8cb into INRIA:main Jun 14, 2023
@ArturoAmorQ ArturoAmorQ deleted the set_output_pandas branch June 14, 2023 13:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants