MAINT Introduce use of set_output to output dataframes #683

ArturoAmorQ · 2023-02-09T09:26:42Z

Pandas output with set_output API is available since v 1.2.

This PR introduces such a nice feature to the MOOC.

ogrisel

This is a very natural way to introduce this. +1.

ogrisel · 2023-02-12T17:47:51Z

This is still draft so I did not merge. But feel free to undraft and merge.

ogrisel · 2023-02-12T18:04:07Z

I think we should use set_output(transform="pandas") by default in the notebook titled "Encoding of categorical variables".

ArturoAmorQ · 2023-02-13T09:45:44Z

I think we should use set_output(transform="pandas") by default in the notebook titled "Encoding of categorical variables".

The global setting raises an ValueError: Pandas output does not support sparse data when training the model at the end of the notebook.

We can still set the output to be dataframe when creating the instances in the rest of the notebook, and use new instances with default input for the pipeline.

…tput_pandas

python_scripts/02_numerical_pipeline_introduction.py

glemaitre

I will make a second check to be sure that we don't have other places where we should be using this type of output.

python_scripts/02_numerical_pipeline_scaling.py

glemaitre · 2023-06-05T12:56:17Z

python_scripts/02_numerical_pipeline_scaling.py

 # %%
-data_train_scaled = pd.DataFrame(data_train_scaled, columns=data_train.columns)
+scaler = StandardScaler().set_output(transform="pandas")
+data_train_scaled = scaler.fit_transform(data_train)


After the analysis, I would also some link to the documentation: https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_set_output.html.

I would probably mention that we can set the output of a Pipeline using the sklearn.set_config function without going into details but instead providing delegating to the scikit-learn example.

This comment may be relevant for Issue #675.

I added a suggestion above to add the link right away without waiting for a PR dedicated to address #675.

glemaitre · 2023-06-05T13:38:35Z

In the notebook linear_model_regularization, I am wondering if we should advocate for trying to get the feature names from model[:-1].get_feature_names_out(...) or instead have set_output and then access model[-1].feature_names_in_.

Co-authored-by: Guillaume Lemaitre <[email protected]>

ogrisel · 2023-06-05T14:48:28Z

+1 for model[-1].feature_names_in_ which should make the code even shorter.

glemaitre · 2023-06-05T14:53:05Z

Otherwise LGTM.

ogrisel

LGTM again. Just a few more comments. Feel free to merge once the suggested changes are integrated (assuming you agree with those).

python_scripts/ensemble_adaboost.py

python_scripts/02_numerical_pipeline_scaling.py

ogrisel · 2023-06-14T13:00:45Z

python_scripts/02_numerical_pipeline_scaling.py

 # %%
-data_train_scaled = pd.DataFrame(data_train_scaled, columns=data_train.columns)
+scaler = StandardScaler().set_output(transform="pandas")
+data_train_scaled = scaler.fit_transform(data_train)


I added a suggestion above to add the link right away without waiting for a PR dedicated to address #675.

Co-authored-by: Olivier Grisel <[email protected]>

c34c8cb

MAINT Introduce use of set_output to output dataframes

74bc33e

ArturoAmorQ marked this pull request as draft February 9, 2023 09:27

ArturoAmorQ mentioned this pull request Feb 10, 2023

MAINT Update scikit-learn to v 1.2.1 #684

Merged

ogrisel approved these changes Feb 12, 2023

View reviewed changes

Use set_output on categorical encoders

bb2f4f4

ArturoAmorQ marked this pull request as ready for review February 16, 2023 14:05

ArturoAmorQ added this to the MOOC 4.0 milestone Feb 23, 2023

ArturoAmorQ added 6 commits February 23, 2023 15:55

Fix conflicts

4928a02

MNT Use lint and black format

c1fb684

Merge branch 'main' of github.com:INRIA/scikit-learn-mooc into set_ou…

56d8abf

…tput_pandas

Fix conflicts

45b1fb6

Merge remote-tracking branch 'upstream/main' into set_output_pandas

27a0029

black

5d04e0a

ArturoAmorQ commented Jun 5, 2023

View reviewed changes

python_scripts/02_numerical_pipeline_introduction.py Outdated Show resolved Hide resolved

Erase trailing comma

bf034ca

glemaitre self-requested a review June 5, 2023 12:29

Revert change in table

8d5ac08

glemaitre reviewed Jun 5, 2023

View reviewed changes

Apply suggestions from code review

4d25da2

Co-authored-by: Guillaume Lemaitre <[email protected]>

Wording tweak

bc1cedd

Apply suggestion from Guillaume

dc35679

ogrisel approved these changes Jun 14, 2023

View reviewed changes

ArturoAmorQ and others added 2 commits June 14, 2023 15:05

Apply suggestions from code review

1458b86

Co-authored-by: Olivier Grisel <[email protected]>

Lint

1ef3aff

ArturoAmorQ merged commit c34c8cb into INRIA:main Jun 14, 2023

ArturoAmorQ deleted the set_output_pandas branch June 14, 2023 13:10

github-actions bot pushed a commit that referenced this pull request Jun 14, 2023

[ci skip] MAINT Introduce use of set_output to output dataframes (#683)

5f18db8

c34c8cb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MAINT Introduce use of set_output to output dataframes #683

MAINT Introduce use of set_output to output dataframes #683

ArturoAmorQ commented Feb 9, 2023

ogrisel left a comment

ogrisel commented Feb 12, 2023

ogrisel commented Feb 12, 2023

ArturoAmorQ commented Feb 13, 2023

glemaitre left a comment

glemaitre Jun 5, 2023

ArturoAmorQ Jun 7, 2023

ogrisel Jun 14, 2023

glemaitre commented Jun 5, 2023

ogrisel commented Jun 5, 2023

glemaitre commented Jun 5, 2023

ogrisel left a comment

ogrisel Jun 14, 2023

MAINT Introduce use of set_output to output dataframes #683

MAINT Introduce use of set_output to output dataframes #683

Conversation

ArturoAmorQ commented Feb 9, 2023

ogrisel left a comment

Choose a reason for hiding this comment

ogrisel commented Feb 12, 2023

ogrisel commented Feb 12, 2023

ArturoAmorQ commented Feb 13, 2023

glemaitre left a comment

Choose a reason for hiding this comment

glemaitre Jun 5, 2023

Choose a reason for hiding this comment

ArturoAmorQ Jun 7, 2023

Choose a reason for hiding this comment

ogrisel Jun 14, 2023

Choose a reason for hiding this comment

glemaitre commented Jun 5, 2023

ogrisel commented Jun 5, 2023

glemaitre commented Jun 5, 2023

ogrisel left a comment

Choose a reason for hiding this comment

ogrisel Jun 14, 2023

Choose a reason for hiding this comment