Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standard Scaler fit-transform interface #179

Merged
merged 9 commits into from
Dec 14, 2023

Conversation

santiago-imelio
Copy link
Contributor

@santiago-imelio santiago-imelio commented Sep 14, 2023

There is already Preprocessing.standard_scale that performs this transformation. However, with this function we cannot standardize a test set using the train set distribution data (mean and std deviation).

Scaler.StandardScaler.fit/2 will return a struct containing the mean and standard deviation of the distribution of learned observations.

@josevalim
Copy link
Contributor

@msluszniak should we generally convert our preprocessing steps into modules to mirror scikit learn?

@msluszniak
Copy link
Contributor

I think we shouldn't convert preprocessing into modules. For functions that calculate some metrics and then apply them to data, it would be easy to make this interface. However, for functions like one-hot encoding or ordinal encoding, we would need to somehow map the relation key -> calculated_mapping. In Scikit they use dictionaries. One of the functions - normalize has the equivalent in Nx.Linalg.norm. If we decide to support standard_scaler, max_abs_scaler, min_max_scaler, and binarizer then the interface would be inconsistent. If we change the interface to use struct users wouldn't be able to pipe the results of standard_scale. But on the other hand, it would be cool to support both.

@msluszniak
Copy link
Contributor

What we can actually do is create separate functions like standard_scale(reference_tensor, tensor_to_apply, opts) and support the application of standardization on a different tensor, WDYT?

@msluszniak
Copy link
Contributor

And the standard_scale/2 is a simplified version of this because the reference tensor is the same as tensor_to_apply

@josevalim
Copy link
Contributor

@msluszniak the downside of this approach is that we would not be able to fit once and apply the same fitting multiple times. Is that a concern?

@msluszniak
Copy link
Contributor

You're right, hmm. In a standard use case, this function won't be called multiple times. But if so, then the benefit from modular implementation might be considerable.

@santiago-imelio
Copy link
Contributor Author

@msluszniak From what I understand, it seems too soon to agree on an interface that fits both the language and the needs of developers, so in my opinion keeping the preprocessing API functional seems to be appropriate for now.

Closing in favor of continuing the discussion.

@josevalim
Copy link
Contributor

I have reopened this PR in case you would like to address the comments in it. :)

Copy link
Contributor

@msluszniak msluszniak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to add docs for fit, transform, and fit_transform

lib/scholar/preprocessing/standard_scaler.ex Outdated Show resolved Hide resolved
lib/scholar/preprocessing/standard_scaler.ex Outdated Show resolved Hide resolved
lib/scholar/preprocessing/standard_scaler.ex Outdated Show resolved Hide resolved
@josevalim josevalim merged commit f651fdc into elixir-nx:main Dec 14, 2023
2 checks passed
@santiago-imelio santiago-imelio deleted the feat/scaler-fit-transform branch December 14, 2023 17:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants