AutoML Pipeline – Java API #15855

sebhrusen · 2023-10-23T13:42:58Z

Sub-issue of #15854

Provide pipeline abstraction (Model, ModelBuilder, Transformation, …) encapsulating a sequence of transformations and a final (optional) predictive model.

** Requirements **
Training of the pipeline should support cross-validation without data leakage.
The trained pipeline model must be able to score/predict, including from the clients.

** Notes **
MOJO not yet supported.

The current implementation focuses on the backend implementation: the client support is only here to manipulate—esp. be able to predict—pipeline models that have been built by AutoML for example.
There are many ways to extend the pipeline logic and make it more customizable:

in AutoML, the preprocessing param can used to select various predefined transformers that will make up the training pipeline.
for single models and grids, the pipeline client can later be extended to allow the user to define its own pipeline (for example using a syntax similar to sklearn Pipeline).
for even more ad-hoc customization, like the scenario you're suggesting, we could allow code ingestion—probably jython scripts, it should not be too difficult to implement a JythonDataTransformer.

Finally, let's keep in mind that Mojo support is also not supported yet, although likely to be much easier to support with this Pipeline mechanism than with the legacy Target encoding support embedded in Model/ModelBuilder for example, as every transformation now applies clearly sequentially and the estimator model (e.g. GLM) contains only post-transformation information, whereas with the legacy TE integration the model contained a mix of pre-encoding and post-encoding making the MOJO extremely difficult to implement due to other categorical encoding being mixed to this.

@TomF

* core pipeline API * remove unnecessary explicit casting * remove grid some integration logic (moving to dedicated PR) * fix ref comparison in PipelineHelperTest * remove Pipeline from sklearn estimators support * fix dynamic test for py pipeline algo * fix R CRAN check * revert changes on Model.scoreMetrics, but extracted suspicious code * added example of multiplier transformer in PipelineTest to show how columns transformations can be easily implemented and applied declaratively * Apply suggestions from @TomF 's code review Co-authored-by: Tomáš Frýda <[email protected]> * addressed tomf suggestions --------- Co-authored-by: Tomáš Frýda <[email protected]>

This reverts commit c15ea1e

This reverts commit c15ea1e.

* Revert "GH-15857: cleanup legacy TE integration in ModelBuilder and AutoML (#16061)" This reverts commit a8f309b. * Revert "GH-15857: AutoML pipeline support (#16041)" This reverts commit 17fa9ee. * Revert "GH-15856: Grid pipeline support (#16040)" This reverts commit b7ac670. * Revert "GH-15855: core pipeline API (#16039)" This reverts commit c15ea1e.

sebhrusen self-assigned this Oct 23, 2023

sebhrusen mentioned this issue Oct 23, 2023

Feature munging pipeline for AutoML #15854

Open

7 tasks

sebhrusen added this to the 3.46.0.1 milestone Oct 23, 2023

sebhrusen added AutoML feature labels Oct 23, 2023

sebhrusen mentioned this issue Jan 29, 2024

GH-15855: core pipeline API #16039

Merged

sebhrusen linked a pull request Jan 29, 2024 that will close this issue

GH-15855: core pipeline API #16039

Merged

sebhrusen closed this as completed in #16039 Feb 12, 2024

mn-mikke added a commit that referenced this issue Feb 27, 2024

Revert "GH-15855: core pipeline API (#16039)"

fa54892

This reverts commit c15ea1e

valenad1 added a commit that referenced this issue Mar 8, 2024

Revert "GH-15855: core pipeline API (#16039)"

224c5df

This reverts commit c15ea1e.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AutoML Pipeline – Java API #15855

AutoML Pipeline – Java API #15855

sebhrusen commented Oct 23, 2023 •

edited

Loading

AutoML Pipeline – Java API #15855

AutoML Pipeline – Java API #15855

Comments

sebhrusen commented Oct 23, 2023 • edited Loading

sebhrusen commented Oct 23, 2023 •

edited

Loading