-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-15855: core pipeline API #16039
GH-15855: core pipeline API #16039
Conversation
* @param ignoredFields A {@link Set} of fields to ignore. Can be empty or null. | ||
* @return checksum A 64-bit long representing the checksum of the object | ||
*/ | ||
public static <T> long checksum(final T obj, final Set<String> ignoredFields, final long initVal) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic extracted for Model.Parameters
so that it can also be used in pipeline DataTransformer
s
private ModelBuilderListener _callback; | ||
|
||
public void setCallback(ModelBuilderListener callback) { | ||
this._callback = callback; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
callbacks have been moved from Driver to ModelBuilder itself as it allows much better configurability needed by the pipeline
676ad67
to
458f902
Compare
@@ -71,18 +74,23 @@ protected int nModelsInParallel(int folds) { | |||
* This is called before cross-validation is carried out | |||
*/ | |||
@Override | |||
public void computeCrossValidation() { | |||
protected void cv_init() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changes in the algos are mainly due to the fact that the CV API now only exposes protected
hooks at various places in the model building cycle, otherwise it breaks the pipeline logic that needs a very strict behaviour when building CV models (esp. as it needs full control over the frames being used at that time).
Algos are therefore encouraged to override only those small hooks, and the ModelBuilder itself remains algo-agnostic.
16b2444
to
009c161
Compare
assertNotNull(k.get()); | ||
assertVecEquals(fr.vec(i), k.get(), 1e-10); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add an example with pipeline say multiply a column by 2 and then build a GLM or GBM or DRF model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wendycwong added a quick MultiplyNumericColumnTransformer
for the test below, with assertions verifying that it's applied correctly and that the original frame is not modified.
@sebhrusen : Is it possible for a user to add her own munging pipeline? For example, I want to transform a column by subtracting the column mean? |
@wendycwong no, this is not currently supported:
I'm answering more than your simple question here, but wanted to explain the scope of this, so maybe I will copy this answer to the original issue for reference. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have just some minor comments, otherwise it looks great! Thank you @sebhrusen for not just hacking it together (as I probably would 😅 ) and inventing those nice abstractions!
model$estimator_model <- NULL | ||
} | ||
model$transformers <- unlist(lapply(model$transformers, function(dt) new("H2ODataTransformer", id=dt$id, description=dt$description))) | ||
# class(model) <- "H2OPipeline" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this commented? If needed, you can assign multiple classes, e.g., class(model) <- c("H2OPipeline", "H2OModel")
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point! right I forgot about the multiple inheritance. I think I commented it out because it didn't seem necessary (single algo don't have dedicated class) and it broke some behaviour somewhere (can't remember what exactly).
The funny part is that the class is defined as follow:
setClass("H2OPipeline", contains="H2OModel",
but afair, it was still not recognized as a model somewhere…
I will give a try to your suggestion, and see if it breaks some R tests.
|
||
private enum Command { | ||
SUBSTITUTE() { | ||
private final Pattern CMD = Pattern.compile("s/(.*?)/(.*?)/?"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For other reviewers:
It took me couple minutes before I realized this is not a PCRE to substitute a non-greedy match with verbatim (.*?)
. The whole expression is actually the part for matching (not substitution). The expected string is s/something/something
or s/something/something/
.
Please correct me if I'm wrong @sebhrusen .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct, the idea behind those KeyGen
is to better understand how all those keys are created, and ensure consistency when some keys are created based on other ones. Pipeline can use a lot of keys for all temp frames, for models used in data transformers and so on.
and example of this substitution pattern is used in the pipeline integration in AutoML:
// in ModelingStep.applyPipeline(...)
pparams._estimatorKeyGen = hyperParams == null
? new ConstantKeyGen(resultKey)
: new PatternKeyGen("{0}|s/"+PIPELINE_KEY_PREFIX+"//") // in case of grid, remove the Pipeline prefix to obtain the estimator key, this allows naming compatibility with the classic mode.
;
this way the key formatting doesn't have to leak later in the code (in this case in Grid) that should be pipeline-agnostic
@@ -0,0 +1 @@ | |||
#hex.pipeline.PipelineRegistration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intentional #
prefix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, right, I should remove this
import water.fvec.Frame; | ||
|
||
/** | ||
* WiP: not used for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you specify what remains to be done here? (if anything)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's mainly a high level abstraction for other transformers that would for example remove rows (not) containing a specific value or that have too few values…
As we don't use this in AutoML (yet), I didn't implement any specific transformer, and so this abstraction is probably missing useful stuff…
Basically, it's just an idea for now. I can delete it if you want.
…olumns transformations can be easily implemented and applied declaratively
Co-authored-by: Tomáš Frýda <[email protected]>
aac3156
to
f3d717d
Compare
* Revert "GH-15857: cleanup legacy TE integration in ModelBuilder and AutoML (#16061)" This reverts commit a8f309b. * Revert "GH-15857: AutoML pipeline support (#16041)" This reverts commit 17fa9ee. * Revert "GH-15856: Grid pipeline support (#16040)" This reverts commit b7ac670. * Revert "GH-15855: core pipeline API (#16039)" This reverts commit c15ea1e.
implements core API for #15855