Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single future.control argument rather than multiple, individual future.* arguments #27

Open
HenrikBengtsson opened this issue Sep 13, 2018 · 9 comments

Comments

@HenrikBengtsson
Copy link
Collaborator

@mllg wrote:

Have you considered to introduce a control object (in the fashion of passing a rpart.control-object to rpart())? A future.control-object could bundle all arguments. Would be good for the overview (also in the documentation) and ...

Yes, I've been having internal (as in lots of inner voices ;)) debates about this and it's been discussed with other in the past. I'm not opposed to it. The main reason I've stayed away from it is that we have to deside exactly how the control elements should be controlled.

... you would avoid inconsistencies like #26.

You're meaning in the sense that the future.apply package and other similar high-level packages (e.g. furrr) won't have to know about future-specific arguments and can just pass whatever down to the future package? That's a nice side effect I haven't thought of before.

However, not all elements in future.control should be passed down to the future package. For instance, scheduling and chunk.size are higher level properties. If so, do they belong to a future.control argument or should they be separate?

So several things to think of. Thanks for bringing this up.

@mllg
Copy link

mllg commented Sep 14, 2018

However, not all elements in future.control should be passed down to the future package. For instance, scheduling and chunk.size are higher level properties. If so, do they belong to a future.control argument or should they be separate?

I don't see a problem with future.apply introducing some additional arguments (i.e. scheduling and chunk.size) and then passing the object forward to future. I think my first attempt to do this for multiple packages would look like this:

future::future.control = function(globals = FALSE) {
  list(globals = globals)
}

future.apply::future.apply.control = function(..., scheduling = 1, chunk.size = NULL) {
  c(future::future.control(...), list(scheduling = scheduling, chunk.size = chunk.size))
}

This way, you could update future and introduce new options w/o having to touch future.apply.

@HenrikBengtsson
Copy link
Collaborator Author

I was actually thinking of taking an easier approach to avoid to introducing a yet-another function to the Future API. I was thinking this could be added API above the Future API (e.g. future.apply package).

For instance, instead of doing:

y <- future_lapply(X, FUN = identity, future.seed = 42, future.scheduling = 2.0)

one could do:

y <- future_lapply(X, FUN = identity, future.control = list(seed = 42, scheduling = 2.0))

where the future.control argument would be used to over ride the defaults. Conceptually, something

control <- .update_control(future.control)

can be used internally with:

.update_control <- function(...) {
  args <- list(...)
  control <- list(
    globals = TRUE, packages = NULL, lazy = FALSE,
    seed = FALSE, scheduling = 1, chunk.size = NULL
  )
  for (name in names(args)) control[[name]] <- args[[name]]
  control
}

The current arguments would then correspond to:

future.globals <- control$globals
future.packages <- control$packages
future.seed <- control$seed

@mllg
Copy link

mllg commented Sep 17, 2018

Where does .update_control() live? In the future package? I find it odd that it knows about the defaults of an "extension" package like future.apply. What if you introduce a new argument in future or future.apply?

@HenrikBengtsson
Copy link
Collaborator Author

In future.apply. No plan of introducing control argument in the future package at this point.

@mllg
Copy link

mllg commented Sep 17, 2018

Please clarify how this is intended to be used:

The choice of backend can be controlled by modifying the global state via plan() (and tweak?).

Some options of future apparently should be set by the package developer, like globals (at least if the function f to apply is not user-provided), or lazy. This is basically hard coded in my package.

Other options seem to be relevant for the user, e.g. seed or the output handling via %stdout%/%stderr%.
How do I expose these options to the user? If there is an control object in future.apply, I can just pass it down, that works for me. But what am I supposed to do in my packages if I want to run futureCall?

@HenrikBengtsson
Copy link
Collaborator Author

So, I'm not thinking of a global option here, just wrapping up existing future.* arguments into a future.control = list(...) argument. The latter will override and/or add the defaults (which are hard-coded in the package - not set by the user). So, the idea is that it work just as now, but just a different way to specifying the arguments (to future_lapply() et al.)

@mllg
Copy link

mllg commented Sep 17, 2018

Hm okay. If I understand you correctly, most options can/should be hardcoded in the package. But what about scheduling or chunk.size? Do I have to expose these arguments and pass them down? Can they be set via plan()?

Sorry if these are dumb question, I should really take more time to RTFM...

@HenrikBengtsson
Copy link
Collaborator Author

No dumb questions. No, you cannot set those via plan(). Can you give me an example where you think that would make more sense for the end user to control the "chunking" (via plan()) rather than you as the developer of the method/algorithm to control it (via future_lapply(), parLapply(), foreach(), whathaveyou). If I can understand your use case(s), I probably can give you a better answer/explanation.

@mllg
Copy link

mllg commented Sep 20, 2018

I'm currently working on mlr3 (https://github.com/mlr-org/mlr3), a successor to mlr. The benchmark() function is used to benchmark multiple learning algorithms on multiple machine learning tasks via resampling. Internally, I basically expand.grid() over learners, tasks, and resampling iterations.
The runtimes of the iterations (one iteration = a single learning algorithm on a single task in a single resampling iteration) are often very heterogeneous (linear model takes a few seconds, deep neural net takes many hours).

As the user typically has some expectations about the runtime and which iterations or combinations will be expensive (e.g., a random forest is more expensive than a single tree), he could optimize the parallelization by evenly distributing the heavy jobs among available workers. So it would be nice to be able to control the chunking, or at least "shuffle" the jobs (as suggested in other issue).

A more general use case for scheduling/chunk.size for homogeneous runtimes:

  • If you have 1e7 very fast jobs you want to chunk to [ncpu] jobs in order to reduce the overhead.
  • If you have 10 very slow jobs you want to have 10 jobs which start in a load-balanced fashion (see parLapply vs. parLapplyLB or mclapply's mc.preschedule).
  • Heuristics like defaulting to min(iters, ncpu * 2) chunks might be helpful here, but do not solve the issue for all setups

I'm also not saying that I definitely need this. Especially the manual chunking might unnecessarily blow up the interface. I was just curious if this is possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants