Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a time limit (checked per iteration) to asymptoticTimings #4

Closed
Anirban166 opened this issue Jun 2, 2020 · 9 comments
Closed

Comments

@Anirban166
Copy link
Owner

Anirban166 commented Jun 2, 2020

@tdhock @neerajdhanraj

Consider a threshold time.limit as the maximum time value the benchmark can produce upto in an iteration (of a particular data set size) on average. (taking the mean, for only one comparison at each iteration)

If the set threshold is crossed (which is measured after the run/iteration, like here) then it will stop further computation by breaking out of the loop, and avoid extra time spent on larger data set sizes.

This would be a double edged sword depending on what time limit the user provides, i.e. if a very small time bound is provided, the complexity might not be correct, but then the inclusion of user set limits would have the expected possibility of unfavourable values set (same case for GuessCompx) and then again, the user can always test for higher time bounds to verify.

For instance, cDPA is classified incorrectly as log-linear for N = 2 or 1 (as expected since the trend is similar to PeakSegPDPA/log-linear complexity at smaller dataset sizes) and classified correctly as quadratic from N = 3 onwards, so by observation from the timings obtained for N = 3, I selected time.limit = 107 as a default time limit for our function (will modify if value is not suitable), a possibly safe value after which it classifies correctly. For demonstration, I also tested for time.limit = 106, one-tenth of our taken threshold value which would classify incorrectly, as shown below: (also the function works pretty fast in this case, with both the timing computation from asymptoticTimings and complexity classification from asymptoticComplexityClass taking under 5 seconds, since N = 3 here)
test_time_limit
Timings function:

asymptoticTimings <- function(e, N.seq, time.limit)
{
  lang.obj <- substitute(e)
  fun.obj  <- function(N)
  {
    eval(lang.obj)
  }

  maximum.time = ifelse(missing(time.limit), 10^7, time.limit)

  data.set.sizes <- (10^seq(1,N.seq,by = 0.5))
  l <- length(data.set.sizes)
  timings.list <- list()

  for(i in 1:l)
  {
    benchmarked.timings <- as.data.frame(microbenchmark(fun.obj(data.set.sizes[i])))
    benchmarked.timings$data.set.size <- data.set.sizes[i] 
    timings.list[[i]] <- data.frame(benchmarked.timings$time, benchmarked.timings$data.set.size)
    if(mean(benchmarked.timings$time) > maximum.time) break
  }

  resultant.df <- do.call(rbind, timings.list)
  colnames(resultant.df) <- c("Timings", "Data set sizes")
  return(resultant.df)
}

Also I've merged the asymptoticTimings and asymptoticTimingsFun part as we could do with one single function.

Do these changes seem okay or should I discard this logic?

@tdhock
Copy link

tdhock commented Jun 2, 2020

first of all please use variable names like seconds or nanoseconds instead of time which can be confusing because the unit is not specified.

second, " cDPA is classified incorrectly as log-linear for N = 2 or 1" is also confusing, because N is typically used to refer to absolute number of rows/data/observations. but here you are using it for the 10^N value right? can you please use something like N = 10^power.of.ten instead?

yes adding a time limit makes sense, but please make sure that our default time limit works with all of the test cases we have discussed #2

@tdhock
Copy link

tdhock commented Jun 2, 2020

a default time limit of 0.1 second seems reasonable to me (then the user probably has a result in less than one second?)

@Anirban166
Copy link
Owner Author

first of all please use variable names like seconds or nanoseconds instead of time which can be confusing because the unit is not specified.

Right! am using seconds now (should be more convenient for the user than nanoseconds)

second, " cDPA is classified incorrectly as log-linear for N = 2 or 1" is also confusing, because N is typically used to refer to absolute number of rows/data/observations. but here you are using it for the 10^N value right? can you please use something like N = 10^power.of.ten instead?

Yup - I got what your saying, the user would typically expect that for N or for in our case since we need it for the number of data set sizes in our sequence of powers of 10, we could give something like the maximum dataset size as the parameter, i.e. the size uptil which the expression would be tested upto. (we can directly include a sequence of (10, N) then but to include the sub-range values like 31.6, 316.3 etc. for a sequence based on powers of ten, am following the approach below)

 asymptoticTimings <- function(e, N, max.seconds)
{
  lang.obj <- substitute(e)
  fun.obj  <- function(N)
  {
    eval(lang.obj)
  }

  time.limit = ifelse(missing(max.seconds), 10^7, max.seconds*10^9)
  power.of.ten = 0
  while(N > 1)
  { if(N %% 10 == 0)
    { power.of.ten <- power.of.ten + 1
      N <- N / 10
    }
  }
  data.set.sizes <- 10^seq(1, power.of.ten, by = 0.5)
  l <- length(data.set.sizes)
  timings.list <- list()

  for(i in 1:l)
  {
    benchmarked.timings <- as.data.frame(microbenchmark(fun.obj(data.set.sizes[i])))
    benchmarked.timings$data.set.size <- data.set.sizes[i] # collecting data set sizes at each iteration
    timings.list[[i]] <- data.frame(benchmarked.timings$time, benchmarked.timings$data.set.size)
    if(mean(benchmarked.timings$time) > time.limit) break
  }

  resultant.df <- do.call(rbind, timings.list)
  colnames(resultant.df) <- c("Timings", "Data set sizes")
  return(resultant.df)
}

following this, the user can conveniently enter max dataset size (can rename N to max.size or something equivalent) values in power of ten (would mention that in @params) such as df <- asymptoticTimings(cDPA(rpois(N,10), rep(1,length(rpois(N,10))), 3L), 1000) for the function above, instead of the number of data set sizes in our sequence of 10.

or we can rename the parameter as power.of.ten to fit right, but the user would have to use that naming convention while using it in his expression (such as df <- asymptoticTimings(cDPA(rpois(power.of.ten,10), rep(1,length(rpois(power.of.ten,10))), 3L), 3, 1) for instance)

asymptoticTimings <- function(e, power.of.ten, max.seconds)
{
  lang.obj <- substitute(e)
  fun.obj  <- function(power.of.ten)
  {
    eval(lang.obj)
  }

  time.limit = ifelse(missing(max.seconds), 10^7, max.seconds*10^9)
  data.set.sizes <- 10^seq(1, power.of.ten, by = 0.5)
  l <- length(data.set.sizes)
  timings.list <- list()

  for(i in 1:l)
  {
    benchmarked.timings <- as.data.frame(microbenchmark(fun.obj(data.set.sizes[i])))
    benchmarked.timings$data.set.size <- data.set.sizes[i] # collecting data set sizes at each iteration
    timings.list[[i]] <- data.frame(benchmarked.timings$time, benchmarked.timings$data.set.size)
    if(mean(benchmarked.timings$time) > time.limit) break
  }

  resultant.df <- do.call(rbind, timings.list)
  colnames(resultant.df) <- c("Timings", "Data set sizes")
  return(resultant.df)
}

Also I'll add the additional parameter RN (as mentioned in proposal) or something with a better name such as sizes.vec for the user to enter any random vector instead of just being affixed with powers of 10. But before that I would like to sort out this case when we are dealing with powers of 10 i.e. with our hardcoded vector or sequence based on N, so which of the above two approaches should I go with? or should I go with something else? (directly using N=10^(power.of.ten) would result in other values)

yes adding a time limit makes sense, but please make sure that our default time limit works with all of the test cases we have discussed #2

Yup sure, I'll test them and let you know the results in that issue's thread

a default time limit of 0.1 second seems reasonable to me (then the user probably has a result in less than one second?)

In my case it takes a bit more than one second (2-3 seconds in most runs) to get the result, but I guess its because of the other factors such as time taken to evaluate the rest of the R code or my laptop speed.
Setting the default limit of 0.1 sec would result in checking for that limit at the end of each iteration, so it will surely calculate uptil N = 3 (if N is not less than 3), but will skip higher values of N (i.e. for N > 3, the computational time would be same if we don't specify a higher time limit - which will save the user time. Eg: asymptoticTimings(cDPA(rpois(power.of.ten,10), rep(1,length(rpois(power.of.ten,10))), 3L), 5) and asymptoticTimings(cDPA(rpois(power.of.ten,10), rep(1,length(rpois(power.of.ten,10))), 3L), 3) would take the same time unless we specify a higher time limit for the former)

@tdhock
Copy link

tdhock commented Jun 3, 2020

I'm sorry but can you explain what you are using N for? I am super confused now. For me N is the number of data points, data size, or rows/observations.

the user should be able to specify N in their expression.

asymptoticTimings(cDPA(rpois(N,10)))

@Anirban166
Copy link
Owner Author

Anirban166 commented Jun 4, 2020

I'm sorry but can you explain what you are using N for? I am super confused now. For me N is the number of data points, data size, or rows/observations.

Am using N for the data size
It will be the data size uptil which we will compute our benchmarks for, thats why I referred it to as max.size above. We will let the user enter one single data size N, but inside our function we will run computations for multiple data sizes, uptil N which are all powers of 10.

the user should be able to specify N in their expression.

asymptoticTimings(cDPA(rpois(N,10)))

Yup, they can specify the dataset size like asymptoticTimings(cDPA(rpois(N,10), rep(1,length(rpois(N,10))), 3L), 10^5) (a data size of 105) directly with this approach:

asymptoticTimings <- function(e, N, max.seconds)
{
  lang.obj <- substitute(e)
  fun.obj  <- function(N)
  {
    eval(lang.obj)
  }

  time.limit = ifelse(missing(max.seconds), 10^7, max.seconds*10^9)
  power.of.ten = 0
  while(N > 1)
  { if(N %% 10 == 0)
    { power.of.ten <- power.of.ten + 1
      N <- N / 10
    }
  }

  data.set.sizes <- 10^seq(1, power.of.ten, by = 0.5)
  l <- length(data.set.sizes)
  timings.list <- list()
  for(i in 1:l)
  {
    benchmarked.timings <- as.data.frame(microbenchmark(fun.obj(data.set.sizes[i])))
    benchmarked.timings$data.set.size <- data.set.sizes[i]
    timings.list[[i]] <- data.frame(benchmarked.timings$time, benchmarked.timings$data.set.size)
    if(mean(benchmarked.timings$time) > time.limit) break
  }

  resultant.df <- do.call(rbind, timings.list)
  colnames(resultant.df) <- c("Timings", "Data set sizes")
  return(resultant.df)
}

@tdhock
Copy link

tdhock commented Jun 4, 2020

this is inefficient, what are you trying to do?


  while(N > 1)
  { if(N %% 10 == 0)
    { power.of.ten <- power.of.ten + 1
      N <- N / 10
    }
  }

@Anirban166
Copy link
Owner Author

this is inefficient, what are you trying to do?


  while(N > 1)
  { if(N %% 10 == 0)
    { power.of.ten <- power.of.ten + 1
      N <- N / 10
    }
  }

Sorry I hadn't thought much while writing that - I was trying to get the power of ten from the value supplied as N, the simple way to do which was:

power.of.ten = log(N, 10)

@tdhock
Copy link

tdhock commented Jun 4, 2020 via email

@Anirban166
Copy link
Owner Author

Anirban166 commented Jun 5, 2020

why are you trying to get the power of ten from the value supplied as N? if the user wants to specify a sequence of data sizes manually I think the syntax should be something like testComplexity(cDPA(rpois(N, 10)), N=10^seq(1, 4))

So we can use N for a range of values? sorry for the confusion I initially thought we had to include a single data size as N.
This makes more sense, but then we wouldn't need the extra parameter RN now since the user can give any random vector (which may not be a sequence of ten) for N as well in this case.

asymptoticTimings <- function(e, data.sizes, max.seconds)
{
  lang.obj <- substitute(e)
  fun.obj  <- function(data.sizes)
  {
    eval(lang.obj)
  }

  time.limit = ifelse(missing(max.seconds), 10^8, max.seconds*10^9)
  l <- length(data.sizes)
  timings.list <- list()

  for(i in 1:l)
  {
    benchmarked.timings <- as.data.frame(microbenchmark(fun.obj(data.sizes[i])))
    benchmarked.timings$data.size <- data.sizes[i]
    timings.list[[i]] <- data.frame(benchmarked.timings$time, benchmarked.timings$data.size)
    if(mean(benchmarked.timings$time) > time.limit) break
  }

  resultant.df <- do.call(rbind, timings.list)
  colnames(resultant.df) <- c("Timings", "Data sizes")
  return(resultant.df)
}
> asymptoticTimeComplexityClass(asymptoticTimings(PeakSegDP::cDPA(rpois(data.sizes, 1), rep(1, length(rpois(data.sizes, 1))), 3L), data.sizes = 10^seq(1, 3, by = 0.5)))
[1] "quadratic"

Renamed it to data.sizes since it seemed more appropriate now..or is N more convenient?

Edit: Closing this thread as the core objective here (implementation of the parameter max.time) is done. Please re-open for discussion when required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants