Creating the memory usage (in bytes) quantifying function #6

Anirban166 · 2020-06-10T10:43:56Z

Started with the memory metrics (usage in bytes) quantifier, with a few adaptations in relation to the timings quantifier. Keynote:

bench essentially computes benchmarks for both time and memory allocations for an R expression, so it won't return an applicable¹ data.frame for memory metrics specifically, as microbenchmark did in contrast for time results. (it only had timings plus it didn't result in a singular output value)
The equivalent to extracting memory is by using bench::bench_memory or by retrieving mem_alloc from bench::mark. Both ways would end up with a singular value for whatsoever input (an expression) supplied, which we can extract by selecting the memory allocation column, example:

> bench::mark(cDPA(rpois(100,10), rep(1,length(rpois(100,10))), 3L))$mem_alloc
[1] 19.6KB
> bench::bench_memory(cDPA(rpois(100,10), rep(1,length(rpois(100,10))), 3L))$mem_alloc
[1] 19.6KB

mem_alloc is one among the two types of data that bench_memory returns:

> attributes(bench::bench_memory(cDPA(rpois(100,10), rep(1,length(rpois(100,10))), 3L)))
$names
[1] "mem_alloc" "memory"

However, the second memory column just shows the allocations per step additionally (it is not use-able¹ as a data.frame) and provides nothing extra, with the total allocated value in bytes (see last row below) being supplied to mem_alloc:

Memory allocations:
       what bytes                                                      calls
1     alloc   448 <Anonymous> -> force() -> cDPA() -> stopifnot() -> rpois()
2     alloc  2552 <Anonymous> -> force() -> cDPA() -> stopifnot() -> rpois()
3     alloc   448 <Anonymous> -> force() -> cDPA() -> stopifnot() -> rpois()
4     alloc  2552 <Anonymous> -> force() -> cDPA() -> stopifnot() -> rpois()
5     alloc   848            <Anonymous> -> force() -> cDPA() -> stopifnot()
6     alloc   448                           <Anonymous> -> force() -> cDPA()
7     alloc  2448               <Anonymous> -> force() -> cDPA() -> double()
8     alloc  1248              <Anonymous> -> force() -> cDPA() -> integer()
9     alloc  2448               <Anonymous> -> force() -> cDPA() -> double()
10    alloc   448                           <Anonymous> -> force() -> cDPA()
11    alloc  2448               <Anonymous> -> force() -> cDPA() -> matrix()
12    alloc  1248               <Anonymous> -> force() -> cDPA() -> matrix()
13    alloc  2448               <Anonymous> -> force() -> cDPA() -> matrix()
total       20032

Ideally, mem_alloc is the only value we will be requiring after all.

However, this does raise a question, as you saw in the results for cDPA above, mem_alloc returns the sizes in kilobytes (KB) or for a larger value it would return in equivalent conversions such as in MB, but the memory allocations are calculated in bytes, so are the values (KB, MB) directly comparable in bytes? - yes, they are. Here's a short test which proves that:

> memorytester(cDPA(rpois(N,10), rep(1,length(rpois(N,10))), 3L), N=10)
[1] 6.11KB
> memorytester(cDPA(rpois(N,10), rep(1,length(rpois(N,10))), 3L), N=10) > 6100
[1] TRUE
> memorytester(cDPA(rpois(N,10), rep(1,length(rpois(N,10))), 3L), N=10) > 6300
[1] FALSE

This is preferred, since otherwise we have to resort to convert it back into bytes as per "KB/MB" string-parsing or with regex and we would require it to be in bytes strictly since we will compare it with max.bytes (similar to max.seconds), the max. memory allocation limit per iteration.

But then this means that there must be some implicit conversion - which sparks the realization for the subsequent error I recieved.²

Following the established logic from asymptoticTimings, making required changes and extracting that single allocated memory size value per iteration - we would expect it to be fine:

asymptoticMemoryUsage <- function(e, data.sizes, max.bytes)
{
  if(!all(!is.infinite(data.sizes) & !is.na(data.sizes) & !is.nan(data.sizes)))
    stop("data.sizes must not contain any NA/NaN/Infinite value.")
  if(length(data.sizes) == 0)
    stop("Cannot run on an empty vector for 'data.sizes'.")
  lang.obj <- substitute(e)
  fun.obj  <- function(data.sizes)
    eval(lang.obj)
  memory.size.limit = ifelse(missing(max.bytes), 10^6, max.bytes)
  l <- length(data.sizes)
  memory.metrics.list <- list()
  for(i in 1:l)
  {
    benchmarked.memory.size <- bench_memory(fun.obj(data.sizes[i]))$mem_alloc
    data.size <- data.sizes[i]
    memory.metrics.list[[i]] <- data.frame(benchmarked.memory.size, data.size)
    ifelse((benchmarked.memory.size > memory.size.limit), break, next)
  }
  resultant.df <- do.call(rbind, memory.metrics.list)
  colnames(resultant.df) <- c("Memory Usage", "Data sizes")
  return(resultant.df)
}

but it throws:

Error in round(bytes/byte_units[unit], digits = digits) :
invalid second argument of length 0

which didn't make much sense at the first glance, but then I eventually got to understand the meaning of that error message.²
Yup, its the conversion from higher memory size units (MBs/KBs) to bytes.
I made small test functions for different cases as to know what could be the reason for that, and this is the one which reproduced the same error:

memorytester <- function(expression, N)
{
  lang.obj <- substitute(expression)
  fun.obj  <- function(N) eval(lang.obj)
  x<- bench_memory(fun.obj(N))$mem_alloc
  return(data.frame(x,N)) # for a single value of N
}

Error in round(bytes/byte_units[unit], digits = digits) : 
  invalid second argument of length 0

This error makes sense now, as we are collecting the mem_alloc value and our data.size (N here) in a data.frame, which makes the same aforementioned conversion (dividing total bytes by the byte unit value, such as 1000 for KB, and then rounding-off, as mentioned in the error message) rule apply to all the parameters within, i.e. it would work fine for mem_alloc objects which need a conversion, but not for data.size which does not require the same. (and hence invalid 'second' argument)

To verify, just returning the allocated memory size value (without the data frame) works:

memorytester <- function(expression, N)
{
  lang.obj <- substitute(expression)
  fun.obj  <- function(N) eval(lang.obj)
  x<- bench_memory(fun.obj(N))$mem_alloc
  return(x)
}

> memorytester(cDPA(rpois(N,10), rep(1,length(rpois(N,10))), 3L), N = 100)
[1] 19.6KB

which means I'll need to devise something else instead of collecting it directly in a data frame along with the data size for each iteration.

Branch : Memtest
Objectives: Add memory usage quantifier, complexity classifier and plotter function with corresponding documentation & unit-tests.

The text was updated successfully, but these errors were encountered:

For additional details, check this thread: #6

Anirban166 · 2020-06-11T04:33:51Z

Done with the functionality, a test run:

> asymptoticMemoryUsage(cDPA(rpois(data.sizes,10), rep(1,length(rpois(data.sizes,10))), 3L), data.sizes = 10^seq(1, 4, by = 0.5))
  Memory usage  Data sizes
1         6256    10.00000
2         9416    31.62278
3        20032   100.00000
4        51136   316.22777
5       149632  1000.00000
6       460960  3162.27766
7      1445632 10000.00000

Since it only gives one memory size value for each data size, it would be sensible to allocate more mid-range values in the sequence, and setting by = 0.1 or lower would be favourable to increase the accuracy of complexity prediction from asymptoticMemoryComplexityClass.

Here's a subsequent test for the parameter max.bytes which works as expected:

> asymptoticMemoryUsage(cDPA(rpois(data.sizes,10), rep(1,length(rpois(data.sizes,10))), 3L), data.sizes = 10^seq(1, 5, by = 0.1))
   Memory usage Data sizes
1          6256   10.00000
2          6832   12.58925
3          7200   15.84893
4          7880   19.95262
5          8648   25.11886
6          9416   31.62278
7         11272   39.81072
8         12832   50.11872
9         14728   63.09573
10        17032   79.43282
11        20032  100.00000
12        23656  125.89254
13        28384  158.48932
14        34312  199.52623
15        41800  251.18864
16        51136  316.22777
17        62944  398.10717
18        77800  501.18723
19        96352  630.95734
20       119968  794.32823
21       149632 1000.00000
22       186784 1258.92541
23       233728 1584.89319
24       292936 1995.26231
25       367240 2511.88643
26       460960 3162.27766
27       578920 3981.07171
28       727240 5011.87234
29       914152 6309.57344
30      1149448 7943.28235

It stops further computation after a memory size value exceeds the default value which is 1MB or 10^6 bytes, (the 30th value above exceeds the hardcoded limit : 1,149,448 > 1,000,000)

User can increase/adjust it accordingly with max.bytes. It might save a bit of computation time as well since the parameter is compared directly with a single memory size value inside the loop, instead of summing up each value. (which is followed when we extract mean for timings in asymptoticTimings)

+In contrast, it runs faster as compared to asymptoticTimings, probably because bench computes the memory 'allocations', and it doesn't spend extra time calculating replicates - it only sums up to one value per data size.

Edit: PR merged.

Branch : Memtest
Objectives: Add memory usage quantifier, complexity classifier and plotter function with corresponding documentation & unit-tests.

Anirban166 · 2020-06-19T04:42:11Z

@tdhock

Can you provide some expected complexity cases for memory complexity testing? (i.e. functions with expected/known memory complexity)

tdhock · 2020-06-19T16:20:51Z

allocate a vector of size N -> O(N)
allocate an N x N matrix -> O(N^2)
PeakSegDisk -> O( log N )
PeakSegOptimal -> O( N log N )

Anirban166 · 2020-06-19T16:25:40Z

allocate a vector of size N -> O(N)
allocate an N x N matrix -> O(N^2)
PeakSegDisk -> O( log N )
PeakSegOptimal -> O( N log N )

Thanks! I'll test them and let you know the results here

Anirban166 · 2020-06-20T05:03:32Z

Anirban166 · 2020-06-20T05:28:00Z

The first three are working fine, but when I test PeakSegDisk::PeakSegFPOP_vec with larger data sizes (say 10^seq(1, 5, by = 0.5) or 10^seq(1, 5, by = 0.1)) it gives loglinear complexity, and when I test it with smaller sizes such as 10^seq(1, 3, by = 0.5) and 10^seq(1, 2, by = 0.1) it gives linear and squareroot complexities respectively. The squareroot one can be mistaken to be a log complexity (which will probably show log if we remove squareroot), but then its confusing when it results in higher complexities for an increase in data size. (at most it shows loglinear)

Anirban166 · 2020-06-20T05:45:40Z

okay just tested it on random poisson data as well (which should show log-linear trend for time complexity instead of quadratic time complexity for sequentially increasing data) which shows linear memory complexity for higher data sizes and squareroot for smaller sizes like 10^seq(1, 2, by = 0.1). Does O(logN) complexity apply to all the functions and all cases of PeakSegDisk?

I also read about an O(NlogN) disk space case for PeakSegFPOP_dir (it says O(logN) as you mentioned for memory), for which I just want to know is disk space different from normal memory complexity in terms of memory allocated?

tdhock · 2020-06-24T04:37:05Z

good point.
PeakSegDisk is O(N log N) time, O(log N) memory for most data sets (like rpois)
it is O(N^2) time, O(N) memory for pathological data (like always increasing data)

what is the squareroot complexity for? Do you have any examples of real algorithms with that time/memory complexity? (I dont know of any)

Anirban166 · 2020-06-24T05:25:21Z

good point.
PeakSegDisk is O(N log N) time, O(log N) memory for most data sets (like rpois)
it is O(N^2) time, O(N) memory for pathological data (like always increasing data)

ah okay, thanks!

what is the squareroot complexity for? Do you have any examples of real algorithms with that time/memory complexity? (I dont know of any)

No I don't, and I can't think of a use-case either, unless I explicitly make a function which runs with squareroot complexity

I just included it since it can be represented in a glm formula as observed from GuessCompx, but yeah I think it would be good to drop some of the complexities which we practically don't get (might save some time with error computation and comparison for the best model) such as constant and cubic (quadratic works as the worst case limit) as well apart from squareroot

Anirban166 added enhancement New feature or request log labels Jun 10, 2020

Anirban166 added a commit that referenced this issue Jun 11, 2020

Add asymptoticMemoryUsage (+ docs/unit-tests)

03bfd53

For additional details, check this thread: #6

Anirban166 mentioned this issue Jun 11, 2020

Add asymptoticMemoryUsage (+ docs/unit-tests) #7

Merged

Anirban166 mentioned this issue Jun 11, 2020

RCMD check() issuing namespace error locally #8

Closed

This was referenced Jun 30, 2020

bench vs microbenchmark #20

Closed

Sorting algorithms #21

Closed

Anirban166 closed this as completed Aug 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Creating the memory usage (in bytes) quantifying function #6

Creating the memory usage (in bytes) quantifying function #6

Anirban166 commented Jun 10, 2020 •

edited

Loading

Anirban166 commented Jun 11, 2020 •

edited

Loading

Anirban166 commented Jun 19, 2020

tdhock commented Jun 19, 2020

Anirban166 commented Jun 19, 2020

Anirban166 commented Jun 20, 2020

Anirban166 commented Jun 20, 2020

Anirban166 commented Jun 20, 2020 •

edited

Loading

tdhock commented Jun 24, 2020

Anirban166 commented Jun 24, 2020

Creating the memory usage (in bytes) quantifying function #6

Creating the memory usage (in bytes) quantifying function #6

Comments

Anirban166 commented Jun 10, 2020 • edited Loading

Anirban166 commented Jun 11, 2020 • edited Loading

Anirban166 commented Jun 19, 2020

tdhock commented Jun 19, 2020

Anirban166 commented Jun 19, 2020

Anirban166 commented Jun 20, 2020

Anirban166 commented Jun 20, 2020

Anirban166 commented Jun 20, 2020 • edited Loading

tdhock commented Jun 24, 2020

Anirban166 commented Jun 24, 2020

Anirban166 commented Jun 10, 2020 •

edited

Loading

Anirban166 commented Jun 11, 2020 •

edited

Loading

Anirban166 commented Jun 20, 2020 •

edited

Loading