Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating the memory usage (in bytes) quantifying function #6

Closed
Anirban166 opened this issue Jun 10, 2020 · 9 comments
Closed

Creating the memory usage (in bytes) quantifying function #6

Anirban166 opened this issue Jun 10, 2020 · 9 comments
Labels
enhancement New feature or request log

Comments

@Anirban166
Copy link
Owner

Anirban166 commented Jun 10, 2020

Started with the memory metrics (usage in bytes) quantifier, with a few adaptations in relation to the timings quantifier. Keynote:

bench essentially computes benchmarks for both time and memory allocations for an R expression, so it won't return an applicable1 data.frame for memory metrics specifically, as microbenchmark did in contrast for time results. (it only had timings plus it didn't result in a singular output value)
The equivalent to extracting memory is by using bench::bench_memory or by retrieving mem_alloc from bench::mark. Both ways would end up with a singular value for whatsoever input (an expression) supplied, which we can extract by selecting the memory allocation column, example:

> bench::mark(cDPA(rpois(100,10), rep(1,length(rpois(100,10))), 3L))$mem_alloc
[1] 19.6KB
> bench::bench_memory(cDPA(rpois(100,10), rep(1,length(rpois(100,10))), 3L))$mem_alloc
[1] 19.6KB

mem_alloc is one among the two types of data that bench_memory returns:

> attributes(bench::bench_memory(cDPA(rpois(100,10), rep(1,length(rpois(100,10))), 3L)))
$names
[1] "mem_alloc" "memory"   

However, the second memory column just shows the allocations per step additionally (it is not use-able1 as a data.frame) and provides nothing extra, with the total allocated value in bytes (see last row below) being supplied to mem_alloc:

Memory allocations:
       what bytes                                                      calls
1     alloc   448 <Anonymous> -> force() -> cDPA() -> stopifnot() -> rpois()
2     alloc  2552 <Anonymous> -> force() -> cDPA() -> stopifnot() -> rpois()
3     alloc   448 <Anonymous> -> force() -> cDPA() -> stopifnot() -> rpois()
4     alloc  2552 <Anonymous> -> force() -> cDPA() -> stopifnot() -> rpois()
5     alloc   848            <Anonymous> -> force() -> cDPA() -> stopifnot()
6     alloc   448                           <Anonymous> -> force() -> cDPA()
7     alloc  2448               <Anonymous> -> force() -> cDPA() -> double()
8     alloc  1248              <Anonymous> -> force() -> cDPA() -> integer()
9     alloc  2448               <Anonymous> -> force() -> cDPA() -> double()
10    alloc   448                           <Anonymous> -> force() -> cDPA()
11    alloc  2448               <Anonymous> -> force() -> cDPA() -> matrix()
12    alloc  1248               <Anonymous> -> force() -> cDPA() -> matrix()
13    alloc  2448               <Anonymous> -> force() -> cDPA() -> matrix()
total       20032                                                           

Ideally, mem_alloc is the only value we will be requiring after all.

However, this does raise a question, as you saw in the results for cDPA above, mem_alloc returns the sizes in kilobytes (KB) or for a larger value it would return in equivalent conversions such as in MB, but the memory allocations are calculated in bytes, so are the values (KB, MB) directly comparable in bytes? - yes, they are. Here's a short test which proves that:

> memorytester(cDPA(rpois(N,10), rep(1,length(rpois(N,10))), 3L), N=10)
[1] 6.11KB
> memorytester(cDPA(rpois(N,10), rep(1,length(rpois(N,10))), 3L), N=10) > 6100
[1] TRUE
> memorytester(cDPA(rpois(N,10), rep(1,length(rpois(N,10))), 3L), N=10) > 6300
[1] FALSE

This is preferred, since otherwise we have to resort to convert it back into bytes as per "KB/MB" string-parsing or with regex and we would require it to be in bytes strictly since we will compare it with max.bytes (similar to max.seconds), the max. memory allocation limit per iteration.

But then this means that there must be some implicit conversion - which sparks the realization for the subsequent error I recieved.2

Following the established logic from asymptoticTimings, making required changes and extracting that single allocated memory size value per iteration - we would expect it to be fine:

asymptoticMemoryUsage <- function(e, data.sizes, max.bytes)
{
  if(!all(!is.infinite(data.sizes) & !is.na(data.sizes) & !is.nan(data.sizes)))
    stop("data.sizes must not contain any NA/NaN/Infinite value.")
  if(length(data.sizes) == 0)
    stop("Cannot run on an empty vector for 'data.sizes'.")
  lang.obj <- substitute(e)
  fun.obj  <- function(data.sizes)
    eval(lang.obj)
  memory.size.limit = ifelse(missing(max.bytes), 10^6, max.bytes)
  l <- length(data.sizes)
  memory.metrics.list <- list()
  for(i in 1:l)
  {
    benchmarked.memory.size <- bench_memory(fun.obj(data.sizes[i]))$mem_alloc
    data.size <- data.sizes[i]
    memory.metrics.list[[i]] <- data.frame(benchmarked.memory.size, data.size)
    ifelse((benchmarked.memory.size > memory.size.limit), break, next)
  }
  resultant.df <- do.call(rbind, memory.metrics.list)
  colnames(resultant.df) <- c("Memory Usage", "Data sizes")
  return(resultant.df)
}

but it throws:

Error in round(bytes/byte_units[unit], digits = digits) :
invalid second argument of length 0

which didn't make much sense at the first glance, but then I eventually got to understand the meaning of that error message.2
Yup, its the conversion from higher memory size units (MBs/KBs) to bytes.
I made small test functions for different cases as to know what could be the reason for that, and this is the one which reproduced the same error:

memorytester <- function(expression, N)
{
  lang.obj <- substitute(expression)
  fun.obj  <- function(N) eval(lang.obj)
  x<- bench_memory(fun.obj(N))$mem_alloc
  return(data.frame(x,N)) # for a single value of N
}
Error in round(bytes/byte_units[unit], digits = digits) : 
  invalid second argument of length 0

This error makes sense now, as we are collecting the mem_alloc value and our data.size (N here) in a data.frame, which makes the same aforementioned conversion (dividing total bytes by the byte unit value, such as 1000 for KB, and then rounding-off, as mentioned in the error message) rule apply to all the parameters within, i.e. it would work fine for mem_alloc objects which need a conversion, but not for data.size which does not require the same. (and hence invalid 'second' argument)

To verify, just returning the allocated memory size value (without the data frame) works:

memorytester <- function(expression, N)
{
  lang.obj <- substitute(expression)
  fun.obj  <- function(N) eval(lang.obj)
  x<- bench_memory(fun.obj(N))$mem_alloc
  return(x)
}
> memorytester(cDPA(rpois(N,10), rep(1,length(rpois(N,10))), 3L), N = 100)
[1] 19.6KB

which means I'll need to devise something else instead of collecting it directly in a data frame along with the data size for each iteration.


Branch : Memtest
Objectives: Add memory usage quantifier, complexity classifier and plotter function with corresponding documentation & unit-tests.

@Anirban166 Anirban166 added enhancement New feature or request log labels Jun 10, 2020
Anirban166 added a commit that referenced this issue Jun 11, 2020
For additional details, check this thread:
#6
@Anirban166
Copy link
Owner Author

Anirban166 commented Jun 11, 2020

Done with the functionality, a test run:

> asymptoticMemoryUsage(cDPA(rpois(data.sizes,10), rep(1,length(rpois(data.sizes,10))), 3L), data.sizes = 10^seq(1, 4, by = 0.5))
  Memory usage  Data sizes
1         6256    10.00000
2         9416    31.62278
3        20032   100.00000
4        51136   316.22777
5       149632  1000.00000
6       460960  3162.27766
7      1445632 10000.00000

Since it only gives one memory size value for each data size, it would be sensible to allocate more mid-range values in the sequence, and setting by = 0.1 or lower would be favourable to increase the accuracy of complexity prediction from asymptoticMemoryComplexityClass.

Here's a subsequent test for the parameter max.bytes which works as expected:

> asymptoticMemoryUsage(cDPA(rpois(data.sizes,10), rep(1,length(rpois(data.sizes,10))), 3L), data.sizes = 10^seq(1, 5, by = 0.1))
   Memory usage Data sizes
1          6256   10.00000
2          6832   12.58925
3          7200   15.84893
4          7880   19.95262
5          8648   25.11886
6          9416   31.62278
7         11272   39.81072
8         12832   50.11872
9         14728   63.09573
10        17032   79.43282
11        20032  100.00000
12        23656  125.89254
13        28384  158.48932
14        34312  199.52623
15        41800  251.18864
16        51136  316.22777
17        62944  398.10717
18        77800  501.18723
19        96352  630.95734
20       119968  794.32823
21       149632 1000.00000
22       186784 1258.92541
23       233728 1584.89319
24       292936 1995.26231
25       367240 2511.88643
26       460960 3162.27766
27       578920 3981.07171
28       727240 5011.87234
29       914152 6309.57344
30      1149448 7943.28235

It stops further computation after a memory size value exceeds the default value which is 1MB or 10^6 bytes, (the 30th value above exceeds the hardcoded limit : 1,149,448 > 1,000,000)

User can increase/adjust it accordingly with max.bytes. It might save a bit of computation time as well since the parameter is compared directly with a single memory size value inside the loop, instead of summing up each value. (which is followed when we extract mean for timings in asymptoticTimings)

+In contrast, it runs faster as compared to asymptoticTimings, probably because bench computes the memory 'allocations', and it doesn't spend extra time calculating replicates - it only sums up to one value per data size.

Edit: PR merged.


Branch : Memtest
Objectives: Add memory usage quantifier, complexity classifier and plotter function with corresponding documentation & unit-tests.

@Anirban166
Copy link
Owner Author

@tdhock

Can you provide some expected complexity cases for memory complexity testing? (i.e. functions with expected/known memory complexity)

@tdhock
Copy link

tdhock commented Jun 19, 2020

allocate a vector of size N -> O(N)
allocate an N x N matrix -> O(N^2)
PeakSegDisk -> O( log N )
PeakSegOptimal -> O( N log N )

@Anirban166
Copy link
Owner Author

allocate a vector of size N -> O(N)
allocate an N x N matrix -> O(N^2)
PeakSegDisk -> O( log N )
PeakSegOptimal -> O( N log N )

Thanks! I'll test them and let you know the results here

@Anirban166
Copy link
Owner Author

Memtests

@Anirban166
Copy link
Owner Author

The first three are working fine, but when I test PeakSegDisk::PeakSegFPOP_vec with larger data sizes (say 10^seq(1, 5, by = 0.5) or 10^seq(1, 5, by = 0.1)) it gives loglinear complexity, and when I test it with smaller sizes such as 10^seq(1, 3, by = 0.5) and 10^seq(1, 2, by = 0.1) it gives linear and squareroot complexities respectively. The squareroot one can be mistaken to be a log complexity (which will probably show log if we remove squareroot), but then its confusing when it results in higher complexities for an increase in data size. (at most it shows loglinear)

@Anirban166
Copy link
Owner Author

Anirban166 commented Jun 20, 2020

Memtest_PeakSegDiskFPOP_vec
okay just tested it on random poisson data as well (which should show log-linear trend for time complexity instead of quadratic time complexity for sequentially increasing data) which shows linear memory complexity for higher data sizes and squareroot for smaller sizes like 10^seq(1, 2, by = 0.1). Does O(logN) complexity apply to all the functions and all cases of PeakSegDisk?

I also read about an O(NlogN) disk space case for PeakSegFPOP_dir (it says O(logN) as you mentioned for memory), for which I just want to know is disk space different from normal memory complexity in terms of memory allocated?

@tdhock
Copy link

tdhock commented Jun 24, 2020

good point.
PeakSegDisk is O(N log N) time, O(log N) memory for most data sets (like rpois)
it is O(N^2) time, O(N) memory for pathological data (like always increasing data)

what is the squareroot complexity for? Do you have any examples of real algorithms with that time/memory complexity? (I dont know of any)

@Anirban166
Copy link
Owner Author

good point.
PeakSegDisk is O(N log N) time, O(log N) memory for most data sets (like rpois)
it is O(N^2) time, O(N) memory for pathological data (like always increasing data)

ah okay, thanks!

what is the squareroot complexity for? Do you have any examples of real algorithms with that time/memory complexity? (I dont know of any)

No I don't, and I can't think of a use-case either, unless I explicitly make a function which runs with squareroot complexity

I just included it since it can be represented in a glm formula as observed from GuessCompx, but yeah I think it would be good to drop some of the complexities which we practically don't get (might save some time with error computation and comparison for the best model) such as constant and cubic (quadratic works as the worst case limit) as well apart from squareroot

This was referenced Jun 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request log
Projects
None yet
Development

No branches or pull requests

2 participants