Adding multiple regression lines for different complexity classes #50

Anirban166 · 2020-08-26T16:37:30Z

My basic idea was to mimic the trends of already experimented algorithms, one for each complexity class (substring for linear, PeakSegPDPA for loglinear, cDPA for quadratic) via a glm/lm method (going for the straight y~x formula, or curves like y~I(x^2) or y~(log(x)) would work too) and subsequently combine them to obtain a plot with all the three regression lines denoting the margin for complexity classes linear, log-linear and quadratic.

As an initial try with geom_smooth() to plot the regression, I made a mistake by applying the old logic (adding third column to the data frames to distinguish by color aesthetics for each algo, rbind()'ing the data frames and plotting the same) to plot the regression which resulted in a line stretching between the two geom lines (kept just for comparison at the first) for the PDPA and cDPA algos (tried with 2 of them first), which made me realize I was just plotting the regression line for the whole rbinded data frame, (with an average that comes in between the two, and hence the position) instead of creating regression lines for the two algorithms seperately:

library(testComplexity)
df.PeakSegPDPA <- asymptoticTimings(PeakSegOptimal::PeakSegPDPA(rpois(N, 1),rep(1, length(rpois(N, 1))), 3L), data.sizes = 10^seq(1, 4, by = 0.5))
df.cDPA <- asymptoticTimings(PeakSegDP::cDPA(rpois(N, 1), rep(1, length(rpois(N, 1))), 3L), data.sizes = 10^seq(1, 4, by = 0.5))
df.PeakSegPDPA$expr <- "PDPA"
df.cDPA$expr <- "cDPA"
plot.obj <- rbind(df.PeakSegPDPA, df.cDPA)
ggplot(plot.obj, aes(x = `Data sizes`, y = Timings)) +
geom_point(aes(color = expr)) + geom_line(aes(color = expr)) +
labs(x = "Data size", y = "Runtime (in nanoseconds)") +
scale_x_log10() + scale_y_log10() +
ggtitle("Timings comparison plot", subtitle = "Log-linear vs Quadratic trend") +
geom_smooth(method = "glm", formula = y~x) +
directlabels::geom_dl(aes(label = expr), method = "last.qp")

So I thought to create seperate geom_smooth layers (+ point/line with different colour assigned) for the algorithms (seperately specifying aesthetics) and started by re-assigning column names to the data frames, (containing benchmarked data of PDPA/cDPA) replacing 'Timings' column with the name of their respective complexity classes (so as to distinguish based on that, for the y-axis), combining them via a cbind (common column being Data sizes) and finally plotting them by first with PDPA (y set as loglinear) and then with the cDPA (y = quadratic) layer:

library(testComplexity)
library(data.table)
df.PeakSegPDPA <- asymptoticTimings(PeakSegOptimal::PeakSegPDPA(rpois(N, 1),rep(1, length(rpois(N, 1))), 3L), data.sizes = 10^seq(1, 4, by = 0.5))
df.cDPA <- asymptoticTimings(PeakSegDP::cDPA(rpois(N, 1), rep(1, length(rpois(N, 1))), 3L), data.sizes = 10^seq(1, 4, by = 0.5))
# Changing the 'Timings' column to complexity class names, in order to distinguish via that: (y component)
colnames(df.cDPA)[1] = "Quadratic"
colnames(df.PeakSegPDPA)[1] = "Loglinear"
# merge by data sizes column created copies x 10 (70k obs. for the 700 here), so used cbind, which just creates an extra column: 
df <- cbind(df.PeakSegPDPA, df.cDPA) 
df[2] <- NULL # Deleting one data size column, since cbind keeps the attribute from both the data frames, at the 2nd and 4th indices here 
# Resultant data frame:
data.table(df)
     Loglinear  Quadratic Data sizes
  1:    296102     138101         10
  2:    157700      53802         10
  3:    123300      45301         10
  4:    141501      41402         10
  5:    126301      41201         10
 ---                                
696: 306150800 3749064301      10000
697: 335036102 3261133301      10000
698: 310979501 3951080402      10000
699: 354960002 3211433401      10000
700: 381043901 3728054701      10000
# plot:
ggplot(df, aes(x = `Data sizes`, y = "Loglinear")) +
geom_point(color = "red") + # using point since line would get overlapped by the straight regression line
geom_smooth(method = "glm", se = FALSE, fullrange = TRUE) +
geom_point(aes(y = "Quadratic"), color = "blue") +
geom_smooth(aes(y = "Quadratic"), method = "glm", se = FALSE, fullrange=TRUE) +
labs(x = "Data size", y = "Runtime") +
scale_x_log10() + scale_y_log10()

This throws an error:

Removed the log scale on Y, but did not get whats expected:

The text was updated successfully, but these errors were encountered:

tdhock · 2020-08-26T17:13:19Z

you are using glm to fit the models right? use the predict method for each glm model to get predicted times, then save them to a data table, then using geom_line with aes(color=complexity), something like....

form.list <- list(
 linear = seconds ~ n.data,
 quadratic = seconds ~ n.data^2,
 ......)
pred.dt.list <- list()
for(complexity in names(form.list)){
  form <- form.list[[complexity]]
  fit <- glm(form, timing.data)
  seconds <- predict(fit, n.data)
  pred.dt.list[[complexity]] <- data.table(
    complexity, n.data, seconds)
}
pred.dt <- do.call(rbind, pred.dt.list)

geom_line(aes(
  n.data, seconds, color=complexity),
  data=pred.dt)

Anirban166 · 2020-08-26T17:20:46Z

you are using glm to fit the models right? use the predict method for each glm model to get predicted times, then save them to a data table, then using geom_line with aes(color=complexity), something like....

form.list <- list(
 linear = seconds ~ n.data,
 quadratic = seconds ~ n.data^2,
 ......)
pred.dt.list <- list()
for(complexity in names(form.list)){
  form <- form.list[[complexity]]
  fit <- glm(form, timing.data)
  seconds <- predict(fit, n.data)
  pred.dt.list[[complexity]] <- data.table(
    complexity, n.data, seconds)
}
pred.dt <- do.call(rbind, pred.dt.list)

geom_line(aes(
  n.data, seconds, color=complexity),
  data=pred.dt)

Thats a nice way to do it..thanks!

Anirban166 closed this as completed Aug 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding multiple regression lines for different complexity classes #50

Adding multiple regression lines for different complexity classes #50

Anirban166 commented Aug 26, 2020 •

edited

Loading

tdhock commented Aug 26, 2020

Anirban166 commented Aug 26, 2020

Adding multiple regression lines for different complexity classes #50

Adding multiple regression lines for different complexity classes #50

Comments

Anirban166 commented Aug 26, 2020 • edited Loading

tdhock commented Aug 26, 2020

Anirban166 commented Aug 26, 2020

Anirban166 commented Aug 26, 2020 •

edited

Loading