Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding multiple regression lines for different complexity classes #50

Closed
Anirban166 opened this issue Aug 26, 2020 · 2 comments
Closed

Comments

@Anirban166
Copy link
Owner

Anirban166 commented Aug 26, 2020

My basic idea was to mimic the trends of already experimented algorithms, one for each complexity class (substring for linear, PeakSegPDPA for loglinear, cDPA for quadratic) via a glm/lm method (going for the straight y~x formula, or curves like y~I(x^2) or y~(log(x)) would work too) and subsequently combine them to obtain a plot with all the three regression lines denoting the margin for complexity classes linear, log-linear and quadratic.

As an initial try with geom_smooth() to plot the regression, I made a mistake by applying the old logic (adding third column to the data frames to distinguish by color aesthetics for each algo, rbind()'ing the data frames and plotting the same) to plot the regression which resulted in a line stretching between the two geom lines (kept just for comparison at the first) for the PDPA and cDPA algos (tried with 2 of them first), which made me realize I was just plotting the regression line for the whole rbinded data frame, (with an average that comes in between the two, and hence the position) instead of creating regression lines for the two algorithms seperately:

library(testComplexity)
df.PeakSegPDPA <- asymptoticTimings(PeakSegOptimal::PeakSegPDPA(rpois(N, 1),rep(1, length(rpois(N, 1))), 3L), data.sizes = 10^seq(1, 4, by = 0.5))
df.cDPA <- asymptoticTimings(PeakSegDP::cDPA(rpois(N, 1), rep(1, length(rpois(N, 1))), 3L), data.sizes = 10^seq(1, 4, by = 0.5))
df.PeakSegPDPA$expr <- "PDPA"
df.cDPA$expr <- "cDPA"
plot.obj <- rbind(df.PeakSegPDPA, df.cDPA)
ggplot(plot.obj, aes(x = `Data sizes`, y = Timings)) +
geom_point(aes(color = expr)) + geom_line(aes(color = expr)) +
labs(x = "Data size", y = "Runtime (in nanoseconds)") +
scale_x_log10() + scale_y_log10() +
ggtitle("Timings comparison plot", subtitle = "Log-linear vs Quadratic trend") +
geom_smooth(method = "glm", formula = y~x) +
directlabels::geom_dl(aes(label = expr), method = "last.qp")

image

So I thought to create seperate geom_smooth layers (+ point/line with different colour assigned) for the algorithms (seperately specifying aesthetics) and started by re-assigning column names to the data frames, (containing benchmarked data of PDPA/cDPA) replacing 'Timings' column with the name of their respective complexity classes (so as to distinguish based on that, for the y-axis), combining them via a cbind (common column being Data sizes) and finally plotting them by first with PDPA (y set as loglinear) and then with the cDPA (y = quadratic) layer:

library(testComplexity)
library(data.table)
df.PeakSegPDPA <- asymptoticTimings(PeakSegOptimal::PeakSegPDPA(rpois(N, 1),rep(1, length(rpois(N, 1))), 3L), data.sizes = 10^seq(1, 4, by = 0.5))
df.cDPA <- asymptoticTimings(PeakSegDP::cDPA(rpois(N, 1), rep(1, length(rpois(N, 1))), 3L), data.sizes = 10^seq(1, 4, by = 0.5))
# Changing the 'Timings' column to complexity class names, in order to distinguish via that: (y component)
colnames(df.cDPA)[1] = "Quadratic"
colnames(df.PeakSegPDPA)[1] = "Loglinear"
# merge by data sizes column created copies x 10 (70k obs. for the 700 here), so used cbind, which just creates an extra column: 
df <- cbind(df.PeakSegPDPA, df.cDPA) 
df[2] <- NULL # Deleting one data size column, since cbind keeps the attribute from both the data frames, at the 2nd and 4th indices here 
# Resultant data frame:
data.table(df)
     Loglinear  Quadratic Data sizes
  1:    296102     138101         10
  2:    157700      53802         10
  3:    123300      45301         10
  4:    141501      41402         10
  5:    126301      41201         10
 ---                                
696: 306150800 3749064301      10000
697: 335036102 3261133301      10000
698: 310979501 3951080402      10000
699: 354960002 3211433401      10000
700: 381043901 3728054701      10000
# plot:
ggplot(df, aes(x = `Data sizes`, y = "Loglinear")) +
geom_point(color = "red") + # using point since line would get overlapped by the straight regression line
geom_smooth(method = "glm", se = FALSE, fullrange = TRUE) +
geom_point(aes(y = "Quadratic"), color = "blue") +
geom_smooth(aes(y = "Quadratic"), method = "glm", se = FALSE, fullrange=TRUE) +
labs(x = "Data size", y = "Runtime") +
scale_x_log10() + scale_y_log10()

This throws an error:

image

Removed the log scale on Y, but did not get whats expected:

image

@tdhock
Copy link

tdhock commented Aug 26, 2020

you are using glm to fit the models right? use the predict method for each glm model to get predicted times, then save them to a data table, then using geom_line with aes(color=complexity), something like....

form.list <- list(
 linear = seconds ~ n.data,
 quadratic = seconds ~ n.data^2,
 ......)
pred.dt.list <- list()
for(complexity in names(form.list)){
  form <- form.list[[complexity]]
  fit <- glm(form, timing.data)
  seconds <- predict(fit, n.data)
  pred.dt.list[[complexity]] <- data.table(
    complexity, n.data, seconds)
}
pred.dt <- do.call(rbind, pred.dt.list)

geom_line(aes(
  n.data, seconds, color=complexity),
  data=pred.dt)

@Anirban166
Copy link
Owner Author

you are using glm to fit the models right? use the predict method for each glm model to get predicted times, then save them to a data table, then using geom_line with aes(color=complexity), something like....

form.list <- list(
 linear = seconds ~ n.data,
 quadratic = seconds ~ n.data^2,
 ......)
pred.dt.list <- list()
for(complexity in names(form.list)){
  form <- form.list[[complexity]]
  fit <- glm(form, timing.data)
  seconds <- predict(fit, n.data)
  pred.dt.list[[complexity]] <- data.table(
    complexity, n.data, seconds)
}
pred.dt <- do.call(rbind, pred.dt.list)

geom_line(aes(
  n.data, seconds, color=complexity),
  data=pred.dt)

Thats a nice way to do it..thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants