Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors as a resultant of incorrect glm parameters #11

Closed
Anirban166 opened this issue Jun 14, 2020 · 10 comments
Closed

Errors as a resultant of incorrect glm parameters #11

Anirban166 opened this issue Jun 14, 2020 · 10 comments

Comments

@Anirban166
Copy link
Owner

Anirban166 commented Jun 14, 2020

Additionally accepting two columns from the data frame should ideally work the same as accepting no columnar parameters (only the data frame) and directly passing our data frame's columns into our glm scope (as seen in asymptoticTimeComplexityClass), but this is not the same based on these takes:

  • Accepting string inputs for the columns such as "col1", "col2"
asymptoticComplexityClass = function(model.df, data.size, output.size)
{ if(class(model.df) == "data.frame" & data.size %in% colnames(model.df) & output.size %in% colnames(model.df))
  {
    constant   <- glm(output.size~1,                        data = model.df); model.df['constant'] = fitted(constant)
    linear     <- glm(output.size~data.size,                data = model.df); model.df['linear'] = fitted(linear)
    squareroot <- glm(output.size~sqrt(data.size),          data = model.df); model.df['squareroot'] = fitted(squareroot)
    log        <- glm(output.size~log(data.size),           data = model.df); model.df['log'] = fitted(log)
    log.linear <- glm(output.size~data.size*log(data.size), data = model.df); model.df['log-linear'] = fitted(log.linear)
    quadratic  <- glm(output.size~I(data.size^2),           data = model.df); model.df['quadratic'] = fitted(quadratic)
    cubic      <- glm(output.size~I(data.size^3),           data = model.df); model.df['cubic'] = fitted(cubic)
    model.list <- list('constant'   = constant,
                       'linear'     = linear,
                       'squareroot' = squareroot,
                       'log'        = log,
                       'log-linear' = log.linear,
                       'quadratic'  = quadratic,
                       'cubic'      = cubic)
    cross.validated.errors <- lapply(model.list, function(x) cv.glm(model.df, x)$delta[2])
    best.model <- names(which.min(cross.validated.errors))
    print(best.model)
  }
  else stop("Input parameter must be a data frame containing the two specified columns passed as parameters")
}
# Quick code for reproducing:
library(microbenchmark)
library(boot)
# Paste code for asymptoticTimings()
df <- asymptoticTimings(changepoint::cpt.mean(rnorm(data.sizes), method = "PELT"), data.sizes = 10^seq(1,3,by=0.5))
colnames(df)<-c("Timings","Datasetsize")
asymptoticComplexityClass(model.df = df, data.size = "Datasetsize", output.size = "Timings")

Error thrown:

 Error in y - mu : non-numeric argument to binary operator 
6.
dev.resids(y, mu, weights) 
5.
glm.fit(x = structure(1, .Dim = c(1L, 1L), .Dimnames = list("1", 
    "(Intercept)"), assign = 0L), y = c(`1` = "Timings"), weights = NULL, 
    start = NULL, etastart = NULL, mustart = NULL, offset = NULL, 
    family = structure(list(family = "gaussian", link = "identity",  ... 
4.
eval(call(if (is.function(method)) "method" else method, x = X, 
    y = Y, weights = weights, start = start, etastart = etastart, 
    mustart = mustart, offset = offset, family = family, control = control, 
    intercept = attr(mt, "intercept") > 0L, singular.ok = singular.ok)) 
3.
eval(call(if (is.function(method)) "method" else method, x = X, 
    y = Y, weights = weights, start = start, etastart = etastart, 
    mustart = mustart, offset = offset, family = family, control = control, 
    intercept = attr(mt, "intercept") > 0L, singular.ok = singular.ok)) 
2.
glm(output.size ~ 1, data = model.df) 
1.

As per the say, it pertains to the fact that the output.size is not of numeric type, which is expected for (x~y) wherein both x and y (output.size and data.size here) should be numeric. If we comment out the constant part (this is a traceback for constant complexity, or the glm formula output.size~1), we recieve further error (different message) on the next complexity term: (traceback for the linear formula)

 Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
  contrasts can be applied only to factors with 2 or more levels 
6.
stop("contrasts can be applied only to factors with 2 or more levels") 
5.
`contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) 
4.
model.matrix.default(mt, mf, contrasts) 
3.
model.matrix(mt, mf, contrasts) 
2.
glm(output.size ~ data.size, data = model.df) 
1.
asymptoticComplexityClass(model.df = df, data.size = "Datasetsize", 
    output.size = "Timings") 

This will supposedly go on for all the complexity classes, which means we are passing our data frame's columns incorrectly inside the glm() call.

  • The same error would occur if we were to accept non-string values such as col1, col2 as well:
asymptoticComplexityClass = function(model.df, data.size, output.size)
{ data.size <- deparse(substitute(data.size))
  output.size <- deparse(substitute(output.size))
  if(class(model.df) == "data.frame" & data.size %in% colnames(model.df) & output.size %in% colnames(model.df))
  {
    constant   <- glm(output.size~1,                        data = model.df); model.df['constant'] = fitted(constant)
    linear     <- glm(output.size~data.size,                data = model.df); model.df['linear'] = fitted(linear)
    squareroot <- glm(output.size~sqrt(data.size),          data = model.df); model.df['squareroot'] = fitted(squareroot)
    log        <- glm(output.size~log(data.size),           data = model.df); model.df['log'] = fitted(log)
    log.linear <- glm(output.size~data.size*log(data.size), data = model.df); model.df['log-linear'] = fitted(log.linear)
    quadratic  <- glm(output.size~I(data.size^2),           data = model.df); model.df['quadratic'] = fitted(quadratic)
    cubic      <- glm(output.size~I(data.size^3),           data = model.df); model.df['cubic'] = fitted(cubic)
    model.list <- list('constant'   = constant,
                       'linear'     = linear,
                       'squareroot' = squareroot,
                       'log'        = log,
                       'log-linear' = log.linear,
                       'quadratic'  = quadratic,
                       'cubic'      = cubic)
    cross.validated.errors <- lapply(model.list, function(x) cv.glm(model.df, x)$delta[2])
    best.model <- names(which.min(cross.validated.errors))
    print(best.model)
  }
  else stop("Input parameter must be a data frame containing the two specified columns passed as parameters")
}
# Pass the columns as non-string values to reproduce:
asymptoticComplexityClass(model.df = df, data.size = Datasetsize, output.size = Timings)
  • Making them numeric is not an option as well:
asymptoticComplexityClass = function(model.df, data.size, output.size)
{ data.size <- deparse(substitute(data.size))
  output.size <- deparse(substitute(output.size))
  if(class(model.df) == "data.frame" & data.size %in% colnames(model.df) & output.size %in% colnames(model.df))
  {
    constant   <- glm(as.numeric(output.size)~1,                        data = model.df); model.df['constant'] = fitted(constant)
    linear     <- glm(as.numeric(output.size)~as.numeric(data.size),                data = model.df); model.df['linear'] = fitted(linear)
    squareroot <- glm(as.numeric(output.size)~sqrt(as.numeric(data.size)),          data = model.df); model.df['squareroot'] = fitted(squareroot)
    log        <- glm(as.numeric(output.size)~log(as.numeric(data.size)),           data = model.df); model.df['log'] = fitted(log)
    log.linear <- glm(as.numeric(output.size)~as.numeric(data.size)*log(as.numeric(data.size)), data = model.df); model.df['log-linear'] = fitted(log.linear)
    quadratic  <- glm(as.numeric(output.size)~I(as.numeric(data.size^2)),           data = model.df); model.df['quadratic'] = fitted(quadratic)
    cubic      <- glm(as.numeric(output.size)~I(as.numeric(data.size^3)),           data = model.df); model.df['cubic'] = fitted(cubic)
    model.list <- list('constant'   = constant,
                       'linear'     = linear,
                       'squareroot' = squareroot,
                       'log'        = log,
                       'log-linear' = log.linear,
                       'quadratic'  = quadratic,
                       'cubic'      = cubic)
    cross.validated.errors <- lapply(model.list, function(x) cv.glm(model.df, x)$delta[2])
    best.model <- names(which.min(cross.validated.errors))
    print(best.model)
  }
  else stop("Input parameter must be a data frame containing the two specified columns passed as parameters")
}

The above throws:

Error in glm.fit(x = numeric(0), y = numeric(0), weights = NULL, start = NULL,  : 
  object 'fit' not found
In addition: Warning messages:
1: In eval(predvars, data, env) : NAs introduced by coercion
2: In glm.fit(x = numeric(0), y = numeric(0), weights = NULL, start = NULL,  :
  no observations informative at iteration 1
 Error in glm.fit(x = numeric(0), y = numeric(0), weights = NULL, start = NULL,  : 
  object 'fit' not found 
  • It seems the values are changed when passed this away around, or not what we expect from df$Timings and df$Datasetsizes, so I explicitly extract them from our data frame, using syntax such as df[[col]] or df[,col] to extract the values from the column col from a data frame df:
asymptoticComplexityClass = function(model.df, data.size, output.size)
{ data.size <- deparse(substitute(data.size))
  output.size <- deparse(substitute(output.size))
  if(class(model.df) == "data.frame" & data.size %in% colnames(model.df) & output.size %in% colnames(model.df))
  {
    constant   <- glm(model.df[,output.size]~1,                        data = model.df); model.df['constant'] = fitted(constant)
    linear     <- glm(model.df[,output.size]~model.df[,data.size],                data = model.df); model.df['linear'] = fitted(linear)
    squareroot <- glm(model.df[,output.size]~sqrt(model.df[,data.size]),          data = model.df); model.df['squareroot'] = fitted(squareroot)
    log        <- glm(model.df[,output.size]~log(model.df[,data.size]),           data = model.df); model.df['log'] = fitted(log)
    log.linear <- glm(model.df[,output.size]~model.df[,data.size]*log(model.df[,data.size]), data = model.df); model.df['log-linear'] = fitted(log.linear)
    quadratic  <- glm(model.df[,output.size]~I(model.df[,data.size]^2),           data = model.df); model.df['quadratic'] = fitted(quadratic)
    cubic      <- glm(model.df[,output.size]~I(model.df[,data.size]^3),           data = model.df); model.df['cubic'] = fitted(cubic)
    model.list <- list('constant'   = constant,
                       'linear'     = linear,
                       'squareroot' = squareroot,
                       'log'        = log,
                       'log-linear' = log.linear,
                       'quadratic'  = quadratic,
                       'cubic'      = cubic)
    cross.validated.errors <- lapply(model.list, function(x) cv.glm(model.df, x)$delta[2])
    best.model <- names(which.min(cross.validated.errors))
    print(best.model)
  }
  else stop("Input parameter must be a data frame containing the two specified columns passed as parameters")
}

This does not produce any error, but it provides incorrect values, and would stick to 'constant' always - (whereas asymptoticTimeComplexityClass would predict the correct complexity)

> asymptoticComplexityClass(model.df = df, data.size = Datasetsize, output.size = Timings)
[1] "constant"
There were 50 or more warnings (use warnings() to see the first 50)

From which it seems this isn't the correct approach, and since we pass the columns directly instead of extracting them from our data frame inside the call to the glms, this isn't whats expected. Need to figure out the correct way to emplace the two parameters.


Branch : GeneralizedComplexity
Objectives : Add complexity classifier for user-defined parameters (output size and data size) from the data frame passed.

@Anirban166
Copy link
Owner Author

Anirban166 commented Jun 15, 2020

Created small functions to test some things, based on a data frame df:

df <- asymptoticTimings(changepoint::cpt.mean(rnorm(data.sizes), method = "PELT"), data.sizes = 10^seq(1,3,by=0.5))

Columns it possess : df$Timings and df$`Data sizes`

Renamed `Data sizes` to Datasize by colnames(df) <- c("Timings", "Datasize")

Expectations : It should not return any warnings and predict correct complexity. (like asymptoticTimeComplexityClass does for it)

Tests:

  • Checking accessibility inside function when user passes columns with respect to the data frame:
> f <- function(df, col) 
{
  print(col)               
}
> f(df, df[,"Timings"]) 
# or
> f(df, df[["Timings"]])

This works, but it won't work for checking column names inside via col %in% colnames(df).

Taking note of that for our function and omitting the check for column names:

asymptoticComplexityClass = function(df, output, size)
{
  #data.size <- deparse(substitute(data.size))
  #output.size <- deparse(substitute(output.size))
if(class(df) == "data.frame") # & 'output' %in% colnames(df) & 'size' %in% colnames(df) <- won't work
{
  constant   <- glm(output~1,                 data = df); df['constant'] = fitted(constant)
  linear     <- glm(output~size,              data = df); df['linear'] = fitted(linear)
  squareroot <- glm(output~sqrt(size),        data = df); df['squareroot'] = fitted(squareroot)
  log        <- glm(output~log(size),         data = df); df['log'] = fitted(log)
  log.linear <- glm(output~size*log(size),    data = df); df['log-linear'] = fitted(log.linear)
  quadratic  <- glm(output~I(size^2),         data = df); df['quadratic'] = fitted(quadratic)
  cubic      <- glm(output~I(size^3),         data = df); df['cubic'] = fitted(cubic)

  model.list <- list('constant'   = constant,
                     'linear'     = linear,
                     'squareroot' = squareroot,
                     'log'        = log,
                     'log-linear' = log.linear,
                     'quadratic'  = quadratic,
                     'cubic'      = cubic)

  cross.validated.errors <- lapply(model.list, function(x) cv.glm(df, x)$delta[2])
  best.model <- names(which.min(cross.validated.errors))
  print(best.model)
}
else stop("Input parameter must be a data frame containing the two specified columns passed as parameters")
}

Result: works, but still gives incorrect complexity (always constant) + warnings

> asymptoticComplexityClass(df, output = df[["Timings"]], size = df[["Datasize"]])
[1] "constant"
There were 50 or more warnings (use warnings() to see the first 50)

The warnings kind of suggest what the problem is:

> warnings()
Warning messages:
1: 'newdata' had 1 row but variables found have 500 rows
2: 'newdata' had 1 row but variables found have 500 rows
3: 'newdata' had 1 row but variables found have 500 rows
4: 'newdata' had 1 row but variables found have 500 rows
5: 'newdata' had 1 row but variables found have 500 rows
# goes uptil 50:

But am not being able to infer what 'newdata' is, (500 rows pertain to no. of rows in df) and how to resolve this or find a way to emplace the parameters correctly to avoid this.

  • Checking for and giving as input the column names directly:
f <- function(df, col) 
{
  if(col %in% colnames(df)) print(1)
}
> f(df, "Timings")

This would work to pass the test of column names, but it would give the error that we recieved first, since I guess the column names passed aren't inferred correctly as columns for the passed data frame df inside the glm() calls:

> asymptoticComplexityClass(df, output = "Timings", size = "Datasize")
 Error in y - mu : non-numeric argument to binary operator 

This can lead to the question - since we have changed the column names, are the column names posing problems? The answer is no, because I tried with a seperate asymptoticTimings function which returns a data frame with columns Timings and Datasize directly, and now since we don't need the name change (from `Data set sizes` to Datasize), it should work but no it still gives the same warnings plus incorrect complexity. asymptoticTimeComplexityClass doesn't give the warnings for this case, although complexity will still be incorrect, which is expected since inside it `Data set sizes` is expected.

I had to try this since the names were the real mess and what to readily accept inside the glm scope is the know-how am missing probably, which tends to this problem. For a correct working example, the time complexity classifier just had the data frame passed as the parameter to the function, with columns directly mentioned in the glm() scope, which when I tried to incorporate in asymptoticComplexityClass, gave errors for including the column names inside the glms as the parameters passed to the function additionally along with the data frame.


Branch : GeneralizedComplexity
Objectives : Add complexity classifier for user-defined parameters (output size and data size) from the data frame passed.

@Anirban166
Copy link
Owner Author

Anirban166 commented Jun 15, 2020

Finally resolved it! whew :')

Firstly, what led to the solution was to create a new data frame, by extracting columns from our data frame and then setting parameter names as used inside asymptoticComplexityClass,

which does work:

new_df <- data.frame('output' = df[["Timings"]], 'size' = df[["Datasize"]])
> asymptoticComplexityClass(new_df)
[1] "log-linear"

But again, I must be able to know the column names before-hand like "Timings" and "Datasetsize" as used here, so we create a function that accepts random column names, and inside it creates a data frame with required 'output' and 'size' set up for our complexity classifying function:

f <- function(df, col1, col2) 
{
   d <- data.frame('output' = df[[col1]], 'size' = df[[col2]])
   return(d) 
}
> x <- f(df, "Timings", "Datasize")
> asymptoticComplexityClass(x)
[1] "log-linear"

Branch : GeneralizedComplexity
Objectives : Add complexity classifier for user-defined parameters (output size and data size) from the data frame passed.

@Anirban166
Copy link
Owner Author

Anirban166 commented Jun 15, 2020

Done with the expected functionality, to summarize:

Now the user can specify the columns he want to classify complexity from his data frame as strings, by passing them along with the data frame, as simple as calling asymptoticComplexityClass(df, "outputsize_colname", "datasize_colname").

What it does internally is create a new data frame with only user-specified columns (original data frames may have n no. of columns, where n>2) where the column names are modified as per the naming convention followed in asymptoticComplexityClassifier, to resolve the problem which is extensively discussed above.

Code:

asymptoticComplexityClass = function(df, output.size, data.size)
{
  if(class(df) == "data.frame" & output.size %in% colnames(df) & data.size %in% colnames(df))
  {
    d <- data.frame('output' = df[[output.size]], 'size' = df[[data.size]])
    asymptoticComplexityClassifier(d)
  }
  else stop("Input parameter must be a data frame containing the two specified columns passed as parameters")
}
asymptoticComplexityClassifier = function(df)
{
  if(class(df) == "data.frame" & 'output' %in% colnames(df) & 'size' %in% colnames(df))
  {
    constant   <- glm(output~1,                 data = df); df['constant'] = fitted(constant)
    linear     <- glm(output~size,              data = df); df['linear'] = fitted(linear)
    squareroot <- glm(output~sqrt(size),        data = df); df['squareroot'] = fitted(squareroot)
    log        <- glm(output~log(size),         data = df); df['log'] = fitted(log)
    log.linear <- glm(output~size*log(size),    data = df); df['log-linear'] = fitted(log.linear)
    quadratic  <- glm(output~I(size^2),         data = df); df['quadratic'] = fitted(quadratic)
    cubic      <- glm(output~I(size^3),         data = df); df['cubic'] = fitted(cubic)

    model.list <- list('constant'   = constant,
                       'linear'     = linear,
                       'squareroot' = squareroot,
                       'log'        = log,
                       'log-linear' = log.linear,
                       'quadratic'  = quadratic,
                       'cubic'      = cubic)

    cross.validated.errors <- lapply(model.list, function(x) cv.glm(df, x)$delta[2])
    best.model <- names(which.min(cross.validated.errors))
    print(best.model)
  }
  else stop("Input parameter must be a data frame containing the two specified columns passed as parameters")
}

Branch : GeneralizedComplexity
Objectives : Add complexity classifier for user-defined parameters (output size and data size) from the data frame passed.

@tdhock
Copy link

tdhock commented Jun 19, 2020

hi @Anirban166 you should think about re-writing that code using a for loop over model complexity classes, e.g.

for(complexity.class in c("constant", "linear", etc)){
  model.list[[complexity.class]] <- etc
}

@tdhock
Copy link

tdhock commented Jun 19, 2020

(would be good to reduce repetition)

@tdhock
Copy link

tdhock commented Jun 19, 2020

(make it easier to add/remove/edit complexity classes)

@Anirban166
Copy link
Owner Author

hi @Anirban166 you should think about re-writing that code using a for loop over model complexity classes, e.g.

for(complexity.class in c("constant", "linear", etc)){
  model.list[[complexity.class]] <- etc
}

Done

model.list <- list()
for(complexity.class in c('constant', 'squareroot', 'log', 'linear', 'loglinear', 'quadratic', 'cubic'))
{
   model.list[[complexity.class]] = eval(as.name(complexity.class))
}  

(would be good to reduce repetition)

adheres to DRY style!

(make it easier to add/remove/edit complexity classes)

Yup thats a good point

@Anirban166
Copy link
Owner Author

Note that I had to rename instances of log-linear to loglinear (or could do log.linear) as because get('log-linear') or eval(as.name('log-linear')) won't work cause we can't have variable names using a dash (i.e. a variable named log-linear)

@tdhock
Copy link

tdhock commented Jun 24, 2020

also log_linear would be ok, but loglinear is fine with me

@Anirban166
Copy link
Owner Author

also log_linear would be ok, but loglinear is fine with me

okay!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants