-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Errors as a resultant of incorrect glm parameters #11
Comments
Created small functions to test some things, based on a data frame df <- asymptoticTimings(changepoint::cpt.mean(rnorm(data.sizes), method = "PELT"), data.sizes = 10^seq(1,3,by=0.5)) Columns it possess :
> f <- function(df, col)
{
print(col)
}
> f(df, df[,"Timings"])
# or
> f(df, df[["Timings"]]) This works, but it won't work for checking column names inside via asymptoticComplexityClass = function(df, output, size)
{
#data.size <- deparse(substitute(data.size))
#output.size <- deparse(substitute(output.size))
if(class(df) == "data.frame") # & 'output' %in% colnames(df) & 'size' %in% colnames(df) <- won't work
{
constant <- glm(output~1, data = df); df['constant'] = fitted(constant)
linear <- glm(output~size, data = df); df['linear'] = fitted(linear)
squareroot <- glm(output~sqrt(size), data = df); df['squareroot'] = fitted(squareroot)
log <- glm(output~log(size), data = df); df['log'] = fitted(log)
log.linear <- glm(output~size*log(size), data = df); df['log-linear'] = fitted(log.linear)
quadratic <- glm(output~I(size^2), data = df); df['quadratic'] = fitted(quadratic)
cubic <- glm(output~I(size^3), data = df); df['cubic'] = fitted(cubic)
model.list <- list('constant' = constant,
'linear' = linear,
'squareroot' = squareroot,
'log' = log,
'log-linear' = log.linear,
'quadratic' = quadratic,
'cubic' = cubic)
cross.validated.errors <- lapply(model.list, function(x) cv.glm(df, x)$delta[2])
best.model <- names(which.min(cross.validated.errors))
print(best.model)
}
else stop("Input parameter must be a data frame containing the two specified columns passed as parameters")
} Result: works, but still gives incorrect complexity (always constant) + warnings > asymptoticComplexityClass(df, output = df[["Timings"]], size = df[["Datasize"]])
[1] "constant"
There were 50 or more warnings (use warnings() to see the first 50) The warnings kind of suggest what the problem is: > warnings()
Warning messages:
1: 'newdata' had 1 row but variables found have 500 rows
2: 'newdata' had 1 row but variables found have 500 rows
3: 'newdata' had 1 row but variables found have 500 rows
4: 'newdata' had 1 row but variables found have 500 rows
5: 'newdata' had 1 row but variables found have 500 rows
# goes uptil 50: But am not being able to infer what 'newdata' is, (500 rows pertain to no. of rows in
f <- function(df, col)
{
if(col %in% colnames(df)) print(1)
}
> f(df, "Timings") This would work to pass the test of column names, but it would give the error that we recieved first, since I guess the column names passed aren't inferred correctly as columns for the passed data frame > asymptoticComplexityClass(df, output = "Timings", size = "Datasize")
Error in y - mu : non-numeric argument to binary operator This can lead to the question - since we have changed the column names, are the column names posing problems? The answer is no, because I tried with a seperate Branch : GeneralizedComplexity |
Finally resolved it! whew :') Firstly, what led to the solution was to create a new data frame, by extracting columns from our data frame and then setting parameter names as used inside new_df <- data.frame('output' = df[["Timings"]], 'size' = df[["Datasize"]])
> asymptoticComplexityClass(new_df)
[1] "log-linear" But again, I must be able to know the column names before-hand like "Timings" and "Datasetsize" as used here, so we create a function that accepts random column names, and inside it creates a data frame with required 'output' and 'size' set up for our complexity classifying function: f <- function(df, col1, col2)
{
d <- data.frame('output' = df[[col1]], 'size' = df[[col2]])
return(d)
}
> x <- f(df, "Timings", "Datasize")
> asymptoticComplexityClass(x)
[1] "log-linear" Branch : GeneralizedComplexity |
Done with the expected functionality, to summarize: Now the user can specify the columns he want to classify complexity from his data frame as strings, by passing them along with the data frame, as simple as calling Code: asymptoticComplexityClass = function(df, output.size, data.size)
{
if(class(df) == "data.frame" & output.size %in% colnames(df) & data.size %in% colnames(df))
{
d <- data.frame('output' = df[[output.size]], 'size' = df[[data.size]])
asymptoticComplexityClassifier(d)
}
else stop("Input parameter must be a data frame containing the two specified columns passed as parameters")
} asymptoticComplexityClassifier = function(df)
{
if(class(df) == "data.frame" & 'output' %in% colnames(df) & 'size' %in% colnames(df))
{
constant <- glm(output~1, data = df); df['constant'] = fitted(constant)
linear <- glm(output~size, data = df); df['linear'] = fitted(linear)
squareroot <- glm(output~sqrt(size), data = df); df['squareroot'] = fitted(squareroot)
log <- glm(output~log(size), data = df); df['log'] = fitted(log)
log.linear <- glm(output~size*log(size), data = df); df['log-linear'] = fitted(log.linear)
quadratic <- glm(output~I(size^2), data = df); df['quadratic'] = fitted(quadratic)
cubic <- glm(output~I(size^3), data = df); df['cubic'] = fitted(cubic)
model.list <- list('constant' = constant,
'linear' = linear,
'squareroot' = squareroot,
'log' = log,
'log-linear' = log.linear,
'quadratic' = quadratic,
'cubic' = cubic)
cross.validated.errors <- lapply(model.list, function(x) cv.glm(df, x)$delta[2])
best.model <- names(which.min(cross.validated.errors))
print(best.model)
}
else stop("Input parameter must be a data frame containing the two specified columns passed as parameters")
} Branch : GeneralizedComplexity |
hi @Anirban166 you should think about re-writing that code using a for loop over model complexity classes, e.g.
|
(would be good to reduce repetition) |
(make it easier to add/remove/edit complexity classes) |
Done model.list <- list()
for(complexity.class in c('constant', 'squareroot', 'log', 'linear', 'loglinear', 'quadratic', 'cubic'))
{
model.list[[complexity.class]] = eval(as.name(complexity.class))
}
adheres to DRY style!
Yup thats a good point |
Note that I had to rename instances of |
also log_linear would be ok, but loglinear is fine with me |
okay! |
Additionally accepting two columns from the data frame should ideally work the same as accepting no columnar parameters (only the data frame) and directly passing our data frame's columns into our glm scope (as seen in
asymptoticTimeComplexityClass
), but this is not the same based on these takes:Error thrown:
As per the say, it pertains to the fact that the
output.size
is not of numeric type, which is expected for (x~y) wherein both x and y (output.size
anddata.size
here) should be numeric. If we comment out the constant part (this is a traceback for constant complexity, or the glm formulaoutput.size~1
), we recieve further error (different message) on the next complexity term: (traceback for the linear formula)This will supposedly go on for all the complexity classes, which means we are passing our data frame's columns incorrectly inside the
glm()
call.The above throws:
df$Timings
anddf$Datasetsizes
, so I explicitly extract them from our data frame, using syntax such asdf[[col]]
ordf[,col]
to extract the values from the columncol
from a data framedf
:This does not produce any error, but it provides incorrect values, and would stick to 'constant' always - (whereas
asymptoticTimeComplexityClass
would predict the correct complexity)From which it seems this isn't the correct approach, and since we pass the columns directly instead of extracting them from our data frame inside the call to the glms, this isn't whats expected. Need to figure out the correct way to emplace the two parameters.
Branch : GeneralizedComplexity
Objectives : Add complexity classifier for user-defined parameters (output size and data size) from the data frame passed.
The text was updated successfully, but these errors were encountered: