Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

predict gives incorrect CIs when collinear columns have been dropped #413

Open
nalimilan opened this issue Mar 20, 2021 · 3 comments
Open

Comments

@nalimilan
Copy link
Member

As noted at #410, predict appears to give incorrect results when some predictors are dropped because of collinearity. We should really do something about this.

julia> x = [1.0 1.0 2.0
            1.0 2.0 3.0
            1.0 -1.0 0.0];

julia> y = [1.0, 3.0, -2.0];

julia> m1 = lm(x, y, dropcollinear=true)
LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,CholeskyPivoted{Float64,Array{Float64,2}}}}:

Coefficients:
─────────────────────────────────────────────────────────────────
        Coef.  Std. Error       t  Pr(>|t|)  Lower 95%  Upper 95%
─────────────────────────────────────────────────────────────────
x1   0.0       NaN         NaN       NaN     NaN        NaN
x2   2.07143     0.257539    8.04    0.0787   -1.20092    5.34378
x3  -0.428571    0.174964   -2.45    0.2468   -2.65169    1.79455
─────────────────────────────────────────────────────────────────


julia> m2 = lm(x[:, 1:2], y, dropcollinear=false)
LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,Cholesky{Float64,Array{Float64,2}}}}:

Coefficients:
────────────────────────────────────────────────────────────────
        Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
────────────────────────────────────────────────────────────────
x1  -0.428571    0.174964  -2.45    0.2468  -2.65169     1.79455
x2   1.64286     0.123718  13.28    0.0479   0.070872    3.21484
────────────────────────────────────────────────────────────────


julia> p1 = predict(m1, x, interval=:confidence);

julia> p2 = predict(m2, x[:, 1:2], interval=:confidence);

julia> @test p1.prediction  p2.prediction
Test Passed

julia> @test p1.upper  p2.upper
Test Failed at REPL[254]:1
  Expression: p1.upper  p2.upper
   Evaluated: [3.354342064529162; 5.728437323463455; 1.6152034019232366]  [3.2437098232940285; 5.727181955909349; 1.200919478058661]
ERROR: There was an error during testing

julia> @test p1.lower  p2.lower
Test Failed at REPL[255]:1
  Expression: p1.lower  p2.lower
   Evaluated: [-0.925770635957734; -0.014151609177741165; -5.758060544780381]  [-0.8151383947225999; -0.012896241623634452; -5.343776620915804]
ERROR: There was an error during testing

In the worst case we could throw an error when calling predict in such cases until we have a fix.

(BTW I'm not sure why the intercept is dropped in this example rather than the last column.)

Cc: @piever @pdeffebach @mkborregaard @dmbates @andreasnoack @Nosferican @DilumAluthge

@pdeffebach
Copy link
Contributor

pdeffebach commented Mar 20, 2021

Thanks for highlighting this.

I can confirm that R does not have this behavior.

The error stems from this line where we work with the pivoted cholesky instead of the original matrix from the original regression.

I haven't tracked down exactly how R is handling it. But there is an interesting warning here

    p <- object$rank
    p1 <- seq_len(p)
    piv <- if(p) qr.lm(object)$pivot[p1]
    if(p < ncol(X) && !(missing(newdata) || is.null(newdata)))
	warning("prediction from a rank-deficient fit may be misleading")

I can trigger the warning


r$> predict(m1, as.data.frame(x), interval = "confidence")                 
        fit         lwr      upr
1  1.214286 -0.81513839 3.243710
2  2.857143 -0.01289624 5.727182
3 -2.071429 -5.34377662 1.200919
Warning message:
In predict.lm(m1, as.data.frame(x), interval = "confidence") :
  prediction from a rank-deficient fit may be misleading

@nalimilan
Copy link
Member Author

Thanks. So apparently R doesn't know how to handle it either, so we should probably throw an error (I don't like warnings that tell you we return invalid results).

@andreasnoack
Copy link
Member

@nalimilan do we have to resolve this before 2.0? Will a "fix" here be breaking any other functionality?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants