GH-15958: GLM fix Tweedie ML dispersion estimation #15959

tomasfryda · 2023-11-30T20:10:57Z

I looked at the Torczon’s multi-directional method but I don't think it's suitable for this task as this is just one dimensional optimization problem. Furthermore, the expand (2) and contraction (0.5) constants proposed in the thesis(https://repository.rice.edu/server/api/core/bitstreams/6bfc12f5-a69e-44bf-9e67-1d2e738bec5a/content) would be suboptimal for one dimensional problem (golden section search derivation
explains why https://homepages.math.uic.edu/~jan/mcs471f05/Lec9/gss.pdf). So I used the golden section search.

Description

I found out that the problem is not only with gradient and Hessian but also with Tweedie likelihood calculated using the series method. So the fix is to once in a while (every 10th iteration) calculate the log likelihood using the Tweedie estimator (I implemented for var. power and dispersion estimation as it combines both series and Fourier inversion method to achieve more precise log likelihood estimation) and compare it with the best so far value. I call this a sanity check - basically check if we are improving and if we get worse switch to the golden section search. Last sanity check I do is when the algo things it converged - sometimes the value explodes too fast so it wouldn't get to the sanity check that would find out we got actually worse.

Speed concerns

Basically the same as for the Tweedie variance power and dispersion estimation - for some values (e.g., Tweedie var. power close to 1 (p < 1.2)) it takes longer to estimate the likelihood.

Golden section search has linear convergence so it's asymptotically slower than Newton's method but it appears more robust to noise and since it doesn't require calculation of gradient and Hessian it doesn't seem that much slower for practical purposes.

Results

I managed to get the results close to R's MLE dispersion estimation (often little bit better than R (see the test)). Note that the value we get from summary(glm)$dispersion is not MLE but it seems close.

Estimation from summary:

 sum((object$weights * object$residuals^2)[object$weights > 0])/df.r

MLE dispersion in R:

    tp <- tweedie.profile( yr ~ xt,
                           p.vec= tweedie_p,
                           link.power = 0,
                           data = simData,
                           weights = weight,
                           offset = offset_col,
                           phi.method = "mle",
                           do.smooth = FALSE,
                           verbose = 0
    )
    rdispersion <- tp$phi.max

Problem with R's MLE dispersion estimation is that it sometimes takes very long time (and I don't know if it finishes).

So I used the estimation used from summary for the following plots.

These plots show how it used to behave for different Tweedie variance power [1.2, 1.8]:

and how it behaves after the change:

And these two show the same thing but with log scale:
before

after

The previous plots were generated using the dispersion estimation from the summary(glm) so I recalculated the same thing with the true MLE and it matches up until Tweedie var. power = 1.7 where R gets stuck (shown as MLE threshold in the plot). The rest of the values ([1.7, 1.85]) I use the summary type of calculation.

…d of as.h2o.data.frame

wendycwong · 2023-12-11T15:31:28Z

The Torczon multi-directional search method is to avoid the simplex degenerations with the iterations. If you do not run into the degenerations problem, then you won't need it.

tomasfryda · 2023-12-11T16:15:41Z

SInce it is just one dimensional optimization I think it's not a problem. If the 1-simplex (line segment) degenerates it's just a point and that could happen but it should be only due to finite precision and that 0-simplex should be the local optimum. But in practice it shouldn't happen since we would converge soon before having the problem with finite precision unless the user specifies the dispersion epsilon to be the machine epsilon or smaller (zero or negative number). So I think it should be ok unless I'm missing something. Does that sound reasonable @wendycwong ?

wendycwong · 2023-12-11T17:31:31Z

Yes, your reasoning sounds good. I obviously did not read your message carefully and it is the very first one. Really have no excuse here.

wendycwong · 2023-12-11T17:33:31Z

h2o-algos/src/main/java/hex/glm/DispersionUtils.java

    /**
     * This method estimates the tweedie dispersion parameter.  It will use Newton's update if the new update will 
     * increase the loglikelihood.  Otherwise, the dispersion will be updated as 
     *                        dispersionNew = dispersionCurr + learningRate * update.
     * In addition, line search is used to increase the magnitude of the update when the update magnitude is too small
     * (< 1e-3).  
     * 
-     * For details, please see seciton IV.I, IV.II, and IV.III in document here: 
+     * For details, please see section IV.I, IV.II, and IV.III in document here: 
     */


replace section to sections

Will do. Looking at the nice documentation you wrote in the comments made me realize that I didn't update it. I'll do it so it's up-to-date with the code. Thank you for directing my attention here!

tomasfryda · 2023-12-12T10:23:48Z

Yes, your reasoning sounds good. I obviously did not read your message carefully and it is the very first one. Really have no excuse here.

Don't worry about it @wendycwong. No excuse needed. It happens to me all the time :)

GLM fix Tweedie ML dispersion estimation

4b7c1d7

tomasfryda added this to the 3.44.0.3 milestone Nov 30, 2023

tomasfryda requested a review from wendycwong November 30, 2023 20:10

tomasfryda self-assigned this Nov 30, 2023

Set dispersion estimated from var power and dispersion

4d40ad6

tomasfryda added the please review label Dec 1, 2023

Fix case with no weight column and an R test so it uses as.h2o instea…

46b338b

…d of as.h2o.data.frame

wendycwong previously approved these changes Dec 11, 2023

View reviewed changes

Fix comment and disable tests for now

c38e93a

tomasfryda dismissed wendycwong’s stale review via c38e93a December 14, 2023 17:58

wendycwong approved these changes Dec 14, 2023

View reviewed changes

tomasfryda merged commit bfc1bd0 into rel-3.44.0 Dec 15, 2023
2 checks passed

tomasfryda deleted the tomf_GH-15958_fix_tweedie_dispersion branch December 15, 2023 17:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-15958: GLM fix Tweedie ML dispersion estimation #15959

GH-15958: GLM fix Tweedie ML dispersion estimation #15959

tomasfryda commented Nov 30, 2023

wendycwong commented Dec 11, 2023

tomasfryda commented Dec 11, 2023

wendycwong commented Dec 11, 2023

wendycwong Dec 11, 2023

tomasfryda Dec 12, 2023

tomasfryda commented Dec 12, 2023

GH-15958: GLM fix Tweedie ML dispersion estimation #15959

GH-15958: GLM fix Tweedie ML dispersion estimation #15959

Conversation

tomasfryda commented Nov 30, 2023

Description

Speed concerns

Results

wendycwong commented Dec 11, 2023

tomasfryda commented Dec 11, 2023

wendycwong commented Dec 11, 2023

wendycwong Dec 11, 2023

Choose a reason for hiding this comment

tomasfryda Dec 12, 2023

Choose a reason for hiding this comment

tomasfryda commented Dec 12, 2023