Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculation of mean can be inaccurate for rate measures with varying denominators #230

Open
francisbarton opened this issue Oct 15, 2024 · 0 comments

Comments

@francisbarton
Copy link
Contributor

francisbarton commented Oct 15, 2024

Currently the mean line in a chart is calculated from the mean of the value field (give or take the fix_after_n option).

This is fine for count values, or for rates (%s or fractions etc) that have a static denominator.
However if you pass in a rate as the value field, and that rate is based on a varying denominator, the mean generated by ptd_spc_standard() will just be the mean of the rates. Which is likely not the overall mean rate.

This issue is probably out of scope - what else is the package supposed to do?! And it is kind of the user's responsibility to pass in the right kind of data and to validate the outputs. But still.

One potential action is to add a caveat in the documentation to remind users that the mean calculation for rate data is just the mean of the rates and therefore may have some error.

Another solution is that ptd_spc could gain an argument, for a user-supplied pre-calculated mean column in the data. This would be a breaking change to the user interface.

Here's a reprex for a situation where an event happens 18 times (numerator) out of a possible 90 (denominator). The mean rate for the year is 0.2 (or 20%). But the mean of the rates is 0.216 (which is used for the mean line in the chart below).

tibble::tibble(
  date = seq.Date(as.Date("2023-01-01"), as.Date("2023-12-01"), "1 month"),
  num = rep(seq(2), 6),
  dnm = rep(c(4, 11), 6)
) |>
  dplyr::summarise(rate = num / dnm, .by = "date") |>
  NHSRplotthedots::ptd_spc("rate", "date")
#> Warning in ptd_add_short_group_warnings(.): Some groups have 'n < 12'
#> observations. These have trial limits, which will be revised with each
#> additional observation until 'n = fix_after_n_points' has been reached.

Potentially here I could pass in my pre-calculated mean of 0.2, which PTD would have then used instead.

Like this:

tibble::tibble(
  date = seq.Date(as.Date("2023-01-01"), as.Date("2023-12-01"), "1 month"),
  num = rep(seq(2), 6),
  dnm = rep(c(4, 11), 6)
) |>
  dplyr::mutate(overall_mean = sum(num) / sum(dnm)) |>
  dplyr::summarise(rate = num / dnm, .by = c("date", "overall_mean")) |>
  # using a data column allows for facets and rebasing
  # but the user would have to manually allow for these
  NHSRplotthedots::ptd_spc("rate", "date", mean_field = "overall_mean")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant