-
-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
geom_binomdensity #147
Comments
Ah interesting. The problem is that the dynamic binwidth must be determined when the grob created by the geom object is drawn, because it is only at that point that you have access to viewport information. If you're not familiar with ggplot internals, basically geoms create a bunch of grobs ("graphics objects"), which are objects that {grid} uses to draw stuff. When the geoms/grobs are created they don't know anything about viewport dimensions; this happens at a later step when the plot is actually rendered (this is why you can resize plots and whatnot even after creating them). It is only at this drawing step that you can do things that require information about viewport dimensions. To deal with this in ggdist, a single geom_dots layer creates a single grob representing all the dotplots it is going to draw so that when it is drawn, it can use the plot dimensions to dynamically select a binwidth that works across all dotplots it draws. The snag you are hitting is that because you have to create two geom_dots objects (since they don't have the same value of Some possible solutions:
|
The less hairy option for you would be the best afaic :) From my understanding, option 2) sounds like a way of addressing the two issues I mentioned at the same time, i.e., so that It would indeed require us to understand a bit more how to use these grobs, but that's something that I foresee we'll have to do anyway (#135).
I suppose it's not as simple as putting the code in a separate (thanks a lot for your help) |
Happy to help! This gives me a good reason to factor out binning stuff and clean it up a bit, which I've been meaning to do anyway.
Heh :). Sadly no, especially if I want to make some kind of commitment to maintaining a public API. Currently the bin detection code is tightly coupled with other dotplot stuff in the makeContent.dots_grob function, which is more or less what you'll have to make a custom variant of. While that function looks like a beast, it's really just three main steps: (1) determine bin size, (2) calculate data positions using that bin size, (3) create the grobs. Most likely I will end up with a function for (1) and a function for (2); (3) is pretty straightforward on its own. Hopefully you will then be able to use those two functions to make a custom grob easily. |
Could this be done more simply by adding an option to do a normal dotplot density but to run to flip the orientation? |
Not sure what you mean. Unless I'm misunderstanding that's essentially what the existing version does. Can you elaborate? |
So right now I can do: library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(ggplot2)
library(ggdist)
RankCorr_u_tau %>%
filter(i %in% 1:2) %>%
ggplot() +
aes(x = u_tau, y = factor(i), fill = stat(x > 6)) +
stat_dotsinterval(quantiles = 100) Created on 2021-06-14 by the reprex package (v2.0.0) I think |
Otherwise @DominiqueMakowski I think two dotplots like shown above with a sigmoid line would be the best obvious solution. |
@mjskay So something like a |
Yeah, that was my suggestion number 1:
|
Got it. Sorry I didn't follow before. I don't think a second parameter makes sense. How hairy do you think allowing side to vary a factor would be ? |
no worries, I probably wasn't explaining the internals very clearly :).
It would be... interesting. Currently side is used in a bunch of places since it determines several things, including how slabs are scaled and what the interpretation of the On the one hand it could be cool to be able to do this, and in theory I think it's possible, but I don't have a clear idea atm of the best way to re-arrange the internals to make it happen. |
Aside: I have basically factored out most of the dynamic binning stuff into The result is if you do want to go route 2 (a custom grob and geom) you should be able to base it on these two functions much more easily. |
@bwiernik I think you've managed to nerd snipe me because I keep thinking about how to make side/scale vary within a geom. So I'm probably gonna give that a try soon. |
Okay, if you install the aes-side branch of ggdist from github ( something like this should work: p = plot(modelbased::estimate_relation(glm(sex ~ body_mass_g, data = palmerpenguins::penguins, family = "binomial")))
prop = as.vector(prop.table(xtabs(~ sex, palmerpenguins::penguins)))
p +
ggdist::geom_dots(
aes(
x = body_mass_g,
y = sex,
side = ifelse(sex == "male", "bottom", "top"),
scale = prop[sex] * 0.9,
justification = ifelse(sex == "male", 1, 0)
),
data = palmerpenguins::penguins,
na.rm = TRUE,
color = "black",
fill = "gray50",
alpha = 0.5
) +
theme_light() |
Wow @mjskay that's blazing fast and awesome 🤩 We'll try it out asap! |
I think I installed correctly the branch, but I have this error: library(ggplot2)
library(ggdist)
prop <- as.vector(prop.table(xtabs(~ sex, palmerpenguins::penguins)))
# Default look
ggplot() +
ggdist::geom_dots(
aes(
x = body_mass_g,
y = sex,
side = ifelse(sex == "male", "bottom", "top"),
scale = prop[sex] * 0.9,
justification = ifelse(sex == "male", 1, 0)
),
data = palmerpenguins::penguins,
na.rm = TRUE,
color = "black",
fill = "gray50",
alpha = 0.5
)
#> Error in if (!all(d[[a]] == d[[a]][[1]])) {: missing value where TRUE/FALSE needed Created on 2021-06-16 by the reprex package (v2.0.0) The |
ah sorry that should be: prop = prop.table(xtabs(~ sex, palmerpenguins::penguins))
p +
ggdist::geom_dots(
aes(
x = body_mass_g,
y = sex,
side = ifelse(sex == "male", "bottom", "top"),
scale = as.vector(prop[sex] * 0.9),
justification = ifelse(sex == "male", 1, 0)
),
data = palmerpenguins::penguins,
na.rm = TRUE,
color = "black",
fill = "gray50",
alpha = 0.5
) +
theme_light() |
Same error 🤔 library(ggplot2)
library(ggdist)
data <- palmerpenguins::penguins
prop <- prop.table(xtabs(~ sex, data))
ggplot() +
ggdist::geom_dots(
aes(
x = body_mass_g,
y = sex,
side = ifelse(sex == "male", "bottom", "top"),
scale = as.vector(prop[sex] * 0.9),
justification = ifelse(sex == "male", 1, 0)
),
data = data,
na.rm = TRUE
)
#> Error in if (!all(d[[a]] == d[[a]][[1]])) {: missing value where TRUE/FALSE needed Created on 2021-06-16 by the reprex package (v2.0.0) |
huh weird. Could you run this again to be sure: devtools::install_github("mjskay/ggdist@aes-side") and if that doesn't work send along the output of |
nevermind your code is giving me an error here too, let me check it out |
Ah, the problem was the presence of some library(ggplot2)
library(ggdist)
data <- palmerpenguins::penguins
prop <- prop.table(xtabs(~ sex, data))
ggplot() +
ggdist::geom_dots(
aes(
x = body_mass_g,
y = sex,
side = ifelse(sex == "male", "bottom", "top"),
scale = as.vector(prop[sex] * 0.9),
justification = ifelse(sex == "male", 1, 0)
),
data = data,
na.rm = TRUE
) Which is the correct output for that input (since a slab can't be drawn with library(ggplot2)
library(ggdist)
data <- palmerpenguins::penguins[!is.na(palmerpenguins::penguins$sex),]
prop <- prop.table(xtabs(~ sex, data))
ggplot() +
ggdist::geom_dots(
aes(
x = body_mass_g,
y = sex,
side = ifelse(sex == "male", "bottom", "top"),
scale = as.vector(prop[sex] * 0.9),
justification = ifelse(sex == "male", 1, 0)
),
data = data,
na.rm = TRUE
) |
It works! So I put it together with the necessary preprocessing into a little convenience function: library(ggplot2)
library(ggdist)
geom_binomdensity <- function(data, x, y, ...) {
# Find y-axis levels
y_levels <- levels(as.factor(data[[y]]))
if(length(y_levels) != 2) {
stop("The y-variable should have exactly two levels.")
}
# Drop NaNs
vars <- c(x, y) # later can eventually add variables specified as color, fill, ...
data <- na.omit(data[vars])
# Other parameters
data$.side <- ifelse(data[[y]] == y_levels[1], "top", "bottom")
data$.justification <- ifelse(data[[y]] == y_levels[1], 0, 1)
prop <- prop.table(xtabs(paste("~", y), data))
data$.scale <- as.vector(prop[data[[y]]] * 0.9)
# ggdist geom
ggdist::geom_dots(
ggplot2::aes_string(
x = x,
y = y,
side = ".side",
justification = ".justification",
scale = ".scale"
),
data = data,
na.rm = TRUE,
...
)
} It seems to work okay in most cases, although you would expect the bottom density to be "taller" in the bottom example (when there is a lot more females than males). Here it seems like the unbalancing leads to a decrease of the points size. # Default case
data <- palmerpenguins::penguins
ggplot() + geom_binomdensity(data, x = "body_mass_g", y = "sex") # Unbalanced
data[1:150, "sex"] <- "female"
ggplot() + geom_binomdensity(data, x = "body_mass_g", y = "sex") # More unblanaced
data[1:250, "sex"] <- "female"
ggplot() + geom_binomdensity(data, x = "body_mass_g", y = "sex") Created on 2021-06-17 by the reprex package (v2.0.0) Another improvement would be to turn that convenience function into a real geom so that it can inherit from the data and aesthetics. Any quick suggestion on how to do that (if it's easy to do)? |
Ah, I think I see the problem. Here's a modified version of your geom that adds two things: (1) some lines indicating where the dotplots should be scaled to be less than (to make sure scaling is working) and (2) a couple of other methods of choosing the scale (I explain below). geom_binomdensity <- function(data, x, y, scale_method = c("density","prop","sqrt_prop"), ...) {
# Find y-axis levels
y_levels <- levels(as.factor(data[[y]]))
if(length(y_levels) != 2) {
stop("The y-variable should have exactly two levels.")
}
# Drop NaNs
vars <- c(x, y) # later can eventually add variables specified as color, fill, ...
data <- na.omit(data[vars])
# Other parameters
data$.side <- ifelse(data[[y]] == y_levels[1], "top", "bottom")
data$.justification <- ifelse(data[[y]] == y_levels[1], 0, 1)
prop <- switch(scale_method,
density = {
density_height <- sapply(split(data, data[[y]]), function(df) {
max(density(df[[x]], na.rm = TRUE)$y) * nrow(df)
})
density_height / sum(density_height)
},
prop = {
prop.table(xtabs(paste("~", y), data))
},
sqrt_prop = {
sqrt_prop <- sqrt(prop.table(xtabs(paste("~", y), data)))
sqrt_prop / sum(sqrt_prop)
}
)
# NOTE: I added as.character() here because I realized without it it may
# not select the correct value from prop if data[[y]] is a factor (since it
# would select based on the numeric version of the factor, not its level name)
data$.scale <- as.vector(prop[as.character(data[[y]])] * 0.9)
# ggdist geom
list(
ggdist::geom_dots(
ggplot2::aes_string(
x = x,
y = y,
side = ".side",
justification = ".justification",
scale = ".scale"
),
data = data,
na.rm = TRUE,
...
),
geom_hline(yintercept = c(prop[[1]], - prop[[2]]) * .9 + c(1,2))
)
} The problem is that we shouldn't expect the heights of the dotplots to be proportional to group size, we should expect the areas to be. So for your example the dotplots are being correctly scaled, it's just that the proportions based on raw group size aren't quite right: data <- palmerpenguins::penguins
data[1:300, "sex"] <- "female"
ggplot() + geom_binomdensity(data, x = "body_mass_g", y = "sex", scale_method = "prop") My first thought was to instead scale by the relative heights of their densities, which improved things but not perfectly: data <- palmerpenguins::penguins
data[1:300, "sex"] <- "female"
ggplot() + geom_binomdensity(data, x = "body_mass_g", y = "sex", scale_method = "density") I also tried using a histogram as a density estimator with similar results. So then I tried just using a dumb area-based heuristic of the square root of the proportion instead of the raw proportion: data <- palmerpenguins::penguins
data[1:300, "sex"] <- "female"
ggplot() + geom_binomdensity(data, x = "body_mass_g", y = "sex", scale_method = "sqrt_prop") That worked surprisingly well! Also seems to do okay if I change the proportion: data <- palmerpenguins::penguins
data[1:200, "sex"] <- "female"
ggplot() + geom_binomdensity(data, x = "body_mass_g", y = "sex", scale_method = "sqrt_prop") data <- palmerpenguins::penguins
data[1:100, "sex"] <- "female"
ggplot() + geom_binomdensity(data, x = "body_mass_g", y = "sex", scale_method = "sqrt_prop") No idea how well this will work on other datasets. Ultimately the problem is that the real quantity we care about is the height of the tallest bin, which we don't know until draw time :). So I'd say easiest thing is to use a heuristic that seems to do well most of the time (maybe sqrt of proportion) and let the user set the scaling proportion manually if needed. Re: the geom thing, you will probably want to create a geom that calls down to setup_params, setup_data, and draw_panel methods from ggdist::GeomDots. You can see an example of this kind of geom in ggplot2::geom_pointrange, which delegates most of its work to GeomLinerange (and GeomPoint): https://github.com/tidyverse/ggplot2/blob/master/R/geom-pointrange.r You may also want to check out https://cran.r-project.org/web/packages/ggplot2/vignettes/extending-ggplot2.html |
Square root of the proportion makes sense from a geometric perspective. I suggest we go with that. |
Sorry for the delay on that, but I finally found some time to double-check and it works like a charm. Thanks a ton again @mjskay For now I've added ggdist to Suggest but we'll probably end up importing it as it'll be useful in other of our functions. library(ggplot2)
#> Warning: package 'ggplot2' was built under R version 4.0.5
library(see)
data <- iris[1:100, ]
ggplot() +
geom_binomdensity(data, x = "Sepal.Length", y = "Species") # Different scales
data[1:70, "Species"] <- "setosa" # Create unbalanced proportions
ggplot() +
geom_binomdensity(data, x = "Sepal.Length", y = "Species", scale = "auto") ggplot() +
geom_binomdensity(data, x = "Sepal.Length", y = "Species", scale = "density") ggplot() +
geom_binomdensity(data, x = "Sepal.Length", y = "Species", scale = "proportion") ggplot() +
geom_binomdensity(data, x = "Sepal.Length", y = "Species",
scale = list("setosa" = 0.4, "versicolor" = 0.6)) Created on 2021-07-11 by the reprex package (v2.0.0) The next step is to consolidate that (and the other "geoms" from see) and make it a real geom that takes an actual aesthetics call instead of strings... but this is a bit outside of my range; @bwiernik and @IndrajeetPatil I think this is more your jazz :) |
Lovely! Glad to be of assistance :). |
@IndrajeetPatil I've never built a geom before. Can you take care of this? |
Okay, I can take the first stab at it. |
FYI the version of ggdist supporting this (3.0) is now on CRAN. It includes a minor improvement compared to the examples above in that you should only need to set the |
An idea for #135 related to easystats/modelbased#120
Aim: a geom for dot-densities for binomial y variables. Mostly a helper to get nice geoms without the need of much manual parametrization.
A quick draft, made simply of assembling two ggdist's geoms, works well when the two levels are equally balanced:
Created on 2021-06-14 by the reprex package (v2.0.0)
However, when the amount of observations is unbalanced, it looks a bit worse, since the fixed scale makes the size of the points vary (which is expected - but maybe not the optimal solution?):
Created on 2021-06-14 by the reprex package (v2.0.0)
I tried adjusting the upper and lower scale based on the relative number of rows, but it doesn't make it much better (?):
Created on 2021-06-14 by the reprex package (v2.0.0)
(note that the points above are still of different sizes)
ggproto
magic, which is beyond me). Is it easy? Is it worth it?@mjskay @bwiernik
The text was updated successfully, but these errors were encountered: