Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG - Fixes response leakage #78

Merged
merged 3 commits into from
Aug 15, 2024
Merged

BUG - Fixes response leakage #78

merged 3 commits into from
Aug 15, 2024

Conversation

mayer79
Copy link
Owner

@mayer79 mayer79 commented Aug 15, 2024

Fixes a major issue #77: the model response was used as covariate, which does not make sense.

@mayer79 mayer79 self-assigned this Aug 15, 2024
@mayer79 mayer79 added the bug label Aug 15, 2024
@mayer79 mayer79 merged commit 9f4fc61 into main Aug 15, 2024
7 checks passed
@mayer79 mayer79 deleted the fix-reponse-used-as-covariate branch August 15, 2024 09:57
@teecrow
Copy link

teecrow commented Sep 11, 2024

@mayer79 This may be worth revising. Counterintuitively, the response/outcome variable should sometimes be used for imputation. Here's a paper that explains this in much more detail. The TLDR is in the abstract:

We mathematically demonstrate that including the outcome variable in imputation models is not just a recommendation but a requirement to achieve unbiased results when using stochastic imputation methods. Moreover, we dispel common misconceptions about deterministic imputation models and demonstrate why the outcome should not be included in these models.

So it depends on the person's use case - so I would argue there should be an option to include or not include the response variable in the imputation. My vote would be the default is to not include it, given my assumption that most people using your package will be from the data science world where deterministic imputation is more common, rather than e.g., the social science world, where multiple imputation is more common.

@mayer79
Copy link
Owner Author

mayer79 commented Sep 11, 2024

Thanks for digging into this.

  1. When your task is to build a regression model with a response, and your covariates have missing values (at random), then it can make sense to use the regression model response in the imputation models (for the features).
  2. The bug in missRanger(), however, used each variable y both as response and covariate in the same model. This does not make sense. It is not so bad as it seems though, because any missing value in y had been imputed by the other variables in previous iterations. Still, I expect imputations to become better by fixing the bug.

@teecrow
Copy link

teecrow commented Sep 11, 2024

Ah I see - I misunderstood the nature of the bug. My apologies and thanks for a great package!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants