Enhancements to `get_duplicates_dataset()` #2019

thomas-neitmann · 2021-11-25T15:38:05Z

thomas-neitmann
Nov 25, 2021

User feedback:

"This function is absolutely cool. It would be great to create a message which dataset is being printed by this function. For example, there may be a situation where you have duplicates in your dataset. Then:

you run get_duplicates_dataset()

you identify a duplicate

you fix a duplicate

you run signal_duplicate_records()

you no longer see a duplicate

you run get_duplicates_dataset()

The initial dataset is still being printed.
It may bring some confusion, so it would be cool to either reset it back to NULL (if the former function hasn't spotted any issue) once the call of a function shows no issues or to simply notify a user which dataset is being printed in the message log. A minor thing. Overall, this is an EXTREMELY useful function and having an ability to quickly spot duplicates by a desired key is BEYOND awesome; "

Based upon that I think we should implement the following:

When you call get_duplicated_dataset() information about which call generates this dataset should be printed.
The duplicates dataset should be "reset" the next time signal_duplicates_records() is called.

bundfussr · 2021-11-30T12:24:48Z

bundfussr
Nov 30, 2021
Maintainer

Always resetting the duplicates dataset when signal_duplicates_records() is called could corrupt the functionality. Consider the case that a derivation calls signal_duplicates_records() twice. If the first call issues a warning and the second call passes without any issues, get_duplicates_dataset() would not return anything.

What about using a list and a separate reset function. Each time signal_duplucates_records() detects an issue it appends an element to the list. The element could contain the call or description of the context, a timestamp, and the duplicates dataset. By default get_duplicates_dataset() would return the last element. Additionally get_duplicates_dataset() could provide parameters to get other elements of the list.

0 replies

thomas-neitmann · 2022-11-01T12:45:13Z

thomas-neitmann
Nov 1, 2022
Author

@zdz2101 @sadchla-codes Another interesting issue you might want to tackle.

0 replies

zdz2101 · 2022-11-01T20:06:41Z

zdz2101
Nov 1, 2022
Maintainer

It seems get_duplicates_dataset() is directly tied to signal_duplicates_dataset() and the end goal is to prevent the confusion involved from running get_duplicates_dataset() randomly or pre-emptively before signal_duplicates_dataset(). My proposal would be a warning message if they run get_duplicates_dataset() without calling signal_duplicates_dataset() as a command right before/prior. I have a draft solution for this that I think already passes through the CI/CD checks and it would end up looking something like:

By restricting the use of get_duplicates_dataset() to solely after signal_duplicates_dataset() we don't have to keep track of the variety of changes the user might make along the way fixing the duplicates like renaming objects or having N-amounts of intermediate datasets. get_duplicates_dataset() becomes a tool that returns the duplicates dataframe, if it exists, for a specific dataset we are asking the user to supply. What're your thoughts @thomas-neitmann ?

0 replies

thomas-neitmann · 2022-11-02T11:18:51Z

thomas-neitmann
Nov 2, 2022
Author

I'm not sure this approach would work in practice. signal_duplicate_records() is usually not called at the top-level as in your example but inside a function (which again might be called fro within another function and so forth). So how would you restrict get_duplicates_dataset() to solely work directly after signal_duplicate_records() is called?

0 replies

bundfussr · 2022-11-02T11:58:38Z

bundfussr
Nov 2, 2022
Maintainer

I think we can not automate resetting the duplicates dataset or issue a warning or error if it is out of date because we do not know when the user has fixed the program and rerun it.
So we could provide a function for resetting the duplicates dataset but the user would need to call it manually.

0 replies

zdz2101 · 2022-11-02T13:16:51Z

zdz2101
Nov 2, 2022
Maintainer

Within the get_duplicates_dataset() I was playing around and added an if (interactive()) statement that grabs the .Rhistory whenever the function is called, which does a string search for "signal_duplicate_records" from the command prior to the get_duplicates_dataset() call. See lines 38 - 49 , also added lines 135-137 to reset duplicates to NULL if no duplicates were found (for example if they fed in a cleaned/fixed dataset). But if the signal_duplicates_dataset() is wrapped inside other functions, this would indeed lose most of its functionality. Perhaps instead of warning/error, it can still have use as a message to remind user to run signal_duplicates_dataset() or whatever wrapped version of it they're using prior, and string search for functions like derive_vars_merged() as well.

0 replies

thomas-neitmann · 2022-11-02T15:06:21Z

thomas-neitmann
Nov 2, 2022
Author

User should never run signal_duplicates_records() themselves. That is always called from inside a derive_*() function when something is off to make the user aware of it. Then we tell them to use get_duplicates_datasets() to see the records which caused the issue.

0 replies

zdz2101 · 2022-12-02T18:45:07Z

zdz2101
Dec 2, 2022
Maintainer

After going back to the drafting board for a bit, how does this sound (if user is running in an interactive environment like RStudio):

Grabbing/saving user's rhistory during their current session
Writing these commands to a temporary .R file
Rerunning last command, using source() on the file created from 2
Saving the possible error message using sink() to a temporary .txt file
Checking if the message contains: "Run get_duplicates_dataset() to access the duplicate records"
Message the user that the duplicate records may have been stale, based on what we know of in 5
Return the dataset, regardless

Draft available

0 replies

bundfussr · 2022-12-05T16:00:11Z

bundfussr
Dec 5, 2022
Maintainer

Rerunning user commands looks dangerous to me. It could change the global environment and cause confusion for the user. E.g., if the last command was adlb <- derive_extreme_records(adlb, ...), the new records are duplicated.

0 replies

bms63 · 2023-01-30T15:21:04Z

bms63
Jan 30, 2023
Maintainer

Hey @galachad can we get your ideas on this one as well?

0 replies

2023-07-07T02:44:42Z

github-actions[bot]
bot Jul 7, 2023

This issue is stale because it has been open for 90 days with no activity.

0 replies

bms63 · 2023-07-19T19:53:31Z

bms63
Jul 19, 2023
Maintainer

Hi all,

@pharmaverse/admiral This issue is stale.

Reviewing the discussion between @zdz2101, @bundfussr and @thomas-neitmann I don't see us implementing this proposal as it seems to invite a lot of potential user-error and confusion. What do you think?

We can move it to a discussion and continue with folks experimenting and discussing and move it up the ladder for discussion at a Core Meeting?

0 replies

bundfussr · 2023-07-20T07:48:46Z

bundfussr
Jul 20, 2023
Maintainer

Reviewing the discussion between @zdz2101, @bundfussr and @thomas-neitmann I don't see us implementing this proposal as it seems to invite a lot of potential user-error and confusion. What do you think?

I agree. I would close the issue.

1 reply

zdz2101 Jul 20, 2023
Maintainer

Agreed I think we can revisit another day if it becomes a popular request

bms63 · 2023-08-09T14:36:18Z

bms63
Aug 9, 2023
Maintainer

@ddsjoberg @millerg23 @zdz2101 Hi all - Don't forget to put in your ideas from today's meeting around this issue. I was having trouble focusing at the end (lot of ideas being discussed today) and so wasn't able to capture it in my notes.

3 replies

millerg23 Aug 9, 2023
Collaborator

A message with the function call to pull out duplicates, would be fine by me. I usually just put the output from get_duplicates() into a data frame, then review it, to see what issue is. So running a different function would be no problem, assume it would be just the underlying function being called in get_duplicates()?

zdz2101 Aug 9, 2023
Maintainer

Proposal was to give users code to run to create the duplicate object themselves, instead of storing & extracting it for them using a get_ function. Only caveat there is that signal_duplicate_records() is called inside the function body in a handful of places like derive_vars_merged() , derive_vars_obs_number() , which are then called inside a function like derive_vars_joined() so it is inherently a little hard to parse out to begin with

bundfussr Aug 16, 2023
Maintainer

I think this works only if the dataset which is checked for duplicates is available to the user. In many cases this is not the case.

By the way, why do you want to provide code for creating the dataset of duplicates instead of providing the dataset directly?

ddsjoberg · 2023-08-09T15:30:16Z

ddsjoberg
Aug 9, 2023
Maintainer

One way around this would be to print the code to run to see the duplicate values.
(Rather than including this complex call to duplicated(), we could wrap it in a simple fn.)

# create a data frame with duplicates
df <- data.frame(USUBJID = c(letters, letters[1:2]))
columns_to_check <- "USUBJID"

cli::cli_warn(
  c("!" = "Duplicate values were found in columns {.val {columns_to_check}}",
    "i" = "Run {.run df[duplicated(df[{shQuote(columns_to_check)}]), {shQuote(columns_to_check)}, drop = FALSE]} to print the duplicate rows.")
)
#> Warning: ! Duplicate values were found in columns "USUBJID"
#> ℹ Run `df[duplicated(df['USUBJID']), 'USUBJID', drop = FALSE]` to print the
#>   duplcicate rows.

df[duplicated(df['USUBJID']), 'USUBJID', drop = FALSE]
#>    USUBJID
#> 27       a
#> 28       b

^{Created on 2023-08-09 with reprex v2.0.2}

CAVEAT: If we won't the name (ie the symbol) of the data frame we're checking, the code printed would need to replace the data frame name with something the user would need to replace.

2 replies

ddsjoberg Aug 9, 2023
Maintainer

We may be able to just to the .Last.value object, e.g.

# create a data frame with duplicates
df <- data.frame(USUBJID = c(letters, letters[1:2]))
columns_to_check <- "USUBJID"

cli::cli_warn(
  c("!" = "Duplicate values were found in columns {.val {columns_to_check}}",
    "i" = "Run {.run show_duplicates(.Last.value, !!!columns_to_check)} to print the duplicate rows.")
)
#> Warning: ! Duplicate values were found in columns "USUBJID"
#> ℹ Run `show_duplicates(.Last.value, USUBJID)` to print the
#>   duplcicate rows.

show_duplicates(.Last.value, USUBJID)
#>    USUBJID
#> 27       a
#> 28       b

(in this example, the show_duplicates() exists)

bms63 Aug 10, 2023
Maintainer

Daniel - do you mind making this into a new issue with your suggestion and link it to the discussion? I think you have a lot of balls in the air right now! Lets see if others might be able to implement your proposal.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancements to `get_duplicates_dataset()` #2019

{{title}}

Replies: 15 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Enhancements to get_duplicates_dataset() #2019

thomas-neitmann Nov 25, 2021

Replies: 15 comments · 6 replies

bundfussr Nov 30, 2021 Maintainer

thomas-neitmann Nov 1, 2022 Author

zdz2101 Nov 1, 2022 Maintainer

thomas-neitmann Nov 2, 2022 Author

bundfussr Nov 2, 2022 Maintainer

zdz2101 Nov 2, 2022 Maintainer

thomas-neitmann Nov 2, 2022 Author

zdz2101 Dec 2, 2022 Maintainer

bundfussr Dec 5, 2022 Maintainer

bms63 Jan 30, 2023 Maintainer

github-actions[bot] bot Jul 7, 2023

bms63 Jul 19, 2023 Maintainer

bundfussr Jul 20, 2023 Maintainer

zdz2101 Jul 20, 2023 Maintainer

bms63 Aug 9, 2023 Maintainer

millerg23 Aug 9, 2023 Collaborator

zdz2101 Aug 9, 2023 Maintainer

bundfussr Aug 16, 2023 Maintainer

ddsjoberg Aug 9, 2023 Maintainer

ddsjoberg Aug 9, 2023 Maintainer

bms63 Aug 10, 2023 Maintainer

Enhancements to `get_duplicates_dataset()` #2019

thomas-neitmann
Nov 25, 2021

Replies: 15 comments 6 replies

bundfussr
Nov 30, 2021
Maintainer

thomas-neitmann
Nov 1, 2022
Author

zdz2101
Nov 1, 2022
Maintainer

thomas-neitmann
Nov 2, 2022
Author

bundfussr
Nov 2, 2022
Maintainer

zdz2101
Nov 2, 2022
Maintainer

thomas-neitmann
Nov 2, 2022
Author

zdz2101
Dec 2, 2022
Maintainer

bundfussr
Dec 5, 2022
Maintainer

bms63
Jan 30, 2023
Maintainer

github-actions[bot]
bot Jul 7, 2023

bms63
Jul 19, 2023
Maintainer

bundfussr
Jul 20, 2023
Maintainer

zdz2101 Jul 20, 2023
Maintainer

bms63
Aug 9, 2023
Maintainer

millerg23 Aug 9, 2023
Collaborator

zdz2101 Aug 9, 2023
Maintainer

bundfussr Aug 16, 2023
Maintainer

ddsjoberg
Aug 9, 2023
Maintainer

ddsjoberg Aug 9, 2023
Maintainer

bms63 Aug 10, 2023
Maintainer