New duplicate algorithm to check for similar entries #52

george-gca · 2025-01-09T02:41:04Z

I added the option to check for duplicate entries based on the similarity of the title and abstract. Sometimes we can have a duplicate entry that is a fixed version of another entry, with a corrected typo or added comma, for example.

I decided to go with difflib.SequenceMatcher for this similarity, since it is a built-in solution. Added the options to use only the title for this check, set the similarity score, discard stopwords for a more strict check considering only the useful words, and also added a pretty diff print support thanks to the rich library:

Signed-off-by: George Araújo <[email protected]>

PeterLombaers · 2025-01-09T16:15:53Z

Thanks for this contribution, it look really nice! It makes a lot of sense to me to want to deduplicate using some fuzzy matching. The code looks clean! I was testing out the features and it seems to be working well. Just a few comments, which I'll put in the comments below.

PeterLombaers · 2025-01-09T16:19:24Z

When I was testing with --verbose, I got confused when I did not see anything pretty printing. I think this happened because my duplicated were already dropped by the original dedup method. When I added rows that were dropped only by your method, I indeed so the nice diff. If I'm using --verbose I expect to get feedback no matter which method does the deduplicating.

asreviewcontrib/datatools/entrypoint.py

PeterLombaers · 2025-01-09T16:25:55Z

But thanks again, it looks very nice! We'll also need to have a chat about how this relates to asreview2.0 @J535D165 !

george-gca · 2025-01-09T16:42:03Z

What do you think should be done when verbose is true for the current algorithm? I mean, what would be the expected output? Because the pretty print probably will print everything dimmed in most cases.

Signed-off-by: George Araújo <[email protected]>

Rensvandeschoot · 2025-01-10T06:43:59Z

Great contribution!! I have recently used fuzzy matching for our FORAS project, where we obtained over 10K records to screen. I checked for duplicates within the dataset and between the titles obtained via the database search and the most likely title match in OpenAlex. I saved the matching score and went through the data, checking these scores from low to high, and I found many fuzzy duplicates of the following type:

• Titles containing extra or different punctuation.
• One title has a spelling mistake corrected in the other.
• Presence of HTML code in one title (e.g., PTSD), but not in the other.
• An abstract number at the beginning of one title (e.g., “T262.”), missing from the other.
• A subtitle in one record versus a single-title format in the other.

All such cases are precise duplicates and can be corrected without losing any information.

But I also found cases with different versions of the same work:

re-prints of the same paper in a different journal,
pre-print + journal version,
conference abstract + journal version,
book chapter + journal version,
dissertation + journal version,
version 1 + version 2 of the same paper

You might want to keep both records in these cases, but the labeling decisions will be the same.

So, my question is whether it is possible to store the matching score so that a user can manually check the records with lower matching scores?

and, hopefully, my comment helps with starting the discussion on what to do with fuzzy-duplicates in ASReview v2.0 :-)

george-gca · 2025-01-10T14:15:50Z

Currently you can choose to print the duplicates on the terminal instead of already removing them by avoiding the -o option. It will print the line of the entries and a pretty diff between them, but it will print this in order of finding, not based on the score. This could be added as an option, for instance, or added as a column to the dataset like you said. What option would be best? Also, adding this column to the dataset wouldn't somehow affect the usage of the dataset with ASReview, or does it simply ignore the extra columns?

This should be enough to match most of the cases you pointed here just by playing with the threshold param, since I do some cleaning before doing the actual match. I've also added an option to remove stopwords before checking, to create even more similar titles. The cases that could fall short would be:

presence/abscence of HTML or latex code, if it is too long
added subtitle to the title

The added subtitle might be a pitfall, since some papers build on top of others, including the title, being completely different papers with similar titles. And sometimes they like to follow a trend of titles, like the case for x is all you need.

It would be actually a great test for this code the project you mentioned.

PeterLombaers · 2025-01-10T15:53:29Z

What do you think should be done when verbose is true for the current algorithm? I mean, what would be the expected output? Because the pretty print probably will print everything dimmed in most cases.

True, I'm not sure what I would want to see in verbose mode for the other case. I would leave it as it is for now. We might to think in the future about wether we want to maintain this feature and how it would look then, also for the other deduplication. I do think it's very nice to have such verbose output when deduplicating, so that you can clearly see which ones get marked as duplicate and why.

Signed-off-by: George Araújo <[email protected]>

george-gca · 2025-01-10T16:35:10Z

I agree with you. I had a quick look at the ASReviewData.drop_duplicates code, and it resets index by default, meaning that comparing dataframes before and after the deduplication would fail.

I changed the code a little bit to allow verbose when not using similar, also moved part of the code to dedup.py since it was starting to be too much code in the entrypoint for that.

george-gca added 7 commits January 8, 2025 18:22

Added more dependencies to project

5ddaf4a

Signed-off-by: George Araújo <[email protected]>

Added new duplicate finding algorithm

95b6569

Signed-off-by: George Araújo <[email protected]>

Added more params to deduplicate similar function

8b921e7

Signed-off-by: George Araújo <[email protected]>

Added missing params for deduplicate similar entries

b0a5259

Signed-off-by: George Araújo <[email protected]>

Updated README

7a6d916

Signed-off-by: George Araújo <[email protected]>

Added similar dedup info to Tutorials

8995c45

Signed-off-by: George Araújo <[email protected]>

Added missing example image

1762e15

Signed-off-by: George Araújo <[email protected]>

J535D165 requested review from PeterLombaers and J535D165 January 9, 2025 08:53

J535D165 added the enhancement New feature or request label Jan 9, 2025

Fixed ruff warnings

413f23d

Signed-off-by: George Araújo <[email protected]>

PeterLombaers reviewed Jan 9, 2025

View reviewed changes

asreviewcontrib/datatools/entrypoint.py Show resolved Hide resolved

PeterLombaers reviewed Jan 9, 2025

View reviewed changes

asreviewcontrib/datatools/entrypoint.py Show resolved Hide resolved

PeterLombaers reviewed Jan 9, 2025

View reviewed changes

asreviewcontrib/datatools/entrypoint.py Show resolved Hide resolved

Renamed similarity as threshold

a27bad9

Signed-off-by: George Araújo <[email protected]>

Printing when not using similar

871d159

Signed-off-by: George Araújo <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New duplicate algorithm to check for similar entries #52

New duplicate algorithm to check for similar entries #52

george-gca commented Jan 9, 2025

PeterLombaers commented Jan 9, 2025

PeterLombaers commented Jan 9, 2025

PeterLombaers commented Jan 9, 2025

george-gca commented Jan 9, 2025 •

edited

Loading

Rensvandeschoot commented Jan 10, 2025

george-gca commented Jan 10, 2025 •

edited

Loading

PeterLombaers commented Jan 10, 2025

george-gca commented Jan 10, 2025

New duplicate algorithm to check for similar entries #52

Are you sure you want to change the base?

New duplicate algorithm to check for similar entries #52

Conversation

george-gca commented Jan 9, 2025

PeterLombaers commented Jan 9, 2025

PeterLombaers commented Jan 9, 2025

PeterLombaers commented Jan 9, 2025

george-gca commented Jan 9, 2025 • edited Loading

Rensvandeschoot commented Jan 10, 2025

george-gca commented Jan 10, 2025 • edited Loading

PeterLombaers commented Jan 10, 2025

george-gca commented Jan 10, 2025

george-gca commented Jan 9, 2025 •

edited

Loading

george-gca commented Jan 10, 2025 •

edited

Loading