Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New duplicate algorithm to check for similar entries #52

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
47 changes: 44 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -178,6 +178,47 @@ asreview data dedup synergy:van_de_schoot_2018 -o van_de_schoot_2018_dedup.csv
Removed 104 records from dataset with 6189 records.
```

We can also choose to deduplicate based on the similarity of the title and abstract, instead of checking for an exact match. This way we can find duplicates that have small differences, but are actually the same record (for example, an additional comma or a fixed typo). This can be done by using the `--drop_similar` flag. This process takes about 4s on a dataset of about 2068 entries.

```bash
asreview data dedup neurips_2020.tsv --drop_similar
```
```
Not using doi for deduplication because there is no such data.
Deduplicating: 100%|████████████████████████████████████| 2068/2068 [00:03<00:00, 531.93it/s]
Found 2 duplicates in dataset with 2068 records.
```

If we want to check which entries were found as duplicates, we can use the `--verbose` flag. This will print the lines of the dataset that were found as duplicates, as well as the difference between them. Any text that has to be removed from the first entry to become the second one is shown as red and has a strikethrough, and any text that has to be added to the first entry is shown as green. All text that is the same in both entries is dimmed.

```bash
asreview data dedup neurips_2020.tsv --drop_similar --verbose
```

![Verbose drop similar](./dedup_similar.png)

The similarity threshold can be set with the `--similarity` flag. The default similarity threshold is `0.98`. We can also choose to only use the title for deduplication by using the `--skip_abstract` flag.

```bash
asreview data dedup neurips_2020.tsv --drop_similar --similarity 0.98 --skip_abstract
```
```
Not using doi for deduplication because there is no such data.
Deduplicating: 100%|████████████████████████████████████| 2068/2068 [00:02<00:00, 770.74it/s]
Found 4 duplicates in dataset with 2068 records.
```

Note that you might have to adjust your similarity score if you choose to only use the title for deduplication. The similarity score is calculated using the [SequenceMatcher](https://docs.python.org/3/library/difflib.html#difflib.SequenceMatcher) class from the `difflib` package. The similarity score is calculated as the ratio of the number of matching characters to the total number of characters in the two strings. For example, the similarity score between the strings "hello" and "hello world" is 0.625. By default, we use the [real_quick_ratio](https://docs.python.org/3/library/difflib.html#difflib.SequenceMatcher.real_quick_ratio) and [quick_ratio](https://docs.python.org/3/library/difflib.html#difflib.SequenceMatcher.quick_ratio) methods, which are faster and usually good enough, but less accurate. If you want to use the ratio method, you can use the `--strict_similarity` flag.

Now, if we want to discard stopwords for deduplication (for a more strict check on the important words), we can use the `--discard_stopwords` flag. The default language for the stopwords is `english`, but that can be set with the `--stopwords_language` flag. The list of supported languages for the stopwords are the same supported by the [nltk](https://www.nltk.org/index.html) package. To check the list of available languages, you can run the following commands on your python environment:

```python
from nltk.corpus import stopwords
print(stopwords.fileids())
```
```
['arabic', 'azerbaijani', 'basque', 'bengali', 'catalan', 'chinese', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek', 'hebrew', 'hinglish', 'hungarian', 'indonesian', 'italian', 'kazakh', 'nepali', 'norwegian', 'portuguese', 'romanian', 'russian', 'slovene', 'spanish', 'swedish', 'tajik', 'turkish']
```

### Data Vstack (Experimental)

Expand All @@ -186,7 +227,7 @@ Vertical stacking: combine as many datasets in the same file format as you want
❗ Vstack is an experimental feature. We would love to hear your feedback.
Please keep in mind that this feature can change in the future.

Stack several datasets on top of each other:
Stack several datasets on top of each other:
```
asreview data vstack output.csv MY_DATASET_1.csv MY_DATASET_2.csv MY_DATASET_3.csv
```
Expand All @@ -206,7 +247,7 @@ Compose is where datasets containing records with different labels (or no
labels) can be assembled into a single dataset.

❗ Compose is an experimental feature. We would love to hear your feedback.
Please keep in mind that this feature can change in the future.
Please keep in mind that this feature can change in the future.

Overview of possible input files and corresponding properties, use at least
one of the following arguments:
Expand All @@ -231,7 +272,7 @@ case of conflicts, use the `--conflict_resolve`/`-c` flag. This is set to
| Resolve method | Action in case of conflict |
|----------------|-----------------------------------------------------------------------------------------|
| `keep_one` | Keep one label, using `--hierarchy` to determine which label to keep |
| `keep_all` | Keep conflicting records as duplicates in the composed dataset (ignoring `--hierarchy`) |
| `keep_all` | Keep conflicting records as duplicates in the composed dataset (ignoring `--hierarchy`) |
| `abort` | Abort |


Expand Down
28 changes: 17 additions & 11 deletions Tutorials.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Tutorials

---
---
Below are several examples to illustrate how to use `ASReview-datatools`. Make
sure to have installed
[asreview-datatools](https://github.com/asreview/asreview-datatools) and
Expand All @@ -18,17 +18,17 @@ ASReview converts the labeling decisions in [RIS files](https://asreview.readthe
irrelevant as `0` and relevant as `1`. Records marked as unseen or with
missing labeling decisions are converted to `-1`.

---
---

## Update Systematic Review
## Update Systematic Review

Assume you are working on a systematic review and you want to update the
review with newly available records. The original data is stored in
`MY_LABELED_DATASET.csv` and the file contains a
[column](https://asreview.readthedocs.io/en/latest/data_labeled.html#label-format)
containing the labeling decissions. In order to update the systematic review,
you run the original search query again but with a new date. You save the
newly found records in `SEARCH_UPDATE.ris`.
newly found records in `SEARCH_UPDATE.ris`.


In the command line interface (CLI), navigate to the directory where the
Expand All @@ -52,12 +52,18 @@ asreview data convert SEARCH_UPDATE.ris SEARCH_UPDATE.csv

Duplicate records can be removed with with `dedup` script. The algorithm
removes duplicates using the Digital Object Indentifier
([DOI](https://www.doi.org/)) and the title plus abstract.
([DOI](https://www.doi.org/)) and the title plus abstract.

```bash
asreview data dedup SEARCH_UPDATE.csv -o SEARCH_UPDATE_DEDUP.csv
```

This can also be done considering a similarity threshold between the titles and abstracts.

```bash
asreview data dedup SEARCH_UPDATE.csv -o SEARCH_UPDATE_DEDUP.csv --drop_similar
```

### Describe input

If you want to see descriptive info on your input datasets, run these commands:
Expand All @@ -78,12 +84,12 @@ asreview data compose updated_search.csv -l MY_LABELED_DATASET.csv -u SEARCH_UPD
The flag `-l` means the labels in `MY_LABELED_DATASET.csv` will be kept.

The flag `-u` means all records from `SEARCH_UPDATE_DEDUP.csv` will be
added as unlabeled to the composed dataset.
added as unlabeled to the composed dataset.

If a record exists in both datasets, it is assumed the record containing a
label is maintained, see the default [conflict resolving
strategy](https://github.com/asreview/asreview-datatools#resolving-conflicting-labels).
To keep both records (with and without label), use
To keep both records (with and without label), use

```bash
asreview data compose updated_search.csv -l MY_LABELED_DATASET.csv -u SEARCH_UPDATE_DEDUP.csv -c keep
Expand Down Expand Up @@ -154,14 +160,14 @@ added as unlabeled.

If any duplicate records exist across the datasets, by default the order of
keeping labels is:
1. relevant
1. relevant
2. irrelevant
3. unlabeled

You can configure the behavior in resolving conflicting labels by setting the
hierarchy differently. To do so, pass the letters r (relevant), i
(irrelevant), and u (unlabeled) in any order to, for example, `--hierarchy
uir`.
uir`.


The composed dataset will be exported to `search_with_priors.ris`.
Expand Down Expand Up @@ -193,12 +199,12 @@ new search.
Assume you want to use the [simulation
mode](https://asreview.readthedocs.io/en/latest/simulation_overview.html) of
ASReview but the data is not stored in one singe file containing the meta-data
and labelling decissions as required by ASReview.
and labelling decissions as required by ASReview.

Suppose the following files are available:

- `SCREENED.ris`: all records that were screened
- `RELEVANT.ris`: the subset of relevant records after manually screening all the records.
- `RELEVANT.ris`: the subset of relevant records after manually screening all the records.

You need to compose the files into a single file where all records from
`RELEVANT.csv` are relevant all other records are irrelevant.
Expand Down
161 changes: 161 additions & 0 deletions asreviewcontrib/datatools/dedup.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
import re
from argparse import Namespace
from difflib import SequenceMatcher

import ftfy
import pandas as pd
from asreview import ASReviewData
from rich.console import Console
from rich.text import Text
from tqdm import tqdm


def _print_similar_list(similar_list: list[tuple[int, int]], data: pd.Series) -> None:
print_seq_matcher = SequenceMatcher()
console = Console()
print('Found similar titles at lines:')

for i, j in similar_list:
print_seq_matcher.set_seq1(data.iloc[i])
print_seq_matcher.set_seq2(data.iloc[j])
text = Text()
text.append(f"\nLines {i+1} and {j+1}:\n", style='bold')

for tag, i1, i2, j1, j2 in print_seq_matcher.get_opcodes():
if tag == 'replace':
# add rich strikethrough
text.append(f'{data.iloc[i][i1:i2]}', style='red strike')
text.append(f'{data.iloc[j][j1:j2]}', style='green')
if tag == 'delete':
text.append(f'{data.iloc[i][i1:i2]}', style='red strike')
if tag == 'insert':
text.append(f'{data.iloc[j][j1:j2]}', style='green')
if tag == 'equal':
text.append(f'{data.iloc[i][i1:i2]}', style='dim')

console.print(text)

print('')


def drop_duplicates_by_similarity(
asdata: ASReviewData,
similarity: float = 0.98,
skip_abstract: bool = False,
discard_stopwords: bool = False,
stopwords_language: str = 'english',
strict_similarity: bool = False,
verbose: bool = False) -> None:

if skip_abstract:
data = asdata.df['title']
else:
data = pd.Series(asdata.texts)

symbols_regex = re.compile(r'[^ \w\d\-_]')
spaces_regex = re.compile(r'\s+')

s = (
data
.apply(ftfy.fix_text)
.str.replace(symbols_regex, '', regex=True)
.str.replace(spaces_regex, ' ', regex=True)
.str.lower()
.str.strip()
.replace("", None)
)

if discard_stopwords:
try:
from nltk.corpus import stopwords
stopwords_set = set(stopwords.words(stopwords_language))
except LookupError:
import nltk
nltk.download('stopwords')
stopwords_set = set(stopwords.words(stopwords_language))

stopwords_regex = re.compile(rf'\b{"\\b|\\b".join(stopwords_set)}\b')
s = s.str.replace(stopwords_regex, '', regex=True)

duplicated = [False] * len(s)
seq_matcher = SequenceMatcher()

if verbose:
similar_list = []
else:
similar_list = None

for i, text in tqdm(s.items(), total=len(s), desc="Deduplicating"):
seq_matcher.set_seq2(text)

for j, t in s.iloc[i+1:][abs(s.str.len() - len(text)) < 5].items():
seq_matcher.set_seq1(t)

if seq_matcher.real_quick_ratio() > similarity and \
seq_matcher.quick_ratio() > similarity and \
(not strict_similarity or seq_matcher.ratio() > similarity):

if verbose and not duplicated[j]:
similar_list.append((i, j))

duplicated[j] = True

if verbose:
_print_similar_list(similar_list, data)

asdata.df = asdata.df[~pd.Series(duplicated)].reset_index(drop=True)


def deduplicate_data(asdata: ASReviewData, args: Namespace) -> None:
initial_length = len(asdata.df)

if args.pid not in asdata.df.columns:
print(
f"Not using {args.pid} for deduplication "
"because there is no such data."
)

if not args.similar:
if args.verbose:
before_dedup = asdata.df.copy()

# retrieve deduplicated ASReview data object
asdata.drop_duplicates(pid=args.pid, inplace=True, reset_index=False)
duplicate_entries = before_dedup[~before_dedup.index.isin(asdata.df.index)]

if len(duplicate_entries) > 0:
print("Duplicate entries:")
for i, row in duplicate_entries.iterrows():
print(f"\tLine {i} - {row['title']}")

asdata.df.reset_index(drop=True, inplace=True)

else:
# retrieve deduplicated ASReview data object
asdata.drop_duplicates(pid=args.pid, inplace=True)

else:
drop_duplicates_by_similarity(
asdata,
args.threshold,
args.title_only,
args.stopwords,
args.stopwords_language,
args.strict,
args.verbose,
)

# count duplicates
n_dup = initial_length - len(asdata.df)

if args.output_path:
asdata.to_file(args.output_path)
print(
f"Removed {n_dup} duplicates from dataset with"
f" {initial_length} records."
)
else:
print(
f"Found {n_dup} duplicates in dataset with"
f" {initial_length} records."
)
Loading