diff --git a/README.md b/README.md index ffa4135..689696d 100644 --- a/README.md +++ b/README.md @@ -178,6 +178,47 @@ asreview data dedup synergy:van_de_schoot_2018 -o van_de_schoot_2018_dedup.csv Removed 104 records from dataset with 6189 records. ``` +We can also choose to deduplicate based on the similarity of the title and abstract, instead of checking for an exact match. This way we can find duplicates that have small differences, but are actually the same record (for example, an additional comma or a fixed typo). This can be done by using the `--drop_similar` flag. This process takes about 4s on a dataset of about 2068 entries. + +```bash +asreview data dedup neurips_2020.tsv --drop_similar +``` +``` +Not using doi for deduplication because there is no such data. +Deduplicating: 100%|████████████████████████████████████| 2068/2068 [00:03<00:00, 531.93it/s] +Found 2 duplicates in dataset with 2068 records. +``` + +If we want to check which entries were found as duplicates, we can use the `--verbose` flag. This will print the lines of the dataset that were found as duplicates, as well as the difference between them. Any text that has to be removed from the first entry to become the second one is shown as red and has a strikethrough, and any text that has to be added to the first entry is shown as green. All text that is the same in both entries is dimmed. + +```bash +asreview data dedup neurips_2020.tsv --drop_similar --verbose +``` + +![Verbose drop similar](./dedup_similar.png) + +The similarity threshold can be set with the `--similarity` flag. The default similarity threshold is `0.98`. We can also choose to only use the title for deduplication by using the `--skip_abstract` flag. + +```bash +asreview data dedup neurips_2020.tsv --drop_similar --similarity 0.98 --skip_abstract +``` +``` +Not using doi for deduplication because there is no such data. +Deduplicating: 100%|████████████████████████████████████| 2068/2068 [00:02<00:00, 770.74it/s] +Found 4 duplicates in dataset with 2068 records. +``` + +Note that you might have to adjust your similarity score if you choose to only use the title for deduplication. The similarity score is calculated using the [SequenceMatcher](https://docs.python.org/3/library/difflib.html#difflib.SequenceMatcher) class from the `difflib` package. The similarity score is calculated as the ratio of the number of matching characters to the total number of characters in the two strings. For example, the similarity score between the strings "hello" and "hello world" is 0.625. By default, we use the [real_quick_ratio](https://docs.python.org/3/library/difflib.html#difflib.SequenceMatcher.real_quick_ratio) and [quick_ratio](https://docs.python.org/3/library/difflib.html#difflib.SequenceMatcher.quick_ratio) methods, which are faster and usually good enough, but less accurate. If you want to use the ratio method, you can use the `--strict_similarity` flag. + +Now, if we want to discard stopwords for deduplication (for a more strict check on the important words), we can use the `--discard_stopwords` flag. The default language for the stopwords is `english`, but that can be set with the `--stopwords_language` flag. The list of supported languages for the stopwords are the same supported by the [nltk](https://www.nltk.org/index.html) package. To check the list of available languages, you can run the following commands on your python environment: + +```python +from nltk.corpus import stopwords +print(stopwords.fileids()) +``` +``` +['arabic', 'azerbaijani', 'basque', 'bengali', 'catalan', 'chinese', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek', 'hebrew', 'hinglish', 'hungarian', 'indonesian', 'italian', 'kazakh', 'nepali', 'norwegian', 'portuguese', 'romanian', 'russian', 'slovene', 'spanish', 'swedish', 'tajik', 'turkish'] +``` ### Data Vstack (Experimental) @@ -186,7 +227,7 @@ Vertical stacking: combine as many datasets in the same file format as you want ❗ Vstack is an experimental feature. We would love to hear your feedback. Please keep in mind that this feature can change in the future. -Stack several datasets on top of each other: +Stack several datasets on top of each other: ``` asreview data vstack output.csv MY_DATASET_1.csv MY_DATASET_2.csv MY_DATASET_3.csv ``` @@ -206,7 +247,7 @@ Compose is where datasets containing records with different labels (or no labels) can be assembled into a single dataset. ❗ Compose is an experimental feature. We would love to hear your feedback. -Please keep in mind that this feature can change in the future. +Please keep in mind that this feature can change in the future. Overview of possible input files and corresponding properties, use at least one of the following arguments: @@ -231,7 +272,7 @@ case of conflicts, use the `--conflict_resolve`/`-c` flag. This is set to | Resolve method | Action in case of conflict | |----------------|-----------------------------------------------------------------------------------------| | `keep_one` | Keep one label, using `--hierarchy` to determine which label to keep | -| `keep_all` | Keep conflicting records as duplicates in the composed dataset (ignoring `--hierarchy`) | +| `keep_all` | Keep conflicting records as duplicates in the composed dataset (ignoring `--hierarchy`) | | `abort` | Abort | diff --git a/Tutorials.md b/Tutorials.md index 0f6a02e..6f777cc 100644 --- a/Tutorials.md +++ b/Tutorials.md @@ -1,6 +1,6 @@ # Tutorials ---- +--- Below are several examples to illustrate how to use `ASReview-datatools`. Make sure to have installed [asreview-datatools](https://github.com/asreview/asreview-datatools) and @@ -18,9 +18,9 @@ ASReview converts the labeling decisions in [RIS files](https://asreview.readthe irrelevant as `0` and relevant as `1`. Records marked as unseen or with missing labeling decisions are converted to `-1`. ---- +--- -## Update Systematic Review +## Update Systematic Review Assume you are working on a systematic review and you want to update the review with newly available records. The original data is stored in @@ -28,7 +28,7 @@ review with newly available records. The original data is stored in [column](https://asreview.readthedocs.io/en/latest/data_labeled.html#label-format) containing the labeling decissions. In order to update the systematic review, you run the original search query again but with a new date. You save the -newly found records in `SEARCH_UPDATE.ris`. +newly found records in `SEARCH_UPDATE.ris`. In the command line interface (CLI), navigate to the directory where the @@ -52,12 +52,18 @@ asreview data convert SEARCH_UPDATE.ris SEARCH_UPDATE.csv Duplicate records can be removed with with `dedup` script. The algorithm removes duplicates using the Digital Object Indentifier -([DOI](https://www.doi.org/)) and the title plus abstract. +([DOI](https://www.doi.org/)) and the title plus abstract. ```bash asreview data dedup SEARCH_UPDATE.csv -o SEARCH_UPDATE_DEDUP.csv ``` +This can also be done considering a similarity threshold between the titles and abstracts. + +```bash +asreview data dedup SEARCH_UPDATE.csv -o SEARCH_UPDATE_DEDUP.csv --drop_similar +``` + ### Describe input If you want to see descriptive info on your input datasets, run these commands: @@ -78,12 +84,12 @@ asreview data compose updated_search.csv -l MY_LABELED_DATASET.csv -u SEARCH_UPD The flag `-l` means the labels in `MY_LABELED_DATASET.csv` will be kept. The flag `-u` means all records from `SEARCH_UPDATE_DEDUP.csv` will be -added as unlabeled to the composed dataset. +added as unlabeled to the composed dataset. If a record exists in both datasets, it is assumed the record containing a label is maintained, see the default [conflict resolving strategy](https://github.com/asreview/asreview-datatools#resolving-conflicting-labels). -To keep both records (with and without label), use +To keep both records (with and without label), use ```bash asreview data compose updated_search.csv -l MY_LABELED_DATASET.csv -u SEARCH_UPDATE_DEDUP.csv -c keep @@ -154,14 +160,14 @@ added as unlabeled. If any duplicate records exist across the datasets, by default the order of keeping labels is: -1. relevant +1. relevant 2. irrelevant 3. unlabeled You can configure the behavior in resolving conflicting labels by setting the hierarchy differently. To do so, pass the letters r (relevant), i (irrelevant), and u (unlabeled) in any order to, for example, `--hierarchy -uir`. +uir`. The composed dataset will be exported to `search_with_priors.ris`. @@ -193,12 +199,12 @@ new search. Assume you want to use the [simulation mode](https://asreview.readthedocs.io/en/latest/simulation_overview.html) of ASReview but the data is not stored in one singe file containing the meta-data -and labelling decissions as required by ASReview. +and labelling decissions as required by ASReview. Suppose the following files are available: - `SCREENED.ris`: all records that were screened -- `RELEVANT.ris`: the subset of relevant records after manually screening all the records. +- `RELEVANT.ris`: the subset of relevant records after manually screening all the records. You need to compose the files into a single file where all records from `RELEVANT.csv` are relevant all other records are irrelevant. diff --git a/asreviewcontrib/datatools/dedup.py b/asreviewcontrib/datatools/dedup.py new file mode 100644 index 0000000..daf5b82 --- /dev/null +++ b/asreviewcontrib/datatools/dedup.py @@ -0,0 +1,161 @@ +import re +from argparse import Namespace +from difflib import SequenceMatcher + +import ftfy +import pandas as pd +from asreview import ASReviewData +from rich.console import Console +from rich.text import Text +from tqdm import tqdm + + +def _print_similar_list(similar_list: list[tuple[int, int]], data: pd.Series) -> None: + print_seq_matcher = SequenceMatcher() + console = Console() + print('Found similar titles at lines:') + + for i, j in similar_list: + print_seq_matcher.set_seq1(data.iloc[i]) + print_seq_matcher.set_seq2(data.iloc[j]) + text = Text() + text.append(f"\nLines {i+1} and {j+1}:\n", style='bold') + + for tag, i1, i2, j1, j2 in print_seq_matcher.get_opcodes(): + if tag == 'replace': + # add rich strikethrough + text.append(f'{data.iloc[i][i1:i2]}', style='red strike') + text.append(f'{data.iloc[j][j1:j2]}', style='green') + if tag == 'delete': + text.append(f'{data.iloc[i][i1:i2]}', style='red strike') + if tag == 'insert': + text.append(f'{data.iloc[j][j1:j2]}', style='green') + if tag == 'equal': + text.append(f'{data.iloc[i][i1:i2]}', style='dim') + + console.print(text) + + print('') + + +def drop_duplicates_by_similarity( + asdata: ASReviewData, + similarity: float = 0.98, + skip_abstract: bool = False, + discard_stopwords: bool = False, + stopwords_language: str = 'english', + strict_similarity: bool = False, + verbose: bool = False) -> None: + + if skip_abstract: + data = asdata.df['title'] + else: + data = pd.Series(asdata.texts) + + symbols_regex = re.compile(r'[^ \w\d\-_]') + spaces_regex = re.compile(r'\s+') + + s = ( + data + .apply(ftfy.fix_text) + .str.replace(symbols_regex, '', regex=True) + .str.replace(spaces_regex, ' ', regex=True) + .str.lower() + .str.strip() + .replace("", None) + ) + + if discard_stopwords: + try: + from nltk.corpus import stopwords + stopwords_set = set(stopwords.words(stopwords_language)) + except LookupError: + import nltk + nltk.download('stopwords') + stopwords_set = set(stopwords.words(stopwords_language)) + + stopwords_regex = re.compile(rf'\b{"\\b|\\b".join(stopwords_set)}\b') + s = s.str.replace(stopwords_regex, '', regex=True) + + duplicated = [False] * len(s) + seq_matcher = SequenceMatcher() + + if verbose: + similar_list = [] + else: + similar_list = None + + for i, text in tqdm(s.items(), total=len(s), desc="Deduplicating"): + seq_matcher.set_seq2(text) + + for j, t in s.iloc[i+1:][abs(s.str.len() - len(text)) < 5].items(): + seq_matcher.set_seq1(t) + + if seq_matcher.real_quick_ratio() > similarity and \ + seq_matcher.quick_ratio() > similarity and \ + (not strict_similarity or seq_matcher.ratio() > similarity): + + if verbose and not duplicated[j]: + similar_list.append((i, j)) + + duplicated[j] = True + + if verbose: + _print_similar_list(similar_list, data) + + asdata.df = asdata.df[~pd.Series(duplicated)].reset_index(drop=True) + + +def deduplicate_data(asdata: ASReviewData, args: Namespace) -> None: + initial_length = len(asdata.df) + + if args.pid not in asdata.df.columns: + print( + f"Not using {args.pid} for deduplication " + "because there is no such data." + ) + + if not args.similar: + if args.verbose: + before_dedup = asdata.df.copy() + + # retrieve deduplicated ASReview data object + asdata.drop_duplicates(pid=args.pid, inplace=True, reset_index=False) + duplicate_entries = before_dedup[~before_dedup.index.isin(asdata.df.index)] + + if len(duplicate_entries) > 0: + print("Duplicate entries:") + for i, row in duplicate_entries.iterrows(): + print(f"\tLine {i} - {row['title']}") + + asdata.df.reset_index(drop=True, inplace=True) + + else: + # retrieve deduplicated ASReview data object + asdata.drop_duplicates(pid=args.pid, inplace=True) + + else: + drop_duplicates_by_similarity( + asdata, + args.threshold, + args.title_only, + args.stopwords, + args.stopwords_language, + args.strict, + args.verbose, + ) + + # count duplicates + n_dup = initial_length - len(asdata.df) + + if args.output_path: + asdata.to_file(args.output_path) + print( + f"Removed {n_dup} duplicates from dataset with" + f" {initial_length} records." + ) + else: + print( + f"Found {n_dup} duplicates in dataset with" + f" {initial_length} records." + ) diff --git a/asreviewcontrib/datatools/entrypoint.py b/asreviewcontrib/datatools/entrypoint.py index 647bc6a..2f1d0cc 100644 --- a/asreviewcontrib/datatools/entrypoint.py +++ b/asreviewcontrib/datatools/entrypoint.py @@ -8,6 +8,7 @@ from asreviewcontrib.datatools.compose import compose from asreviewcontrib.datatools.convert import _parse_arguments_convert from asreviewcontrib.datatools.convert import convert +from asreviewcontrib.datatools.dedup import deduplicate_data, drop_duplicates_by_similarity from asreviewcontrib.datatools.describe import _parse_arguments_describe from asreviewcontrib.datatools.describe import describe from asreviewcontrib.datatools.sample import _parse_arguments_sample @@ -59,36 +60,50 @@ def execute(self, argv): type=str, help="Persistent identifier used for deduplication. Default: doi.", ) + dedup_parser.add_argument( + "--similar", + action='store_true', + help="Drop similar records.", + ) + dedup_parser.add_argument( + "--threshold", + default=0.98, + type=float, + help="Similarity threshold for deduplication. Default: 0.98.", + ) + dedup_parser.add_argument( + "--title_only", + action='store_true', + help="Use only title for deduplication.", + ) + dedup_parser.add_argument( + "--stopwords", + action='store_true', + help="Ignore stopwords for deduplication, focusing on main words.", + ) + dedup_parser.add_argument( + "--strict", + action='store_true', + help="Use a more strict similarity for deduplication.", + ) + dedup_parser.add_argument( + "--stopwords_language", + default="english", + type=str, + help="Language for stopwords. Default: english.", + ) + dedup_parser.add_argument( + "--verbose", + action='store_true', + help="Print verbose output.", + ) args_dedup = dedup_parser.parse_args(argv[1:]) # read data in ASReview data object asdata = load_data(args_dedup.input_path) - initial_length = len(asdata.df) - - if args_dedup.pid not in asdata.df.columns: - print( - f"Not using {args_dedup.pid} for deduplication" - "because there is no such data." - ) - - # retrieve deduplicated ASReview data object - asdata.drop_duplicates(pid=args_dedup.pid, inplace=True) - - # count duplicates - n_dup = initial_length - len(asdata.df) + deduplicate_data(asdata, args_dedup) - if args_dedup.output_path: - asdata.to_file(args_dedup.output_path) - print( - f"Removed {n_dup} duplicates from dataset with" - f" {initial_length} records." - ) - else: - print( - f"Found {n_dup} duplicates in dataset with" - f" {initial_length} records." - ) if argv[0] == "compose": args_compose_parser = _parse_arguments_compose() args_compose = args_compose_parser.parse_args(argv[1:]) diff --git a/dedup_similar.png b/dedup_similar.png new file mode 100644 index 0000000..53c7d52 Binary files /dev/null and b/dedup_similar.png differ diff --git a/pyproject.toml b/pyproject.toml index 0034c41..1471822 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -14,7 +14,7 @@ classifiers = [ "Programming Language :: Python :: 3.11" ] license = {text = "MIT License"} -dependencies = ["asreview>=1.1,<2", "pandas", "pyalex"] +dependencies = ["asreview>=1.1,<2", "ftfy", "nltk", "pandas", "pyalex", "rich", "tqdm"] dynamic = ["version"] requires-python = ">=3.8"