Diverse Unionable Tuple Search (DUST)

This repository contains the implementation codes for paper: Diverse Unionable Tuple Search, currently under submission.

Authors: Aamod Khatiwada, Roee Shraga and Renée J. Miller

Block Diagram of DUST System

Abstract

Unionable table search techniques input a query table from the user and search for the data lake tables that can contribute additional rows to it. The definition of unionability is generally based on similarity measures which may include similarity between columns (e.g., value overlap or semantic similarity of the values in the columns) or tables (e.g., similarity of table embeddings). Due to this and the large redundancy in many data lakes (which can contain copies and versions of the same table), the most unionable tables may be identical or nearly identical to the query table and contain little new information. Hence, we introduce the problem of searching for unionable tuples from the data lake that are diverse with respect to the tuples already present in a query table. As a solution, we propose DUST system that searches for unionable tuples from the data lake and diversifies them. DUST uses a novel embedding model to represent unionable tuples. Then, we present an efficient algorithm that diversifies a set of candidate unionable tuples. We show that our embedding outperforms other tuple representation models by at least 15 %. Furthermore, using real data lake benchmarks, we show that our diversification algorithm is more than six times faster than the most efficient diversification baseline. We also show that it is more effective in diversifying unionable tuples than existing diversification algorithms.

Repository Organization

data folder contains the sub-folders for some datasets and placeholder for additional datasets.
diversity_algorithm folder contains our novel diversity algorithm and baseline implementations.
dust.jpg file shows the block diagram of DUST system.
README.md file explains the repository.
requirements.txt file contains necessary packages to run the project.

Setup

Clone the repo
CD to the repo directory. Create and activate a virtual environment for this project. We recommend using python version 3.8 or higher.

On macOS or Linux:

python3 -m venv env
source env/bin/activate
which python

On windows:

python -m venv env
.\env\Scripts\activate.bat
where.exe python

Install necessary packages.
```
pip install -r requirements.txt
```

Fine-tuned model

All the fine-tuned DUST models are available at this link. Please download them and upload to out_model folder under respective benchmark's subfolders.

Reproducibility

CD to the repo.
To run column alignment experiments, run preprocess_align.py file by updating the parameters.
To finetune DUST model, run preprocess_finetune.py file by updating the parameters.
To run diversity algorithm, CD to diversity_algorithms and run main.py by updating the parameters.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data		data
diversity_algorithms		diversity_algorithms
groundtruth		groundtruth
out_model		out_model
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
create_tuple_datalake.py		create_tuple_datalake.py
dust.jpg		dust.jpg
fasttext_embeddings.py		fasttext_embeddings.py
glove_embeddings.py		glove_embeddings.py
model_classes.py		model_classes.py
prepare_dataset_utilities.py		prepare_dataset_utilities.py
prepare_finetuning_dataset.ipynb		prepare_finetuning_dataset.ipynb
prepare_finetuning_dataset_split.ipynb		prepare_finetuning_dataset_split.ipynb
preprocess_align.py		preprocess_align.py
preprocess_finetune.py		preprocess_finetune.py
preprocess_sentence_bert.py		preprocess_sentence_bert.py
requirements.txt		requirements.txt
utilities.py		utilities.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Diverse Unionable Tuple Search (DUST)

Abstract

Repository Organization

Setup

Fine-tuned model

Reproducibility

About

Releases

Packages

Languages

License

northeastern-datalab/dust

Folders and files

Latest commit

History

Repository files navigation

Diverse Unionable Tuple Search (DUST)

Abstract

Repository Organization

Setup

Fine-tuned model

Reproducibility

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages