Skip to content

Crowdsourced datasets including the individual crowd votes.

Notifications You must be signed in to change notification settings

saties/crowdsourced-datasets

 
 

Repository files navigation

CrowdData

CrowdData is an open repository that aggregates the crowdsourced datasets that have individual crowd votes. We aim at providing the available datasets with a standard format (explained in Download section below) so that they can be directly used in experiments, without any work-load in preprocessing. Datasets included in this repo serve for classification tasks (mainly text classification, except Emotion Dataset). CrowData can benefit researchers investigating hybrid usage of machine and human-in-the-loop in classification tasks (the repo includes 5 datasets having the actual content of the tasks), human in classification and ranking tasks, truth discovery based on crowdsourced data, estimation of the crowd bias, and active learning. If you use any of the datasets in this repository, please make sure that you've read and followed the usage consent we explain at the bottom of this page.

Datasets

We categorized the datasets in two folders: binary-classification and multi-class-classification. Within each folder, each dataset is kept in a separate folder having a link to the original source. Table below shows an overview of the datasets. The columns of the table are as follows:

  • Dataset: Name of the dataset including a link to the original source.
  • Description: A brief description of the dataset.
  • Number of tasks: The number of tasks asked to the crowd.
  • Number of workers: Number of crowd workers completing the tasks.
  • Number of total votes: Number of votes collected for all tasks.
  • Ground Truth: Are the ground truths of corresponding tasks available in the dataset? Yes? No? Partially available?
  • Task Type: Type of the task asked to the crowd. It can be either binary or multi-class question. If it is a multi-class question, we specify whether it is categorical (how many categories?), interval (range?), or ordinal (how many classes?).
  • Task Content: Content of the task asked to the crowd (text, image, etc.), and does the content available in the dataset? (Available? Unavailable? Partially available?)
  • I don't know option: Do the crowd workers have an "I don't know" option while completing the tasks?
  • Time spent on the task: Does the dataset includes any information about the time spent on the tasks?
Dataset Description Number of tasks Number of workers Number of total votes Ground Truth Task Type Task Content I don't know option Time spent on the task
Blue Birds The task is to identify whether the image contains a blue bird or not. The dataset contains both the individual votes and the ground truths. 108 39 4212 Yes binary image, unavailable No No
Crowdsourced Amazon Sentiment The task is to make sentiment analysis on Amazon product reviews. There are two predicates: "is_book", "is_negative". 1011 284 7803 Yes binary text, available No Unavailable
Crowdsourced loneliness-slr Each paper is assessed by three questions: (i) Does it related to the use of technology? (ii) Does it related to older adults, and (iii) Does it related to the intervention? 319 34 797 Yes binary text, unavailable Yes Unavailable
HITspam-UsingCrowdflower The dataset contains individual worker judgments and the related ground truths about whether a HIT (from Crowdflower data) should be considered as a "spam" task. 5380 153 42762 Partially binary text, unavailable No Unavailable
HITspam-UsingMTurk The dataset contains individual worker judgments and the related ground truths about whether a HIT (from MTurk data) should be considered as a "spam" task. 5840 135 28354 Partially binary text, unavailable No Unavailable
Recognizing Textual Entailment Recognizing Textual Entailment dataset contains the individual worker judgments and the related ground truths about identifying whether a given Hypothesis sentence is implied by the information in the given text. 800 164 8000 Yes binary text, available No Unavailable
Sentiment popularity - AMT This dataset contains positive or negative judgments of workers for 500 sentences extracted from movie reviews, with gold labels assigned by the website. 500 143 10000 Yes binary text, unavailable No Yes
Temporal Ordering Temporal Ordering dataset contains the individual worker votes and the corresponding ground truths for the task of identifying whether one event happens before another event in a given context. 462 76 4620 Yes binary text, partially available No Unavailable
Text Highlighting This dataset contains two kinds of tasks: (i) classification tasks with highlighting support, and (ii) highlighting tasks, where the workers highlight evidence. 685 1851 27711 Yes binary text, available Maybe option Available
Toloka Aggregation Relevance 2 This dataset contains approximately 0.5 million anonymized individual votes that collected in the "Relevance 2 Gradations" project in 2016. 99319 7139 475536 Partially binary text, unavailable No Unavailable
2010 Crowdsourced Web Relevance Judgments Data The dataset contains the judgments about the relevance of English Web pages from the ClueWeb09 collection (http://lemurproject.org/clueweb09/). The judgments are based on 3 scales: highly relevant, relevant, and non-relevant. A fourth judgment option indicated a broken link which could not be judged. 20232 766 98453 Yes multi, 3 classes text, unavailable No Unavailable
AdultContent2 This dataset contains approximately 100K individual worker judgments and the related ground truths for classification of websites into 5 categories. 11040 269 92721 Partially multi, 5 categories text, unavailable No Unavailable
AdultContent3 This dataset contains approximately 50K individual worker judgments and the related ground truths for classification of websites into 4 categories. 500 100 50000 No multi, 4 categories text, unavailable No Unavailable
Emotion This dataset contains individual worker votes that rate the emotion of a given text, based on the followings: anger, disgust, fear, joy, sadness, surprise, valence. Furthermore, each rating contains a value from -100 to 100 for each emotion about the text. 700 10 7000 Yes multi, interval (-100,100) text, available No Unavailable
Toloka Aggregation Relevance 5 This dataset contains the judgments on the relevance of a document for a query on a 5-graded scale. 363814 1274 1091918 Partially multi, 5 classes text, unavailable No Unavailable
Weather Sentiment - AMT This dataset contains the sentiment judgments of 300 tweets. The classification task is based on the following categories: negative (0), neutral (1), positive (2), tweet not related to weather (3) and can't tell (4). 300 110 6000 Yes multi, 5 classes text, unavailable Yes Yes
Word Pair Similarity This dataset contains the individual worker votes that assign a numerical similarity score between 0 and 10 to a given text. 30 10 300 Yes multi, interval (0,10) text, unavailable No Unavailable

Download

We provide two python scripts that will help you to download all the datasets, and then transform them to a standard format. In order to do that, you should first run the download_datasets.py, and then transform_datasets.py. The required python version is 3.7, and the following modules should be installed on your system: os, pandas, wget, zipfile, tarfile, re, platform, and shutil.

Running the two scripts in given order will create one csv file within each dataset folder. These csv files will be in a standard format that includes the following columns, respectively:

  • workerID: ID of the crowd worker.
  • taskID: ID of the task answered by the corresponding worker.
  • response: Response of the corresponding worker on the task identified by taskID.
  • goldLabel: Gold label of the corresponding task (if available).
  • taskContent: Content of the task answered by the worker (if available).

Only Sentiment popularity - AMT and Weather Sentiment - AMT datasets will have an additional column:

  • timeSpent: How much time the corresponding worker spent on this task?

P.S. If the original dataset includes multi-predicates for a task, then we create one csv file for each predicate in the transformed version of the dataset.

(You should not modify any of the directory names and/or dataset files you downloaded from this repo to obtain the resulting csv files accurately)

Usage consent

By using this tool you agree to acknowledge the original datasets and to check their terms and conditions. Some data providers may require authentication, filling forms, etc. We include a link to the original source both in the table above and in the individual repository folders for usefulness.

About

Crowdsourced datasets including the individual crowd votes.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%