This repository contains materials associated to the paper:
Alan Ramponi and Camilla Casula. 2023. DiatopIt: A Corpus of Social Media Posts for the Study of Diatopic Language Variation in Italy. In Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023), pages 187–199, Dubrovnik, Croatia. Association for Computational Linguistics. [cite] [paper]
The dataset has recently been used in the context of the GeoLingIt shared task at EVALITA 2023.
The corpus is released in the form of tweet IDs to be hydrated, in compliance to the Twitter developer policy. Data is in the data/
folder and includes three files:
train.txt
: the training set, totalling 13,669 posts;dev.txt
: the development set, totalling 552 posts;test.txt
: the test set, totalling 818 posts.
The text of the post can be rehydrated from the Tweet object of the Twitter json response. Just iterate over the posts (i.e., for post in json_object["data"]
) and get the post["text"]
content.
The location labels can be rehydrated from the Place object of the Twitter json response. Just iterate over the places for the posts (i.e., for place in json_object["includes"]["places"]
) and get the relevant information:
- region:
place["name"].split(", ")[1]
; - latitude:
from_bbox_to_point(place["geo"]["bbox"])[0]
; - longitude:
from_bbox_to_point(place["geo"]["bbox"])[1]
.
The from_bbox_to_point()
function computes latitude and longitude coordinates by
taking the central point from the 4-point bounding box of city areas given by the Twitter APIs. It is defined as follows:
def from_bbox_to_point(bbox):
w_long, s_lat, e_long, n_lat = bbox
latitude = (s_lat + n_lat) / 2
longitude = (w_long + e_long) / 2
return [latitude, longitude]
If you find any issues, please get in touch with us.
If you use or build on top of this work, please cite our paper as follows:
@inproceedings{ramponi-casula-2023-diatopit,
title = "{D}iatop{I}t: A Corpus of Social Media Posts for the Study of Diatopic Language Variation in {I}taly",
author = "Ramponi, Alan and
Casula, Camilla",
booktitle = "Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)",
month = may,
year = "2023",
address = "Dubrovnik, Croatia",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.vardial-1.19",
pages = "187--199",
}