Skip to content
angelobo edited this page Jan 27, 2017 · 5 revisions

Task 2

… of the Semantic Publishing Challenge 2017.

Motivation

Several information about papers published in CEUR-WS.org is hidden within PDFs. Our goal is to extract some data and to make them available as LOD.

That information should describe the organization of the content of the paper and should provide a deeper understanding of the context in which it was written. In particular, the extracted information is expected to answer queries about the internal organization of sections, tables, figures, footnotes and about the authors’ affiliations and research institutions.

The queries participants are required to answer are shown below.

The task requires techniques for extracting data from PDF, sided by techniques for Named-entity Recognition and Natural Language Processing.

Data Source

The input datasets consists of a set of PDF papers. The papers use different formats and different rules for bibliographic references, headers, affiliations and acknowledgements.

Datasets can be downloaded here:

Training Dataset TD2

PDF papers available in CEUR-WS.org. Individual description given here; list of URLs for convenient one-time download below. The expected output on TD2 is also available below.

List of URLs for one-time download:

http://ceur-ws.org/Vol-1006/paper2.pdf 
http://ceur-ws.org/Vol-1116/paper6.pdf 
http://ceur-ws.org/Vol-1116/paper7.pdf 
http://ceur-ws.org/Vol-1116/paper8.pdf 
http://ceur-ws.org/Vol-1123/paper4.pdf 
http://ceur-ws.org/Vol-1123/paper5.pdf 
http://ceur-ws.org/Vol-1184/ldow2014_paper_02.pdf 
http://ceur-ws.org/Vol-1184/ldow2014_paper_05.pdf 
http://ceur-ws.org/Vol-1313/paper_14.pdf 
http://ceur-ws.org/Vol-1315/paper15.pdf 
http://ceur-ws.org/Vol-1320/paper_25.pdf 
http://ceur-ws.org/Vol-1324/paper_10.pdf 
http://ceur-ws.org/Vol-1324/paper_4.pdf 
http://ceur-ws.org/Vol-1405/paper-02.pdf 
http://ceur-ws.org/Vol-1405/paper-06.pdf 
http://ceur-ws.org/Vol-1500/paper3.pdf 
http://ceur-ws.org/Vol-1503/01_pap_batot.pdf 
http://ceur-ws.org/Vol-1503/06_pap_rosa.pdf 
http://ceur-ws.org/Vol-1522/Badreddin2015HuFaMo.pdf 
http://ceur-ws.org/Vol-1522/Liebel2015HuFaMo.pdf 
http://ceur-ws.org/Vol-1531/paper8.pdf 
http://ceur-ws.org/Vol-1746/paper-11.pdf 
http://ceur-ws.org/Vol-1746/paper-13.pdf 
http://ceur-ws.org/Vol-1749/paper_006.pdf 
http://ceur-ws.org/Vol-1749/paper_008.pdf 
http://ceur-ws.org/Vol-1751/AICS_2016_paper_12.pdf 
http://ceur-ws.org/Vol-1751/AICS_2016_paper_15.pdf 
http://ceur-ws.org/Vol-1751/AICS_2016_paper_30.pdf 
http://ceur-ws.org/Vol-1755/160-168.pdf 
http://ceur-ws.org/Vol-1755/79-84.pdf 
http://ceur-ws.org/Vol-1758/paper1.pdf 
http://ceur-ws.org/Vol-1758/paper3.pdf 
http://ceur-ws.org/Vol-1760/paper1.pdf 
http://ceur-ws.org/Vol-1760/paper6.pdf 
http://ceur-ws.org/Vol-1766/om2016_poster2.pdf 
http://ceur-ws.org/Vol-1766/om2016_poster4.pdf 
http://ceur-ws.org/Vol-1769/paper01.pdf 
http://ceur-ws.org/Vol-1769/paper09.pdf 
http://ceur-ws.org/Vol-1771/paper13.pdf 
http://ceur-ws.org/Vol-1771/paper3.pdf 

Expected output on TD2

The following ZIP file contains the expected output of all queries on all papers in the training dataset: sempub17-TD2.zip

The archive contains the full list of queries (in QUERIES-LIST.csv) and the output of each of them in a separate .csv file.

For each query there is an entry in QUERIES-LIST.csv indicating the identifier of the query and the natural language description. The output of that query is contained in the corresponding .csv file, as shown below:

QueryID Natural language description CSV output file
Q1.1 Identify the affiliations of the authors of the paper http://ceur-ws.org/Vol-1006/paper2.pdf Q1.1.csv

Evalation dataset ED2

The final evaluation will be performed by using SemPubEvaluator on a set of papers to be disclosed a few days before the deadline.

Queries

Participants are required to produce a dataset for answering the following queries.

  • Q2.1 (Affiliations in a paper): Identify the affiliations of the authors of the paper X.
  • Q2.2 (Countries in affiliations): Identify the countries of the affiliations of the authors in the paper X.
  • Q2.3 (Supplementary material): Identify the supplementary material(s) for the paper X.
  • Q2.4 (Sections): Identify the titles of the first-level sections of the paper X.
  • Q2.5 (Tables): Identify the captions of the tables in the paper X
  • Q2.6 (Figures): Identify the captions of the figures in the paper X.
  • Q2.7 (Footnotes): Identify the footnotes in the paper X (or part of it).
  • Q2.8 (EU projects): Identify the EU project(s) that supported the research presented in the paper X (or part of it).

These queries have to be translated in SPARQL according to the challenge's general rules and have to produce an output according to the detailed rules.