SemPub15_QueriesTask2

Queries for Task 2 of the Semantic Publishing Challenge

More details and explanations will be gradually added to this page. Participants are invited to use the mailing list (https://groups.google.com/forum/#!forum/sempub-challenge) to comment, to ask questions, and to get in touch with chairs and other participants.

General information and rules

Participants are required to translate the input queries into SPARQL queries that can be executed against the produced LOD. The dataset can use any vocabulary but the query result output must conform with the rules described on this page.

Some preliminary information and general rules:

queries must produce a CSV output, according to the rules detailed below. The evaluation will be performed automatically by comparing this output (on the evaluation dataset) with the expected results.
IRIs of workshop volumes and papers must follow the following naming convention:

type of resource	URI example
workshop volume	http://ceur-ws.org/Vol-1010/
paper	http://ceur-ws.org/Vol-1099/#paper3

Papers have fragment IDs like paper3 in the most recently published workshop proceedings. When processing older workshop proceedings, please derive such IDs from the filenames of the papers, by removing the PDF extension (e.g. paper3.pdf → paper3 or ldow2011-paper12.pdf → ldow2011-paper12 ).

IRIs of other resources (e.g. affilitations, funding agencies) must also be within the http://ceur-ws.org/ namespace, but in a path separate from http://ceur-ws.org/Vol-NNN/ for any number NNN.
the structure of the IRI used in the examples is not normative and does not provide any indication. Participants are free to use their own IRI structure and their own organization of classes and instances
All data relevant for the queries and available in the input dataset must be extracted and produced as output. Though the evaluation mechanisms will be implemented so as to take minor differences into account and to normalize them, participants are asked to extract as much as information as possible. Further details are given below for each query.
Since most of the queries take as input a paper (usually denoted as X), participants are required to use an unambiguous way of identifying input papers. To avoid errors, papers are identified by the URL of the PDF file, as available in http://ceur-ws-org.
The order of output records does not matter.

We do not provide further explanations for queries whose output looks clear. If they are not or there is any other issue, please feel free to ask on the mailing list.

Expected output on Training Dataset

--TODO--

The attached ZIP file contains the expected output of all eight queries on all papers in the training dataset.

The archive contains the full list of queries (in QUERIES-LIST.csv) and the output of each of them in a separate .csv file.

For each query there is an entry in QUERIES-LIST.csv indicating the identifier of the query and the natural language description. The output of that query is contained in the corresponding .csv file.

An example follows:

QueryID	Natural language description	CSV output file
Q1.1	Identify the affiliations of the authors of the paper http://ceur-ws.org/Vol-1518/paper1.pdf	Q1.1.csv

Queries

Query Q2.1: Affiliations in a paper

Query: Identify the affiliations of the authors of the paper X

The correct identification of the affiliations is tricky and would require to model complex organizations, sub-organizations and units. A simplified approach is adopted for this task: participants are required to extract one single string for each affiliation as it appears in the header of the paper excluding data about the location (address, city, state).

Participants are also asked to extract the fullname of each author, without any further processing: author names must be extracted as they appear in the header. No normalizion on middlenames and initials is required.

During the evaluation process, these values will be normalized in lowercase and spaces and punctuations will be stripped.

Further notes:

If the affiliation is composed of multiple parts (for instance, it indicates a Department of a University) all these parts must be included in the same affiliation.
If the affiliation is described in multiple lines, all these lines must be included apart from data about the location (according to the general rule above). Multiple lines can be collapsed in a single one, since newlines and punctuations will be stripped during the evaluation.
In case of multiple affiliations for the same author, the query must return one line for each affiliation.
In case of multiple authors with the same affiliation, the query must return one line for each author.

Expected output format (CSV):

affiliation-iri, affiliation-fullname, author-iri, author-fullname
<IRI>,rdfs:Literal,<IRI>,rdfs:Literal
<IRI>,rdfs:Literal,<IRI>,rdfs:Literal
[...]

Examples in TD

Some examples of output are shown below, others can be found in the training dataset file.

Query Q1.1: Identify the affiliations of the authors of the paper http://ceur-ws.org/Vol-1518/paper1.pdf

affiliation-iri, affiliation-fullname, author-iri, author-fullname
<http://ceur-ws.org/affiliation/escuela-superior-politecnica-del-litoral>, "Escuela Superior Politécnica del Litoral", <http://ceur-ws.org/author/xavier-ochoa>, "Xavier Ochoa"

Query Q2.2: Countries in affiliations

Query: Identify the countries of the affiliations of the authors in the paper X

Participants are required to extract data about affiliations and to identify the country where each research institution is located.

During the evaluation process, the name of the countries will be normalized in lowercase.

Further notes:

the country names must be in English
if the country is not explicitely mentioned in the affiliation, it should be derived from external sources
the article 'the' in the country name is not relevant (for instance, 'The Netherlands' is considered equal to 'Netherlands')
some acronyms are normalized: for instance 'UK', 'U.K.' and 'United Kingdom' are considered equivalent; 'USA', 'U.S.A.' and 'United Stated of America' as well

Expected output format (CSV):

country-iri, country-fullname
<IRI>,rdfs:Literal
<IRI>,rdfs:Literal
[...]

Examples in TD

Some examples of output are shown below, others can be found in the training dataset file.

Query Q2.3: Identify the affiliations of the authors of the paper http://ceur-ws.org/Vol-1518/paper3.pdf

country-iri, country-fullname
<http://ceur-ws.org/country/germany>, "Germany"
<http://ceur-ws.org/country/the-netherlands>, "The Netherlands"

Query Q2.20: http://ceur-ws.org/Vol-1500/paper4.pdf

country-iri, country-fullname
<http://ceur-ws.org/country/canada>, "Canada"
<http://ceur-ws.org/country/united-kingdom>, "United Kingdom"

Query Q2.3: Supplementary material

Query: Identify the supplementary material(s) for the paper X

-- TODO --

Expected output format (CSV):

material-url
<IRI>
<IRI>
[...]

Example in TD

-- TODO --

Query Q2.4: Sections

Query: Identify the titles of the first-level sections of the paper X.

Expected output format (CSV):

section-iri, section-number, section-title
<IRI>,xsd:integer, rdfs:Literal
<IRI>,xsd:integer,rdfs:Literal
[...]

Example in TD

-- TODO --

Query Q2.5: Tables

Query: Identify the captions of the tables in the paper X

-- TODO --

Expected output format (CSV):

table-iri, table-number, table-caption
<IRI>,xsd:integer,rdfs:Literal
<IRI>,xsd:integer,rdfs:Literal
[...]

Example in TD

-- TODO --

Query Q2.6: Figures

Query: Identify the captions of the figures in the paper X

-- TODO --

Expected output format (CSV):

figure-iri, figure-number, figure-caption
<IRI>,xsd:integer,rdfs:Literal
<IRI>,xsd:integer,rdfs:Literal
[...]

Example in TD

-- TODO --

Query Q2.7: Funding agencies

Query: Identify the funding agencies that funded the research presented in the paper X (or part of it).

The analysis is restricted to agencies explicitly mentioned in the paper. During the evaluation process, queries that do not meet this requirement will not be used.

Funding agencies must be represented as resources in the produced dataset identified by the resource-iri value.

The name of the agency must be copied directly from the paper, without looking for other information in external data sources. Punctuation, spaces, prepositions and articles in these values will be normalized during the evaluation process.

Note: in case of papers whose research is supported by a EU project, it is not required to include the EU Commission among the funding agencies. That is covered by query Q2.8.

Expected output format (CSV):

funding-agency-iri, funding-agency-name, funding-agency-acronym  
<IRI>,rdfs:Literal,rdfs:Literal
<IRI>,rdfs:Literal,rdfs:Literal
[...]

Examples in TD

-- TODO --

Query Q2.8: EU projects

Query: Identify the EU project(s) that supported the research presented in the paper X (or part of it).

The analysis is restricted to projects explicitly mentioned in the paper. During the evaluation process, queries that do not meet this requirement will not be used.

Projects must be represented as resources in the produced dataset identified by the resource-iri value.

The name of the projects must be copied directly from the paper, without looking for other information in external data sources. Punctuation, spaces, prepositions and articles in these values will be normalized during the evaluation process.

Expected output format (CSV):

project-iri, project-name 
<IRI>,rdfs:Literal
<IRI>,rdfs:Literal
[...]

Example in TD

-- TODO --

SemPub Challenge 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SemPub15_QueriesTask2

Queries for Task 2 of the Semantic Publishing Challenge

General information and rules

Expected output on Training Dataset

Queries

Query Q2.1: Affiliations in a paper

Examples in TD

Query Q2.2: Countries in affiliations

Examples in TD

Query Q2.3: Supplementary material

Example in TD

Query Q2.4: Sections

Example in TD

Query Q2.5: Tables

Example in TD

Query Q2.6: Figures

Example in TD

Query Q2.7: Funding agencies

Examples in TD

Query Q2.8: EU projects

Example in TD

Clone this wiki locally