-
Notifications
You must be signed in to change notification settings - Fork 2
SemPub15_QueriesTask2
Queries for Task 2 of the Semantic Publishing Challenge
More details and explanations will be gradually added to this page. Participants are invited to use the mailing list (https://groups.google.com/forum/#!forum/sempub-challenge) to comment, to ask questions, and to get in touch with chairs and other participants.
Participants are required to translate the input queries into SPARQL queries that can be executed against the produced LOD. The dataset can use any vocabulary but the query result output must conform with the rules described on this page.
Some preliminary information and general rules:
- queries must produce a CSV output, according to the rules detailed below. The evaluation will be performed automatically by comparing this output (on the evaluation dataset) with the expected results.
- IRIs of workshop volumes and papers must follow the following naming convention:
type of resource | URI example |
---|---|
workshop volume | http://ceur-ws.org/Vol-1010/ |
paper | http://ceur-ws.org/Vol-1099/#paper3 |
Papers have fragment IDs like paper3
in the most recently published workshop proceedings. When processing older workshop proceedings, please derive such IDs from the filenames of the papers, by removing the PDF extension (e.g. paper3.pdf
→ paper3
or ldow2011-paper12.pdf
→ ldow2011-paper12
).
- IRIs of other resources (e.g. affilitations, funding agencies) must also be within the http://ceur-ws.org/ namespace, but in a path separate from http://ceur-ws.org/Vol-NNN/ for any number NNN.
- the structure of the IRI used in the examples is not normative and does not provide any indication. Participants are free to use their own IRI structure and their own organization of classes and instances
- All data relevant for the queries and available in the input dataset must be extracted and produced as output. Though the evaluation mechanisms will be implemented so as to take minor differences into account and to normalize them, participants are asked to extract as much as information as possible. Further details are given below for each query.
- Since most of the queries take as input a paper (usually denoted as X), participants are required to use an unambiguous way of identifying input papers. To avoid errors, papers are identified by the URL of the PDF file, as available in http://ceur-ws-org.
- The order of output records does not matter.
We do not provide further explanations for queries whose output looks clear. If they are not or there is any other issue, please feel free to ask on the mailing list.
--TODO--
The attached ZIP file contains the expected output of all eight queries on all papers in the training dataset.
The archive contains the full list of queries (in QUERIES-LIST.csv
) and the output of each of them in a separate .csv file.
For each query there is an entry in QUERIES-LIST.csv
indicating the identifier of the query and the natural language description. The output of that query is contained in the corresponding .csv file.
An example follows:
QueryID | Natural language description | CSV output file |
---|---|---|
Q1.1 | Identify the affiliations of the authors of the paper http://ceur-ws.org/Vol-1518/paper1.pdf | Q1.1.csv |
Query: Identify the affiliations of the authors of the paper X
The correct identification of the affiliations is tricky and would require to model complex organizations, sub-organizations and units. A simplified approach is adopted for this task: participants are required to extract one single string for each affiliation as it appears in the header of the paper excluding data about the location (address, city, state).
Participants are also asked to extract the fullname of each author, without any further processing: author names must be extracted as they appear in the header. No normalizion on middlenames and initials is required.
During the evaluation process, these values will be normalized in lowercase and spaces and punctuations will be stripped.
Further notes:
- If the affiliation is composed of multiple parts (for instance, it indicates a Department of a University) all these parts must be included in the same affiliation.
- If the affiliation is described in multiple lines, all these lines must be included apart from data about the location (according to the general rule above). Multiple lines can be collapsed in a single one, since newlines and punctuations will be stripped during the evaluation.
- In case of multiple affiliations for the same author, the query must return one line for each affiliation.
- In case of multiple authors with the same affiliation, the query must return one line for each author.
Expected output format (CSV):
affiliation-iri, affiliation-fullname, author-iri, author-fullname <IRI>,rdfs:Literal,<IRI>,rdfs:Literal <IRI>,rdfs:Literal,<IRI>,rdfs:Literal [...]
Some examples of output are shown below, others can be found in the training dataset file.
Query Q1.1: Identify the affiliations of the authors of the paper http://ceur-ws.org/Vol-1518/paper1.pdf
affiliation-iri, affiliation-fullname, author-iri, author-fullname <http://ceur-ws.org/affiliation/escuela-superior-politecnica-del-litoral>, "Escuela Superior Politécnica del Litoral", <http://ceur-ws.org/author/xavier-ochoa>, "Xavier Ochoa"
Query: Identify the countries of the affiliations of the authors in the paper X
Participants are required to extract data about affiliations and to identify the country where each research institution is located.
During the evaluation process, the name of the countries will be normalized in lowercase.
Further notes:
- the country names must be in English
- if the country is not explicitely mentioned in the affiliation, it should be derived from external sources
- the article 'the' in the country name is not relevant (for instance, 'The Netherlands' is considered equal to 'Netherlands')
- some acronyms are normalized: for instance 'UK', 'U.K.' and 'United Kingdom' are considered equivalent; 'USA', 'U.S.A.' and 'United Stated of America' as well
Expected output format (CSV):
country-iri, country-fullname <IRI>,rdfs:Literal <IRI>,rdfs:Literal [...]
Some examples of output are shown below, others can be found in the training dataset file.
Query Q2.3: Identify the affiliations of the authors of the paper http://ceur-ws.org/Vol-1518/paper3.pdf
country-iri, country-fullname <http://ceur-ws.org/country/germany>, "Germany" <http://ceur-ws.org/country/the-netherlands>, "The Netherlands"
Query Q2.20: http://ceur-ws.org/Vol-1500/paper4.pdf
country-iri, country-fullname <http://ceur-ws.org/country/canada>, "Canada" <http://ceur-ws.org/country/united-kingdom>, "United Kingdom"
Query: Identify the supplementary material(s) for the paper X
-- TODO --
Expected output format (CSV):
material-url <IRI> <IRI> [...]
-- TODO --
Query: Identify the titles of the first-level sections of the paper X.
Expected output format (CSV):
section-iri, section-number, section-title <IRI>,xsd:integer, rdfs:Literal <IRI>,xsd:integer,rdfs:Literal [...]
-- TODO --
Query: Identify the captions of the tables in the paper X
-- TODO --
Expected output format (CSV):
table-iri, table-number, table-caption <IRI>,xsd:integer,rdfs:Literal <IRI>,xsd:integer,rdfs:Literal [...]
-- TODO --
Query: Identify the captions of the figures in the paper X
-- TODO --
Expected output format (CSV):
figure-iri, figure-number, figure-caption <IRI>,xsd:integer,rdfs:Literal <IRI>,xsd:integer,rdfs:Literal [...]
-- TODO --
Query: Identify the funding agencies that funded the research presented in the paper X (or part of it).
The analysis is restricted to agencies explicitly mentioned in the paper. During the evaluation process, queries that do not meet this requirement will not be used.
Funding agencies must be represented as resources in the produced dataset identified by the resource-iri value.
The name of the agency must be copied directly from the paper, without looking for other information in external data sources. Punctuation, spaces, prepositions and articles in these values will be normalized during the evaluation process.
Note: in case of papers whose research is supported by a EU project, it is not required to include the EU Commission among the funding agencies. That is covered by query Q2.8.
Expected output format (CSV):
funding-agency-iri, funding-agency-name, funding-agency-acronym <IRI>,rdfs:Literal,rdfs:Literal <IRI>,rdfs:Literal,rdfs:Literal [...]
-- TODO --
Query: Identify the EU project(s) that supported the research presented in the paper X (or part of it).
The analysis is restricted to projects explicitly mentioned in the paper. During the evaluation process, queries that do not meet this requirement will not be used.
Projects must be represented as resources in the produced dataset identified by the resource-iri value.
The name of the projects must be copied directly from the paper, without looking for other information in external data sources. Punctuation, spaces, prepositions and articles in these values will be normalized during the evaluation process.
Expected output format (CSV):
project-iri, project-name <IRI>,rdfs:Literal <IRI>,rdfs:Literal [...]
-- TODO --