Skip to content

Latest commit

 

History

History
76 lines (56 loc) · 4.33 KB

README.md

File metadata and controls

76 lines (56 loc) · 4.33 KB

SemPub2015 Tools and Extensions

This project implements FitLayout-based applications and tools for automatic information extraction from the CEUR-WS.org workshop proceedings pages. The tools were created as a proposed solution of the Task 1 of the Semantic Publishing Challenge 2015 colocated with the Extended Semantic Web Conference 2015.

The project was extended to process all the workshops located in CEUR, you can got the most generated dataset from dataset/all.ttl.tar.gz.

How to Build the tool

The whole package is built using maven. Therefore, after cloning the repository, you need to use this command mvn package to create a runnable SemPub2015Extractor.jar program. To be able to do this, you need to have maven installed. We also provide an already compiled version in https://sourceforge.net/projects/toolesws/files/SemPub2015Extractor.jar.

To run the extraction tool using

java -jar SemPub2015Extractor.jar

Please note that you need to have java version 1.7 otherwise you will get errors.

This will start a FitLayout JavaScript console. Use help() command for obtaining more info.

Data storage

The program stores generated data in Blazegraph, detail information see About_Blazegraph. The program assumes that the Blazegraph storage is running at http://localhost:9999/blazegraph, and you can usestorage.connect() to connect another repository. You can get the latest version of the blazegraph software from https://www.blazegraph.com/download/.

Running the Extraction Task

Option 1. To accomplish the SemPub2015 Task1 the following commands should be entered at the command prompt of the FitLayout JavaScript console:

processEvaluationSet(); 
transformToDomain();

Option 2. To process all the workshops located in CEUR the following commands should be used:

processAllData(); 
transformToDomain();

Option 3. To process a single volume, like http://ceur-ws.org/Vol-1/ the following commands should be used:

processPage('http://ceur-ws.org/Vol-1/'); 
transformToDomain();

After this, the storage should contain the complete extracted data.

Serialize RDF Data

You can serialize the generated rdf dataset from the repository by using the python script provided at dataset/serializer.py, the generated file is called all.ttl.gz. You can also use the most recent generated dataset located at dataset/all.ttl.tar.gz.

SPARQL Queries

The SPARQL queries corresponding to the individual SemPub2015 queries are located in sparql/ESWC2015-queries.txt.

The transformation query from the domain-independent logical model to the domain-dependent CEUR workshop ontology is located in logicalTree2domain.sparql. The transformation itself is included in the transformToDomain() call so it's not necessary to execute this query manually.

Related publications by original developers of the tool

The related publication is the following:

MILIČKA Martin and BURGET Radek. Information Extraction from Web Sources based on Multi-aspect Content Analysis. In: Semantic Web Evaluation Challenges, SemWebEval 2015 at ESWC 2015. Portorož: Springer International Publishing, 2015, pp. 81-92. ISBN 978-3-319-25517-0. ISSN 1865-0929.

LICENSE of the tool

The detail information is contained in LICENSE.

Acknowledgements

The original work was supported by the BUT FIT grant FIT-S-14-2299 and the IT4Innovations Centre of Excellence CZ.1.05/1.1.00/02.0070. Currently this work is related to EU project OpenAIRE2020 (643410).