Wikipedia NIF corpus creator

Reads triples containing HTML literals as objects, turning them into NIF 2.0 files.

WikiCorpusGenerator

Expects an NTRIPLES file with HTML Literals containing Wikipedia abstracts. Parses them, cleans the text and produces one nif:Context resource for each abstract and one nif:Word resource for each link. Prints as a number of turtle files because file sizes get unwieldy otherwise.

NIFCorpusSurfaceFormEnricher

Takes the generated NIF corpus and a number of surface forms extracted from the corpus with another tool (not contained in this repository). Adds nif:Words for surface forms of entities that were linked in the text to the corpus and writes a new, enriched corpus.

Both classes are in an extremely brittle state. Sorry about that.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src/main/java/org/nlp2rdf/corpus		src/main/java/org/nlp2rdf/corpus
README.md		README.md
build.gradle		build.gradle
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wikipedia NIF corpus creator

WikiCorpusGenerator

NIFCorpusSurfaceFormEnricher

About

Releases

Packages

Languages

der-bruemmer/NifWikiCorpus

Folders and files

Latest commit

History

Repository files navigation

Wikipedia NIF corpus creator

WikiCorpusGenerator

NIFCorpusSurfaceFormEnricher

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages