Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty URLs have been incorrectly resolved in the downloaded turtle files #1

Open
isotes opened this issue Jan 19, 2021 · 1 comment

Comments

@isotes
Copy link

isotes commented Jan 19, 2021

It seems that empty URLs are in some cases resolved to the encompassing rdf file leading to incorrect (and at least for Münster) malformed entries in the turtle files.

Example from Münster_(Westfalen).rdf:

<https://opendata.stadt-muenster.de/dataset/sporthallen-und-sportst%C3%A4tten-standorte/resource/96e271af-7e05-4c2e-9406-17a3535e88a2>
        a                    cat:Distribution ;
        dcterms:description  "" ;
        dcterms:format       "wms" ;
        dcterms:issued       "2019-07-01T17:33:24+02:00"^^xsd:date ;
        dcterms:modified     "2019-07-18T11:21:22+02:00"^^xsd:date ;
        dcterms:title        "Sporthallen und Sportstätten - Standorte - WMS-GetMap" ;
        cat:accessURL        <https://opendata.stadt-muenster.de/dataset/sporthallen-und-sportst%C3%A4tten-standorte/resource/96e271af-7e05-4c2e-9406-17a3535e88a2> ;
        cat:byteSize         "" ;
        cat:downloadURL      <file:///home/lisa/repos/crawling/target/Münster_(Westfalen).rdf> ;
        cat:mediaType        "" ;
        foaf:page            "https://opendata.stadt-muenster.de/dataset/sporthallen-und-sportst%C3%A4tten-standorte/resource/96e271af-7e05-4c2e-9406-17a3535e88a2" .

The cat:downloadURL <file:///home/lisa/repos/crawling/target/Münster_(Westfalen).rdf> ; is incorrect and malformed ('ü').

Looking at the catalog entry on the website it should be empty: <dcat:downloadURL rdf:resource=""/>

Grepping for 'home/lisa' in catalog/toLoad leads to 2948 results for various fields (at least dcat:accessURL, dcat:downloadURL, and vcard:hasURL). I did not check if the reason is always an empty URL in the original data.

@Aklakan
Copy link
Member

Aklakan commented Jan 20, 2021

So there are multiple problems:

<dcat:downloadURL rdf:resource="<div class="field field-name-field-link-remote-file field-type-file field-label-hidden"><div class="field-items"><div class="field-item even">https://www.stadt-muenster.de/ows/mapserv706/odgruenserv?REQUEST=GetFeature&SERVICE=WFS&VERSION=1.0.0&TYPENAME=ms:Gruenflaechen&OUTPUTFORMAT=CSV&EXCEPTIONS=XML&MAXFEATURES=70000</div></div></div>"/>
  • Sometimes there are also empty IRIs i.e. <> - we need to ensure to resolve relative IRIs against the URL from where it was retrieved
  • sparql-integrate might not set the base url correctly when downloading RDF from the web - needs to be checked

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants