Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata exploration #194

Closed
bpoldrack opened this issue Mar 6, 2023 · 1 comment
Closed

Metadata exploration #194

bpoldrack opened this issue Mar 6, 2023 · 1 comment

Comments

@bpoldrack
Copy link
Member

bpoldrack commented Mar 6, 2023

This is a record of some experiences while exploring issues with depositing metadata on JülichDATA.

First thing to notice is, that the creation of a dataset on dataverse (or at least this instance) succeeds if the provided metadata record is considered invalid. This is particularly true for not (completely) providing what is considered required, leading to the issues #191 and #192 (while providing a full record in this case led to #193 ).

This means, that running create-sibling-dataverse can lead to a situation on dataverse that we cannot currently fix (or even properly report on), and that would need "manual" fixing via webinterface before we can continue to use it with this extension.
To me this alone implies that we'd want two things:

  • some way to locally validate metadata before we produce that situation
  • an update-metadata command that can be used to fix the situation. This is a need that seems necessary even when a push patch based on metalad is implemented (which I suppose is not happening soon anyway). Such a patch can later still utilize this command or its internal functions. @mih seems to essentially have come to the same conclusion (have that command).

What is required metadata is determined by a collection, though, not the instance. dataverse comes with a default (citation metadata block) and an instance can change that default, but ultimately it's the collection that really determines it. Which implies that knowing the default of a concrete instance is still insufficient. According to dataverse's Matrix channel and my own search, there is no way of querying a collection for what actually is required.

What we can query an instance for, is what metadata blocks are used with it and some information about their structure (w/o an annotation of what fields are required).
Get the names of available metadatablocks: curl "https://data.fz-juelich.de/api/metadatablocks", and then "schema" per block like so (in case of metadata block "fzj"): curl "https://data.fz-juelich.de/api/metadatablocks/fzj". What that gives is sadly not a JSON schema that could be used to generically validate against. It has most of the things that would be needed for it, but not in the right fashion (especially data types - they seem to be the types used by the DB backend, rather than something JSON schema would allow for).

The way dataverse expects the metadata JSON to be provided is actually hard to formalize in a JSON schema. That is mostly because the expected structure comes with a list of anonymous objects, which have their name as a property rather than having that name as a key and the object as the value. That way one can not easily define the requirements in a schema.

@loj has provided an example and some requirements here: https://github.com/psychoinformatics-de/org/issues/134

The way pydataverse deals with that problem is an incomplete schema for dataverse's default blocks (https://github.com/gdcc/pyDataverse/blob/master/src/pyDataverse/schemas/json/dataset_upload_default_schema.json) and a lot of hardcoded additional knowledge that used by a customized validator (https://github.com/gdcc/pyDataverse/blob/master/src/pyDataverse/models.py#L334 ff)

In an ideal world we can come up with an approach for a proper schema, that can be generically validated via the jsonschema package and would be sort of easy to enhance for new metadata blocks, so that a user can deploy and configure their own version and have the validation running against it.
The hardcoding approach of pydataverse I would deem unmaintainable. Take the example of JülichDATA. Even if everybody only sticks to the default requiring the citation and the fzj block, that would need regular updates (because for some reason pof3, pof4 are separate fields, so obviously pof5, etc. need to be added whenever they become avaiable). If anyone chooses to have different requirements for their institute's collection (like "also fill in life sciences block"), someone would need to provide the hardcoded validation for that. Hence, if we can find a reasonable way to have them provide a definition that can be used generically, that would be much better.

I do think, that it likely is possible to do that properly with JSON schema, since it actually allows for quite complex things including conditionals, but I failed to wrap my head around it so far.
The problem is this. What dataverse expects is a structure like this:

{
  "datasetVersion": {
    "metadataBlocks": {
      "citation": {
        "displayName": "Citation Metadata",
        "fields": [
          {
            "typeName": "title",
            "multiple": false,
            "typeClass": "primitive",
            "value": "Darwin's Finches"
          },
          {
            "typeName": "author",
            "multiple": true,
            "typeClass": "compound",
            "value": [
              {
                "authorName": {
                  "typeName": "authorName",
                  "multiple": false,
                  "typeClass": "primitive",
                  "value": "Finch, Fiona"
                }
              }
            ]
          },
          
          ...

The trouble is that list in fields. Required fields can not be named. Instead the requirement is, that this list contains objects, whose typeName property's value matches the list of required fields. There may be additional objects, but for everything required there needs to be an object in that list with the respective value in that property. I can't figure how exactly to express this.

@mih
Copy link
Member

mih commented Mar 14, 2023

I am closing this FOI issue. Metadata handling is (for now), no longer in scope for the planned set of commands.

@mih mih closed this as completed Mar 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants