Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No "identifier" to correspond to the one in datacite XSD/docs? #102

Closed
yarikoptic opened this issue Nov 19, 2024 · 10 comments
Closed

No "identifier" to correspond to the one in datacite XSD/docs? #102

yarikoptic opened this issue Nov 19, 2024 · 10 comments

Comments

@yarikoptic
Copy link

Package version (if known): currently v1.1.2-22-g26f3974

Describe the bug

There is "Identifier" defined

among the rest:

image

but it seems that jsonschema does not have it defined anywhere... for comparison to above here are the properties in jsonschema

image

@yarikoptic
Copy link
Author

well, I guess it is not there because you do not expect people to provide it -- that "identifier" (DOI) is set by datacite. But nevertheless - it is part of the schema, so likely should be there!

@tmorrell
Copy link
Contributor

Version 4.5 uses the doi for the content that is in the XML identifiers field, and alternateIdentifers for any other identifiers. See all the changes in 4.5 at https://github.com/inveniosoftware/datacite/blob/master/CHANGES.rst

DOI is the only valid identifier for the XML identifiers field, and DataCite is no longer putting the DOI in the json identifers field (see https://api.test.datacite.org/dois/10.82433/B09Z-4K37?publisher=true&affiliation=true). So for 4.5 the group decided that doi made more sense.

In general the jsonschema tries to represent the DataCite json format, and not be an exact copy of the XML schema.

I'll add a quick response in dandi/dandi-schema#261 but it looks like you came to the same conclusion. Feel free to reopen if you want to discuss more.

@yarikoptic
Copy link
Author

In general the jsonschema tries to represent the DataCite json format, and not be an exact copy of the XML schema.

That's what I am trying overall to grasp here -- where is the ground truth? ;) I thought it was that XML schema, with docs and jsonschema to provide alternative serializations. But if it is "DataCite json format" -- where is that one "defined" and how do it relate to XML schema? (sorry for all the questions)

On that aspect, how does jsonschema relates to the schema of the "datacite json" formatted output from doi.org?

Seem to also diverge quite a bit:

pwd
/home/yoh/proj/datacite/inveniosoftware-datacite/datacite/schemas
❯ curl --silent -LH "Accept: application/vnd.datacite.datacite+json" https://doi.org/10.48324/DANDI.000897/0.240605.1710 | check-jsonschema --traceback-mode full --schemafile datacite-v4.5.json -
Schema validation errors were encountered.
  -::$: Additional properties are not allowed ('agency', 'clientId', 'id', 'identifiers', 'providerId', 'state' were unexpected)
  -::$.types: Additional properties are not allowed ('bibtex', 'citeproc', 'ris', 'schemaOrg' were unexpected)
  -::$.publicationYear: 2024 is not of type 'string'
here is that json pretty printed
{
  "id": "https://doi.org/10.48324/dandi.000897/0.240605.1710",
  "doi": "10.48324/DANDI.000897/0.240605.1710",
  "url": "https://dandiarchive.org/dandiset/000897/0.240605.1710",
  "types": {
    "ris": "DATA",
    "bibtex": "misc",
    "citeproc": "dataset",
    "schemaOrg": "Dataset",
    "resourceType": "Neural Data",
    "resourceTypeGeneral": "Dataset"
  },
  "creators": [
    {
      "name": "Neupane, Sujaya",
      "nameType": "Personal",
      "givenName": "Sujaya",
      "familyName": "Neupane",
      "affiliation": [],
      "nameIdentifiers": [
        {
          "schemeUri": "https://orcid.org/",
          "nameIdentifier": "0000-0002-0052-3122",
          "nameIdentifierScheme": "ORCID"
        }
      ]
    },
    {
      "name": "Fiete, Ila",
      "nameType": "Personal",
      "givenName": "Ila",
      "familyName": "Fiete",
      "affiliation": [],
      "nameIdentifiers": [
        {
          "schemeUri": "https://orcid.org/",
          "nameIdentifier": "0000-0003-4738-2539",
          "nameIdentifierScheme": "ORCID"
        }
      ]
    },
    {
      "name": "Jazayeri, Mehrdad",
      "nameType": "Personal",
      "givenName": "Mehrdad",
      "familyName": "Jazayeri",
      "affiliation": [],
      "nameIdentifiers": [
        {
          "schemeUri": "https://orcid.org/",
          "nameIdentifier": "0000-0002-9764-6961",
          "nameIdentifierScheme": "ORCID"
        }
      ]
    }
  ],
  "titles": [
    {
      "title": "Neupane_Fiete_Jazayeri_Mental navigation_NHP_EntorhinalCortex"
    }
  ],
  "publisher": {
    "name": "DANDI Archive"
  },
  "subjects": [
    {
      "subject": "entorhinal cortex, cognitive map, mental navigation,"
    }
  ],
  "contributors": [
    {
      "name": "Neupane, Sujaya",
      "nameType": "Personal",
      "givenName": "Sujaya",
      "familyName": "Neupane",
      "affiliation": [],
      "contributorType": "ContactPerson",
      "nameIdentifiers": [
        {
          "schemeUri": "https://orcid.org/",
          "nameIdentifier": "0000-0002-0052-3122",
          "nameIdentifierScheme": "ORCID"
        }
      ]
    },
    {
      "name": "Jazayeri, Mehrdad",
      "nameType": "Personal",
      "givenName": "Mehrdad",
      "familyName": "Jazayeri",
      "affiliation": [],
      "contributorType": "ContactPerson",
      "nameIdentifiers": [
        {
          "schemeUri": "https://orcid.org/",
          "nameIdentifier": "0000-0002-9764-6961",
          "nameIdentifierScheme": "ORCID"
        }
      ]
    }
  ],
  "publicationYear": 2024,
  "identifiers": [
    {
      "identifier": "https://identifiers.org/DANDI:000897/0.240605.1710",
      "identifierType": "URL"
    },
    {
      "identifier": "https://dandiarchive.org/dandiset/000897/0.240605.1710",
      "identifierType": "URL"
    }
  ],
  "rightsList": [
    {
      "rightsIdentifier": "cc_by_40",
      "rightsIdentifierScheme": "SPDX"
    }
  ],
  "descriptions": [
    {
      "description": "The dataset contains electrophysiology data recorded from the entorhinal cortex of two NHPs performing a mental navigation task. The recording probes used were V-probe with 32 channels or 64 channels, manufactured by Plexon Inc. ",
      "descriptionType": "Abstract"
    }
  ],
  "fundingReferences": [
    {
      "funderName": "National Institute of Mental Health",
      "awardNumber": "NIMH-MH129046",
      "funderIdentifier": "https://ror.org/05xj56w78",
      "funderIdentifierType": "ROR"
    },
    {
      "funderName": "Natural Science and Engineering Council of Canada",
      "awardNumber": "NSERC PDF-516867-2018",
      "funderIdentifier": "https://ror.org/01h531d29",
      "funderIdentifierType": "ROR"
    }
  ],
  "schemaVersion": "http://datacite.org/schema/kernel-4",
  "providerId": "dartlib",
  "clientId": "dartlib.dandi",
  "agency": "datacite",
  "state": "findable"
}

@yarikoptic
Copy link
Author

oh, and there is yet another json format model in datacite api output

❯ curl --silent -L "https://api.datacite.org/dois/10.48324/DANDI.000897/0.240605.1710" | jq . >| /tmp/dandi-000897-datacite-api.json
❯ jq .data /tmp/dandi-000897-datacite-api.json | check-jsonschema --traceback-mode full --schemafile datacite-v4.5.json -
Schema validation errors were encountered.
  -::$: Additional properties are not allowed ('attributes', 'id', 'relationships', 'type' were unexpected)
  -::$: 'creators' is a required property
  -::$: 'titles' is a required property
  -::$: 'publisher' is a required property
  -::$: 'publicationYear' is a required property
  -::$: 'types' is a required property
  -::$: 'schemaVersion' is a required property

My poor brain needs a diagram ... here is what chatgpt gave me (I didn't even try to correct since lacking the picture) -- could you improve relationships there to be more reflective of the situation? (can live edit on https://mermaid.live)

graph TD
    XML[DataCite XML Model] -- Basis for --> Doc[DataCite Documentation]
    Doc -- Explains mapping to --> JSONSchema[JSON Schema Model]
    JSONSchema -- Implements --> API[DataCite API Model]
    API -- Serves --> DOI[doi.org vnd.datacite.datacite+json Model]
    XML -- Provides structure for --> DOI
    Doc -- Guides --> API
    JSONSchema -- Validates --> DOI
Loading

@tmorrell
Copy link
Contributor

For the last curl call you want the content in attributes

The ground truth for me is the XML and json representations at https://github.com/inveniosoftware/datacite/tree/master/tests/data. This comes back to #101 and how we should have a more transparent process for generating those.

I'll try to tweak your diagram....one sec

@tmorrell
Copy link
Contributor

The root of the problem is that DataCite doesn't make official JSON representations available, nor make a JSON schema available. You can't use the JSON from doi.org vnd.datacite.datacite+json directly to mint a DOI. What's served out of doi.org vnd.datacite.datacite+json and what's accepted for DOI minting also changes over time (independent of the metadata schema version).

So we do the best we can. We make our own "DataCite JSON Model" that works as both a representation of the metadata and converts between XML and JSON formats. We test that our examples work for DOI minting, and keep fixed examples at https://github.com/inveniosoftware/datacite/tree/master/tests/data. We make changes each version to try to make the jsonschema work well....but since it's not official there's no guarantees we make the right decisions on things.

Here's my version of the diagram.

@yarikoptic
Copy link
Author

For the last curl call you want the content in attributes

ah, cool, indeed -- I should have checked -- looks closer, although also does not validate

❯ jq .data.attributes /tmp/dandi-000897-datacite-api.json | check-jsonschema --traceback-mode full --schemafile datacite-v4.5.json -
Schema validation errors were encountered.
  -::$: Additional properties are not allowed ('citationCount', 'citationsOverTime', 'contentUrl', 'created', 'downloadCount', 'downloadsOverTime', 'identifiers', 'isActive', 'metadataVersion', 'partCount', 'partOfCount', 'published', 'reason', 'referenceCount', 'registered', 'source', 'state', 'updated', 'versionCount', 'versionOfCount', 'viewCount', 'viewsOverTime', 'xml' were unexpected)
  -::$.types: Additional properties are not allowed ('bibtex', 'citeproc', 'ris', 'schemaOrg' were unexpected)
  -::$.publisher: 'DANDI Archive' is not of type 'object'
  -::$.publicationYear: 2024 is not of type 'string'
  -::$.language: None is not of type 'string'
  -::$.version: None is not of type 'string'

The ground truth for me is the XML and json representations at https://github.com/inveniosoftware/datacite/tree/master/tests/data.

so it is the "DataCite XML ", good! JSON - you mean examples? Correct me if I am wrong, following your explanation those "official" XML examples are then converted into JSON using your tools and validated against datacite fabric... right?

The root of the problem is that DataCite doesn't make official JSON representations available, nor make a JSON schema available.

But they do operate on JSON records. So how do they verify input records or do produce output records -- is there a model or there is only "in code" implementation of XML model? Are sources available?

We test that our examples work for DOI minting

Just to make sure -- you have that automated, right?

FWIW and FTR, here is our testing against api.test.datacite.org -- https://github.com/dandi/dandi-schema/blob/master/dandischema/datacite/tests/test_datacite.py .

@yarikoptic
Copy link
Author

yarikoptic commented Nov 19, 2024

I tried to separate out notions of JSON vs JSON Model etc... but just ended up with a total mess ;) giving up on my artsy attempts for now
---
DataCite Representations and actors
---
flowchart TD
    XMLModel[DataCite XML Model] -- Basis for --> Doc[DataCite Documentation]
    XMLModel -- Informs --> JSONModel[DataCite JSON Model]
    XML -- Alows user to mint --> DOI[DOI]

    XML -- instantiates --> XMLModel
    JSON -- instantiates --> JSONModel

    JSON -- Allows user to mint --> DOI
    JSONModel -- Enhancement --> Representation[doi.org vnd.datacite.datacite+json Model]
    JSONSchema -- Validates --> JSON
    JSON -- Validated against --> test[api.test.datacite fabric]
    XMLModel -- Validated against --> test
    test -- Implements --> XMLModel

    click XMLModel "https://github.com/datacite/schema"
    click Doc "https://datacite-metadata-schema.readthedocs.io"
    click JSONModel "https://github.com/inveniosoftware/datacite"
Loading

@tmorrell
Copy link
Contributor

You also need to add ?publisher=true&affiliation=true. The DataCite API isn't versioned, so breaking changes are added with parameters. It's....not ideal.

We wouldn't include the additional properties...they are added by DataCite but they aren't metadata.

so it is the "DataCite XML ", good! JSON - you mean examples? Correct me if I am wrong, following your explanation those ["official" XML examples](https://github.com/datacite/schema/tree/master/source/meta/kernel-4.5/include) are then converted into JSON using your tools and validated against datacite fabric... right?

Yup!

But they do operate on JSON records. So how do they verify input records or do produce output records -- is there a model or there is only "in code" implementation of XML model? Are sources available?

As far as I know only "in code". I believe https://github.com/datacite/bolognese does the serialization and https://github.com/datacite/lupo does the API endpoints....but it's not particularly aproachable code.

Just to make sure -- you have that automated, right?

Yup

@pytest.mark.parametrize("example_json43", TEST_43_JSON_FILES)
. It currently runs offline when a fabrica password is provided. It doesn't run in GitHub actions so as to not spam fabrica test.

@yarikoptic
Copy link
Author

FWIW I did file a sample issue

Let's see what it "brings" if anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants