Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial draft of a design document for the Zenodo like DOI per dandiset #2012

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

yarikoptic
Copy link
Member

@yarikoptic yarikoptic commented Aug 22, 2024

A design doc composed with @djarecka to avoid dummy DOIs for dandisets

refs:

TODOs

  • complete initial pass
  • seek review

but could already be checked out by @dandi/archive-maintainers folks since overall idea is formulated already and some early concerns/questions could already be asked/answered

- If minting a DOI fails, we need to raise exception to inform developers about the issue but proceed with the creation of the dandiset.
- *minimal metadata* entered during creation request (title, description, license)
- DLP URL `https://dandiarchive.org/dandiset/{dandiset.id}`
- For embargoed dandiset, **do not** specify any metadata besides the DLP URL.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a thought: perhaps the doi generation process can be implemented only when dataset is unembargoed. i.e. embargoed datasets cannot be pointed to by doi (even if we get reviewer view only access). an owner would have to unembargo it.

for implementation, doi generation happens at creation for public and unembargoing for embargoed dandisets.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could. But one of the goals here is to get away from using "fake DOIs" (#1709) which this would "prevent".

In current design (might change) we would make overall dandiset DOI Findable only upon initial publication. As embargoed dataset would never be published, its DOI would remain Draft thus not available to users, and thus IMHO there is no harm. I think we could even keep updating it with metadata etc. That IMHO would simplify the logic and make "embargoed" less special (thus easier to code/troubleshoot etc).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In current design (might change) we would make overall dandiset DOI Findable only upon initial publication.

Given demand from zarr users to get DOIs for dandisets with zarrs, we better make them Findable as soon as possible in the life cycle of those dandisets. ref: dandi/helpdesk#165 (reply in thread) . So I think we should proceed that way -- make them Findable as soon as datacite validation passes. Also inform user about datacite model issues as part of the validation.

Comment on lines +41 to +44
- For `Draft DOI` (dandiset was not published yet), there is no validation, try to update datacite metadata record while keeping the same target URL
- **Question to clear up**: what happens to Draft DOI if metadata record is invalid? Does it fail to update altogether? does it update only the fields it knows about?
- For `Findable DOI` (dandiset was published at least once), we do not update anything since DLP points to that published version.
- **TODO: figure out how to annotate Draft version, so it always says that it is a draft version and thus potentially not used for citation if that could be avoided**
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To facilitate citation of dandisets which are not yet published, and immediate current use case -- dandisets with zarrs which we cannot publish yet since we cannot guarantee their versions, we could try migrating Draft DOI to Findable upon every modification of metadata. If fails -- keep prior state (Draft). Then if Findable already and fails -- doomed to keep prior one until edits bring it to "good state".

Suggested change
- For `Draft DOI` (dandiset was not published yet), there is no validation, try to update datacite metadata record while keeping the same target URL
- **Question to clear up**: what happens to Draft DOI if metadata record is invalid? Does it fail to update altogether? does it update only the fields it knows about?
- For `Findable DOI` (dandiset was published at least once), we do not update anything since DLP points to that published version.
- **TODO: figure out how to annotate Draft version, so it always says that it is a draft version and thus potentially not used for citation if that could be avoided**
- For `Draft DOI` (dandiset was not published yet): try to update/make it `Findable`.
- If fails - keep Draft since there is no validation, try to update datacite metadata record while keeping the same target URL.
- **Question to clear up**: what happens to Draft DOI if metadata record is invalid? It seems to create one with no metadata, but does it update only the fields it knows about?
- For `Findable DOI`
- if it is still a draft version but which had legit metadata, we try to update metadata. If fails, we either ignore or just add a comment somewhere that "record might not reflect the most recent changes to draft version".
- I think we need to add to validation procedures, validation against datacite metadata record, and reporting errors to the user so that users address them before trying to publish. May be we should validate only if no other errors (our schema validation) were detected to reduce noise, or just give a summary that "Metadata is not satisfying datacite model, fix known metadata errors first."
- if dandiset was published at least once (has version) -- we do not update anything since DLP points to that published version.
- **TODO: figure out how to annotate Draft version, so it always says that it is a draft version and thus potentially not used for citation if that could be avoided**

Copy link
Member

@djarecka djarecka Jan 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regarding validation,

  • for PublishDandiset
    I think we aimed to create PublishedDandised class that has already all the fields that datacite requires, so if we are able to create PublishedDandiset, we should be able to create dandiset.
    In addition, at the end of to_datacite we have validation_datacite that checks against datacite schema (or at least one of the versions...)
    So I believe once we have published dandiset and findable doi, we should only update with new publish version and if our schema is right, we should not have problem with updating doi.
    Of course, datacite can change the schema (or at least the validation function), and we could have issues.

  • for Dandiset
    We should run to_datacite with option validate=True and see if the validation against datacite schema passes

@djarecka
Copy link
Member

djarecka commented Jan 19, 2025

I created some test to simulate the workflow in dandi/dandi-schema#275

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants