Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Marker metadata for tracking #43

Open
noamross opened this issue Mar 3, 2023 · 8 comments
Open

Marker metadata for tracking #43

noamross opened this issue Mar 3, 2023 · 8 comments
Labels
Milestone

Comments

@noamross
Copy link

noamross commented Mar 3, 2023

Is it possible for us to include in metadata information that would allow us to search for records created by deposits? Perhaps with some opt-in mechanism like "Would you like to add the keyword deposits-client to make it possible for us to track records created by deposits" at deposition time? Ideally its something in a minimally obstrusive or visible metadata field but something we could find via the various search APIs across repositories.

@mpadge
Copy link
Member

mpadge commented Mar 3, 2023

Yep, indeed. Here it is in action:

#' search_results <- cli$deposits_search ("keywords='frictionlessdata'&type='dataset'")

Just need to resolve the precise keyword which will be used for deposits rather than frictionless, for which "deposits" is sufficiently unambiguous, but perhaps not sufficiently informative? I imagine the procedure will be automatic, rather than opt-in. The data are still exposed to, and controllable by, users, so in my current view only require demonstration of the possibilitity of manually removing the keyword.

@noamross
Copy link
Author

noamross commented Mar 3, 2023

Cool. While I'd love to have high coverage data I'd want to be maximally transparent and opt-in with this.

Yes, "deposits" is probably not a great keyword. (This makes me wonder if another name would be better for the package). That said, if there is already a "frictionlessdata" tag, perhaps we could only place our marker inside data package.json, and we could query that file in repositories with that tag. It would be more intensive but doable, and avoid tag cluttering that users might not want.

@noamross
Copy link
Author

noamross commented Mar 3, 2023

That would actually be a great example for a tutorial!

@noamross noamross added this to the Phase II milestone Mar 6, 2023
@mpadge
Copy link
Member

mpadge commented Mar 28, 2023

The neccessary precursor issue of keywords #36 is now done. Output copied here to demonstrate functionality needed for this issue. Keywords always have to be defined in "subjects", not "description". The following code illustrates the new functionality, starting with what happens when "keywords" are defined in the wrong field:

library (deposits)
packageVersion ("deposits")
#> [1] '0.1.0.53'
metadata <- list (
    title = "New Title",
    abstract = "This is the abstract",
    creator = list (list (name = "A. Person"), list (name = "B. Person")),
    description = paste0 (
        "This is the description\n\n",
        "## keywords\none, two\nthree\n\n## version\n1.0"
    )
)
cli <- depositsClient$new (service = "zenodo", metadata = metadata, sandbox = TRUE)
#> Error: Metadata source for [keywords] should be [subject] and not [description]
cli <- depositsClient$new (service = "figshare", metadata = metadata)
#> Error: Metadata source for [keywords] should be [subject] and not [description]

The error message for both services is sufficiently informative to know what to do next:

metadata$description <- "This is the description\n\n## version\n1.0"
metadata$subject <- "## keywords\none, two\nthree"
cli <- depositsClient$new (service = "zenodo", metadata = metadata, sandbox = TRUE)
cli$deposit_new ()
#> ID of new deposit : 1177062
cli$hostdata$metadata$keywords
#> [[1]]
#> [1] "one"
#> 
#> [[2]]
#> [1] "two"
#> 
#> [[3]]
#> [1] "three"

cli <- depositsClient$new (service = "figshare", metadata = metadata)
cli$deposit_new ()
#> Files for private Figshare deposits can only be downloaded manually; no metadata can be retrieved for this deposit.
#> ID of new deposit : 22348531
cli$hostdata$tags
#> [1] "one"   "two"   "three"

Created on 2023-03-28 with reprex v2.0.2

And keywords are appropriately translated into service-specific terms, with the services themselves then returning their own representations. This issue then just needs optional or automatic insertion of a deposits-specific keyword, potentially alongside the "frictionlessdata" keyword illustrated in this Zenodo search query.

@peterdesmet Can you comment on any "official" frictionless positions on the use of such keywords? Is "frictionlessdata" supported or encouraged, or just something you personally use? (Seems to be the latter from the Zenodo records.) Do you have any adivce or recommendations for us to extend upon your own usage to flag our own as a direct extension of frictionless? Any advice or input would be really appreciated 👍 😄

@peterdesmet
Copy link

What keyword to use (frictionlessdata vs frictionless) was recently brought up in the Frictionless Slack, but I don't think it was conclusive. I have referenced this issue there and I'm tagging Community Manager @sapetti9 here. 😄

Regarding:

Do you have any adivce or recommendations for us to extend upon your own usage to flag our own as a direct extension of frictionless? Any advice or input would be really appreciated

Can you clarify your use case? Is it "what keywords to automatically assign to a deposit in Zenodo/... that was created with the deposits package?"

@mpadge
Copy link
Member

mpadge commented Mar 28, 2023

Can you clarify your use case? Is it "what keywords to automatically assign to a deposit in Zenodo/... that was created with the deposits package?"

Yes, that is precisely what I meant. We are intending to have a (likely optional, but possibly default) keyword that we can use to identify all deposits created via this package. And those will also likely include an additional keyword to align with your current "frictionlessdata" usage. So ultimately two keywords.

@peterdesmet
Copy link

I think the proper way to do it would be to assign a related identifier with relationType=IsCompiledBy. This is defined in Data Cite Schema as "indicates B is used to compile or create A". I think this applies here.

  • The advantages over a keyword would be that you clearly identify the relationship (= linked data) and that there is a URL to the R package. This helps machines and humans.
  • The disadvantage would be that it is likely more complex to implement, you need a stable identifier for the R package (DOI of the R package on Zenodo could work) and that I don't know if Zenodo allows searching on this.

In any case, I tried it out for one of the animal tracking datasets I published with the movepub R package: https://doi.org/10.5281/zenodo.5653311 Here's how it looks:

Screenshot 2023-03-28 at 18 06 13

@mpadge
Copy link
Member

mpadge commented Mar 28, 2023

That's a great idea! deposits builds from a DCMI metadata structure which includes a few terms in which that might fit. And Zenodo has a "related_identifiers" field which allows the compiled option, and it also has the ability to construct custom search queries on any fields. So that should work for that, which will mean also for Dryad, which we'll soon expand to. (We currently do figshare too, but full functionality there is not so important.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants