-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Workflow Run RO-crate format #39
base: master
Are you sure you want to change the base?
Conversation
add encodingFormat for nextflow.config
feat: add wrroc to valid formats
* fx: make getIntermediateOutputFiles work again * Fix bugs fixes #16 fixes #17 --------- Co-authored-by: fbartusch <[email protected]>
* fix #7 Signed-off-by: fbartusch <[email protected]>
* feat: add README to create * feat: ignore vscode * fix: make getIntermediateOutputFiles work again (#18) (#19) * fx: make getIntermediateOutputFiles work again * Fix bugs fixes #16 fixes #17 --------- Co-authored-by: fbartusch <[email protected]> * feat: add README to json * feat: check first if readme exists * Add readme to hasPart Signed-off-by: fbartusch <[email protected]> --------- Signed-off-by: fbartusch <[email protected]> Co-authored-by: fbartusch <[email protected]>
* Add getEncodingFormat function that return the encoding format for a file * handle YAML files manually Signed-off-by: fbartusch <[email protected]>
* implements #1 Signed-off-by: fbartusch <[email protected]>
Iss7 directory type
* start with metaYaml imports * merge dev-wrroc into metaYaml (#23) * add encodingFormat for nextflow.config * add encodingFormat for main.nf * feat: add wrroc to valid formats * fix: make getIntermediateOutputFiles work again (#18) * fx: make getIntermediateOutputFiles work again * Fix bugs fixes #16 fixes #17 --------- Co-authored-by: fbartusch <[email protected]> * feat: add README to crate (#14) * feat: add README to create * feat: ignore vscode * fix: make getIntermediateOutputFiles work again (#18) (#19) * fx: make getIntermediateOutputFiles work again * Fix bugs fixes #16 fixes #17 --------- Co-authored-by: fbartusch <[email protected]> * feat: add README to json * feat: check first if readme exists * Add readme to hasPart Signed-off-by: fbartusch <[email protected]> --------- Signed-off-by: fbartusch <[email protected]> Co-authored-by: fbartusch <[email protected]> --------- Signed-off-by: fbartusch <[email protected]> Co-authored-by: fbartusch <[email protected]> * WIP * only add from meta if meta exists * remove usage from ext args * add module name to id --------- Signed-off-by: fbartusch <[email protected]> Co-authored-by: fbartusch <[email protected]>
@bentsherman maybe you can have a look here :) |
Have you got an example RO-Crate generated with this version of the plugin? |
ro-crate-metadata.json |
Is this superseding #33 now? |
I ran
Since I don't have the original crate I'm wondering, do all relative paths correspond to existing files in the crate, e.g., is there a
I tried to run the pipeline following the instructions at the repo above. I had to manually change the plugin's version (to 1.1.0-DEV) after installing it, otherwise Nextflow downloads the
Despite that, the run went on and I got an RO-Crate, but without the |
@simleo I will try and answer to the issues you had :)
Yes this path is copied like this to the folder in which the json can be found. But that might not be the most elegant solution as the input files could also be put in a folder called input or something. What do you think?
Yes the installation of the plugin needs to be fixed. I always run Seems like something is off with your inputs. Did you download the files as described in the README of the pipeline? |
It's not that important, what matters is that there are no clashes between files due to their names.
Yes, and I put them in the root directory of the
The command I ran is:
|
Then I think you might need to update to a more recent Nextflow version as I only tested this for versions above 24.04. But you should be able to test the plugin with any pipeline - maybe running an nf-core pipeline that is better maintained is more reliable! |
@bentsherman I would appreciate any feedback on this so we can get this finished up this month? As soon as people start using the plugin I guess more requests for changes / improvements / updates will come in. |
The metadata file looks OK. One thing I suggest changing is Regarding the crate structure, it looks fine from your screenshot, though some files added to the crate are not listed in the metadata (e.g. However, I still cannot reproduce your result after upgrading Nextflow to 24.10.3. The output directory is missing several files and directories including
(with respect to Nextflow 23.10.0 the |
Can you try changing the config file to something like: (mainly updating the version of the plugin)
|
Thanks @famosab, I finally managed to install the right version of the plugin. The run finished with no errors and no warnings, and I got a valid RO-Crate. It's looking pretty good already, but some things need to be fixed. I think the most important one is
|
Taking a look this afternoon. Expect some minor edits soon |
Signed-off-by: Ben Sherman <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some minor cleanup so far. The render()
function is pretty long, so I'm going to see if I can move some code into helper functions to make it easier to read at a high-level. Please refrain from making edits for now as I work through the code. I left some comments/questions in the meantime.
// Copy workflow input files into RO-Crate | ||
workflowInputMapping.each { source, dest -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like you are copying the intermediate files into the RO-crate? I don't think this is feasible in general. I think it would be better to save only a record of the task inputs/outputs with a checksum. That is what the BCO does at least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recording object
(inputs) and result
(outputs) for CreateAction
is respectively a MAY and a SHOULD in WRROC, so it's not strictly required. Also, workflow inputs and outputs are more important than intermediate files. However, if intermediate inputs/outputs are available, it is appropriate to include them in the RO-Crate, so that it provides a more complete representation of the run. If this is not feasible in general, could it be controlled by a boolean option (e.g. include_intermediates
) in the configuration file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have kept the object
and result
metadata without copying the actual intermediate files. This way you can still e.g. construct the provenance graph. I will consider adding a flag to copy the intermediates in a future iteration
// Copy workflow into crate directory | ||
Files.copy(scriptFile, crateRootDir.resolve(scriptFile.getFileName()), StandardCopyOption.REPLACE_EXISTING) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simleo you mentioned copying all of the pipeline code into the crate, but is this really necessary? If the crate specifies a git repository, revision, and main script path, the user can reproduce the pipeline code at any time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the crate specifies a git repository, revision, and main script path
This would be a nice alternative, though including the code is safer since a git repository can be deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use case I'm thinking of is using a standard pipeline e.g. nf-core/rnaseq v3.12, which may have been previously "verified" by a regulatory body. In this case it would be better to include an immutable reference rather than the actual files because then you don't have to check that those files correspond to a "verified" pipeline. But this is another topic I would like to see more discussion from community. I will consider a flag to copy the pipeline in a future iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simleo Is there an appropriate entity for recording the git url + commit hash? For example the BCO has an "scm extension" that looks like this:
"scm_extension": {
"scm_repository": "https://github.com/nextflow-io/rnaseq-nf",
"scm_type": "git",
"scm_commit": "1815dc2a18bb2c2a8e4c7915260d77bb04ec8c91",
"scm_path": "main.nf",
"scm_preview": "https://github.com/nextflow-io/rnaseq-nf/tree/1815dc2a18bb2c2a8e4c7915260d77bb04ec8c91/main.nf"
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add to the ComputationalWorkflow using codeRepository and version. Would be nice to also record the URL or path of the main script since it is the entrypoint to execute the workflow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ComputationalWorkflow could have a "url" property pointing to the exact revision, e.g.:
"url": "https://github.com/famosab/wrrocmetatest/blob/e8665c295b12132aa886f3abd0fc34a71b08d4a6/main.nf"
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
…oject dir Signed-off-by: Ben Sherman <[email protected]>
Agreed. We'll have to use either the parameter JSON schema or (in the future) static types to determine which params are files and thereby should be included in the crate. EDIT: I'll probably table this for a future iteration since the parameter schema is a whole can of worms. |
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
…epository URL + commit hash Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Okay I think I am done with cleanup and refactoring. You can see my changes from the commits, but to summarize:
Another round of testing would be good to make sure I didn't break anything. I have tested with the rnaseq-nf toy pipeline which is pretty basic. |
I updated and reinstalled the plugin, then I ran https://github.com/famosab/wrrocmetatest again, but the run failed. There are no error messages on the console, but the results dir does not contain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for improving the quality of the code so much. I tested it with the nf-core demo pipeline with the test profile and found some problems.
I have some scripts for running the plugin on most nf-core pipelines. Maybe I find more problems in the next days.
} | ||
|
||
// -- copy input files from params to crate | ||
params.each { name, value -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested this with the nf-core demo pipeline and the test profile.
The test profile uses a samplesheet hosted on GitHub. This lets the plugin fail with;
java.nio.file.NoSuchFileException: <pipeline directory>/https:/raw.githubusercontent.com/nf-core/test-datasets/viralrecon/samplesheet/samplesheet_test_illumina_amplicon.csv
[...]
at org.codehaus.groovy.runtime.DefaultGroovyMethods.each(DefaultGroovyMethods.java:2471)
at nextflow.prov.WrrocRenderer.render(WrrocRenderer.groovy:199)
at nextflow.prov.ProvObserver.onFlowComplete(ProvObserver.groovy:121)
[...]
A quick&dirty solution that could fix this looks like this:
params.each { name, value ->
final schema = paramSchema[name] ?: [:]
final type = getParameterType(name, value, schema)
if( type == "File" || type == "Directory" ) {
String valueString = value.toString()
Path path = Path.of(valueString);
if (path.isFile()) {
final source = Path.of(value.toString()).toAbsolutePath()
// don't copy params.outdir into itself...
if( source == crateDir )
return
source.copyTo(crateDir)
} else {
// Check if file parameter is an URL
try {
URL url = new URL(valueString)
url.toURI() // Performs more checks than URL constructor
ReadableByteChannel readableByteChannel = Channels.newChannel(url.openStream());
// Download file into crate
String fileName = valueString.substring(valueString.lastIndexOf('/') + 1);
FileOutputStream fileOutputStream = new FileOutputStream(crateDir.resolve(fileName).toString());
FileChannel fileChannel = fileOutputStream.getChannel();
fileChannel.transferFrom(readableByteChannel, 0, Long.MAX_VALUE);
} catch(Exception e) {
}
}
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same problem with:
igenomes_base = 's3://ngi-igenomes/igenomes/'
The current thinks it's a File/Directory in the Nextflow sense and fails to copy it.
case CharSequence: | ||
return "Text" | ||
default: | ||
return null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately the additionalType
is a MUST for RO-Crates. The resulting crate for the nf-core demo pipeline is invalid, according to the rocrate-validator
:
"severity": "REQUIRED",
"message": "FormalParameter MUST have an additionalType",
"violatingEntity": "./main.nf#param#genomes",
This is the only problematic formal parameter in my test and it should originate from here
Can you test for Map
and List
values and return "Text"
as type, as these nested structures are converted to JSON-Strings in the metadata file.
@@ -0,0 +1,820 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RO-Crate has to contain the main workflow file main.nf
.
We have to at least copy this one file into crate otherwise the rocrate-validator
considers the crate invalid:
"severity": "REQUIRED",
"message": "Main Workflow main.nf not found in crate",
I think that could be same issue I described in my review |
This makes the RO-Crate invalid, though currently the validator does not check for it: see crs4/rocrate-validator#60 |
Discussing this at the WRROC meeting. @elichad used |
The crate that @simleo references is available here: https://zenodo.org/records/12987289. In the metadata of that crate (v1.0.2), entities like From what I can tell, this doesn't violate the RO-Crate 1.1 spec, as I don't think there's a direct statement that anything with |
We worked on a first version of the plugin which is able to render valid RO-crates for any workflow run.
Happy to receive feedback to get this finished up :)
Continues #19 and #33.