Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proper way to access local node paths in Toil, AWS Batch backend, for CWL workflows #5163

Open
robertbio opened this issue Nov 18, 2024 · 1 comment

Comments

@robertbio
Copy link

robertbio commented Nov 18, 2024

Hi,

I'm using Toil in server mode as a WES server, configured with the AWS Batch backend. In the AWS Batch compute environment, I have a drive mounted on each of the cluster nodes (currently via mount-s3).

I want to give access to this local path inside a CommandLineTool defined in a CWL workflow. Referencing files from S3 works fine, and they are staged correctly using InitialWorkDirRequirement. However, when I reference a file or directory using a local path (e.g., file:///mnt/service-data/blast/gramene_01_2022.tar.gz), I get the following error:

[2024-11-18T11:47:36+0000] [MainThread] [E] [root] No available job store implementation can import the URL 'missing:'. Ensure Toil has been installed with the appropriate extras.

Here is the relevant cwl definition and parameters for context:

list_mounted.tool.cwl:

cwlVersion: v1.2
class: CommandLineTool

hints:
  DockerRequirement:
    dockerPull: openjdk:9.0.1-11-slim
baseCommand: ls
arguments:
  - -l

inputs:
  in1:
    type: File
    inputBinding:
      position: 1
      valueFrom: $(self.basename)


requirements:
  InitialWorkDirRequirement:
    listing:
      - $(inputs.in1)
      
 
outputs:
  out1: stdout

workflow_params:

workflow_params='{
  "in1" : {
    "class" : "File",
    "basename" : "gramene_01_2022.tar.gz",
    "path": "file:///mnt/service-data/cdblast/gramene_01_2022.tar.gz"
  }
}'

What is the proper way to make these local node paths available to the jobs in this scenario? Is there a recommended approach for this? Am I missing something in the Toil configuration?

Thanks for your help!

┆Issue is synchronized with this Jira Story
┆Issue Number: TOIL-1674

@adamnovak
Copy link
Member

adamnovak commented Nov 18, 2024

If you want the leader to not see the file as missing, you need to have it available from the leader as well as from the workers. So the first thing to try is probably mounting your filesystem on the node you are issuing the command from.

If you are using import symlinking, and a file job store on your shared filesystem, your files should be imported into Toil as symlinks to their actual location on the shared filesystem, and not copied. But I am not sure that mounted S3 provides the consistency guarantees that Toil needs form a shared filesystem to use it as the job store. Or that it supports symlinks.

If you are using the AWS job store for storage, Toil will want to copy your files into the job store, and from there to each node. You could try to turn on toil-cwl-runner's --bypass-file-store option to make Toil just assume all paths are accessible from all nodes. But then you might need to set --tmp-outdir-prefix or some of the other CWL path settings to get Toil to create job outputs on your shared filesystem instead of in node-local temporary storage, because you're turning off the whole system responsible for moving files between nodes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants