Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inquiry Regarding Failed Jobs in Cactus Pangenome Workflow #1603

Open
jinhua2024 opened this issue Jan 31, 2025 · 0 comments
Open

Inquiry Regarding Failed Jobs in Cactus Pangenome Workflow #1603

jinhua2024 opened this issue Jan 31, 2025 · 0 comments

Comments

@jinhua2024
Copy link

jinhua2024 commented Jan 31, 2025

Dear Cactus Development Team,

I am using Cactus (version 2.9.3) to construct a pangenome with five individuals, each having two haplotypes. I have been running the workflow for approximately three weeks, but the process ultimately failed with multiple failed jobs. I would appreciate any guidance on resolving these issues.

System and Setup

  • Cactus version: 2.9.3

  • Execution environment: Cluster computing system

  • Number of input samples: 5 individuals, each with 2 haplotypes

  • Command used:
    /osmgfs10000/home/wujinhua/singularity/4.1.2/bin/singularity exec --bind /osmgfs10000/home/wujinhua/assembly/cactus_pangen/2.9.3_dsd_trio_family_2:/osmgfs10000/home/wujinhua/assembly/cactus_pangen/2.9.3_dsd_trio_family_2 /osmgfs10000/home/wujinhua/assembly/cactus_pangen/cactus_v2.9.3.sif cactus-pangenome /osmgfs10000/home/wujinhua/assembly/cactus_pangen/2.9.3_dsd_trio_family_2/jobs /osmgfs10000/home/wujinhua/assembly/cactus_pangen/250109samples_family_trio.txt --outDir /osmgfs10000/home/wujinhua/assembly/cactus_pangen/2.9.3_dsd_trio_family_2 --outName 2519_trio --reference ref_pig --workDir /osmgfs10000/home/wujinhua/assembly/cactus_pangen/workdir --logFile /osmgfs10000/home/wujinhua/assembly/cactus_pangen/2.9.3_dsd_trio_family_2/run1.log --maxCores 40 --maxMemory 320G --mapCores 12 --vcf full clip filter --gfa clip full filter --gbz full clip filter --odgi clip full --xg --viz --draw --chrom-vg clip filter --chrom-og clip full --filter 1 --vcfbub 1000000 --vcfwave --haplo --giraffe filter --permissiveContigFilter 0.25 --collapse

  • Error summary from the log:
    25-01-29T05:22:13-0600] [MainThread] [I] [toil.leader] Failed jobs at end of the run: 'graphmap_join_workflow' kind-export_align_wrapper/instance-dng4x0hz v6 'Job' kind-export_graphmap_wrapper/instance-ilcgdwkj v11 'make_haplo_index' kind-make_haplo_index/instance-4pjn1v78 v6 'Job' kind-Job/instance-kgy7538p v2 'sanitize_fasta_headers' kind-pangenome_end_to_end_workflow/instance-z1zq5ofu v6 'join_vg' kind-join_vg/instance-hktg_jsr v3 'Job' kind-Job/instance-3akbws9h v2 'check_vcfwave' kind-Job/instance-lo809xmj v4 'make_vg_indexes' kind-make_vg_indexes/instance-itmlb_z9 v3 'Job' kind-export_minigraph_wrapper/instance-bs72or7f v8 'batch_align_jobs' kind-export_split_wrapper/instance-plflc1ie v12 'sort_minigraph_input_with_mash' kind-minigraph_construct_workflow/instance-0cukg2mf v7
    [2025-01-29T05:22:13-0600] [MainThread] [I] [toil.realtimeLogger] Stopping real-time logging server.
    [2025-01-29T05:22:13-0600] [MainThread] [I] [toil.realtimeLogger] Joining real-time logging server thread.
    Traceback (most recent call last):
    File "/home/cactus/cactus_env/bin/cactus-pangenome", line 8, in
    sys.exit(main())
    File "/home/cactus/cactus_env/lib/python3.10/site-packages/cactus/refmap/cactus_pangenome.py", line 256, in main
    toil.start(Job.wrapJobFn(pangenome_end_to_end_workflow, options, config_wrapper, input_seq_id_map, input_path_map, input_seq_order, ref_collapse_paf_id, last_scores_id))
    File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/common.py", line 930, in start
    return self._runMainLoop(rootJobDescription)
    File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/common.py", line 1417, in _runMainLoop
    jobCache=self._jobCache).run()
    File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/leader.py", line 304, in run
    raise FailedJobsException(self.jobStore, failed_jobs, exit_code=self.recommended_fail_exit_code)
    toil.exceptions.FailedJobsException: The job store '/osmgfs10000/home/wujinhua/assembly/cactus_pangen/2.9.3_dsd_trio_family_2/jobs' contains 12 failed jobs: 'graphmap_join_workflow' kind-export_align_wrapper/instance-dng4x0hz v6, 'Job' kind-export_graphmap_wrapper/instance-ilcgdwkj v11, 'make_haplo_index' kind-make_haplo_index/instance-4pjn1v78 v6, 'Job' kind-Job/instance-kgy7538p v2, 'sanitize_fasta_headers' kind-pangenome_end_to_end_workflow/instance-z1zq5ofu v6, 'join_vg' kind-join_vg/instance-hktg_jsr v3, 'Job' kind-Job/instance-3akbws9h v2, 'check_vcfwave' kind-Job/instance-lo809xmj v4, 'make_vg_indexes' kind-make_vg_indexes/instance-itmlb_z9 v3, 'Job' kind-export_minigraph_wrapper/instance-bs72or7f v8, 'batch_align_jobs' kind-export_split_wrapper/instance-plflc1ie v12, 'sort_minigraph_input_with_mash' kind-minigraph_construct_workflow/instance-0cukg2mf v7
    Log from job "'make_haplo_index' kind-make_haplo_index/instance-4pjn1v78 v6" follows:
    =========>
    [2025-01-11T17:42:48-0600] [MainThread] [I] [toil.worker] ---TOIL WORKER OUTPUT LOG---
    [2025-01-11T17:42:48-0600] [MainThread] [I] [toil] Running Toil version 7.0.0-d569ea5711eb310ffd5703803f7250ebf7c19576 on host computer-node05.
    [2025-01-11T17:42:48-0600] [MainThread] [I] [toil.worker] Working on job 'make_haplo_index' kind-make_haplo_index/instance-4pjn1v78 v4
    [2025-01-11T17:42:48-0600] [MainThread] [I] [toil.worker] Loaded body Job('make_haplo_index' kind-make_haplo_index/instance-4pjn1v78 v4) from description 'make_haplo_index' kind-make_haplo_index/instance-4pjn1v78 v4
    [2025-01-11T17:42:55-0600] [MainThread] [I] [cactus.shared.common] Running the command ['vg', 'index', '-t', '40', '-j', '/osmgfs10000/home/wujinhua/assembly/cactus_pangen/workdir/toilwf-0d2e62899a1f59da93f9dc3950de185b/3516/job/tmpps336_pl/clip.2519_trio.dist1', '/osmgfs10000/home/wujinhua/assembly/cactus_pangen/workdir/toilwf-0d2e62899a1f59da93f9dc3950de185b/3516/job/tmpps336_pl/clip.2519_trio.gbz', '--snarl-limit', '1']
    [2025-01-11T17:42:55-0600] [MainThread] [I] [toil-rt] 2025-01-11 17:42:55.412790: Running the command: "vg index -t 40 -j /osmgfs10000/home/wujinhua/assembly/cactus_pangen/workdir/toilwf-0d2e62899a1f59da93f9dc3950de185b/3516/job/tmpps336_pl/clip.2519_trio.dist1 /osmgfs10000/home/wujinhua/assembly/cactus_pangen/workdir/toilwf-0d2e62899a1f59da93f9dc3950de185b/3516/job/tmpps336_pl/clip.2519_trio.gbz --snarl-limit 1"
    [2025-01-11T18:02:03-0600] [MainThread] [W] [toil.lib.humanize] Deprecated toil method. Please use "toil.lib.conversions.bytes2human()" instead."
    [2025-01-11T18:02:03-0600] [MainThread] [W] [toil.lib.humanize] Deprecated toil method. Please use "toil.lib.conversions.bytes2human()" instead."
    [2025-01-11T18:02:03-0600] [MainThread] [I] [toil-rt] 2025-01-11 18:02:03.851045: Successfully ran: "vg index -t 40 -j /osmgfs10000/home/wujinhua/assembly/cactus_pangen/workdir/toilwf-0d2e62899a1f59da93f9dc3950de185b/3516/job/tmpps336_pl/clip.2519_trio.dist1 /osmgfs10000/home/wujinhua/assembly/cactus_pangen/workdir/toilwf-0d2e62899a1f59da93f9dc3950de185b/3516/job/tmpps336_pl/clip.2519_trio.gbz --snarl-limit 1" in 1148.424 seconds and 50.7 Gi memory with job-memory 64.9 Gi. Percent utilization: 78.1
    [2025-01-11T18:02:03-0600] [MainThread] [I] [toil-rt] 2025-01-11 18:02:03.852830: Running the command: "vg gbwt -p --num-threads 40 -r /osmgfs10000/home/wujinhua/assembly/cactus_pangen/workdir/toilwf-0d2e62899a1f59da93f9dc3950de185b/3516/job/tmpps336_pl/clip.2519_trio.ri -Z /osmgfs10000/home/wujinhua/assembly/cactus_pangen/workdir/toilwf-0d2e62899a1f59da93f9dc3950de185b/3516/job/tmpps336_pl/clip.2519_trio.gbz"
    [2025-01-11T18:02:42-0600] [MainThread] [W] [toil.lib.humanize] Deprecated toil method. Please use "toil.lib.conversions.bytes2human()" instead."
    [2025-01-11T18:02:42-0600] [MainThread] [W] [toil.lib.humanize] Deprecated toil method. Please use "toil.lib.conversions.bytes2human()" instead."
    [2025-01-11T18:02:42-0600] [MainThread] [I] [toil-rt] 2025-01-11 18:02:42.077345: Successfully ran: "vg gbwt -p --num-threads 40 -r /osmgfs10000/home/wujinhua/assembly/cactus_pangen/workdir/toilwf-0d2e62899a1f59da93f9dc3950de185b/3516/job/tmpps336_pl/clip.2519_trio.ri -Z /osmgfs10000/home/wujinhua/assembly/cactus_pangen/workdir/toilwf-0d2e62899a1f59da93f9dc3950de185b/3516/job/tmpps336_pl/clip.2519_trio.gbz" in 38.2169 seconds and 9.4 Gi memory with job-memory 64.9 Gi. Percent utilization: 14.42
    [2025-01-11T18:02:42-0600] [MainThread] [I] [toil-rt] 2025-01-11 18:02:42.078316: Running the command: "vg haplotypes -v 2 -t 40 -H /osmgfs10000/home/wujinhua/assembly/cactus_pangen/workdir/toilwf-0d2e62899a1f59da93f9dc3950de185b/3516/job/tmpps336_pl/clip.2519_trio.hapl -d /osmgfs10000/home/wujinhua/assembly/cactus_pangen/workdir/toilwf-0d2e62899a1f59da93f9dc3950de185b/3516/job/tmpps336_pl/clip.2519_trio.dist1 -r /osmgfs10000/home/wujinhua/assembly/cactus_pangen/workdir/toilwf-0d2e62899a1f59da93f9dc3950de185b/3516/job/tmpps336_pl/clip.2519_trio.ri /osmgfs10000/home/wujinhua/assembly/cactus_pangen/workdir/toilwf-0d2e62899a1f59da93f9dc3950de185b/3516/job/tmpps336_pl/clip.2519_trio.gbz"
    [2025-01-11T18:05:15-0600] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files:
    [2025-01-11T18:05:15-0600] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-make_vg_indexes/instance-itmlb_z9/file-21fe21687297455284f67c4fe5a32591/clip.merged.gbz' to path '/osmgfs10000/home/wujinhua/assembly/cactus_pangen/workdir/toilwf-0d2e62899a1f59da93f9dc3950de185b/3516/job/tmpps336_pl/clip.2519_trio.gbz'
    [2025-01-11T18:05:15-0600] [MainThread] [C] [toil.worker] Worker crashed with traceback:
    Traceback (most recent call last):
    File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/worker.py", line 438, in workerScript
    job._runner(jobGraph=None, jobStore=job_store, fileStore=fileStore, defer=defer)
    File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/job.py", line 2984, in _runner
    returnValues = self._run(jobGraph=None, fileStore=fileStore)
    File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/job.py", line 2895, in _run
    return self.run(fileStore)
    File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/job.py", line 3158, in run
    rValue = userFunction(*((self,) + tuple(self._args)), **self._kwargs)
    File "/home/cactus/cactus_env/lib/python3.10/site-packages/cactus/refmap/cactus_graphmap_join.py", line 1227, in make_haplo_index
    cactus_call(parameters=['vg', 'haplotypes'] + hapl_opts + ['-t', str(job.cores), '-H', hapl_path, '-d', dist_path, '-r', ri_path, gbz_path])
    File "/home/cactus/cactus_env/lib/python3.10/site-packages/cactus/shared/common.py", line 914, in cactus_call
    raise RuntimeError("{}Command {} exited {}: {}".format(sigill_msg, call, process.returncode, out))
    RuntimeError: Command /usr/bin/time -f "CACTUS-LOGGED-MEMORY-IN-KB: %M" vg haplotypes -v 2 -t 40 -H /osmgfs10000/home/wujinhua/assembly/cactus_pangen/workdir/toilwf-0d2e62899a1f59da93f9dc3950de185b/3516/job/tmpps336_pl/clip.2519_trio.hapl -d /osmgfs10000/home/wujinhua/assembly/cactus_pangen/workdir/toilwf-0d2e62899a1f59da93f9dc3950de185b/3516/job/tmpps336_pl/clip.2519_trio.dist1 -r /osmgfs10000/home/wujinhua/assembly/cactus_pangen/workdir/toilwf-0d2e62899a1f59da93f9dc3950de185b/3516/job/tmpps336_pl/clip.2519_trio.ri /osmgfs10000/home/wujinhua/assembly/cactus_pangen/workdir/toilwf-0d2e62899a1f59da93f9dc3950de185b/3516/job/tmpps336_pl/clip.2519_trio.gbz exited 1: stderr=Loading GBZ from /osmgfs10000/home/wujinhua/assembly/cactus_pangen/workdir/toilwf-0d2e62899a1f59da93f9dc3950de185b/3516/job/tmpps336_pl/clip.2519_trio.gbz
    Generating haplotype information
    Loading distance index from /osmgfs10000/home/wujinhua/assembly/cactus_pangen/workdir/toilwf-0d2e62899a1f59da93f9dc3950de185b/3516/job/tmpps336_pl/clip.2519_trio.dist1
    Building minimizer index
    Built the minimizer index in 113.048 seconds
    Loading r-index from /osmgfs10000/home/wujinhua/assembly/cactus_pangen/workdir/toilwf-0d2e62899a1f59da93f9dc3950de185b/3516/job/tmpps336_pl/clip.2519_trio.ri
    Partitioning parameters:

    • target length 10000 bp
    • 32 jobs
      Determining construction jobs
      error: [vg haplotypes] HaplotypePartitioner::partition_haplotypes(): there are 638 top-level chains and 621 weakly connected components; haplotype sampling cannot be used with this graph
      Command exited with non-zero status 1
      CACTUS-LOGGED-MEMORY-IN-KB: 32820436
      [2025-01-11T18:05:15-0600] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host computer-node05

Questions:

  1. Are there known issues that might cause these job failures?
  2. How can I debug and resolve these failures?

I appreciate any insights or suggestions you can provide. Thank you for your time and assistance.

Best regards,
jinhua
run111.log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant