Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

whatshap error: Read name occurs more than twice in the input file #37

Closed
A97paupic opened this issue Mar 22, 2024 · 3 comments
Closed
Labels
bug Something isn't working

Comments

@A97paupic
Copy link

Description of the bug

Whenever I try to run the pipeline using a reduced revio fastq file as a test, I run into this error: Read name occurs more than twice in the input file. Up to that point I have not experienced any problems. Do you have idea what is going on, is there a way to get around this error?

Command used and terminal output

ERROR ~ Error executing process > 'FELLEN31_SKIERFE:SKIERFE:PHASING:WHATSHAP_PHASE (pr_001_001)'

Caused by:
  Process `FELLEN31_SKIERFE:SKIERFE:PHASING:WHATSHAP_PHASE (pr_001_001)` terminated with an error exit status (1)

Command executed:

  whatshap phase \
      --ignore-read-groups --indels --distrust-genotypes \
      -o pr_001_001.sorted.vcf.gz.phased.vcf \
      --reference GCA_000001405.15_GRCh38_no_alt_analysis_set_nochr.fna \
      pr_001_001.sorted.vcf.gz \
      pr_001_001.bam

  bgzip \
      -@ 2 \
      pr_001_001.sorted.vcf.gz.phased.vcf

  tabix \
      -p vcf \
      pr_001_001.sorted.vcf.gz.phased.vcf.gz

  cat <<-END_VERSIONS > versions.yml
  "FELLEN31_SKIERFE:SKIERFE:PHASING:WHATSHAP_PHASE":
      whatshap: $( whatshap --version )
      bgzip: $( bgzip --version | head -n 1 | sed 's/bgzip (htslib) //g')
      tabix: $( tabix --version | head -n 1 | sed 's/tabix (htslib) //g')
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  WARNING: Ignoring --row-limit as heuristic is not used as algorithm.
  WARNING: Ignoring --indels as indel phasing is default in WhatsHap 2.0+
  This is WhatsHap 2.2 running under Python 3.9.18
  [E::idx_find_and_load] Could not retrieve index file for 'pr_001_001.sorted.vcf.gz'
  [E::idx_find_and_load] Could not retrieve index file for 'pr_001_001.sorted.vcf.gz'
  [E::idx_find_and_load] Could not retrieve index file for 'pr_001_001.sorted.vcf.gz'
  Working on 1 sample from 1 family

  # Working on contig 1 in individual pr_001_001
  Found 6443 usable heterozygous variants (3298 skipped due to missing genotypes)
  ERROR: whatshap error: Read name 'm84045_230420_150058_s1/198967705/ccs' occurs more than twice in the input file

Relevant files

No response

System information

Nextflow v.23.04.2,
HPC,
slurm,
Singularity,
CentOS Linux,
fellen31/skierfe v1.0dev

@A97paupic A97paupic added the bug Something isn't working label Mar 22, 2024
@fellen31
Copy link
Collaborator

Hi,

Could you double check if you have multiple copies of this read in your input file? For example with:

zgrep -c "m84045_230420_150058_s1/198967705/ccs" REVIO_TEST_DATA.fastq.gz

@A97paupic
Copy link
Author

Ah, it sure was one to many copies in the fastq file! It has been rectified and now it runs through the whatshap process as it should, thanks!

Regards,
Paul

@fellen31
Copy link
Collaborator

Nice to hear you could fix it!
Currently been working on adding test profile and data to make the pipeline easier to test and set up.

Just let me know if you encounter any more problems.
Felix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants