Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with --show-ref (gvcf mode) #6

Open
mattdmem opened this issue Nov 28, 2024 · 13 comments
Open

Problems with --show-ref (gvcf mode) #6

mattdmem opened this issue Nov 28, 2024 · 13 comments
Labels
bug Something isn't working

Comments

@mattdmem
Copy link

Hello...

I'm attempting to run v0.2.0 of Clair3-rna and have some observations:

docker run -v `pwd`:/data -it hkubal/clair3-rna:v0.2.0   /opt/bin/run_clair3_rna --bam_fn /data/test.bam --ref_fn /data/reference.fasta --sample_name test --platform=ont_dorado_drna004 --output_dir /data/test --output_prefix=test --include_all_ctgs --print_ref_calls -t 10 --min_coverage 0

Leads to the following error:

usage: call_var_bam.py [-h] [--platform PLATFORM] --bam_fn BAM_FN --chkpnt_fn
                       CHKPNT_FN --ref_fn REF_FN [--call_fn CALL_FN]
                       [--vcf_fn VCF_FN] [--ctgName CTGNAME]
                       [--ctgStart CTGSTART] [--ctgEnd CTGEND]
                       [--bed_fn [BED_FN]] [--sampleName [SAMPLENAME]]
                       [--min_af MIN_AF] [--snp_min_af SNP_MIN_AF]
                       [--indel_min_af INDEL_MIN_AF] [--gvcf GVCF]
                       [--qual QUAL] [--samtools SAMTOOLS] [--pypy PYPY]
                       [--python PYTHON] [--fast_mode FAST_MODE]
                       [--enable_phasing_model ENABLE_PHASING_MODEL]
                       [--minCoverage MINCOVERAGE] [--minMQ MINMQ]
                       [--minBQ MINBQ] [--bp_resolution] [--haploid_precise]
                       [--haploid_sensitive] [--call_snp_only CALL_SNP_ONLY]
                       [--enable_long_indel ENABLE_LONG_INDEL]
                       [--keep_iupac_bases KEEP_IUPAC_BASES]
                       [--phasing_info_in_bam] [--base_err BASE_ERR]
                       [--gq_bin_size GQ_BIN_SIZE]
                       [--temp_file_dir TEMP_FILE_DIR] [--use_gpu USE_GPU]
                       [--tensorflow_threads TENSORFLOW_THREADS]
                       [--extend_bed [EXTEND_BED]]
call_var_bam.py: error: unrecognized arguments: --show_ref

We can prevent this error by editing /opt/bin/run_clair3_rna to envoke the --gvcf option.

But, when we get the gvcf not all reference positions are present. You get only a handful of RefCall sites.

It seems if you set --snp_min_af to 0 this outputs most of the sites in the reference as expected, however those at the very start of the reference are omitted. Is this expected behaviour? Adding --indel_min_af 0 does not help any further.

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  test
REF       17      .       T       .       19.07   RefCall .       GT:GQ:DP:AD:AF:PL       0/0:19:558:545:0.9767:990
REF        18      .       T       .       19.19   RefCall .       GT:GQ:DP:AD:AF:PL       0/0:19:563:549:0.9751:990
REF        19      .       T       .       15.37   RefCall .       GT:GQ:DP:AD:AF:PL       0/0:15:573:567:0.9895:990
REF        20      .       A       .       16.23   RefCall .       GT:GQ:DP:AD:AF:PL       0/0:16:574:565:0.9843:990

They are well covered:

samtools depth test.bam | more
REF        1       364
REF       2       415
REF        3       425
REF        4       434
REF        5       436
REF        6       448
REF        7       449
REF        8       399
REF       9       391
REF        10      500
REF       11      478
REF        12      532
REF        13      520
REF        14      544
REF        15      554
REF        16      556

I do notice the following message:

[faidx] Truncated sequence: REF:1-2312

If I index my reference there is no problem:

samtools faidx reference.fasta
cat reference.fasta.fai
REF       1312    10      1312    1313

This may or may not be related to omission of positions at the start of the reference in the VCF?

@davidnewman02
Copy link

We further observed that the sites output to the vcf file varied with depth. Using a ~1kb transcript at very high depth (>1000x) we find that we get a line in the vcf for all except the first/last 16 bases (as noted above), but at 50x depth a number of sites are missing from the output.

1000x depth output

...
sample  90      .       C       .       20.49   RefCall .       GT:GQ:DP:AD:AF:PL       0/0:20:991:964:0.9728:990
sample  91      .       A       .       18.54   RefCall .       GT:GQ:DP:AD:AF:PL       0/0:18:992:962:0.9698:990
sample  92      .       A       .       10.16   RefCall .       GT:GQ:DP:AD:AF:PL       0/0:10:993:981:0.9879:990
sample  93      .       G       .       11.40   RefCall .       GT:GQ:DP:AD:AF:PL       0/0:11:993:986:0.9930:990
sample  94      .       G       .       13.96   RefCall .       GT:GQ:DP:AD:AF:PL       0/0:13:993:988:0.9950:990
sample  95      .       G       .       16.93   RefCall .       GT:GQ:DP:AD:AF:PL       0/0:16:993:988:0.9950:990
sample  96      .       C       .       18.39   RefCall .       GT:GQ:DP:AD:AF:PL       0/0:18:993:974:0.9809:990
sample  97      .       G       .       15.67   RefCall .       GT:GQ:DP:AD:AF:PL       0/0:15:993:978:0.9849:990
sample  98      .       A       .       20.74   RefCall .       GT:GQ:DP:AD:AF:PL       0/0:20:993:973:0.9799:990
sample  99      .       G       .       19.93   RefCall .       GT:GQ:DP:AD:AF:PL       0/0:19:996:983:0.9869:990
sample  100     .       G       .       16.31   RefCall .       GT:GQ:DP:AD:AF:PL       0/0:16:996:987:0.9910:990
...

50x depth output

...
sample  89      .       G       .       17.67   RefCall .       GT:GQ:DP:AD:AF:PL       0/0:17:50:49:0.9800:990
sample  91      .       A       .       19.12   RefCall .       GT:GQ:DP:AD:AF:PL       0/0:19:50:48:0.9600:990
sample  96      .       C       .       21.01   RefCall .       GT:GQ:DP:AD:AF:PL       0/0:21:50:49:0.9800:990
sample  99      .       G       .       20.49   RefCall .       GT:GQ:DP:AD:AF:PL       0/0:20:50:47:0.9400:990
sample  100     .       G       .       21.81   RefCall .       GT:GQ:DP:AD:AF:PL       0/0:21:50:49:0.9800:990
...

There doesn't seem an obvious pattern to those included/excluded at lower depths. For example position 93, 98 have the lowest/highest QUAL at high depth and are both missing from the low depth data, similarly 91, 100 are included and represent the lowest/highest allelic depth.

These output were run with the following settings, with --print_ref_calls switch for --gvcf in the run_clair3_rna script

./run_clair3_rna --bam_fn input_data.bam --ref_fn reference.fasta --threads 16 --platform ont_dorado_drna004 --output_dir output --snp_min_af 0.0 --min_coverage 1 --sample_name sample1 --output_prefix sample --include_all_ctgs --print_ref_calls --indel_min_af 0.0

@aquaskyline aquaskyline added the bug Something isn't working label Nov 29, 2024
@aquaskyline
Copy link
Member

aquaskyline commented Nov 29, 2024

The GVCF output module in Clair3-RNA was for Clair3 and has not yet been tuned for Clair3-RNA. For me to plan better, may I ask how often would GVCF output be used for RNA-Seq based variant calling? Or just fixing the logic of --print_ref_calls would suffice?

@mattdmem
Copy link
Author

Well, I would think fixing the logic so it does not error on --print-ref-calls or has unexpected behaviour described above would be in all users interests.

In our case getting an estimation of the QUAL of ref calls at each site is important, so I think it would be used very frequently.

@zhengzhenxian
Copy link
Collaborator

zhengzhenxian commented Dec 2, 2024

@mattdmem

We have released an update v0.2.1 to fix the bug that misses a few variants with the --print_ref_calls option.

Missing variants in the head and tail 16 bp of each sequence was another issue and was caused by the design of the Clair3 networks. Alignments were usually less reliable in these regions, and insufficient context will be fed to the Clair3 networks for candidates in these regions, causing increased FP. In the new version, we added a new option --enable_variant_calling_at_sequence_head_and_tail to get the variants in these regions called. We required reads with MQ≥5, and not secondary and supplementary alignments in these regions.

Please re-pull the docker image and give the new version a try. Let us know for any questions.

@zhengzhenxian
Copy link
Collaborator

@davidnewman02

Clair3-RNA skips considering those positions with only reference allele support as candidates, even when --snp_min_af and --indel_min_af are set to 0. This explains why some RefCall positions with a higher coverage are shown, but those with a lower coverage are not shown (because there is no alternative allele at the positions). This is the current behavior of Clair3-RNA that we have chosen to speed up Clair3-RNA by reducing the number of candidates. Let us know if those positions with no alternative alleles should also be considered as candidates so as to show them as RefCall when --print_ref_call is enabled, and we can modify the behavior.

@mattdmem
Copy link
Author

mattdmem commented Dec 4, 2024

Thanks - I think we would expect, like you do with a gVCF to have all positions regardless of the presence of an alternative allele.

@zhengzhenxian
Copy link
Collaborator

@mattdmem

Thanks. We have modified the logic to output all positions with at least one read support when --snp_min_af or --indel_min_af is set to 0.0. Please kindly try the updated docker image.

@davidnewman02
Copy link

Hi @zhengzhenxian,

Thanks for implementing these changes. I can confirm that the --print_ref_calls options now works as expected, however, I couldn't get the --enable_variant_calling_at_sequence_head_and_tail option to work.

(built-locally from commit: a101dc6b)

[INFO] Pileup Model Prediction                                                                                                                                                                                                                                                    [INFO] RUN THE FOLLOWING COMMAND:

( parallel -C " " --joblog /data/clair3_rna_eval/tmp_out/logs/parallel_1_predict.log -j 16 python3 /data/clair3_rna_eval/Clair3-RNA/clair3_rna.py call_var_bam --chkpnt_fn ~/miniconda3/envs/clair3_rna/bin/clair3_rna_models/ont_dorado_drna004/variables --bam_fn /data/clair3_rna_eval/input_data/subsample.8000.bam --call_fn /data/clair3_rna_eval/tmp_out/tmp/pileup_output/pileup_{1}_{2}.vcf --sampleName sample1 --ref_fn /data/clair3_rna_eval/fake_refs/refs/ref_10_G.fasta --extend_bed /data/clair3_rna_eval/tmp_out/tmp/split_beds/{1} --ctgName {1} --chunk_id {2} --chunk_num {3} --platform ont --snp_min_af 0.0 --indel_min_af 0.0 --minMQ 5 --minCoverage 20 --samtools samtools --pileup  --cmd_fn /data/clair3_rna_eval/tmp_out/tmp/CMD --enable_variant_calling_at_sequence_head_and_tail True  :::: /data/clair3_rna_eval/tmp_out/tmp/CHUNK_LIST ) 2>&1 | tee /data/clair3_rna_eval/tmp_out/logs/1_call_var_bam_pileup.log

[INFO] Delay 1 seconds before starting variant calling ...
[faidx] Truncated sequence: sample:1-2022
[mpileup] 1 samples in 1 input files
Calling variants ...
Exception in thread Thread-1:
Traceback (most recent call last):
  File "~/miniconda3/envs/clair3_rna/lib/python3.9/threading.py", line 950, in _bootstrap_inner
    self.run()
  File "~/miniconda3/envs/clair3_rna/lib/python3.9/threading.py", line 888, in run
    self._target(*self._args, **self._kwargs)
  File "/data/clair3_rna_eval/Clair3-RNA/clair3_rna/call_variants.py", line 1491, in load_mini_batch
    mini_batches_loaded.append(next(tensor_generator))
  File "/data/clair3_rna_eval/Clair3-RNA/clair3_rna/utils.py", line 113, in tensor_generator_from
    if seq[param.flankingBaseNum] not in BASE2NUM:
IndexError: string index out of range
Total processed positions in sample (chunk 1/1) : 0
Total time elapsed: 0.17 s
[INFO] No vcf output for file /data/clair3_rna_eval/tmp_out/tmp/pileup_output/pileup_sample_1.vcf, remove empty file
Traceback (most recent call last):
  File "/data/clair3_rna_eval/Clair3-RNA/clair3_rna/../clair3_rna.py", line 108, in <module>
    main()
  File "/data/clair3_rna_eval/Clair3-RNA/clair3_rna/../clair3_rna.py", line 102, in main
    submodule.main()
  File "/data/clair3_rna_eval/Clair3-RNA/src/create_tensor_pileup.py", line 684, in main
    CreateTensorPileup(args)
  File "/data/clair3_rna_eval/Clair3-RNA/src/create_tensor_pileup.py", line 539, in CreateTensorPileup
    tensor_can_fp.stdin.write(l)
BrokenPipeError: [Errno 32] Broken pipe

I tried a few different configurations including with/without --print_ref_calls and introducing a fake-variant in the edge region, but couldn't get this to work.

@zhengzhenxian
Copy link
Collaborator

@davidnewman02

We attempted to repeat your error using different data and configurations but still failed on our slide. Not sure if the issue is related to the reference. Could you please send your reference and BAM files to my email at [email protected] for our testing? Many thanks for your help.

Zhenxian

@aquaskyline
Copy link
Member

@mattdmem just in case the reference and bam cannot be shared, would you mind tweaking the code a bit and print out param.flankingBaseNum to see how it goes out of the range?

@davidnewman02
Copy link

Although I cannot share the original data I've been able to reproduce this error by generating a random pileup of data and running Clair3-RNA on that.

Clair3-RNA_issue-6.zip

@zhengzhenxian
Copy link
Collaborator

@davidnewman02 Great thanks for sharing the demo data. We have fixed the reference range issue, please try the latest code, thanks.

Zhenxian

@davidnewman02
Copy link

Hi @zhengzhenxian,
Thanks for the quick response to this issue, I have run the latest Clair3-RNA on my sample data and confirm the issue is resolved.
David

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants