Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assembly does not improve #172

Open
gubrins opened this issue Mar 7, 2023 · 3 comments
Open

Assembly does not improve #172

gubrins opened this issue Mar 7, 2023 · 3 comments

Comments

@gubrins
Copy link

gubrins commented Mar 7, 2023

Heys,

I am working with two closely related species and for both I have HiFi and Hi-C data. I did the exact same for both species and for species 1, after SALSA, I get a better assembly. However, for species 2, after salsa I get the same N50 as I had before doing the scaffolding.
During the assembly, I get this ERROR! WARNING: Not enough Hi-C reads for scaffolding. What does this mean?
This is the summary I get from gfastats after the scaffolding:

`+++Summary+++:

scaffolds: 356

Total scaffold length: 1502913456
Average scaffold length: 4221667.01
Scaffold N50: 67491308
Scaffold auN: 81379285.48
Scaffold L50: 7
Largest scaffold: 203202437

contigs: 403

Total contig length: 1502889956
Average contig length: 3729255.47
Contig N50: 67491308
Contig auN: 81095181.14
Contig L50: 7
Largest contig: 203202437

gaps: 47

Total gap length: 23500
Average gap length: 500.00
Gap N50: 500
Gap auN: 500.00
Gap L50: 24
Largest gap: 500
Base composition (ACGT): 448804358, 302773097, 302741034, 448571467
GC content %: 40.29

soft-masked bases: 0

paths: 356

`

As you can see, both scaffold and contig N50 are the same: 67491308

And I also add this, just in case it helps:

bedfile loaded
Starting Iteration 1
bedfile started
bedfile loaded
Loading Hi-C links 
Hybrid scaffold graph loaded, nodes = 806 edges = 450
Hi-C implied edges = 0
Starting Iteration 2
bedfile started
bedfile loaded
Starting Iteration 2
WARNING: Not enough Hi-C reads for scaffolding
Loading Hi-C links 
Hybrid scaffold graph loaded, nodes = 688 edges = 350
Hi-C implied edges = 0
python2 /home/panthera/bin/RE_sites.py -a scafolding_omanensis/assembly.cleaned.fasta -e GANTC > scafolding_omanensis/re_counts_iteration_1
python2 /home/panthera/bin/make_links.py -b scafolding_omanensis/alignment_iteration_1.bed -d scafolding_omanensis -i 1 -x abc
python2 /home/panthera/bin/fast_scaled_scores.py -d scafolding_omanensis -i 1
sort -k 5 -gr scafolding_omanensis/contig_links_scaled_iteration_1 > scafolding_omanensis/contig_links_scaled_sorted_iteration_1
python2 /home/panthera/bin/layout_unitigs.py -x abc -l scafolding_omanensis/contig_links_scaled_sorted_iteration_1 -c 1000 -i 1 -d scafolding_omanensis
/home/panthera/bin/break_contigs -a scafolding_omanensis/alignment_iteration_2.bed -b scafolding_omanensis/breakpoints_iteration_2.txt -l scafolding_omanensis/scaffold_length_iteration_2 -i 2 -s 100   > scafolding_omanensis/misasm_iteration_2.report
python2 /home/panthera/bin/refactor_breaks.py -d scafolding_omanensis -i 2
python2 /home/panthera/bin/make_links.py -b scafolding_omanensis/alignment_iteration_2.bed -d scafolding_omanensis -i 2
python2 /home/panthera/bin/layout_unitigs.py -x abc -l scafolding_omanensis/contig_links_scaled_sorted_iteration_2 -c 1000 -i 2 -d scafolding_omanensis
/home/panthera/bin/break_contigs -a scafolding_omanensis/alignment_iteration_3.bed -b scafolding_omanensis/breakpoints_iteration_3.txt -l scafolding_omanensis/scaffold_length_iteration_3 -i 3 -s 100  > scafolding_omanensis/misasm_iteration_3.report
python2 /home/panthera/bin/refactor_breaks.py -d scafolding_omanensis -i 3 > scafolding_omanensis/misasm_3.log

This is the code I used to scaffold the assembly:
run_pipeline.py --assembly purged.fa --length purged.fa.fai --bed combined.bed --enzyme GANTC --output scaffolded

Any help would be appreciated!

Thanks in advance!

@ckeeling
Copy link

ckeeling commented Jun 15, 2023

Hi @gubrins, did you find a solution? I'm just another user of SALSA. Maybe your combined.bed file isn't as it should be. Did you use the method described here: [(https://github.com/ArimaGenomics/mapping_pipeline)] to clean up, map, and prepare the sorted bed file? What does cat final.bam.stats give you? Something like this with substantial "All inter" mappings?

All 110216052
All intra 61586755
All intra 1kb 24551570
All intra 10kb 15238027
All intra 15kb 13597611
All intra 20kb 12478943
All inter 48629297

Chris

@gubrins
Copy link
Author

gubrins commented Jun 16, 2023

Hi Chris, thanks for the interest! Unluckily I did not manage to improve it... The bed file should be fine because I did the exact same as the other species and for one worked and for the other not, so I am wondering that the Hi-C data is not as good as I thought.

@ckeeling
Copy link

ckeeling commented Jun 16, 2023

It seems that your "All inter" counts in your bam stats file are low or zero. Is that true? That would confirm poor Hi-C data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants