-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathlab_book
482 lines (349 loc) · 34.5 KB
/
lab_book
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
April 17th, 2017
Today I've done the BLAST of all 60717 L. fortunei proteins predicted in the genome against A. gambiae proteins related to fertility publishe
in a Nature Biotechnology paper (Hammond et al., 2016 doi:10.1038/nbt.3439 )
Database was all the protein sequences present at Supplementary Table 5 of the above paper, although 4 sequences listed on Table 5 was not
present on the genome of A. gambiae downloaded from the Esembl browser, nor were find went specificaly searched on the Esembl website.
Those not found sequences are: AGAP011235-PA, AGAP010915-PA, AGAP007997-PA, AGAP012170-PA.
First local BLAST results are at biobureau gdrive: https://drive.google.com/drive/u/1/folders/0B2pYU5fjrq3EczhILWRyWDBTa3M
April 24th, 2017.
Organizing all the BLASTs to find the fertility genes in L. fortunei. We are using the genes employed by Crisanti
blastp -query Anopheles_gambiae.AgamP4.pep.crisanti.fasta -db /home/ubuntu/nr -evalue 1e-05 -num_threads 4 -out Anopheles_gambiae.AgamP4.pep.crisanti.fasta.blast-NR.1e05 &
1-) Another BLAST from today: L. fortunei proteins against molluscan reproduction genes Ju_Americo have choosen.
makeblastdb -in mollusca.rep.fa -dbtype prot
blastp -query marcela.fasta -db molluscan_genes/mollusca.rep.fa -evalue 1e-05 -num_threads 4 -out marcela.mollusca.rep.fa1e05 &
April, 25th 2017
-BLASTing all reproduction mollucans mRNA and proteins against marcela.fasta
I'm doing this in Amazon, in the isntance called CTG_BLAST i-0b326ee60f77046d7
blastp -query marcela.fasta -db molluscan_genes/mollusca.rep.fa -evalue 1e-05 -num_threads 4 -out marcela.mollusca.rep.fa1e05 &
-BLASTing L. fortunei VASA protein JuAmerico found with the transcripts assembled from Trinity. Just to check. The protein is well conserved
but it looks like its missing 100aa at the beginning.
/data2/Marcela/scripts/ncbi-blast-2.4.0+/bin/tblastn -query vasa.lf.fa -db /data2/Marcela/data/rna-seq/trinti.1.2.3/Trinities.1.2.3.fasta -evalue 1e-05 -num_threads 4 -out vasa.trinity1.2.3.tblastn &
After that, I got all the trinity hits and translated them using transdecode, so we can check how similar to the genomic de novo preditec vasa the transcripts are:
perl /home/bioma/scripts/PASApipeline-2.0.2/pasa-plugins/transdecoder/transcripts_to_best_scoring_ORFs.pl -t vasa.trinity1.2.3.tblastn.cov.trinityID.fasta
-BLASTing L. fortunei (marcela.fasta) against C. elegans proteins selected by Ju (3979 seqs, sterile-phenoype).
makeblastdb -in c_elegans.PRJNA13758.WS258.protein.sterile.fa -dbtype prot
nohup blastp -query ../marcela.fasta -db /home/ubuntu/c_elegans/c_elegans.PRJNA13758.WS258.protein.sterile.fa -evalue 1e-05 -num_threads 4 -out marcela.c_elegans.PRJNA13758.WS258.protein.sterile.1e05 &
26 April, 2017
How to get a whole line by a list of IDs.
awk 'NR==FNR{tgts[$1]; next} $1 in tgts' file1 file2
April 27th, 2017.
Today I'll curate the possible VASA gene Ju found among the golden Mussel proteins. Which means I'll look into the scaffolds from
where it was predicted, and see how much RNA-seq there is aligned to it.
So the first thing is to find which scaffold it is. And it is this:
L_FORTUNEI.itr6_4515
Then, you get all the rna-seq aligned to it in a bam file using samtools
samtools view -h all.merged.sorted.bam L_FORTUNEI.itr6_4515_ > L_FORTUNEI.itr6_4515_all.merged.sam
then
samtools view -Sb L_FORTUNEI.itr6_4515_all.merged.sam -o L_FORTUNEI.itr6_4515_all.merged.bam
samtools sort
samtools index
then
Get the fasta for the scaffold
and
Get the gff for the scaffold
/data2/Marcela/scripts/Python-3.5.2/python /data2/Marcela/scripts/biocode/gff/filter_gff3_by_id_list.py -i /data2/Marcela/pasa.itr6/marcelAAAa_pasa.gene_structures_post_PASA_updates.85229.gff3 -o id.bla.gff3 -l /data2/Marcela/hisat2/id.bla
Then sort, zip and tabix the gff
sort -k1,1 -k4,4n -rk3,3n id.bla.gff3 > id.bla.sorted.gff
bgzip id.bla.sorted.gff
tabix -p gff id.bla.sorted.gff.gz
May, 4th (be with you!) 2017.
Finding best reciprocal BLAST hit!
I found this script and I'm using it: https://github.com/peterjc/galaxy_blast/blob/master/tools/blast_rbh/blast_rbh.py
The first thing I did was to use the BLAST matches ju found between LF and C_elegans (Celegans_Agambiae_Hits_evalue50_ID60.xlsx).
The fasta files are at local machine:
/home/bioma/2017_marcela/CTG/lf.blas.celegans.interest.fasta
/home/bioma/2017_marcela/CTG/c_elegens.interested.fasta
I've modified same names in the c_elegans file because there are more than 1 protein per locus, so they have the same inicial name
the file is called c_elegens.interested.id.modified.forRBBH (on gdocs)
Then I run the program localy:
bioma@bioma-XPS-8300:~/2017_marcela/CTG$ python ../../scripts/filterfasta.py -i lf.blas.celegans.interest.id ../mexilhao_tese_preeprint/marcela.fasta > lf.blas.celegans.interest.fasta
bioma@bioma-XPS-8300:~/2017_marcela/CTG$ python ../../scripts/rbbh.py -a prot -t blastp -o output.tsv lf.blas.celegans.interest.fasta c_elegens.interested.fasta
Result:
Done, 19 RBH found
Warning: Sequences with tied best hits found, you may have duplicates/clusters
Result file called: lf.ce-elegans.RBBH (on gdocs)
- RBBH analysis with molluscan proteins. Only vasa protein got that for now
python ../../scripts/filterfasta.py -i molluscan.reproduction.LF.blastp.lf.id ../mexilhao_tese_preeprint/marcela.fasta > molluscan.reproduction.LF.blastp.lf.fasta
less molluscan.reproduction.LF.blastp.lf.fasta
python ../../scripts/rbbh.py -h
python ../../scripts/rbbh.py -a prot -t blastp -o lf.molluscan-reproduction.RBBH --threads=4 molluscan.reproduction.LF.blastp.lf.fasta sequence-proteins-JU.fasta
ls -ltr
less lf.molluscan-reproduction.RBBH
10th May of 2017.
- Already asked for a quotation to resequence a male and a female golden mussel (genone).
- Today I'm looking into which proteins and scaffolds the trinity (vasa) transcripts TR79527-c2__g1__i3 len=3794, TR76527-c0_g1_i2 len=3728, TR76527-c0_g1_i5 len=3477
represent.
So, First I've blasted them against marcela.fasta
blastx -query trinity.vasas.fasta -db /data2/Marcela/marcela.fasta.an/marcela.fasta -evalue 1e-05 -out trinity-vasas_marcela.fasta.blastx1e05
Best hits:
TR79527-c2__g1__i3 itr6_4515_pi.g108.t1 100.00 562 0 0 1111 2796 17 578 0.0 1127 44.44 91.68
TR76527-c0_g1_i2 itr6_4515_pi.g108.t1.1.57f69240 100.00 537 0 0 1120 2730 17 553 0.0 1075 43.21 91.33
TR76527-c0_g1_i5 itr6_4515_pi.g108.t1.1.57f69240 100.00 537 0 0 869 2479 17 553 0.0 1075 46.33 91.33
All "good" hits:
TR79527-c2__g1__i3 itr6_4515_pi.g108.t1 100.00 562 0 0 1111 2796 17 578 0.0 1127 14.81 91.68
TR79527-c2__g1__i3 itr6_4515_pi.g108.t1.1.57f69240 99.46 552 3 0 1141 2796 2 553 0.0 1099 14.47 93.37
TR79527-c2__g1__i3 itr6_4515_pi.g108.t1.2.57f69240 99.43 527 3 0 1216 2796 2 528 0.0 1045 13.81 93.07
TR79527-c2__g1__i3 itr6_4515_pi.g108.t1.3.57f69240 99.81 518 1 0 1111 2664 17 534 0.0 1031 13.63 92.99
TR76527-c0_g1_i2 itr6_4515_pi.g108.t1.1.57f69240 100.00 537 0 0 1120 2730 17 553 0.0 1075 14.40 91.33
TR76527-c0_g1_i2 itr6_4515_pi.g108.t1 100.00 537 0 0 1120 2730 42 578 0.0 1075 14.40 87.60
TR76527-c0_g1_i2 itr6_4515_pi.g108.t1.2.57f69240 99.43 527 3 0 1150 2730 2 528 0.0 1046 14.06 93.07
TR76527-c0_g1_i2 itr6_4515_pi.g108.t1.3.57f69240 99.80 493 1 0 1120 2598 42 534 0.0 979 13.20 88.49
TR76527-c0_g1_i5 itr6_4515_pi.g108.t1.1.57f69240 100.00 537 0 0 869 2479 17 553 0.0 1075 15.44 91.33
TR76527-c0_g1_i5 itr6_4515_pi.g108.t1 100.00 537 0 0 869 2479 42 578 0.0 1075 15.44 87.60
TR76527-c0_g1_i5 itr6_4515_pi.g108.t1.2.57f69240 99.43 527 3 0 899 2479 2 528 0.0 1046 15.07 93.07
TR76527-c0_g1_i5 itr6_4515_pi.g108.t1.3.57f69240 99.80 493 1 0 869 2347 42 534 0.0 979 14.15 88.49
- Ok, so I got these three transcripts from above and got the results from them from transdecoder. I then alligned these transdecoder
protein sequences + vasa from uniprot for mytilus and c. gigas + the 3 vasa proteins predicted in the genome. The alligment shows
that is possible the genome-predicted proteins are incomplete, but that the transdecoder has the beginning of them. Now I'll
investigate the gff and the scaffold where they are coded.
Aligment result in the local computer:
/home/bioma/2017_marcela/CTG/transdecoderlong.genome.mitgigas.vasas.aln
11th of May, 2017
Trying to find RBBH with all mollusks versus marcela.fasta
python ../../scripts/rbbh.py -a prot -t blastp -o allmoluscs.marcelafasta sequence-proteins-JU.fasta /home/bioma/2017_marcela/mexilhao_tese_preeprint/marcela.fasta
18th of May, 2017
Testing how to create consensus from mapped reads using samtools.
So, I've downloaded a bam file from allegro, where the 500bp pair end were aligned to the genome. Then we have:
samtools mpileup -d 8000 -E -uf Draftfinal_itr6.pilon6.fasta itr6finalcoverage.500.006.sorted.bam | bcftools view -cg - | vcfutils.pl vcf2fq > cns.fastq
Also check:
https://www.biostars.org/p/70585/
https://www.biostars.org/p/225064/
22th of May, 2017.
How to find promotor regions in silico??
Chapter I'm reading at the moment:
https://www.researchgate.net/publication/272440150_In_silico_Identification_of_Eukaryotic_Promoters
To download packages like softberry
http://www.softberry.com/berry.phtml?topic=fdp.htm&no_menu=on
Running Fprom: Program predicts potential transcription start positions by linear discriminant function combining
characteristics describing functional motifs and oligonucleotide composition of these sites. FProm uses file with
selected factor binding sites from currently supported functional site data base.
./fprom /home/bioma/2017_marcela/mexilhao_tese_preeprint/LF.genome.NCBI4.1.fasta > /home/bioma/2017_marcela/CTG/LF.genome.NCBI4.1.fasta.fprom
Other online tools for single sequences search:
http://www.genetools.us/genomics/Promoter%20databases%20and%20prediction%20tools.htm
Promoter regions:
A promoter region is generally defined as any genomic DNA where the transcrip-
tion machinery assembles and initiates transcription. The promoter region consists
of protein binding regions along with the transcription start site (TSS).
In Eukaryotes, the promoter regions are broadly classified as core promoters,
proximal promoters and distal promoters. The core promoter region, where the actual
basal transcription machinery assembles, is 30–100 nucleotides in length. These
regions are characterized by the presence of sequence motifs such as the TATA box
and the Inr element.
They may also contain downstream elements like DPE, MTE
(in humans) along with the associated TSS (Juven-Gershon et al. 2008; Thomas
and Chiang 2006). The proximal promoter regions are the sequences located within
500 base pairs relative to the TSS and contain certain proximal promoter elements,
which include the GC box, the CAAT box, cis-regulatory modules (CRM) (Lenhard
and Sandelin 2012), etc. Distal promoter elements include enhancers, insulators and
silencers. The distal promoter region does not have a well-defined length and can
extend up to 10 kb from the TSS in upstream as well as downstream regions. Distal
promoters interact with transcription activators to increase the rate of transcription. In
vertebrates, it is known that 5 % of the genes code for specific transcription activators,
which interact with proximal and distal promoter regions.
Promoter: core (TATA Box, Inr elements 30-100 bp) + proximal (500 bp relative to the TSS – CG box, CAAT box, cis-regulatory elements) + distal
Programs available:
Fprom – softberry (já rodando)
22th of May, 2017
Identification of beta-tubulins:
I've downloaded some beta-tubulins from Uniprot for C. gigas and for Mytilus. File: /home/bioma/2017_marcela/CTG/CG.mytilus.uniprot.betaTubulin.fa
Then runned a BLAST:
makeblastdb -in CG.mytilus.uniprot.betaTubulin.fa -dbtype prot
blastp -query marcela.fasta -db CG.mytilus.uniprot.betaTubulin.fa -evalue 1e-05 -out marcela.CG.mytilus.uniprot.betaTubulin1e05 &
BLAST BHcov result:
itr6_5034_pi.g1088.t1 tr|K1R278|K1R278_CRAGI 38.39 435 256 12 404 838 2 424 3e-113 343 50.88 94.84
itr6_5034_pi.g1091.t1 tr|K1R278|K1R278_CRAGI 39.26 433 253 10 37 469 2 424 5e-122 354 89.09 94.84
itr6_5034_pi.g1085.t1 tr|K1R278|K1R278_CRAGI 39.14 419 245 10 1 419 16 424 3e-117 340 96.10 91.70
itr6_5034_pi.g1096.t1 tr|K1R278|K1R278_CRAGI 39.40 434 253 10 15 448 1 424 3e-123 357 93.33 95.07
itr6_3927_pi.g551.t1 tr|K1R7V7|K1R7V7_CRAGI 45.14 288 154 4 43 329 2 286 2e-81 246 73.21 63.90
itr6_3927_pi.g551.t1.1.57f6923f tr|K1R7V7|K1R7V7_CRAGI 45.33 289 154 4 1 288 1 286 1e-82 248 82.05 64.13
itr6_2039_pi.g1537.t1 tr|K1R278|K1R278_CRAGI 39.63 434 252 10 1 434 1 424 2e-123 357 96.23 95.07
itr6_26_pilo.g431.t1 tr|K1R278|K1R278_CRAGI 47.37 57 30 0 2 58 368 424 4e-18 65.9 73.08 12.78
itr6_4255_pi.g441.t1 tr|K1R278|K1R278_CRAGI 24.79 121 80 11 273 393 313 422 1e-09 48.5 24.35 24.66
itr6_4255_pi.g438.t1 tr|K1R7V7|K1R7V7_CRAGI 37.43 171 73 34 10 147 106 275 3e-27 97.1 67.98 38.12
itr6_222_pil.g2785.t1 tr|K1QAJ5|K1QAJ5_CRAGI 92.62 325 23 1 42 366 10 333 0.0 595 47.17 75.52
itr6_222_pil.g2784.t1 tr|K1PN21|K1PN21_CRAGI 91.73 411 24 10 134 534 20 430 0.0 781 73.04 92.36
itr6_222_pil.g2778.t1 tr|K1R7V7|K1R7V7_CRAGI 99.33 446 3 0 1 446 1 446 0.0 935 100.00 100.00
itr6_222_pil.g2782.t1 tr|K1QAJ5|K1QAJ5_CRAGI 98.51 202 3 0 158 359 228 429 1e-149 419 56.27 47.09
itr6_222_pil.g2786.t1 tr|K1R7V7|K1R7V7_CRAGI 98.65 446 5 1 1 445 1 446 0.0 925 100.00 100.00
itr6_222_pil.g2783.t1 tr|K1QAJ5|K1QAJ5_CRAGI 96.49 399 14 0 3 401 31 429 0.0 814 99.50 93.01
itr6_2662_pi.g1051.t1.1.57f6923e tr|Q5NT88|Q5NT88_CRAGI 38.54 410 242 10 20 429 2 401 4e-112 330 77.65 89.69
itr6_2662_pi.g1051.t1 tr|Q5NT88|Q5NT88_CRAGI 38.54 410 242 10 20 429 2 401 4e-112 330 77.65 89.69
itr6_3015_pi.g1241.t1 tr|K1R278|K1R278_CRAGI 38.87 301 176 8 6 306 132 424 9e-80 240 93.19 65.70
itr6_3015_pi.g1240.t1 tr|G0YFD6|G0YFD6_MYTED 42.25 71 39 2 486 556 1 69 1e-14 61.2 12.68 40.59
itr6_3841_pi.g2427.t1 tr|K1QAJ5|K1QAJ5_CRAGI 48.78 82 34 8 1 82 250 323 2e-19 70.5 90.11 17.25
itr6_3841_pi.g2425.t1 tr|Q5NT88|Q5NT88_CRAGI 92.41 316 24 0 93 408 58 373 0.0 610 77.07 70.85
itr6_3841_pi.g2428.t1 tr|K1PN21|K1PN21_CRAGI 97.30 74 2 0 88 161 175 248 4e-48 151 45.68 16.63
itr6_3841_pi.g2426.t1 tr|K1PN21|K1PN21_CRAGI 55.32 188 49 35 75 227 66 253 6e-57 177 67.11 42.25
itr6_165_pil.g2044.t1 tr|K1R278|K1R278_CRAGI 38.37 443 254 19 1 443 1 424 2e-119 347 96.30 95.07
itr6_165_pil.g2047.t1 tr|K1R278|K1R278_CRAGI 38.86 368 217 8 856 1223 65 424 5e-93 296 29.70 80.72
itr6_165_pil.g2045.t1 tr|K1R278|K1R278_CRAGI 39.55 359 209 8 38 396 74 424 1e-101 299 86.92 78.70
itr6_165_pil.g2046.t1 tr|G0YGJ4|G0YGJ4_MYTED 43.33 60 34 0 12 71 41 100 1e-17 62.8 83.33 35.93
itr6_1167_pi.g2577.t1 tr|K1R7V7|K1R7V7_CRAGI 43.24 37 20 1 1 37 1 36 3e-08 40.4 26.81 8.07
itr6_1167_pi.g2579.t1.1.57f6923b tr|Q5NT88|Q5NT88_CRAGI 32.14 56 37 1 21 75 150 205 2e-06 38.5 11.48 12.56
itr6_1167_pi.g2579.t1 tr|Q5NT88|Q5NT88_CRAGI 32.14 56 37 1 21 75 150 205 1e-06 38.9 12.30 12.56
itr6_2024_pi.g1264.t1 tr|K1R7V7|K1R7V7_CRAGI 97.98 445 7 2 1 443 1 445 0.0 918 99.11 99.78
itr6_2024_pi.g1264.t1.1.57f6923d tr|K1R278|K1R278_CRAGI 96.78 373 12 0 1 373 73 445 0.0 768 99.47 83.63
itr6_2024_pi.g1261.t1 tr|K1R7V7|K1R7V7_CRAGI 97.77 448 8 2 1 448 1 446 0.0 920 100.00 100.00
itr6_4995_pi.g934.t1 tr|K1R7V7|K1R7V7_CRAGI 27.68 112 73 8 58 168 298 402 3e-06 35.4 61.33 23.54
itr6_4995_pi.g933.t1 tr|K1R278|K1R278_CRAGI 28.81 236 140 28 4 215 5 236 1e-20 82.8 49.42 52.02
itr6_3744_pi.g1657.t1 tr|K1PN21|K1PN21_CRAGI 46.24 266 140 3 29 293 2 265 4e-77 232 87.17 59.33
itr6_3744_pi.g1658.t1 tr|K1R278|K1R278_CRAGI 34.59 133 79 8 1 133 300 424 4e-31 105 88.08 28.03
itr6_224_pil.g2812.t1 tr|K1QAJ5|K1QAJ5_CRAGI 83.39 283 23 24 42 302 78 358 3e-171 472 82.33 65.50
itr6_8379_pi.g36.t1 tr|K1QAJ5|K1QAJ5_CRAGI 97.20 214 5 1 1 214 216 428 7e-158 434 99.53 49.65
itr6_176_pil.g2226.t1 tr|K1QAJ5|K1QAJ5_CRAGI 28.28 99 69 2 528 624 277 375 9e-13 58.9 15.25 23.08
itr6_3438_pi.g1869.t1 tr|K1R278|K1R278_CRAGI 38.66 419 247 10 1 419 16 424 6e-108 337 34.07 91.70
itr6_3830_pi.g2323.t1 tr|K1R278|K1R278_CRAGI 94.55 404 22 0 106 509 26 429 0.0 804 76.37 90.58
itr6_3830_pi.g2324.t1 tr|K1R278|K1R278_CRAGI 94.86 214 11 0 186 399 1 214 3e-146 413 53.50 47.98
itr6_3830_pi.g2322.t1 tr|K1QAJ5|K1QAJ5_CRAGI 96.77 403 13 0 22 424 27 429 0.0 815 95.05 93.94
itr6_3830_pi.g2326.t1 tr|K1PN21|K1PN21_CRAGI 84.88 291 44 0 315 605 1 291 0.0 515 47.86 65.39
itr6_3830_pi.g2330.t1 tr|K1PN21|K1PN21_CRAGI 88.24 255 30 0 26 280 175 429 1e-178 491 84.16 57.30
itr6_2880_pi.g3194.t1 tr|K1R278|K1R278_CRAGI 36.18 152 89 8 175 326 281 424 4e-34 120 44.19 32.29
23th May, 2017
I've deleted by mistake the file with the promoters. Running again:
/home/bioma/scripts/fprom /home/bioma/2017_marcela/mexilhao_tese_preeprint/LF.genome.NCBI4.1.fasta > /home/bioma/2017_marcela/CTG/LF.genome.NCBI4.1.fasta.fprom.2
About the promoters: (i) its important to find out how reliable the in silico tool is, (ii) how to integrate the promoters information witht he gff, if its possible.
OBS 2: I've created a folder Promoters in the googledocs, inside the folders EXECUÇÃO TÉCNICA and WP#1 to upload files and results about the promoters search.
23th of May, 2017
Today I've discussed with Ju Americo that we should check a best reciprocal Blast hit (RBBH) for marcela.fasta against all proteins from
(1) anopheles genome, (2) c.elegans genome, (3) c. gigas, (4) m. galloprovincialis.
I've started to run the RBBH for anopheles (locally)
bioma@bioma-XPS-8300:~/2017_marcela/CTG$ python ../../scripts/rbbh.py -a prot -t blastp -o genomes-marcela.Anopheles_gambiae.AgamP4.pep.all.RBBH --threads=4 marcela.fasta Anopheles_gambiae.AgamP4.pep.all.fa
nohup python ../scripts/rbbh.py -a prot -t blastp -o marcela.c_elegans.PRJNA13758.WS258.protein.fa.RBBH --threads=8 ../marcela.fasta.an/marcela.fasta c_elegans.PRJNA13758.WS258.protein.fa
I need to run the oher ones too, but I cannot run it at the same time in the local machine. Let's see how long the first job is going to take.
Next question is: where is the BLAST of marcela.fasta against ncbi nr? Do I have it or not?
Even though I have this file all.lf.uniprot.final.blast - which comprises all the marcela.fasta matches with UNIPROT or (the ones
that did not have a match with Uniprot) with NR NCBI, I'm running a complete BLAST against NR NCBI
perl /data2/Marcela/scripts/split_multifasta.pl --input_file=marcela.fasta --output_dir=/data2/Marcela/datanote/ --seqs_per_file=20239
nohup /data2/Marcela/scripts/ncbi-blast-2.4.0+/bin/blastp -query marcela.*.fasta -db /data2/ncbl/db/nr -evalue 1e-05 -num_threads 4 -out marcela.*.fasta.NR.1e-05 &
24th of May, 2017
Running the RBBH for marcela.fasta and Mytilus proteins (murgarella et al, 2016).
python ../../scripts/rbbh.py -a prot -t blastp -o marcela.Mytilus.galloprovincialis.aa.RBBH --threads=2 marcela.fasta Mytilus.galloprovincialis.aa
python ../../scripts/rbbh.py -a prot -t blastp -o marcela.Crassostrea_gigas.GCA_000297895.1.30.pep.all.fa.RBBH --threads=4 marcela.fasta Crassostrea_gigas.GCA_000297895.1.30.pep.all.fa
Organizing RBBH files and results after running it with whole genomes.
For anopheles, the RBBH are similar when done with the smaller db, there is only one more, and one out
#A_id B_id A_length B_length A_qcovhsp B_qcovhsp length pident bitscore
itr6_18008_p.g24.t1 AGAP003228-PA 208 429 92 53 228 72.807 329
itr6_2417_pi.g1791.t1 AGAP009459-PA 303 298 98 99 297 70.370 458
itr6_4125_pi.g1566.t1 AGAP004283-PA 97 157 93 64 101 72.277 144
itr6_222_pil.g2778.t1 AGAP010929-PA 446 447 99 98 440 96.591 905
itr6_2014_pi.g1174.t1.1.57f6923d AGAP003622-PA 287 415 95 66 273 73.626 427
itr6_10681_p.g56.t1 AGAP005800-PA 416 717 100 58 416 73.558 619
itr6_12467_p.g169.t1 AGAP000462-PA 164 164 100 100 164 71.951 258 (não achou no RBBH com os genomas completos)
Outros:
itr6_222_pil.g2778.t1 AGAP010929-PA 446 447 99 98 440 96.591 905
Uploading all these tables into google docs:
1-) RBBH-marcela.anopheles-wholegenomeAndsmallDB-check.xlsx - are the RBBH (reciprocal best blast hits) found when reciprocally
blasting the whole proteins predicted for L. fortunei (marcela.fasta) with the whole proteins from the anopheles genome (esembl).
Also, these RBBH were checked against RBBH run with a smaller db (from 24th of April, 2017 descripton), which had only the fertility genes of interest.
2-) genomes-marcela.Anopheles_gambiae.AgamP4.pep.all.RBBH - analysis result of RBBH (reciprocal best blast hits) found when reciprocally blasting the whole
proteins predicted for L. fortunei (marcela.fasta) with the whole proteins from the anopheles genome (esembl).
3-) Anopheles_gambiae.AgamP4.pep.crisanti.blastp1e05.covBH.RBBH.xlsx - its an update of previous Ju’s table, but now presenting the RBBH in bold.
Running RBBH for c_elegans again because of the IDs
python ../../scripts/rbbh.py -a prot -t blastp -o marcela.c_elegans.PRJNA13758.WS258.protein.1.glued.faRBBH --threads=4 marcela.fasta c_elegans.PRJNA13758.WS258.protein.1.glued.fa
So, I've run RBBH for marcela.fasta against m.gallo and c.gigas whole-genome-predicted proteins and now I'm blasting this db
against the reduced reproductive proteins of interested ju has created (sequence-proteins-JU.fasta).
cat Crassostrea_gigas.GCA_000297895.1.30.pep.all.fa Mytilus.galloprovincialis.aa > C.gigas.M.gallo.all.genome.proteins.fa
blastp -query sequence-proteins-JU.fasta -db C.gigas.M.gallo.all.genome.proteins.fa -evalue 1e-05 -num_threads 4 -out sequence-proteins-JU.C.gigas.M.gallo.all.genome.proteins.1e05 &
The idea is to identify which protein is which between the two db, because they have different IDs. And then I can explore more the RBBH results.
perl ../../scripts/BLASTcovBH.pl sequence-proteins-JU.C.gigas.M.gallo.all.genome.proteins.1e05 > sequence-proteins-JU.C.gigas.M.gallo.all.genome.proteins.1e05.covBH
Ok, so for the 17 proteins of molluscan reproduction only vasa is present in the RBBH (with both: small db and whole c.gigas genome).
Searching for vitelogenin:
1-) I went to Uniprot and download vitelogenins for c. gigas and 'mytilus'. Total of 6 proteins, including fragments (file: vitelogenin.gigas.mytilus.uniprot.fa)
The first BLAST with marcela.fasta:
blastp -query vitelogenin.gigas.mytilus.uniprot.fa -db marcela.fasta -evalue 1e-05 -out vitelogenin.gigas.mytilus.uniprot.marcela.fasta1e05 &
tr|G9JJT9|G9JJT9_CRAGI itr6_234_pil.g2949.t1 39.13 92 56 0 23 114 1483 1574 3e-19 85.9 44.02 5.46
tr|K1QNA2|K1QNA2_CRAGI itr6_234_pil.g2949.t1 32.89 681 336 121 52 722 162 731 7e-109 387 27.83 33.81
tr|K1QFM6|K1QFM6_CRAGI itr6_11658_p.g77.t1 50.25 396 194 3 70 463 1 395 3e-137 436 83.30 14.58
tr|K1QMJ1|K1QMJ1_CRAGI itr6_11658_p.g77.t1 38.99 277 167 2 6 282 8 282 2e-60 207 96.52 10.15
tr|Q49P77|Q49P77_MYTED itr6_234_pil.g2949.t1 25.56 223 85 81 14 236 162 303 7e-16 76.6 94.09 8.42
Now I've decided to do a BLAST of these sequences with the whole-genome-predicted sequences from m. gallo and c. gigas.
blastp -query vitelogenin.gigas.mytilus.uniprot.fa -db C.gigas.M.gallo.all.genome.proteins.fa -evalue 1e-05 -out vitelogenin.gigas.mytilus.uniprot.C.gigas.M.gallo.all.genome.proteins.fa1e05 &
perl ../../scripts/BLASTcovBH.pl vitelogenin.gigas.mytilus.uniprot.C.gigas.M.gallo.all.genome.proteins.fa1e05 > vitelogenin.gigas.mytilus.uniprot.C.gigas.M.gallo.all.genome.proteins.fa1e05.covBH
25th of May, 2017
Today I'm organizing all the results for the genes of interest found in comparison with A. gambiae, c. elegans or molluscan (c. gigas
and m. gallo). The results of the RBBH from whole-genome comparisons are similar to the ones done with the smaller datasets.
After conferming the RBBH, I went to see which was the function of the LF protein when BLASTed against Uniprot/NR.
So files containing the RBBH (RBBH-marcela.anopheles-wholegenomeAndsmallDB-check.xlsx , c_elegans_RBBH-confirmation.xlsx) contain
also the fuction description from the best hit found in the BLAST against Uniprot/NR. For the molluscan, before the only RBBH was
for the vasa protein, and it was confirmed with the RBBH against c. gigas whole genome.
20th of June, 2017.
Juliana Americo had selected 266 molluscan genes of interested in NCBI or Uniprot. I have - however - run the RBBH for the complete set of
proteins for the C. gigas and M. gallo (murgarella) genomes. Then, the second thing I did was to run RBBH for the 266 proteins
selected by Ju against C.gigas and M. gallo genome-predicted proteins: so then I could have the ID of the protein in the genome
to compare with the RBBH result against LF (marcela.Mytilus.galloprovincialis.aa.RBBH and marcela.Crassostrea_gigas.GCA_000297895.1.30.pep.all.fa.RBBH).
The RBBH for 266 agains genome-predicted-proteins:
python ../../scripts/rbbh.py -a prot -t blastp -o Ptn_Reproduction_Mollusca_dbC.gigas.M.gallo.all.genome.proteins.fa.RBBH --threads=4 C.gigas.M.gallo.all.genome.proteins.fa Ptn_Reproduction_Mollusca_db.fasta
After the comparisons I found the vasa protein but also the estrogen receptor protein with RBBH against LF. How I did that?
I checked the IDS of the genome-predicted-proteins in here Ptn_Reproduction_Mollusca_dbC.gigas.M.gallo.all.genome.proteins.fa.RBBH
and checked if they were present in here marcela.Mytilus.galloprovincialis.aa.RBBH or marcela.Crassostrea_gigas.GCA_000297895.1.30.pep.all.fa.RBBH
26th of June, 2017
Today I'm trying another approach to find the insertion area in LF genome. Ju Americo found a paper where they state housekeeping
genes are usually organized in clusters inside the geneome (https://drive.google.com/file/d/0B2pYU5fjrq3EYThkUjBBZWFKLUk/view?ts=595102d4)
Therefore, if we find a area between such genes, the possibility of this area to be expressed is high.
So, right now I'll download the 400 proteins that state as coming from housekeeping genes from RefSeq:
Batchentry - list of IDs - send to coding file (proteins).
To eliminate all empty lines of a file:
:g/^$/d
And now running blast with the 408 housekeeping proteins
makeblastdb -in housekeeping.fa -dbtype prot
blastp -query marcela.fasta -db housekeeping.fa -evalue 1e-05 -num_threads 4 -out marcela.fasta.housekeeping.1e-05 &
27th of June, 2017
How to get all spaces were there is no gene in a gff?
beedtools complement -i marcelAAAa_pasa.gene_structures_post_PASA_updates.85229.gff3 -g /home/bioma/LF.genome.blas.fasta.genomefile
How to generate the genome file (-g)?
samtools faidx LF.genome.blas.fasta
awk -v OFS='\t' {'print $1,$2'} lLF.genome.blas.fasta.fai > LF.genome.blas.fasta.genomefile
29 June,2017
Nearly all results for the first technical delivery (gene targets and insetion area) are ready. Bellow one find the description
and gdrive location of all files:
For insertion area: Files listed bellow are here: https://drive.google.com/drive/u/1/folders/0B2IClYPbMhLAZF83U2ZfSjZZV1E
- Intergenic regions coodernates for all scaffolds (LF.genome.intergenic)
- Housekeeping LF blast result (marcela.fasta.housekeeping.1e-05.covBH)
- Intergenic regions coordenates only for scaffolds that have housekeeping genes (LF.genome.Intergenic-for-housekeeping.coords)
- Specific area chosen for insertion: scaffold FASTA file (L_FORTUNEI.itr6_10266_.fa)
- Specific area chosen for insertion: gff of scaffold with all the coordenates for the 4 genes coded in this scaffold, including the two convergent genes (itr6_10266_.ok.gff)
- 2 FASTA files for each of the two convergent genes containing: mrna, coding seq and protein (itr6_10266_p.g262.t1.codingSeq-mrna-protein , itr6_10266_p.g263.t1.codingSeq-mrna-protein.fasta)
https://drive.google.com/drive/u/1/folders/0B2IClYPbMhLAZF83U2ZfSjZZV1E
For gene targets. Files listed bellow are here: https://drive.google.com/drive/u/1/folders/0B2pYU5fjrq3ENGZrY3IteGN4T0U
- gff (genome location) for all 27 gene targets (LF.gene.targets.gff3)
- mrna sequences for all 27 gene targets (gene-targets.mrna)
- coding seq for all 27 gene targets (gene-targets.codingseq)
- genome scaffolds were this sequences are (LF.gene.targets.scaffold.fasta)
31th July, 2017
Ju Americo required to find the promotor region of 9 genes in L. fortunei. These genes were found in C. gigas or P. yessoensis.
Requirement table is here: https://docs.google.com/spreadsheets/d/18NxP29QCvPcm0z8sIMKE0lSYLbL8YbtM-svhHWC33q0/edit?ts=597f2fd9#gid=0
So, I've downloaded all sequences she asked for from Uniprot and made a BLAST against LF proteins (marcela.fasta).
For all sequences I've downloaded sequences from C. gigas or Patinopecten yessoensis. But for GAPDH and Cytochrome
I've also downloaded the human protein. In any case, there was no blast with any marcela.fasta proteins.
9th august, 2017
I'm looking for the promotor regions in some specific genes Ju Americo has selected from other species. In this link (https://docs.google.com/spreadsheets/d/18NxP29QCvPcm0z8sIMKE0lSYLbL8YbtM-svhHWC33q0/edit?ts=597f2fd9#gid=16907541
) in sheet3, you find the LF ortologs of such genes, them the RBBH or not, then the BLAST result once run against NCBI nr, then gene position in scaffold, and
the genes that come just before and after.
Also, Ju Americo would like to know if such genes are expressed in the transcriptomes we already have sequenced (from Marcela's thesis).
The thing is we have not done a diferential expression analysis in such data. But one thing we can check is if such genes are present
or absent in the de novo assembled transcripts for each mussel (we have sequenced 3 mussels, 4 tissues). Even better would be to
assembly with trinity transcripts for each tissue, or use cufflinks for such.
Right now, the only thing we have is a blast result against the trinity sets for each mussel.
1-) First I've created a db with the 7 LF ortologs of the sequences Ju has required:
/home/bioma/2017_marcela/CTG/lf.genes.for.promoters.fa
Then, I've run a blastx of the 3 sets of trinity transcripts against such db. The blast results are:
/home/bioma/2017_marcela/CTG/Trinity.1.lf.genes.for.promotersblastx1e05.covBH
/home/bioma/2017_marcela/CTG/Trinity.3.lf.genes.for.promotersblastx1e05.covBH
/home/bioma/2017_marcela/CTG/Trinity.6.lf.genes.for.promotersblastx1e05.covBH
Then I saved only blast results where 90% or more of the db sequence is inside the blast result.
/home/bioma/2017_marcela/CTG/Trinity.3.lf.genes.for.promotersblastx1e05.covBH.id2.90
/home/bioma/2017_marcela/CTG/Trinity.1.lf.genes.for.promotersblastx1e05.covBH.id2.90
/home/bioma/2017_marcela/CTG/Trinity.6.lf.genes.for.promotersblastx1e05.covBH.id2.90
All the files are also in biobureau gdocs: https://drive.google.com/drive/u/1/folders/0B2IClYPbMhLARFVhaGNEMFNwcms
All the 7 genes Ju has selected are constitutive, or expressed in the embrio and on (https://docs.google.com/spreadsheets/d/18NxP29QCvPcm0z8sIMKE0lSYLbL8YbtM-svhHWC33q0/edit#gid=0)
Only vasa is supposed to be expressed only in the germinative cells. So, I've checked the trinity sets - that should not have this transcript,
since it was sequenced for digestive gland, mantle, foot and gills of adults - blast results, and really did not found vasa for the trinity set 3 and 6.
For mussel 1, it seems to have a hit,eventhough is not a perfect hit:
TR79527-c2_g1_i3 itr6_4515_pi.g108.t1 100.00 562 0 0 1111 2796 17 578 0.0 1127 44.44 91.68
This is a m8 result plus 2 last collumns, that show % of query coverage and % of subject coverage. As we can see, this results
is the best hit and only covers 44.44% of the trinity transcript.
In the blast results (here lines 449-451) one can also find other blast results for the 7 genes Ju has selected (line 447).
October 30th, 2017
Luana and João have been working on assemblying L.fortunei transcriptomes of each indidual and specific tissue. The software being used for it is Trinity.
The assembly of tissue C, from organism 1, was the following:
nohup Trinity --seqType fq --left 1c_S4_R1_001.D.gland.fastq --right 1c_S4_R2_001.D.gland.fastq --max_memory 8G