lab_book

April 17th, 2017

Today I've done the BLAST of all 60717 L. fortunei proteins predicted in the genome against A. gambiae proteins related to fertility publishe
in a Nature Biotechnology paper (Hammond et al., 2016 doi:10.1038/nbt.3439 ) 

Database was all the protein sequences present at Supplementary Table 5 of the above paper, although 4 sequences listed on Table 5 was not
present on the genome of A. gambiae downloaded from the Esembl browser, nor were find went specificaly searched on the Esembl website.
Those not found sequences are: AGAP011235-PA, AGAP010915-PA, AGAP007997-PA, AGAP012170-PA.

First local BLAST results are at biobureau gdrive: https://drive.google.com/drive/u/1/folders/0B2pYU5fjrq3EczhILWRyWDBTa3M 


April 24th, 2017.

Organizing all the BLASTs to find the fertility genes in L. fortunei. We are using the genes employed by Crisanti 

blastp -query Anopheles_gambiae.AgamP4.pep.crisanti.fasta -db /home/ubuntu/nr -evalue 1e-05 -num_threads 4 -out Anopheles_gambiae.AgamP4.pep.crisanti.fasta.blast-NR.1e05 &


1-) Another BLAST from today: L. fortunei proteins against molluscan reproduction genes Ju_Americo have choosen.

makeblastdb -in mollusca.rep.fa -dbtype prot
blastp -query marcela.fasta -db molluscan_genes/mollusca.rep.fa -evalue 1e-05 -num_threads 4 -out marcela.mollusca.rep.fa1e05 &

April, 25th 2017

-BLASTing all reproduction mollucans mRNA and proteins against marcela.fasta
I'm doing this in Amazon, in the isntance called CTG_BLAST i-0b326ee60f77046d7

 blastp -query marcela.fasta -db molluscan_genes/mollusca.rep.fa -evalue 1e-05 -num_threads 4 -out marcela.mollusca.rep.fa1e05 &
 

-BLASTing L. fortunei VASA protein JuAmerico found with the transcripts assembled from Trinity. Just to check. The protein is well conserved
but it looks like its missing 100aa at the beginning.

/data2/Marcela/scripts/ncbi-blast-2.4.0+/bin/tblastn -query vasa.lf.fa -db /data2/Marcela/data/rna-seq/trinti.1.2.3/Trinities.1.2.3.fasta -evalue 1e-05 -num_threads 4 -out vasa.trinity1.2.3.tblastn &
After that, I got all the trinity hits and translated them using transdecode, so we can check how similar to the genomic de novo preditec vasa the transcripts are:

perl /home/bioma/scripts/PASApipeline-2.0.2/pasa-plugins/transdecoder/transcripts_to_best_scoring_ORFs.pl -t vasa.trinity1.2.3.tblastn.cov.trinityID.fasta

-BLASTing L. fortunei (marcela.fasta) against C. elegans proteins selected by Ju (3979 seqs, sterile-phenoype).
makeblastdb -in c_elegans.PRJNA13758.WS258.protein.sterile.fa -dbtype prot

nohup blastp -query ../marcela.fasta -db /home/ubuntu/c_elegans/c_elegans.PRJNA13758.WS258.protein.sterile.fa -evalue 1e-05 -num_threads 4 -out marcela.c_elegans.PRJNA13758.WS258.protein.sterile.1e05 &


26 April, 2017

How to get a whole line by a list of IDs.

awk 'NR==FNR{tgts[$1]; next} $1 in tgts' file1 file2


April 27th, 2017.

Today I'll curate the possible VASA gene Ju found among the golden Mussel proteins. Which means I'll look into the scaffolds from 
where it was predicted, and see how much RNA-seq there is aligned to it.

So the first thing is to find which scaffold it is. And it is this:

L_FORTUNEI.itr6_4515

Then, you get all the rna-seq aligned to it in a bam file using samtools

samtools view -h all.merged.sorted.bam L_FORTUNEI.itr6_4515_ > L_FORTUNEI.itr6_4515_all.merged.sam

then
samtools view -Sb L_FORTUNEI.itr6_4515_all.merged.sam -o L_FORTUNEI.itr6_4515_all.merged.bam
samtools sort
samtools index

then
Get the fasta for the scaffold
and
Get the gff for the scaffold

/data2/Marcela/scripts/Python-3.5.2/python /data2/Marcela/scripts/biocode/gff/filter_gff3_by_id_list.py -i /data2/Marcela/pasa.itr6/marcelAAAa_pasa.gene_structures_post_PASA_updates.85229.gff3 -o id.bla.gff3 -l /data2/Marcela/hisat2/id.bla

Then sort, zip and tabix the gff

sort -k1,1 -k4,4n -rk3,3n id.bla.gff3 > id.bla.sorted.gff
bgzip id.bla.sorted.gff
tabix -p gff id.bla.sorted.gff.gz

May, 4th (be with you!) 2017.

Finding best reciprocal BLAST hit!

I found this script and I'm using it: https://github.com/peterjc/galaxy_blast/blob/master/tools/blast_rbh/blast_rbh.py

The first thing I did was to use the BLAST matches ju found between LF and C_elegans (Celegans_Agambiae_Hits_evalue50_ID60.xlsx).
The fasta files are at local machine:
/home/bioma/2017_marcela/CTG/lf.blas.celegans.interest.fasta
/home/bioma/2017_marcela/CTG/c_elegens.interested.fasta
I've modified same names in the c_elegans file because there are more than 1 protein per locus, so they have the same inicial name
the file is called c_elegens.interested.id.modified.forRBBH (on gdocs)
Then I run the program localy:
bioma@bioma-XPS-8300:~/2017_marcela/CTG$ python ../../scripts/filterfasta.py -i lf.blas.celegans.interest.id ../mexilhao_tese_preeprint/marcela.fasta > lf.blas.celegans.interest.fasta
bioma@bioma-XPS-8300:~/2017_marcela/CTG$ python ../../scripts/rbbh.py -a prot -t blastp -o output.tsv lf.blas.celegans.interest.fasta c_elegens.interested.fasta
Result:
Done, 19 RBH found
Warning: Sequences with tied best hits found, you may have duplicates/clusters

Result file called: lf.ce-elegans.RBBH (on gdocs)

- RBBH analysis with molluscan proteins. Only vasa protein got that for now

python ../../scripts/filterfasta.py -i molluscan.reproduction.LF.blastp.lf.id ../mexilhao_tese_preeprint/marcela.fasta > molluscan.reproduction.LF.blastp.lf.fasta
less molluscan.reproduction.LF.blastp.lf.fasta
python ../../scripts/rbbh.py -h
python ../../scripts/rbbh.py -a prot -t blastp -o lf.molluscan-reproduction.RBBH --threads=4 molluscan.reproduction.LF.blastp.lf.fasta sequence-proteins-JU.fasta
ls -ltr
less lf.molluscan-reproduction.RBBH

10th May of 2017.
- Already asked for a quotation to resequence a male and a female golden mussel (genone).
- Today I'm looking into which proteins and scaffolds the trinity (vasa) transcripts TR79527-c2__g1__i3 len=3794, TR76527-c0_g1_i2 len=3728, TR76527-c0_g1_i5 len=3477 
represent.

So, First I've blasted them against marcela.fasta
blastx -query trinity.vasas.fasta -db /data2/Marcela/marcela.fasta.an/marcela.fasta -evalue 1e-05 -out trinity-vasas_marcela.fasta.blastx1e05
Best hits:
TR79527-c2__g1__i3      itr6_4515_pi.g108.t1    100.00  562     0       0       1111    2796    17      578     0.0     1127    44.44   91.68
TR76527-c0_g1_i2        itr6_4515_pi.g108.t1.1.57f69240 100.00  537     0       0       1120    2730    17      553     0.0     1075    43.21   91.33
TR76527-c0_g1_i5        itr6_4515_pi.g108.t1.1.57f69240 100.00  537     0       0       869     2479    17      553     0.0     1075    46.33   91.33

All "good" hits:
TR79527-c2__g1__i3      itr6_4515_pi.g108.t1    100.00  562     0       0       1111    2796    17      578     0.0     1127    14.81   91.68
TR79527-c2__g1__i3      itr6_4515_pi.g108.t1.1.57f69240 99.46   552     3       0       1141    2796    2       553     0.0     1099    14.47   93.37
TR79527-c2__g1__i3      itr6_4515_pi.g108.t1.2.57f69240 99.43   527     3       0       1216    2796    2       528     0.0     1045    13.81   93.07
TR79527-c2__g1__i3      itr6_4515_pi.g108.t1.3.57f69240 99.81   518     1       0       1111    2664    17      534     0.0     1031    13.63   92.99
TR76527-c0_g1_i2        itr6_4515_pi.g108.t1.1.57f69240 100.00  537     0       0       1120    2730    17      553     0.0     1075    14.40   91.33
TR76527-c0_g1_i2        itr6_4515_pi.g108.t1    100.00  537     0       0       1120    2730    42      578     0.0     1075    14.40   87.60
TR76527-c0_g1_i2        itr6_4515_pi.g108.t1.2.57f69240 99.43   527     3       0       1150    2730    2       528     0.0     1046    14.06   93.07
TR76527-c0_g1_i2        itr6_4515_pi.g108.t1.3.57f69240 99.80   493     1       0       1120    2598    42      534     0.0     979     13.20   88.49
TR76527-c0_g1_i5        itr6_4515_pi.g108.t1.1.57f69240 100.00  537     0       0       869     2479    17      553     0.0     1075    15.44   91.33
TR76527-c0_g1_i5        itr6_4515_pi.g108.t1    100.00  537     0       0       869     2479    42      578     0.0     1075    15.44   87.60
TR76527-c0_g1_i5        itr6_4515_pi.g108.t1.2.57f69240 99.43   527     3       0       899     2479    2       528     0.0     1046    15.07   93.07
TR76527-c0_g1_i5        itr6_4515_pi.g108.t1.3.57f69240 99.80   493     1       0       869     2347    42      534     0.0     979     14.15   88.49

- Ok, so I got these three transcripts from above and got the results from them from transdecoder. I then alligned these transdecoder
protein sequences + vasa from uniprot for mytilus and c. gigas + the 3 vasa proteins predicted in the genome. The alligment shows
that is possible the genome-predicted proteins are incomplete, but that the transdecoder has the beginning of them. Now I'll
investigate the gff and the scaffold where they are coded.
Aligment result in the local computer:
/home/bioma/2017_marcela/CTG/transdecoderlong.genome.mitgigas.vasas.aln

11th of May, 2017
Trying to find RBBH with all mollusks versus marcela.fasta
python ../../scripts/rbbh.py -a prot -t blastp -o allmoluscs.marcelafasta sequence-proteins-JU.fasta /home/bioma/2017_marcela/mexilhao_tese_preeprint/marcela.fasta

18th of May, 2017

Testing how to create consensus from mapped reads using samtools. 
So, I've downloaded a bam file from allegro, where the 500bp pair end were aligned to the genome. Then we have:
samtools mpileup -d 8000 -E -uf Draftfinal_itr6.pilon6.fasta itr6finalcoverage.500.006.sorted.bam | bcftools view -cg - | vcfutils.pl vcf2fq > cns.fastq
Also check:
https://www.biostars.org/p/70585/
https://www.biostars.org/p/225064/

22th of May, 2017.

How to find promotor regions in silico??
Chapter I'm reading at the moment:
https://www.researchgate.net/publication/272440150_In_silico_Identification_of_Eukaryotic_Promoters
To download packages like softberry
http://www.softberry.com/berry.phtml?topic=fdp.htm&no_menu=on

Running Fprom: Program predicts potential transcription start positions by linear discriminant function combining
characteristics describing functional motifs and oligonucleotide composition of these sites. FProm uses file with
selected factor binding sites from currently supported functional site data base.

./fprom /home/bioma/2017_marcela/mexilhao_tese_preeprint/LF.genome.NCBI4.1.fasta > /home/bioma/2017_marcela/CTG/LF.genome.NCBI4.1.fasta.fprom


Other online tools for single sequences search:
http://www.genetools.us/genomics/Promoter%20databases%20and%20prediction%20tools.htm


Promoter regions:

A promoter region is generally defined as any genomic DNA where the transcrip-
tion machinery assembles and initiates transcription. The promoter region consists
of protein binding regions along with the transcription start site (TSS).

In Eukaryotes, the promoter regions are broadly classified as core promoters,
proximal promoters and distal promoters. The core promoter region, where the actual
basal transcription machinery assembles, is 30–100 nucleotides in length. These
regions are characterized by the presence of sequence motifs such as the TATA box
and the Inr element.

They may also contain downstream elements like DPE, MTE
(in humans) along with the associated TSS (Juven-Gershon et al. 2008; Thomas
and Chiang 2006). The proximal promoter regions are the sequences located within
500 base pairs relative to the TSS and contain certain proximal promoter elements,
which include the GC box, the CAAT box, cis-regulatory modules (CRM) (Lenhard
and Sandelin 2012), etc. Distal promoter elements include enhancers, insulators and
silencers. The distal promoter region does not have a well-defined length and can
extend up to 10 kb from the TSS in upstream as well as downstream regions. Distal
promoters interact with transcription activators to increase the rate of transcription. In
vertebrates, it is known that 5 % of the genes code for specific transcription activators,
which interact with proximal and distal promoter regions.

Promoter: core (TATA Box, Inr elements  30-100 bp) + proximal (500 bp relative to the TSS – CG box, CAAT box, cis-regulatory elements) + distal 


Programs available:
Fprom – softberry (já rodando)


22th of May, 2017

Identification of beta-tubulins:

I've downloaded some beta-tubulins from Uniprot for C. gigas and for Mytilus. File: /home/bioma/2017_marcela/CTG/CG.mytilus.uniprot.betaTubulin.fa
Then runned a BLAST:
makeblastdb -in CG.mytilus.uniprot.betaTubulin.fa -dbtype prot
blastp -query marcela.fasta -db CG.mytilus.uniprot.betaTubulin.fa -evalue 1e-05 -out marcela.CG.mytilus.uniprot.betaTubulin1e05 &

BLAST BHcov result:
itr6_5034_pi.g1088.t1   tr|K1R278|K1R278_CRAGI  38.39   435     256     12      404     838     2       424     3e-113  343     50.88   94.84
itr6_5034_pi.g1091.t1   tr|K1R278|K1R278_CRAGI  39.26   433     253     10      37      469     2       424     5e-122  354     89.09   94.84
itr6_5034_pi.g1085.t1   tr|K1R278|K1R278_CRAGI  39.14   419     245     10      1       419     16      424     3e-117  340     96.10   91.70
itr6_5034_pi.g1096.t1   tr|K1R278|K1R278_CRAGI  39.40   434     253     10      15      448     1       424     3e-123  357     93.33   95.07
itr6_3927_pi.g551.t1    tr|K1R7V7|K1R7V7_CRAGI  45.14   288     154     4       43      329     2       286     2e-81   246     73.21   63.90
itr6_3927_pi.g551.t1.1.57f6923f tr|K1R7V7|K1R7V7_CRAGI  45.33   289     154     4       1       288     1       286     1e-82   248     82.05   64.13
itr6_2039_pi.g1537.t1   tr|K1R278|K1R278_CRAGI  39.63   434     252     10      1       434     1       424     2e-123  357     96.23   95.07
itr6_26_pilo.g431.t1    tr|K1R278|K1R278_CRAGI  47.37   57      30      0       2       58      368     424     4e-18   65.9    73.08   12.78
itr6_4255_pi.g441.t1    tr|K1R278|K1R278_CRAGI  24.79   121     80      11      273     393     313     422     1e-09   48.5    24.35   24.66
itr6_4255_pi.g438.t1    tr|K1R7V7|K1R7V7_CRAGI  37.43   171     73      34      10      147     106     275     3e-27   97.1    67.98   38.12
itr6_222_pil.g2785.t1   tr|K1QAJ5|K1QAJ5_CRAGI  92.62   325     23      1       42      366     10      333     0.0     595     47.17   75.52
itr6_222_pil.g2784.t1   tr|K1PN21|K1PN21_CRAGI  91.73   411     24      10      134     534     20      430     0.0     781     73.04   92.36
itr6_222_pil.g2778.t1   tr|K1R7V7|K1R7V7_CRAGI  99.33   446     3       0       1       446     1       446     0.0     935     100.00  100.00
itr6_222_pil.g2782.t1   tr|K1QAJ5|K1QAJ5_CRAGI  98.51   202     3       0       158     359     228     429     1e-149  419     56.27   47.09
itr6_222_pil.g2786.t1   tr|K1R7V7|K1R7V7_CRAGI  98.65   446     5       1       1       445     1       446     0.0     925     100.00  100.00
itr6_222_pil.g2783.t1   tr|K1QAJ5|K1QAJ5_CRAGI  96.49   399     14      0       3       401     31      429     0.0     814     99.50   93.01
itr6_2662_pi.g1051.t1.1.57f6923e        tr|Q5NT88|Q5NT88_CRAGI  38.54   410     242     10      20      429     2       401     4e-112  330     77.65   89.69
itr6_2662_pi.g1051.t1   tr|Q5NT88|Q5NT88_CRAGI  38.54   410     242     10      20      429     2       401     4e-112  330     77.65   89.69
itr6_3015_pi.g1241.t1   tr|K1R278|K1R278_CRAGI  38.87   301     176     8       6       306     132     424     9e-80   240     93.19   65.70
itr6_3015_pi.g1240.t1   tr|G0YFD6|G0YFD6_MYTED  42.25   71      39      2       486     556     1       69      1e-14   61.2    12.68   40.59
itr6_3841_pi.g2427.t1   tr|K1QAJ5|K1QAJ5_CRAGI  48.78   82      34      8       1       82      250     323     2e-19   70.5    90.11   17.25
itr6_3841_pi.g2425.t1   tr|Q5NT88|Q5NT88_CRAGI  92.41   316     24      0       93      408     58      373     0.0     610     77.07   70.85
itr6_3841_pi.g2428.t1   tr|K1PN21|K1PN21_CRAGI  97.30   74      2       0       88      161     175     248     4e-48   151     45.68   16.63
itr6_3841_pi.g2426.t1   tr|K1PN21|K1PN21_CRAGI  55.32   188     49      35      75      227     66      253     6e-57   177     67.11   42.25
itr6_165_pil.g2044.t1   tr|K1R278|K1R278_CRAGI  38.37   443     254     19      1       443     1       424     2e-119  347     96.30   95.07
itr6_165_pil.g2047.t1   tr|K1R278|K1R278_CRAGI  38.86   368     217     8       856     1223    65      424     5e-93   296     29.70   80.72
itr6_165_pil.g2045.t1   tr|K1R278|K1R278_CRAGI  39.55   359     209     8       38      396     74      424     1e-101  299     86.92   78.70
itr6_165_pil.g2046.t1   tr|G0YGJ4|G0YGJ4_MYTED  43.33   60      34      0       12      71      41      100     1e-17   62.8    83.33   35.93
itr6_1167_pi.g2577.t1   tr|K1R7V7|K1R7V7_CRAGI  43.24   37      20      1       1       37      1       36      3e-08   40.4    26.81   8.07
itr6_1167_pi.g2579.t1.1.57f6923b        tr|Q5NT88|Q5NT88_CRAGI  32.14   56      37      1       21      75      150     205     2e-06   38.5    11.48   12.56
itr6_1167_pi.g2579.t1   tr|Q5NT88|Q5NT88_CRAGI  32.14   56      37      1       21      75      150     205     1e-06   38.9    12.30   12.56
itr6_2024_pi.g1264.t1   tr|K1R7V7|K1R7V7_CRAGI  97.98   445     7       2       1       443     1       445     0.0     918     99.11   99.78
itr6_2024_pi.g1264.t1.1.57f6923d        tr|K1R278|K1R278_CRAGI  96.78   373     12      0       1       373     73      445     0.0     768     99.47   83.63
itr6_2024_pi.g1261.t1   tr|K1R7V7|K1R7V7_CRAGI  97.77   448     8       2       1       448     1       446     0.0     920     100.00  100.00
itr6_4995_pi.g934.t1    tr|K1R7V7|K1R7V7_CRAGI  27.68   112     73      8       58      168     298     402     3e-06   35.4    61.33   23.54
itr6_4995_pi.g933.t1    tr|K1R278|K1R278_CRAGI  28.81   236     140     28      4       215     5       236     1e-20   82.8    49.42   52.02
itr6_3744_pi.g1657.t1   tr|K1PN21|K1PN21_CRAGI  46.24   266     140     3       29      293     2       265     4e-77   232     87.17   59.33
itr6_3744_pi.g1658.t1   tr|K1R278|K1R278_CRAGI  34.59   133     79      8       1       133     300     424     4e-31   105     88.08   28.03
itr6_224_pil.g2812.t1   tr|K1QAJ5|K1QAJ5_CRAGI  83.39   283     23      24      42      302     78      358     3e-171  472     82.33   65.50
itr6_8379_pi.g36.t1     tr|K1QAJ5|K1QAJ5_CRAGI  97.20   214     5       1       1       214     216     428     7e-158  434     99.53   49.65
itr6_176_pil.g2226.t1   tr|K1QAJ5|K1QAJ5_CRAGI  28.28   99      69      2       528     624     277     375     9e-13   58.9    15.25   23.08
itr6_3438_pi.g1869.t1   tr|K1R278|K1R278_CRAGI  38.66   419     247     10      1       419     16      424     6e-108  337     34.07   91.70
itr6_3830_pi.g2323.t1   tr|K1R278|K1R278_CRAGI  94.55   404     22      0       106     509     26      429     0.0     804     76.37   90.58
itr6_3830_pi.g2324.t1   tr|K1R278|K1R278_CRAGI  94.86   214     11      0       186     399     1       214     3e-146  413     53.50   47.98
itr6_3830_pi.g2322.t1   tr|K1QAJ5|K1QAJ5_CRAGI  96.77   403     13      0       22      424     27      429     0.0     815     95.05   93.94
itr6_3830_pi.g2326.t1   tr|K1PN21|K1PN21_CRAGI  84.88   291     44      0       315     605     1       291     0.0     515     47.86   65.39
itr6_3830_pi.g2330.t1   tr|K1PN21|K1PN21_CRAGI  88.24   255     30      0       26      280     175     429     1e-178  491     84.16   57.30
itr6_2880_pi.g3194.t1   tr|K1R278|K1R278_CRAGI  36.18   152     89      8       175     326     281     424     4e-34   120     44.19   32.29

23th May, 2017
I've deleted by mistake the file with the promoters. Running again:
/home/bioma/scripts/fprom /home/bioma/2017_marcela/mexilhao_tese_preeprint/LF.genome.NCBI4.1.fasta > /home/bioma/2017_marcela/CTG/LF.genome.NCBI4.1.fasta.fprom.2
About the promoters: (i) its important to find out how reliable the in silico tool is, (ii) how to integrate the promoters information witht he gff, if its possible.
OBS 2: I've created a folder Promoters in the googledocs, inside the folders EXECUÇÃO TÉCNICA and WP#1 to upload files and results about the promoters search.

23th of May, 2017
Today I've discussed with Ju Americo that we should check a best reciprocal Blast hit (RBBH) for marcela.fasta against all proteins from
(1) anopheles genome, (2) c.elegans genome, (3) c. gigas, (4) m. galloprovincialis.
I've started to run the RBBH for anopheles (locally)
bioma@bioma-XPS-8300:~/2017_marcela/CTG$ python ../../scripts/rbbh.py -a prot -t blastp -o  genomes-marcela.Anopheles_gambiae.AgamP4.pep.all.RBBH --threads=4 marcela.fasta Anopheles_gambiae.AgamP4.pep.all.fa

nohup python ../scripts/rbbh.py -a prot -t blastp -o marcela.c_elegans.PRJNA13758.WS258.protein.fa.RBBH --threads=8 ../marcela.fasta.an/marcela.fasta c_elegans.PRJNA13758.WS258.protein.fa

I need to run the oher ones too, but I cannot run it at the same time in the local machine. Let's see how long the first job is going to take.
Next question is: where is the BLAST of marcela.fasta against ncbi nr? Do I have it or not?

Even though I have this file all.lf.uniprot.final.blast - which comprises all the marcela.fasta matches with UNIPROT or (the ones
that did not have a match with Uniprot) with NR NCBI, I'm running a complete BLAST against NR NCBI
perl /data2/Marcela/scripts/split_multifasta.pl --input_file=marcela.fasta --output_dir=/data2/Marcela/datanote/ --seqs_per_file=20239
nohup /data2/Marcela/scripts/ncbi-blast-2.4.0+/bin/blastp -query marcela.*.fasta -db /data2/ncbl/db/nr -evalue 1e-05 -num_threads 4 -out marcela.*.fasta.NR.1e-05 &


24th of May, 2017
Running the RBBH for marcela.fasta and Mytilus proteins (murgarella et al, 2016).
python ../../scripts/rbbh.py -a prot -t blastp -o marcela.Mytilus.galloprovincialis.aa.RBBH --threads=2 marcela.fasta Mytilus.galloprovincialis.aa
python ../../scripts/rbbh.py -a prot -t blastp -o marcela.Crassostrea_gigas.GCA_000297895.1.30.pep.all.fa.RBBH --threads=4 marcela.fasta Crassostrea_gigas.GCA_000297895.1.30.pep.all.fa

Organizing RBBH files and results after running it with whole genomes.
For anopheles, the RBBH are similar when done with the smaller db, there is only one more, and one out
#A_id   B_id    A_length        B_length        A_qcovhsp       B_qcovhsp       length  pident  bitscore
itr6_18008_p.g24.t1     AGAP003228-PA   208     429     92      53      228     72.807  329
itr6_2417_pi.g1791.t1   AGAP009459-PA   303     298     98      99      297     70.370  458
itr6_4125_pi.g1566.t1   AGAP004283-PA   97      157     93      64      101     72.277  144
itr6_222_pil.g2778.t1   AGAP010929-PA   446     447     99      98      440     96.591  905
itr6_2014_pi.g1174.t1.1.57f6923d        AGAP003622-PA   287     415     95      66      273     73.626  427
itr6_10681_p.g56.t1     AGAP005800-PA   416     717     100     58      416     73.558  619
itr6_12467_p.g169.t1    AGAP000462-PA   164     164     100     100     164     71.951  258 (não achou no RBBH com os genomas completos)
Outros:
itr6_222_pil.g2778.t1   AGAP010929-PA   446     447     99      98      440     96.591  905

Uploading all these tables into google docs:

1-) RBBH-marcela.anopheles-wholegenomeAndsmallDB-check.xlsx - are the RBBH (reciprocal best blast hits) found when reciprocally 
blasting the whole proteins predicted for L. fortunei (marcela.fasta) with the whole proteins from the anopheles genome (esembl). 
Also, these RBBH were checked against RBBH run with a smaller db (from 24th of April, 2017 descripton), which had only the fertility genes of interest. 
2-) genomes-marcela.Anopheles_gambiae.AgamP4.pep.all.RBBH - analysis result of RBBH (reciprocal best blast hits) found when reciprocally blasting the whole 
proteins predicted for L. fortunei (marcela.fasta) with the whole proteins from the anopheles genome (esembl). 
3-) Anopheles_gambiae.AgamP4.pep.crisanti.blastp1e05.covBH.RBBH.xlsx - its an update of previous Ju’s table, but now presenting the RBBH in bold.


Running RBBH for c_elegans again because of the IDs
python ../../scripts/rbbh.py -a prot -t blastp -o marcela.c_elegans.PRJNA13758.WS258.protein.1.glued.faRBBH --threads=4 marcela.fasta c_elegans.PRJNA13758.WS258.protein.1.glued.fa

So, I've run RBBH for marcela.fasta against m.gallo and c.gigas whole-genome-predicted proteins and now I'm blasting this db
against the reduced reproductive proteins of interested ju has created (sequence-proteins-JU.fasta).

cat Crassostrea_gigas.GCA_000297895.1.30.pep.all.fa Mytilus.galloprovincialis.aa > C.gigas.M.gallo.all.genome.proteins.fa
blastp -query sequence-proteins-JU.fasta -db C.gigas.M.gallo.all.genome.proteins.fa -evalue 1e-05 -num_threads 4 -out sequence-proteins-JU.C.gigas.M.gallo.all.genome.proteins.1e05 &
The idea is to identify which protein is which between the two db, because they have different IDs. And then I can explore more the RBBH results.

perl ../../scripts/BLASTcovBH.pl sequence-proteins-JU.C.gigas.M.gallo.all.genome.proteins.1e05 > sequence-proteins-JU.C.gigas.M.gallo.all.genome.proteins.1e05.covBH
Ok, so for the 17 proteins of molluscan reproduction only vasa is present in the RBBH (with both: small db and whole c.gigas genome).

Searching for vitelogenin:
1-) I went to Uniprot and download vitelogenins for c. gigas and 'mytilus'. Total of 6 proteins, including fragments (file: vitelogenin.gigas.mytilus.uniprot.fa)
The first BLAST with marcela.fasta:

 blastp -query vitelogenin.gigas.mytilus.uniprot.fa -db marcela.fasta -evalue 1e-05 -out vitelogenin.gigas.mytilus.uniprot.marcela.fasta1e05 &

tr|G9JJT9|G9JJT9_CRAGI  itr6_234_pil.g2949.t1   39.13   92      56      0       23      114     1483    1574    3e-19   85.9    44.02   5.46
tr|K1QNA2|K1QNA2_CRAGI  itr6_234_pil.g2949.t1   32.89   681     336     121     52      722     162     731     7e-109  387     27.83   33.81
tr|K1QFM6|K1QFM6_CRAGI  itr6_11658_p.g77.t1     50.25   396     194     3       70      463     1       395     3e-137  436     83.30   14.58
tr|K1QMJ1|K1QMJ1_CRAGI  itr6_11658_p.g77.t1     38.99   277     167     2       6       282     8       282     2e-60   207     96.52   10.15
tr|Q49P77|Q49P77_MYTED  itr6_234_pil.g2949.t1   25.56   223     85      81      14      236     162     303     7e-16   76.6    94.09   8.42

Now I've decided to do a BLAST of these sequences with the whole-genome-predicted sequences from m. gallo and c. gigas.
blastp -query vitelogenin.gigas.mytilus.uniprot.fa -db C.gigas.M.gallo.all.genome.proteins.fa -evalue 1e-05 -out vitelogenin.gigas.mytilus.uniprot.C.gigas.M.gallo.all.genome.proteins.fa1e05 &
perl ../../scripts/BLASTcovBH.pl vitelogenin.gigas.mytilus.uniprot.C.gigas.M.gallo.all.genome.proteins.fa1e05 > vitelogenin.gigas.mytilus.uniprot.C.gigas.M.gallo.all.genome.proteins.fa1e05.covBH

25th of May, 2017

Today I'm organizing all the results for the genes of interest found in comparison with A. gambiae, c. elegans or molluscan (c. gigas
and m. gallo). The results of the RBBH from whole-genome comparisons are similar to the ones done with the smaller datasets.
After conferming the RBBH, I went to see which was the function of the LF protein when BLASTed against Uniprot/NR.
So files containing the RBBH (RBBH-marcela.anopheles-wholegenomeAndsmallDB-check.xlsx , c_elegans_RBBH-confirmation.xlsx) contain
also the fuction description from the best hit found in the BLAST against Uniprot/NR. For the molluscan, before the only RBBH was
for the vasa protein, and it was confirmed with the RBBH against c. gigas whole genome.

20th of June, 2017.

Juliana Americo had selected 266 molluscan genes of interested in NCBI or Uniprot. I have - however - run the RBBH for the complete set of 
proteins for the C. gigas and M. gallo (murgarella) genomes. Then, the second thing I did was to run RBBH for the 266 proteins
selected by Ju against C.gigas and M. gallo genome-predicted proteins: so then I could have the ID of the protein in the genome
to compare with the RBBH result against LF (marcela.Mytilus.galloprovincialis.aa.RBBH and marcela.Crassostrea_gigas.GCA_000297895.1.30.pep.all.fa.RBBH). 

The RBBH for 266 agains genome-predicted-proteins:
python ../../scripts/rbbh.py -a prot -t blastp -o Ptn_Reproduction_Mollusca_dbC.gigas.M.gallo.all.genome.proteins.fa.RBBH --threads=4 C.gigas.M.gallo.all.genome.proteins.fa Ptn_Reproduction_Mollusca_db.fasta

After the comparisons I found the vasa protein but also the estrogen receptor protein with RBBH against LF. How I did that?
I checked the IDS of the genome-predicted-proteins in here Ptn_Reproduction_Mollusca_dbC.gigas.M.gallo.all.genome.proteins.fa.RBBH
and checked if they were present in here marcela.Mytilus.galloprovincialis.aa.RBBH or marcela.Crassostrea_gigas.GCA_000297895.1.30.pep.all.fa.RBBH


26th of June, 2017
Today I'm trying another approach to find the insertion area in LF genome. Ju Americo found a paper where they state housekeeping
genes are usually organized in clusters inside the geneome (https://drive.google.com/file/d/0B2pYU5fjrq3EYThkUjBBZWFKLUk/view?ts=595102d4)
Therefore, if we find a area between such genes, the possibility of this area to be expressed is high.

So, right now I'll download the 400 proteins that state as coming from housekeeping genes from RefSeq:
Batchentry - list of IDs - send to coding file (proteins).
To eliminate all empty lines of a file:
:g/^$/d
And now running blast with the 408 housekeeping proteins
makeblastdb -in housekeeping.fa -dbtype prot
blastp -query marcela.fasta -db housekeeping.fa -evalue 1e-05 -num_threads 4 -out marcela.fasta.housekeeping.1e-05 &


27th of June, 2017

How to get all spaces were there is no gene in a gff?

beedtools complement -i marcelAAAa_pasa.gene_structures_post_PASA_updates.85229.gff3 -g /home/bioma/LF.genome.blas.fasta.genomefile

How to generate the genome file (-g)?
samtools faidx LF.genome.blas.fasta
awk -v OFS='\t' {'print $1,$2'} lLF.genome.blas.fasta.fai > LF.genome.blas.fasta.genomefile


29 June,2017

Nearly all results for the first technical delivery (gene targets and insetion area) are ready. Bellow one find the description
and gdrive location of all files:

For insertion area: Files listed bellow are here: https://drive.google.com/drive/u/1/folders/0B2IClYPbMhLAZF83U2ZfSjZZV1E
- Intergenic regions coodernates for all scaffolds (LF.genome.intergenic)
- Housekeeping LF blast result (marcela.fasta.housekeeping.1e-05.covBH)
- Intergenic regions coordenates only for scaffolds that have housekeeping genes (LF.genome.Intergenic-for-housekeeping.coords)
- Specific area chosen for insertion: scaffold FASTA file (L_FORTUNEI.itr6_10266_.fa)
- Specific area chosen for insertion: gff of scaffold with all the coordenates for the 4 genes coded in this scaffold, including the two convergent genes (itr6_10266_.ok.gff)
- 2 FASTA files for each of the two convergent genes containing: mrna, coding seq and protein (itr6_10266_p.g262.t1.codingSeq-mrna-protein , itr6_10266_p.g263.t1.codingSeq-mrna-protein.fasta)

https://drive.google.com/drive/u/1/folders/0B2IClYPbMhLAZF83U2ZfSjZZV1E


For gene targets. Files listed bellow are here: https://drive.google.com/drive/u/1/folders/0B2pYU5fjrq3ENGZrY3IteGN4T0U
- gff (genome location) for all 27 gene targets (LF.gene.targets.gff3)
- mrna sequences for all 27 gene targets (gene-targets.mrna)
- coding seq for all 27 gene targets (gene-targets.codingseq)
- genome scaffolds were this sequences are (LF.gene.targets.scaffold.fasta)


31th July, 2017

Ju Americo required to find the promotor region of 9 genes in L. fortunei. These genes were found in C. gigas or P. yessoensis. 
Requirement table is here: https://docs.google.com/spreadsheets/d/18NxP29QCvPcm0z8sIMKE0lSYLbL8YbtM-svhHWC33q0/edit?ts=597f2fd9#gid=0

So, I've downloaded all sequences she asked for from Uniprot and made a BLAST against LF proteins (marcela.fasta).
For all sequences I've downloaded sequences from C. gigas or Patinopecten yessoensis. But for GAPDH and Cytochrome
I've also downloaded the human protein. In any case, there was no blast with any marcela.fasta proteins.


9th august, 2017

I'm looking for the promotor regions in some specific genes Ju Americo has selected from other species. In this link (https://docs.google.com/spreadsheets/d/18NxP29QCvPcm0z8sIMKE0lSYLbL8YbtM-svhHWC33q0/edit?ts=597f2fd9#gid=16907541
) in sheet3, you find the LF ortologs of such genes, them the RBBH or not, then the BLAST result once run against NCBI nr, then gene position in scaffold, and
the genes that come just before and after.

Also, Ju Americo would like to know if such genes are expressed in the transcriptomes we already have sequenced (from Marcela's thesis).
The thing is we have not done a diferential expression analysis in such data. But one thing we can check is if such genes are present
or absent in the de novo assembled transcripts for each mussel (we have sequenced 3 mussels, 4 tissues). Even better would be to
assembly with trinity transcripts for each tissue, or use cufflinks for such.

Right now, the only thing we have is a blast result against the trinity sets for each mussel.
1-) First I've created a db with the 7 LF ortologs of the sequences Ju has required:
/home/bioma/2017_marcela/CTG/lf.genes.for.promoters.fa
Then, I've run a blastx of the 3 sets of trinity transcripts against such db. The blast results are:
/home/bioma/2017_marcela/CTG/Trinity.1.lf.genes.for.promotersblastx1e05.covBH
/home/bioma/2017_marcela/CTG/Trinity.3.lf.genes.for.promotersblastx1e05.covBH
/home/bioma/2017_marcela/CTG/Trinity.6.lf.genes.for.promotersblastx1e05.covBH
Then I saved only blast results where 90% or more of the db sequence is inside the blast result.
/home/bioma/2017_marcela/CTG/Trinity.3.lf.genes.for.promotersblastx1e05.covBH.id2.90
/home/bioma/2017_marcela/CTG/Trinity.1.lf.genes.for.promotersblastx1e05.covBH.id2.90
/home/bioma/2017_marcela/CTG/Trinity.6.lf.genes.for.promotersblastx1e05.covBH.id2.90

All the files are also in biobureau gdocs: https://drive.google.com/drive/u/1/folders/0B2IClYPbMhLARFVhaGNEMFNwcms

All the 7 genes Ju has selected are constitutive, or expressed in the embrio and on (https://docs.google.com/spreadsheets/d/18NxP29QCvPcm0z8sIMKE0lSYLbL8YbtM-svhHWC33q0/edit#gid=0)
Only vasa is supposed to be expressed only in the germinative cells. So, I've checked the trinity sets - that should not have this transcript,
since it was sequenced for digestive gland, mantle, foot and gills of adults - blast results, and really did not found vasa for the trinity set 3 and 6.
For mussel 1, it seems to have a hit,eventhough is not a perfect hit:

TR79527-c2_g1_i3  itr6_4515_pi.g108.t1  100.00  562 0  0  1111 2796 17 578 0.0  1127  44.44  91.68

This is a m8 result plus 2 last collumns, that show % of query coverage and % of subject coverage. As we can see, this results
is the best hit and only covers 44.44% of the trinity transcript.

In the blast results (here lines 449-451) one can also find other blast results for the 7 genes Ju has selected (line 447). 


October 30th, 2017

Luana and João have been working on assemblying L.fortunei transcriptomes of each indidual and specific tissue. The software being used for it is Trinity.
The assembly of tissue C, from organism 1, was the following:

nohup Trinity --seqType fq --left 1c_S4_R1_001.D.gland.fastq --right 1c_S4_R2_001.D.gland.fastq --max_memory 8G