-
Notifications
You must be signed in to change notification settings - Fork 0
Chapter 5 ‐ Sequence Annotation
- PGA: https://github.com/quxiaojian/PGA
- Ugene: https://ugene.net/
We have an assembled plastid sequence, the next step for us to do is to annotate the sequence. This can be done in different ways, in our case we will use Plastid Genome Annotator (PGA). PGA is a tool specifically made for plastid annotation, however many different annotation tools are available. PGA takes a number of genbank files as reference. In our case these files contain information on the plastid genome of a specific species as well as the genes which are present. It will also need the fasta files of the sequences we want to annotate, we will annotate both created paths. We have downloaded the genbank files for different related species already.
$ cd NERC_training/plastid_annotation/
We need to make a folder for our created sequence and link them to this folder.
$ cp -R NERC_training/reference/plastid_annotation_references/ ./
$ mkdir paths
$ ln -s NERC_training/plastid_assembly/ptgaul/result_3000/ptGAUL_final_assembly/*.fasta ./paths/
$ ls *
Once we have the reference ready and we have our sequence we can start the annotation.
$ perl PGA.pl -r NERC_training/plastid_annotation/plastid_annotation_references/ -t paths/ -out annotated/
where,
-r is the folder containing reference genbank files
-t is the folder with the fasta sequences to be annotated
-out is the folder to which the results will be written
Once the tool is finished we can check the output folder, we should see our two sequences now are genbank files.
$ ls annotated/
We can look at these annotations using UGENE. This tool can be used to do different types of visualisation, comparison and many other things. If you have time, take a look at the functionalities of this tool. To start UGENE run the following command:
$ ugeneui
In UGENE, open the created annotation by clicking File → Open. Find your annotated file and open it. The image you see should look something like the following.
➔ Open both created paths, can you find differences between these two paths looking at the annotations?
UGENE also has the option to create comparisons between two sequences using dotplots. For the creation of a dotplot you need two sequences, these will be compared to each other. In essence a dotplot will show you any position on the first genome which is similar to a position anywhere on the other genome as a dot, when positions do not match the graph will not show you anything. Using the two different paths, create a dotplot by clicking Tools → Build dotplot. Select path1 and path2 as the two different sequences, make sure to use genbank files so you will also see the annotations for the sequences. Click next, check “Search inverted repeat” and click “OK”. You image should look something like this:
➔ Can you see differences between the two paths? If so, what are the differences?
We want to see if our paths will have a similar pattern and order to previously published data. We have downloaded a reference to be able to compare our sequence to this reference. Once again go to Tools → Build dotplot. Select one of your sequences and the reference sequence (located at /NERC_training/plastid_annotation/plastid_annotation_references/Cotinus_coggygria_NC_054342.1.gb). Make sure to again check “Search inverted repeat” and in this case also set the identity to 90%.
➔ Are there differences between your sequence and the reference sequence?
➔ Check both paths in this manner, does one seem more similar to the reference?
Make a note of this, you will need it later on.
Do not worry if your sequences are different to the reference sequence, we can continue with these sequences.
In the case that the sequence has a different order, in a regular situation we would try to find out why the sequence is different, try different settings for the assembly of the plastid, try different tools to see if another tool might be more suitable for this specific species or run, however in this case we do not have the time.
If you do have the time and want to take a look, you can rerun ptGAUL using different settings, use -c in ptGAUL to change the minimal coverage. You do not have to redo the annotation for the creation of the dotplot, you can create a dotplot between a fasta file and a genbank in UGENE without a problem.
In some cases, ptGAUL has made assumptions which are not true. We will take a look together to see if we can fix some of these assumptions.
Now that we have a sequence and we have an annotation, we need to finally find out what we are actually working on. For this we will extract a barcode gene from one of the paths, which we will run through BOLD to get an idea of what we are looking at.
The barcode gene we will take a look at is matK. Open your genbank file in UGENE by going to File → Open and locating your genbank file. We can look for the gene visually, however searching for it is an easier option. At the bottom of your screen below the visual version of our plastid, there is a list with feature information which can be opened. Right click the list and choose Find Qualifier. In the pop-up (shown in the figure below), in the field for value, we are going to put the name of the gene which we are looking for.
In this case we will look for matK. Click next and we should find the gene of interest. Once we have found it, we can click on the gene and see where it is on our plasmid. To copy the sequence, we can right click the gene in the feature list, go to copy/paste and select “Copy annotation sequence”. We can now paste the sequence anywhere, you can save the sequence in a separate file if you would like, however in this case we can paste the sequence into BOLD directly.
Go to https://www.boldsystems.org/index.php/IDS_OpenIdEngine, select “Plant Identification” and paste the sequence into the fasta sequence field. Continue by pressing Submit at the bottom of the form. It might take a few minutes before you get any results. The results should look something like the following.
As a second identification method BLAST can be used (https://blast.ncbi.nlm.nih.gov/Blast.cgi). Click on nucleotide blast and paste the sequence into the top box. In this case we do not need to change any of the other settings, so we can click BLAST at the bottom to start the comparison.
If you have time, do the same for the rbcL gene, do the results match?
➔ What species do you think you are working on?