Skip to content

Allele Calling

luissian edited this page Apr 7, 2019 · 15 revisions

The allele calling feature in Taranis allows to identify the allele number defined in the cgMLST schema, for each sample FASTA file.

Based on this checking the core gene can be classified in any of the following possibilities:

  • ASM
  • ALM
  • Exact Match
  • INF
  • LNF
  • NIPH
  • PLOT

The figure below describe the algorithm used to classify each gene from the cgMSLT schema.

taranis_diagram_identify_alleles.jpg

The workflow starts by creating a local blast database for each of the samples.

Then we get the first gene in the cgMLST schema and we are performing a blast using the local database for the first sample file.

We are setting for this blast query with a 100% of identity and with a 100% alignment.

Imagine that as a result of this query, we found that there is a match. In this case we are going to investigate more deeply to find out if in the sample this gene is repeated in the sequence. If yes then we are going to mark this match as NIPHEM.

If we only found one match (remember this match was with 100% identity and 100% alignment), could be that we could get more matches when decreasing the identity. So we repeat the query but this time with 90% identity and 100% alignment.

If we get that two or more gene sequence is repeated in the sample we will mark this gene as NIPH.

If on contrary only one match is found then we mark this gene as Exact Match and we will add the allele number that match in the core gene schema.

Going back to our first match (100% identity, 100% alignment) in case that there no match is found, we are going to less restricted by decreasing the identify to 90% and the alignment to 80%.

Now we could get that there is not match, in that case we mark it as LNF.

If a match is found then we are trying to identify if the allele found could be handled as a new inferred one.

To know if we can mark as inferred we need to talk about the "allele variability".

When getting a gene from the schema we could find that not all the alleles has the same sequence length. We could find that it has per example a length of; 90, 100 or 120 nucleotides. For this example the variability will be 90, 100, 120. If the length of the match is one of above then it will mark as INF and it will be assigned a number.

The number and the sequence is stored to check if for any other sample has the same sequence. If it is the same it will assigned the same number, if not the number will be stepped to show that they are not the same.

If the found length number does not match with any on the list, then it is considering that there is a insertion/deletion.

In some situations we could get that the sequence matched is not the 100% identify and 100% alignment, because some nucleotides where not assemblies and they are at beginning or end the contig sample. We will mark then as PLOT . This can be happened because a bad quality of the samples.

At this point there are two possible options; the match found is shorter than the gene, because there are some deletions,( we will mark as DELETE) or on the contrary insertions in your sequence makes that length was longer, and called them INSERT.

We have extended the classification of DELETE/INSERT by checking the new protein that the sequence gene creates.

For example when a nucleotide has a deletion this makes that the new protein changes when comparing with the "original". Because of the sequence it could make that the new protein is longer that the original, in that case we will name it ALM. In case that the new protein was shorter we add ASM prefix.

As we comment before, Taranis output files, in tsv format, that will be used for deeper investigation by the bio-informatics. These files are:

  • deletions.tsv
  • inferred_alleles.tsv
  • insertions.tsv
  • matching_contigs.tsv
  • paralog.tsv
  • plot.tsv
  • result.tsv
  • result_for tree_diagram.tsv
  • snp.tsv
  • summary_result.tsv

Detailed descriptions on these files are described in separated chapters.