Skip to content

RGASP-consortium/reconstruction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 

Repository files navigation

2013, July 4th RGASP_analyzer (developed for RGASP, for details see: http://www.gencodegenes.org/rgasp/)

This small program can be used to compare an automatically generated gtf file with a gtf file annotation. It will output how many features (exons, transcript and genes) are within the annotation and the prediction and how many of the predicted features are correct.

Installation: - make sure Java 1.5 or newer is installed - jargs.jar must be on the Java classpath - bash must be available - awk must be installed and in the path

Usage: java -Xms1G -Xmx2G org.rgasp.PredictionAnalyser -l, --level ... one of "cds", "exon", "both"

						if "cds" is specified, only cds records within the annotation and
						the prediction are considered
						
						if "exon" is specified only exon records within the annotation and
						the prediction are considered
						
						if "both" is specified a "cds" and an "exon" analysis
						will be performed separately
						
-m, --mode			... one of "fixed", "flexible", "both"

						if "fixed" mode is specified, predicted records are verified
						only if there is an annotation record with the same borders
						
						if "flexible" mode is specified, predicted records, that correspond to
						a first or last exon of a transcript in the annotation, may have flexible
						outer borders
						
						if "both" is specified, fixed and flexible mode analysis will be performed
						separately
						
-a, --annotation	... path to annotation file
-p, --prediction	... path to prediction file / folder
-o, -- output		... output folder

Example: java -Xms1G -Xmx2G -jar RGASP_analyzer.jar -l exon -m both -a path2annotation -p path2prediction -o outputfolder

File format: Both the annotation and the prediction files are expected to be in gtf file format. Please note that the key-value pairs gene_id and transcript_id are required! (for details see: http://www.sanger.ac.uk/resources/databases/encode/gencodeformat.html)

Please note, that this analysis only considers gtf records with the feature-type exon or CDS.
Make sure, that you adapt your gtf file accordingly.
Example: If your gtf file contains information about coding exons and UTRs, use this information
to create exon records.

Please make sure, that the file format for the annotation and the prediction files are consistent,
especially make sure that:
	- chromosome names are called consistent
	- stop codons are consistently within the coding exons or not

Output: In the folder specified with the -o, --output parameter the following tables will be created

- LEVEL_MODE_statistics.txt
  lists for a single prediction or each prediction file in the prediction folder how many exons,
  transcripts, or genes have been predicted, and how many of them are true positives.
  Example:
  	sensitivity for (coding) exons can be calculated by: TP_Exons / Num_Exons_Ref
  	specificity for (coding) exons can be calculated by: TP_Exons_Pred / Num_Exons_Pred

- LEVEL_MODE_exon.tbl
  binary table that lists for each exon of an annotated transcript whether it has been
  predicted (1) or not (0) by a prediction file.
  Exons are identified by their gene and transcript identifier and their position within the transcript

- LEVEL_MODE_exon_unique.tbl
  binary table as LEVEL_MODE_exon.tbl except that exons that are shared between different transcript
  isoforms will only appear once

- LEVEL_MODE_transcript.tbl
  binary table that lists for each transcript of an annotation file whether it has been predicted (1) or not (0)
  by a prediction file.

- LEVEL_MODE_gene.tbl
  binary table that lists for each gene of an annotation file whether it has been predicted (1) or not (0)
  by a prediction file.
  
While the statistic table gives you fast access to sensitivity and specificity values, the other tables can
be used to make more complex follow-up analyses.

Memory: The memory Java allocated by default will most probably not be sufficient to run your analysis. Therefore, you would usually run the analysis with the -Xms1G -Xmx2G arguments. Depending on your input file sizes, you might want to increase these argument.

Contact: When facing any problems please contact: Tamara Steijger European Bioinformatics Institute [email protected]

About

Transcript reconstruction evaluation software

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages