Miscalling integration site and breakpoint positions on genome #65

cnobles · 2017-02-15T16:34:34Z

On line 476 of intSiteLogic.R, we've assigned the "tStart" and "tEnd" from the alignment output to the GRanges start and end positions. This is correct most of the time, but the "tStart" and "tEnd" in the psl output file correspond to the edges of the alignment. If there are bases at the beginning and end of the read that do not align, they are not considered when assigning the position.

I think this is a mistake in logic. Instead, the range assignment should be incorporate the flanking nucleotides that may not align, especially on the LTR end. This region is known to have a drop in sequence quality that can lead to base miscalling. These miscalled bases may not be aligned which will change the position of the integration site position calling. A quality metric used for filtering is that the "qStart" must be less than or equal to 5nt (line 519, adjustible with maxAlignStart), which means that the called position could be up to 5 nt away from the actual junction of the LTR.

This is fixed downstream in our reports by standardizing within a 5bp window, but could be refined in the current informatic processing. The proposed change would look something like below:

intSiteLogic.R : Line 476

    ranges=IRanges(start=algns$tStart, end=algns$tEnd)

changed to:

    ranges = ifelse(algns$strand == "+",
        IRanges(
            start = (algns$tStart - algns$qStart + 1),
            end = (algns$end + (algns$qSize - algns$qEnd))),
        IRanges(
            start = (algns$tStart - (algns$qSize - algns$qEnd)),
            end = (algns$tEnd + algns$qStart - 1)))

The text was updated successfully, but these errors were encountered:

chasberry · 2017-02-15T17:33:24Z

Whether it is `logical' or not matters less than whether the data support this assessment.

Please do not commit to this before running checks of the sort in

/scp:[email protected]:/home/berry/projects/bushman/WAS/derep-6-16/derep-and-jack.Rnw

to support your reasoning.

In that report, I did note after the first table:

In both tables, There are higher counts at 1 than at -1 in the + orientation and lower counts on the - orientation. Similar observations apply to +2 vs -2, +3 vs -3, and so on.

This might occur if a more fragments are mapped to a position that is in the host DNA a few bases away from the junction rather than to a position in the host DNA on the opposite side of hte insertion. To say this another way, it is more likely that base pairs of host DNA are omitted rather than added. I do not show this here, but a similar observation applies to ...

So rerunning those tables with/without your proposed fix would be particularly helpful.

I do not understand the `fixed downstream' sentence. Maybe OK for rough and ready reporting, but makes me itch to think this might roll into data that are used for finer analyses.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Miscalling integration site and breakpoint positions on genome #65

Miscalling integration site and breakpoint positions on genome #65

cnobles commented Feb 15, 2017 •

edited

Loading

chasberry commented Feb 15, 2017 •

edited

Loading

Miscalling integration site and breakpoint positions on genome #65

Miscalling integration site and breakpoint positions on genome #65

Comments

cnobles commented Feb 15, 2017 • edited Loading

chasberry commented Feb 15, 2017 • edited Loading

cnobles commented Feb 15, 2017 •

edited

Loading

chasberry commented Feb 15, 2017 •

edited

Loading