Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Miscalling integration site and breakpoint positions on genome #65

Open
cnobles opened this issue Feb 15, 2017 · 1 comment
Open

Miscalling integration site and breakpoint positions on genome #65

cnobles opened this issue Feb 15, 2017 · 1 comment

Comments

@cnobles
Copy link

cnobles commented Feb 15, 2017

On line 476 of intSiteLogic.R, we've assigned the "tStart" and "tEnd" from the alignment output to the GRanges start and end positions. This is correct most of the time, but the "tStart" and "tEnd" in the psl output file correspond to the edges of the alignment. If there are bases at the beginning and end of the read that do not align, they are not considered when assigning the position.

I think this is a mistake in logic. Instead, the range assignment should be incorporate the flanking nucleotides that may not align, especially on the LTR end. This region is known to have a drop in sequence quality that can lead to base miscalling. These miscalled bases may not be aligned which will change the position of the integration site position calling. A quality metric used for filtering is that the "qStart" must be less than or equal to 5nt (line 519, adjustible with maxAlignStart), which means that the called position could be up to 5 nt away from the actual junction of the LTR.

This is fixed downstream in our reports by standardizing within a 5bp window, but could be refined in the current informatic processing. The proposed change would look something like below:

intSiteLogic.R : Line 476

    ranges=IRanges(start=algns$tStart, end=algns$tEnd)

changed to:

    ranges = ifelse(algns$strand == "+",
        IRanges(
            start = (algns$tStart - algns$qStart + 1),
            end = (algns$end + (algns$qSize - algns$qEnd))),
        IRanges(
            start = (algns$tStart - (algns$qSize - algns$qEnd)),
            end = (algns$tEnd + algns$qStart - 1)))
@chasberry
Copy link

chasberry commented Feb 15, 2017

Whether it is `logical' or not matters less than whether the data support this assessment.

Please do not commit to this before running checks of the sort in

/scp:[email protected]:/home/berry/projects/bushman/WAS/derep-6-16/derep-and-jack.Rnw

to support your reasoning.

In that report, I did note after the first table:

In both tables, There are higher counts at 1 than at -1 in the + orientation and lower counts on the - orientation. Similar observations apply to +2 vs -2, +3 vs -3, and so on.

This might occur if a more fragments are mapped to a position that is in the host DNA a few bases away from the junction rather than to a position in the host DNA on the opposite side of hte insertion. To say this another way, it is more likely that base pairs of host DNA are omitted rather than added. I do not show this here, but a similar observation applies to ...

So rerunning those tables with/without your proposed fix would be particularly helpful.


I do not understand the `fixed downstream' sentence. Maybe OK for rough and ready reporting, but makes me itch to think this might roll into data that are used for finer analyses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants