Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redux: flagging microhaps for problematic repetitive or indel content #153

Open
standage opened this issue Apr 17, 2024 · 0 comments
Open

Comments

@standage
Copy link
Member

standage commented Apr 17, 2024

Excluding microhap markers from a candidate panel design based on a crude determination of repetitive content has proven ineffective. The flagging method implemented in #147 was based on a crude overlap analysis that did not take into account each repetitive element’s class nor its placement relative to the relevant genomic intervals. Not all repetitive elements are created equal, and where repeats fall relative to the body of a microhap makes a difference. In this thread, I’ll describe ideas I have for a more fine-grained and nuanced integration of microhap, repeat, and other variant data.

microhaplotype repeat analysis

Specificity of flanking regions for primer design

Many of my initial observations were based on samples sequenced following hybridization capture enrichment (HCE) for microhap targets. While I still have an interest in at least one more probe set redesign for testing purposes, microhap sequencing will primarily be implemented more generally with targeted amplicon sequencing based assays. Consequently, repetitive sequence content is a concern in the long run only to the extent that it impacts the specificity of primer binding. Repetitive sequence within or around a microhap shouldn’t be problematic if sufficiently unique flanking primer sites can be identified.

Thus, the highest priority for multiplex PCR panel design is measuring the specificity of each marker’s flanking regions. This could be done with BLAST or Primer-BLAST, and maybe it would be good to look at the results of such an analysis in a handful of representative cases. But I think a much more direct and efficient way to measure specificity is through k-mer abundance. In broad strokes, we could count the abundance of each k-mer in the GRCh38 reference genome, and then go back and look at the flanking regions around each marker (extend symmetrically to a maximum length L), determine the abundance of each k-mer in the flanking regions, and use this information to look for a unique pair of binding sites.

I had initially thought about using a k-mer size of 21 due to habit, but Rob Lagace suggested that k=11 might work better for this purpose. It’s probably good to examine a range of k values (11, 13, 15) on markers with proven performance in published assays to see what the k-mer abundance distributions look like.

Exclusion ranges for indels and short simple repeats

Short repeats (such as low-complexity sequences or simple tandem repeats) and short indels should not be a problem for calling microhaplotypes so long as they don’t occur directly adjacent to a target SNP. My initial thought is that the required distance on either side of a target SNP should be a number between 4-8 bp, probably 5 bp. In the analysis performed in #147, approximately 21% of the excluded markers were discarded because of simple repeats or low-complexity sequences, so handling these cases with more nuance should have a fairly significant impact. Filtering based on indels has not yet been incorporated into the MicroHapDB build process, but an auxiliary table flagging microhaps with problematic indels could be added similar to the way the repeat flagging table was added in #147.

Filtering of SINE/LINE/LTR elements

SINE, LINE, and LTR elements account for approximately 71% of the excluded markers in #147. Handling these markers with more nuance will have a huge impact at least probe design for HCE. As mentioned above, repetitive sequence itself isn’t necessarily an issue if the target has sufficiently specific primer binding sites and if the repeats aren’t directly adjacent to SNP targets.

RepeatMasker assigns a score to each annotated repeat element. High scoring elements occur in many identical or near identical copies throughout the genome. Low scoring elements tend to be shorter fragments, and bear resemblance to their paralogs but also exhibit nucleotide-level divergence resulting in unique sequence.

For each of these repeat classes, it makes sense to determine a score threshold below which the extent of repetitiveness is unlikely to cause problems, and then exclude these from the analysis

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant