You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Excluding microhap markers from a candidate panel design based on a crude determination of repetitive content has proven ineffective. The flagging method implemented in #147 was based on a crude overlap analysis that did not take into account each repetitive element’s class nor its placement relative to the relevant genomic intervals. Not all repetitive elements are created equal, and where repeats fall relative to the body of a microhap makes a difference. In this thread, I’ll describe ideas I have for a more fine-grained and nuanced integration of microhap, repeat, and other variant data.
Specificity of flanking regions for primer design
Many of my initial observations were based on samples sequenced following hybridization capture enrichment (HCE) for microhap targets. While I still have an interest in at least one more probe set redesign for testing purposes, microhap sequencing will primarily be implemented more generally with targeted amplicon sequencing based assays. Consequently, repetitive sequence content is a concern in the long run only to the extent that it impacts the specificity of primer binding. Repetitive sequence within or around a microhap shouldn’t be problematic if sufficiently unique flanking primer sites can be identified.
Thus, the highest priority for multiplex PCR panel design is measuring the specificity of each marker’s flanking regions. This could be done with BLAST or Primer-BLAST, and maybe it would be good to look at the results of such an analysis in a handful of representative cases. But I think a much more direct and efficient way to measure specificity is through k-mer abundance. In broad strokes, we could count the abundance of each k-mer in the GRCh38 reference genome, and then go back and look at the flanking regions around each marker (extend symmetrically to a maximum length L), determine the abundance of each k-mer in the flanking regions, and use this information to look for a unique pair of binding sites.
I had initially thought about using a k-mer size of 21 due to habit, but Rob Lagace suggested that k=11 might work better for this purpose. It’s probably good to examine a range of k values (11, 13, 15) on markers with proven performance in published assays to see what the k-mer abundance distributions look like.
Exclusion ranges for indels and short simple repeats
Short repeats (such as low-complexity sequences or simple tandem repeats) and short indels should not be a problem for calling microhaplotypes so long as they don’t occur directly adjacent to a target SNP. My initial thought is that the required distance on either side of a target SNP should be a number between 4-8 bp, probably 5 bp. In the analysis performed in #147, approximately 21% of the excluded markers were discarded because of simple repeats or low-complexity sequences, so handling these cases with more nuance should have a fairly significant impact. Filtering based on indels has not yet been incorporated into the MicroHapDB build process, but an auxiliary table flagging microhaps with problematic indels could be added similar to the way the repeat flagging table was added in #147.
Filtering of SINE/LINE/LTR elements
SINE, LINE, and LTR elements account for approximately 71% of the excluded markers in #147. Handling these markers with more nuance will have a huge impact at least probe design for HCE. As mentioned above, repetitive sequence itself isn’t necessarily an issue if the target has sufficiently specific primer binding sites and if the repeats aren’t directly adjacent to SNP targets.
RepeatMasker assigns a score to each annotated repeat element. High scoring elements occur in many identical or near identical copies throughout the genome. Low scoring elements tend to be shorter fragments, and bear resemblance to their paralogs but also exhibit nucleotide-level divergence resulting in unique sequence.
For each of these repeat classes, it makes sense to determine a score threshold below which the extent of repetitiveness is unlikely to cause problems, and then exclude these from the analysis
The text was updated successfully, but these errors were encountered:
Excluding microhap markers from a candidate panel design based on a crude determination of repetitive content has proven ineffective. The flagging method implemented in #147 was based on a crude overlap analysis that did not take into account each repetitive element’s class nor its placement relative to the relevant genomic intervals. Not all repetitive elements are created equal, and where repeats fall relative to the body of a microhap makes a difference. In this thread, I’ll describe ideas I have for a more fine-grained and nuanced integration of microhap, repeat, and other variant data.
Specificity of flanking regions for primer design
Many of my initial observations were based on samples sequenced following hybridization capture enrichment (HCE) for microhap targets. While I still have an interest in at least one more probe set redesign for testing purposes, microhap sequencing will primarily be implemented more generally with targeted amplicon sequencing based assays. Consequently, repetitive sequence content is a concern in the long run only to the extent that it impacts the specificity of primer binding. Repetitive sequence within or around a microhap shouldn’t be problematic if sufficiently unique flanking primer sites can be identified.
Thus, the highest priority for multiplex PCR panel design is measuring the specificity of each marker’s flanking regions. This could be done with BLAST or Primer-BLAST, and maybe it would be good to look at the results of such an analysis in a handful of representative cases. But I think a much more direct and efficient way to measure specificity is through k-mer abundance. In broad strokes, we could count the abundance of each k-mer in the GRCh38 reference genome, and then go back and look at the flanking regions around each marker (extend symmetrically to a maximum length L), determine the abundance of each k-mer in the flanking regions, and use this information to look for a unique pair of binding sites.
I had initially thought about using a k-mer size of 21 due to habit, but Rob Lagace suggested that k=11 might work better for this purpose. It’s probably good to examine a range of k values (11, 13, 15) on markers with proven performance in published assays to see what the k-mer abundance distributions look like.
Exclusion ranges for indels and short simple repeats
Short repeats (such as low-complexity sequences or simple tandem repeats) and short indels should not be a problem for calling microhaplotypes so long as they don’t occur directly adjacent to a target SNP. My initial thought is that the required distance on either side of a target SNP should be a number between 4-8 bp, probably 5 bp. In the analysis performed in #147, approximately 21% of the excluded markers were discarded because of simple repeats or low-complexity sequences, so handling these cases with more nuance should have a fairly significant impact. Filtering based on indels has not yet been incorporated into the MicroHapDB build process, but an auxiliary table flagging microhaps with problematic indels could be added similar to the way the repeat flagging table was added in #147.
Filtering of SINE/LINE/LTR elements
SINE, LINE, and LTR elements account for approximately 71% of the excluded markers in #147. Handling these markers with more nuance will have a huge impact at least probe design for HCE. As mentioned above, repetitive sequence itself isn’t necessarily an issue if the target has sufficiently specific primer binding sites and if the repeats aren’t directly adjacent to SNP targets.
RepeatMasker assigns a score to each annotated repeat element. High scoring elements occur in many identical or near identical copies throughout the genome. Low scoring elements tend to be shorter fragments, and bear resemblance to their paralogs but also exhibit nucleotide-level divergence resulting in unique sequence.
For each of these repeat classes, it makes sense to determine a score threshold below which the extent of repetitiveness is unlikely to cause problems, and then exclude these from the analysis
The text was updated successfully, but these errors were encountered: