-
Notifications
You must be signed in to change notification settings - Fork 12
Brief theoretical introduction
Position weight matrices (PWMs), also known position specific scoring matrices (PSSM) or weighted patterns, are a simple yet powerful model for sequence signals used in bioinformatics. They can be, for example, used to model transcription factor binding sites in DNA or other binding sites.
PWMs are obtained usually from empirically observed instances of binding sites by counting occurrences of symbols in different positions, represented as count or frequency matrices. For example, a count matrix could look like this (example from JASPAR database), where each four rows specifies the counts or frequencies for nucleotides A, C, G and T, respectively:
10.00 12.00 4.00 1.00 2.00 2.00 0.00 0.00 0.00 8.00 13.00
2.00 2.00 7.00 1.00 0.00 8.00 0.00 0.00 1.00 2.00 2.00
3.00 1.00 1.00 0.00 23.00 0.00 26.00 26.00 0.00 0.00 4.00
11.00 11.00 14.00 24.00 1.00 16.00 0.00 0.00 25.00 16.00 7.00
MOODS by default can read matrix files in JASPAR count format (.pfm).
By default, MOODS converts count/frequency matrices to PWMs using log-likelihood scoring that defines a score against a DNA sequence; this is then used to identify putative sites matching the motif described by the matrix. Specifically, given a
where moods_dna.py
).
The frequencies are then used to compute the PWM
For any sequence
Intuitively, this score compares the probability that the model specified by the original frequency matrix
Given a PWM
The idea is that these matches should correspond e.g. to putative transcription factor binding sites for the factor described by the original count matrix.
The exact choice of the threshold
For more discussion and background, see e.g.
- https://en.wikipedia.org/wiki/Position_weight_matrix
- https://en.wikipedia.org/wiki/Additive_smoothing
For more technical details, see the following articles.
- C. Pizzi, P. Rastas and E. Ukkonen: Finding Significant Matches of Position Weight Matrices in Linear Time. IEEE/ACM Transactions on Computational Biology and Bioinformatics. (2011)
- J.H. Korhonen, K. Palin, J. Taipale and E. Ukkonen: Fast motif matching revisited: high-order PWMs, SNPs and indels. Bioinformatics. (2016)