Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mood-score threshold choice for a given p-value and PWM is different across chromosomes #44

Open
osyafinkelberg opened this issue Jan 8, 2023 · 4 comments

Comments

@osyafinkelberg
Copy link

Hello!

I am scanning the human genome for CTCF motifs using a PWM-format matrix:

moods-dna.py --sep ";" -s hg38.fna --p-value 0.0001 -S MA0139.1_pwm -o ctcf_scan

File hg38.fna contains sequences for all chromosomes each starting with the >chr...\n line. I plotted mood-score distributions for each chromosome separately and found out that the mood-score threshold value for each chromosome is different. For example for chr1 it is -13.209, while for chr17 it is -12.677.

As far as I understand the threshold choice procedure from the MOOD-wiki page this threshold should depend only on the given PWM and p-value. How can it be that the mood-score value thresholds are different for each chromosome?

Thank you very much!

@jhkorhonen
Copy link
Owner

The log-odds and p-value computation also depends on the background distribution, intuitively describing how the sequence looks like if there is no coding information in it. See the corresponding wiki page for more information on this.

By default, MOODS just looks at the current input sequence at hands and estimates this distribution from the frequencies of different symbols in the input. If you want to use a consistent background for all sequences, you can use parameters

--batch --bg pA pC pG pT --lo-bg pA pC pG pT

where pA pC pG pT is the background distribution you want to use (and --batch is just optimises the process somewhat). You can in principle estimate this distribution by computing the nucleotide frequencies from the whole human DNA, but you may want to consult an actual biologist on what is the correct assumption here.

@osyafinkelberg
Copy link
Author

Thank you so much for your answer!

So is the --lo-bg flag used for both PWM construction and later "independently" for the threshold T choice ?

Did I understand correctly, that if I provide the already computed PWM and --lo-bg parameter, the latter will influence only the threshold T choice but not the PWM (that is all individual scores will be computed using the provided PWM without normalization by the --lo-bg frequencies)?

@jhkorhonen
Copy link
Owner

With pre-computed PWM and -S, the parameter --lo-bg doesn't do anything, as it is used only for log-odds conversion if that is done.

You want to set --batch --bg in your use case if I understood it correctly. Then the input matrices will not be converted, and the threshold is computed from the given p-value once, using the given background distribution, and that is used for all sequences.

Now that I look at this, there is some illogical behaviour and poorly documented behaviour regarding this in the moods_dna.py script. Marking this for improvement.

@osyafinkelberg
Copy link
Author

Thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants