Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError: 'R' #12

Open
ppgardne opened this issue Nov 26, 2021 · 2 comments
Open

KeyError: 'R' #12

ppgardne opened this issue Nov 26, 2021 · 2 comments

Comments

@ppgardne
Copy link

Hi Nicola,

I've just been trying to run your handy phastSim method on several different sequences. It seems to fail on a few -- for example on the genome AE001825:

phastSim --outpath ./ --outputFile AE001825.1.phastsim-001.temp --reference AE001825.1.fasta --treeFile AE001825.1.phastsim-001.phastsimtree --createFasta --hyperMutProbs 0.01 0.04 --hyperMutRates 100 10 --indels --insertionRate GAMMA 0.1 1.0 --deletionRate CONSTANT 0.1 --insertionLength GEOMETRIC 0.9 --deletionLength NEGBINOMIAL 2 0.95

With a simple tree file (AE001825.1.phastsim-001.phastsimtree):
((phastsim0:0.1,phastsim1:0.1,phastsim2:0.1,phastsim3:0.1,phastsim4:0.1,phastsim5:0.1,phastsim6:0.1,phastsim7:0.1,phastsim8:0.1,phastsim10:0.1):0.001);

The AE001825 sequence does have some non-ACGT characters (R & K), which I presume is the issue for phastSim? Should I randomly select A/G for R, or G/T for K.

Best wishes,
Paul.

@NicolaDM
Copy link
Owner

Hi Paul,

Indeed, the non-ACGT characters in the reference are the issue. I will think about what the best default behaviour for phastSim should be in this case - definitively we should give a clearer error message.
The reason I am hesitant to let phastSim automatically pick a nucleotide at random is because it would make the interpretation of the output more complicated/inconsistent (the concise output formats need to be interpreted in terms of differences wrt the reference).
But we could write an additional script that makes a random sampling from reference ambiguous characters and creates a new ambiguity-free reference, if useful.

@ppgardne
Copy link
Author

Kia ora Nicola,

Yes -- my work around would be to randomly sample from IUPAC ambiguity characters. These are rare in most assembled sequences, so shouldn't throw phastSIM's results off by too much (AFAIK).

Best wishes,
P.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants