-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong identification of Python code #284
Comments
And a little different case - wrong identified JavaScript. Interestingly, the mime type was originally |
A new example - Python code identified as text, interesting as it's just a revshell: setup.py(4).zip |
Out of the 12 samples you uploaded, these two PRs are going to fix 8 of them. I'll give another try at tweaking the executor for the kopia one, but I want to make sure we don't reduce the confidence in the executor too much, as it could cause false positives. It currently check for a mandatory use of base64, like DataDog was doing. Since we are using the default score or 0 for that yara rule, we will prefer yara rules with higher score if there are conflicting results, so false positives are less of a worry. That brings us to the ps1/batch misidentification. As you saw, there are either full scripts of that other language sprinkled in the python file, or very specific language function names. Since those yara identification rules have a higher score, they are preferred over the python one. We could bump the python score higher, but we'd need to make sure not to cause any false positives on batch/powershell scripts. |
Thanks for taking a look at this - yeah, I expect those to be difficult to fix. Currently, I can only imagine having very specific rules (rather not worth), or designing a way to handle multiple possible identifications and/or pass through some AI-based identification in such cases, which could be more capable of handling such mixed cases. |
Eventually, it could be a score bump for "what exist first" - e.g. if the Python characteristic was found earlier in the file than ps1/batch AND was not after some not-Python indicators. I mean, if we have a strong Python indicator at the top of the file, and it's not inside a PowerShell variable assignment, it's reasonable to assume it has to be proceeded by Python. But recognizing if it was or wasn't inside a string (or something like it), sounds too complex for YARA |
I modified the first executor to allow for different encoding. It will now check for a zlib or lzma or base64 payload inside an execution, instead of a base64 payload inside an optional zlib or lzma step inside an execution. Regarding the mixed code scripts, I agree that having the list of locations where the yara strings were found (and hopefully their name) could be of great help to deconflict the two languages. I'll try to look into it, but I think that's already a great start. 🙂 |
By adding the following code in yara_ident(), we can see that we have a lot of information coming back in the matches: for m in matches:
print(m.rule)
for s in m.strings:
print(f"\t{s.identifier}")
for i in s.instances:
print(f"\t\t{i.matched_data} @{i.offset}") For the three currently misidentified files, we have:
4a2353d4be195e06172985931534fe14dbe5746c452c7a27c1d4a5d51d516eb6
init.ps1.py
In all three cases, we can determine that all strong* strings of the code_ps1 rule are found between strong* strings of code_python. It would be easy to make code_python take precedence on code_ps1 in those cases. The one currently identified as code/batch is trickier, as it doesn't use strong* strings. Right now, we do not have a real convention for the Identify yara rules. If we categorize the strings as strong*, weak* and others, maybe we could only use strong* to determine surrounding-ness of non-weak* strings of other rules. We already have the concept of rule score, and the higher the score, the higher the priority of the rule. Overriding that after the fact for certain string name may be puzzling/misleading, and it could remove the possibility for an admin to fix their Identify locally. Maybe we should consider the difference between the two rules' score, or never override a rule that has a score higher than a certain number. Regarding the strong_py4, I wonder how strong it is... It's a simple regex that looks for And even if this is all super fancy and interesting (😃), we could simply bump the score of the code_python rule. I am not certain how many false positives of the other languages it could cause. It may be wiser to change that strong_py4 if we do: I just found out that it became a strong indicator through refactoring, but it was originally a weak one. I'll try to find a lot of various script to check if the score bump would change the identification of true VBS/Batch/ps1/... scripts. |
I love the idea of surrounding-ness :D re: strong_py4 I agree that it doesn't look strong at all - and may really cause multiple texts to be identified as Python.
But, if we have any strong identifiers, maybe we could say, that having weak strings and the end would be enough, as long as they are not for both languages?
I feel that we may need kind of identification log explaining what was the reason for the given file type. It could include the debug output you added and explain, what was the reason for the decision - maybe we could attach it to the results in the profiled submission? Then administrators could easier understand what's going on |
BTW, the |
A few more samples :) Here is a one that is obfuscated: settings.py.zip |
And a new one :D |
A new sample where I'm quite surprised it wasn't identified - do I think correctly that the yara identification has been tricked by unusual spaces, like dabafa823ad6a838790dc0e5d9d6c190f0400406e9f32bc9297252801d305a99.zip |
Yes. Why is that even allowed. 😢 I'm trying to re-evaluate Magika, if it could be fast enough to run as a fallback for files that end up
Only half of the new samples would be correctly identified with high confidence. I'll try to find a way to get more than one guess from the model in the case of low-confidence as I'm curious to see what confidence python has for post_install.py. |
Describe the bug
As usual, there are a couple of Python code files that were not identified correctly :)
cc: @gdesmar
Password for all files:
zippy
. As usual, they can contact dangerous code.The following files were identified as
text/plain
:import urllib.request
,subprocess.run
,urllib.request.urlretrieve
exec(lzma.decompress(
import urllib.parse
,import aiohttp
subprocess.Popen
exec(lzma.decompress(base64.b64decode
The following files were identified as
code/ps1
. Those are more complicated, as they do contain PowerShell commands, but they are Python scripts:The same story, but with
code/batch
identification:To Reproduce
Steps to reproduce the behavior:
Expected behavior
Identification as
code/python
.Screenshots
If applicable, add screenshots to help explain your problem.
Environment (please complete the following information if pertinent):
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: