Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong identification of Python code #284

Open
kam193 opened this issue Nov 8, 2024 · 14 comments
Open

Wrong identification of Python code #284

kam193 opened this issue Nov 8, 2024 · 14 comments
Assignees
Labels
assess We still haven't decided if this will be worked on or not bug Something isn't working

Comments

@kam193
Copy link

kam193 commented Nov 8, 2024

Describe the bug
As usual, there are a couple of Python code files that were not identified correctly :)

cc: @gdesmar

Password for all files: zippy. As usual, they can contact dangerous code.

The following files were identified as text/plain:

  1. BitForger.py.zip
    • possible characteristics: import urllib.request, subprocess.run, urllib.request.urlretrieve
  2. _deobfuscated_code_FINAL.py (kopia).zip
    • characteristic executor: exec(lzma.decompress(
  3. init.py.zip
    • quite similar to the (1), but a little longer
  4. init.py(1).zip
    • import urllib.parse, import aiohttp
  5. init.py (kopia).zip
    • subprocess.Popen
  6. tools.py.zip
    • similar to (2), exec(lzma.decompress(base64.b64decode

The following files were identified as code/ps1. Those are more complicated, as they do contain PowerShell commands, but they are Python scripts:

  1. init.py.zip
  2. uidesign.py.zip

The same story, but with code/batch identification:

  1. 4a2353d4be195e06172985931534fe14dbe5746c452c7a27c1d4a5d51d516eb6.zip

To Reproduce
Steps to reproduce the behavior:

  1. Upload and see the wrong file type

Expected behavior
Identification as code/python.

Screenshots
If applicable, add screenshots to help explain your problem.

Environment (please complete the following information if pertinent):

  • Assemblyline Version: 4.5.0.x - those files were collected for some time
  • Browser: [e.g. chrome, safari]

Additional context
Add any other context about the problem here.

@kam193 kam193 added assess We still haven't decided if this will be worked on or not bug Something isn't working labels Nov 8, 2024
@gdesmar gdesmar self-assigned this Nov 8, 2024
@kam193
Copy link
Author

kam193 commented Nov 9, 2024

And a little different case - wrong identified JavaScript. Interestingly, the mime type was originally application/javascript, but the final one is unknown -> maybe even for untrusted mimes, it would be useful to fall back to them if there is nothing better?
9bc1e972b4c7e11256f817954f51c07963fd6f17161c0c2ce867f0c3ba4173d1.zip
3adf910188dc6f0df92eca9a835aac4b31e3232b57adf661c8cfef4992812ea2.zip

@kam193
Copy link
Author

kam193 commented Nov 14, 2024

A new example - Python code identified as text, interesting as it's just a revshell: setup.py(4).zip

@gdesmar
Copy link

gdesmar commented Nov 18, 2024

Out of the 12 samples you uploaded, these two PRs are going to fix 8 of them.
Still left, if it can be improved, are the two files identified as code/ps1, the one identified as code/batch and _deobfuscated_code_FINAL.py (kopia).zip.

I'll give another try at tweaking the executor for the kopia one, but I want to make sure we don't reduce the confidence in the executor too much, as it could cause false positives. It currently check for a mandatory use of base64, like DataDog was doing. Since we are using the default score or 0 for that yara rule, we will prefer yara rules with higher score if there are conflicting results, so false positives are less of a worry. That brings us to the ps1/batch misidentification. As you saw, there are either full scripts of that other language sprinkled in the python file, or very specific language function names. Since those yara identification rules have a higher score, they are preferred over the python one. We could bump the python score higher, but we'd need to make sure not to cause any false positives on batch/powershell scripts.

@kam193
Copy link
Author

kam193 commented Nov 18, 2024

Thanks for taking a look at this - yeah, I expect those to be difficult to fix. Currently, I can only imagine having very specific rules (rather not worth), or designing a way to handle multiple possible identifications and/or pass through some AI-based identification in such cases, which could be more capable of handling such mixed cases.

@kam193
Copy link
Author

kam193 commented Nov 18, 2024

Eventually, it could be a score bump for "what exist first" - e.g. if the Python characteristic was found earlier in the file than ps1/batch AND was not after some not-Python indicators. I mean, if we have a strong Python indicator at the top of the file, and it's not inside a PowerShell variable assignment, it's reasonable to assume it has to be proceeded by Python. But recognizing if it was or wasn't inside a string (or something like it), sounds too complex for YARA

@gdesmar
Copy link

gdesmar commented Nov 18, 2024

I modified the first executor to allow for different encoding. It will now check for a zlib or lzma or base64 payload inside an execution, instead of a base64 payload inside an optional zlib or lzma step inside an execution.
This will identify _deobfuscated_code_FINAL.py (kopia).zip as a python script. I don't think it will cause too many false positives, as the rule have a default score that will lower the chances.

Regarding the mixed code scripts, I agree that having the list of locations where the yara strings were found (and hopefully their name) could be of great help to deconflict the two languages. I'll try to look into it, but I think that's already a great start. 🙂

@gdesmar
Copy link

gdesmar commented Nov 19, 2024

By adding the following code in yara_ident(), we can see that we have a lot of information coming back in the matches:

for m in matches:
    print(m.rule)
    for s in m.strings:
        print(f"\t{s.identifier}")
        for i in s.instances:
            print(f"\t\t{i.matched_data} @{i.offset}")

For the three currently misidentified files, we have:
uidesign.py

code_ps1
        $strong_pwsh6
                b'Start-Process' @849
        $strong_pwsh100
                b'-Command' @421
                b'-Command' @952
code_python
        $strong_py4
                b'else:' @1171
        $strong_py24
                b'subprocess.run(' @390
                b'subprocess.run(' @921
        $strong_py108
                b'os.getcwd()' @109

4a2353d4be195e06172985931534fe14dbe5746c452c7a27c1d4a5d51d516eb6

code_ps1
        $strong_pwsh6
                b'Start-Process' @2489
        $strong_pwsh38
                b'Invoke-WebRequest' @2320
                b'Invoke-WebRequest' @2706
        $strong_pwsh41
                b'Expand-Archive' @2878
        $strong_pwsh100
                b'-Command' @2307
                b'-Uri' @2338
                b'-OutFile' @2442
                b'-Command' @2693
                b'-Uri' @2724
                b'-OutFile' @2835
code_python
        $strong_py3
                b'\n    def install_chrome_and_driver():' @83
        $strong_py4
                b'else:' @444
        $strong_py23
                b'platform.system()' @143
        $strong_py24
                b'subprocess.run(' @1673
                b'subprocess.run(' @3397
        $strong_py108
                b'os.getcwd()' @1886
                b'os.getcwd()' @3652
code_batch
        $power1
                b'powershell' @2296
                b'powershell' @2682
        $command
                b'-Command ' @2307
                b'-Command ' @2693
        $cmd0
                b'@echo off' @2225

init.ps1.py

code_ps1
        $strong_pwsh6
                b'start-process' @3310
        $strong_pwsh9
                b'get-process' @3238
        $strong_pwsh27
                b'set-location' @3254
        $strong_pwsh34
                b'stop-process' @3337
        $strong_pwsh38
                b'invoke-webrequest' @3354
        $strong_pwsh39
                b'copy-item' @3280
        $strong_pwsh100
                b'-command' @3203
                b'-Command' @8281
                b'-Command' @9991
                b'-Command' @10006
                b'-Command' @10066
code_python
        $strong_py1
                b'\n    if __name__ == "__main__":' @10706
        $strong_py2
                b'\n    from getpass import getpass' @29
                b'\n    from subprocess import run' @62
                b'\n    from sys import argv' @94
                b'\n    from typing import Any' @120
                b'\n    from prompt_toolkit import PromptSession' @158
                b'\n    from prompt_toolkit.formatted_text import HTML' @204
                b'\n    from prompt_toolkit.history import InMemoryHistory' @256
                b'\n    from prompt_toolkit.completion import WordCompleter' @312
        $strong_py4
                b'try:' @0
                b'try:' @4548
                b'try:' @4852
                b'try:' @4870
                b'try:' @6310
                b'else:' @6609
                b'else:' @7110
                b'else:' @8432
                b'try:' @8994
                b'else:' @9443
                b'try:' @10326
                b'try:' @10465
        $strong_py108
                b'os.getcwd()' @4365
                b'os.getcwd()' @9108
                b'os.getcwd()' @9250
        $strong_py150
                b'os.system(' @5220
                b'os.system(' @8854
                b'os.system(' @9567
                b'os.system(' @9655

In all three cases, we can determine that all strong* strings of the code_ps1 rule are found between strong* strings of code_python. It would be easy to make code_python take precedence on code_ps1 in those cases.

The one currently identified as code/batch is trickier, as it doesn't use strong* strings. Right now, we do not have a real convention for the Identify yara rules. If we categorize the strings as strong*, weak* and others, maybe we could only use strong* to determine surrounding-ness of non-weak* strings of other rules. We already have the concept of rule score, and the higher the score, the higher the priority of the rule. Overriding that after the fact for certain string name may be puzzling/misleading, and it could remove the possibility for an admin to fix their Identify locally. Maybe we should consider the difference between the two rules' score, or never override a rule that has a score higher than a certain number.

Regarding the strong_py4, I wonder how strong it is... It's a simple regex that looks for try: or except: or else:. I feel like those are not that strong, even with the colon, as it may show up in normal text. If we remove it, or change it to weak_py1, it would stop uidesign.py from having python both before and after all ps1 indicators. In that case, how confident are we that having two python identifier before three ps1 identifier is enough?

And even if this is all super fancy and interesting (😃), we could simply bump the score of the code_python rule. I am not certain how many false positives of the other languages it could cause. It may be wiser to change that strong_py4 if we do: I just found out that it became a strong indicator through refactoring, but it was originally a weak one. I'll try to find a lot of various script to check if the score bump would change the identification of true VBS/Batch/ps1/... scripts.

@kam193
Copy link
Author

kam193 commented Nov 20, 2024

I love the idea of surrounding-ness :D

re: strong_py4 I agree that it doesn't look strong at all - and may really cause multiple texts to be identified as Python.

it would stop uidesign.py from having python both before and after all ps1 indicators.

But, if we have any strong identifiers, maybe we could say, that having weak strings and the end would be enough, as long as they are not for both languages?

Overriding that after the fact for certain string name may be puzzling/misleading, and it could remove the possibility for an admin to fix their Identify locally.

I feel that we may need kind of identification log explaining what was the reason for the given file type. It could include the debug output you added and explain, what was the reason for the decision - maybe we could attach it to the results in the profiled submission? Then administrators could easier understand what's going on

@kam193
Copy link
Author

kam193 commented Nov 20, 2024

BTW, the try:, else:, except: could also be strong if we require them to have at least a tab or four spaces before, and use as weak otherwise. It's rather unusual to have four spaces in a normal text (but two could be, so it won't be enough), and very common in Python code.

@kam193
Copy link
Author

kam193 commented Nov 29, 2024

A few more samples :)

Here is a one that is obfuscated: settings.py.zip
Two samples that are quite short, but with a rather important content - revshells: linux(1).txt.zip, windows(1).txt.zip
And a one I'm surprised it wasn't detected, but overall, it has only data - it's not a big loss not to identify it, so it's rather a side not: native.py.zip (note, it has a BOM mark, but it doesn't look to be a problem - other files with it were correctly identified)..

@kam193
Copy link
Author

kam193 commented Dec 13, 2024

And a new one :D
post_install.py.zip

@kam193
Copy link
Author

kam193 commented Jan 10, 2025

A new sample where I'm quite surprised it wasn't identified - do I think correctly that the yara identification has been tricked by unusual spaces, like base64 .b64encode? It was identified as text/plain

dabafa823ad6a838790dc0e5d9d6c190f0400406e9f32bc9297252801d305a99.zip

@gdesmar
Copy link

gdesmar commented Jan 10, 2025

the yara identification has been tricked by unusual spaces

Yes. Why is that even allowed. 😢
At least that solution's easy, to simply transform all string match to regex match and allow spaces, but it's going to slow down the rule a bit.

I'm trying to re-evaluate Magika, if it could be fast enough to run as a fallback for files that end up unknown or text/plain. It would not help for the files we wrongly identify as code/* already, but may help for those:

settings.py: Generic text document (text) [Low-confidence model best-guess: Python source (code), score=65] 65%
linux.1.txt: Generic text document (text) [Low-confidence model best-guess: Python source (code), score=69] 69%
windows.1.txt: Python source (code) 99%
native.py: Python source (code) 99%
post_install.py: Generic text document (text) [Low-confidence model best-guess: JavaScript source (code), score=60] 60%
dabafa823ad6a838790dc0e5d9d6c190f0400406e9f32bc9297252801d305a99: Python source (code) 99%

Only half of the new samples would be correctly identified with high confidence. I'll try to find a way to get more than one guess from the model in the case of low-confidence as I'm curious to see what confidence python has for post_install.py.

@kam193
Copy link
Author

kam193 commented Jan 10, 2025

Yes. Why is that even allowed. 😢

The same reason why this code works:

obraz

To make our life miserable 😄

Only half of the new samples would be correctly identified with high confidence.

Looks like Magika still needs some magic 😅 Although I can understand that this code may look like minified JS, but still :((

ChatGPT was a little more intelligent, but also totally hallucinated how the file formats are described in AL:
obraz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
assess We still haven't decided if this will be worked on or not bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants