Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter for ConfigExtractor to Improve Performance #299

Open
utkonos opened this issue Jan 3, 2025 · 11 comments
Open

Filter for ConfigExtractor to Improve Performance #299

utkonos opened this issue Jan 3, 2025 · 11 comments
Labels
assess We still haven't decided if this will be worked on or not enhancement New feature or request

Comments

@utkonos
Copy link

utkonos commented Jan 3, 2025

Is your feature request related to a problem? Please describe.
All files, even file fragments from PE component extraction, are sent to the ConfigExtractor. There should be a filter so that only files that can potentially have a config extracted are sent to this service. I understand that this may increase the burden of maintaining YARA rules or other filtration methods for identifying files that could even have a config extracted in the first place.

Describe the solution you'd like
YARA or other detection methods for identifying malware families that are within the realm of the possible for the ConfigExtractor

Additional context
I submitted a test file which is a basic PE NSIS installer with three identical PE files in its archive. Using a MicroK8s appliance deployment, all of the processing was nearly instantaneous, but then processing took a total of about 10 minutes while 16 files including file fragments were queued in the ConfigExtractor. The ConfExt processing took the rest of the remaining ~9 minutes until the processing was fully complete.

@utkonos utkonos added assess We still haven't decided if this will be worked on or not enhancement New feature or request labels Jan 3, 2025
@cccs-rs
Copy link
Contributor

cccs-rs commented Jan 3, 2025

Hmm... most extractors within ConfigExtractor should have a YARA rule associated to trigger the extractor, that being said there are some extractors that may not have any rules associated and so they try to brute-force analysis (the assumption is that they have tests at runtime to determine if the file is relevant to the extractor).

If you have any samples that are shareable that did result in long processing, I would be curious to test them out and push PRs back to the maintainers 😁

@utkonos
Copy link
Author

utkonos commented Jan 3, 2025

I think the bottleneck that I see on ConfigExtractor is that many file objects are being queued for processing by the service that should not in the first place. I understand the filters inside the service, but I'm thinking of a feature to prevent junk from being queued in the first place. The test file I have been using produces 16 objects that then get queued for this service. Almost all of the objects are PE components except for two the installer and the one file in the NSIS archive overlay. I'd even allow for the uninstaller to be considered a third file. But sending a bunch of sections and other chunks carved from a PE to the config extraction service in the first place is what I'm thinking about preventing.

Test file: c453b20437d728f5c6f0133bc3709ac24a0edb964304724bfbe62fa65ba77b1d

@utkonos
Copy link
Author

utkonos commented Jan 4, 2025

That test file consistently blows up the queue on ConfigExtractor
image
Everything else finishes in a few seconds. Then it grinds on the ConfigExtractor service for 9-10 minutes.

@utkonos
Copy link
Author

utkonos commented Jan 4, 2025

Are these file fragment/extract components being considered full files and being sent (incorrectly in my opinion) to the ConfigExtractor Service?
image

Is there a way to look at what exactly those files in the queue for that service are?

@utkonos
Copy link
Author

utkonos commented Jan 4, 2025

image

@kam193
Copy link

kam193 commented Jan 4, 2025

I think there is a general issue with the Config Extractor performance - I disabled the service by default and use it only when I suspect it may be helpful.

Are these file fragment/extract components being considered full files and being sent (incorrectly in my opinion) to the ConfigExtractor Service?

You can verify it afterwords by looking in the results - if there is an empty result, then they were processed.

It's however a good question if the extractor configs really expect any file type to be a possible configuration (but I could imagine this) - if not, maybe the accepted file type should be limited to executables?

@utkonos
Copy link
Author

utkonos commented Jan 4, 2025

I need to dig into everything happening inside the ConfigExtractor (CEx) service to make a complete recommendation or PR, but in general I could see a benefit of doing some decision making outside the service before an object is queued to be processed. I sketched out a diagram that should help understand what I am thinking.

ConfigExtractor

After a sample is processed, there are three general categories of objects downstream from the processing: the input object, whole object children, and fragment children. An example to differentiate what I mean is in the NSIS installer test file above, there are three identical executables in the archive overlay. These are whole object children. There are an array of PE components like sections that are produced by the PE analysis. These are fragment children. The service that produces these files has knowledge of what they are and based on that should mark them somehow. Fragment children should never be queued for ConfigExtraction in the first place, so I have colored that red.

The ConfigExtractor I have split into three flavors: Targeted, YOLO, and Brute Force. These can all work from the exact same container but deployed with configuration options that change, enable, or disable processing as appropriate. The result is actually three service flavors running separately. The targeted flavor would only process files that are a-priori known to be a malware family handled by code in the service. The YOLO flavor is a more generalist configuration that handles any exe or document. And the brute force flavor would do its thing on every object that is sent to it.

Depending on the use case, a user can enable or disable any of these three service flavors.

For all objects, they would be processed and file type identified, except for the parent which would already have that. They all would also go through YARA scanning to get tags.

Based on the file typing, some of the resulting files would be sent to the YOLO. Based on the YARA tagging some of the resulting files would be sent to the targeted service. And then optionally, everything can be sent to brute force.

@kam193
Copy link

kam193 commented Jan 5, 2025

This sounds reasonable and more specific configuration sounds good, but I'd suggest to first confirming that the ConfigExtractor is slow on every file (or did you do it already?). If there was a one file processed in 9 minutes and the rest rejected immediately, we won't get almost any improvement filtering them earlier.

@utkonos
Copy link
Author

utkonos commented Jan 5, 2025

confirming that the ConfigExtractor is slow on every file (or did you do it already?)

No, not yet. I need to do some deeper analysis on this problem. I am making an educated guess based on the number of files that were queued for processing in this service compared with the number of objects shown as child objects from the test file I submitted in the UI. I have a bunch of projects going at the moment, but I will dive into this more completely soon.

@cccs-rs
Copy link
Contributor

cccs-rs commented Jan 6, 2025

One way this could be done is limit the file acceptance to executable/.* so that way the service isn't tasked with non-executables.

That being said, based on our usage of the service, we've seen hits for files that match the pattern: executable/.*|java/jar|code/.*|unknown but these hits can come from different extractors and I haven't verified if the hits are FPs or not (I just ran a facet query in Elasticsearch to produce the candidate pattern).

unknown file types are files that Assemblyline hasn't been able to identify using magic, mime, or YARA rules ie. blobs of data.

The ConfigExtractor I have split into three flavors: Targeted, YOLO, and Brute Force

We do kind of have something like this but not at the service-level, it's handled by the underlying library and it's only really handles the use-case of targeted (self declared by the extractor by using YARA rules) or brute force (where the expectation is that the extractor will be able to handle things at runtime, quit early or not):
https://github.com/CybercentreCanada/configextractor-py/blob/main/configextractor/main.py#L190-L205

The YOLO flavor is a more generalist configuration that handles any exe or document.

YOLO is what I would describe as an extractor that could use a YARA rule to loosely target anything that resembles an exe or an acceptable document based on the magic bytes or it could make that determination at runtime by performing those validity checks before any attempt at config extraction (basically a smarter brute force).

@cccs-rs
Copy link
Contributor

cccs-rs commented Jan 6, 2025

Test file: c453b20437d728f5c6f0133bc3709ac24a0edb964304724bfbe62fa65ba77b1d
... processing took a total of about 10 minutes while 16 files including file fragments were queued in the ConfigExtractor. The ConfExt processing took the rest of the remaining ~9 minutes until the processing was fully complete.

Running a similar test in our production system with all service categories, we were able to complete processing of all 17 (root + children) files in ~90s. What I wonder is if the additional time on your deployment is coming from the system having to scale up the number of service instances in response to the backlog for the service? On our deployment, we have the min_instances: 3 whereas I believe the appliance may use min_instances: 1

Performing some custom filtering before the service is a tricky situation as it would need to involve scheduling in the dispatcher, and it introduces a situation of the service needing to depend on the results of another from a previous stage where they should maintain their independence if they can.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
assess We still haven't decided if this will be worked on or not enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants