Filter for ConfigExtractor to Improve Performance #299

utkonos · 2025-01-03T18:23:24Z

Is your feature request related to a problem? Please describe.
All files, even file fragments from PE component extraction, are sent to the ConfigExtractor. There should be a filter so that only files that can potentially have a config extracted are sent to this service. I understand that this may increase the burden of maintaining YARA rules or other filtration methods for identifying files that could even have a config extracted in the first place.

Describe the solution you'd like
YARA or other detection methods for identifying malware families that are within the realm of the possible for the ConfigExtractor

Additional context
I submitted a test file which is a basic PE NSIS installer with three identical PE files in its archive. Using a MicroK8s appliance deployment, all of the processing was nearly instantaneous, but then processing took a total of about 10 minutes while 16 files including file fragments were queued in the ConfigExtractor. The ConfExt processing took the rest of the remaining ~9 minutes until the processing was fully complete.

cccs-rs · 2025-01-03T19:01:16Z

Hmm... most extractors within ConfigExtractor should have a YARA rule associated to trigger the extractor, that being said there are some extractors that may not have any rules associated and so they try to brute-force analysis (the assumption is that they have tests at runtime to determine if the file is relevant to the extractor).

If you have any samples that are shareable that did result in long processing, I would be curious to test them out and push PRs back to the maintainers 😁

utkonos · 2025-01-03T20:55:52Z

I think the bottleneck that I see on ConfigExtractor is that many file objects are being queued for processing by the service that should not in the first place. I understand the filters inside the service, but I'm thinking of a feature to prevent junk from being queued in the first place. The test file I have been using produces 16 objects that then get queued for this service. Almost all of the objects are PE components except for two the installer and the one file in the NSIS archive overlay. I'd even allow for the uninstaller to be considered a third file. But sending a bunch of sections and other chunks carved from a PE to the config extraction service in the first place is what I'm thinking about preventing.

Test file: c453b20437d728f5c6f0133bc3709ac24a0edb964304724bfbe62fa65ba77b1d

utkonos · 2025-01-04T01:43:43Z

That test file consistently blows up the queue on ConfigExtractor

Everything else finishes in a few seconds. Then it grinds on the ConfigExtractor service for 9-10 minutes.

utkonos · 2025-01-04T01:45:41Z

Are these file fragment/extract components being considered full files and being sent (incorrectly in my opinion) to the ConfigExtractor Service?

Is there a way to look at what exactly those files in the queue for that service are?

utkonos · 2025-01-04T01:47:39Z

kam193 · 2025-01-04T14:04:55Z

I think there is a general issue with the Config Extractor performance - I disabled the service by default and use it only when I suspect it may be helpful.

Are these file fragment/extract components being considered full files and being sent (incorrectly in my opinion) to the ConfigExtractor Service?

You can verify it afterwords by looking in the results - if there is an empty result, then they were processed.

It's however a good question if the extractor configs really expect any file type to be a possible configuration (but I could imagine this) - if not, maybe the accepted file type should be limited to executables?

utkonos · 2025-01-04T16:53:18Z

I need to dig into everything happening inside the ConfigExtractor (CEx) service to make a complete recommendation or PR, but in general I could see a benefit of doing some decision making outside the service before an object is queued to be processed. I sketched out a diagram that should help understand what I am thinking.

After a sample is processed, there are three general categories of objects downstream from the processing: the input object, whole object children, and fragment children. An example to differentiate what I mean is in the NSIS installer test file above, there are three identical executables in the archive overlay. These are whole object children. There are an array of PE components like sections that are produced by the PE analysis. These are fragment children. The service that produces these files has knowledge of what they are and based on that should mark them somehow. Fragment children should never be queued for ConfigExtraction in the first place, so I have colored that red.

The ConfigExtractor I have split into three flavors: Targeted, YOLO, and Brute Force. These can all work from the exact same container but deployed with configuration options that change, enable, or disable processing as appropriate. The result is actually three service flavors running separately. The targeted flavor would only process files that are a-priori known to be a malware family handled by code in the service. The YOLO flavor is a more generalist configuration that handles any exe or document. And the brute force flavor would do its thing on every object that is sent to it.

Depending on the use case, a user can enable or disable any of these three service flavors.

For all objects, they would be processed and file type identified, except for the parent which would already have that. They all would also go through YARA scanning to get tags.

Based on the file typing, some of the resulting files would be sent to the YOLO. Based on the YARA tagging some of the resulting files would be sent to the targeted service. And then optionally, everything can be sent to brute force.

kam193 · 2025-01-05T10:58:17Z

This sounds reasonable and more specific configuration sounds good, but I'd suggest to first confirming that the ConfigExtractor is slow on every file (or did you do it already?). If there was a one file processed in 9 minutes and the rest rejected immediately, we won't get almost any improvement filtering them earlier.

utkonos · 2025-01-05T18:25:35Z

confirming that the ConfigExtractor is slow on every file (or did you do it already?)

No, not yet. I need to do some deeper analysis on this problem. I am making an educated guess based on the number of files that were queued for processing in this service compared with the number of objects shown as child objects from the test file I submitted in the UI. I have a bunch of projects going at the moment, but I will dive into this more completely soon.

cccs-rs · 2025-01-06T16:38:21Z

One way this could be done is limit the file acceptance to executable/.* so that way the service isn't tasked with non-executables.

That being said, based on our usage of the service, we've seen hits for files that match the pattern: executable/.*|java/jar|code/.*|unknown but these hits can come from different extractors and I haven't verified if the hits are FPs or not (I just ran a facet query in Elasticsearch to produce the candidate pattern).

unknown file types are files that Assemblyline hasn't been able to identify using magic, mime, or YARA rules ie. blobs of data.

The ConfigExtractor I have split into three flavors: Targeted, YOLO, and Brute Force

We do kind of have something like this but not at the service-level, it's handled by the underlying library and it's only really handles the use-case of targeted (self declared by the extractor by using YARA rules) or brute force (where the expectation is that the extractor will be able to handle things at runtime, quit early or not):
https://github.com/CybercentreCanada/configextractor-py/blob/main/configextractor/main.py#L190-L205

The YOLO flavor is a more generalist configuration that handles any exe or document.

YOLO is what I would describe as an extractor that could use a YARA rule to loosely target anything that resembles an exe or an acceptable document based on the magic bytes or it could make that determination at runtime by performing those validity checks before any attempt at config extraction (basically a smarter brute force).

cccs-rs · 2025-01-06T17:54:03Z

Test file: c453b20437d728f5c6f0133bc3709ac24a0edb964304724bfbe62fa65ba77b1d
... processing took a total of about 10 minutes while 16 files including file fragments were queued in the ConfigExtractor. The ConfExt processing took the rest of the remaining ~9 minutes until the processing was fully complete.

Running a similar test in our production system with all service categories, we were able to complete processing of all 17 (root + children) files in ~90s. What I wonder is if the additional time on your deployment is coming from the system having to scale up the number of service instances in response to the backlog for the service? On our deployment, we have the min_instances: 3 whereas I believe the appliance may use min_instances: 1

Performing some custom filtering before the service is a tricky situation as it would need to involve scheduling in the dispatcher, and it introduces a situation of the service needing to depend on the results of another from a previous stage where they should maintain their independence if they can.

utkonos added assess We still haven't decided if this will be worked on or not enhancement New feature or request labels Jan 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filter for ConfigExtractor to Improve Performance #299

Filter for ConfigExtractor to Improve Performance #299

utkonos commented Jan 3, 2025

cccs-rs commented Jan 3, 2025

utkonos commented Jan 3, 2025

utkonos commented Jan 4, 2025

utkonos commented Jan 4, 2025

utkonos commented Jan 4, 2025

kam193 commented Jan 4, 2025

utkonos commented Jan 4, 2025

kam193 commented Jan 5, 2025

utkonos commented Jan 5, 2025 •

edited

Loading

cccs-rs commented Jan 6, 2025

cccs-rs commented Jan 6, 2025

Filter for ConfigExtractor to Improve Performance #299

Filter for ConfigExtractor to Improve Performance #299

Comments

utkonos commented Jan 3, 2025

cccs-rs commented Jan 3, 2025

utkonos commented Jan 3, 2025

utkonos commented Jan 4, 2025

utkonos commented Jan 4, 2025

utkonos commented Jan 4, 2025

kam193 commented Jan 4, 2025

utkonos commented Jan 4, 2025

kam193 commented Jan 5, 2025

utkonos commented Jan 5, 2025 • edited Loading

cccs-rs commented Jan 6, 2025

cccs-rs commented Jan 6, 2025

utkonos commented Jan 5, 2025 •

edited

Loading