Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suricata service can be stuck for hours if suricata didn't start #217

Open
kam193 opened this issue Apr 22, 2024 · 9 comments
Open

Suricata service can be stuck for hours if suricata didn't start #217

kam193 opened this issue Apr 22, 2024 · 9 comments
Assignees
Labels
assess We still haven't decided if this will be worked on or not bug Something isn't working

Comments

@kam193
Copy link

kam193 commented Apr 22, 2024

Describe the bug
Recently, I've recognized a few times that Suricata service stopped processing files. After analysing what's going on, I've found that Suricata is not running, and the service seems to have no timeout for the initialization. I'm not sure why Suricata isn't running, I haven't found any tracks in logs, neither from the service nor the Suricata inside the container. Restarting container is enough to repair it.

The behaviour of the service looks a little unhealthy as it's trying to reach Suricata in an infinite loop, throwing RecoverableError what prevents scaler from stepping in and recreating the service.

To Reproduce
Steps to reproduce the behavior:

Unfortunately, I don't know an easy way. I assume killing the Suricata process or modifying the service not to start Suricata at all could reproduce the behaviour.

Expected behavior
I would like the service to have a timeout / limit for Suricata to start, and just give up throwing a normal error. Using RecoverableError for the time of the initialization is perfectly fine, but should be stopped after some time.

Screenshots
If applicable, add screenshots to help explain your problem.

Environment (please complete the following information if pertinent):

  • Assemblyline Version: 4.5.0.18, Suricata service 4.5.0.5

Additional context
The service container has logs like below. Note the time - between the start and last sample are two hours of throwing just RecoverableError.

{"@timestamp": "2024-04-22 14:30:04,319", "event": { "module": "assemblyline", "dataset": "assemblyline.service.suricata" }, "host": { "ip": "x.x.x.x", "hostname": "f2680f09c0a4" }, "log": { "level": "INFO", "logger": "assemblyline.service.suricata" }, "process": { "pid": "1" }, "message": "Launching Suricata: /usr/local/bin/suricata -vvvv -c /etc/suricata/suricata.yaml --unix-socket=/var/run/suricata/suricata.socket --pidfile /var/run/suricata/suricata.pid --set logging.outputs.1.file.filename=/var/log/suricata/suricata.log"}
{"@timestamp": "2024-04-22 14:30:04,320", "event": { "module": "assemblyline", "dataset": "assemblyline.service.suricata" }, "host": { "ip": "x.x.x.x", "hostname": "f2680f09c0a4" }, "log": { "level": "INFO", "logger": "assemblyline.service.suricata" }, "process": { "pid": "1" }, "message": "Suricata not started yet: [Errno 2] No such file or directory"}
...
{"@timestamp": "2024-04-22 14:30:18,322", "event": { "module": "assemblyline", "dataset": "assemblyline.service.suricata" }, "host": { "ip": "x.x.x.x", "hostname": "f2680f09c0a4" }, "log": { "level": "INFO", "logger": "assemblyline.service.suricata" }, "process": { "pid": "1" }, "message": "Ruleset 0: 71348 rules loaded"}
{"@timestamp": "2024-04-22 14:30:18,322", "event": { "module": "assemblyline", "dataset": "assemblyline.service.suricata" }, "host": { "ip": "x.x.x.x", "hostname": "f2680f09c0a4" }, "log": { "level": "WARNING", "logger": "assemblyline.service.suricata" }, "process": { "pid": "1" }, "message": "Ruleset 0: 3458 rules failed to load.This can be due to duplication of rules among muliple rulesets being loaded."}
{"@timestamp": "2024-04-22 14:30:18,323", "event": { "module": "assemblyline", "dataset": "assemblyline.service.suricata" }, "host": { "ip": "x.x.x.x", "hostname": "f2680f09c0a4" }, "log": { "level": "INFO", "logger": "assemblyline.service.suricata" }, "process": { "pid": "1" }, "message": "Suricata started with service version: 4.5.0.5"}
{"@timestamp": "2024-04-22 14:30:18,353", "event": { "module": "assemblyline", "dataset": "assemblyline.odm" }, "host": { "ip": "x.x.x.x", "hostname": "f2680f09c0a4" }, "log": { "level": "WARNING", "logger": "assemblyline.odm" }, "process": { "pid": "1" }, "message": "The following parameters were ignored from object 'Service': monitored_keys"}
{"@timestamp": "2024-04-22 14:30:18,378", "event": { "module": "assemblyline", "dataset": "assemblyline.dispatching.client" }, "host": { "ip": "x.x.x.x", "hostname": "f2680f09c0a4" }, "log": { "level": "INFO", "logger": "assemblyline.dispatching.client" }, "process": { "pid": "1" }, "message": "[3BH0jBEinXRrBLTJlPFgnj/704e5e5b3234433c01fcfd1b20a306e77e985038120492dc53965c3edd38a4ea] Suricata:f2680f09c0a4 task found"}
{"@timestamp": "2024-04-22 14:30:18,396", "event": { "module": "assemblyline", "dataset": "assemblyline.service.suricata" }, "host": { "ip": "x.x.x.x", "hostname": "f2680f09c0a4" }, "log": { "level": "INFO", "logger": "assemblyline.service.suricata" }, "process": { "pid": "1" }, "message": "[3BH0jBEinXRrBLTJlPFgnj] New task received"}
{"@timestamp": "2024-04-22 14:30:18,396", "event": { "module": "assemblyline", "dataset": "assemblyline.service.suricata" }, "host": { "ip": "x.x.x.x", "hostname": "f2680f09c0a4" }, "log": { "level": "INFO", "logger": "assemblyline.service.suricata" }, "process": { "pid": "1" }, "message": "[3BH0jBEinXRrBLTJlPFgnj] Downloading file: 704e5e5b3234433c01fcfd1b20a306e77e985038120492dc53965c3edd38a4ea"}
{"@timestamp": "2024-04-22 14:30:18,409", "event": { "module": "assemblyline", "dataset": "assemblyline.service.suricata" }, "host": { "ip": "x.x.x.x", "hostname": "f2680f09c0a4" }, "log": { "level": "INFO", "logger": "assemblyline.service.suricata" }, "process": { "pid": "1" }, "message": "[3BH0jBEinXRrBLTJlPFgnj] Starting task for file: 704e5e5b3234433c01fcfd1b20a306e77e985038120492dc53965c3edd38a4ea (network/tcpdump)"}
{"@timestamp": "2024-04-22 14:31:19,037", "event": { "module": "assemblyline", "dataset": "assemblyline.service.suricata" }, "host": { "ip": "x.x.x.x", "hostname": "f2680f09c0a4" }, "log": { "level": "INFO", "logger": "assemblyline.service.suricata" }, "process": { "pid": "1" }, "message": "Recoverable Service Error (3BH0jBEinXRrBLTJlPFgnj/704e5e5b3234433c01fcfd1b20a306e77e985038120492dc53965c3edd38a4ea) [Errno 104] Connection reset by peer:   File \"/var/lib/assemblyline/.local/lib/python3.11/site-packages/assemblyline_v4_service/common/base.py\", line 181, in handle_task\n    self.execute(request)\n  File \"/opt/al_service/suricata_/suricata_.py\", line 512, in execute\n    raise RecoverableError(e)\nRecoverableError: [Errno 104] Connection reset by peer"}
{"@timestamp": "2024-04-22 14:31:19,038", "event": { "module": "assemblyline", "dataset": "assemblyline.service.suricata" }, "host": { "ip": "x.x.x.x", "hostname": "f2680f09c0a4" }, "log": { "level": "INFO", "logger": "assemblyline.service.suricata" }, "process": { "pid": "1" }, "message": "[3BH0jBEinXRrBLTJlPFgnj] Saving error to: /tmp/3BH0jBEinXRrBLTJlPFgnj_704e5e5b3234433c01fcfd1b20a306e77e985038120492dc53965c3edd38a4ea_error.json"}
{"@timestamp": "2024-04-22 14:31:19,042", "event": { "module": "assemblyline", "dataset": "assemblyline.service.suricata" }, "host": { "ip": "x.x.x.x", "hostname": "f2680f09c0a4" }, "log": { "level": "INFO", "logger": "assemblyline.service.suricata" }, "process": { "pid": "1" }, "message": "There are no new signatures. (1713789546.8030632 >= 1713789546.8030632)"}
{"@timestamp": "2024-04-22 14:31:19,042", "event": { "module": "assemblyline", "dataset": "assemblyline.service.suricata" }, "host": { "ip": "x.x.x.x", "hostname": "f2680f09c0a4" }, "log": { "level": "INFO", "logger": "assemblyline.service.suricata" }, "process": { "pid": "1" }, "message": "[3BH0jBEinXRrBLTJlPFgnj] Task completed with errors"}
{"@timestamp": "2024-04-22 14:31:19,042", "event": { "module": "assemblyline", "dataset": "assemblyline.tasking_client" }, "host": { "ip": "x.x.x.x", "hostname": "f2680f09c0a4" }, "log": { "level": "INFO", "logger": "assemblyline.tasking_client" }, "process": { "pid": "1" }, "message": "[3BH0jBEinXRrBLTJlPFgnj] f2680f09c0a4 - Suricata failed to complete task "}
...
{"@timestamp": "2024-04-22 14:33:13,160", "event": { "module": "assemblyline", "dataset": "assemblyline.service.suricata" }, "host": { "ip": "x.x.x.x", "hostname": "f2680f09c0a4" }, "log": { "level": "INFO", "logger": "assemblyline.service.suricata" }, "process": { "pid": "1" }, "message": "Suricata not started yet: [Errno 111] Connection refused"}
{"@timestamp": "2024-04-22 14:33:23,160", "event": { "module": "assemblyline", "dataset": "assemblyline.service.suricata" }, "host": { "ip": "x.x.x.x", "hostname": "f2680f09c0a4" }, "log": { "level": "INFO", "logger": "assemblyline.service.suricata" }, "process": { "pid": "1" }, "message": "Suricata not started yet: [Errno 111] Connection refused"}
{"@timestamp": "2024-04-22 14:33:23,161", "event": { "module": "assemblyline", "dataset": "assemblyline.service.suricata" }, "host": { "ip": "x.x.x.x", "hostname": "f2680f09c0a4" }, "log": { "level": "INFO", "logger": "assemblyline.service.suricata" }, "process": { "pid": "1" }, "message": "Recoverable Service Error (3BH0jBEinXRrBLTJlPFgnj/1ffd5e0894eef1ef357e5554ada335827398568b8287331221dff8abe4c42622) RetryError[Attempts: 15, Value: False]:   File \"/var/lib/assemblyline/.local/lib/python3.11/site-packages/assemblyline_v4_service/common/base.py\", line 181, in handle_task\n    self.execute(request)\n  File \"/opt/al_service/suricata_/suricata_.py\", line 476, in execute\n    self.start_suricata_if_necessary()\n  File \"/opt/al_service/suricata_/suricata_.py\", line 145, in start_suricata_if_necessary\n    raise RecoverableError(e)\nRecoverableError: RetryError[Attempts: 15, Value: False]"}
...
{"@timestamp": "2024-04-22 16:25:03,645", "event": { "module": "assemblyline", "dataset": "assemblyline.service.suricata" }, "host": { "ip": "x.x.x.x", "hostname": "f2680f09c0a4" }, "log": { "level": "INFO", "logger": "assemblyline.service.suricata" }, "process": { "pid": "1" }, "message": "Suricata not started yet: [Errno 111] Connection refused"}
{"@timestamp": "2024-04-22 16:25:03,645", "event": { "module": "assemblyline", "dataset": "assemblyline.service.suricata" }, "host": { "ip": "x.x.x.x", "hostname": "f2680f09c0a4" }, "log": { "level": "INFO", "logger": "assemblyline.service.suricata" }, "process": { "pid": "1" }, "message": "Recoverable Service Error (77rC62QgvSPJeHvVdIp6dY/3382b9a14870049b854563106c33caa3da560b215a60e175e0aedfeff27a562b) RetryError[Attempts: 15, Value: False]:   File \"/var/lib/assemblyline/.local/lib/python3.11/site-packages/assemblyline_v4_service/common/base.py\", line 181, in handle_task\n    self.execute(request)\n  File \"/opt/al_service/suricata_/suricata_.py\", line 476, in execute\n    self.start_suricata_if_necessary()\n  File \"/opt/al_service/suricata_/suricata_.py\", line 145, in start_suricata_if_necessary\n    raise RecoverableError(e)\nRecoverableError: RetryError[Attempts: 15, Value: False]"}
{"@timestamp": "2024-04-22 16:25:03,646", "event": { "module": "assemblyline", "dataset": "assemblyline.service.suricata" }, "host": { "ip": "x.x.x.x", "hostname": "f2680f09c0a4" }, "log": { "level": "INFO", "logger": "assemblyline.service.suricata" }, "process": { "pid": "1" }, "message": "[77rC62QgvSPJeHvVdIp6dY] Saving error to: /tmp/77rC62QgvSPJeHvVdIp6dY_3382b9a14870049b854563106c33caa3da560b215a60e175e0aedfeff27a562b_error.json"}

The /var/log/suricata/suricata.log is finished at 14:30 with a fail of loading some signatures, what isn't a reason to stop Suricata, and looks the same as when everything works.

@kam193 kam193 added assess We still haven't decided if this will be worked on or not bug Something isn't working labels Apr 22, 2024
@cccs-rs
Copy link
Contributor

cccs-rs commented May 28, 2024

Can you confirm if this is still happening on more recent versions of the Suricata service?

@kam193
Copy link
Author

kam193 commented May 28, 2024

I've just disabled my workaround (script looking for stuck Suricata services every 30 minutes, it was impossible to work without it) - I'll come back in a few days with results.

@kam193
Copy link
Author

kam193 commented May 30, 2024

I'm confirming this is still an issue - with Suricata 4.5.0.10 and AL v26

@cccs-rs
Copy link
Contributor

cccs-rs commented Jun 6, 2024

Want to try testing with the latest stable?

The service was updated to use Suricata 8.0-dev and a flag was added to the command to run as daemon which I'm hoping should resolve the issue mentioned (unsure if that flag was available in previous versions).

@kam193
Copy link
Author

kam193 commented Jun 6, 2024

So... I'm not really able to test it, as on my server the newest Suricata (service v16 or even v15) doesn't even start:

{"@timestamp": "2024-06-06 20:59:55,903", "event": { "module": "assemblyline", "dataset": "assemblyline.service.suricata" }, "host": { "ip": "x.x.x.x", "hostname": "6879c90012be" }, "log": { "level": "ERROR", "logger": "assemblyline.service.suricata" }, "process": { "pid": "1" }, "message": "Exiting:\nTraceback (most recent call last):\n  File \"/var/lib/assemblyline/.local/lib/python3.11/site-packages/assemblyline_core/server_base.py\", line 100, in run\n    self.try_run()\n  File \"/var/lib/assemblyline/.local/lib/python3.11/site-packages/assemblyline_v4_service/run_privileged_service.py\", line 148, in try_run\n    self.service_tool_version = self.service.get_tool_version()\n                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/opt/al_service/suricata_/suricata_.py\", line 114, in get_tool_version\n    return f\"{self.get_suricata_version()}.r{self.rules_hash}\"\n              ^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/opt/al_service/suricata_/suricata_.py\", line 107, in get_suricata_version\n    return safe_str(subprocess.check_output([\"suricata\", \"-V\"]).strip().replace(b\"This is Suricata version \", b\"\"))\n                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/usr/local/lib/python3.11/subprocess.py\", line 466, in check_output\n    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/usr/local/lib/python3.11/subprocess.py\", line 571, in run\n    raise CalledProcessError(retcode, process.args,\nsubprocess.CalledProcessError: Command '['suricata', '-V']' died with <Signals.SIGILL: 4>.\n"}

What is caused by:

kernel: traps: suricata[2946516] trap invalid opcode ip:56509c8df61c sp:7ffd368f99e0 error:0 in suricata[56509c8c5000+5e1000]

Indicating incompatible binary instruction... Strange.

To be honest, even when I believe that looking for the root cause is important, my suggestion is to rather prepare for cases like surprisingly dead suricata process. When it's not the main process in the docker container, it could be killed for any reason (even manually - good for testing), and the container will live further. I'd add a check in the error handling to see if the process is still running or exited.

@cccs-rs
Copy link
Contributor

cccs-rs commented Nov 26, 2024

@kam193 Would you be able to evaluate the latest stable release of Suricata after the patch that @jasper-vdhoven made?

I believe this should resolve the longstanding issue we've had with Suricata.

@kam193
Copy link
Author

kam193 commented Nov 29, 2024

Hey, I re-enabled Suricata (previously, I had to disable it as it was almost always producing just errors) and let it work for a while. Unfortunately, it doesn't look like fixed at all :( I see the log about removing stale PID, but tasks are again not processed. AL keeps pre-empting tasks with an error:

The number of retries has passed the limit.

Submitted to queue at 2024-11-29T10:38:55.908986Z
Popped from queue and running at 2024-11-29T11:15:40.838524Z on worker 1a2e7bd9738e
Service error:   File "/var/lib/assemblyline/.local/lib/python3.11/site-packages/assemblyline_v4_service/common/base.py", line 181, in handle_task
    self.execute(request)
  File "/opt/al_service/suricata_/suricata_.py", line 248, in execute
    self.start_suricata_if_necessary()
  File "/opt/al_service/suricata_/suricata_.py", line 170, in start_suricata_if_necessary
    raise RecoverableError(retry_error) from retry_error
RecoverableError: RetryError[<Future at 0x7fdd71400e10 state=finished returned bool>]
Submitted to queue at 2024-11-29T11:17:45.895559Z
Popped from queue and running at 2024-11-29T11:55:19.236295Z on worker 1a2e7bd9738e
Service error:   File "/var/lib/assemblyline/.local/lib/python3.11/site-packages/assemblyline_v4_service/common/base.py", line 181, in handle_task
    self.execute(request)
  File "/opt/al_service/suricata_/suricata_.py", line 248, in execute
    self.start_suricata_if_necessary()
  File "/opt/al_service/suricata_/suricata_.py", line 170, in start_suricata_if_necessary
    raise RecoverableError(retry_error) from retry_error
RecoverableError: RetryError[<Future at 0x7fdd7300fd90 state=finished returned bool>]
Submitted to queue at 2024-11-29T11:57:24.458721Z
Popped from queue and running at 2024-11-29T12:30:47.260748Z on worker 1a2e7bd9738e
Service error:   File "/var/lib/assemblyline/.local/lib/python3.11/site-packages/assemblyline_v4_service/common/base.py", line 181, in handle_task
    self.execute(request)
  File "/opt/al_service/suricata_/suricata_.py", line 248, in execute
    self.start_suricata_if_necessary()
  File "/opt/al_service/suricata_/suricata_.py", line 170, in start_suricata_if_necessary
    raise RecoverableError(retry_error) from retry_error
RecoverableError: RetryError[<Future at 0x7fdd71ab1e90 state=finished returned bool>]

@cccs-rs
Copy link
Contributor

cccs-rs commented Nov 29, 2024

Is there anything you can tell me about your deployment of Suricata to see if I can re-create the same conditions that's causing it to fail?

  • Are there any additional sources besides the default set?
  • Is the resource allocation different than default?
  • Are the pre-emptions on certain kinds of tasks (ie. filesize dependent) or all?

@kam193
Copy link
Author

kam193 commented Nov 29, 2024

Sure, let's see:

ad. 1:

obraz

ad. 2:
What can be important: I have a docker-compose setup, the minimum number of service instances is 0, so they are started when needed.

obraz
obraz

ad. 3:

I don't have currently any pcap file that I've noticed to be processed at all 😅 According to my observation, it doesn't matter - files are preempted because the Suricata wasn't working and never started doing anything.

I don't have the newest analysis, my previous understanding was, that starting the Suricata process was not always successful and it was dying. I didn't see any clear reason why - no logs in Suricata, no oom-kills etc., and the service wasn't checking if the process still exists, just trying to connect in a loop. For some time, I was trying to overcome it by regularly checking the logs and eventually killing the container, but some time ago something changed and it basically stopped processing anything. Finally, I gave up and disabled the service.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
assess We still haven't decided if this will be worked on or not bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants