Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker service scaling recovery problems #291

Open
precise0 opened this issue Dec 17, 2024 · 4 comments
Open

Docker service scaling recovery problems #291

precise0 opened this issue Dec 17, 2024 · 4 comments
Labels
assess We still haven't decided if this will be worked on or not bug Something isn't working

Comments

@precise0
Copy link

Describe the bug
In some cases services will not scale properly after services fail for any number of reasons, OOM or other. The result is that the overall throughput of the deployment will drop to nothing because a service with samples in queue to be processed will have 0 instances running. What seems to help this issue is disabling and enabling the service in the Administrator panel. I have seen this plague every type of service that ships with AssemblyLine, recently I have had to intervene with aforementioned remedy on: CAPA (4.5.0.stable9), DeobfuScripter (4.5.0.stable14), Batchdeobfuscator (4.5.0.stable19), and Espresso (4.5.0.stable7).

To Reproduce
Steps to reproduce the behavior:

  1. Normal docker deployment
  2. Running for a number of days at around 10k samples per day
  3. After enough errors accumulate with a given service I eventually see this failure mode.

Expected behavior
After service failure they would recover within a reasonable time

Screenshots
N/A

Environment (please complete the following information if pertinent):
Assemblyline Docker deployment 0.4.5 stable, last updated 2 weeks ago

Additional context
I have created a service that detects this condition using the client socketio log listener and disables then enables the afflicted service, I have been running it for about 4 days now and I see great throughput improvements. However I wanted to pass this along to possibly find some root cause for it.

@precise0 precise0 added assess We still haven't decided if this will be worked on or not bug Something isn't working labels Dec 17, 2024
@cccs-rs
Copy link
Contributor

cccs-rs commented Dec 18, 2024

Is there any error logs surrounding these incidents such as from the dispatcher or scaler containers?

@precise0
Copy link
Author

The dispatcher doesn't seem to have any errors, however the scaler did throw quite a few, here is an example. It seems that it could possibly correlate to the issue but it's hard to exactly pinpoint. Either way it doesn't seem to be operating normally.

{
  "@timestamp": "2024-12-08 23:38:38,879",
  "event": {
    "module": "assemblyline",
    "dataset": "assemblyline.scaler"
  },
  "host": {
    "ip": "x.x.x.x",
    "hostname": "9b0f4e6f0c71"
  },
  "log": {
    "level": "ERROR",
    "logger": "assemblyline.scaler"
  },
  "process": {
    "pid": "1"
  },
  "message": "Crash in scaler: update_scaling\nTraceback (most recent call last):\n  File \"/var/lib/assemblyline/.local/lib/python3.11/site-packages/urllib3/connectionpool.py\", line 789, in urlopen\n    response = self._make_request(\n               ^^^^^^^^^^^^^^^^^^^\n  File \"/var/lib/assemblyline/.local/lib/python3.11/site-packages/urllib3/connectionpool.py\", line 495, in _make_request\n    conn.request(\n  File \"/var/lib/assemblyline/.local/lib/python3.11/site-packages/urllib3/connection.py\", line 441, in request\n    self.endheaders()\n  File \"/usr/local/lib/python3.11/http/client.py\", line 1298, in endheaders\n    self._send_output(message_body, encode_chunked=encode_chunked)\n  File \"/usr/local/lib/python3.11/http/client.py\", line 1058, in _send_output\n    self.send(msg)\n  File \"/usr/local/lib/python3.11/http/client.py\", line 996, in send\n    self.connect()\n  File \"/var/lib/assemblyline/.local/lib/python3.11/site-packages/docker/transport/unixconn.py\", line 26, in connect\n    sock.connect(self.unix_socket)\nBlockingIOError: [Errno 11] Resource temporarily unavailable\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File \"/var/lib/assemblyline/.local/lib/python3.11/site-packages/requests/adapters.py\", line 667, in send\n    resp = conn.urlopen(\n           ^^^^^^^^^^^^^\n  File \"/var/lib/assemblyline/.local/lib/python3.11/site-packages/elasticapm/instrumentation/packages/base.py\", line 213, in call_if_sampling\n    return self.call(module, method, wrapped, instance, args, kwargs)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/var/lib/assemblyline/.local/lib/python3.11/site-packages/elasticapm/instrumentation/packages/urllib3.py\", line 132, in call\n    response = wrapped(*args, **kwargs)\n               ^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/var/lib/assemblyline/.local/lib/python3.11/site-packages/urllib3/connectionpool.py\", line 843, in urlopen\n    retries = retries.increment(\n              ^^^^^^^^^^^^^^^^^^\n  File \"/var/lib/assemblyline/.local/lib/python3.11/site-packages/urllib3/util/retry.py\", line 474, in increment\n    raise reraise(type(error), error, _stacktrace)\n          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/var/lib/assemblyline/.local/lib/python3.11/site-packages/urllib3/util/util.py\", line 38, in reraise\n    raise value.with_traceback(tb)\n  File \"/var/lib/assemblyline/.local/lib/python3.11/site-packages/urllib3/connectionpool.py\", line 789, in urlopen\n    response = self._make_request(\n               ^^^^^^^^^^^^^^^^^^^\n  File \"/var/lib/assemblyline/.local/lib/python3.11/site-packages/urllib3/connectionpool.py\", line 495, in _make_request\n    conn.request(\n  File \"/var/lib/assemblyline/.local/lib/python3.11/site-packages/urllib3/connection.py\", line 441, in request\n    self.endheaders()\n  File \"/usr/local/lib/python3.11/http/client.py\", line 1298, in endheaders\n    self._send_output(message_body, encode_chunked=encode_chunked)\n  File \"/usr/local/lib/python3.11/http/client.py\", line 1058, in _send_output\n    self.send(msg)\n  File \"/usr/local/lib/python3.11/http/client.py\", line 996, in send\n    self.connect()\n  File \"/var/lib/assemblyline/.local/lib/python3.11/site-packages/docker/transport/unixconn.py\", line 26, in connect\n    sock.connect(self.unix_socket)\nurllib3.exceptions.ProtocolError: ('Connection aborted.', BlockingIOError(11, 'Resource temporarily unavailable'))\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File \"/var/lib/assemblyline/.local/lib/python3.11/site-packages/assemblyline_core/scaler/scaler_server.py\", line 414, in with_logs\n    fn(*args, **kwargs)\n  File \"/var/lib/assemblyline/.local/lib/python3.11/site-packages/assemblyline_core/scaler/scaler_server.py\", line 742, in update_scaling\n    raw_targets = self.controller.get_targets()\n                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/var/lib/assemblyline/.local/lib/python3.11/site-packages/assemblyline_core/scaler/controllers/docker_ctl.py\", line 360, in get_targets\n    return {name: self.get_target(name) for name in names}\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/var/lib/assemblyline/.local/lib/python3.11/site-packages/assemblyline_core/scaler/controllers/docker_ctl.py\", line 360, in <dictcomp>\n    return {name: self.get_target(name) for name in names}\n                  ^^^^^^^^^^^^^^^^^^^^^\n  File \"/var/lib/assemblyline/.local/lib/python3.11/site-packages/assemblyline_core/scaler/controllers/docker_ctl.py\", line 347, in get_target\n    for container in self.client.containers.list(filters=filters, ignore_removed=True):\n                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/var/lib/assemblyline/.local/lib/python3.11/site-packages/docker/models/containers.py\", line 1018, in list\n    containers.append(self.get(r['Id']))\n                      ^^^^^^^^^^^^^^^^^\n  File \"/var/lib/assemblyline/.local/lib/python3.11/site-packages/docker/models/containers.py\", line 954, in get\n    resp = self.client.api.inspect_container(container_id)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/var/lib/assemblyline/.local/lib/python3.11/site-packages/docker/utils/decorators.py\", line 19, in wrapped\n    return f(self, resource_id, *args, **kwargs)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/var/lib/assemblyline/.local/lib/python3.11/site-packages/docker/api/container.py\", line 794, in inspect_container\n    self._get(self._url(\"/containers/{0}/json\", container)), True\n    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/var/lib/assemblyline/.local/lib/python3.11/site-packages/docker/utils/decorators.py\", line 44, in inner\n    return f(self, *args, **kwargs)\n           ^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/var/lib/assemblyline/.local/lib/python3.11/site-packages/docker/api/client.py\", line 246, in _get\n    return self.get(url, **self._set_request_timeout(kwargs))\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/var/lib/assemblyline/.local/lib/python3.11/site-packages/requests/sessions.py\", line 602, in get\n    return self.request(\"GET\", url, **kwargs)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/var/lib/assemblyline/.local/lib/python3.11/site-packages/requests/sessions.py\", line 589, in request\n    resp = self.send(prep, **send_kwargs)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/var/lib/assemblyline/.local/lib/python3.11/site-packages/elasticapm/instrumentation/packages/base.py\", line 213, in call_if_sampling\n    return self.call(module, method, wrapped, instance, args, kwargs)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/var/lib/assemblyline/.local/lib/python3.11/site-packages/elasticapm/instrumentation/packages/requests.py\", line 58, in call\n    response = wrapped(*args, **kwargs)\n               ^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/var/lib/assemblyline/.local/lib/python3.11/site-packages/requests/sessions.py\", line 703, in send\n    r = adapter.send(request, **kwargs)\n        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/var/lib/assemblyline/.local/lib/python3.11/site-packages/requests/adapters.py\", line 682, in send\n    raise ConnectionError(err, request=request)\nrequests.exceptions.ConnectionError: ('Connection aborted.', BlockingIOError(11, 'Resource temporarily unavailable'))\n"
}

@cccs-rs
Copy link
Contributor

cccs-rs commented Dec 18, 2024

Seems to be an issue when there's high IO going on and a socket pertaining to Docker is "unavailable".

Since this does seem to happen during update_scaling, might I suggest setting the SCALE_INTERVAL to something higher than a check every 5s?

Currently the only way to do this is my mounting over that file in the scaler container, but if increasing that value does work, we could make it more configurable.

@precise0
Copy link
Author

I can certainly try this, thank you, I'll report back when I get it working.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
assess We still haven't decided if this will be worked on or not bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants