Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High Disk usage without any messages queued #852

Closed
viktorerlingsson opened this issue Nov 18, 2024 · 8 comments
Closed

High Disk usage without any messages queued #852

viktorerlingsson opened this issue Nov 18, 2024 · 8 comments
Assignees
Labels

Comments

@viktorerlingsson
Copy link
Member

viktorerlingsson commented Nov 18, 2024

Describe the bug
After running LavinMQ for a while, the disk fills up even if there are no messages left in any queues.

du reports usage, but df does not show it, pointing towards memory mapped files being deleted but not unmapped.

Describe your setup

How to reproduce
Not sure yet.

Expected behavior
LavinMQ should not use any significant disk space if there are no messages queued.

@viktorerlingsson viktorerlingsson changed the title High Disk usage High Disk usage without any messages queued Nov 18, 2024
@kickster97 kickster97 added the bug label Nov 18, 2024
@viktorerlingsson viktorerlingsson removed their assignment Dec 17, 2024
@fkollmann
Copy link

We experience this behavior on our PROD system. Maybe file descriptors are being leaked?

We keep an eye on this issue and will be injecting lsof into the image to continue our diagnosis.

@fkollmann
Copy link

fkollmann commented Jan 8, 2025

This is the behavior of the disk usage we see (on Dec 31th 2024):

2024-12-30_19-25

It only happens on PROD and runs totally fine on DEV. And it only affects the master node, never one of the followers.

This is before the restart:

2024-12-30_19-03

And after the restart:

2024-12-30_19-04

@viktorerlingsson
Copy link
Member Author

Thanks for the extra information @fkollmann 👍

Sorry for not updating earlier, but we're aware of what's causing the issue, but we haven't been able to create a good solution to the problem yet.
The issue is that some memory mapped files sometimes are deleted, but not unmapped (finalize is not properly called for them). So when this happens you should see a bunch of memory mapped files with lsof that are marked as deleted, but not being released by the process.
It seems to happen somewhat randomly, and we think garbage collection is the culprit. And it seems to happen more frequently when there's high load on an instance, explaining why you see this more often in PROD than in DEV environments.

Sending multiple USR2 signals in succession (killall -USR2 lavinmq or pkill -USR2 lavinmq), forcing LavinMQ to run GC, might work as a work-around for now if you do not wish to restart LavinMQ, but we've had mixed results with it.

@fkollmann
Copy link

Thanks for the feedback! This fits that we see more open files than actual existing files in the file system:

kubectl exec lavinmq-2 --namespace uplift -c lavinmq -- sh -c 'find /var/lib/lavinmq/42099b4af021e53fd8fd4e056c2568d7c2e3ffa8 -type f | wc -l'
--> 293

kubectl exec lavinmq-2 --namespace uplift -c lavinmq -- sh -c 'lsof +D /var/lib/lavinmq/42099b4af021e53fd8fd4e056c2568d7c2e3ffa8 | wc -l'
--> 352

The workaround indeed does help. Running it once freed the disk space:

kubectl exec lavinmq-2 --namespace uplift -c lavinmq -- pkill -USR2 lavinmq

Thank you very much for this! We will add a container which runs this on a regular basis.

@fkollmann
Copy link

The workaround works fine for us:

2025-01-11_13-17

This is what we did:

In the k8s manifest, we added a container which sends the USR2 signal to the LavinMQ process:

    spec:
      shareProcessNamespace: true # required to allow sending signal from 'garbage-collect' to 'lavinmq'

      containers:
      - name: lavinmq
        ....

      - name: garbage-collect
        ....

        command: [ "/usr/local/bin/sp_garbage_collect.sh" ]

We use the following script to send the signal:

#!/bin/sh

# This script manually triggers the garbage collection of LavinMQ.
#
# There currently is a bug which prevents LavinMQ to actually free disk
# space, because the garbage is not correctly triggered.
#
# For more details, see https://github.com/cloudamqp/lavinmq/issues/852

echo "Starting garbage collection every 100 minutes..."

while true
do
    sleep 100m

    echo "Triggering garbage collection..."

    pkill -USR2 lavinmq
done

Hope this helpy anyone else who has this issue.

@viktorerlingsson
Copy link
Member Author

This should be fixed with the release of LavinMQ v2.1.0. Please upgrade to that and let us know if any problems remain!

@fkollmann
Copy link

I can confirm that the issue is fixed for our envionment:

Image

Thanks for the hard work!

@viktorerlingsson
Copy link
Member Author

I can confirm that the issue is fixed for our envionment

That's great, thanks for verifying!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants