-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exclusive queues can be deleted without the consumers being notified #12949
Comments
Thank you @mkuratczyk for creating the bug ticket. Since this issue is important for us, we would try to contribute a fix for it ourselves. Just sending the consumer cancel for any auto-delete queue when the partition resolves would be sufficient and probably the only thing we could fix. I will have a look at the code and come back with a proposal/ideas or maybe some questions. Thanks again! |
Hi @mkuratczyk , An update on this issue: So, tried out some things, and, in the end, to solve the problem for us I implemented and tested the below idea:
This has worked for us to receive the consumer cancel after a partition resolves. Also from what I tested didn't see any obvious issues. Anyway, the code is only a first draft and still needs to be polished and further tested... What I would like now is to get some early feedback from you if this approach is sound and would be acceptable to be merged in the main repositories. If not, we need other ideas :) Thanks, |
Hi! Thank you for taking the time to dig into this and provide an implementation, I really appreciate it! The feature you are adding to Khepri already exists, perhaps it’s incorrectly documented or it didn’t match your expectations or use case? See Stored procedures and triggers. Someone suggested a simpler API than stored procedures for triggers a few years ago. I filed rabbitmq/khepri#57 to remember about it but never came to implement it so far. This idea might be a better fit for this use case. Now on the RabbitMQ side, I don’t know what the right approach should be yet, I need to think about this. I mean, I’m not sure if the queue should be deleted in the first place. What happens with Mnesia by the way? |
@Rmarian thank you for taking the time to contribute. Have you performed your test with, say, 200K queues? Because in our experience what works well with 5 queues won't necessarily work as well with hundreds of thousands. Khepri overload would be a very serious operational problem for a cluster. |
Hi @dumbbell , The feature you are adding to Khepri already exists, perhaps it’s incorrectly documented or it didn’t match your expectations or use case?
Now on the RabbitMQ side, I don’t know what the right approach should be yet, I need to think about this. I mean, I’m not sure if the queue should be deleted in the first place. What happens with Mnesia by the way?
|
Not yet...but it is on my to do list. Thinking about how such a test would look like:
|
I see, so the "notification through disconnection" is kind of a byproduct of the partition recovery strategy. Thus, the code to explicitly notify consumers in this case might be missing in the first place. |
@dumbbell the code to notify consumers if a queue is stopped is already present here. If I understand @Rmarian correctly, the problem is that the queue record is deleted in Khepri while the actual queue process is still running even after the partition resolves ("zombie" process). Hence, the client doesn't get notified. |
Thank you @ansd! |
Describe the bug
Given a 3-node cluster with Khepri enabled, if a node is partitioned (certainly happens with the leader being partitioned, I'm not sure if it can if the follower is partitioned), exclusive queues can be deleted without their corresponding connections being notified nor terminated (so from the client's perspective, it just looks like there are no messages).
Originally reported in:
#12829 (comment)
Reproduction steps
Expected behavior
We discussed removing the special case for exclusive queues altogether. Currently, they are considered transient internally, but with Khepri, there's really no concept of a transient declaration, since all declarations are persisted in Khperi's log. Getting rid of the whole "transient" concept once and for all should simplify the code, documentation and in general make RabbitMQ easier to use and understand.
Either way, the queues should not be deleted or the consumers should be notified.
Additional context
All logs from a reproduction:
rmq-server-0.log
rmq-server-1.log
rmq-server-2.log
perf-test -H amqp://rmq-server-2.rmq-nodes -E -t fanout -e amq.fanout -y 10 -r 1
- all connections and exclusive queues are on rmq-server-2; as expected, perf-test publishes 1 message per second and consumes 10 messages per second (fanout to 10 exclusive queues)The logs contain some additional debug logs which show that all nodes attempted to deleted the exclusive queue on a down node, but:
10 transient queues from node '[email protected]' deleted in 0.066262s
) and sent notificaitons but this was not effective, since server-2 didn't get the memoThe text was updated successfully, but these errors were encountered: