-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error bringing back up failed master #63
Comments
Hmm, I haven't seen that particular error before. Which version of redis is this? In this scenario what should happen is that the resurrected node (previous master) should be made a slave of the newly promoted master node. It does this by sending a "slaveof" command. After that happens, we issue periodic "state reports" for each node. A state report is really just the result of sending an "INFO" command to the redis node. I wonder if somehow Redis is in a "read-only" mode and rejects INFO commands when in that mode? |
2.8.3
|
You must have 'slave-read-only yes' set in your redis.conf: https://github.com/antirez/redis/blob/unstable/redis.conf#L223 Your old master process is coming back up thinking it's master (probably set in redis.conf), at which point: node_manager::update_master_state is called. Additionally, while I haven't spent much time at all looking at the node manager code, it seems to me that the node.rb wait() health-checking functionality simply won't work against read-only slaves (the default since 2.6), since BLPOP and DEL ops are writes. I can confirm that only the master node is getting redis_failover* ops. What is weird is that there should be a NodeWatcher thread attempting (and failing) this against the slaves as well, yet I'm not seeing those read-only exceptions in the logs.. |
I believe this might be something new in 2.8. Slaves are now read only by default.
|
That definitely sounds like what's happening here @arohter given that slaves are read-only by default in 2.8. The Node Watcher definitely uses the blpop/del/lpush redis commands when watching a node's status. I went that route to avoid having to be in a busy-wait loop, but maybe that's what we should do given that slaves are read-only by default. @arohter, is this something that you'd be interested in modifying? I think getting redis_failover working with 2.8 and Ruby 2.0 would be great along with the other fixes that you've already submitted. I'll create an issue for it. |
Actually, slaves became read-only by default in 2.6, not just 2.8 :( We're investigating this issue now. There's really two issues here: 1) The above reported, where the wakeup() method is called on cluster reconfig and 2) slave node_watcher health checking generally. This leads to a couple questions. Is there a reason for using BLPOP beyond avoiding the busy-wait loop? It can be handy to immediately determine node process failures, but other parts of the code (specifically the snapshot processing loop) still rely on relatively long (5s) intervals (unless there's some zk Watch code I've failed to see), so not sure this is an active feature. The same comment applies to the wakeup() method. Just want to make sure there's no other reasons/advantages for using BLPOP. I vaguely recall (digging into the history a while back) you were using a polling loop before switching to BLPOP. What was the main motivator for changing? Anything we should be on alert for? I'm assuming a switch to a simple INFO or PING polling health check, although I shall ponder if there might be anything more sophisticated/better. |
I think switching back to a polling approach using PiNG or INFO should be fine. I can't recall at this time why I did switched other to avoid hitting redis servers with commands every N seconds, but like you said, it's probably not a big deal at all and sounds like it's what we need to do now that slaves are read-only by default in 2.6. Thanks for taking this on! |
I think we'll use ECHO, since it properly fails when slave-serve-stale-data is set to no. |
I have master localhost:7777 and slave localhost:8888
I take down 7777 and everything switches over to 8888 without a problem. I bring 7777 back up and Redis properly switches itself to a slave of 8888 but I see this error in the log output of the leader redis_node_manager. Any idea what's wrong?
The text was updated successfully, but these errors were encountered: