-
Notifications
You must be signed in to change notification settings - Fork 41
Stuck Verifier #46
Comments
If there's no red outline, it is almost certainly (~99.998% chance) a different issue. Also, the area around that block was rather uneventful.
Due to delays on vote flipping, we would never see such close times if there was any question about which block would be frozen. Looking at your log, it appears that a temporary communication issue caused the verifier to miss votes for a block. We had a similar issue when one of our Digital Ocean droplets (their term for a VPS) went down last week, but as soon as the droplet came back to life, the verifier was able to sync back up. The "removed node from new out-of-cycle queue due to size" message is not a problem. This map is kept small (1000) to protect the verifier, and the only consequence of this removal is that some new verifiers will be temporarily delayed in their eligibility for the lottery. So, a few interesting questions here are:
Side note: I had mentioned that we tend to assume honest misconfiguration, not malice, when we see high bandwidth usage from out-of-cycle verifiers. This was due to some earlier bandwidth issues we saw from some misconfigured out-of-cycle verifiers. I had forgotten the amount of bandwidth, though, and I had forgotten that the release notes for v586 noted it. At the time, we were consuming over 14TB of traffic per day to serve the seed transaction files, and this traffic was actually generated by a rather small group of out-of-cycle verifiers (about 50, I think). |
Thanks for looking into it.
I'm not a java pro, but if oom errors can play havoc, or some background threads stop under some circumstances (I saw that while debugging the sentinel iirc: some threads stopped because of a non logued error, leaving the whole thing in a non fully working state), couldn't the process "just" log and stop when this happens, so it can be restarted clean by supervisord? |
Your hardware is definitely not a problem. :) The idea of logging/dying is really good. It's certainly preferable to getting stuck in a non-working state, and it might also help us identify and fix issues faster. This could probably be implemented robustly in the current design with checkpoints in all long-running threads and a watchdog that does a state dump and termination if any checkpoint is untouched for a specified interval. Thank you for the detailed info. We'll dig a little more and let you know what we find. |
Another verifier, likely same issue. Verifier stuck at 16216601
logs when stuck:
aso Rewinded the logs to find the moment it went from tracking ok to stuck:
(no more exception sending messages BlockWithVotesRequest37 after restart, but no more requests for blocks with votes either)) Maybe more important: |
Another verifier.
Logs from tracking to no more tracking:
Then loops with the same logs until I restarted. (This time, it was showing yellow on nyzo.co homepage.) |
@EggPool Thank you for this. You've included a lot of really helpful information. The hardware should not be a concern in either case. All of our verifiers use the $20/month Digital Ocean "basic" droplets (4GB RAM, 2 CPU), and we have not seen any issues with RAM or CPU. The high CPU load is a really interesting bit of information and will likely help us pinpoint what's happening. The status responses show that block votes are not coming in at all in the second and third example. The first example was a little different: block votes were coming in, and it looks like it may have caught up on its own after a temporary communication outage. The second example was over 11,000 blocks behind, though, which is almost a full day. And the latest blocks had no votes whatsoever. The third example, while you captured it only 172 blocks behind, was showing yellow, not red. The fact that you could communicate with it, but it was receiving no votes whatsoever for the latest blocks, shows that there was something specific to receiving/processing UDP. The fact that the restart fixed the problem could point to either an issue with the UDP message queue or that node joins associated with the restart corrected the problem. High CPU usage before the restart suggests that there's a process in a bad state, but it's also possible that the high CPU usage was due to some recovery process working harder than expected. On example 2 (the one 11,000 blocks behind), the exceptions on The white display on the nyzo.co homepage, while the verifier status is yellow, is actually very interesting. I'll look at the logic there to see why that's happening. This should be an easy fix. Thank you again for all this information. It's very helpful, and we'll keep digging and let you know what we find. |
This same verifier, right now, was 50 blocks behind. Still tracking, requesting blocks with votes, but late. Yellow status on its detail page, still white on nyzo.co homepage.
Resynced and restarted, but given startup time, it's still like 30 blocks behind and drifting away. Only moves up one by one, does not batch catchup like usually once it gets to the blocks he saw live. After restart and a while:
...
etc. Usual cure in this case is running a backup verifier for the same id, since in previous similar cases, switching the verifier to another ip only delays the issue. The new verifier also gets stuck in the same way. |
One note: to maintain score, just set a backup vote source identifier to any verifier that isn't having trouble tracking, yours or another: http://tech.nyzo.co/setupInstructions/preferences#:~:text=fallback_vote_source_identifier |
Yep, It has - like all my verifiers - but is raising in scores anyways. |
This would make sense if it cannot receive any votes at all, because it does not receive the vote that it needs to use the fallback. This is the real problem, and solving this problem would solve everything:
For some reason, your verifier is simply not receiving votes. I'm looking at scores now on Argo 746 and looking at the information the nyzo.co server has to see if there's any indication why your verifier isn't getting votes. The nyzo.co server doesn't have any special access, but it does do a lot of queries. |
And the specific problem is this: |
I'm pretty sure I found the verifier you described. It was running version 610, joined the cycle not too long ago, is having tracking problems, and was displaying as white on the nyzo.co cycle page. I added some extra logging to see why it was displaying as white on the cycle page, and it displayed as yellow on the cycle page after the restart. The logic on the web server for yellow is:
This logic is meant to display the health of the more healthy of multiple verifiers when multiple verifiers are present for a particular identifier on multiple IP addresses. So, it appears that the nyzo.co server had some invisible entry that was not displaying on the status page but was convincing it that the verifier was healthy. I'd really like to understand this, but I wasn't able to find the particular cause in this case. I'll add additional monitoring in the next few days to surface this discrepancy next time it happens. For this particular verifier, I currently see a score of about 25,000 to 30,000 across the verifiers I've checked. The removal threshold is 148,116. So, you still have a little more than 1.5 days of non-voting before removal. I recommend keeping the verifier on the current IP address and restarting it one or two more times. I'm next going to add some logging to the Argos and Killr to see if they are sending votes to your verifier. If they are sending votes, then this is likely a UDP message queue issue. If they are not, then this is likely a node-join issue. |
I've been watching your verifier for the past several hours, and it has gone from consistent 0 votes at many block levels to, right now, receiving around 2500 votes per block level. With the current cycle length, the threshold is around 1,900 votes to freeze a block. I've also seen many heights where the block votes are somewhere in between: 600, 1000, or 1500 votes. The best I can figure is that a lot of incoming UDP is getting dropped, either deliberately as misguided protection, or inadvertently, by the hosting provider. If you want to try to save this one, I suggest moving it to a totally different hosting provider for a little while to get its score back. If the information I have is correct, it appears this one is currently on Contabo. If you have a Digital Ocean account, I highly recommend them for network stability, and you could get one of their $20/month Droplets for just a little while and destroy it when the verifier is no longer in danger. Another solution that would be really nice would be to begin storing block votes on the client and configuring the client to respond to block-with-votes requests. We also need to see whether the chain filler can/should be stepping in to freeze entire chain sections. Right now, this verifier is receiving enough votes to freeze several blocks around the cycle's frozen edge, and I think that should be sufficient for the chain filler to do its job, but I'd need to review the code again to confirm. |
Thanks for digging. About the Yellow on home and invisible entry: This verifier was on another ip when in-queue, and was migrated shortly after it joined (Many other verifiers do this, so this is likely a larger issue than my alone. reborn experienced it as well with a migrated verifier) At that time I did setup the new verifier, it went white, I stopped the old one. Just restarted the verifier, it's keeping track for now. |
Update: The ip the current verifier is using, was used by a backup in-cycle verifier before. |
The second verifier (v605) was stuck again, high cpu load.
I just added this in preferences, hoping to reduce cpu load when stuck and recover faster:
After rsync and restart, moves up, but dritfs for lack of votes.
Please note that here, I do take time to collect data and report every incident over a short time frame, but this a part only of what is happening at large scale since months, to several users, small or larger operators, fighting day after day to keep such cases in-cycle or just letting them go if they don't see soon enough or end up simply giving up. |
Yeah I seen this issue but for me was OOM on 1GB VPS with gr8 cpu and 1GB swap, I moved them to other VPS the low ram should not be issue maybe VPS had some memory issue i can not see CPU load was same stuck at 100% for 1 core also main nyzo page show verifier as white, only detection come from sentinel still white but all zero all the time |
One from here and several different ones stuck. Just pasting some info from one of the new ones.
vps with 4 cpu, 8 Gb ram. Logs
Before stalling on this block, it was stalling on many previous ones. |
Side note: Yet, its performance score continue to slowly raise. |
Update on the 2cpu, 4gb one.
Rsync'd, restarted, tracking again. |
I can't speak to all of these issues, and I don't know the details of the configurations, but I have been watching the one verifier closely, and I can tell you what I've seen there. When the verifier was not receiving any block votes, I added additional logging to Killr, Argo 746, and Argo 752. They were all sending block votes to the other verifier for every height, and they were sending to the correct IP address. So this wasn't some node-join connectedness issue related to a changed IP. The votes were being sent, and Killr, Argo 746, and Argo 752 were having no problems communicating with one another and with the rest of the cycle. But the votes were not being received on the verifier that was having trouble tracking. Yesterday, I was also seeing the score continue to creep up, despite the verifier tracking well. I did, however, see that the score is slightly lower (better) this morning. The current score I'm seeing is 54462 on Killr, 67179 on Argo 746, and 53339 on Argo 752. So, it appears that outbound UDP issues persisted for at least 24 hours, and I'm unsure if they are resolved yet. Killr is in Digital Ocean's AMS3 (Amsterdam) data center. Argo 746 is in Digital Ocean's SGP1 (Singapore) data center. Argo 752 is in Digital Ocean's NYC3 (New York) data center. They are all running on Digital Ocean's "basic" shared CPU plan with a "regular" (not premium) CPU, 4GB RAM, 80GB of SSD storage, 2 CPUs, and 4TB of transfer. In the United States, this is $0.03/hour or $20/month. The root of the problem on that verifier is an issue with UDP communication, both inbound and outbound. But these are not problems that we can fix -- they appear to be issues outside the Java application. If a verifier does not have reliable communication with the rest of the cycle, it will not be able to participate effectively in consensus, and it cannot be allowed to stay in the cycle over the long term. This is why we have performance scores, and removals due to performance scores have been essential to keeping the cycle healthy. We can work on some of the issues with resynchronization. The two biggest will likely be a watchdog for all important threads and ensuring that the mechanism of freezing chain sections always works when two blocks can be frozen at end of the chain. I'm thinking the chain section mechanism may only be active at startup now, but I can't think of a reason, right now, that it couldn't be active all the time. We can also start new cycle joins with the maximum negative score instead of a score of zero. This will give them more room for working out issues soon after they join. One of the most disappointing things I see is verifiers that drop shortly after they join. Based on what I've seen the past few days, Contabo and OVH seem to have the most problems with UDP communication. I think it would be wise to avoid these providers for verifiers. @EggPool I know that you have verifiers on multiple providers, but I would like you to try Digital Ocean specifically, if that's an option, to see what it looks like when a verifier has truly stable UDP communication. UDP, by its very design, is not guaranteed. As we see from the differences in performance scores on Argo 746, reliability of UDP will vary based on geographical location. But Argo 746 is also a really good example of how stable a verifier can be when it is geographically distance from much of the rest of the cycle. We have had Argo 746 in Singapore for more than 2 years now, and it has never had a problem with performance scores. All of our setups on Digital Ocean follow the instructions we posted on tech.nyzo.co: https://tech.nyzo.co/setupInstructions/verifier. We always use the latest Ubuntu LTS version available when we build a verifier. Our preferences files all contain a single option ( I'd really like to see how this works out for you. We spend almost no time at all maintaining our verifiers. When we restart them, it's either to test new code or update to a newly released version. Of our 10 verifiers, we have 4 that are still running version 600. Each of these verifiers has transmitted more than 2900 blocks since its last restart, and each of these verifiers has produced more than 7.64 million blocks since its last restart. In the right environment, the Nyzo verifier is ridiculously stable, and it has been for quite some time. If you can try out a verifier on Digital Ocean, I'd really like to see how it works for you. I know you and a lot of other people are frustrated right now, but it really doesn't have to be this way. Of course, we don't want all of the Nyzo cycle to move to Digital Ocean. This is just one provider that we know provides consistent UDP ingress and egress. We had similarly good experiences on AWS, but bandwidth was too expensive. I'm sure other good providers could be identified by looking at verifiers with long uptimes in the cycle. |
Thanks. What I'm aiming here is to give you some real time feedback of what it's like to operate a verifier for a regular user, having to monitor and restart every now and then, have them yellow and not showing as such on the home page, have it drop for bad perfs despite all you can do, resync, restart, fallback votes, migrating, backup, and hearing in return "the system works as intended". As for D.O: Last for now, even if we conclude that some specific providers end up dropping udp traffic (and there could be a good reason why) from time to time (not all the time), knowing this happens and gets the verifier stuck, can lead to better warning, backup/degraded mode and recovery mechanisms, that could avoid most of the drops from "regular" users, or at least show them clearly what is going on. I need some time to gather and detail higher level thoughts in a synthetic way. |
For one very small piece of this, I can offer a fix. The following change was just deployed to the nyzo.co web server:
There should be no more instances of a stuck verifier not displaying as yellow on the nyzo.co homepage. |
For those who haven't followed this whole thread, the previous version of the method was:
This would sometimes leave a phantom entry that would cause a verifier to show as white on the nyzo.co homepage due to an IP switch. The new version may result in some false positives (yellow when there is a healthy verifier), but it should completely eliminate the false negatives (white when the verifier is unhealthy). |
Seems like a different issue, no red outline on nyzo.co.
Verifier was showing as yellow, tracking issue. Stuck at block 16148619
v611002
aso...
When in this state, it won't move by itself, needs blocks resync and restart.
Also note the "removed node from new out-of-cycle queue due to size" lines.
The text was updated successfully, but these errors were encountered: