Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trigger hangs but does not send an error #668

Open
coderdj opened this issue Jan 31, 2018 · 2 comments
Open

Trigger hangs but does not send an error #668

coderdj opened this issue Jan 31, 2018 · 2 comments

Comments

@coderdj
Copy link
Contributor

coderdj commented Jan 31, 2018

Got this report from @darrylmasson, who was on shift.

There was an odd trigger crash over the weekend. The trigger itself stopped working on Friday evening, but didn't throw any errors that got to the shifters. The Trigger Status box on the daq site complained that its info was more than a minute old, and nothing in the run queue was listed as "running", despite the check boxes at the top both being green. The entries were still showing up in the runs db, just with 0 events, and without datamanager seeing any of the data. Muon veto appears to have been totally unaffected. The system ran like this for most of the weekend, just pushing into the untriggered database without any apparent problems. Dan tried restarting it on Sunday, and it ran for a bit before crashing again, this time alerting the shifters. We acknowledged the error and it picked up and kept going without further issue (apart from the loss of 4 runs from the backlog). I'm not sure if this kind of delayed failure mode has been seen before, but I think it's a bit concerning that the trigger can stop working without alerting the shifters.

Might be related to #666, or at least that issue happened at the same time while recovering the ~1.5 days of runs in the backlog after the crash.

One editorial note, there was no 'second crash' as Darryl writes, though it certainly appeared that way to an operator. The second crash was just the error message from the first crash finally making its way through the system.

@JelleAalbers
Copy link
Contributor

OK, so you think the operation aborted because: all indexes on collection dropped was the cause of the hang? Then it sounds like a mongo-related race condition.

I thought there was an initial hang (unknown cause), and then during catching up, we hit #666, crashing with the operation aborted error. I don't think #666 can happen during normal operation, only when catching up with runs older than two hours.

For the initial hang my speculation would be that a shipment of events got lost between two components on the eventbuilder communicating on the network. (not saying the network is to blame, maybe RabbitMQ or something in our own code). This might cause an infinite wait. I added timeouts to most event-queue operations, but in one place I didn't see an easy way to do so (#444).

Perhaps we could define a new alarm condition based on the absence of communication from the trigger for more than ~30 minutes?

@coderdj
Copy link
Contributor Author

coderdj commented Jan 31, 2018

Ah no, I tend to agree with you that the hang had an unknown cause but I referenced #666 just because it happened on the same day we started recovery.

The alarm on trigger inactivity is a good idea. I'll make an issue to add an alarm on that and on DB buffer size. That should be enough to ensure any slowdown in the trigger gets noticed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants