Trigger hangs but does not send an error #668

coderdj · 2018-01-31T10:19:17Z

Got this report from @darrylmasson, who was on shift.

There was an odd trigger crash over the weekend. The trigger itself stopped working on Friday evening, but didn't throw any errors that got to the shifters. The Trigger Status box on the daq site complained that its info was more than a minute old, and nothing in the run queue was listed as "running", despite the check boxes at the top both being green. The entries were still showing up in the runs db, just with 0 events, and without datamanager seeing any of the data. Muon veto appears to have been totally unaffected. The system ran like this for most of the weekend, just pushing into the untriggered database without any apparent problems. Dan tried restarting it on Sunday, and it ran for a bit before crashing again, this time alerting the shifters. We acknowledged the error and it picked up and kept going without further issue (apart from the loss of 4 runs from the backlog). I'm not sure if this kind of delayed failure mode has been seen before, but I think it's a bit concerning that the trigger can stop working without alerting the shifters.

Might be related to #666, or at least that issue happened at the same time while recovering the ~1.5 days of runs in the backlog after the crash.

One editorial note, there was no 'second crash' as Darryl writes, though it certainly appeared that way to an operator. The second crash was just the error message from the first crash finally making its way through the system.

JelleAalbers · 2018-01-31T10:41:48Z

OK, so you think the operation aborted because: all indexes on collection dropped was the cause of the hang? Then it sounds like a mongo-related race condition.

I thought there was an initial hang (unknown cause), and then during catching up, we hit #666, crashing with the operation aborted error. I don't think #666 can happen during normal operation, only when catching up with runs older than two hours.

For the initial hang my speculation would be that a shipment of events got lost between two components on the eventbuilder communicating on the network. (not saying the network is to blame, maybe RabbitMQ or something in our own code). This might cause an infinite wait. I added timeouts to most event-queue operations, but in one place I didn't see an easy way to do so (#444).

Perhaps we could define a new alarm condition based on the absence of communication from the trigger for more than ~30 minutes?

coderdj · 2018-01-31T10:50:39Z

Ah no, I tend to agree with you that the hang had an unknown cause but I referenced #666 just because it happened on the same day we started recovery.

The alarm on trigger inactivity is a good idea. I'll make an issue to add an alarm on that and on DB buffer size. That should be enough to ensure any slowdown in the trigger gets noticed.

coderdj added the eventbuilder label Jan 31, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trigger hangs but does not send an error #668

Trigger hangs but does not send an error #668

coderdj commented Jan 31, 2018

JelleAalbers commented Jan 31, 2018

coderdj commented Jan 31, 2018

Trigger hangs but does not send an error #668

Trigger hangs but does not send an error #668

Comments

coderdj commented Jan 31, 2018

JelleAalbers commented Jan 31, 2018

coderdj commented Jan 31, 2018