-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MCTP recoverability chock the mainthread #14
Comments
The nvmed log:
|
This MI cmd running on the worker thread, so the asnyc code (on main thread) for the device will yield, during which the dbus call will be handled. So the dbus call should take at most ~5 second. Why the dbus tree call took ~20 second? |
Ok. that 20 second may be due to that we have multiple devices stuck in the recovery loop. When I reduce the number of broken devices to 1, |
So it seems that the MI worker thread just returned too quickly, so the gap is not long enough to fulfill all async tasks populated in the main thread. |
I'm working to understand this. My current approach is to try to capture the problem in an integration test case so we have something that we can reliably reproduce. I may have to change tack if that's not completely feasible, but I want to explore it. |
What is emitting this log message? And what's its relationship to the main thread? However, regardless of the answer to that, we probably want to back-off and limit the attempts to retry the endpoint optimisation. A tight-ish infinite loop of it failing isn't helpful to anyone. |
Thanks. Mostly, these software issues are caused by device malfunction or partial functional which is not easy to capture in the integate test w/o error injection. For this one particularly, the device's MCTP layer is serving but the NVMe layer is died. So the nvmed is stuck at https://gbmc.googlesource.com/dbus-sensors/+/refs/heads/google-dev/src/NVMeMi.cpp#192 |
That is printed via a libnvme patch. We are printing timestamps for any libnvme call that exceed the threshold. And it shows the quirk id_cntrl takes ~6s and timeout.
The libnvme quirk is done within the libnvme level and as part of ep setup. ep setup is supposed to do all local operation (i.e. NON-IO), so it is executed in the mainthread instead of workerthread. https://gbmc.googlesource.com/dbus-sensors/+/refs/heads/google-dev/src/NVMeMi.cpp#113 |
Exercise the NVMeMi endpoint recovery implementation by exploiting the link seam for libnvme{,-mi}. To do this we first reuse the headers provided by libnvme with a partial-dependency declaration in the test. From there we provide our own (largely neutered) implementation of the APIs to link against. With this in-place we can trivially recreate the error discussed in[1] by returning erroneous values from the libnvme APIs. The higher-level consequence is we drive the NVMeMi implementation through the endpoint recovery process while trying to optimize the connection. At that point we can use the usual GMock APIs to define the behaviors we want from MctpDevice and MctpEndpoint during the recovery process. [1]: #14 An interesting future direction is to allow mocks to be injected behind the libnvme{,-mi} link seam. Adding a method to inject mocks enables describing arbitrary NVMe scenarios in the usual gmock style, which has the potential to drastically increase test code coverage for nvmesensor. Returning to the change at hand, note that the test case causes valgrind to report a leak, however this appears to be from the gmock error report itself. Given we're working to resolve the gmock error, the leak will disappear with it: ``` ... Mock MCTP Device] MCTP connection is not established error reading ctemp from subsystem, reason:Transport endpoint is not connected Sensor bar reading error! ../tests/test_nvme_mi_recovery.cpp:63: Failure Mock function called more times than expected - returning directly. Function call: recover() Expected: to be called between 1 and 3 times Actual: called 5 times - over-saturated and active [Mock MCTP Device] MCTP connection is not established error reading ctemp from subsystem, reason:Transport endpoint is not connected ../tests/test_nvme_mi_recovery.cpp:63: Failure Mock function called more times than expected - returning directly. Function call: recover() Expected: to be called between 1 and 3 times Actual: called 6 times - over-saturated and active status else poll loop has been canceled ../tests/test_nvme_mi_recovery.cpp:90: Failure Value of: testing::Mock::VerifyAndClearExpectations(mctpEp.get()) Actual: false Expected: true [ FAILED ] NVMeRecovery.optimisationFailure (8348 ms) [----------] 1 test from NVMeRecovery (8359 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test suite ran. (8415 ms total) [ PASSED ] 0 tests. [ FAILED ] 1 test, listed below: [ FAILED ] NVMeRecovery.optimisationFailure 1 FAILED TEST ==197574== ==197574== HEAP SUMMARY: ==197574== in use at exit: 8 bytes in 1 blocks ==197574== total heap usage: 7,500 allocs, 7,499 frees, 1,751,392 bytes allocated ==197574== ==197574== 8 bytes in 1 blocks are still reachable in loss record 1 of 1 ==197574== at 0x4840F83: operator new(unsigned long) (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so) ==197574== by 0xD157EF: testing::internal::GetFailureReporter() (gmock-internal-utils.cc:124) ==197574== by 0x9C16E3: testing::internal::Expect(bool, char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (gmock-internal-utils.h:259) ==197574== by 0x9C2303: testing::internal::UntypedFunctionMockerBase::FailureCleanupHandler::~FailureCleanupHandler() (gmock-spec-builders.h:1427) ==197574== by 0x9D852C: testing::internal::FunctionMocker<void ()>::InvokeWith(std::tuple<>&&) (gmock-spec-builders.h:1898) ==197574== by 0x9CB22B: testing::internal::FunctionMocker<void ()>::Invoke() (gmock-spec-builders.h:1548) ==197574== by 0x9C2EF8: MockMctpEndpoint::recover() (test_nvme_mi_recovery.cpp:27) ==197574== by 0xAD57EB: NVMeMi::recover() (NVMeMi.cpp:192) ==197574== by 0xAD50B1: NVMeMi::epOptimize()::{lambda(boost::system::error_code)#1}::operator()(boost::system::error_code) const::{lambda(std::error_code const&)#1}::operator()(std::error_code const&) const (NVMeMi.cpp:167) ==197574== by 0xAFF26F: void std::__invoke_impl<void, NVMeMi::epOptimize()::{lambda(boost::system::error_code)#1}::operator()(boost::system::error_code) const::{lambda(std::error_code const&)#1}&, std::error_code const&>(std::__invoke_other, NVMeMi::epOptimize()::{lambda(boost::system::error_code)#1}::operator()(boost::system::error_code) const::{lambda(std::error_code const&)#1}&, std::error_code const&) (invoke.h:61) ==197574== by 0xAF3ED2: std::enable_if<is_invocable_r_v<void, NVMeMi::epOptimize()::{lambda(boost::system::error_code)#1}::operator()(boost::system::error_code) const::{lambda(std::error_code const&)#1}&, std::error_code const&>, void>::type std::__invoke_r<void, NVMeMi::epOptimize()::{lambda(boost::system::error_code)#1}::operator()(boost::system::error_code) const::{lambda(std::error_code const&)#1}&, std::error_code const&>(NVMeMi::epOptimize()::{lambda(boost::system::error_code)#1}::operator()(boost::system::error_code) const::{lambda(std::error_code const&)#1}&, std::error_code const&) (invoke.h:111) ==197574== by 0xAEFE8F: std::_Function_handler<void (std::error_code const&), NVMeMi::epOptimize()::{lambda(boost::system::error_code)#1}::operator()(boost::system::error_code) const::{lambda(std::error_code const&)#1}>::_M_invoke(std::_Any_data const&, std::error_code const&) (std_function.h:290) ==197574== ==197574== LEAK SUMMARY: ==197574== definitely lost: 0 bytes in 0 blocks ==197574== indirectly lost: 0 bytes in 0 blocks ==197574== possibly lost: 0 bytes in 0 blocks ==197574== still reachable: 8 bytes in 1 blocks ==197574== suppressed: 0 bytes in 0 blocks ==197574== ==197574== For lists of detected and suppressed errors, rerun with: -s ==197574== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0) ``` Change-Id: Idf3fb6d167a9e11ea1d554c4ab2f2c50d1489421 Signed-off-by: Andrew Jeffery <[email protected]>
To mitigate starving other tasks out of the main thread, implement a grace-period that delays restarting the NVMeMiIntf implementation once MCTP endpoint recovery completes. Together with reducing libnvme's default timeout for quirk probing this should be enough to mitigate the issue discussed in [1]. [1]: #14 Change-Id: Iae742fd1e67cb84c163840e7b74868bd6256d6f1 Signed-off-by: Andrew Jeffery <[email protected]>
@drakedog2008 sorry for the delay. I've put together a series of fixes that you can find in the dev/issue-14 branch. Here's the series view. I mentioned above that I wanted to develop an integration test to capture the problem, and then use that demonstrate that it is resolved. The first part of the series does some leg-work that allows the test cases to run both outside and inside of the OpenBMC CI container, which made it easier for me to develop the test. I expect this makes it easier for others to write and debug tests as well. There are a few necessary patches for the OpenBMC CI container that go with the change above: https://gerrit.openbmc.org/q/topic:%22test-setups%22+repo:openbmc/openbmc-build-scripts You will need to include these in your test environment as well if you rely on openbmc-build-scripts. Following dealing with the tests, the second phase of the series performs some refactoring to make what is currently Beyond that the series adds the test that exercises the problem in As for the timeout on the probe IO generated by libnvme, for now I applied the following patch:
I intend to start a discussion upstream about making the probe process explicit (i.e. adding an API to invoke it). With that we could move the IO onto the worker thread, avoiding the blocking in |
When the MCTP layer of the device is working but the NVMe MI layer does not, nvmed will stuck in the MCTP recovery loop and consume most of the CPU slices on the mainthread. Dbus clients will timeout on:
query the unhealthy status for the bad device.
query health status for the other good devices.
The text was updated successfully, but these errors were encountered: