MCTP recoverability chock the mainthread #14

drakedog2008 · 2024-04-26T03:35:21Z

When the MCTP layer of the device is working but the NVMe MI layer does not, nvmed will stuck in the MCTP recovery loop and consume most of the CPU slices on the mainthread. Dbus clients will timeout on:

query the unhealthy status for the bad device.
query health status for the other good devices.

root@mvbce5-nfd11:~# time busctl tree xyz.openbmc_project.NVMe
└─ /xyz
  └─ /xyz/openbmc_project
    ├─ /xyz/openbmc_project/inventory
    │ └─ /xyz/openbmc_project/inventory/system
    │   └─ /xyz/openbmc_project/inventory/system/board
    │     ├─ /xyz/openbmc_project/inventory/system/board/Chassis_1
    │     │ ├─ /xyz/openbmc_project/inventory/system/board/Chassis_1/Chassis_1_NVMe_1
    │     │ ├─ /xyz/openbmc_project/inventory/system/board/Chassis_1/Chassis_1_NVMe_2
    │     │ │ ├─ /xyz/openbmc_project/inventory/system/board/Chassis_1/Chassis_1_NVMe_2/controllers
    │     │ │ │ ├─ /xyz/openbmc_project/inventory/system/board/Chassis_1/Chassis_1_NVMe_2/controllers/0
    │     │ │ │ ├─ /xyz/openbmc_project/inventory/system/board/Chassis_1/Chassis_1_NVMe_2/controllers/1
    │     │ │ │ └─ /xyz/openbmc_project/inventory/system/board/Chassis/Chassis_1_NVMe_2/controllers/2
    │     │ │ └─ /xyz/openbmc_project/inventory/system/board/Chassis_1/Chassis_1_NVMe_2/volumes
    │     │ │   └─ /xyz/openbmc_project/inventory/system/board/Chassis_1/Chassis_1_NVMe_2/volumes/1
 ...
    └─ /xyz/openbmc_project/sensors
      └─ /xyz/openbmc_project/sensors/temperature
        ├─ /xyz/openbmc_project/sensors/temperature/NVMe_2_NVMe_1
...
real	0m19.398s
user	0m0.080s
sys	0m0.021s

The text was updated successfully, but these errors were encountered:

drakedog2008 · 2024-04-26T03:43:15Z

The nvmed log:

Apr 25 20:41:00 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]: Available
Apr 25 20:41:06 machineA nvmesensor[16144]: mctp: net 1 eid 18, Admin op:0x06, cdw(1:0x0, 10:0x1, 11:0x0, 12:0x0 13:0x0, 14:0x0, 15:0x0) T:5000ms, elapsed:6003ms
Apr 25 20:41:06 machineA nvmesensor[16144]: Identify Controller failed, no quirks applied
Apr 25 20:41:07 machineA nvmesensor[16144]: Failed reading subsystem info failing: Success
Apr 25 20:41:07 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]Failed setting up MTU for the MCTP endpoint.
Apr 25 20:41:07 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]: Degraded
Apr 25 20:41:07 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]start MCTP closure
Apr 25 20:41:07 machineA nvmesensor[16144]: [bus: 8, address: 48] end MCTP closure
Apr 25 20:41:07 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]: Available
Apr 25 20:41:13 machineA nvmesensor[16144]: mctp: net 1 eid 18, Admin op:0x06, cdw(1:0x0, 10:0x1, 11:0x0, 12:0x0 13:0x0, 14:0x0, 15:0x0) T:5000ms, elapsed:6003ms
Apr 25 20:41:13 machineA nvmesensor[16144]: Identify Controller failed, no quirks applied
Apr 25 20:41:13 machineA nvmesensor[16144]: Failed reading subsystem info failing: Success
Apr 25 20:41:13 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]Failed setting up MTU for the MCTP endpoint.
Apr 25 20:41:13 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]: Degraded
Apr 25 20:41:13 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]start MCTP closure
Apr 25 20:41:13 machineA nvmesensor[16144]: [bus: 8, address: 48] end MCTP closure
Apr 25 20:41:13 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]: Available
Apr 25 20:41:19 machineA nvmesensor[16144]: mctp: net 1 eid 18, Admin op:0x06, cdw(1:0x0, 10:0x1, 11:0x0, 12:0x0 13:0x0, 14:0x0, 15:0x0) T:5000ms, elapsed:6010ms
Apr 25 20:41:19 machineA nvmesensor[16144]: Identify Controller failed, no quirks applied
Apr 25 20:41:20 machineA nvmesensor[16144]: Failed reading subsystem info failing: Success
Apr 25 20:41:20 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]Failed setting up MTU for the MCTP endpoint.
Apr 25 20:41:20 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]: Degraded
Apr 25 20:41:20 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]start MCTP closure
Apr 25 20:41:20 machineA nvmesensor[16144]: [bus: 8, address: 48] end MCTP closure
Apr 25 20:41:20 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]: Available
Apr 25 20:41:26 machineA nvmesensor[16144]: mctp: net 1 eid 18, Admin op:0x06, cdw(1:0x0, 10:0x1, 11:0x0, 12:0x0 13:0x0, 14:0x0, 15:0x0) T:5000ms, elapsed:6007ms
Apr 25 20:41:26 machineA nvmesensor[16144]: Identify Controller failed, no quirks applied
Apr 25 20:41:26 machineA nvmesensor[16144]: Failed reading subsystem info failing: Success
Apr 25 20:41:26 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]Failed setting up MTU for the MCTP endpoint.
Apr 25 20:41:26 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]: Degraded
Apr 25 20:41:26 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]start MCTP closure
Apr 25 20:41:26 machineA nvmesensor[16144]: [bus: 8, address: 48] end MCTP closure
Apr 25 20:41:26 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]: Available
Apr 25 20:41:32 machineA nvmesensor[16144]: mctp: net 1 eid 18, Admin op:0x06, cdw(1:0x0, 10:0x1, 11:0x0, 12:0x0 13:0x0, 14:0x0, 15:0x0) T:5000ms, elapsed:6009ms
Apr 25 20:41:32 machineA nvmesensor[16144]: Identify Controller failed, no quirks applied
Apr 25 20:41:33 machineA nvmesensor[16144]: Failed reading subsystem info failing: Success
Apr 25 20:41:33 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]Failed setting up MTU for the MCTP endpoint.
Apr 25 20:41:33 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]: Degraded
Apr 25 20:41:33 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]start MCTP closure
Apr 25 20:41:33 machineA nvmesensor[16144]: [bus: 8, address: 48] end MCTP closure
Apr 25 20:41:33 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]: Available

drakedog2008 · 2024-04-26T04:21:03Z

Apr 25 20:41:33 machineA nvmesensor[16144]: Failed reading subsystem info failing: Success

This MI cmd running on the worker thread, so the asnyc code (on main thread) for the device will yield, during which the dbus call will be handled. So the dbus call should take at most ~5 second.

Why the dbus tree call took ~20 second?

drakedog2008 · 2024-04-26T04:27:08Z

Ok. that 20 second may be due to that we have multiple devices stuck in the recovery loop.

When I reduce the number of broken devices to 1, busctl tree takes 2 sec to 5 sec range.

drakedog2008 · 2024-04-26T04:29:14Z

So it seems that the MI worker thread just returned too quickly, so the gap is not long enough to fulfill all async tasks populated in the main thread.

amboar · 2024-05-06T02:02:39Z

I'm working to understand this. My current approach is to try to capture the problem in an integration test case so we have something that we can reliably reproduce. I may have to change tack if that's not completely feasible, but I want to explore it.

amboar · 2024-05-06T04:51:15Z

Apr 25 20:41:26 machineA nvmesensor[16144]: mctp: net 1 eid 18, Admin op:0x06, cdw(1:0x0, 10:0x1, 11:0x0, 12:0x0 13:0x0, 14:0x0, 15:0x0) T:5000ms, elapsed:6007ms

What is emitting this log message? And what's its relationship to the main thread?

However, regardless of the answer to that, we probably want to back-off and limit the attempts to retry the endpoint optimisation. A tight-ish infinite loop of it failing isn't helpful to anyone.

drakedog2008 · 2024-05-06T17:40:21Z

I'm working to understand this. My current approach is to try to capture the problem in an integration test case so we have something that we can reliably reproduce. I may have to change tack if that's not completely feasible, but I want to explore it.

Thanks.

Mostly, these software issues are caused by device malfunction or partial functional which is not easy to capture in the integate test w/o error injection.

For this one particularly, the device's MCTP layer is serving but the NVMe layer is died. So the nvmed is stuck at

https://gbmc.googlesource.com/dbus-sensors/+/refs/heads/google-dev/src/NVMeMi.cpp#192

drakedog2008 · 2024-05-06T17:46:48Z

What is emitting this log message?

That is printed via a libnvme patch. We are printing timestamps for any libnvme call that exceed the threshold. And it shows the quirk id_cntrl takes ~6s and timeout.

And what's its relationship to the main thread?

The libnvme quirk is done within the libnvme level and as part of ep setup. ep setup is supposed to do all local operation (i.e. NON-IO), so it is executed in the mainthread instead of workerthread.

https://gbmc.googlesource.com/dbus-sensors/+/refs/heads/google-dev/src/NVMeMi.cpp#113

Exercise the NVMeMi endpoint recovery implementation by exploiting the link seam for libnvme{,-mi}. To do this we first reuse the headers provided by libnvme with a partial-dependency declaration in the test. From there we provide our own (largely neutered) implementation of the APIs to link against. With this in-place we can trivially recreate the error discussed in[1] by returning erroneous values from the libnvme APIs. The higher-level consequence is we drive the NVMeMi implementation through the endpoint recovery process while trying to optimize the connection. At that point we can use the usual GMock APIs to define the behaviors we want from MctpDevice and MctpEndpoint during the recovery process. [1]: #14 An interesting future direction is to allow mocks to be injected behind the libnvme{,-mi} link seam. Adding a method to inject mocks enables describing arbitrary NVMe scenarios in the usual gmock style, which has the potential to drastically increase test code coverage for nvmesensor. Returning to the change at hand, note that the test case causes valgrind to report a leak, however this appears to be from the gmock error report itself. Given we're working to resolve the gmock error, the leak will disappear with it: ``` ... Mock MCTP Device] MCTP connection is not established error reading ctemp from subsystem, reason:Transport endpoint is not connected Sensor bar reading error! ../tests/test_nvme_mi_recovery.cpp:63: Failure Mock function called more times than expected - returning directly. Function call: recover() Expected: to be called between 1 and 3 times Actual: called 5 times - over-saturated and active [Mock MCTP Device] MCTP connection is not established error reading ctemp from subsystem, reason:Transport endpoint is not connected ../tests/test_nvme_mi_recovery.cpp:63: Failure Mock function called more times than expected - returning directly. Function call: recover() Expected: to be called between 1 and 3 times Actual: called 6 times - over-saturated and active status else poll loop has been canceled ../tests/test_nvme_mi_recovery.cpp:90: Failure Value of: testing::Mock::VerifyAndClearExpectations(mctpEp.get()) Actual: false Expected: true [ FAILED ] NVMeRecovery.optimisationFailure (8348 ms) [----------] 1 test from NVMeRecovery (8359 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test suite ran. (8415 ms total) [ PASSED ] 0 tests. [ FAILED ] 1 test, listed below: [ FAILED ] NVMeRecovery.optimisationFailure 1 FAILED TEST ==197574== ==197574== HEAP SUMMARY: ==197574== in use at exit: 8 bytes in 1 blocks ==197574== total heap usage: 7,500 allocs, 7,499 frees, 1,751,392 bytes allocated ==197574== ==197574== 8 bytes in 1 blocks are still reachable in loss record 1 of 1 ==197574== at 0x4840F83: operator new(unsigned long) (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so) ==197574== by 0xD157EF: testing::internal::GetFailureReporter() (gmock-internal-utils.cc:124) ==197574== by 0x9C16E3: testing::internal::Expect(bool, char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (gmock-internal-utils.h:259) ==197574== by 0x9C2303: testing::internal::UntypedFunctionMockerBase::FailureCleanupHandler::~FailureCleanupHandler() (gmock-spec-builders.h:1427) ==197574== by 0x9D852C: testing::internal::FunctionMocker<void ()>::InvokeWith(std::tuple<>&&) (gmock-spec-builders.h:1898) ==197574== by 0x9CB22B: testing::internal::FunctionMocker<void ()>::Invoke() (gmock-spec-builders.h:1548) ==197574== by 0x9C2EF8: MockMctpEndpoint::recover() (test_nvme_mi_recovery.cpp:27) ==197574== by 0xAD57EB: NVMeMi::recover() (NVMeMi.cpp:192) ==197574== by 0xAD50B1: NVMeMi::epOptimize()::{lambda(boost::system::error_code)#1}::operator()(boost::system::error_code) const::{lambda(std::error_code const&)#1}::operator()(std::error_code const&) const (NVMeMi.cpp:167) ==197574== by 0xAFF26F: void std::__invoke_impl<void, NVMeMi::epOptimize()::{lambda(boost::system::error_code)#1}::operator()(boost::system::error_code) const::{lambda(std::error_code const&)#1}&, std::error_code const&>(std::__invoke_other, NVMeMi::epOptimize()::{lambda(boost::system::error_code)#1}::operator()(boost::system::error_code) const::{lambda(std::error_code const&)#1}&, std::error_code const&) (invoke.h:61) ==197574== by 0xAF3ED2: std::enable_if<is_invocable_r_v<void, NVMeMi::epOptimize()::{lambda(boost::system::error_code)#1}::operator()(boost::system::error_code) const::{lambda(std::error_code const&)#1}&, std::error_code const&>, void>::type std::__invoke_r<void, NVMeMi::epOptimize()::{lambda(boost::system::error_code)#1}::operator()(boost::system::error_code) const::{lambda(std::error_code const&)#1}&, std::error_code const&>(NVMeMi::epOptimize()::{lambda(boost::system::error_code)#1}::operator()(boost::system::error_code) const::{lambda(std::error_code const&)#1}&, std::error_code const&) (invoke.h:111) ==197574== by 0xAEFE8F: std::_Function_handler<void (std::error_code const&), NVMeMi::epOptimize()::{lambda(boost::system::error_code)#1}::operator()(boost::system::error_code) const::{lambda(std::error_code const&)#1}>::_M_invoke(std::_Any_data const&, std::error_code const&) (std_function.h:290) ==197574== ==197574== LEAK SUMMARY: ==197574== definitely lost: 0 bytes in 0 blocks ==197574== indirectly lost: 0 bytes in 0 blocks ==197574== possibly lost: 0 bytes in 0 blocks ==197574== still reachable: 8 bytes in 1 blocks ==197574== suppressed: 0 bytes in 0 blocks ==197574== ==197574== For lists of detected and suppressed errors, rerun with: -s ==197574== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0) ``` Change-Id: Idf3fb6d167a9e11ea1d554c4ab2f2c50d1489421 Signed-off-by: Andrew Jeffery <[email protected]>

To mitigate starving other tasks out of the main thread, implement a grace-period that delays restarting the NVMeMiIntf implementation once MCTP endpoint recovery completes. Together with reducing libnvme's default timeout for quirk probing this should be enough to mitigate the issue discussed in [1]. [1]: #14 Change-Id: Iae742fd1e67cb84c163840e7b74868bd6256d6f1 Signed-off-by: Andrew Jeffery <[email protected]>

amboar · 2024-05-17T03:07:09Z

@drakedog2008 sorry for the delay. I've put together a series of fixes that you can find in the dev/issue-14 branch. Here's the series view.

I mentioned above that I wanted to develop an integration test to capture the problem, and then use that demonstrate that it is resolved. The first part of the series does some leg-work that allows the test cases to run both outside and inside of the OpenBMC CI container, which made it easier for me to develop the test. I expect this makes it easier for others to write and debug tests as well.

There are a few necessary patches for the OpenBMC CI container that go with the change above:

https://gerrit.openbmc.org/q/topic:%22test-setups%22+repo:openbmc/openbmc-build-scripts

You will need to include these in your test environment as well if you rely on openbmc-build-scripts.

Following dealing with the tests, the second phase of the series performs some refactoring to make what is currently setupMctpDevice() testable and maintainable in the face of adding further functionality. It does this by reworking the implementation to be a collection of methods on the NVMeDevice type.

Beyond that the series adds the test that exercises the problem in NVMeMi, and then resolves the problem by inserting a grace period on endpoint recovery. The current implementation sets that grace period at 5 seconds, but I figured we could discuss changes to that here if you think another value would be better suited.

As for the timeout on the probe IO generated by libnvme, for now I applied the following patch:

diff --git a/src/nvme/mi-mctp.c b/src/nvme/mi-mctp.c
index 0c5972a5677c..0886e67368b8 100644
--- a/src/nvme/mi-mctp.c
+++ b/src/nvme/mi-mctp.c
@@ -501,12 +501,7 @@ nvme_mi_ep_t nvme_mi_open_mctp(nvme_root_t root, unsigned int netid, __u8 eid)
        ep->transport = &nvme_mi_transport_mctp;
        ep->transport_data = mctp;
 
-       /* Assuming an i2c transport at 100kHz, smallest MTU (64+4). Given
-        * a worst-case clock stretch, and largest-sized packets, we can
-        * expect up to 1.6s per command/response pair. Allowing for a
-        * retry or two (handled by lower layers), 5s is a reasonable timeout.
-        */
-       ep->timeout = 5000;
+       ep->timeout = 1000;
 
        nvme_mi_ep_probe(ep);

I intend to start a discussion upstream about making the probe process explicit (i.e. adding an API to invoke it). With that we could move the IO onto the worker thread, avoiding the blocking in nvme_mi_open_mctp().

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MCTP recoverability chock the mainthread #14

MCTP recoverability chock the mainthread #14

drakedog2008 commented Apr 26, 2024

drakedog2008 commented Apr 26, 2024

drakedog2008 commented Apr 26, 2024 •

edited

Loading

drakedog2008 commented Apr 26, 2024

drakedog2008 commented Apr 26, 2024

amboar commented May 6, 2024

amboar commented May 6, 2024 •

edited

Loading

drakedog2008 commented May 6, 2024 •

edited

Loading

drakedog2008 commented May 6, 2024 •

edited

Loading

amboar commented May 17, 2024

MCTP recoverability chock the mainthread #14

MCTP recoverability chock the mainthread #14

Comments

drakedog2008 commented Apr 26, 2024

drakedog2008 commented Apr 26, 2024

drakedog2008 commented Apr 26, 2024 • edited Loading

drakedog2008 commented Apr 26, 2024

drakedog2008 commented Apr 26, 2024

amboar commented May 6, 2024

amboar commented May 6, 2024 • edited Loading

drakedog2008 commented May 6, 2024 • edited Loading

drakedog2008 commented May 6, 2024 • edited Loading

amboar commented May 17, 2024

drakedog2008 commented Apr 26, 2024 •

edited

Loading

amboar commented May 6, 2024 •

edited

Loading

drakedog2008 commented May 6, 2024 •

edited

Loading

drakedog2008 commented May 6, 2024 •

edited

Loading