Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MCTP recoverability chock the mainthread #14

Open
drakedog2008 opened this issue Apr 26, 2024 · 9 comments
Open

MCTP recoverability chock the mainthread #14

drakedog2008 opened this issue Apr 26, 2024 · 9 comments

Comments

@drakedog2008
Copy link

When the MCTP layer of the device is working but the NVMe MI layer does not, nvmed will stuck in the MCTP recovery loop and consume most of the CPU slices on the mainthread. Dbus clients will timeout on:

query the unhealthy status for the bad device.
query health status for the other good devices.

root@mvbce5-nfd11:~# time busctl tree xyz.openbmc_project.NVMe
└─ /xyz
  └─ /xyz/openbmc_project
    ├─ /xyz/openbmc_project/inventory
    │ └─ /xyz/openbmc_project/inventory/system
    │   └─ /xyz/openbmc_project/inventory/system/board
    │     ├─ /xyz/openbmc_project/inventory/system/board/Chassis_1
    │     │ ├─ /xyz/openbmc_project/inventory/system/board/Chassis_1/Chassis_1_NVMe_1
    │     │ ├─ /xyz/openbmc_project/inventory/system/board/Chassis_1/Chassis_1_NVMe_2
    │     │ │ ├─ /xyz/openbmc_project/inventory/system/board/Chassis_1/Chassis_1_NVMe_2/controllers
    │     │ │ │ ├─ /xyz/openbmc_project/inventory/system/board/Chassis_1/Chassis_1_NVMe_2/controllers/0
    │     │ │ │ ├─ /xyz/openbmc_project/inventory/system/board/Chassis_1/Chassis_1_NVMe_2/controllers/1
    │     │ │ │ └─ /xyz/openbmc_project/inventory/system/board/Chassis/Chassis_1_NVMe_2/controllers/2
    │     │ │ └─ /xyz/openbmc_project/inventory/system/board/Chassis_1/Chassis_1_NVMe_2/volumes
    │     │ │   └─ /xyz/openbmc_project/inventory/system/board/Chassis_1/Chassis_1_NVMe_2/volumes/1
 ...
    └─ /xyz/openbmc_project/sensors
      └─ /xyz/openbmc_project/sensors/temperature
        ├─ /xyz/openbmc_project/sensors/temperature/NVMe_2_NVMe_1
...
real	0m19.398s
user	0m0.080s
sys	0m0.021s
@drakedog2008
Copy link
Author

The nvmed log:

Apr 25 20:41:00 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]: Available
Apr 25 20:41:06 machineA nvmesensor[16144]: mctp: net 1 eid 18, Admin op:0x06, cdw(1:0x0, 10:0x1, 11:0x0, 12:0x0 13:0x0, 14:0x0, 15:0x0) T:5000ms, elapsed:6003ms
Apr 25 20:41:06 machineA nvmesensor[16144]: Identify Controller failed, no quirks applied
Apr 25 20:41:07 machineA nvmesensor[16144]: Failed reading subsystem info failing: Success
Apr 25 20:41:07 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]Failed setting up MTU for the MCTP endpoint.
Apr 25 20:41:07 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]: Degraded
Apr 25 20:41:07 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]start MCTP closure
Apr 25 20:41:07 machineA nvmesensor[16144]: [bus: 8, address: 48] end MCTP closure
Apr 25 20:41:07 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]: Available
Apr 25 20:41:13 machineA nvmesensor[16144]: mctp: net 1 eid 18, Admin op:0x06, cdw(1:0x0, 10:0x1, 11:0x0, 12:0x0 13:0x0, 14:0x0, 15:0x0) T:5000ms, elapsed:6003ms
Apr 25 20:41:13 machineA nvmesensor[16144]: Identify Controller failed, no quirks applied
Apr 25 20:41:13 machineA nvmesensor[16144]: Failed reading subsystem info failing: Success
Apr 25 20:41:13 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]Failed setting up MTU for the MCTP endpoint.
Apr 25 20:41:13 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]: Degraded
Apr 25 20:41:13 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]start MCTP closure
Apr 25 20:41:13 machineA nvmesensor[16144]: [bus: 8, address: 48] end MCTP closure
Apr 25 20:41:13 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]: Available
Apr 25 20:41:19 machineA nvmesensor[16144]: mctp: net 1 eid 18, Admin op:0x06, cdw(1:0x0, 10:0x1, 11:0x0, 12:0x0 13:0x0, 14:0x0, 15:0x0) T:5000ms, elapsed:6010ms
Apr 25 20:41:19 machineA nvmesensor[16144]: Identify Controller failed, no quirks applied
Apr 25 20:41:20 machineA nvmesensor[16144]: Failed reading subsystem info failing: Success
Apr 25 20:41:20 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]Failed setting up MTU for the MCTP endpoint.
Apr 25 20:41:20 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]: Degraded
Apr 25 20:41:20 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]start MCTP closure
Apr 25 20:41:20 machineA nvmesensor[16144]: [bus: 8, address: 48] end MCTP closure
Apr 25 20:41:20 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]: Available
Apr 25 20:41:26 machineA nvmesensor[16144]: mctp: net 1 eid 18, Admin op:0x06, cdw(1:0x0, 10:0x1, 11:0x0, 12:0x0 13:0x0, 14:0x0, 15:0x0) T:5000ms, elapsed:6007ms
Apr 25 20:41:26 machineA nvmesensor[16144]: Identify Controller failed, no quirks applied
Apr 25 20:41:26 machineA nvmesensor[16144]: Failed reading subsystem info failing: Success
Apr 25 20:41:26 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]Failed setting up MTU for the MCTP endpoint.
Apr 25 20:41:26 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]: Degraded
Apr 25 20:41:26 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]start MCTP closure
Apr 25 20:41:26 machineA nvmesensor[16144]: [bus: 8, address: 48] end MCTP closure
Apr 25 20:41:26 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]: Available
Apr 25 20:41:32 machineA nvmesensor[16144]: mctp: net 1 eid 18, Admin op:0x06, cdw(1:0x0, 10:0x1, 11:0x0, 12:0x0 13:0x0, 14:0x0, 15:0x0) T:5000ms, elapsed:6009ms
Apr 25 20:41:32 machineA nvmesensor[16144]: Identify Controller failed, no quirks applied
Apr 25 20:41:33 machineA nvmesensor[16144]: Failed reading subsystem info failing: Success
Apr 25 20:41:33 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]Failed setting up MTU for the MCTP endpoint.
Apr 25 20:41:33 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]: Degraded
Apr 25 20:41:33 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]start MCTP closure
Apr 25 20:41:33 machineA nvmesensor[16144]: [bus: 8, address: 48] end MCTP closure
Apr 25 20:41:33 machineA nvmesensor[16144]: [network: 1, EID: 18 | bus: 8, address: 48]: Available

@drakedog2008
Copy link
Author

drakedog2008 commented Apr 26, 2024

Apr 25 20:41:33 machineA nvmesensor[16144]: Failed reading subsystem info failing: Success

This MI cmd running on the worker thread, so the asnyc code (on main thread) for the device will yield, during which the dbus call will be handled. So the dbus call should take at most ~5 second.

Why the dbus tree call took ~20 second?

@drakedog2008
Copy link
Author

Ok. that 20 second may be due to that we have multiple devices stuck in the recovery loop.

When I reduce the number of broken devices to 1, busctl tree takes 2 sec to 5 sec range.

@drakedog2008
Copy link
Author

So it seems that the MI worker thread just returned too quickly, so the gap is not long enough to fulfill all async tasks populated in the main thread.

@amboar
Copy link
Collaborator

amboar commented May 6, 2024

I'm working to understand this. My current approach is to try to capture the problem in an integration test case so we have something that we can reliably reproduce. I may have to change tack if that's not completely feasible, but I want to explore it.

@amboar
Copy link
Collaborator

amboar commented May 6, 2024

Apr 25 20:41:26 machineA nvmesensor[16144]: mctp: net 1 eid 18, Admin op:0x06, cdw(1:0x0, 10:0x1, 11:0x0, 12:0x0 13:0x0, 14:0x0, 15:0x0) T:5000ms, elapsed:6007ms

What is emitting this log message? And what's its relationship to the main thread?

However, regardless of the answer to that, we probably want to back-off and limit the attempts to retry the endpoint optimisation. A tight-ish infinite loop of it failing isn't helpful to anyone.

@drakedog2008
Copy link
Author

drakedog2008 commented May 6, 2024

I'm working to understand this. My current approach is to try to capture the problem in an integration test case so we have something that we can reliably reproduce. I may have to change tack if that's not completely feasible, but I want to explore it.

Thanks.

Mostly, these software issues are caused by device malfunction or partial functional which is not easy to capture in the integate test w/o error injection.

For this one particularly, the device's MCTP layer is serving but the NVMe layer is died. So the nvmed is stuck at

https://gbmc.googlesource.com/dbus-sensors/+/refs/heads/google-dev/src/NVMeMi.cpp#192

@drakedog2008
Copy link
Author

drakedog2008 commented May 6, 2024

What is emitting this log message?

That is printed via a libnvme patch. We are printing timestamps for any libnvme call that exceed the threshold. And it shows the quirk id_cntrl takes ~6s and timeout.

And what's its relationship to the main thread?

The libnvme quirk is done within the libnvme level and as part of ep setup. ep setup is supposed to do all local operation (i.e. NON-IO), so it is executed in the mainthread instead of workerthread.

https://gbmc.googlesource.com/dbus-sensors/+/refs/heads/google-dev/src/NVMeMi.cpp#113

amboar added a commit that referenced this issue May 17, 2024
Exercise the NVMeMi endpoint recovery implementation by exploiting the
link seam for libnvme{,-mi}.

To do this we first reuse the headers provided by libnvme with a
partial-dependency declaration in the test. From there we provide our
own (largely neutered) implementation of the APIs to link against.

With this in-place we can trivially recreate the error discussed in[1]
by returning erroneous values from the libnvme APIs. The higher-level
consequence is we drive the NVMeMi implementation through the endpoint
recovery process while trying to optimize the connection. At that point
we can use the usual GMock APIs to define the behaviors we want from
MctpDevice and MctpEndpoint during the recovery process.

[1]: #14

An interesting future direction is to allow mocks to be injected behind
the libnvme{,-mi} link seam. Adding a method to inject mocks enables
describing arbitrary NVMe scenarios in the usual gmock style, which has
the potential to drastically increase test code coverage for nvmesensor.

Returning to the change at hand, note that the test case causes valgrind
to report a leak, however this appears to be from the gmock error report
itself. Given we're working to resolve the gmock error, the leak will
disappear with it:

```
...
Mock MCTP Device] MCTP connection is not established
error reading ctemp from subsystem, reason:Transport endpoint is not connected
Sensor bar reading error!
../tests/test_nvme_mi_recovery.cpp:63: Failure
Mock function called more times than expected - returning directly.
    Function call: recover()
         Expected: to be called between 1 and 3 times
           Actual: called 5 times - over-saturated and active

[Mock MCTP Device] MCTP connection is not established
error reading ctemp from subsystem, reason:Transport endpoint is not connected
../tests/test_nvme_mi_recovery.cpp:63: Failure
Mock function called more times than expected - returning directly.
    Function call: recover()
         Expected: to be called between 1 and 3 times
           Actual: called 6 times - over-saturated and active

status else
poll loop has been canceled
../tests/test_nvme_mi_recovery.cpp:90: Failure
Value of: testing::Mock::VerifyAndClearExpectations(mctpEp.get())
  Actual: false
Expected: true

[  FAILED  ] NVMeRecovery.optimisationFailure (8348 ms)
[----------] 1 test from NVMeRecovery (8359 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (8415 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] NVMeRecovery.optimisationFailure

 1 FAILED TEST
==197574==
==197574== HEAP SUMMARY:
==197574==     in use at exit: 8 bytes in 1 blocks
==197574==   total heap usage: 7,500 allocs, 7,499 frees, 1,751,392 bytes allocated
==197574==
==197574== 8 bytes in 1 blocks are still reachable in loss record 1 of 1
==197574==    at 0x4840F83: operator new(unsigned long) (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==197574==    by 0xD157EF: testing::internal::GetFailureReporter() (gmock-internal-utils.cc:124)
==197574==    by 0x9C16E3: testing::internal::Expect(bool, char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (gmock-internal-utils.h:259)
==197574==    by 0x9C2303: testing::internal::UntypedFunctionMockerBase::FailureCleanupHandler::~FailureCleanupHandler() (gmock-spec-builders.h:1427)
==197574==    by 0x9D852C: testing::internal::FunctionMocker<void ()>::InvokeWith(std::tuple<>&&) (gmock-spec-builders.h:1898)
==197574==    by 0x9CB22B: testing::internal::FunctionMocker<void ()>::Invoke() (gmock-spec-builders.h:1548)
==197574==    by 0x9C2EF8: MockMctpEndpoint::recover() (test_nvme_mi_recovery.cpp:27)
==197574==    by 0xAD57EB: NVMeMi::recover() (NVMeMi.cpp:192)
==197574==    by 0xAD50B1: NVMeMi::epOptimize()::{lambda(boost::system::error_code)#1}::operator()(boost::system::error_code) const::{lambda(std::error_code const&)#1}::operator()(std::error_code const&) const (NVMeMi.cpp:167)
==197574==    by 0xAFF26F: void std::__invoke_impl<void, NVMeMi::epOptimize()::{lambda(boost::system::error_code)#1}::operator()(boost::system::error_code) const::{lambda(std::error_code const&)#1}&, std::error_code const&>(std::__invoke_other, NVMeMi::epOptimize()::{lambda(boost::system::error_code)#1}::operator()(boost::system::error_code) const::{lambda(std::error_code const&)#1}&, std::error_code const&) (invoke.h:61)
==197574==    by 0xAF3ED2: std::enable_if<is_invocable_r_v<void, NVMeMi::epOptimize()::{lambda(boost::system::error_code)#1}::operator()(boost::system::error_code) const::{lambda(std::error_code const&)#1}&, std::error_code const&>, void>::type std::__invoke_r<void, NVMeMi::epOptimize()::{lambda(boost::system::error_code)#1}::operator()(boost::system::error_code) const::{lambda(std::error_code const&)#1}&, std::error_code const&>(NVMeMi::epOptimize()::{lambda(boost::system::error_code)#1}::operator()(boost::system::error_code) const::{lambda(std::error_code const&)#1}&, std::error_code const&) (invoke.h:111)
==197574==    by 0xAEFE8F: std::_Function_handler<void (std::error_code const&), NVMeMi::epOptimize()::{lambda(boost::system::error_code)#1}::operator()(boost::system::error_code) const::{lambda(std::error_code const&)#1}>::_M_invoke(std::_Any_data const&, std::error_code const&) (std_function.h:290)
==197574==
==197574== LEAK SUMMARY:
==197574==    definitely lost: 0 bytes in 0 blocks
==197574==    indirectly lost: 0 bytes in 0 blocks
==197574==      possibly lost: 0 bytes in 0 blocks
==197574==    still reachable: 8 bytes in 1 blocks
==197574==         suppressed: 0 bytes in 0 blocks
==197574==
==197574== For lists of detected and suppressed errors, rerun with: -s
==197574== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
```

Change-Id: Idf3fb6d167a9e11ea1d554c4ab2f2c50d1489421
Signed-off-by: Andrew Jeffery <[email protected]>
amboar added a commit that referenced this issue May 17, 2024
To mitigate starving other tasks out of the main thread, implement a
grace-period that delays restarting the NVMeMiIntf implementation once
MCTP endpoint recovery completes. Together with reducing libnvme's
default timeout for quirk probing this should be enough to mitigate the
issue discussed in [1].

[1]: #14

Change-Id: Iae742fd1e67cb84c163840e7b74868bd6256d6f1
Signed-off-by: Andrew Jeffery <[email protected]>
@amboar
Copy link
Collaborator

amboar commented May 17, 2024

@drakedog2008 sorry for the delay. I've put together a series of fixes that you can find in the dev/issue-14 branch. Here's the series view.

I mentioned above that I wanted to develop an integration test to capture the problem, and then use that demonstrate that it is resolved. The first part of the series does some leg-work that allows the test cases to run both outside and inside of the OpenBMC CI container, which made it easier for me to develop the test. I expect this makes it easier for others to write and debug tests as well.

There are a few necessary patches for the OpenBMC CI container that go with the change above:

https://gerrit.openbmc.org/q/topic:%22test-setups%22+repo:openbmc/openbmc-build-scripts

You will need to include these in your test environment as well if you rely on openbmc-build-scripts.

Following dealing with the tests, the second phase of the series performs some refactoring to make what is currently setupMctpDevice() testable and maintainable in the face of adding further functionality. It does this by reworking the implementation to be a collection of methods on the NVMeDevice type.

Beyond that the series adds the test that exercises the problem in NVMeMi, and then resolves the problem by inserting a grace period on endpoint recovery. The current implementation sets that grace period at 5 seconds, but I figured we could discuss changes to that here if you think another value would be better suited.

As for the timeout on the probe IO generated by libnvme, for now I applied the following patch:

diff --git a/src/nvme/mi-mctp.c b/src/nvme/mi-mctp.c
index 0c5972a5677c..0886e67368b8 100644
--- a/src/nvme/mi-mctp.c
+++ b/src/nvme/mi-mctp.c
@@ -501,12 +501,7 @@ nvme_mi_ep_t nvme_mi_open_mctp(nvme_root_t root, unsigned int netid, __u8 eid)
        ep->transport = &nvme_mi_transport_mctp;
        ep->transport_data = mctp;
 
-       /* Assuming an i2c transport at 100kHz, smallest MTU (64+4). Given
-        * a worst-case clock stretch, and largest-sized packets, we can
-        * expect up to 1.6s per command/response pair. Allowing for a
-        * retry or two (handled by lower layers), 5s is a reasonable timeout.
-        */
-       ep->timeout = 5000;
+       ep->timeout = 1000;
 
        nvme_mi_ep_probe(ep);
 

I intend to start a discussion upstream about making the probe process explicit (i.e. adding an API to invoke it). With that we could move the IO onto the worker thread, avoiding the blocking in nvme_mi_open_mctp().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants