-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible deadlock #9
Comments
Hello, can you describe what exactly happens and how is your program designed? |
We run a SSD model using edgetpu. A single thread receives images and run the model. |
@YijinLiu I see, may I ask if this is over usb3 or usb2? |
It's usb3. Thanks. |
Feel free to reopen if this is still an issue. |
Yes, it is. This is awkward.. Anyway, I have bit more information: I turned on logging and got a lot of errors like: I driver/usb/local_usb_device.cc:97] ConvertLibUsbTransferStatus: USB transfer error 2 [LibUsbDataInCallback] I driver/usb/local_usb_device.cc:97] ConvertLibUsbTransferStatus: USB transfer error 2 [LibUsbDataInCallback] I driver/usb/local_usb_device.cc:97] ConvertLibUsbTransferStatus: USB transfer error 2 [LibUsbDataInCallback] I driver/usb/local_usb_device.cc:97] ConvertLibUsbTransferStatus: USB transfer error 2 [LibUsbDataInCallback] I driver/usb/local_usb_device.cc:97] ConvertLibUsbTransferStatus: USB transfer error 2 [LibUsbDataInCallback] I driver/usb/local_usb_device.cc:97] ConvertLibUsbTransferStatus: USB transfer error 2 [LibUsbDataInCallback] I driver/usb/local_usb_device.cc:97] ConvertLibUsbTransferStatus: USB transfer error 2 [LibUsbDataInCallback] I driver/usb/local_usb_device.cc:97] ConvertLibUsbTransferStatus: USB transfer error 2 [LibUsbDataInCallback] I driver/usb/local_usb_device.cc:97] ConvertLibUsbTransferStatus: USB transfer error 2 [LibUsbDataInCallback] I driver/usb/local_usb_device.cc:97] ConvertLibUsbTransferStatus: USB transfer error 2 [LibUsbDataInCallback] I driver/usb/usb_driver.cc:468] USB transfer error 2 [LibUsbDataInCallback] I driver/usb/usb_driver.cc:468] USB transfer error 2 [LibUsbDataInCallback] And the thread was stuck here: #0 futex_wait_cancelable (private=, expected=0, futex_word=0xc95660c) at ../sysdeps/nptl/futex-internal.h:183 #1 __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0xcf13f90, cond=0xc9565e0) at pthread_cond_wait.c:508 #2 __pthread_cond_wait (cond=0xc9565e0, mutex=0xcf13f90) at pthread_cond_wait.c:638 #3 0x00007ffff7a610fc in std::condition_variable::wait(std::unique_lock&) () from bin/libstdc++.so.6 #4 0x00007fffeb8fca38 in std::_V2::condition_variable_any::wait (this=0xc9565e0, __lock=...) at /usr/include/c++/7/condition_variable:263 #5 0x00007fffeb8f82cc in platforms::darwinn::driver::UsbDriver::WorkerThreadFunc (this=0xc956300) at driver/usb/usb_driver.cc:1320 #6 0x00007fffeb8f9a15 in platforms::darwinn::driver::UsbDriver::::operator()(void) const (__closure=0xce69628) at driver/usb/usb_driver.cc:1599 #7 0x00007fffeb8fe28d in std::__invoke_impl >(std::__invoke_other, platforms::darwinn::driver::UsbDriver:: &&) (__f=...) at /usr/include/c++/7/bits/invoke.h:60 #8 0x00007fffeb8fcbaf in std::__invoke >(platforms::darwinn::driver::UsbDriver:: &&) (__fn=...) at /usr/include/c++/7/bits/invoke.h:95 #9 0x00007fffeb8ffba6 in std::thread::_Invoker > >::_M_invoke<0>(std::_Index_tuple<0>) (this=0xce69628) at /usr/include/c++/7/thread:234 #10 0x00007fffeb8ffb77 in std::thread::_Invoker > >::operator()(void) (this=0xce69628) at /usr/include/c++/7/thread:243 #11 0x00007fffeb8ffb56 in std::thread::_State_impl > > >::_M_run(void) (this=0xce69620) at /usr/include/c++/7/thread:186 #12 0x00007ffff7a66a00 in ?? () from bin/libstdc++.so.6 #13 0x00007ffff796a609 in start_thread (arg=) at pthread_create.c:477 #14 0x00007fffeb723293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 |
Well, even more weird: I am not allowed to reopen it! |
@YijinLiu Issue reopened. Can you please share any steps to reproduce ? |
I cannot find a way to repro it. We have a Linux program running detection on security cameras all the time. It may happen once every week. Do you think these "USB transfer error" are normal? |
@YijinLiu What is the power source that you are using to boot your machine/platform ? |
It's a desktop machine, with CPU i7-10700, running ubuntu 20, with kernel 5.11. We have users with many other configurations that are experiencing the same issue. |
I'll try to run a demo on my machine which has Core i7. Will let you know if I am able to repro this issue at my end. Meanwhile, if you can find some specific steps to repro this then that will be really helpful. |
I think I found a bug in the code. I am not sure whether it's the cause for this deadlock though. Testing it now, we will know after a week.. diff --git a/driver/usb/usb_driver.cc b/driver/usb/usb_driver.cc index 78beda4..1dcac42 100644 --- a/driver/usb/usb_driver.cc +++ b/driver/usb/usb_driver.cc @@ -1299,7 +1299,7 @@ void UsbDriver::WorkerThreadFunc() { } } - reevaluation_needed = ProcessIo().ValueOrDie(); + if (ProcessIo().ValueOrDie()) reevaluation_needed = true; // TODO: Enter kPaused state when dma_scheduler_.IsEmpty(). Any // new task should kick the driver back to kOpen state. Note this is in @@ -1311,7 +1311,7 @@ void UsbDriver::WorkerThreadFunc() { } else { StdCondMutexLock queue_lock(&callback_mutex_); - Lock2 unlock_both(state_lock, queue_lock); + Lock2 unlock_both(queue_lock, state_lock); if (callback_queue_.empty()) { VLOG(10) << StringPrintf("%s waiting on state change", __func__); |
There is another issue, a likely contention:
|
Thank you for sharing these details. What all files were modified to solve the issue ? |
Besides the diffs I posted previously, here are two extra changes: diff --git a/driver/beagle/beagle_usb_driver_provider.cc b/driver/beagle/beagle_usb_driver_provider.cc index dcfd804..2b82a00 100644 --- a/driver/beagle/beagle_usb_driver_provider.cc +++ b/driver/beagle/beagle_usb_driver_provider.cc @@ -105,7 +105,7 @@ ABSL_FLAG(bool, usb_enable_bulk_descriptors_from_device, ABSL_FLAG(bool, usb_enable_processing_of_hints, GetEnv("USB_ENABLE_PROCESSING_OF_HINTS", true), "USB set to true for driver to proactively send data to device."); -ABSL_FLAG(int, usb_timeout_millis, GetEnv("USB_TIMEOUT_MILLIS", 6000), +ABSL_FLAG(int, usb_timeout_millis, GetEnv("USB_TIMEOUT_MILLIS", 0), "USB timeout in milliseconds"); ABSL_FLAG(bool, usb_reset_back_to_dfu_mode, GetEnv("USB_RESET_BACK_TO_DFU_MODE", false), @@ -135,7 +135,7 @@ ABSL_FLAG(bool, usb_enable_overlapping_requests, "the current one."); ABSL_FLAG(bool, usb_enable_overlapping_bulk_in_and_out, GetEnv("USB_ENABLE_OVERLAPPING_BULK_IN_AND_OUT", true), - "Allows bulk-in trasnfer to be submitted before previous bulk-out " + "Allows bulk-in transfer to be submitted before previous bulk-out " "requests complete."); ABSL_FLAG(bool, usb_enable_queued_bulk_in_requests, GetEnv("USB_ENABLE_QUEUED_BULK_IN_REQUESTS", true), diff --git a/driver/usb/usb_driver.h b/driver/usb/usb_driver.h index 182f27e..d3580a9 100644 --- a/driver/usb/usb_driver.h +++ b/driver/usb/usb_driver.h @@ -151,7 +151,7 @@ class UsbDriver : public Driver { bool usb_fail_if_slower_than_superspeed{false}; // General timeout for USB operations in milliseconds. - int usb_timeout_millis{6000}; + int usb_timeout_millis{0}; // If non-empty, the firmware image to use for automatic DFU. // This feature is only available when a device factory has been supplied. |
@YijinLiu Are these changes made to the generic Linux drivers? |
It's libedgetpu. You can find the filenames in the diff... |
@chao-camect Got it. We are still not able to reproduce this. However, can you create a patch of the complete code change and submit for review? |
To repro it, you'll need to add random sleep time between two calls to USB Coral. Since the default timeout is 6s, the sleep time need to be around that to replicate. |
Hey, I got the same issue occasional happened on my coral mini as :
And I am using python to run the 'edgetpu.run_inference()', the behavior is pretty un-consistent, seems have relation with different USB power supply, so is there any USB power requirements for the coral mini running tpu? please give any debug or solution advice, thanks! |
Hi @xxs1989 the board requires 5V/2A power supply. Please check the section 5.1 for more details at: https://coral.ai/static/files/Coral-Dev-Board-Mini-datasheet.pdf. Thanks! |
Hi. I have exactly same experience. I have Raspberry pi 4 and raspberry pi 5. I am using CodeProject.AI in docker. Both raspberries have problems with USB TPU Coral. Both raspberries were in deadlocked state after few hours (or approx. 10k requests). I tried to use USB hub and Y cable. No success. So I tried to compile libedgetpu with your patches. First patch (order of locks) had no success. But after that I applied second patch (change usb_timeout_millis from 6000 to 0) it seems to be ok now. Now uptime is almost two days and there are more than 600k requests. If you need some additional information, I can you do some research for you. I am able to build debug version and I am able to attach it by GDB and send you some info. If you would like to add extra logging, I can do it for you too. I am developer (C/C++/CSharp), but I do not understand USB data. UPDATE: 5 days without problems, 1,5M requests. |
My program is experiencing deadlock in libedgetpu occasionally. Following is the stack trace.
I use one thread to run the inference. This only happens about once per week. I haven't find a way to repro it so far.
The text was updated successfully, but these errors were encountered: