Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible deadlock #9

Open
YijinLiu opened this issue Dec 16, 2020 · 23 comments
Open

Possible deadlock #9

YijinLiu opened this issue Dec 16, 2020 · 23 comments
Assignees

Comments

@YijinLiu
Copy link

My program is experiencing deadlock in libedgetpu occasionally. Following is the stack trace.

Thread 58 (Thread 0x7fff7d7fa700 (LWP 24554)):
#0  0x00007ffff75ebad3 in futex_wait_cancelable (private=, expected=0, futex_word=0x7fff7d7f9170) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x7fff7d7f9120, cond=0x7fff7d7f9148) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x7fff7d7f9148, mutex=0x7fff7d7f9120) at pthread_cond_wait.c:655
#3  0x00007ffff78b48bc in std::condition_variable::wait(std::unique_lock&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007fffea8f9a0e in std::condition_variable::wait >(std::unique_lock &, platforms::darwinn::BlockingCounter::) (this=0x7fff7d7f9148, 
    __lock=..., __p=...) at /usr/include/c++/5/condition_variable:98
#5  0x00007fffea8f99c4 in platforms::darwinn::BlockingCounter::Wait (this=0x7fff7d7f9120) at port/blocking_counter.cc:42
#6  0x00007fffea8ce763 in platforms::darwinn::driver::Driver::Execute (this=0xc1a6700, request=std::shared_ptr (empty) = {...}) at driver/driver.cc:593
#7  0x00007fffea8b9f0a in platforms::darwinn::tflite::EdgeTpuDriverWrapper::InvokeExecutable (this=0xbdf57a0, context=0xc087628, node=0xc13f088) at tflite/edgetpu_context_direct.cc:240
#8  0x00007fffea8ac653 in platforms::darwinn::tflite::(anonymous namespace)::CustomOpInvoke (context=0xc087628, node=0xc13f088) at tflite/custom_op_direct.cc:122
#9  0x00000000068df998 in tflite::impl::Subgraph::Invoke() ()
#10 0x0000000006830e87 in tflite::impl::Interpreter::Invoke() ()

Thread 32 (Thread 0x7fffb4cf5700 (LWP 24515)):
#0  0x00007ffff75ebad3 in futex_wait_cancelable (private=, expected=0, futex_word=0xc1a6a14) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0xc19a4b0, cond=0xc1a69e8) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0xc1a69e8, mutex=0xc19a4b0) at pthread_cond_wait.c:655
#3  0x00007ffff78b48bc in std::condition_variable::wait(std::unique_lock&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007fffea87517f in std::_V2::condition_variable_any::wait (this=0xc1a69e8, __lock=...) at /usr/include/c++/5/condition_variable:250
#5  0x00007fffea871789 in platforms::darwinn::driver::UsbDriver::WorkerThreadFunc (this=0xc1a6700) at driver/usb/usb_driver.cc:1322
#6  0x00007fffea8728c7 in platforms::darwinn::driver::UsbDriver::::operator()(void) const (__closure=0xc1cb358) at driver/usb/usb_driver.cc:1604
#7  0x00007fffea878834 in std::_Bind_simple()>::_M_invoke<>(std::_Index_tuple<>) (this=0xc1cb358) at /usr/include/c++/5/functional:1531
#8  0x00007fffea8787b5 in std::_Bind_simple()>::operator()(void) (this=0xc1cb358) at /usr/include/c++/5/functional:1520
#9  0x00007fffea87875c in std::thread::_Impl()> >::_M_run(void) (this=0xc1cb340) at /usr/include/c++/5/thread:115
#10 0x00007ffff78baa50 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#11 0x00007ffff75e56db in start_thread (arg=0x7fffb4cf5700) at pthread_create.c:463
#12 0x00007fffea0d171f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

I use one thread to run the inference. This only happens about once per week. I haven't find a way to repro it so far.

@Namburger
Copy link
Contributor

Hello, can you describe what exactly happens and how is your program designed?

@YijinLiu
Copy link
Author

YijinLiu commented Dec 16, 2020

We run a SSD model using edgetpu. A single thread receives images and run the model.
Once a while, the thread stuck forever. The initial post has the stack trace. Unfortunately, I have no luck finding a way to repro it so far. Let me know if you need more information.

@Namburger
Copy link
Contributor

@YijinLiu I see, may I ask if this is over usb3 or usb2?

@YijinLiu
Copy link
Author

It's usb3. Thanks.

@manoj7410
Copy link

Feel free to reopen if this is still an issue.

@YijinLiu
Copy link
Author

Yes, it is. This is awkward..
I answered one of your questions and wait for several months. Only to find you closed it w/o giving any reason.
I hope this is not how you guys usually interact with developers.

Anyway, I have bit more information:

I turned on logging and got a lot of errors like:

I driver/usb/local_usb_device.cc:97] ConvertLibUsbTransferStatus: USB transfer error 2 [LibUsbDataInCallback]
I driver/usb/local_usb_device.cc:97] ConvertLibUsbTransferStatus: USB transfer error 2 [LibUsbDataInCallback]
I driver/usb/local_usb_device.cc:97] ConvertLibUsbTransferStatus: USB transfer error 2 [LibUsbDataInCallback]
I driver/usb/local_usb_device.cc:97] ConvertLibUsbTransferStatus: USB transfer error 2 [LibUsbDataInCallback]
I driver/usb/local_usb_device.cc:97] ConvertLibUsbTransferStatus: USB transfer error 2 [LibUsbDataInCallback]
I driver/usb/local_usb_device.cc:97] ConvertLibUsbTransferStatus: USB transfer error 2 [LibUsbDataInCallback]
I driver/usb/local_usb_device.cc:97] ConvertLibUsbTransferStatus: USB transfer error 2 [LibUsbDataInCallback]
I driver/usb/local_usb_device.cc:97] ConvertLibUsbTransferStatus: USB transfer error 2 [LibUsbDataInCallback]
I driver/usb/local_usb_device.cc:97] ConvertLibUsbTransferStatus: USB transfer error 2 [LibUsbDataInCallback]
I driver/usb/local_usb_device.cc:97] ConvertLibUsbTransferStatus: USB transfer error 2 [LibUsbDataInCallback]
I driver/usb/usb_driver.cc:468] USB transfer error 2 [LibUsbDataInCallback]
I driver/usb/usb_driver.cc:468] USB transfer error 2 [LibUsbDataInCallback]

And the thread was stuck here:

#0  futex_wait_cancelable (private=, expected=0, futex_word=0xc95660c) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0xcf13f90, cond=0xc9565e0) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0xc9565e0, mutex=0xcf13f90) at pthread_cond_wait.c:638
#3  0x00007ffff7a610fc in std::condition_variable::wait(std::unique_lock&) () from bin/libstdc++.so.6
#4  0x00007fffeb8fca38 in std::_V2::condition_variable_any::wait (this=0xc9565e0, __lock=...) at /usr/include/c++/7/condition_variable:263
#5  0x00007fffeb8f82cc in platforms::darwinn::driver::UsbDriver::WorkerThreadFunc (this=0xc956300) at driver/usb/usb_driver.cc:1320
#6  0x00007fffeb8f9a15 in platforms::darwinn::driver::UsbDriver::::operator()(void) const (__closure=0xce69628) at driver/usb/usb_driver.cc:1599
#7  0x00007fffeb8fe28d in std::__invoke_impl >(std::__invoke_other, platforms::darwinn::driver::UsbDriver:: &&) (__f=...) at /usr/include/c++/7/bits/invoke.h:60
#8  0x00007fffeb8fcbaf in std::__invoke >(platforms::darwinn::driver::UsbDriver:: &&) (__fn=...) at /usr/include/c++/7/bits/invoke.h:95
#9  0x00007fffeb8ffba6 in std::thread::_Invoker > >::_M_invoke<0>(std::_Index_tuple<0>) (this=0xce69628) at /usr/include/c++/7/thread:234
#10 0x00007fffeb8ffb77 in std::thread::_Invoker > >::operator()(void) (this=0xce69628) at /usr/include/c++/7/thread:243
#11 0x00007fffeb8ffb56 in std::thread::_State_impl > > >::_M_run(void) (this=0xce69620) at /usr/include/c++/7/thread:186
#12 0x00007ffff7a66a00 in ?? () from bin/libstdc++.so.6
#13 0x00007ffff796a609 in start_thread (arg=) at pthread_create.c:477
#14 0x00007fffeb723293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

@YijinLiu
Copy link
Author

Well, even more weird: I am not allowed to reopen it!

@manoj7410 manoj7410 reopened this Oct 12, 2021
@manoj7410
Copy link

@YijinLiu Issue reopened. Can you please share any steps to reproduce ?
Which platform are you working on ?

@YijinLiu
Copy link
Author

I cannot find a way to repro it. We have a Linux program running detection on security cameras all the time. It may happen once every week.

Do you think these "USB transfer error" are normal?

@manoj7410
Copy link

manoj7410 commented Oct 12, 2021

@YijinLiu What is the power source that you are using to boot your machine/platform ?
How many USB Accelerators are connected with the machine ?
Is there any other USB device connected as well ?

@YijinLiu
Copy link
Author

It's a desktop machine, with CPU i7-10700, running ubuntu 20, with kernel 5.11.
I only have one USB accelerator connected to the machine.

We have users with many other configurations that are experiencing the same issue.

@manoj7410
Copy link

I'll try to run a demo on my machine which has Core i7. Will let you know if I am able to repro this issue at my end. Meanwhile, if you can find some specific steps to repro this then that will be really helpful.

@YijinLiu
Copy link
Author

I think I found a bug in the code. I am not sure whether it's the cause for this deadlock though. Testing it now, we will know after a week..

diff --git a/driver/usb/usb_driver.cc b/driver/usb/usb_driver.cc
index 78beda4..1dcac42 100644
--- a/driver/usb/usb_driver.cc
+++ b/driver/usb/usb_driver.cc
@@ -1299,7 +1299,7 @@ void UsbDriver::WorkerThreadFunc() {
         }
       }
 
-      reevaluation_needed = ProcessIo().ValueOrDie();
+      if (ProcessIo().ValueOrDie()) reevaluation_needed = true;
 
       // TODO: Enter kPaused state when dma_scheduler_.IsEmpty(). Any
       // new task should kick the driver back to kOpen state. Note this is in
@@ -1311,7 +1311,7 @@ void UsbDriver::WorkerThreadFunc() {
     } else {
       StdCondMutexLock queue_lock(&callback_mutex_);
 
-      Lock2 unlock_both(state_lock, queue_lock);
+      Lock2 unlock_both(queue_lock, state_lock);
 
       if (callback_queue_.empty()) {
         VLOG(10) << StringPrintf("%s waiting on state change", __func__);

@YijinLiu
Copy link
Author

YijinLiu commented Nov 9, 2021

There is another issue, a likely contention:

  • The worker thread (from driver/usb/usb_driver.cc) installs these event/interrupt/bulk-in readers, which timeout every 6 seconds by default.
  • The client calls UsbDriver::DoSubmit to submit jobs to the device.
  • If the readers timeout the same time UsbDriver::ProcessIo issues bulk-out requests, it may be stuck there for ever.
    I changed usb_timeout_millis to 0 to disable the timeout. It has been running for 10 days without the deadlock I saw before.

@manoj7410
Copy link

Thank you for sharing these details. What all files were modified to solve the issue ?

@chao-camect
Copy link

Besides the diffs I posted previously, here are two extra changes:

diff --git a/driver/beagle/beagle_usb_driver_provider.cc b/driver/beagle/beagle_usb_driver_provider.cc
index dcfd804..2b82a00 100644
--- a/driver/beagle/beagle_usb_driver_provider.cc
+++ b/driver/beagle/beagle_usb_driver_provider.cc
@@ -105,7 +105,7 @@ ABSL_FLAG(bool, usb_enable_bulk_descriptors_from_device,
 ABSL_FLAG(bool, usb_enable_processing_of_hints,
           GetEnv("USB_ENABLE_PROCESSING_OF_HINTS", true),
           "USB set to true for driver to proactively send data to device.");
-ABSL_FLAG(int, usb_timeout_millis, GetEnv("USB_TIMEOUT_MILLIS", 6000),
+ABSL_FLAG(int, usb_timeout_millis, GetEnv("USB_TIMEOUT_MILLIS", 0),
           "USB timeout in milliseconds");
 ABSL_FLAG(bool, usb_reset_back_to_dfu_mode,
           GetEnv("USB_RESET_BACK_TO_DFU_MODE", false),
@@ -135,7 +135,7 @@ ABSL_FLAG(bool, usb_enable_overlapping_requests,
           "the current one.");
 ABSL_FLAG(bool, usb_enable_overlapping_bulk_in_and_out,
           GetEnv("USB_ENABLE_OVERLAPPING_BULK_IN_AND_OUT", true),
-          "Allows bulk-in trasnfer to be submitted before previous bulk-out "
+          "Allows bulk-in transfer to be submitted before previous bulk-out "
           "requests complete.");
 ABSL_FLAG(bool, usb_enable_queued_bulk_in_requests,
           GetEnv("USB_ENABLE_QUEUED_BULK_IN_REQUESTS", true),
diff --git a/driver/usb/usb_driver.h b/driver/usb/usb_driver.h
index 182f27e..d3580a9 100644
--- a/driver/usb/usb_driver.h
+++ b/driver/usb/usb_driver.h
@@ -151,7 +151,7 @@ class UsbDriver : public Driver {
     bool usb_fail_if_slower_than_superspeed{false};
 
     // General timeout for USB operations in milliseconds.
-    int usb_timeout_millis{6000};
+    int usb_timeout_millis{0};
 
     // If non-empty, the firmware image to use for automatic DFU.
     // This feature is only available when a device factory has been supplied.

@manoj7410
Copy link

@YijinLiu Are these changes made to the generic Linux drivers?

@chao-camect
Copy link

It's libedgetpu. You can find the filenames in the diff...

@manoj7410
Copy link

@chao-camect Got it. We are still not able to reproduce this. However, can you create a patch of the complete code change and submit for review?

@chao-camect
Copy link

To repro it, you'll need to add random sleep time between two calls to USB Coral. Since the default timeout is 6s, the sleep time need to be around that to replicate.
I'll find some time to create the PR.

@hjonnala hjonnala assigned hjonnala and unassigned manoj7410 May 3, 2022
@xxs1989
Copy link

xxs1989 commented Sep 28, 2022

Hey, I got the same issue occasional happened on my coral mini as :

I driver/usb/usb_driver.cc:86] lock (does nothing)
I driver/usb/usb_driver.cc:1322] WorkerThreadFunc driver state change detected
I driver/usb/usb_driver.cc:1174] WorkerThreadFunc dispatching 1 callback events in worker thread
I driver/usb/usb_driver.cc:1138] bulk in 8 bytes from buffer index [5]
I driver/dma_chunker.cc:30] Completed 0 bytes; Outstanding 0 bytes; Processing next 8 bytes
I driver/usb/usb_driver.cc:916] [3-3] bulk in for 8 bytes has yielded 8 bytes from index [5]
I driver/usb/usb_driver.cc:1310] WorkerThreadFunc re-evaluation is needed
I driver/usb/usb_driver.cc:1174] WorkerThreadFunc dispatching 0 callback events in worker thread
I driver/usb/usb_driver.cc:1260] WorkerThreadFunc Installing bulk-in reader. buffer index [5]
I driver/usb/local_usb_device.cc:748] AsyncBulkInTransfer
I driver/usb/local_usb_device.cc:761] ASYNC IN 1 begin
I driver/single_queue_dma_scheduler.cc:154] Completing DMA[3]
I driver/usb/usb_driver.cc:550] IO completed
I driver/single_queue_dma_scheduler.cc:154] Completing DMA[4]
I driver/single_tpu_request.cc:410] [313] NotifyCompletion()
I driver/package_registry.cc:658] Returned instruction buffers back to executable reference
I driver/single_tpu_request.cc:478] [313] SetState old=3, new=4.
I driver/single_queue_dma_scheduler.cc:234] Request[313]: Completed
I driver/single_tpu_request.cc:96] [313] Request destroyed.
I driver/usb/usb_driver.cc:550] IO completed
I driver/usb/usb_driver.cc:1317] WorkerThreadFunc waiting on state change
I driver/usb/usb_driver.cc:91] Unlocks both mutex
I driver/usb/local_usb_device.cc:672] ASYNC IN 3 end

I driver/usb/local_usb_device.cc:97] ConvertLibUsbTransferStatus: USB transfer error 2 [LibUsbDataInCallback]
I driver/usb/local_usb_device.cc:642] UnregisterCompletedTransfer
I driver/usb/usb_driver.cc:86] lock (does nothing)
I driver/usb/usb_driver.cc:1322] WorkerThreadFunc driver state change detected
I driver/usb/usb_driver.cc:1174] WorkerThreadFunc dispatching 1 callback events in worker thread

I driver/usb/usb_driver.cc:468] USB transfer error 2 [LibUsbDataInCallback]
I driver/usb/usb_driver.cc:1232] WorkerThreadFunc Re-installing interrupt reader
I driver/usb/local_usb_device.cc:785] AsyncInterruptInTransfer
I driver/usb/local_usb_device.cc:798] ASYNC IN 3 begin
I driver/usb/usb_driver.cc:1317] WorkerThreadFunc waiting on state change
I driver/usb/usb_driver.cc:91] Unlocks both mutex
I driver/usb/local_usb_device.cc:672] ASYNC IN 1 end

I driver/usb/local_usb_device.cc:97] ConvertLibUsbTransferStatus: USB transfer error 2 [LibUsbDataInCallback]
I driver/usb/local_usb_device.cc:642] UnregisterCompletedTransfer
I driver/usb/usb_driver.cc:86] lock (does nothing)
I driver/usb/usb_driver.cc:1322] WorkerThreadFunc driver state change detected
I driver/usb/usb_driver.cc:1174] WorkerThreadFunc dispatching 1 callback events in worker thread
I driver/usb/usb_driver.cc:1260] WorkerThreadFunc Installing bulk-in reader. buffer index [6]

And I am using python to run the 'edgetpu.run_inference()', the behavior is pretty un-consistent, seems have relation with different USB power supply, so is there any USB power requirements for the coral mini running tpu? please give any debug or solution advice, thanks!

@hjonnala
Copy link

Hi @xxs1989 the board requires 5V/2A power supply. Please check the section 5.1 for more details at: https://coral.ai/static/files/Coral-Dev-Board-Mini-datasheet.pdf. Thanks!

@messani
Copy link

messani commented Oct 19, 2024

Hi.

I have exactly same experience. I have Raspberry pi 4 and raspberry pi 5. I am using CodeProject.AI in docker. Both raspberries have problems with USB TPU Coral. Both raspberries were in deadlocked state after few hours (or approx. 10k requests). I tried to use USB hub and Y cable. No success.

So I tried to compile libedgetpu with your patches. First patch (order of locks) had no success. But after that I applied second patch (change usb_timeout_millis from 6000 to 0) it seems to be ok now.

Now uptime is almost two days and there are more than 600k requests.

If you need some additional information, I can you do some research for you. I am able to build debug version and I am able to attach it by GDB and send you some info. If you would like to add extra logging, I can do it for you too.

I am developer (C/C++/CSharp), but I do not understand USB data.

UPDATE: 5 days without problems, 1,5M requests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants