Unregister BLE notifications to improve stability #659

EasilyBoredEngineer · 2025-01-10T05:07:04Z

G'day Syssi,

I've been having a few issues with stability when using your jk ble code.

I just thought I'd pass on the stacktrace, and an AI analysis of it :)

0x4009025d: esp_system_abort at /home/dcrad/.platformio/packages/framework-espidf/components/esp_system/esp_system.c:137
0x40096416: abort at /home/dcrad/.platformio/packages/framework-espidf/components/newlib/abort.c:46
0x401a8157: __cxxabiv1::__terminate(void (*)()) at /builds/idf/crosstool-NG/.build/xtensa-esp32-elf/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:47
0x401a819e: std::terminate() at /builds/idf/crosstool-NG/.build/xtensa-esp32-elf/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:57
0x401a8fd9: __cxa_allocate_exception at /builds/idf/crosstool-NG/.build/xtensa-esp32-elf/src/gcc/libstdc++-v3/libsupc++/eh_alloc.cc:300
0x401a81d0: operator new(unsigned int) at /builds/idf/crosstool-NG/.build/xtensa-esp32-elf/src/gcc/libstdc++-v3/libsupc++/new_op.cc:54
0x400f3245: std::unique_ptr<esphome::Scheduler::SchedulerItem, std::default_delete<esphome::Scheduler::SchedulerItem> > esphome::make_unique<esphome::Scheduler::SchedulerItem>() at /home/dcrad/esphome-projects/.esphome/build/curve-master-3/src/esphome/core/helpers.h:77
0x400f3496: esphome::Scheduler::set_timeout(esphome::Component*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int, std::function<void ()>) at /home/dcrad/esphome-projects/.esphome/build/curve-master-3/src/esphome/core/scheduler.cpp:34
0x400f2653: esphome::Component::set_timeout(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int, std::function<void ()>&&) at /home/dcrad/esphome-projects/.esphome/build/curve-master-3/src/esphome/core/component.cpp:70
0x400ec911: esphome::sensor::TimeoutFilter::new_value(float) at /home/dcrad/esphome-projects/.esphome/build/curve-master-3/src/esphome/components/sensor/filter.cpp:386
0x400ec9a6: esphome::sensor::Filter::input(float) at /home/dcrad/esphome-projects/.esphome/build/curve-master-3/src/esphome/components/sensor/filter.cpp:15
0x400ed38c: esphome::sensor::Sensor::publish_state(float) at /home/dcrad/esphome-projects/.esphome/build/curve-master-3/src/esphome/components/sensor/sensor.cpp:48
0x400e73f9: esphome::jk_bms_ble::JkBmsBle::publish_state_(esphome::sensor::Sensor*, float) at /home/dcrad/esphome-projects/.esphome/build/curve-master-3/src/esphome/components/jk_bms_ble/jk_bms_ble.cpp:1544
0x400e877a: esphome::jk_bms_ble::JkBmsBle::decode_jk02_cell_info_(std::vector<unsigned char, std::allocator<unsigned char> > const&) at /home/dcrad/esphome-projects/.esphome/build/curve-master-3/src/esphome/components/jk_bms_ble/jk_bms_ble.cpp:520 (discriminator 4)
0x400e907c: esphome::jk_bms_ble::JkBmsBle::decode_(std::vector<unsigned char, std::allocator<unsigned char> > const&) at /home/dcrad/esphome-projects/.esphome/build/curve-master-3/src/esphome/components/jk_bms_ble/jk_bms_ble.cpp:300
0x400e9172: esphome::jk_bms_ble::JkBmsBle::assemble(unsigned char const*, unsigned short) at /home/dcrad/esphome-projects/.esphome/build/curve-master-3/src/esphome/components/jk_bms_ble/jk_bms_ble.cpp:279
0x400e926a: esphome::jk_bms_ble::JkBmsBle::gattc_event_handler(esp_gattc_cb_event_t, unsigned char, esp_ble_gattc_cb_param_t*) at /home/dcrad/esphome-projects/.esphome/build/curve-master-3/src/esphome/components/jk_bms_ble/jk_bms_ble.cpp:228
0x400da825: esphome::ble_client::BLEClient::gattc_event_handler(esp_gattc_cb_event_t, unsigned char, esp_ble_gattc_cb_param_t*) at /home/dcrad/esphome-projects/.esphome/build/curve-master-3/src/esphome/components/ble_client/ble_client.cpp:54 (discriminator 2)
0x401bfdfe: esphome::esp32_ble_tracker::ESP32BLETracker::gattc_event_handler(esp_gattc_cb_event_t, unsigned char, esp_ble_gattc_cb_param_t*) at /home/dcrad/esphome-projects/.esphome/build/curve-master-3/src/esphome/components/esp32_ble_tracker/esp32_ble_tracker.cpp:408 (discriminator 2)
0x400dea35: non-virtual thunk to esphome::esp32_ble_tracker::ESP32BLETracker::gattc_event_handler(esp_gattc_cb_event_t, unsigned char, esp_ble_gattc_cb_param_t*) at ??:?
0x401bfda9: esphome::esp32_ble::ESP32BLE::real_gattc_event_handler_(esp_gattc_cb_event_t, unsigned char, esp_ble_gattc_cb_param_t*) at /home/dcrad/esphome-projects/.esphome/build/curve-master-3/src/esphome/components/esp32_ble/ble.cpp:361 (discriminator 2)
0x400dd65d: esphome::esp32_ble::ESP32BLE::loop() at /home/dcrad/esphome-projects/.esphome/build/curve-master-3/src/esphome/components/esp32_ble/ble.cpp:308
0x401c0699: esphome::Component::call_loop() at /home/dcrad/esphome-projects/.esphome/build/curve-master-3/src/esphome/core/component.cpp:77
0x401c074d: esphome::Component::call() at /home/dcrad/esphome-projects/.esphome/build/curve-master-3/src/esphome/core/component.cpp:104
0x400f12c5: esphome::Application::loop() at /home/dcrad/esphome-projects/.esphome/build/curve-master-3/src/esphome/core/application.cpp:74 (discriminator 2)
0x400f555a: loop() at /home/dcrad/esphome-projects/.esphome/build/curve-master-3/bluebox-common-jk2.yaml:1646
0x400dc97a: esphome::loop_task(void*) at /home/dcrad/esphome-projects/.esphome/build/curve-master-3/src/esphome/components/esp32/core.cpp:69 (discriminator 1)```

AI analysis...

Certainly! Let's focus on the stack frames that directly involve the `jk_bms_ble` component:

---

### **Relevant Stack Frames**

#### 1. **`esphome::jk_bms_ble::JkBmsBle::publish_state_`**  
   - **Location:**  
     ```plaintext
     /src/esphome/components/jk_bms_ble/jk_bms_ble.cpp:1544
     ```
   - **Description:**  
     This function is responsible for publishing sensor states to the ESPHome framework. It is called during the BLE data handling pipeline when decoded data is forwarded to associated sensors.

   - **Potential Issue:**  
     If too many notifications are processed in rapid succession, this function might be overwhelmed. This can lead to:
     - Excessive calls to `Sensor::publish_state(float)`, which might trigger `TimeoutFilter` operations unnecessarily.
     - Memory issues due to frequent allocations for sensor updates.

---

#### 2. **`esphome::jk_bms_ble::JkBmsBle::decode_jk02_cell_info_`**  
   - **Location:**  
     ```plaintext
     /src/esphome/components/jk_bms_ble/jk_bms_ble.cpp:520
     ```
   - **Description:**  
     This function handles decoding of cell-specific information from BLE data. It's a critical part of processing incoming BLE notifications.

   - **Potential Issue:**  
     If BLE notifications are arriving too frequently or contain unexpected data, this function might:
     - Process redundant or unnecessary data.
     - Create excessive objects, resulting in memory fragmentation.
     - Trigger cascading updates to sensors.

   - **Debugging Suggestions:**
     - Log the incoming data to identify if redundant or spurious notifications are being processed.
     - Check if all notifications are meaningful for your use case.

---

#### 3. **`esphome::jk_bms_ble::JkBmsBle::decode_`**  
   - **Location:**  
     ```plaintext
     /src/esphome/components/jk_bms_ble/jk_bms_ble.cpp:300
     ```
   - **Description:**  
     This function appears to be a general decoder for BLE data. It likely parses raw BLE packets and routes them to more specific functions like `decode_jk02_cell_info_`.

   - **Potential Issue:**  
     - Inefficient decoding logic could lead to unnecessary memory usage.
     - If unhandled BLE packets are received, they might trigger errors or excessive operations.

   - **Debugging Suggestions:**
     - Ensure the decoder filters out irrelevant packets as early as possible.
     - Log the size and type of incoming BLE packets for analysis.

---

#### 4. **`esphome::jk_bms_ble::JkBmsBle::assemble`**  
   - **Location:**  
     ```plaintext
     /src/esphome/components/jk_bms_ble/jk_bms_ble.cpp:279
     ```
   - **Description:**  
     This function likely reconstructs fragmented BLE data packets into a complete message for processing.

   - **Potential Issue:**  
     - Memory usage might spike if fragmented packets are frequent or if improper bounds checking occurs.
     - Repeated allocations for packet assembly could fragment the heap.

   - **Debugging Suggestions:**
     - Log the buffer size and frequency of assembly operations.
     - Ensure proper bounds checking and avoid unnecessary allocations.

---

#### 5. **`esphome::jk_bms_ble::JkBmsBle::gattc_event_handler`**  
   - **Location:**  
     ```plaintext
     /src/esphome/components/jk_bms_ble/jk_bms_ble.cpp:228
     ```
   - **Description:**  
     This function handles BLE GATTC (Generic Attribute Profile Client) events, such as receiving notifications or responses from BLE devices.

   - **Potential Issue:**  
     - The handler might be overwhelmed by the frequency or size of incoming BLE events.
     - Inefficient handling could lead to delays and cascading memory issues.

   - **Debugging Suggestions:**
     - Log the event type and frequency to detect if notifications are too frequent.
     - Limit or throttle the handling of non-critical events.

---

### **Suggestions for `jk_bms_ble` Component**
1. **Throttle Notifications:**
   - Add logic to discard or debounce redundant notifications.

2. **Optimize Decoding and Publishing:**
   - Ensure only necessary data is processed and published to sensors.

3. **Log Critical Points:**
   - Log details like notification frequency, packet size, and decoding time.

4. **Test with Reduced BLE Traffic:**
   - Temporarily limit the number of connected BLE devices or adjust scanning intervals to reduce incoming traffic.

---

The text was updated successfully, but these errors were encountered:

syssi · 2025-01-10T06:26:41Z

Is this a stack trace of an "out of memory (OOM) exception"?

EasilyBoredEngineer · 2025-01-11T02:15:20Z

no syssi it's just a crash trace, that refers to out of memory as the reason for the crash.

This may or may not have anything to do with your component itself (I'm zero expert on stacktrace interpretation, but is probably to do with the esphome ble implementation which is very very memory intensive.

One thing I do know for sure.. with 3 ble clients on my system, if I turn off the jk_bms ble client system is stable. It could just be a straw / camels back thing.

EasilyBoredEngineer · 2025-01-16T12:10:17Z

@syssi if I throttle to 60s these crashes / memory leaks slow down. Is there any method we can call to adjust throttle in lambda?

also, these frequent crashes have only started in the last two weeks. They were occasional previously, now minute by minute.

syssi · 2025-01-16T12:37:05Z

Please enable the debug component and track the free heap. Does the memory fill up over time? I'm aware of expensive BLE reconnects which causes crashes because this code path requires too much (dynamic) memory.

EasilyBoredEngineer · 2025-01-16T20:20:43Z

Hi @syssi Syssi,

Here's the memory graphs. I'm on a 60 second throttle at the momentb and its been stable for 12 hours. Regarding 'this code path requires too much dynamic memory', do you mean in your codebade or esphome's?

Is there any method to dynamically change the throttle in lambda?

syssi · 2025-01-16T21:04:53Z

The underlying BLE stack (ESP-IDF) consumes the memory. I assume it cannot be improved here. May be it can be adressed at the ESPHome core but I don't think so.

EasilyBoredEngineer · 2025-01-16T21:37:35Z

Syssi I don't know if this is all gobbledigook but some AI output about your ble connection handling...

The provided code potentially introduces a memory leak during BLE reconnection due to the improper handling of resources when connections are dropped and re-established. To address this issue, consider the following changes:

1. Proper Resource Cleanup on Disconnection

When a BLE device disconnects, ensure that all allocated resources related to the BLE connection, such as registered notifications and dynamic memory, are properly released. For instance:

Unregister notifications when a disconnection occurs.
Clear any dynamically allocated buffers or objects.

Update the ESP_GATTC_DISCONNECT_EVT case:

case ESP_GATTC_DISCONNECT_EVT: {
  this->node_state = espbt::ClientState::IDLE;
  this->status_notification_received_ = false;

  // Cleanup notification registration if necessary
  esp_ble_gattc_unregister_for_notify(this->parent()->get_gattc_if(), this->parent()->get_remote_bda(), this->notify_handle_);

  // Clear the frame buffer to free memory
  this->frame_buffer_.clear();

  break;
}

2. Avoid Repeated Notification Registrations

Repeated calls to esp_ble_gattc_register_for_notify during reconnections without unregistering can lead to leaks. To prevent this, ensure that registration only happens once or that it is properly cleaned up:

Track the registration status and only register if it has not been done.
Unregister notifications during disconnection.

3. Validate Buffers and Free Unused Memory

In the assemble method, ensure buffers are cleared when a frame is invalid or when reconnections happen:

if (this->frame_buffer_.size() > MAX_RESPONSE_SIZE) {
  ESP_LOGW(TAG, "Frame dropped because of invalid length");
  this->frame_buffer_.clear();
}

Additionally, ensure no dangling references to the buffer persist across connections.

4. Debugging and Logging

Add detailed logging to track BLE connection states, resource usage, and cleanup operations. This can help identify memory leaks during runtime.

5. Use Smart Pointers or Scoped Allocations

If the code uses dynamically allocated resources (e.g., std::vector), ensure they are cleared or use smart pointers where appropriate to manage their lifecycle.

Example Updates to the Code

Below is a streamlined example:

case ESP_GATTC_DISCONNECT_EVT: {
  ESP_LOGI(TAG, "Device disconnected");
  this->node_state = espbt::ClientState::IDLE;
  this->status_notification_received_ = false;

  // Cleanup notifications
  auto status = esp_ble_gattc_unregister_for_notify(this->parent()->get_gattc_if(), 
                                                    this->parent()->get_remote_bda(), 
                                                    this->notify_handle_);
  if (status != ESP_OK) {
    ESP_LOGW(TAG, "Failed to unregister notifications, status=%d", status);
  }

  // Clear buffers
  this->frame_buffer_.clear();

  break;
}

By following these steps, you can mitigate the risk of memory leaks caused by improper cleanup during BLE reconnections. Ensure that all cleanup operations are robust and thoroughly tested under various scenarios, such as rapid connect/disconnect cycles.

syssi · 2025-01-16T22:18:52Z

Good catch! I wasn't aware of the unregister method for BLE notifications. I will give it a try later the week. The other advices aren't important IMO.

EasilyBoredEngineer · 2025-01-17T00:31:52Z

thanks mate... ps.. there's an esp-idf nimble implemntation that I've seen others use at https://github.com/h2zero/esp-nimble-cpp/tree/master - but I don't even have the foggiest idea where to start to integrate it.

one dude has a go - vgijssel/setup#127

EasilyBoredEngineer · 2025-01-17T00:42:40Z

pps - throttle can be set dynamically in lambda by using id(bms0).set_throttle(number of milliseconds)

EasilyBoredEngineer · 2025-01-17T04:45:39Z

@syssi - here's some other comments, that, again, will need to be filtered. As an aside, since I've implemented variable timing using the set_throttle method, the memory leak seems to have largely become a non-issue. I have times in my implementation where I'm using the cell voltage delta with a PID to let the low cells 'catch up' by keeping the high cell at 3.41v. When I'm doing that, I need a refresh of about 2s to feed the PID, so I'm now using set_throttle to up the refresh rate to 2s at those times and leaving it at 60s the rest of the time. It's heaps more stable.

In any case, here's the AI thoughts... I asked it whether it thought your ble code could be the cause of the following errors...


[14:04:12][D][esp-idf:000][BTU_TASK]: W (2620933) BT_APPL: gattc_conn_cb: if=4 st=0 id=516 rsn=0x8

[14:04:12][D][esp-idf:000][BTU_TASK]: W (2620936) BT_APPL: gattc_conn_cb: if=5 st=0 id=517 rsn=0x8

[14:04:13][D][esp-idf:000][BTU_TASK]: W (2620939) BT_HCI: hcif disc complete: hdl 0x2, rsn 0x8

[14:04:13][E][ble_client_lambda:131]: Disconnected from ble jk_client```


and it said...

This code snippet implements a GATT client event handler for BLE in ESP-IDF. Several potential issues or areas for improvement in this code could lead to the disconnection errors (`rsn=0x8`) you observed. Let’s analyze it step by step:

### Key Observations and Possible Problems

1. **`ESP_GATTC_OPEN_EVT` Handling**
   - The `ESP_GATTC_OPEN_EVT` case is empty. This event indicates that the BLE connection has been established, and it might be a good place to validate the connection, log success, or initiate the next steps.  
   - **Suggestion:** Add logging or connection validation here to ensure the connection is stable.

   ```cpp
   case ESP_GATTC_OPEN_EVT: {
       ESP_LOGI(TAG, "Connection opened to device.");
       break;
   }

Error Handling for ESP_GATTC_REG_FOR_NOTIFY_EVT
- In the ESP_GATTC_REG_FOR_NOTIFY_EVT case, the code assumes the registration for notifications was successful, but it does not handle errors if the registration fails (e.g., the characteristic might not support notifications).
- Suggestion: Validate the registration and handle errors explicitly.
Characteristic Discovery (ESP_GATTC_SEARCH_CMPL_EVT)
- If the characteristic (chr) is not found, the code logs an error but does not actively disconnect from the device. This can leave the connection hanging unnecessarily.
- Suggestion: Actively disconnect or handle the situation where the required characteristic is not found.
```
if (chr == nullptr) {
    ESP_LOGE(TAG, "[%s] No control service found at device, disconnecting.",
             this->parent_->address_str().c_str());
    esp_ble_gattc_close(this->parent()->get_gattc_if(), this->parent()->get_conn_id());
    break;
}
```
Timeouts and Reconnection
- The disconnection reason 0x8 (Connection Timeout) might occur if the BLE server does not respond to requests promptly. This can happen if:
  - The server is overloaded or misbehaving.
  - The client sends too many requests too quickly.
- Suggestion: Ensure that requests to the BLE server are appropriately spaced and monitor response times.
Unhandled Notifications
- In ESP_GATTC_NOTIFY_EVT, notifications are processed only if the handle matches notify_handle_. If the server sends notifications for other characteristics, they are ignored silently. This could lead to a timeout if the expected responses are missed.
- Suggestion: Add logging to capture unexpected notifications.
```
if (param->notify.handle != this->notify_handle_) {
    ESP_LOGW(TAG, "Unexpected notification handle: 0x%X", param->notify.handle);
    break;
}
```
Resource Cleanup on Disconnect (ESP_GATTC_DISCONNECT_EVT)
- The ESP_GATTC_DISCONNECT_EVT case handles the state cleanup (node_state and status_notification_received_), but it does not attempt to reconnect or log detailed reasons for the disconnect.
- Suggestion: Log the reason for the disconnect and consider implementing a reconnection mechanism if appropriate.
```
case ESP_GATTC_DISCONNECT_EVT: {
    ESP_LOGW(TAG, "Disconnected from device. Cleaning up resources.");
    this->node_state = espbt::ClientState::IDLE;
    this->status_notification_received_ = false;
    // Optionally attempt reconnection here
    break;
}
```

Debugging Tips

Increase Logging Verbosity
- Use ESP_LOGV or ESP_LOGVV to log additional debug information, especially in cases like ESP_GATTC_OPEN_EVT and ESP_GATTC_SEARCH_CMPL_EVT.
Verify BLE Server Behavior
- Ensure the BLE server supports the required services and characteristics.
- Check if the server has resource constraints (e.g., connection limits).
Monitor Timing
- BLE communication relies on strict timing. Use appropriate connection intervals and ensure no operation exceeds its expected duration.
Enable Error Handling
- Always check the return values of BLE API calls (e.g., esp_ble_gattc_register_for_notify) and handle errors.

Conclusion

This code is functional but has some gaps in error handling, resource management, and logging. These issues might contribute to the connection timeouts you're experiencing. Implementing the suggestions above should help stabilize the BLE connection and make troubleshooting easier.

syssi · 2025-01-17T07:04:50Z

Just for fun: Ask the AI whether it is important to unregister the notifications and whether leaks really occur here?

IMO the Bluetooth stack takes care on disconnect in the last resort to avoid memory leaks here.

Nevertheless better safe than sorry: #666

EasilyBoredEngineer · 2025-01-17T22:07:25Z

Yes, unregistering notifications is important for maintaining proper BLE resource management and avoiding potential issues such as resource leaks or unexpected behavior. While the ESP-IDF BLE stack handles some cleanup when connections are closed, it's good practice to explicitly unregister notifications when they are no longer needed or when the connection is about to be terminated.

Why Unregistering Notifications is Important

Resource Management:
- Notifications consume resources on both the client and server. Explicitly unregistering them ensures these resources are freed promptly.
- Failure to unregister notifications might lead to stale references in the BLE stack, which can cause memory leaks or other unexpected behavior.
Avoiding Memory Leaks:
- The ESP-IDF BLE stack is designed to clean up most resources on disconnection, but it’s not guaranteed to handle everything perfectly. Explicitly unregistering notifications reduces the risk of orphaned memory allocations or other resource leaks.
Error Prevention:
- If a new connection is established and tries to register notifications for the same characteristic without first unregistering, it could lead to conflicts or errors.
Best Practices:
- Explicit cleanup ensures that your application handles BLE resources responsibly, making it more robust and portable.

How to Unregister Notifications

The ESP-IDF provides the esp_ble_gattc_unregister_for_notify function to unregister notifications. You can call this function in the ESP_GATTC_DISCONNECT_EVT case to ensure cleanup when the connection is terminated.

Example Code:

Here’s how you can modify the ESP_GATTC_DISCONNECT_EVT handler to unregister notifications:

case ESP_GATTC_DISCONNECT_EVT: {
    ESP_LOGW(TAG, "Disconnected from device. Cleaning up resources.");

    // Unregister for notifications
    auto status = esp_ble_gattc_unregister_for_notify(this->parent()->get_gattc_if(),
                                                      this->parent()->get_remote_bda(),
                                                      this->notify_handle_);
    if (status) {
        ESP_LOGW(TAG, "Failed to unregister notifications, status=%d", status);
    } else {
        ESP_LOGI(TAG, "Successfully unregistered notifications.");
    }

    // Clean up internal state
    this->node_state = espbt::ClientState::IDLE;
    this->status_notification_received_ = false;
    break;
}

Are Leaks Likely Without Explicit Unregistration?

While the BLE stack should clean up resources on disconnection, there’s still a risk of memory leaks or orphaned states if notifications are not explicitly unregistered. The risk depends on:

The version of ESP-IDF you're using (newer versions tend to handle cleanup better).
The complexity of your application (e.g., if connections are frequent, lingering resources could accumulate over time).

Best Practice Summary

Always Unregister Notifications:
- Explicitly unregister notifications in the ESP_GATTC_DISCONNECT_EVT handler or when notifications are no longer needed.
Test for Leaks:
- Use tools like the ESP-IDF heap_caps_dump() or enable heap debugging to monitor memory usage and check for leaks.
Clean Up Proactively:
- Even if ESP-IDF handles cleanup during disconnection, taking responsibility for unregistering notifications ensures your application is robust and portable across versions and platforms.

EasilyBoredEngineer · 2025-01-18T00:24:52Z

@syssi early indicators from your change... (assuming I did it right? added - source: github://syssi/esphome-jk-bms@90c7a0f) more free heap than I've ever seen - and it's very stable. I'm running without throttle at the moment to see whether it sticks. I've still had a few disconnects.

EasilyBoredEngineer · 2025-01-18T02:15:53Z

@syssi, also, the jk bms ble disconnects are down to exactly one per hour, on the hour, which I assume is something that the bms itself does. It was previously more regular.

See #659

syssi · 2025-01-18T11:19:16Z

I've merged the change into main. Please give it another try! I did not expect any major impact.

EasilyBoredEngineer · 2025-01-18T22:08:25Z

Well it has, @syssi - 24 hours now running with no throttle, zero crashes.

EasilyBoredEngineer · 2025-01-19T23:49:06Z

@syssi now 48hrs... no throttle, zero crashes.

syssi · 2025-01-20T08:40:18Z

May be we should implement a "reconnect counter" for easier performance measurements.

hvdb · 2025-01-20T21:59:18Z

Previously this was not a workable solution for me, it would work for a few hours (if lucky) and then crash and needed to restart and repeat...

Now it works way better, only some occasional resets. But those are probably normal.
I did set the throttle to 30s, but think that can be lower now again.
Very good find I would say, thanks @syssi and @EasilyBoredEngineer

EasilyBoredEngineer · 2025-01-21T00:50:06Z

Previously this was not a workable solution for me, it would work for a few hours (if lucky) and then crash and needed to restart and repeat...

Now it works way better, only some occasional resets. But those are probably normal. I did set the throttle to 30s, but think that can be lower now again. Very good find I would say, thanks @syssi and @EasilyBoredEngineer

Great!!! Yes on mine the connection resets every hour, exactly 60 minutes - I think it must be something the BMS does. I agree regarding the rest though - super stable now. You don't even need the throttle anymore.

I'm glad that my little finding has been helpful, as @syssi has given me so so so much help over the last year or more.

syssi · 2025-01-21T07:49:23Z

@EasilyBoredEngineer Could you add something like this to your YAML to get an better idea about the number of disconnects per day:

globals:
  - id: total_ble_disconnects
    type: int
    restore_value: false
    initial_value: "0"

sensor:
  - platform: template
    name: "Total BLE disconnects"
    accuracy_decimals: 0
    update_interval: 60s
    lambda: |-
      return id(total_ble_disconnects) * 1.0f;

ble_client:
  - mac_address: ${mac_address}
    id: client0
    on_disconnect:
      then:
        - lambda: |-
            id(total_ble_disconnects) += 1;

A chart showing the value over a few days would be awesome!

EasilyBoredEngineer · 2025-01-21T11:05:50Z

@syssi, I can do, but it will be a few days as I'm currently tuning a PID - pushing shit uphill for a deployment to India - which means lots of deliberate reboots.

What I can say, though, is the graphs would simply show one thing - Now, The JK disconnects on the hour, every hour. Nothing in my Yaml accounts for that. Prior to this change the reboots were random, often many many times per hour.

It's rock solid now. One an hour. Almost unnoticeable as with 320ms/320ms ble tracker all that happens is you hear the beep, then it's back.

EasilyBoredEngineer · 2025-01-21T13:16:52Z

this likely tells much of the story... far lhs was prior to the changes, middle was after the changes, RHS is me beating the hell out of the ESP as I reboot regularly during pid tuning.

syssi · 2025-01-21T13:35:51Z

I've added the counter here. The setup doesn't get touched / rebooted.

hvdb · 2025-01-21T14:12:33Z

So I see a reset after precisely 5hours.
With a throttle of 5s
So after 3600 times it reset.

If there is no throttle it will go to 3600 in an hour.
(Given a call every second)

But this is normal behaviour I would say.

I will update to previous version(commit) with the counter to see the difference.

syssi · 2025-01-21T14:42:41Z

Please keep in mind: The throttle mechanism just discards (doesn't process) received frames. The JK-BMS sends/pushs two cell info frames per second (no polling required!). If you aren't interested in such a high update interval the throttle option can be used to discard most of the data.

I don't expect a connection between the throttle setting and the reconnect frequency.

syssi · 2025-01-21T19:23:10Z

My BLE connection is stable yet. No disconnect for 5 hours.

EasilyBoredEngineer · 2025-01-21T22:21:42Z

my bms is JK_BD6A17S6P

@syssi that information about the throttle mechanism is interesting. I think the relationship between throttle time and stability was definitely beyond doubt in the previous incarnation of this code. I wonder what the mechanism for that was?

Keep in mind I also have two other ble-clients on this device, all updating every two seconds (they are other esps) and they don't disconnect at all - so occams would say it's likely a jk thing rather than a core esphome thing.

syssi · 2025-01-22T07:08:56Z

In my case it's a:

[08:08:30][I][jk_bms_ble:1258]:   Vendor ID: JK-B2A24S15P
[08:08:30][I][jk_bms_ble:1261]:   Hardware version: 10.XW
[08:08:30][I][jk_bms_ble:1264]:   Software version: 10.07

syssi · 2025-01-22T15:36:35Z

No BLE reconnects for 25 hours.

Kappa71 · 2025-01-23T09:51:10Z

With a PB2A16S20P (firmware 15.38) I get a disconnection exactly every 5 hours (throttle: 5s).

syssi · 2025-01-23T09:57:38Z

@Kappa71 Could you change throttle to 4s or 6s to prove the relation? :-)

syssi · 2025-01-23T09:59:41Z

JK-B2A24S15P: Throttle 10s

syssi · 2025-01-23T10:14:37Z

So I see a reset after precisely 5hours.
With a throttle of 5s
So after 3600 times it reset.

If there is no throttle it will go to 3600 in an hour.
(Given a call every second)`

I understand your calculation now!

A reconnect occurs every 3600 processed notifications. If throttle is used it takes more time to reach the 3600 processed frames. If no throttle is used the reconnect happens after ~3600 seconds.

I would try to prove this. :-)

syssi · 2025-01-23T10:16:11Z

Please keep in mind: If no throttle is used at the YAML the default (2s) is applied:

            cv.Optional(
                CONF_THROTTLE, default="2s"
            ): cv.positive_time_period_milliseconds,

EasilyBoredEngineer · 2025-01-24T07:26:25Z

Yep… glad we’re making progress

Kappa71 · 2025-01-24T07:32:44Z

Throttle = 6s. Still one disconnection every 5 hours.

syssi · 2025-01-24T07:43:14Z

Hmmm... but no disconnects on my site using throttle: 1s.

Kappa71 · 2025-01-24T07:55:07Z

Maybe then it's the PB2A16S20P (inverter BMS) which triggers the disconnection every 5hours.

EasilyBoredEngineer · 2025-01-29T08:51:40Z

@EasilyBoredEngineer

Maybe then it's the PB2A16S20P (inverter BMS) which triggers the disconnection every 5hours.

Hi kappa - alas not in my case as I’m just using the bms - however I can report I’m up with no crashes for 5 days after syssi changes - great work Seb. 👍👍

EasilyBoredEngineer · 2025-02-03T02:36:16Z

Now running this via BLE on a c3 talking to two Victron sccs and running a WireGuard client. Not even raising a sweat. So impressed.

syssi mentioned this issue Jan 17, 2025

Unregister BLE notifications on disconnect #666

Merged

syssi added a commit that referenced this issue Jan 18, 2025

Unregister BLE notifications on disconnect (#666)

ed86a33

See #659

syssi changed the title ~~Syssi as an FYI - Stack Trace.~~ Unregister BLE notifications to improve stability Jan 22, 2025

abc2006 mentioned this issue Jan 25, 2025

Bluetooth-Connection to BMS stops working #671

Open

Unregister BLE notifications to improve stability #659

Unregister BLE notifications to improve stability #659

Comments

EasilyBoredEngineer commented Jan 10, 2025

syssi commented Jan 10, 2025

EasilyBoredEngineer commented Jan 11, 2025

EasilyBoredEngineer commented Jan 16, 2025

syssi commented Jan 16, 2025

EasilyBoredEngineer commented Jan 16, 2025

syssi commented Jan 16, 2025

EasilyBoredEngineer commented Jan 16, 2025

1. Proper Resource Cleanup on Disconnection

2. Avoid Repeated Notification Registrations

3. Validate Buffers and Free Unused Memory

4. Debugging and Logging

5. Use Smart Pointers or Scoped Allocations

Example Updates to the Code

syssi commented Jan 16, 2025

EasilyBoredEngineer commented Jan 17, 2025

EasilyBoredEngineer commented Jan 17, 2025

EasilyBoredEngineer commented Jan 17, 2025

Debugging Tips

Conclusion

syssi commented Jan 17, 2025

EasilyBoredEngineer commented Jan 17, 2025

Why Unregistering Notifications is Important

How to Unregister Notifications

Example Code:

Are Leaks Likely Without Explicit Unregistration?

Best Practice Summary

EasilyBoredEngineer commented Jan 18, 2025

EasilyBoredEngineer commented Jan 18, 2025

syssi commented Jan 18, 2025

EasilyBoredEngineer commented Jan 18, 2025

EasilyBoredEngineer commented Jan 19, 2025

syssi commented Jan 20, 2025

hvdb commented Jan 20, 2025

EasilyBoredEngineer commented Jan 21, 2025 • edited Loading

syssi commented Jan 21, 2025 • edited Loading

EasilyBoredEngineer commented Jan 21, 2025

EasilyBoredEngineer commented Jan 21, 2025 • edited Loading

syssi commented Jan 21, 2025

hvdb commented Jan 21, 2025

syssi commented Jan 21, 2025 • edited Loading

syssi commented Jan 21, 2025

EasilyBoredEngineer commented Jan 21, 2025 • edited Loading

syssi commented Jan 22, 2025

syssi commented Jan 22, 2025

Kappa71 commented Jan 23, 2025

syssi commented Jan 23, 2025

syssi commented Jan 23, 2025 • edited Loading

syssi commented Jan 23, 2025

syssi commented Jan 23, 2025

EasilyBoredEngineer commented Jan 24, 2025

Kappa71 commented Jan 24, 2025

syssi commented Jan 24, 2025

Kappa71 commented Jan 24, 2025

EasilyBoredEngineer commented Jan 29, 2025

EasilyBoredEngineer commented Feb 3, 2025

EasilyBoredEngineer commented Jan 21, 2025 •

edited

Loading

syssi commented Jan 21, 2025 •

edited

Loading

EasilyBoredEngineer commented Jan 21, 2025 •

edited

Loading

syssi commented Jan 21, 2025 •

edited

Loading

EasilyBoredEngineer commented Jan 21, 2025 •

edited

Loading

syssi commented Jan 23, 2025 •

edited

Loading