Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unregister BLE notifications to improve stability #659

Open
EasilyBoredEngineer opened this issue Jan 10, 2025 · 42 comments
Open

Unregister BLE notifications to improve stability #659

EasilyBoredEngineer opened this issue Jan 10, 2025 · 42 comments

Comments

@EasilyBoredEngineer
Copy link

G'day Syssi,

I've been having a few issues with stability when using your jk ble code.

I just thought I'd pass on the stacktrace, and an AI analysis of it :)

0x4009025d: esp_system_abort at /home/dcrad/.platformio/packages/framework-espidf/components/esp_system/esp_system.c:137
0x40096416: abort at /home/dcrad/.platformio/packages/framework-espidf/components/newlib/abort.c:46
0x401a8157: __cxxabiv1::__terminate(void (*)()) at /builds/idf/crosstool-NG/.build/xtensa-esp32-elf/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:47
0x401a819e: std::terminate() at /builds/idf/crosstool-NG/.build/xtensa-esp32-elf/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:57
0x401a8fd9: __cxa_allocate_exception at /builds/idf/crosstool-NG/.build/xtensa-esp32-elf/src/gcc/libstdc++-v3/libsupc++/eh_alloc.cc:300
0x401a81d0: operator new(unsigned int) at /builds/idf/crosstool-NG/.build/xtensa-esp32-elf/src/gcc/libstdc++-v3/libsupc++/new_op.cc:54
0x400f3245: std::unique_ptr<esphome::Scheduler::SchedulerItem, std::default_delete<esphome::Scheduler::SchedulerItem> > esphome::make_unique<esphome::Scheduler::SchedulerItem>() at /home/dcrad/esphome-projects/.esphome/build/curve-master-3/src/esphome/core/helpers.h:77
0x400f3496: esphome::Scheduler::set_timeout(esphome::Component*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int, std::function<void ()>) at /home/dcrad/esphome-projects/.esphome/build/curve-master-3/src/esphome/core/scheduler.cpp:34
0x400f2653: esphome::Component::set_timeout(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int, std::function<void ()>&&) at /home/dcrad/esphome-projects/.esphome/build/curve-master-3/src/esphome/core/component.cpp:70
0x400ec911: esphome::sensor::TimeoutFilter::new_value(float) at /home/dcrad/esphome-projects/.esphome/build/curve-master-3/src/esphome/components/sensor/filter.cpp:386
0x400ec9a6: esphome::sensor::Filter::input(float) at /home/dcrad/esphome-projects/.esphome/build/curve-master-3/src/esphome/components/sensor/filter.cpp:15
0x400ed38c: esphome::sensor::Sensor::publish_state(float) at /home/dcrad/esphome-projects/.esphome/build/curve-master-3/src/esphome/components/sensor/sensor.cpp:48
0x400e73f9: esphome::jk_bms_ble::JkBmsBle::publish_state_(esphome::sensor::Sensor*, float) at /home/dcrad/esphome-projects/.esphome/build/curve-master-3/src/esphome/components/jk_bms_ble/jk_bms_ble.cpp:1544
0x400e877a: esphome::jk_bms_ble::JkBmsBle::decode_jk02_cell_info_(std::vector<unsigned char, std::allocator<unsigned char> > const&) at /home/dcrad/esphome-projects/.esphome/build/curve-master-3/src/esphome/components/jk_bms_ble/jk_bms_ble.cpp:520 (discriminator 4)
0x400e907c: esphome::jk_bms_ble::JkBmsBle::decode_(std::vector<unsigned char, std::allocator<unsigned char> > const&) at /home/dcrad/esphome-projects/.esphome/build/curve-master-3/src/esphome/components/jk_bms_ble/jk_bms_ble.cpp:300
0x400e9172: esphome::jk_bms_ble::JkBmsBle::assemble(unsigned char const*, unsigned short) at /home/dcrad/esphome-projects/.esphome/build/curve-master-3/src/esphome/components/jk_bms_ble/jk_bms_ble.cpp:279
0x400e926a: esphome::jk_bms_ble::JkBmsBle::gattc_event_handler(esp_gattc_cb_event_t, unsigned char, esp_ble_gattc_cb_param_t*) at /home/dcrad/esphome-projects/.esphome/build/curve-master-3/src/esphome/components/jk_bms_ble/jk_bms_ble.cpp:228
0x400da825: esphome::ble_client::BLEClient::gattc_event_handler(esp_gattc_cb_event_t, unsigned char, esp_ble_gattc_cb_param_t*) at /home/dcrad/esphome-projects/.esphome/build/curve-master-3/src/esphome/components/ble_client/ble_client.cpp:54 (discriminator 2)
0x401bfdfe: esphome::esp32_ble_tracker::ESP32BLETracker::gattc_event_handler(esp_gattc_cb_event_t, unsigned char, esp_ble_gattc_cb_param_t*) at /home/dcrad/esphome-projects/.esphome/build/curve-master-3/src/esphome/components/esp32_ble_tracker/esp32_ble_tracker.cpp:408 (discriminator 2)
0x400dea35: non-virtual thunk to esphome::esp32_ble_tracker::ESP32BLETracker::gattc_event_handler(esp_gattc_cb_event_t, unsigned char, esp_ble_gattc_cb_param_t*) at ??:?
0x401bfda9: esphome::esp32_ble::ESP32BLE::real_gattc_event_handler_(esp_gattc_cb_event_t, unsigned char, esp_ble_gattc_cb_param_t*) at /home/dcrad/esphome-projects/.esphome/build/curve-master-3/src/esphome/components/esp32_ble/ble.cpp:361 (discriminator 2)
0x400dd65d: esphome::esp32_ble::ESP32BLE::loop() at /home/dcrad/esphome-projects/.esphome/build/curve-master-3/src/esphome/components/esp32_ble/ble.cpp:308
0x401c0699: esphome::Component::call_loop() at /home/dcrad/esphome-projects/.esphome/build/curve-master-3/src/esphome/core/component.cpp:77
0x401c074d: esphome::Component::call() at /home/dcrad/esphome-projects/.esphome/build/curve-master-3/src/esphome/core/component.cpp:104
0x400f12c5: esphome::Application::loop() at /home/dcrad/esphome-projects/.esphome/build/curve-master-3/src/esphome/core/application.cpp:74 (discriminator 2)
0x400f555a: loop() at /home/dcrad/esphome-projects/.esphome/build/curve-master-3/bluebox-common-jk2.yaml:1646
0x400dc97a: esphome::loop_task(void*) at /home/dcrad/esphome-projects/.esphome/build/curve-master-3/src/esphome/components/esp32/core.cpp:69 (discriminator 1)```

AI analysis...

Certainly! Let's focus on the stack frames that directly involve the `jk_bms_ble` component:

---

### **Relevant Stack Frames**

#### 1. **`esphome::jk_bms_ble::JkBmsBle::publish_state_`**  
   - **Location:**  
     ```plaintext
     /src/esphome/components/jk_bms_ble/jk_bms_ble.cpp:1544
     ```
   - **Description:**  
     This function is responsible for publishing sensor states to the ESPHome framework. It is called during the BLE data handling pipeline when decoded data is forwarded to associated sensors.

   - **Potential Issue:**  
     If too many notifications are processed in rapid succession, this function might be overwhelmed. This can lead to:
     - Excessive calls to `Sensor::publish_state(float)`, which might trigger `TimeoutFilter` operations unnecessarily.
     - Memory issues due to frequent allocations for sensor updates.

---

#### 2. **`esphome::jk_bms_ble::JkBmsBle::decode_jk02_cell_info_`**  
   - **Location:**  
     ```plaintext
     /src/esphome/components/jk_bms_ble/jk_bms_ble.cpp:520
     ```
   - **Description:**  
     This function handles decoding of cell-specific information from BLE data. It's a critical part of processing incoming BLE notifications.

   - **Potential Issue:**  
     If BLE notifications are arriving too frequently or contain unexpected data, this function might:
     - Process redundant or unnecessary data.
     - Create excessive objects, resulting in memory fragmentation.
     - Trigger cascading updates to sensors.

   - **Debugging Suggestions:**
     - Log the incoming data to identify if redundant or spurious notifications are being processed.
     - Check if all notifications are meaningful for your use case.

---

#### 3. **`esphome::jk_bms_ble::JkBmsBle::decode_`**  
   - **Location:**  
     ```plaintext
     /src/esphome/components/jk_bms_ble/jk_bms_ble.cpp:300
     ```
   - **Description:**  
     This function appears to be a general decoder for BLE data. It likely parses raw BLE packets and routes them to more specific functions like `decode_jk02_cell_info_`.

   - **Potential Issue:**  
     - Inefficient decoding logic could lead to unnecessary memory usage.
     - If unhandled BLE packets are received, they might trigger errors or excessive operations.

   - **Debugging Suggestions:**
     - Ensure the decoder filters out irrelevant packets as early as possible.
     - Log the size and type of incoming BLE packets for analysis.

---

#### 4. **`esphome::jk_bms_ble::JkBmsBle::assemble`**  
   - **Location:**  
     ```plaintext
     /src/esphome/components/jk_bms_ble/jk_bms_ble.cpp:279
     ```
   - **Description:**  
     This function likely reconstructs fragmented BLE data packets into a complete message for processing.

   - **Potential Issue:**  
     - Memory usage might spike if fragmented packets are frequent or if improper bounds checking occurs.
     - Repeated allocations for packet assembly could fragment the heap.

   - **Debugging Suggestions:**
     - Log the buffer size and frequency of assembly operations.
     - Ensure proper bounds checking and avoid unnecessary allocations.

---

#### 5. **`esphome::jk_bms_ble::JkBmsBle::gattc_event_handler`**  
   - **Location:**  
     ```plaintext
     /src/esphome/components/jk_bms_ble/jk_bms_ble.cpp:228
     ```
   - **Description:**  
     This function handles BLE GATTC (Generic Attribute Profile Client) events, such as receiving notifications or responses from BLE devices.

   - **Potential Issue:**  
     - The handler might be overwhelmed by the frequency or size of incoming BLE events.
     - Inefficient handling could lead to delays and cascading memory issues.

   - **Debugging Suggestions:**
     - Log the event type and frequency to detect if notifications are too frequent.
     - Limit or throttle the handling of non-critical events.

---

### **Suggestions for `jk_bms_ble` Component**
1. **Throttle Notifications:**
   - Add logic to discard or debounce redundant notifications.

2. **Optimize Decoding and Publishing:**
   - Ensure only necessary data is processed and published to sensors.

3. **Log Critical Points:**
   - Log details like notification frequency, packet size, and decoding time.

4. **Test with Reduced BLE Traffic:**
   - Temporarily limit the number of connected BLE devices or adjust scanning intervals to reduce incoming traffic.

---
@syssi
Copy link
Owner

syssi commented Jan 10, 2025

Is this a stack trace of an "out of memory (OOM) exception"?

@EasilyBoredEngineer
Copy link
Author

no syssi it's just a crash trace, that refers to out of memory as the reason for the crash.

This may or may not have anything to do with your component itself (I'm zero expert on stacktrace interpretation, but is probably to do with the esphome ble implementation which is very very memory intensive.

One thing I do know for sure.. with 3 ble clients on my system, if I turn off the jk_bms ble client system is stable. It could just be a straw / camels back thing.

@EasilyBoredEngineer
Copy link
Author

@syssi if I throttle to 60s these crashes / memory leaks slow down. Is there any method we can call to adjust throttle in lambda?

also, these frequent crashes have only started in the last two weeks. They were occasional previously, now minute by minute.

@syssi
Copy link
Owner

syssi commented Jan 16, 2025

Please enable the debug component and track the free heap. Does the memory fill up over time? I'm aware of expensive BLE reconnects which causes crashes because this code path requires too much (dynamic) memory.

@EasilyBoredEngineer
Copy link
Author

Image
Image

Hi @syssi Syssi,

Here's the memory graphs. I'm on a 60 second throttle at the momentb and its been stable for 12 hours. Regarding 'this code path requires too much dynamic memory', do you mean in your codebade or esphome's?

Is there any method to dynamically change the throttle in lambda?

@syssi
Copy link
Owner

syssi commented Jan 16, 2025

The underlying BLE stack (ESP-IDF) consumes the memory. I assume it cannot be improved here. May be it can be adressed at the ESPHome core but I don't think so.

@EasilyBoredEngineer
Copy link
Author

Syssi I don't know if this is all gobbledigook but some AI output about your ble connection handling...


The provided code potentially introduces a memory leak during BLE reconnection due to the improper handling of resources when connections are dropped and re-established. To address this issue, consider the following changes:

1. Proper Resource Cleanup on Disconnection

When a BLE device disconnects, ensure that all allocated resources related to the BLE connection, such as registered notifications and dynamic memory, are properly released. For instance:

  • Unregister notifications when a disconnection occurs.
  • Clear any dynamically allocated buffers or objects.

Update the ESP_GATTC_DISCONNECT_EVT case:

case ESP_GATTC_DISCONNECT_EVT: {
  this->node_state = espbt::ClientState::IDLE;
  this->status_notification_received_ = false;

  // Cleanup notification registration if necessary
  esp_ble_gattc_unregister_for_notify(this->parent()->get_gattc_if(), this->parent()->get_remote_bda(), this->notify_handle_);

  // Clear the frame buffer to free memory
  this->frame_buffer_.clear();

  break;
}

2. Avoid Repeated Notification Registrations

Repeated calls to esp_ble_gattc_register_for_notify during reconnections without unregistering can lead to leaks. To prevent this, ensure that registration only happens once or that it is properly cleaned up:

  • Track the registration status and only register if it has not been done.
  • Unregister notifications during disconnection.

3. Validate Buffers and Free Unused Memory

In the assemble method, ensure buffers are cleared when a frame is invalid or when reconnections happen:

if (this->frame_buffer_.size() > MAX_RESPONSE_SIZE) {
  ESP_LOGW(TAG, "Frame dropped because of invalid length");
  this->frame_buffer_.clear();
}

Additionally, ensure no dangling references to the buffer persist across connections.

4. Debugging and Logging

Add detailed logging to track BLE connection states, resource usage, and cleanup operations. This can help identify memory leaks during runtime.

5. Use Smart Pointers or Scoped Allocations

If the code uses dynamically allocated resources (e.g., std::vector), ensure they are cleared or use smart pointers where appropriate to manage their lifecycle.


Example Updates to the Code

Below is a streamlined example:

case ESP_GATTC_DISCONNECT_EVT: {
  ESP_LOGI(TAG, "Device disconnected");
  this->node_state = espbt::ClientState::IDLE;
  this->status_notification_received_ = false;

  // Cleanup notifications
  auto status = esp_ble_gattc_unregister_for_notify(this->parent()->get_gattc_if(), 
                                                    this->parent()->get_remote_bda(), 
                                                    this->notify_handle_);
  if (status != ESP_OK) {
    ESP_LOGW(TAG, "Failed to unregister notifications, status=%d", status);
  }

  // Clear buffers
  this->frame_buffer_.clear();

  break;
}

By following these steps, you can mitigate the risk of memory leaks caused by improper cleanup during BLE reconnections. Ensure that all cleanup operations are robust and thoroughly tested under various scenarios, such as rapid connect/disconnect cycles.

@syssi
Copy link
Owner

syssi commented Jan 16, 2025

Good catch! I wasn't aware of the unregister method for BLE notifications. I will give it a try later the week. The other advices aren't important IMO.

@EasilyBoredEngineer
Copy link
Author

thanks mate... ps.. there's an esp-idf nimble implemntation that I've seen others use at https://github.com/h2zero/esp-nimble-cpp/tree/master - but I don't even have the foggiest idea where to start to integrate it.

one dude has a go - vgijssel/setup#127

@EasilyBoredEngineer
Copy link
Author

pps - throttle can be set dynamically in lambda by using id(bms0).set_throttle(number of milliseconds)

@EasilyBoredEngineer
Copy link
Author

@syssi - here's some other comments, that, again, will need to be filtered. As an aside, since I've implemented variable timing using the set_throttle method, the memory leak seems to have largely become a non-issue. I have times in my implementation where I'm using the cell voltage delta with a PID to let the low cells 'catch up' by keeping the high cell at 3.41v. When I'm doing that, I need a refresh of about 2s to feed the PID, so I'm now using set_throttle to up the refresh rate to 2s at those times and leaving it at 60s the rest of the time. It's heaps more stable.

In any case, here's the AI thoughts... I asked it whether it thought your ble code could be the cause of the following errors...


[14:04:12][D][esp-idf:000][BTU_TASK]: W (2620933) BT_APPL: gattc_conn_cb: if=4 st=0 id=516 rsn=0x8

[14:04:12][D][esp-idf:000][BTU_TASK]: W (2620936) BT_APPL: gattc_conn_cb: if=5 st=0 id=517 rsn=0x8

[14:04:13][D][esp-idf:000][BTU_TASK]: W (2620939) BT_HCI: hcif disc complete: hdl 0x2, rsn 0x8

[14:04:13][E][ble_client_lambda:131]: Disconnected from ble jk_client```


and it said...

This code snippet implements a GATT client event handler for BLE in ESP-IDF. Several potential issues or areas for improvement in this code could lead to the disconnection errors (`rsn=0x8`) you observed. Let’s analyze it step by step:

### Key Observations and Possible Problems

1. **`ESP_GATTC_OPEN_EVT` Handling**
   - The `ESP_GATTC_OPEN_EVT` case is empty. This event indicates that the BLE connection has been established, and it might be a good place to validate the connection, log success, or initiate the next steps.  
   - **Suggestion:** Add logging or connection validation here to ensure the connection is stable.

   ```cpp
   case ESP_GATTC_OPEN_EVT: {
       ESP_LOGI(TAG, "Connection opened to device.");
       break;
   }
  1. Error Handling for ESP_GATTC_REG_FOR_NOTIFY_EVT

    • In the ESP_GATTC_REG_FOR_NOTIFY_EVT case, the code assumes the registration for notifications was successful, but it does not handle errors if the registration fails (e.g., the characteristic might not support notifications).
    • Suggestion: Validate the registration and handle errors explicitly.
  2. Characteristic Discovery (ESP_GATTC_SEARCH_CMPL_EVT)

    • If the characteristic (chr) is not found, the code logs an error but does not actively disconnect from the device. This can leave the connection hanging unnecessarily.
    • Suggestion: Actively disconnect or handle the situation where the required characteristic is not found.
    if (chr == nullptr) {
        ESP_LOGE(TAG, "[%s] No control service found at device, disconnecting.",
                 this->parent_->address_str().c_str());
        esp_ble_gattc_close(this->parent()->get_gattc_if(), this->parent()->get_conn_id());
        break;
    }
  3. Timeouts and Reconnection

    • The disconnection reason 0x8 (Connection Timeout) might occur if the BLE server does not respond to requests promptly. This can happen if:
      • The server is overloaded or misbehaving.
      • The client sends too many requests too quickly.
    • Suggestion: Ensure that requests to the BLE server are appropriately spaced and monitor response times.
  4. Unhandled Notifications

    • In ESP_GATTC_NOTIFY_EVT, notifications are processed only if the handle matches notify_handle_. If the server sends notifications for other characteristics, they are ignored silently. This could lead to a timeout if the expected responses are missed.
    • Suggestion: Add logging to capture unexpected notifications.
    if (param->notify.handle != this->notify_handle_) {
        ESP_LOGW(TAG, "Unexpected notification handle: 0x%X", param->notify.handle);
        break;
    }
  5. Resource Cleanup on Disconnect (ESP_GATTC_DISCONNECT_EVT)

    • The ESP_GATTC_DISCONNECT_EVT case handles the state cleanup (node_state and status_notification_received_), but it does not attempt to reconnect or log detailed reasons for the disconnect.
    • Suggestion: Log the reason for the disconnect and consider implementing a reconnection mechanism if appropriate.
    case ESP_GATTC_DISCONNECT_EVT: {
        ESP_LOGW(TAG, "Disconnected from device. Cleaning up resources.");
        this->node_state = espbt::ClientState::IDLE;
        this->status_notification_received_ = false;
        // Optionally attempt reconnection here
        break;
    }

Debugging Tips

  1. Increase Logging Verbosity

    • Use ESP_LOGV or ESP_LOGVV to log additional debug information, especially in cases like ESP_GATTC_OPEN_EVT and ESP_GATTC_SEARCH_CMPL_EVT.
  2. Verify BLE Server Behavior

    • Ensure the BLE server supports the required services and characteristics.
    • Check if the server has resource constraints (e.g., connection limits).
  3. Monitor Timing

    • BLE communication relies on strict timing. Use appropriate connection intervals and ensure no operation exceeds its expected duration.
  4. Enable Error Handling

    • Always check the return values of BLE API calls (e.g., esp_ble_gattc_register_for_notify) and handle errors.

Conclusion

This code is functional but has some gaps in error handling, resource management, and logging. These issues might contribute to the connection timeouts you're experiencing. Implementing the suggestions above should help stabilize the BLE connection and make troubleshooting easier.

@syssi
Copy link
Owner

syssi commented Jan 17, 2025

Just for fun: Ask the AI whether it is important to unregister the notifications and whether leaks really occur here?

IMO the Bluetooth stack takes care on disconnect in the last resort to avoid memory leaks here.

Nevertheless better safe than sorry: #666

@EasilyBoredEngineer
Copy link
Author

Yes, unregistering notifications is important for maintaining proper BLE resource management and avoiding potential issues such as resource leaks or unexpected behavior. While the ESP-IDF BLE stack handles some cleanup when connections are closed, it's good practice to explicitly unregister notifications when they are no longer needed or when the connection is about to be terminated.

Why Unregistering Notifications is Important

  1. Resource Management:

    • Notifications consume resources on both the client and server. Explicitly unregistering them ensures these resources are freed promptly.
    • Failure to unregister notifications might lead to stale references in the BLE stack, which can cause memory leaks or other unexpected behavior.
  2. Avoiding Memory Leaks:

    • The ESP-IDF BLE stack is designed to clean up most resources on disconnection, but it’s not guaranteed to handle everything perfectly. Explicitly unregistering notifications reduces the risk of orphaned memory allocations or other resource leaks.
  3. Error Prevention:

    • If a new connection is established and tries to register notifications for the same characteristic without first unregistering, it could lead to conflicts or errors.
  4. Best Practices:

    • Explicit cleanup ensures that your application handles BLE resources responsibly, making it more robust and portable.

How to Unregister Notifications

The ESP-IDF provides the esp_ble_gattc_unregister_for_notify function to unregister notifications. You can call this function in the ESP_GATTC_DISCONNECT_EVT case to ensure cleanup when the connection is terminated.

Example Code:

Here’s how you can modify the ESP_GATTC_DISCONNECT_EVT handler to unregister notifications:

case ESP_GATTC_DISCONNECT_EVT: {
    ESP_LOGW(TAG, "Disconnected from device. Cleaning up resources.");

    // Unregister for notifications
    auto status = esp_ble_gattc_unregister_for_notify(this->parent()->get_gattc_if(),
                                                      this->parent()->get_remote_bda(),
                                                      this->notify_handle_);
    if (status) {
        ESP_LOGW(TAG, "Failed to unregister notifications, status=%d", status);
    } else {
        ESP_LOGI(TAG, "Successfully unregistered notifications.");
    }

    // Clean up internal state
    this->node_state = espbt::ClientState::IDLE;
    this->status_notification_received_ = false;
    break;
}

Are Leaks Likely Without Explicit Unregistration?

While the BLE stack should clean up resources on disconnection, there’s still a risk of memory leaks or orphaned states if notifications are not explicitly unregistered. The risk depends on:

  • The version of ESP-IDF you're using (newer versions tend to handle cleanup better).
  • The complexity of your application (e.g., if connections are frequent, lingering resources could accumulate over time).

Best Practice Summary

  1. Always Unregister Notifications:

    • Explicitly unregister notifications in the ESP_GATTC_DISCONNECT_EVT handler or when notifications are no longer needed.
  2. Test for Leaks:

    • Use tools like the ESP-IDF heap_caps_dump() or enable heap debugging to monitor memory usage and check for leaks.
  3. Clean Up Proactively:

    • Even if ESP-IDF handles cleanup during disconnection, taking responsibility for unregistering notifications ensures your application is robust and portable across versions and platforms.

@EasilyBoredEngineer
Copy link
Author

@syssi early indicators from your change... (assuming I did it right? added - source: github://syssi/esphome-jk-bms@90c7a0f) more free heap than I've ever seen - and it's very stable. I'm running without throttle at the moment to see whether it sticks. I've still had a few disconnects.

@EasilyBoredEngineer
Copy link
Author

@syssi, also, the jk bms ble disconnects are down to exactly one per hour, on the hour, which I assume is something that the bms itself does. It was previously more regular.

@syssi
Copy link
Owner

syssi commented Jan 18, 2025

I've merged the change into main. Please give it another try! I did not expect any major impact.

@EasilyBoredEngineer
Copy link
Author

Well it has, @syssi - 24 hours now running with no throttle, zero crashes.

@EasilyBoredEngineer
Copy link
Author

@syssi now 48hrs... no throttle, zero crashes.

@syssi
Copy link
Owner

syssi commented Jan 20, 2025

May be we should implement a "reconnect counter" for easier performance measurements.

@hvdb
Copy link

hvdb commented Jan 20, 2025

Previously this was not a workable solution for me, it would work for a few hours (if lucky) and then crash and needed to restart and repeat...

Now it works way better, only some occasional resets. But those are probably normal.
I did set the throttle to 30s, but think that can be lower now again.
Very good find I would say, thanks @syssi and @EasilyBoredEngineer

@EasilyBoredEngineer
Copy link
Author

EasilyBoredEngineer commented Jan 21, 2025

Previously this was not a workable solution for me, it would work for a few hours (if lucky) and then crash and needed to restart and repeat...

Now it works way better, only some occasional resets. But those are probably normal. I did set the throttle to 30s, but think that can be lower now again. Very good find I would say, thanks @syssi and @EasilyBoredEngineer

Great!!! Yes on mine the connection resets every hour, exactly 60 minutes - I think it must be something the BMS does. I agree regarding the rest though - super stable now. You don't even need the throttle anymore.

I'm glad that my little finding has been helpful, as @syssi has given me so so so much help over the last year or more.

@syssi
Copy link
Owner

syssi commented Jan 21, 2025

@EasilyBoredEngineer Could you add something like this to your YAML to get an better idea about the number of disconnects per day:

globals:
  - id: total_ble_disconnects
    type: int
    restore_value: false
    initial_value: "0"

sensor:
  - platform: template
    name: "Total BLE disconnects"
    accuracy_decimals: 0
    update_interval: 60s
    lambda: |-
      return id(total_ble_disconnects) * 1.0f;

ble_client:
  - mac_address: ${mac_address}
    id: client0
    on_disconnect:
      then:
        - lambda: |-
            id(total_ble_disconnects) += 1;

A chart showing the value over a few days would be awesome!

@EasilyBoredEngineer
Copy link
Author

@syssi, I can do, but it will be a few days as I'm currently tuning a PID - pushing shit uphill for a deployment to India - which means lots of deliberate reboots.

What I can say, though, is the graphs would simply show one thing - Now, The JK disconnects on the hour, every hour. Nothing in my Yaml accounts for that. Prior to this change the reboots were random, often many many times per hour.

It's rock solid now. One an hour. Almost unnoticeable as with 320ms/320ms ble tracker all that happens is you hear the beep, then it's back.

@EasilyBoredEngineer
Copy link
Author

EasilyBoredEngineer commented Jan 21, 2025

Image

this likely tells much of the story... far lhs was prior to the changes, middle was after the changes, RHS is me beating the hell out of the ESP as I reboot regularly during pid tuning.

@syssi
Copy link
Owner

syssi commented Jan 21, 2025

I've added the counter here. The setup doesn't get touched / rebooted.

Image

@hvdb
Copy link

hvdb commented Jan 21, 2025

So I see a reset after precisely 5hours.
With a throttle of 5s
So after 3600 times it reset.

If there is no throttle it will go to 3600 in an hour.
(Given a call every second)

But this is normal behaviour I would say.

I will update to previous version(commit) with the counter to see the difference.

@syssi
Copy link
Owner

syssi commented Jan 21, 2025

Please keep in mind: The throttle mechanism just discards (doesn't process) received frames. The JK-BMS sends/pushs two cell info frames per second (no polling required!). If you aren't interested in such a high update interval the throttle option can be used to discard most of the data.

I don't expect a connection between the throttle setting and the reconnect frequency.

@syssi
Copy link
Owner

syssi commented Jan 21, 2025

My BLE connection is stable yet. No disconnect for 5 hours.

@EasilyBoredEngineer
Copy link
Author

EasilyBoredEngineer commented Jan 21, 2025

my bms is JK_BD6A17S6P

@syssi that information about the throttle mechanism is interesting. I think the relationship between throttle time and stability was definitely beyond doubt in the previous incarnation of this code. I wonder what the mechanism for that was?

Keep in mind I also have two other ble-clients on this device, all updating every two seconds (they are other esps) and they don't disconnect at all - so occams would say it's likely a jk thing rather than a core esphome thing.

@syssi
Copy link
Owner

syssi commented Jan 22, 2025

In my case it's a:

[08:08:30][I][jk_bms_ble:1258]:   Vendor ID: JK-B2A24S15P
[08:08:30][I][jk_bms_ble:1261]:   Hardware version: 10.XW
[08:08:30][I][jk_bms_ble:1264]:   Software version: 10.07

@syssi syssi changed the title Syssi as an FYI - Stack Trace. Unregister BLE notifications to improve stability Jan 22, 2025
@syssi
Copy link
Owner

syssi commented Jan 22, 2025

No BLE reconnects for 25 hours.

@Kappa71
Copy link

Kappa71 commented Jan 23, 2025

With a PB2A16S20P (firmware 15.38) I get a disconnection exactly every 5 hours (throttle: 5s).

Image

@syssi
Copy link
Owner

syssi commented Jan 23, 2025

@Kappa71 Could you change throttle to 4s or 6s to prove the relation? :-)

@syssi
Copy link
Owner

syssi commented Jan 23, 2025

JK-B2A24S15P: Throttle 10s
Image

@syssi
Copy link
Owner

syssi commented Jan 23, 2025

So I see a reset after precisely 5hours.
With a throttle of 5s
So after 3600 times it reset.

If there is no throttle it will go to 3600 in an hour.
(Given a call every second)`

I understand your calculation now!

A reconnect occurs every 3600 processed notifications. If throttle is used it takes more time to reach the 3600 processed frames. If no throttle is used the reconnect happens after ~3600 seconds.

I would try to prove this. :-)

@syssi
Copy link
Owner

syssi commented Jan 23, 2025

Please keep in mind: If no throttle is used at the YAML the default (2s) is applied:

            cv.Optional(
                CONF_THROTTLE, default="2s"
            ): cv.positive_time_period_milliseconds,

@EasilyBoredEngineer
Copy link
Author

Image

Yep… glad we’re making progress

@Kappa71
Copy link

Kappa71 commented Jan 24, 2025

Throttle = 6s. Still one disconnection every 5 hours.

Image

@syssi
Copy link
Owner

syssi commented Jan 24, 2025

Hmmm... but no disconnects on my site using throttle: 1s.

@Kappa71
Copy link

Kappa71 commented Jan 24, 2025

Maybe then it's the PB2A16S20P (inverter BMS) which triggers the disconnection every 5hours.

@EasilyBoredEngineer
Copy link
Author

@EasilyBoredEngineer

Maybe then it's the PB2A16S20P (inverter BMS) which triggers the disconnection every 5hours.

Hi kappa - alas not in my case as I’m just using the bms - however I can report I’m up with no crashes for 5 days after syssi changes - great work Seb. 👍👍

@EasilyBoredEngineer
Copy link
Author

Now running this via BLE on a c3 talking to two Victron sccs and running a WireGuard client. Not even raising a sweat. So impressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants