Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception and connection drop during SAS Token renewal. Reconnect attempts fail until IoT Edge Module exited and restarted #1202

Open
gbe-tuv opened this issue Jan 20, 2025 · 10 comments
Labels

Comments

@gbe-tuv
Copy link

gbe-tuv commented Jan 20, 2025

Context

  • OS and version used:
    Operating System: Debian GNU/Linux 11 (bullseye)
    Kernel: Linux 6.1.28
  • Iot Edge version used:
    IoT Edge 1.5.13
  • Python version:
    Python 3.9.2
  • pip version:
    pip 24.3.1
  • list of installed packages:
    avro 1.12.0
    azure-core 1.32.0
    azure-iot-device 2.14.0
    azure-storage-blob 12.24.0
    beautifulsoup4 4.12.3
    certifi 2024.12.14
    charset-normalizer 3.4.1
    cryptography 3.3.2
    deprecation 2.1.0
    idna 3.10
    isodate 0.7.2
    janus 2.0.0
    mccmnc 3.3
    packaging 24.2
    paho-mqtt 1.6.1
    pip 24.3.1
    PyGObject 3.38.0
    PySocks 1.7.1
    pyzmq 26.2.0
    requests 2.32.3
    requests-unixsocket2 0.4.2
    setuptools 52.0.0
    six 1.16.0
    soupsieve 2.6
    tqdm 4.67.1
    typing_extensions 4.12.2
    urllib3 2.3.0
    websockets 14.1
    wheel 0.34.2

Description of the issue

Hello,

We have a custom Iot Edge module that periodically sends messages to Iot Hub.

Establishing connection and sending messages works for about 24 hours of container runtime. After that timeframe connection drops and an exception occurs during renewal of SAS token:

File "/usr/local/lib/python3.9/dist-packages/azure/iot/device/common/mqtt_transport.py", line 396, in connect
    rc = self._mqtt_client.connect(
  File "/usr/local/lib/python3.9/dist-packages/paho/mqtt/client.py", line 914, in connect
    return self.reconnect()
  File "/usr/local/lib/python3.9/dist-packages/paho/mqtt/client.py", line 1044, in reconnect
    sock = self._create_socket_connection()
  File "/usr/local/lib/python3.9/dist-packages/paho/mqtt/client.py", line 3685, in _create_socket_connection
    return socket.create_connection(addr, timeout=self._connect_timeout, source_address=source)
  File "/usr/lib/python3.9/socket.py", line 822, in create_connection
    for res in getaddrinfo(host, port, 0, SOCK_STREAM):
  File "/usr/lib/python3.9/socket.py", line 953, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/azure/iot/device/common/pipeline/pipeline_stages_mqtt.py", line 206, in _run_op
    self.transport.connect(password=password)
  File "/usr/local/lib/python3.9/dist-packages/azure/iot/device/common/mqtt_transport.py", line 419, in connect
    raise exceptions.ConnectionFailedError() from e
azure.iot.device.common.transport_exceptions.ConnectionFailedError

After this exception occurs, reconnection attempts fail with the same exception until the module is shutdown and restarted.

SAS token renewal is successful without errors in the first ~24 hours of continer runtime. A noticable difference is that in the logs where the error occurs the line Forcing paho disconnect to prevent it from automatically reconnecting is present in contrast to the cases where token renewal is successful.
See logs below showing an example of successful and failed SAS token renewal.

Shutting down and restarting the module every 24 hours leads to data loss which is not acceptable in our application. We are under time pressure due to customer demands, so any help would be highly appreciated.

Code sample exhibiting the issue

Shortened to code passages relevant for connection and message sending.

def create_client(client):
    attempt_count = 0
    max_attempts = 50

    while attempt_count < max_attempts:
        try:
            if client:
                close_client(client)  # Ensure any existing client is closed before retrying

            logger.info(f"Trying to establish IoT Edge connection (Attempt {attempt_count + 1}/{max_attempts})")

            client = IoTHubModuleClient.create_from_edge_environment(
                auto_connect=True, connection_retry=True, connection_retry_interval=10
            )
            
            logger.info("Client successfully created and handlers attached.")
            break

        except Exception as e:
            logger.error(f"Failed to create IoT Hub client or attach handlers: {e}")
            if client:
                close_client(client)  # Cleanup on failure

            client = None
            attempt_count += 1
            if attempt_count < max_attempts:
                time.sleep(3)
            else:
                logger.error(f"Failed to create client after {max_attempts} attempts. Killing module")
                sys.exit(1)

def close_client(client):
    if client:
        try:
            logger.info("Closing IoT Hub client.")
            client.disconnect()
            client.shutdown()
            client = None
            logger.info("IoT Hub client closed successfully.")
        except Exception as e:
            logger.error(f"Error while closing IoT Hub client: {e}")
        finally:
            time.sleep(2)
                
def send_to_iot_hub(message_bytes, msg_index, client):
    try:
        logger.debug("Sending message to IoT Hub...")
        if client is None:
            logger.debug("iot_client_manager.client == None, try to create new.")
            close_client(client)
            create_client(client)
        client.send_message_to_output(message_bytes, "output2")

    except (
        azure.iot.device.exceptions.ConnectionFailedError,
        azure.iot.device.exceptions.ConnectionDroppedError,
        azure.iot.device.exceptions.OperationTimeout,
        azure.iot.device.exceptions.NoConnectionError,
        azure.iot.device.exceptions.ClientError
    ) as e:
        logger.error(f"An error occurred while sending message: {e}")
        # Attempt recconection
        close_client(client)
        create_client(client)

    except Exception as e:
        root_logger.error(f"Exception in send_message: {e}")
        return False

    else:
        root_logger.debug(f"Message {msg_index} sent to IoT Hub successfully.")
    return True
                
def run():
    client = None
    create_client(client)
    index = 0

    while True:
        logger.debug(f"Processing hitfile and trying to send to IoT Hub: {index}")
        success = send_to_iot_hub(message_bytes, msg_index, client)
        if not success:
            time.sleep(1)  # Wait and retry sending the message
        time.sleep(0.05)
                         
if __name__ == "__main__":
    run()

Console log of the issue

Logs from a failed SAStoken renewal (without reconnection attemps):
(See full logs with reconnection attemps here: hermes_failed_renew_sas_token.log)

2025-01-18 18:22:43,183 - azure.iot.device.common.pipeline.pipeline_thread - DEBUG - Starting renew_token in pipeline thread
2025-01-18 18:22:43,212 - azure.iot.device.common.pipeline.pipeline_stages_base - INFO - SasTokenStage: Renewing SAS Token...
2025-01-18 18:22:47,002 - root - INFO - Processing hitfile and trying to send to IoT Hub: 34544
2025-01-18 18:22:47,279 - urllib3.connectionpool - DEBUG - http://localhost:None "POST /modules/Hermes/genid/638724464602815315/sign?api-version=2022-08-03 HTTP/1.1" 200 57
2025-01-18 18:22:47,400 - azure.iot.device.common.pipeline.pipeline_stages_base - INFO - SasTokenStage: Starting reauthorization process for new SAS token
2025-01-18 18:22:47,409 - root - DEBUG - Sending message to IoT Hub...
2025-01-18 18:22:47,416 - azure.iot.device.common.pipeline.pipeline_stages_base - DEBUG - ConnectionStateStage(ReauthorizeConnectionOperation): State changes CONNECTED -> REAUTHORIZING. Sending op down.
2025-01-18 18:22:47,438 - azure.iot.device.iothub.sync_clients - INFO - Sending message to output:output2...
2025-01-18 18:22:47,468 - azure.iot.device.iothub.pipeline.mqtt_pipeline - DEBUG - Starting SendOutputMessageOperation on the pipeline
2025-01-18 18:22:47,501 - azure.iot.device.common.pipeline.pipeline_stages_mqtt - DEBUG - MQTTTransportStage(ReauthorizeConnectionOperation): reauthorizing. Will issue disconnect and then a connect
2025-01-18 18:22:47,531 - azure.iot.device.common.pipeline.pipeline_thread - DEBUG - Starting run_op in pipeline thread
2025-01-18 18:22:47,562 - azure.iot.device.common.pipeline.pipeline_stages_mqtt - DEBUG - MQTTTransportStage(DisconnectOperation): disconnecting
2025-01-18 18:22:47,578 - azure.iot.device.common.mqtt_transport - INFO - disconnecting MQTT client
2025-01-18 18:22:47,593 - paho - DEBUG - Sending DISCONNECT
2025-01-18 18:22:47,619 - azure.iot.device.common.mqtt_transport - INFO - disconnected with result code: 0
2025-01-18 18:22:47,663 - azure.iot.device.common.pipeline.pipeline_thread - DEBUG - Starting _on_mqtt_disconnected in pipeline thread
2025-01-18 18:22:47,692 - azure.iot.device.common.mqtt_transport - DEBUG - _mqtt_client.disconnect returned rc=0
2025-01-18 18:22:47,723 - azure.iot.device.common.pipeline.pipeline_stages_base - DEBUG - Cancelling SAS Token update alarm
2025-01-18 18:22:47,755 - azure.iot.device.common.pipeline.pipeline_stages_base - DEBUG - SasTokenStage: Scheduling automatic SAS Token renewal at epoch time: 1737228043
2025-01-18 18:22:47,796 - azure.iot.device.common.pipeline.pipeline_ops_base - DEBUG - SendOutputMessageOperation: creating worker op of type MQTTPublishOperation
2025-01-18 18:22:47,853 - azure.iot.device.common.pipeline.pipeline_stages_base - DEBUG - AutoConnectStage(MQTTPublishOperation): Op needs connection.  Queueing this op and starting a ConnectionOperation
2025-01-18 18:22:47,916 - azure.iot.device.common.pipeline.pipeline_stages_base - DEBUG - AutoConnectStage(MQTTPublishOperation): calling down with Connect operation
2025-01-18 18:22:47,946 - azure.iot.device.common.pipeline.pipeline_stages_base - DEBUG - ConnectionStateStage(ConnectOperation): State is ConnectionState.REAUTHORIZING - waiting for in-progress operation to finish
2025-01-18 18:22:47,983 - azure.iot.device.common.pipeline.pipeline_stages_mqtt - INFO - MQTTTransportStage: _on_mqtt_disconnect called
2025-01-18 18:22:48,020 - azure.iot.device.common.pipeline.pipeline_stages_base - DEBUG - ConnectionStateStage(DisconnectedEvent): Not attempting to reconnect (Reauthorization in progress)
2025-01-18 18:22:48,051 - azure.iot.device.common.pipeline.pipeline_stages_base - DEBUG - PipelineRootStage: DisconnectedEvent received. Calling on_disconnected_handler
2025-01-18 18:22:48,070 - azure.iot.device.common.pipeline.pipeline_thread - DEBUG - Starting _on_disconnected in callback thread
2025-01-18 18:22:48,107 - azure.iot.device.common.pipeline.pipeline_stages_mqtt - DEBUG - MQTTTransportStage: Expected disconnect - completing pending disconnect op
2025-01-18 18:22:48,109 - azure.iot.device.iothub.abstract_clients - INFO - Connection State - Disconnected
2025-01-18 18:22:48,133 - azure.iot.device.common.pipeline.pipeline_ops_base - DEBUG - DisconnectOperation: completing without error
2025-01-18 18:22:48,150 - azure.iot.device.iothub.abstract_clients - INFO - Cleared all pending method requests due to disconnect
2025-01-18 18:22:48,221 - azure.iot.device.common.pipeline.pipeline_ops_base - DEBUG - ReauthorizeConnectionOperation: creating worker op of type ConnectOperation
2025-01-18 18:22:48,239 - azure.iot.device.common.pipeline.pipeline_stages_mqtt - DEBUG - MQTTTransportStage(ConnectOperation): connecting
2025-01-18 18:22:48,247 - azure.iot.device.common.pipeline.pipeline_stages_mqtt - DEBUG - MQTTTransportStage(ConnectOperation): Starting watchdog
2025-01-18 18:22:48,256 - azure.iot.device.common.mqtt_transport - DEBUG - connecting to mqtt broker
2025-01-18 18:22:48,282 - azure.iot.device.common.mqtt_transport - INFO - Connect using port 8883 (TCP)
2025-01-18 18:22:48,523 - azure.iot.device.common.mqtt_transport - INFO - Forcing paho disconnect to prevent it from automatically reconnecting
2025-01-18 18:22:48,556 - azure.iot.device.common.mqtt_transport - DEBUG - Done forcing paho disconnect
2025-01-18 18:22:48,580 - azure.iot.device.common.pipeline.pipeline_stages_mqtt - INFO - transport.connect raised error
2025-01-18 18:22:49,088 - azure.iot.device.common.pipeline.pipeline_stages_mqtt - INFO - Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/azure/iot/device/common/mqtt_transport.py", line 396, in connect
    rc = self._mqtt_client.connect(
  File "/usr/local/lib/python3.9/dist-packages/paho/mqtt/client.py", line 914, in connect
    return self.reconnect()
  File "/usr/local/lib/python3.9/dist-packages/paho/mqtt/client.py", line 1044, in reconnect
    sock = self._create_socket_connection()
  File "/usr/local/lib/python3.9/dist-packages/paho/mqtt/client.py", line 3685, in _create_socket_connection
    return socket.create_connection(addr, timeout=self._connect_timeout, source_address=source)
  File "/usr/lib/python3.9/socket.py", line 822, in create_connection
    for res in getaddrinfo(host, port, 0, SOCK_STREAM):
  File "/usr/lib/python3.9/socket.py", line 953, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/azure/iot/device/common/pipeline/pipeline_stages_mqtt.py", line 206, in _run_op
    self.transport.connect(password=password)
  File "/usr/local/lib/python3.9/dist-packages/azure/iot/device/common/mqtt_transport.py", line 419, in connect
    raise exceptions.ConnectionFailedError() from e
azure.iot.device.common.transport_exceptions.ConnectionFailedError

2025-01-18 18:22:49,125 - azure.iot.device.common.pipeline.pipeline_stages_mqtt - DEBUG - MQTTTransportStage(ConnectOperation): cancelling watchdog
2025-01-18 18:22:49,164 - azure.iot.device.common.pipeline.pipeline_ops_base - DEBUG - ConnectOperation: completing with error 
2025-01-18 18:22:49,187 - azure.iot.device.common.pipeline.pipeline_ops_base - DEBUG - ReauthorizeConnectionOperation: Worker op (ConnectOperation) has been completed
2025-01-18 18:22:49,205 - azure.iot.device.common.pipeline.pipeline_ops_base - DEBUG - ReauthorizeConnectionOperation: completing with error 
2025-01-18 18:22:49,223 - azure.iot.device.common.pipeline.pipeline_stages_base - DEBUG - ConnectionStateStage(ReauthorizeConnectionOperation): failed, state change ConnectionState.REAUTHORIZING -> DISCONNECTED
2025-01-18 18:22:49,245 - azure.iot.device.common.pipeline.pipeline_stages_base - DEBUG - ConnectionStateStage: Resolving next waiting op: ConnectOperation
2025-01-18 18:22:49,275 - azure.iot.device.common.pipeline.pipeline_stages_base - DEBUG - ConnectionStateStage(ConnectOperation): State changes DISCONNECTED -> CONNECTING. Sending op down
2025-01-18 18:22:49,295 - azure.iot.device.common.pipeline.pipeline_stages_mqtt - DEBUG - MQTTTransportStage(ConnectOperation): connecting
2025-01-18 18:22:49,317 - azure.iot.device.common.pipeline.pipeline_stages_mqtt - DEBUG - MQTTTransportStage(ConnectOperation): Starting watchdog
2025-01-18 18:22:49,348 - azure.iot.device.common.mqtt_transport - DEBUG - connecting to mqtt broker
...

Logs from a successful SAS token renewal:

2025-01-18 19:34:22,221 - azure.iot.device.common.pipeline.pipeline_thread - DEBUG - Starting renew_token in pipeline thread
2025-01-18 19:34:22,269 - azure.iot.device.common.pipeline.pipeline_stages_base - INFO - SasTokenStage: Renewing SAS Token...
2025-01-18 19:34:27,963 - root - INFO - Processing hitfile and trying to send to IoT Hub: 34974
2025-01-18 19:34:28,274 - urllib3.connectionpool - DEBUG - http://localhost:None "POST /modules/Hermes/genid/638724464602815315/sign?api-version=2022-08-03 HTTP/1.1" 200 57
2025-01-18 19:34:28,429 - root - DEBUG - Sending message to IoT Hub...
2025-01-18 19:34:28,460 - azure.iot.device.common.pipeline.pipeline_stages_base - INFO - SasTokenStage: Starting reauthorization process for new SAS token
2025-01-18 19:34:28,472 - azure.iot.device.iothub.sync_clients - INFO - Sending message to output:output2...
2025-01-18 19:34:28,513 - azure.iot.device.common.pipeline.pipeline_stages_base - DEBUG - ConnectionStateStage(ReauthorizeConnectionOperation): State changes CONNECTED -> REAUTHORIZING. Sending op down.
2025-01-18 19:34:28,565 - azure.iot.device.iothub.pipeline.mqtt_pipeline - DEBUG - Starting SendOutputMessageOperation on the pipeline
2025-01-18 19:34:28,591 - azure.iot.device.common.pipeline.pipeline_thread - DEBUG - Starting run_op in pipeline thread
2025-01-18 19:34:28,622 - azure.iot.device.common.pipeline.pipeline_stages_mqtt - DEBUG - MQTTTransportStage(ReauthorizeConnectionOperation): reauthorizing. Will issue disconnect and then a connect
2025-01-18 19:34:28,663 - azure.iot.device.common.pipeline.pipeline_stages_mqtt - DEBUG - MQTTTransportStage(DisconnectOperation): disconnecting
2025-01-18 19:34:28,671 - azure.iot.device.common.mqtt_transport - INFO - disconnecting MQTT client
2025-01-18 19:34:28,686 - paho - DEBUG - Sending DISCONNECT
2025-01-18 19:34:28,716 - azure.iot.device.common.mqtt_transport - INFO - disconnected with result code: 0
2025-01-18 19:34:28,771 - azure.iot.device.common.pipeline.pipeline_thread - DEBUG - Starting _on_mqtt_disconnected in pipeline thread
2025-01-18 19:34:28,842 - azure.iot.device.common.mqtt_transport - DEBUG - _mqtt_client.disconnect returned rc=0
2025-01-18 19:34:28,867 - azure.iot.device.common.pipeline.pipeline_stages_base - DEBUG - Cancelling SAS Token update alarm
2025-01-18 19:34:28,884 - azure.iot.device.common.pipeline.pipeline_stages_base - DEBUG - SasTokenStage: Scheduling automatic SAS Token renewal at epoch time: 1737232342
2025-01-18 19:34:28,905 - azure.iot.device.common.pipeline.pipeline_ops_base - DEBUG - SendOutputMessageOperation: creating worker op of type MQTTPublishOperation
2025-01-18 19:34:28,930 - azure.iot.device.common.pipeline.pipeline_stages_base - DEBUG - AutoConnectStage(MQTTPublishOperation): Op needs connection.  Queueing this op and starting a ConnectionOperation
2025-01-18 19:34:28,952 - azure.iot.device.common.pipeline.pipeline_stages_base - DEBUG - AutoConnectStage(MQTTPublishOperation): calling down with Connect operation
2025-01-18 19:34:28,975 - azure.iot.device.common.pipeline.pipeline_stages_base - DEBUG - ConnectionStateStage(ConnectOperation): State is ConnectionState.REAUTHORIZING - waiting for in-progress operation to finish
2025-01-18 19:34:28,999 - azure.iot.device.common.pipeline.pipeline_stages_mqtt - INFO - MQTTTransportStage: _on_mqtt_disconnect called
2025-01-18 19:34:29,039 - azure.iot.device.common.pipeline.pipeline_stages_base - DEBUG - ConnectionStateStage(DisconnectedEvent): Not attempting to reconnect (Reauthorization in progress)
2025-01-18 19:34:29,090 - azure.iot.device.common.pipeline.pipeline_stages_base - DEBUG - PipelineRootStage: DisconnectedEvent received. Calling on_disconnected_handler
2025-01-18 19:34:29,139 - azure.iot.device.common.pipeline.pipeline_thread - DEBUG - Starting _on_disconnected in callback thread
2025-01-18 19:34:29,165 - azure.iot.device.common.pipeline.pipeline_stages_mqtt - DEBUG - MQTTTransportStage: Expected disconnect - completing pending disconnect op
2025-01-18 19:34:29,166 - azure.iot.device.iothub.abstract_clients - INFO - Connection State - Disconnected
2025-01-18 19:34:29,238 - azure.iot.device.common.pipeline.pipeline_ops_base - DEBUG - DisconnectOperation: completing without error
2025-01-18 19:34:29,283 - azure.iot.device.iothub.abstract_clients - INFO - Cleared all pending method requests due to disconnect
2025-01-18 19:34:29,317 - azure.iot.device.common.pipeline.pipeline_ops_base - DEBUG - ReauthorizeConnectionOperation: creating worker op of type ConnectOperation
2025-01-18 19:34:29,341 - azure.iot.device.common.pipeline.pipeline_stages_mqtt - DEBUG - MQTTTransportStage(ConnectOperation): connecting
2025-01-18 19:34:29,351 - azure.iot.device.common.pipeline.pipeline_stages_mqtt - DEBUG - MQTTTransportStage(ConnectOperation): Starting watchdog
2025-01-18 19:34:29,377 - azure.iot.device.common.mqtt_transport - DEBUG - connecting to mqtt broker
2025-01-18 19:34:29,418 - azure.iot.device.common.mqtt_transport - INFO - Connect using port 8883 (TCP)
2025-01-18 19:34:37,560 - paho - DEBUG - Sending CONNECT (u1, p1, wr0, wq0, wf0, c0, k60) client_id=b'240100102/Hermes'
2025-01-18 19:34:37,665 - azure.iot.device.common.mqtt_transport - DEBUG - _mqtt_client.connect returned rc=0
2025-01-18 19:34:38,887 - paho - DEBUG - Received CONNACK (1, 0)
2025-01-18 19:34:38,985 - azure.iot.device.common.mqtt_transport - INFO - connected with result code: 0
2025-01-18 19:34:39,010 - azure.iot.device.common.pipeline.pipeline_thread - DEBUG - Starting _on_mqtt_connected in pipeline thread
2025-01-18 19:34:39,075 - azure.iot.device.common.pipeline.pipeline_stages_mqtt - INFO - _on_mqtt_connected called
2025-01-18 19:34:39,102 - azure.iot.device.common.pipeline.pipeline_stages_base - DEBUG - ConnectionStateStage(ConnectedEvent): State changes REAUTHORIZING -> CONNECTED. Connection re-established after re-authentication
2025-01-18 19:34:39,112 - azure.iot.device.iothub.pipeline.pipeline_stages_iothub - INFO - EnsureDesiredPropertiesStage: Reconnected.  Getting twin
2025-01-18 19:34:39,143 - azure.iot.device.iothub.pipeline.pipeline_stages_iothub - INFO - EnsureDesiredPropertiesStage: sending twin GET to ensure freshness
2025-01-18 19:34:39,231 - azure.iot.device.common.pipeline.pipeline_stages_base - DEBUG - CoordinateRequestAndResponseStage(RequestAndResponseOperation): adding request 061ed482-68fd-4003-8d32-3bb1c0ef5fc3 to pending list
2025-01-18 19:34:39,253 - azure.iot.device.common.pipeline.pipeline_stages_base - DEBUG - CoordinateRequestAndResponseStage(RequestAndResponseOperation): Sending twin request to GET resource /
2025-01-18 19:34:39,289 - azure.iot.device.common.pipeline.pipeline_ops_base - DEBUG - RequestOperation: creating worker op of type MQTTPublishOperation
2025-01-18 19:34:39,316 - azure.iot.device.common.pipeline.pipeline_stages_mqtt - DEBUG - MQTTTransportStage(MQTTPublishOperation): publishing on $iothub/twin/GET/?$rid=061ed482-68fd-4003-8d32-3bb1c0ef5fc3
2025-01-18 19:34:39,354 - azure.iot.device.common.mqtt_transport - INFO - publishing on $iothub/twin/GET/?$rid=061ed482-68fd-4003-8d32-3bb1c0ef5fc3
2025-01-18 19:34:39,397 - paho - DEBUG - Sending PUBLISH (d0, q1, r0, m396), 'b'$iothub/twin/GET/?$rid=061ed482-68fd-4003-8d32-3bb1c0ef5fc3'', ... (1 bytes)
2025-01-18 19:34:39,444 - azure.iot.device.common.mqtt_transport - DEBUG - _mqtt_client.publish returned rc=0
2025-01-18 19:34:39,463 - azure.iot.device.common.mqtt_transport - DEBUG - Waiting for response on MID: 396
2025-01-18 19:34:39,477 - azure.iot.device.common.pipeline.pipeline_stages_base - DEBUG - PipelineRootStage: ConnectedEvent received. Calling on_connected_handler
2025-01-18 19:34:39,488 - azure.iot.device.common.pipeline.pipeline_thread - DEBUG - Starting _on_connected in callback thread
2025-01-18 19:34:39,523 - azure.iot.device.common.pipeline.pipeline_stages_mqtt - DEBUG - MQTTTransportStage: completing connect op
2025-01-18 19:34:39,525 - azure.iot.device.iothub.abstract_clients - INFO - Connection State - Connected
2025-01-18 19:34:39,561 - azure.iot.device.common.pipeline.pipeline_stages_mqtt - DEBUG - MQTTTransportStage(ConnectOperation): cancelling watchdog
2025-01-18 19:34:39,611 - azure.iot.device.common.pipeline.pipeline_ops_base - DEBUG - ConnectOperation: completing without error
2025-01-18 19:34:39,663 - azure.iot.device.common.pipeline.pipeline_ops_base - DEBUG - ReauthorizeConnectionOperation: Worker op (ConnectOperation) has been completed
2025-01-18 19:34:39,707 - azure.iot.device.common.pipeline.pipeline_ops_base - DEBUG - ReauthorizeConnectionOperation: completing without error
2025-01-18 19:34:39,740 - azure.iot.device.common.pipeline.pipeline_stages_base - DEBUG - ConnectionStateStage: Resolving next waiting op: ConnectOperation
2025-01-18 19:34:39,782 - azure.iot.device.common.pipeline.pipeline_stages_base - DEBUG - ConnectionStateStage(ConnectOperation): State is already CONNECTED. Completing operation
2025-01-18 19:34:39,808 - azure.iot.device.common.pipeline.pipeline_ops_base - DEBUG - ConnectOperation: completing without error
2025-01-18 19:34:39,828 - azure.iot.device.common.pipeline.pipeline_stages_base - DEBUG - AutoConnectStage(MQTTPublishOperation): connection is complete.  Running op that triggered connection.
2025-01-18 19:34:39,858 - azure.iot.device.common.pipeline.pipeline_stages_mqtt - DEBUG - MQTTTransportStage(MQTTPublishOperation): publishing on devices/240100102/modules/Hermes/messages/events/%24.on=output2
2025-01-18 19:34:39,888 - azure.iot.device.common.mqtt_transport - INFO - publishing on devices/240100102/modules/Hermes/messages/events/%24.on=output2
2025-01-18 19:34:39,909 - paho - DEBUG - Sending PUBLISH (d0, q1, r0, m397), 'b'devices/240100102/modules/Hermes/messages/events/%24.on=output2'', ... (1340 bytes)
2025-01-18 19:34:40,001 - azure.iot.device.common.mqtt_transport - DEBUG - _mqtt_client.publish returned rc=0
2025-01-18 19:34:40,061 - azure.iot.device.common.mqtt_transport - DEBUG - Waiting for response on MID: 397
2025-01-18 19:34:40,092 - azure.iot.device.common.pipeline.pipeline_stages_base - INFO - SasTokenStage: Connection reauthorization successful
2025-01-18 19:35:03,510 - paho - DEBUG - Received PUBLISH (d0, q1, r0, m1), '$iothub/twin/res/200/?$rid=061ed482-68fd-4003-8d32-3bb1c0ef5fc3', ...  (400 bytes)
2025-01-18 19:35:03,621 - azure.iot.device.common.mqtt_transport - INFO - message received on $iothub/twin/res/200/?$rid=061ed482-68fd-4003-8d32-3bb1c0ef5fc3
2025-01-18 19:35:03,645 - azure.iot.device.common.pipeline.pipeline_thread - DEBUG - Starting _on_mqtt_message_received in pipeline thread
2025-01-18 19:35:03,710 - paho - DEBUG - Sending PUBACK (Mid: 1)
2025-01-18 19:35:03,711 - azure.iot.device.common.pipeline.pipeline_stages_mqtt - DEBUG - MQTTTransportStage: message received on topic $iothub/twin/res/200/?$rid=061ed482-68fd-4003-8d32-3bb1c0ef5fc3
2025-01-18 19:35:03,792 - paho - DEBUG - Received PUBACK (Mid: 396)
2025-01-18 19:35:03,872 - azure.iot.device.common.pipeline.pipeline_stages_base - DEBUG - CoordinateRequestAndResponseStage(ResponseEvent): Handling event with request_id 061ed482-68fd-4003-8d32-3bb1c0ef5fc3
2025-01-18 19:35:03,960 - azure.iot.device.common.mqtt_transport - INFO - payload published for 396
2025-01-18 19:35:04,017 - azure.iot.device.common.pipeline.pipeline_stages_base - DEBUG - CoordinateRequestAndResponseStage(RequestAndResponseOperation): Completing twin request to GET resource / with status 200
2025-01-18 19:35:04,071 - azure.iot.device.common.pipeline.pipeline_ops_base - DEBUG - RequestAndResponseOperation: completing without error
2025-01-18 19:35:04,108 - azure.iot.device.common.mqtt_transport - DEBUG - Response received for recognized MID: 396 - triggering callback
2025-01-18 19:35:04,118 - azure.iot.device.common.pipeline.pipeline_thread - DEBUG - Starting on_complete in pipeline thread
2025-01-18 19:35:04,249 - azure.iot.device.iothub.pipeline.pipeline_stages_iothub - DEBUG - TwinRequestResponseStage(RequestAndResponseOperation): Got response for GetTwinOperation
2025-01-18 19:35:04,359 - azure.iot.device.common.pipeline.pipeline_ops_base - DEBUG - GetTwinOperation: completing without error
2025-01-18 19:35:04,475 - azure.iot.device.iothub.pipeline.pipeline_stages_iothub - DEBUG - <azure.iot.device.iothub.pipeline.pipeline_stages_iothub.EnsureDesiredPropertiesStage object at 0xb6177b98> Twin GET response received.  Checking versions
2025-01-18 19:35:04,523 - azure.iot.device.iothub.pipeline.pipeline_stages_iothub - DEBUG - EnsureDesiredPropertiesStage: old version = 18, new version = 18
2025-01-18 19:35:04,553 - azure.iot.device.common.pipeline.pipeline_stages_mqtt - DEBUG - MQTTTransportStage(MQTTPublishOperation): PUBACK received. completing op.
2025-01-18 19:35:04,584 - azure.iot.device.common.pipeline.pipeline_ops_base - DEBUG - MQTTPublishOperation: completing without error
2025-01-18 19:35:04,629 - azure.iot.device.common.pipeline.pipeline_ops_base - DEBUG - RequestOperation: Worker op (MQTTPublishOperation) has been completed
2025-01-18 19:35:04,681 - paho - DEBUG - Received PUBACK (Mid: 397)
2025-01-18 19:35:04,716 - azure.iot.device.common.pipeline.pipeline_ops_base - DEBUG - RequestOperation: completing without error
2025-01-18 19:35:04,749 - azure.iot.device.common.mqtt_transport - INFO - payload published for 397
2025-01-18 19:35:04,781 - azure.iot.device.common.mqtt_transport - DEBUG - Response received for recognized MID: 397 - triggering callback
2025-01-18 19:35:04,815 - azure.iot.device.common.pipeline.pipeline_stages_base - DEBUG - CoordinateRequestAndResponseStage(RequestAndResponseOperation): Finished sending twin request to GET resource /
2025-01-18 19:35:04,846 - azure.iot.device.common.pipeline.pipeline_thread - DEBUG - Starting on_complete in pipeline thread
2025-01-18 19:35:04,930 - azure.iot.device.common.pipeline.pipeline_stages_mqtt - DEBUG - MQTTTransportStage(MQTTPublishOperation): PUBACK received. completing op.
2025-01-18 19:35:04,989 - azure.iot.device.common.pipeline.pipeline_ops_base - DEBUG - MQTTPublishOperation: completing without error
2025-01-18 19:35:05,025 - azure.iot.device.common.pipeline.pipeline_ops_base - DEBUG - SendOutputMessageOperation: Worker op (MQTTPublishOperation) has been completed
2025-01-18 19:35:05,051 - azure.iot.device.common.pipeline.pipeline_ops_base - DEBUG - SendOutputMessageOperation: completing without error
2025-01-18 19:35:05,076 - azure.iot.device.common.pipeline.pipeline_thread - DEBUG - Starting on_complete in callback thread
2025-01-18 19:35:05,093 - azure.iot.device.common.evented_callback - DEBUG - Callback completed with result None
2025-01-18 19:35:05,114 - azure.iot.device.iothub.sync_clients - INFO - Successfully sent message to output: output2
@jelalanne
Copy link

Hello @gbe-tuv ,

I'm reporting it here for anyone that would come by, as you mentionned the issue I created, here is my workaround.
I added a retry process in the create_client function, it's better than nothing, and it works all the time (no error for the past few months).

retries = 8
for attempt in range(retries):
    try:
        log("Trying to connect to IoT Edge environment")
        client = IoTHubModuleClient.create_from_edge_environment()
        client.on_message_received = receive_message_handler
        client.on_twin_desired_properties_patch_received = receive_twin_patch_handler
        client.on_method_request_received = method_request_handler
        break
    except Exception as ex:
        log(f"ERROR: Attempt {attempt + 1} failed: {ex}")
        if attempt < retries - 1:
            log("New try in {0} seconds.".format(2 ** attempt))
            time.sleep(2 ** attempt)  # Exponential delay
        else:
            log("Maximum number of attempts reached. Raising exception to higher level.")
            # Cleanup if failure occurs
            GPIO.cleanup()
            raise ex

return client

It rarely goes to 5th try

@gbe-tuv
Copy link
Author

gbe-tuv commented Jan 21, 2025

Hello @jelalanne ,

Thank you for your response.
We have a similar mechanism for multiple attempts when creating the client in create_client(), however this unfortunately doesn't solve our problem.
When starting the IoT Edge module create_client() works after a few attempts but when experiencing the error shown in the logs during SAS token renewal, we shutdown the client and try creating a new one which fails for 50+ attempts. Reconnection is only possible after exiting the module and starting it again.

@jelalanne
Copy link

jelalanne commented Jan 21, 2025 via email

@gbe-tuv
Copy link
Author

gbe-tuv commented Jan 21, 2025

Thats a good point, currently the wait-time between attempts is only 3s. I'll try using an exponential increasing delay like you suggested.
Exiting and restarting the module leads to data loss as it sends messages every few seconds, so this is not really an option unfortunately.

@jelalanne
Copy link

Keep me updated on how it went ;)

That should not be an issue if you manage your data properly. For example, an option would be to execute a function that locally saves the data in a temporary file whenever connection is lost, then you exit the module, and you add a process at the beginning that checks if something exists in these temporary files, to send them if so. Of course, I know nothing about your context, so I may be wrong.

@gbe-tuv
Copy link
Author

gbe-tuv commented Jan 22, 2025

We found a different potential cause that looks very promising. There was a problem with DNS resolution inside the docker container running the module after 24 hours. Root of the problem was most likely a wrong configuration that binds /etc/resolv.conf from the host.

We are currently testing if the fix is working and will follow up with an update of the results.

@IcorreaX
Copy link

Hello!!

I'm having a really similar issue (we don´t have it running inside a docker container) so we don't relate this to our DNS configuration nor resolv.conf file.

We have the "socket.gaierror [Errno -3] Temporary failure in name resolution" exception raising from time to time when it is renewing the SAS token. We are connected to an IoT Hub instance and some times it does the SAS token renewal flawlessly, but sometimes (let's say 1 out of 15 renewals) it raises that socket.gaierror. It has the same behaviour that you @gbe-tuv described; when the Forcing paho disconnect to prevent it from automatically reconnecting message appears then the renewal fails, otherwise it succeds.

We assumed that, if it is indeed a temporary failure, then capturing that exception and apply some retry mechanism will fix the issue, but the thing is that when this exception raises, it never stops raising unless we reboot the device or at least restart the service in our linux system.

Here are the logs for both, failure:

DEBUG - Main - 2025-01-17 10:07:30,632 - MQTTTransportStage(ConnectOperation): connecting
DEBUG - Main - 2025-01-17 10:07:30,632 - MQTTTransportStage(ConnectOperation): Starting watchdog
DEBUG - Main - 2025-01-17 10:07:30,633 - connecting to mqtt broker
INFO - Main - 2025-01-17 10:07:30,633 - Connect using port 8883 (TCP)
INFO - Main - 2025-01-17 10:07:30,635 - Forcing paho disconnect to prevent it from automatically reconnecting
DEBUG - Main - 2025-01-17 10:07:30,635 - Done forcing paho disconnect
INFO - Main - 2025-01-17 10:07:30,636 - transport.connect raised error
INFO - Main - 2025-01-17 10:07:30,638 - Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/azure/iot/device/common/mqtt_transport.py", line 396, in connect
    host=self._hostname, port=8883, keepalive=self._keep_alive
  File "/usr/local/lib/python3.7/dist-packages/paho/mqtt/client.py", line 914, in connect
    return self.reconnect()
  File "/usr/local/lib/python3.7/dist-packages/paho/mqtt/client.py", line 1044, in reconnect
    sock = self._create_socket_connection()
  File "/usr/local/lib/python3.7/dist-packages/paho/mqtt/client.py", line 3685, in _create_socket_connection
    return socket.create_connection(addr, timeout=self._connect_timeout, source_address=source)
  File "/usr/lib/python3.7/socket.py", line 707, in create_connection
    for res in getaddrinfo(host, port, 0, SOCK_STREAM):
  File "/usr/lib/python3.7/socket.py", line 748, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -3] Temporary failure in name resolution

and successfull case:

DEBUG - Main - 2024-12-06 01:04:23,888 - MQTTTransportStage(ConnectOperation): connecting
DEBUG - Main - 2024-12-06 01:04:23,888 - MQTTTransportStage(ConnectOperation): Starting watchdog
INFO - Main - 2024-12-06 01:04:23,889 - Connection State - Disconnected
INFO - Main - 2024-12-06 01:04:23,890 - Cleared all pending method requests due to disconnect
DEBUG - Main - 2024-12-06 01:04:23,891 - connecting to mqtt broker
INFO - Main - 2024-12-06 01:04:23,892 - Connect using port 8883 (TCP)
DEBUG - Main - 2024-12-06 01:04:24,280 - _mqtt_client.connect returned rc=0
INFO - Main - 2024-12-06 01:04:24,404 - connected with result code: 0
DEBUG - Main - 2024-12-06 01:04:24,405 - Starting _on_mqtt_connected in pipeline thread
INFO - Main - 2024-12-06 01:04:24,406 - _on_mqtt_connected called
DEBUG - Main - 2024-12-06 01:04:24,406 - ConnectionStateStage(ConnectedEvent): State changes REAUTHORIZING -> CONNECTED. Connection re-established after re-authentication
DEBUG - Main - 2024-12-06 01:04:24,407 - PipelineRootStage: ConnectedEvent received. Calling on_connected_handler
DEBUG - Main - 2024-12-06 01:04:24,407 - Starting _on_connected in callback thread
DEBUG - Main - 2024-12-06 01:04:24,407 - MQTTTransportStage: completing connect op
DEBUG - Main - 2024-12-06 01:04:24,408 - MQTTTransportStage(ConnectOperation): cancelling watchdog
DEBUG - Main - 2024-12-06 01:04:24,408 - ConnectOperation: completing without error
DEBUG - Main - 2024-12-06 01:04:24,408 - ReauthorizeConnectionOperation: Worker op (ConnectOperation) has been completed
DEBUG - Main - 2024-12-06 01:04:24,408 - ReauthorizeConnectionOperation: completing without error
INFO - Main - 2024-12-06 01:04:24,409 - SasTokenStage: Connection reauthorization successful
INFO - Main - 2024-12-06 01:04:24,409 - Connection State - Connected

Obviously we think that this is a problem with how the SDK handles that exception underneath and we don´t know how to treat it. Thanks in advance!

@jelalanne
Copy link

Hello @IcorreaX How can you not be using containers with IoT Edge ? Be aware that there is a good chance your issue/solution might not be the same as ours if you're not using IoT Edge

@gbe-tuv
Copy link
Author

gbe-tuv commented Jan 23, 2025

Removing the configuration that binds /etc/resolv.conf from the host seems to have solved the problem. No disconnect after 24 hours of runtime on two test devices. So it looks like it was a DNS problem and no bug in the SDK.
I will close the ticket after some more test time to confirm there are no disconnects anymore.

@IcorreaX I suspect you are using IoTHubDeviceClient without IoT Edge instead of IoTHubModuleClient like I and @jelalanne are doing. Maybe check if DNS resolution is working when the problem occurs, as this could still lead to the exception even without using a docker container. Otherwise you seem to have a different problem than what we observed.

@jelalanne
Copy link

@gbe-tuv excellent news :) If your tests are good, could you give more details about what you've done for future visitors please ? For people who are not very used to DNS things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants