-
Notifications
You must be signed in to change notification settings - Fork 382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exception and connection drop during SAS Token renewal. Reconnect attempts fail until IoT Edge Module exited and restarted #1202
Comments
Hello @gbe-tuv , I'm reporting it here for anyone that would come by, as you mentionned the issue I created, here is my workaround. retries = 8
for attempt in range(retries):
try:
log("Trying to connect to IoT Edge environment")
client = IoTHubModuleClient.create_from_edge_environment()
client.on_message_received = receive_message_handler
client.on_twin_desired_properties_patch_received = receive_twin_patch_handler
client.on_method_request_received = method_request_handler
break
except Exception as ex:
log(f"ERROR: Attempt {attempt + 1} failed: {ex}")
if attempt < retries - 1:
log("New try in {0} seconds.".format(2 ** attempt))
time.sleep(2 ** attempt) # Exponential delay
else:
log("Maximum number of attempts reached. Raising exception to higher level.")
# Cleanup if failure occurs
GPIO.cleanup()
raise ex
return client It rarely goes to 5th try |
Hello @jelalanne , Thank you for your response. |
Then let it exit peacefully and restart automatically. Is there an issue with that ? How much time do you let between each retry ? When I set small retry period it didn't work too, but setting the retry process on exponential period worked, so many seconds between each.
|
Thats a good point, currently the wait-time between attempts is only 3s. I'll try using an exponential increasing delay like you suggested. |
Keep me updated on how it went ;) That should not be an issue if you manage your data properly. For example, an option would be to execute a function that locally saves the data in a temporary file whenever connection is lost, then you exit the module, and you add a process at the beginning that checks if something exists in these temporary files, to send them if so. Of course, I know nothing about your context, so I may be wrong. |
We found a different potential cause that looks very promising. There was a problem with DNS resolution inside the docker container running the module after 24 hours. Root of the problem was most likely a wrong configuration that binds /etc/resolv.conf from the host. We are currently testing if the fix is working and will follow up with an update of the results. |
Hello!! I'm having a really similar issue (we don´t have it running inside a docker container) so we don't relate this to our DNS configuration nor resolv.conf file. We have the "socket.gaierror [Errno -3] Temporary failure in name resolution" exception raising from time to time when it is renewing the SAS token. We are connected to an IoT Hub instance and some times it does the SAS token renewal flawlessly, but sometimes (let's say 1 out of 15 renewals) it raises that socket.gaierror. It has the same behaviour that you @gbe-tuv described; when the We assumed that, if it is indeed a temporary failure, then capturing that exception and apply some retry mechanism will fix the issue, but the thing is that when this exception raises, it never stops raising unless we reboot the device or at least restart the service in our linux system. Here are the logs for both, failure: DEBUG - Main - 2025-01-17 10:07:30,632 - MQTTTransportStage(ConnectOperation): connecting
DEBUG - Main - 2025-01-17 10:07:30,632 - MQTTTransportStage(ConnectOperation): Starting watchdog
DEBUG - Main - 2025-01-17 10:07:30,633 - connecting to mqtt broker
INFO - Main - 2025-01-17 10:07:30,633 - Connect using port 8883 (TCP)
INFO - Main - 2025-01-17 10:07:30,635 - Forcing paho disconnect to prevent it from automatically reconnecting
DEBUG - Main - 2025-01-17 10:07:30,635 - Done forcing paho disconnect
INFO - Main - 2025-01-17 10:07:30,636 - transport.connect raised error
INFO - Main - 2025-01-17 10:07:30,638 - Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/azure/iot/device/common/mqtt_transport.py", line 396, in connect
host=self._hostname, port=8883, keepalive=self._keep_alive
File "/usr/local/lib/python3.7/dist-packages/paho/mqtt/client.py", line 914, in connect
return self.reconnect()
File "/usr/local/lib/python3.7/dist-packages/paho/mqtt/client.py", line 1044, in reconnect
sock = self._create_socket_connection()
File "/usr/local/lib/python3.7/dist-packages/paho/mqtt/client.py", line 3685, in _create_socket_connection
return socket.create_connection(addr, timeout=self._connect_timeout, source_address=source)
File "/usr/lib/python3.7/socket.py", line 707, in create_connection
for res in getaddrinfo(host, port, 0, SOCK_STREAM):
File "/usr/lib/python3.7/socket.py", line 748, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -3] Temporary failure in name resolution and successfull case: DEBUG - Main - 2024-12-06 01:04:23,888 - MQTTTransportStage(ConnectOperation): connecting
DEBUG - Main - 2024-12-06 01:04:23,888 - MQTTTransportStage(ConnectOperation): Starting watchdog
INFO - Main - 2024-12-06 01:04:23,889 - Connection State - Disconnected
INFO - Main - 2024-12-06 01:04:23,890 - Cleared all pending method requests due to disconnect
DEBUG - Main - 2024-12-06 01:04:23,891 - connecting to mqtt broker
INFO - Main - 2024-12-06 01:04:23,892 - Connect using port 8883 (TCP)
DEBUG - Main - 2024-12-06 01:04:24,280 - _mqtt_client.connect returned rc=0
INFO - Main - 2024-12-06 01:04:24,404 - connected with result code: 0
DEBUG - Main - 2024-12-06 01:04:24,405 - Starting _on_mqtt_connected in pipeline thread
INFO - Main - 2024-12-06 01:04:24,406 - _on_mqtt_connected called
DEBUG - Main - 2024-12-06 01:04:24,406 - ConnectionStateStage(ConnectedEvent): State changes REAUTHORIZING -> CONNECTED. Connection re-established after re-authentication
DEBUG - Main - 2024-12-06 01:04:24,407 - PipelineRootStage: ConnectedEvent received. Calling on_connected_handler
DEBUG - Main - 2024-12-06 01:04:24,407 - Starting _on_connected in callback thread
DEBUG - Main - 2024-12-06 01:04:24,407 - MQTTTransportStage: completing connect op
DEBUG - Main - 2024-12-06 01:04:24,408 - MQTTTransportStage(ConnectOperation): cancelling watchdog
DEBUG - Main - 2024-12-06 01:04:24,408 - ConnectOperation: completing without error
DEBUG - Main - 2024-12-06 01:04:24,408 - ReauthorizeConnectionOperation: Worker op (ConnectOperation) has been completed
DEBUG - Main - 2024-12-06 01:04:24,408 - ReauthorizeConnectionOperation: completing without error
INFO - Main - 2024-12-06 01:04:24,409 - SasTokenStage: Connection reauthorization successful
INFO - Main - 2024-12-06 01:04:24,409 - Connection State - Connected Obviously we think that this is a problem with how the SDK handles that exception underneath and we don´t know how to treat it. Thanks in advance! |
Hello @IcorreaX How can you not be using containers with IoT Edge ? Be aware that there is a good chance your issue/solution might not be the same as ours if you're not using IoT Edge |
Removing the configuration that binds /etc/resolv.conf from the host seems to have solved the problem. No disconnect after 24 hours of runtime on two test devices. So it looks like it was a DNS problem and no bug in the SDK. @IcorreaX I suspect you are using |
@gbe-tuv excellent news :) If your tests are good, could you give more details about what you've done for future visitors please ? For people who are not very used to DNS things. |
Context
Operating System: Debian GNU/Linux 11 (bullseye)
Kernel: Linux 6.1.28
IoT Edge 1.5.13
Python 3.9.2
pip 24.3.1
avro 1.12.0
azure-core 1.32.0
azure-iot-device 2.14.0
azure-storage-blob 12.24.0
beautifulsoup4 4.12.3
certifi 2024.12.14
charset-normalizer 3.4.1
cryptography 3.3.2
deprecation 2.1.0
idna 3.10
isodate 0.7.2
janus 2.0.0
mccmnc 3.3
packaging 24.2
paho-mqtt 1.6.1
pip 24.3.1
PyGObject 3.38.0
PySocks 1.7.1
pyzmq 26.2.0
requests 2.32.3
requests-unixsocket2 0.4.2
setuptools 52.0.0
six 1.16.0
soupsieve 2.6
tqdm 4.67.1
typing_extensions 4.12.2
urllib3 2.3.0
websockets 14.1
wheel 0.34.2
Description of the issue
Hello,
We have a custom Iot Edge module that periodically sends messages to Iot Hub.
Establishing connection and sending messages works for about 24 hours of container runtime. After that timeframe connection drops and an exception occurs during renewal of SAS token:
After this exception occurs, reconnection attempts fail with the same exception until the module is shutdown and restarted.
SAS token renewal is successful without errors in the first ~24 hours of continer runtime. A noticable difference is that in the logs where the error occurs the line
Forcing paho disconnect to prevent it from automatically reconnecting
is present in contrast to the cases where token renewal is successful.See logs below showing an example of successful and failed SAS token renewal.
Shutting down and restarting the module every 24 hours leads to data loss which is not acceptable in our application. We are under time pressure due to customer demands, so any help would be highly appreciated.
Code sample exhibiting the issue
Shortened to code passages relevant for connection and message sending.
Console log of the issue
Logs from a failed SAStoken renewal (without reconnection attemps):
(See full logs with reconnection attemps here: hermes_failed_renew_sas_token.log)
Logs from a successful SAS token renewal:
The text was updated successfully, but these errors were encountered: