IoT connection problems
Incident Report for Poly Cloud Services
Postmortem

Microsoft confirmed that the Azure IoT disconnect incident started on 12/19 due to a faulty backend node that was not working properly. When the node was reset on 12/20, it resulted in devices disconnecting and then quickly trying to reconnect to Azure's Device Provisioning Service (DPS). The connection rate resulted in DPS message throttling which prevented the devices from attaching to the Poly Lens service.

To mitigate the issue Microsoft increased our DPS message quota on 12/21 in hopes it would allow our devices to reconnect. Initially, we observed improvement in the connection rate, but then it regressed. The IoT team identified the issue was occurring after the DPS connection, where some IoT API responses were being throttled. The throttling prevented the devices from clearing their registration check because when a device would check if it was successfully registered, the API would timeout, and the device would restart the registration process again.

Today, the IoT team reduced the DPS registration quota and increased the IoT GetOperations API quota which allowed the full device registration operation to complete on DPS; thus allowing our devices to connect to the service.

Before and throughout this issue, the Poly Lens engineering team worked closely with the Microsoft IoT team to identify potential improvements in our IoT client. We’ve already implemented an improved DPS registration process and we expect to have this available in device software releases early in 2023.

Posted Dec 22, 2022 - 19:27 MST

Resolved
This incident has been resolved.
Posted Dec 22, 2022 - 19:15 MST
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Dec 22, 2022 - 14:00 MST
Investigating
We are currently working with our IoT provider to troubleshoot increased message throttling and IoT device disconnects in our service. Impact: Devices in Poly Lens may incorrectly appear offline and some services requiring an IoT connection may experience intermittent performance.
Posted Dec 20, 2022 - 10:22 MST
This incident affected: Poly Lens.