Back

Service Incident - March 16, 2023

09:47 CET: Conversational functionality appears to be impacted on a number of bots. We are investigating the issue.

10:20 CET: Since 09:57~ bots appear to be back online. We're still examining the root cause.

Root cause: Around 9:27 CET today, bots started gradually becoming unresponsive. In our logs, we see our Conversation Manager service failing to connect to our Redis cache repeatedly, leading to bots not responding. The exact root cause for why our Redis cache was unreachable at the time is yet to be discovered. In the telemetry dashboard of our Redis resource, we see two errors that were logged, but no other details were included in these logs. At around 10:00 CET, we finished restarting the pods that contain our Conversation Manager service. This action made our service able to connect with the Redis cache, and allowed bots to start functioning properly again. In order to prevent this issue from occurring again, we have enabled extra logging capabilities in our managed Redis resources to grant us additional visibility of these types of errors. We have also planned work on some logic that will take down any pod that experiences repeated connection errors against our Redis cache. This will ensure that our cluster replaces it with a new pod that will be able to reconnect to the Redis cache, limiting the downtime severely if the issue should resurface. We will continue to look into why the Redis cache was unreachable, and ensure that we deal with the root of the issue to prevent a reoccurrance.