Transient netsplit in us-east-2 leading to protracted netsplit for a few channels

Incident Report for Ably

Resolved

At 03:32 UTC on the 12th, the us-east-2 (Ohio) data center was partially partitioned (netsplit) from the rest of the cluster. The netsplit was healed by 03:40. However, a very small proportion of channels did not self-heal, and remained partitioned between us-east-2 and at least one other site in the cluster (mostly either eu-central-1 or eu-west-1) until we took manual corrective action to fix them, which was completed in the production cluster at 12:22 UTC.

Our existing quality-of-service alerting infrastructure should have lead to us being paged at the time, however, due to a bug in the definition for this particular alert, this did not happen.

Reliable delivery of messages is the thing we prioritise above all else at Ably, and so even though the number of affected channels was small, we are taking this extremely seriously, and (other than fixing the bug in the self-healing function) we will be prioritising improvements to our alerting framework to ensure that no future similar issue can be allowed to last this long. We are deeply sorry to those customers who had affected channels.
Posted Aug 12, 2024 - 14:08 UTC
This incident affected: Ably Pub/Sub Channels (US East (Ohio)).