Timeouts and High Latencies Across All Regions

Incident Report for Ably

Postmortem

Summary of incident affecting the Ably production service 1 September - investigation and conclusions

Overview

There was an incident affecting Ably production services on 1 September 2023, which impacted a significant fraction of connections to primary and fallback endpoints, for a total of 12 minutes from 1519 to 1531 (all times UTC). The problem arose as a result of a Distributed Denial of Service (DDoS) attack targeting the Ably default endpoint, which temporarily overloaded various elements of Ably's networking and request handling infrastructure.

At the present state of the investigation we have a thorough understanding of the nature of the attack, its impact on the service, and the effectiveness of our mitigations at the time. We are still in the process of planning and implementing a number of future mitigations to ensure that we can better handle a range of attack scenarios should they arise in the future. However, we are deliberately limiting the detail we disclose about the nature of this specific attack and our available mitigations.

Background

Ably operates a number of production clusters for its service around the globe. There is a main production cluster that services the majority of customer accounts, plus a small number of dedicated clusters for specific accounts. Each of the clusters has a presence in multiple regions in a globally federated system.

Access network, client connections and routing

In each region, the frontend layer of the messaging system, which terminates subscriber connections and handles REST requests, consists of a group of endpoint instances, behind a mesh of routers/reverse proxies which are responsible for distributing connections and requests among available endpoint instances, and implement endpoint affinity for certain classes of requests. These routers are served by AWS NLB instances in each region.

Clients typically connect to the Ably service using generic API hostnames (ie `rest.ably.io` and `realtime.ably.io`) which resolve to a CloudFront distribution. This in turn uses latency-based DNS resolution to route to the nearest available and healthy NLB (which will usually be the nearest datacenter geographically). Route53 health checking is able to route requests to more remote service endpoints in the case that the nearest is detected to be unhealthy. Certain Ably customers have dedicated endpoints for connection, which then each have a dedicated CloudFront distribution; in the majority of cases (ie all except those customers using dedicated clusters), these also route to the same multi-tenanted production cluster.

System scaling

During normal operations, each of the principal roles that make up a cluster - including frontends and routers as mentioned above - scale autonomously in response to load variations. This scaling takes place independently in each of the regions that the cluster is operating in.

Autoscaling is configured using specific scaling thresholds and step scaling parameters to achieve a balance between operating efficiency, instantaneous capacity margin, and rapidity of scaling in response to load variations. In the case of a large and sudden load increase, for example, a capacity margin is available to absorb some load increase whilst additional capacity comes online. The rate at which new capacity can become effective is constrained by:

the time taken for new instances to be launched and initialised;
the absolute rate at which new instances are added.

These factors mean that there are limits to the rate at which capacity is added in response to load spikes. When there is an abnormally severe load spike, there can be a period of time in which the cluster in a region is under-provisioned, which results in some fraction of requests being denied until the required capacity is in place.

We can also manually trigger explicit scaling in order to increase available capacity at a rate beyond that which happens autonomously; this manual control is available as an intervention to handle certain disruption situations.

As it happened

In the description below all times are in UTC, on 2023-09-01.

At 1519 we were alerted by our endpoint monitoring that the realtime endpoint was unhealthy, followed by a large number of other service failure indications, affecting all regions, over the next two minutes.
It was immediately evident that frontend and router instances were overloaded, although the cause at this time was not identified. Autoscaling had already been triggered.
At 1524 explicit manual scaling was triggered to further double capacity in eu-central-1, followed immediately by all other regions.
At 1529, with the added capacity still insufficient to handle the external load, a further doubling of capacity was manually triggered in all regions.
At 1531, as all requested additional capacity came online, service was restored to normal.
The attack subsided some minutes later.
At 1550 a second attack was mounted, but this was absorbed without impact.

Impact

During the incident there was an impact both to new requests and connection attempts, and operations on existing connections.

The success rate for new connection attempts dropped to a low of 71% at 1523, then steadily recovered to 100% by 1531.

Operations on existing connections (creating new attachments, and publishing on existing attachments) were impacted, but less severely. New attachment success rates dropped to a low of 97.5% at 1525, and publish success rates (again on existing connections) dropped to a low of 98.5% at 1519.

Customers using the default endpoint, or any dedicated endpoint that resolves to the main production cluster, were affected.

Customers using a dedicated cluster were not affected.

The investigation has identified a number of issues relating to the DDoS resilience of the Ably service. We will not be publicly sharing many details of the attack or remediations, except to say that this was a volumetric Layer 7 attack from a broad range of IPs. We have identified and are implementing a range of measures that will ensure that we are better able to accommodate this kind of attack in the future.

Conclusion

The service issues were the direct result of a DDoS attack and temporarily overloaded Ably's available capacity and degraded the available service for legitimate traffic. We have reviewed our response to this event and identified a number of mitigations that we expect will enable us to reduce or eliminate the impact of similar events in the future.

We take service continuity very seriously, and this incident represents another opportunity to learn and improve the level of service we are committed to providing for our customers. We are sorry for the disruption that was experienced by some customers, and are committed to identifying and implementing remedial actions so that we address the risks from such attacks as much as possible in the future.

Posted Jul 09, 2024 - 10:32 UTC

Resolved

Timeouts and high latencies experienced across all regions between 15:18 UTC and 15:31 UTC.

The cause of the issue was transient load causing temporary resource starvation. We are working on mitigating this going forward.

Posted Sep 01, 2023 - 15:18 UTC