Incident log archive

All times are shown in UTC

January 2021

19th January 2021 05:00:00 PM

Delays in push message delivery and reactor queue delivery

Due to unexpectedly high load, mobile push messages (that is, to APNS/FCM) may have experienced delivery delays in the last hour or so.

Additionally, between about 17:50 and 17:55 UTC, consumers of some Reactor queues will have had a chance of experiencing delays.

The core service (normal pub/sub etc) was unaffected.

Everything should now be back to normal.


in about 1 hour

December 2020

9th December 2020 09:31:00 AM

Reactor queue admin operations (creation, listing, deletion) are intermittently failing

We are looking into reactor queue management operations on the website (creation, listing, deletion) intermittently failing since around 9:30 UTC this morning, which can cause listing queues to incorrectly show that you have no queues.

Non-management operations (pushing into queues, consuming from queues) are unaffected.

Update at 10:38: All queue management operations are back to normal. We apologise for the inconvenience.


in about 1 hour
4th December 2020 12:10:00 PM

Some Reactor queues briefly unavailable to consumers

Between about 12:10 and 12:15 UTC, a subset of Reactor queues were unavailable for consumers, due to one rabbitmq server becoming unavailable. We use mirrored queues with a replication factor of 2, and only one node was affected, so no messages were lost. However, consumers whose queues have primaries on the affected node may have been unable to consume for a few minutes; the consumer would have been rejected with error `home node '[email protected].io' of durable queue ':' in vhost '/shared' is down or inaccessible`.


in 5 minutes

October 2020

9th October 2020 02:38:25 PM

Potential pre-emptive action for AWS US-East capacity problems

AWS have been reporting issues with new instances coming online in the US East region.

This has had no impact on our service in this region however we are actively monitoring it. If the situation changes such that we are not confident there is enough capacity to service the traffic, we will likely route traffic away from US East until AWS EC2 instances are stable in that region.

9th Oct 08:14 PM

Amazon reported that they have now resolved the issues in US East 1, and we have resumed normal operations again in all regions.

The update from AWS on the root cause of this problem is as follows:

Starting at 9:37 PM PDT on October 8th, we experienced increased network connectivity issues for a subset of instances within a single Availability Zone (use1-az2) in the US-EAST-1 Region. This was caused by a single cell within the subsystem responsible for the updating VPC network configuration and mappings experiencing elevated failures. These elevated failures caused network configuration and mappings to be delayed or to fail for new instance launches and attachments of ENIs within the affected cell. The issue has also caused connectivity issues between an affected instance in the affected Availability Zone(use1-az2) and newly launched instances within other Availability Zones in the US-EAST-1 Region, since updated VPC network configuration and mappings were not able to be updated within the affected Availability Zone(use1-az2). The root cause of the issue was addressed and at 10:20 AM PDT on October 9th, we began to see recovery for the affected instances. By 11:10 AM PDT, all affected instances had fully recovered. The issue has been resolved and the service is operating normally


in about 7 hours

September 2020

1st September 2020 09:18:00 AM

Publishing issues in us-west-1

A small fraction of publishes failing between 09:18 and 09:36 UTC with error "Service Unavailable (server at capacity limit)" due to resource issues in the us-west-1 region. We identified the issue and increased the number of resources available to accommodate a spike in load on the system.


in 18 minutes

August 2020

19th August 2020 11:37:00 PM

disruption in us-west-1

Networking issues in aws us-west-1 between 23:37 and 00:13 UTC lead to elevated error rates and channel continuity losses in us-west-1.

Customer-visible effects will have consisted of:
- a small proportion of connections connected to us-west-1 region will have timed out or been disconnected, and may then reconnected to another region and failed to resume their connection, so experiencing a continuity loss on their channels
- a small proportion of channels that were active in the us-west-1 region will have migrated to other instances as the affected instances (which seem to have had all networking cut for an extended period) were detected as unhealthy and removed from the cluster. During this period publishes to those channels may be been refused or timed out, and channels which lost continuity due to the disruption will notify attached clients of continuity losses (in the form of an 'update' event, see https://www.ably.io/documentation/realtime/channels#listening-state).

We are in the process of a larger piece of work that will further decouple channel regions from each other, such that when networking issues affect a single region activity in other regions will be completely unaffected; unfortunately this is not yet complete, and in the current architecture clients in other regions attached to channels active in us-west-1 may also have experienced continuity losses.


in 36 minutes

July 2020

28th July 2020 08:25:00 AM

Sporadic high latencies in us-east-1

We are investigating intermittent high latencies in REST requests to us-east-1. Other datacenters are unaffected.

28th Jul 09:20 AM

We have temporarily redirected traffic away from us-east-1 to other datacenters

28th Jul 09:35 AM

The issue is due to AWS issues with resolving dns from within us-east-1: https://status.aws.amazon.com/ . We are leaving traffic redirected away from us-east-1 until that is resolved. (Any connections that were already connected to us-east-1 will remain for now)

28th Jul 11:41 AM

Following AWS reporting the dns resolution latency issues as fixed on their end, which we have confirmed, we are now re-enabling traffic to us-east-1.

Total customer-visible effect should have been very little, other than occasional latency spikes for requests to us-east-1 in the time between the AWS issue beginning and when we redirected traffic away from that region at 08:15 UTC, and fractionally higher latencies for customers near us-east-1 who were redirected to us-west-1 for the duration.


in about 3 hours
22nd July 2020 07:08:19 AM

Database errors across all regions

We are seeing very high load in the database layer that is affecting all regions and are currently investigating.

22nd Jul 07:55 AM

Within 20 minutes of active management the load on our global database has returned to normal levels.

Our initial investigation indicates that customers using our Push registration and delivery APIs were most affected during this period.

We are investigating the root cause of this issue now and will continue to post updates as we know more.

We apologise for any inconvenience this may have caused.


in 21 minutes
19th July 2020 12:41:20 PM

Scheduled website database maintenance

We're performing scheduled Redis and PostgreSQL database maintenance. Customer dashboards and the website will be unavailable for a few minutes during the maintenance window and notifications might be delayed be a few minutes. The realtime systems will not be affected be the migration.


in about 2 hours

June 2020

19th June 2020 06:24:26 PM

Support ticketing site maintenance

We are currently migrating our ticketing system and FAQ site support.ably.io to support.ably.com.

During this migration, there will be some disruption for some customers.

We expect this to be completed within 30 minutes.

If you have any issues, please contact us via live chat on www.ably.io

19th Jun 06:57 PM

DNS migration is now complete with 3rd party provider Freshdesk.


in 33 minutes

May 2020

6th May 2020 10:24:00 AM

Push notifications processing stalled

The processing of push notifications is stalled. We are currently investigating the cause.

We will make na update here in 15 minutes, or as soon as there is more information.

6th May 11:32 AM

A fix for this problem is being deployed now and we are monitoring the situation.

6th May 11:34 AM

The service is back to normal.


in about 1 hour

April 2020

23rd April 2020 10:00:53 PM

Alert emails and other website notifications stalled

A website problem has been causing the sending of various automated emails, including limit notifications and welcome emails, to be stalled. The majority of emails arising from 0922 UTC on Thursday 23 April were stalled and backlogged until the service was unblocked at 1205 UTC on Monday 27 April. All backlogged emails were eventually sent by 1800 UTC.

The system is now operating normally.

An investigation is continuing into the circumstances that led to the problem, and to the extended time for resolution. This incident will be updated in due course with more information.

7th May 11:23 AM

Our engineering and ops teams have completed the post mortem of this incident and summarised all actions we have taken to ensure we can avoid any future disruption to our service.

See https://gist.github.com/paddybyers/c27d302524caa8e46f41e9ba19fdcf2e


in 4 days

March 2020

16th March 2020 07:54:24 PM

Website intermittently available

Our website (www.ably.io) is experiencing availability issue due to issue with our hosting provider, Heroku. The realtime service is unaffected and remains fully operational in all regions.

16th Mar 08:01 PM

Heroku seems to be having ongoing issues; there's no explanation in https://status.heroku.com/incidents/1973. We will continue to monitor the situation.


in 5 minutes
16th March 2020 04:13:00 PM

Website intermittently available

Our website (www.ably.io) is experiencing availability issue due to issue with our hosting provider, Heroku.

The realtime service is unaffected and remains fully operational in all regions.


in 12 minutes
10th March 2020 10:27:00 PM

Website stats timeouts

We are currently experiencing timeouts from the website for some async operations.

- Stats in dashboards
- Blog feeds in the navigation
- Some queries for keys, queues, rules in the dashboards

We are investigating the root cause, but rolling back to a previous version now.

10th Mar 10:39 PM

A downstream provider of our database and web services performed maintenance on our database earlier today, which required a change in the endpoint used by all web services. Unfortunately the update was only made to one of the two web services required, which caused the async operations to fail during this period.

The issue is now fully resolved, and we'll be investigating why this update was only applied to one of the two web services.


in 10 minutes