www.ably.io
Back

Incident log archive

All times are shown in UTC

June 2020

19th June 2020 06:24:26 PM

Support ticketing site maintenance

We are currently migrating our ticketing system and FAQ site support.ably.io to support.ably.com.

During this migration, there will be some disruption for some customers.

We expect this to be completed within 30 minutes.

If you have any issues, please contact us via live chat on www.ably.io

19th Jun 06:57 PM

DNS migration is now complete with 3rd party provider Freshdesk.

Resolved

in 33 minutes

May 2020

6th May 2020 10:24:00 AM

Push notifications processing stalled

The processing of push notifications is stalled. We are currently investigating the cause.

We will make na update here in 15 minutes, or as soon as there is more information.

6th May 11:32 AM

A fix for this problem is being deployed now and we are monitoring the situation.

6th May 11:34 AM

The service is back to normal.

Resolved

in about 1 hour

April 2020

23rd April 2020 10:00:53 PM

Alert emails and other website notifications stalled

A website problem has been causing the sending of various automated emails, including limit notifications and welcome emails, to be stalled. The majority of emails arising from 0922 UTC on Thursday 23 April were stalled and backlogged until the service was unblocked at 1205 UTC on Monday 27 April. All backlogged emails were eventually sent by 1800 UTC.

The system is now operating normally.

An investigation is continuing into the circumstances that led to the problem, and to the extended time for resolution. This incident will be updated in due course with more information.

7th May 11:23 AM

Our engineering and ops teams have completed the post mortem of this incident and summarised all actions we have taken to ensure we can avoid any future disruption to our service.

See https://gist.github.com/paddybyers/c27d302524caa8e46f41e9ba19fdcf2e

Resolved

in 4 days

March 2020

16th March 2020 07:54:24 PM

Website intermittently available

Our website (www.ably.io) is experiencing availability issue due to issue with our hosting provider, Heroku. The realtime service is unaffected and remains fully operational in all regions.

16th Mar 08:01 PM

Heroku seems to be having ongoing issues; there's no explanation in https://status.heroku.com/incidents/1973. We will continue to monitor the situation.

Resolved

in 5 minutes
16th March 2020 04:13:00 PM

Website intermittently available

Our website (www.ably.io) is experiencing availability issue due to issue with our hosting provider, Heroku.

The realtime service is unaffected and remains fully operational in all regions.

Resolved

in 12 minutes
10th March 2020 10:27:00 PM

Website stats timeouts

We are currently experiencing timeouts from the website for some async operations.

- Stats in dashboards
- Blog feeds in the navigation
- Some queries for keys, queues, rules in the dashboards

We are investigating the root cause, but rolling back to a previous version now.

10th Mar 10:39 PM

A downstream provider of our database and web services performed maintenance on our database earlier today, which required a change in the endpoint used by all web services. Unfortunately the update was only made to one of the two web services required, which caused the async operations to fail during this period.

The issue is now fully resolved, and we'll be investigating why this update was only applied to one of the two web services.

Resolved

in 10 minutes

February 2020

29th February 2020 09:23:00 PM

Performance issues in all regions due to database layer issues

We are experiencing elevated error rates and latencies in all regions, due to continued intermittent performance issues we're experiencing with our database layer.

29th Feb 09:36 PM

As yesterday, the incident resolved itself after 9 minutes. We continue to investigate as a top priority.

1st Mar 11:31 PM

We have now identified the root cause of the recent latency issues we've experienced in the global persistence layer, and have rolled out updates in the global persistence layer that have ensured that the latencies are consistently low.

The primary cause of the problem was an inadequate rate limiter in one area of our system, which allowed our persistence layer to be overloaded and thus impact the global service latencies for operations that rely on the persistence layer (primarily history, push registrations, and persisted tokens).

A full post mortem will follow soon.

10th Mar 08:23 PM

Our engineering and ops team have completed the post mortem of this incident and summarised all actions we have taken to ensure we can avoid any future disruption to our service.

See https://gist.github.com/pauln-ably/03098db1095f4ef61aac801ae987dac2

Resolved

in 11 minutes
28th February 2020 09:23:49 PM

Performance issues in all regions due to database layer issues

We are experiencing elevated error rates and latencies in all regions, due to continued intermittent performance issues we're experiencing with our database layer.

The issue has resolved within 5 minutes.

We're continuing our investigation into the root cause of these intermittent performance issues.

29th Feb 12:43 AM

Following more than 36 hours of intermittent short and significant increases in latencies in our persistence layer, the engineering team have been investigating the root cause to understand why only a small percentage of shards are affected during this time.

We have made significant progress in identifying potential root causes, however in the meantime we have also been addressing the issues by adding capacity and upgrading the entire cluster.

The persistence cluster is now operating with approximately 3x more capacity than it had 24 hours ago and is now upgraded to the latest stable versions.

We'll continue to investigate what triggered these increases in latencies, however we are optimistic that the increased capacity will now offer stability and predictable performance moving forward.

A full post mortem will be published soon.

10th Mar 08:25 PM

Please see https://status.ably.io/incidents/695 for the post mortem of this disruption.

Resolved

in 8 minutes
28th February 2020 10:44:00 AM

Performance issues in all regions due to database layer issues

We are experiencing elevated error rates and latencies in all regions. Investigating.

28th Feb 12:10 PM

Latencies are back to normal as of 11:21 UTC

10th Mar 08:24 PM

Please see https://status.ably.io/incidents/695 for the post mortem of this disruption.

Resolved

in 37 minutes
28th February 2020 06:57:22 AM

Performance issues in all regions due to database layer issues

We are seeing increased latencies and error rates in all regions due to database issues

28th Feb 07:16 AM

Error rates and latencies are back to normal. We are continuing to investigate the root cause.

28th Feb 08:51 AM

Service has continued with no further issues.

10th Mar 08:24 PM

Please see https://status.ably.io/incidents/695 for the post mortem of this disruption.

Resolved

in 13 minutes
27th February 2020 09:23:00 PM

Performance issues in all regions due to database layer issues

We are investigating performance issues in all regions due to an issue with our database layer (Cassandra)

27th Feb 10:41 PM

We had elevanted cassandra latencies for 9 minutes between 21:23 and 21:32 UTC. Essentially the same issue as was happening earlier today; we are still investigating the root cause.

10th Mar 08:24 PM

Please see https://status.ably.io/incidents/695 for the post mortem of this disruption.

Resolved

in 9 minutes
27th February 2020 11:05:42 AM

Performance issues in all regions due to database layer issues

We are investigating performance issues in all regions due to an issue with our database layer (Cassandra)

27th Feb 11:27 AM

Error rates have dropped back to normal. We are continuing to investigate.

27th Feb 01:11 PM

Error rates are back to normal. A small segment of the keyspace was unable to achieve quorum for a two hour period; sufficient replicates are now back online to achieve quorum for the entire keyspace, and several more instances are in the process of being brought online. We will review our global replication strategy for this persistence layer as part of a post-mortem.

10th Mar 08:24 PM

Please see https://status.ably.io/incidents/695 for the post mortem of this disruption.

Resolved

in about 2 hours

December 2019

3rd December 2019 04:30:00 PM

Minor transient disruption to channel lifecycle webhooks over the next day or two

Customers using channel lifecycle webhooks may experience some brief transient disruption (which in some cases may very briefly include duplicate or lost channel lifecycle webhooks) at some point over the next day or two, while we transition channel lifecycle webhooks over to a new architecture (message rules on the channel lifecycle metachannel). The result will be a and more dependable channel lifecycle webhooks, as they will now get the reliability benefits of running on top of Ably's robust, globally distributed channels, rather than (as they were previously) all lifecycle events for an app being funnelled through a single point.

Resolved

in 2 days

September 2019

30th September 2019 05:40:00 AM

Capacity issues in ap-southeast-1 (Singapore) region

Since 0540UTC today, the cluster in the ap-southeast-1 region was unable to obtain sufficient capacity to meet demand. As a result, slightly higher latencies are being experienced by connections in the region.

Until more capacity is available, we are diverting traffic to ap-southeast-2 (Sydney).

30th Sep 03:46 PM

AWS capacity has now come online in the Singapore region (ap-southeast-1). All traffic is being routed back to this region now.

Resolved

in about 10 hours
25th September 2019 11:54:00 AM

Elevate rate of 5xx errors in US-East-1

We had a higher than normal level of 5xx errors from our routing layer in us-east-1 between 11:54 and 13:17 UTC. We believe we have identified the issue, have instituted a workaround, and are working on a fix. Service should be generally unaffected as rejected requests will have been rerouted to other regions by our client library fallback functionality.

Resolved

in about 1 hour