Deno logoDeno

2022-07-18 incident update

Kitson Kelly, Ryan Dahl


Starting on July 18th at 01:20 UTC all Deno Deploy services (including subhosting) experienced an service disruption until approximately 02:25 UTC (65 minutes). During this period users could not access projects hosted on Deno Deploy, access Deno web properties such as deno.com, or download modules hosted on deno.land. Additionally, subhosting customers were not able to run workloads on Deno Deploy during this time.

We have concluded that this outage was caused by a failure of the service discovery system used by Deno Deploy. This post details what exactly happened, and what we are doing to prevent this in the future.

All services are now operating normally again. No data was lost. We take outages like these seriously and sincerely apologize for the disruption.

Impact

For a period of 65 minutes, users were unable to access most Deno web properties and deployments on Deno Deploy, which included deno.com and deno.land. All Deno Deploy subhosted deployments were inaccessible.

Timeline of events

On July 18th at around 00:51 UTC our etcd cluster started to log issues that it was out of space for its database. This service is used for service discovery for Deploy, allowing services to coordinate and discover each other.

At about 01:20 UTC we received internal alerts that Deploy based deployments and web properties were unavailable.

At about 01:25 UTC, the outage was confirmed by the on-call engineer and a high priority incident was raised. By 01:30 UTC, we had several people investigating and had identified that the outage was caused by the etcd cluster. Soon it was confirmed that the etcd cluster was offline because of its inability to grow its internal database storage. Rebuilding the cluster was identified as the best option to restore service.

At about 02:00 UTC an attempt was made to rebuild the cluster, during which several challenges encountered. The cluster was successfully rebuilt and services start to come back online at about 02:25 UTC.

Root cause

The root cause was that the etcd instances hit their storage quota limit. The VMs they ran on had ample disk space available - the problem was that etcd has a 2GB default storage quota limit.

Etcd is used by Deploy for service discovery. After a period of time without etcd being available, internal Deploy load balancers do not how to where they can send requests for isolate execution.

What's next?

This incident, coupled with other recent incidents have highlighted that our service discovery system is both critical infrastructure, and prone to failure. We have already begun work on a more robust service discovery solution to replace the current system.

We encountered several hurdles in the restoration of service, including secrets management and co-dependency of services that should not be co-dependent. We will be addressing these problems in the coming days.

We also have been maturing our operational procedures and run books as well as adopting better internal tooling for managing incidents. This incident highlighted some gaps in those procedures and documentation which would have made escalating and resolution of the incident better.

We take outages like this very seriously - this is why Deno Deploy is still in beta. We hope to announce General Availability in the coming months, dependent in part on fixing these critical gaps.