2022-07-18 incident update
Starting on July 18th at 01:20 UTC all Deno Deploy services (including
subhosting) experienced an service disruption until approximately 02:25 UTC (65
minutes). During this period users could not access projects hosted on Deno
Deploy, access Deno web properties such as deno.com, or download modules hosted
deno.land. Additionally, subhosting customers were not able to run
workloads on Deno Deploy during this time.
We have concluded that this outage was caused by a failure of the service discovery system used by Deno Deploy. This post details what exactly happened, and what we are doing to prevent this in the future.
All services are now operating normally again. No data was lost. We take outages like these seriously and sincerely apologize for the disruption.
For a period of 65 minutes, users were unable to access most Deno web properties
and deployments on Deno Deploy, which included
Deno Deploy subhosted deployments were inaccessible.
Timeline of events
On July 18th at around 00:51 UTC our
etcd cluster started to log issues that
it was out of space for its database. This service is used for service discovery
for Deploy, allowing services to coordinate and discover each other.
At about 01:20 UTC we received internal alerts that Deploy based deployments and web properties were unavailable.
At about 01:25 UTC, the outage was confirmed by the on-call engineer and a high priority incident was raised. By 01:30 UTC, we had several people investigating and had identified that the outage was caused by the etcd cluster. Soon it was confirmed that the etcd cluster was offline because of its inability to grow its internal database storage. Rebuilding the cluster was identified as the best option to restore service.
At about 02:00 UTC an attempt was made to rebuild the cluster, during which several challenges encountered. The cluster was successfully rebuilt and services start to come back online at about 02:25 UTC.
The root cause was that the etcd instances hit their storage quota limit. The VMs they ran on had ample disk space available - the problem was that etcd has a 2GB default storage quota limit.
Etcd is used by Deploy for service discovery. After a period of time without etcd being available, internal Deploy load balancers do not how to where they can send requests for isolate execution.
This incident, coupled with other recent incidents have highlighted that our service discovery system is both critical infrastructure, and prone to failure. We have already begun work on a more robust service discovery solution to replace the current system.
We encountered several hurdles in the restoration of service, including secrets management and co-dependency of services that should not be co-dependent. We will be addressing these problems in the coming days.
We also have been maturing our operational procedures and run books as well as adopting better internal tooling for managing incidents. This incident highlighted some gaps in those procedures and documentation which would have made escalating and resolution of the incident better.
We take outages like this very seriously - this is why Deno Deploy is still in beta. We hope to announce General Availability in the coming months, dependent in part on fixing these critical gaps.