May 30 incident update
On May 30th (Monday) at 22:12 UTC several services provided by the Deno company had a 70 minute service disruption. During this time projects hosted at Deno Deploy, including many Deno Land web properties, were not responding in some regions; additionally, attempts to deploy new workers would fail.
We have concluded that this outage was caused by our service discovery database (etcd) running out of memory. This post details what exactly happened, how we recovered the systems, and what we are doing to prevent this in the future.
All services are now operating normally again. No data was lost. We take outages like these seriously and sincerely apologize for the disruption.
Impact
During the 70 minute window, the Deno Deploy management dashboard at https://dash.deno.com/ was unavailable. Additionally, requests to Deno Deploy projects, including deno.com, deno.land doc.deno.land, lint.deno.land, deno.news, examples.deno.land, would fail in some regions.
Subhosting traffic was not impacted, other than some increased latency around the end of the incident.
Timeline of events
At 22:12 UTC a load test across the entire Deno Deploy network was started. One minute later, at 22:13 UTC the load test ended successfully.
At 22:15 UTC our uptime monitoring detects that some Deno Deploy hosted properties are unavailable and triggers an alarm.
At 22:25 UTC it is discovered that the etcd cluster has disintegrated. Multiple instances are stuck an a reboot loop because the etcd daemon shortly after startup has used up all the RAM available on the VM, and subsequently gets terminated by the OOM killer.
Between 22:50 UTC and 23:01, the etcd cluster is recreated, using a different VM instance size with more memory available to it.
At 23:25 UTC all services in our network have registered themselves with the new etcd cluster and the system is operating normally again.
Root cause
During the load test, the Deno Deploy infrastructure scaled up automatically to handle the increased load. When the load test was over, the infrastructure scaled back down. This triggered a flurry of updates to the etcd cluster which functions as our service discovery database.
While processing these updates, the etcd daemon used more memory than what was available on the VM they were running on. This triggered the Linux OOM killer to terminate the etcd daemon. Because multiple etcd instances went down at the same time, quorum was lost and the service discovery database became unavailable.
What’s next?
The etcd cluster now uses a different VM instance size with more memory available to it.
We are adding additional etcd health checks, and we will set up alerts that notify us when etcd memory usage approaches the limit.