July 13th Utah Outage Update
Starting on July 13th (Wednesday) at approximately 18:00 UTC several services provided by the Deno company experienced a regional service disruption in Utah (us-west3) for a period of just over 24 hours. During this time people in the Utah region experienced failures accessing projects hosted on Deno Deploy, including many Deno web properties.
We have concluded that this outage was caused by our load balancing service. This post details what exactly happened, and what we are doing to prevent this in the future.
A shorter outage was also suffered on July 15 between 15:30 UTC and 16:00 UTC, after attempting to bring the us-west3 region back online. The newly created load balancer worked correctly for a few hours, but eventually encountered the same problem. During this time projects hosted on Deno Deploy, including many Deno web properties, were not available in this region.
All services are now operating normally again. No data was lost. We take outages like these seriously and sincerely apologize for the disruption.
For a period of around 24 hours, some users in the us-west3 region were unable to access dash.deno.com, and Deno Deploy projects, including deno.com and deno.land. This region is based in Utah, and services the surrounding area, including some neighboring states. During the secondary incident, the impact was the same, but for a much shorter time period; less than 30 minutes.
Only the us-west3 region was disrupted. All other regions operated normally throughout the incident. Additionally, subhosting was not affected.
Timeline of events
On July 13th, at around 18:45 UTC we started to receive reports of an outage from a small number of users. We investigated the status of our services, but were unable to confirm any of the reports. All of our status monitoring and tests reported that everything was operating normally.
Over the course of the outage, we continued to monitor our service status, and worked with some of the affected users to narrow down the source of the problem.
On July 14th, at 19:14 UTC we were able to identify that the problem was within our us-west3 region, which we then took offline, directing traffic to other nearby regions instead.
On July 15th, at 11:30 UTC we attempted to bring the region back online, with a new load balancer instance.
On July 15th, at 16:00 UTC the load balancer entered the same faulted state as before, and we again disabled the region. The region will remain disabled until our monitoring has improved and the issue has been fixed more permanently.
The different layers of load balancers and backend services use etcd to preform service discovery. Services announce themselves to the etcd cluster when their availability state changes. Other services (like these load balancers) retrieve the list of healthy backend services from etcd. They also watch the list of announced services to be informed when a given service becomes available, or goes offline. The load balancers then use this list to decide which backend services to route traffic to.
During this outage, one of the regional load balancers managed to sever its connection to etcd without the application noticing. We expect this connection to be routinely severed due to network faults, timeouts and the like. As such, the application is programmed to reconnect automatically, and to shut itself off if it fails.
Without this connection, the load balancer was unable to receive updates from etcd. This resulted in it "desynchronizing" from the rest of the system, and not knowing of any healthy backends to direct traffic to.
The situation of a load balancer not having any healthy backend services is not terribly uncommon, and as such the load balancer knows how to deal with this. If there are no healthy backends it will un-advertise itself from the network to prevent requests ending up at this "dead end".
Due to a design oversight in the load balancer, it did not un-advertise itself from the system in this specific scenario. This resulted in the loadbalancer continuing to receive traffic from upstream load balancers, without having anywhere to direct the traffic to. This caused the traffic to be dropped entirely.
This incident has made it clear that a few blindspots exist within our monitoring systems. We are looking into ways to reduce these, and to improve our monitoring capabilities within individual regions.
Additionally it was difficult for us to narrow down which region was causing the outage, because the load balancer that failed was very early in the network stack (a TCP load balancer). It does not record any diagnostics about dropped connections, nor does it have a return channel to return diagnostic information to the user (unlike HTTP loadbalancers, which can return a response header). We are working to improve the monitoring situation here.
We are also currently working on some architectural changes in the load balancing system to prevent this class of failures entirely.