On Thursday at 3:11 UTC several services provided by the Deno company had a 35 minute service disruption. During this time projects hosted at Deno Deploy and the deno.land website were not responding. We have concluded that this outage was the result of an unexpected database maintenance event. This post details what exactly happened, how we recovered the systems, and what we are doing to prevent this in the future.
All services are now operating normally again. No data was lost. We take outages like these seriously and sincerely apologize for the disruption.
Timeline of events
At 3:11 UTC a primary Postgres database hosted on Google Cloud Platform started an unexpected maintenance event.
At 3:13 UTC an automated alarm triggered that a request to deno.land/std failed.
At 3:32 UTC the database maintenance finished, but ended in a unknown failure state.
At 3:48 UTC the database server restarted automatically due to the earlier failure, and the alarm was cleared.
Deno Deploy has a primary Postgres database that is hosted on Google Cloud Platform. This database is used to store various data that is used by Deno Deploy to run its services. This database is set-up in a highly available fashion, and can fail over to standby replicas when the primary fails.
At 3:11 UTC, Google Cloud Platform started an unexpected maintenance event on the primary Postgres database. During maintenance events, failover is not possible. This meant that the primary database was unavailable for read and writes, causing Deno Deploy services that access this database to fail.
After the maintenance event ended, the database server restarted automatically, and all services that were using the primary database were able to recover.
During the 35 minute window, requests to Deno Deploy projects failed, including requests to deno.land/x and deno.land/std. Deno programs trying to download modules from /x or /std experienced failures. The Deno Deploy management dashboard at https://dash.deno.com/ and Deno Deploy GitHub integration were also unavailable.
We are in contact with GCP to determine the root cause of this maintenance event, and why the database server took so long to recover. As a temporary measure, we have halted scheduled database maintenance until we have determined a permanent solution to prevent outages like these. Additionally we are investigating solutions to making our services more resilient against primary database failures.