May 23rd, 2023 Deno Deploy Postmortem

On May 23rd, 2023, starting at 20:42 UTC, we experienced an unexpected outage across all Deno Deploy services hosted on GCP, including deno.com and deno.land. The services were inaccessible for approximately 45 minutes due to a surge in CPU capacity caused by a logging roll-out.

Our commitment to providing a stable and robust platform to our users is our topmost priority. We deeply regret this incident and sincerely apologize for any disruption caused. This report provides an overview of the event, the cause of the outage, and measures we plan to take to prevent such instances in the future.

Impact

During a 45-minute period, users experienced a service disruption and could not access key Deno web properties and deployments on Deno Deploy, including deno.com and deno.land. All Deno Deploy hosted deployments were impacted by this incident.

Timeline of Events

All times in UTC, on May 23rd 2023.

20:34 - Initiation of a logging update to our production clusters.
20:42 - First alerts triggered indicating system failures.
20:45 - Team member reports unavailability of deno.com.
20:47 - A status update was promptly published.
20:51 - Rollback procedures were put in motion.
21:04 - The system started showing signs of recovery as alerts subsided.
21:18 - The majority of our systems were recovered.
21:27 - Full recovery of all system alerts.
21:43 - Incident officially marked as resolved.

We estimate a downtime of approximately 45 minutes from when our systems first started failing until full recovery was achieved.

Root cause

The unexpected outage was triggered by a logging update to our production clusters. This update introduced a new service which inadvertently increased our CPU load beyond the maximum set capacity, preventing our isolate hypervisors from being scheduled and causing deployments to fail.

During testing in a staging environment, this change was inadvertently bundled with another update, which masked the CPU limit problems and prevented pre-deployment detection.

What’s next?

In light of this incident and other recent less critical issues, we recognize that our current “one-shot” deployment method is insufficient. We are planning to implement canary-style deployments, allowing us to deploy changes to canary regions first before a general roll-out, specifically for significant cluster-level changes that require lengthy rollout and rollback procedures.

Moving forward, we will enforce a policy of testing one change at a time in our staging environment to prevent masking potential issues.

Lastly, we are working on establishing clearer error budgets and internal service level objectives (SLOs) to guide our engineering teams on the decision-making process regarding potentially risky changes.

Have questions, suggestions or other thoughts? Feel free to drop us a line.