Postmortem: Cloud Runs Incident QUOTA_EXCEEDED (4:40 AM - 7:20 AM) PT

Summary :rotating_light:

On January 21st, between 4:40 AM PT and 5:20 AM PT, our users experienced an issue where all runs were getting “QUOTA_EXCEEDED” errors.

Impact:

  • All flows that ran between 4:40 AM PT and 5:20 AM PT failed.
  • The flows that failed also were turned off until 7:20 AM PT when they were turned back on.

How it Happened

We are currently in the middle of rolling out a new billing system, and there was hotfix rushed and deployed for a certain issue a user was facing.

Solving

Once we noticed the mistake, then we rolled back the change 20 minutes later.
The issue was escalated and our engineering team was able to turn on all the flows that were turned off, and now everything is back to normal.

What we learned:

  • Add more tests to our test suite for billing.
  • Runs that get “QUOTA_EXCEEDED” errors should have their status be turned to “Waiting” instead of “Error” for a certain amount of time (i.e 3 days).
  • No longer turning off flows when a run has a “QUOTA_EXCEEDED” error.

We sincerely apologize to all of our users for the issues that were caused, we have learnt a great deal from them and we are always striving to exceed all high expectations our great community has for us, we will improve and make the system reliable.

Best Regards,
Abdul.

4 Likes

Awesome post. Thanks for the transparency. Love AP!

1 Like

Thank you for the transparency and explaination of root cause analysis.

1 Like