Cloud Incident: Website Down (November 25th, 3:50 AM PT - 4:40 AM PT)

abuaboud · November 25, 2024, 9:19pm

Summary (TL;DR)

On November 25th, between 3:50 AM PT and 4:40 AM PT, the website experienced repeated outages, making it inaccessible and causing webhooks to fail.

Issue Timeline

All timestamps are in Pacific Time (PT):

3:50 AM: Cloudflare sent an alert.
4:00 AM: On-call engineers were notified and began investigating.
The database was suspected to be the issue, so it was upscaled.
Web servers showed normal CPU and RAM usage.
A spike of concurrent flow activity occurred after the initial failure, causing load to add up.
All queues were paused manually to reduce stress on HTTP servers.
Queues were gradually re-enabled.
By 4:30 AM, the issue was resolved, but monitoring continued.
For several hours, partial failures occurred during recovery (one server would fail, but others took over). Fortunately, there was no significant user impact after the initial recovery.

Investigation

In recent weeks, usage on Activepieces has grown rapidly, leading to partial failures where one server would crash and restart while others handled the load.

Today’s Issue

The root cause was tricky to identify because machine metrics (CPU, RAM) appeared healthy:

Node.js Memory Limits: Node.js has a default memory allocation limit (4-8 GB), while our servers have 64-128 GB available RAM.
During sudden spikes with many files being sent, these files were temporarily stored in RAM for upload, causing the memory limit to be reached and leading to crashes.
We’ve updated NODE_OPTIONS to override the default memory settings.

Two root causes were identified this week (No user impact):

Background job failures:
- Jobs to delete old logs were failing due to a foreign key relationship with another table containing millions of records.
- The related column was not indexed, causing a linear search and database CPU spikes.
- This issue was fixed last Friday.
Lack of error handling in certain WebSocket functions:
- Specific actions, like testing steps or using AI (via WebSocket), triggered uncaught errors that crashed servers.
- Auto-restart policies recovered the servers quickly, but the issue was also resolved last Friday.

Lessons Learned / Actions Taken

Regularly review and optimize database indexes for high-usage tables by analyzing the slowest 10 queries.
Implement high-level error handling in WebSockets to prevent crashes from propagating.
Invest in better observability tools to gain deeper insights into platform behavior (community thread on stats feature).
Update the engine to bypass HTTP servers for file uploads by using pre-signed S3 URLs to upload files directly.

Conclusion

We understand how critical Activepieces is to our users and sincerely apologize for any inconvenience caused.

Activepieces will continue to grow stronger and more resilient with every challenge we face. Thank you for your patience and trust!