[10-09-2025] Cloud Infrastructure Issues

abuaboud · September 9, 2025, 10:13pm

Dear Community,

Over the past week, you may have noticed a drop in the overall reliability of our cloud services. Some of the issues you might have experienced include:

Website going down at the beginning of each hour for about a minute
Flows getting stuck in “Paused” state
Missing logs in rare runs
Slowness in publishing or testing
Rare webhook loss if a worker crashes (e.g., memory issue on noisy flows)

Root Cause

Activepieces has been growing rapidly in the number of executions happening in the cloud. Today, we handle hundreds of millions of executions — polling new data, executing flows, processing webhooks, and handling third-party events. This scale introduces corner cases and puts extra stress on our app servers.

One example: as the number of self-hosted instances grew, they all started hitting the piece sync endpoint in the cloud at the beginning of each hour. Even though this endpoint is fully optimized and runs in memory, the huge spike still caused downtime, we are sustaining over 2,000 sync requests per second for around minute before the server crashes.

Meanwhile, our team has been constantly firefighting issues while also delivering new features. This delayed deeper infrastructure improvements until outages began to significantly impact the product.

Our Response

Our team has been working around the clock (24/7) to stabilize the situation. Here’s what we’ve already delivered:

Scaled up infrastructure: We doubled the number of our worker servers.
Optimized app servers: Introduced a process manager so Node.js can fully utilize all CPU cores — making each server up to 4x efficient on the cloud.
Rebuilt workers: In record time, we rewrote the worker system to be lighter and more efficient. Now the app pushes jobs directly to workers with fewer network calls instead of constantly workers asking for updates putting pressure on the app servers — reducing failures and preventing jobs from being retried unnecessarily.
Fixed rare pause-flow bug: In some cases, when a flow paused across two different hours, the log link to S3 would break because the S3 path was recalculated multiple times based on the current hour. This issue has been resolved.
Fixed rare short-delay bug: When a delay in a flow was too short, the “resume job” could trigger instantly before the first job finished. This caused the resume job to fail since the original job was still about to complete. We’ve fixed this edge case.
Refactoring & small fixes: Many tiny improvements and refactors across the system to make it more stable and reliable.

Impacted Flows

We ran a full infrastructure scan to restore missing logs and resume paused flows. This process took over 24 hours due to the large amount of data in S3 and the database, but everything is now back online and all paused flows have resumed.

Future

We know these issues have been frustrating, and we truly appreciate your patience as we work to improve. Here’s what’s coming next:

Dedicated Reliability Team: We’ve created a new sub-team focused solely on infrastructure and reliability. This team will triple in size this year — we just need a little time to hire the right people.
Stronger Testing: We’re adding more automated tests to prevent regressions and ensure fixes stay solid. To speed this up, we’ll invite the community to help through paid open-source bounties for writing end-to-end (E2E) tests for Activepieces.

We deeply appreciate you staying with us through this journey — thank you for your trust, patience, and understanding!

gsj · September 10, 2025, 2:09am

Thanks @abuaboud !

Glad to see AP is growing and things are working as normal again.

Christian_DeRamos · September 17, 2025, 7:24pm

Would providing self-hosters with the ability to turn off auto-sync and disable specific pieces reduce the overhead on your systems?