Hi Everyone,
Impact
Between around 3 AM UTC on the 5th of March until 11 AM, the webhook triggers were not starting the flows.
How it Happened
As many of you noticed, the webhook triggers were auto-upgraded to a piece from something was hardcoded, like other pieces. The reason we did that is because pieces have versioning, which means we can start introducing many features requested to the webhook piece without breaking old behavior.
In migration, we supplied the relative version ~0.0.1
instead of the absolute version 0.0.1
. This worked well in draft mode, which we tested in the end-to-end test. However, it broke the published flow versions without our knowledge.
We had done this migration a while ago with a schedule, and it went successfully and because of that we had confidence that this migration should be smooth, since there were minor changes in the code.
After releasing, we checked all monitoring and tests, and they were fine as the end-to-end test was creating a new flow, and the issue didn’t affect the new flows, only older published ones. The queue monitoring was working as expected.
Solving
Once the issues were posted in the cloud at MAJOR bug with Webhook module - #5 by abuaboud, and being noticed by one of the team they were escalated quickly and fixed in the next 20 minutes.
We migrated everything again to 0.0.1
this morning, and everything is back to normal.
What we learned:
- Double down on the idea that locked flow versions shouldn’t be edited, and should be only done gradually.
- We had a blind spot in the code area between receiving webhooks and adding them to the queues. We need to set up some visibility for us or the user on events that are being processed.
- Data migration should be done during the day when most of the team is active.
I apologize for these issues. We have learned from them and we appreciate the trust you have in us to make the system reliable. We will improve.
Regards, Mo.