Beginning at 2018-01-29 06:45PST, the Flow API began to experience severe performance degradation due to insufficient resources on the backing database. This caused a cascading failure as requests timed out and retried, thereby exponentially increasing load on an already-stressed database.
Our monitoring and alerting notified our team of the issue at 06:52PST.
This cascading failure resulted in a total of 12 minutes of total downtime, occurring between 07:02PST - 07:08PST and 07:11PST - 07:17PST.
The root cause of this incident was quickly identified by our team as being related to a change made during the maintenance window on 2018-01-26. A critical piece of configuration was missed, which meant only half our database resources were put into effect.
At 07:14PST, our team applied the missing configuration. By 07:30PST, new requests began making their way through, and the backpressure began to subside. By 08:00PST, performance had returned to an acceptable mark.
Following this incident, we have resolved to strengthen our policies and procedures surrounding maintenance events, ensuring strict validation of all changes before the conclusion of the event. Additionally, work is already underway that would allow us to put infrastructure changes under code-review scrutiny days or weeks prior to applying it, allowing greater confidence and ability to test changes before a maintenance window even begins.