Degraded API Performance
Incident Report for Flow
Postmortem

Beginning at 2018-01-29 06:45PST, the Flow API began to experience severe performance degradation due to insufficient resources on the backing database. This caused a cascading failure as requests timed out and retried, thereby exponentially increasing load on an already-stressed database.

Our monitoring and alerting notified our team of the issue at 06:52PST.

This cascading failure resulted in a total of 12 minutes of total downtime, occurring between 07:02PST - 07:08PST and 07:11PST - 07:17PST.

The root cause of this incident was quickly identified by our team as being related to a change made during the maintenance window on 2018-01-26. A critical piece of configuration was missed, which meant only half our database resources were put into effect.

At 07:14PST, our team applied the missing configuration. By 07:30PST, new requests began making their way through, and the backpressure began to subside. By 08:00PST, performance had returned to an acceptable mark.

Following this incident, we have resolved to strengthen our policies and procedures surrounding maintenance events, ensuring strict validation of all changes before the conclusion of the event. Additionally, work is already underway that would allow us to put infrastructure changes under code-review scrutiny days or weeks prior to applying it, allowing greater confidence and ability to test changes before a maintenance window even begins.

Posted Jan 29, 2018 - 14:27 PST

Resolved
Performance has returned to normal levels. We apologize for the interruption in your morning; we are performing a full investigation and will publish a post-mortem report soon.
Posted Jan 29, 2018 - 08:07 PST
Monitoring
A mitigation has been deployed and load pressure is decreasing, we are continuing to closely monitor until performance returns to acceptable levels.
Posted Jan 29, 2018 - 07:32 PST
Identified
Monday morning load has surfaced a misconfiguration which has had a severe performance impact on the Flow API. Our Ops team has been notified and is working to mitigate the issue. This will cause all Flow apps (Desktop, Web, iOS, and Android) to behave slowly or fail to load.
Posted Jan 29, 2018 - 07:29 PST