What happened and why
Earlier today our EU1 infrastructure suffered a total outage lasting 50 minutes. The root cause for this incident was an AWS EFS filesystem which had reached 0 for its burst balance and it was not large enough to have a baseline throughput to sustain itself.
Because of the variability in the sizes of files we process, anything from Kilobytes documents to Gigabyte archive files, we utilize Amazon’s Elastic Filesystem as an encrypted buffer for content while processing. This had worked well for us thus far but today we encountered a failure mode that we were not expecting, in essence, we did not have the right monitoring in place to identify that our usage pattern in that region was depleting burst credits faster than it could refill them and that the amount of throughput without burst capacity (what AWS calls baseline aggregate throughput) for that region was not enough to sustain our content analysis service causing it to fail - you can learn more about these EFS metrics here.
After our on-call engineer was paged and the issue triaged, we contacted AWS support and proceeded to replace the problematic EFS filesystem.
What are we doing to prevent this from happening again
Onwards,
The scanii.com engineering team