Degraded performance in EU1 region

Incident Report for Scanii.com

Postmortem

What happened and why

Earlier today our EU1 infrastructure suffered a total outage lasting 50 minutes. The root cause for this incident was an AWS EFS filesystem which had reached 0 for its burst balance and it was not large enough to have a baseline throughput to sustain itself.

Because of the variability in the sizes of files we process, anything from Kilobytes documents to Gigabyte archive files, we utilize Amazon’s Elastic Filesystem as an encrypted buffer for content while processing. This had worked well for us thus far but today we encountered a failure mode that we were not expecting, in essence, we did not have the right monitoring in place to identify that our usage pattern in that region was depleting burst credits faster than it could refill them and that the amount of throughput without burst capacity (what AWS calls baseline aggregate throughput) for that region was not enough to sustain our content analysis service causing it to fail - you can learn more about these EFS metrics here.

After our on-call engineer was paged and the issue triaged, we contacted AWS support and proceeded to replace the problematic EFS filesystem.

What are we doing to prevent this from happening again

We are putting in place new monitoring to identify when an EFS filesystem is rapidly depleting burst credits
We are auditing and tuning the baseline aggregate throughput for all regions
We will investigate adding the ability for our global api endpoint api.scanii.com to automatically re-route traffic if it identifies a failing region

Onwards,

The scanii.com engineering team

Posted Jul 08, 2020 - 09:51 EDT

Resolved

This incident has been resolved.

Posted Jul 08, 2020 - 08:10 EDT

Monitoring

We have a fix in place and we're now monitoring a recovery

Posted Jul 08, 2020 - 08:01 EDT

Update

A AWS service we depend upon is having issues, we're in contact with AWS to resolve it.

Posted Jul 08, 2020 - 07:19 EDT

Identified

The issue has been identified and a fix is being implemented.

Posted Jul 08, 2020 - 07:12 EDT

This incident affected: API Endpoints (api-eu1.scanii.com).