Postmortem on 2022-03-29 Outage
Last night (EDT), Scanii.com suffered an outage of our US1 region taking down our api-us1.scanii.com , api.scanii.com and our management UI https://www.scanii.com . This outage lasted 1 hour and 13 minutes .
The outage was caused by a mistake in a Terraform infrastructure-as-code file that triggered our services to be unnecessarily replaced in the AWS us-east-1 region. Unfortunately, once destroyed, the recreation of these services failed due to an inconsistency between the Terraform view of the infrastructure and how, in actually, the infrastructure existed in that region.
Once the cause of the failed redeploy was identified, the team was able to correct the issue, complete the redeploy and bring our services back online.
What we’re doing to prevent this from happening again
We’re revamping our process for Terraform changes minimizing the reliance on the engineer spotting a potential dangerous resource destruction. This method was effective when we had hundreds of resources but it is impractical for the size of our infrastructure today with thousands of resources across multiple regions and 2 cloud providers.
As always, we apologize to all of our customers for the trouble this incident may have caused and if you have any questions or concerns, please don’t hesitate to reach out to us at firstname.lastname@example.org