Between 2025-05-14 14:00 UTC and 2025-05-15 10:00 UTC, Scanii’s administrative portal (www.scanii.com) experienced an outage lasting approximately 18 hours. The downtime was caused by a misconfiguration in our Terraform infrastructure code, which unintentionally triggered the replacement of critical services in the AWS us-east-1 region.
This disruption should have been promptly caught by our monitoring systems. However, two separate issues prevented timely detection and alerting, compounding the impact of the incident.
1 Terraform Misconfiguration:
A change in our infrastructure-as-code repository inadvertently marked several key services for replacement. This change was not caught during our usual review processes.
2 Monitoring Failures:
Azure Monitor: A prior configuration update — made in response to the retirement of an Azure availability feature — changed the threshold logic for incident creation. This update set an unattainable condition, effectively disabling alerts from Azure's monitoring system.
CloudWatch Integration: An unrelated change to improve internal alerting (e.g., backup system monitoring) broke the CloudWatch → PagerDuty integration, preventing critical alarms from notifying our on-call team.
To prevent recurrence, we are taking the following actions:
1 Pre-Production Alerting Tests:
We will implement automated, routine alert testing in our pre-production environment to validate that alert paths are functional.
2 Monitoring Infrastructure Audit:
We are conducting a full audit of all alerting and monitoring infrastructure-as-code configurations to identify and resolve similar misconfigurations.
3 Terraform Change Safeguards (planned):
We are evaluating stricter CI/CD policies and static analysis tools to detect unintended resource replacements before deployment.
We take this incident seriously and are committed to learning from it. We sincerely apologize for the disruption and thank you for your patience. If you have any questions or concerns, please reach out to us at support@uvasoftware.com.