Management portal unavailable

Incident Report for Scanii.com

Postmortem

Summary

Between 2025-05-14 14:00 UTC and 2025-05-15 10:00 UTC, Scanii’s administrative portal (www.scanii.com) experienced an outage lasting approximately 18 hours. The downtime was caused by a misconfiguration in our Terraform infrastructure code, which unintentionally triggered the replacement of critical services in the AWS us-east-1 region.

This disruption should have been promptly caught by our monitoring systems. However, two separate issues prevented timely detection and alerting, compounding the impact of the incident.

Root Cause

1 Terraform Misconfiguration:
A change in our infrastructure-as-code repository inadvertently marked several key services for replacement. This change was not caught during our usual review processes.

2 Monitoring Failures:

Azure Monitor: A prior configuration update — made in response to the retirement of an Azure availability feature — changed the threshold logic for incident creation. This update set an unattainable condition, effectively disabling alerts from Azure's monitoring system.
CloudWatch Integration: An unrelated change to improve internal alerting (e.g., backup system monitoring) broke the CloudWatch → PagerDuty integration, preventing critical alarms from notifying our on-call team.

⠀Impact

Portal access was unavailable for ~18 hours.
No customer API traffic was affected.

Remediation and Next Steps

To prevent recurrence, we are taking the following actions:

1 Pre-Production Alerting Tests:
We will implement automated, routine alert testing in our pre-production environment to validate that alert paths are functional.

2 Monitoring Infrastructure Audit:
We are conducting a full audit of all alerting and monitoring infrastructure-as-code configurations to identify and resolve similar misconfigurations.

3 Terraform Change Safeguards (planned):
We are evaluating stricter CI/CD policies and static analysis tools to detect unintended resource replacements before deployment.

Closing Note

We take this incident seriously and are committed to learning from it. We sincerely apologize for the disruption and thank you for your patience. If you have any questions or concerns, please reach out to us at support@uvasoftware.com.

Posted May 16, 2025 - 09:02 EDT

Resolved

This incident has been resolved.

Posted May 15, 2025 - 07:30 EDT

Update

Looks like we're back up, I'll provide a post mortem as soon as we can.

Posted May 15, 2025 - 07:29 EDT

Investigating

We are investigating issues with users accessing our web portal at www.scanii.com

Posted May 14, 2025 - 12:00 EDT

This incident affected: Administrative (Management Portal (scanii.com)).