Uva Software deploys happen via one of two modes: first via a Github Actions that progressively rolls out across our different environments and regions where before every new “stage” is deployed we ensure that the previous stage is working properly. Second, via an infrastructural change, using Terraform, that modifies the state of our components forcing them to be redeployed. Infrastructural changed are significantly less common but they do happen.
Starting at round 12:50 AM (ET) on Feb 4th a production deploy caused all of our API regions to become unresponsive triggering our load balancers to return 500 level errors to HTTP calls. This deploy was triggered by Terraform as part of a scheduled production roll out of a change to our web portal component upgrade but was not expected to trigger a redeploy of our global API endpoints - this expectation was a big mistake and showed our lack of understanding of Terraform idiosyncrasies.
Because that Terraform changes aren’t part of our regular deployment pipeline (and thus can know the right deployment version that should live in each environment/region combination) it defaults to a “latest” tag that was considered “safe” and has worked reasonably well for quite a few years now except this day “latest” for our API endpoints was a large change still being tested in our staging environment and incompatible with our production infrastructure. Hence the outage.
Our teams were notified, via PagerDuty at 12:53AM quickly identified that our API endpoints had been recently deployed (and were failing to come up) and initiated a rollback to the previous version. We did this on a region by region basis (from largest to smallest) and, at around 1:30AM all API endpoints were recovered.
Now, about all those false positive alerts.
While all that was going on, there was a legitimate change going out to our web portal aimed at addressing latency for results showing up for customers caused by a large influx of usage from one of our larger customers (in essence there was a backup of messages in SQS waiting to be post processed) - the goal of this change was to increase concurrency in order to kick CPU higher to then trigger auto scaling to better keep up with post processing demand. This change was successful and results are now showing up in the portal faster than ever but also included an unrelated dependency upgrade to a JSON serialization library that changed the text representation of an empty array from “” to “[ ]”. This apparently harmless change was enough to trick our logic in the web portal to start marking results without findings as result with findings causing all the false positive email notifications.
To make matters worse, there was a, ill timed, bug introduced with that release that prevented users from being able to change their notification preferences. To be clear, this was a problem with our management portal only (www.scanii.com) and did not impact any of our content processing endpoints.
Unfortunately the number of emails sent was not something we had active monitoring on and the problem was not identified until later in the day when we began reviewing the support tickets that had come in. At about 8AM ET we did another deploy temporarily pausing all email notifications pending a proper fix to the portal. That fix when out early today (Feb 5th) after comprehensive testing.
As always, we apologize to all of our customers for the trouble this incident may have caused and if you have any questions or concerns, please don’t hesitate to reach out to us at email@example.com