For the week between 9/8/20 and 9/11/20 Scanii API services in the US region suffered 3 major outages, here’s a description of what happened and what we’re doing to prevent this issue from happening again.
Starting on Tuesday 9/8 around 10AM EST (GMT-4) our monitoring systems started triggering alerts that our API was getting slower (response times were increasing), this is not uncommon when there is a spike in the amount of content customers are submitting to our engine for analysis but it’s usually quickly resolved by our AWS based system automatically scaling capacity up to meet demand. We use something called target tracking scaling policies that automatically ensures that we have enough servers to keep the average CPU usage (as of this writing) around 50% - sadly, in this case the latency did not subside after the capacity increase and what had started as latency became full on API calls failing and returning 500 level HTTP errors.
Our initial theory was that this was related to a new large customer that had just started sending us traffic that same day and whose usage pattern was very spiky, no API calls for a while then millions of calls of a short, lets say 5 minutes, period. Essentially the traffic was spiking faster than our capacity management logic could keep up. We proceeded to manually increase capacity to accommodate their peak need until we could work out a more permanent solution - and that was Tuesday.
Most of Wednesday was quiet as we developed a plan to bring capacity back down to acceptable levels. Keep in mind that at this time we were operating at 20x our compute capacity baseline just to be confident that we could take that large spike in traffic without disrupting other customers.
Our plan to get back to normal essentially involved improvements to our capacity management logic to trigger capacity events earlier in a spike coupled with improvements aimed at speeding up how long it took for newly provisioned capacity to come fully online.
So, at the end of the day, we rolled out a new set of improvements and things stayed quiet - but not for long.
Thursday came around and as soon as US based load started climbing (around 8AM EST) we started alerting again but, this time, capacity did not seem to be the culprit anymore and our recent changes were doing their job but, sadly, we were failing requests again.
In a nutshell, this was a rough day as we had to bump up capacity back to the 20x exorbitant amount to bring error rates back to normal and start over looking for a new root cause for our issues.
Our next theory circled around expensive API calls, such as the lookup of a processing result, which had sky rocketed with the elevated error rates we had seen the previous day.
That happened because quite a few customers poll for results (instead of using our callback mechanism) and, with all the outages we had, there were a lot of missing results that customers were polling for forever. As we had done the previous days, we rolled out improvements to this problem at the end of the day so they could be monitored under real load the following day.
All right, now is Friday (9/11) and, to be totally honest, we weren’t super confident that our “cascading effect” theory (named that since it postulated that the problems Thursday were caused by the super high error rates Tuesday and Wednesday) was it - but it was the best we had at the moment.
Sadly error rates climbed again the next day and we were quicker to increased capacity in order to minimize the impact to our customers. This time however, as we felt that we were running out of options, we decided to save a couple servers demonstrating high error rates for some deeper analysis.
And there it was, in one of these failed servers, the root cause for all of our problems.
A plain old software bug that could only be triggered under very specific high load situations that would, essentially, trigger a deadlock in our engine that would bubble up to customers in the form of a gateway time out.
Nothing like a bug you understand, can reproduce in staging and fix - and we quickly rolled out a permanent fix to production.
Here’s our plan to prevent this type of issue from happening again:
Since December of 2010 Scanii has provided a high quality service that companies across the globe have grown to depend upon but this week we clearly did not live up to that responsibility and we’re deeply sorry for it.
To all of our customers that were impacted by this series of incidents, we plan to go well beyond our SLA requirements and issue a full refund for the month of September to all customers that request it via an email to email@example.com - we’re sorry and we mean it.
If you have any questions please don’t hesitate to get in touch with our support team, we’ll answer any and all questions you may have.