On 5 September 2019, we became aware of increased CPU usage on our test agents across all SpeedCurve regions. Unfortunately, the increased CPU usage affected metrics for almost all of the tests that were run between 3-11 September 2019. CPU-based metrics like TTI & scripting time were the most heavily affected, but in many cases time-based metrics like start render & speedindex were also affected.
We know that dramatic changes in metrics like this can be frustrating, especially when you aren't sure what caused the change. We now know that the root cause of this incident was an update to the Linux kernel on the servers that run our test agents. The unusually long duration of the incident was due to a combination of insufficient monitoring, a complex tech stack, and a slow debugging feedback loop.
Let's dive straight into a timeline of events. All times are in UTC.
An update to the Linux kernel was installed on our test agents. All of our test agents run Ubuntu 18.04 LTS and are configured to run a software update when they first boot, as well as every 24 hours. For this reason, there would have been a mixture of "good" and "bad" test agents for several hours after this point.
Our internal monitoring alerted us to an increase in CPU metrics. At this point there was still a mixture of "good" and "bad" test agents, so the data that triggered the alert appeared to be caused by some anomalies rather than a genuine issue. For this reason, the alert was ignored on the assumption that a genuine issue would issue subsequent alerts.
Our internal monitoring alerted us to another increase in CPU metrics, this time for a third party script (Google Analytics). This prompted a short investigation, but it was believed that the alert was caused by a change in Google Analytics rather than an issue with the test agents.
Our internal monitoring alerted us again to an increase in CPU metrics. This time the alert was seen and taken more seriously, because it appeared to be more widespread than a single third party. Investigation into the issue began in earnest at this point.
We received the first report from a SpeedCurve user about degraded performance.
More members of the SpeedCurve team joined the discussion to speculate about possible causes. The tech stack for our test agents has several layers:
Our goal at this point was to rule out as many layers as possible so that we could focus the investigation.
More internal monitoring alerted us to the fact that this issue is much more widespread than we initially thought. We began to speculate that there could be an EC2 issue, but this was ruled out as the issue appeared to be spread across multiple regions.
By this point we had ruled out all layers except for Linux and EC2. We believed the most likely cause was a software package upgrade, and began a binary search to identify the package.
No further investigation was performed over the weekend.
After some false positives identifying the software package, we switched to a much more thorough debugging method. This involved upgrading software packages one-by-one, rebooting the server, and creating AMI snapshots at every step of the way.
After a flood of support tickets from SpeedCurve users, we agreed that this issue was widespread enough to justify creating an incident on our status page.
We identified an update to the Linux kernel as the root cause. This was unexpected, and started some heavy discussions around whether automated software updates were appropriate for our test agents.
The SpeedCurve team agreed to roll back the Linux kernel to a known-good version and disable automatic software updates.
We began preparing patched test agent images for all of our test regions.
All regions except for London had been switched to the patched test agents. The London region seemed to be experiencing issues and we were unable to copy images to it.
The London region was switched to the patched test agents. The incident was marked as resolved on our status page.
This was SpeedCurve's most widespread and longest-running incident. There are many reasons for this, but the biggest reasons are as follows:
The major change we're making after this incident is switching from automated software updates to periodic, curated updates. This has a few benefits for us (and for our users):
On top of this, we will also continue to improve our internal monitoring.
This was a frustrating incident for SpeedCurve users and for the SpeedCurve team. We're really sorry for the inconvenience that it caused. On the bright side, we learned a lot and we're looking forward to improving our processes so that incidents like this don't happen again. Thanks so much for helping us to improve SpeedCurve!