Degraded performance for some pages

Incident Report for SpeedCurve

Postmortem

On 5 September 2019, we became aware of increased CPU usage on our test agents across all SpeedCurve regions. Unfortunately, the increased CPU usage affected metrics for almost all of the tests that were run between 3-11 September 2019. CPU-based metrics like TTI & scripting time were the most heavily affected, but in many cases time-based metrics like start render & speedindex were also affected.

We know that dramatic changes in metrics like this can be frustrating, especially when you aren't sure what caused the change. We now know that the root cause of this incident was an update to the Linux kernel on the servers that run our test agents. The unusually long duration of the incident was due to a combination of insufficient monitoring, a complex tech stack, and a slow debugging feedback loop.

What happened

Let's dive straight into a timeline of events. All times are in UTC.

2 Sep 21:00

An update to the Linux kernel was installed on our test agents. All of our test agents run Ubuntu 18.04 LTS and are configured to run a software update when they first boot, as well as every 24 hours. For this reason, there would have been a mixture of "good" and "bad" test agents for several hours after this point.

3 Sep 13:06

Our internal monitoring alerted us to an increase in CPU metrics. At this point there was still a mixture of "good" and "bad" test agents, so the data that triggered the alert appeared to be caused by some anomalies rather than a genuine issue. For this reason, the alert was ignored on the assumption that a genuine issue would issue subsequent alerts.

4 Sep 02:00

Our internal monitoring alerted us to another increase in CPU metrics, this time for a third party script (Google Analytics). This prompted a short investigation, but it was believed that the alert was caused by a change in Google Analytics rather than an issue with the test agents.

5 Sep 02:28

Our internal monitoring alerted us again to an increase in CPU metrics. This time the alert was seen and taken more seriously, because it appeared to be more widespread than a single third party. Investigation into the issue began in earnest at this point.

5 Sep 03:35

We received the first report from a SpeedCurve user about degraded performance.

5 Sep 03:38

More members of the SpeedCurve team joined the discussion to speculate about possible causes. The tech stack for our test agents has several layers:

The SpeedCurve application, which orchestrates the testing
WebPageTest, which farms testing jobs to individual test agents
The test agent software, which control the web browsers and extract performance data
The web browsers
Linux
Amazon EC2

Our goal at this point was to rule out as many layers as possible so that we could focus the investigation.

5 Sep 07:16

More internal monitoring alerted us to the fact that this issue is much more widespread than we initially thought. We began to speculate that there could be an EC2 issue, but this was ruled out as the issue appeared to be spread across multiple regions.

6 Sep 00:56

By this point we had ruled out all layers except for Linux and EC2. We believed the most likely cause was a software package upgrade, and began a binary search to identify the package.

6 Sep 05:00

No further investigation was performed over the weekend.

8 Sep 20:28

After some false positives identifying the software package, we switched to a much more thorough debugging method. This involved upgrading software packages one-by-one, rebooting the server, and creating AMI snapshots at every step of the way.

9 Sep 21:20

After a flood of support tickets from SpeedCurve users, we agreed that this issue was widespread enough to justify creating an incident on our status page.

10 Sep 03:06

We identified an update to the Linux kernel as the root cause. This was unexpected, and started some heavy discussions around whether automated software updates were appropriate for our test agents.

10 Sep 20:30

The SpeedCurve team agreed to roll back the Linux kernel to a known-good version and disable automatic software updates.

11 Sep 05:52

We began preparing patched test agent images for all of our test regions.

11 Sep 09:35

All regions except for London had been switched to the patched test agents. The London region seemed to be experiencing issues and we were unable to copy images to it.

11 Sep 10:25

The London region was switched to the patched test agents. The incident was marked as resolved on our status page.

What didn't go well

This was SpeedCurve's most widespread and longest-running incident. There are many reasons for this, but the biggest reasons are as follows:

While we have full control over changes to the SpeedCurve application and WebPageTest, there are several layers of our tech stack that we have less control over. Even though we exclusively use stable and LTS (long-term support) software update channels, we are still at the mercy of software vendors to ensure no breaking changes are introduced. Clearly the use of stable and LTS channels is not enough to prevent issues like this from occurring.
Our internal monitoring produced some unexpected results, and we ignored the first two alerts. For this reason, it took us around 48 hours to realise the severity of this incident.
We are familiar with breaking changes being introduced in web browsers, but this was the first incident where we had to dig all the way down to the operating system level. Our existing debugging processes were not sufficient to deal with this incident, and it took much longer than anticipated to find the root cause. We also had no way to revert to a known-good OS configuration, since our existing rollback scenarios only accounted for issues higher up the stack.

How we intend to prevent this from happening again

The major change we're making after this incident is switching from automated software updates to periodic, curated updates. This has a few benefits for us (and for our users):

We can perform updates on our own test agents before rolling them out to al of our regions. This allows us to check for potential issues in a timely manner, and also gives us the opportunity to report bugs to software vendors before they impact SpeedCurve test results.
We can take a snapshot after each update has been approved and rolled out. Since test agents are essentially frozen after each update, we have a reliable history of agent images that we can roll back to in the case of an incident like this.
In the case that an update will have a noticeable impact on SpeedCurve test results, we can give our users plenty of notice.

On top of this, we will also continue to improve our internal monitoring.

Conclusion

This was a frustrating incident for SpeedCurve users and for the SpeedCurve team. We're really sorry for the inconvenience that it caused. On the bright side, we learned a lot and we're looking forward to improving our processes so that incidents like this don't happen again. Thanks so much for helping us to improve SpeedCurve!

Posted Sep 13, 2019 - 03:40 UTC

Resolved

All test regions have been patched and performance should return to expected levels. We will follow up with a full write-up of this issue soon.

Posted Sep 11, 2019 - 10:26 UTC

Monitoring

A fix has been rolled out to most testing regions and we are continuing to monitor the situation.

Posted Sep 11, 2019 - 09:45 UTC

Identified

We have identified that a change in the Linux kernel is responsible for noticeable CPU overhead on our servers. We're working to resolve this as soon as possible.

Posted Sep 11, 2019 - 02:13 UTC

Investigating

We've noticed that some pages are experiencing degraded performance metrics since 3 September 2019. We are actively investigating the cause of this issue.

Posted Sep 09, 2019 - 21:20 UTC

This incident affected: Synthetic Data Collection.