In the early days of 2018, the engineering team at the mobile services company Branch noticed slowdowns and errors with its Amazon Web Services cloud servers. An unexpected round of AWS server reboots in December had already struck Ian Chan, Branch’s director of engineering, as odd. But the server slowdowns a few weeks later presented a more pressing concern.
“We had six engineers crammed in a small war room all staring at charts, deploy logs, revision histories, and latency graphs looking for the cause,” Chan says. “We spent a few days eliminating possibilities one after another, but were unable to find a root cause. We were seemingly chasing a non-existent bug in our system.”
The team kept Branch’s services operational by reworking some of their architecture, and purchasing more server capacity from AWS to stabilize the workloads. “At some point someone floated the hypothesis that it was an underlying performance issue due to the Spectre and Meltdown patches being applied by AWS,” Chan says. “The mystery reboots from just a few weeks earlier suddenly made sense.”
Branch’s struggles turn out not to be unique. Last week’s public revelation that most mainstream computing processors could be manipulated to leak data between programs led to a frenzy of patches and confusion. Even before Meltdown and Spectre were officially revealed, there had been hints that the fix could significantly degrade performance. And while system administrators, internet infrastructure providers, and cybersecurity managers now largely agree they’ve dodged the early worst-case scenarios, they’ve taken a tangible toll.
Taking Your Medicine
The Meltdown and Spectre vulnerabilities exist because for years chipmakers have taken steps to prioritize performance and speed that, as a side effect, turned out to impact security. By reining in some of these data fast tracks, the fixes slow down certain types of operations, particularly for programs that require a lot of requests to the kernel, an operating system’s most fundamental and secretive inner sanctum.
‘I remember first looking at it and thinking ‘oh, shit,’.’
John Michener, Casaba Security
Early testing and benchmarking of the Meltdown and Spectre fixes indicated that their impact could be severe. Even just the complexity of applying and managing the patches—particularly for Spectre, which is more a class of vulnerability than a specific bug—has created a real strain on the industry. Lots of vulnerabilities require large-scale patches. But Meltdown and Spectre are unique in that they involve overhauls of both standard operating system software, and more rare updates to the firmware and microcode that coordinate and control hardware.
“I remember first looking at it and thinking ‘oh, shit,'” says John Michener, the chief scientist at the security consulting firm Casaba Security, which has helped retail vendors with Meltdown and Spectre remediation. “We’ll see Spectre-related bugs for the next five years. But in general this type of thing has happened before. We may see a marginal impact and take a bit of a hit, but the newer processors don’t have a huge loss. Older processors have more of an impact.”
Dampening the potentially crippling performance issues has required a massive, coordinated effort behind the scenes. Some companies, including the open source enterprise IT services group Red Hat, had advanced notice about Meltdown and Spectre before the public disclosure, getting a head start on the patching process.
“There certainly is a performance impact, but what we had to do is kind of use the big hammer initially to mitigate, and then we can go back to iterate and refine,” says Red Hat chief ARM architect Jon Masters. “There’s potential for improving these fixes.”
That’s not to say everything’s fine and rosy. While Intel and other processor manufacturers initially worked to downplay potential performance problems from the patches, the industry immediately started feeling ripple effects.
In a Tuesday update, for example, Microsoft said that consumer devices with processors from 2015 or earlier running Windows 7, 8, and 10 would be more likely to exhibit slowdowns. The company added that, “Windows Server on any silicon, especially in any IO-intensive application, shows a more significant performance impact when you enable the mitigations.”
This means that millions of Windows PCs and servers around the world, even those that are just a few of years old, could get noticeably more sluggish—as much as 20 percent slower in some cases. Intel also published benchmark and user data on Wednesday, which similarly shows deeper losses for older generations of silicon.
Millions of Windows PCs and servers around the world, even those that are just a few of years old, could get noticeably more sluggish
Those losses will hit consumers hard. Large-scale organizations have minimized problems by testing patches in advance, and adding other efficiencies to offset losses, but individuals are pretty much stuck with the solutions tech companies provide. On Tuesday, for example, Microsoft paused distribution of its Meltdown and Spectre patches for certain AMD processors after the update bricked some machines. Microsoft claims that its patches were flawed because of inaccuracies in AMD’s chip documentation. On Thursday, Intel also admitted that its Meltdown and Spectre patches for older Broadwell and Haswell processors are causing more random reboots than usual. The chipmaker may push another patch to deal with the glitch.
And that’s before you even get to performance dips that stem from third-party service providers, like cloud platforms.
The video game maker Epic Games, for example, recently detailed patch-related performance declines in the popular battle royale game Fortnite. “All of our cloud services are affected by updates required to mitigate the Meltdown vulnerability,” Epic Games wrote last week. “We heavily rely on cloud services to run our back-end and we may experience further service issues due to ongoing updates.”
Fortnite players have experienced problems with log-ins, slowdowns, and downtime—not ideal for a competitive gaming environment. The problems have persisted since Fortnite initially outlined them last week. The company tells WIRED that it is still working with its cloud providers on a total resolution.
Industrial control systems and critical infrastructure have so far avoided Meltdown and Spectre slowdowns by not yet deploying fixes. That’s typical of these sectors, given the importance of understanding how patches will impact systems before they’re deployed. If something went wrong it could go really wrong.
“We definitely don’t see anyone in critical infrastructure patching on the fly,” says Jonathan Pollet, the founder of Red Tiger Security, which consults on cybersecurity issues for heavy industrial clients like power plants and natural gas utilities.
In working with the Meltdown and Spectre patches so far, Pollet notes that industrial systems generally have low processing and bandwidth requirements anyway, meaning less potential for performance degradation. The bigger complication will be identifying all of vulnerable devices, and making sure patches reach them eventually.
“When there’s a vulnerability at the chip level our customers are struggling with figuring out which of their components out in the field or in plants and factories actually have this particular bug, because they’re not really tracking their supply chain and inventory down to the chip level,” Pollet says. “So it took a few days for some of our clients to figure out where they actually had infrastructure that required an update.”
That type of time investment applies to internet infrastructure as well, one sector where lack of protection against data exposure vulnerabilities like Meltdown and Spectre could pose a real and large-scale security risk long-term.
“The thing that’s unusual about this bug is the scope of it,” says John Graham Cumming, chief technology officer of the content management and internet infrastructure company Cloudflare. “It affects pretty much all computers, it’s a very high percentage, and the problem is that people really find ways to exploit these security problems over time. So you’ve got to patch, there’s no way to get away from that, you’ve got to roll it out everywhere.”
‘You’re suddenly in an emergency situation where there’s kind of a fog of war.’
John Graham Cumming, Cloudflare
Google has been refining a mitigation approach called Retpoline, which the company released last week to help manage performance issues in cloud platforms and other massive enterprise systems. And Amazon Web Services told WIRED in a statement Thursday that, “There have been isolated cases where a specific workload needed attention after patching. Our engineers have helped customers optimize their applications and in almost every case, prevent significant changes to their costs.”
For its part, Cloudflare, which claims to manage almost 10 percent of internet requests worldwide, says that in the end it managed the performance issues with the Meltdown and Spectre patches by putting extensive resources into testing the fixes before pushing them out. “You’re suddenly in an emergency situation where there’s kind of a fog of war,” Cumming says. “We sell performance, so if it was going to slow us down that would have a very big impact on our business.”
And though installing the Meltdown and Spectre patches has been an enormous effort and caused real grief, many in the industry remain upbeat about the challenge. Even after all of its struggles and the money it had to spend to handle the problem, Branch says it sympathizes with AWS, and everyone working to deploy the patches. In fact, AWS pushed out yet another refinement on Friday to improve performance right as this story went live.
“We’re still investigating the longer term impact on our system,” Branch’s Chan says. “Despite the performance impact, AWS was protecting its customers. They did the right thing.”