Within the first few days of 2018, security researchers at Google, Graz University of Technology, and several other organizations shocked the world with the disclosure of multiple vulnerabilities found in most modern processors. The IT industry in particular was sent into a panic, as we learned that the Meltdown and Spectre flaws plague just about every single computer chip developed in the past two decades.
Then, there were reports that the quickly released software fixes for Meltdown and Spectre result in massive performance degradation. Early assessments suggest that once patched, some machines take a performance hit as high as a 30%. According to Intel, similar tests conducted since the flaw became public indicate a 2% to 14% decrease in performance. Regardless, both numbers are concerning and carry major implications for the data center.
Many data center owners and operators are in a tough position, wondering just exactly how bad the impact will be and what it will mean for facility utilization and capacity over the longer term.
For the data center market, it’s important to recognize that the flaws’ impact on capacity and demand will be far less than it would have if capacity were more homogeneous or if there was less over-provisioning in individual facilities. Uptime Institute survey responses, along with a variety of other sources, suggest that most servers operate at less than 25% of capacity most of the time, and often much lower. While this appears to be a sign of inefficiency, in many cases data centers use it as a deliberate strategy, especially when it comes to dealing with demand peaks.
As a matter of fact, when 451 Research asked operators how they currently deal with variable resource requirements due to randomness, time of day, and/or seasonal demand, 61% of respondents said they “overprovisioned.” While this over-provisioning might help to cushion the impact of the processor vulnerabilities, it won’t solve the issue.
In certain situations, even performance hits at the lower levels could be expensive, forcing a thorough review of capacity or performance. As things are now, fixing the vulnerability with a software patch will degrade performance, while new hardware without the flaws might still be a year or two away. Many operators could be forced into buying new hardware simply because it is more efficient than the systems they currently have – even though they’re still vulnerable.
One global financial services company told Uptime Institute that its projected capacity requirement has “ballooned,” while a cloud provider has said it wants compensation from its server manufacturer to cover expected increased costs.
The overall impact of Meltdown and Spectre on the data center will depend on a variety of different factors. Since the Intel chip flaws were originally revealed, analysts have suggested that certain workloads – such as those with a high input/output requirement – are more likely to suffer worse performance issues. For example, since their systems are most likely to operate at a higher utilization, cloud providers, heavy users of virtualization, and transaction processing applications might be affected the most.
Enterprises will struggle to pass on any costs and will have to pay for infrastructure to address any IT capacity increases, while cloud providers -- who may be forced to provision fewer virtual servers per underlying processor -- can pass on or absorb extra costs. For colocation facilities, however, there is limited downside, since their customers, both enterprises and cloud service providers, may need more space or power.
Unfortunately, a long-term resolution will require changes to microprocessor designs, which are likely years away. In the meantime, data center operators will consider a range of options, including using cloud services, consolidating, retrofitting, expanding existing sites, and using colocation or hosted services.
The impact of Meltdown and Spectre is likely to lead a lot of haggling about costs and compensation, especially if there is a clear performance shortfall and new hardware is required. New systems will not be vulnerable, although some redesigns could reduce expected performance gains. The extent of the impact will likely depend on who foots the bill for the servers and supporting infrastructure, where the operator sits in the ecosystem, and whether it is easy to pass on costs.
Andy Lawrence, founding member and executive director of Uptime Institute Research, has built his career focusing on innovative new solutions, emerging technologies and opportunities found at the intersection of IT and infrastructure.