The original promise of server virtualization was simple and offered instant gratification: Say goodbye to racks of underutilized server hardware, and instead use software to pack dozens of virtual machines efficiently on to a few physical servers. The consolidation is supposed to produce savings on hardware (capex), in addition to reduced upkeep, data center space, and energy (opex).
Unfortunately, virtualization adopters quickly discovered that it is entirely too easy to provision virtual servers, which are consuming resources at a pace far greater than before. That's great for agility and flexibility but potentially damaging for budgets. Unnecessary overallocation of resources is costly and leads to more spending for additional capacity.
IT departments often don't have a good idea of their virtual footprint at any given moment. Understanding how many virtual machines are active, which ones have too many resources, or which ones are not used at all can be a complex exercise. IT operators without the proper tools or information are left unprepared to deal with the situation.
With application owners and line-of-business managers jumping into the cloud game at increasing speed, VM overallocation has worsened. These individuals are often making decisions for compute, storage, and memory without understanding application requirements (e.g., resources needed to support an estimated number of users). As a result, people tend to buy more than what they may need to avoid the horror of running out of resources.
Another problem is that monitoring and reporting tools haven't kept pace with virtualization and cloud computing. Without feedback to meter how much resource is provisioned, operation teams may not realize when limits are being exceeded.
For example, if there is a policy for a particular individual that only limits the amount of memory and CPU that can be provisioned per virtual machine without limiting the total number of virtual machines, that policy doesn't account for the wide variance in resource allocation that can occur. The cost, obviously, varies dramatically, and there is no accounting for actual resource availability.
Of course, there's the option to reclaim capacity from the underutilized VMs, sometimes called "zombie" VMs, though that's a risky proposition. Some reclamation actions require downtime or complex procedures to ensure there is not a service interruption. In those scenarios, organizations need adequate storage and networking capacity to bring the resources back up without issue; it can be tricky to calculate these dependencies properly.
To avoid excess capacity in virtual environments, an organization needs to collect enough data across components, in small enough slices, and retain it long enough to provide a highly accurate view of past performance. IT needs to see the detailed peaks and valleys of resource consumption in order to make the right calls on provisioning; that requires frequently updated data sets, in seconds not days. The availability of granular, high-frequency, historical data is the best way to predict future performance needs of your applications.
With the right tools, policies, and measurements, it's possible to stay on top of even the largest virtual environment. Look no further than Netflix, Facebook, and Google, all known for highly efficient, distributed, and agile web-based operations.
Visibility: All roads lead from here. A company should strive for the capability to have detailed historical views into performance, resource allocation, and status information from all applications and infrastructure devices. Modern infrastructure management and IT operations tools and cloud services can collect the data, mash it up, and provide high-level dashboards to indicate trends and issues.
Data: In IT operations, the more data there is to analyze, the better. The current gold standard is 20-second intervals. Let's say a website peaks at 28,000 transactions during high-load periods. Yet if the interval of observation were hourly, the resulting data would indicate a value closer to 10,000 transactions. If the operations team makes infrastructure decisions based on the average, rather than the peak, it is inviting disaster with end users and customers. Businesses can't afford the loss of disgruntled users who don't want to deal with a sluggish site.
Metrics: Users expect applications to always run perfectly when they need them. That means it's important to be measuring more things more often to know exactly what is needed and when. CPU, memory, and storage usage may have been enough in the past.
It's important to look at not only the basic virtual machine metrics, but also metrics that report on how the broader infrastructure is performing as a system. For example, transaction time, memory ballooning, CPU contention, memory swapping, storage IOPS, and network throughput and latency should all be combined to present performance and resource consumption in context of the applications being delivered.