Workload behavior can significantly affect the appropriate design and optimization of your data center. It's important to understand various workload characteristics, how they change over time, and how they impact application performance. In this series, I've identified the top six things you should know about your virtualized workloads.
In Part 1 and Part 2, I looked at behaviors of virtual machines, the characteristics of the workloads they generate, and how specific metrics can accurately represent changing environments. In this article, I'll wrap up the discussion by looking at two ways to better understand your VMs and the infrastructure in which they live.
Location is everything
It's quite common to take performance metrics at face value. Most people do not question if the right data is being measured and in the right way. Furthermore, few ask if these metrics are provided in order of importance, or if their presentation is simply a function of ease.
But, if data is collected in the wrong location, it can lead to vast amounts of inaccuracies. This, in turn, results in incorrect conclusions. If you've ever sat behind home plate at a baseball game, you know what I mean. From that location, every routine fly ball looks like a home run. That is because the data you have is incomplete, and the location with which it was collected creates a false perception that leads to a faulty conclusion.
The same is true in your data center. Let's consider a common scenario where the analytics of a storage system show great performance in the storage array, yet VMs are still performing poorly. This is the result of measuring from the wrong location. The analytics may accurately show the latency of the components inside the array, but they do not account for latency introduced throughout the storage stack (i.e. across the network and to the VM). So while the data might technically be accurate for the limited viewpoint of the storage device, it leads to an incorrect conclusion. What really matters is the performance as seen by the VM, not by the storage system. This is why it simply does not make sense to perform VM analytics from within the storage device.
Measuring data inside the VM can be equally challenging. That is because the method employed by most operating systems to collect data assumes the OS is the sole proprietor of resources. In reality, however, it may be time-slicing CPU clock cycles with other VMs. While the VM is the end "consumer" of resource, it does not understand it is virtualized, and cannot see the influence of performance bottlenecks throughout the virtualization layer or any of the physical components in the stack that support it. In addition, VM metrics pulled from inside the guest OS lack consistency, as different operating systems can measure things in different ways. For example, disk latency in Windows "Perfmon" is often not measured in the same way and at the same frequency as Linux "top."
The best answer to the above challenges is to pull data from within the hypervisor kernel. This ensures information is collected in a uniform, consistent method. Plus, the data collected is always meaningful and applicable across all applications and storage hardware platforms. Only the hypervisor kernel is capable of measuring data in such a way that it accounts for all elements of the virtualization stack.
The big picture
There is no shortage of metrics in a modern virtualized environment. However, an abundance of metrics does not necessarily lead to an accurate, holistic understanding of your data center environment. Disparate data coming from a variety of resources may or may not show similar data. As a result, users are forced to spend time reconciling what these metrics mean and how they impact each other.
Some tools attempt to distill this plethora of data into a few metrics that help provide insight into performance or configuration issues. Examples include CPU utilization, queue depths, storage latency, or storage IOPS. However, it is quite common to misinterpret these metrics when they are looked at in isolation.
Just as an out-of-tune cello can affect the sound of an entire symphony, a noisy VM can adversely impact your entire environment. But the wrong analytics tools can make it hard to identify and correct this issue. For example, a VM generating heavy I/O can often result in lower CPU activity. This is the exact opposite of what most tools would look for, which is why viewing data in isolation is problematic.
The weight of impact between metrics will also vary. VMs consuming large amounts of CPU generally only affect other VMs on that host. However, the impact of storage based noisy neighbors goes well beyond a single host, and can affect all VMs and hosts that use the storage system.
Holistic understanding of the data center is not unlike one of those puzzles that are a mosaic of smaller pictures. Individual data that stands on its own is important, but the greater value lies in how it contributes to the broader understanding of the environment. One needs to be able to get a broad, big picture overview of an environment to drill down and identify the root cause of an issue, or be able to start out at the level of an underperforming VM and see how or why it may be impacted by others.
Many trends in the data center focus on operational simplicity. This is a worthy desire, but often runs counter to the other challenges for data center architects and application administrators – optimization and troubleshooting of issues. Obfuscating meaningful metrics does not make the problems go away. It simply makes your job that much more difficult. Take a data driven approach to determine if your data center is running well.
For more information, check out: