Software-defined data centers represent a major shift in the way IT professionals think about their data centers. Not that long ago, legacy data centers relied mostly on hardware, and vendors focused on creating individual parts for each customer, requiring a lot of proprietary hardware to manage myriad devices. Hardware products were relatively inflexible, and the software that made the hardware flexible and adaptable played a supporting role at best.
SDDCs, on the other hand, present a software-centric solution, managed by policies, that avoids many of the downsides of legacy data centers without losing sight of the hardware that is crucial for the system to run. Most hardware within SDDCs looks quite different from hardware in traditional environments as it is often commodity. If a SDDC contains any proprietary hardware, the software leverages it to carry out important functions. In the world of hyperconvergence, this kind of hardware essentially becomes part of the data center’s standard operations. Because it is identical hardware (and not unique to each device), it scales well as new appliances are added to the data center.
That said, hardware within a SDCC does not always operate as intended; hardware components sometimes have bugs that can only be detected through use. In a hyperconverged environment, the software layer is built with the understanding that hardware can -- and ultimately will -- fail. The software-based architecture and policies are designed to anticipate and handle any hardware failures that take place.
Having fail safes built into software is not enough of a reason, however, to neglect thorough testing of each piece of hardware to ensure that it meets quality specifications before it is deployed. The process requires committing a significant amount of technical expertise, time, and infrastructure, but it's an absolutely necessary process that can lead to the discovery of “unexpected behavior,” or a mistake or bug that would greatly impact an organization’s entire data center.
Mistakes do happen. For example, a disk controller could accidentally throw bits into the trash without ever writing them to the storage media. That’s an accident. But if the engineering team is testing disk controllers and runs into this “unexpected behavior,” that’s not an accident. An engineering team can identify problems before they transform into a disaster if they commit to implementing a comprehensive platform test phase.
In addition to preventing problems before they happen, it's important to understand the various ways hardware might fail in order to program the software level appropriately. It may not always be clear at the application layer that the hardware was the cause of an issue because it is challenging to root cause this class of issue after the fact. If a virtual machine will not boot, it's important to figure out where the problem originated. It could originate from the guest file system, the guest drivers, the hypervisor, the storage system, the drive controller, the HDD, the SSD, etc. Without proper guidance, the software layer cannot perform its failsafe duties.
While it may sound disconcerting to most that hardware defects are extremely common, IT professionals know that they're not a matter of if but of when. But by implementing a comprehensive hardware QA program, larger problems can be avoided, and software can be taught to handle problems smarter as they arise. This is why testing is not optional; it's a requirement within any data center.