As flash technology has matured, our understanding of how to speed up storage has evolved immensely, but slowly. It took the industry a while to realize that we'd gone from balanced systems where 1 Gigabit Ethernet networks were fast enough for hard-drive storage to an environment where the array controllers, RAID cards, and networks were all way too slow to keep up with solid-state drives, each delivering 40K IOPS or more. Vendors have tried to keep RAID arrays relevant with hybrid arrays that replaced some HDDs with SSDs, but it's become clear that we need to go in a different direction.
In this blog, I'll focus is on solutions that avoid the controller/network bottleneck. This involves storage appliances with perhaps 10 or 12 drives each, looking suspiciously like standard COTS servers in most cases. Even so, 12 SATA SSDs -- the slowest class of SSD drives, can generate 500K to 1 million IOPS, while delivering 4 gigabytes per second or greater in streaming mode. Using high-end drives can take the IOPS to a stratospheric 5 million IOPS and 12+ GBps.
Clearly, the prevalent networking solution, 10 GbE, is stretched way beyond its limits. The result is a hunt for economic and rapidly available solutions to tackle the new bottleneck issue and boost performance. The industry has reacted in a truly spectacular way to this challenge and the next few years will bring network solutions that help the storage problem with SSDs.
Let’s look at what is happening in today’s networks.
10 GbE
10 Gigabit Ethernet has now become the mainstream Ethernet solution for the data center. We have server motherboards with two 10 GbE local area network connections on board in the CPU chipset. It’s certainly a great improvement over 1 GbE, but virtualized servers and cloud systems still suffer from a network choke-point and we have to prepare for the onslaught of containers, which will multiply instance counts per server.
40 GbE
10 GbE still leaves storage appliances severely throttled, so an evolution of the technology to 40 GbE by connecting a quad-link of 10 GbE connections has helped to stave off the problem. These quads stripe data blocks over all four connections, so they reduce block transfer time somewhat. Most appliances, at least in the performance rather than bulk storage/archiving market, use four of these connections or more to move data around.
Topologically, 40 GbE is deployed as a backbone technology, usually between storage and top-of-rack switches, which convert it to 10GbE for linking to servers.
Fibre Channel
Fans of traditional SANs are now deploying 16 gigabits-per-second Fibre Channel and there are plans to extend to faster speeds. Ethernet link speeds have jumped ahead of FC, however, while the shrinking of the external array market is casting a cloud over the FC approach.
InfiniBand
InfiniBand uses the same core electronics as Ethernet, and in fact Mellanox now sells a chip that can handle either protocol. IB has a niche in low-latency applications and high-performance computing, but the addition of RDMA to Ethernet is challenging that position.
RDMA
Remote Direct Memory Access is a method where data is transferred directly between the main memories of two systems. It reduces operating system overhead enormously and is the fastest way to move data to and from networked storage. Both InfiniBand and Ethernet support high-speed links with RDMA.
25 GbE/100 GbE
Reacting to real customer needs, especially from the massive cloud storage providers, the IEEE fast-tracked a 25 GbE standard, while development ran in parallel. Electronic design of the connections is the difficult development task here, with the need to accelerate all fast interfaces that use the same electronics, from PCIe to InfiniBand to Ethernet compressed the development cycle to a couple of years.
25 GbE and 100 GbE is already shipping and should ramp up quickly for clouds and virtual clusters.
Looking ahead, here are some networking advances expected in the relatively near future.
50 GbE/200 GbE
The next step, to 50 GbE and 200 GbE, is already in development. We can expect delivery of samples in 2019. The bad news, of course, is that servers will have terabyte-per-second DRAM and large core counts, so the “beast” remains insatiable!
8-link solutions
The technology for wider connections is another way to speed up network backbones. Several vendors are working on versions with 8 even 12 aggregated connections. That takes 50 GbE links to 400 GbE and 600 GbE respectively
100 GbE links
The next step in evolution will be the 100-gigabit single connection. This requires some strenuous engineering to achieve such speed over meaningful distances and is essentially the "Holy Grail of Networking." It's expected sometime after 2020.
This technology will give the industry its first economic terabit link.
NVMe over Fabrics
NVMe over Fabrics uses the new Non-Volatile Memory –express protocol to implement RDMA over a variety of connections, including Ethernet, InfiniBand and PCIe, the last being a new way to build ultra-fast small clusters. The Fibre Channel industry is also looking to replace its SCSI protocol with NVMe and RDMA.
NVMe goes a step further than simple RDMA by consolidating system interrupts and using advanced rotating queue techniques to reduce overhead even further.
It is still anyone’s guess as to whether Ethernet takes most of the LAN and storage network business, pushing Fibre Channel and InfiniBand into small niches. The ubiquity of Ethernet as a cloud network will drive tremendous pressure for the industry to unify on it as a single fabric, however, saving administrator costs and reducing complexity, while likely being substantially cheaper due to huge volume and broader competition.
NoF over PCIe is a dark horse. The lack of an infrastructure, and clear differentiation from an Ethernet NoF could determine its success or failure. Most likely, NoF will displace InfiniBand.