Some astounding products were demonstrated at the recent Flash Memory Summit in Santa Clara, Calif. Seagate announced a 60 TB solid-state drive in a 3.5 inch form factor while Toshiba and Samsung unveiled plans for 100 TB 2.5 inch solid-state drives. In addition to these high-capacity SSDs, vendors also released several new super-fast SSDs, including a Seagate drive promising 10 gigabytes per second, well ahead of where we are today.
The bottom line is that the hard-disk drive now can be definitely declared obsolete, with flash chip supply, and hence SSD prices, governing the rate of hard-disk drive replacement on vendors’ shelves.
At FMS, there was a lot of discussion about which interface should be used in these fast and high-capacity SSDs. Traditional SAS and SATA are too slow for the fast drive types, due to the use of the three-decade old SCSI IO stack. The industry is currently embracing NVMe over PCIe for the fast drives. This supports RDMA access, relieving the host system of a most of the overhead associated with moving data to storage
The problem with this model is that in both hyperconverged servers and storage appliances, the PCIe model involves bringing data in over PCIe from the SSDs and then re-transmitting it over the LAN, which involves about another 50% or 100% of the drive-access bandwidth on PCIe. With PCIe lane counts typically limited to 48 or less, the configuration of a storage appliance or hyperconverged box is quite limited, with perhaps four to six drives per appliance/server.
Naturally, effective -- as opposed to raw -- bandwidth is a serious issue for systems performance. Alternatives have been proposed to overcome the limitations of the PCIe approach. One solution is to interpose a PCIe switch between the drives and the server motherboard, then connect all the servers using PCIe. This works fine, but there is little available infrastructure around PCIe and also many questions concerning scalability beyond a small cluster.
InfiniBand is another viable option. Used heavily in high-performance computing systems because of its proven RDMA track record, NVMe over IB is one of the fabric proposals being considered under the NVMe over Fabrics initiative at SNIA. This would allow drives to directly connect to a sizable cluster of servers, with the local server hosting the drives in effect just being another IB connection. This shared-drive topology would meet the goal of not interposing transfers into the hosting server into the network transfer path.
The key in this structure is RDMA. Ethernet is now going toe-to-toe with IB on speeds and feeds for an RDMA version and there is a consensus that RDMA over Ethernet will probably gain a sizable lead in the RDMA stacks over the next half decade. This makes NVMe over Ethernet very attractive as a connection scheme for fast drives, since cloud builders end up with a single fabric to manage, which is a real positive as we move to automated storage-defined networks.
Whether IB or Ethernet, we end up with fast SSDs connected directly to the cloud fabric, which is great for performance in the shared storage models of hyperconverged systems and modern-era storage appliances.
One argument voiced at the FMS was that this type of architecture would slow down local access performance compared with native PCIe. This really isn’t the case, since the RDMA process operates in a similar way to the current DMA used for a server’s PCIe-connected peripherals. There is some overhead associated with the protocol, but any attempt to extend PCIe to being a large cluster fabric runs into the same issue.
To confuse things a bit, Intel has touted OmniPath as a connection scheme for its 3D X-Point drives. This is an Intel-proprietary 100 Gbps quad-lane link, possibly based on “silicon photonics.” Intel is pushing this as a cluster fabric, so it would meet the need of the shared-drive topologies. OmniPath runs with a substantial host involvement in transfers, slowing real-world performance and eating up a lot of compute power. It also lacks infrastructure. Both of these could be resolved, given time, but the effective incumbency of Ethernet makes winning the cluster fabric stakes a real challenge for Intel.
To complicate the story further, we have those huge capacity drives to optimize. There’s really no reason they have to be SAS/SATA, or even NVMe. While it is feasible to make these drives run at gigabyte-per-second speed, they likely will use QLC flash chips when they come to market in the first half of 2017, with much less performance than the flash used in fast drives. This will, however, move price per terabyte into the bulk storage HDD range, assuming foundry growth arrives on schedule and the product is not exclusive to OEMs.
The focus on Ethernet fabrics suggests that moving from SAS/SATA to an Ethernet interface for these bulk SSDs would unify the IO fabric structure. The Ethernet for bulk SSD doesn’t have to be RDMA-based. In fact, WD Labs recently demonstrated a 504-drive Ceph cluster where Ceph was installed on Ethernet drives with dual 10 GbE interfaces. This type of architecture is very flexible. It could run Scality’s Ring software, OpenStack Swift or Caringo’s Swarm and fit software-defined storage models handily.
It's possible that even the slow bulk SSDs will unify on NVMe over Ethernet, but NVMe may be artificially limited to “enterprise” drives as a way to realize premium pricing. We’ll see if one of the new SSD vendors breaks rank with the old guard HDD suppliers on this issue.