As a reformed network geek who has turned to the dark side to follow the storage market, I've been especially intrigued by the evolution of data center Ethernet as fundamentally different not only from the 10-Mbps shared media Ethernet of old but, more significantly, from the direction of campus Ethernet. As we bring DCB, TRIIL-like Layer 2 multipathing and the like into the data center network, I would like to propose that many of us can eliminate the expensive modular switch that's typically served as the network core.
We have traditionally built networks around a pair of massively redundant core switches using additional access and aggregation layers of switches to funnel traffic to the core. This design was driven by the limitations of previous generations of smaller switches, the spanning tree protocol and yesterday’s traffic patterns.
Just a few years ago, much of the traffic in an enterprise was users running two-tier client/server applications and accessing file services. This resulted in large volumes of so called north-south traffic between user PCs, external systems and servers. Today, much of the traffic is server-to-server or east-west as Web and application servers access databases. Virtualization adds to the east-west traffic pattern with live migration traffic and virtual desktops.
If we’re building a network for east-west traffic, why are we connecting ToR (top of rack) switches through oversubscribed uplinks to the core? If you have 100 to 300 or so servers with dual 10 Gbit connections, you could just build a full mesh of 48-, 60- or 96-port ToR switches.
Let’s take a full mesh of eight 60-port switches with 20-Gbps interconnects as an example. Each switch would use 14 10-Gbps ports for connections to its peers for providing 140 Gbps of fabric bandwidth. That would leave 46 ports on each switch for server and storage connections, which would easily support 150 or so servers with dual connections and plenty of storage connections to boot. Each server would be no more than two switch hops from any other, and the interswitch links would be only about 3-to-1 oversubscribed.
A more conventional design would use a pair of core switches that the ToR switches all uplinked to. If we use 40-Gbps uplinks, we boost the oversubscription level from 3-to-1 to 6.5-to-1 (52-to-8) and add another switch hop to the data path between any two servers connected to different switches. Using dedicated storage switches also connected to the core would further stress the uplinks. We may be able to save a few bucks on ToR switches, but we’d have to spend several times that much to buy line cards for the core, let alone the cost of a pair of Nexus 7000-type core switches.
As you add Layer 2 multipath capable ToR switches, consider using the fabric itself as the core. It could save you more than a few dollars while providing reliability and performance comparable to a more conventional design. Of course, full meshes can only scale so large. Adding more switches to the mesh requires more interswitch link ports so you can reach the point where, with adding another switch to the mesh, you’ll actually reduce the number of useable ports in the fabric. So if you’re building a network for 1,000 servers, those core switches with high port counts might pay off.
So would you build a data center network without a core? Comments solicited.
Disclaimer: Brocade, who makes switches that could be used to build such a mesh network, is a client of DeepStorage.net.