The world of today's data center is focused on moving bits back and forth as quickly as possible without interruption. We've taken every measure to ensure that data is flowing as freely as it can between 10, 40 or even 100 Gbit links connecting clusters of virtual servers. We want to crunch, analyze and query data instantly. All of this works as well as it can until we start looking under the hood. There we discover the root of all evil: spanning tree.
802.1d spanning tree (STP) and its progeny are elegant solutions for a bygone age. Radia Perlman created STP in a time when people had never heard of an Ethernet bridge. It prevented people from creating ignorance-based disasters. As time marched on, we created new additions to STP to decrease convergence time, improve recovery from catastrophic configuration events and even take into account changes in technology.
However, at the core, STP was still based on the idea that there should only be one link to a central bridge. STP effectively reduced the entire bandwidth of a data center to a single Layer 2 link between devices.
Perlman has gone on to create the true successor to her original design. With the help of some brilliant people inside the IETF, she created RFC 6326 -- Transparent Interconnection of Lots of Links (TRILL). TRILL allows for much-needed advances like mulitpathing and rapid failover between links in the event of failure. And TRILL can do it without any human interaction. So why aren't we using TRILL?
Network Computing contributor Ethan Banks recently posted a possible answer on Twitter: "If we want STP to die, replacement protocols can't continue to be license-only features." Ignoring the issues with hardware refreshes and suitability of TRILL to non-data center environments for the moment, I think Ethan has hit upon a very critical point in the development of TRILL and its vendor-specific kin.
Cisco has a TRILL-based solution in FabricPath. Brocades uses TRILL as the basis for VCS. Other vendors are developing products in much the same way: Use TRILL as a baseline to create the Layer 2 routing bridge (RBridge) underlay network, build proprietary functionality on top of it, then charge a license fee for the whole kit. Essentially, make people pay big bucks to get rid of spanning tree.
Ask anyone and they'll tell you STP is evil. Perlman resolved to replace it after reading the story of a hospital network core melting down due to STP misconfiguration and impacting caregivers. Every network engineer has a story of forgetting to configure it on a link and watching the ensuing bridging looping destroying a network. If STP is so bad, why are we being charged to replace it?
[TRILL can get us past STP, but will it have a place in a world moving toward software-defined networking? Ethan Banks digs into the issue in “Will SDN Kill TRILL?”]
I understand there are significant development costs involved in creating FabricPath, VCS and products just like them. Vendors want to ensure that their development work is adequately compensated. But if I have the choice of paying thousands of dollars and hundreds of hours rearchitecting my network or just enabling MSTP or RSTP and calling it a day, I'll pick the cheaper of two evils every time.
I've brought this up before with people who make these decisions. Why not reduce or outright eliminate the licensing requirement for TRILL or TRILL-like features to hasten the demise of STP? The response was nothing if not predictable: "We have demo licenses to allow customers to try these features out and see how they can benefit." Anyone ever redesigned an entire network for the sake of enabling a time-limited demo feature?
Spanning tree reached the end of its useful life many years ago. Newer, better protocols have been developed, and hardware has been created that can run both STP and TRILL without the need to change hardware modules on the fly.
The only thing that is still stuck in the past is the desire to license these features and derive revenue from them. If DEC had charged extra for STP in the beginning, I'm sure no one would have purchased it. It's only when a technology is included for everyone at no additional cost that engineers and architects begin to test it and find a way to use it to replace existing designs.
SDN and the coming wave of overlays and underlays won't change the need to ensure your network is built on a solid foundation. Tunnels fall apart when they are built on quicksand. Building your data center on the quagmire that is STP is going to result in pain and expensive reconfiguration down the road. I think we need to take this changing of the guard as a sign that we need to replace STP in our data centers as well.
What do you think? Is it past time to replace STP? Will you pay to replace it? Or should vendors make the superior protocols freely available to hasten the demise of STP and the rise of TRILL? Use the comments section below to voice your opinion.