Expanding on its DIY approach to networking, Facebook today said its engineers built a platform for network troubleshooting. The system, dubbed NetNORAD, was designed to overcome problems the company was having with traditional troubleshooting methods.
In a detailed blog post, Facebook Network Engineer Petr Lapukhov describes how NetNORAD works and how it came about. Keeping the company’s massive network up and running is a top priority, but common troubleshooting methods were taking too long, he wrote.
“The ultimate goal is to detect network interruptions and automatically mitigate them within seconds. In contrast, a human-driven investigation may take multiple minutes, if not hours,” he wrote. “Some of these issues can be detected using traditional network monitoring, usually by querying the device counters via SNMP or retrieving information via device CLI. Often, this takes time on the order of minutes to produce a robust signal and inform the operator or trigger an automated remediation response.”
In addition, Facebook engineers often encountered “gray failures,” where the problem isn’t detectable by traditional metrics or a device can’t report its own malfunctioning, he said. These issues led Facebook to build NetNORAD, which Lapukhov described as a system that treats the network like a “black box” and troubleshoots network problems “independently of device polling.”
Creating their own systems is becoming a habit for Facebook’s network engineers. In 2014, the company unveiled Wedge, an open, top-of-rack switch running a Linux-based operating system. Last year, Facebook introduced 6-pack, which it described as the first open hardware modular switch. 6-pack builds on Wedge to form the core of Facebook’s data center fabric. Engineers have said the custom equipment provides Facebook with more flexibility than traditional networking gear.
While Facebook contributed its Wedge and 6-pack designs to the Open Compute Project – which the company launched in 2011 to share data center designs – it’s open sourcing components of NetNORAD outside of OCP. The pieces -- the pinger and responder and the fbtracert utility – are available on GitHub.
“We are open-sourcing some key components of the NetNORAD system to promote the concepts of end-to-end fault detection and to help network engineers around the world operate their networks….While this does not constitute a complete fault detection system, we hope you can use these components as a starting point, building upon them with your own code and other open source products for data analysis,” Lapukhov wrote.