Modern IT infrastructures are stretching legacy monitoring tools to the breaking point. The unprecedented "bigness" of today's networks -- in numbers and complexity -- creates many blind spots, knowledge gaps and delays.
This lack of visibility makes it hard for even seasoned network operations teams to see emerging problems, find root causes and fix them quickly. It's also a prime cause of the intense stress under which these teams labor.
This basic issue -- "How do we monitor something this big?" -- makes other long-standing questions like "Where's the problem?" and "What changed?" even more urgent and difficult to answer. The bottom line is this: If your tools can't handle the scale of today's networks, you can't confidently answer these questions.
The very public and painful LTE outages that occurred from 2011 into early 2014 show the limits of legacy approaches. Especially when you consider that today's LTE networks have all the hallmarks of 21st century IT infrastructure: lots of hardware and software components; numerous vendors and suppliers; a proliferation of interfaces, protocols and data formats, both structured and unstructured; fast-growing, unpredictable traffic and transactions; vast end-user populations; and the need to work with older network elements.
With the earlier outages, one problem was especially common. Even when performance metrics and statistics were collected locally, data couldn't be made ubiquitously available fast enough, it did not present a comprehensive view, or accuracy and granularity were often subpar -- potentially even all three. Network operations staff was often left behind by the speed of developing problems and the inability of legacy monitoring and reporting tools to keep up.
When this happens, customer complaints become the alerting system. Engineers operate on rumor because reports take too long to run or the data for the affected system is not available. And the angst-ridden "war room" scene becomes the problem isolation system. Neither scenario is viable in today's enterprises and service providers.
The new "big"
In the 1980s and 1990s, online transaction processing systems were the epitome of big. That's no longer the case. Just think of what's involved in keeping today's Google's search, Amazon's one-click purchasing, NASDAQ's trading, or Steam's online gameplay and social networking available, reliable and fast for millions of users.
There are several factors that characterize today's big IT infrastructures:
1. Sheer numbers. Servers, along with switches, routers and other gear, can number tens of thousands to hundreds of thousands of connected devices. One major high-tech consumer electronics company has over 600,000 servers. Rarely is all of this equipment and software supplied by one vendor, so there are usually dozens of suppliers to deal with as well.
2. Complex business services. Services like personal storage and backup, online shopping, and financial management require groups of software applications that span a geographically diverse, multivendor hardware infrastructure. Keeping these services alive and healthy means being able to trace dependencies end-to-end and aggregate all relevant data on a single screen.
3. Non-traditional infrastructure functions. Increasingly, IT groups want to monitor everything from CPU and exhaust temperatures to fan speeds, power consumption and blown fuses. If a CPU fan starts slowing down undetected, the system runs hotter and hotter until it locks up without warning. Other examples can be found in the continuous rapid evolution of mobile infrastructure. New network functions and signaling are evolving quickly and service providers are scrambling to gain visibility, since these platforms are now the foundation of their revenue generation capability.
4. More than SNMP. To get an accurate picture of a complex network, monitoring platforms must pull data through numerous other interfaces, data formats and protocols. These include NetFlow, Cisco IP SLA, DNS, NBAR, JMX, WMI and Intelligent Platform Management Interface (IPMI). There's also growing demand for log data in order to correlate it with performance anomalies. A recent study from the SANS Institute found that 35% of survey respondents now log from 51GB to 1TB of data daily, and 8% log over 1 TB. Using only log data, one customer learned a network service had crashed and restarted itself.
There is also a proliferation in structured and programmatic data. Many up-and-coming technologies such as OpenStack rely almost exclusively on APIs to provide and share performance data. Other systems, especially telecom-grade element management systems (EMS), make data available via CSV, XML, JSON and other types of flat files. Putting this data together and providing a uniform reporting platform reduces mean time to repair and the total cost of ownership of the entire infrastructure.
5. The client device explosion. Bring your own device (BYOD), whether for corporate end users or mobile customers, creates a much larger and varied population of devices, usage and traffic, straining enterprise WiFi and public hotspot networks and backend network services. The Hotspot 2.0 specification and its implementation in next generation hotspots will only add to the traffic and network services demand to create a "cellular-like" WiFi experience.
Specialty monitoring vendors and equipment vendors provide tools and purpose-built EMSs for today's performance monitoring infrastructures. Both types can be good for what they do. But it's tough to get them to work together, and especially to work together fast to create a single view of end-to-end visibility. IT groups often have to stitch together a patchwork of a dozen tools or more.
These elaborate monitoring systems are riddled with blind spots and create a swivel-chair approach, as analysts must monitor dozens of disparate screens. In at least one instance, a single over-utilized CPU on one network device took down a national LTE service because the CPU utilization data couldn't be collected fast enough from that device.
Painful limitations
Often, conventional tools simply bog down and operations teams don't want to use them; polling 30,000 interfaces seems endless. Furthermore, many of these monitoring tools use centralized, one-size-fits-all, database architecture. And, at some point, as the volumes of data continue to grow along with the increase of the size and complexity of the infrastructure and the need for real-time information, this inevitably becomes a bottleneck.
There are other limitations. You may only get a coarse sampling of data. Your data may be outdated as the system performs batch instead of stream processing. You may only monitor a subset of the elements because the system is not config or API extensible. The latest upgrade may be a long time away, and so you're hoping that if there is a problem, it's in that subset. You have to reboot your monitoring system twice a week.
Current IT trends -- virtualization, cloud-based services and functions, and the Internet of Things -- all point to ever-growing size and complexity. Software-defined networks will make fast, frequent and automatic infrastructure changes a feature instead of a problem, spinning resources up or down on demand.
Scalable monitoring architectures
Today's IT groups have a clear view of what's needed to monitor big infrastructures. And they're demanding the following:
- Real-time data from all elements (hardware and software)
- Tools that collect and process vastly larger amounts of performance data with the ability to scale
- The ability to determine normal performance behavior for each element
- Fast, accurate, reliable alerts when that behavior changes
- Clear, fast visibility into those changes and their causes
- Closer, faster, more effective support from vendors
The good news is that changes in technology, architecture, and vendor practices can now satisfy at least some of these demands.
IT groups have known for years what's needed to dynamically scale their IT infrastructure: distributed computing, distributed data collecting and storing, and parallel processing. Yet, relatively few monitoring tools are designed from the ground up to leverage these capabilities. The growth in complexity and scale we are witnessing today is bringing many of the problems of legacy systems to a head. A scalable architecture is essential to handle the vastly larger amount of statistics and metrics needed for accurate, effective, fast and reliable performance monitoring.
So, how do we monitor something this big? Clearly, we need better tools.