Data growth continues to rise across the globe, driven by accelerating adoption of technologies like Big Data, IoT, and artificial intelligence. Naturally, this data growth is accompanied by an increase in data centers, including traditional, cloud, and edge data centers. And as modern datacenter networks become more complex - supporting data-intensive new applications associated with the technologies above - these networks are beginning to produce massive amounts of telemetry data.
Network telemetry data is information providing basic metrics on how a network is performing. This information is raw and very granular. For example, it tells you the throughput or latency of application flows and packets throughout the network. Given how busy hyperscale and service provider networks get, this detailed data can accrue to huge volumes in a short time. Network operators generally regard these massive quantities of telemetry data as a burden and essentially worthless. After all, they've got sophisticated proprietary solutions to help manage their networks, why even bother combing through a seemingly incompressible mess of raw telemetry data?
But this data holds tremendous value that most organizations haven’t considered. By performing machine learning on telemetry data, these organizations can unlock hidden insights that will drastically improve performance and bolster security. In short, with machine learning, network operators can find tremendous value in all the chaos.
To begin, you need to have uncompromised visibility on the network you wish to monitor. Using traffic probes, sampling traffic, or polling statistics and counters generally do not give you enough information to observe and detect issues that can impact these cloud-scale networks environment. It would be like running in a dark and dangerous forest at night only having a small flashlight. The chances you will not find your way or not see an obstacle are high. Thankfully, though, emerging technologies like Inband-Network Telemetry solve this problem giving network operators the ground-truth on the network data they need.
Next, you will need to know your network data, which includes processing and organizing different types of telemetry data like path and latency information, hop-by-hop delay, jitter, and packet loss rate, identify interesting data and detect anomalies and events.
Now that you have an ordered data-set, machine learning techniques can help determine a baseline of how your network is currently performing – providing a comprehensive view of latency, bandwidth, packet drops, and other metrics. This sort of baselining makes it clear where a network is running smoothly and where it’s struggling. Thus, from this viewpoint, operators can identify the problem areas where they need to focus. Again, these insights are now possible because telemetry data is so rich, providing specific information on each packet in a network.
Machine learning correlates phenomena between latency, paths, switches, routers, events, etc. This intelligence uncovers things that were previously invisible. For example, it may tell you that network events X and Y are closely related and that when one is seen, the other is likely to be observed too. It can also tell you that when you make a network change, you get a particular behavior. This makes it possible to detect and eliminate bottlenecks. Maybe there's a network policy causing a packet drop or a slowdown. Maybe there's a problem with the application deployment or the network provisioning that needs to be remedied by the application or the network administrators. Machine learning provides the clues that point you toward an answer.
This leads to another major advantage of this approach: predictive analytics. As your machine learning models train to understand correlations and patterns in the present, they eventually gain the ability also to predict the future. Network operators come to understand how each action they take correlates to certain behaviors, down to the packet level. Armed with this knowledge, they can anticipate and prevent network outages, delays in the forwarding plane, and app slowdowns.
In addition to improving performance in the present and future, machine learning on telemetry data also boosts network security. After detailed baselining, it's easy to spot anomalous behavior. Anomalies often just signify poor performance somewhere in the network, but sometimes they indicate a breach. Once identified, anomalous behavior can be investigated further, allowing network operators to expose significant security vulnerabilities potentially.
To gain accurate network insights from machine learning, an organization needs a massive volume of telemetry data to draw from. But getting the right quantity of data is easy, it's getting the right quality of data that can be tricky. In order to train the ML models correctly, telemetry data must include flow reports, congestion reports, and drop reports in order to properly baseline networks, find correlations, and predict future outcomes.
In an era when networks are becoming more scattered and chaotic, using machine learning on telemetry data is a way to make the network seem whole again. This method provides critical new insights and establishes a holistic way to manage the network rather than troubleshooting individual components. Advanced technologies like 5G are only going to put more pressure on network performance. The best way to meet this rising challenge is to take a proactive, AI-based approach. For growing networks, telemetry data may seem extraneous, but when properly harnessed, that data provides solutions to the network's biggest problems.