One of the holy grails of AI in NetDevOps is the ability to analyze large amounts of log data and make recommendations. It has been a long-established goal to integrate network telemetry with AI to create a comprehensive NetDevOps workflow. Enterprise and commercial organizations have been trying to solve this problem in many ways over the years.
The traditional way to accomplish this is to create simple matching conditions that trigger if certain pattern matching is accomplished. This method mostly focuses on regex matching of log files and linking certain logs with certain triggers. This is a traditional way of doing log parsing and should not be confused with an AI as this method is severely limited by the historical logs generated.
A system using this conditional approach to log alerting is hamstrung by historical events. Alerts can only be triggered off log messages that have historically been linked to problems. Additional intelligence can be added by correlating logs to environmental data (i.e., MAC address tables, route tables, etc.), but that doesn't add true AI functionality to the system. All that does is increase the complexity of the conditional checking.
There are two reasons this method isn’t scalable:
- Administrators must curate the entire list of logs to create actions
- The system can only react to logs its seen before
This is where an AI-trained model can improve the functionality. A neural net can be trained by log messages to create a model capable of interpreting any sort of log data input. The neural net is not applying conditionals to create a large decision tree. Instead, it is understanding general log file structure and content so that it can react to log messages it has never seen before. This trained model can then be queried to provide recommendations or advanced labeling.
The challenge with networking log data is all the different formats and structures that data can take. So, the AI system needs to be flexible enough to learn from the variations of these logs. This is where the cyBERT model excels.
CyBERT is based on BERT, a natural language processing (NLP) model. It was found that many log messages generated by network nodes bear a resemblance to natural language. Sending these logs through BERT was a great starting point. Our team took this model and augmented it specifically for log messages. This is how cyBERT was generated.
CyBERT still has certain learning challenges, as any AI would. Though it is much more effective than traditional conditional alerting, the recommendations that cyBERT can make are still limited by the training data provided. The wider the breadth of logs, the better training model that cyBERT can generate, allowing it to react more effectively against logs it has never seen before.
Combining cyBERT with a simulation platform creates an interesting synergy. Simulation platforms are used in NetDevOps as a method for validation prior to deploying configurations during a change window. The Air Infrastructure Simulation Platform is used here to validate changes for SONIC and Cumulus Linux network operating systems (NOS) prior to deployment. Instead of using the Air platform to simulate and validate proposed configurations, we can use the same simulation platform to generate data that can be used for training.
For training cyBERT with log data, there are two places where that information can originate:
- Real-world logs from production networks
- Simulated environments with controlled inputs
The first method, as stated above, is limited by the historical experiences a network has seen. The second method is much more interesting because you can create data to emulate less common behavior.
Let's draw a parallel to another common type of machine learning: autonomous driving. For a good autonomous driving AI, the system needs to be able to drive in all types of weather and traffic patterns. But since no one can control the weather, you may not get enough real-world snow, heavy rain, or inclement weather data to train your AI to drive safely in those conditions. To solve this, we created DRIVE Sim, a virtualized world where AIs can be trained to drive in controlled weather environments. The AI system is hooked into DRIVE Sim instead of real-world cameras and trains and learns in a simulation with real-world physics and weather effects.
Similarly, creating a strong AI that can assist network operators in diagnosing and resolving issues requires a simulated environment that can continue to generate a wide breadth of data that can be used to better train the AI. The most severe network security issues are some of the rarest. Using the Air Infrastructure Simulation Platform, we can generate multiple variations of log messages using controlled inputs. This allows us to have structured data as the input and generate a more targeted model that has learned on specific events.
And since the simulation platform can always be running, a constant barrage of misconfigurations, structured cyber attacks, and infrastructure anomalies can constantly generate log data. Allowing the model to constantly be training on new events, ever learning and advancing its knowledge.
Leveraging AI in NetDevOps is more than just taking a trained neural net and asking it to perform network operations. The integration of good AI systems into networking workflow is through collaboration. Using NetDevOps principles in unique ways to train AI models using effective data is the only way that the holy grail of AI in NetDevOps can be achieved.
Rama Darbha is Director of Solutions Architecture at NVIDIA.