More and more, I'm seeing networking and IT teams latching onto log data as a way to proactively monitor their complex digital networks and infrastructures. These groups are operationalizing log data for advanced warning of potential application or service disruption, as opposed to after-the-fact forensic analysis. The main reason that continues to pop up -- they are tired of having to know exactly what problems to look for in order to frame search queries.
I recently sat down with Eric Sharpsten, CTO of a large government contracting group, to discuss his thoughts on integrating log data and performance metrics.
"If I can get [log] data and marry it up with other data that drives the performance metrics, that's really powerful," Sharpsten said. He noted one of the biggest advances in log data use is the ability to get performance information quickly, allowing a swift recovery process after an outage or event.
Shifting views
Leveraging log data has received mixed reviews in the past, mainly because of the specialized nature of system logs, which were "owned" by separate groups, like security or administration. This isolation made it difficult to quickly access different data sets and correlate various types of log data with each other and with higher-level performance metrics.
These days, the implementation of log data is yielding better performance results for networking and IT groups for a couple reasons. There's a shift toward focusing more on a subset of specific log data to store and retain long term, rather than just indiscriminately storing all the data all the time. Networking and IT staff are also recognizing the importance of leveraging log data as a tool for network troubleshooting and proactive monitoring and planning.
In fact, a recent Enterprise Management Associates report highlighted this fundamental shift. According to the firm's survey, 47% of participants said log data is one of the most important tools in network management.
The obstacles
The popularity of log tools is a testament to their value. But search remains an after-the-fact process, used most often in troubleshooting to find a misconfiguration or a protocol mismatch, or some failure that has already occurred. Splunk, for instance, is primarily a forensics tool designed to search, monitor and analyze machine data after a performance event has already disrupted application or service delivery.
Perhaps that is why just over 50% of EMA's survey respondents said that "knowing what to look for" has hindered their decision to rely on log data. Another 36% said they found "writing new scripts or filters to find what is important" most daunting.
The reality is that tools, technologies and thinking based on legacy models still cultivate anxiety and delays in utilizing log data. Three things are needed to ensure log analytics are viable and valuable:
- Networking and IT staff need to be able to have an understanding, first, of normal log behavior through baselining.
- They need the ability to send near real-time alerts when abnormalities appear.
- And they need to correlate real-time network and infrastructure performance metrics, flows and logs at scale.
New technologies can solidify log data's leap from a reactive tool to a proactive remedy. Advanced indexing, metadata tagging, searching, analytics and reporting dramatically simplify the work of navigating a wealth of current data in historical context. Formerly isolated event data can be used as a real-time operations resource to uncover the causes of problems, and track how device changes affect digital network and infrastructure performance.