Traditional monitoring solutions collect data from endpoints and parse it into a series of metrics. The result is then compared against policies and thresholds to determine the health of applications and systems in the environment. An unhealthy status is assigned to a monitored system when it's metrics breach a configured policy, raising an alert or actions are triggered to remediate the condition.
This approach is fine for monitoring and troubleshooting traditional system architectures and monolithic applications. However, conventional monitoring solutions perform poorly with cloud-native applications and distributed infrastructure solutions.
Conventional monitoring solutions have two critical weaknesses with newer applications and infrastructure solutions. Alerts display that something is wrong on an object which is ok when the object still exists but, the alarm is tied linked to an object and therefore if the object no longer exists, neither does the alarm.
The other weakness is the lack of ability to provide end-to-end visibility of the processes which caused the alarm to trigger. This information is still gathered by manually checking logs and attempting to create a timeline of events.
Enter observability platforms
Observability platforms aim to provide operators with an end-to-end view of processes that are related to system degradation, establishing a level of cause and effect instead of just effect. Observability platforms work by determining relationships between objects and actions performed by the objects. Additionally, a deeper set of data is gathered to perform analytics.
Monitoring an application with an observability platform can tell you that a specific service is not running as expected because there has been an increase in load and the autoscaling function has an error. A traditional monitoring solution could alert you that the autoscaling function is generating errors, but not necessarily that the trigger was an increase in traffic to a specific page on the public facing web site. Additionally, the observability platform identifies a particular function on the page to be the cause.
All the information available to link a specific function and a higher demand for a backend service is already available to you. Making those links and getting to that level of detail is time-consuming and if the alert was picked up by an infrastructure team instead of application owners, then it may be missed.
Observability platforms provide the ability to analyse all the available data to provide a clear and concise series of related events, providing both cause and effect resulting in decreased time to resolution.
Similar scenarios can be run against infrastructure and infrastructure services. Reconfiguration of stateful firewall rules can trigger session states to be reevaluated that increases the firewalls resource demand. Depending on the firewall there can be a significant impact different between a single substantial rule reconfiguration and many small rule updates.
A workflow that updates the firewall rule has been updated to take in an array of rules to be modified and applied each change sequentially instead of creating a single large request. The development system and test firewalls only have minimal traffic and experienced no negative impact. The production firewall suffers a significant performance impacted when the updated workflow executes.
In many environments the alert generated by the above scenario would go to the networking team, who should be able to see that the rules were updated but, they may not have visibility into the workflow execution. An observability platform can create a link between the workflow and the resulting impact, establishing a connection between the cause-and-effect and the appropriate response. Both the automation and network teams receive notifications.
Moving forward
By now I have painted a cheerful picture of what observability platforms can offer. However, this is the real world, and nothing is as simple as it seems.
Performing deep analytics has a dependency on the quality of the data received and that all systems and services in scope are configured to send logging data to the observability platform. Inconsistencies in log data between systems can prevent automatic relationship detection, which may require manual relationship configuration.
Some observability platforms provide the capability for application developers to embed integrations directly into code, providing low-level data. Ensuring a consistent log structure throughout the project enhances the platforms ability to determine relationships automatically.
Observability platforms improve troubleshooting by providing a level of end-to-end visibility required for modern distributed systems. Additionally, correlating cause and effect of events enhances a team’s ability to determine why a fault occurred.