The increasing adoption of microservices as a foundational application architecture coupled with the portability of containers across data center, cloud, and edge is driving a movement to democratize data (telemetry). One result is the rapid and robust move toward OpenTelemetry as the de facto standard in generating and ingesting telemetry across infrastructure, platforms, and applications (microservices).
A variety of surveys, research, and anecdotal evidence indicate that microservices are growing as a primary architecture within organizations. This report from JRebel found that nearly half (49%) of Java developers were using “microservices as the architecture for their main application(s).” Our own research shows a slight year-over-year increase in microservices (container-native) architectures in the composition of the enterprise app portfolio. While microservices haven't taken over the enterprise app portfolio, they are growing in use.
So are the challenges in operating them.
Understanding what’s going on in a complex environment like that of containers with constant motion, movement, and monitoring required is a significant challenge. Our research found that the top insight missing from existing monitoring and visibility solutions is root-cause analysis of performance problems.
It’s not that operations and developers don’t know there’s a problem; they don’t have the information necessary to figure out what’s causing it.
An IBM reported this year similarly noted this frustration with microservices, reporting that “51% of respondents had difficulty predicting performance in production environments.” The aforementioned JRebel research found that 40% of respondents had trouble with monitoring in container environments, specifically:
- 14% struggle with troubleshooting inter-service performance issues
- 14% are challenged by scaling and monitoring in production
- 12% have trouble understanding the performance of the distributed system
These difficulties arise from the need to stitch together multiple performance agents and libraries to gain the visibility needed to know there's a problem. The proliferation of agents and libraries is not just a management nightmare, but it makes it difficult to apply consistent processes, ensure compliance, and ultimately provide the information needed to find and fix the cause of performance problems.
The reality is that we have enough data. And the truth is that value is not found in collecting that data or even alerting on binary metrics like "up/down" or "fast/slow." Value is derived from analysis and insights that help operators and developers quickly – and accurately – identify the root cause of a problem and address it.
This is a significant driver behind the shift toward OpenTelemetry from just about every observability platform and solution provider. In 2019, the Cloud Native Computing Foundation (CNCF) founded the OpenTelemetry project by merging two vendor-neutral open source telemetry projects – OpenTracing and OpenCensus – to help reduce the effort required to keep up with changes and keep track of what’s going on in modern app environments.
In February 2021, the OpenTelemetry Specification v1.0.0 was released. While OpenTelemetry encompasses metrics, logs, and traces, the OpenTelemetry Specification v1.0.0 release is focused on distributed tracing. Distributed tracing is considered the most efficient way to track microservice interactions in the highly dynamic environment of a container cluster. By normalizing – standardizing – on OpenTelemetry, everyone can instrument, generate, and collect telemetry without the extensive processing and transformation of data generated by today’s monitoring menagerie of agents.
Basically, OpenTelemetry supports exporting telemetry to a variety of open-source and commercial back-ends for analysis and generation of insights operators and developers need – like the root cause of performance degradation or more accurate predictions about how microservices will perform in production.
Will broad adoption of OpenTelemetry fix the microservices mess? No, not on its own. But OpenTelemetry is a step forward – the foundation if you will – to the solutions and services that will fix the mess. Full-stack support for OpenTelemetry would make it possible to discern whether a slow user experience is due to network latency, platform problems, or poor application design.
Much of the mess associated with microservices is to simply understand what’s going on and what’s going wrong. OpenTelemetry combined with analysis that produces actionable insights will go a long way toward making microservices more manageable.