How to Avoid Microservice Observability Traps

Author: Eric D. Schabell (LinkedIn)
Bio: Eric is Chronosphere‘s Director Evangelism. He’s renowned in the development community as a speaker, lecturer, author and baseball expert. His current role allows him to help the world understand the challenges they are facing with cloud native observability. He brings a unique perspective to the stage with a professional life dedicated to sharing his deep expertise of open source technologies and organizations.

Introduction

As organizations place big bets on cloud native development and infrastructure, the life of developers has jumped into hyperdrive. From releasing new features on a semi-annual basis, developers now deploy to production daily or even hourly. From large scale monolithic applications and tightly coupled services, they went to microservices, loosely coupled, with hard to trace interactions.

Traditionally, monitoring was used to keep an eye on infrastructure that was not as dynamic as today’s more complex and difficult to maintain environments. With auto-scaling infrastructure that is based on containers and the Kubernetes platform, developers and engineers are challenged by descriptive instructions synchronizing their microservices code to an infrastructure end state. This new cloud native observability is changing the way organizations that wish to succeed in challenging economic climates are approaching their cloud environments. Dynamic infrastructure and service creation based on usage demands leads to an explosion of microservice observability data that challenges even the best of us.

To meet these new microservice observability challenges, here are the most common traps that you can proactively avoid.

Preoccupation with Observability Pillars

Any time there is a transition between generations in technology, there is a strong instinct to take the thing that worked so far, and try to drag it into the present day. This comes often in the form of focusing on the features that used to work well in the past instead of overall business outcomes.

With the transition to observability for microservices, it’s been no different. Many have tried to focus the discussion around three pillars used to tackle these challenges: metrics, tracing, and logs. This discussion struggles to address the sheer volume of data generated by applying the old application performance monitoring model to microservices. It ignores the complex integrations needed to monitor across massively scaled infrastructure in the cloud native world. It just focuses on three simple items in the technology realm without a thought to delivering on your organization’s microservice promises.

We all want better business outcomes for our organizations, such as faster remediation of problems, easier problem detection, greater revenue generation, happier customers, and engineering teams that can remain focused on delivering more business value. The problem with the three pillars is that you are talking about lower level tooling and not about solutions. Imagine you are talking about all the great and expensive tools you can buy to work on that house you just bought, while in the background the house is sagging, on a bad foundation, bugs are eating through the walls, and sparks are coming out of the wiring when you flip a light switch. You need to use the data you have about the house to remediate the urgent problems, such as spraying for termites, replacing the sagging foundation floor beams, and working to replace the old faulty wiring—not buying a fancy new hand plane. In the same way, a solution-oriented approach focuses on maintaining a functional, agile, and cost-effective infrastructure instead of focusing tools and features.

Your microservice observability needs are much better served with an approach designed for better business outcomes, fulfilling your promises to customers, and where you focus on three phases of observability.The phases you go through start with knowing the problem is happening as fast as possible and might even lead to fixing it immediately. If not, then you start triaging based on specific information related to the problem which quickly leads to fixing it. Finally, you want to have a very deep understanding of the issues you just encountered to ensure it never happens again.

None of these phases require you to focus on data types or specific technology details. They do need you to have an observability solution in place that can provide sharply focused insights and put enough information at your fingertips for you to make informed decisions quickly.

Failure to Embrace Open Source Standards

Ask any architect about building for the long term, the universal answer will be to look for open standards when considering adding any new components or systems to your infrastructure. They search for answers to questions like, does the candidate component under consideration adhere to some defined open standard? Does it at least conform to using open standards?

When an open standard exists, and in some early cases open consensus where everyone centers around a technology or protocol, it ensures you always have an exit strategy. By exit strategy, you are looking to have an easy way out of any technology choices you make and can swap it out in the future.

An example of one such standard is the Open Container Initiative (OCI) for container tooling in a cloud native environment. When ensuring your organization’s architecture uses such a standard, all components and systems interacting with your containers become replaceable by any future choices you might make as long as they follow the same standard. This creates choice, and choice is a good thing!

As you approach your microservice observability solution, there are many open source projects to help you tackle the initial tasks. Many are closely associated with the Cloud Native Computing Foundation (CNCF) as projects and promote open standards where possible. Some of them have even become an unofficial open standard by their default usage in observability solutions.

Prometheus

Prometheus is a graduated project under the CNCF umbrella, which is defined as “…considered stable and used in production.” It’s listed as a monitoring system and time series database, but the project site itself advertises that it is used to power your metrics and alerting with the leading open source monitoring solution.

What does Prometheus do for you? It provides a flexible data model that allows for you to identify time series data, which is a sequence of data points indexed in time order, by assigning a metric name. Time series are stored in memory and on local disk in an efficient format. Scaling is done by functional sharing, splitting data across the storage, and federation. Leveraging the metrics data is done with a very powerful query language called PromQL which we will cover in the next section. Alerts for your systems are set up using this query language and a provided alert manager for notification.

There are multiple modes provided for visualizing the data collected, from a built-in expression browser, integration with grafana dashboards, to a console templating language. There are also many client libraries available to help you easily instrument existing services in your architecture. If you want to import existing third-party data into Prometheus, there are many integrations available for you to leverage. Each server runs independently making it an easy starting point and reliable out of the box with only local storage to get started. It’s written in the Go language and all binaries are statically linked for easy deployment and performance.

OpenTelemetry

Another up-and-coming project is called OpenTelemetry (OTEL). Found in the incubating section of the CNCF site, it’s a very fast growing project with a focus on “high-quality, ubiquitous, and portable telemetry to enable effective observability.”

This project helps you to generate telemetry data from your applications and services, then forwarding that in what is now considered a standard form, called the OTEL Protocol, to a variety of monitoring tools. To generate the telemetry data you have to first instrument your code, but OTEL makes this very easy with automatic instrumentation through their integration with many existing languages.

Jaeger

Before OTEL was on the scene, the CNCF project Jaeger provided a distributed tracing platform that has targeted the cloud native microservice industry.

“Jaeger is open source, end-to-end distributed tracing. Monitor and troubleshoot transactions in complex distributed systems.”

While this project is fully matured, it’s targeted at supporting an older protocol and has just recently retired their classic client libraries while advising users to migrate to their native support for the OTEL Protocol standard.

Reliance on Consumption Based Models

Understanding your payment model for your microservice observability is something that many organizations fail to fully wrap their heads around. For many years before containers and cloud native environments were becoming the new normal, users paid for their observability data with a consumption based model. All data you collected, or ingested, was then saved for later possible usage in ad hoc queries, dashboards, or alters. This was a good model when you knew the size of your infrastructure, the usage you could expect, and everything remained within expectations.

Enter the world of cloud native with Kubernetes, containers, and microservices all running in a dynamic and automatically scalable environment. This has led to organizations still tied to consumption based models ending up with bills that boggle the mind. As Martin Mao pinpoints, “It’s remarkable how common this situation is, where an organization is paying more for their observability data (typically metrics, logs, traces, and sometimes events), than they do for their production infrastructure.”

These higher costs would not be a hot topic of conversation except for the fact that they are not leading to better outcomes. Martin continues, “If these organizations could draw a straight line from more data to better outcomes — higher levels of availability, happier customers, faster remediation, more revenue — this tradeoff might make sense.”

But they don’t make sense when there is no value.

What these organizations need is an observability solution that puts them in charge of detecting what data they are ingesting that is valuable to them, aggregating away the data that has little or no value, and only save that data you find valuable. Your solution needs some sort of control plane that puts knobs and dials at your fingertips, displays all the data you are ingesting that are not used in any queries, alters, or dashboards, and allows you to control the data that is saved.

This model puts you back in control of your data and observability costs.

In an effort to cut costs, DevOps organizations might turn to Do-It-Yourself (DIY) observability solutions. This is not necessarily a bad thing when you first start out, as small scale observability solutions can be overseen. The troubles come when you scale this up and find that your DevOps teams are slowly losing more and more resources to maintain the microservice observability solution, instead of delivering on new microservices. There is a definite tipping point between the initial DIY observability cost savings and the scaled-out Frankenstein monster of observability infrastructure that adds to your team’s costs, increases complexity, and steals away headcount.

Conclusion

The challenges of cloud native and microservices has become a real concern for organizations as they transition to the cloud. They are finding out that traditional monitoring solutions are not effective or efficient enough for their new dynamic microservices observability needs.

There are three traps that are common obstacles that organizations fail to avoid. The first is getting sucked into the preoccupation with observability pillars, a tendency to focus on technology and not business outcomes. Second, a failure to embrace open source standards. Remember, having an exit strategy is the first question you should always ask. Finally, a reliance on traditional consumption based models leads to a painful realization that it’s not the way to frame today’s microservice observability solutions in the cloud. With great observability comes great power, but it requires avoiding these three traps.

How to Avoid Microservice Observability Traps

Introduction

Preoccupation with Observability Pillars

Failure to Embrace Open Source Standards

Prometheus

OpenTelemetry

Jaeger

Reliance on Consumption Based Models

Conclusion

SnapLogic Now Available On Google Cloud Marketplace

Causal AI And Intelligent Automation For The Future | Bob Wambach, Dynatrace

Introduction

Preoccupation with Observability Pillars

Failure to Embrace Open Source Standards

Prometheus

OpenTelemetry

Jaeger

Reliance on Consumption Based Models

Conclusion

SnapLogic Now Available On Google Cloud Marketplace

Causal AI And Intelligent Automation For The Future | Bob Wambach, Dynatrace

You may also like

The Hidden Risks of Untested HA Environments | Cassius Rhue, SIOS Technology | TFiR

The RBAC Reality Check for AI in Platform Engineering | Corey McGalliard, Akamai Cloud | TFiR

Why AI Compounds Cloud Cost Problems and How Java Runtime Tuning Fixes It | Peter Maloney, Azul | TFiR

How to Run AWS Locally and Cut Cloud Dev Costs | Waldemar Hummer, LocalStack | TFiR

How Klutch Installs Into Any Kubernetes Cluster | Julian Fischer, anynines | TFiR

Why Platform Engineering Teams Over-Abstract and How Modular Design Fixes It | Corey McGalliard, Akamai Cloud | TFiR