Observability Costs Are Exploding—Here’s How to Fix It

Author: Jad Naous, Founder and CEO of Grepr

Author Bio: Jad Naous is the CEO and Founder of Grepr, where he’s tackling observability cost optimization after spending over a decade building and investing in data infrastructure and observability platforms. With a PhD from Stanford and hands-on experience shipping products that process terabytes of mission-critical data daily, Jad brings both technical depth and strategic clarity to the problem of runaway observability spend.

Day in and day out, we’ve been hearing the same thing from SREs across domains: our telemetry volumes are out of control, our Datadog/Splunk/New Relic bill is becoming a line item that gets brought up in board meetings, and we’re spending so much time managing the tooling that we barely have time to do actual reliability work anymore.

And AI coding assistants have exacerbated the issue to another level. The pace of change massively increased. Small feature changes often lead to a cascade of changes and tweaks. New services are coming up at breakneck speed, with their full-stack complement of kubernetes clusters, databases, nodes, containers, and applications… and all their observability. This rapid pace of change isn’t only exponentially increasing the volume of data, but requiring SREs to cope with an enormous backlog of rules and configurations needed to manage it.

Your Vendors Profit From This Problem

Vendors have a vested interest in not solving this problem. Their business models depend on all this data. Expecting them to solve your data volume problem is like asking your cigarette company to help you quit smoking.

So what do people usually try? Some variation of “just reduce the data.” Often this boils down to convincing developers to instrument less, sample more aggressively, or spend time and think more heavily about what data to send or not to send.. Incentives are misaligned here too. Developers want to have all the data in case they need it for troubleshooting. No one wants to wake up at 2am and find that the one clue to resolving a problem has been sampled away.

Process the Stream, Not the Source

What if, instead of fighting about what gets emitted, there was a way to dynamically make decisions about what data is passed through depending on context? Data gets created however developers want it. An intelligent processing engine can dynamically figure out what to aggregate, what to pass through, and what to divert to low cost storage. Only the useful output hits your expensive observability platform.

Most telemetry data is only collected as insurance is extremely redundant. In our tests, we see about 95% repeated log messages with various parameters. An ML model that identifies those patterns and aggregates them down to a count plus a representative sample can save enormous costs , and the raw data is stored in a low-cost data lake, that data can all continue to be available in case it’s needed in an incident.

From Observability Engineering to Reliability Engineering

Today, observability at scale is a configuration management problem. Every new service requires new manual configurations. As data changes and new patterns emerge, new rules need to be written and old ones modified. The platform team toils in fine-tuning configurations, alerts, and various rules..

If we’re able to reduce data volumes and extract signal from noise and understand what’s truly unique at scale, we can leverage LLMs to make semantic sense of the data… in real-time.

But it goes further than that. What if we could just express intent at that point? “This service is tier-1 critical, full fidelity telemetry, 90-day retention.” “That one’s a batch job, just tell me if it fails.” “Flag anything that looks like PII in the payment service logs.” Let the system work out the routing rules and sampling rates and retention policies to make that happen.

The technology stack for this basically exists today. Stream processing is battle-tested, ML pattern recognition works at scale without constant tuning, and LLM-powered interfaces are surprisingly decent at turning “I want X” into actual pipeline config. Organizations putting the pieces together are seeing detection times drop from hours to minutes and observability costs that are growing sublinearly with scale rather than exponentially.

The rapid pace of change on code today combined with scale has engineers continuously toil with observability. The next generation reliability platform needs to eliminate that toil and enable engineers to write and deploy reliability applications rather than observability rules. That is the only way we can keep reliability up in the face of the massive shift in software delivery that we’re seeing today.

Observability is not here to stay

Your Vendors Profit From This Problem

Process the Stream, Not the Source

From Observability Engineering to Reliability Engineering

Kubernetes Is Becoming the Developer Workstation: Why the “Inner Loop” Is the Next Platform Battleground

Why Your AI Agents Are Stuck in Pilot Hell, And What to Do About It | Marie Forshaw, CData | TFiR

Your Vendors Profit From This Problem

Process the Stream, Not the Source

From Observability Engineering to Reliability Engineering

Kubernetes Is Becoming the Developer Workstation: Why the “Inner Loop” Is the Next Platform Battleground

Why Your AI Agents Are Stuck in Pilot Hell, And What to Do About It | Marie Forshaw, CData | TFiR

You may also like

Kids at KubeCon: What the Next Generation Really Thinks | Neev Bhartiya, Aadi Bhartiya | TFiR

How to Cut Cloud Dev Costs and Ship More Resilient Code Locally | Waldemar Hummer, LocalStack | TFiR

How to Test Multi-Cloud and Sovereign Cloud Workloads Locally | Waldemar Hummer, LocalStack | TFiR

Why AI Observability Fails Without Dynamic Data Collection Control | Shahar Azulay, groundcover | TFiR

Why AI Agents Fail in Production Without Trusted Telemetry | Shahar Azulay, groundcover | TFiR

Why Cloud Development Feedback Loops Fail and How to Fix Them | Waldemar Hummer, LocalStack | TFiR