Day in and day out, we’ve been hearing the same thing from SREs across domains: our telemetry volumes are out of control, our Datadog/Splunk/New Relic bill is becoming a line item that gets brought up in board meetings, and we’re spending so much time managing the tooling that we barely have time to do actual reliability work anymore.
And AI coding assistants have exacerbated the issue to another level. The pace of change massively increased. Small feature changes often lead to a cascade of changes and tweaks. New services are coming up at breakneck speed, with their full-stack complement of kubernetes clusters, databases, nodes, containers, and applications… and all their observability. This rapid pace of change isn’t only exponentially increasing the volume of data, but requiring SREs to cope with an enormous backlog of rules and configurations needed to manage it.
Your Vendors Profit From This Problem
Vendors have a vested interest in not solving this problem. Their business models depend on all this data. Expecting them to solve your data volume problem is like asking your cigarette company to help you quit smoking.
So what do people usually try? Some variation of “just reduce the data.” Often this boils down to convincing developers to instrument less, sample more aggressively, or spend time and think more heavily about what data to send or not to send.. Incentives are misaligned here too. Developers want to have all the data in case they need it for troubleshooting. No one wants to wake up at 2am and find that the one clue to resolving a problem has been sampled away.
Process the Stream, Not the Source
What if, instead of fighting about what gets emitted, there was a way to dynamically make decisions about what data is passed through depending on context? Data gets created however developers want it. An intelligent processing engine can dynamically figure out what to aggregate, what to pass through, and what to divert to low cost storage. Only the useful output hits your expensive observability platform.
Most telemetry data is only collected as insurance is extremely redundant. In our tests, we see about 95% repeated log messages with various parameters. An ML model that identifies those patterns and aggregates them down to a count plus a representative sample can save enormous costs , and the raw data is stored in a low-cost data lake, that data can all continue to be available in case it’s needed in an incident.
From Observability Engineering to Reliability Engineering
Today, observability at scale is a configuration management problem. Every new service requires new manual configurations. As data changes and new patterns emerge, new rules need to be written and old ones modified. The platform team toils in fine-tuning configurations, alerts, and various rules..
If we’re able to reduce data volumes and extract signal from noise and understand what’s truly unique at scale, we can leverage LLMs to make semantic sense of the data… in real-time.
But it goes further than that. What if we could just express intent at that point? “This service is tier-1 critical, full fidelity telemetry, 90-day retention.” “That one’s a batch job, just tell me if it fails.” “Flag anything that looks like PII in the payment service logs.” Let the system work out the routing rules and sampling rates and retention policies to make that happen.
The technology stack for this basically exists today. Stream processing is battle-tested, ML pattern recognition works at scale without constant tuning, and LLM-powered interfaces are surprisingly decent at turning “I want X” into actual pipeline config. Organizations putting the pieces together are seeing detection times drop from hours to minutes and observability costs that are growing sublinearly with scale rather than exponentially.
The rapid pace of change on code today combined with scale has engineers continuously toil with observability. The next generation reliability platform needs to eliminate that toil and enable engineers to write and deploy reliability applications rather than observability rules. That is the only way we can keep reliability up in the face of the massive shift in software delivery that we’re seeing today.






