Cloud Native AI Infrastructure Open Source

Why OpenTelemetry Is Now the Foundation for AI and Cloud Observability | Chris Aniszczyk, CNCF | TFiR

OpenTelemetry graduates at CNCF. Chris Aniszczyk explains what it means for AI agents, GPU-native clouds, and vendor-neutral observability.

By Monika Chauhan 11 minutes ago

0

Engineering teams instrumenting AI agents, inference workloads, and GPU-native infrastructure are running into observability gaps that traditional tooling was never designed to close. Distributed tracing standards that work for microservices do not automatically extend to ephemeral agents, model metadata, or the unique telemetry demands of GPU-first cloud providers. Without a vendor-neutral, broadly adopted specification, teams risk fragmentation at exactly the layer where visibility matters most.

In this interview on TFiR, Chris Aniszczyk, CTO at CNCF, covers the graduation of OpenTelemetry, the project’s evolution from a merger of two competing standards into the de facto observability foundation for cloud native and AI infrastructure, and what comes next for agentic and GPU-native workloads.

Guest: Chris Aniszczyk, CTO at CNCF
Show: TFiR

Here is what every platform engineer, SRE, and cloud architect needs to know.

Technical Deep Dive

Q: How did OpenTelemetry originate and what problem was it created to solve?

Chris Aniszczyk, CTO at CNCF, explains that OpenTelemetry emerged from two overlapping projects: OpenTracing, which focused on distributed tracing, and OpenCensus, which came out of Google and focused primarily on logs and metrics. Both communities were pursuing similar goals along different paths, causing confusion for end users who were often attempting to run both simultaneously. CNCF brokered a meeting at the Linux Foundation headquarters in San Francisco, seating both communities at the same table, and that meeting became the genesis of OpenTelemetry. OpenTracing was eventually deprecated and archived, and OpenCensus merged into the unified project.

“Our members and our end users were like, ‘This is just not useful. All y’all need to figure something out, combine it into one thing, and make logs, metrics, and traces first-class citizens.’” — Chris Aniszczyk, CTO, CNCF

Q: What does CNCF graduation actually mean for a project like OpenTelemetry?

Graduation is a formal stamp of approval from the CNCF Technical Oversight Committee, a body of technical experts drawn from across the industry. The TOC evaluates whether a project is widely used, vendor neutral, has contributors from multiple companies, has passed an independent security audit, and has demonstrated the ability to respond to security issues. Aniszczyk describes it as a market signal to end users and vendors that the project is not going away. Only a few dozen projects in the CNCF portfolio have reached this bar, including Kubernetes, Prometheus, Envoy, and Helm.

“To end users and other vendors it’s like, this is not going away anytime soon. If you haven’t looked at it already, you should definitely take a look.” — Chris Aniszczyk, CTO, CNCF

Q: Is OpenTelemetry now the Kubernetes of observability?

Aniszczyk states directly that OTel is the Kubernetes of the observability world. Every major traditional observability vendor, including Splunk, Datadog, Grafana, and Honeycomb, supports OTel by default. Every major programming language has an OTel SDK. Amazon recently announced that CloudWatch supports OTel natively. In the CNCF open source project velocity report, OTel ranks second only to Kubernetes for contribution velocity, and it ranks in the top 20 to 30 open source projects globally by contributions.

“It is the Kubernetes of the observability world. You look at all the traditional observability vendors, your Splunks, Datadogs, newer age ones, Grafanas, Honeycombs, they all support OTel by default.” — Chris Aniszczyk, CTO, CNCF

Q: How does OpenTelemetry give engineering teams vendor optionality?

Aniszczyk explains that instrumenting applications with OTel allows teams to route telemetry data to different vendors for different purposes, using Datadog for one use case and Grafana for another, without re-instrumenting the application. He compares this to what Kubernetes did for cloud portability: it does not eliminate migration work entirely, but it significantly reduces it. For organizations running legacy or homegrown observability stacks, OTel provides a standardized output format that can modernize existing systems and create a pathway away from proprietary tooling.

“It gives them a strong optionality and choice. Basically what Kubernetes did, hey, you could run on Google Cloud or you could run on your own private cloud, that level of choice is now offered for the observability part of your stack.” — Chris Aniszczyk, CTO, CNCF

Q: What are the four pillars of OpenTelemetry and when was the fourth added?

OpenTelemetry originally defined three pillars of observability: metrics, logs, and traces. The project has since added a fourth pillar called profiling, which covers CPU snapshot data and related runtime performance signals. Aniszczyk uses this as an example of how OTel has continued to evolve beyond its original scope as the community has stretched it to cover new requirements. The addition of profiling reflects the project’s responsiveness to practitioner needs rather than a fixed specification locked at founding.

“OTel added a fourth pillar called profiling, which is kind of like CPU snapshot data. So even OTel has evolved and grown over time.” — Chris Aniszczyk, CTO, CNCF

Q: How is OpenTelemetry evolving to support AI and agentic workloads?

Aniszczyk says AI workloads do not require a new observability pillar, but they do require new metadata extensions. Agents produce logs, require metrics, and need traceability across API calls, database interactions, and model invocations, the same fundamentals as microservices. What is new is the need to capture model identity, prompt data, and inference-specific context. Two active efforts are extending OTel to address this: one called OpenLLMetry and another called OpenInference, which extends OTel to support inference-based workloads. Aniszczyk expects this work to be pushed upstream into the core OTel specification over the following six to twelve months.

“Over the next six to 12 months you’re going to see more of that work pushed upstream and working with the OTel process and community to support these workloads.” — Chris Aniszczyk, CTO, CNCF

Q: What specific observability challenges do AI agents introduce that traditional tooling does not handle?

Aniszczyk identifies a distinct risk with AI agents that does not exist with conventional microservices: agents can take destructive actions, including deleting their own logs and traces. This means that passive observability, where telemetry is expected to persist and be queryable after the fact, cannot be assumed. For cloud providers and enterprises allowing agents to run on shared infrastructure, this demands sandboxing in combination with observability to detect anomalous behavior and support rollback before data is lost. The ephemeral and potentially self-modifying nature of agents requires observability systems to capture telemetry in real time rather than relying on post-hoc log retrieval.

“Agents are a little bit smarter because they could actually do ephemeral destruction. They could even delete their logs. So you need solid sandboxing and observability to make sure that if that agent does something naughty, you could actually at least know something’s happening and roll back.” — Chris Aniszczyk, CTO, CNCF

Q: Why do GPU-first cloud providers have an observability gap and how will OTel address it?

Aniszczyk describes a category of providers he calls neo-clouds or AI-native clouds, citing CoreWeave, Lambda, and Nebius as examples, that have built sophisticated GPU infrastructure but whose observability tooling does not meet the standards expected in cloud native environments. These providers focused on delivering GPU access at scale and did not build observability stacks to the same maturity level as hyperscalers or cloud native platforms. Aniszczyk expects OTel to evolve specifically to address these GPU-first use cases, extending instrumentation coverage to the infrastructure layer where AI training and inference workloads execute.

“The observability for these things is not what we would traditionally expect in the cloud native world. So I think you’ll see OTel evolve and support those use cases significantly.” — Chris Aniszczyk, CTO, CNCF

Q: How is AI being used within the OpenTelemetry and broader CNCF project ecosystem?

Aniszczyk confirms that AI coding assistants, including GitHub Copilot and cloud-based AI tools, are in use across CNCF projects. Many third-party AI tooling vendors provide access to open source maintainers, making adoption de facto across the portfolio. He flags one area requiring improvement: AI-generated security reports. Automated tools are producing a high volume of vulnerability reports, but some lack the context to evaluate whether a reported issue falls within the actual threat model of the project. CNCF is working to get better at filtering these AI-generated reports against project-specific threat models before they consume maintainer time.

“There’s been a lot of security reports because it’s easy for any developer to go find a security issue in a project, but maybe what they’ve issued is not a security issue because it wasn’t smart enough to look at the threat model.” — Chris Aniszczyk, CTO, CNCF

Q: What is the significance of OpenTelemetry reaching graduation in roughly seven years?

Aniszczyk frames the achievement in terms of competitive industry dynamics: building a shared standard adopted by vendors that compete at the billion and trillion dollar business level is exceptionally rare. He places OTel alongside Linux, Kubernetes, and PyTorch as examples of technology that achieved this level of cross-competitor participation. The seven-year timeline from the initial brokered meeting to graduation, combined with ranking second in the world for open source contribution velocity, marks OTel as a genuinely durable foundation. Aniszczyk states that CNCF will run special celebrations throughout the year to mark the milestone.

“In the software industry it is non-trivial to go build something used by a lot of different vendors that compete at the billion dollar, even trillion dollar business level. OTel did that essentially in about seven years, which is incredible.” — Chris Aniszczyk, CTO, CNCF

Q: How does the Linux Foundation approach staying ahead of technology cycles rather than becoming anchored to past successes?

Aniszczyk draws on his 25 years in open source, including 15 years at the Eclipse Foundation, to explain the pattern. Technologies like Eclipse and OpenStack each went through a major cycle of dominance followed by shift. His observation is that open source is positive sum for technology, but marketing attention and organizational investment are effectively zero sum, meaning communities and companies redirect focus as hype cycles move. The Linux Foundation’s response, illustrated by initiatives like the Agent AI Foundation, is to move toward where activity is heading rather than defending existing territory. Aniszczyk notes that all AI workloads still run on Linux and Kubernetes, which means the foundation’s core infrastructure remains foundational regardless of what runs on top.

“Open source is positive sum, but people’s marketing dollars and attention is kind of zero sum. Some companies will love Kubernetes and KubeCon, but maybe AI is the thing right now, so time and attention tends to move around.” — Chris Aniszczyk, CTO, CNCF

Resources & Documentation

OpenTelemetry at CNCF, official CNCF project page for OpenTelemetry including governance and graduation details
OpenTelemetry, official project site with SDKs, specifications, and documentation across all supported languages
CNCF Open Source Project Velocity Report, annual contribution velocity data referenced by Aniszczyk
OpenTelemetry Specification on GitHub, the core specification repository covering metrics, logs, traces, and profiling

***

👇 Click to Read Full Raw Transcript

Swapnil Bhartiya: Hi, this is Swapnil Bhartiya and we are here at Open Source Summit at Minneapolis and we have with us once again after a long time Chris Aniszczyk, CTO at CNCF. Chris, it’s great to have you back on the show.

Chris Aniszczyk: Awesome. Glad to be here in sunny Minneapolis.

Swapnil Bhartiya: But the most exciting thing is the graduation of OpenTelemetry. I remember my discussion with Morgan McLean and of course Ben Sigelman. I think it was back in 2018 in Denmark and Copenhagen at KubeCon when those two projects, of course OpenTracing and OpenCensus came together to create and back then they were talking about how there was a lot of overlap happening, how a lot of confusion was happening. So bringing this together is good for everybody. Now we are here, the project is graduating and all the discussion that I have in terms of observability and especially in the AI space, observability is becoming a very critical thing. OpenTelemetry keeps popping up. So it is becoming very, very exciting. I want to hear from you if you recall any early meetings where you guys were trying to bring these projects together. What were the struggles, or what was something that was either easy or difficult.

Chris Aniszczyk: I mean, if going back to the early days of CNCF with OpenTracing coming in as one of our earlier projects and OpenCensus eventually being born out of Google, those projects took different paths. OpenTracing, obviously hence its name, was very much focused on tracing. OpenCensus was very much focused on kind of the logs metrics problem. They were exploring tracing but there was overlap, but they were a little bit distinct. Sometimes people were using both and trying to get both to work. The two communities in the beginning were not as friendly to each other because I think what they were trying to do is eventually have their stake in the ground and say hey, ours is the best way. But the reality is our members and our end users were like this is just not useful. All y’all need to figure something out, like combine into one thing. And both have logs, metrics and traces as like first class citizens for telemetry data and have some de facto standards built around this. So when our members were complaining and our users were like, well what is CNCF good at, we’re good at neutrality and bringing people together and seeing what shakes out and allowing innovation to happen. So we brokered one of those early meetings. I still remember this day in the old Linux Foundation headquarters in San Francisco in kind of a U shaped table. We had OpenCensus on one side, OpenTracing on one side, and CNCF staff and TOC members in the middle trying to broker conversations of like, what is it going to take for you to work together? Where’s the overlap? What’s not working? Why do you think it’s not working? Eventually that meeting was the genesis of OpenTelemetry becoming a thing and OpenTracing eventually being deprecated and archived and OpenCensus merging into OTel. And if we look back, that small meeting, where about maybe 15 or so people were at, basically seven years later, we’re here today and OTel is supported by every major observability vendor. All the hyperscalers pretty much have some form of OpenTelemetry support. And with the rise of AI workloads and so on, there’s now work underway to bring potentially LLM monitoring or agent monitoring, ensuring the OTel specs support that use case. And also I even forgot, when OTel started it was always like three pillars of observability: metrics, logs, traces. Well, OTel added a fourth pillar called profiling, which is kind of like CPU snapshot data and so on. So even OTel has evolved and grown over time. But the important thing in my opinion is every major vendor, observability, hyperscaler is involved in some way. And it’s basically just like Kubernetes is at this point, where all the major vendors and players are involved and are working together and committed to making Kubernetes work, but also OpenTelemetry work for observability.

Swapnil Bhartiya: Excellent. Thanks for sharing that initial journey as well. Now I want to talk about telemetry and evolution. A lot of things I also want to talk about. I want to get this elephant out of the room. What does graduation mean for a project like OpenTelemetry which is already being used in production?

Chris Aniszczyk: Yeah, I mean, so graduation in CNCF could kind of be thought of as, we have about a couple dozen projects that have hit that bar. And really what it means is CNCF has a technical board called the TOC and these are technical experts from the industry in a variety of different domains. And they basically do a stamp of approval that this project is widely used, it is vendor neutral, it has folks involved from different companies. It has an independent security audit, proves that it can respond to security issues. And basically it’s their stamp that this stuff has the base criteria of what a successful long term sustainable open source project looks like. Basically to end users and other vendors it’s like, this is not going away anytime soon. And so we have projects like Kubernetes, Prometheus, Envoy, Helm, all widely used mature projects. And OTel is under that umbrella. So to me it’s more of a market signal to the rest of the world that this is for sure not going away. You should definitely, if you haven’t looked at it already for whatever reason, you should definitely go take a look at it. So it’s not necessarily reflective of how mature and how widely adopted the project is. OTel’s already been fairly widely adopted. It is the governance and the structures of how the open source project runs and is developed is really what it’s about. The sustainability, and that’s like a lot of different aspects. Like is it just one company doing all the work? Has there been security audits that have proved that the project could be versatile and respond to security? So those are the kind of things that the TOC looks for.

Swapnil Bhartiya: You mentioned Kubernetes and if you look at OTel, I talk to, of course, open source teams a lot. If you look at this graduation, would you consider that when it comes to observability, OTel has kind of hit the moment of Kubernetes or Linux kernel, that it is not just technology but typically the foundation of observability?

Chris Aniszczyk: It is the Kubernetes of the observability world. You look at all the traditional observability vendors, your Splunks, Datadogs, your newer age ones, Grafanas, Honeycombs, they all support OTel by default. Almost every major programming language has SDKs for OTel. The big hyperscalers have it. Amazon even last month recently announced that CloudWatch supports OTel natively now. So these are all big signals that it’s everywhere. And I talk to a lot of our end users in CNCF and they actually love OpenTelemetry because basically it allows them to do two things. One, it gives them a little bit more choice of how to choose vendors potentially. Like if internally we are instrumenting our applications using OTel, then maybe we could use Datadog for this, Grafana for this. It gives them a strong optionality and choice, which CNCF is all about. Basically what Kubernetes did, like hey you could run on Google Cloud or you could run on your own private cloud, that level of choice is now offered for the observability part of your stack. And is it super easy to move between things? No, not necessarily. There’s always a bit of work, just like Kubernetes, you can’t magically move to different clouds. You got to do a little bit of work but it’s significantly easier. The other thing I learned from some of our end users is a lot of people, especially with older stacks, a lot of regulated industries that have been around for a while, they have homegrown solutions. They built their own kind of observability and they have now modified those to go emit OTel related data. So that gives them a pathway to eventually move off of those homegrown solutions. So yeah, to me is the Kubernetes of the observability world and it’s reflective in the data, not only adoption but you look at the commits, the contributions. We have this open source project velocity report that we produce in CNCF. OTel is literally number two behind Kubernetes for contribution velocity. So it’s not just people using it and vendors supporting it. A lot of people are showing up. And OTel ranks as like a top 20 or 30 open source project worldwide in terms of contributions. It’s number two in CNCF but even worldwide it’s huge.

Swapnil Bhartiya: In all the interviews I have had the opportunity to do, OpenTelemetry keeps coming up. It doesn’t matter what the company is doing. Now you earlier mentioned initially there were three pillars, then a fourth pillar was added. Now AI is there, an AI vertical is there. And we hear that observability is very, very important here. How is OpenTelemetry either evolving or will evolve for the AI vertical?

Chris Aniszczyk: Yeah, I don’t think AI workloads necessitate a new pillar, it’s like a new use case. Because I think at the end of the day you’ll have things like agents, they obviously are going to produce a lot of logs, you’re going to need metrics to figure out what the agent is doing, you’re going to have to have traceability. So like, my agent kicked off an API call, hit a database, just like you need for traditional microservices. You need all those same things. There needs to be some improvements and maybe modifications to OTel to support extra metadata around like which model was used, which prompt, and other things that are not fully supported yet. And there actually are efforts out there that are either being done by the community or startups. There’s a couple of efforts. One of them is called OpenLLMetry, kind of like a fun spin on LLM. Another one is called OpenInference, which is extending OTel to support inference based workloads. This mostly is metadata around models and all that traceability. So that work is already being done. I think over the next six to 12 months you’re going to see more of that work pushed upstream and working with the OTel process and community to support these workloads. And any healthy project over time evolves with the community that shows up. Just like Linux was never meant to go into phones or into space, but here we are today. Kubernetes was never meant to go into edge devices or space either. And here we are today. OTel is being stretched by its community and I think once you have enough critical mass of support across vendors and users wanting to see it there, things naturally evolve to support.

Swapnil Bhartiya: Since you are name dropping, like Zephyr, that is celebrated 10 years and that is almost everywhere. So this technology, open source technology, I was talking to Hilary yesterday also, we don’t realize it today but when you look back at it, 10 or 15 years from now, the whole Linux Foundation has played a transfer. No one has ever done something like that ever. So whatever Jim has achieved is incredible.

Chris Aniszczyk: Yeah, LF is a very fascinating place. When I joined a little over 10 years ago to help start CNCF and OCI, I think the LF was like 30 people. Maybe now we’re around 400. And this foundation as a service thing was fairly new and it’s just like people want a safe home to collaborate on software without any worries and some basic infrastructure to ensure that things are fair, they don’t have to worry about any IP issues and the LF has been really good at figuring that out. Linux obviously helped pave the way but other stuff has come along.

Swapnil Bhartiya: I have been tracking the Linux Foundation even before its current form. But the beauty of the Linux Foundation is that you folks also sometimes stay ahead of the curve, like the Agent AI Foundation. That is also critical because a lot of foundations get stuck. They are comfortable with what they are doing, but you folks keep moving where the puck is going to be, not where it is.

Chris Aniszczyk: Yeah, I mean there are some of us who have been working in open source for a long time. It is 25 years for me. And I spent probably 15 years working on Eclipse technology and the Eclipse Foundation, which is one of the early kind of corporate open source foundations parallel to the Linux Foundation. And there was a time where the whole world was using Eclipse and then things went away. And our OpenStack friends had a similar cycle. So some of us who have worked here are very familiar with innovation cycles. And it’s not that open source is positive sum, but people’s marketing dollars and attention are kind of zero sum. So some companies will be like, we love Kubernetes and we love KubeCon. But maybe AI is the thing right now and so time and attention tends to move around. Everything technology is a hype cycle.

Swapnil Bhartiya: But it doesn’t matter. They will need the support and that’s when they look and that’s where the Linux Foundation is really there for that.

Chris Aniszczyk: We’re very good at that. And what’s funny in my world and CNCF is everyone is focused on AI and agents. And I say, all this stuff needs to run somewhere. It’s all running on Linux and Kubernetes.

Swapnil Bhartiya: One more thing I want to ask you, which is more or less, we were talking about AI leveraging OpenTelemetry. If you look at observability, AI is very good at analyzing these things. How is AI being used within the OpenTelemetry space? What is the scope of AI within the OpenTelemetry community and project?

Chris Aniszczyk: It depends what you mean by that. Almost every CNCF project has access to coding assistant tools, whether it’s Copilot or cloud AI. AI is being used all over the place across the board. I would have to look in depth at OTel specifically to see what they’re using but I guarantee you they’re playing with Copilot and cloud AI. We have a lot of third party tools that are provided to open source maintainers on behalf of companies out there. So it’s de facto all over CNCF. One area where we are looking for probably more assistance is that in the last few months there have been a lot of security reports because it’s easy for any developer to go find a security issue in a project and it will try to come up with something. But maybe what they’ve issued is not a security issue because the tool was not smart enough to look at the threat model to understand that this is out of scope. So we need to get better at that and we will. But for now I think OTel and many CNCF projects are really good at adopting AI.

Swapnil Bhartiya: Last question, we will wrap this up. How do you further see OTel evolve?

Chris Aniszczyk: So it will continue to evolve to where the community is stretching it. And right now I see it being stretched in two ways. One, the previous point we mentioned around supporting agentic workloads, how to properly instrument and trace the full life cycle of an agent from birth to it doing a bunch of things to it maybe eventually ending its life, and having all that fully traced. That is not an easy problem. So OTel will eventually go support that. The other case is I have recently been talking to a lot of neo cloud vendors or AI native clouds, like your Nebius, CoreWeaves, Lambdas of the world. Those GPU first vendors need a lot more observability in their lives. They have built amazing sets of infrastructure to provide GPU access to many folks. But the observability for these things is not what we would traditionally expect in the cloud native world. So I think you’ll see OTel evolve and support those use cases significantly. So both on the AI native clouds or neo clouds, whatever you want to call them, GPU first providers, and then also the agent layer world. Because an agent is basically a fancy microservice in some ways. I look at it because I’m a cloud native person. It comes, it disappears, it goes away. Agents are a little bit smarter because they could actually do ephemeral destruction. They could even delete their logs, they could remove their traces. So you have to be careful. And imagine if you’re a cloud provider or a company that allows people to run their own agents on your infrastructure, you definitely need some solid sandboxing and observability to make sure that if that agent does something naughty, you could actually at least know something is happening and roll back.

Swapnil Bhartiya: What is your graduation celebration message to all the contributors, thousands of contributors who have put their sweat into this?

Chris Aniszczyk: In the software industry it is very non-trivial to go build something used by a lot of different vendors that compete at the billion dollar, even trillion dollar business level. That doesn’t happen often in industry. You have things like Linux, Kubernetes, PyTorch and you have everyone involved. And OTel did that essentially in about seven years, which is incredible. So everyone involved should be extremely proud. They have basically built a technology that is going to last a long time and will be the foundation of not only what we did in cloud native, but also for the AI world. So that is something super to be proud of. It’s not easy to do. Most people never get an opportunity to have that level of impact. And we are going to be doing some special things throughout this year to celebrate OTel’s graduation because it is truly at the level of what I consider a Linux or Kubernetes in terms of industry impact and the amount of companies involved.

Swapnil Bhartiya: And sometimes these technologies evolve to become, they will actually outlive us.

Chris Aniszczyk: Yeah, that is a dream for many people. We want something that kind of lives forever in the corpus of software.

Swapnil Bhartiya: Chris, once again, thank you so much for sitting down with me and talking about OTel. I look forward to chatting with you again. Thank you.

Chris Aniszczyk: Anytime.

You may also like

How Self-Improving AI Works Without Human Intervention | Kunal Bhatia, Hexo Labs | TFiR

By Monika Chauhan23 hours ago

AI Infrastructure

Why HA Health Checks Fail as Clusters Grow | Trey Isaac, SIOS Technology | TFiR

By Monika Chauhan24 hours ago

Cloud Native

Why AI Agents Fail in Production and What the Meta Harness Actually Fixes | Amit Naik, CData | TFiR

By Monika Chauhan1 day ago

AI Infrastructure

85% of Domains Are Failing DNS Security Controls: Akamai’s Steve Winterfeld on the Hidden Threat | TFiR

By Monika Chauhan2 days ago

Why Cloud Development Feedback Loops Fail and How to Fix Them | Waldemar Hummer, LocalStack | TFiR

By Monika Chauhan2 days ago

AI Infrastructure

How Kubernetes 1.36 Handles GPU Scheduling, DRA, and Kubelet Security | Ryota Sawada, Kubernetes | TFiR

By Monika Chauhan2 days ago

AI Infrastructure