The value of profiling for your business

Author: Morgan McLean, Senior Director, Product Management at Splunk, a Cisco Company

Bio: Morgan McLean is a Senior Director of Product Management at Splunk, a Cisco Company. Additionally, he is the co-founder of OpenCensus and OpenTelemetry, now the second largest CNCF project behind only Kubernetes.

Improving the performance, reliability and pace of improvements of complex, distributed systems is a constant challenge for developers, SREs and DevOps teams. In today’s fast-paced environment where technologies change daily, and a slow page load can be almost as detrimental to customer experience as an entire site down, all businesses need a clear picture of how their systems are behaving. This is where observability comes in.

Observability is the practice of collecting and analyzing telemetry data from applications to reduce downtime and improve digital resiliency. An observability practice provides a company full insight into application issues and visibility into what’s causing the issues. Faster insight and diagnosis of root causes equate to faster resolutions with minimal customer impact, enabling that company’s engineering teams to spend more time building and releasing new features instead of toiling against outages and poor performance.

Building a leading observability practice through OpenTelemetry

How does one go about building a leading observability practice? It starts with OpenTelemetry, an open-source, vendor-agnostic framework that provides a standardized way of instrumenting systems to easily collect the telemetry data needed for complete end-to-end observability. OpenTelemetry enables organizations to own their data and send it wherever they want and whenever they want, to pivot and evolve their observability practice as business needs change and grow.

If you’re already practicing observability, chances are you’re working with tools like Infrastructure and Application Performance Monitoring. With the development of distributed systems and cloud environments, these types of solutions have come a long way. Organizations now have more insight into their systems than ever before, and most problems are faster and easier to solve. But some problems still remain hard to solve. For example, root causes of issues around CPU, memory usage and general resource exhaustion can be tedious to diagnose, especially if they occur in code shared across multiple services. For example, in a banking application that tracks customer financial transactions like Credit Karma, it might be easy to see resource spikes and the impacts of slow load time for recent transactions in an observability platform. However, it’s much more difficult to figure out why these issues are happening. So how do organizations get insight into the “why” for optimal performance and customer experience? The answer is profiling.

Profiling to the rescue

Profiling is a capability that monitors compute resources used by application code. It helps SREs, developers and DevOps teams troubleshoot faster by providing insight into issues like CPU exhaustion, which potentially means downtime is on the horizon. Profiling exposes direct visibility into the exact functions in an application’s code that might be consuming too many resources so that teams can quickly identify exactly why problems are occurring. Using our Credit Karma example, if customers are experiencing slow transaction load times, and development teams are seeing increased resource usage, those teams can gain insight into the specific lines of code responsible for exhausting resources and remedy it through profiling.

Working with profiling used to be a heavy, expensive and extremely manual lift with significant performance overhead and difficulty in making it work on real-world data, but now the profiling adoption process is much more set-it-and-forget-it. Profiling can be easily configured to run in the background so engineering teams can see right into the lines of code impacting service behavior. Profiling data can then be used to optimize application performance, proactively detect issues with resource consumption and ultimately improve the overall customer experience.

OpenTelemetry profiling

While profiling is available in some backend observability platforms, it’s often a niche feature that isn’t always well integrated into their main offering, or is only available entirely standalone. By adding profiles as a first-class signal to OpenTelemetry, it will become available to everyone, as OpenTelemetry has already achieved with distributed tracing. Code-level profiling data will live alongside existing traces, metrics and logs and enhance telemetry data to make it easier to understand the relationships between it and the corresponding lines of code for faster issue detection and resolution.

Earlier this year, OpenTelemetry hinted at a GA release of OpenTelemetry’s profiling signal, and there’s anticipation to see a stable 1.0 release in the wild soon. OpenTelemetry contributions such as Elastic’s donation of its eBPF-based continuous profiling agent and Splunk’s donation of its .Net-based profiler help the efforts of establishing profiling as a core telemetry OpenTelemetry signal. Additions like these, along with a ton of other amazing work going into profiling support will make it possible for SREs, developers and DevOps teams to easily move between telemetry data and the correlated profiling data for deeper code-level insight, which means faster issue resolution, optimized application performance and improved customer experience.

Conclusion

Greater visibility into the reliability of systems with profiling gives organizations actionable insights to optimize performance. These things help make the lives of development teams easier by providing a quick troubleshooting experience and ensuring systems are performant and reliable. With support for profiling as a first-class citizen in OpenTelemetry coming soon, it will be exciting to see how this further improves application performance, troubleshooting and overall customer experience.

At Splunk, a Cisco company, we’re convinced that open standards like OpenTelemetry are the future. We are dedicated to furthering the advancement and adoption of projects that prioritize data ownership. To read more about OpenTelemetry’s profiling support, check out the OpenTelemetry announces support for profiling post. If you’re interested in experimenting with AlwaysOn Profiling for Splunk APM, head over to the Observability docs. Also come visit us in person at Booth #D5 at KubeCon + CloudNativeCon North America, in Salt Lake City, Utah, on November 12-15, 2024.

The value of profiling for your business

Building a leading observability practice through OpenTelemetry

Profiling to the rescue

OpenTelemetry profiling

Conclusion

Enterprises leverage GPUs beyond AI, driving results in early adoption stages: Hammerspace

A brief introduction to excess capacity in containerized environments

Building a leading observability practice through OpenTelemetry

Profiling to the rescue

OpenTelemetry profiling

Conclusion

Enterprises leverage GPUs beyond AI, driving results in early adoption stages: Hammerspace

A brief introduction to excess capacity in containerized environments

You may also like

How to Govern and Observe AI Agents at Scale Without Centralizing All Your Data | Mangesh Pimpalkhare, Cisco Splunk | TFiR

Why OpenTelemetry Is Now the Observability Standard for Cloud Native and AI Workloads | Chris Aniszczyk, CNCF | TFiR

Why AI Observability Fails Without Dynamic Data Collection Control | Shahar Azulay, groundcover | TFiR

Why AI Agents Fail in Production Without Trusted Telemetry | Shahar Azulay, groundcover | TFiR

Why OpenTelemetry Is Now the Foundation for AI and Cloud Observability | Chris Aniszczyk, CNCF | TFiR

AI Agents Are Breaking Observability — Snowflake’s Jeremy Burton on What Comes Next | TFiR