Leveraging limits for consistent Prometheus monitoring

Author: Nicolas Takashi, Observability Tech Lead at Coralogix

Bio: Nicolas is a Software Engineer with a Platform Engineer role at Coralogix. He’s mostly interested in topics related to the observability ecosystem, as well as Kubernetes and distributed systems. He is also an open-source contributor to projects such as Prometheus Operator, Perses, and OpenTelemetry.

Prometheus is an open-source monitoring solution for distributed applications and containerized microservices. It is also the de facto standard for monitoring metrics in Kubernetes, where components emit metrics in Prometheus format.

As your company scales Kubernetes into production, setting appropriate limits is essential to avoid costly challenges, such as unbounded metrics. Establishing these limits helps make Prometheus sample ingestion more predictable and stable. This article introduces various limits and explains how they can impact your system when applied correctly.

Prometheus metric labels

Labels are key-value pairs associated with metrics to provide additional context. Labels allow for granular filtering, dynamic aggregation, and a richer context on metrics. However, adding labels without careful consideration can negatively impact Prometheus’ performance. Key points to remember before creating labels include:

Each label generates a new time series, which consumes memory and increases storage requirements.
High cardinality in labels can significantly slow down query execution.

To identify existing metrics with the highest cardinality, use the following command:

topk(10, count by (__name__, job)({__name__=~".+"}))

Limits to prevent label overload

If your company is trying to gain more insights into metrics, a common strategy is to add labels. For example, an engineering team might add a metric with 20 labels. Even if these are low cardinality labels having a cardinality of 2 on average, the total expected cardinality for the labels is 2^20 or 1,048,576. This is a significant number of time series for a single metric. As a result of many such metrics, you will see OOM crashes, disks filling up, and overall system instability.

To ensure teams cannot have this scale of impact, you can configure limits using the Prometheus operator in Kubernetes. Configure labelLimit and enforcedLabelLimit to put a soft and hard limit on labels. The labelLimit field will be the default limit for all fields, but the labelLimit value in the service monitor can also be overwritten. Using the enforcedLabelLimit gives a hard cap on the number of limits in any metric.

When metrics go over the set limits, target samples are dropped. A metric named prometheus_target_scrape_pool_exceeded_label_limits_total is incremented when samples are dropped.

Limits that block lengthy labels

As companies grow, they will usually attempt to improve monitoring. With growth comes more related metrics, pushing teams to be more specific on label names to differentiate the data. Lengthy labels and lengthy values can also cause OOM crashes, filled disk space, and system instability. This occurs because the label names are stored in the time series database, increasing the memory size.

Configure limits using the spec labelNameLengthLimit and enforcedLabelNameLengthLimit for soft and hard limits on the label names. Use the enforcedLabelValueLengthLimit to set a hard limit on the value length. Also, note this limit will affect metric names since they are stored as a value under the __name__ label.

When metrics exceed the set limits, target samples are dropped. No metric indicates a limit overage for name or value lengths, but a log message will be present and can raise an alert.

Putting a cap on your sample size

Each label value also adds a time series. For example, a label with ten values will have ten time series. It is critical to design labels and values with low cardinality and not use labels to describe unique identifiers like emails or userId. Also, if you use a histogram for a label, the cardinality is automatically multiplied by 12. Our ten-value label will have a cardinality of 120, and the thousand-value label will have a cardinality of 12,000 when using a histogram.

Configure limits using sampleLimit and enforcedSampleLimit to set soft and hard limits. Setting both limits can be helpful where you usually want to use the lower, soft limit, but in some cases, it is warranted to go over this. Teams should never cross the enforced limit.

When metrics exceed the set limits, target samples are dropped. No metric indicates a limit overage for name or value lengths, but a log message will be present and can be used to raise an alert.

Predicting ingestion to prevent target overload

Once limits are in place, companies can predict the ingestion size for their Prometheus instance more accurately. For example, say you know your per target size for your series ingestion and want to have some system RAM target size. You can enforce the number of targets a Prometheus instance can scrape so it remains below or close to a set limit.

The targetLimit configuration limits the number of scraped targets accepted. Target samples are dropped for a given job group with more than the allowed number of targets.

When target samples are dropped, a metric named prometheus_target_scrape_pool_exceeded_target_limit is incremented. A clear log message will also appear on the scrape page.

Summary

We have discussed several limits that can be used to ensure your Prometheus instance remains stable and uses a predictable amount of memory and RAM. Label limits are used to prevent excessive time series creation that can cause system instability and increase Prometheus costs. The target limit is then used to ensure ingestion size is predictable and can be handled by your instance set up.

To learn more about Kubernetes and the cloud native ecosystem, join us at KubeCon + CloudNativeCon North America, in Salt Lake City, Utah, on November 12-15, 2024.

Leveraging limits for consistent Prometheus monitoring

Prometheus metric labels

Limits to prevent label overload

Limits that block lengthy labels

Putting a cap on your sample size

Predicting ingestion to prevent target overload

Summary

Linus Torvalds on the kernel, GenAI, EVs, programming languages and more…

iGrafx partners with BiXi Consulting to expand presence in Japan

Prometheus metric labels

Limits to prevent label overload

Limits that block lengthy labels

Putting a cap on your sample size

Predicting ingestion to prevent target overload

Summary

Linus Torvalds on the kernel, GenAI, EVs, programming languages and more…

iGrafx partners with BiXi Consulting to expand presence in Japan

You may also like

One Control Plane for All Data Services Across Kubernetes and Cloud | Julian Fischer, anynines | TFiR

The RBAC Reality Check for AI in Platform Engineering | Corey McGalliard, Akamai Cloud | TFiR

How Klutch Installs Into Any Kubernetes Cluster | Julian Fischer, anynines | TFiR

Why Enterprises Should Stop Building AI Infrastructure Themselves | Richard Borenstein, Mirantis | TFiR

How to Build Safe, Production-Ready Kubernetes Clusters at Scale | Corey McGalliard, Akamai Cloud | TFiR

How to Run Self-Service EKS at Scale Without Losing Governance | Julian Fischer, anynines | TFiR