Prometheus is an open-source monitoring solution for distributed applications and containerized microservices. It is also the de facto standard for monitoring metrics in Kubernetes, where components emit metrics in Prometheus format.
As your company scales Kubernetes into production, setting appropriate limits is essential to avoid costly challenges, such as unbounded metrics. Establishing these limits helps make Prometheus sample ingestion more predictable and stable. This article introduces various limits and explains how they can impact your system when applied correctly.
Prometheus metric labels
Labels are key-value pairs associated with metrics to provide additional context. Labels allow for granular filtering, dynamic aggregation, and a richer context on metrics. However, adding labels without careful consideration can negatively impact Prometheus’ performance. Key points to remember before creating labels include:
- Each label generates a new time series, which consumes memory and increases storage requirements.
- High cardinality in labels can significantly slow down query execution.
To identify existing metrics with the highest cardinality, use the following command:
topk(10, count by (__name__, job)({__name__=~".+"}))
Limits to prevent label overload
If your company is trying to gain more insights into metrics, a common strategy is to add labels. For example, an engineering team might add a metric with 20 labels. Even if these are low cardinality labels having a cardinality of 2 on average, the total expected cardinality for the labels is 2^20 or 1,048,576. This is a significant number of time series for a single metric. As a result of many such metrics, you will see OOM crashes, disks filling up, and overall system instability.
To ensure teams cannot have this scale of impact, you can configure limits using the Prometheus operator in Kubernetes. Configure labelLimit and enforcedLabelLimit to put a soft and hard limit on labels. The labelLimit field will be the default limit for all fields, but the labelLimit value in the service monitor can also be overwritten. Using the enforcedLabelLimit gives a hard cap on the number of limits in any metric.
When metrics go over the set limits, target samples are dropped. A metric named prometheus_target_scrape_pool_exceeded_label_limits_total is incremented when samples are dropped.
Limits that block lengthy labels
As companies grow, they will usually attempt to improve monitoring. With growth comes more related metrics, pushing teams to be more specific on label names to differentiate the data. Lengthy labels and lengthy values can also cause OOM crashes, filled disk space, and system instability. This occurs because the label names are stored in the time series database, increasing the memory size.
Configure limits using the spec labelNameLengthLimit and enforcedLabelNameLengthLimit for soft and hard limits on the label names. Use the enforcedLabelValueLengthLimit to set a hard limit on the value length. Also, note this limit will affect metric names since they are stored as a value under the __name__ label.
When metrics exceed the set limits, target samples are dropped. No metric indicates a limit overage for name or value lengths, but a log message will be present and can raise an alert.
Putting a cap on your sample size
Each label value also adds a time series. For example, a label with ten values will have ten time series. It is critical to design labels and values with low cardinality and not use labels to describe unique identifiers like emails or userId. Also, if you use a histogram for a label, the cardinality is automatically multiplied by 12. Our ten-value label will have a cardinality of 120, and the thousand-value label will have a cardinality of 12,000 when using a histogram.
Configure limits using sampleLimit and enforcedSampleLimit to set soft and hard limits. Setting both limits can be helpful where you usually want to use the lower, soft limit, but in some cases, it is warranted to go over this. Teams should never cross the enforced limit.
When metrics exceed the set limits, target samples are dropped. No metric indicates a limit overage for name or value lengths, but a log message will be present and can be used to raise an alert.
Predicting ingestion to prevent target overload
Once limits are in place, companies can predict the ingestion size for their Prometheus instance more accurately. For example, say you know your per target size for your series ingestion and want to have some system RAM target size. You can enforce the number of targets a Prometheus instance can scrape so it remains below or close to a set limit.
The targetLimit configuration limits the number of scraped targets accepted. Target samples are dropped for a given job group with more than the allowed number of targets.
When target samples are dropped, a metric named prometheus_target_scrape_pool_exceeded_target_limit is incremented. A clear log message will also appear on the scrape page.
Summary
We have discussed several limits that can be used to ensure your Prometheus instance remains stable and uses a predictable amount of memory and RAM. Label limits are used to prevent excessive time series creation that can cause system instability and increase Prometheus costs. The target limit is then used to ensure ingestion size is predictable and can be handled by your instance set up.
To learn more about Kubernetes and the cloud native ecosystem, join us at KubeCon + CloudNativeCon North America, in Salt Lake City, Utah, on November 12-15, 2024.






