As GPU workloads grow in scale and diversity, efficient resource utilization has become a primary concern for platform teams. Modern GPUs deliver immense compute capacity, but many workloads do not require exclusive access to an entire device. Training jobs often fluctuate in activity, batch workloads run intermittently, and inference services may use only a fraction of available compute. These patterns have driven the adoption of hardware-level GPU partitioning as a practical and widely supported solution.
Both NVIDIA and AMD provide native GPU partitioning capabilities designed to safely share a single physical GPU across multiple workloads. NVIDIA offers Multi-Instance GPU (MIG) on its A100 and H100 class accelerators, while AMD provides Compute Partitioning (CPX) on the Instinct MI300 series. These technologies form the foundation for efficient GPU sharing in Kubernetes environments.
Hardware Partitioning as a First-Class Capability
GPU partitioning allows a single physical accelerator to be divided into multiple isolated compute slices. Each slice has guaranteed access to a portion of compute and memory resources and is exposed to Kubernetes as a schedulable device. From the scheduler’s perspective, these partitions behave like independent GPUs. Workloads request them using standard resource requests, and device plugins ensure that each pod is bound to a specific partition with hardware-enforced isolation.
This model delivers immediate benefits. Partitioning increases workload density, reduces wasted capacity from over-allocation, and enables multiple teams or services to share expensive GPU resources safely. For clusters running mixed workloads, partitioning is often the most effective way to improve baseline utilization without sacrificing predictability or performance isolation.
Tradeoffs of Static Allocation
While GPU partitioning significantly improves resource efficiency, it relies on a static allocation model. Once a partition is assigned to a workload, it remains allocated for the lifetime of the pod. Kubernetes does not natively consider how actively the workload is using the assigned GPU resources. A partition that is lightly used or temporarily idle is treated the same as one running at full capacity.
This behavior is not a flaw in partitioning itself, but rather a characteristic of Kubernetes scheduling. The scheduler operates on allocation state, not real-time or historical utilization. As a result, clusters can still experience scheduling pressure even when meaningful GPU capacity exists in the form of idle partitions.
Utilization-Aware Scheduling as a Complement
Utilization-aware scheduling builds on top of GPU partitioning by incorporating GPU activity signals into scheduling decisions. Instead of relying solely on resource requests, the scheduler evaluates how GPU partitions are actually being used over time. Metrics collected from vendor tooling and exported through systems such as Prometheus provide a reliable view of sustained utilization patterns.
When a new workload cannot be scheduled through normal placement, utilization-aware logic can identify lower-priority workloads that have held GPU partitions while remaining underutilized for a prolonged period. In such cases, those partitions can be reclaimed and reassigned. This approach allows the cluster to adapt dynamically to changing workload behavior while respecting priority and policy boundaries.
Benefits of Combining Both Approaches
GPU partitioning and utilization-aware scheduling address different dimensions of efficiency. Partitioning improves spatial utilization by allowing multiple workloads to share a single GPU concurrently. Utilization-aware scheduling improves temporal utilization by ensuring that allocated partitions continue to deliver value over time.
Together, they enable higher average utilization, faster scheduling for critical workloads, and better alignment between GPU allocation and actual demand. Importantly, this combination does not compromise isolation. Partitions remain fixed, exclusive, and hardware-enforced. Scheduling intelligence simply determines which workload should own a partition at a given point in time.
Practical Considerations
Adopting this combined approach requires thoughtful policy design. Workload priorities, grace periods, and utilization thresholds must be tuned to reflect real usage patterns. Observability is essential, both for debugging scheduling decisions and for building trust among users whose workloads may be preempted. When applied carefully, these mechanisms allow platform teams to maximize the value of GPU investments without introducing instability.
Conclusion
GPU partitioning is a powerful and necessary capability for modern accelerator clusters. AMD CPX and NVIDIA MIG provide the hardware foundation required to safely share high-performance GPUs. By complementing these technologies with utilization-aware scheduling, organizations can further improve efficiency, reduce idle capacity, and support diverse workloads on shared infrastructure. Rather than competing approaches, partitioning and scheduling intelligence work best when applied together as part of a cohesive GPU strategy.
Sources:
- https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-mig.html
- https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/gpu-partitioning/mi300x/quick-start-guide.html
- https://www.cncf.io/blog/2026/01/20/reclaiming-underutilized-gpus-in-kubernetes-using-scheduler-plugins/






