Unveiling Kubernetes Autoscaling: Enhancing 3 Methods for Optimal Performance

This post will teach you the best practices for Kubernetes autoscaling, in addition to three efficient strategies for doing so. Pod Autoscaling, Cluster Autoscaler, and Horizontal Pod Autoscaler (HPA) are some of these techniques. Every approach has its own benefits and may be customized to meet the demands of various workloads and use cases. You will also examine NetApp’s method of Kubernetes autoscaling with Spot instances, which makes use of affordable resources to maximize the effectiveness of your Kubernetes infrastructure. You will gain insightful knowledge and useful strategies from this in-depth guide to successfully setting up autoscaling in your Kubernetes infrastructure.

What is Kubernetes Autoscaling?

Kubernetes autoscaling is a feature that enables a cluster to dynamically adjust its resources in response to changes in demand. This capability allows the cluster to scale up by adding nodes or allocating more resources to pods when demand increases, and scale down by removing nodes or reducing pod resources when demand decreases. By doing so, Kubernetes autoscaling helps optimize resource utilization, minimize costs, and enhance overall performance.

There are three typical approaches to application scalability in Kubernetes environments:

  • Horizontal Pod Autoscaler (HPA): This part makes sure your application has enough instances to manage varying workloads effectively by automatically adjusting the number of pod replicas based on observed CPU or memory use.
  • Vertical Pod Autoscaler (VPA): Based on how each pod uses its resources, the VPA automatically modifies the CPU and memory reservations for each pod. Your apps’ performance is enhanced, and resource allocation is optimized without requiring human interaction.
  • Cluster Autoscaler: This system automatically modifies the number of nodes to match demand by tracking how each pod in the cluster is using its resources. To ensure effective resource allocation throughout the cluster, it adds nodes when there are waiting pods because of limited resources and removes nodes when they are underutilized.
Kubernetes Autoscaling

1. Horizontal Pod Autoscaler (HPA)

The Horizontal Pod Autoscaler (HPA) feature in Kubernetes automates the process of adjusting the number of pod replicas based on changes in application usage levels. This dynamic scaling capability allows Kubernetes to efficiently manage workload scaling in response to fluctuations in demand.

HPA is versatile and can be applied to both stateless applications and stateful workloads, making it a valuable tool for various use cases. It operates as a control loop within the Kubernetes controller manager, with a default loop duration of 15 seconds, although this can be customized using the flag “–horizontal-pod-autoscaler-sync-period”.

During each loop iteration, the controller manager evaluates the actual resource utilization against the metrics specified for each HPA. These metrics are obtained from either the custom metrics API or the resource metrics API, depending on whether auto-scaling is configured based on custom metrics or pod resources like CPU utilization.

Horizontal Pod Autoscaler (HPA) utilizes different types of metrics to determine auto-scaling behavior:

  • Resource Metrics: You can specify a preset target or a target usage number using resource metrics. These metrics usually include the pods’ CPU and memory usage.
  • Custom Metrics: For custom metrics, only raw values are supported, and it’s not possible to define a target utilization. Custom metrics are user-defined metrics specific to the application or workload being monitored.
  • Object Metrics and External Metrics: Scaling based on object metrics or external metrics involves using a single metric obtained from the object. This metric is then compared to the target value to calculate a utilization ratio, which determines the scaling action. Examples of object metrics might include the number of pending requests in a queue or the length of a message queue.


  • Avoidance of HPA with VPA: It is recommended to avoid using Horizontal Pod Autoscaler (HPA) alongside Vertical Pod Autoscaling (VPA) specifically for memory or CPU metrics evaluation. This is because using both mechanisms concurrently for these metrics might lead to conflicting scaling decisions, potentially causing undesirable behavior in the pod resource allocation.
  • Configuration Restrictions with Deployments: When utilizing a Deployment resource in Kubernetes, it’s important to note that Horizontal Pod Autoscaler (HPA) configuration is restricted to the Deployment itself. You cannot configure HPA directly on a ReplicaSet or Replication Controller within the deployment. This limitation ensures that scaling decisions are applied consistently across the entire deployment rather than individual replica sets or controllers.

Best Practices

  • Ensure all pods have resource requests configured: When making scaling decisions, Horizontal Pod Autoscaler (HPA) utilizes the observed CPU utilization values of pods operating within a Kubernetes controller. This calculation is based on a percentage of the resource requests specified by individual pods. To ensure accurate data for scaling decisions, it’s essential to configure resource request values for all containers within your pods.
  • Prefer custom metrics over external metrics when possible: While both external and custom metrics APIs provide data for scaling decisions, using an external metrics API poses a security risk as it can grant access to a wide range of metrics. In contrast, the custom metrics API presents a lower risk if compromised, as it only exposes specific metrics relevant to your application. Therefore, whenever feasible, prioritize utilizing a custom metrics API for more secure and targeted scaling decisions.
  • Integrating Horizontal Pod Autoscaler (HPA) with Cluster Autoscaler enables synchronized scalability of pods and cluster nodes. When scaling up is required, Cluster Autoscaler can add eligible nodes to accommodate the increased workload. Conversely, during scaling down, it can identify and deactivate unnecessary nodes to optimize resource utilization and conserve resources effectively. This coordination ensures efficient scaling operations that align with the dynamic demands of your application workload.

2. Vertical Pod Autoscaling (VPA)

The Vertical Pod Autoscaler (VPA) automates the adjustment of CPU and memory reservations for pods based on live usage data. By setting limits on container resources, VPA ensures that actual resource utilization aligns with available memory and CPU resources.

Most Kubernetes containers follow their original resource requests instead of requesting a higher limit. This may cause the default scheduler to overcommit memory and CPU reservations on a node. In order to solve this, VPA dynamically modifies the resource requests made by pod containers, making adjustments to match the resources that are available.

Certain workloads may experience short periods of high resource utilization. Setting higher request limits by default would result in wasted resources during periods of low utilization and limit the nodes capable of running these workloads. While Horizontal Pod Autoscaler (HPA) may assist in some cases, certain applications may not easily support load distribution across multiple instances.

Target values in a VPA deployment are determined by the recommender component, which keeps track of resource usage. Pods that need updates to their resource limits are removed by the updater component. Lastly, a mutating admission webhook is used by the VPA admission controller to rewrite the pod resource requests at the time of creation. This all-encompassing strategy guarantees that resource allocations for pods are optimized using real usage data.


  • Updating running pods is still considered experimental in Vertical Pod Autoscaler (VPA), and its performance in large clusters has not been thoroughly tested.
  • While VPA typically responds to most out-of-memory events, there may be exceptions, and the behavior of multiple VPA resources targeting the same pod is not clearly defined. Additionally, when updating pod resources, VPA may recreate pods, potentially on different nodes, leading to the restart of all running containers within those pods.

Best Practices

The following are two recommended methods for utilizing Vertical Pod Autoscaler effectively:

  • Avoid using HPA and VPA in tandem—Horizontal Pod Autoscaler (HPA) and VPA are incompatible with each other. It’s recommended not to use both together for the same set of pods unless you configure the HPA to utilize either custom or external metrics.
  • Use VPA together with Cluster Autoscaler—VPA may occasionally recommend resource request values that exceed available resources. This can lead to resource pressure and cause pods to enter a pending status. To address this, the Cluster Autoscaler can help by provisioning new nodes in response to pending pods, ensuring sufficient resources are available to meet workload demands.

3. Cluster Autoscaler

The cluster autoscaler operates by automatically adjusting the number of nodes in a cluster based on the requested resources of all pods. It continuously performs two main tasks: identifying unschedulable pods and consolidating pods deployed on a small number of nodes.

Generally, unscheduled pods result from either a lack of memory or CPU resources, or from limitations placed on the pod by its affinities, nodeSelector labels, or tolerances. In order to resolve this, the autoscaler searches managed node pools for openings to add nodes capable of handling unschedulable pods. It adds a new node to the pool of nodes if it is practical and essential.

To find pods that could be rescheduled onto other available nodes inside the cluster, the autoscaler also examines the nodes in a controlled pool. It then moves on to remove the associated node and evict these pods if it is viable. To provide seamless transitions, the autoscaler takes into account pod priority and PodDisruptionBudgets during this process.

When scaling down operations, the cluster autoscaler uses a 10-minute graceful termination time before forcing the termination of a node. By giving the pods on the node enough time to be rescheduled onto other nodes, this reduces system disruptions.


  • The cluster autoscaler is compatible with specific managed Kubernetes platforms. If your platform is not supported, you may need to install it manually. Additionally, it’s important to note that the cluster autoscaler does not work with local PersistentVolumes. If you’re using local SSDs and need pods requiring ephemeral storage, you cannot scale up a node group with a size of 0.

Best Practices

The following are two recommended methods for utilizing Cluster Autoscaler effectively:

  • Provide enough resources so that the Cluster Autoscaler pod can operate: A minimum of one CPU must be set aside for resource requests sent to the Cluster Autoscaler pod. This guarantees that the Cluster Autoscaler pod’s host node has enough resources to run efficiently. Insufficient resources could cause the Cluster Autoscaler to stop working.
  • Define resource requests for each pod: This is necessary in order to allow the Cluster Autoscaler to operate as intended. These requests are used by the Cluster Autoscaler to make judgments about node utilization and pod status. Inaccurate resource request specifications have the potential to interfere with Cluster Autoscaler computations and performance.

Additional Kubernetes Scaling Methods

You have additional options when it comes to scaling workloads with Kubernetes. Here are two such techniques:

Short note about Kubernetes Autoscaling

  • Various tools and mechanisms are available for scaling applications and provisioning resources in Kubernetes. Kubernetes offers native solutions like horizontal and vertical pod autoscaling (HPA and VPA) for scaling at the application level. However, Kubernetes does not handle infrastructure scaling itself at the infrastructure layer.
  • Kubernetes offers capabilities that streamline the orchestration of containerized applications, largely accomplished through automated provisioning processes. With Kubernetes Deployment, you can automate pod behavior, eliminating the need for manual management of the application lifecycle. Deployments allow you to define and automate the desired behavior efficiently.
  • A Kubernetes StatefulSet is a workload API resource object specifically designed to assist in the management of stateful applications. Within Kubernetes, various built-in workload resources serve specific purposes, with StatefulSets tailored for efficient handling of stateful applications.
  • A ReplicaSet (RS) is a Kubernetes object designed to maintain a consistent and reliable set of running pods for a particular workload. Configured with a defined number of identical pods, the ReplicaSet automatically adjusts by creating new pods if any are evicted or fail, ensuring the desired level of replication is maintained.
  • A DaemonSet is a Kubernetes feature that enables the deployment of a specific pod on all cluster nodes that meet defined criteria. As new nodes are added to the cluster, the pod is automatically deployed to them, and if a node is removed, the corresponding pod is also removed. When a DaemonSet is deleted, Kubernetes removes all pods associated with it.
  • A Kubernetes Deployment empowers you to define and manage pods and ReplicaSets in a declarative manner. By specifying a desired state, the Deployment Controller actively monitors the current state of resources and ensures that the deployed pods align with the desired state. This functionality is pivotal in Kubernetes autoscaling, as it enables the system to automatically adjust the number of pods based on demand.