Understanding Kubernetes cluster scaling

Question

Using AWS EKS with t3.medium instances so I have (2 VCPU = 2000 cores and 4gb ram).

Running 6 different apps on the node with these cpu request definitions:

name  request replica total-cpu
app#1 300m    x2      600
app#2 100m    x4      400
app#3 150m    x1      150
app#4 300m    x1      300
app#5 100m    x1      100
app#6 150m    x1      150

With basic math I can say whole apps consume 1700m cpu cores. Also I have hpa with 60% cpu limit for app#1 and app#2. So, I am expecting to have just one node, or maybe two nodes (because of kube-system pods), but the cluster always running with 3 nodes. It looks like I understood autoscaling thing wrong.

$ kubectl top nodes
NAME                                          CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
ip-*.eu-central-1.compute.internal    221m         11%    631Mi           18%
ip-*.eu-central-1.compute.internal    197m         10%    718Mi           21%
ip-*.eu-central-1.compute.internal   307m         15%    801Mi           23%

As you see it's just using 10-15% of nodes. How can I optimize node scaling? What is the reason to have 3 nodes?

$ kubectl get hpa
NAME                       REFERENCE                             TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
app#1   Deployment/easyinventory-deployment   37%/60%   1         5         3          5d16h
app#2   Deployment/poolinventory-deployment   64%/60%   1         5         4          4d10h

UPDATE #1

I have pod disruption budget for kube-system pods

kubectl create poddisruptionbudget pdb-event --namespace=kube-system --selector k8s-app=event-exporter --max-unavailable 1 
kubectl create poddisruptionbudget pdb-fluentd --namespace=kube-system --selector k8s-app=k8s-app: fluentd-gcp-scaler --max-unavailable 1 
kubectl create poddisruptionbudget pdb-heapster --namespace=kube-system --selector k8s-app=heapster --max-unavailable 1 
kubectl create poddisruptionbudget pdb-dns --namespace=kube-system --selector k8s-app=kube-dns --max-unavailable 1 
kubectl create poddisruptionbudget pdb-dnsauto --namespace=kube-system --selector k8s-app=kube-dns-autoscaler --max-unavailable 1 
kubectl create poddisruptionbudget pdb-glbc --namespace=kube-system --selector k8s-app=glbc --max-unavailable 1 
kubectl create poddisruptionbudget pdb-metadata --namespace=kube-system --selector app=metadata-agent-cluster-level --max-unavailable 1 
kubectl create poddisruptionbudget pdb-kubeproxy --namespace=kube-system --selector component=kube-proxy --max-unavailable 1 
kubectl create poddisruptionbudget pdb-metrics --namespace=kube-system --selector k8s-app=metrics-server --max-unavailable 1
#source: https://gist.github.com/kenthua/fc06c6ea52a25a51bc07e70c8f781f8f

UPDATE #2

Figured out 3rd node is not always live, k8s scaling down to 2 nodes but after a few minutes, scaling up again to 3 nodes and later down to 2 nodes again and again. kubectl describe nodes

# Node 1
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests      Limits
  --------                    --------      ------
  cpu                         1010m (52%)   1300m (67%)
  memory                      3040Mi (90%)  3940Mi (117%)
  ephemeral-storage           0 (0%)        0 (0%)
  attachable-volumes-aws-ebs  0             0
# Node 2
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests      Limits
  --------                    --------      ------
  cpu                         1060m (54%)   1850m (95%)
  memory                      3300Mi (98%)  4200Mi (125%)
  ephemeral-storage           0 (0%)        0 (0%)
  attachable-volumes-aws-ebs  0             0

UPDATE #3

I0608 11:03:21.965642       1 static_autoscaler.go:192] Starting main loop
I0608 11:03:21.965976       1 utils.go:590] No pod using affinity / antiaffinity found in cluster, disabling affinity predicate for this loop
I0608 11:03:21.965996       1 filter_out_schedulable.go:65] Filtering out schedulables
I0608 11:03:21.966120       1 filter_out_schedulable.go:130] 0 other pods marked as unschedulable can be scheduled.
I0608 11:03:21.966164       1 filter_out_schedulable.go:130] 0 other pods marked as unschedulable can be scheduled.
I0608 11:03:21.966175       1 filter_out_schedulable.go:90] No schedulable pods
I0608 11:03:21.966202       1 static_autoscaler.go:334] No unschedulable pods
I0608 11:03:21.966257       1 static_autoscaler.go:381] Calculating unneeded nodes
I0608 11:03:21.966336       1 scale_down.go:437] Scale-down calculation: ignoring 1 nodes unremovable in the last 5m0s
I0608 11:03:21.966359       1 scale_down.go:468] Node ip-*-93.eu-central-1.compute.internal - memory utilization 0.909449
I0608 11:03:21.966411       1 scale_down.go:472] Node ip-*-93.eu-central-1.compute.internal is not suitable for removal - memory utilization too big (0.909449)
I0608 11:03:21.966460       1 scale_down.go:468] Node ip-*-115.eu-central-1.compute.internal - memory utilization 0.987231
I0608 11:03:21.966469       1 scale_down.go:472] Node ip-*-115.eu-central-1.compute.internal is not suitable for removal - memory utilization too big (0.987231)
I0608 11:03:21.966551       1 static_autoscaler.go:440] Scale down status: unneededOnly=false lastScaleUpTime=2020-06-08 09:14:54.619088707 +0000 UTC m=+143849.361988520 lastScaleDownDeleteTime=2020-06-06 17:18:02.104469988 +0000 UTC m=+36.847369765 lastScaleDownFailTime=2020-06-06 17:18:02.104470075 +0000 UTC m=+36.847369849 scaleDownForbidden=false isDeleteInProgress=false scaleDownInCooldown=false
I0608 11:03:21.966578       1 static_autoscaler.go:453] Starting scale down
I0608 11:03:21.966667       1 scale_down.go:785] No candidates for scale down

Update #4

According to autoscaler logs, it was ignoring the ip-*145.eu-central-1.compute.internal to scale down, for some reason, I wonder what will happen and terminated the instance from EC2 console directly. And these lines appeared in autoscaler logs:

I0608 11:10:43.747445       1 scale_down.go:517] Finding additional 1 candidates for scale down.
I0608 11:10:43.747477       1 cluster.go:93] Fast evaluation: ip-*-145.eu-central-1.compute.internal for removal
I0608 11:10:43.747540       1 cluster.go:248] Evaluation ip-*-115.eu-central-1.compute.internal for default/app2-848db65964-9nr2m -> PodFitsResources predicate mismatch, reason: Insufficient memory,
I0608 11:10:43.747549       1 cluster.go:248] Evaluation ip-*-93.eu-central-1.compute.internal for default/app2-848db65964-9nr2m -> PodFitsResources predicate mismatch, reason: Insufficient memory,
I0608 11:10:43.747557       1 cluster.go:129] Fast evaluation: node ip-*-145.eu-central-1.compute.internal is not suitable for removal: failed to find place for default/app2-848db65964-9nr2m
I0608 11:10:43.747569       1 scale_down.go:554] 1 nodes found to be unremovable in simulation, will re-check them at 2020-06-08 11:15:43.746773707 +0000 UTC m=+151098.489673532
I0608 11:10:43.747596       1 static_autoscaler.go:440] Scale down status: unneededOnly=false lastScaleUpTime=2020-06-08 09:14:54.619088707 +0000 UTC m=+143849.361988520 lastScaleDownDeleteTime=2020-06-06 17:18:02.104469988 +0000 UTC m=+36.847369765 lastScaleDownFailTime=2020-06-06 17:18:02.104470075 +0000 UTC m=+36.847369849 scaleDownForbidden=false isDeleteInProgress=false scaleDownInCooldown=false

As far as I see, the node is not scaling down because there are no other nodes to fit "app2". But app memory request is 700Mi and at the moment other nodes have enough memory for the app2

$ kubectl top nodes
NAME                                          CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
ip-10-0-0-93.eu-central-1.compute.internal    386m         20%    920Mi           27%
ip-10-0-1-115.eu-central-1.compute.internal   298m         15%    794Mi           23%

Still no idea why autoscaler is not moving app2 to one of other available nodes and scale down the ip-*-145.

Can you do `kubectl describe nodes` to see each nodes's resource request? k8s has limitation and request for resource mgmt. And request may occupy some resources even they don't actually use it. — Ken Chen, Jun 08 '20 at 09:10
Hello @KenChen , just shared results as a 2nd update to my question. But I'm not sure what I should understand from these results. I though the problem is with my CPU request definitions, but looks like the problem is memory. Did I understand it correctly? — Eray, Jun 08 '20 at 09:18
According to `kubectl top nodes` memory usage is fine but according to `kubectl describe nodes` I am overcommitting — Eray, Jun 08 '20 at 09:26
What are you using to scale your nodes? Autoscaling groups? Cluster-Autoscaler? — Blokje5, Jun 08 '20 at 09:30
@Blokje5, Cluster Autoscaler (following this guide: https://docs.aws.amazon.com/eks/latest/userguide/cluster-autoscaler.html) . But to be honest, I thought cluster autoscaler is working with ASG. — Eray, Jun 08 '20 at 09:33
Overcommitting a resource is fine. That's **at most** resource a pod can use. Following your logs, my guess is the memory request go beyond the capacity of 2 nodes. However, not sure why it's scaling down to 2 tho. — Ken Chen, Jun 08 '20 at 09:39
No the cluster-autoscaler adds nodes based on schedulability of pods. So if a pod is unschedulable (e.g. it's resource requests can not be satisfied) a new node will be added. So it is mostly based on how the Kubernetes Scheduler schedules pods. If one of your pods is not schedulable on one of your nodes, a new node will be added. — Blokje5, Jun 08 '20 at 09:39
@Blokje5 Now checked Auto Scaling Groups in EC2, there is an ASG (created with my eks node group). min=1, max=3, desired=3 . Min and max values are as I defined them, that's fine. But I don't know why desired increased to 3. I'm sure, I set it up as 1 not as 3. But probably cluster autoscaling increasing desired instance count to 3 because of unschedulable pods as you said. If so, why kubernetes failing to schedule pods since total pod request limits are below nodes capacity as I explained in my question. — Eray, Jun 08 '20 at 09:51
Just checked autoscaler logs, looks like it's trying to scale down every minutes. But getting `is not suitable for removal - memory utilization too big`. Update #3 added to question for detailed log — Eray, Jun 08 '20 at 10:00
Node ip-*145.eu-central-1.compute.internal was scaled down due to low utilisation. You probably need to check autoscaler settings. — Ken Chen, Jun 08 '20 at 10:09
Actually I don't have any specific autoscaler configuration, just deployed it as suggested in official AWS documentation and hoping it to create new nodes for me when needed (if no resource left for new pod replicas). So, I don't know which autoscaler settings you meant. Yes, it's scaling down but the problem is it's scaling up again after 3-5 minutes even with a low cpu/memory usage on existing nodes. — Eray, Jun 08 '20 at 10:51

score 1 · Answer 1 · answered Jun 09 '20 at 02:29

1

How Pods with resource requests are scheduled.

A request is the amount guaranteed for the container. So the scheduler will not schedule a pod to a node without enough capacity. In your case, the 2 existing nodes already cap their mem (0.9 and 0.98). So ip-*-145 cannot be scaled down otherwise app2 has nowhere to go.

answered Jun 09 '20 at 02:29

Ken Chen

531
2
14

What is the unit of these `0.9` and `0.98`? Are they `98%` and `90%`? – Eray Jun 09 '20 at 12:23
@Eray yes. I leverage this comment, did you manage to solve your issues? – Tomás Denis Reyes Sánchez Dec 17 '20 at 13:21

Understanding Kubernetes cluster scaling

1 Answers1