-
Notifications
You must be signed in to change notification settings - Fork 320
Description
This is anything but nice. It completely breaks eksctl anywhere operations just because of a valid URL change.
Context
- Setup is Bare Metal in a test environment that probably gets wiped 10 times per day, if not more.
- You created a new EKS-A cluster with 3 control plane nodes and 1 worker node.
hardware.csv
hostname,mac,ip_address,netmask,gateway,nameservers,labels,disk
cplane-0,XX:XX:XX:XX:XX:01,10.162.10.131,255.255.255.240,10.162.10.129,8.8.8.8|8.8.4.4,type=cp,/dev/sda
cplane-1,XX:XX:XX:XX:XX:02,10.162.10.132,255.255.255.240,10.162.10.129,8.8.8.8|8.8.4.4,type=cp,/dev/sda
cplane-2,XX:XX:XX:XX:XX:03,10.162.10.133,255.255.255.240,10.162.10.129,8.8.8.8|8.8.4.4,type=cp,/dev/sda
worker-0,XX:XX:XX:XX:XX:04,10.162.10.134,255.255.255.240,10.162.10.129,8.8.8.8|8.8.4.4,type=worker,/dev/sda
worker-1,XX:XX:XX:XX:XX:05,10.162.10.135,255.255.255.240,10.162.10.129,8.8.8.8|8.8.4.4,type=worker,/dev/sda
worker-2,XX:XX:XX:XX:XX:06,10.162.10.136,255.255.255.240,10.162.10.129,8.8.8.8|8.8.4.4,type=worker,/dev/sda
Cluster creation command
eksctl anywhere create cluster \
--hardware-csv hardware.csv \
-f eksa-mgmt-cluster.yaml \
--no-timeoutsThe cluster is up (with 1 worker node):
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
cplane-0 Ready control-plane 19m v1.31.9-eks-ca3410b
cplane-1 Ready control-plane 15m v1.31.9-eks-ca3410b
cplane-2 Ready control-plane 9m53s v1.31.9-eks-ca3410b
worker-0 Ready <none> 3m46s v1.31.9-eks-ca3410bYou now want to add more nodes
It doesn't really matter if you created the cluster with 1, 2 or 20 nodes — the idea is that you want to scale up the workers. You still have 2 "spares" in hardware.csv so you only have to increase the count.
Oh, and by the way, you re-organized the S3 bucket that serves the images and now the image is in a path pointing to eks anywhere bundle version
--- orig.yaml 2025-07-13 10:29:49.840935788 +0000
+++ new.yaml 2025-07-13 10:30:22.217130563 +0000
@@ -28,7 +28,7 @@
managementCluster:
name: eks-a
workerNodeGroupConfigurations:
- - count: 1
+ - count: 3
machineGroupRef:
kind: TinkerbellMachineConfig
name: eks-a
@@ -41,7 +41,7 @@
metadata:
name: eks-a
spec:
- osImageURL: "https://BUCKET_NAME.s3.eu-central-1.amazonaws.com/ubuntu-2204-kube-1-31.gz"
+ osImageURL: "https://BUCKET_NAME.s3.eu-central-1.amazonaws.com/v0.22.6/ubuntu-2204-kube-1-31.gz"
tinkerbellIP: "10.162.10.141"
---Start the cluster upgrade to add more nodes
eksctl anywhere upgrade cluster \
--hardware-csv hardware.csv \
-f cluster.yaml \
--no-timeoutsPreflight check passes nicely
Performing setup and validations
✅ Tinkerbell provider validation
✅ SSH Keys present
✅ Validate OS is compatible with registry mirror configuration
✅ Validate certificate for registry mirror
✅ Control plane ready
✅ Worker nodes ready
✅ Nodes ready
✅ Cluster CRDs ready
✅ Cluster object present on workload cluster
✅ Upgrade cluster kubernetes version increment
✅ Upgrade cluster worker node group kubernetes version increment
✅ Validate authentication for git provider
✅ Validate immutable fields
✅ Validate cluster's eksaVersion matches EKS-Anywhere Version
✅ Validate eksa controller is not paused
✅ Validate extended kubernetes version support is supported
✅ Validate pod disruption budgets
✅ Validate eksaVersion skew is one minor version
Ensuring etcd CAPI providers exist on management cluster before upgrade
Pausing GitOps cluster resources reconcile
Upgrading core components
Well, that's about all. The process quickly starts throwing this error, endlessly, in a loop, with --no-timeouts. The message makes absolutely no sense. It looks like it wants to start a rolling upgrade when you change the image path and not the version
"Cluster generation and observedGeneration","Generation":2,"ObservedGeneration":2
"Error happened during retry","error":"cluster has an error: hardware validation failure: for node rollout, minimum hardware count not met for selector '{\"type\":\"cp\"}': have 0, require 1","retries":138
"Sleeping before next retry","time":"1s"
osImageURL documentation states:
The osImageURL must contain the Cluster.Spec.KubernetesVersion or Cluster.Spec.WorkerNodeGroupConfiguration[].KubernetesVersion version (in case of modular upgrade).
For example, if the Kubernetes version is 1.31, the osImageURL name should include 1.31, 1_31, 1-31 or 131.
So, why is it trying to upgrade just based on a simple path change in the image URL? The required part (1-31) is still there and it does not differ from kubernetesVersion
The fix?
Obvious: put back the old image URL and start over... oh, wait! You need to stop the existing process. Ok, CTRL+C it, change the image to the old path and start again.
Check the emojis again:
Performing setup and validations
✅ Tinkerbell provider validation
✅ SSH Keys present
✅ Validate OS is compatible with registry mirror configuration
✅ Validate certificate for registry mirror
✅ Control plane ready
❌ Validation failed {"validation": "worker nodes ready", "error": "machine deployment is in ScalingUp phase", "remediation": "ensure machine deployments for cluster eks-a are ready"}
✅ Nodes ready
✅ Cluster CRDs ready
✅ Cluster object present on workload cluster
✅ Upgrade cluster kubernetes version increment
✅ Upgrade cluster worker node group kubernetes version increment
✅ Validate authentication for git provider
✅ Validate immutable fields
✅ Validate cluster's eksaVersion matches EKS-Anywhere Version
✅ Validate eksa controller is not paused
✅ Validate extended kubernetes version support is supported
✅ Validate pod disruption budgets
✅ Validate eksaVersion skew is one minor version
Error: failed to upgrade cluster: validations failed: machine deployment is in ScalingUp phase
Great. Now I can't add nodes or perform any eksctl anywhere operation whatsoever.
As you can see, it is not enough to just go back to the old image to fix this, it implies some careful fiddling with the cluster CRs.
I am not going to detail this here because people might think it could apply in other situations and they could potentially break a production cluster.
Please address this.
LE: I think it should actually throw a clearer error since scale up and upgrades are not supported in the same time
Environment:
- EKS Anywhere Release
$ eksctl anywhere version
Version: v0.22.6
Release Manifest URL: https://anywhere-assets.eks.amazonaws.com/releases/eks-a/manifest.yaml
Bundle Manifest URL: https://anywhere-assets.eks.amazonaws.com/releases/bundles/100/manifest.yaml