Skip to content

eksctl anywhere operations rendered unusable after osImageURL change #9899

@swapzero

Description

@swapzero

This is anything but nice. It completely breaks eksctl anywhere operations just because of a valid URL change.


Context

  • Setup is Bare Metal in a test environment that probably gets wiped 10 times per day, if not more.
  • You created a new EKS-A cluster with 3 control plane nodes and 1 worker node.
hardware.csv
hostname,mac,ip_address,netmask,gateway,nameservers,labels,disk
cplane-0,XX:XX:XX:XX:XX:01,10.162.10.131,255.255.255.240,10.162.10.129,8.8.8.8|8.8.4.4,type=cp,/dev/sda
cplane-1,XX:XX:XX:XX:XX:02,10.162.10.132,255.255.255.240,10.162.10.129,8.8.8.8|8.8.4.4,type=cp,/dev/sda
cplane-2,XX:XX:XX:XX:XX:03,10.162.10.133,255.255.255.240,10.162.10.129,8.8.8.8|8.8.4.4,type=cp,/dev/sda
worker-0,XX:XX:XX:XX:XX:04,10.162.10.134,255.255.255.240,10.162.10.129,8.8.8.8|8.8.4.4,type=worker,/dev/sda
worker-1,XX:XX:XX:XX:XX:05,10.162.10.135,255.255.255.240,10.162.10.129,8.8.8.8|8.8.4.4,type=worker,/dev/sda
worker-2,XX:XX:XX:XX:XX:06,10.162.10.136,255.255.255.240,10.162.10.129,8.8.8.8|8.8.4.4,type=worker,/dev/sda

Cluster creation command

eksctl anywhere create cluster \
  --hardware-csv hardware.csv \
  -f eksa-mgmt-cluster.yaml \
  --no-timeouts

The cluster is up (with 1 worker node):

$ kubectl get nodes
NAME       STATUS   ROLES           AGE     VERSION
cplane-0   Ready    control-plane   19m     v1.31.9-eks-ca3410b
cplane-1   Ready    control-plane   15m     v1.31.9-eks-ca3410b
cplane-2   Ready    control-plane   9m53s   v1.31.9-eks-ca3410b
worker-0   Ready    <none>          3m46s   v1.31.9-eks-ca3410b

You now want to add more nodes

It doesn't really matter if you created the cluster with 1, 2 or 20 nodes — the idea is that you want to scale up the workers. You still have 2 "spares" in hardware.csv so you only have to increase the count.

Oh, and by the way, you re-organized the S3 bucket that serves the images and now the image is in a path pointing to eks anywhere bundle version

--- orig.yaml	2025-07-13 10:29:49.840935788 +0000
+++ new.yaml	2025-07-13 10:30:22.217130563 +0000
@@ -28,7 +28,7 @@
   managementCluster:
     name: eks-a
   workerNodeGroupConfigurations:
-  - count: 1
+  - count: 3
     machineGroupRef:
       kind: TinkerbellMachineConfig
       name: eks-a
@@ -41,7 +41,7 @@
 metadata:
   name: eks-a
 spec:
-  osImageURL: "https://BUCKET_NAME.s3.eu-central-1.amazonaws.com/ubuntu-2204-kube-1-31.gz"
+  osImageURL: "https://BUCKET_NAME.s3.eu-central-1.amazonaws.com/v0.22.6/ubuntu-2204-kube-1-31.gz"
   tinkerbellIP: "10.162.10.141"

 ---

Start the cluster upgrade to add more nodes

eksctl anywhere upgrade cluster \
  --hardware-csv hardware.csv \
  -f cluster.yaml \
  --no-timeouts

Preflight check passes nicely

Performing setup and validations
✅ Tinkerbell provider validation
✅ SSH Keys present
✅ Validate OS is compatible with registry mirror configuration
✅ Validate certificate for registry mirror
✅ Control plane ready
✅ Worker nodes ready
✅ Nodes ready
✅ Cluster CRDs ready
✅ Cluster object present on workload cluster
✅ Upgrade cluster kubernetes version increment
✅ Upgrade cluster worker node group kubernetes version increment
✅ Validate authentication for git provider
✅ Validate immutable fields
✅ Validate cluster's eksaVersion matches EKS-Anywhere Version
✅ Validate eksa controller is not paused
✅ Validate extended kubernetes version support is supported
✅ Validate pod disruption budgets
✅ Validate eksaVersion skew is one minor version
Ensuring etcd CAPI providers exist on management cluster before upgrade
Pausing GitOps cluster resources reconcile
Upgrading core components

Well, that's about all. The process quickly starts throwing this error, endlessly, in a loop, with --no-timeouts. The message makes absolutely no sense. It looks like it wants to start a rolling upgrade when you change the image path and not the version

"Cluster generation and observedGeneration","Generation":2,"ObservedGeneration":2
"Error happened during retry","error":"cluster has an error: hardware validation failure: for node rollout, minimum hardware count not met for selector '{\"type\":\"cp\"}': have 0, require 1","retries":138
"Sleeping before next retry","time":"1s"

osImageURL documentation states:

The osImageURL must contain the Cluster.Spec.KubernetesVersion or Cluster.Spec.WorkerNodeGroupConfiguration[].KubernetesVersion version (in case of modular upgrade). 
For example, if the Kubernetes version is 1.31, the osImageURL name should include 1.31, 1_31, 1-31 or 131.

So, why is it trying to upgrade just based on a simple path change in the image URL? The required part (1-31) is still there and it does not differ from kubernetesVersion


The fix?

Obvious: put back the old image URL and start over... oh, wait! You need to stop the existing process. Ok, CTRL+C it, change the image to the old path and start again.


Check the emojis again:

Performing setup and validations
✅ Tinkerbell provider validation
✅ SSH Keys present
✅ Validate OS is compatible with registry mirror configuration
✅ Validate certificate for registry mirror
✅ Control plane ready
❌ Validation failed    {"validation": "worker nodes ready", "error": "machine deployment is in ScalingUp phase", "remediation": "ensure machine deployments for cluster eks-a are ready"}
✅ Nodes ready
✅ Cluster CRDs ready
✅ Cluster object present on workload cluster
✅ Upgrade cluster kubernetes version increment
✅ Upgrade cluster worker node group kubernetes version increment
✅ Validate authentication for git provider
✅ Validate immutable fields
✅ Validate cluster's eksaVersion matches EKS-Anywhere Version
✅ Validate eksa controller is not paused
✅ Validate extended kubernetes version support is supported
✅ Validate pod disruption budgets
✅ Validate eksaVersion skew is one minor version
Error: failed to upgrade cluster: validations failed: machine deployment is in ScalingUp phase

Great. Now I can't add nodes or perform any eksctl anywhere operation whatsoever.

As you can see, it is not enough to just go back to the old image to fix this, it implies some careful fiddling with the cluster CRs.
I am not going to detail this here because people might think it could apply in other situations and they could potentially break a production cluster.


Please address this.
LE: I think it should actually throw a clearer error since scale up and upgrades are not supported in the same time


Environment:

  • EKS Anywhere Release
$ eksctl anywhere version
Version: v0.22.6
Release Manifest URL: https://anywhere-assets.eks.amazonaws.com/releases/eks-a/manifest.yaml
Bundle Manifest URL: https://anywhere-assets.eks.amazonaws.com/releases/bundles/100/manifest.yaml

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions