eksctl anywhere operations rendered unusable after osImageURL change

This is anything but nice. It completely breaks `eksctl anywhere` operations just because of a valid URL change.

---

### Context

- Setup is **Bare Metal** in a test environment that probably gets wiped 10 times per day, if not more.
- You created a new EKS-A cluster with **3 control plane nodes** and **1 worker node**.

<details>
<summary>hardware.csv</summary>

```csv
hostname,mac,ip_address,netmask,gateway,nameservers,labels,disk
cplane-0,XX:XX:XX:XX:XX:01,10.162.10.131,255.255.255.240,10.162.10.129,8.8.8.8|8.8.4.4,type=cp,/dev/sda
cplane-1,XX:XX:XX:XX:XX:02,10.162.10.132,255.255.255.240,10.162.10.129,8.8.8.8|8.8.4.4,type=cp,/dev/sda
cplane-2,XX:XX:XX:XX:XX:03,10.162.10.133,255.255.255.240,10.162.10.129,8.8.8.8|8.8.4.4,type=cp,/dev/sda
worker-0,XX:XX:XX:XX:XX:04,10.162.10.134,255.255.255.240,10.162.10.129,8.8.8.8|8.8.4.4,type=worker,/dev/sda
worker-1,XX:XX:XX:XX:XX:05,10.162.10.135,255.255.255.240,10.162.10.129,8.8.8.8|8.8.4.4,type=worker,/dev/sda
worker-2,XX:XX:XX:XX:XX:06,10.162.10.136,255.255.255.240,10.162.10.129,8.8.8.8|8.8.4.4,type=worker,/dev/sda
```
</details>

---

### Cluster creation command

```bash
eksctl anywhere create cluster \
  --hardware-csv hardware.csv \
  -f eksa-mgmt-cluster.yaml \
  --no-timeouts
```

The cluster is up (with 1 worker node):

```bash
$ kubectl get nodes
NAME       STATUS   ROLES           AGE     VERSION
cplane-0   Ready    control-plane   19m     v1.31.9-eks-ca3410b
cplane-1   Ready    control-plane   15m     v1.31.9-eks-ca3410b
cplane-2   Ready    control-plane   9m53s   v1.31.9-eks-ca3410b
worker-0   Ready    <none>          3m46s   v1.31.9-eks-ca3410b
```

---

### You now want to add more nodes

It doesn't really matter if you created the cluster with 1, 2 or 20 nodes — the idea is that you want to scale up the workers. You still have 2 "spares" in `hardware.csv` so you only have to increase the count.

> Oh, and by the way, you re-organized the S3 bucket that serves the images and now the image is in a path pointing to eks anywhere bundle version

```diff
--- orig.yaml	2025-07-13 10:29:49.840935788 +0000
+++ new.yaml	2025-07-13 10:30:22.217130563 +0000
@@ -28,7 +28,7 @@
   managementCluster:
     name: eks-a
   workerNodeGroupConfigurations:
-  - count: 1
+  - count: 3
     machineGroupRef:
       kind: TinkerbellMachineConfig
       name: eks-a
@@ -41,7 +41,7 @@
 metadata:
   name: eks-a
 spec:
-  osImageURL: "https://BUCKET_NAME.s3.eu-central-1.amazonaws.com/ubuntu-2204-kube-1-31.gz"
+  osImageURL: "https://BUCKET_NAME.s3.eu-central-1.amazonaws.com/v0.22.6/ubuntu-2204-kube-1-31.gz"
   tinkerbellIP: "10.162.10.141"

 ---
```

---

### Start the cluster upgrade to add more nodes

```bash
eksctl anywhere upgrade cluster \
  --hardware-csv hardware.csv \
  -f cluster.yaml \
  --no-timeouts
```

---

### Preflight check passes nicely

```
Performing setup and validations
✅ Tinkerbell provider validation
✅ SSH Keys present
✅ Validate OS is compatible with registry mirror configuration
✅ Validate certificate for registry mirror
✅ Control plane ready
✅ Worker nodes ready
✅ Nodes ready
✅ Cluster CRDs ready
✅ Cluster object present on workload cluster
✅ Upgrade cluster kubernetes version increment
✅ Upgrade cluster worker node group kubernetes version increment
✅ Validate authentication for git provider
✅ Validate immutable fields
✅ Validate cluster's eksaVersion matches EKS-Anywhere Version
✅ Validate eksa controller is not paused
✅ Validate extended kubernetes version support is supported
✅ Validate pod disruption budgets
✅ Validate eksaVersion skew is one minor version
Ensuring etcd CAPI providers exist on management cluster before upgrade
Pausing GitOps cluster resources reconcile
Upgrading core components
```

---

Well, that's about all. The process quickly starts throwing this error, endlessly, in a loop, with `--no-timeouts`. The message **makes absolutely no sense**. It looks like it wants to start a rolling upgrade when you change the image path and not the version

```
"Cluster generation and observedGeneration","Generation":2,"ObservedGeneration":2
"Error happened during retry","error":"cluster has an error: hardware validation failure: for node rollout, minimum hardware count not met for selector '{\"type\":\"cp\"}': have 0, require 1","retries":138
"Sleeping before next retry","time":"1s"
```

[osImageURL](https://anywhere.eks.amazonaws.com/docs/getting-started/baremetal/bare-spec/#osimageurl-required) documentation states:

```
The osImageURL must contain the Cluster.Spec.KubernetesVersion or Cluster.Spec.WorkerNodeGroupConfiguration[].KubernetesVersion version (in case of modular upgrade). 
For example, if the Kubernetes version is 1.31, the osImageURL name should include 1.31, 1_31, 1-31 or 131.
```

So, why is it trying to upgrade just based on a simple path change in the image URL? The **required** part (`1-31`) is still there and it does not differ from `kubernetesVersion`

---

### The fix?

Obvious: put back the old image URL and start over... oh, wait! You need to stop the existing process. Ok, CTRL+C it, change the image to the old path and start again.

---

### Check the emojis again:

```
Performing setup and validations
✅ Tinkerbell provider validation
✅ SSH Keys present
✅ Validate OS is compatible with registry mirror configuration
✅ Validate certificate for registry mirror
✅ Control plane ready
❌ Validation failed    {"validation": "worker nodes ready", "error": "machine deployment is in ScalingUp phase", "remediation": "ensure machine deployments for cluster eks-a are ready"}
✅ Nodes ready
✅ Cluster CRDs ready
✅ Cluster object present on workload cluster
✅ Upgrade cluster kubernetes version increment
✅ Upgrade cluster worker node group kubernetes version increment
✅ Validate authentication for git provider
✅ Validate immutable fields
✅ Validate cluster's eksaVersion matches EKS-Anywhere Version
✅ Validate eksa controller is not paused
✅ Validate extended kubernetes version support is supported
✅ Validate pod disruption budgets
✅ Validate eksaVersion skew is one minor version
Error: failed to upgrade cluster: validations failed: machine deployment is in ScalingUp phase
```

---

Great. Now I can't add nodes or perform any `eksctl anywhere` operation whatsoever.

> As you can see, it is not enough to just go back to the old image to fix this, it implies some careful fiddling with the cluster CRs.  
> I am not going to detail this here because people might think it could apply in other situations and they could potentially break a production cluster.

---

Please address this. 
LE: I think it should actually throw a clearer error since scale up and upgrades are not supported in the same time

---

**Environment**:
- EKS Anywhere Release
```
$ eksctl anywhere version
Version: v0.22.6
Release Manifest URL: https://anywhere-assets.eks.amazonaws.com/releases/eks-a/manifest.yaml
Bundle Manifest URL: https://anywhere-assets.eks.amazonaws.com/releases/bundles/100/manifest.yaml
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

eksctl anywhere operations rendered unusable after osImageURL change #9899

Context

Cluster creation command

You now want to add more nodes

Start the cluster upgrade to add more nodes

Preflight check passes nicely

The fix?

Check the emojis again:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

eksctl anywhere operations rendered unusable after osImageURL change #9899

Description

Context

Cluster creation command

You now want to add more nodes

Start the cluster upgrade to add more nodes

Preflight check passes nicely

The fix?

Check the emojis again:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions