-
Notifications
You must be signed in to change notification settings - Fork 68
🐛 makes the replica count configurable and update upgrade-e2e test cases #2371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -12,11 +12,11 @@ metadata: | |
| namespace: {{ .Values.namespaces.olmv1.name }} | ||
| spec: | ||
| minReadySeconds: 5 | ||
| replicas: 1 | ||
| replicas: {{ .Values.options.catalogd.deployment.replicas }} | ||
| strategy: | ||
| type: RollingUpdate | ||
| rollingUpdate: | ||
| maxSurge: 1 # Allow temporary 2 pods (1 + 1) for zero-downtime updates | ||
| maxSurge: 1 # Allow temporary extra pod (e.g., 3 pods with 2 replicas) for zero-downtime updates | ||
|
||
| maxUnavailable: 0 # Never allow pods to be unavailable during updates | ||
| selector: | ||
| matchLabels: | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -11,11 +11,11 @@ metadata: | |
| name: operator-controller-controller-manager | ||
| namespace: {{ .Values.namespaces.olmv1.name }} | ||
| spec: | ||
| replicas: 1 | ||
| replicas: {{ .Values.options.operatorController.deployment.replicas }} | ||
| strategy: | ||
| type: RollingUpdate | ||
| rollingUpdate: | ||
| maxSurge: 1 # Allow temporary 2 pods (1 + 1) for zero-downtime updates | ||
| maxSurge: 1 # Allow temporary extra pod (e.g., 3 pods with 2 replicas) for zero-downtime updates | ||
|
||
| maxUnavailable: 0 # Never allow pods to be unavailable during updates | ||
| selector: | ||
| matchLabels: | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -8,6 +8,7 @@ options: | |
| enabled: true | ||
| deployment: | ||
| image: quay.io/operator-framework/operator-controller:devel | ||
| replicas: 2 | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the default should remain
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I generally agree with @pedjak here, we should be defaulting to 1 upstream, but allow the configuration to updated downstream. However, because we are allowing changes here, we will need to ensure that the e2e tests can still be run upstream and downstream.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Now thinking, maybe we want to use 1 for the default, and put 2 in experimental? That way, we can test both values.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The community user will also encounter a similar upgrade error(OCPBUGS-62517) if we disable the PDB as the default.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In my opinion, we should maintain consistency between the upstream and downstream projects. This is also one of the core principles of open source: fostering alignment, collaboration, and transparency across the entire ecosystem. When upstream and downstream diverge without clear justification, it creates unnecessary complexity, increases maintenance overhead, and makes it harder for users and contributors to understand or adopt the project. By keeping the two sides consistent, we ensure that new features, bug fixes, and improvements flow smoothly in both directions. It also strengthens long-term sustainability, because the community can rely on a unified architecture and shared expectations. Ultimately, this consistency reinforces the value of open source itself—working together toward a common, coherent, and maintainable solution.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
As usual, it depends. IMHO, this is the downstream bug because the downstream has certain expectations about components availability during upgrade. If OLM deployment has just single pod, it is perfectly normal that the deployment becomes shortly unavailable during upgrade, and many upstream users might be ok with that, especially because OLM is a controller that performs reconciliations asynchronously (i.e. eventually). All ClusterExtensions that get created/modified during OLM upgrade are not lost and they are going to be reconciled - hence to service interruption does not happens at all. Having said that, it is also perfectly fine that we add knobs on the upstream, so that the number of pods and pdbs can be configured/added for users that have stronger requirements, but we should not enforce them, especially if that means that suddenly running OLM requires twice as much resources.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would still want this to be 1 for the standard setup (and tests). We can set this to 2 in |
||
| extraArguments: [] | ||
| features: | ||
| enabled: [] | ||
|
|
@@ -19,6 +20,7 @@ options: | |
| enabled: true | ||
| deployment: | ||
| image: quay.io/operator-framework/catalogd:devel | ||
| replicas: 2 | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same as above. |
||
| extraArguments: [] | ||
| features: | ||
| enabled: [] | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -2107,11 +2107,11 @@ metadata: | |
| namespace: olmv1-system | ||
| spec: | ||
| minReadySeconds: 5 | ||
| replicas: 1 | ||
| replicas: 2 | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same as above.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a generated file, the change comes from |
||
| strategy: | ||
| type: RollingUpdate | ||
| rollingUpdate: | ||
| maxSurge: 1 # Allow temporary 2 pods (1 + 1) for zero-downtime updates | ||
| maxSurge: 1 # Allow temporary extra pod (e.g., 3 pods with 2 replicas) for zero-downtime updates | ||
|
||
| maxUnavailable: 0 # Never allow pods to be unavailable during updates | ||
| selector: | ||
| matchLabels: | ||
|
|
@@ -2258,11 +2258,11 @@ metadata: | |
| name: operator-controller-controller-manager | ||
| namespace: olmv1-system | ||
| spec: | ||
| replicas: 1 | ||
| replicas: 2 | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same comment as earlier. |
||
| strategy: | ||
| type: RollingUpdate | ||
| rollingUpdate: | ||
| maxSurge: 1 # Allow temporary 2 pods (1 + 1) for zero-downtime updates | ||
| maxSurge: 1 # Allow temporary extra pod (e.g., 3 pods with 2 replicas) for zero-downtime updates | ||
| maxUnavailable: 0 # Never allow pods to be unavailable during updates | ||
| selector: | ||
| matchLabels: | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -27,8 +27,11 @@ import ( | |
| ) | ||
|
|
||
| const ( | ||
| artifactName = "operator-controller-e2e" | ||
| pollDuration = time.Minute | ||
| artifactName = "operator-controller-e2e" | ||
| // pollDuration is set to 3 minutes to account for leader election time in multi-replica deployments. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. our e2e test do not need to run now in HA setup, having just single replica should be fine. If some additional HA tests are needed, they should be provided downstream.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. However, the upstream e2e's do run downstream as well, and we need to ensure they can work there as well.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure, but if I am not mistaken, OCPBUGS-62517 reports a downstream test to fail, not an upstream one?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, but it's all the same downstream (OCP) configuration. We can't change the OCP configuration just to run our upstream e2e's. Our upstream e2e's are expected to be able to run mostly unmodified on OCP. The upstream e2e will encounter 2 replicas when run in OCP (because that's the OCP configuration), but will encounter 1 replica upstream (because that's the upstream configuration). And because we'd also allow the user to support multiple replicas upstream, our upstream e2e should accommodate that.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Agree that our e2e tests should be agnostic to underlying setup, but our upstream e2e CI test jobs should run against an OLM deployment with single replicas (as they do until now). |
||
| // In the worst case (previous leader crashed), leader election can take up to 163 seconds | ||
| // (LeaseDuration: 137s + RetryPeriod: 26s). Adding buffer for reconciliation time. | ||
| pollDuration = 3 * time.Minute | ||
| pollInterval = time.Second | ||
| testCatalogRefEnvVar = "CATALOG_IMG" | ||
| testCatalogName = "test-catalog" | ||
|
|
@@ -169,18 +172,19 @@ location = "docker-registry.operator-controller-e2e.svc.cluster.local:5000"`, | |
| require.NoError(ct, c.Get(context.Background(), types.NamespacedName{Name: clusterExtension.Name}, clusterExtension)) | ||
| }, 2*time.Minute, pollInterval) | ||
|
|
||
| // Give the check 2 minutes instead of the typical 1 for the pod's | ||
| // files to update from the configmap change. | ||
| // Give the check extra time for the pod's files to update from the configmap change. | ||
| // The theoretical max time is the kubelet sync period of 1 minute + | ||
| // ConfigMap cache TTL of 1 minute = 2 minutes | ||
| // ConfigMap cache TTL of 1 minute = 2 minutes. | ||
| // With multi-replica deployments, add leader election time (up to 163s in worst case). | ||
| // Total: 2 min (ConfigMap) + 2.7 min (leader election) + buffer = 5 minutes | ||
| t.Log("By eventually reporting progressing as True") | ||
| require.EventuallyWithT(t, func(ct *assert.CollectT) { | ||
| require.NoError(ct, c.Get(context.Background(), types.NamespacedName{Name: clusterExtension.Name}, clusterExtension)) | ||
| cond := apimeta.FindStatusCondition(clusterExtension.Status.Conditions, ocv1.TypeProgressing) | ||
| require.NotNil(ct, cond) | ||
| require.Equal(ct, metav1.ConditionTrue, cond.Status) | ||
| require.Equal(ct, ocv1.ReasonSucceeded, cond.Reason) | ||
| }, 2*time.Minute, pollInterval) | ||
| }, 5*time.Minute, pollInterval) | ||
|
|
||
| t.Log("By eventually installing the package successfully") | ||
| require.EventuallyWithT(t, func(ct *assert.CollectT) { | ||
|
|
@@ -655,6 +659,8 @@ func TestClusterExtensionRecoversFromNoNamespaceWhenFailureFixed(t *testing.T) { | |
| // backoff of this eventually check we MUST ensure we do not touch the ClusterExtension | ||
| // after creating int the Namespace and ServiceAccount. | ||
| t.Log("By eventually installing the package successfully") | ||
| // Use 5 minutes for recovery tests to account for exponential backoff after repeated failures | ||
| // plus leader election time (up to 163s in worst case) | ||
| require.EventuallyWithT(t, func(ct *assert.CollectT) { | ||
| require.NoError(ct, c.Get(context.Background(), types.NamespacedName{Name: clusterExtension.Name}, clusterExtension)) | ||
| cond := apimeta.FindStatusCondition(clusterExtension.Status.Conditions, ocv1.TypeInstalled) | ||
|
|
@@ -663,7 +669,7 @@ func TestClusterExtensionRecoversFromNoNamespaceWhenFailureFixed(t *testing.T) { | |
| require.Equal(ct, ocv1.ReasonSucceeded, cond.Reason) | ||
| require.Contains(ct, cond.Message, "Installed bundle") | ||
| require.NotEmpty(ct, clusterExtension.Status.Install) | ||
| }, pollDuration, pollInterval) | ||
| }, 5*time.Minute, pollInterval) | ||
|
|
||
| t.Log("By eventually reporting Progressing == True with Reason Success") | ||
| require.EventuallyWithT(t, func(ct *assert.CollectT) { | ||
|
|
@@ -777,6 +783,8 @@ func TestClusterExtensionRecoversFromExistingDeploymentWhenFailureFixed(t *testi | |
| // backoff of this eventually check we MUST ensure we do not touch the ClusterExtension | ||
| // after deleting the Deployment. | ||
| t.Log("By eventually installing the package successfully") | ||
| // Use 5 minutes for recovery tests to account for exponential backoff after repeated failures | ||
| // plus leader election time (up to 163s in worst case) | ||
| require.EventuallyWithT(t, func(ct *assert.CollectT) { | ||
| require.NoError(ct, c.Get(context.Background(), types.NamespacedName{Name: clusterExtension.Name}, clusterExtension)) | ||
| cond := apimeta.FindStatusCondition(clusterExtension.Status.Conditions, ocv1.TypeInstalled) | ||
|
|
@@ -785,7 +793,7 @@ func TestClusterExtensionRecoversFromExistingDeploymentWhenFailureFixed(t *testi | |
| require.Equal(ct, ocv1.ReasonSucceeded, cond.Reason) | ||
| require.Contains(ct, cond.Message, "Installed bundle") | ||
| require.NotEmpty(ct, clusterExtension.Status.Install) | ||
| }, pollDuration, pollInterval) | ||
| }, 5*time.Minute, pollInterval) | ||
|
|
||
| t.Log("By eventually reporting Progressing == True with Reason Success") | ||
| require.EventuallyWithT(t, func(ct *assert.CollectT) { | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these changes really needed in this PR, i.e. can we do them separately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess no. I updated it since it failed during the test. You can check the test history.