Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -484,8 +484,8 @@ run-experimental: run-internal #HELP Build the operator-controller then deploy i
CATD_NAMESPACE := olmv1-system
.PHONY: wait
wait:
kubectl wait --for=condition=Available --namespace=$(CATD_NAMESPACE) deployment/catalogd-controller-manager --timeout=60s
kubectl wait --for=condition=Ready --namespace=$(CATD_NAMESPACE) certificate/catalogd-service-cert # Avoid upgrade test flakes when reissuing cert
kubectl wait --for=condition=Available --namespace=$(CATD_NAMESPACE) deployment/catalogd-controller-manager --timeout=3m
kubectl wait --for=condition=Ready --namespace=$(CATD_NAMESPACE) certificate/catalogd-service-cert --timeout=3m # Avoid upgrade test flakes when reissuing cert

.PHONY: docker-build
docker-build: build-linux #EXHELP Build docker image for operator-controller and catalog with GOOS=linux and local GOARCH.
Expand Down
4 changes: 2 additions & 2 deletions hack/test/pre-upgrade-setup.sh
Original file line number Diff line number Diff line change
Expand Up @@ -155,5 +155,5 @@ spec:
version: 1.0.0
EOF

kubectl wait --for=condition=Serving --timeout=60s ClusterCatalog $TEST_CLUSTER_CATALOG_NAME
kubectl wait --for=condition=Installed --timeout=60s ClusterExtension $TEST_CLUSTER_EXTENSION_NAME
kubectl wait --for=condition=Serving --timeout=5m ClusterCatalog $TEST_CLUSTER_CATALOG_NAME
kubectl wait --for=condition=Installed --timeout=5m ClusterExtension $TEST_CLUSTER_EXTENSION_NAME
Comment on lines +158 to +159
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these changes really needed in this PR, i.e. can we do them separately?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess no. I updated it since it failed during the test. You can check the test history.

Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,11 @@ metadata:
namespace: {{ .Values.namespaces.olmv1.name }}
spec:
minReadySeconds: 5
replicas: 1
replicas: {{ .Values.options.catalogd.deployment.replicas }}
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # Allow temporary 2 pods (1 + 1) for zero-downtime updates
maxSurge: 1 # Allow temporary extra pod (e.g., 3 pods with 2 replicas) for zero-downtime updates
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from operational point of view this is a no-op change - please revert it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updating the comment makes sense, especially since the number of pods is based on deployment configuration. But I'd suggest something else.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is necessary so that others can understand the reason.

maxUnavailable: 0 # Never allow pods to be unavailable during updates
selector:
matchLabels:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,11 +11,11 @@ metadata:
name: operator-controller-controller-manager
namespace: {{ .Values.namespaces.olmv1.name }}
spec:
replicas: 1
replicas: {{ .Values.options.operatorController.deployment.replicas }}
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # Allow temporary 2 pods (1 + 1) for zero-downtime updates
maxSurge: 1 # Allow temporary extra pod (e.g., 3 pods with 2 replicas) for zero-downtime updates
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above.

maxUnavailable: 0 # Never allow pods to be unavailable during updates
selector:
matchLabels:
Expand Down
2 changes: 2 additions & 0 deletions helm/olmv1/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ options:
enabled: true
deployment:
image: quay.io/operator-framework/operator-controller:devel
replicas: 2
Copy link
Contributor

@pedjak pedjak Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the default should remain 1, and set podDisruptionBudget.enabled to false as default as well. >1 is required for HA setups. We should not implicitly enforce after an upgrade that OLM suddenly consumes more cluster resources (i.e. pods). Such things should not be hidden from users.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally agree with @pedjak here, we should be defaulting to 1 upstream, but allow the configuration to updated downstream. However, because we are allowing changes here, we will need to ensure that the e2e tests can still be run upstream and downstream.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now thinking, maybe we want to use 1 for the default, and put 2 in experimental? That way, we can test both values.

Copy link
Member Author

@jianzhangbjz jianzhangbjz Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The community user will also encounter a similar upgrade error(OCPBUGS-62517) if we disable the PDB as the default.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, we should maintain consistency between the upstream and downstream projects. This is also one of the core principles of open source: fostering alignment, collaboration, and transparency across the entire ecosystem. When upstream and downstream diverge without clear justification, it creates unnecessary complexity, increases maintenance overhead, and makes it harder for users and contributors to understand or adopt the project.

By keeping the two sides consistent, we ensure that new features, bug fixes, and improvements flow smoothly in both directions. It also strengthens long-term sustainability, because the community can rely on a unified architecture and shared expectations. Ultimately, this consistency reinforces the value of open source itself—working together toward a common, coherent, and maintainable solution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The community user will also encounter a similar upgrade error

As usual, it depends. IMHO, this is the downstream bug because the downstream has certain expectations about components availability during upgrade. If OLM deployment has just single pod, it is perfectly normal that the deployment becomes shortly unavailable during upgrade, and many upstream users might be ok with that, especially because OLM is a controller that performs reconciliations asynchronously (i.e. eventually). All ClusterExtensions that get created/modified during OLM upgrade are not lost and they are going to be reconciled - hence to service interruption does not happens at all.

Having said that, it is also perfectly fine that we add knobs on the upstream, so that the number of pods and pdbs can be configured/added for users that have stronger requirements, but we should not enforce them, especially if that means that suddenly running OLM requires twice as much resources.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still want this to be 1 for the standard setup (and tests).

We can set this to 2 in experimental.yaml to also test the multi-replica e2e without forcing the default user to have 2 replicas.

extraArguments: []
features:
enabled: []
Expand All @@ -19,6 +20,7 @@ options:
enabled: true
deployment:
image: quay.io/operator-framework/catalogd:devel
replicas: 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above.

extraArguments: []
features:
enabled: []
Expand Down
8 changes: 4 additions & 4 deletions manifests/experimental-e2e.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2107,11 +2107,11 @@ metadata:
namespace: olmv1-system
spec:
minReadySeconds: 5
replicas: 1
replicas: 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a generated file, the change comes from values.yaml.

strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # Allow temporary 2 pods (1 + 1) for zero-downtime updates
maxSurge: 1 # Allow temporary extra pod (e.g., 3 pods with 2 replicas) for zero-downtime updates
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as earlier.

maxUnavailable: 0 # Never allow pods to be unavailable during updates
selector:
matchLabels:
Expand Down Expand Up @@ -2258,11 +2258,11 @@ metadata:
name: operator-controller-controller-manager
namespace: olmv1-system
spec:
replicas: 1
replicas: 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as earlier.

strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # Allow temporary 2 pods (1 + 1) for zero-downtime updates
maxSurge: 1 # Allow temporary extra pod (e.g., 3 pods with 2 replicas) for zero-downtime updates
maxUnavailable: 0 # Never allow pods to be unavailable during updates
selector:
matchLabels:
Expand Down
8 changes: 4 additions & 4 deletions manifests/experimental.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2032,11 +2032,11 @@ metadata:
namespace: olmv1-system
spec:
minReadySeconds: 5
replicas: 1
replicas: 2
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # Allow temporary 2 pods (1 + 1) for zero-downtime updates
maxSurge: 1 # Allow temporary extra pod (e.g., 3 pods with 2 replicas) for zero-downtime updates
maxUnavailable: 0 # Never allow pods to be unavailable during updates
selector:
matchLabels:
Expand Down Expand Up @@ -2170,11 +2170,11 @@ metadata:
name: operator-controller-controller-manager
namespace: olmv1-system
spec:
replicas: 1
replicas: 2
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # Allow temporary 2 pods (1 + 1) for zero-downtime updates
maxSurge: 1 # Allow temporary extra pod (e.g., 3 pods with 2 replicas) for zero-downtime updates
maxUnavailable: 0 # Never allow pods to be unavailable during updates
selector:
matchLabels:
Expand Down
8 changes: 4 additions & 4 deletions manifests/standard-e2e.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1795,11 +1795,11 @@ metadata:
namespace: olmv1-system
spec:
minReadySeconds: 5
replicas: 1
replicas: 2
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # Allow temporary 2 pods (1 + 1) for zero-downtime updates
maxSurge: 1 # Allow temporary extra pod (e.g., 3 pods with 2 replicas) for zero-downtime updates
maxUnavailable: 0 # Never allow pods to be unavailable during updates
selector:
matchLabels:
Expand Down Expand Up @@ -1945,11 +1945,11 @@ metadata:
name: operator-controller-controller-manager
namespace: olmv1-system
spec:
replicas: 1
replicas: 2
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # Allow temporary 2 pods (1 + 1) for zero-downtime updates
maxSurge: 1 # Allow temporary extra pod (e.g., 3 pods with 2 replicas) for zero-downtime updates
maxUnavailable: 0 # Never allow pods to be unavailable during updates
selector:
matchLabels:
Expand Down
8 changes: 4 additions & 4 deletions manifests/standard.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1720,11 +1720,11 @@ metadata:
namespace: olmv1-system
spec:
minReadySeconds: 5
replicas: 1
replicas: 2
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # Allow temporary 2 pods (1 + 1) for zero-downtime updates
maxSurge: 1 # Allow temporary extra pod (e.g., 3 pods with 2 replicas) for zero-downtime updates
maxUnavailable: 0 # Never allow pods to be unavailable during updates
selector:
matchLabels:
Expand Down Expand Up @@ -1857,11 +1857,11 @@ metadata:
name: operator-controller-controller-manager
namespace: olmv1-system
spec:
replicas: 1
replicas: 2
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # Allow temporary 2 pods (1 + 1) for zero-downtime updates
maxSurge: 1 # Allow temporary extra pod (e.g., 3 pods with 2 replicas) for zero-downtime updates
maxUnavailable: 0 # Never allow pods to be unavailable during updates
selector:
matchLabels:
Expand Down
24 changes: 16 additions & 8 deletions test/e2e/cluster_extension_install_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,11 @@ import (
)

const (
artifactName = "operator-controller-e2e"
pollDuration = time.Minute
artifactName = "operator-controller-e2e"
// pollDuration is set to 3 minutes to account for leader election time in multi-replica deployments.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

our e2e test do not need to run now in HA setup, having just single replica should be fine. If some additional HA tests are needed, they should be provided downstream.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, the upstream e2e's do run downstream as well, and we need to ensure they can work there as well.

Copy link
Contributor

@pedjak pedjak Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but if I am not mistaken, OCPBUGS-62517 reports a downstream test to fail, not an upstream one?

Copy link
Contributor

@tmshort tmshort Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but it's all the same downstream (OCP) configuration. We can't change the OCP configuration just to run our upstream e2e's. Our upstream e2e's are expected to be able to run mostly unmodified on OCP.

The upstream e2e will encounter 2 replicas when run in OCP (because that's the OCP configuration), but will encounter 1 replica upstream (because that's the upstream configuration).

And because we'd also allow the user to support multiple replicas upstream, our upstream e2e should accommodate that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

our upstream e2e should accommodate that.

Agree that our e2e tests should be agnostic to underlying setup, but our upstream e2e CI test jobs should run against an OLM deployment with single replicas (as they do until now).

// In the worst case (previous leader crashed), leader election can take up to 163 seconds
// (LeaseDuration: 137s + RetryPeriod: 26s). Adding buffer for reconciliation time.
pollDuration = 3 * time.Minute
pollInterval = time.Second
testCatalogRefEnvVar = "CATALOG_IMG"
testCatalogName = "test-catalog"
Expand Down Expand Up @@ -169,18 +172,19 @@ location = "docker-registry.operator-controller-e2e.svc.cluster.local:5000"`,
require.NoError(ct, c.Get(context.Background(), types.NamespacedName{Name: clusterExtension.Name}, clusterExtension))
}, 2*time.Minute, pollInterval)

// Give the check 2 minutes instead of the typical 1 for the pod's
// files to update from the configmap change.
// Give the check extra time for the pod's files to update from the configmap change.
// The theoretical max time is the kubelet sync period of 1 minute +
// ConfigMap cache TTL of 1 minute = 2 minutes
// ConfigMap cache TTL of 1 minute = 2 minutes.
// With multi-replica deployments, add leader election time (up to 163s in worst case).
// Total: 2 min (ConfigMap) + 2.7 min (leader election) + buffer = 5 minutes
t.Log("By eventually reporting progressing as True")
require.EventuallyWithT(t, func(ct *assert.CollectT) {
require.NoError(ct, c.Get(context.Background(), types.NamespacedName{Name: clusterExtension.Name}, clusterExtension))
cond := apimeta.FindStatusCondition(clusterExtension.Status.Conditions, ocv1.TypeProgressing)
require.NotNil(ct, cond)
require.Equal(ct, metav1.ConditionTrue, cond.Status)
require.Equal(ct, ocv1.ReasonSucceeded, cond.Reason)
}, 2*time.Minute, pollInterval)
}, 5*time.Minute, pollInterval)

t.Log("By eventually installing the package successfully")
require.EventuallyWithT(t, func(ct *assert.CollectT) {
Expand Down Expand Up @@ -655,6 +659,8 @@ func TestClusterExtensionRecoversFromNoNamespaceWhenFailureFixed(t *testing.T) {
// backoff of this eventually check we MUST ensure we do not touch the ClusterExtension
// after creating int the Namespace and ServiceAccount.
t.Log("By eventually installing the package successfully")
// Use 5 minutes for recovery tests to account for exponential backoff after repeated failures
// plus leader election time (up to 163s in worst case)
require.EventuallyWithT(t, func(ct *assert.CollectT) {
require.NoError(ct, c.Get(context.Background(), types.NamespacedName{Name: clusterExtension.Name}, clusterExtension))
cond := apimeta.FindStatusCondition(clusterExtension.Status.Conditions, ocv1.TypeInstalled)
Expand All @@ -663,7 +669,7 @@ func TestClusterExtensionRecoversFromNoNamespaceWhenFailureFixed(t *testing.T) {
require.Equal(ct, ocv1.ReasonSucceeded, cond.Reason)
require.Contains(ct, cond.Message, "Installed bundle")
require.NotEmpty(ct, clusterExtension.Status.Install)
}, pollDuration, pollInterval)
}, 5*time.Minute, pollInterval)

t.Log("By eventually reporting Progressing == True with Reason Success")
require.EventuallyWithT(t, func(ct *assert.CollectT) {
Expand Down Expand Up @@ -777,6 +783,8 @@ func TestClusterExtensionRecoversFromExistingDeploymentWhenFailureFixed(t *testi
// backoff of this eventually check we MUST ensure we do not touch the ClusterExtension
// after deleting the Deployment.
t.Log("By eventually installing the package successfully")
// Use 5 minutes for recovery tests to account for exponential backoff after repeated failures
// plus leader election time (up to 163s in worst case)
require.EventuallyWithT(t, func(ct *assert.CollectT) {
require.NoError(ct, c.Get(context.Background(), types.NamespacedName{Name: clusterExtension.Name}, clusterExtension))
cond := apimeta.FindStatusCondition(clusterExtension.Status.Conditions, ocv1.TypeInstalled)
Expand All @@ -785,7 +793,7 @@ func TestClusterExtensionRecoversFromExistingDeploymentWhenFailureFixed(t *testi
require.Equal(ct, ocv1.ReasonSucceeded, cond.Reason)
require.Contains(ct, cond.Message, "Installed bundle")
require.NotEmpty(ct, clusterExtension.Status.Install)
}, pollDuration, pollInterval)
}, 5*time.Minute, pollInterval)

t.Log("By eventually reporting Progressing == True with Reason Success")
require.EventuallyWithT(t, func(ct *assert.CollectT) {
Expand Down
5 changes: 4 additions & 1 deletion test/helpers/helpers.go
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,10 @@ var (
)

const (
pollDuration = time.Minute
// pollDuration is set to 3 minutes to account for leader election time in multi-replica deployments.
// In the worst case (previous leader crashed), leader election can take up to 163 seconds
// (LeaseDuration: 137s + RetryPeriod: 26s). Adding buffer for reconciliation time.
pollDuration = 3 * time.Minute
pollInterval = time.Second
testCatalogName = "test-catalog"
testCatalogRefEnvVar = "CATALOG_IMG"
Expand Down
Loading