Skip to content

Conversation

@jianzhangbjz
Copy link
Member

@jianzhangbjz jianzhangbjz commented Dec 4, 2025

Description

Make deployments HA-ready with configurable replica count

Problem

The current deployment configuration creates a deadlock during node maintenance operations:

  • Both operator-controller-controller-manager and catalogd-controller-manager have 1 replica (hardcoded)
  • Both have PodDisruptionBudgets with minAvailable: 1
  • When the single pod runs on a node being drained, eviction violates the PDB
  • This prevents node drain operations from completing

Impact: Node maintenance operations (upgrades, scaling, etc.) fail with:
Cannot evict pod as it would violate the pod's disruption budget.

Root cause: With only 1 replica and minAvailable: 1, there's no way to safely evict the pod while maintaining availability.

Solution

This PR makes the replica count configurable and sets a sensible default of 2 replicas:

  1. Added replicas configuration to Helm values:

    • options.operatorController.deployment.replicas: 2
    • options.catalogd.deployment.replicas: 2
  2. Updated Helm templates to use the configurable value instead of hardcoded 1

  3. Regenerated all manifests with the new defaults

Benefits

Enables safe node maintenance: With 2 replicas, one pod can be evicted while maintaining minAvailable: 1
Zero-downtime updates: RollingUpdate with 2 replicas ensures continuous service availability
Improved fault tolerance: System continues operating if one pod fails
Production-ready defaults: Aligns with HA best practices for control plane components
Flexible configuration: Replica count can be customized via Helm values

Compatibility

Existing deployments: When this change is applied, deployments will scale from 1 to 2 replicas:

  • RollingUpdate strategy ensures zero downtime during scale-up
  • maxSurge: 1 allows temporary 3rd pod during transition
  • maxUnavailable: 0 prevents service interruption

Custom deployments: Users can override the default by setting:

options:
  operatorController:
    deployment:
      replicas: <custom-value>
  catalogd:
    deployment:
      replicas: <custom-value>

Reviewer Checklist

  • API Go Documentation
  • Tests: Unit Tests (and E2E Tests, if appropriate)
  • Comprehensive Commit Messages
  • Links to related GitHub Issue(s)

Assisted-By: Claude Code

@jianzhangbjz jianzhangbjz requested a review from a team as a code owner December 4, 2025 03:09
@netlify
Copy link

netlify bot commented Dec 4, 2025

Deploy Preview for olmv1 ready!

Name Link
🔨 Latest commit 7e6ea57
🔍 Latest deploy log https://app.netlify.com/projects/olmv1/deploys/693257db4619af0008eef753
😎 Deploy Preview https://deploy-preview-2371--olmv1.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@jianzhangbjz
Copy link
Member Author

/retest

@jianzhangbjz
Copy link
Member Author

Encounter this error again.

=== RUN   TestClusterExtensionInstallRegistry/no_registry_configuration_necessary
    cluster_extension_install_test.go:57: When a cluster extension is installed from a catalog
    cluster_extension_install_test.go:58: When the extension bundle format is registry+v1
    helpers.go:263: Ensuring ClusterCatalog has Status.Condition of Progressing with a status == True and reason == Succeeded
    helpers.go:273: Checking that catalog has the expected metadata label
    helpers.go:278: Ensuring ClusterCatalog has Status.Condition of Type = Serving with status == True
    cluster_extension_install_test.go:79: It resolves the specified package with correct bundle path
    cluster_extension_install_test.go:80: By creating the ClusterExtension resource
    cluster_extension_install_test.go:83: By eventually reporting a successful resolution and bundle path
    cluster_extension_install_test.go:88: By eventually reporting progressing as True
    cluster_extension_install_test.go:89: 
        	Error Trace:	/home/runner/work/operator-controller/operator-controller/test/e2e/cluster_extension_install_test.go:94
        	            				/opt/hostedtoolcache/go/1.24.6/x64/src/runtime/asm_amd64.s:1700
        	Error:      	Not equal: 
        	            	expected: "Succeeded"
        	            	actual  : "Retrying"
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1 +1 @@
        	            	-Succeeded
        	            	+Retrying
    cluster_extension_install_test.go:89: 
        	Error Trace:	/home/runner/work/operator-controller/operator-controller/test/e2e/cluster_extension_install_test.go:89
        	Error:      	Condition never satisfied
        	Test:       	TestClusterExtensionInstallRegistry/no_registry_configuration_necessary
Failed to list cluster extensions: no matches for kind "ClusterExtensionRevision" in version "olm.operatorframework.io/v1"    helpers.go:321: By deleting ClusterCatalog "test-catalog-jnwbqjlw"
    helpers.go:330: By deleting ClusterExtension "clusterextension-fqz4fls8"
    helpers.go:294: By waiting for CustomResourceDefinitions of "clusterextension-fqz4fls8" to be deleted
    helpers.go:302: By waiting for ClusterRoleBindings of "clusterextension-fqz4fls8" to be deleted
    helpers.go:310: By waiting for ClusterRoles of "clusterextension-fqz4fls8" to be deleted
    helpers.go:340: By deleting ServiceAccount "clusterextension-fqz4fls8"
    helpers.go:349: By deleting Namespace "clusterextension-fqz4fls8"

@tmshort
Copy link
Contributor

tmshort commented Dec 4, 2025

Changing the number of replicas directly impacts the expectations of our e2es. Perhaps default it to 1 or update the tests with some additional logic.

Allowing us to configure the number of replicas and other values will give us better downstream control.

@jianzhangbjz jianzhangbjz force-pushed the pdb2 branch 3 times, most recently from fa1b4f7 to 89b96d0 Compare December 4, 2025 07:23
@jianzhangbjz jianzhangbjz changed the title 🐛 makes the replica count **configurable** and sets a sensible default of **2 replicas** 🐛 makes the replica count configurable and update upgrade-e2e test cases Dec 4, 2025
@jianzhangbjz jianzhangbjz force-pushed the pdb2 branch 5 times, most recently from b03ff44 to 5403227 Compare December 4, 2025 09:13
@codecov
Copy link

codecov bot commented Dec 4, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.90%. Comparing base (c85bc55) to head (7e6ea57).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2371      +/-   ##
==========================================
+ Coverage   70.61%   74.90%   +4.28%     
==========================================
  Files          95       95              
  Lines        7375     7336      -39     
==========================================
+ Hits         5208     5495     +287     
+ Misses       1728     1408     -320     
+ Partials      439      433       -6     
Flag Coverage Δ
e2e 45.14% <ø> (+0.47%) ⬆️
experimental-e2e 49.59% <ø> (+35.24%) ⬆️
unit 58.81% <ø> (+0.36%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jianzhangbjz jianzhangbjz force-pushed the pdb2 branch 4 times, most recently from da978dd to 1ba8c7e Compare December 4, 2025 10:56
@jianzhangbjz
Copy link
Member Author

It failed at, more https://github.com/operator-framework/operator-controller/actions/runs/19926553979/job/57127694168?pr=2371

./testdata/build-test-registry.sh operator-controller-e2e docker-registry localhost/e2e-test-registry:devel
namespace/operator-controller-e2e created
certificate.cert-manager.io/operator-controller-e2e-registry created
deployment.apps/docker-registry created
service/docker-registry created
error: timed out waiting for the condition on deployments/docker-registry
make: *** [Makefile:267: image-registry] Error 1

@jianzhangbjz jianzhangbjz force-pushed the pdb2 branch 2 times, most recently from 6d955b8 to b7d4b4d Compare December 4, 2025 11:38
@jianzhangbjz
Copy link
Member Author

Hi @tmshort , I have updated all related e2e test cases. Could you help have a review? Thanks!

Copy link
Contributor

@pedjak pedjak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

our upstream e2e test do not need HA setup to run, please keep them lean as they were before this PR.

Comment on lines +158 to +159
kubectl wait --for=condition=Serving --timeout=5m ClusterCatalog $TEST_CLUSTER_CATALOG_NAME
kubectl wait --for=condition=Installed --timeout=5m ClusterExtension $TEST_CLUSTER_EXTENSION_NAME
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these changes really needed in this PR, i.e. can we do them separately?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess no. I updated it since it failed during the test. You can check the test history.

type: RollingUpdate
rollingUpdate:
maxSurge: 1 # Allow temporary 2 pods (1 + 1) for zero-downtime updates
maxSurge: 1 # Allow temporary extra pod (e.g., 3 pods with 2 replicas) for zero-downtime updates
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from operational point of view this is a no-op change - please revert it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updating the comment makes sense, especially since the number of pods is based on deployment configuration. But I'd suggest something else.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is necessary so that others can understand the reason.

type: RollingUpdate
rollingUpdate:
maxSurge: 1 # Allow temporary 2 pods (1 + 1) for zero-downtime updates
maxSurge: 1 # Allow temporary extra pod (e.g., 3 pods with 2 replicas) for zero-downtime updates
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above.

enabled: true
deployment:
image: quay.io/operator-framework/operator-controller:devel
replicas: 2
Copy link
Contributor

@pedjak pedjak Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the default should remain 1, and set podDisruptionBudget.enabled to false as default as well. >1 is required for HA setups. We should not implicitly enforce after an upgrade that OLM suddenly consumes more cluster resources (i.e. pods). Such things should not be hidden from users.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally agree with @pedjak here, we should be defaulting to 1 upstream, but allow the configuration to updated downstream. However, because we are allowing changes here, we will need to ensure that the e2e tests can still be run upstream and downstream.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now thinking, maybe we want to use 1 for the default, and put 2 in experimental? That way, we can test both values.

Copy link
Member Author

@jianzhangbjz jianzhangbjz Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The community user will also encounter a similar upgrade error(OCPBUGS-62517) if we disable the PDB as the default.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, we should maintain consistency between the upstream and downstream projects. This is also one of the core principles of open source: fostering alignment, collaboration, and transparency across the entire ecosystem. When upstream and downstream diverge without clear justification, it creates unnecessary complexity, increases maintenance overhead, and makes it harder for users and contributors to understand or adopt the project.

By keeping the two sides consistent, we ensure that new features, bug fixes, and improvements flow smoothly in both directions. It also strengthens long-term sustainability, because the community can rely on a unified architecture and shared expectations. Ultimately, this consistency reinforces the value of open source itself—working together toward a common, coherent, and maintainable solution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The community user will also encounter a similar upgrade error

As usual, it depends. IMHO, this is the downstream bug because the downstream has certain expectations about components availability during upgrade. If OLM deployment has just single pod, it is perfectly normal that the deployment becomes shortly unavailable during upgrade, and many upstream users might be ok with that, especially because OLM is a controller that performs reconciliations asynchronously (i.e. eventually). All ClusterExtensions that get created/modified during OLM upgrade are not lost and they are going to be reconciled - hence to service interruption does not happens at all.

Having said that, it is also perfectly fine that we add knobs on the upstream, so that the number of pods and pdbs can be configured/added for users that have stronger requirements, but we should not enforce them, especially if that means that suddenly running OLM requires twice as much resources.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still want this to be 1 for the standard setup (and tests).

We can set this to 2 in experimental.yaml to also test the multi-replica e2e without forcing the default user to have 2 replicas.

enabled: true
deployment:
image: quay.io/operator-framework/catalogd:devel
replicas: 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above.

type: RollingUpdate
rollingUpdate:
maxSurge: 1 # Allow temporary 2 pods (1 + 1) for zero-downtime updates
maxSurge: 1 # Allow temporary extra pod (e.g., 3 pods with 2 replicas) for zero-downtime updates
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as earlier.

spec:
minReadySeconds: 5
replicas: 1
replicas: 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a generated file, the change comes from values.yaml.

namespace: olmv1-system
spec:
replicas: 1
replicas: 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as earlier.

artifactName = "operator-controller-e2e"
pollDuration = time.Minute
artifactName = "operator-controller-e2e"
// pollDuration is set to 3 minutes to account for leader election time in multi-replica deployments.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

our e2e test do not need to run now in HA setup, having just single replica should be fine. If some additional HA tests are needed, they should be provided downstream.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, the upstream e2e's do run downstream as well, and we need to ensure they can work there as well.

Copy link
Contributor

@pedjak pedjak Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but if I am not mistaken, OCPBUGS-62517 reports a downstream test to fail, not an upstream one?

Copy link
Contributor

@tmshort tmshort Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but it's all the same downstream (OCP) configuration. We can't change the OCP configuration just to run our upstream e2e's. Our upstream e2e's are expected to be able to run mostly unmodified on OCP.

The upstream e2e will encounter 2 replicas when run in OCP (because that's the OCP configuration), but will encounter 1 replica upstream (because that's the upstream configuration).

And because we'd also allow the user to support multiple replicas upstream, our upstream e2e should accommodate that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

our upstream e2e should accommodate that.

Agree that our e2e tests should be agnostic to underlying setup, but our upstream e2e CI test jobs should run against an OLM deployment with single replicas (as they do until now).

@tmshort
Copy link
Contributor

tmshort commented Dec 4, 2025

It would be good to run this downstream as well.

@pedjak
Copy link
Contributor

pedjak commented Dec 4, 2025

It would be good to run this downstream as well.

Sure, and so far were were running e2e tests with single replicas downstream, is that not possible meanwhile?

@tmshort
Copy link
Contributor

tmshort commented Dec 4, 2025

It would be good to run this downstream as well.

Sure, and so far were were running e2e tests with single replicas downstream, is that not possible meanwhile?

If we have to run two replicas downstream to handle the upgrade bug, then we'll be running the tests against two replicas.

I'm trying this out now, we'll see how it goes.

@tmshort
Copy link
Contributor

tmshort commented Dec 4, 2025

I just noticed we have {{ .Values.options.operatorController.podDisruptionBudget.maxUnavailable }} in the template, but we don't set a default in values.yaml.

@tmshort
Copy link
Contributor

tmshort commented Dec 4, 2025

Can you also update the commit description?

@pedjak
Copy link
Contributor

pedjak commented Dec 4, 2025

If we have to run two replicas downstream to handle the upgrade bug, then we'll be running the tests against two replicas.

If I understand correctly, the test reported as failing in OCPBUGS-62517 is not a part of upstream e2e suite.

@tmshort
Copy link
Contributor

tmshort commented Dec 4, 2025

If we have to run two replicas downstream to handle the upgrade bug, then we'll be running the tests against two replicas.

If I understand correctly, the test reported as failing in OCPBUGS-62517 is not a part of upstream e2e suite.

Correct. But to fix OCPBUGS-62517, we will need to have two replicas on OCP. Thus, when we run our upstream e2e's on OCP, they will encounter two replicas,and so the upstream e2e's will need to handle more than one replica, even if we default upstream to 1, which I also recommend.

My downstream testing seems to be passing, once I updated the downstream helm config to 2 replicas and re-enabling the PDB.

@openshift-ci
Copy link

openshift-ci bot commented Dec 5, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from pedjak. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…er-manager.yml

Co-authored-by: Todd Short <tmshort@users.noreply.github.com>
@jianzhangbjz
Copy link
Member Author

To address:

=== RUN   TestClusterExtensionVersionUpdate
    single_namespace_support_test.go:361: When a cluster extension is installed from a catalog
    single_namespace_support_test.go:362: When resolving upgrade edges
    helpers.go:266: Ensuring ClusterCatalog has Status.Condition of Progressing with a status == True and reason == Succeeded
    helpers.go:276: Checking that catalog has the expected metadata label
    helpers.go:281: Ensuring ClusterCatalog has Status.Condition of Type = Serving with status == True
    single_namespace_support_test.go:368: By creating an ClusterExtension at a specified version
    single_namespace_support_test.go:383: By eventually reporting a successful resolution
    single_namespace_support_test.go:392: It allows to upgrade the ClusterExtension to a non-successor version
    single_namespace_support_test.go:393: By forcing update of ClusterExtension resource to a non-successor version
    single_namespace_support_test.go:398: By eventually reporting a satisfiable resolution
    single_namespace_support_test.go:406: We should have two ClusterExtensionRevision resources
panic: test timed out after 10m0s
	running tests:
		TestClusterExtensionVersionUpdate (10s)

goroutine 2466 [running]:
testing.(*M).startAlarm.func1()

@jianzhangbjz jianzhangbjz requested a review from pedjak December 5, 2025 06:48
@tmshort
Copy link
Contributor

tmshort commented Dec 5, 2025

I'm going to walk back my comment on putting replicas: 2 into experimental. I agree with @pedjak that we should have 1, but support 2. What I really want is to have our e2e tests be able to test both the 1 replica and 2 replica scenarios.

EDIT: This can easily be done by making experimental-e2e.yaml have two replicas - which won't impact the released YAML files. Create a new values.yaml file that only specifies 2 replicas, and add it to list in the Makefile here:

$(EXPERIMENTAL_E2E_MANIFEST) ?= helm/cert-manager.yaml helm/experimental.yaml helm/e2e.yaml

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants