Skip to content

Conversation

@hors
Copy link
Collaborator

@hors hors commented Sep 12, 2025

K8SPSMDB-1466 Powered by Pull Request Badge

CHANGE DESCRIPTION

Description

This PR fixes the issue where the Percona Server MongoDB Operator would crash when encountering errors during Multi-Cluster Services (MCS) discovery, specifically the "stale GroupVersion discovery" error.

Problem

The operator was crashing with the following error:

This occurred in the MCS registration process when dc.ServerPreferredResources() failed, causing the entire operator pod to crash and restart.

Solution

  • Graceful Error Handling: Modified the Register() function in pkg/mcs/register.go to handle any discovery error gracefully
  • Mark MCS Unavailable: Instead of crashing, the operator now marks MCS as unavailable and continues normal operation
  • Comprehensive Testing: Added unit tests to ensure MCS functionality works correctly

New Unit Tests in pkg/mcs/register_test.go

  • TestIsAvailable(): Tests MCS availability status
  • TestMCSSchemeGroupVersion(): Tests scheme group version initialization
  • TestServiceExport(): Tests ServiceExport object creation
  • TestServiceExportList(): Tests ServiceExportList object creation

Benefits

  • Prevents Operator Crashes: Operator no longer crashes on MCS discovery errors
  • Graceful Degradation: MCS functionality is marked as unavailable but operator continues working
  • Improved Reliability: Operator is more resilient to external service issues
  • Comprehensive Testing: MCS functionality is now properly tested
  • Backward Compatible: No breaking changes to existing functionality

Testing

  • Unit tests pass for MCS functionality
  • Operator starts successfully without MCS available
  • Operator continues normal operation when MCS discovery fails
  • No regression in existing functionality

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Code quality improvement
  • Test coverage improvement

Checklist

  • Code follows the project's coding standards
  • Self-review of the code has been performed
  • Code has been commented, particularly in hard-to-understand areas
  • Unit tests have been added/updated
  • No new warnings or errors introduced
  • Changes are backward compatible

Related Issues

Fixes the operator crash issue when MCS discovery fails with "stale GroupVersion discovery" error.

Additional Notes

This fix ensures that MCS (Multi-Cluster Services) is treated as an optional feature. When MCS discovery fails for any reason, the operator gracefully marks MCS as unavailable and continues with normal MongoDB cluster operations. This makes the operator more resilient and prevents unnecessary pod restarts.

@pull-request-size pull-request-size bot added the size/M 30-99 lines label Sep 12, 2025
@hors hors marked this pull request as ready for review September 16, 2025 10:26
Comment on lines 38 to 41
// MCS is optional functionality - if discovery fails for any reason,
// mark it as unavailable and continue without crashing the operator
available = false
return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we need to log the error and inform users that MCS is not available

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hors hors requested a review from egegunes November 14, 2025 10:30
@JNKPercona
Copy link
Collaborator

Test Name Result Time
arbiter passed 00:11:14
balancer passed 00:17:47
cross-site-sharded passed 00:18:08
custom-replset-name passed 00:10:11
custom-tls passed 00:13:48
custom-users-roles passed 00:10:27
custom-users-roles-sharded passed 00:11:28
data-at-rest-encryption passed 00:12:31
data-sharded passed 00:22:28
demand-backup failure 00:06:03
demand-backup-eks-credentials-irsa passed 00:00:06
demand-backup-fs passed 00:22:53
demand-backup-if-unhealthy passed 00:07:51
demand-backup-incremental failure 00:05:16
demand-backup-incremental-sharded failure 00:07:50
demand-backup-physical-parallel failure 00:06:43
demand-backup-physical-aws failure 00:05:35
demand-backup-physical-azure passed 00:12:18
demand-backup-physical-gcp-s3 passed 00:11:55
demand-backup-physical-gcp-native failure 00:58:22
demand-backup-physical-minio passed 00:20:15
demand-backup-physical-sharded-parallel failure 00:08:53
demand-backup-physical-sharded-aws failure 00:07:09
demand-backup-physical-sharded-azure passed 00:17:38
demand-backup-physical-sharded-gcp-native failure 01:02:40
demand-backup-physical-sharded-minio passed 00:17:35
demand-backup-sharded failure 00:13:24
expose-sharded passed 00:33:37
finalizer passed 00:10:03
ignore-labels-annotations passed 00:07:49
init-deploy passed 00:13:07
ldap passed 00:08:47
ldap-tls passed 00:12:35
limits passed 00:06:11
liveness passed 00:08:35
mongod-major-upgrade passed 00:13:08
mongod-major-upgrade-sharded passed 00:20:59
monitoring-2-0 passed 00:24:48
monitoring-pmm3 passed 00:28:48
multi-cluster-service passed 00:13:57
multi-storage passed 00:18:25
non-voting-and-hidden passed 00:16:58
one-pod passed 00:08:04
operator-self-healing-chaos passed 00:12:24
pitr passed 00:31:43
pitr-physical passed 01:02:11
pitr-sharded passed 00:20:23
pitr-to-new-cluster passed 00:24:31
pitr-physical-backup-source passed 00:53:48
preinit-updates passed 00:04:53
pvc-resize passed 00:12:11
recover-no-primary passed 00:27:08
replset-overrides passed 00:16:27
rs-shard-migration passed 00:13:46
scaling passed 00:11:07
scheduled-backup failure 00:06:51
security-context passed 00:07:19
self-healing-chaos passed 00:14:58
service-per-pod passed 00:18:42
serviceless-external-nodes passed 00:07:34
smart-update passed 00:08:08
split-horizon passed 00:07:44
stable-resource-version passed 00:05:04
storage passed 00:07:30
tls-issue-cert-manager passed 00:29:13
upgrade passed 00:09:09
upgrade-consistency passed 00:06:36
upgrade-consistency-sharded-tls passed 00:51:48
upgrade-sharded passed 00:19:38
upgrade-partial-backup passed 00:14:27
users passed 00:17:17
version-service passed 00:25:31
Summary Value
Tests Run 72/72
Job Duration 04:00:46
Total Test Time 20:14:58

commit: cbb0b70
image: perconalab/percona-server-mongodb-operator:PR-2044-cbb0b708

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/M 30-99 lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants