K8SPSMDB-1466 improve MCS error handling to prevent operator crashes #2044

hors · 2025-09-12T17:30:53Z

CHANGE DESCRIPTION

Description

This PR fixes the issue where the Percona Server MongoDB Operator would crash when encountering errors during Multi-Cluster Services (MCS) discovery, specifically the "stale GroupVersion discovery" error.

Problem

The operator was crashing with the following error:

This occurred in the MCS registration process when dc.ServerPreferredResources() failed, causing the entire operator pod to crash and restart.

Solution

Graceful Error Handling: Modified the Register() function in pkg/mcs/register.go to handle any discovery error gracefully
Mark MCS Unavailable: Instead of crashing, the operator now marks MCS as unavailable and continues normal operation
Comprehensive Testing: Added unit tests to ensure MCS functionality works correctly

New Unit Tests in `pkg/mcs/register_test.go`

TestIsAvailable(): Tests MCS availability status
TestMCSSchemeGroupVersion(): Tests scheme group version initialization
TestServiceExport(): Tests ServiceExport object creation
TestServiceExportList(): Tests ServiceExportList object creation

Benefits

✅ Prevents Operator Crashes: Operator no longer crashes on MCS discovery errors
✅ Graceful Degradation: MCS functionality is marked as unavailable but operator continues working
✅ Improved Reliability: Operator is more resilient to external service issues
✅ Comprehensive Testing: MCS functionality is now properly tested
✅ Backward Compatible: No breaking changes to existing functionality

Testing

Unit tests pass for MCS functionality
Operator starts successfully without MCS available
Operator continues normal operation when MCS discovery fails
No regression in existing functionality

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Code quality improvement
Test coverage improvement

Checklist

Code follows the project's coding standards
Self-review of the code has been performed
Code has been commented, particularly in hard-to-understand areas
Unit tests have been added/updated
No new warnings or errors introduced
Changes are backward compatible

Related Issues

Fixes the operator crash issue when MCS discovery fails with "stale GroupVersion discovery" error.

Additional Notes

This fix ensures that MCS (Multi-Cluster Services) is treated as an optional feature. When MCS discovery fails for any reason, the operator gracefully marks MCS as unavailable and continues with normal MongoDB cluster operations. This makes the operator more resilient and prevents unnecessary pod restarts.

egegunes · 2025-09-22T07:44:02Z

pkg/mcs/register.go

+		// MCS is optional functionality - if discovery fails for any reason,
+		// mark it as unavailable and continue without crashing the operator
+		available = false
+		return nil


i think we need to log the error and inform users that MCS is not available

JNKPercona · 2025-11-14T14:56:21Z

Test Name	Result	Time
arbiter	passed	00:11:14
balancer	passed	00:17:47
cross-site-sharded	passed	00:18:08
custom-replset-name	passed	00:10:11
custom-tls	passed	00:13:48
custom-users-roles	passed	00:10:27
custom-users-roles-sharded	passed	00:11:28
data-at-rest-encryption	passed	00:12:31
data-sharded	passed	00:22:28
demand-backup	failure	00:06:03
demand-backup-eks-credentials-irsa	passed	00:00:06
demand-backup-fs	passed	00:22:53
demand-backup-if-unhealthy	passed	00:07:51
demand-backup-incremental	failure	00:05:16
demand-backup-incremental-sharded	failure	00:07:50
demand-backup-physical-parallel	failure	00:06:43
demand-backup-physical-aws	failure	00:05:35
demand-backup-physical-azure	passed	00:12:18
demand-backup-physical-gcp-s3	passed	00:11:55
demand-backup-physical-gcp-native	failure	00:58:22
demand-backup-physical-minio	passed	00:20:15
demand-backup-physical-sharded-parallel	failure	00:08:53
demand-backup-physical-sharded-aws	failure	00:07:09
demand-backup-physical-sharded-azure	passed	00:17:38
demand-backup-physical-sharded-gcp-native	failure	01:02:40
demand-backup-physical-sharded-minio	passed	00:17:35
demand-backup-sharded	failure	00:13:24
expose-sharded	passed	00:33:37
finalizer	passed	00:10:03
ignore-labels-annotations	passed	00:07:49
init-deploy	passed	00:13:07
ldap	passed	00:08:47
ldap-tls	passed	00:12:35
limits	passed	00:06:11
liveness	passed	00:08:35
mongod-major-upgrade	passed	00:13:08
mongod-major-upgrade-sharded	passed	00:20:59
monitoring-2-0	passed	00:24:48
monitoring-pmm3	passed	00:28:48
multi-cluster-service	passed	00:13:57
multi-storage	passed	00:18:25
non-voting-and-hidden	passed	00:16:58
one-pod	passed	00:08:04
operator-self-healing-chaos	passed	00:12:24
pitr	passed	00:31:43
pitr-physical	passed	01:02:11
pitr-sharded	passed	00:20:23
pitr-to-new-cluster	passed	00:24:31
pitr-physical-backup-source	passed	00:53:48
preinit-updates	passed	00:04:53
pvc-resize	passed	00:12:11
recover-no-primary	passed	00:27:08
replset-overrides	passed	00:16:27
rs-shard-migration	passed	00:13:46
scaling	passed	00:11:07
scheduled-backup	failure	00:06:51
security-context	passed	00:07:19
self-healing-chaos	passed	00:14:58
service-per-pod	passed	00:18:42
serviceless-external-nodes	passed	00:07:34
smart-update	passed	00:08:08
split-horizon	passed	00:07:44
stable-resource-version	passed	00:05:04
storage	passed	00:07:30
tls-issue-cert-manager	passed	00:29:13
upgrade	passed	00:09:09
upgrade-consistency	passed	00:06:36
upgrade-consistency-sharded-tls	passed	00:51:48
upgrade-sharded	passed	00:19:38
upgrade-partial-backup	passed	00:14:27
users	passed	00:17:17
version-service	passed	00:25:31

Summary	Value
Tests Run	72/72
Job Duration	04:00:46
Total Test Time	20:14:58

commit: cbb0b70
image: perconalab/percona-server-mongodb-operator:PR-2044-cbb0b708

pull-request-size bot added the size/M 30-99 lines label Sep 12, 2025

K8SPSMDB-1466 improve MCS error handling to prevent operator crashes

e77ba0f

hors force-pushed the K8SPSMDB-1466 branch from 9800cf2 to e77ba0f Compare September 12, 2025 17:46

Merge branch 'main' into K8SPSMDB-1466

1a2e24e

hors marked this pull request as ready for review September 16, 2025 10:26

hors requested review from egegunes, gkech, nmarukovich and pooknull as code owners September 16, 2025 10:26

egegunes requested changes Sep 22, 2025

View reviewed changes

add more logs :)

fa83617

hors requested a review from egegunes November 14, 2025 10:30

Merge branch 'main' into K8SPSMDB-1466

cbb0b70

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

K8SPSMDB-1466 improve MCS error handling to prevent operator crashes #2044

K8SPSMDB-1466 improve MCS error handling to prevent operator crashes #2044

hors commented Sep 12, 2025 •

edited by atlassian bot

Loading

Uh oh!

egegunes Sep 22, 2025

Uh oh!

hors Nov 14, 2025

Uh oh!

JNKPercona commented Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

K8SPSMDB-1466 improve MCS error handling to prevent operator crashes #2044

Are you sure you want to change the base?

K8SPSMDB-1466 improve MCS error handling to prevent operator crashes #2044

Conversation

hors commented Sep 12, 2025 • edited by atlassian bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CHANGE DESCRIPTION

Description

Problem

Solution

New Unit Tests in pkg/mcs/register_test.go

Benefits

Testing

Type of Change

Checklist

Related Issues

Additional Notes

Uh oh!

egegunes Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

hors Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

JNKPercona commented Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hors commented Sep 12, 2025 •

edited by atlassian bot

Loading

New Unit Tests in `pkg/mcs/register_test.go`