Fix ClusterInfo type ResourceExport recreation bug #4412

luolanzone · 2022-11-24T12:39:23Z

After Gateway HA is enabled, the ClusterInfo type of ResourceExport will be recreated when the active Gateway is changed. But there is a case that a new ClusterInfo of ResourceExport creation may fail when the leader controller process is slow and existing ResourceExport is not deleted in time.
The root cause is the Gateway creation event will update a ClusterInfo of ResourceExport instead of creating a new one due to it still can get existing one from cache when the ResourceExport's DeletionTimestamp is not zero, after ResourceExport is updated, it will be recycled by kubernetes very quickly since it's stale object. From Gateway controller's perspective, Gateway change is reflecting in ResourceExport, but it's actually being deleted eventually.
Fix the issue through retry when the existing ResourceExport's DeletionTimestamp is not zero.

Signed-off-by: Lan Luo luola@vmware.com

luolanzone · 2022-11-24T12:40:16Z

/test-multicluster-e2e

codecov · 2022-11-24T13:00:04Z

Codecov Report

Merging #4412 (407ecda) into main (b977b1d) will decrease coverage by 0.02%.
The diff coverage is 40.00%.

@@            Coverage Diff             @@
##             main    #4412      +/-   ##
==========================================
- Coverage   65.56%   65.54%   -0.03%     
==========================================
  Files         400      412      +12     
  Lines       56847    56941      +94     
==========================================
+ Hits        37270    37320      +50     
- Misses      16882    16941      +59     
+ Partials     2695     2680      -15

Flag	Coverage Δ
e2e-tests	`61.57% <40.00%> (?)`
integration-tests	`34.54% <0.00%> (+0.04%)`	⬆️
kind-e2e-tests	`47.45% <ø> (-0.67%)`	⬇️
unit-tests	`49.81% <40.00%> (+<0.01%)`	⬆️

Impacted Files	Coverage Δ
...ter/controllers/multicluster/gateway_controller.go	`84.09% <40.00%> (+14.23%)`	⬆️
pkg/agent/cniserver/ipam/antrea_ipam.go	`52.81% <0.00%> (-22.95%)`	⬇️
pkg/agent/controller/networkpolicy/reject.go	`63.05% <0.00%> (-12.81%)`	⬇️
pkg/agent/openflow/multicast.go	`19.88% <0.00%> (-8.25%)`	⬇️
pkg/agent/types/networkpolicy.go	`89.74% <0.00%> (-4.86%)`	⬇️
pkg/agent/cniserver/server.go	`74.94% <0.00%> (-4.05%)`	⬇️
pkg/controller/networkpolicy/store/addressgroup.go	`88.37% <0.00%> (-3.49%)`	⬇️
...agent/flowexporter/connections/deny_connections.go	`87.09% <0.00%> (-3.23%)`	⬇️
pkg/agent/openflow/service.go	`88.54% <0.00%> (-3.13%)`	⬇️
pkg/agent/openflow/pipeline.go	`82.11% <0.00%> (-2.76%)`	⬇️
... and 54 more

multicluster/controllers/multicluster/gateway_controller_test.go

jianjuns · 2022-11-26T01:00:22Z

multicluster/controllers/multicluster/gateway_controller.go

@@ -113,6 +113,11 @@ func (r *GatewayReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ct
 			}
 			return nil
 		}
+
+		if !existingResExport.DeletionTimestamp.IsZero() {
+			return fmt.Errorf("existing ResourceExport is being deleted, retry later")


Question - if we do not add this check, will updateResourceExport() fail? Then what is the difference of returning an error here, compared to letting updateResourceExport() fail and return an error at line 122?

Have you checked how long the new active GW can be finally recreated, when the race condition happens and we have to retry here?

updateResourceExport() will not fail because update is allowed when DeletionTimestamp is not empty. The updated one will be recycled by K8s because DeletionTimestamp is not zero after that.
The new active Gateway is created immediately when old Gateway is removed or down. The corresponding ClusterInfo type of ResourceExport will also be created almost immediately because it's usually one time retry unless the leader controller is down.

In this case, could there be another race condition that the check is passed here, but DeletionTimeStamp is set before or after update is done?

No, we only have one goroutine in gateway controller to handle Gateway change considering Gateway event rarely happens.

But you get the ResourceExport from client cache right? The state can be stale?

When the Gateway changes, should we create the ResourceExport rather than update it?

Disabling client cache won't resolve it as the object in apiserver may be updated after the API is called.
This situation is very common and a typical way to resolve it is, using the object in the cache without concerning if it's stale or not, updating the object with resourceVersion got from the cache (clientset automatically does it as long as the object has resourceVersion), then if the object is stale, apiserver will reject the update request with updateConflict error, and client should retrieve the latest version from apiserver directly and retry. Example: https://github.com/antrea-io/antrea/blob/main/pkg/agent/controller/egress/egress_controller.go#L561

But I wonder even if it updates the resource export successfully, could the other slow leader delete it because it thinks this is the one for itself? If yes, the deletion logic should check whether the object matches and set resourceVersion when sending the delete request.

@tnqn thanks for the suggestion. I am not sure I got your last question about other slow leader, there is only one leader controller in one ClusterSet, and each member cluster will create its own ClusterInfo type of ResourceExport and also 1-1 mapping ResourceImport. Only when member cluster submits the deletion request, the leader controller will delete corresponding ClusterInfo of ResourceExport. But I think we can set resourceVersion for delete request considering it should have no side-effect.

@jianjuns I have refined the code to create ResourceExport when DeletionTimestamp != 0. After checking existing codes, it already has resourceVersion when it's trying to update existing ResourceExport, so it should fail when there is conflict and retry. To avoid stale cache, I removed the spec comparison steps and let it update anyway. For the deletion action, I feel it's Ok without resourceVersion considering Gateway webhook will guarantee that only one Gateway is allowed in a member cluster. all events should happen by sequence.

Do we still keep the resourceVersion in update or not? If yes, maybe add a comment before the update call about that.

@tnqn : the controller has only a single worker, so no synchronization issue or reordering.

@jianjuns yes, the updateResourceExport will reuse existing ResourceExport's resourceVersion to update. I have added a comment for this.

multicluster/controllers/multicluster/gateway_controller.go

After Gateway HA is enabled, the ClusterInfo type of ResourceExport will be recreated when the active Gateway is changed. But there is a case that a new ClusterInfo of ResourceExport creation may fail when the leader controller process is slow and existing ResourceExport is not deleted in time. Fix the issue through retry when the existing ResourceExport's DeletionTimestamp is not zero. Signed-off-by: Lan Luo <luola@vmware.com>

luolanzone · 2022-12-05T03:07:34Z

/test-multicluster-e2e

jianjuns · 2022-12-05T17:18:42Z

/skip-all

luolanzone added area/multi-cluster Issues or PRs related to multi cluster. action/backport Indicates a PR that requires backports. labels Nov 24, 2022

luolanzone requested review from jianjuns and tnqn November 25, 2022 07:41

jianjuns reviewed Nov 26, 2022

View reviewed changes

luolanzone force-pushed the fix-clusterinfo-recreate-bug branch 3 times, most recently from 950ad3e to 9ffb265 Compare December 2, 2022 07:19

jianjuns previously approved these changes Dec 2, 2022

View reviewed changes

luolanzone dismissed jianjuns’s stale review via 519b8a5 December 5, 2022 01:25

luolanzone force-pushed the fix-clusterinfo-recreate-bug branch from 9ffb265 to 519b8a5 Compare December 5, 2022 01:25

jianjuns reviewed Dec 5, 2022

View reviewed changes

multicluster/controllers/multicluster/gateway_controller.go Outdated Show resolved Hide resolved

luolanzone force-pushed the fix-clusterinfo-recreate-bug branch from 519b8a5 to 407ecda Compare December 5, 2022 02:00

jianjuns approved these changes Dec 5, 2022

View reviewed changes

tnqn added this to the Antrea v1.10 release milestone Dec 5, 2022

jianjuns merged commit 330a7de into antrea-io:main Dec 5, 2022

luolanzone mentioned this pull request Dec 6, 2022

Automated cherry pick of #4412: Fix ClusterInfo type ResourceExport recreation bug #4442

Merged

luolanzone deleted the fix-clusterinfo-recreate-bug branch December 8, 2022 14:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ClusterInfo type ResourceExport recreation bug #4412

Fix ClusterInfo type ResourceExport recreation bug #4412

luolanzone commented Nov 24, 2022 •

edited

Loading

luolanzone commented Nov 24, 2022

codecov bot commented Nov 24, 2022 •

edited

Loading

jianjuns Nov 26, 2022

luolanzone Nov 26, 2022

jianjuns Nov 26, 2022

luolanzone Nov 26, 2022

jianjuns Nov 26, 2022

tnqn Dec 1, 2022

luolanzone Dec 2, 2022

luolanzone Dec 2, 2022 •

edited

Loading

jianjuns Dec 2, 2022

luolanzone Dec 5, 2022

luolanzone commented Dec 5, 2022

jianjuns commented Dec 5, 2022

Fix ClusterInfo type ResourceExport recreation bug #4412

Fix ClusterInfo type ResourceExport recreation bug #4412

Conversation

luolanzone commented Nov 24, 2022 • edited Loading

luolanzone commented Nov 24, 2022

codecov bot commented Nov 24, 2022 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

luolanzone Dec 2, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

luolanzone commented Dec 5, 2022

jianjuns commented Dec 5, 2022

luolanzone commented Nov 24, 2022 •

edited

Loading

codecov bot commented Nov 24, 2022 •

edited

Loading

luolanzone Dec 2, 2022 •

edited

Loading