Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid ServiceCIDR flapping on agent start #5017

Merged
merged 1 commit into from
May 23, 2023

Conversation

tnqn
Copy link
Member

@tnqn tnqn commented May 22, 2023

The previous implementation always generated intermediate values for ServiceCIDR on agent start, which may interrupt the Service traffic and causes difficulty for cleaning up stale routes as the value calculated at one point may not be reliable to identify all stale routes.

This commit waits for the Service Informer to be synced first, and calculates the ServiceCIDR based on all Services. Ideally the Service route won't change in most cases, and hence avoid the above issues.

Besides, it fixes an issue that stale routes on Linux were not cleaned up correctly due to incorrect check.

@tnqn tnqn requested a review from hongliangl May 22, 2023 08:46
@tnqn tnqn added this to the Antrea v1.12 release milestone May 22, 2023
@tnqn tnqn added the action/release-note Indicates a PR that should be included in release notes. label May 22, 2023
pkg/agent/servicecidr/discoverer.go Outdated Show resolved Hide resolved
pkg/agent/servicecidr/discoverer.go Outdated Show resolved Hide resolved
return
}
svcs, _ := d.serviceLister.List(labels.Everything())
d.updateServiceCIDR(svcs...)
Copy link
Contributor

@hongliangl hongliangl May 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here all existing Services are listed, and a calculated Service CIDR will be got, how could we clean the stale Service routes? Previously, we use the first ClusterIP to collect all routes whose destination includes the ClusterIP.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous cleanup logic is not reliable, there is no guarantee the first clusterIP has been covered the previous routes. Please check the latest code about how stale routes are collected.

@tnqn tnqn force-pushed the stabilize-service-cidr branch 3 times, most recently from 569025d to 00ec216 Compare May 22, 2023 11:27
The previous implementation always generated intermediate values for
ServiceCIDR on agent start, which may interrupt the Service traffic and
causes difficulty for cleaning up stale routes as the value calculated
at one point may not be reliable to identify all stale routes.

This commit waits for the Service Informer to be synced first,
and calculates the ServiceCIDR based on all Services. Ideally the
Service route won't change in most cases, and hence avoid the above
issues.

Besides, it fixes an issue that stale routes on Linux were not cleaned
up correctly due to incorrect check.

Signed-off-by: Quan Tian <qtian@vmware.com>
@tnqn
Copy link
Member Author

tnqn commented May 22, 2023

/test-all
/test-ipv6-all
/test-ipv6-only-all

@tnqn tnqn requested a review from antoninbas May 22, 2023 12:07

func (d *Discoverer) updateServiceCIDR(svcs ...*corev1.Service) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me that the current implementation of updateServiceCIDR will always be correct if multiple workers / goroutines can call updateServiceCIDR simultaneously?
That being said, it should not matter given that you start a single worker goroutine in Run.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may not work if it's called simultaneously. There is no need to have multiple workers for this task as it just checks if an IP is in a subnet in most cases (CIDR expansion can only happen few times).

@tnqn
Copy link
Member Author

tnqn commented May 23, 2023

@hongliangl do you have other comments for this one?

@tnqn tnqn merged commit ae7f370 into antrea-io:main May 23, 2023
@tnqn tnqn deleted the stabilize-service-cidr branch May 23, 2023 03:31
@tnqn tnqn mentioned this pull request May 23, 2023
ceclinux pushed a commit to ceclinux/antrea that referenced this pull request Jun 5, 2023
The previous implementation always generated intermediate values for
ServiceCIDR on agent start, which may interrupt the Service traffic and
causes difficulty for cleaning up stale routes as the value calculated
at one point may not be reliable to identify all stale routes.

This commit waits for the Service Informer to be synced first,
and calculates the ServiceCIDR based on all Services. Ideally the
Service route won't change in most cases, and hence avoid the above
issues.

Besides, it fixes an issue that stale routes on Linux were not cleaned
up correctly due to incorrect check.

Signed-off-by: Quan Tian <qtian@vmware.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action/release-note Indicates a PR that should be included in release notes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants