Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Egress IP scheduling #4627

Merged
merged 1 commit into from
Feb 24, 2023
Merged

Improve Egress IP scheduling #4627

merged 1 commit into from
Feb 24, 2023

Conversation

tnqn
Copy link
Member

@tnqn tnqn commented Feb 15, 2023

PR #4593 introduced maxEgressIPsPerNode to limit the number of Egress IPs that can be assigned to a Node. However, it used the EgressInformer cache to check whether a Node can accommodate new Egress IPs and did the calculation for different Egresses concurrently, which may cause inconsistent schedule results among agents. For instance:

When Nodes' capacity is 1 and two Egresses, e1 and e2, are created concurrently, different agents may process them in different orders, with different contexts:

  • agent a1 may process Egress e1 first and assign it to Node n1; it then processes Egress e2 and think it should be assigned to Node n2 by agent a2 because n1 is out of space.
  • agent a2 may process Egress e1 and e2 faster, before any of their status is updated in Egress API, and would think both Egresses should be assigned to Node n1 by agent a1.

As a result, Egress e2 will be left unassigned.

To fix the problem, the Egress IP scheduling should be deterministic accross agents and time. This patch adds an egressIPScheduler, which takes the spec of Egress and ExternalIPPool and the state of memberlist cluster as inputs, generates scheduling results deterministically.

According to the benchmark test, scheduling 1,000 Egresses among 1,000 Nodes once takes less than 3ms.

The PR also includes the following improvement:

A global max-egress-ips may not work for the case that the cluster consists of different instance types of Nodes. It adds support for per-Node max-egress-ips annotation, with which Nodes can be configured with different capacity via their annotations. It also makes dynamically adjusting a Node's capacity at runtime and configuring Node capacity post-deployment possible.

@tnqn
Copy link
Member Author

tnqn commented Feb 15, 2023

/test-e2e

@codecov
Copy link

codecov bot commented Feb 15, 2023

Codecov Report

Merging #4627 (f721f61) into main (aafea18) will increase coverage by 0.98%.
The diff coverage is 81.81%.

❗ Current head f721f61 differs from pull request most recent head 6128e99. Consider uploading reports for the commit 6128e99 to get more accurate results

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #4627      +/-   ##
==========================================
+ Coverage   68.65%   69.64%   +0.98%     
==========================================
  Files         402      403       +1     
  Lines       59570    58608     -962     
==========================================
- Hits        40900    40819      -81     
+ Misses      15847    14991     -856     
+ Partials     2823     2798      -25     
Flag Coverage Δ *Carryforward flag
e2e-tests 38.63% <69.47%> (+0.25%) ⬆️
integration-tests 34.35% <ø> (-0.05%) ⬇️ Carriedforward from 6128e99
kind-e2e-tests 46.64% <69.47%> (+7.09%) ⬆️ Carriedforward from 6128e99
unit-tests 59.90% <80.41%> (+0.10%) ⬆️ Carriedforward from 6128e99

*This pull request uses carry forward flags. Click here to find out more.

Impacted Files Coverage Δ
cmd/antrea-agent/agent.go 0.00% <0.00%> (ø)
pkg/agent/controller/egress/egress_controller.go 75.45% <74.41%> (-9.79%) ⬇️
pkg/agent/controller/egress/ip_scheduler.go 83.33% <83.33%> (ø)
pkg/agent/memberlist/cluster.go 78.50% <100.00%> (-0.84%) ⬇️
...gent/controller/noderoute/node_route_controller.go 62.56% <0.00%> (-5.72%) ⬇️
pkg/agent/wireguard/client_linux.go 77.07% <0.00%> (-4.46%) ⬇️
pkg/agent/flowexporter/exporter/exporter.go 70.96% <0.00%> (-4.04%) ⬇️
pkg/apiserver/storage/ram/watch.go 90.66% <0.00%> (-2.67%) ⬇️
...gent/controller/networkpolicy/status_controller.go 79.16% <0.00%> (-2.50%) ⬇️
... and 55 more

@tnqn tnqn force-pushed the fix-max-egress-ips branch 2 times, most recently from 45f6830 to a7597c4 Compare February 15, 2023 11:03
@tnqn tnqn added this to the Antrea v1.11 release milestone Feb 15, 2023
@tnqn tnqn marked this pull request as ready for review February 15, 2023 12:01
@tnqn tnqn force-pushed the fix-max-egress-ips branch 2 times, most recently from f7e8713 to 9e8b2b3 Compare February 15, 2023 14:05
newResults := map[string]*scheduleResult{}
nodeToIPs := map[string]sets.String{}
egresses, _ := s.egressLister.List(labels.Everything())
// Sort Egresses by creation timestamp to make the result deterministic and prioritize objected created earlier
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question - it is still possible different agents get different lists for a period right? In that case, could different agents decide different IP assignment? Will that be corrected when all agents converge? Does it mean we may re-assign IPs?

Copy link
Member Author

@tnqn tnqn Feb 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, different agents may get differenet lists at a moment. But the diff should be likely the Egresses created most recently because they have been sent to some agents but haven't been sent to others.
As we sort Egresses by creation timestamp, Egresses created earilier will be prioritized and assigned to Nodes, and the results won't be affected by the Egresses created most recently. For instance, agent a1 receives Egress {e1, e2, e3}, agent a2 receives Egress {e1, e2}

  • their schedule decisions about e1 and e2 will be the same;
  • if a1 schedules e3 to itself, it will configure it to its interface, otherwise does nothing;
  • when a2 receives e3, it should get the same result as a1, and configure e3 to its own interface or do nothing.

During the process no IP is re-assigned.

There could be also other cases causing different agents get different lists, e.g. Egress delete/update events. However, the behavior that one Egress's assignment may affect others only happens when Node capacity is reached. If the capacity is enough, each Egress's schedule is individual, and the consistent hash should guarantee the Egresses are distributed evenly. So in most cases when the Egress's number is not greater than Nodes * maxEgressIPsPerNode, there should be no IP re-assignning.

In all cases, all agents can get the same schedule results and correct IP assignment when their cache converge.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My read is it is still possible that two agents decide different IP - Node assignment, when they get different Egress list, at Egress update/delete events? E.g. one gets {e1, e3}, one gets {e1, e2, e3}.

Good to add comments to describe the scenarios.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comments to this function, PTAL

@tnqn tnqn force-pushed the fix-max-egress-ips branch 3 times, most recently from 6825c45 to 0a56063 Compare February 16, 2023 13:07
}

// addEgress processes Egress ADD events.
func (c *egressIPScheduler) addEgress(obj interface{}) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Receiver names are different. It’s better to rename all receivers to 's'?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, thanks

jianjuns
jianjuns previously approved these changes Feb 17, 2023
Copy link
Contributor

@jianjuns jianjuns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tnqn tnqn force-pushed the fix-max-egress-ips branch 2 times, most recently from 814727a to 7337199 Compare February 23, 2023 15:01
@tnqn tnqn changed the title Fix Egress IP scheduling Improve Egress IP scheduling Feb 23, 2023
Copy link
Contributor

@jianjuns jianjuns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the commit message:

the cluster consists different instance types of Nodes

consists -> consists of

@@ -24,6 +24,9 @@ const (
// NodeWireGuardPublicAnnotationKey represents the key of the Node's WireGuard public key in the Annotations of the Node.
NodeWireGuardPublicAnnotationKey string = "node.antrea.io/wireguard-public-key"

// NodeMaxEgressIPsAnnotationKey represents the key of the maximum number of Egress IPs in the Annotations of the Node.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about "the key of maximum Egress IP number"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

s.nodeToMaxEgressIPsMutex.Lock()
defer s.nodeToMaxEgressIPsMutex.Unlock()

oldMaxEgressIPs, exists := s.nodeToMaxEgressIPs[nodeName]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not exists, should we return false if the value equals to the global value?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and skipped inserting it to avoid extra check when deleting the cached value (as there is no need to reschedule when deleting the same value).

@@ -67,18 +71,23 @@ type egressIPScheduler struct {
// eventHandlers is the registered callbacks.
eventHandlers []scheduleEventHandler

// The maximum number of Egress IPs a Node can accommodate.
// The global maximum number of Egress IPs a Node can accommodate.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

global -> default?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

}
}
if s.deleteMaxEgressIPsByNode(node.Name) {
s.queue.Add(workItem)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we need to trigger rescheduling at Node deletion, it should be done even before this commit when there was no per Node annotation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed it, there is no need to trigger it because if the Node is selected by any pool, ClusterEventHandler will trigger it.

PR antrea-io#4593 introduced maxEgressIPsPerNode to limit the number of Egress IPs
that can be assigned to a Node. However, it used the EgressInformer cache
to check whether a Node can accommodate new Egress IPs and did the
calculation for different Egresses concurrently, which may cause
inconsistent schedule results among agents. For instance:

When Nodes' capacity is 1 and two Egresses, e1 and e2, are created
concurrently, different agents may process them in different orders, with
different contexts:

- agent a1 may process Egress e1 first and assign it to Node n1; it then
  processes Egress e2 and think it should be assigned to Node n2 by agent
  a2 because n1 is out of space.
- agent a2 may process Egress e1 and e2 faster, before any of their
  status is updated in Egress API, and would think both Egresses should
  be assigned to Node n1 by agent a1.

As a result, Egress e2 will be left unassigned.

To fix the problem, the Egress IP scheduling should be deterministic
accross agents and time. This patch adds an egressIPScheduler, which
takes the spec of Egress and ExternalIPPool and the state of memberlist
cluster as inputs, generates scheduling results deterministically.

According to the benchmark test, scheduling 1,000 Egresses among 1,000
Nodes once takes less than 3ms.

The PR also adds support for per-Node max-egress-ips annotation, which
which Nodes can be configured with different capacity via their
annotations. It also makes dynamically adjusting a Node's capacity at
runtime and configuring Node capacity post-deployment possible.

Signed-off-by: Quan Tian <qtian@vmware.com>
@tnqn
Copy link
Member Author

tnqn commented Feb 24, 2023

/test-all

@tnqn
Copy link
Member Author

tnqn commented Feb 24, 2023

/skip-e2e which failed on a known flaky case

@tnqn tnqn merged commit d5dd02e into antrea-io:main Feb 24, 2023
@tnqn tnqn deleted the fix-max-egress-ips branch February 24, 2023 07:16
@tnqn tnqn added the kind/bug Categorizes issue or PR as related to a bug. label Jan 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants