Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race conditions in NetworkPolicyController #4028

Merged
merged 1 commit into from
Aug 16, 2022
Merged

Conversation

tnqn
Copy link
Member

@tnqn tnqn commented Jul 18, 2022

There were a few race conditions in NetworkPolicyController:

  • An AppliedToGroup or AddressGroup in use may be removed if situations
    like this happens:
  1. addANP creates a group for ANP A;
  2. addNetworkPolicy reuses the group for KNP B, is going to create an
    internal NetworkPolicy;
  3. deleteANP deletes the group for ANP A because at that moment no other
    internal NetworkPolicies are using the group;
  4. addNetworkPolicy commits the internal NetworkPolicy for KNP B to
    storage, but the group no longer exists.
  • An Antrea-native NetworkPolicy may be out-of-date if situations like
    this happens:
  1. An ACNP event is received, updateCNP calculates the new internal
    NetworkPolicy for the ACNP, is going to commit it to storage;
  2. A ClusterGroup event triggers update of the ACNP via
    triggerCNPUpdates
  3. triggerCNPUpdates calls reprocessCNP which updates the new internal
    NetworkPolicy for the ACNP and commits it to storage;
  4. updateCNP in the first step commits its internal NetworkPolicy to
    storage which overrides the update of the ClusterGroup event.

The second one caused test flake of the test case
"TestGroupNoK8sNP/Case=ACNPNestedClusterGroup".

To resolve the race conditions completely and make NetworkPolicy
handling less error prone, this patch refactors NetworkPolicyController:

  • Event handlers no longer update the storage of internal NetworkPolicy
    directly and only triggers resync of affected policies, which ensures
    that there is at most one worker handling an internal NetworkPolicy at
    any moment.
  • Ensure atomicity when updating internal NetworkPolicy and creating or
    deleting AddressGroups and AppliedToGroups.

Duplicate code and tests are deleted with the refactoring.

Signed-off-by: Quan Tian qtian@vmware.com

Fixes #4127

@codecov
Copy link

codecov bot commented Jul 18, 2022

Codecov Report

Merging #4028 (d483bf7) into main (cab72fc) will increase coverage by 1.87%.
The diff coverage is 92.47%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #4028      +/-   ##
==========================================
+ Coverage   63.65%   65.52%   +1.87%     
==========================================
  Files         300      304       +4     
  Lines       46567    46604      +37     
==========================================
+ Hits        29640    30538     +898     
+ Misses      14596    13658     -938     
- Partials     2331     2408      +77     
Flag Coverage Δ
integration-tests 34.98% <ø> (+0.02%) ⬆️
kind-e2e-tests 48.76% <50.75%> (+2.29%) ⬆️
unit-tests 44.38% <90.35%> (+0.42%) ⬆️
Impacted Files Coverage Δ
...kg/controller/networkpolicy/store/networkpolicy.go 81.60% <ø> (+4.52%) ⬆️
pkg/controller/networkpolicy/group.go 41.66% <50.00%> (+6.80%) ⬆️
pkg/controller/networkpolicy/clustergroup.go 75.55% <66.66%> (-0.69%) ⬇️
...g/controller/networkpolicy/clusternetworkpolicy.go 75.11% <86.66%> (+7.02%) ⬆️
...kg/controller/networkpolicy/antreanetworkpolicy.go 74.54% <91.22%> (+2.70%) ⬆️
...ntroller/networkpolicy/networkpolicy_controller.go 82.58% <97.26%> (+3.85%) ⬆️
pkg/controller/networkpolicy/crd_utils.go 89.71% <100.00%> (-1.33%) ⬇️
pkg/controller/types/networkpolicy.go 100.00% <100.00%> (ø)
pkg/ipam/poolallocator/allocator.go 49.76% <0.00%> (-5.96%) ⬇️
... and 69 more

@tnqn tnqn force-pushed the fix-acnp branch 2 times, most recently from 6855b40 to 30ff4d7 Compare July 18, 2022 16:32
@tnqn
Copy link
Member Author

tnqn commented Jul 18, 2022

/test-all

@tnqn
Copy link
Member Author

tnqn commented Jul 18, 2022

/test-all

@tnqn tnqn marked this pull request as ready for review July 19, 2022 02:13
Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not super familiar with this code at this stage, but I took a stab at a review

@@ -24,19 +24,22 @@ import (
antreatypes "antrea.io/antrea/pkg/controller/types"
)

func getAntreaNetworkPolicyReference(anp *crdv1alpha1.NetworkPolicy) controlplane.NetworkPolicyReference {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious about why we are not returning a pointer here (*controlplane.NetworkPolicyReference)?
Ultimately we use it as key for internalNetworkPolicyQueue. I don't know if this causes unnecessary copies, or if on the contrary less indirection is better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

internalNetworkPolicyQueue must use value instead of pointer as the key, otherwise the same NetworkPolicies will not be treated as same item because the pointers may be different. But other places could use pointers to save some copy overhead. I have updated them.

@@ -132,7 +88,7 @@ func (n *NetworkPolicyController) processAntreaNetworkPolicy(np *crdv1alpha1.Net
// Create AppliedToGroup for each AppliedTo present in AntreaNetworkPolicy spec.
for _, at := range np.Spec.AppliedTo {
appliedToGroupNamesSet.Insert(n.createAppliedToGroup(
np.Namespace, at.PodSelector, at.NamespaceSelector, at.ExternalEntitySelector))
networkPolicyRef, np.Namespace, at.PodSelector, at.NamespaceSelector, at.ExternalEntitySelector))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe we could remove the second parameter of createAppliedToGroup if the namespace name can be easily retrieved from networkPolicyRef?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made some changes to how to track references. Now only policy UID is required for reference, so namespace is still passed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to ensure atomicity when updating internal NetworkPolicy and creating or deleting AddressGroups and AppliedToGroups, no extra argument is needed.

Comment on lines 269 to 271
_, exists, _ := c.appliedToGroupStore.Get(cg)
if exists {
c.enqueueAppliedToGroup(cg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible for the AppliedToGroup to be deleted between the call to get and the call to enqueue? I assume it is possible and that it's not an issue, but a comment to that effect would be nice.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

Comment on lines 922 to 1016
key, _ := npc.internalNetworkPolicyQueue.Get()
npc.internalNetworkPolicyQueue.Done(key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe for the clarity of the test, we should check that the keys match {getAntreaClusterNetworkPolicyReference(cnp1), getAntreaClusterNetworkPolicyReference(cnp2)}?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

// cnp2 is added after the ClusterGroup.
npc.addCNP(cnp2)

// cnp1 is added before the ClusterGroup. The rule's From should be empty as the ClusterGroup hasn't been synced,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I find the "added" terminology a bit confusing here, since we don't call the add handler anymore. Maybe just "synced"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

_, npc := newController()
cnp := getCNP()
newCNP := cnp.DeepCopy()
newCNP.Spec.Priority = float64(100)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add a comment explaining why we set the priority to this value here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

_, npc := newController()
anp := getANP()
newANP := anp.DeepCopy()
newANP.Spec.Priority = float64(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add a comment explaining why we set the priority to this value here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@@ -428,7 +428,7 @@ func TestToAntreaPeerForCRD(t *testing.T) {
_, npc := newController()
npc.addClusterGroup(&cgA)
npc.cgStore.Add(&cgA)
actualPeer := npc.toAntreaPeerForCRD(tt.inPeers, testCNPObj, tt.direction, tt.namedPortExists)
actualPeer := npc.toAntreaPeerForCRD(getAntreaClusterNetworkPolicyReference(testCNPObj), tt.inPeers, testCNPObj, tt.direction, tt.namedPortExists)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just thought about this: could we use getACNPReference / getANPReference as function names? It gets pretty long to read and we already use the ACNP / ANP abbreviations in other function names

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

// addressGroupReferences tracks the reference count of policies for AddressGroups.
addressGroupReferences map[string]networkPolicyReferences
// appliedToGroupMutex prevents race conditions between multiple internalNetworkPolicyWorkers when they create or
// delete same AppliedToGroups and ensures atomicity of updating appliedToGroupStore and appliedToGroupReferences.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete the same

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

// delete same AppliedToGroups and ensures atomicity of updating appliedToGroupStore and appliedToGroupReferences.
appliedToGroupMutex sync.Mutex
// addressGroupMutex prevents race conditions between multiple internalNetworkPolicyWorkers when they create or
// delete same AddressGroups and ensures atomicity of updating addressGroupStore and addressGroupReferences.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete the same

there is also an extra whitespace

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@tnqn tnqn added this to the Antrea v1.8 release milestone Jul 28, 2022
@tnqn
Copy link
Member Author

tnqn commented Aug 8, 2022

/skip-integration tested manually

@qiyueyao
Copy link
Contributor

qiyueyao commented Aug 9, 2022

Regarding PR #4047, has updated documentation in the enableLogging section that users should avoid annotation update to prevent NP from working in a stale state. The comment needs to be deleted after this PR fixes race conditions. Thanks!

@tnqn
Copy link
Member Author

tnqn commented Aug 11, 2022

/test-all

for _, p := range parentGroupObjs {
parentGrp := p.(*antreatypes.Group)
c.enqueueInternalGroup(parentGrp.SourceReference.ToGroupName())
}
}

// triggerDerivedGroupUpdates triggers processing of AppliedToGroup and AddressGroup derived from the provided group.
func (c *NetworkPolicyController) triggerDerivedGroupUpdates(grp string) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This replaces the previous group enqueue operations in reprocessCNP, which blindly enqueues all AppliedToGroups and AddressGroups.

@tnqn
Copy link
Member Author

tnqn commented Aug 14, 2022

/test-all

@tnqn tnqn force-pushed the fix-acnp branch 3 times, most recently from 80326a0 to 4844c6e Compare August 15, 2022 05:27
@tnqn
Copy link
Member Author

tnqn commented Aug 15, 2022

/test-all
/test-ipv6-all
/test-ipv6-only-all
/test-windows-all

@tnqn tnqn marked this pull request as ready for review August 15, 2022 05:27
@tnqn
Copy link
Member Author

tnqn commented Aug 15, 2022

/test-all

antoninbas
antoninbas previously approved these changes Aug 15, 2022
Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM< I think @GraysonWu should take another look as well

Comment on lines +870 to +871
// It must use value instead of pointer as the key, otherwise the same NetworkPolicies will not be treated as same
// item because the pointers may be different.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for adding this comment

// in case of ADD event or modified and store the updated instance, in case
// of an UPDATE event.
func (n *NetworkPolicyController) processAntreaNetworkPolicy(np *crdv1alpha1.NetworkPolicy) *antreatypes.NetworkPolicy {
// instance to the caller wherein.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove "wherein", which is no longer valid

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

GraysonWu
GraysonWu previously approved these changes Aug 16, 2022
Copy link
Contributor

@GraysonWu GraysonWu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just curious: is there a way to "test" a race condition?

@tnqn
Copy link
Member Author

tnqn commented Aug 16, 2022

LGTM. Just curious: is there a way to "test" a race condition?

Good question, I added an unit test TestSyncInternalNetworkPolicyConcurrently to verify one of the race conditions is resolved. Without the lock n.internalNetworkPolicyMutex.Lock(), the test had a chance to fail:

# go test -run TestSyncInternalNetworkPolicyConcurrently  -count 500 -race ./pkg/controller/networkpolicy/
--- FAIL: TestSyncInternalNetworkPolicyConcurrently (0.01s)
    networkpolicy_controller_test.go:3155:
                Error Trace:    networkpolicy_controller_test.go:3155
                                                        networkpolicy_controller_test.go:3130
                Error:          "[]" should have 1 item(s), but has 0
                Test:           TestSyncInternalNetworkPolicyConcurrently
    networkpolicy_controller_test.go:3158:
                Error Trace:    networkpolicy_controller_test.go:3158
                                                        networkpolicy_controller_test.go:3130
                Error:          Should be true
                Test:           TestSyncInternalNetworkPolicyConcurrently
FAIL
FAIL    antrea.io/antrea/pkg/controller/networkpolicy   5.405s
FAIL

Although the code before refactoring didn't create and delete groups in SyncInternalNetworkPolicy but in addNetworkPolicy, addANP, addCNP, updateNetworkPolicy, etc., they could execute concurrently as they are event handlers of different resources, so the same race condition could happen like above.

@tnqn
Copy link
Member Author

tnqn commented Aug 16, 2022

/skip-all the latest change only added an unit test and updated a comment.

@tnqn
Copy link
Member Author

tnqn commented Aug 16, 2022

@antoninbas @GraysonWu could you give another approval?

There were a few race conditions in NetworkPolicyController:
* An AppliedToGroup or AddressGroup in use may be removed if situations
like this happens:
1. addANP creates a group for ANP A;
2. addNetworkPolicy reuses the group for KNP B, is going to create an
   internal NetworkPolicy;
3. deleteANP deletes the group for ANP A because at that moment no other
   internal NetworkPolicies are using the group;
4. addNetworkPolicy commits the internal NetworkPolicy for KNP B to
   storage, but the group no longer exists.

* An Antrea-native NetworkPolicy may be out-of-date if situations like
this happens:
1. An ACNP event is received, `updateCNP` calculates the new internal
   NetworkPolicy for the ACNP, is going to commit it to storage;
2. A ClusterGroup event triggers update of the ACNP via
   triggerCNPUpdates
3. triggerCNPUpdates calls reprocessCNP which updates the new internal
   NetworkPolicy for the ACNP and commits it to storage;
4. updateCNP in the first step commits its internal NetworkPolicy to
   storage which overrides the update of the ClusterGroup event.

The second one caused test flake of the test case
"TestGroupNoK8sNP/Case=ACNPNestedClusterGroup".

To resolve the race conditions completely and make NetworkPolicy
handling less error prone, this patch refactors NetworkPolicyController:
* Event handlers no longer update the storage of internal NetworkPolicy
  directly and only triggers resync of affected policies, which ensures
  that there is at most one worker handling an internal NetworkPolicy at
  any moment.
* Ensure atomicity when updating internal NetworkPolicy and creating or
  deleting AddressGroups and AppliedToGroups.

Duplicate code and tests are deleted with the refactoring.

Signed-off-by: Quan Tian <qtian@vmware.com>
return "", nil
// Internal Group is not found, which means the corresponding namespaced group is either not created yet or not
// processed yet. Once the internal Group is created and processed, the sync worker for internal group will
// re-enqueue the ClusterNetworkPolicy processing which will trigger the creation of AppliedToGroup.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// re-enqueue the ClusterNetworkPolicy processing which will trigger the creation of AppliedToGroup.
// re-enqueue the AntreaNetworkPolicy processing which will trigger the creation of AppliedToGroup.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for pointing it out, will fix it in other PR

@tnqn
Copy link
Member Author

tnqn commented Aug 16, 2022

/skip-all

@tnqn tnqn merged commit d1c6a43 into antrea-io:main Aug 16, 2022
@tnqn tnqn deleted the fix-acnp branch August 16, 2022 06:43
@jianjuns
Copy link
Contributor

@tnqn Sorry for the late review. I added several questions about locking in the comments, and have a general question: why earlier we chose to process related NPs in event handlers directly? Any side effects (performance, etc.) with the new approach of enqueuing NPs?

@tnqn
Copy link
Member Author

tnqn commented Aug 17, 2022

@tnqn Sorry for the late review. I added several questions about locking in the comments

@jianjuns I don't find your questions about locking in the comments. Have you published them?

and have a general question: why earlier we chose to process related NPs in event handlers directly?

It's because we only support K8s NetworkPolicy when designing the workflow, it worked fine and was more efficient to calculate spec in event handlers and calculate span in workers separately. But with Antrea-native policies, groups features added, the workflow lost its advantage and became complex and error prone:

When we only support K8s NetworkPolicy:

  1. InternalNetworkPolicy, AddressGroup, AppliedToGroup's spec have only single source of truth, which is K8s NetworkPolicy. Only K8s NetworkPolicy creation, update, and deletion events could trigger their creation and deletion, and their spec's update.
  2. Event handlers of a single kind of resources (e.g. addNetworkPolicy, updateNetworkPolicy, deleteNetworkPolicy) are executed sequentially, so creating and deleting InternalNetworkPolicies, AddressGroups, AppliedToGroups don't need to concern race conditions.

The above two facts changed after NetworkPolicyController started to support Antrea-native policies and groups:

  1. InternalNetworkPolicy, AddressGroup, AppliedToGroup's spec have multiple source of truth: K8s NetworkPolicy, Antrea ClusterNetworkPolicy, Antrea NetworkPolicy, Group, ClusterGroup, Namespace's annotation (for determining Audit Logging property of K8s NetworkPolicy). Then all these resources' events can trigger InternalNetworkPolicy, AddressGroup, and AppliedToGroup's creation, deletion, and spec's update. That's why there are functions like triggerANPUpdates, triggerCNPUpdates, triggerParentGroupSync, etc.
  2. Event handlers of different kind of resources can execute concurrently, so race conditions between them need to be taken into consideration. Also, as ClusterGroup, Group, and even Namespace's annotation can affect the spec of InternalNetworkPolicy, the workers that sync ClusterGroup, Group and the event handlers of Namespace can update InternalNetworkPolicy, which can also conflict with the event handlers of networkpolicies.

Mutex need to be used very carefully to avoid race conditions, for example, even just internal NetworkPolicy, they may be written by the following goroutines:

  1. K8s NetworkPolicy event handler executor
  2. Antrea ClusterNetworkPolicy event handler executor
  3. Antrea NetworkPolicy event handler executor
  4. Namespace event handler executor
  5. multiple InternalNetworkPolicy workers
  6. multiple InternalGroup workers

But even mutex is used properly, there will be a bottleneck for all the above executors, only one of them could execute the code block that may create/update/delete InternalNetworkPolicy, AddressGroup, AppliedToGroup. The previous code's critical section was not big enough to avoid race conditions, hence the errors. To fix it in old way, more code block need to be executed with lock acquired. And the code was quite redundant, this can be told by the lines of code change: the refactoring added 1,504 lines of code and deleted 2,735 lines of code, and the unit test coverage of this package was still improved from 58.2% to 63.3%.

I just created issue #4127 to describe the race conditions and why a refactoring is necessary for record.

Any side effects (performance, etc.) with the new approach of enqueuing NPs?

In theory, syncInternalNetworkPolicy's overhead is greater than before because it calculates both spec and span while it only calculates span before. But the concurrency is improved, previously their spec calculation is executed sequentially, and the use of internalNetworkPolicyMutex also makes span calculation is executed sequentially even though there are multiple workers, now spec and span calculation are executed concurrently and can scale out by increasing number of workers, only updating storage is executed sequentially. In theory, calculating spec should not cost much as it's a simple translation operation, unlike calculating span operation which may depend on the number of Nodes, so I think repeating calculating spec in syncInternalNetworkPolicy should be fine. Even it adds some overhead for each round of processing, it should still be better than serializing them in the above 6 executors.

Copy link
Contributor

@jianjuns jianjuns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Quan for the detailed explanation. All make sense to me.

Yes seems I forgot to publish the previous comments.

n.internalNetworkPolicyMutex.Lock()
defer n.internalNetworkPolicyMutex.Unlock()

if !oldInternalPolicyExists {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question - does NP store update need to be protected by the mutex too? Why?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they need to be protected by the same mutex because NP store has index of AddressGroups and AppliedToGroups, which is used to determine whether an AddressGroup/AppliedToGroup is orphan or not. For example, the following code means the AppliedToGroup is no longer used and can be deleted:

objs, _ := n.internalNetworkPolicyStore.GetByIndex(store.AppliedToGroupIndex, atgName)
if len(objs) == 0 {

We must ensure the AddressGroup/AppliedToGroup data (stored in their own store) and their index (stored in NP store) are updated atomically, otherwise the first race condition the PR description describes could happen:

  1. addANP creates a group for ANP A;
  2. addNetworkPolicy reuses the group for KNP B, is going to create an internal NetworkPolicy;
  3. deleteANP deletes the group for ANP A because at that moment no other internal NetworkPolicies are using the group;
  4. addNetworkPolicy commits the internal NetworkPolicy for KNP B to storage, but the group no longer exists.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Got it.

case controlplane.AntreaClusterNetworkPolicy:
cnp, err := n.cnpLister.Get(key.Name)
if err != nil {
n.deleteInternalNetworkPolicy(internalNetworkPolicyName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleteInternalNetworkPolicy() can delete groups too, but I did not see that is protected by the mutex?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's protected by the mutex:

func (n *NetworkPolicyController) deleteInternalNetworkPolicy(name string) {
n.internalNetworkPolicyMutex.Lock()
defer n.internalNetworkPolicyMutex.Unlock()

Copy link
Contributor

@jianjuns jianjuns Aug 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Humm.. not sure why I missed these lines.

@qiyueyao qiyueyao added the action/backport Indicates a PR that requires backports. label Sep 3, 2022
heanlan pushed a commit to heanlan/antrea that referenced this pull request Mar 29, 2023
There were a few race conditions in NetworkPolicyController:
* An AppliedToGroup or AddressGroup in use may be removed if situations
like this happens:
1. addANP creates a group for ANP A;
2. addNetworkPolicy reuses the group for KNP B, is going to create an
   internal NetworkPolicy;
3. deleteANP deletes the group for ANP A because at that moment no other
   internal NetworkPolicies are using the group;
4. addNetworkPolicy commits the internal NetworkPolicy for KNP B to
   storage, but the group no longer exists.

* An Antrea-native NetworkPolicy may be out-of-date if situations like
this happens:
1. An ACNP event is received, `updateCNP` calculates the new internal
   NetworkPolicy for the ACNP, is going to commit it to storage;
2. A ClusterGroup event triggers update of the ACNP via
   triggerCNPUpdates
3. triggerCNPUpdates calls reprocessCNP which updates the new internal
   NetworkPolicy for the ACNP and commits it to storage;
4. updateCNP in the first step commits its internal NetworkPolicy to
   storage which overrides the update of the ClusterGroup event.

The second one caused test flake of the test case
"TestGroupNoK8sNP/Case=ACNPNestedClusterGroup".

To resolve the race conditions completely and make NetworkPolicy
handling less error prone, this patch refactors NetworkPolicyController:
* Event handlers no longer update the storage of internal NetworkPolicy
  directly and only triggers resync of affected policies, which ensures
  that there is at most one worker handling an internal NetworkPolicy at
  any moment.
* Ensure atomicity when updating internal NetworkPolicy and creating or
  deleting AddressGroups and AppliedToGroups.

Duplicate code and tests are deleted with the refactoring.

Signed-off-by: Quan Tian <qtian@vmware.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action/backport Indicates a PR that requires backports.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Race conditions in NetworkPolicyController caused NetworkPolicy realization issues
6 participants