Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add agent networkpolicy metrics #770

Merged
merged 17 commits into from
Jun 8, 2020

Conversation

yktsubo
Copy link
Contributor

@yktsubo yktsubo commented May 30, 2020

Add agent metrics of total number of ingress/egress network policy rules installed on the node.

  • antrea_agent_ingress_networkpolicy_rule
    The number of ingress networkpolicy rules installed on the node
  • antrea_agent_egress_networkpolicy_rule
    The number of egress networkpolicy rules installed on the node

This PR is a part of #713.

@antrea-bot
Copy link
Collaborator

Thanks for your PR.
Unit tests and code linters are run automatically every time the PR is updated.
E2e, conformance and network policy tests can only be triggered by a member of the vmware-tanzu organization. Regular contributors to the project should join the org.

The following commands are available:

  • /test-e2e: to trigger e2e tests.
  • /skip-e2e: to skip e2e tests.
  • /test-conformance: to trigger conformance tests.
  • /skip-conformance: to skip conformance tests.
  • /test-networkpolicy: to trigger networkpolicy tests.
  • /skip-networkpolicy: to skip networkpolicy tests.
  • /test-windows-conformance: to trigger windows conformance tests.
  • /skip-windows-conformance: to skip windows conformance tests.
  • /test-all: to trigger all tests.
  • /skip-all: to skip all tests.

These commands can only be run by members of the vmware-tanzu organization.

)

var (
IngressNetworkPolicyCount = metrics.NewGauge(&metrics.GaugeOpts{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to exclude these metrics from pkg/agent/metrics/prometheus.go? Or at least from pkg/agent/metrics?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with Kobi. These should be part of pkg/agent/metrics IMO.
May be a separate directory like pkg/agent/metrics/networkpolicy will be good.
In future, we can possibly add a way to enable/disable specific metrics in antrea agent. What do you think @ksamoray ?

Copy link
Contributor Author

@yktsubo yktsubo Jun 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is to avoid import cycle.
pkg/agent/metrics/prometheus.go use pkg/agent/openflow package
I tried that pkg/agent//openflow/network_policy.go use metrics/prometheus.go, which caused the loop.
What was on my mind was

  1. To implement antrea_agent_ovs_flow_table wihtout using openflow package in prometheus.go
  2. Use another package not to have import cycle.

I think 1st one is better considering the future use?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do the following to resolve the dependency? Have separate directories for definition and callback (if needed) for metrics, and just do initialization in prometheus.go
pkg/agent/metrics/ovs --for pod count (ovs container interface count) and ovs flow table count
pkg/agent/metrics/networkpolicy---for network policy related metrics
pkg/agent/metrics/prometheus.go---for initializing/registration of the metrics. In future, we can selectively register the metrics.

Better to get opinion of @ksamoray too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your comment. Let me check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing that importing pkg/agent/openflow wasn't a good practice on my side.
If possible, I think that it's better to avoid a case where some of the metric implementations import pkg/agent/metrics while others are imported by.
Having said that, it could be painful to reimplement antrea_agent_ovs_flow_table while walking around openflow package. Therefore I think that for now you're solution is valid and so is the proposal by @srikartati
Unless someone has a different proposal - e.g maintaining flow counts as counter metrics instead of using a callback. My openflow package it too limited to conclude this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for registering metrics selectively, that would require a whole lot of configuration flags. Maybe a few individual metrics which their collection is costly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ksamoray I guess it depends on how we group/classify various metrics and interface we provide to enable those groups. Moreover, muxing of metrics is to provide flexibility to the user who is interested in subset of metrics rather than complexity in tracking metrics.

@yktsubo You probably already saw PR#781. I tried to remove the dependency with other packages. You can now define the new metrics in metrics package. Also, ok with not having separate package for network policy metrics in the current patch. Grouping/classification may probably need more discussion and work.

@@ -36,6 +36,8 @@ const monitoringNamespace string = "monitoring"
var antreaAgentMetrics = []string{
"antrea_agent_local_pod_count",
"antrea_agent_ovs_flow_table",
"antrea_agent_ingress_networkpolicy_rule",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had that list sorted, IDK if that has any importance though :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the comment, I'll change based on your suggestion.

Copy link
Member

@srikartati srikartati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this.
Wondering how you were able to verify these metrics through e2e test? As in, were mock network policies and rules are added and verified?

)

var (
IngressNetworkPolicyCount = metrics.NewGauge(&metrics.GaugeOpts{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with Kobi. These should be part of pkg/agent/metrics IMO.
May be a separate directory like pkg/agent/metrics/networkpolicy will be good.
In future, we can possibly add a way to enable/disable specific metrics in antrea agent. What do you think @ksamoray ?

@@ -738,6 +739,14 @@ func (c *client) InstallPolicyRuleFlows(ruleID uint32, rule *types.PolicyRule, n
if err := c.applyConjunctiveMatchFlows(ctxChanges); err != nil {
return err
}

// Count up antrea_agent_ingress_networkpolicy_rule or antrea_agent_egress_networkpolicy_rule
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not installOFRule function in pkg/agent/controller/networkpolicy/reconciler.go?

This function could be used to track number of network policies in addition to the current metric, number of network policy rules.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion, it sounds great. I'll take a look into it.

pkg/agent/metricsstore/metric_cache.go Outdated Show resolved Hide resolved
@@ -944,6 +953,21 @@ func (c *client) UninstallPolicyRuleFlows(ruleID uint32) error {
return err
}

isFound := false
for _, ctxChange := range ctxChanges {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ctxChanges seem to be some what complicated.

Is it possible to change following function in reconciler.go to accept Direction or pointer to rule as installOFRule?
func (r *reconciler) uninstallOFRule(ofID uint32)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your suggestion, let me take a look into it.

@yktsubo yktsubo force-pushed the add_agent_networkpolicy_metrics branch from bd4b623 to 35651c8 Compare June 4, 2020 15:53
Copy link
Member

@srikartati srikartati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any ideas on testing these metrics if they are tracked correctly?
I see e2e test do not have a way for accuracy of metrics. Looks like it is just tracking the list of metrics.
May be we could create some pods and then use createNetworkPolicy/deleteNetworkpolicy and check for the value of metrics in prometheus e2e test. What do you think?

@@ -368,6 +369,14 @@ func (r *reconciler) installOFRule(ofRule *types.PolicyRule, npName, npNamespace
r.idAllocator.release(ofID)
return 0, fmt.Errorf("error installing ofRule %v: %v", ofID, err)
}

// Count up antrea_agent_ingress_networkpolicy_rule_count or antrea_agent_egress_networkpolicy_rule_count
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at it more closely, installOFRule is also invoked for update of network policy rules (through "update" function). We may double count the rules. Please look at "add" function to get the correct count.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, Actually I wanted to confirm the point whether we have a case to increment or decrement counters in update.

@@ -398,11 +407,19 @@ func (r *reconciler) updateOFRule(ofID uint32, addedFrom []types.Address, addedT
return nil
}

func (r *reconciler) uninstallOFRule(ofID uint32) error {
func (r *reconciler) uninstallOFRule(ofID uint32, flowDirection v1beta1.Direction) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here as well, please look at "Forget" function that calls uninstallOFRule.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll work on it.

@ksamoray
Copy link
Contributor

ksamoray commented Jun 4, 2020

@srikartati for some of the metrics we can validate with the numbers in the controller, agent CRDs.
For others, not sure really what we can compare with.
Either way, this is not within the scope of this change, and could be handled as a new issue.

@srikartati
Copy link
Member

@srikartati for some of the metrics we can validate with the numbers in the controller, agent CRDs.
For others, not sure really what we can compare with.
Either way, this is not within the scope of this change, and could be handled as a new issue.

We are adding these metrics by hooking the tracking code in the existing code. IMO testing is necessary to get the code right and will be easily maintainable if there any future changes in existing code. Therefore, I do not completely agree that this is out of scope in this PR.
However, I see your point as tests are not there for other metrics, we could consider this as a separate issue. We can check in the code now and rely on manual testing.

As pointed out in my previous comment, testSetup in Prometheus e2e test can be extended to deploy/delete pods, create/delete network policies etc. At least metrics such as network policy, pod count can be covered. In addition, we may also antctl (tests/e2e/antctl_test) to get the expected values from agent/controller to compare with metrics gathered by API.

@srikartati
Copy link
Member

However, I see your point as tests are not there for other metrics, we could consider this as a separate issue. We can check in the code now and rely on manual testing.

Added a new issue for improving testing: #799
Thanks.

@yktsubo
Copy link
Contributor Author

yktsubo commented Jun 4, 2020

@ksamoray @srikartati
Thank you for the comment about testing.
Since we have a new issue, I'll let it go at this.

Yuki Tsuboi added 17 commits June 7, 2020 22:59
Add ingress and egress networkpolicy metrics

Signed-off-by: Yuki Tsuboi <ytsuboi@vmware.com>
Add networkpolicy metrics in agent

Signed-off-by: Yuki Tsuboi <ytsuboi@vmware.com>
Update register function based on kubernetes library upgrade

Signed-off-by: Yuki Tsuboi <ytsuboi@vmware.com>
Update metric name

Signed-off-by: Yuki Tsuboi <ytsuboi@vmware.com>
Signed-off-by: Yuki Tsuboi <ytsuboi@vmware.com>
Signed-off-by: Yuki Tsuboi <ytsuboi@vmware.com>
Signed-off-by: Yuki Tsuboi <ytsuboi@vmware.com>
Use reconcier to implement networkpolicy metrics

Signed-off-by: Yuki Tsuboi <ytsuboi@vmware.com>
Change based on rebase

Signed-off-by: Yuki Tsuboi <ytsuboi@vmware.com>
Signed-off-by: Yuki Tsuboi <ytsuboi@vmware.com>
Signed-off-by: Yuki Tsuboi <ytsuboi@vmware.com>
Delete unncessary lines

Signed-off-by: Yuki Tsuboi <ytsuboi@vmware.com>
Apply gofmt to pkg/agent/metrics/prometheus.go

Signed-off-by: Yuki Tsuboi <ytsuboi@vmware.com>
Use add and forget to update agent networkpolicy metrics

Signed-off-by: Yuki Tsuboi <ytsuboi@vmware.com>
Delete unnecessary blanks

Signed-off-by: Yuki Tsuboi <ytsuboi@vmware.com>
Fix metrics name

Signed-off-by: Yuki Tsuboi <ytsuboi@vmware.com>
Add direction in unit test

Signed-off-by: Yuki Tsuboi <ytsuboi@vmware.com>
@yktsubo yktsubo force-pushed the add_agent_networkpolicy_metrics branch from 5ed689e to 80b3c83 Compare June 7, 2020 14:52
Copy link
Member

@srikartati srikartati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing the comments. Were you able to do any manual testing?

@yktsubo
Copy link
Contributor Author

yktsubo commented Jun 8, 2020

Hi @srikartati
Yes, I tested create new netpol(ingress/egress) and delete them in my lab.
I confirmed that metrics increased and decreased as we expect.

If we you need me to put some testing code in networkpolicy unit test, I can think about it.

@srikartati
Copy link
Member

Hi @srikartati
Yes, I tested create new netpol(ingress/egress) and delete them in my lab.
I confirmed that metrics increased and decreased as we expect.

If we you need me to put some testing code in networkpolicy unit test, I can think about it.

As discussed before, we can take this up in follow up patch standalone or other metrics: #799

LGTM. It'd be better to get approval from @ksamoray

Copy link
Contributor

@ksamoray ksamoray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ksamoray
Copy link
Contributor

ksamoray commented Jun 8, 2020

/test-all

@srikartati srikartati merged commit 037dfd9 into antrea-io:master Jun 8, 2020
@yktsubo yktsubo deleted the add_agent_networkpolicy_metrics branch June 9, 2020 01:30
GraysonWu pushed a commit to GraysonWu/antrea that referenced this pull request Sep 22, 2020
* Add ingress and egress networkpolicy metrics

Add ingress and egress networkpolicy metrics

Signed-off-by: Yuki Tsuboi <ytsuboi@vmware.com>

* Add networkpolicy metrics in agent

Add networkpolicy metrics in agent

Signed-off-by: Yuki Tsuboi <ytsuboi@vmware.com>

* Update register function based on kubernetes library upgrade

Update register function based on kubernetes library upgrade

Signed-off-by: Yuki Tsuboi <ytsuboi@vmware.com>

* Update metric name

Update metric name

Signed-off-by: Yuki Tsuboi <ytsuboi@vmware.com>

* Delete unnecessary blanks

Signed-off-by: Yuki Tsuboi <ytsuboi@vmware.com>

* Add metrics to testcase

Signed-off-by: Yuki Tsuboi <ytsuboi@vmware.com>

* Gofmt for agent prometheus.go and metric_cache.go

Signed-off-by: Yuki Tsuboi <ytsuboi@vmware.com>

* Use reconcier to implement networkpolicy metrics

Use reconcier to implement networkpolicy metrics

Signed-off-by: Yuki Tsuboi <ytsuboi@vmware.com>

* Change based on rebase

Change based on rebase

Signed-off-by: Yuki Tsuboi <ytsuboi@vmware.com>

* Delete unncessary package from go.sum

Signed-off-by: Yuki Tsuboi <ytsuboi@vmware.com>

* Fix mistake by rebase

Signed-off-by: Yuki Tsuboi <ytsuboi@vmware.com>

* Delete unncessary lines

Delete unncessary lines

Signed-off-by: Yuki Tsuboi <ytsuboi@vmware.com>

* Apply gofmt to pkg/agent/metrics/prometheus.go

Apply gofmt to pkg/agent/metrics/prometheus.go

Signed-off-by: Yuki Tsuboi <ytsuboi@vmware.com>

* Use add and forget to update agent networkpolicy metrics

Use add and forget to update agent networkpolicy metrics

Signed-off-by: Yuki Tsuboi <ytsuboi@vmware.com>

* Delete unnecessary blanks

Delete unnecessary blanks

Signed-off-by: Yuki Tsuboi <ytsuboi@vmware.com>

* Fix metrics name

Fix metrics name

Signed-off-by: Yuki Tsuboi <ytsuboi@vmware.com>

* Add direction in unit test

Add direction in unit test

Signed-off-by: Yuki Tsuboi <ytsuboi@vmware.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants