Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate disabling cgo for all Antrea binaries #5724

Closed
antoninbas opened this issue Nov 17, 2023 · 4 comments · Fixed by #5988
Closed

Investigate disabling cgo for all Antrea binaries #5724

antoninbas opened this issue Nov 17, 2023 · 4 comments · Fixed by #5988
Assignees

Comments

@antoninbas
Copy link
Contributor

This is a follow-up to #5722

To avoid similar issues in the future (e.g., with the antrea/antrea-ubuntu image), we should investigate disabling cgo (CGO_ENABLED=0) for all Antrea binaries (antrea-agent / antrea-controller), and not just for the antctl binaries.

Rather than blindly setting CGO_ENABLED=0, let's think of the possible impact. We should look at impact on binary size, and identify whether there will be any impact on functionality (this is unlikely).

@antoninbas
Copy link
Contributor Author

I suggest looking into this once #5691 has been addressed.

@antoninbas
Copy link
Contributor Author

antoninbas commented Feb 14, 2024

Findings

Impact on binary size

Current build

Agent:

-rwxr-xr-x 1 root root 83185878 Feb 13 23:15 antctl
-rwxr-xr-x 1 root root 98601481 Feb 13 23:18 antrea-agent
-rwxr-xr-x 1 root root 27040411 Feb 13 23:18 antrea-cni

Controller:

-rwxr-xr-x 1 root root 83185878 Feb 13 23:15 antctl
-rwxr-xr-x 1 root root 90393889 Feb 13 23:19 antrea-controller

With cgo disabled

Agent:

-rwxr-xr-x 1 root root 83185878 Feb 13 23:41 antctl
-rwxr-xr-x 1 root root 98474250 Feb 13 23:42 antrea-agent
-rwxr-xr-x 1 root root 27040411 Feb 13 23:42 antrea-cni

Controller:

-rwxr-xr-x 1 root root 83185878 Feb 13 23:41 antctl
-rwxr-xr-x 1 root root 90265322 Feb 13 23:42 antrea-controller

Findings

Binary size is (very) slightly smaller when disabling cgo: 98.60MB -> 98.47MB for antrea-agent, 90.39MB -> 90.27MB for antrea-controller.

Impact on build time

We evaluate the impact of running the default make command, which builds the antrea-agent-ubuntu and antrea-controller-ubuntu container images (go build runs inside container). In both cases, we use an empty build cache but we pull the base images ahead of time.

Current build

9m49s

With cgo disabled

6m42s

Findings

Build time is reduced by around 30%. This is because the Go cache can be reused more effectively. If we look at these 2 commands:

# Disable CGO for antctl in case it is copied outside of the container image. It
# also reduces the size of the binary and aligns with how we distribute antctl
# in release assets.
RUN --mount=type=cache,target=/go/pkg/mod/ \
--mount=type=cache,target=/root/.cache/go-build/ \
CGO_ENABLED=0 make antctl-linux && mv bin/antctl-linux bin/antctl
RUN --mount=type=cache,target=/go/pkg/mod/ \
--mount=type=cache,target=/root/.cache/go-build/ \
make antrea-agent antrea-cni

When running the second command, the cache is mostly invalidated, because the value of CG_ENABLED is different between the 2 go build commands.

Functionality

Usage of cgo in the Go standard library is mostly limited to the net and os/user packages. Antrea has no use for the os/user package, but we do rely on the net package. Within the net package, cgo seems to be used for DNS lookups. From https://pkg.go.dev/net#hdr-Name_Resolution:

On Unix systems, the resolver has two options for resolving names. It can use a pure Go resolver that sends DNS requests directly to the servers listed in /etc/resolv.conf, or it can use a cgo-based resolver that calls C library routines such as getaddrinfo and getnameinfo.

By default the pure Go resolver is used, because a blocked DNS request consumes only a goroutine, while a blocked C call consumes an operating system thread. When cgo is available, the cgo-based resolver is used instead under a variety of conditions: on systems that do not let programs make direct DNS requests (OS X), when the LOCALDOMAIN environment variable is present (even if empty), when the RES_OPTIONS or HOSTALIASES environment variable is non-empty, when the ASR_CONFIG environment variable is non-empty (OpenBSD only), when /etc/resolv.conf or /etc/nsswitch.conf specify the use of features that the Go resolver does not implement, and when the name being looked up ends in .local or is an mDNS name.

Shared system libraries when cgo enabled ``` root@ba2971b2126c:/# ldd `which antrea-agent` linux-vdso.so.1 (0x00007fff5a526000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f0ce320e000) /lib64/ld-linux-x86-64.so.2 (0x00007f0ce343c000) ```
``` libc names root@ba2971b2126c:/# nm `which antrea-agent` | grep " U " U __errno_location U abort U fprintf U fputc U free U freeaddrinfo U fwrite U gai_strerror U getaddrinfo U getgrgid_r U getgrnam_r U getnameinfo U getpwnam_r U getpwuid_r U malloc U mmap U munmap U nanosleep U pthread_attr_destroy U pthread_attr_getstack U pthread_attr_getstacksize U pthread_attr_init U pthread_cond_broadcast U pthread_cond_wait U pthread_create U pthread_detach U pthread_getattr_np U pthread_key_create U pthread_mutex_lock U pthread_mutex_unlock U pthread_self U pthread_setspecific U pthread_sigmask U res_search U setenv U sigaction U sigaddset U sigemptyset U sigfillset U sigismember U stderr U strerror U sysconf U unsetenv U vfprintf ```

Note that on Windows, the net and os/user packages have never used cgo.

One thing to look out for is name resolution for names ending in .local. This is a pretty common case in K8s, as cluster.local is a default domain name used for DNS resolution within the cluster. However, we need to make the following key observation: since #4548, we are no longer using the ClusterFirstWithHostNet DNSPolicy in the antrea-agent Pod, and hence it is no longer possible to resolve cluster names (both the antrea-agent and antrea-controller Pods are hostNetwork Pods).

In theory, it is possible to have cluster.local names in FQDN policies. These are resolved by Antrea by crafting and sending DNS requests directly to the coreDNS Service. If we cannot identify the coreDNS service, we fall back to the local resolver and then we cannot resolve these names any more because of the DNSPolicy (it doesn't depend whether the pure Go resolver or the system resolver is used in this case).
See:

if dnsServerOverride != "" {
klog.InfoS("DNS server override provided by user", "dnsServer", dnsServerOverride)
controller.dnsServerAddr = dnsServerOverride
} else {
host, port := os.Getenv(kubeDNSServiceHost), os.Getenv(kubeDNSServicePort)
if host == "" || port == "" {
klog.InfoS("Unable to derive DNS server from the kube-dns Service, will fall back to local resolver")
controller.dnsServerAddr = ""
} else {
controller.dnsServerAddr = net.JoinHostPort(host, port)
klog.InfoS("Using kube-dns Service for DNS requests", "dnsServer", controller.dnsServerAddr)
}
}

// makeDNSRequest makes a proactive query for a FQDN to the coreDNS service.
func (f *fqdnController) makeDNSRequest(ctx context.Context, fqdn string) error {
if f.dnsServerAddr == "" {
klog.V(2).InfoS("No DNS server configured, falling back to local resolver")
return f.lookupIP(ctx, fqdn)
}
klog.V(2).InfoS("Making DNS request", "fqdn", fqdn, "dnsServer", f.dnsServerAddr)
dnsClient := dns.Client{SingleInflight: true}

Use of net.LookupIP in Antrea codebase ``` abasSMD6R:antrea abas$ git grep -i "lookupIP" pkg/agent/controller/networkpolicy/fqdn.go:func (f *fqdnController) lookupIP(ctx context.Context, fqdn string) error { pkg/agent/controller/networkpolicy/fqdn.go: if ips, err := resolver.LookupIP(ctx, "ip4", fqdn); err == nil { pkg/agent/controller/networkpolicy/fqdn.go: if ips, err := resolver.LookupIP(ctx, "ip6", fqdn); err == nil { pkg/agent/controller/networkpolicy/fqdn.go: return f.lookupIP(ctx, fqdn) pkg/agent/controller/networkpolicy/fqdn_test.go:func TestLookupIPFallback(t *testing.T) { pkg/agent/controller/networkpolicy/fqdn_test.go: err := f.lookupIP(ctx, "www.google.com") third_party/ipam/nodeipam/node_ipam_controller.go: lookupIP func(host string) ([]net.IP, error) third_party/ipam/nodeipam/node_ipam_controller.go: lookupIP: net.LookupIP, third_party/proxy/util/utils.go: LookupIPAddr(ctx context.Context, host string) ([]net.IPAddr, error) ``` The references in `third_party/` don't seem to be actually used / relevant.

Conclusion

It seems that we can disable cgo without risk (forcing the pure Go resolver to always be used should not impact either the Antrea Agent or the Antrea Controller), and with a small benefit when it comes to container image build time.

@antoninbas
Copy link
Contributor Author

Note that while by default (when cgo is available), go defaults to the cgo-based resolver (which uses libc) to resolve names ending with .local, it seems that when it comes to K8s cluster local names, they can still be resolved by the pure Go resolver.
So the special case seems really specific to mDNS, and there is actually a proposed CL to change the default behavior (https://go-review.googlesource.com/c/go/+/540555), so that the .local suffix by itself is not enough to automatically select the cgo-based resolver.

I did some experiments with a program which simply calls net.LookupIP on a user-provided URL:

root@antrea-toolbox-vlllv:/go# export GODEBUG=netdns=4
root@antrea-toolbox-vlllv:/go# ./main www.google.com
go package net: confVal.netCgo = false  netGo = false
go package net: dynamic selection of DNS resolver
go package net: hostLookupOrder(www.google.com) = files,dns
142.251.46.164
root@antrea-toolbox-vlllv:/go# ./main nginx-test.default.svc
go package net: confVal.netCgo = false  netGo = false
go package net: dynamic selection of DNS resolver
go package net: hostLookupOrder(nginx-test.default.svc) = files,dns
10.96.140.88
root@antrea-toolbox-vlllv:/go# ./main nginx-test.default.svc.cluster.local
go package net: confVal.netCgo = false  netGo = false
go package net: dynamic selection of DNS resolver
go package net: hostLookupOrder(nginx-test.default.svc.cluster.local) = cgo
10.96.140.88
root@antrea-toolbox-vlllv:/go# export GODEBUG=netdns=go+4
root@antrea-toolbox-vlllv:/go# ./main nginx-test.default.svc.cluster.local
go package net: confVal.netCgo = false  netGo = true
go package net: GODEBUG setting forcing use of Go's resolver
go package net: hostLookupOrder(nginx-test.default.svc.cluster.local) = files,dns
10.96.140.88
root@antrea-toolbox-vlllv:/go#

So it seems that the choice of resolver does not really impact resolution of K8s cluster local names.

@antoninbas antoninbas added this to the Antrea v2.0 release milestone Feb 14, 2024
@antoninbas
Copy link
Contributor Author

I'm also planning to disable cgo for the flow-aggregator, based on the observations above (FA typically needs to resolve cluster local names to connect to collectors).

Disabling cgo has no noticeable impact on binary size / build time.

antoninbas added a commit to antoninbas/antrea that referenced this issue Feb 14, 2024
Instead of selectively disabling cgo for some binaries (e.g., release
assets), we now unconditionally disable cgo for all binaries, even those
that only run inside the container image for which they were built
(e.g., antrea-controller). After some analysis, there seems to be no
downside in doing this. We also get some benefits such as reduced build
time for the default make command.

Fixes antrea-io#5724

Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>
antoninbas added a commit to antoninbas/antrea that referenced this issue Feb 14, 2024
Instead of selectively disabling cgo for some binaries (e.g., release
assets), we now unconditionally disable cgo for all binaries, even those
that only run inside the container image for which they were built
(e.g., antrea-controller). After some analysis, there seems to be no
downside in doing this. We also get some benefits such as reduced build
time for the default make command.

Fixes antrea-io#5724

Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>
antoninbas added a commit that referenced this issue Feb 21, 2024
* Disable cgo for all Antrea binaries

Instead of selectively disabling cgo for some binaries (e.g., release
assets), we now unconditionally disable cgo for all binaries, even those
that only run inside the container image for which they were built
(e.g., antrea-controller). After some analysis, there seems to be no
downside in doing this. We also get some benefits such as reduced build
time for the default make command.

Fixes #5724

* Revert "Add git to antrea-build image for UBI build (#5727)"

This reverts commit 2f8441b.

* Revert "Fix antrea-ubi image build (#5723)"

This reverts commit 2afab06.

---------

Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>
luolanzone pushed a commit to luolanzone/antrea that referenced this issue Mar 20, 2024
* Disable cgo for all Antrea binaries

Instead of selectively disabling cgo for some binaries (e.g., release
assets), we now unconditionally disable cgo for all binaries, even those
that only run inside the container image for which they were built
(e.g., antrea-controller). After some analysis, there seems to be no
downside in doing this. We also get some benefits such as reduced build
time for the default make command.

Fixes antrea-io#5724

* Revert "Add git to antrea-build image for UBI build (antrea-io#5727)"

This reverts commit 2f8441b.

* Revert "Fix antrea-ubi image build (antrea-io#5723)"

This reverts commit 2afab06.

---------

Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>
tnqn pushed a commit that referenced this issue Mar 21, 2024
* Disable cgo for all Antrea binaries

Instead of selectively disabling cgo for some binaries (e.g., release
assets), we now unconditionally disable cgo for all binaries, even those
that only run inside the container image for which they were built
(e.g., antrea-controller). After some analysis, there seems to be no
downside in doing this. We also get some benefits such as reduced build
time for the default make command.

Fixes #5724

* Revert "Add git to antrea-build image for UBI build (#5727)"

This reverts commit 2f8441b.

* Revert "Fix antrea-ubi image build (#5723)"

This reverts commit 2afab06.

---------

Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>
luolanzone pushed a commit to luolanzone/antrea that referenced this issue Mar 22, 2024
* Disable cgo for all Antrea binaries

Instead of selectively disabling cgo for some binaries (e.g., release
assets), we now unconditionally disable cgo for all binaries, even those
that only run inside the container image for which they were built
(e.g., antrea-controller). After some analysis, there seems to be no
downside in doing this. We also get some benefits such as reduced build
time for the default make command.

Fixes antrea-io#5724

* Revert "Add git to antrea-build image for UBI build (antrea-io#5727)"

This reverts commit 2f8441b.

* Revert "Fix antrea-ubi image build (antrea-io#5723)"

This reverts commit 2afab06.

---------

Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>
tnqn pushed a commit that referenced this issue Mar 25, 2024
* Disable cgo for all Antrea binaries

Instead of selectively disabling cgo for some binaries (e.g., release
assets), we now unconditionally disable cgo for all binaries, even those
that only run inside the container image for which they were built
(e.g., antrea-controller). After some analysis, there seems to be no
downside in doing this. We also get some benefits such as reduced build
time for the default make command.

Fixes #5724

* Revert "Add git to antrea-build image for UBI build (#5727)"

This reverts commit 2f8441b.

* Revert "Fix antrea-ubi image build (#5723)"

This reverts commit 2afab06.

---------

Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>
luolanzone pushed a commit to luolanzone/antrea that referenced this issue Mar 26, 2024
* Disable cgo for all Antrea binaries

Instead of selectively disabling cgo for some binaries (e.g., release
assets), we now unconditionally disable cgo for all binaries, even those
that only run inside the container image for which they were built
(e.g., antrea-controller). After some analysis, there seems to be no
downside in doing this. We also get some benefits such as reduced build
time for the default make command.

Fixes antrea-io#5724

* Revert "Add git to antrea-build image for UBI build (antrea-io#5727)"

This reverts commit 2f8441b.

* Revert "Fix antrea-ubi image build (antrea-io#5723)"

This reverts commit 2afab06.

---------

Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>
tnqn pushed a commit that referenced this issue Mar 27, 2024
* Disable cgo for all Antrea binaries

Instead of selectively disabling cgo for some binaries (e.g., release
assets), we now unconditionally disable cgo for all binaries, even those
that only run inside the container image for which they were built
(e.g., antrea-controller). After some analysis, there seems to be no
downside in doing this. We also get some benefits such as reduced build
time for the default make command.

Fixes #5724

* Revert "Add git to antrea-build image for UBI build (#5727)"

This reverts commit 2f8441b.

* Revert "Fix antrea-ubi image build (#5723)"

This reverts commit 2afab06.

---------

Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant