Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.12 Upgrade #5062

Closed
sba30 opened this issue May 31, 2023 · 3 comments · Fixed by #5154
Closed

1.12 Upgrade #5062

sba30 opened this issue May 31, 2023 · 3 comments · Fixed by #5154
Labels
kind/bug Categorizes issue or PR as related to a bug. reported-by/end-user Issues reported by end users.

Comments

@sba30
Copy link

sba30 commented May 31, 2023

Describe the bug

To Reproduce
Upgrade from 1.8 to 1.12

Expected
Upgrade to be successful and pods running.

Actual behavior
Agent Logs flooded with this errot
E0530 15:53:46.969895 1 ofSwitch.go:267] Received OpenFlow1.5 error: unknown error with type 14, code 1, vendor 0 on message OFPT_EXPERIMENTER
I0530 15:53:46.980139 1 fqdn.go:756] "Received a fragmented DNS response, partially unpacking it" lengthField=26529 actualLength=1396
I0530 15:53:46.980159 1 fqdn.go:758] "Unable to unpack the DNS response partially, skipping it" err="dns: bad rdata"
E0530 15:53:46.980372 1 ofSwitch.go:267] Received OpenFlow1.5 error: unknown error with type 14, code 1, vendor 0 on message OFPT_EXPERIMENTER
I0530 15:53:46.989486 1 fqdn.go:756] "Received a fragmented DNS response, partially unpacking it" lengthField=43364 actualLength=1396
I0530 15:53:46.989513 1 fqdn.go:758] "Unable to unpack the DNS response partially, skipping it" err="dns: bad rdata"
E0530 15:53:46.989734 1 ofSwitch.go:267] Received OpenFlow1.5 error: unknown error with type 14, code 1, vendor 0 on message OFPT_EXPERIMENTER
I0530 15:53:47.000031 1 fqdn.go:756] "Received a fragmented DNS response, partially unpacking it" lengthField=14966 actualLength=1396
I0530 15:53:47.000060 1 fqdn.go:758] "Unable to unpack the DNS response partially, skipping it" err="dns: bad rdata"
E0530 15:53:47.000267 1 ofSwitch.go:267] Received OpenFlow1.5 error: unknown error with type 14, code 1, vendor 0 on message OFPT_EXPERIMENTER
I0530 15:53:47.010512 1 fqdn.go:756] "Received a fragmented DNS response, partially unpacking it" lengthField=5891 actualLength=22
I0530 15:53:47.010551 1 fqdn.go:758] "Unable to unpack the DNS response partially, skipping it" err="dns: buffer size too small"
E0530 15:53:47.010772 1 ofSwitch.go:267] Received OpenFlow1.5 error: unknown error with type 14, code 1, vendor 0 on message OFPT_EXPERIMENTER
I0530 15:53:47.019994 1 fqdn.go:756] "Received a fragmented DNS response, partially unpacking it" lengthField=28525 actualLength=1396
I0530 15:53:47.020027 1 fqdn.go:758] "Unable to unpack the DNS response partially, skipping it" err="dns: bad rdata"
I0530 15:53:47.030227 1 fqdn.go:756] "Received a fragmented DNS response, partially unpacking it" lengthField=25398 actualLength=1396
I0530 15:53:47.030247 1 fqdn.go:758] "Unable to unpack the DNS response partially, skipping it" err="dns: bad rdata"
I0530 15:53:47.039496 1 fqdn.go:756] "Received a fragmented DNS response, partially unpacking it" lengthField=11572 actualLength=1396
I0530 15:53:47.039522 1 fqdn.go:758] "Unable to unpack the DNS response partially, skipping it" err="dns: bad rdata"
I0530 15:53:47.049930 1 fqdn.go:756] "Received a fragmented DNS response, partially unpacking it" lengthField=12465 actualLength=1396
I0530 15:53:47.049956 1 fqdn.go:758] "Unable to unpack the DNS response partially, skipping it" err="dns: bad rdata"
I0530 15:53:47.059833 1 fqdn.go:756] "Received a fragmented DNS response, partially unpacking it" lengthField=26990 actualLength=1158
I0530 15:53:47.059850 1 fqdn.go:758] "Unable to unpack the DNS response partially, skipping it" err="dns: bad rdata"
E0530 15:53:47.060144 1 ofSwitch.go:267] Received OpenFlow1.5 error: unknown error with type 14, code 1, vendor 0 on message OFPT_EXPERIMENTER
I0530 15:53:47.070512 1 fqdn.go:756] "Received a fragmented DNS response, partially unpacking it" lengthField=43713 actualLength=1190
I0530 15:53:47.070533 1 fqdn.go:758] "Unable to unpack the DNS response partially, skipping it" err="dns: buffer size too small"
E0530 15:53:47.070773 1 ofSwitch.go:267] Received OpenFlow1.5 error: unknown error with type 14, code 1, vendor 0 on message OFPT_EXPERIMENTER
I0530 15:53:47.080502 1 fqdn.go:756] "Received a fragmented DNS response, partially unpacking it" lengthField=5891 actualLength=445
I0530 15:53:47.080532 1 fqdn.go:758] "Unable to unpack the DNS response partially, skipping it" err="dns: buffer size too small"
I0530 15:53:47.082991 1 traceflow_controller.go:188] Shutting down AntreaAgentTraceflowController
I0530 15:53:47.083009 1 agent.go:867] Stopping Antrea agent
I0530 15:53:47.083021 1 controller.go:188] Shutting down ExternalIPPoolController
I0530 15:53:47.083031 1 node_route_controller.go:365] Shutting down AntreaAgentNodeRouteController
I0530 15:53:47.083036 1 requestheader_controller.go:183] Shutting down RequestHeaderAuthRequestController
I0530 15:53:47.083053 1 ip_scheduler.go:241] "Shutting down Egress IP scheduler"
I0530 15:53:47.083023 1 configmap_cafile_content.go:223] "Shutting down controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file"
I0530 15:53:47.083055 1 cluster.go:352] "Shutting down" controllerName="MemberListCluster"
I0530 15:53:47.083163 1 secure_serving.go:255] Stopped listening on [::]:10350
I0530 15:53:47.083058 1 server.go:701] Shutting down CNI server
I0530 15:53:47.083064 1 discoverer.go:105] Stopping ServiceCIDRDiscoverer
I0530 15:53:47.083001 1 status_controller.go:217] Shutting down NetworkPolicy StatusController
I0530 15:53:47.083091 1 object_count_tracker.go:151] "StorageObjectCountTracker pruner is exiting"
I0530 15:53:47.083093 1 channel.go:87] "Stopping SubscribableChannel" name="PodUpdate"
I0530 15:53:47.083010 1 egress_controller.go:330] Shutting down AntreaAgentEgressController
I0530 15:53:47.083105 1 configmap_cafile_content.go:223] "Shutting down controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
I0530 15:53:47.083104 1 configmap_cafile_content.go:223] "Shutting down controller" name="antrea-ca::kube-system::antrea-ca::ca.crt"
I0530 15:53:47.083119 1 tlsconfig.go:255] "Shutting down DynamicServingCertificateController"

Versions:

Additional context

@sba30 sba30 added the kind/bug Categorizes issue or PR as related to a bug. label May 31, 2023
@wenyingd
Copy link
Contributor

@sba30 Thanks for reporting the issue.

The error message Received OpenFlow1.5 error: unknown error with type 14, code 1, vendor 0 on message OFPT_EXPERIMENTER in ofnet represents an Error message sent from OVS for type: OFPET_BAD_PROPERTY, code: OFPBPC_BAD_LEN for an openflow15.Type_Experimenter message.

I guess it is related with some issue when decoding PakcetIn2 message with a fragmented DNS packet. Could you help provided the OVS logs?

@tnqn
Copy link
Member

tnqn commented May 31, 2023

Can you also share the full antrea-agent log (starting from begining)? Even if antrea-agent receives fragmented DNS packet, it should be running. It seems to be killed by kubelet because of failed liveness probe, likely due to the "ovs" check:

ovsConnCheck := healthz.NamedCheck("ovs", func(_ *http.Request) error {
if aq.GetOpenflowClient().IsConnected() {
return nil
}
return fmt.Errorf("disconnected from OFSwitch")
})

The early logs may help us understand why the ovs check is not ready.

@wenyingd
Copy link
Contributor

wenyingd commented Jun 6, 2023

Having offline sync with @sba30 , these OVS error logs are printed on the issued setup,

2023-05-30T08:18:39.275Z|00048|connmgr|INFO|br-int<->unix#1: sending OFPBPC_BAD_LEN error reply to NXT_RESUME message
2023-05-30T08:18:39.405Z|00049|connmgr|INFO|br-int<->unix#1: sending OFPBPC_BAD_LEN error reply to NXT_RESUME message
2023-05-30T08:18:39.535Z|00050|connmgr|INFO|br-int<->unix#1: sending OFPBPC_BAD_LEN error reply to NXT_RESUME message
2023-05-30T08:18:39.665Z|00051|connmgr|INFO|br-int<->unix#1: sending OFPBPC_BAD_LEN error reply to NXT_RESUME message
2023-05-30T08:18:39.995Z|00052|ofp_packet|WARN|unknown continuation property type 7
2023-05-30T08:18:40.415Z|00053|ofp_packet|WARN|unknown continuation property type for exp_id 0x2320, exp_type 983032
2023-05-30T08:18:40.996Z|00054|connmgr|INFO|br-int<->unix#1: 486 flow_mods in the 6 s starting 10 s ago (485 adds, 1 deletes)
2023-05-30T08:18:41.045Z|00055|ofp_packet|WARN|unknown continuation property type 0
2023-05-30T08:18:44.978Z|00056|ofp_packet|WARN|unknown continuation property type 7

The errors showed that some issue exists when encoding the OpenFlow resume message (used in packetIn2 mechnism to continue the packet processing pipeline ). After OVS send the OpenFlow error message to antrea-agent, antrea-agent (ofnet) would disconnecct the previous connection from OVS and initiate a new connection. When health check happens at this time, it may think antrea-agent as unhealthy, and then restart the antrea-agent Pod.

Except for the resume message encoding error, I also observed that the flow entry for packet_in TCP DNS response is not installed as expected, the failed flow would send all TCP packets marked with flags "+ack+psh" to antrea-agent, which may indirectly lead to this failure. I use another issue #5077 to track it.

@tnqn tnqn added the reported-by/end-user Issues reported by end users. label Sep 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. reported-by/end-user Issues reported by end users.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants