pso-csi-node pods restarts frequently #168

rpatil72 · 2021-05-08T06:54:38Z

$ kubectl get pods -o wide -n storage
NAME                                         READY   STATUS    RESTARTS   AGE     IP             NODE            NOMINATED NODE   READINESS GATES
pso-csi-controller-0                         6/6     Running   6          6h1m    172.20.2.53    init22b-8       <none>           <none>
pso-csi-node-24qd8                           2/3     Running   53         6h1m    10.64.112.52   k8s-worker-02   <none>           <none>
pso-csi-node-mfjg7                           3/3     Running   0          6h1m    10.64.230.41   init22b-8       <none>           <none>
pso-csi-node-ncqmn                           3/3     Running   0          6h1m    10.64.230.40   init22b-7       <none>           <none>
pso-csi-node-p6b97                           2/3     Running   56         6h1m    10.64.112.51   k8s-worker-01   <none>           <none>
pso-db-0-0                                   1/1     Running   0          6h1m    172.20.1.47    init22b-7       <none>           <none>
pso-db-1-0                                   1/1     Running   0          6h1m    172.20.2.54    init22b-8       <none>           <none>
pso-db-2-0                                   1/1     Running   0          6h1m    172.20.3.4     k8s-worker-01   <none>           <none>
pso-db-3-0                                   1/1     Running   0          6h1m    172.20.4.5     k8s-worker-02   <none>           <none>
pso-db-4-0                                   1/1     Running   0          6h1m    172.20.1.48    init22b-7       <none>           <none>
pso-db-cockroach-operator-5f494b6b5b-wpskw   1/1     Running   0          6h1m    172.20.1.45    init22b-7       <none>           <none>
pso-db-deployer-5df48fb79b-5hj82             1/1     Running   0          6h1m    172.20.1.46    init22b-7       <none>           <none>
pso-explorer-84d6cf69f9-lgqjc                2/2     Running   0          5h35m   172.20.2.57    init22b-8       <none>           <none>
psox-agent-2bj4j                             1/1     Running   0          5h51m   172.20.3.5     k8s-worker-01   <none>           <none>
psox-agent-2vx4x                             1/1     Running   0          5h51m   172.20.4.7     k8s-worker-02   <none>           <none>
psox-agent-pc5bh                             1/1     Running   0          5h51m   172.20.2.56    init22b-8       <none>           <none>
psox-agent-z2hnz                             1/1     Running   0          5h51m   172.20.1.49    init22b-7       <none>           <none>
psox-agent-zqlnl                             1/1     Running   1          5h51m   172.20.0.9     k8s-master-01   <none>           <none>

node-driver-registrar

W0508 00:42:08.409027   56628 connection.go:170] Still connecting to unix:///csi/csi.sock
W0508 00:42:18.409174   56628 connection.go:170] Still connecting to unix:///csi/csi.sock
W0508 00:42:28.409061   56628 connection.go:170] Still connecting to unix:///csi/csi.sock
W0508 00:42:38.409045   56628 connection.go:170] Still connecting to unix:///csi/csi.sock
W0508 00:42:48.409081   56628 connection.go:170] Still connecting to unix:///csi/csi.sock
W0508 00:42:58.408966   56628 connection.go:170] Still connecting to unix:///csi/csi.sock
W0508 00:43:08.408997   56628 connection.go:170] Still connecting to unix:///csi/csi.sock
W0508 00:43:18.409026   56628 connection.go:170] Still connecting to unix:///csi/csi.sock
W0508 00:43:28.409013   56628 connection.go:170] Still connecting to unix:///csi/csi.sock
W0508 00:43:38.409098   56628 connection.go:170] Still connecting to unix:///csi/csi.sock
I0508 00:43:42.144113   56628 node_register.go:51] Starting Registration Server at: /registration/pure-csi-reg.sock
I0508 00:43:42.144325   56628 node_register.go:60] Registration Server started at: /registration/pure-csi-reg.sock
I0508 00:43:43.563343   56628 main.go:77] Received GetInfo call: &InfoRequest{}
I0508 00:43:43.584165   56628 main.go:87] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}
E0508 00:49:40.845200   56628 connection.go:129] Lost connection to unix:///csi/csi.sock.
E0508 01:00:54.861780   56628 connection.go:129] Lost connection to unix:///csi/csi.sock.
E0508 01:05:12.763918   56628 connection.go:129] Lost connection to unix:///csi/csi.sock.
E0508 01:14:36.757668   56628 connection.go:129] Lost connection to unix:///csi/csi.sock.
E0508 01:33:18.765391   56628 connection.go:129] Lost connection to unix:///csi/csi.sock.
E0508 02:28:59.840505   56628 connection.go:129] Lost connection to unix:///csi/csi.sock.
E0508 02:46:42.755703   56628 connection.go:129] Lost connection to unix:///csi/csi.sock.
E0508 02:57:45.872492   56628 connection.go:129] Lost connection to unix:///csi/csi.sock.

pso-csi-container

Original NodeID: k8s-worker-01
processIPHost
time="2021-05-08T06:50:10Z" level=error msg="Failed to retrieve configmap from cluster." configmap="&ConfigMap{ObjectMeta:{      0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] []  []},Data:map[string]string{},BinaryData:map[string][]byte{},}" configmap_name=pure-csi-container-configmap error="an empty namespace may not be set when a resource name is provided"
time="2021-05-08T06:50:11Z" level=info msg="Starting server" debug=false endpoint="unix:///csi/csi.sock" nodeID=k8s-worker-01 span="{}"
time="2021-05-08T06:50:11Z" level=info msg="Created listener" addr=//csi/csi.sock protocol=unix span="{}"
time="2021-05-08T06:50:11Z" level=info msg="Listening for connections" address=//csi/csi.sock network=unix span="{}"
time="2021-05-08T06:50:11Z" level=info msg="Parsing ISCSI target ranges for CIDR filter." ISCSITargetRanges=
time="2021-05-08T06:50:11Z" level=info msg="CIDR Filters constructed." CIDRs="[]"
(/go/src/pso.purestorage.com/k8s/vendor/pso.purestorage.com/nsm/pkg/datastores/services/cockroach/collection.go:166)
[2021-05-08 06:50:12]  sql: no rows in result set 
time="2021-05-08T06:51:00Z" level=error msg="waiters found" bucket= req="(*csi.ProbeRequest) " span="{}" waiters=1
time="2021-05-08T06:51:00Z" level=error msg="failed to queue request" error="rpc error: code = Aborted desc = requests pending" req="(*csi.ProbeRequest) " span="{}"
time="2021-05-08T06:51:00Z" level=error msg="GRPC handler error" error="rpc error: code = Aborted desc = requests pending" method=/csi.v1.Identity/Probe request= span="{}"
time="2021-05-08T06:51:02Z" level=error msg="waiters found" bucket= req="(*csi.ProbeRequest) " span="{}" waiters=1
time="2021-05-08T06:51:02Z" level=error msg="failed to queue request" error="rpc error: code = Aborted desc = requests pending" req="(*csi.ProbeRequest) " span="{}"
time="2021-05-08T06:51:02Z" level=error msg="GRPC handler error" error="rpc error: code = Aborted desc = requests pending" method=/csi.v1.Identity/Probe request= span="{}"
time="2021-05-08T06:51:04Z" level=error msg="waiters found" bucket= req="(*csi.ProbeRequest) " span="{}" waiters=1
time="2021-05-08T06:51:04Z" level=error msg="failed to queue request" error="rpc error: code = Aborted desc = requests pending" req="(*csi.ProbeRequest) " span="{}"
time="2021-05-08T06:51:04Z" level=error msg="GRPC handler error" error="rpc error: code = Aborted desc = requests pending" method=/csi.v1.Identity/Probe request= span="{}"
time="2021-05-08T06:51:06Z" level=error msg="waiters found" bucket= req="(*csi.ProbeRequest) " span="{}" waiters=1
time="2021-05-08T06:51:06Z" level=error msg="failed to queue request" error="rpc error: code = Aborted desc = requests pending" req="(*csi.ProbeRequest) " span="{}"
time="2021-05-08T06:51:06Z" level=error msg="GRPC handler error" error="rpc error: code = Aborted desc = requests pending" method=/csi.v1.Identity/Probe request= span="{}"
time="2021-05-08T06:51:08Z" level=error msg="waiters found" bucket= req="(*csi.ProbeRequest) " span="{}" waiters=1
time="2021-05-08T06:51:08Z" level=error msg="failed to queue request" error="rpc error: code = Aborted desc = requests pending" req="(*csi.ProbeRequest) " span="{}"
time="2021-05-08T06:51:08Z" level=error msg="GRPC handler error" error="rpc error: code = Aborted desc = requests pending" method=/csi.v1.Identity/Probe request= span="{}"
time="2021-05-08T06:51:10Z" level=error msg="waiters found" bucket= req="(*csi.ProbeRequest) " span="{}" waiters=1
time="2021-05-08T06:51:10Z" level=error msg="failed to queue request" error="rpc error: code = Aborted desc = requests pending" req="(*csi.ProbeRequest) " span="{}"
time="2021-05-08T06:51:10Z" level=error msg="GRPC handler error" error="rpc error: code = Aborted desc = requests pending" method=/csi.v1.Identity/Probe request= span="{}"
time="2021-05-08T06:51:12Z" level=error msg="waiters found" bucket= req="(*csi.ProbeRequest) " span="{}" waiters=1

The text was updated successfully, but these errors were encountered:

dsupure · 2021-05-09T04:43:10Z

Hi, @rpatil72. Can you turn on the debug flag in the values.yaml, run the script here to collect the log and send to pso-support@purestorage.com?

Thanks.

rpatil72 · 2021-05-09T05:23:42Z

Thanks @dsupure for quick response.
Have enabled debug in values.yaml and have sent the collected logs to that email

dsupure · 2021-05-12T22:22:34Z

Thank you @rpatil72 . We got the logs. Our support team is debugging the logs. We will contact you shortly once we find the issue. Thanks for your patience.

zatricky · 2021-06-10T09:59:50Z

Hi, @dsupure @rpatil72. Was there a solution to this problem before we need to contact support directly?

dsupure · 2021-06-11T01:50:52Z

Hi @zatricky :
My team is investigating it. We will give you an update tomorrow. Thanks for your patient.

pure-jliao · 2021-06-11T19:46:46Z

hello @rpatil72 @zatricky after investigation we probably need more info from you:

did you start with PSO 5 or 6?
did a fresh installed PSO 6 use to work fine? if yes what is the env setup? e.g. k8s, os versions
though it restarts frequently, is it functioning or not?
did those pod restart before or after it was ready?

From the latest log I got from the email

   1 NAME                                         READY   STATUS    RESTARTS   AGE     IP             NODE            NOMINATED NODE   READINESS GATES
   2 pso-csi-controller-0                         5/6     Running   0          6m26s   172.20.2.58    init22b-8       <none>           <none>
   3 pso-csi-node-dndw2                           2/3     Running   1          6m35s   10.64.112.52   k8s-worker-02   <none>           <none>
   4 pso-csi-node-hvg85                           2/3     Running   0          6m26s   10.64.230.41   init22b-8       <none>           <none>
   5 pso-csi-node-p2xjb                           3/3     Running   0          6m35s   10.64.230.40   init22b-7       <none>           <none>
   6 pso-csi-node-xx56s                           2/3     Running   1          6m35s   10.64.112.51   k8s-worker-01   <none>           <none>
   7 pso-db-0-0                                   1/1     Running   0          2d17h   172.20.1.47    init22b-7       <none>           <none>
   8 pso-db-1-0                                   1/1     Running   0          2d17h   172.20.2.54    init22b-8       <none>           <none>
   9 pso-db-2-0                                   1/1     Running   0          2d17h   172.20.3.4     k8s-worker-01   <none>           <none>
  10 pso-db-3-0                                   1/1     Running   0          2d17h   172.20.4.5     k8s-worker-02   <none>           <none>
  11 pso-db-4-0                                   1/1     Running   0          2d17h   172.20.1.48    init22b-7       <none>           <none>

pso pods started for 6 mins while the db pods started 2 days ago, did you just delete those pod? or something else was done?
from this log we didn't find anything useful, if you could reach the state in the first post on this page (controller is ready, only 2 node pods keep restarting) then could you run the log collection script again and send us the log? since 6 minutes probably is too short to grab the logs, sometimes it could take longer.
meanwhile did you try to do a helm uninstall and then reinstall PSO 6 again?

zatricky · 2021-06-11T21:28:10Z

Hi, @pure-jliao. Our org has sent in a support request. Note that I have similar errors to those @rpatil72 submitted. In our case the original cause was a hypervisor cluster failure affecting many unrelated hosts, including an entire kubernetes cluster making use of the storage. We had already recovered all core services. I was hoping that a resolution was found for @rpatil72 so that we could at least know where to look next before submitting that support request (which has of course already been sent by now).

pure-jliao · 2021-06-11T22:13:12Z

thanks @zatricky , FYI we do found something wrong related to this error log line on our side Failed to retrieve configmap from cluster but it shouldn't be the root cause of the problem stated here and kind of misleading, so that's why we need more info here in my previous post.

dsupure · 2021-06-15T23:08:54Z

@rpatil72 : can you help to answer the questions from @pure-jliao ? Thanks.

rpatil72 · 2021-06-16T16:37:59Z

@pure-jliao please find the replies below:

did you start with PSO 5 or 6?
Started with PSO 6
did a fresh installed PSO 6 use to work fine? if yes what is the env setup? e.g. k8s, os versions
This was a new setup with PSO 6; k8s 1.21 on Ubuntu 20.04.2 LTS / 5.4.0-72-generic, CRI is docker://20.10.2
though it restarts frequently, is it functioning or not?
It was not functioning
did those pod restart before or after it was ready?
it not got ready
pso pods started for 6 mins while the db pods started 2 days ago, did you just delete those pod? or something else was done?
tried to redeploy couple of times
from this log we didn't find anything useful, if you could reach the state in the first post on this page (controller is ready, only 2 node pods keep restarting) then could you run the log collection script again and send us the log? since 6 minutes probably is too short to grab the logs, sometimes it could take longer.
I can try this, but I switched to
https://github.com/kubernetes-sigs/nfs-subdir-external-provisioner/blob/master/charts/nfs-subdir-external-provisioner/README.md
and this work fine in the same environment
meanwhile did you try to do a helm uninstall and then reinstall PSO 6 again?
Yes but it did not help then

pure-jliao · 2021-06-17T18:23:58Z

@rpatil72 thanks for applying, one thing to note right now we haven't tested PSO 6 on k8s 1.21 and it's not officially supported yet. but not 100% sure it's caused by a newer and unstable version of k8s, so if possible do you have any chance to try a fresh install on k8s 1.20? meanwhile in your first post, only 2 node pods are restarting and all other PSO pods are healthy, if you could reach that state again, could you try to grab the logs again? it might be a little bit hard to get logs if the pods keep crashing. also could you try to run PSO without the explorer to rule out some other effects?

and it still cannot be resolved we could schedule a call cc: @dsupure @dingyin

smrutimandal · 2021-06-29T23:18:42Z

Facing the same issue.

bash-3.2$ helm list -n pure-pso
NAME    	NAMESPACE	REVISION	UPDATED                              	STATUS  	CHART         	APP VERSION
pure-pso	pure-pso 	1       	2021-06-30 01:00:21.279272 +0200 CEST	deployed	pure-pso-6.1.1	6.1.1
bash-3.2$ kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-12T14:11:29Z", GoVersion:"go1.16.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.0", GitCommit:"cb303e613a121a29364f75cc67d3d580833a7479", GitTreeState:"clean", BuildDate:"2021-04-08T16:25:06Z", GoVersion:"go1.16.1", Compiler:"gc", Platform:"linux/amd64"}
bash-3.2$

Any updates?

pure-jliao · 2021-06-30T02:07:44Z

@smrutimandal sorry we haven't narrowed down the problem here, there could be a lot of things causing PSO pod failures. So please collect the logs and contact support team for more investigation.

smrutimandal · 2021-07-14T13:03:50Z

Hi,

Thanks for getting back to me. I am in the process of getting my support contract in order. In the meantime, I found that that the following feature-gate defaults have changed between 1.20 and 1.21.

Kube API Server v1.20:
CSIServiceAccountToken=true|false (ALPHA - default=false)
CSIStorageCapacity=true|false (ALPHA - default=false)

Kube API Server v1.21:
CSIServiceAccountToken=true|false (BETA - default=true)
CSIStorageCapacity=true|false (BETA - default=true)

Can these changes have any impact on the liveness-probes?

Any help is greatly appreciated.

Thanks

dsupure · 2021-07-14T16:31:02Z

@smrutimandal: As @pure-jliao mentioned, there are many possibilities and it's hard to troubleshoot your problem via github. Please contact pure support and we will create a support case for you. We will follow up there. Thank you.

smrutimandal · 2021-07-15T06:34:53Z

Hi @dsupure ,

I understand that. If compatibility is really an issue though, the https://github.com/purestorage/pso-csi#environments-supported section, assuring support for v1.17+, is a bit confusing.

I will get in touch with support.

Thanks

mscbpi · 2022-01-26T18:56:44Z

1.21 is supposed to be supported now, do you have any update ? I get the same liveness probe errors since K8s Upgrade (1.19 is EoL), though the setup seems to work OK (no restart -to be monitored- and volumes get attached)


time="2022-01-26T23:08:16Z" level=error msg="waiters found" bucket= req="(*csi.ProbeRequest) " span="{}" waiters=1
time="2022-01-26T23:08:16Z" level=error msg="failed to queue request" error="rpc error: code = Aborted desc = requests pending" req="(*csi.ProbeRequest) " span="{}"
time="2022-01-26T23:08:16Z" level=error msg="GRPC handler error" error="rpc error: code = Aborted desc = requests pending" method=/csi.v1.Identity/Probe request= span="{}"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pso-csi-node pods restarts frequently #168

pso-csi-node pods restarts frequently #168

rpatil72 commented May 8, 2021 •

edited

Loading

dsupure commented May 9, 2021

rpatil72 commented May 9, 2021 •

edited

Loading

dsupure commented May 12, 2021

zatricky commented Jun 10, 2021 •

edited

Loading

dsupure commented Jun 11, 2021

pure-jliao commented Jun 11, 2021

zatricky commented Jun 11, 2021

pure-jliao commented Jun 11, 2021

dsupure commented Jun 15, 2021

rpatil72 commented Jun 16, 2021

pure-jliao commented Jun 17, 2021

smrutimandal commented Jun 29, 2021 •

edited

Loading

pure-jliao commented Jun 30, 2021

smrutimandal commented Jul 14, 2021

dsupure commented Jul 14, 2021

smrutimandal commented Jul 15, 2021

mscbpi commented Jan 26, 2022 •

edited

Loading

pso-csi-node pods restarts frequently #168

pso-csi-node pods restarts frequently #168

Comments

rpatil72 commented May 8, 2021 • edited Loading

dsupure commented May 9, 2021

rpatil72 commented May 9, 2021 • edited Loading

dsupure commented May 12, 2021

zatricky commented Jun 10, 2021 • edited Loading

dsupure commented Jun 11, 2021

pure-jliao commented Jun 11, 2021

zatricky commented Jun 11, 2021

pure-jliao commented Jun 11, 2021

dsupure commented Jun 15, 2021

rpatil72 commented Jun 16, 2021

pure-jliao commented Jun 17, 2021

smrutimandal commented Jun 29, 2021 • edited Loading

pure-jliao commented Jun 30, 2021

smrutimandal commented Jul 14, 2021

dsupure commented Jul 14, 2021

smrutimandal commented Jul 15, 2021

mscbpi commented Jan 26, 2022 • edited Loading

rpatil72 commented May 8, 2021 •

edited

Loading

rpatil72 commented May 9, 2021 •

edited

Loading

zatricky commented Jun 10, 2021 •

edited

Loading

smrutimandal commented Jun 29, 2021 •

edited

Loading

mscbpi commented Jan 26, 2022 •

edited

Loading