-
Notifications
You must be signed in to change notification settings - Fork 580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failure in k8s-operator endpoint-template (PostStart hook) #7396
Comments
This is miss leading. The amount of logs generated by kuttl from unstable k8s test hides the true problem in the main test suite. Taken from raw output.
|
Events from kuttle output doesn't show the root cause of pod-0 not being scheduled
|
The kubectl get pod output shows the following problem
|
For what ever reason the postStart script failed. |
Second time https://buildkite.com/redpanda/redpanda/builds/18895#01849d6c-9736-4275-b872-dd1116132d33 the same problem
|
Third time https://buildkite.com/redpanda/redpanda/builds/18875#01849b55-2a33-426c-9267-ae7b4410d220 the same problem From this test run one pod named
|
Due to failing PostStart hook it seems that one of the Redpanda broker is in degraded state and AdminAPI is not responsive. The Pod log collector does not work if pod is in `PodInitializing` state. REF redpanda-data#7396
Due to failing PostStart hook it seems that one of the Redpanda broker is in degraded state and AdminAPI is not responsive. The Pod log collector does not work if pod is in `PodInitializing` state. To find more evidence what is happening kind export logs command was added. It should be collected by the buildkite agent and served for later investigations. REF redpanda-data#7396
k8s: Gather k8s events from failed test
The script that tries to remove maintenance mode #!/usr/bin/env bash
set -e
until NODE_ID=$(curl --silent --fail http://${POD_NAME}.endpoint-template.kuttl-test-composed-jackass.svc.cluster.local:9644/v1/node_config | grep -o '\"node_id\":[^,}]*' | grep -o '[^: ]*$'); do
sleep 0.5
done
echo "Clearing maintenance mode on node ${NODE_ID}"
until [ "${status:-}" = "200" ] || [ "${status:-}" = "400" ]; do
status=$(curl -X DELETE --silent -o /dev/null -w "%{http_code}" http://${POD_NAME}.endpoint-template.kuttl-test-composed-jackass.svc.cluster.local:9644/v1/brokers/${NODE_ID}/maintenance)
sleep 0.5
done |
Redpanda config redpanda:
data_directory: /var/lib/redpanda/data
empty_seed_starts_cluster: false
seed_servers:
- host:
address: endpoint-template-0.endpoint-template.kuttl-test-composed-jackass.svc.cluster.local.
port: 33145
- host:
address: endpoint-template-1.endpoint-template.kuttl-test-composed-jackass.svc.cluster.local.
port: 33145
- host:
address: endpoint-template-2.endpoint-template.kuttl-test-composed-jackass.svc.cluster.local.
port: 33145
rpc_server:
address: 0.0.0.0
port: 33145
kafka_api:
- address: 0.0.0.0
port: 9092
name: kafka
authentication_method: none
- address: 0.0.0.0
port: 9093
name: kafka-external
authentication_method: none
kafka_api_tls:
- name: kafka-external
key_file: /etc/tls/certs/tls.key
cert_file: /etc/tls/certs/tls.crt
truststore_file: /etc/tls/certs/ca/ca.crt
enabled: true
require_client_auth: true
admin:
- address: 0.0.0.0
port: 9644
name: admin
advertised_rpc_api:
address: endpoint-template-0.endpoint-template.kuttl-test-composed-jackass.svc.cluster.local.
port: 33145
advertised_kafka_api:
- address: endpoint-template-0.endpoint-template.kuttl-test-composed-jackass.svc.cluster.local.
port: 9092
name: kafka
- address: 0-5c2a3f8f4c-kafka.example.com
port: 30804
name: kafka-external
developer_mode: true
auto_create_topics_enabled: true
cloud_storage_segment_max_upload_interval_sec: 1800
default_topic_replications: 3
enable_rack_awareness: true
fetch_reads_debounce_timeout: 10
group_topic_partitions: 3
id_allocator_replication: 3
log_segment_size: 536870912
storage_min_free_bytes: 10485760
topic_partitions_per_shard: 1000
transaction_coordinator_replication: 3
rpk:
tune_network: true
tune_disk_scheduler: true
tune_disk_nomerges: true
tune_disk_write_cache: true
tune_disk_irq: true
tune_cpu: true
tune_aio_events: true
tune_clocksource: true
tune_swappiness: true
coredump_dir: /var/lib/redpanda/coredump
tune_ballast_file: true
overprovisioned: true
pandaproxy:
pandaproxy_api:
- address: 0.0.0.0
port: 8082
name: proxy
- address: 0.0.0.0
port: 8083
name: proxy-external
pandaproxy_api_tls:
- name: proxy-external
key_file: /etc/tls/certs/pandaproxy/tls.key
cert_file: /etc/tls/certs/pandaproxy/tls.crt
truststore_file: /etc/tls/certs/pandaproxy/ca/ca.crt
enabled: true
require_client_auth: true
advertised_pandaproxy_api:
- address: endpoint-template-0.endpoint-template.kuttl-test-composed-jackass.svc.cluster.local.
port: 8082
name: proxy
- address: 0-5c2a3f8f4c-pandaproxy.example.com
port: 32576
name: proxy-external
pandaproxy_client:
brokers:
- address: endpoint-template-0.endpoint-template.kuttl-test-composed-jackass.svc.cluster.local.
port: 9092
- address: endpoint-template-1.endpoint-template.kuttl-test-composed-jackass.svc.cluster.local.
port: 9092
- address: endpoint-template-2.endpoint-template.kuttl-test-composed-jackass.svc.cluster.local.
port: 9092
schema_registry: {} |
From Redpanda's perspective, a couple observations:
Just curious if that aligns with what we think the test is doing; is it adding a brand new node to a cluster? For a brand new node, is it possible to skip the maintenance mode disabling? More broadly speaking, if the issue is that we can't send any RPCs while initializing, that seems problematic, and maybe we need to expose a different readiness endpoint for these hooks to use.
It looks like the node is having trouble sending RPCs to any of the seed nodes. Is that something that we should be expecting when initializing a pod? |
once more in a feature branch: https://buildkite.com/redpanda/redpanda/builds/18931#0184a033-7ea9-4375-a819-226a353770d7 |
Another thing I'm confused about is why this is only happening only sometimes, if the issue truly is that DNS setup cannot proceed until the node exposes its admin endpoint. |
Just for posterity, |
Thanks @andrwng I will drill down with the DNS (core dns) pod if it was causing the problem.
This tests is related to setting up the external advertised kafka api DNS using some template mechanism. It happens that this particular test will create 3 Redpanda cluster. Others tries to create only 1 Redpanda cluster.
It's not that easy. We would need to implement some hacks around when node is considered formed and when not. If we can not use Admin API reliably then I we might end up with some kind of workaround.
We are not using readiness probe to put Redpanda broker out of maintenance mode. It's PostStart hook https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/#container-hooks |
DNS containers doesn't have any useful logs. I don't want to blame kind cluster or buildkite environment, but I run out of options. Maybe we have too many kind nodes, but it's a stretch hypothesis. |
Due to failing PostStart hook it seems that one of the Redpanda broker is in degraded state and AdminAPI is not responsive. The Pod log collector does not work if pod is in `PodInitializing` state. To find more evidence what is happening kind export logs command was added. It should be collected by the buildkite agent and served for later investigations. REF redpanda-data#7396
We need to collect previous logs from pods. All we can see is that pod/endpoint-template-1's initContainer was running at the time we collected logs. |
can that be done automatically and added into the buildkite artifacts so that we can access them? |
That will be the next step when this next has someone working on it, yes. |
Based on 1 week of failures there was no failures in endpoint-template. I will close this issue. If it reappears I will re-open. |
Due to failing PostStart hook it seems that one of the Redpanda broker is in degraded state and AdminAPI is not responsive. The Pod log collector does not work if pod is in `PodInitializing` state. To find more evidence what is happening kind export logs command was added. It should be collected by the buildkite agent and served for later investigations. REF redpanda-data#7396
…dpanda-datagh-7396/gather-more-events k8s: Gather k8s events from failed test
Due to failing PostStart hook it seems that one of the Redpanda broker is in degraded state and AdminAPI is not responsive. The Pod log collector does not work if pod is in `PodInitializing` state. To find more evidence what is happening kind export logs command was added. It should be collected by the buildkite agent and served for later investigations. REF redpanda-data#7396
…edpanda-datagh-7396/gather-more-events k8s: Gather k8s events from failed test
Due to failing PostStart hook it seems that one of the Redpanda broker is in degraded state and AdminAPI is not responsive. The Pod log collector does not work if pod is in `PodInitializing` state. To find more evidence what is happening kind export logs command was added. It should be collected by the buildkite agent and served for later investigations. REF redpanda-data#7396 (cherry picked from commit 441a967)
Due to failing PostStart hook it seems that one of the Redpanda broker is in degraded state and AdminAPI is not responsive. The Pod log collector does not work if pod is in `PodInitializing` state. To find more evidence what is happening kind export logs command was added. It should be collected by the buildkite agent and served for later investigations. REF redpanda-data#7396 (cherry picked from commit 441a967)
Due to failing PostStart hook it seems that one of the Redpanda broker is in degraded state and AdminAPI is not responsive. The Pod log collector does not work if pod is in `PodInitializing` state. To find more evidence what is happening kind export logs command was added. It should be collected by the buildkite agent and served for later investigations. REF redpanda-data#7396 (cherry picked from commit 441a967)
https://buildkite.com/redpanda/redpanda/builds/18840#01849373-8425-495a-95e6-7c462a832601
The text was updated successfully, but these errors were encountered: