Some of post upgrade tests do not run health check to skip tests #10537

petr-balogh · 2024-09-19T15:26:35Z

e.g.:
test_noobaa_service_mon_after_ocs_upgrade

Run: ocs-ci results for OCS4-16-Downstream-OCP4-17-AZURE-IPI-3AZ-RHCOS-3M-3W-upgrade-ocp (BUILD ID: 4.16.2-1 RUN ID: 1725347953)

We see those tests failed:

Failures:
Brown squad
tests/functional/z_cluster/cluster_expansion/test_add_capacity.py::TestAddCapacityPreUpgrade::test_add_capacity_pre_upgrade
tests/functional/upgrade/test_configuration.py::test_crush_map_unchanged
tests/functional/upgrade/test_resources.py::test_storage_pods_running
tests/functional/upgrade/test_resources.py::test_pod_log_after_upgrade
Purple squad
tests/functional/upgrade/test_upgrade_ocp.py::TestUpgradeOCP::test_upgrade_ocp
tests/test_failure_propagator.py::TestFailurePropagator::test_report_skip_triggering_test
Magenta squad
tests/functional/upgrade/test_monitoring_after_ocp_upgrade.py::test_monitoring_after_ocp_upgrade
Orang e squad
tests/cross_functional/scale/upgrade/test_upgrade_with_scaled_obc.py::test_scale_obc_post_upgrade
tests/cross_functional/scale/upgrade/test_upgrade_with_scaled_pvcs_pods.py::test_scale_pvcs_pods_post_upgrade
Red squad
tests/functional/object/mcg/test_default_backingstore_override.py::TestDefaultBackingstoreOverride::test_default_backingstore_override_post_upgrade
tests/functional/upgrade/test_resources.py::test_noobaa_service_mon_after_ocs_upgrade

We should run ceph health check for all upgrade related tests:
https://github.com/red-hat-storage/ocs-ci/blob/master/tests/conftest.py#L1742

https://github.com/red-hat-storage/ocs-ci/blob/master/ocs_ci/framework/pytest_customization/marks.py#L137

So looks like we have some bug or issue as it's not running the check there in the setup phase of some test cases:

2024-09-04 01:15:55  tests/functional/upgrade/test_resources.py::test_noobaa_service_mon_after_ocs_upgrade 
2024-09-04 01:15:55  -------------------------------- live log setup --------------------------------
2024-09-04 01:15:55  19:15:45 - MainThread - ocs_ci.utility.utils - INFO  - testrun_name: OCS4-16-Downstream-OCP4-17-AZURE-IPI-3AZ-RHCOS-3M-3W-upgrade-ocp
2024-09-04 01:15:55  19:15:45 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: ['oc', 'login', '-u', 'kubeadmin', '-p', '*****']
2024-09-04 01:15:55  19:15:45 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig -n openshift-monitoring whoami --show-token
2024-09-04 01:15:55  19:15:46 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig -n openshift-monitoring get Route prometheus-k8s -n openshift-monitoring -o yaml
2024-09-04 01:15:55  19:15:46 - MainThread - ocs_ci.framework.pytest_customization.reports - INFO  - duration reported by tests/functional/upgrade/test_resources.py::test_noobaa_service_mon_after_ocs_upgrade immediately after test execution: 1.3
2024-09-04 01:15:55  -------------------------------- live log call ---------------------------------
2024-09-04 01:15:55  19:15:46 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig -n openshift-storage get  csv -n openshift-storage -o yaml
2024-09-04 01:15:55  19:15:47 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig -n openshift-storage get servicemonitors  -n openshift-storage -o yaml
2024-09-04 01:15:55  19:15:48 - MainThread - tests.functional.upgrade.test_resources - INFO  - noobaa-service-monitor does not exist
2024-09-04 01:15:55  19:15:48 - MainThread - ocs_ci.framework.pytest_customization.reports - INFO  - duration reported by tests/functional/upgrade/test_resources.py::test_noobaa_service_mon_after_ocs_upgrade immediately after test execution: 1.76
2024-09-04 01:15:55  PASSED
2024-09-04 01:15:55  ------------------------------ live log teardown -------------------------------
2024-09-04 01:15:55  19:15:48 - MainThread - tests.conftest - WARNING  - During test were raised new alerts
2024-09-04 01:15:55  19:15:48 - MainThread - tests.conftest - WARNING  - [{'labels': {'alertname': 'PodSecurityViolation', 'namespace': 'openshift-kube-apiserver', 'policy_level': 'restricted', 'severity': 'info'}, 'annotations': {'description': 'A workload (pod, deployment, daemonset, ...) was created somewhere in the cluster but it did not match the PodSecurity "restricted" profile defined by its namespace either via the cluster-wide configuration (which triggers on a "restricted" profile violations) or by the namespace local Pod Security labels. Refer to Kubernetes documentation on Pod Security Admission to learn more about these violations.', 'summary': "One or more workloads users created in the cluster don't match their Pod Security profile"}, 'state': 'firing', 'activeAt': '2024-09-03T11:48:05.877119205Z', 'value': '2.103802490580051e+01'}, {'labels': {'alertname': 'PodDisruptionBudgetAtLimit', 'namespace': 'openshift-storage', 'poddisruptionbudget': 'rook-ceph-mon-pdb', 'severity': 'warning'}, 'annotations': {'description': 'The pod disruption budget is at the minimum disruptions allowed level. The number of current healthy pods is equal to the desired healthy pods.', 'runbook_url': 'https://github.com/openshift/runbooks/blob/master/alerts/cluster-kube-controller-manager-operator/PodDisruptionBudgetAtLimit.md', 'summary': 'The pod disruption budget is preventing further disruption to pods.'}, 'state': 'pending', 'activeAt': '2024-09-03T23:11:29.251667276Z', 'value': '2e+00'}, {'labels': {'alertname': 'PodDisruptionBudgetAtLimit', 'namespace': 'openshift-storage', 'poddisruptionbudget': 'rook-ceph-osd-zone-eastus-1', 'severity': 'warning'}, 'annotations': {'description': 'The pod disruption budget is at the minimum disruptions allowed level. The number of current healthy pods is equal to the desired healthy pods.', 'runbook_url': 'https://github.com/openshift/runbooks/blob/master/alerts/cluster-kube-controller-manager-operator/PodDisruptionBudgetAtLimit.md', 'summary': 'The pod disruption budget is preventing further disruption to pods.'}, 'state': 'firing', 'activeAt': '2024-09-03T11:46:29.254763422Z', 'value': '2e+00'}, {'labels': {'alertname': 'PodDisruptionBudgetAtLimit', 'namespace': 'openshift-storage', 'poddisruptionbudget': 'rook-ceph-osd-zone-eastus-3', 'severity': 'warning'}, 'annotations': {'description': 'The pod disruption budget is at the minimum disruptions allowed level. The number of current healthy pods is equal to the desired healthy pods.', 'runbook_url': 'https://github.com/openshift/runbooks/blob/master/alerts/cluster-kube-controller-manager-operator/PodDisruptionBudgetAtLimit.md', 'summary': 'The pod disruption budget is preventing further disruption to pods.'}, 'state': 'firing', 'activeAt': '2024-09-03T11:46:29.254763422Z', 'value': '2e+00'}, {'labels': {'alertname': 'PrometheusDuplicateTimestamps', 'container': 'kube-rbac-proxy', 'endpoint': 'metrics', 'instance': '10.129.2.43:9092', 'job': 'prometheus-k8s', 'namespace': 'openshift-monitoring', 'pod': 'prometheus-k8s-1', 'service': 'prometheus-k8s', 'severity': 'warning'}, 'annotations': {'description': 'Prometheus openshift-monitoring/prometheus-k8s-1 is dropping 98.97 samples/s with different values but duplicated timestamp.', 'runbook_url': 'https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/PrometheusDuplicateTimestamps.md', 'summary': 'Prometheus is dropping samples with duplicate timestamps.'}, 'state': 'firing', 'activeAt': '2024-09-03T11:43:40.388498204Z', 'value': '9.896666666666667e+01'}, {'labels': {'alertname': 'PrometheusDuplicateTimestamps', 'container': 'kube-rbac-proxy', 'endpoint': 'metrics', 'instance': '10.128.2.14:9092', 'job': 'prometheus-k8s', 'namespace': 'openshift-monitoring', 'pod': 'prometheus-k8s-0', 'service': 'prometheus-k8s', 'severity': 'warning'}, 'annotations': {'description': 'Prometheus openshift-monitoring/prometheus-k8s-0 is dropping 98.97 samples/s with different values but duplicated timestamp.', 'runbook_url': 'https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/PrometheusDuplicateTimestamps.md', 'summary': 'Prometheus is dropping samples with duplicate timestamps.'}, 'state': 'firing', 'activeAt': '2024-09-03T11:48:40.377336908Z', 'value': '9.89666666666667e+01'}, {'labels': {'alertname': 'ClusterNotUpgradeable', 'condition': 'Upgradeable', 'endpoint': 'metrics', 'name': 'version', 'namespace': 'openshift-cluster-version', 'severity': 'info'}, 'annotations': {'description': "In most cases, you will still be able to apply patch releases. Reason DegradedPool. For more information refer to 'oc adm upgrade' or https://console-openshift-console.apps.j-299zi3c33-uo.azure.qe.rh-ocs.com/settings/cluster/.", 'summary': 'One or more cluster operators have been blocking minor version cluster upgrades for at least an hour.'}, 'state': 'firing', 'activeAt': '2024-09-03T10:30:59Z', 'value': '0e+00'}, {'labels': {'alertname': 'TargetDown', 'job': 'ocs-metrics-exporter', 'namespace': 'openshift-storage', 'service': 'ocs-metrics-exporter', 'severity': 'warning'}, 'annotations': {'description': '50% of the ocs-metrics-exporter/ocs-metrics-exporter targets in openshift-storage namespace have been unreachable for more than 15 minutes. This may be a symptom of network connectivity issues, down nodes, or failures within these components. Assess the health of the infrastructure and nodes running these targets and then contact support.', 'runbook_url': 'https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/TargetDown.md', 'summary': 'Some targets were not reachable from the monitoring server for an extended period of time.'}, 'state': 'pending', 'activeAt': '2024-09-03T23:13:45.801917775Z', 'value': '5e+01'}, {'labels': {'alertname': 'CephClusterWarningState', 'container': 'mgr', 'endpoint': 'http-metrics', 'instance': '10.129.2.47:9283', 'job': 'rook-ceph-mgr', 'managedBy': 'ocs-storagecluster', 'namespace': 'openshift-storage', 'pod': 'rook-ceph-mgr-a-59d9b8cb7c-mb2jw', 'service': 'rook-ceph-mgr', 'severity': 'warning'}, 'annotations': {'description': 'Storage cluster is in warning state for more than 15m.', 'message': 'Storage cluster is in degraded state', 'runbook_url': 'https://github.com/openshift/runbooks/blob/master/alerts/openshift-container-storage-operator/CephClusterWarningState.md', 'severity_level': 'warning', 'storage_type': 'ceph'}, 'state': 'firing', 'activeAt': '2024-09-03T12:15:59.795888155Z', 'value': '1e+00'}, {'labels': {'alertname': 'CephMonQuorumAtRisk', 'managedBy': 'ocs-storagecluster', 'namespace': 'openshift-storage', 'severity': 'critical'}, 'annotations': {'description': 'Storage cluster quorum is low. Contact Support.', 'message': 'Storage quorum at risk', 'runbook_url': 'https://github.com/openshift/runbooks/blob/master/alerts/openshift-container-storage-operator/CephMonQuorumAtRisk.md', 'severity_level': 'error', 'storage_type': 'ceph'}, 'state': 'firing', 'activeAt': '2024-09-03T11:49:46.660878994Z', 'value': '2e+00'}, {'labels': {'alertname': 'KubeDeploymentReplicasMismatch', 'container': 'kube-rbac-proxy-main', 'deployment': 'rook-ceph-mon-a', 'endpoint': 'https-main', 'job': 'kube-state-metrics', 'namespace': 'openshift-storage', 'service': 'kube-state-metrics', 'severity': 'warning'}, 'annotations': {'description': 'Deployment openshift-storage/rook-ceph-mon-a has not matched the expected number of replicas for longer than 15 minutes. This indicates that cluster infrastructure is unable to start or restart the necessary components. This most often occurs when one or more nodes are down or partioned from the cluster, or a fault occurs on the node that prevents the workload from starting. In rare cases this may indicate a new version of a cluster component cannot start due to a bug or configuration error. Assess the pods for this deployment to verify they are running on healthy nodes and then contact support.', 'runbook_url': 'https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/KubeDeploymentReplicasMismatch.md', 'summary': 'Deployment has not matched the expected number of replicas'}, 'state': 'pending', 'activeAt': '2024-09-03T23:15:30.244714658Z', 'value': '1e+00'}, {'labels': {'alertname': 'KubeDeploymentReplicasMismatch', 'container': 'kube-rbac-proxy-main', 'deployment': 'rook-ceph-crashcollector-d66507c920c0f24e37de2ecf0e2ace0d', 'endpoint': 'https-main', 'job': 'kube-state-metrics', 'namespace': 'openshift-storage', 'service': 'kube-state-metrics', 'severity': 'warning'}, 'annotations': {'description': 'Deployment openshift-storage/rook-ceph-crashcollector-d66507c920c0f24e37de2ecf0e2ace0d has not matched the expected number of replicas for longer than 15 minutes. This indicates that cluster infrastructure is unable to start or restart the necessary components. This most often occurs when one or more nodes are down or partioned from the cluster, or a fault occurs on the node that prevents the workload from starting. In rare cases this may indicate a new version of a cluster component cannot start due to a bug or configuration error. Assess the pods for this deployment to verify they are running on healthy nodes and then contact support.', 'runbook_url': 'https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/KubeDeploymentReplicasMismatch.md', 'summary': 'Deployment has not matched the expected number of replicas'}, 'state': 'firing', 'activeAt': '2024-09-03T11:49:30.244714658Z', 'value': '1e+00'}, {'labels': {'alertname': 'KubeDeploymentReplicasMismatch', 'container': 'kube-rbac-proxy-main', 'deployment': 'rook-ceph-exporter-j-299zi3c33-uo-dxg78-worker-eastus3-pwx92', 'endpoint': 'https-main', 'job': 'kube-state-metrics', 'namespace': 'openshift-storage', 'service': 'kube-state-metrics', 'severity': 'warning'}, 'annotations': {'description': 'Deployment openshift-storage/rook-ceph-exporter-j-299zi3c33-uo-dxg78-worker-eastus3-pwx92 has not matched the expected number of replicas for longer than 15 minutes. This indicates that cluster infrastructure is unable to start or restart the necessary components. This most often occurs when one or more nodes are down or partioned from the cluster, or a fault occurs on the node that prevents the workload from starting. In rare cases this may indicate a new version of a cluster component cannot start due to a bug or configuration error. Assess the pods for this deployment to verify they are running on healthy nodes and then contact support.', 'runbook_url': 'https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/KubeDeploymentReplicasMismatch.md', 'summary': 'Deployment has not matched the expected number of replicas'}, 'state': 'firing', 'activeAt': '2024-09-03T11:49:30.244714658Z', 'value': '1e+00'}, {'labels': {'alertname': 'KubePodNotScheduled', 'container': 'kube-rbac-proxy-main', 'endpoint': 'https-main', 'job': 'kube-state-metrics', 'namespace': 'openshift-storage', 'pod': 'rook-ceph-mon-a-5bdfbb987d-w6gp7', 'service': 'kube-state-metrics', 'severity': 'warning', 'uid': '54507170-a39a-4fa2-8076-0bfd7322441b'}, 'annotations': {'description': 'Pod openshift-storage/rook-ceph-mon-a-5bdfbb987d-w6gp7 cannot be scheduled for more than 30 minutes.\nCheck the details of the pod with the following command:\noc describe -n openshift-storage pod rook-ceph-mon-a-5bdfbb987d-w6gp7', 'summary': 'Pod cannot be scheduled.'}, 'state': 'pending', 'activeAt': '2024-09-03T23:11:30.244714658Z', 'value': '1e+00'}, {'labels': {'alertname': 'KubePodNotScheduled', 'container': 'kube-rbac-proxy-main', 'endpoint': 'https-main', 'job': 'kube-state-metrics', 'namespace': 'openshift-storage', 'pod': 'rook-ceph-crashcollector-d66507c920c0f24e37de2ecf0e2ace0d-f4ztp', 'service': 'kube-state-metrics', 'severity': 'warning', 'uid': 'b292040d-975a-48f8-8d70-c43cbacad853'}, 'annotations': {'description': 'Pod openshift-storage/rook-ceph-crashcollector-d66507c920c0f24e37de2ecf0e2ace0d-f4ztp cannot be scheduled for more than 30 minutes.\nCheck the details of the pod with the following command:\noc describe -n openshift-storage pod rook-ceph-crashcollector-d66507c920c0f24e37de2ecf0e2ace0d-f4ztp', 'summary': 'Pod cannot be scheduled.'}, 'state': 'firing', 'activeAt': '2024-09-03T11:49:30.244714658Z', 'value': '1e+00'}, {'labels': {'alertname': 'KubePodNotScheduled', 'container': 'kube-rbac-proxy-main', 'endpoint': 'https-main', 'job': 'kube-state-metrics', 'namespace': 'openshift-storage', 'pod': 'rook-ceph-exporter-j-299zi3c33-uo-dxg78-worker-eastus3-pwx25p4k', 'service': 'kube-state-metrics', 'severity': 'warning', 'uid': 'a03c9a4f-86c3-4ecd-b6df-b3a27a966fef'}, 'annotations': {'description': 'Pod openshift-storage/rook-ceph-exporter-j-299zi3c33-uo-dxg78-worker-eastus3-pwx25p4k cannot be scheduled for more than 30 minutes.\nCheck the details of the pod with the following command:\noc describe -n openshift-storage pod rook-ceph-exporter-j-299zi3c33-uo-dxg78-worker-eastus3-pwx25p4k', 'summary': 'Pod cannot be scheduled.'}, 'state': 'firing', 'activeAt': '2024-09-03T11:49:30.244714658Z', 'value': '1e+00'}, {'labels': {'alertname': 'CannotRetrieveUpdates', 'namespace': 'openshift-cluster-version', 'severity': 'warning'}, 'annotations': {'description': 'Failure to retrieve updates means that cluster administrators will need to monitor for available updates on their own or risk falling behind on security or other bugfixes. If the failure is expected, you can clear spec.channel in the ClusterVersion object to tell the cluster-version operator to not retrieve updates. Failure reason VersionNotFound . For more information refer to `oc get clusterversion/version -o=jsonpath="{.status.conditions[?(.type==\'RetrievedUpdates\')]}{\'\\n\'}"` or https://console-openshift-console.apps.j-299zi3c33-uo.azure.qe.rh-ocs.com/settings/cluster/.', 'summary': 'Cluster version operator has not retrieved updates in 12h 44m 40s.'}, 'state': 'firing', 'activeAt': '2024-09-03T11:48:04.791380633Z', 'value': '4.5880790999889374e+04'}, {'labels': {'alertname': 'Watchdog', 'namespace': 'openshift-monitoring', 'severity': 'none'}, 'annotations': {'description': 'This is an alert meant to ensure that the entire alerting pipeline is functional.\nThis alert is always firing, therefore it should always be firing in Alertmanager\nand always fire against a receiver. There are integrations with various notification\nmechanisms that send a notification when this alert is not firing. For example the\n"DeadMansSnitch" integration in PagerDuty.\n', 'summary': 'An alert that should always be firing to certify that Alertmanager is working properly.'}, 'state': 'firing', 'activeAt': '2024-09-03T11:47:43.408522273Z', 'value': '1e+00'}]
2024-09-04 01:15:55  19:15:48 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig -n openshift-storage get Pod  -n openshift-storage --selector=app=rook-ceph-tools -o yaml
2024-09-04 01:15:55  19:15:48 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig -n openshift-storage get Pod  -n openshift-storage --selector=app=rook-ceph-tools -o yaml
2024-09-04 01:15:55  19:15:48 - MainThread - ocs_ci.ocs.resources.pod - INFO  - These are the ceph tool box pods: ['rook-ceph-tools-6ff667d74-x4p2l']
2024-09-04 01:15:55  19:15:48 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig -n openshift-storage get Pod rook-ceph-tools-6ff667d74-x4p2l -n openshift-storage
2024-09-04 01:15:55  19:15:49 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig -n openshift-storage get Pod  -n openshift-storage -o yaml
2024-09-04 01:15:55  19:15:52 - MainThread - ocs_ci.ocs.resources.pod - INFO  - Pod name: rook-ceph-tools-6ff667d74-x4p2l
2024-09-04 01:15:55  19:15:52 - MainThread - ocs_ci.ocs.resources.pod - INFO  - Pod status: Running
2024-09-04 01:15:55  19:15:52 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc -n openshift-storage rsh rook-ceph-tools-6ff667d74-x4p2l ceph health
2024-09-04 01:15:55  19:15:53 - MainThread - ocs_ci.utility.retry - WARNING  - Ceph cluster health is not OK. Health: HEALTH_WARN 1/3 mons down, quorum b,c
2024-09-04 01:15:55  , Retrying in 30 seconds...

I see that health check doesn't run at start as it suppose to, but only in teardown.

The text was updated successfully, but these errors were encountered:

petr-balogh added the team/ecosystem Ecosystem team related issues/PRs label Sep 19, 2024

petr-balogh assigned clacroix12 and OdedViner Sep 20, 2024

OdedViner mentioned this issue Sep 28, 2024

[Fix] Post upgrade tests do not run health check to skip tests #10595

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some of post upgrade tests do not run health check to skip tests #10537

Some of post upgrade tests do not run health check to skip tests #10537

petr-balogh commented Sep 19, 2024

Some of post upgrade tests do not run health check to skip tests #10537

Some of post upgrade tests do not run health check to skip tests #10537

Comments

petr-balogh commented Sep 19, 2024