From 7abd18938b70fa7a4ebc8b6566d5d9c03abfcfa3 Mon Sep 17 00:00:00 2001 From: Lenny Chen Date: Mon, 8 Apr 2024 17:14:48 -0700 Subject: [PATCH 1/6] docs: add rook-ceph known issue --- docs/docs-content/integrations/rook-ceph.md | 83 +++++++++++++++++++++ docs/docs-content/release-notes.md | 16 +++- 2 files changed, 95 insertions(+), 4 deletions(-) diff --git a/docs/docs-content/integrations/rook-ceph.md b/docs/docs-content/integrations/rook-ceph.md index 130840ea75..b4ea6ce3aa 100644 --- a/docs/docs-content/integrations/rook-ceph.md +++ b/docs/docs-content/integrations/rook-ceph.md @@ -121,6 +121,13 @@ clusters. 4. Use the password you receive in the output with the username `admin` to log in to the Ceph Dashboard. +### Known Issues + +- If a cluster experiences network issues, it's possible for the file mount to become unavailable. This a known issue + disclosed in the [Rook GitHub repository](https://github.com/rook/rook/issues/13818). Refer to the + [Troubleshooting section](#file-mount-becomes-unavailable-after-cluster-experiences-network-issues) for a workaround + if you observe this issue in your cluster. + @@ -322,6 +329,82 @@ improvements. +## Troubleshooting + +### File Mount Becomes Unavailable after Cluster Experiences Network Issues + +A known issue exists with Rook-Ceph that if your cluster experiences network issues, file mount becomes unavailable even +after the network is restored. This is currently an open issue with Rook. If you run into this issue, follow these steps +for a workaround. + +#### Debug Steps + +1. One way to debug is to reboot the node that is experiencing the issues. If you are unable to reboot the node, or if + rebooting the node does not fix the issue, continue to the following steps. + +2. Connect to your cluster via the command-line. For more information, refer to + [Access Cluster with CLI](/docs/docs-content/clusters/cluster-management/palette-webctl.md). + +3. Issue the following command to identify Persistent Volume Claims (PVC) from Ceph File System (FS): + + ```shell + kubectl get pvc -all | grep "cephFS" + ``` + +4. Scale down all workloads, including pods, deployments, and StatefulSets using the PVC to zero. Ensure that all + workloads must be scaled down. Even if one pod remains that uses the PVC, this workaround wil not work. + + + + + + To scale down a pod, delete it. + + ```shell + kubectl delete pods pod-name + ``` + + + + + + To scale down a StatefulSet, use the following command. Replace `statefulset-name` with the name of the StatefulSet. + + ```shell + kubectl scale statefulset statefulset-name --replicas=0 + ``` + + + + + + To scale down a deployment, use the following command. Replace `deployment-name` with the name of the deployment. + + ```shell + kubectl scale deployment deployment-name --replicas=0 + ``` + + + + + + :::tip + + If you do not know which workloads use the PVC, you can start by getting a list of all pods that are using PVCs and + their PVC names with the following command. + + ```shell + kubectl get pods --all-namespaces -o=json | jq -c '.items[] | {name: .metadata.name, namespace: .metadata.namespace, claimName: .spec | select( has ("volumes") ).volumes[] | select( has ("persistentVolumeClaim") ).persistentVolumeClaim.claimName }' + ``` + + You can then find workloads that are associated with the pods and scale them down to zero. + + ::: + +5. Once all the workloads are scaled down, this will trigger a unmount and fresh mount of cephFS volumes. + +6. Scale the workloads back to their original state. + ## Terraform ```tf diff --git a/docs/docs-content/release-notes.md b/docs/docs-content/release-notes.md index 252c45118b..3d639cd23f 100644 --- a/docs/docs-content/release-notes.md +++ b/docs/docs-content/release-notes.md @@ -86,10 +86,10 @@ the following sections for a complete list of features, improvements, and known through Palette CLI will be eligible for a cluster profile update. We recommend you review the [Upgrade a PCG](./clusters/pcg/manage-pcg/pcg-upgrade.md) guide to learn more about updating a PCG. -- Self-hosted Palette instances now use Kubernetes version 1.27.11. This new version of Kubernetes will cause node repave - events during the upgrade process. If you have multiple self-hosted Palette instances in a VMware environment, take a - moment and review the [Known Issues](#known-issues) section below for potential issues that may arise during the - upgrade process. +- Self-hosted Palette instances now use Kubernetes version 1.27.11. This new version of Kubernetes will cause node + repave events during the upgrade process. If you have multiple self-hosted Palette instances in a VMware environment, + take a moment and review the [Known Issues](#known-issues) section below for potential issues that may arise during + the upgrade process. #### Known Issues @@ -169,6 +169,14 @@ the following sections for a complete list of features, improvements, and known [Harbor Edge](./integrations/harbor-edge.md#enable-image-download-from-outside-of-harbor) reference page to learn more about the feature. +#### Known issues + +- If a cluster that uses the Rook-Ceph pack experiences network issues, it's possible for the file mount to become + unavailable. This a known issue disclosed in the [Rook GitHub repository](https://github.com/rook/rook/issues/13818). + To resolve this issue, refer to + [Rook-Ceph](./integrations/rook-ceph.md#file-mount-becomes-unavailable-after-cluster-experiences-network-issues) pack + documentation. + ### Virtual Machine Orchestrator (VMO) #### Improvements From 6ce438dd597030295c76e190506c3cc891896fb2 Mon Sep 17 00:00:00 2001 From: Lenny Chen Date: Tue, 9 Apr 2024 15:37:46 -0700 Subject: [PATCH 2/6] docs: fix vale issues --- docs/docs-content/integrations/rook-ceph.md | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/docs/docs-content/integrations/rook-ceph.md b/docs/docs-content/integrations/rook-ceph.md index b4ea6ce3aa..d4487549db 100644 --- a/docs/docs-content/integrations/rook-ceph.md +++ b/docs/docs-content/integrations/rook-ceph.md @@ -333,9 +333,8 @@ improvements. ### File Mount Becomes Unavailable after Cluster Experiences Network Issues -A known issue exists with Rook-Ceph that if your cluster experiences network issues, file mount becomes unavailable even -after the network is restored. This is currently an open issue with Rook. If you run into this issue, follow these steps -for a workaround. +A known issue exists with Rook-Ceph that if your cluster experiences network issues, file mount becomes unavailable and +remains unavailable even after the network is restored. #### Debug Steps @@ -345,14 +344,13 @@ for a workaround. 2. Connect to your cluster via the command-line. For more information, refer to [Access Cluster with CLI](/docs/docs-content/clusters/cluster-management/palette-webctl.md). -3. Issue the following command to identify Persistent Volume Claims (PVC) from Ceph File System (FS): +3. Issue the following command to identify Persistent Volume Claims (PVC) from Ceph File System (FS). ```shell - kubectl get pvc -all | grep "cephFS" + kubectl get pvc --all | grep "cephFS" ``` -4. Scale down all workloads, including pods, deployments, and StatefulSets using the PVC to zero. Ensure that all - workloads must be scaled down. Even if one pod remains that uses the PVC, this workaround wil not work. +4. Scale down all workloads, including pods, deployments, and StatefulSets using the PVC to zero. @@ -394,14 +392,16 @@ for a workaround. their PVC names with the following command. ```shell - kubectl get pods --all-namespaces -o=json | jq -c '.items[] | {name: .metadata.name, namespace: .metadata.namespace, claimName: .spec | select( has ("volumes") ).volumes[] | select( has ("persistentVolumeClaim") ).persistentVolumeClaim.claimName }' + kubectl get pods --all-namespaces --output=json | jq '.items[] | {name: .metadata.name, namespace: .metadata.namespace, claimName: .spec | select( has ("volumes") ).volumes[] | select( has ("persistentVolumeClaim") ).persistentVolumeClaim.claimName }' ``` You can then find workloads that are associated with the pods and scale them down to zero. ::: -5. Once all the workloads are scaled down, this will trigger a unmount and fresh mount of cephFS volumes. +5. Once all the workloads are scaled down, this will trigger a unmount and fresh mount of cephFS volumes. Ensure that + all workloads are scaled down to zero. Even if one pod remains that uses the PVC, the unmount will not happen and the + issue will not be resolved. 6. Scale the workloads back to their original state. From 0e5dc9403425fc7d699e967d279ad0967e6ea1c7 Mon Sep 17 00:00:00 2001 From: Lenny Chen Date: Tue, 9 Apr 2024 16:10:53 -0700 Subject: [PATCH 3/6] minor edit --- docs/docs-content/integrations/rook-ceph.md | 21 +++++++++++++++++++-- docs/docs-content/release-notes.md | 4 ++-- 2 files changed, 21 insertions(+), 4 deletions(-) diff --git a/docs/docs-content/integrations/rook-ceph.md b/docs/docs-content/integrations/rook-ceph.md index d4487549db..2c6e07f694 100644 --- a/docs/docs-content/integrations/rook-ceph.md +++ b/docs/docs-content/integrations/rook-ceph.md @@ -123,8 +123,9 @@ clusters. ### Known Issues -- If a cluster experiences network issues, it's possible for the file mount to become unavailable. This a known issue - disclosed in the [Rook GitHub repository](https://github.com/rook/rook/issues/13818). Refer to the +- If a cluster experiences network issues, it's possible for the file mount to become unavailable and it will remain + unavailable even after network is restored. This a known issue disclosed in the + [Rook GitHub repository](https://github.com/rook/rook/issues/13818). Refer to the [Troubleshooting section](#file-mount-becomes-unavailable-after-cluster-experiences-network-issues) for a workaround if you observe this issue in your cluster. @@ -223,6 +224,14 @@ clusters. 4. Use the password you receive in the output with the username `admin` to log in to the Ceph Dashboard. +### Known Issues + +- If a cluster experiences network issues, it's possible for the file mount to become unavailable and it will remain + unavailable even after network is restored. This a known issue disclosed in the + [Rook GitHub repository](https://github.com/rook/rook/issues/13818). Refer to the + [Troubleshooting section](#file-mount-becomes-unavailable-after-cluster-experiences-network-issues) for a workaround + if you observe this issue in your cluster. + @@ -318,6 +327,14 @@ clusters. 4. Use the password you receive in the output with the username `admin` to log in to the Ceph Dashboard. +### Known Issues + +- If a cluster experiences network issues, it's possible for the file mount to become unavailable and it will remain + unavailable even after network is restored. This a known issue disclosed in the + [Rook GitHub repository](https://github.com/rook/rook/issues/13818). Refer to the + [Troubleshooting section](#file-mount-becomes-unavailable-after-cluster-experiences-network-issues) for a workaround + if you observe this issue in your cluster. + diff --git a/docs/docs-content/release-notes.md b/docs/docs-content/release-notes.md index 3d639cd23f..b0d8257300 100644 --- a/docs/docs-content/release-notes.md +++ b/docs/docs-content/release-notes.md @@ -172,8 +172,8 @@ the following sections for a complete list of features, improvements, and known #### Known issues - If a cluster that uses the Rook-Ceph pack experiences network issues, it's possible for the file mount to become - unavailable. This a known issue disclosed in the [Rook GitHub repository](https://github.com/rook/rook/issues/13818). - To resolve this issue, refer to + unavailable and will remain unavailable even after network is restored. This a known issue disclosed in the + [Rook GitHub repository](https://github.com/rook/rook/issues/13818). To resolve this issue, refer to [Rook-Ceph](./integrations/rook-ceph.md#file-mount-becomes-unavailable-after-cluster-experiences-network-issues) pack documentation. From 239879603c518672078c663f55788568ec8840b5 Mon Sep 17 00:00:00 2001 From: Lenny Chen <55669665+lennessyy@users.noreply.github.com> Date: Wed, 10 Apr 2024 13:18:21 -0700 Subject: [PATCH 4/6] Apply suggestions from code review Co-authored-by: Karl Cardenas --- docs/docs-content/integrations/rook-ceph.md | 17 ++++++++--------- 1 file changed, 8 insertions(+), 9 deletions(-) diff --git a/docs/docs-content/integrations/rook-ceph.md b/docs/docs-content/integrations/rook-ceph.md index 2c6e07f694..7120cc655d 100644 --- a/docs/docs-content/integrations/rook-ceph.md +++ b/docs/docs-content/integrations/rook-ceph.md @@ -123,8 +123,8 @@ clusters. ### Known Issues -- If a cluster experiences network issues, it's possible for the file mount to become unavailable and it will remain - unavailable even after network is restored. This a known issue disclosed in the +- If a cluster experiences network issues, it's possible for the file mount to become unavailable and remain + unavailable even after the network is restored. This a known issue disclosed in the [Rook GitHub repository](https://github.com/rook/rook/issues/13818). Refer to the [Troubleshooting section](#file-mount-becomes-unavailable-after-cluster-experiences-network-issues) for a workaround if you observe this issue in your cluster. @@ -226,8 +226,8 @@ clusters. ### Known Issues -- If a cluster experiences network issues, it's possible for the file mount to become unavailable and it will remain - unavailable even after network is restored. This a known issue disclosed in the +- If a cluster experiences network issues, it's possible for the file mount to become unavailable and remain + unavailable even after the network is restored. This a known issue disclosed in the [Rook GitHub repository](https://github.com/rook/rook/issues/13818). Refer to the [Troubleshooting section](#file-mount-becomes-unavailable-after-cluster-experiences-network-issues) for a workaround if you observe this issue in your cluster. @@ -329,8 +329,8 @@ clusters. ### Known Issues -- If a cluster experiences network issues, it's possible for the file mount to become unavailable and it will remain - unavailable even after network is restored. This a known issue disclosed in the +- If a cluster experiences network issues, it's possible for the file mount to become unavailable and remain + unavailable even after the network is restored. This a known issue disclosed in the [Rook GitHub repository](https://github.com/rook/rook/issues/13818). Refer to the [Troubleshooting section](#file-mount-becomes-unavailable-after-cluster-experiences-network-issues) for a workaround if you observe this issue in your cluster. @@ -350,8 +350,7 @@ improvements. ### File Mount Becomes Unavailable after Cluster Experiences Network Issues -A known issue exists with Rook-Ceph that if your cluster experiences network issues, file mount becomes unavailable and -remains unavailable even after the network is restored. +A known issue exists with Rook-Ceph where file mounts become unavailable and remain unavailable even after network issues are resolved. #### Debug Steps @@ -416,7 +415,7 @@ remains unavailable even after the network is restored. ::: -5. Once all the workloads are scaled down, this will trigger a unmount and fresh mount of cephFS volumes. Ensure that +5. Once all the workloads are scaled down, all existing volume mounts will be unmounted, followed by fresh new mounts of cephFS volumes. Ensure that all workloads are scaled down to zero. Even if one pod remains that uses the PVC, the unmount will not happen and the issue will not be resolved. From 2a43f5d6de487cbf471c2acec434e8ce229ed50c Mon Sep 17 00:00:00 2001 From: Lenny Chen Date: Thu, 11 Apr 2024 11:51:33 -0700 Subject: [PATCH 5/6] implement suggestions --- docs/docs-content/integrations/rook-ceph.md | 46 +++++++-------------- 1 file changed, 16 insertions(+), 30 deletions(-) diff --git a/docs/docs-content/integrations/rook-ceph.md b/docs/docs-content/integrations/rook-ceph.md index 7120cc655d..5bcb0a39c6 100644 --- a/docs/docs-content/integrations/rook-ceph.md +++ b/docs/docs-content/integrations/rook-ceph.md @@ -123,8 +123,8 @@ clusters. ### Known Issues -- If a cluster experiences network issues, it's possible for the file mount to become unavailable and remain - unavailable even after the network is restored. This a known issue disclosed in the +- If a cluster experiences network issues, it's possible for the file mount to become unavailable and remain unavailable + even after the network is restored. This a known issue disclosed in the [Rook GitHub repository](https://github.com/rook/rook/issues/13818). Refer to the [Troubleshooting section](#file-mount-becomes-unavailable-after-cluster-experiences-network-issues) for a workaround if you observe this issue in your cluster. @@ -226,8 +226,8 @@ clusters. ### Known Issues -- If a cluster experiences network issues, it's possible for the file mount to become unavailable and remain - unavailable even after the network is restored. This a known issue disclosed in the +- If a cluster experiences network issues, it's possible for the file mount to become unavailable and remain unavailable + even after the network is restored. This a known issue disclosed in the [Rook GitHub repository](https://github.com/rook/rook/issues/13818). Refer to the [Troubleshooting section](#file-mount-becomes-unavailable-after-cluster-experiences-network-issues) for a workaround if you observe this issue in your cluster. @@ -329,8 +329,8 @@ clusters. ### Known Issues -- If a cluster experiences network issues, it's possible for the file mount to become unavailable and remain - unavailable even after the network is restored. This a known issue disclosed in the +- If a cluster experiences network issues, it's possible for the file mount to become unavailable and remain unavailable + even after the network is restored. This a known issue disclosed in the [Rook GitHub repository](https://github.com/rook/rook/issues/13818). Refer to the [Troubleshooting section](#file-mount-becomes-unavailable-after-cluster-experiences-network-issues) for a workaround if you observe this issue in your cluster. @@ -350,7 +350,8 @@ improvements. ### File Mount Becomes Unavailable after Cluster Experiences Network Issues -A known issue exists with Rook-Ceph where file mounts become unavailable and remain unavailable even after network issues are resolved. +A known issue exists with Rook-Ceph where file mounts become unavailable and remain unavailable even after network +issues are resolved. #### Debug Steps @@ -368,40 +369,25 @@ A known issue exists with Rook-Ceph where file mounts become unavailable and rem 4. Scale down all workloads, including pods, deployments, and StatefulSets using the PVC to zero. - - - - - To scale down a pod, delete it. + To scale down a deployment, use the following command. Replace `deployment-name` with the name of the deployment. ```shell - kubectl delete pods pod-name + kubectl scale deployment deployment-name --replicas=0 ``` - - - - To scale down a StatefulSet, use the following command. Replace `statefulset-name` with the name of the StatefulSet. ```shell kubectl scale statefulset statefulset-name --replicas=0 ``` - - - - - To scale down a deployment, use the following command. Replace `deployment-name` with the name of the deployment. + To scale down a pod, delete it. Make sure you delete the deployments and StatefulSets first. If a pod belongs to a + StatefulSet or a deployment, it will simply be recreated. ```shell - kubectl scale deployment deployment-name --replicas=0 + kubectl delete pods pod-name ``` - - - - :::tip If you do not know which workloads use the PVC, you can start by getting a list of all pods that are using PVCs and @@ -415,9 +401,9 @@ A known issue exists with Rook-Ceph where file mounts become unavailable and rem ::: -5. Once all the workloads are scaled down, all existing volume mounts will be unmounted, followed by fresh new mounts of cephFS volumes. Ensure that - all workloads are scaled down to zero. Even if one pod remains that uses the PVC, the unmount will not happen and the - issue will not be resolved. +5. Once all the workloads are scaled down, all existing volume mounts will be unmounted, followed by fresh new mounts of + cephFS volumes. Ensure that all workloads are scaled down to zero. Even if one pod remains that uses the PVC, the + unmount will not happen and the issue will not be resolved. 6. Scale the workloads back to their original state. From 4b40768cd5b83276b87951b826ab22e538568b50 Mon Sep 17 00:00:00 2001 From: Lenny Chen Date: Thu, 11 Apr 2024 11:52:07 -0700 Subject: [PATCH 6/6] add meta description --- docs/docs-content/integrations/rook-ceph.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/docs-content/integrations/rook-ceph.md b/docs/docs-content/integrations/rook-ceph.md index 5bcb0a39c6..c38b44da4a 100644 --- a/docs/docs-content/integrations/rook-ceph.md +++ b/docs/docs-content/integrations/rook-ceph.md @@ -1,7 +1,9 @@ --- sidebar_label: "rook-ceph" title: "Rook Ceph" -description: "Rook Ceph storage pack in Spectro Cloud" +description: "Rook is an open-source cloud-native storage orchestrator that provides the platform, framework, and support for Ceph +storage to natively integrate with cloud-native environments. Ceph is a distributed storage system that provides file, +block, and object storage and is deployed in large-scale production clusters. This page talks about how to use the Rook Ceph storage pack in Spectro Cloud" hide_table_of_contents: true type: "integration" category: ["storage", "amd64"]