Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add rook-ceph known issue #2601

Merged
merged 6 commits into from
Apr 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
89 changes: 88 additions & 1 deletion docs/docs-content/integrations/rook-ceph.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
---
sidebar_label: "rook-ceph"
title: "Rook Ceph"
description: "Rook Ceph storage pack in Spectro Cloud"
description: "Rook is an open-source cloud-native storage orchestrator that provides the platform, framework, and support for Ceph
storage to natively integrate with cloud-native environments. Ceph is a distributed storage system that provides file,
block, and object storage and is deployed in large-scale production clusters. This page talks about how to use the Rook Ceph storage pack in Spectro Cloud"
hide_table_of_contents: true
type: "integration"
category: ["storage", "amd64"]
Expand Down Expand Up @@ -121,6 +123,14 @@ clusters.

4. Use the password you receive in the output with the username `admin` to log in to the Ceph Dashboard.

### Known Issues

- If a cluster experiences network issues, it's possible for the file mount to become unavailable and remain unavailable
even after the network is restored. This a known issue disclosed in the
[Rook GitHub repository](https://github.com/rook/rook/issues/13818). Refer to the
[Troubleshooting section](#file-mount-becomes-unavailable-after-cluster-experiences-network-issues) for a workaround
if you observe this issue in your cluster.

</TabItem>

<TabItem label="1.11.x" value="1.11.x">
Expand Down Expand Up @@ -216,6 +226,14 @@ clusters.

4. Use the password you receive in the output with the username `admin` to log in to the Ceph Dashboard.

### Known Issues

- If a cluster experiences network issues, it's possible for the file mount to become unavailable and remain unavailable
even after the network is restored. This a known issue disclosed in the
[Rook GitHub repository](https://github.com/rook/rook/issues/13818). Refer to the
[Troubleshooting section](#file-mount-becomes-unavailable-after-cluster-experiences-network-issues) for a workaround
if you observe this issue in your cluster.

</TabItem>

<TabItem label="1.10.x" value="1.10.x">
Expand Down Expand Up @@ -311,6 +329,14 @@ clusters.

4. Use the password you receive in the output with the username `admin` to log in to the Ceph Dashboard.

### Known Issues

- If a cluster experiences network issues, it's possible for the file mount to become unavailable and remain unavailable
even after the network is restored. This a known issue disclosed in the
[Rook GitHub repository](https://github.com/rook/rook/issues/13818). Refer to the
[Troubleshooting section](#file-mount-becomes-unavailable-after-cluster-experiences-network-issues) for a workaround
if you observe this issue in your cluster.

</TabItem>

<TabItem label="Deprecated" value="Deprecated">
Expand All @@ -322,6 +348,67 @@ improvements.

</Tabs>

## Troubleshooting

### File Mount Becomes Unavailable after Cluster Experiences Network Issues

A known issue exists with Rook-Ceph where file mounts become unavailable and remain unavailable even after network
issues are resolved.

#### Debug Steps

1. One way to debug is to reboot the node that is experiencing the issues. If you are unable to reboot the node, or if
rebooting the node does not fix the issue, continue to the following steps.

2. Connect to your cluster via the command-line. For more information, refer to
[Access Cluster with CLI](/docs/docs-content/clusters/cluster-management/palette-webctl.md).

3. Issue the following command to identify Persistent Volume Claims (PVC) from Ceph File System (FS).

```shell
kubectl get pvc --all | grep "cephFS"
```

4. Scale down all workloads, including pods, deployments, and StatefulSets using the PVC to zero.
lennessyy marked this conversation as resolved.
Show resolved Hide resolved

To scale down a deployment, use the following command. Replace `deployment-name` with the name of the deployment.

```shell
kubectl scale deployment deployment-name --replicas=0
```

To scale down a StatefulSet, use the following command. Replace `statefulset-name` with the name of the StatefulSet.

```shell
kubectl scale statefulset statefulset-name --replicas=0
```

To scale down a pod, delete it. Make sure you delete the deployments and StatefulSets first. If a pod belongs to a
StatefulSet or a deployment, it will simply be recreated.
lennessyy marked this conversation as resolved.
Show resolved Hide resolved
lennessyy marked this conversation as resolved.
Show resolved Hide resolved

```shell
kubectl delete pods pod-name
```

:::tip

If you do not know which workloads use the PVC, you can start by getting a list of all pods that are using PVCs and
lennessyy marked this conversation as resolved.
Show resolved Hide resolved
their PVC names with the following command.

```shell
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent way of helping the reader 😄

kubectl get pods --all-namespaces --output=json | jq '.items[] | {name: .metadata.name, namespace: .metadata.namespace, claimName: .spec | select( has ("volumes") ).volumes[] | select( has ("persistentVolumeClaim") ).persistentVolumeClaim.claimName }'
```

You can then find workloads that are associated with the pods and scale them down to zero.

:::

5. Once all the workloads are scaled down, all existing volume mounts will be unmounted, followed by fresh new mounts of
cephFS volumes. Ensure that all workloads are scaled down to zero. Even if one pod remains that uses the PVC, the
unmount will not happen and the issue will not be resolved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚫 [vale] reported by reviewdog 🐶
[Vale.Spelling] Did you really mean 'unmount'?


6. Scale the workloads back to their original state.

## Terraform

```tf
Expand Down
16 changes: 12 additions & 4 deletions docs/docs-content/release-notes.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,10 +86,10 @@ the following sections for a complete list of features, improvements, and known
through Palette CLI will be eligible for a cluster profile update. We recommend you review the
[Upgrade a PCG](./clusters/pcg/manage-pcg/pcg-upgrade.md) guide to learn more about updating a PCG.

- Self-hosted Palette instances now use Kubernetes version 1.27.11. This new version of Kubernetes will cause node repave
events during the upgrade process. If you have multiple self-hosted Palette instances in a VMware environment, take a
moment and review the [Known Issues](#known-issues) section below for potential issues that may arise during the
upgrade process.
- Self-hosted Palette instances now use Kubernetes version 1.27.11. This new version of Kubernetes will cause node
repave events during the upgrade process. If you have multiple self-hosted Palette instances in a VMware environment,
take a moment and review the [Known Issues](#known-issues) section below for potential issues that may arise during
the upgrade process.

#### Known Issues

Expand Down Expand Up @@ -169,6 +169,14 @@ the following sections for a complete list of features, improvements, and known
[Harbor Edge](./integrations/harbor-edge.md#enable-image-download-from-outside-of-harbor) reference page to learn more
about the feature.

#### Known issues

- If a cluster that uses the Rook-Ceph pack experiences network issues, it's possible for the file mount to become
unavailable and will remain unavailable even after network is restored. This a known issue disclosed in the
[Rook GitHub repository](https://github.com/rook/rook/issues/13818). To resolve this issue, refer to
[Rook-Ceph](./integrations/rook-ceph.md#file-mount-becomes-unavailable-after-cluster-experiences-network-issues) pack
documentation.

### Virtual Machine Orchestrator (VMO)

#### Improvements
Expand Down