Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need pvc namespace passed to CSI driver #170

Closed
lpabon opened this issue Nov 21, 2018 · 16 comments · Fixed by #274
Closed

Need pvc namespace passed to CSI driver #170

lpabon opened this issue Nov 21, 2018 · 16 comments · Fixed by #274

Comments

@lpabon
Copy link
Member

lpabon commented Nov 21, 2018

The in-tree Portworx Kubernetes driver currently relies on determining the namespace of the PVC during creation.

There are two issues with the current implementations in which none pass this information during creation of CSI volume:

1: On mounts the side car containers can pass the pod.Namespace (or more likely, the pvc name and namespace) to the CSI driver. We would like to have this on Creation.

and

2: Secrets Only passes secrets of the PVC namespace for every call except during creation.

We need 1: to work and also would really to have 2: to be more efficient in obtaining secrets.

@saad-ali
Copy link
Member

The request is to:

  1. Ability to set namespace for secret during provisioning
    • Need to discuss security implications with @liggitt
  2. Be able to pass pvc info (namespace, name, id) to provisioning call, like we do for mount.
    • What are the use cases?

@liggitt
Copy link
Contributor

liggitt commented Nov 29, 2018

Did #69 not address using secrets in other namespaces? (documented at https://kubernetes-csi.github.io/docs/secrets-and-credentials.html)

provisionerCredentials, err := getCredentials(p.client, provisionerSecretRef)
if err != nil {
return nil, err
}
req.Secrets = provisionerCredentials
// Resolve controller publish, node stage, node publish secret references
controllerPublishSecretRef, err := getSecretReference(controllerPublishSecretNameKey, controllerPublishSecretNamespaceKey, options.Parameters, pvName, options.PVC)
if err != nil {
return nil, err
}
nodeStageSecretRef, err := getSecretReference(nodeStageSecretNameKey, nodeStageSecretNamespaceKey, options.Parameters, pvName, options.PVC)
if err != nil {
return nil, err
}
nodePublishSecretRef, err := getSecretReference(nodePublishSecretNameKey, nodePublishSecretNamespaceKey, options.Parameters, pvName, options.PVC)
if err != nil {
return nil, err
}

@liggitt
Copy link
Contributor

liggitt commented Nov 29, 2018

The reason the secrets in the pvc namespace are not made available for paired create/delete operations is that the PVC and its namespace may not exist at deletion time

@lpabon
Copy link
Member Author

lpabon commented Dec 5, 2018

@liggitt we only need it in create. Doesn't it must exist in create, right? or am I wrong?

@harsh-px
Copy link

harsh-px commented Dec 8, 2018

Here's are some use cases that we, at Portworx, have seen over the years.

  1. Generally, the storage driver needs to know the orchestrator namespace when provisioning the volume so that they can restrict the volume bits at the same level of isolation. In a multi-tenant cluster, Kubernetes relies on namespaces to provide complete isolation between users. This isolation needs to go right down to the storage layer.
  2. If namespaces are provided, the storage driver provisioning the volumes can query additional metadata for the PVC using the API server. For e.g the storage driver can get the labels applied on the PVC spec and can use those to perform storage level affinities and anti-affinities. For e.g do not co-locate any volumes that have labels app=cassandra. We have seen significant use cases for these where users want to influence storage replica placements, just like pod affinity and anti-affinities affect scheduling placements. With Portworx 2.1, we are introducing volume placement strategies which will be centered around PVC labels. Examples below:
  1. Annotations on the PVCs can be used to instruct the storage driver on custom per-PVC behaviors (during creation) which may not be apt at a StorageClass level. While the goal is for the StorageClass and the PVC spec to provide all the fields sufficient for the creation of volume, there will always be a use case where the storage drivers would want to expose functionality which is not part of the official spec yet. In the past , we have used annotations to

@harsh-px
Copy link

@saad-ali I've added some use cases for this request ^^

@lingxiankong
Copy link

Also, we(catalyst cloud) have some similar requirements not sure should merge with this one. Besides the PV name/namespace, we also need the reclaim policy of the PV passed to the CSI driver(which in turn passed to the volume property in the storage backend), so as public cloud provider, we could know if we should delete the volumes in the backend when the cloud user deletes the k8s cluster.

So, could we change the issue title to something like Need pvc/pv information passed to CSI driver, then we could discuss what kind of information should be passed?

@lingxiankong
Copy link

For reclaim policy, to deal with the situation that the policy could be changed anytime during a PV's lifetime, I think we need the external-provisioner to call something like 'updateVolume' method of CSI driver, which is not in the CSI spec yet.

@msau42
Copy link
Collaborator

msau42 commented Feb 11, 2019

Potentially related: #213

@saad-ali
Copy link
Member

saad-ali commented Feb 21, 2019

Ok there are two asks in this issue

  1. Ability to reference a secret from the PVC's namespace:
  2. Be able to pass pvc info (namespace, name, id) to provisioning call, like we do for mount.
    • I'm reading @harsh-px's comments and will respond shortly.

@lingxiankong
Copy link

@saad-ali thanks for responding

@saad-ali
Copy link
Member

  1. Generally, the storage driver needs to know the orchestrator namespace when provisioning the volume so that they can restrict the volume bits at the same level of isolation. In a multi-tenant cluster, Kubernetes relies on namespaces to provide complete isolation between users. This isolation needs to go right down to the storage layer.

So you record some metadata when a volume is provisioned to say it belongs to namespace foo? And then when a volume is mounted, you verify that the pod is in namespace foo? If so, why? Kubernetes does this enforcement already.

  1. If namespaces are provided, the storage driver provisioning the volumes can query additional metadata for the PVC using the API server. For e.g the storage driver can get the labels applied on the PVC spec and can use those to perform storage level affinities and anti-affinities. For e.g do not co-locate any volumes that have labels app=cassandra. We have seen significant use cases for these where users want to influence storage replica placements, just like pod affinity and anti-affinities affect scheduling placements. With Portworx 2.1, we are introducing volume placement strategies which will be centered around PVC labels. Examples below:

This is a great use case. But making a CSI driver reach in to Kubernetes to figure this out on its own is a hack.

Kubernetes and CSI already support topology where a volume is only accessible by certain nodes in a cluster. However, Kubernetes and CSI don't provide a way to the case where a volume is equally accessible by all nodes but has some internal storage system topology that can influence application performance.

Rather then poking a hole in the API to make the hack easier, I would strongly suggest working with the community to come up with a generic way to be able to influence storage specific topology. A good place to start is the long standing CSI issue already opened for this: container-storage-interface/spec#44.

  1. Annotations on the PVCs can be used to instruct the storage driver on custom per-PVC behaviors (during creation) which may not be apt at a StorageClass level. While the goal is for the StorageClass and the PVC spec to provide all the fields sufficient for the creation of volume, there will always be a use case where the storage drivers would want to expose functionality which is not part of the official spec yet. In the past , we have used annotations to

Annotations on PVCs MUST NOT be passed to CSI drivers. The Kubernetes PVC object is intended for application portability. When we start leaking cluster/implementation specific details in to it we are violating that principle. And explicitly passing PVC annotations to CSI drivers encourages that pattern.

Let's discuss the specific use cases you have in mind, and see if we can come up with better solutions for each of those uses cases (for example, the use case you pointed out above) rather then opening up a hole in the API.

Also, we(catalyst cloud) have some similar requirements not sure should merge with this one. Besides the PV name/namespace, we also need the reclaim policy of the PV passed to the CSI driver(which in turn passed to the volume property in the storage backend), so as public cloud provider, we could know if we should delete the volumes in the backend when the cloud user deletes the k8s cluster.

For reclaim policy, to deal with the situation that the policy could be changed anytime during a PV's lifetime, I think we need the external-provisioner to call something like 'updateVolume' method of CSI driver, which is not in the CSI spec yet.

The Kubernetes cluster does leak resources on deletion today. That is a problem, but fixing it at the storage system layer is a hack. Cleaning up cluster resources (and ensuring PVC reclaim policy) on cluster deletion is the responsibility of Kubernetes or the Kubernetes deployment system. Please open an issue on https://github.com/kubernetes/kubernetes/issues to do the right thing at those layers.

@lingxiankong
Copy link

The Kubernetes cluster does leak resources on deletion today. That is a problem, but fixing it at the storage system layer is a hack. Cleaning up cluster resources (and ensuring PVC reclaim policy) on cluster deletion is the responsibility of Kubernetes or the Kubernetes deployment system. Please open an issue on https://github.com/kubernetes/kubernetes/issues to do the right thing at those layers.

Hi @saad-ali, thanks for your reply. I'm confused about something.

Cleaning up cluster resources (and ensuring PVC reclaim policy) on cluster deletion is the responsibility of Kubernetes or the Kubernetes deployment system I agree that resource cleanup is the responsibility of the Kubernetes deployment system, which in this case is the public cloud provider but not kubernetes itself. Because when the end user deletes a kubernetes cluster, it's impossible for the cloud system to log into the kubernetes cluster and delete everything automatically.

Usually, the cloud system relies on the resource metadata/decription/tags to identify which resources are belonged to the kubernetes cluster, and those metadata/description/tags are set when the kubernetes cluster resource is created. I have some examples:

  • The load balancer created in the OpenStack cloud for the kubernetes service of LoadBalancer type has the cluster information in its description, see here
  • The volume in the cloud(in-tree PV controller) created for the PV has PV name/namespace information in its properties, see here. However, when switching to use kubernetes-csi, there is only a volName param left indicating the PV's name. IMHO, we lost feature parity in the migration from in-tree pv controller to csi.

@kerneltime
Copy link

kerneltime commented Apr 19, 2019

We have a usecase for managing thousands of datasets that in turn can have arbitrary number of versions.
These datasets are created and consumed within kubernetes and have usage outside of kubernetes as well.
The current mapping of which dataset is done via storage class. When creating a volume (filesystem) there is no ability to address a specific version.
Currently we use a fork to test things out
Snapshots model also runs into a similar friction from what I can tell.
The only solution currently seems to be for a pod to directly address the volume (ephemeral volume). The create for a new version that is written will have to implied based on the spec passed in vs being able to make any kind of central decision for it.

@msau42
Copy link
Collaborator

msau42 commented Apr 19, 2019

@kerneltime for your use case, is your dataset readonly? Can users create a PVC based on some dataset, and modify, and persist it separately from another PVC based off of the same dataset? I'm trying to understand if csi ephemeral volumes can suit your use case better than PVs

@kerneltime
Copy link

There 2 kinds read only and new datasets. If developers modify a dataset it is a new version.
Ephemeral volumes should work and for scaling the number of volumes (millions over a few months for the same etcd process) it might be the only scalable way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants