Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support sharding the CSI volumes OMAP, to overcome large (>200k) volume counts per pool/cephfs instance #818

Closed
ShyamsundarR opened this issue Feb 4, 2020 · 3 comments
Labels
wontfix This will not be worked on

Comments

@ShyamsundarR
Copy link
Contributor

When ceph CSI started maintaining its journal in RADOS one if the concerns raised [1] was around the scale limits of a single RADOS object (which as of this writing is ~200k keys per object).

The CSI volumes directory RADOS object, per pool, maintains a single object that contains a key per currently in-use volumes (the csi.volumes.<InstanceID> object). At the scale limits of keys per object, meaning ~200k images per pool (or per cephfs instance, as we use the cephfs metadata pool to store the same), we would need to shard this object to overcome the RADOS per object key limits.

This is a concern in the future, and not an medium term problem, but noting this down in order that it is not missed as a reference when work is taken up towards the same.

Also noting this down due to discussion around key counts in this PR with @dillaman

[1] Older discussion references on sharding the CSI directory object:

@ShyamsundarR
Copy link
Contributor Author

For snapshot OMap directory, as there can be more snapshots per volume, we may reach the RADOS per object key limits sooner.

There was a fleeting desire to create a per parent-image-UUID named snapshot OMap directory, but that will not catch snapshot name collisions across different parent UUIDs, which is required by the CSi plugins.

For the snapshot OMap as well, we would need some form of sharding, @dillaman had some further thoughts on the same that I am capturing verbatim below:

RGW "solved" this issue in the past by first using sharding across a
fixed number of objects (i.e. hash the name and pick the destination
omap index object by modulo the number of objects). The downside to
that approach was that it required the user to pick the expected
maximum number of objects prior to establishing the cluster (or do an
offline reshard). Since RGW needs to always be up and it can expect to
continuously grow, they then switched to dynamic sharding [1] to
permit growth into the hundreds of millions of indexed objects.

I realistically would never expect the CSI to need the extra
complexity of something like dynamic sharding. However, you could
implement a backwards compatible fixed sharding scheme in the future
where the CSI driver sets the shard object upper limit (power of two)
and it recursively searches for a hit in decreasing shard objects
upper limits until it finds a hit. If and when it finds a hit, it
should move it to the correct shard so that future accesses don't need
to perform the search and it helps to reduce pressure on the "older"
objects.

[1] https://docs.ceph.com/docs/mimic/radosgw/dynamicresharding/

@stale
Copy link

stale bot commented Oct 4, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Oct 4, 2020
@stale
Copy link

stale bot commented Oct 12, 2020

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.

@stale stale bot closed this as completed Oct 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

1 participant