Support sharding the CSI volumes OMAP, to overcome large (>200k) volume counts per pool/cephfs instance #818

ShyamsundarR · 2020-02-04T15:24:37Z

When ceph CSI started maintaining its journal in RADOS one if the concerns raised [1] was around the scale limits of a single RADOS object (which as of this writing is ~200k keys per object).

The CSI volumes directory RADOS object, per pool, maintains a single object that contains a key per currently in-use volumes (the csi.volumes.<InstanceID> object). At the scale limits of keys per object, meaning ~200k images per pool (or per cephfs instance, as we use the cephfs metadata pool to store the same), we would need to shard this object to overcome the RADOS per object key limits.

This is a concern in the future, and not an medium term problem, but noting this down in order that it is not missed as a reference when work is taken up towards the same.

Also noting this down due to discussion around key counts in this PR with @dillaman

[1] Older discussion references on sharding the CSI directory object:

The text was updated successfully, but these errors were encountered:

ShyamsundarR · 2020-02-12T22:51:39Z

For snapshot OMap directory, as there can be more snapshots per volume, we may reach the RADOS per object key limits sooner.

There was a fleeting desire to create a per parent-image-UUID named snapshot OMap directory, but that will not catch snapshot name collisions across different parent UUIDs, which is required by the CSi plugins.

For the snapshot OMap as well, we would need some form of sharding, @dillaman had some further thoughts on the same that I am capturing verbatim below:

RGW "solved" this issue in the past by first using sharding across a
fixed number of objects (i.e. hash the name and pick the destination
omap index object by modulo the number of objects). The downside to
that approach was that it required the user to pick the expected
maximum number of objects prior to establishing the cluster (or do an
offline reshard). Since RGW needs to always be up and it can expect to
continuously grow, they then switched to dynamic sharding [1] to
permit growth into the hundreds of millions of indexed objects.

I realistically would never expect the CSI to need the extra
complexity of something like dynamic sharding. However, you could
implement a backwards compatible fixed sharding scheme in the future
where the CSI driver sets the shard object upper limit (power of two)
and it recursively searches for a hit in decreasing shard objects
upper limits until it finds a hit. If and when it finds a hit, it
should move it to the correct shard so that future accesses don't need
to perform the search and it helps to reduce pressure on the "older"
objects.

[1] https://docs.ceph.com/docs/mimic/radosgw/dynamicresharding/

stale · 2020-10-04T12:17:14Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

stale · 2020-10-12T02:33:48Z

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.

stale bot added the wontfix This will not be worked on label Oct 4, 2020

stale bot closed this as completed Oct 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support sharding the CSI volumes OMAP, to overcome large (>200k) volume counts per pool/cephfs instance #818

Support sharding the CSI volumes OMAP, to overcome large (>200k) volume counts per pool/cephfs instance #818

ShyamsundarR commented Feb 4, 2020

ShyamsundarR commented Feb 12, 2020

stale bot commented Oct 4, 2020

stale bot commented Oct 12, 2020

Support sharding the CSI volumes OMAP, to overcome large (>200k) volume counts per pool/cephfs instance #818

Support sharding the CSI volumes OMAP, to overcome large (>200k) volume counts per pool/cephfs instance #818

Comments

ShyamsundarR commented Feb 4, 2020

ShyamsundarR commented Feb 12, 2020

stale bot commented Oct 4, 2020

stale bot commented Oct 12, 2020