-
Notifications
You must be signed in to change notification settings - Fork 580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add disk usage summary report interface #9672
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Only nits and suggestions.
void segment::clear_cached_disk_usage() { | ||
_idx.clear_cached_disk_usage(); | ||
_data_disk_usage_size.reset(); | ||
_compaction_index_size.reset(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it's a bit odd that the normal index caches its own size, but the compaction index size is cached by the segment
. We don't have a higher level abstraction for the compacted index, so I don't see another way of doing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right. i tried very hard (and succeeded) in not refactoring a bunch of stuff to make this cleaner. it is indeed awkward to have the reader and appender and then the same path in the segment/reader/appender etc...
src/v/storage/disk_log_impl.cc
Outdated
@@ -1701,15 +1701,10 @@ log make_disk_backed_log( | |||
|
|||
ss::future<reclaim_size_limits> | |||
disk_log_impl::estimate_reclaim_size(compaction_config cfg) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, even when retention is disabled we may still want to delete data that has been uploaded to cloud storage.
Just making sure I understand the behavior before this: did we previously not locally GC for compacted topics? Or did we just not estimate the size, and therefore not prioritize running GC code as highly as we should've been
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before there was no estimation. Now we can estimate, but we aren't yet using that in the scheduling decision. Also, generally, this is work is tackling retention policies, so it is orthogonal to compaction. We aren't estimating for compaction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks for clarifying
The intention behind caching the size of removed persistent state for GC'd log segments was reporting: if we held a segment reference it still may have been a GC'd tombstone. In this case reporting its size as 0 was awkward. However, this interface is now going to be primarily used for determining GC schedules. In this new role it makes sense to report the most up-to-date information. Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
Prior to this commit estimation of disk sizes involved reporting user-level data: for example the number of bytes written into the segment file. This is not particularly accurate because it doesn't take into account the physical allocation of space by the underlying file system. For example, a small file of a few bytes in a file system that does not have any optimizations such as storing small file data in inode blocks might require a minimum file allocation size of 4K. This can have a dramatic impact on disk usage for workloads that produce a large number of files on disk: under reporting disk usage. To make accounting more accurate we instead query for file size from the OS. It's important to use this information appropriately: it doesn't necessarily reflect user-level data sizes for retention. Rather, it allows us to make better decisions about space usage when scheduling data removal. The impact of querying the OS is expected to be small because the number of queries >> modifications that affect size, so caching is feasible. Generally only segments that are compacted and the active segment for partition will have operations that invalidate the cache. However, these are also the segments for which we would skip applying retention policy to in a particular round of data reclamation. Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
It is more natural to report cumulative sizes for categories of retention where possible rather than non-overlapping sizes. Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
Prior to this commit the GC estimate size would report that no data was reclaimable for non-collectible partitions. However, even when retention is disabled we may still want to delete data that has been uploaded to cloud storage. This commit fixes that accounting by reporting data that is collectible, even if it is not collectible because of normal retention policy. Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
This commit renames `estimate_reclaim_size` method to be a more generically named `disk_usage` method that will return information about both estimated reclaim size as before, as well as on-disk usage information for the entire partition. The two are combined because reporting all of the information together makes for a more efficient implementation. Further, the two pieces of information will be used together as part of disk space management. Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
Force pushes
|
In a previous commit the
gc_estimate
interface was introduced which was effectively a dry-run of retention policy designed to allow future disk space management functionality to reason about which partitions had the most data available for removal from local disk in support of low disk space scenarios.The flip side of knowing how much data can be reclaimed is to know how much data is being used. For example, future disk space management functionality will need to understand how much space is attributable to a raft vs other sub-systems (e.g. in single-disk setups).
To that end this PR introduces
disk_log_impl::disk_usage()
that replacesgc_estimate
and returns a report about both disk usage broken out into categories (e.g. data, index) as well as the previous dry-run retention calculation. They are combined because (1) it is more efficient to collect these statistics at once and (2) the only current user will need both pieces of information.Backports Required
Release Notes