Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add disk usage summary report interface #9672

Merged
merged 6 commits into from
Mar 29, 2023
Merged

Conversation

dotnwat
Copy link
Member

@dotnwat dotnwat commented Mar 28, 2023

In a previous commit the gc_estimate interface was introduced which was effectively a dry-run of retention policy designed to allow future disk space management functionality to reason about which partitions had the most data available for removal from local disk in support of low disk space scenarios.

The flip side of knowing how much data can be reclaimed is to know how much data is being used. For example, future disk space management functionality will need to understand how much space is attributable to a raft vs other sub-systems (e.g. in single-disk setups).

To that end this PR introduces disk_log_impl::disk_usage() that replaces gc_estimate and returns a report about both disk usage broken out into categories (e.g. data, index) as well as the previous dry-run retention calculation. They are combined because (1) it is more efficient to collect these statistics at once and (2) the only current user will need both pieces of information.

ss::future<usage_report> disk_log_impl::disk_usage(compaction_config cfg) 

/*
 * disk usage report
 *
 * usage: disk usage summary for log.
 * reclaim: disk usage reclaim summary for log.
 */
struct usage_report {
    usage usage;
    reclaim_size_limits reclaim;
};

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.1.x
  • v22.3.x
  • v22.2.x

Release Notes

  • none

@dotnwat
Copy link
Member Author

dotnwat commented Mar 28, 2023

@dotnwat dotnwat marked this pull request as ready for review March 28, 2023 04:48
@dotnwat dotnwat self-assigned this Mar 28, 2023
VladLazar
VladLazar previously approved these changes Mar 29, 2023
Copy link
Contributor

@VladLazar VladLazar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Only nits and suggestions.

Comment on lines +177 to +181
void segment::clear_cached_disk_usage() {
_idx.clear_cached_disk_usage();
_data_disk_usage_size.reset();
_compaction_index_size.reset();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it's a bit odd that the normal index caches its own size, but the compaction index size is cached by the segment. We don't have a higher level abstraction for the compacted index, so I don't see another way of doing it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right. i tried very hard (and succeeded) in not refactoring a bunch of stuff to make this cleaner. it is indeed awkward to have the reader and appender and then the same path in the segment/reader/appender etc...

src/v/storage/disk_log_impl.cc Outdated Show resolved Hide resolved
src/v/storage/disk_log_impl.cc Show resolved Hide resolved
andrwng
andrwng previously approved these changes Mar 29, 2023
src/v/storage/segment.h Outdated Show resolved Hide resolved
@@ -1701,15 +1701,10 @@ log make_disk_backed_log(

ss::future<reclaim_size_limits>
disk_log_impl::estimate_reclaim_size(compaction_config cfg) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, even when retention is disabled we may still want to delete data that has been uploaded to cloud storage.

Just making sure I understand the behavior before this: did we previously not locally GC for compacted topics? Or did we just not estimate the size, and therefore not prioritize running GC code as highly as we should've been

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before there was no estimation. Now we can estimate, but we aren't yet using that in the scheduling decision. Also, generally, this is work is tackling retention policies, so it is orthogonal to compaction. We aren't estimating for compaction.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks for clarifying

The intention behind caching the size of removed persistent state for
GC'd log segments was reporting: if we held a segment reference it still
may have been a GC'd tombstone. In this case reporting its size as 0 was
awkward.

However, this interface is now going to be primarily used for
determining GC schedules. In this new role it makes sense to report the
most up-to-date information.

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
Prior to this commit estimation of disk sizes involved reporting
user-level data: for example the number of bytes written into the
segment file. This is not particularly accurate because it doesn't
take into account the physical allocation of space by the underlying
file system.

For example, a small file of a few bytes in a file system that does not
have any optimizations such as storing small file data in inode blocks
might require a minimum file allocation size of 4K. This can have a
dramatic impact on disk usage for workloads that produce a large
number of files on disk: under reporting disk usage.

To make accounting more accurate we instead query for file size from the
OS. It's important to use this information appropriately: it doesn't
necessarily reflect user-level data sizes for retention. Rather, it
allows us to make better decisions about space usage when scheduling
data removal.

The impact of querying the OS is expected to be small because the number
of queries >> modifications that affect size, so caching is feasible.
Generally only segments that are compacted and the active segment for
partition will have operations that invalidate the cache. However, these
are also the segments for which we would skip applying retention policy
to in a particular round of data reclamation.

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
It is more natural to report cumulative sizes for categories of
retention where possible rather than non-overlapping sizes.

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
Prior to this commit the GC estimate size would report that no data was
reclaimable for non-collectible partitions. However, even when retention
is disabled we may still want to delete data that has been uploaded to
cloud storage. This commit fixes that accounting by reporting data that
is collectible, even if it is not collectible because of normal
retention policy.

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
This commit renames `estimate_reclaim_size` method to be a more
generically named `disk_usage` method that will return information about
both estimated reclaim size as before, as well as on-disk usage
information for the entire partition.

The two are combined because reporting all of the information together
makes for a more efficient implementation. Further, the two pieces of
information will be used together as part of disk space management.

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
@dotnwat
Copy link
Member Author

dotnwat commented Mar 29, 2023

@dotnwat dotnwat merged commit c3cca09 into redpanda-data:dev Mar 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants