Regular for-next build test #1157

kdave · 2024-02-21T15:01:30Z

Keep this open, the build tests are hosted on github CI.

There's no point in checking at iterate_inodes_from_logical() if the path has search_commit_root set, the only caller never sets search_commit_root to true and it doesn't make sense for it ever to be true for the current use case (logical_to_ino ioctl). So stop checking for that and since the only caller allocates the path just for it to be used by iterate_inodes_from_logical(), move the path allocation into that function. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>

…refs There's no point in converting the return values from strcmp() as all we need is that it returns a negative value if the first argument is less than the second, a positive value if it's greater and 0 if equal. We do not have a need for -1 instead of any other negative value and no need for 1 instead of any other positive value - that's all that rb_find() needs and no where else we need specific negative and positive values. So remove the intermediate local variable and checks and return directly the result from strcmp(). This also reduces the module's text size. Before: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1888116 161347 16136 2065599 1f84bf fs/btrfs/btrfs.ko After: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1888052 161347 16136 2065535 1f847f fs/btrfs/btrfs.ko Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>

So far we've been deriving the buffer tree index using the sector size. But each extent buffer covers multiple sectors. This makes the buffer tree rather sparse. For example the typical and quite common configuration uses sector size of 4KiB and node size of 16KiB. In this case it means the buffer tree is using up to the maximum of 25% of it's slots. Or in other words at least 75% of the tree slots are wasted as never used. We can score significant memory savings on the required tree nodes by indexing the tree using the node size instead. As a result far less slots are wasted and the tree can now use up to all 100% of it's slots this way. Note: This works even with unaligned tree blocks as we can still get unique index by doing eb->start >> nodesize_shift. Getting some stats from running fio write test, there is a bit of variance. The values presented in the table below are medians from 5 test runs. The numbers are: - # of allocated ebs in the tree - # of leaf tree nodes - highest index in the tree (radix tree width)): ebs / leaves / Index | bare for-next | with fix ---------------------+--------------------+------------------- post mount | 16 / 11 / 10e5c | 16 / 10 / 4240 post test | 5810 / 891 / 11cfc | 4420 / 252 / 473a post rm | 574 / 300 / 10ef0 | 540 / 163 / 46e9 In this case (10GiB filesystem) the height of the tree is still 3 levels but the 4x width reduction is clearly visible as expected. But since the tree is more dense we can see the 54-72% reduction of leaf nodes. That's very close to ideal with this test. It means the tree is getting really dense with this kind of workload. Also, the fio results show no performance change. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>

The RCU protected string is only used for a device name, and RCU is used so we can print the name and eventually synchronize against the rare device rename in device_list_add(). We don't need the whole API just for that. Open code all the helpers and access to the string itself. Notable change is in device_list_add() when the device name is changed, which is the only place that can actually happen at the same time as message prints using the device name under RCU read lock. Previously there was kfree_rcu() which used the embedded rcu_head to delay freeing the object depending on the RCU mechanism. Now there's kfree_rcu_mightsleep() which does not need the rcu_head and waits for the grace period. Sleeping is safe in this context and as this is a rare event it won't interfere with the rest as it's holding the device_list_mutex. Straightforward changes: - rcu_string_strdup -> kstrdup - rcu_str_deref -> rcu_dereference - drop ->str from safe contexts Historical notes: Introduced in 606686e ("Btrfs: use rcu to protect device->name") with a vague reference of the potential problem described in https://lore.kernel.org/all/20120531155304.GF11775@ZenIV.linux.org.uk/ . The RCU protection looks like the easiest and most lightweight way of protecting the rare event of device rename racing device_list_add() with a random printk() that uses the device name. Alternatives: a spin lock would require to protect the printk anyway, a fixed buffer for the name would be eventually wrong in case the new name is overwritten when being printed, an array switching pointers and cleaning them up eventually resembles RCU too much. The cleanups up to this patch should hide special case of RCU to the minimum that only the name needs rcu_dereference(), which can be further cleaned up to use btrfs_dev_name(). Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>

The only use for device name has been removed so we can kill the RCU string API. Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>

As we can have non-contiguous range in the eb->folios, any item can be straddling two folios and we need to check if it can be read in one go or in two parts. For that there's a check which is not implemented in the simplest way: offset in folio + size <= folio size With a simple expression transformation: oil + size <= unit_size size <= unit_size - oil sizeof() <= part this can be simplified and reusing existing run-time or compile-time constants. Add likely() annotation for this expression as this is the fast path and compiler sometimes reorders that after the followup block with the memcpy (observed in practice with other simplifications). Overall effect on stack consumption: btrfs_get_8 -8 (80 -> 72) btrfs_set_8 -8 (88 -> 80) And .ko size (due to optimizations making use of the direct constants): text data bss dec hex filename 1456601 115665 16088 1588354 183c82 pre/btrfs.ko 1456093 115665 16088 1587846 183a86 post/btrfs.ko DELTA: -508 Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>

Now unit_size is used only once, so use it directly in 'part' calculation. Don't cache sizeof(type) in a variable. While this is a compile-time constant, forcing the type 'int' generates worse code as it leads to additional conversion from 32 to 64 bit type on x86_64. The sizeof() is used only a few times and it does not make the code that harder to read, so use it directly and let the compiler utilize the immediate constants in the context it needs. The .ko code size slightly increases (+50) but further patches will reduce that again. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>

There's a check in each set/get helper if the requested range is within extent buffer bounds, and if it's not then report it. This was in an ASSERT statement so with CONFIG_BTRFS_ASSERT this crashes right away, on other configs this is only reported but reading out of the bounds is done anyway. There are currently no known reports of this particular condition failing. There are some drawbacks though. The behaviour dependence on the assertions being compiled in or not and a less visible effect of inlining report_setget_bounds() into each helper. As the bounds check is expected to succeed almost always it's ok to inline it but make the report a function and move it out of the helper completely (__cold puts it to a different section). This also skips reading/writing the requested range in case it fails. This improves stack usage significantly: btrfs_get_16 -48 (80 -> 32) btrfs_get_32 -48 (80 -> 32) btrfs_get_64 -48 (80 -> 32) btrfs_get_8 -48 (72 -> 24) btrfs_set_16 -56 (88 -> 32) btrfs_set_32 -56 (88 -> 32) btrfs_set_64 -56 (88 -> 32) btrfs_set_8 -48 (80 -> 32) NEW (48): report_setget_bounds 48 LOST/NEW DELTA: +48 PRE/POST DELTA: -360 Same as .ko size: text data bss dec hex filename 1456079 115665 16088 1587832 183a78 pre/btrfs.ko 1454951 115665 16088 1586704 183610 post/btrfs.ko DELTA: -1128 Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>

Reading/writing 1 byte (u8) is a special case compared to the others as it's always contained in the folio we find, so the split memcpy will never be needed. Turn it to a compile-time check that the memcpy part can be optimized out. The stack usage is reduced: btrfs_set_8 -16 (32 -> 16) btrfs_get_8 -16 (24 -> 8) Code size reduction: text data bss dec hex filename 1454951 115665 16088 1586704 183610 pre/btrfs.ko 1454691 115665 16088 1586444 18350c post/btrfs.ko DELTA: -260 Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>

Reading/writing 2 bytes (u16) may need 2 folios to be written to, each time it's just one byte so using memcpy for that is an overkill. Add a branch for the split case so that memcpy is now used for u32 and u64. Another side effect is that the u16 types now don't need additional stack space, everything fits to registers. Stack usage is reduced: btrfs_get_16 -8 (32 -> 24) btrfs_set_16 -16 (32 -> 16) Code size reduction: text data bss dec hex filename 1454691 115665 16088 1586444 18350c pre/btrfs.ko 1454459 115665 16088 1586212 183424 post/btrfs.ko DELTA: -232 Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>

The target address for the read/write can be simplified as it's the same expression for the first folio. This improves the generated code as the folio address does not have to be cached on stack. Stack usage reduction: btrfs_set_32 -8 (32 -> 24) btrfs_set_64 -8 (32 -> 24) btrfs_get_16 -8 (24 -> 16) Code size reduction: text data bss dec hex filename 1454459 115665 16088 1586212 183424 pre/btrfs.ko 1454279 115665 16088 1586032 183370 post/btrfs.ko DELTA: -180 Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>

The case of a reading the bytes from 2 folios needs two memcpy()s, the compiler does not emit calls but two inline loops. Factoring out the code makes some improvement (stack, code) and in the future will provide an optimized implementation as well. (The analogical version with two destinations is not done as it increases stack usage but can be done if needed.) The address of the second folio is reordered before the first memcpy, which leads to an optimization reusing the vmemmap_base and page_offset_base (implementing folio_address()). Stack usage reduction: btrfs_get_32 -8 (32 -> 24) btrfs_get_64 -8 (32 -> 24) Code size reduction: text data bss dec hex filename 1454279 115665 16088 1586032 183370 pre/btrfs.ko 1454229 115665 16088 1585982 18333e post/btrfs.ko DELTA: -50 As this is the last patch in this series, here's the overall diff starting and including commit "btrfs: accessors: simplify folio bounds checks": Stack: btrfs_set_16 -72 (88 -> 16) btrfs_get_32 -56 (80 -> 24) btrfs_set_8 -72 (88 -> 16) btrfs_set_64 -64 (88 -> 24) btrfs_get_8 -72 (80 -> 8) btrfs_get_16 -64 (80 -> 16) btrfs_set_32 -64 (88 -> 24) btrfs_get_64 -56 (80 -> 24) NEW (48): report_setget_bounds 48 LOST/NEW DELTA: +48 PRE/POST DELTA: -472 Code: text data bss dec hex filename 1456601 115665 16088 1588354 183c82 pre/btrfs.ko 1454229 115665 16088 1585982 18333e post/btrfs.ko DELTA: -2372 Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>

There used to be 'oip' short for offset in page, which got changed during conversion to folios, the name is a bit confusing so rename it. Signed-off-by: David Sterba <dsterba@suse.com>

There are two cases open coding the clear and wake up pattern, we can use the helper. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>

If we attempt a mmap write into a NOCOW file or a prealloc extent when there is no more available data space (or unallocated space to allocate a new data block group) and we can do a NOCOW write (there are no reflinks for the target extent or snapshots), we always fail due to -ENOSPC, unlike for the regular buffered write and direct IO paths where we check that we can do a NOCOW write in case we can't reserve data space. Simple reproducer: $ cat test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi umount $DEV &> /dev/null mkfs.btrfs -f -b $((512 * 1024 * 1024)) $DEV mount $DEV $MNT touch $MNT/foobar # Make it a NOCOW file. chattr +C $MNT/foobar # Add initial data to file. xfs_io -c "pwrite -S 0xab 0 1M" $MNT/foobar # Fill all the remaining data space and unallocated space with data. dd if=/dev/zero of=$MNT/filler bs=4K &> /dev/null # Overwrite the file with a mmap write. Should succeed. xfs_io -c "mmap -w 0 1M" \ -c "mwrite -S 0xcd 0 1M" \ -c "munmap" \ $MNT/foobar # Unmount, mount again and verify the new data was persisted. umount $MNT mount $DEV $MNT od -A d -t x1 $MNT/foobar umount $MNT Running this: $ ./test.sh (...) wrote 1048576/1048576 bytes at offset 0 1 MiB, 256 ops; 0.0008 sec (1.188 GiB/sec and 311435.5231 ops/sec) ./test.sh: line 24: 234865 Bus error xfs_io -c "mmap -w 0 1M" -c "mwrite -S 0xcd 0 1M" -c "munmap" $MNT/foobar 0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab * 1048576 Fix this by not failing in case we can't allocate data space and we can NOCOW into the target extent - reserving only metadata space in this case. After this change the test passes: $ ./test.sh (...) wrote 1048576/1048576 bytes at offset 0 1 MiB, 256 ops; 0.0007 sec (1.262 GiB/sec and 330749.3540 ops/sec) 0000000 cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd * 1048576 A test case for fstests will be added soon. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>

…rite() We have the inode's io_tree already stored in a local variable, so use it instead of grabbing it again in the call to btrfs_clear_extent_bit(). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>

Most of the time we want to use the btrfs_inode, so change the local inode variable to be a btrfs_inode instead of a VFS inode, reducing verbosity by eliminating a lot of BTRFS_I() calls. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>

The documentation for the @nowait parameter is missing, so add it. The @nowait parameter was added in commit 80f9d24 ("btrfs: make btrfs_check_nocow_lock nowait compatible"), which forgot to update the function comment. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>

We call btrfs_check_nocow_lock() to see if we can NOCOW a block sized range but we don't check later if we can NOCOW the whole range. It's unexpected to be able to NOCOW a range smaller than blocksize, so add an assertion to check the NOCOW range matches the blocksize. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>

Currently btrfs_check_nocow_lock() stops at the first extent it finds and that extent may be smaller than the target range we want to NOCOW into. But we can have multiple consecutive extents which we can NOCOW into, so by stopping at the first one we find we just make the caller do more work by splitting the write into multiple ones, or in the case of mmap writes with large folios we fail with -ENOSPC in case the folio's range is covered by more than one extent (the fallback to NOCOW for mmap writes in case there's no available data space to reserve/allocate was recently added by the patch "btrfs: fix -ENOSPC mmap write failure on NOCOW files/extents"). Improve on this by checking for multiple consecutive extents. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>

Remove the log message before reclaiming a chunk in btrfs_reclaim_bgs_work(). Especially with automatic block-group reclaiming these messages spam the kernel log. Note there is also a tracepoint for the same condition to ease debugging. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>

When BTRFS is doing automatic block-group reclaim, it is spamming the kernel log messages a lot. Add a 'verbose' parameter to btrfs_relocate_chunk() and btrfs_relocate_block_group() to control the verbosity of these log message. This way the old behaviour of printing log messages on a user-space initiated balance operation can be kept while excessive log spamming due to auto reclaim is mitigated. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>

…ck() Set the EXTENT_NORESERVE bit in the io tree before unlocking the range so that we can use the cached state and speedup the operation, since the unlock operation releases the cached state. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com>

We have a cached extent state record from the previous extent locking so we can use when setting the EXTENT_NORESERVE in the range, allowing the operation to be faster if the extent io tree is relatively large. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com>

It's just a simple wrapper around btrfs_clear_extent_bit() that passes a NULL for its last argument (a cached extent state record), plus there is not counter part - we have a btrfs_set_extent_bit() but we do not have a btrfs_set_extent_bits() (plural version). So just remove it and make all callers use btrfs_clear_extent_bit() directly. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com>

There are some reports of "unable to find chunk map for logical 2147483648 length 16384" error message appears in dmesg. This means some IOs are occurring after a block group is removed. When a metadata tree node is cleaned on a zoned setup, we keep that node still dirty and write it out not to create a write hole. However, this can make a block group's used bytes == 0 while there is a dirty region left. Such an unused block group is moved into the unused_bg list and processed for removal. When the removal succeeds, the block group is removed from the transaction->dirty_bgs list, so the unused dirty nodes in the block group are not sent at the transaction commit time. It will be written at some later time e.g, sync or umount, and causes "unable to find chunk map" errors. This can happen relatively easy on SMR whose zone size is 256MB. However, calling do_zone_finish() on such block group returns -EAGAIN and keep that block group intact, which is why the issue is hidden until now. Fixes: afba2bc ("btrfs: zoned: implement active zone tracking") CC: stable@vger.kernel.org # 6.1+ Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

btrfs_zone_finish() can fail for several reason. If it is -EAGAIN, we need to try it again later. So, put the block group to the retry list properly. Failing to do so will keep the removable block group intact until remount and can causes unnecessary ENOSPC. Fixes: 74e91b1 ("btrfs: zoned: zone finish unused block group") CC: stable@vger.kernel.org # 6.1+ Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

kdave force-pushed the for-next branch from 04c8194 to 6f435ee Compare February 22, 2024 10:28

kdave changed the title ~~Post rc5 build test~~ Regular for-next build test Feb 22, 2024

kdave force-pushed the for-next branch 6 times, most recently from 2d4aefb to c9e380a Compare February 28, 2024 14:37

kdave force-pushed the for-next branch 6 times, most recently from c56343b to 1cab137 Compare March 5, 2024 17:23

kdave force-pushed the for-next branch 2 times, most recently from 6613f3c to b30a0ce Compare March 15, 2024 01:05

fdmanana force-pushed the for-next branch from 41a7195 to 787f021 Compare March 18, 2024 11:17

kdave force-pushed the for-next branch 6 times, most recently from d205ebd to c0bd9d9 Compare March 25, 2024 17:48

kdave force-pushed the for-next branch 4 times, most recently from 15022b1 to c22750c Compare March 28, 2024 02:04

kdave force-pushed the for-next branch 3 times, most recently from 28d9855 to e18d8ce Compare April 4, 2024 19:30

fdmanana and others added 13 commits July 8, 2025 15:58

btrfs: remove struct rcu_string

d4cde02

The only use for device name has been removed so we can kill the RCU string API. Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: accessors: rename variable for folio offset

ce4070a

There used to be 'oip' short for offset in page, which got changed during conversion to folios, the name is a bit confusing so rename it. Signed-off-by: David Sterba <dsterba@suse.com>

kdave force-pushed the for-next branch from 5174ede to ce4070a Compare July 8, 2025 14:01

btrfs: use clear_and_wake_up_bit() where open coded

bd811ff

There are two cases open coding the clear and wake up pattern, we can use the helper. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>

morbidrsa force-pushed the for-next branch from 293d9e0 to 3742397 Compare July 11, 2025 06:39

fdmanana and others added 8 commits July 11, 2025 11:55

kdave force-pushed the for-next branch from 3742397 to 59cbbb9 Compare July 11, 2025 09:59

fdmanana and others added 5 commits July 11, 2025 12:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Regular for-next build test #1157

Regular for-next build test #1157

Uh oh!

kdave commented Feb 21, 2024 •

edited

Loading

Uh oh!

Uh oh!

Regular for-next build test #1157

Are you sure you want to change the base?

Regular for-next build test #1157

Uh oh!

Conversation

kdave commented Feb 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

kdave commented Feb 21, 2024 •

edited

Loading