btrfs: immediately drop extent maps after failed COW write

fdmanana · kdave · commit 2827e29266b4 · 2024-05-16T16:50:51.000+02:00
If a write path in COW mode fails, either before submitting a bio for the
new extents or an actual IO error happens, we can end up allowing a fast
fsync to log file extent items that point to unwritten extents.

This is because the ordered extent completion for a failed write is
executed in a work queue. This means that once the write path unlocks the
inode, a fast fsync can come and log the extent maps created by the write
attempt before the work queue completes the ordered extent.

For example consider a direct IO write, in COW mode, that fails at
btrfs_dio_submit_io() because btrfs_extract_ordered_extent() returned an
error:

1) We call btrfs_finish_ordered_extent() with the 'uptodate' parameter
   set to false, meaning an error happened;

2) That results in marking the ordered extent with the BTRFS_ORDERED_IOERR
   flag;

3) btrfs_finish_ordered_extent() queues the completion of the ordered
   extent - so that btrfs_finish_one_ordered() will be executed later in
   a work queue. That function will drop extents maps in the range when
   it's executed, since the extent maps point to unwritten locations
   (signaled by the BTRFS_ORDERED_IOERR flag);

4) After calling btrfs_finish_ordered_extent() we keep going down the
   write path and unlock the inode;

5) After that a fast fsync starts and locks the inode;

6) Before the work queue executes btrfs_finish_one_ordered(), the fsync
   task sees the extent maps that point to the unwritten locations and
   logs file extent items based on them - it does not know they are
   unwritten, and the fast fsync path does not wait for ordered extents
   to complete in order to reduce latency.

So to fix this make btrfs_finish_ordered_extent() drop the extent maps
in the range if an error happened for a COW write.

Note that this issues of using extent maps that point to unwritten
locations can not happen for reads, because in read paths we start by
locking the extent range and wait for any ordered extents in the range
to complete before looking for extent maps.

Reviewed-by: Qu Wenruo &lt;wqu@suse.com&gt;
Signed-off-by: Filipe Manana &lt;fdmanana@suse.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
@@ -388,6 +388,33 @@ bool btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered,
 	ret = can_finish_ordered_extent(ordered, page, file_offset, len, uptodate);
 	spin_unlock_irqrestore(&inode->ordered_tree_lock, flags);
 
+	/*
+	 * If this is a COW write it means we created new extent maps for the
+	 * range and they point to an unwritten location if we got an error
+	 * either before submitting a bio or during IO.
+	 *
+	 * We have marked the ordered extent with BTRFS_ORDERED_IOERR, and we
+	 * are queuing its completion below. During completion, at
+	 * btrfs_finish_one_ordered(), we will drop the extent maps for the
+	 * unwritten extents.
+	 *
+	 * However because completion runs in a work queue we can end up
+	 * unlocking the inode before the ordered extent is completed.
+	 *
+	 * That means that a fast fsync can happen before the work queue
+	 * executes the completion of the ordered extent, and in that case
+	 * the fsync will use the extent maps that point to unwritten extents,
+	 * resulting in logging file extent items that point to unwritten
+	 * locations. Unlike read paths, a fast fsync doesn't wait for ordered
+	 * extent completion before proceeding (intentional to reduce latency).
+	 *
+	 * To be safe drop the new extent maps in the range (if are doing COW)
+	 * right here before we unlock the inode and allow a fsync to run.
+	 */
+	if (!uptodate && !test_bit(BTRFS_ORDERED_NOCOW, &ordered->flags))
+		btrfs_drop_extent_map_range(inode, file_offset,
+					    file_offset + len - 1, false);
+
 	if (ret)
 		btrfs_queue_ordered_fn(ordered);
 	return ret;