summaryrefslogtreecommitdiff
path: root/fs/btrfs
AgeCommit message (Collapse)Author
9 daysbtrfs: unfold transaction abort at btrfs_copy_root()Filipe Manana
Instead of having a common btrfs_abort_transaction() call for when any of the two btrfs_inc_ref() calls fail, move the btrfs_abort_transaction() to happen immediately after each one of the calls, so that when analyzing a stack trace with a transaction abort we know which call failed. Reviewed-by: Daniel Vacek <neelx@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
9 daysbtrfs: move transaction aborts to the error site in add_block_group_free_space()David Sterba
Transaction aborts should be done next to the place the error happens, which was not done in add_block_group_free_space(). Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
9 daysbtrfs: move transaction aborts to the error site in ↵David Sterba
remove_block_group_free_space() Transaction aborts should be done next to the place the error happens, which was not done in remove_block_group_free_space(). Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
9 daysbtrfs: simplify error detection flow during log replayFilipe Manana
We have this fuzzy logic at btrfs_recover_log_trees() where we don't abort the transaction and exit immediately after each function call that returned an error, and instead have if-then-else logic or check if the previous function call returned success before calling the next function. Make the flow more straightforward by immediately aborting the transaction and exiting after each function call failure. This also allows to avoid two consecutive if statements that test the same conditions: if (!ret && wc.stage == LOG_WALK_REPLAY_ALL) { (...) } Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
9 daysbtrfs: remove redundant path release when replaying a log treeFilipe Manana
There's no need to call btrfs_release_path() before calling btrfs_init_root_free_objectid() as we have released the path already at the top of the loop and the previous call to fixup_inode_link_counts() also releases the path. So remove it to simplify the code. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
9 daysbtrfs: abort transaction during log replay if walk_log_tree() failedFilipe Manana
If we failed walking a log tree during replay, we have a missing transaction abort to prevent committing a transaction where we didn't fully replay all the changes from a log tree and therefore can leave the respective subvolume tree in some inconsistent state. So add the missing transaction abort. CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
9 daysbtrfs: unfold transaction aborts when replaying log treesFilipe Manana
We have a single line doing a transaction abort in case either we got an error from btrfs_get_fs_root() different from -ENOENT or we got an error from btrfs_pin_extent_for_log_replay(), making it hard to figure out which function call failed when looking at a transaction abort massages and stack trace in dmesg. Change this to have an explicit transaction abort for each one of the two cases. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
9 daysbtrfs: make btrfs_should_periodic_reclaim() staticJohannes Thumshirn
btrfs_should_periodic_reclaim() is not used outside of space-info.c so make it static and remove the prototype from space-info.h. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
9 daysbtrfs: zoned: use filesystem size not disk size for reclaim decisionJohannes Thumshirn
When deciding if a zoned filesystem is reaching the threshold to reclaim data block groups, look at the size of the filesystem not to potentially total available size of all drives in the filesystem. Especially if a filesystem was created with mkfs' -b option, constraining it to only a portion of the block device, the numbers won't match and potentially garbage collection is kicking in too late. Fixes: 3687fcb0752a ("btrfs: zoned: make auto-reclaim less aggressive") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Tested-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
9 daysbtrfs: unfold transaction abort at clone_copy_inline_extent()Filipe Manana
We have a common error path where we abort the transaction, but like this in case we get a transaction abort stack trace we don't know exactly which previous function call failed. Instead abort the transaction after any function call that returns an error, so that we can easily identify which function failed. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
9 daysbtrfs: remove pointless 'out' label from clone_finish_inode_update()Filipe Manana
The label is only used once and we can instead return directly where it's used, besides the fact that all we do under the label is to return the value of 'ret'. So get rid of the label and return directly. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: David Sterba <dsterba@suse.com>
9 daysbtrfs: unfold transaction abort at walk_up_proc()Filipe Manana
Instead of having a common btrfs_abort_transaction() call for when any of the two btrfs_dec_ref() calls fail, move the btrfs_abort_transaction() to happen immediately after each one of the calls, so that when analysing a stack trace with a transaction abort we know which call failed. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: David Sterba <dsterba@suse.com>
9 daysbtrfs: unfold transaction abort at __btrfs_inc_extent_ref()Filipe Manana
Instead of having a common btrfs_abort_transaction() call for when either insert_tree_block_ref() failed or when insert_extent_data_ref() failed, move the btrfs_abort_transaction() to happen immediately after each one of those calls, so that when analysing a stack trace with a transaction abort we know which call failed. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: David Sterba <dsterba@suse.com>
9 daysbtrfs: unfold transaction aborts at btrfs_create_new_inode()Filipe Manana
Instead of having a common btrfs_abort_transaction() call for when either btrfs_orphan_add() failed or when btrfs_add_link() failed, move the btrfs_abort_transaction() to happen immediately after each one of those calls, so that when analysing a stack trace with a transaction abort we know which call failed. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: David Sterba <dsterba@suse.com>
12 daysbtrfs/ioctl: store btrfs_uring_encoded_data in io_btrfs_cmdCaleb Sander Mateos
btrfs is the only user of struct io_uring_cmd_data and its op_data field. Switch its ->uring_cmd() implementations to store the struct btrfs_uring_encoded_data * in the struct io_btrfs_cmd, overlayed with io_uring_cmd's pdu field. This avoids having to touch another cache line to access the struct btrfs_uring_encoded_data *, and allows op_data and struct io_uring_cmd_data to be removed. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Acked-by: David Sterba <dsterba@suse.com> Link: https://lore.kernel.org/r/20250708202212.2851548-4-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-04tree-wide: s/struct fileattr/struct file_kattr/gChristian Brauner
Now that we expose struct file_attr as our uapi struct rename all the internal struct to struct file_kattr to clearly communicate that it is a kernel internal struct. This is similar to struct mount_{k}attr and others. Link: https://lore.kernel.org/20250703-restlaufzeit-baurecht-9ed44552b481@brauner Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-03Merge tag 'for-6.16-rc4-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - tree-log fixes: - fixes of log tracking of directories and subvolumes - fix iteration and error handling of inode references during log replay - fix free space tree rebuild (reported by syzbot) * tag 'for-6.16-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: use btrfs_record_snapshot_destroy() during rmdir btrfs: propagate last_unlink_trans earlier when doing a rmdir btrfs: record new subvolume in parent dir earlier to avoid dir logging races btrfs: fix inode lookup error handling during log replay btrfs: fix iteration of extrefs during log replay btrfs: fix missing error handling when searching for inode refs during log replay btrfs: fix failure to rebuild free space tree using multiple transactions
2025-06-30btrfs: stop parsing crc32c driver nameEric Biggers
To determine whether the crc32c implementation is "fast", use crc32_optimizations() instead of parsing the crypto_shash driver name. This keeps the code working as intended after the driver name is changed by the next commit. Acked-by: David Sterba <dsterba@suse.com> Link: https://lore.kernel.org/r/20250613183753.31864-2-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2025-06-27btrfs: use btrfs_record_snapshot_destroy() during rmdirFilipe Manana
We are setting the parent directory's last_unlink_trans directly which may result in a concurrent task starting to log the directory not see the update and therefore can log the directory after we removed a child directory which had a snapshot within instead of falling back to a transaction commit. Replaying such a log tree would result in a mount failure since we can't currently delete snapshots (and subvolumes) during log replay. This is the type of failure described in commit 1ec9a1ae1e30 ("Btrfs: fix unreplayable log after snapshot delete + parent dir fsync"). Fix this by using btrfs_record_snapshot_destroy() which updates the last_unlink_trans field while holding the inode's log_mutex lock. Fixes: 44f714dae50a ("Btrfs: improve performance on fsync against new inode after rename/unlink") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-06-27btrfs: propagate last_unlink_trans earlier when doing a rmdirFilipe Manana
In case the removed directory had a snapshot that was deleted, we are propagating its inode's last_unlink_trans to the parent directory after we removed the entry from the parent directory. This leaves a small race window where someone can log the parent directory after we removed the entry and before we updated last_unlink_trans, and as a result if we ever try to replay such a log tree, we will fail since we will attempt to remove a snapshot during log replay, which is currently not possible and results in the log replay (and mount) to fail. This is the type of failure described in commit 1ec9a1ae1e30 ("Btrfs: fix unreplayable log after snapshot delete + parent dir fsync"). So fix this by propagating the last_unlink_trans to the parent directory before we remove the entry from it. Fixes: 44f714dae50a ("Btrfs: improve performance on fsync against new inode after rename/unlink") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-06-27btrfs: record new subvolume in parent dir earlier to avoid dir logging racesFilipe Manana
Instead of recording that a new subvolume was created in a directory after we add the entry do the directory, record it before adding the entry. This is to avoid races where after creating the entry and before recording the new subvolume in the directory (the call to btrfs_record_new_subvolume()), another task logs the directory, so we end up with a log tree where we logged a directory that has an entry pointing to a root that was not yet committed, resulting in an invalid entry if the log is persisted and replayed later due to a power failure or crash. Also state this requirement in the function comment for btrfs_record_new_subvolume(), similar to what we do for the btrfs_record_unlink_dir() and btrfs_record_snapshot_destroy(). Fixes: 45c4102f0d82 ("btrfs: avoid transaction commit on any fsync after subvolume creation") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-06-27btrfs: fix inode lookup error handling during log replayFilipe Manana
When replaying log trees we use read_one_inode() to get an inode, which is just a wrapper around btrfs_iget_logging(), which in turn is a wrapper for btrfs_iget(). But read_one_inode() always returns NULL for any error that btrfs_iget_logging() / btrfs_iget() may return and this is a problem because: 1) In many callers of read_one_inode() we convert the NULL into -EIO, which is not accurate since btrfs_iget() may return -ENOMEM and -ENOENT for example, besides -EIO and other errors. So during log replay we may end up reporting a false -EIO, which is confusing since we may not have had any IO error at all; 2) When replaying directory deletes, at replay_dir_deletes(), we assume the NULL returned from read_one_inode() means that the inode doesn't exist and then proceed as if no error had happened. This is wrong because unless btrfs_iget() returned ERR_PTR(-ENOENT), we had an actual error and the target inode may exist in the target subvolume root - this may later result in the log replay code failing at a later stage (if we are "lucky") or succeed but leaving some inconsistency in the filesystem. So fix this by not ignoring errors from btrfs_iget_logging() and as a consequence remove the read_one_inode() wrapper and just use btrfs_iget_logging() directly. Also since btrfs_iget_logging() is supposed to be called only against subvolume roots, just like read_one_inode() which had a comment about it, add an assertion to btrfs_iget_logging() to check that the target root corresponds to a subvolume root. Fixes: 5d4f98a28c7d ("Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-06-27btrfs: fix iteration of extrefs during log replayFilipe Manana
At __inode_add_ref() when processing extrefs, if we jump into the next label we have an undefined value of victim_name.len, since we haven't initialized it before we did the goto. This results in an invalid memory access in the next iteration of the loop since victim_name.len was not initialized to the length of the name of the current extref. Fix this by initializing victim_name.len with the current extref's name length. Fixes: e43eec81c516 ("btrfs: use struct qstr instead of name and namelen pairs") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-06-27btrfs: fix missing error handling when searching for inode refs during log ↵Filipe Manana
replay During log replay, at __add_inode_ref(), when we are searching for inode ref keys we totally ignore if btrfs_search_slot() returns an error. This may make a log replay succeed when there was an actual error and leave some metadata inconsistency in a subvolume tree. Fix this by checking if an error was returned from btrfs_search_slot() and if so, return it to the caller. Fixes: e02119d5a7b4 ("Btrfs: Add a write ahead tree log to optimize synchronous operations") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-06-27btrfs: fix failure to rebuild free space tree using multiple transactionsFilipe Manana
If we are rebuilding a free space tree, while modifying the free space tree we may need to allocate a new metadata block group. If we end up using multiple transactions for the rebuild, when we call btrfs_end_transaction() we enter btrfs_create_pending_block_groups() which calls add_block_group_free_space() to add items to the free space tree for the block group. Then later during the free space tree rebuild, at btrfs_rebuild_free_space_tree(), we may find such new block groups and call populate_free_space_tree() for them, which fails with -EEXIST because there are already items in the free space tree. Then we abort the transaction with -EEXIST at btrfs_rebuild_free_space_tree(). Notice that we say "may find" the new block groups because a new block group may be inserted in the block groups rbtree, which is being iterated by the rebuild process, before or after the current node where the rebuild process is currently at. Syzbot recently reported such case which produces a trace like the following: ------------[ cut here ]------------ BTRFS: Transaction aborted (error -17) WARNING: CPU: 1 PID: 7626 at fs/btrfs/free-space-tree.c:1341 btrfs_rebuild_free_space_tree+0x470/0x54c fs/btrfs/free-space-tree.c:1341 Modules linked in: CPU: 1 UID: 0 PID: 7626 Comm: syz.2.25 Not tainted 6.15.0-rc7-syzkaller-00085-gd7fa1af5b33e-dirty #0 PREEMPT Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025 pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : btrfs_rebuild_free_space_tree+0x470/0x54c fs/btrfs/free-space-tree.c:1341 lr : btrfs_rebuild_free_space_tree+0x470/0x54c fs/btrfs/free-space-tree.c:1341 sp : ffff80009c4f7740 x29: ffff80009c4f77b0 x28: ffff0000d4c3f400 x27: 0000000000000000 x26: dfff800000000000 x25: ffff70001389eee8 x24: 0000000000000003 x23: 1fffe000182b6e7b x22: 0000000000000000 x21: ffff0000c15b73d8 x20: 00000000ffffffef x19: ffff0000c15b7378 x18: 1fffe0003386f276 x17: ffff80008f31e000 x16: ffff80008adbe98c x15: 0000000000000001 x14: 1fffe0001b281550 x13: 0000000000000000 x12: 0000000000000000 x11: ffff60001b281551 x10: 0000000000000003 x9 : 1c8922000a902c00 x8 : 1c8922000a902c00 x7 : ffff800080485878 x6 : 0000000000000000 x5 : 0000000000000001 x4 : 0000000000000001 x3 : ffff80008047843c x2 : 0000000000000001 x1 : ffff80008b3ebc40 x0 : 0000000000000001 Call trace: btrfs_rebuild_free_space_tree+0x470/0x54c fs/btrfs/free-space-tree.c:1341 (P) btrfs_start_pre_rw_mount+0xa78/0xe10 fs/btrfs/disk-io.c:3074 btrfs_remount_rw fs/btrfs/super.c:1319 [inline] btrfs_reconfigure+0x828/0x2418 fs/btrfs/super.c:1543 reconfigure_super+0x1d4/0x6f0 fs/super.c:1083 do_remount fs/namespace.c:3365 [inline] path_mount+0xb34/0xde0 fs/namespace.c:4200 do_mount fs/namespace.c:4221 [inline] __do_sys_mount fs/namespace.c:4432 [inline] __se_sys_mount fs/namespace.c:4409 [inline] __arm64_sys_mount+0x3e8/0x468 fs/namespace.c:4409 __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline] invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:49 el0_svc_common+0x130/0x23c arch/arm64/kernel/syscall.c:132 do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:151 el0_svc+0x58/0x17c arch/arm64/kernel/entry-common.c:767 el0t_64_sync_handler+0x78/0x108 arch/arm64/kernel/entry-common.c:786 el0t_64_sync+0x198/0x19c arch/arm64/kernel/entry.S:600 irq event stamp: 330 hardirqs last enabled at (329): [<ffff80008048590c>] raw_spin_rq_unlock_irq kernel/sched/sched.h:1525 [inline] hardirqs last enabled at (329): [<ffff80008048590c>] finish_lock_switch+0xb0/0x1c0 kernel/sched/core.c:5130 hardirqs last disabled at (330): [<ffff80008adb9e60>] el1_dbg+0x24/0x80 arch/arm64/kernel/entry-common.c:511 softirqs last enabled at (10): [<ffff8000801fbf10>] local_bh_enable+0x10/0x34 include/linux/bottom_half.h:32 softirqs last disabled at (8): [<ffff8000801fbedc>] local_bh_disable+0x10/0x34 include/linux/bottom_half.h:19 ---[ end trace 0000000000000000 ]--- Fix this by flagging new block groups which had their free space tree entries already added and then skip them in the rebuild process. Also, since the rebuild may be triggered when doing a remount, make sure that when we clear an existing free space tree that we clear such flag from every existing block group, otherwise we would skip those block groups during the rebuild. Reported-by: syzbot+d0014fb0fc39c5487ae5@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/68460a54.050a0220.daf97.0af5.GAE@google.com/ Fixes: 882af9f13e83 ("btrfs: handle free space tree rebuild in multiple transactions") Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-06-23Merge tag 'for-6.16-rc3-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Fixes: - fix invalid inode pointer dereferences during log replay - fix a race between renames and directory logging - fix shutting down delayed iput worker - fix device byte accounting when dropping chunk - in zoned mode, fix offset calculations for DUP profile when conventional and sequential zones are used together Regression fixes: - fix possible double unlock of extent buffer tree (xarray conversion) - in zoned mode, fix extent buffer refcount when writing out extents (xarray conversion) Error handling fixes and updates: - handle unexpected extent type when replaying log - check and warn if there are remaining delayed inodes when putting a root - fix assertion when building free space tree - handle csum tree error with mount option 'rescue=ibadroot' Other: - error message updates: add prefix to all scrub related messages, include other information in messages" * tag 'for-6.16-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: zoned: fix alloc_offset calculation for partly conventional block groups btrfs: handle csum tree error with rescue=ibadroots correctly btrfs: fix race between async reclaim worker and close_ctree() btrfs: fix assertion when building free space tree btrfs: don't silently ignore unexpected extent type when replaying log btrfs: fix invalid inode pointer dereferences during log replay btrfs: fix double unlock of buffer_tree xarray when releasing subpage eb btrfs: update superblock's device bytes_used when dropping chunk btrfs: fix a race between renames and directory logging btrfs: scrub: add prefix for the error messages btrfs: warn if leaking delayed_nodes in btrfs_put_root() btrfs: fix delayed ref refcount leak in debug assertion btrfs: include root in error message when unlinking inode btrfs: don't drop a reference if btrfs_check_write_meta_pointer() fails
2025-06-19btrfs: zoned: fix alloc_offset calculation for partly conventional block groupsJohannes Thumshirn
When one of two zones composing a DUP block group is a conventional zone, we have the zone_info[i]->alloc_offset = WP_CONVENTIONAL. That will, of course, not match the write pointer of the other zone, and fails that block group. This commit solves that issue by properly recovering the emulated write pointer from the last allocated extent. The offset for the SINGLE, DUP, and RAID1 are straight-forward: it is same as the end of last allocated extent. The RAID0 and RAID10 are a bit tricky that we need to do the math of striping. This is the kernel equivalent of Naohiro's user-space commit: "btrfs-progs: zoned: fix alloc_offset calculation for partly conventional block groups". Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-06-19btrfs: handle csum tree error with rescue=ibadroots correctlyQu Wenruo
[BUG] There is syzbot based reproducer that can crash the kernel, with the following call trace: (With some debug output added) DEBUG: rescue=ibadroots parsed BTRFS: device fsid 14d642db-7b15-43e4-81e6-4b8fac6a25f8 devid 1 transid 8 /dev/loop0 (7:0) scanned by repro (1010) BTRFS info (device loop0): first mount of filesystem 14d642db-7b15-43e4-81e6-4b8fac6a25f8 BTRFS info (device loop0): using blake2b (blake2b-256-generic) checksum algorithm BTRFS info (device loop0): using free-space-tree BTRFS warning (device loop0): checksum verify failed on logical 5312512 mirror 1 wanted 0xb043382657aede36608fd3386d6b001692ff406164733d94e2d9a180412c6003 found 0x810ceb2bacb7f0f9eb2bf3b2b15c02af867cb35ad450898169f3b1f0bd818651 level 0 DEBUG: read tree root path failed for tree csum, ret=-5 BTRFS warning (device loop0): checksum verify failed on logical 5328896 mirror 1 wanted 0x51be4e8b303da58e6340226815b70e3a93592dac3f30dd510c7517454de8567a found 0x51be4e8b303da58e634022a315b70e3a93592dac3f30dd510c7517454de8567a level 0 BTRFS warning (device loop0): checksum verify failed on logical 5292032 mirror 1 wanted 0x1924ccd683be9efc2fa98582ef58760e3848e9043db8649ee382681e220cdee4 found 0x0cb6184f6e8799d9f8cb335dccd1d1832da1071d12290dab3b85b587ecacca6e level 0 process 'repro' launched './file2' with NULL argv: empty string added DEBUG: no csum root, idatacsums=0 ibadroots=134217728 Oops: general protection fault, probably for non-canonical address 0xdffffc0000000041: 0000 [#1] SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000208-0x000000000000020f] CPU: 5 UID: 0 PID: 1010 Comm: repro Tainted: G OE 6.15.0-custom+ #249 PREEMPT(full) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022 RIP: 0010:btrfs_lookup_csum+0x93/0x3d0 [btrfs] Call Trace: <TASK> btrfs_lookup_bio_sums+0x47a/0xdf0 [btrfs] btrfs_submit_bbio+0x43e/0x1a80 [btrfs] submit_one_bio+0xde/0x160 [btrfs] btrfs_readahead+0x498/0x6a0 [btrfs] read_pages+0x1c3/0xb20 page_cache_ra_order+0x4b5/0xc20 filemap_get_pages+0x2d3/0x19e0 filemap_read+0x314/0xde0 __kernel_read+0x35b/0x900 bprm_execve+0x62e/0x1140 do_execveat_common.isra.0+0x3fc/0x520 __x64_sys_execveat+0xdc/0x130 do_syscall_64+0x54/0x1d0 entry_SYSCALL_64_after_hwframe+0x76/0x7e ---[ end trace 0000000000000000 ]--- [CAUSE] Firstly the fs has a corrupted csum tree root, thus to mount the fs we have to go "ro,rescue=ibadroots" mount option. Normally with that mount option, a bad csum tree root should set BTRFS_FS_STATE_NO_DATA_CSUMS flag, so that any future data read will ignore csum search. But in this particular case, we have the following call trace that caused NULL csum root, but not setting BTRFS_FS_STATE_NO_DATA_CSUMS: load_global_roots_objectid(): ret = btrfs_search_slot(); /* Succeeded */ btrfs_item_key_to_cpu() found = true; /* We found the root item for csum tree. */ root = read_tree_root_path(); if (IS_ERR(root)) { if (!btrfs_test_opt(fs_info, IGNOREBADROOTS)) /* * Since we have rescue=ibadroots mount option, * @ret is still 0. */ break; if (!found || ret) { /* @found is true, @ret is 0, error handling for csum * tree is skipped. */ } This means we completely skipped to set BTRFS_FS_STATE_NO_DATA_CSUMS if the csum tree is corrupted, which results unexpected later csum lookup. [FIX] If read_tree_root_path() failed, always populate @ret to the error number. As at the end of the function, we need @ret to determine if we need to do the extra error handling for csum tree. Fixes: abed4aaae4f7 ("btrfs: track the csum, extent, and free space trees in a rb tree") Reported-by: Zhiyu Zhang <zhiyuzhang999@gmail.com> Reported-by: Longxing Li <coregee2000@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-06-19btrfs: fix race between async reclaim worker and close_ctree()Filipe Manana
Syzbot reported an assertion failure due to an attempt to add a delayed iput after we have set BTRFS_FS_STATE_NO_DELAYED_IPUT in the fs_info state: WARNING: CPU: 0 PID: 65 at fs/btrfs/inode.c:3420 btrfs_add_delayed_iput+0x2f8/0x370 fs/btrfs/inode.c:3420 Modules linked in: CPU: 0 UID: 0 PID: 65 Comm: kworker/u8:4 Not tainted 6.15.0-next-20250530-syzkaller #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025 Workqueue: btrfs-endio-write btrfs_work_helper RIP: 0010:btrfs_add_delayed_iput+0x2f8/0x370 fs/btrfs/inode.c:3420 Code: 4e ad 5d (...) RSP: 0018:ffffc9000213f780 EFLAGS: 00010293 RAX: ffffffff83c635b7 RBX: ffff888058920000 RCX: ffff88801c769e00 RDX: 0000000000000000 RSI: 0000000000000100 RDI: 0000000000000000 RBP: 0000000000000001 R08: ffff888058921b67 R09: 1ffff1100b12436c R10: dffffc0000000000 R11: ffffed100b12436d R12: 0000000000000001 R13: dffffc0000000000 R14: ffff88807d748000 R15: 0000000000000100 FS: 0000000000000000(0000) GS:ffff888125c53000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00002000000bd038 CR3: 000000006a142000 CR4: 00000000003526f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> btrfs_put_ordered_extent+0x19f/0x470 fs/btrfs/ordered-data.c:635 btrfs_finish_one_ordered+0x11d8/0x1b10 fs/btrfs/inode.c:3312 btrfs_work_helper+0x399/0xc20 fs/btrfs/async-thread.c:312 process_one_work kernel/workqueue.c:3238 [inline] process_scheduled_works+0xae1/0x17b0 kernel/workqueue.c:3321 worker_thread+0x8a0/0xda0 kernel/workqueue.c:3402 kthread+0x70e/0x8a0 kernel/kthread.c:464 ret_from_fork+0x3fc/0x770 arch/x86/kernel/process.c:148 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 </TASK> This can happen due to a race with the async reclaim worker like this: 1) The async metadata reclaim worker enters shrink_delalloc(), which calls btrfs_start_delalloc_roots() with an nr_pages argument that has a value less than LONG_MAX, and that in turn enters start_delalloc_inodes(), which sets the local variable 'full_flush' to false because wbc->nr_to_write is less than LONG_MAX; 2) There it finds inode X in a root's delalloc list, grabs a reference for inode X (with igrab()), and triggers writeback for it with filemap_fdatawrite_wbc(), which creates an ordered extent for inode X; 3) The unmount sequence starts from another task, we enter close_ctree() and we flush the workqueue fs_info->endio_write_workers, which waits for the ordered extent for inode X to complete and when dropping the last reference of the ordered extent, with btrfs_put_ordered_extent(), when we call btrfs_add_delayed_iput() we don't add the inode to the list of delayed iputs because it has a refcount of 2, so we decrement it to 1 and return; 4) Shortly after at close_ctree() we call btrfs_run_delayed_iputs() which runs all delayed iputs, and then we set BTRFS_FS_STATE_NO_DELAYED_IPUT in the fs_info state; 5) The async reclaim worker, after calling filemap_fdatawrite_wbc(), now calls btrfs_add_delayed_iput() for inode X and there we trigger an assertion failure since the fs_info state has the flag BTRFS_FS_STATE_NO_DELAYED_IPUT set. Fix this by setting BTRFS_FS_STATE_NO_DELAYED_IPUT only after we wait for the async reclaim workers to finish, after we call cancel_work_sync() for them at close_ctree(), and by running delayed iputs after wait for the reclaim workers to finish and before setting the bit. This race was recently introduced by commit 19e60b2a95f5 ("btrfs: add extra warning if delayed iput is added when it's not allowed"). Without the new validation at btrfs_add_delayed_iput(), this described scenario was safe because close_ctree() later calls btrfs_commit_super(). That will run any final delayed iputs added by reclaim workers in the window between the btrfs_run_delayed_iputs() and the the reclaim workers being shut down. Reported-by: syzbot+0ed30ad435bf6f5b7a42@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/6840481c.a00a0220.d4325.000c.GAE@google.com/T/#u Fixes: 19e60b2a95f5 ("btrfs: add extra warning if delayed iput is added when it's not allowed") Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-06-19btrfs: fix assertion when building free space treeFilipe Manana
When building the free space tree with the block group tree feature enabled, we can hit an assertion failure like this: BTRFS info (device loop0 state M): rebuilding free space tree assertion failed: ret == 0, in fs/btrfs/free-space-tree.c:1102 ------------[ cut here ]------------ kernel BUG at fs/btrfs/free-space-tree.c:1102! Internal error: Oops - BUG: 00000000f2000800 [#1] SMP Modules linked in: CPU: 1 UID: 0 PID: 6592 Comm: syz-executor322 Not tainted 6.15.0-rc7-syzkaller-gd7fa1af5b33e #0 PREEMPT Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025 pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : populate_free_space_tree+0x514/0x518 fs/btrfs/free-space-tree.c:1102 lr : populate_free_space_tree+0x514/0x518 fs/btrfs/free-space-tree.c:1102 sp : ffff8000a4ce7600 x29: ffff8000a4ce76e0 x28: ffff0000c9bc6000 x27: ffff0000ddfff3d8 x26: ffff0000ddfff378 x25: dfff800000000000 x24: 0000000000000001 x23: ffff8000a4ce7660 x22: ffff70001499cecc x21: ffff0000e1d8c160 x20: ffff0000e1cb7800 x19: ffff0000e1d8c0b0 x18: 00000000ffffffff x17: ffff800092f39000 x16: ffff80008ad27e48 x15: ffff700011e740c0 x14: 1ffff00011e740c0 x13: 0000000000000004 x12: ffffffffffffffff x11: ffff700011e740c0 x10: 0000000000ff0100 x9 : 94ef24f55d2dbc00 x8 : 94ef24f55d2dbc00 x7 : 0000000000000001 x6 : 0000000000000001 x5 : ffff8000a4ce6f98 x4 : ffff80008f415ba0 x3 : ffff800080548ef0 x2 : 0000000000000000 x1 : 0000000100000000 x0 : 000000000000003e Call trace: populate_free_space_tree+0x514/0x518 fs/btrfs/free-space-tree.c:1102 (P) btrfs_rebuild_free_space_tree+0x14c/0x54c fs/btrfs/free-space-tree.c:1337 btrfs_start_pre_rw_mount+0xa78/0xe10 fs/btrfs/disk-io.c:3074 btrfs_remount_rw fs/btrfs/super.c:1319 [inline] btrfs_reconfigure+0x828/0x2418 fs/btrfs/super.c:1543 reconfigure_super+0x1d4/0x6f0 fs/super.c:1083 do_remount fs/namespace.c:3365 [inline] path_mount+0xb34/0xde0 fs/namespace.c:4200 do_mount fs/namespace.c:4221 [inline] __do_sys_mount fs/namespace.c:4432 [inline] __se_sys_mount fs/namespace.c:4409 [inline] __arm64_sys_mount+0x3e8/0x468 fs/namespace.c:4409 __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline] invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:49 el0_svc_common+0x130/0x23c arch/arm64/kernel/syscall.c:132 do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:151 el0_svc+0x58/0x17c arch/arm64/kernel/entry-common.c:767 el0t_64_sync_handler+0x78/0x108 arch/arm64/kernel/entry-common.c:786 el0t_64_sync+0x198/0x19c arch/arm64/kernel/entry.S:600 Code: f0047182 91178042 528089c3 9771d47b (d4210000) ---[ end trace 0000000000000000 ]--- This happens because we are processing an empty block group, which has no extents allocated from it, there are no items for this block group, including the block group item since block group items are stored in a dedicated tree when using the block group tree feature. It also means this is the block group with the highest start offset, so there are no higher keys in the extent root, hence btrfs_search_slot_for_read() returns 1 (no higher key found). Fix this by asserting 'ret' is 0 only if the block group tree feature is not enabled, in which case we should find a block group item for the block group since it's stored in the extent root and block group item keys are greater than extent item keys (the value for BTRFS_BLOCK_GROUP_ITEM_KEY is 192 and for BTRFS_EXTENT_ITEM_KEY and BTRFS_METADATA_ITEM_KEY the values are 168 and 169 respectively). In case 'ret' is 1, we just need to add a record to the free space tree which spans the whole block group, and we can achieve this by making 'ret == 0' as the while loop's condition. Reported-by: syzbot+36fae25c35159a763a2a@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/6841dca8.a00a0220.d4325.0020.GAE@google.com/ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-06-19btrfs: don't silently ignore unexpected extent type when replaying logFilipe Manana
If there's an unexpected (invalid) extent type, we just silently ignore it. This means a corruption or some bug somewhere, so instead return -EUCLEAN to the caller, making log replay fail, and print an error message with relevant information. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-06-19btrfs: fix invalid inode pointer dereferences during log replayFilipe Manana
In a few places where we call read_one_inode(), if we get a NULL pointer we end up jumping into an error path, or fallthrough in case of __add_inode_ref(), where we then do something like this: iput(&inode->vfs_inode); which results in an invalid inode pointer that triggers an invalid memory access, resulting in a crash. Fix this by making sure we don't do such dereferences. Fixes: b4c50cbb01a1 ("btrfs: return a btrfs_inode from read_one_inode()") CC: stable@vger.kernel.org # 6.15+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-06-19btrfs: fix double unlock of buffer_tree xarray when releasing subpage ebFilipe Manana
If we break out of the loop because an extent buffer doesn't have the bit EXTENT_BUFFER_TREE_REF set, we end up unlocking the xarray twice, once before we tested for the bit and break out of the loop, and once again after the loop. Fix this by testing the bit and exiting before unlocking the xarray. The time spent testing the bit is negligible and it's not worth trying to do that outside the critical section delimited by the xarray lock due to the code complexity required to avoid it (like using a local boolean variable to track whether the xarray is locked or not). The xarray unlock only needs to be done before calling release_extent_buffer(), as that needs to lock the xarray (through xa_cmpxchg_irq()) and does a more significant amount of work. Fixes: 19d7f65f032f ("btrfs: convert the buffer_radix to an xarray") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/linux-btrfs/aDRNDU0GM1_D4Xnw@stanley.mountain/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-06-19btrfs: update superblock's device bytes_used when dropping chunkMark Harmstone
Each superblock contains a copy of the device item for that device. In a transaction which drops a chunk but doesn't create any new ones, we were correctly updating the device item in the chunk tree but not copying over the new bytes_used value to the superblock. This can be seen by doing the following: # dd if=/dev/zero of=test bs=4096 count=2621440 # mkfs.btrfs test # mount test /root/temp # cd /root/temp # for i in {00..10}; do dd if=/dev/zero of=$i bs=4096 count=32768; done # sync # rm * # sync # btrfs balance start -dusage=0 . # sync # cd # umount /root/temp # btrfs check test For btrfs-check to detect this, you will also need my patch at https://github.com/kdave/btrfs-progs/pull/991. Change btrfs_remove_dev_extents() so that it adds the devices to the fs_info->post_commit_list if they're not there already. This causes btrfs_commit_device_sizes() to be called, which updates the bytes_used value in the superblock. Fixes: bbbf7243d62d ("btrfs: combine device update operations during transaction commit") CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Mark Harmstone <maharmstone@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-06-19btrfs: fix a race between renames and directory loggingFilipe Manana
We have a race between a rename and directory inode logging that if it happens and we crash/power fail before the rename completes, the next time the filesystem is mounted, the log replay code will end up deleting the file that was being renamed. This is best explained following a step by step analysis of an interleaving of steps that lead into this situation. Consider the initial conditions: 1) We are at transaction N; 2) We have directories A and B created in a past transaction (< N); 3) We have inode X corresponding to a file that has 2 hardlinks, one in directory A and the other in directory B, so we'll name them as "A/foo_link1" and "B/foo_link2". Both hard links were persisted in a past transaction (< N); 4) We have inode Y corresponding to a file that as a single hard link and is located in directory A, we'll name it as "A/bar". This file was also persisted in a past transaction (< N). The steps leading to a file loss are the following and for all of them we are under transaction N: 1) Link "A/foo_link1" is removed, so inode's X last_unlink_trans field is updated to N, through btrfs_unlink() -> btrfs_record_unlink_dir(); 2) Task A starts a rename for inode Y, with the goal of renaming from "A/bar" to "A/baz", so we enter btrfs_rename(); 3) Task A inserts the new BTRFS_INODE_REF_KEY for inode Y by calling btrfs_insert_inode_ref(); 4) Because the rename happens in the same directory, we don't set the last_unlink_trans field of directoty A's inode to the current transaction id, that is, we don't cal btrfs_record_unlink_dir(); 5) Task A then removes the entries from directory A (BTRFS_DIR_ITEM_KEY and BTRFS_DIR_INDEX_KEY items) when calling __btrfs_unlink_inode() (actually the dir index item is added as a delayed item, but the effect is the same); 6) Now before task A adds the new entry "A/baz" to directory A by calling btrfs_add_link(), another task, task B is logging inode X; 7) Task B starts a fsync of inode X and after logging inode X, at btrfs_log_inode_parent() it calls btrfs_log_all_parents(), since inode X has a last_unlink_trans value of N, set at in step 1; 8) At btrfs_log_all_parents() we search for all parent directories of inode X using the commit root, so we find directories A and B and log them. Bu when logging direct A, we don't have a dir index item for inode Y anymore, neither the old name "A/bar" nor for the new name "A/baz" since the rename has deleted the old name but has not yet inserted the new name - task A hasn't called yet btrfs_add_link() to do that. Note that logging directory A doesn't fallback to a transaction commit because its last_unlink_trans has a lower value than the current transaction's id (see step 4); 9) Task B finishes logging directories A and B and gets back to btrfs_sync_file() where it calls btrfs_sync_log() to persist the log tree; 10) Task B successfully persisted the log tree, btrfs_sync_log() completed with success, and a power failure happened. We have a log tree without any directory entry for inode Y, so the log replay code deletes the entry for inode Y, name "A/bar", from the subvolume tree since it doesn't exist in the log tree and the log tree is authorative for its index (we logged a BTRFS_DIR_LOG_INDEX_KEY item that covers the index range for the dentry that corresponds to "A/bar"). Since there's no other hard link for inode Y and the log replay code deletes the name "A/bar", the file is lost. The issue wouldn't happen if task B synced the log only after task A called btrfs_log_new_name(), which would update the log with the new name for inode Y ("A/bar"). Fix this by pinning the log root during renames before removing the old directory entry, and unpinning after btrfs_log_new_name() is called. Fixes: 259c4b96d78d ("btrfs: stop doing unnecessary log updates during a rename") CC: stable@vger.kernel.org # 5.18+ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-06-19btrfs: scrub: add prefix for the error messagesAnand Jain
Add a "scrub: " prefix to all messages logged by scrub so that it's easy to filter them from dmesg for analysis. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-06-19btrfs: warn if leaking delayed_nodes in btrfs_put_root()Leo Martins
Add a warning for leaked delayed_nodes when putting a root. We currently do this for inodes, but not delayed_nodes. Signed-off-by: Leo Martins <loemra.dev@gmail.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> [ Remove the changelog from the commit message. ] Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-06-19btrfs: fix delayed ref refcount leak in debug assertionLeo Martins
If the delayed_root is not empty we are increasing the number of references to a delayed_node without decreasing it, causing a leak. Fix by decrementing the delayed_node reference count. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Leo Martins <loemra.dev@gmail.com> Reviewed-by: Qu Wenruo <wqu@suse.com> [ Remove the changelog from the commit message. ] Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-06-19btrfs: include root in error message when unlinking inodeFilipe Manana
To help debugging include the root number in the error message, and since this is a critical error that implies a metadata inconsistency and results in a transaction abort change the log message level from "info" to "critical", which is a much better fit. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-06-19fs: replace mmap hook with .mmap_prepare for simple mappingsLorenzo Stoakes
Since commit c84bf6dd2b83 ("mm: introduce new .mmap_prepare() file callback"), the f_op->mmap() hook has been deprecated in favour of f_op->mmap_prepare(). This callback is invoked in the mmap() logic far earlier, so error handling can be performed more safely without complicated and bug-prone state unwinding required should an error arise. This hook also avoids passing a pointer to a not-yet-correctly-established VMA avoiding any issues with referencing this data structure. It rather provides a pointer to the new struct vm_area_desc descriptor type which contains all required state and allows easy setting of required parameters without any consideration needing to be paid to locking or reference counts. Note that nested filesystems like overlayfs are compatible with an .mmap_prepare() callback since commit bb666b7c2707 ("mm: add mmap_prepare() compatibility layer for nested file systems"). In this patch we apply this change to file systems with relatively simple mmap() hook logic - exfat, ceph, f2fs, bcachefs, zonefs, btrfs, ocfs2, orangefs, nilfs2, romfs, ramfs and aio. Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/f528ac4f35b9378931bd800920fee53fc0c5c74d.1750099179.git.lorenzo.stoakes@oracle.com Acked-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-10new helper: set_default_d_op()Al Viro
... to be used instead of manually assigning to ->s_d_op. All in-tree filesystem converted (and field itself is renamed, so any out-of-tree ones in need of conversion will be caught by compiler). Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-05-28Merge tag 'for-6.16-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "A fixup to the xarray conversion sent in the main 6.16 batch. It was not included because it would cause rebase/refresh of like 80 patches, right before sending the early pull request last week. It's fixing a bug when zoned mode is enabled on btrfs so it's not affecting most people" * tag 'for-6.16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: don't drop a reference if btrfs_check_write_meta_pointer() fails
2025-05-27btrfs: don't drop a reference if btrfs_check_write_meta_pointer() failsJosef Bacik
In the zoned mode there's a bug in the extent buffer tree conversion to xarray. The reference for eb is dropped and code continues but the references get dropped by releasing the batch. Reported-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reported-by: kernel test robot <oliver.sang@intel.com> Link: https://lore.kernel.org/linux-btrfs/202505191521.435b97ac-lkp@intel.com/ Fixes: 19d7f65f032f ("btrfs: convert the buffer_radix to an xarray") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-27btrfs: don't drop a reference if btrfs_check_write_meta_pointer() failsJosef Bacik
In the zoned mode there's a bug in the extent buffer tree conversion to xarray. The reference for eb is dropped and code continues but the references get dropped by releasing the batch. Reported-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Fixes: 19d7f65f032f ("btrfs: convert the buffer_radix to an xarray") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-26Merge tag 'for-6.16-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "Apart from numerous cleanups, there are some performance improvements and one minor mount option update. There's one more radix-tree conversion (one remaining), and continued work towards enabling large folios (almost finished). Performance: - extent buffer conversion to xarray gains throughput and runtime improvements on metadata heavy operations doing writeback (sample test shows +50% throughput, -33% runtime) - extent io tree cleanups lead to performance improvements by avoiding unnecessary searches or repeated searches - more efficient extent unpinning when committing transaction (estimated run time improvement 3-5%) User visible changes: - remove standalone mount option 'nologreplay', deprecated in 5.9, replacement is 'rescue=nologreplay' - in scrub, update reporting, add back device stats message after detected errors (accidentally removed during recent refactoring) Core: - convert extent buffer radix tree to xarray - in subpage mode, move block perfect compression out of experimental build - in zoned mode, introduce sub block groups to allow managing special block groups, like the one for relocation or tree-log, to handle some corner cases of ENOSPC - in scrub, simplify bitmaps for block tracking status - continued preparations for large folios: - remove assertions for folio order 0 - add support where missing: compression, buffered write, defrag, hole punching, subpage, send - fix fsync of files with no hard links not persisting deletion - reject tree blocks which are not nodesize aligned, a precaution from 4.9 times - move transaction abort calls closer to the error sites - remove usage of some struct bio_vec internals - simplifications in extent map - extent IO cleanups and optimizations - error handling improvements - enhanced ASSERT() macro with optional format strings - cleanups: - remove unused code - naming unifications, dropped __, added prefix - merge similar functions - use common helpers for various data structures" * tag 'for-6.16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (198 commits) btrfs: move misplaced comment of btrfs_path::keep_locks btrfs: remove standalone "nologreplay" mount option btrfs: use a single variable to track return value at btrfs_page_mkwrite() btrfs: don't return VM_FAULT_SIGBUS on failure to set delalloc for mmap write btrfs: simplify early error checking in btrfs_page_mkwrite() btrfs: pass true to btrfs_delalloc_release_space() at btrfs_page_mkwrite() btrfs: fix wrong start offset for delalloc space release during mmap write btrfs: fix harmless race getting delayed ref head count when running delayed refs btrfs: log error codes during failures when writing super blocks btrfs: simplify error return logic when getting folio at prepare_one_folio() btrfs: return real error from __filemap_get_folio() calls btrfs: remove superfluous return value check at btrfs_dio_iomap_begin() btrfs: fix invalid data space release when truncating block in NOCOW mode btrfs: update Kconfig option descriptions btrfs: update list of features built under experimental config btrfs: send: remove btrfs_debug() calls btrfs: use boolean for delalloc argument to btrfs_free_reserved_extent() btrfs: use boolean for delalloc argument to btrfs_free_reserved_bytes() btrfs: fold error checks when allocating ordered extent and update comments btrfs: check we grabbed inode reference when allocating an ordered extent ...
2025-05-26Merge tag 'for-6.16/block-20250523' of git://git.kernel.dk/linuxLinus Torvalds
Pull block updates from Jens Axboe: - ublk updates: - Add support for updating the size of a ublk instance - Zero-copy improvements - Auto-registering of buffers for zero-copy - Series simplifying and improving GET_DATA and request lookup - Series adding quiesce support - Lots of selftests additions - Various cleanups - NVMe updates via Christoph: - add per-node DMA pools and use them for PRP/SGL allocations (Caleb Sander Mateos, Keith Busch) - nvme-fcloop refcounting fixes (Daniel Wagner) - support delayed removal of the multipath node and optionally support the multipath node for private namespaces (Nilay Shroff) - support shared CQs in the PCI endpoint target code (Wilfred Mallawa) - support admin-queue only authentication (Hannes Reinecke) - use the crc32c library instead of the crypto API (Eric Biggers) - misc cleanups (Christoph Hellwig, Marcelo Moreira, Hannes Reinecke, Leon Romanovsky, Gustavo A. R. Silva) - MD updates via Yu: - Fix that normal IO can be starved by sync IO, found by mkfs on newly created large raid5, with some clean up patches for bdev inflight counters - Clean up brd, getting rid of atomic kmaps and bvec poking - Add loop driver specifically for zoned IO testing - Eliminate blk-rq-qos calls with a static key, if not enabled - Improve hctx locking for when a plug has IO for multiple queues pending - Remove block layer bouncing support, which in turn means we can remove the per-node bounce stat as well - Improve blk-throttle support - Improve delay support for blk-throttle - Improve brd discard support - Unify IO scheduler switching. This should also fix a bunch of lockdep warnings we've been seeing, after enabling lockdep support for queue freezing/unfreezeing - Add support for block write streams via FDP (flexible data placement) on NVMe - Add a bunch of block helpers, facilitating the removal of a bunch of duplicated boilerplate code - Remove obsolete BLK_MQ pci and virtio Kconfig options - Add atomic/untorn write support to blktrace - Various little cleanups and fixes * tag 'for-6.16/block-20250523' of git://git.kernel.dk/linux: (186 commits) selftests: ublk: add test for UBLK_F_QUIESCE ublk: add feature UBLK_F_QUIESCE selftests: ublk: add test case for UBLK_U_CMD_UPDATE_SIZE traceevent/block: Add REQ_ATOMIC flag to block trace events ublk: run auto buf unregisgering in same io_ring_ctx with registering io_uring: add helper io_uring_cmd_ctx_handle() ublk: remove io argument from ublk_auto_buf_reg_fallback() ublk: handle ublk_set_auto_buf_reg() failure correctly in ublk_fetch() selftests: ublk: add test for covering UBLK_AUTO_BUF_REG_FALLBACK selftests: ublk: support UBLK_F_AUTO_BUF_REG ublk: support UBLK_AUTO_BUF_REG_FALLBACK ublk: register buffer to local io_uring with provided buf index via UBLK_F_AUTO_BUF_REG ublk: prepare for supporting to register request buffer automatically ublk: convert to refcount_t selftests: ublk: make IO & device removal test more stressful nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk nvme: introduce multipath_always_on module param nvme-multipath: introduce delayed removal of the multipath head node nvme-pci: derive and better document max segments limits nvme-pci: use struct_size for allocation struct nvme_dev ...
2025-05-26Merge tag 'vfs-6.16-rc1.async.dir' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs directory lookup updates from Christian Brauner: "This contains cleanups for the lookup_one*() family of helpers. We expose a set of functions with names containing "lookup_one_len" and others without the "_len". This difference has nothing to do with "len". It's rater a historical accident that can be confusing. The functions without "_len" take a "mnt_idmap" pointer. This is found in the "vfsmount" and that is an important question when choosing which to use: do you have a vfsmount, or are you "inside" the filesystem. A related question is "is permission checking relevant here?". nfsd and cachefiles *do* have a vfsmount but *don't* use the non-_len functions. They pass nop_mnt_idmap and refuse to work on filesystems which have any other idmap. This work changes nfsd and cachefile to use the lookup_one family of functions and to explictily pass &nop_mnt_idmap which is consistent with all other vfs interfaces used where &nop_mnt_idmap is explicitly passed. The remaining uses of the "_one" functions do not require permission checks so these are renamed to be "_noperm" and the permission checking is removed. This series also changes these lookup function to take a qstr instead of separate name and len. In many cases this simplifies the call" * tag 'vfs-6.16-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: VFS: change lookup_one_common and lookup_noperm_common to take a qstr Use try_lookup_noperm() instead of d_hash_and_lookup() outside of VFS VFS: rename lookup_one_len family to lookup_noperm and remove permission check cachefiles: Use lookup_one() rather than lookup_one_len() nfsd: Use lookup_one() rather than lookup_one_len() VFS: improve interface for lookup_one functions
2025-05-17btrfs: move misplaced comment of btrfs_path::keep_locksSun YangKai
Commit 925baeddc5b0 ("Btrfs: Start btree concurrency work.") added the comment for the field keep_locks. This got moved later but without the comment, so move it to the right place and fix the comment style. Signed-off-by: Sun YangKai <sunk67188@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-16btrfs: remove standalone "nologreplay" mount optionQu Wenruo
Standalone "nologreplay" mount option has been marked deprecated since commit 74ef00185eb8 ("btrfs: introduce "rescue=" mount option"), which dates back to v5.9 (2020). Furthermore there is no other filesystem with the same named mount option, so this one is btrfs specific and we will not hit the same problem when removing "norecovery" mount option. So let's remove the standalone "nologreplay" mount option. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: use a single variable to track return value at btrfs_page_mkwrite()Filipe Manana
We have two variables to track return values, ret and ret2, with types vm_fault_t (an unsigned int type) and int, which makes it a bit confusing and harder to keep track. So use a single variable, of type int, and under the 'out' label return vmf_error(ret) in case ret contains an error, otherwise return VM_FAULT_NOPAGE. This is equivalent to what we had before and it's simpler. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>