linux/linux-stable.git - Linux kernel stable tree

Age	Commit message (Collapse)	Author
8 days	btrfs: unfold transaction aborts when writing dirty block groups	Filipe Manana
	We have a single transaction abort call that can be due to an error from one of two calls to update_block_group_item(). Unfold the transaction abort calls so that if they happen we know which update_block_group_item() call failed. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 days	btrfs: zoned: requeue to unused block group list if zone finish failed	Naohiro Aota
	btrfs_zone_finish() can fail for several reason. If it is -EAGAIN, we need to try it again later. So, put the block group to the retry list properly. Failing to do so will keep the removable block group intact until remount and can causes unnecessary ENOSPC. Fixes: 74e91b12b115 ("btrfs: zoned: zone finish unused block group") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 days	btrfs: zoned: do not remove unwritten non-data block group	Naohiro Aota
	There are some reports of "unable to find chunk map for logical 2147483648 length 16384" error message appears in dmesg. This means some IOs are occurring after a block group is removed. When a metadata tree node is cleaned on a zoned setup, we keep that node still dirty and write it out not to create a write hole. However, this can make a block group's used bytes == 0 while there is a dirty region left. Such an unused block group is moved into the unused_bg list and processed for removal. When the removal succeeds, the block group is removed from the transaction->dirty_bgs list, so the unused dirty nodes in the block group are not sent at the transaction commit time. It will be written at some later time e.g, sync or umount, and causes "unable to find chunk map" errors. This can happen relatively easy on SMR whose zone size is 256MB. However, calling do_zone_finish() on such block group returns -EAGAIN and keep that block group intact, which is why the issue is hidden until now. Fixes: afba2bc036b0 ("btrfs: zoned: implement active zone tracking") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 days	btrfs: remove btrfs_clear_extent_bits()	Filipe Manana
	It's just a simple wrapper around btrfs_clear_extent_bit() that passes a NULL for its last argument (a cached extent state record), plus there is not counter part - we have a btrfs_set_extent_bit() but we do not have a btrfs_set_extent_bits() (plural version). So just remove it and make all callers use btrfs_clear_extent_bit() directly. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 days	btrfs: don't print relocation messages from auto reclaim	Johannes Thumshirn
	When BTRFS is doing automatic block-group reclaim, it is spamming the kernel log messages a lot. Add a 'verbose' parameter to btrfs_relocate_chunk() and btrfs_relocate_block_group() to control the verbosity of these log message. This way the old behaviour of printing log messages on a user-space initiated balance operation can be kept while excessive log spamming due to auto reclaim is mitigated. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 days	btrfs: remove redundant auto reclaim log message	Johannes Thumshirn
	Remove the log message before reclaiming a chunk in btrfs_reclaim_bgs_work(). Especially with automatic block-group reclaiming these messages spam the kernel log. Note there is also a tracepoint for the same condition to ease debugging. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 days	btrfs: change dump_block_groups() in btrfs_dump_space_info() from int to bool	Johannes Thumshirn
	btrfs_dump_space_info()'s parameter dump_block_groups is used as a boolean although it is defined as an integer. Change it from int to bool. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 days	btrfs: add btrfs prefix to free space tree exported functions	Filipe Manana
	A few of the free space tree exported functions have a 'btrfs_' prefix in their name, but most don't. Not only is this inconsistent, the preferred style is to have such a prefix, to avoid potential collisions in the future with other kernel code and offer a somewhat better readibility by making it obvious in calls sites that we are calling btrfs specific code. So add the 'btrfs_' prefix to all free space tree functions that are missing it. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: use boolean for delalloc argument to btrfs_free_reserved_bytes()	Filipe Manana
	We are using an integer for the 'delalloc' argument but all we need is a boolean, so switch the type to 'bool' and rename the parameter to 'is_delalloc' to better match the fact that it's a boolean. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: simplify getting and extracting previous transaction at ↵	Filipe Manana
	clean_pinned_extents() Instead of detecting if there is a previous transaction by comparing the current transaction's list prev member to the head of the transaction list (fs_info->trans_list), use the list_is_first() helper which contains that logic and the naming makes sense since a new transaction is always added to the end of the list fs_info->trans_list with list_add_tail(). We are also extracting the previous transaction with list_last_entry() against the transaction, which is correct but confusing because that function is usually meant to be used against a pointer to the start of a list and not a member of a list. It is easier to reason by either calling list_first_entry() against the list fs_info->trans_list, since we can never have more than two transactions in the list, or by calling list_prev_entry() against the transaction. So change that to use the later method. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: tweak extent/chunk allocation for space_info sub-space	Naohiro Aota
	Make the extent allocator and the chunk allocator aware of the sub-space. It now uses BTRFS_SUB_GROUP_DATA_RELOC sub-space for data relocation block group, and uses BTRFS_SUB_GROUP_TREELOG for metadata tree-log block group. And, it needs to check the space_info is the right one when a block group candidate is given. Also, new block group should now belong to the specified one. Now that, block_group->space_info is always set before btrfs_add_bg_to_space_info(), we no longer need to "find" the space_info. So, rename the variable name to address that as well. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: introduce btrfs_space_info sub-group	Naohiro Aota
	Current code assumes we have only one space_info for each block group type (DATA, METADATA, and SYSTEM). We sometime need multiple space infos to manage special block groups. One example is handling the data relocation block group for the zoned mode. That block group is dedicated for writing relocated data and we cannot allocate any regular extent from that block group, which is implemented in the zoned extent allocator. This block group still belongs to the normal data space_info. So, when all the normal data block groups are full and there is some free space in the dedicated block group, the space_info looks to have some free space, while it cannot allocate normal extent anymore. That results in a strange ENOSPC error. We need to have a space_info for the relocation data block group to represent the situation properly. Adds a basic infrastructure for having a "sub-group" of a space_info: creation and removing. A sub-group space_info belongs to one of the primary space_infos and has the same flags as its parent. This commit first introduces the relocation data sub-space_info, and the next commit will introduce tree-log sub-space_info. In the future, it could be useful to implement tiered storage for btrfs e.g. by implementing a sub-group space_info for block groups resides on a fast storage. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: add space_info parameter for block group creation	Naohiro Aota
	Add struct btrfs_space_info parameter to btrfs_make_block_group(), its related functions and related struct. Passed space_info will have a new block group. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: add space_info argument to btrfs_chunk_alloc()	Naohiro Aota
	Take a btrfs_space_info argument in btrfs_chunk_alloc(). New block group will belong to that space_info. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: factor out check_removing_space_info() from btrfs_free_block_groups()	Naohiro Aota
	Factor out check_removing_space_info() from btrfs_free_block_groups(). It sanity checks a to-be-removed space_info. There is no functional change. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: trivial conversion to return bool instead of int	David Sterba
	Old code has a lot of int for bool return values, bool is recommended and done in new code. Convert the trivial cases that do simple 0/false and 1/true. Functions comment are updated if needed. Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: use list_first_entry() everywhere	David Sterba
	Using the helper makes it a bit more clear that we're accessing the first list entry. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: rename the functions to search for bits in extent ranges	Filipe Manana
	These functions are exported so they should have a 'btrfs_' prefix by convention, to make it clear they are btrfs specific and to avoid collisions with functions from elsewhere in the kernel. So add a 'btrfs_' prefix to their name to make it clear they are from btrfs. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: rename set_extent_bit() to include a btrfs prefix	Filipe Manana
	This is an exported function so it should have a 'btrfs_' prefix by convention, to make it clear it's btrfs specific and to avoid collisions with functions from elsewhere in the kernel. So rename it to btrfs_set_extent_bit(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: rename the functions to clear bits for an extent range	Filipe Manana
	These functions are exported so they should have a 'btrfs_' prefix by convention, to make it clear they are btrfs specific and to avoid collisions with functions from elsewhere in the kernel. One of them has a double underscore prefix which is also discouraged. So remove double underscore prefix where applicable and add a 'btrfs_' prefix to their name to make it clear they are from btrfs. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: do more trivial BTRFS_PATH_AUTO_FREE conversions	David Sterba
	The most trivial pattern for the auto freeing when the variable is declared with the macro and the final btrfs_free_path() is removed. There are almost none goto -> return conversions and there's no other function cleanup. Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: remove EXTENT_UPTODATE io tree flag	Filipe Manana
	The EXTENT_UPTODATE io tree flag is now used only to mark ranges in the fs_info->excluded_extents as used by super blocks and not available for extent allocation (to prevent adding those ranges as free space in the in memory space caches). As we can use any flag for that purpose, and we are using EXTENT_DIRTY for the pinned extents io tree for example, remove the EXTENT_UPTODATE flag and use instead EXTENT_DIRTY for the excluded extents io tree. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: stop searching for EXTENT_DIRTY bit in the excluded extents io tree	Filipe Manana
	At btrfs_add_new_free_space() we keep searching for ranges in the excluded extents io tree that have the EXTENT_DIRTY bit set, however we never ever set that bit for ranges in that tree. That is a leftover from when that function used the global freed extents trees (fs_info->freed_extents[2]), where we used both the EXTENT_DIRTY and EXTENT_UPTODATE bits, but those trees are gone with commit fe119a6eeb67 ("btrfs: switch to per-transaction pinned extents"), which introduced the fs_info->excluded_extents io tree, where only EXTENT_UPTODATE is set. So remove the EXTENT_DIRTY bit search at btrfs_add_new_free_space(). Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18	btrfs: codify pattern for adding block_group to bg_list	Boris Burkov
	Similar to mark_bg_unused() and mark_bg_to_reclaim(), we have a few places that use bg_list with refcounting, mostly for retrying failures to reclaim/delete unused. These have custom logic for handling locking and refcounting the bg_list properly, but they actually all want to do the same thing, so pull that logic out into a helper. Unfortunately, mark_bg_unused() does still need the NEW flag to avoid prematurely marking stuff unused (even if refcount is fine, we don't want to mess with bg creation), so it cannot use the new helper. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18	btrfs: explicitly ref count block_group on new_bgs list	Boris Burkov
	All other users of the bg_list list_head increment the refcount when adding to a list and decrement it when deleting from the list. Just for the sake of uniformity and to try to avoid refcounting bugs, do it for this list as well. This does not fix any known ref-counting bug, as the reference belongs to a single task (trans_handle is not shared and this represents trans_handle->new_bgs linkage) and will not lose its original refcount while that thread is running. And BLOCK_GROUP_FLAG_NEW protects against ref-counting errors "moving" the block group to the unused list without taking a ref. With that said, I still believe it is simpler to just hold the extra ref count for this list user as well. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18	btrfs: fix block group refcount race in btrfs_create_pending_block_groups()	Boris Burkov
	Block group creation is done in two phases, which results in a slightly unintuitive property: a block group can be allocated/deallocated from after btrfs_make_block_group() adds it to the space_info with btrfs_add_bg_to_space_info(), but before creation is completely completed in btrfs_create_pending_block_groups(). As a result, it is possible for a block group to go unused and have 'btrfs_mark_bg_unused' called on it concurrently with 'btrfs_create_pending_block_groups'. This causes a number of issues, which were fixed with the block group flag 'BLOCK_GROUP_FLAG_NEW'. However, this fix is not quite complete. Since it does not use the unused_bg_lock, it is possible for the following race to occur: btrfs_create_pending_block_groups btrfs_mark_bg_unused if list_empty // false list_del_init clear_bit else if (test_bit) // true list_move_tail And we get into the exact same broken ref count and invalid new_bgs state for transaction cleanup that BLOCK_GROUP_FLAG_NEW was designed to prevent. The broken refcount aspect will result in a warning like: [1272.943527] refcount_t: underflow; use-after-free. [1272.943967] WARNING: CPU: 1 PID: 61 at lib/refcount.c:28 refcount_warn_saturate+0xba/0x110 [1272.944731] Modules linked in: btrfs virtio_net xor zstd_compress raid6_pq null_blk [last unloaded: btrfs] [1272.945550] CPU: 1 UID: 0 PID: 61 Comm: kworker/u32:1 Kdump: loaded Tainted: G W 6.14.0-rc5+ #108 [1272.946368] Tainted: [W]=WARN [1272.946585] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014 [1272.947273] Workqueue: btrfs_discard btrfs_discard_workfn [btrfs] [1272.947788] RIP: 0010:refcount_warn_saturate+0xba/0x110 [1272.949532] RSP: 0018:ffffbf1200247df0 EFLAGS: 00010282 [1272.949901] RAX: 0000000000000000 RBX: ffffa14b00e3f800 RCX: 0000000000000000 [1272.950437] RDX: 0000000000000000 RSI: ffffbf1200247c78 RDI: 00000000ffffdfff [1272.950986] RBP: ffffa14b00dc2860 R08: 00000000ffffdfff R09: ffffffff90526268 [1272.951512] R10: ffffffff904762c0 R11: 0000000063666572 R12: ffffa14b00dc28c0 [1272.952024] R13: 0000000000000000 R14: ffffa14b00dc2868 R15: 000001285dcd12c0 [1272.952850] FS: 0000000000000000(0000) GS:ffffa14d33c40000(0000) knlGS:0000000000000000 [1272.953458] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [1272.953931] CR2: 00007f838cbda000 CR3: 000000010104e000 CR4: 00000000000006f0 [1272.954474] Call Trace: [1272.954655] <TASK> [1272.954812] ? refcount_warn_saturate+0xba/0x110 [1272.955173] ? __warn.cold+0x93/0xd7 [1272.955487] ? refcount_warn_saturate+0xba/0x110 [1272.955816] ? report_bug+0xe7/0x120 [1272.956103] ? handle_bug+0x53/0x90 [1272.956424] ? exc_invalid_op+0x13/0x60 [1272.956700] ? asm_exc_invalid_op+0x16/0x20 [1272.957011] ? refcount_warn_saturate+0xba/0x110 [1272.957399] btrfs_discard_cancel_work.cold+0x26/0x2b [btrfs] [1272.957853] btrfs_put_block_group.cold+0x5d/0x8e [btrfs] [1272.958289] btrfs_discard_workfn+0x194/0x380 [btrfs] [1272.958729] process_one_work+0x130/0x290 [1272.959026] worker_thread+0x2ea/0x420 [1272.959335] ? __pfx_worker_thread+0x10/0x10 [1272.959644] kthread+0xd7/0x1c0 [1272.959872] ? __pfx_kthread+0x10/0x10 [1272.960172] ret_from_fork+0x30/0x50 [1272.960474] ? __pfx_kthread+0x10/0x10 [1272.960745] ret_from_fork_asm+0x1a/0x30 [1272.961035] </TASK> [1272.961238] ---[ end trace 0000000000000000 ]--- Though we have seen them in the async discard workfn as well. It is most likely to happen after a relocation finishes which cancels discard, tears down the block group, etc. Fix this fully by taking the lock around the list_del_init + clear_bit so that the two are done atomically. Fixes: 0657b20c5a76 ("btrfs: fix use-after-free of new block group that became unused") Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18	btrfs: remove unnecessary fs_info argument from btrfs_add_block_group_cache()	Filipe Manana
	The fs_info can be taken from the given block group, so there is no need to pass it as an argument. Also rename the local variable from 'info' to 'fs_info' which is more widely used, more clear and to be more consistent. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18	btrfs: do trivial BTRFS_PATH_AUTO_FREE conversions	David Sterba
	The most trivial pattern for the auto freeing when the variable is declared with the macro and the final btrfs_free_path() is removed. There are almost none goto -> return conversions and there's no other function cleanup. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18	btrfs: fix reclaimed bytes accounting after automatic block group reclaim	Filipe Manana
	We are considering the used bytes counter of a block group as the amount to update the space info's reclaim bytes counter after relocating the block group, but this value alone is often not enough. This is because we may have a reserved extent (or more) and in that case its size is reflected in the reserved counter of the block group - the size of the extent is only transferred from the reserved counter to the used counter of the block group when the delayed ref for the extent is run - typically when committing the transaction (or when flushing delayed refs due to ENOSPC on space reservation). Such call chain for data extents is: btrfs_run_delayed_refs_for_head() run_one_delayed_ref() run_delayed_data_ref() alloc_reserved_file_extent() alloc_reserved_extent() btrfs_update_block_group() -> transfers the extent size from the reserved counter to the used counter For metadata extents: btrfs_run_delayed_refs_for_head() run_one_delayed_ref() run_delayed_tree_ref() alloc_reserved_tree_block() alloc_reserved_extent() btrfs_update_block_group() -> transfers the extent size from the reserved counter to the used counter Since relocation flushes delalloc, waits for ordered extent completion and commits the current transaction before doing the actual relocation work, the correct amount of reclaimed space is therefore the sum of the "used" and "reserved" counters of the block group before we call btrfs_relocate_chunk() at btrfs_reclaim_bgs_work(). So fix this by taking the "reserved" counter into consideration. Fixes: 243192b67649 ("btrfs: report reclaim stats in sysfs") Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18	btrfs: get used bytes while holding lock at btrfs_reclaim_bgs_work()	Filipe Manana
	At btrfs_reclaim_bgs_work(), we are grabbing twice the used bytes counter of the block group while not holding the block group's spinlock. This can result in races, reported by KCSAN and similar tools, since a concurrent task can be updating that counter while at btrfs_update_block_group(). So avoid these races by grabbing the counter in a critical section delimited by the block group's spinlock after setting the block group to RO mode. This also avoids using two different values of the counter in case it changes in between each read. This silences KCSAN and is required for the next patch in the series too. Fixes: 243192b67649 ("btrfs: report reclaim stats in sysfs") Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18	btrfs: get zone unusable bytes while holding lock at btrfs_reclaim_bgs_work()	Filipe Manana
	At btrfs_reclaim_bgs_work(), we are grabbing a block group's zone unusable bytes while not under the protection of the block group's spinlock, so this can trigger race reports from KCSAN (or similar tools) since that field is typically updated while holding the lock, such as at __btrfs_add_free_space_zoned() for example. Fix this by grabbing the zone unusable bytes while we are still in the critical section holding the block group's spinlock, which is right above where we are currently grabbing it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18	btrfs: unify ordering of btrfs_key initializations	David Sterba
	The btrfs_key is defined as objectid/type/offset and the keys are also printed like that. For better readability, update all key initializations to match this order. Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13	btrfs: block-group: remove unnecessary calls to btrfs_mark_buffer_dirty()	Filipe Manana
	We have several places explicitly calling btrfs_mark_buffer_dirty() but that is not necessarily since the target leaf came from a path that was obtained for a btree search function that modifies the btree, something like btrfs_insert_empty_item() or anything else that ends up calling btrfs_search_slot() with a value of 1 for its 'cow' argument. These just make the code more verbose, confusing and add a little extra overhead and well as increase the module's text size, so remove them. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13	btrfs: update btrfs_add_block_group_cache() to use rb helper	Roger L. Beckermeyer III
	Update fs/btrfs/block-group.c to use rb_find_add_cached(). Suggested-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Roger L. Beckermeyer III <beckerlee3@gmail.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13	btrfs: drop fs_info argument from btrfs_update_space_info_*()	Naohiro Aota
	Since commit e1e577aafe41 ("btrfs: store fs_info in space_info"), we have the fs_info in a space_info. So, we can drop fs_info argument from btrfs_update_space_info_*. There is no behavior change. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11	btrfs: correct typos in multiple comments across various files	Shen Lichuan
	Fix some confusing spelling errors that were currently identified, the details are as follows: block-group.c: 2800: uncompressible ==> incompressible extent-tree.c: 3131: EXTEMT ==> EXTENT extent_io.c: 3124: utlizing ==> utilizing extent_map.c: 1323: ealier ==> earlier extent_map.c: 1325: possiblity ==> possibility fiemap.c: 189: emmitted ==> emitted fiemap.c: 197: emmitted ==> emitted fiemap.c: 203: emmitted ==> emitted transaction.h: 36: trasaction ==> transaction volumes.c: 5312: filesysmte ==> filesystem zoned.c: 1977: trasnsaction ==> transaction Signed-off-by: Shen Lichuan <shenlichuan@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-17	btrfs: zoned: fix zone unusable accounting for freed reserved extent	Naohiro Aota
	When btrfs reserves an extent and does not use it (e.g, by an error), it calls btrfs_free_reserved_extent() to free the reserved extent. In the process, it calls btrfs_add_free_space() and then it accounts the region bytes as block_group->zone_unusable. However, it leaves the space_info->bytes_zone_unusable side not updated. As a result, ENOSPC can happen while a space_info reservation succeeded. The reservation is fine because the freed region is not added in space_info->bytes_zone_unusable, leaving that space as "free". OTOH, corresponding block group counts it as zone_unusable and its allocation pointer is not rewound, we cannot allocate an extent from that block group. That will also negate space_info's async/sync reclaim process, and cause an ENOSPC error from the extent allocation process. Fix that by returning the space to space_info->bytes_zone_unusable. Ideally, since a bio is not submitted for this reserved region, we should return the space to free space and rewind the allocation pointer. But, it needs rework on extent allocation handling, so let it work in this way for now. Fixes: 169e0da91a21 ("btrfs: zoned: track unusable bytes for zones") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10	btrfs: constify more pointer parameters	David Sterba
	Continue adding const to parameters. This is for clarity and minor addition to safety. There are some minor effects, in the assembly code and .ko measured on release config. Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-29	btrfs: zoned: fix zone_unusable accounting on making block group read-write ↵	Naohiro Aota
	again When btrfs makes a block group read-only, it adds all free regions in the block group to space_info->bytes_readonly. That free space excludes reserved and pinned regions. OTOH, when btrfs makes the block group read-write again, it moves all the unused regions into the block group's zone_unusable. That unused region includes reserved and pinned regions. As a result, it counts too much zone_unusable bytes. Fortunately (or unfortunately), having erroneous zone_unusable does not affect the calculation of space_info->bytes_readonly, because free space (num_bytes in btrfs_dec_block_group_ro) calculation is done based on the erroneous zone_unusable and it reduces the num_bytes just to cancel the error. This behavior can be easily discovered by adding a WARN_ON to check e.g, "bg->pinned > 0" in btrfs_dec_block_group_ro(), and running fstests test case like btrfs/282. Fix it by properly considering pinned and reserved in btrfs_dec_block_group_ro(). Also, add a WARN_ON and introduce btrfs_space_info_update_bytes_zone_unusable() to catch a similar mistake. Fixes: 169e0da91a21 ("btrfs: zoned: track unusable bytes for zones") CC: stable@vger.kernel.org # 5.15+ Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11	btrfs: switch btrfs_block_group::inode to struct btrfs_inode	David Sterba
	The structure is internal so we should use struct btrfs_inode for that. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11	btrfs: prevent pathological periodic reclaim loops	Boris Burkov
	Periodic reclaim runs the risk of getting stuck in a state where it keeps reclaiming the same block group over and over. This can happen if 1. reclaiming that block_group fails 2. reclaiming that block_group fails to move any extents into existing block_groups and just allocates a fresh chunk and moves everything. Currently, 1. is a very tight loop inside the reclaim worker. That is critical for edge triggered reclaim or else we risk forgetting about a reclaimable group. On the other hand, with level triggered reclaim we can break out of that loop and get it later. With that fixed, 2. applies to both failures and "successes" with no progress. If we have done a periodic reclaim on a space_info and nothing has changed in that space_info, there is not much point to trying again, so don't, until enough space gets free, which we capture with a heuristic of needing to net free 1 chunk. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11	btrfs: periodic block_group reclaim	Boris Burkov
	We currently employ a edge-triggered block group reclaim strategy which marks block groups for reclaim as they free down past a threshold. With a dynamic threshold, this is worse than doing it in a level-triggered fashion periodically. That is because the reclaim itself happens periodically, so the threshold at that point in time is what really matters, not the threshold at freeing time. If we mark the reclaim in a big pass, then sort by usage and do reclaim, we also benefit from a negative feedback loop preventing unnecessary reclaims as we crunch through the "best" candidates. Since this is quite a different model, it requires some additional support. The edge triggered reclaim has a good heuristic for not reclaiming fresh block groups, so we need to replace that with a typical GC sweep mark which skips block groups that have seen an allocation since the last sweep. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11	btrfs: dynamic block_group reclaim threshold	Boris Burkov
	We can currently recover allocated block_groups by: - explicitly starting balance operations - "auto reclaim" via bg_reclaim_threshold The latter works by checking against a fixed threshold on frees. If we pass from above the threshold to below, relocation triggers and the block group will get reclaimed by the cleaner thread (assuming it is still eligible) Picking a threshold is challenging. Too high, and you end up trying to reclaim very full block_groups which is quite costly, and you don't do reclaim on block_groups that don't get quite THAT full, but could still be quite fragmented and stranding a lot of space. Too low, and you similarly miss out on reclaim even if you badly need it to avoid running out of unallocated space, if you have heavily fragmented block groups living above the threshold. No matter the threshold, it suffers from a workload that happens to bounce around that threshold, which can introduce arbitrary amounts of reclaim waste. To improve this situation, introduce a dynamic threshold. The basic idea behind this threshold is that it should be very lax when there is plenty of unallocated space, and increasingly aggressive as we approach zero unallocated space. To that end, it sets a target for unallocated space (10 chunks) and then linearly increases the threshold as the amount of space short of the target we are increases. The formula is: (target - unalloc) / target I tested this by running it on three interesting workloads: 1. bounce allocations around X% full. 2. fill up all the way and introduce full fragmentation. 3. write in a fragmented way until the filesystem is just about full. 1. and 2. attack the weaknesses of a fixed threshold; fixed either works perfectly or fully falls apart, depending on the threshold. Dynamic always handles these cases well. 3. attacks dynamic by checking whether it is too zealous to reclaim in conditions with low unallocated and low unused. It tends to claw back 1GiB of unallocated fairly aggressively, but not much more. Early versions of dynamic threshold struggled on this test. Additional work could be done to intelligently ratchet up the urgency of reclaim in very low unallocated conditions. Existing mechanisms are already useless in that case anyway. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11	btrfs: report reclaim stats in sysfs	Boris Burkov
	When evaluating various reclaim strategies/thresholds against each other, it is useful to collect data about the amount of reclaim happening. Expose a count, error count, and byte count via sysfs per space_info. Note that this is only for automatic reclaim, not manually invoked balances or other codepaths that use "relocate_block_group" Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11	btrfs: move btrfs_block_group_root() to block-group.c	Anand Jain
	The function btrfs_block_group_root() is declared in disk-io.c; however, all its callers are in block-group.c. Move it to the latter file and declare it static. Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-01	btrfs: fix adding block group to a reclaim list and the unused list during ↵	Naohiro Aota
	reclaim There is a potential parallel list adding for retrying in btrfs_reclaim_bgs_work and adding to the unused list. Since the block group is removed from the reclaim list and it is on a relocation work, it can be added into the unused list in parallel. When that happens, adding it to the reclaim list will corrupt the list head and trigger list corruption like below. Fix it by taking fs_info->unused_bgs_lock. [177.504][T2585409] BTRFS error (device nullb1): error relocating ch= unk 2415919104 [177.514][T2585409] list_del corruption. next->prev should be ff1100= 0344b119c0, but was ff11000377e87c70. (next=3Dff110002390cd9c0) [177.529][T2585409] ------------[ cut here ]------------ [177.537][T2585409] kernel BUG at lib/list_debug.c:65! [177.545][T2585409] Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI [177.555][T2585409] CPU: 9 PID: 2585409 Comm: kworker/u128:2 Tainted: G W 6.10.0-rc5-kts #1 [177.568][T2585409] Hardware name: Supermicro SYS-520P-WTR/X12SPW-TF, BIOS 1.2 02/14/2022 [177.579][T2585409] Workqueue: events_unbound btrfs_reclaim_bgs_work[btrfs] [177.589][T2585409] RIP: 0010:__list_del_entry_valid_or_report.cold+0x70/0x72 [177.624][T2585409] RSP: 0018:ff11000377e87a70 EFLAGS: 00010286 [177.633][T2585409] RAX: 000000000000006d RBX: ff11000344b119c0 RCX:0000000000000000 [177.644][T2585409] RDX: 000000000000006d RSI: 0000000000000008 RDI:ffe21c006efd0f40 [177.655][T2585409] RBP: ff110002e0509f78 R08: 0000000000000001 R09:ffe21c006efd0f08 [177.665][T2585409] R10: ff11000377e87847 R11: 0000000000000000 R12:ff110002390cd9c0 [177.676][T2585409] R13: ff11000344b119c0 R14: ff110002e0508000 R15:dffffc0000000000 [177.687][T2585409] FS: 0000000000000000(0000) GS:ff11000fec880000(0000) knlGS:0000000000000000 [177.700][T2585409] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [177.709][T2585409] CR2: 00007f06bc7b1978 CR3: 0000001021e86005 CR4:0000000000771ef0 [177.720][T2585409] DR0: 0000000000000000 DR1: 0000000000000000 DR2:0000000000000000 [177.731][T2585409] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:0000000000000400 [177.742][T2585409] PKRU: 55555554 [177.748][T2585409] Call Trace: [177.753][T2585409] <TASK> [177.759][T2585409] ? __die_body.cold+0x19/0x27 [177.766][T2585409] ? die+0x2e/0x50 [177.772][T2585409] ? do_trap+0x1ea/0x2d0 [177.779][T2585409] ? __list_del_entry_valid_or_report.cold+0x70/0x72 [177.788][T2585409] ? do_error_trap+0xa3/0x160 [177.795][T2585409] ? __list_del_entry_valid_or_report.cold+0x70/0x72 [177.805][T2585409] ? handle_invalid_op+0x2c/0x40 [177.812][T2585409] ? __list_del_entry_valid_or_report.cold+0x70/0x72 [177.820][T2585409] ? exc_invalid_op+0x2d/0x40 [177.827][T2585409] ? asm_exc_invalid_op+0x1a/0x20 [177.834][T2585409] ? __list_del_entry_valid_or_report.cold+0x70/0x72 [177.843][T2585409] btrfs_delete_unused_bgs+0x3d9/0x14c0 [btrfs] There is a similar retry_list code in btrfs_delete_unused_bgs(), but it is safe, AFAICS. Since the block group was in the unused list, the used bytes should be 0 when it was added to the unused list. Then, it checks block_group->{used,reserved,pinned} are still 0 under the block_group->lock. So, they should be still eligible for the unused list, not the reclaim list. The reason it is safe there it's because because we're holding space_info->groups_sem in write mode. That means no other task can allocate from the block group, so while we are at deleted_unused_bgs() it's not possible for other tasks to allocate and deallocate extents from the block group, so it can't be added to the unused list or the reclaim list by anyone else. The bug can be reproduced by btrfs/166 after a few rounds. In practice this can be hit when relocation cannot find more chunk space and ends with ENOSPC. Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Suggested-by: Johannes Thumshirn <Johannes.Thumshirn@wdc.com> Fixes: 4eb4e85c4f81 ("btrfs: retry block group reclaim without infinite loop") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-06-13	btrfs: retry block group reclaim without infinite loop	Boris Burkov
	If inc_block_group_ro systematically fails (e.g. due to ETXTBUSY from swap) or btrfs_relocate_chunk systematically fails (from lack of space), then this worker becomes an infinite loop. At the very least, this strands the cleaner thread, but can also result in hung tasks/RCU stalls on PREEMPT_NONE kernels and if the reclaim_bgs_lock mutex is not contended. I believe the best long term fix is to manage reclaim via work queue, where we queue up a relocation on the triggering condition and re-queue on failure. In the meantime, this is an easy fix to apply to avoid the immediate pain. Fixes: 7e2718099438 ("btrfs: reinsert BGs failed to reclaim") CC: stable@vger.kernel.org # 6.6+ Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-26	btrfs: zoned: don't skip block groups with 100% zone unusable	Johannes Thumshirn
	Commit f4a9f219411f ("btrfs: do not delete unused block group if it may be used soon") changed the behaviour of deleting unused block-groups on zoned filesystems. Starting with this commit, we're using btrfs_space_info_used() to calculate the number of used bytes in a space_info. But btrfs_space_info_used() also accounts btrfs_space_info::bytes_zone_unusable as used bytes. So if a block group is 100% zone_unusable it is skipped from the deletion step. In order not to skip fully zone_unusable block-groups, also check if the block-group has bytes left that can be used on a zoned filesystem. Fixes: f4a9f219411f ("btrfs: do not delete unused block group if it may be used soon") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-05	btrfs: mark btrfs_put_caching_control() static	Lijuan Li
	btrfs_put_caching_control() is only used in block-group.c, so mark it static. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Lijuan Li <lilijuan@iscas.ac.cn> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04	btrfs: handle block group lookup error when it's being removed	David Sterba
	The unlikely case of lookup error in btrfs_remove_block_group() can be handled properly, in its caller this would lead to a transaction abort. We can't do anything else, a block group must have been loaded first. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>