summaryrefslogtreecommitdiff
path: root/fs/btrfs
AgeCommit message (Collapse)Author
2025-01-13btrfs: free-space-tree: remove unnecessary calls to btrfs_mark_buffer_dirty()Filipe Manana
We have several places explicitly calling btrfs_mark_buffer_dirty() but that is not necessarily since the target leaf came from a path that was obtained for a btree search function that modifies the btree, something ike btrfs_insert_empty_item() or anything else that ends up calling btrfs_search_slot() with a value of 1 for its 'cow' argument. These just make the code more verbose, confusing and add a little extra overhead and well as increase the module's text size, so remove them. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: tree-log: remove unnecessary calls to btrfs_mark_buffer_dirty()Filipe Manana
We have several places explicitly calling btrfs_mark_buffer_dirty() but that is not necessarily since the target leaf came from a path that was obtained for a btree search function that modifies the btree, something like btrfs_insert_empty_item() or anything else that ends up calling btrfs_search_slot() with a value of 1 for its 'cow' argument. These just make the code more verbose, confusing and add a little extra overhead and well as increase the module's text size, so remove them. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: uncollapse transaction aborts during renamesFilipe Manana
During renames we are grouping transaction aborts that can be due to a failure of one of several function calls. While this makes the code less verbose, it makes it harder to debug as we end up not knowing from which function call we got an error. So change this to trigger a transaction abort after each function call failure, so that when we get a transaction abort message we know exactly which function call failed, helping us to debug issues. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: validate system chunk array at btrfs_validate_super()Qu Wenruo
Currently btrfs_validate_super() only does a very basic check on the array chunk size (not too large than the available space, but not too small to contain no chunk). The more comprehensive checks (the regular chunk checks and size check inside the system chunk array) are all done inside btrfs_read_sys_array(). It's not a big deal, but it also means we do not do any validation on the system chunk array at super block writeback time either. Do the following modification to centralize the system chunk array checks into btrfs_validate_super(): - Make chunk_err() helper accept stack chunk pointer If @leaf parameter is NULL, then the @chunk pointer will be a pointer to the chunk item, other than the offset inside the leaf. And since @leaf can be NULL, add a new @fs_info parameter for that case. - Make btrfs_check_chunk_valid() handle stack chunk pointer The same as chunk_err(), a new @fs_info parameter, and if @leaf is NULL, then @chunk will be a pointer to a stack chunk. If @chunk is NULL, then all needed btrfs_chunk members will be read using the stack helper instead of the leaf helper. This means we need to read out all the needed member at the beginning of the function. Furthermore, at super block read time, fs_info->sectorsize is not yet initialized, we need one extra @sectorsize parameter to grab the correct sectorsize. - Introduce a helper validate_sys_chunk_array() * Validate the disk key. * Validate the size before we access the full chunk items. * Do the full chunk item validation. - Call validate_sys_chunk_array() at btrfs_validate_super() - Simplify the checks inside btrfs_read_sys_array() Now the checks will be converted to an ASSERT(). - Simplify the checks inside read_one_chunk() Now that all chunk items inside system chunk array and chunk tree are verified, there is no need to verify them again inside read_one_chunk(). This change has the following advantages: - More comprehensive checks at write time And unlike the sys_chunk_array read routine, this time we do not need to allocate a dummy extent buffer to do the check. All the checks done here require no new memory allocation. - Slightly improved readability when iterating the system chunk array Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: update tree_insert() to use rb helpersRoger L. Beckermeyer III
Update tree_insert() to use rb_find_add_cached(). add cmp_refs_node in rb_find_add_cached() to compare. Since we're here, also make comp_data_refs() and comp_refs() accept both parameters as const. Signed-off-by: Roger L. Beckermeyer III <beckerlee3@gmail.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: update btrfs_add_chunk_map() to use rb helpersRoger L. Beckermeyer III
Update btrfs_add_chunk_map() to use rb_find_add_cached(). Signed-off-by: Roger L. Beckermeyer III <beckerlee3@gmail.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: update __btrfs_add_delayed_item() to use rb helperRoger L. Beckermeyer III
Update __btrfs_add_delayed_item() to use rb_find_add_cached(). Signed-off-by: Roger L. Beckermeyer III <beckerlee3@gmail.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: update prelim_ref_insert() to use rb helpersRoger L. Beckermeyer III
Update prelim_ref_insert() to use rb_find_add_cached(). There is a special change that the existing prelim_ref_compare() is called with the first parameter as the existing ref in the rbtree. But the newer rb_find_add_cached() expects the cmp() function to have the first parameter as the to-be-added node, thus the new helper prelim_ref_rb_add_cmp() need to adapt this new order. Signed-off-by: Roger L. Beckermeyer III <beckerlee3@gmail.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: update btrfs_add_block_group_cache() to use rb helperRoger L. Beckermeyer III
Update fs/btrfs/block-group.c to use rb_find_add_cached(). Suggested-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Roger L. Beckermeyer III <beckerlee3@gmail.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: don't include linux/rwlock_types.h directlyWolfram Sang
The header clearly states that it does not want to be included directly, only via linux/spinlock_types.h. Drop this as we can simply use the spinlock.h which is already included. Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: handle free space tree rebuild in multiple transactionsQu Wenruo
During free space tree rebuild, we're holding a transaction handle for the whole rebuild process. This can lead to blocked task warning, e.g. btrfs-transaction kthread (which is already created before btrfs_start_pre_rw_mount()) can be waked up to join and commit the current transaction. But the free space tree rebuild process may need to go through thousands block groups, this will block btrfs-transaction kthread for a long time. Fix the problem by calling btrfs_should_end_transaction() after each block group, so that we won't hold the transaction handle too long. And since the free-space-tree rebuild can be split into multiple transactions, we need to consider the safety when the rebuild process is interrupted. Thankfully since we only set the FREE_SPACE_TREE compat_ro flag without FREE_SPACE_TREE_VALID flag, even if the rebuild is interrupted, on the next RW mount, we will still go rebuild the free space tree, by deleting any items we have and re-starting the rebuild from scratch. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: use uuid_is_null() to verify if an uuid is emptyFilipe Manana
At btrfs_is_empty_uuid() we have our custom code to check if an uuid is empty, however there a kernel uuid library that has a function named uuid_is_null() which does the same and probably more efficient. So change btrfs_is_empty_uuid() to use uuid_is_null(), which is almost a directly replacement, it just wraps the necessary casting since our uuid types are u8 arrays while the uuid kernel library uses the uuid_t type, which is just a typedef of an u8 array of 16 elements as well. Also since the function is now to trivial, make it a static inline function in fs.h. Suggested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: remove pointless comment from ctree.hFilipe Manana
It's pointless to have a comment above the prototype declarations of btrfs_ctree_init() and btrfs_ctree_exit() mentioning that they are declared in ctree.c. This is from the old days when ctree.h was used to place anything that didn't fit in any other file. So remove it. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: move extent-tree function declarations out of ctree.hFilipe Manana
We have 3 functions that have their prototypes declared in ctree.h but they are defined at extent-tree.c and they are unrelated to the btree data structure. Move the prototypes out of ctree.h and into extent-tree.h. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: move btrfs_alloc_write_mask() into fs.hFilipe Manana
Currently btrfs_alloc_write_mask() is defined in ctree.h but it's not related at all to the btree data structure, so move it into fs.h. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: move BTRFS_BYTES_TO_BLKS() into fs.hFilipe Manana
Currently BTRFS_BYTES_TO_BLKS() is defined in ctree.h but it's not related at all to the btree data structure, so move it into fs.h. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: move the folio ordered helpers from ctree.h into fs.hFilipe Manana
The folio ordered helper macros are defined at ctree.h but this is not the best place since ctree.{h,c} is all about the btree data structure implementation and not a generic module. So move these macros into the fs.h header. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: move btrfs_is_empty_uuid() from ioctl.c into fs.cFilipe Manana
It's a generic helper not specific to ioctls and used in several places, so move it out from ioctl.c and into fs.c. While at it change its return type from int to bool and declare the loop variable in the loop itself. This also slightly reduces the module's size. Before this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1781492 161037 16920 1959449 1de619 fs/btrfs/btrfs.ko After this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1781340 161037 16920 1959297 1de581 fs/btrfs/btrfs.ko Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: move the exclusive operation functions into fs.cFilipe Manana
The declarations for the exclusive operation functions are located at fs.h but their definitions are in ioctl.c, which doesn't make much sense since (most of them) are used in several files other than ioctl.c. Since they are used in several files and they are generic enough, move them out of ioctl.c and into fs.c, even the ones that are currently only used at ioctl.c, for the sake of having them all in the same C file. This also reduces the module's size. Before this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1782094 161045 16920 1960059 1de87b fs/btrfs/btrfs.ko After this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1781492 161037 16920 1959449 1de619 fs/btrfs/btrfs.ko Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: move csum related functions from ctree.c into fs.cFilipe Manana
The ctree module is about the implementation of the btree data structure and not a place holder for generic filesystem things like the csum algorithm details. Move the functions related to the csum algorithm details away from ctree.c and into fs.c, which is a far better place for them. Also fix missing punctuation in comments and change one multiline comment to a single line comment since everything fits in under 80 characters. For some reason this also slightly reduces the module's size. Before this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1782126 161045 16920 1960091 1de89b fs/btrfs/btrfs.ko After this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1782094 161045 16920 1960059 1de87b fs/btrfs/btrfs.ko Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: move abort_should_print_stack() to transaction.hFilipe Manana
The function abort_should_print_stack() is declared in transaction.h but its definition is in ctree.c, which doesn't make sense since ctree.c is the btree implementation and the function is related to the transaction code. Move its definition into transaction.h as an inline function since it's a very short and trivial function, and also add the 'btrfs_' prefix into its name. This change also reduces the module size. Before this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1783148 161137 16920 1961205 1decf5 fs/btrfs/btrfs.ko After this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1782126 161045 16920 1960091 1de89b fs/btrfs/btrfs.ko Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: pass btrfs_io_geometry to is_single_device_ioJohannes Thumshirn
Now that we have the stripe tree decision saved in struct btrfs_io_geometry we can pass it into is_single_device_io() and get rid of another call to btrfs_need_raid_stripe_tree_update(). Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: cache RAID stripe tree decision in btrfs_io_contextJohannes Thumshirn
Cache the decision if a particular I/O needs to update RAID stripe tree entries in struct btrfs_io_context. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: cache stripe tree usage in struct btrfs_io_geometryJohannes Thumshirn
Cache the return of btrfs_need_stripe_tree_update() in struct btrfs_io_geometry starting from btrfs_map_block(). Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: add assertions and comment about path expectations to ↵Filipe Manana
btrfs_cross_ref_exist() We should always call check_delayed_ref() with a path having a locked leaf from the extent tree where either the extent item is located or where it should be located in case it doesn't exist yet (when there's a pending unflushed delayed ref to do it), as we need to lock any existing delayed ref head while holding such leaf locked in order to avoid races with flushing delayed references, which could make us think an extent is not shared when it really is. So add some assertions and a comment about such expectations to btrfs_cross_ref_exist(), which is the only caller of check_delayed_ref(). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: add function comment for check_committed_ref()Filipe Manana
There are some not immediately obvious details about the operation of check_committed_ref(), namely that when it returns 0 it must return with the path having a locked leaf from the extent tree that contains the extent's extent item, so that we can later check for delayed refs when calling check_delayed_ref() in a way that doesn't race with a task running delayed references. For similar reasons, it must also return with a locked leaf when the extent item is not found, and that leaf is where the extent item should be located, because we may have delayed references that are going to create the extent item. Also document that the function can return false positives in order to not be too slow, and that the most important is to not return false negatives. So add a function comment to check_committed_ref(). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: simplify arguments for btrfs_cross_ref_exist()Filipe Manana
Instead of passing a root and an objectid which matches an inode number, pass the inode instead, since the root is always the root associated to the inode and the objectid is the number of that inode. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: simplify return logic at check_committed_ref()Filipe Manana
Instead of setting the value to return in a local variable 'ret' and then jumping into a label named 'out' that does nothing but return that value, simplify everything by getting rid of the label and directly returning a value. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: avoid redundant call to get inline ref type at check_committed_ref()Filipe Manana
At check_committed_ref() we are calling btrfs_get_extent_inline_ref_type() twice, once before we check if have an inline extent owner ref (for simple qgroups) and then once again sometime after that check. This second call is redundant when we have simple quotas disabled or we found an inline ref that is not of the owner ref type. So avoid this second call unless we have simple quotas enabled and found an owner ref, saving a function call that does inline ref validation again. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: remove the snapshot check from check_committed_ref()Filipe Manana
At check_committed_ref() we have this check to see if the data extent was created in a generation lower than or equals to the generation where the last snapshot for the root was created, and if so we return immediately with 1, since it's very likely the extent is shared, referenced by other root. The only call chain for check_committed_ref() is the following: can_nocow_file_extent() btrfs_cross_ref_exist() check_committed_ref() And we already do that snapshot check at can_nocow_file_extent(), before we call btrfs_cross_ref_exist(). This makes the check done at check_committed_ref() redundant, so remove it. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: remove no longer needed strict argument from can_nocow_extent()Filipe Manana
All callers of can_nocow_extent() now pass a value of false for its 'strict' argument, making it redundant. So remove the argument from can_nocow_extent() as well as can_nocow_file_extent(), btrfs_cross_ref_exist() and check_committed_ref(), because this argument was used just to influence the behavior of check_committed_ref(). Also remove the 'strict' field from struct can_nocow_file_extent_args, which is now always false as well, as its value is taken from the argument to can_nocow_extent(). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: remove unused variable length in btrfs_insert_one_raid_extent()Johannes Thumshirn
Remove the variable length in btrfs_insert_one_raid_extent() as it is unused. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: output the reason for open_ctree() failureQu Wenruo
There is a recent ML report that mounting a large fs backed by hardware RAID56 controller (with one device missing) took too much time, and systemd seems to kill the mount attempt. In that case, the only error message is: BTRFS error (device sdj): open_ctree failed There is no reason on why the failure happened, making it very hard to understand the reason. At least output the error number (in the particular case it should be -EINTR) to provide some clue. Link: https://lore.kernel.org/linux-btrfs/9b9c4d2810abcca2f9f76e32220ed9a90febb235.camel@scientia.org/ Reported-by: Christoph Anton Mitterer <calestyo@scientia.org> Cc: stable@vger.kernel.org Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: open-code btrfs_copy_from_user()Qu Wenruo
The function btrfs_copy_from_user() handles the folio dirtying for buffered write. The original design is to allow that function to handle multiple folios, but since commit c87c299776e4 ("btrfs: make buffered write to copy one page a time") there is no need to support multiple folios. So here open-code btrfs_copy_from_user() to copy_folio_from_iter_atomic() and flush_dcache_folio() calls. The short-copy check and revert are still kept as-is. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: improve the warning and error message for btrfs_remove_qgroup()Qu Wenruo
[WARNING] There are several warnings about the recently introduced qgroup auto-removal that it triggers WARN_ON() for the non-zero rfer/excl numbers, e.g: ------------[ cut here ]------------ WARNING: CPU: 67 PID: 2882 at fs/btrfs/qgroup.c:1854 btrfs_remove_qgroup+0x3df/0x450 CPU: 67 UID: 0 PID: 2882 Comm: btrfs-cleaner Kdump: loaded Not tainted 6.11.6-300.fc41.x86_64 #1 RIP: 0010:btrfs_remove_qgroup+0x3df/0x450 Call Trace: <TASK> btrfs_qgroup_cleanup_dropped_subvolume+0x97/0xc0 btrfs_drop_snapshot+0x44e/0xa80 btrfs_clean_one_deleted_snapshot+0xc3/0x110 cleaner_kthread+0xd8/0x130 kthread+0xd2/0x100 ret_from_fork+0x34/0x50 ret_from_fork_asm+0x1a/0x30 </TASK> ---[ end trace 0000000000000000 ]--- BTRFS warning (device sda): to be deleted qgroup 0/319 has non-zero numbers, rfer 258478080 rfer_cmpr 258478080 excl 0 excl_cmpr 0 [CAUSE] Although the root cause is still unclear, as if qgroup is consistent a fully dropped subvolume (with extra transaction committed) should lead to all zero numbers for the qgroup. My current guess is the subvolume drop triggered the new subtree drop threshold thus marked qgroup inconsistent, then rescan cleared it but some corner case is not properly handled during subvolume dropping. But at least for this particular case, since it's only the rfer/excl not properly reset to 0, and qgroup is already marked inconsistent, there is nothing to be worried for the end users. The user space tool utilizing qgroup would queue a rescan to handle everything, so the kernel wanring is a little overkilled. [ENHANCEMENT] Enhance the warning inside btrfs_remove_qgroup() by: - Only do WARN() if CONFIG_BTRFS_DEBUG is enabled As explained the kernel can handle inconsistent qgroups by simply do a rescan, there is nothing to bother the end users. - Treat the reserved space leak the same as non-zero numbers By outputting the values and trigger a WARN() if it's a debug build. So far I haven't experienced any case related to reserved space so I hope we will never need to bother them. Fixes: 839d6ea4f86d ("btrfs: automatically remove the subvolume qgroup") Link: https://github.com/kdave/btrfs-progs/issues/922 Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: remove detached list from struct btrfs_backref_cacheJosef Bacik
We don't ever look at this list, remove it. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: remove the ->lowest and ->leaves members from struct btrfs_backref_nodeJosef Bacik
Before we were keeping all of our nodes on various lists in order to make sure everything got cleaned up correctly. We used node->lowest to indicate that node->lower was linked into the cache->leaves list. Now that we do cleanup based on the rb-tree both the list and the flag are useless, so delete them both. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: simplify btrfs_backref_release_cache()Josef Bacik
We rely on finding all our nodes on the various lists in the backref cache, when they are all also in the rbtree. Instead just search through the rbtree and free everything. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: do not handle non-shareable roots in backref cacheJosef Bacik
Now that we handle relocation for non-shareable roots without using the backref cache, remove the ->cowonly field from the backref nodes and update the handling to throw an error. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: don't build backref tree for COW-only blocksJosef Bacik
We already determine the owner for any blocks we find when we're relocating, and for COW-only blocks (and the data reloc tree) we COW down to the block and call it good enough. However we still build a whole backref tree for them, even though we're not going to use it, and then just don't put these blocks in the cache. Rework the code to check if the block belongs to a COW-only root or the data reloc root, and then just cow down to the block, skipping the backref cache generation. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: remove clone_backref_node() from relocationJosef Bacik
Since we no longer maintain backref cache across transactions, and this is only called when we're creating the reloc root for a newly created snapshot in the transaction critical section, we will end up doing a bunch of work that will just get thrown away when we start the transaction in the relocation loop. Delete this code as it no longer does anything for us. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: simplify loop in select_reloc_root()Josef Bacik
We have this setup as a loop, but in reality we will never walk back up the backref tree, if we do then it's a bug. Get rid of the loop and handle the case where we have node->new_bytenr set at all. Previous check was only if node->new_bytenr != root->node->start, but if it did then we would hit the WARN_ON() and walk back up the tree. Instead we want to just return error if ->new_bytenr is set, and then do the normal updating of the node for the reloc root and carry on. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: add a comment for new_bytenr in backref_cache_nodeJosef Bacik
Add a comment for this field so we know what it is used for. Previously we used it to update the backref cache, so people may mistakenly think it is useless, but in fact exists to make sure the backref cache makes sense. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: remove the changed list for backref cacheJosef Bacik
Now that we're not updating the backref cache when we switch transids we can remove the changed list. We're going to keep the new_bytenr field because it serves as a good sanity check for the backref cache and relocation, and can prevent us from making extent tree corruption worse. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: convert BUG_ON in btrfs_reloc_cow_block() to proper error handlingJosef Bacik
This BUG_ON is meant to catch backref cache problems, but these can arise from either bugs in the backref cache or corruption in the extent tree. Fix it to be a proper error. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: fix data race when accessing the inode's disk_i_size at ↵Hao-ran Zheng
btrfs_drop_extents() A data race occurs when the function `insert_ordered_extent_file_extent()` and the function `btrfs_inode_safe_disk_i_size_write()` are executed concurrently. The function `insert_ordered_extent_file_extent()` is not locked when reading inode->disk_i_size, causing `btrfs_inode_safe_disk_i_size_write()` to cause data competition when writing inode->disk_i_size, thus affecting the value of `modify_tree`. The specific call stack that appears during testing is as follows: ============DATA_RACE============ btrfs_drop_extents+0x89a/0xa060 [btrfs] insert_reserved_file_extent+0xb54/0x2960 [btrfs] insert_ordered_extent_file_extent+0xff5/0x1760 [btrfs] btrfs_finish_one_ordered+0x1b85/0x36a0 [btrfs] btrfs_finish_ordered_io+0x37/0x60 [btrfs] finish_ordered_fn+0x3e/0x50 [btrfs] btrfs_work_helper+0x9c9/0x27a0 [btrfs] process_scheduled_works+0x716/0xf10 worker_thread+0xb6a/0x1190 kthread+0x292/0x330 ret_from_fork+0x4d/0x80 ret_from_fork_asm+0x1a/0x30 ============OTHER_INFO============ btrfs_inode_safe_disk_i_size_write+0x4ec/0x600 [btrfs] btrfs_finish_one_ordered+0x24c7/0x36a0 [btrfs] btrfs_finish_ordered_io+0x37/0x60 [btrfs] finish_ordered_fn+0x3e/0x50 [btrfs] btrfs_work_helper+0x9c9/0x27a0 [btrfs] process_scheduled_works+0x716/0xf10 worker_thread+0xb6a/0x1190 kthread+0x292/0x330 ret_from_fork+0x4d/0x80 ret_from_fork_asm+0x1a/0x30 ================================= The main purpose of the check of the inode's disk_i_size is to avoid taking write locks on a btree path when we have a write at or beyond EOF, since in these cases we don't expect to find extent items in the root to drop. However if we end up taking write locks due to a data race on disk_i_size, everything is still correct, we only add extra lock contention on the tree in case there's concurrency from other tasks. If the race causes us to not take write locks when we actually need them, then everything is functionally correct as well, since if we find out we have extent items to drop and we took read locks (modify_tree set to 0), we release the path and retry again with write locks. Since this data race does not affect the correctness of the function, it is a harmless data race, use data_race() to check inode->disk_i_size. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Hao-ran Zheng <zhenghaoran154@gmail.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: don't BUG_ON() in btrfs_drop_extents()Johannes Thumshirn
btrfs_drop_extents() calls BUG_ON() in case the counter of to be deleted extents is greater than 0. But all of these code paths can handle errors, so there's no need to crash the kernel. Instead WARN() that the condition has been met and gracefully bail out. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: zoned: reclaim unused zone by zone resettingNaohiro Aota
On the zoned mode, once used and freed region is still not reusable after the freeing. The underlying zone needs to be reset before reusing. Btrfs resets a zone when it removes a block group, and then new block group is allocated on the zones to reuse the zones. But, it is sometime too late to catch up with a write side. This commit introduces a new space-info reclaim method ZONE_RESET. That will pick a block group from the unused list and reset its zone to reuse the zone_unusable space. It is faster than removing the block group and re-creating a new block group on the same zones. For the first implementation, the ZONE_RESET is only applied to a block group whose region is fully zone_unusable. Reclaiming partial zone_unusable block group could be implemented later. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: drop fs_info argument from btrfs_update_space_info_*()Naohiro Aota
Since commit e1e577aafe41 ("btrfs: store fs_info in space_info"), we have the fs_info in a space_info. So, we can drop fs_info argument from btrfs_update_space_info_*. There is no behavior change. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: factor out btrfs_return_free_space()Naohiro Aota
Factor out a part of unpin_extent_range() that returns space back to the space info, prioritizing global block reserve. Also, move the "len" variable into the loop to clarify we don't need to carry it beyond an iteration. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>