summaryrefslogtreecommitdiff
path: root/fs/btrfs
AgeCommit message (Collapse)Author
2023-08-21btrfs: reduce debug spam from submit_compressed_extentsChristoph Hellwig
Move the printk that is supposed to help to debug failures in submit_one_async_extent into submit_one_async_extent and make it coniditonal on actually having an error condition instead of spamming the log unconditionally. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: remove end_extent_writepageChristoph Hellwig
end_extent_writepage is a small helper that combines a call to btrfs_mark_ordered_io_finished with conditional error-only calls to btrfs_page_clear_uptodate and mapping_set_error with a somewhat unfortunate calling convention that passes and inclusive end instead of the len expected by the underlying functions. Remove end_extent_writepage and open code it in the 4 callers. Out of those two already are error-only and thus don't need the extra conditional, and one already has the mapping_set_error, so a duplicate call can be avoided. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: remove btrfs_writepage_endio_finish_orderedChristoph Hellwig
btrfs_writepage_endio_finish_ordered is a small wrapper around btrfs_mark_ordered_io_finished that just changs the argument passing slightly, and adds a tracepoint. Move the tracpoint to btrfs_mark_ordered_io_finished, which means it now also covers the error handling in btrfs_cleanup_ordered_extent and switch all callers to just call btrfs_mark_ordered_io_finished directly. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: split page locking out of __process_pages_contigChristoph Hellwig
There is a lot of complexity in __process_pages_contig to deal with the PAGE_LOCK case that can return an error unlike all the other actions. Open code the page iteration for page locking in lock_delalloc_pages and remove all the now unused code from __process_pages_contig. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: don't create inline extents in fallback_to_cowChristoph Hellwig
For NOCOW files, run_delalloc_nocow can still fall back to COW allocations when required and calls to fallback_to_cow helper for that. For such an allocation we can have multiple ordered_extents for existing extents that NOCOW overwrites and new allocations that fallback_to_cow creates. If one of the new extents is an inline extent, the writepages could would have to avoid normal page writeback for them as indicated by the page_started return argument, which run_delalloc_nocow can't return. Fix this by never creating inline extents from fallback_to_cow. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: pass a flags argument to cow_file_rangeChristoph Hellwig
The int used as bool unlock is not a very good way to describe the behavior, and the next patch will have to add another behavior modifier. We'll do that by two bool parameters instead of adding bit flags. Now specifies that the pages should always be kept locked. This is the inverse of the old unlock argument. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> [ switch flags to bool ] Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: fix start transaction qgroup rsv double freeBoris Burkov
btrfs_start_transaction reserves metadata space of the PERTRANS type before it identifies a transaction to start/join. This allows flushing when reserving that space without a deadlock. However, it results in a race which temporarily breaks qgroup rsv accounting. T1 T2 start_transaction do_stuff start_transaction qgroup_reserve_meta_pertrans commit_transaction qgroup_free_meta_all_pertrans hit an error starting txn goto reserve_fail qgroup_free_meta_pertrans (already freed!) The basic issue is that there is nothing preventing another commit from committing before start_transaction finishes (in fact sometimes we intentionally wait for it) so any error path that frees the reserve is at risk of this race. While this exact space was getting freed anyway, and it's not a huge deal to double free it (just a warning, the free code catches this), it can result in incorrectly freeing some other pertrans reservation in this same reservation, which could then lead to spuriously granting reservations we might not have the space for. Therefore, I do believe it is worth fixing. To fix it, use the existing prealloc->pertrans conversion mechanism. When we first reserve the space, we reserve prealloc space and only when we are sure we have a transaction do we convert it to pertrans. This way any racing commits do not blow away our reservation, but we still get a pertrans reservation that is freed when _this_ transaction gets committed. This issue can be reproduced by running generic/269 with either qgroups or squotas enabled via mkfs on the scratch device. Reviewed-by: Josef Bacik <josef@toxicpanda.com> CC: stable@vger.kernel.org # 5.10+ Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: free qgroup rsv on io failureBoris Burkov
If we do a write whose bio suffers an error, we will never reclaim the qgroup reserved space for it. We allocate the space in the write_iter codepath, then release the reservation as we allocate the ordered extent, but we only create a delayed ref if the ordered extent finishes. If it has an error, we simply leak the rsv. This is apparent in running any error injecting (dmerror) fstests like btrfs/146 or btrfs/160. Such tests fail due to dmesg on umount complaining about the leaked qgroup data space. When we clean up other aspects of space on failed ordered_extents, also free the qgroup rsv. Reviewed-by: Josef Bacik <josef@toxicpanda.com> CC: stable@vger.kernel.org # 5.10+ Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: remove duplicate free_async_extent_pages() on reservation errorGoldwyn Rodrigues
While performing compressed writes, if the extent reservation fails, the async extent pages are first freed in the error check for return value ret, and then again at out_free label. Remove the first call to free_async_extent_pages(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: move eb subpage preallocation out of the loopQu Wenruo
Initially we preallocate btrfs_subpage structure in the main loop of alloc_extent_buffer(). But later commit fbca46eb46ec ("btrfs: make nodesize >= PAGE_SIZE case to reuse the non-subpage routine") has made sure we only go subpage routine if our nodesize is smaller than PAGE_SIZE. This means for that case, we only need to allocate the subpage structure once anyway. So this patch would make the preallocation out of the main loop. This would slightly reduce the workload when we hold the page lock, and make code a little easier to read. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: simplify the no-bioc fast path condition in btrfs_map_blockChristoph Hellwig
nr_alloc_stripes can't be one if we are writing to a replacement device, as it is incremented for that case right above. Remove the duplicate checks. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: scrub: remove unused btrfs_path in scrub_simple_mirror()Qu Wenruo
The @path in scrub_simple_mirror() is no longer utilized after commit e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure"). Before that commit, we call find_first_extent_item() directly, which needs a path and that path can be reused. But after that switch commit, the extent search is done inside queue_scrub_stripe(), which will no longer accept a path from outside. So the @path variable can be safely removed. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ remove the stale comment ] Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: use folio_next_index() helper in extent_write_cache_pagesMinjie Du
Simplify code pattern of 'folio->index + folio_nr_pages(folio)' by using the existing helper folio_next_index(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Minjie Du <duminjie@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: use helper sizeof_field in struct accessorsDavid Sterba
There's a helper for obtaining size of a struct member, we can use it instead of open coding the pointer magic. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: deprecate integrity checker featureQu Wenruo
The integrity checker feature needs to be enabled at compile time (BTRFS_FS_CHECK_INTEGRITY) and then enabled by mount options check_int*. Although it provides some unique features which can not be provided by any other sanity checks like tree-checker, it does not only have high CPU and memory overhead, but is also a maintenance burden. For example, it's the only caller of btrfs_map_block() with @need_raid_map = 0. Considering most btrfs developers are not even testing this feature, I'm here to propose deprecation of this feature. For now only warning messages will be printed, the feature itself would still work. Removal time has been set to 6.7 release. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: move btrfs_free_excluded_extents() into block-group.cFilipe Manana
The function btrfs_free_excluded_extents() is only used by block-group.c, so move it into block-group.c and make it static. Also removed unnecessary variables that are used only once. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: open code trivial btrfs_add_excluded_extent()Filipe Manana
The code for btrfs_add_excluded_extent() is trivial, it's just a set_extent_bit() call. However it's defined in extent-tree.c but it is only used (twice) in block-group.c. So open code it in block-group.c, reducing the need to export a trivial function. Also since the only caller btrfs_add_excluded_extent() is prepared to deal with errors, stop ignoring errors from the set_extent_bit() call. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: make find_first_extent_bit() return a booleanFilipe Manana
Currently find_first_extent_bit() returns a 0 if it found a range in the given io tree and 1 if it didn't find any. There's no need to return any errors, so make the return value a boolean and invert the logic to make more sense: return true if it found a range and false if it didn't find any range. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: make btrfs_destroy_pinned_extent() return voidFilipe Manana
Currently btrfs_destroy_pinned_extent() is always returning 0 no matter what and its caller ignores its return value (as well everything up in the call chain). This is because this is called in the transaction abort path, where we can't even deal with any errors since we are in a critical situation already and cleanup of resources is done in a best effort fashion. So make btrfs_destroy_pinned_extent() return void. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: make btrfs_destroy_marked_extents() return voidFilipe Manana
Currently btrfs_destroy_marked_extents() is returning the value of the last call to find_first_extent_bit(), which returns a value of 1 meaning no more ranges found the dirty pages io tree. This value is useless to the single caller of btrfs_destroy_marked_extents(), which ignores any return value from btrfs_destroy_marked_extents(). This is because it's only used in the transaction abort path, where we can't even deal with any errors since we are in a critical situation already and cleanup of resources is done in a best effort fashion. So make btrfs_destroy_marked_extents() return void. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: rename add_new_free_space() to btrfs_add_new_free_space()Filipe Manana
Since add_new_free_space() is exported, used outside block-group.c, rename it to include the 'btrfs_' prefix. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: update documentation for add_new_free_space()Filipe Manana
The documentation for add_new_free_space() is stale and no longer correct: 1) It's no longer used only when caching a block group. It's also called when creating a block group (btrfs_make_block_group()), when reading a block group at mount time (read_one_block_group()) and when reading the free space tree for a block group (typically the first time we attempt to allocate from the block group); 2) It has nothing to do with pinned extents. It only deals with the excluded extents io tree, which is used to track the locations of super blocks in order to make sure we never add the location of a super block to the free space cache of a block group. So update the documention and also add a description of the arguments and return values. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: tracepoints: simplify raid56 eventsQu Wenruo
After commit 6bfd0133bee2 ("btrfs: raid56: switch scrub path to use a single function"), the raid56 implementation no longer uses different endio functions for RMW/recover/scrub. All read operations end in submit_read_wait_bio_list(), while all write operations end in submit_write_bios(). This means quite some trace events are out-of-date and no longer utilized. This patch would unify the trace events into just two: - trace_raid56_read() Replaces trace_raid56_read_partial(), trace_raid56_scrub_read() and trace_raid56_scrub_read_recover(). - trace_raid56_write() Replaces trace_raid56_write_stripe() and trace_raid56_scrub_write_stripe(). Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: sysfs: show if ACL support has been compiled inAnand Jain
ACL support depends on the compile-time configuration option CONFIG_BTRFS_FS_POSIX_ACL. Prior to mounting a btrfs filesystem, it is not possible to determine whether ACL support has been compiled in. To address this, add a sysfs interface, /sys/fs/btrfs/features/acl, and check for ACL support in the system's btrfs. To determine ACL support: Return 0 indicates ACL is not supported: $ cat /sys/fs/btrfs/features/acl 0 Return 1 indicates ACL is supported: $ cat /sys/fs/btrfs/features/acl 1 IMO, this is a better approach, so that we also know if kernel is older. On an older kernel $ ls /sys/fs/btrfs/features/acl ls: cannot access '/sys/fs/btrfs/features/acl': No such file or directory mount a btrfs filesystem $ cat /proc/self/mounts | grep btrfs | grep -q noacl $ echo $? 0 Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: raid56: remove unused BTRFS_RBIO_REBUILD_MISSINGQu Wenruo
Commit aca43fe839e4 ("btrfs: remove unused raid56 functions which were dedicated for scrub") removed the special handling of RAID56 scrub for missing device. As scrub goes full mirror_num based recovery, that means if it hits a missing device in RAID56, it would just try the next mirror, which would go through the BTRFS_RBIO_READ_REBUILD operation. This means there is no longer any use of BTRFS_RBIO_REBUILD_MISSING operation and we can safely remove it. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: add comments for btrfs_map_block()Qu Wenruo
The function btrfs_map_block() is a critical part of the btrfs storage layer, which handles mapping of logical ranges to physical ranges. Thus it's better to have some basic explanation, especially on the following points: - Segment split by various boundaries As a continuous logical range may be split into different segments, due to various factors like zones and RAID0/5/6/10 boundaries. - The meaning of @mirror_num - The possible single stripe optimization - One deprecated parameter @need_raid_map Just explicitly mark it deprecated so we're aware of the problem. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: remove redundant initialization of variables in log_new_ancestorsColin Ian King
The variables leaf and slot are initialized when declared but the values assigned to them are never read as they are being re-assigned later on. The initializations are redundant and can be removed. Cleans up clang scan build warnings: fs/btrfs/tree-log.c:6797:25: warning: Value stored to 'leaf' during its initialization is never read [deadcode.DeadStores] fs/btrfs/tree-log.c:6798:7: warning: Value stored to 'slot' during its initialization is never read [deadcode.DeadStores] It's been there since b8aa330d2acb ("Btrfs: improve performance on fsync of files with multiple hardlinks") without any usage so it's safe to be removed. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: scrub: remove redundant division of stripe_nrColin Ian King
Variable stripe_nr is being divided by map->num_stripes however the result is never read. The division and assignment are redundant and can be removed. Cleans up clang scan build warning: fs/btrfs/scrub.c:1264:3: warning: Value stored to 'stripe_nr' is never read [deadcode.DeadStores] The code is a leftover from 6ded22c1bfe6 ("btrfs: reduce div64 calls by limiting the number of stripes of a chunk to u32") that converted div64 to normal division, it's the same but previous version did not trigger a warning. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: zoned: use vcalloc instead of for vzalloc in btrfs_get_dev_zone_infoJulia Lawall
Use vcalloc that checks potential multiplication overflows. The changes were done using Coccinelle semantic patch. Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-19Merge tag 'for-6.5-rc6-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fix infinite loop in readdir(), could happen in a big directory when files get renamed during enumeration - fix extent map handling of skipped pinned ranges - fix a corner case when handling ordered extent length - fix a potential crash when balance cancel races with pause - verify correct uuid when starting scrub or device replace * tag 'for-6.5-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix incorrect splitting in btrfs_drop_extent_map_range btrfs: fix BUG_ON condition in btrfs_cancel_balance btrfs: only subtract from len_to_oe_boundary when it is tracking an extent btrfs: fix replace/scrub failure with metadata_uuid btrfs: fix infinite directory reads
2023-08-18btrfs: fix incorrect splitting in btrfs_drop_extent_map_rangeJosef Bacik
In production we were seeing a variety of WARN_ON()'s in the extent_map code, specifically in btrfs_drop_extent_map_range() when we have to call add_extent_mapping() for our second split. Consider the following extent map layout PINNED [0 16K) [32K, 48K) and then we call btrfs_drop_extent_map_range for [0, 36K), with skip_pinned == true. The initial loop will have start = 0 end = 36K len = 36K we will find the [0, 16k) extent, but since we are pinned we will skip it, which has this code start = em_end; if (end != (u64)-1) len = start + len - em_end; em_end here is 16K, so now the values are start = 16K len = 16K + 36K - 16K = 36K len should instead be 20K. This is a problem when we find the next extent at [32K, 48K), we need to split this extent to leave [36K, 48k), however the code for the split looks like this split->start = start + len; split->len = em_end - (start + len); In this case we have em_end = 48K split->start = 16K + 36K // this should be 16K + 20K split->len = 48K - (16K + 36K) // this overflows as 16K + 36K is 52K and now we have an invalid extent_map in the tree that potentially overlaps other entries in the extent map. Even in the non-overlapping case we will have split->start set improperly, which will cause problems with any block related calculations. We don't actually need len in this loop, we can simply use end as our end point, and only adjust start up when we find a pinned extent we need to skip. Adjust the logic to do this, which keeps us from inserting an invalid extent map. We only skip_pinned in the relocation case, so this is relatively rare, except in the case where you are running relocation a lot, which can happen with auto relocation on. Fixes: 55ef68990029 ("Btrfs: Fix btrfs_drop_extent_cache for skip pinned case") CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-17btrfs: fix BUG_ON condition in btrfs_cancel_balancexiaoshoukui
Pausing and canceling balance can race to interrupt balance lead to BUG_ON panic in btrfs_cancel_balance. The BUG_ON condition in btrfs_cancel_balance does not take this race scenario into account. However, the race condition has no other side effects. We can fix that. Reproducing it with panic trace like this: kernel BUG at fs/btrfs/volumes.c:4618! RIP: 0010:btrfs_cancel_balance+0x5cf/0x6a0 Call Trace: <TASK> ? do_nanosleep+0x60/0x120 ? hrtimer_nanosleep+0xb7/0x1a0 ? sched_core_clone_cookie+0x70/0x70 btrfs_ioctl_balance_ctl+0x55/0x70 btrfs_ioctl+0xa46/0xd20 __x64_sys_ioctl+0x7d/0xa0 do_syscall_64+0x38/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd Race scenario as follows: > mutex_unlock(&fs_info->balance_mutex); > -------------------- > .......issue pause and cancel req in another thread > -------------------- > ret = __btrfs_balance(fs_info); > > mutex_lock(&fs_info->balance_mutex); > if (ret == -ECANCELED && atomic_read(&fs_info->balance_pause_req)) { > btrfs_info(fs_info, "balance: paused"); > btrfs_exclop_balance(fs_info, BTRFS_EXCLOP_BALANCE_PAUSED); > } CC: stable@vger.kernel.org # 4.19+ Signed-off-by: xiaoshoukui <xiaoshoukui@ruijie.com.cn> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-17btrfs: only subtract from len_to_oe_boundary when it is tracking an extentChris Mason
bio_ctrl->len_to_oe_boundary is used to make sure we stay inside a zone as we submit bios for writes. Every time we add a page to the bio, we decrement those bytes from len_to_oe_boundary, and then we submit the bio if we happen to hit zero. Most of the time, len_to_oe_boundary gets set to U32_MAX. submit_extent_page() adds pages into our bio, and the size of the bio ends up limited by: - Are we contiguous on disk? - Does bio_add_page() allow us to stuff more in? - is len_to_oe_boundary > 0? The len_to_oe_boundary math starts with U32_MAX, which isn't page or sector aligned, and subtracts from it until it hits zero. In the non-zoned case, the last IO we submit before we hit zero is going to be unaligned, triggering BUGs. This is hard to trigger because bio_add_page() isn't going to make a bio of U32_MAX size unless you give it a perfect set of pages and fully contiguous extents on disk. We can hit it pretty reliably while making large swapfiles during provisioning because the machine is freshly booted, mostly idle, and the disk is freshly formatted. It's also possible to trigger with reads when read_ahead_kb is set to 4GB. The code has been clean up and shifted around a few times, but this flaw has been lurking since the counter was added. I think the commit 24e6c8082208 ("btrfs: simplify main loop in submit_extent_page") ended up exposing the bug. The fix used here is to skip doing math on len_to_oe_boundary unless we've changed it from the default U32_MAX value. bio_add_page() is the real limit we want, and there's no reason to do extra math when block layer is doing it for us. Sample reproducer, note you'll need to change the path to the bdi and device: SUBVOL=/btrfs/swapvol SWAPFILE=$SUBVOL/swapfile SZMB=8192 mkfs.btrfs -f /dev/vdb mount /dev/vdb /btrfs btrfs subvol create $SUBVOL chattr +C $SUBVOL dd if=/dev/zero of=$SWAPFILE bs=1M count=$SZMB sync echo 4 > /proc/sys/vm/drop_caches echo 4194304 > /sys/class/bdi/btrfs-2/read_ahead_kb while true; do echo 1 > /proc/sys/vm/drop_caches echo 1 > /proc/sys/vm/drop_caches dd of=/dev/zero if=$SWAPFILE bs=4096M count=2 iflag=fullblock done Fixes: 24e6c8082208 ("btrfs: simplify main loop in submit_extent_page") CC: stable@vger.kernel.org # 6.4+ Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-17btrfs: fix replace/scrub failure with metadata_uuidAnand Jain
Fstests with POST_MKFS_CMD="btrfstune -m" (as in the mailing list) reported a few of the test cases failing. The failure scenario can be summarized and simplified as follows: $ mkfs.btrfs -fq -draid1 -mraid1 /dev/sdb1 /dev/sdb2 :0 $ btrfstune -m /dev/sdb1 :0 $ wipefs -a /dev/sdb1 :0 $ mount -o degraded /dev/sdb2 /btrfs :0 $ btrfs replace start -B -f -r 1 /dev/sdb1 /btrfs :1 STDERR: ERROR: ioctl(DEV_REPLACE_START) failed on "/btrfs": Input/output error [11290.583502] BTRFS warning (device sdb2): tree block 22036480 mirror 2 has bad fsid, has 99835c32-49f0-4668-9e66-dc277a96b4a6 want da40350c-33ac-4872-92a8-4948ed8c04d0 [11290.586580] BTRFS error (device sdb2): unable to fix up (regular) error at logical 22020096 on dev /dev/sdb8 physical 1048576 As above, the replace is failing because we are verifying the header with fs_devices::fsid instead of fs_devices::metadata_uuid, despite the metadata_uuid actually being present. To fix this, use fs_devices::metadata_uuid. We copy fsid into fs_devices::metadata_uuid if there is no metadata_uuid, so its fine. Fixes: a3ddbaebc7c9 ("btrfs: scrub: introduce a helper to verify one metadata block") CC: stable@vger.kernel.org # 6.4+ Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-14btrfs: fix infinite directory readsFilipe Manana
The readdir implementation currently processes always up to the last index it finds. This however can result in an infinite loop if the directory has a large number of entries such that they won't all fit in the given buffer passed to the readdir callback, that is, dir_emit() returns a non-zero value. Because in that case readdir() will be called again and if in the meanwhile new directory entries were added and we still can't put all the remaining entries in the buffer, we keep repeating this over and over. The following C program and test script reproduce the problem: $ cat /mnt/readdir_prog.c #include <sys/types.h> #include <dirent.h> #include <stdio.h> int main(int argc, char *argv[]) { DIR *dir = opendir("."); struct dirent *dd; while ((dd = readdir(dir))) { printf("%s\n", dd->d_name); rename(dd->d_name, "TEMPFILE"); rename("TEMPFILE", dd->d_name); } closedir(dir); } $ gcc -o /mnt/readdir_prog /mnt/readdir_prog.c $ cat test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi mkfs.btrfs -f $DEV &> /dev/null #mkfs.xfs -f $DEV &> /dev/null #mkfs.ext4 -F $DEV &> /dev/null mount $DEV $MNT mkdir $MNT/testdir for ((i = 1; i <= 2000; i++)); do echo -n > $MNT/testdir/file_$i done cd $MNT/testdir /mnt/readdir_prog cd /mnt umount $MNT This behaviour is surprising to applications and it's unlike ext4, xfs, tmpfs, vfat and other filesystems, which always finish. In this case where new entries were added due to renames, some file names may be reported more than once, but this varies according to each filesystem - for example ext4 never reported the same file more than once while xfs reports the first 13 file names twice. So change our readdir implementation to track the last index number when opendir() is called and then make readdir() never process beyond that index number. This gives the same behaviour as ext4. Reported-by: Rob Landley <rob@landley.net> Link: https://lore.kernel.org/linux-btrfs/2c8c55ec-04c6-e0dc-9c5c-8c7924778c35@landley.net/ Link: https://bugzilla.kernel.org/show_bug.cgi?id=217681 CC: stable@vger.kernel.org # 6.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-12Merge tag 'for-6.5-rc5-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "More fixes, some of them going back to older releases and there are fixes for hangs in stress tests regarding space caching: - fixes and progress tracking for hangs in free space caching, found by test generic/475 - writeback fixes, write pages in integrity mode and skip writing pages that have been written meanwhile - properly clear end of extent range after an error - relocation fixes: - fix race betwen qgroup tree creation and relocation - detect and report invalid reloc roots" * tag 'for-6.5-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: set cache_block_group_error if we find an error btrfs: reject invalid reloc tree root keys with stack dump btrfs: exit gracefully if reloc roots don't match btrfs: avoid race between qgroup tree creation and relocation btrfs: properly clear end of the unreserved range in cow_file_range btrfs: don't wait for writeback on clean pages in extent_write_cache_pages btrfs: don't stop integrity writeback too early btrfs: wait for actual caching progress during allocation
2023-08-11btrfs: convert to multigrain timestampsJeff Layton
Enable multigrain timestamps, which should ensure that there is an apparent change to the timestamp whenever it has been written after being actively observed via getattr. Beyond enabling the FS_MGTIME flag, this patch eliminates update_time_for_write, which goes to great pains to avoid in-memory stores. Just have it overwrite the timestamps unconditionally. Signed-off-by: Jeff Layton <jlayton@kernel.org> Acked-by: David Sterba <dsterba@suse.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230807-mgctime-v7-13-d1dec143a704@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-11fs: drop the timespec64 argument from update_timeJeff Layton
Now that all of the update_time operations are prepared for it, we can drop the timespec64 argument from the update_time operation. Do that and remove it from some associated functions like inode_update_time and inode_needs_update_time. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230807-mgctime-v7-8-d1dec143a704@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-10btrfs: set cache_block_group_error if we find an errorJosef Bacik
We set cache_block_group_error if btrfs_cache_block_group() returns an error, this is because we could end up not finding space to allocate and mistakenly return -ENOSPC, and which could then abort the transaction with the incorrect errno, and in the case of ENOSPC result in a WARN_ON() that will trip up tests like generic/475. However there's the case where multiple threads can be racing, one thread gets the proper error, and the other thread doesn't actually call btrfs_cache_block_group(), it instead sees ->cached == BTRFS_CACHE_ERROR. Again the result is the same, we fail to allocate our space and return -ENOSPC. Instead we need to set cache_block_group_error to -EIO in this case to make sure that if we do not make our allocation we get the appropriate error returned back to the caller. CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-10btrfs: reject invalid reloc tree root keys with stack dumpQu Wenruo
[BUG] Syzbot reported a crash that an ASSERT() got triggered inside prepare_to_merge(). That ASSERT() makes sure the reloc tree is properly pointed back by its subvolume tree. [CAUSE] After more debugging output, it turns out we had an invalid reloc tree: BTRFS error (device loop1): reloc tree mismatch, root 8 has no reloc root, expect reloc root key (-8, 132, 8) gen 17 Note the above root key is (TREE_RELOC_OBJECTID, ROOT_ITEM, QUOTA_TREE_OBJECTID), meaning it's a reloc tree for quota tree. But reloc trees can only exist for subvolumes, as for non-subvolume trees, we just COW the involved tree block, no need to create a reloc tree since those tree blocks won't be shared with other trees. Only subvolumes tree can share tree blocks with other trees (thus they have BTRFS_ROOT_SHAREABLE flag). Thus this new debug output proves my previous assumption that corrupted on-disk data can trigger that ASSERT(). [FIX] Besides the dedicated fix and the graceful exit, also let tree-checker to check such root keys, to make sure reloc trees can only exist for subvolumes. CC: stable@vger.kernel.org # 5.15+ Reported-by: syzbot+ae97a827ae1c3336bbb4@syzkaller.appspotmail.com Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-10btrfs: exit gracefully if reloc roots don't matchQu Wenruo
[BUG] Syzbot reported a crash that an ASSERT() got triggered inside prepare_to_merge(). [CAUSE] The root cause of the triggered ASSERT() is we can have a race between quota tree creation and relocation. This leads us to create a duplicated quota tree in the btrfs_read_fs_root() path, and since it's treated as fs tree, it would have ROOT_SHAREABLE flag, causing us to create a reloc tree for it. The bug itself is fixed by a dedicated patch for it, but this already taught us the ASSERT() is not something straightforward for developers. [ENHANCEMENT] Instead of using an ASSERT(), let's handle it gracefully and output extra info about the mismatch reloc roots to help debug. Also with the above ASSERT() removed, we can trigger ASSERT(0)s inside merge_reloc_roots() later. Also replace those ASSERT(0)s with WARN_ON()s. CC: stable@vger.kernel.org # 5.15+ Reported-by: syzbot+ae97a827ae1c3336bbb4@syzkaller.appspotmail.com Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-10btrfs: avoid race between qgroup tree creation and relocationQu Wenruo
[BUG] Syzbot reported a weird ASSERT() triggered inside prepare_to_merge(). assertion failed: root->reloc_root == reloc_root, in fs/btrfs/relocation.c:1919 ------------[ cut here ]------------ kernel BUG at fs/btrfs/relocation.c:1919! invalid opcode: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 9904 Comm: syz-executor.3 Not tainted 6.4.0-syzkaller-08881-g533925cb7604 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/27/2023 RIP: 0010:prepare_to_merge+0xbb2/0xc40 fs/btrfs/relocation.c:1919 Code: fe e9 f5 (...) RSP: 0018:ffffc9000325f760 EFLAGS: 00010246 RAX: 000000000000004f RBX: ffff888075644030 RCX: 1481ccc522da5800 RDX: ffffc90005c09000 RSI: 00000000000364ca RDI: 00000000000364cb RBP: ffffc9000325f870 R08: ffffffff816f33ac R09: 1ffff9200064bea0 R10: dffffc0000000000 R11: fffff5200064bea1 R12: ffff888075644000 R13: ffff88803b166000 R14: ffff88803b166560 R15: ffff88803b166558 FS: 00007f4e305fd700(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000056080679c000 CR3: 00000000193ad000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> relocate_block_group+0xa5d/0xcd0 fs/btrfs/relocation.c:3749 btrfs_relocate_block_group+0x7ab/0xd70 fs/btrfs/relocation.c:4087 btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3283 __btrfs_balance+0x1b06/0x2690 fs/btrfs/volumes.c:4018 btrfs_balance+0xbdb/0x1120 fs/btrfs/volumes.c:4402 btrfs_ioctl_balance+0x496/0x7c0 fs/btrfs/ioctl.c:3604 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:870 [inline] __se_sys_ioctl+0xf8/0x170 fs/ioctl.c:856 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f4e2f88c389 [CAUSE] With extra debugging, the offending reloc_root is for quota tree (rootid 8). Normally we should not use the reloc tree for quota root at all, as reloc trees are only for subvolume trees. But there is a race between quota enabling and relocation, this happens after commit 85724171b302 ("btrfs: fix the btrfs_get_global_root return value"). Before that commit, for quota and free space tree, we exit immediately if we cannot grab it from fs_info. But now we would try to read it from disk, just as if they are fs trees, this sets ROOT_SHAREABLE flags in such race: Thread A | Thread B ---------------------------------+------------------------------ btrfs_quota_enable() | | | btrfs_get_root_ref() | | |- btrfs_get_global_root() | | | Returned NULL | | |- btrfs_lookup_fs_root() | | | Returned NULL |- btrfs_create_tree() | | | Now quota root item is | | | inserted | |- btrfs_read_tree_root() | | | Got the newly inserted quota root | | |- btrfs_init_fs_root() | | | Set ROOT_SHAREABLE flag [FIX] Get back to the old behavior by returning PTR_ERR(-ENOENT) if the target objectid is not a subvolume tree or data reloc tree. Reported-and-tested-by: syzbot+ae97a827ae1c3336bbb4@syzkaller.appspotmail.com Fixes: 85724171b302 ("btrfs: fix the btrfs_get_global_root return value") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-10btrfs: properly clear end of the unreserved range in cow_file_rangeChristoph Hellwig
When the call to btrfs_reloc_clone_csums in cow_file_range returns an error, we jump to the out_unlock label with the extent_reserved variable set to false. The cleanup at the label will then call extent_clear_unlock_delalloc on the range from start to end. But we've already added cur_alloc_size to start before the jump, so there might no range be left from the newly incremented start to end. Move the check for 'start < end' so that it is reached by also for the !extent_reserved case. CC: stable@vger.kernel.org # 6.1+ Fixes: a315e68f6e8b ("Btrfs: fix invalid attempt to free reserved space on failure to cow range") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-10btrfs: don't wait for writeback on clean pages in extent_write_cache_pagesChristoph Hellwig
__extent_writepage could have started on more pages than the one it was called for. This happens regularly for zoned file systems, and in theory could happen for compressed I/O if the worker thread was executed very quickly. For such pages extent_write_cache_pages waits for writeback to complete before moving on to the next page, which is highly inefficient as it blocks the flusher thread. Port over the PageDirty check that was added to write_cache_pages in commit 515f4a037fb ("mm: write_cache_pages optimise page cleaning") to fix this. CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-10btrfs: don't stop integrity writeback too earlyChristoph Hellwig
extent_write_cache_pages stops writing pages as soon as nr_to_write hits zero. That is the right thing for opportunistic writeback, but incorrect for data integrity writeback, which needs to ensure that no dirty pages are left in the range. Thus only stop the writeback for WB_SYNC_NONE if nr_to_write hits 0. This is a port of write_cache_pages changes in commit 05fe478dd04e ("mm: write_cache_pages integrity fix"). Note that I've only trigger the problem with other changes to the btrfs writeback code, but this condition seems worthwhile fixing anyway. CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> [ updated comment ] Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-10btrfs: wait for actual caching progress during allocationJosef Bacik
Recently we've been having mysterious hangs while running generic/475 on the CI system. This turned out to be something like this: Task 1 dmsetup suspend --nolockfs -> __dm_suspend -> dm_wait_for_completion -> dm_wait_for_bios_completion -> Unable to complete because of IO's on a plug in Task 2 Task 2 wb_workfn -> wb_writeback -> blk_start_plug -> writeback_sb_inodes -> Infinite loop unable to make an allocation Task 3 cache_block_group ->read_extent_buffer_pages ->Waiting for IO to complete that can't be submitted because Task 1 suspended the DM device The problem here is that we need Task 2 to be scheduled completely for the blk plug to flush. Normally this would happen, we normally wait for the block group caching to finish (Task 3), and this schedule would result in the block plug flushing. However if there's enough free space available from the current caching to satisfy the allocation we won't actually wait for the caching to complete. This check however just checks that we have enough space, not that we can make the allocation. In this particular case we were trying to allocate 9MiB, and we had 10MiB of free space, but we didn't have 9MiB of contiguous space to allocate, and thus the allocation failed and we looped. We specifically don't cycle through the FFE loop until we stop finding cached block groups because we don't want to allocate new block groups just because we're caching, so we short circuit the normal loop once we hit LOOP_CACHING_WAIT and we found a caching block group. This is normally fine, except in this particular case where the caching thread can't make progress because the DM device has been suspended. Fix this by not only waiting for free space to >= the amount of space we want to allocate, but also that we make some progress in caching from the time we start waiting. This will keep us from busy looping when the caching is taking a while but still theoretically has enough space for us to allocate from, and fixes this particular case by forcing us to actually sleep and wait for forward progress, which will flush the plug. With this fix we're no longer hanging with generic/475. CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-09btrfs: have it use inode_update_timestampsJeff Layton
In later patches, we're going to drop the "now" argument from the update_time operation. Have btrfs_update_time use the new inode_update_timestamps helper to fetch a new timestamp and update it properly. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230807-mgctime-v7-4-d1dec143a704@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-09fs: pass the request_mask to generic_fillattrJeff Layton
generic_fillattr just fills in the entire stat struct indiscriminately today, copying data from the inode. There is at least one attribute (STATX_CHANGE_COOKIE) that can have side effects when it is reported, and we're looking at adding more with the addition of multigrain timestamps. Add a request_mask argument to generic_fillattr and have most callers just pass in the value that is passed to getattr. Have other callers (e.g. ksmbd) just pass in STATX_BASIC_STATS. Also move the setting of STATX_CHANGE_COOKIE into generic_fillattr. Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: "Paulo Alcantara (SUSE)" <pc@manguebit.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jeff Layton <jlayton@kernel.org> Message-Id: <20230807-mgctime-v7-2-d1dec143a704@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-07-27Merge tag 'for-6.5-rc3-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fix accounting of global block reserve size when block group tree is enabled - the async discard has been enabled in 6.2 unconditionally, but for zoned mode it does not make that much sense to do it asynchronously as the zones are reset as needed - error handling and proper error value propagation fixes * tag 'for-6.5-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: check for commit error at btrfs_attach_transaction_barrier() btrfs: check if the transaction was aborted at btrfs_wait_for_commit() btrfs: remove BUG_ON()'s in add_new_free_space() btrfs: account block group tree when calculating global reserve size btrfs: zoned: do not enable async discard
2023-07-26btrfs: check for commit error at btrfs_attach_transaction_barrier()Filipe Manana
btrfs_attach_transaction_barrier() is used to get a handle pointing to the current running transaction if the transaction has not started its commit yet (its state is < TRANS_STATE_COMMIT_START). If the transaction commit has started, then we wait for the transaction to commit and finish before returning - however we completely ignore if the transaction was aborted due to some error during its commit, we simply return ERR_PT(-ENOENT), which makes the caller assume everything is fine and no errors happened. This could make an fsync return success (0) to user space when in fact we had a transaction abort and the target inode changes were therefore not persisted. Fix this by checking for the return value from btrfs_wait_for_commit(), and if it returned an error, return it back to the caller. Fixes: d4edf39bd5db ("Btrfs: fix uncompleted transaction") CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>