summaryrefslogtreecommitdiff
path: root/fs/btrfs/extent_io.c
AgeCommit message (Collapse)Author
2025-05-27btrfs: don't drop a reference if btrfs_check_write_meta_pointer() failsJosef Bacik
In the zoned mode there's a bug in the extent buffer tree conversion to xarray. The reference for eb is dropped and code continues but the references get dropped by releasing the batch. Reported-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reported-by: kernel test robot <oliver.sang@intel.com> Link: https://lore.kernel.org/linux-btrfs/202505191521.435b97ac-lkp@intel.com/ Fixes: 19d7f65f032f ("btrfs: convert the buffer_radix to an xarray") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: fix broken drop_caches on extent buffer foliosBoris Burkov
The (correct) commit e41c81d0d30e ("mm/truncate: Replace page_mapped() call in invalidate_inode_page()") replaced the page_mapped(page) check with a refcount check. However, this refcount check does not work as expected with drop_caches for btrfs's metadata pages. Btrfs has a per-sb metadata inode with cached pages, and when not in active use by btrfs, they have a refcount of 3. One from the initial call to alloc_pages(), one (nr_pages == 1) from filemap_add_folio(), and one from folio_attach_private(). We would expect such pages to get dropped by drop_caches. However, drop_caches calls into mapping_evict_folio() via mapping_try_invalidate() which gets a reference on the folio with find_lock_entries(). As a result, these pages have a refcount of 4, and fail this check. For what it's worth, such pages do get reclaimed under memory pressure, so I would say that while this behavior is surprising, it is not really dangerously broken. When I asked the mm folks about the expected refcount in this case, I was told that the correct thing to do is to donate the refcount from the original allocation to the page cache after inserting it. Therefore, attempt to fix this by adding a put_folio() to the critical spot in alloc_extent_buffer() where we are sure that we have really allocated and attached new pages. We must also adjust folio_detach_private() to properly handle being the last reference to the folio and not do a use-after-free after folio_detach_private(). extent_buffers allocated by clone_extent_buffer() and alloc_dummy_extent_buffer() are unmapped, so this transfer of ownership from allocation to insertion in the mapping does not apply to them. However, we can still folio_put() them safely once they are finished being allocated and have called folio_attach_private(). Finally, removing the generic put_folio() for the allocation from btrfs_detach_extent_buffer_folios() means we need to be careful to do the appropriate put_folio() in allocation failure paths in alloc_extent_buffer(), clone_extent_buffer() and alloc_dummy_extent_buffer(). Link: https://lore.kernel.org/linux-mm/ZrwhTXKzgDnCK76Z@casper.infradead.org/ Tested-by: Klara Modin <klarasmodin@gmail.com> Reviewed-by: Daniel Vacek <neelx@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: get rid of goto in alloc_test_extent_buffer()Daniel Vacek
The `free_eb` label is used only once. Simplify by moving the code inplace. Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: use buffer xarray for extent buffer writeback operationsJosef Bacik
Currently we have this ugly back and forth with the btree writeback where we find the folio, find the eb associated with that folio, and then attempt to writeback. This results in two different paths for subpage ebs and >= page size ebs. Clean this up by adding our own infrastructure around looking up tagged ebs and writing the ebs out directly. This allows us to unify the subpage and >= pagesize IO paths, resulting in a much cleaner writeback path for extent buffers. I ran this through fsperf on a VM with 8 CPUs and 16GiB of RAM. I used smallfiles100k, but reduced the files to 1k to make it run faster, the results are as follows, with the statistically significant improvements marked with *, there were no regressions. fsperf was run with -n 10 for both runs, so the baseline is the average 10 runs and the test is the average of 10 runs. smallfiles100k results metric baseline current stdev diff ================================================================================ avg_commit_ms 68.58 58.44 3.35 -14.79% * commits 270.60 254.70 16.24 -5.88% dev_read_iops 48 48 0 0.00% dev_read_kbytes 1044 1044 0 0.00% dev_write_iops 866117.90 850028.10 14292.20 -1.86% dev_write_kbytes 10939976.40 10605701.20 351330.32 -3.06% elapsed 49.30 33 1.64 -33.06% * end_state_mount_ns 41251498.80 35773220.70 2531205.32 -13.28% * end_state_umount_ns 1.90e+09 1.50e+09 14186226.85 -21.38% * max_commit_ms 139 111.60 9.72 -19.71% * sys_cpu 4.90 3.86 0.88 -21.29% write_bw_bytes 42935768.20 64318451.10 1609415.05 49.80% * write_clat_ns_mean 366431.69 243202.60 14161.98 -33.63% * write_clat_ns_p50 49203.20 20992 264.40 -57.34% * write_clat_ns_p99 827392 653721.60 65904.74 -20.99% * write_io_kbytes 2035940 2035940 0 0.00% write_iops 10482.37 15702.75 392.92 49.80% * write_lat_ns_max 1.01e+08 90516129 3910102.06 -10.29% * write_lat_ns_mean 366556.19 243308.48 14154.51 -33.62% * As you can see we get about a 33% decrease runtime, with a 50% throughput increase, which is pretty significant. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: set DIRTY and WRITEBACK tags on the buffer_treeJosef Bacik
In preparation for changing how we do writeout of extent buffers, start tagging the extent buffer xarray with DIRTY and WRITEBACK to make it easier to find extent buffers that are in either state. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: convert the buffer_radix to an xarrayJosef Bacik
In order to fully utilize xarray tagging to improve writeback we need to convert the buffer_radix to a proper xarray. This conversion is relatively straightforward as the radix code uses the xarray underneath. Using xarray directly allows for quite a lot less code. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: subpage: reject tree blocks which are not nodesize alignedQu Wenruo
When btrfs subpage support (fs block < page size) was introduced, a subpage filesystem will only reject tree blocks which cross page boundaries. This used to be a compromise to simplify the tree block handling and still allowing subpage cases to read some old converted filesystems which did not have proper chunk alignment. But in practice, suppose we have the following unaligned tree block on a 64K page sized system: 0 32K 44K 60K 64K | |///////////////| | Although btrfs has no problem reading the tree block at [44K, 60K), if extent allocator is allocating another tree block, it may choose the range [60K, 74K), as extent allocator has no awareness if it's a subpage metadata request or not. Then we'd get -EINVAL from the following sequence: btrfs_alloc_tree_block() |- btrfs_reserve_extent() | Which returned range [60K, 74K) |- btrfs_init_new_buffer() |- btrfs_find_create_tree_block() |- alloc_extent_buffer() |- check_eb_alignment() Which returned -EINVAL, because the range crosses page boundary. This situation will not fix itself and should mostly mark the fs read-only. Thankfully we didn't really get such reports in the real world because: - The original unaligned tree block is only caused by older btrfs-convert It's before the btrfs-convert rework was done in v4.6, where converted btrfs filesystem can have metadata block groups which are not aligned to nodesize nor stripe size (64K). But after btrfs-progs v4.6, all chunks allocated will be stripe (64K) aligned, thus no more such problem. Considering how old the fix is (v4.6 was released almost 10 years ago), subpage support for btrfs was introduced in v5.15, it should be safe to reject those unaligned tree blocks. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: move folio initialization to one place in attach_eb_folio_to_filemap()Daniel Vacek
This is just a trivial change. The code looks a bit more readable this way, IMO. Move initialization of existing_folio to the beginning of the retry loop so it's set to NULL at one place. Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: convert ASSERT(0) with handled errors to DEBUG_WARN()David Sterba
The use of ASSERT(0) is maybe useful for some cases but more like a notice for developers. Assertions can be compiled in independently so convert it to a debugging helper. The difference is that it's just a warning and will not end up in BUG(). The converted cases are in connection with proper error handling. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: convert WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG)) to DEBUG_WARNDavid Sterba
Use the conditional warning instead of typing the whole condition. Optional message is printed where it seems clear what could be the problem. Conversion is left out in btree_csum_one_bio() because of the additional condition. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: track the next file offset in struct btrfs_bio_ctrlChristoph Hellwig
The bio implementation is not something we should really mess around, and we shouldn't recalculate the pos from the folio over and over. Instead just track then end of the current bio in logical file offsets in the btrfs_bio_ctrl, which is much simpler and easier to read. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: remove the alignment checks in end_bbio_data_read()Christoph Hellwig
end_bbio_data_read() checks that each iterated folio fragment is aligned and justifies that with block drivers advancing the bio. But block driver only advance bi_iter, while end_bbio_data_read() uses bio_for_each_folio_all() to iterate the immutable bi_io_vec array that can't be changed by drivers at all. Furthermore btrfs has already did the alignment check of the file offset inside submit_one_sector(), and the size is fixed to fs block size, there is no need to re-do the alignment check again inside the endio function. So just remove the unnecessary alignment check along with the incorrect comment. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: rename remaining exported extent map functionsFilipe Manana
Rename all the exported functions from extent_map.h that don't have a 'btrfs_' prefix in their names, so that they are consistent with all the other functions, to make it clear they are btrfs specific functions and to avoid potential name collisions in the future with functions defined elsewhere in the kernel. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: rename functions to allocate and free extent mapsFilipe Manana
These functions are exported and don't have a 'btrfs_' prefix in their names, which goes against coding style conventions. Rename them to have such prefix, making it clear they are from btrfs and avoiding potential collisions in the future with functions defined elsewhere outside btrfs. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: rename extent map functions to get block start, end and check if in treeFilipe Manana
These functions are exported and don't have a 'btrfs_' prefix in their names, which goes against coding style conventions. Rename them to have such prefix, making it clear they are from btrfs and avoiding potential collisions in the future with functions defined elsewhere outside btrfs. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: rename exported extent map compression functionsFilipe Manana
These functions are exported and don't have a 'btrfs_' prefix in their names, which goes against coding style conventions. Rename them to have such prefix, making it clear they are from btrfs and avoiding potential collisions in the future with functions defined elsewhere outside btrfs. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: rename free_extent_state() to include a btrfs prefixFilipe Manana
This is an exported function so it should have a 'btrfs_' prefix by convention, to make it clear it's btrfs specific and to avoid collisions with functions from elsewhere in the kernel. Rename the function to add 'btrfs_' prefix to it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: rename the functions to count, test and get bit ranges in io treesFilipe Manana
These functions are exported so they should have a 'btrfs_' prefix by convention, to make it clear they are btrfs specific and to avoid collisions with functions from elsewhere in the kernel. So add a 'btrfs_' prefix to their names to make it clear they are from btrfs. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: rename the functions to clear bits for an extent rangeFilipe Manana
These functions are exported so they should have a 'btrfs_' prefix by convention, to make it clear they are btrfs specific and to avoid collisions with functions from elsewhere in the kernel. One of them has a double underscore prefix which is also discouraged. So remove double underscore prefix where applicable and add a 'btrfs_' prefix to their name to make it clear they are from btrfs. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: add btrfs prefix to main lock, try lock and unlock extent functionsFilipe Manana
These functions are exported so they should have a 'btrfs_' prefix by convention, to make it clear they are btrfs specific and to avoid collisions with functions from elsewhere in the kernel. So add a prefix to their name. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: use folio_contains() for EOF detectionQu Wenruo
Currently we use the following pattern to detect if the folio contains the end of a file: if (folio->index == end_index) folio_zero_range(); But that only works if the folio is page sized. For the following case, it will not work and leave the range beyond EOF uninitialized: The page size is 4K, and the fs block size is also 4K. 16K 20K 24K | | | | | EOF at 22K And we have a large folio sized 8K at file offset 16K. In that case, the old "folio->index == end_index" will not work, thus the range [22K, 24K) will not be zeroed out. Fix the following call sites which use the above pattern: - add_ra_bio_pages() - extent_writepage() Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: remove unnecessary early exits in delalloc folio lock and unlockQu Wenruo
Inside functions unlock_delalloc_folio() and lock_delalloc_folios(), we have the following early exits: if (index == locked_folio->index && end_index == index) return; This allows us to exit early if the range is inside the same locked folio. However the current check relies on page sized folios, if we have a large folio that contains @index but not at @index, then the early exit will no longer trigger. Furthermore without the above early check, the existing code can handle it well, as both __process_folios_contig() and lock_delalloc_folios() will skip any folio page lock/unlock if it's on the locked folio. Here we remove the early exits and let the existing code handle the same index case, to make the code a little simpler. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: prepare end_bbio_data_write() for large data foliosQu Wenruo
The function is doing an ASSERT() checking the folio order, but all later functions are handling large folios properly, thus we can safely remove that ASSERT(). Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: use clear_extent_bit() at try_release_extent_state()Filipe Manana
Instead of using __clear_extent_bit() we can use clear_extent_bit() since we pass a NULL value for the changeset argument. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: pass a pointer to get_range_bits() to cache first search resultFilipe Manana
Allow get_range_bits() to take an extent state pointer to pointer argument so that we can cache the first extent state record in the target range, so that a caller can use it for subsequent operations without doing a full tree search. Currently the only user is try_release_extent_state(), which then does a call to __clear_extent_bit() which can use such a cached state record. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: allow folios to be released while ordered extent is finishingFilipe Manana
When the release_folio callback (from struct address_space_operations) is invoked we don't allow the folio to be released if its range is currently locked in the inode's io_tree, as it may indicate the folio may be needed by the task that locked the range. However if the range is locked because an ordered extent is finishing, then we can safely allow the folio to be released because ordered extent completion doesn't need to use the folio at all. When we are under memory pressure, the kernel starts writeback of dirty pages (folios) with the goal of releasing the pages from the page cache after writeback completes, however this often is not possible on btrfs because: * Once the writeback completes we queue the ordered extent completion; * Once the ordered extent completion starts, we lock the range in the inode's io_tree (at btrfs_finish_one_ordered()); * If the release_folio callback is called while the folio's range is locked in the inode's io_tree, we don't allow the folio to be released, so the kernel has to try to release memory elsewhere, which may result in triggering more writeback or releasing other pages from the page cache which may be more useful to have around for applications. In contrast, when the release_folio callback is invoked after writeback finishes and before ordered extent completion starts or locks the range, we allow the folio to be released, as well as when the release_folio callback is invoked after ordered extent completion unlocks the range. Improve on this by detecting if the range is locked for ordered extent completion and if it is, allow the folio to be released. This detection is achieved by adding a new extent flag in the io_tree that is set when the range is locked during ordered extent completion. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: update comment for try_release_extent_state()Filipe Manana
Drop reference to pages from the comment since the function is fully folio aware and works regardless of how many pages are in the folio. Also while at it, capitalize the first word and make it more explicit that release_folio is a callback from struct address_space_operations. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: remove unused flag EXTENT_BUFFER_IN_TREEDaniel Vacek
This flag is set after inserting the eb to the buffer tree and cleared on it's removal. It was added in commit 34b41acec1ccc0 ("Btrfs: use a bit to track if we're in the radix tree") and wanted to make use of it, faa2dbf004e89e ("Btrfs: add sanity tests for new qgroup accounting code"). Both are 10+ years old, we can remove the flag. Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: remove unused flag EXTENT_BUFFER_READ_ERRDaniel Vacek
This flag was added by commit 656f30dba7ab ("Btrfs: be aware of btree inode write errors to avoid fs corruption") but it stopped being used after commit 046b562b20a5 ("btrfs: use a separate end_io handler for read_extent_buffer"). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-02btrfs: open code folio_index() in btree_clear_folio_dirty_tag()Kairui Song
The folio_index() helper is only needed for mixed usage of page cache and swap cache, for pure page cache usage, the caller can just use folio->index instead. It can't be a swap cache folio here. Swap mapping may only call into fs through 'swap_rw' but btrfs does not use that method for swap. Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-04-23btrfs: adjust subpage bit start based on sectorsizeJosef Bacik
When running machines with 64k page size and a 16k nodesize we started seeing tree log corruption in production. This turned out to be because we were not writing out dirty blocks sometimes, so this in fact affects all metadata writes. When writing out a subpage EB we scan the subpage bitmap for a dirty range. If the range isn't dirty we do bit_start++; to move onto the next bit. The problem is the bitmap is based on the number of sectors that an EB has. So in this case, we have a 64k pagesize, 16k nodesize, but a 4k sectorsize. This means our bitmap is 4 bits for every node. With a 64k page size we end up with 4 nodes per page. To make this easier this is how everything looks [0 16k 32k 48k ] logical address [0 4 8 12 ] radix tree offset [ 64k page ] folio [ 16k eb ][ 16k eb ][ 16k eb ][ 16k eb ] extent buffers [ | | | | | | | | | | | | | | | | ] bitmap Now we use all of our addressing based on fs_info->sectorsize_bits, so as you can see the above our 16k eb->start turns into radix entry 4. When we find a dirty range for our eb, we correctly do bit_start += sectors_per_node, because if we start at bit 0, the next bit for the next eb is 4, to correspond to eb->start 16k. However if our range is clean, we will do bit_start++, which will now put us offset from our radix tree entries. In our case, assume that the first time we check the bitmap the block is not dirty, we increment bit_start so now it == 1, and then we loop around and check again. This time it is dirty, and we go to find that start using the following equation start = folio_start + bit_start * fs_info->sectorsize; so in the case above, eb->start 0 is now dirty, and we calculate start as 0 + 1 * fs_info->sectorsize = 4096 4096 >> 12 = 1 Now we're looking up the radix tree for 1, and we won't find an eb. What's worse is now we're using bit_start == 1, so we do bit_start += sectors_per_node, which is now 5. If that eb is dirty we will run into the same thing, we will look at an offset that is not populated in the radix tree, and now we're skipping the writeout of dirty extent buffers. The best fix for this is to not use sectorsize_bits to address nodes, but that's a larger change. Since this is a fs corruption problem fix it simply by always using sectors_per_node to increment the start bit. Fixes: c4aec299fa8f ("btrfs: introduce submit_eb_subpage() to submit a subpage metadata page") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: prepare extent_io.c for future large folio supportQu Wenruo
When we're handling folios from filemap, we can no longer assume all folios are page sized. Thus for call sites assuming the folio is page sized, change the PAGE_SIZE usage to folio_size() instead. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: add a size parameter to btrfs_alloc_subpage()Qu Wenruo
Since we can no longer assume page sized folio for data filemap folios, allow btrfs_alloc_subpage() to accept a new parameter, @fsize, indicating the folio size. This doesn't follow the regular behavior of passing a folio directly, because this function is shared by both data and metadata folios, and for metadata folios we have extra allocation policy to ensure no large folios whose sizes are larger than nodesize (unless it's page sized). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: subpage: make btrfs_is_subpage() check against a folioQu Wenruo
To support large data folios, we can no longer assume every filemap folio is page sized. So btrfs_is_subpage() check must be done against a folio. Thankfully for metadata folios, we have the full control and ensure a large folio will not be large than nodesize, so btrfs_meta_is_subpage() doesn't need this change. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: avoid linker error in btrfs_find_create_tree_block()Mark Harmstone
The inline function btrfs_is_testing() is hardcoded to return 0 if CONFIG_BTRFS_FS_RUN_SANITY_TESTS is not set. Currently we're relying on the compiler optimizing out the call to alloc_test_extent_buffer() in btrfs_find_create_tree_block(), as it's not been defined (it's behind an #ifdef). Add a stub version of alloc_test_extent_buffer() to avoid linker errors on non-standard optimization levels. This problem was seen on GCC 14 with -O0 and is helps to see symbols that would be otherwise optimized out. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Mark Harmstone <maharmstone@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: reject out-of-band dirty folios during writebackQu Wenruo
[OUT-OF-BAND DIRTY FOLIOS] An out-of-band folio means the folio is marked dirty but without notifying the filesystem. This can lead to various problems, not limited to: - No folio::private to track per block status - No proper space reserved for such a dirty folio [HISTORY IN BTRFS] This used to be a problem related to get_user_page(), but with the introduction of pin_user_pages*(), we should no longer hit such case anymore. In btrfs, we have a long history of catching such out-of-band dirty folios by: - Mark the folio ordered during delayed allocation - Check the folio ordered flag during writeback If the folio has no ordered flag, it means it doesn't go through delayed allocation, thus it's definitely an out-of-band one. If we got one, we go through COW fixup, which will re-dirty the folio with proper handling in another workqueue. [PROBLEMS OF COW-FIXUP] Such workaround is a blockage for us to migrate to iomap (it requires extra flags to trace if a folio is dirtied by the fs or not) and I'd argue it's not data checksum safe, since if a folio can be marked dirty without informing the fs, the content can also change at any time. But with the introduction of pin_user_pages*() during v5.8 merge window, such out-of-band dirty folio such be treated as a bug. Ext4 has treated such case by warning and erroring out even before pin_user_pages*(). Furthermore, there are already proofs that such folio ordered flag tracking can be screwed up by incorrect error handling, check the commit messages of the following commits: 06f364284794 ("btrfs: do proper folio cleanup when cow_file_range() failed") c2b47df81c8e ("btrfs: do proper folio cleanup when run_delalloc_nocow() failed") [FIXES] Unlike btrfs, ext4 and xfs (iomap) never bother handling such out-of-band dirty folios. - Ext4 just warns and errors out - Iomap always follows the folio/block dirty flags And there is nothing really COW specific, xfs also supports COW too. Here we take one step towards ext4 by doing warning and erroring out. But since the cow fixup thing is introduced from the beginning, we keep the old behavior for non-experimental builds, and only do the new warning for experimental builds before we're 100% sure and remove cow fixup. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: allow buffered write to avoid full page read if it's block alignedQu Wenruo
[BUG] Since the support of block size (sector size) < page size for btrfs, test case generic/563 fails with 4K block size and 64K page size: --- tests/generic/563.out 2024-04-25 18:13:45.178550333 +0930 +++ /home/adam/xfstests-dev/results//generic/563.out.bad 2024-09-30 09:09:16.155312379 +0930 @@ -3,7 +3,8 @@ read is in range write is in range write -> read/write -read is in range +read has value of 8388608 +read is NOT in range -33792 .. 33792 write is in range ... [CAUSE] The test case creates a 8MiB file, then does buffered write into the 8MiB using 4K block size, to overwrite the whole file. On 4K page sized systems, since the write range covers the full block and page, btrfs will not bother reading the page, just like what XFS and EXT4 do. But on 64K page sized systems, although the 4K sized write is still block aligned, it's not page aligned anymore, thus btrfs will read the full page, which will be accounted by cgroup and fail the test. As the test case itself expects such 4K block aligned write should not trigger any read. Such expected behavior is an optimization to reduce folio reads when possible, and unfortunately btrfs does not implement such optimization. [FIX] To skip the full page read, we need to do the following modification: - Do not trigger full page read as long as the buffered write is block aligned This is pretty simple by modifying the check inside prepare_uptodate_page(). - Skip already uptodate blocks during full page read Or we can lead to the following data corruption: 0 32K 64K |///////| | Where the file range [0, 32K) is dirtied by buffered write, the remaining range [32K, 64K) is not. When reading the full page, since [0,32K) is only dirtied but not written back, there is no data extent map for it, but a hole covering [0, 64k). If we continue reading the full page range [0, 64K), the dirtied range will be filled with 0 (since there is only a hole covering the whole range). This causes the dirtied range to get lost. With this optimization, btrfs can pass generic/563 even if the page size is larger than fs block size. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: make btrfs_do_readpage() to do block-by-block readQu Wenruo
Currently if btrfs has its block size (the older sector size) smaller than the page size, btrfs_do_readpage() will handle the range extent by extent, this is good for performance as it doesn't need to re-lookup the same extent map again and again. (Although get_extent_map() already does extra cached em check, thus the optimization is not that obvious.) This is totally fine and is a valid optimization, but it has an assumption that there is no partial uptodate range in the page. Meanwhile there is an incoming feature, requiring btrfs to skip the full page read if a buffered write range covers a full block but not a full page. In that case, we can have a page that is partially uptodate, and the current per-extent lookup cannot handle such case. So here we change btrfs_do_readpage() to do block-by-block read, this simplifies the following things: - Remove the need for @iosize variable Because we just use sectorsize as our increment. - Remove @pg_offset, and calculate it inside the loop when needed It's just offset_in_folio(). - Use a for() loop instead of a while() loop This will slightly reduce the read performance for subpage cases, but for the future where we need to skip already uptodate blocks, it should still be worth. For block size == page size, this brings no performance change. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: introduce a read path dedicated extent lock helperQu Wenruo
Currently we're using btrfs_lock_and_flush_ordered_range() for both btrfs_read_folio() and btrfs_readahead(), but it has one critical problem for future subpage optimizations: - It will call btrfs_start_ordered_extent() to writeback the involved folios But remember we're calling btrfs_lock_and_flush_ordered_range() at read paths, meaning the folio is already locked by read path. If we really trigger writeback for those already locked folios, this will lead to a deadlock and writeback cannot get the folio lock. Such dead lock is prevented by the fact that btrfs always keeps a dirty folio also uptodate, by either dirtying all blocks of the folio, or by reading the whole folio before dirtying. To prepare for the incoming patch which allows btrfs to skip full folio read if the buffered write is block aligned, we have to start by solving the possible deadlock first. Instead of blindly calling btrfs_start_ordered_extent(), introduce a new helper, which is smarter in the following ways: - Only wait and flush the ordered extent if * The folio doesn't even have private bit set * Part of the blocks of the ordered extent are not uptodate This can happen by: * The folio writeback finished, then got invalidated. There are a lot of reasons that a folio can get invalidated, from memory pressure to direct IO (which invalidates all folios of the range). But OE not yet finished. We have to wait for the ordered extent, as the OE may contain to-be-inserted data checksum. Without waiting, our read can fail due to the missing checksum. But either way, the OE should not need any extra flush inside the locked folio range. - Skip the ordered extent completely if * All the blocks are dirty This happens when OE creation is caused by a folio writeback whose file offset is before our folio. E.g. 16K page size and 4K block size 0 8K 16K 24K 32K |//////////////||///////| | The writeback of folio 0 created an OE for range [0, 24K), but since folio 16K is not fully uptodate, a read is triggered for folio 16K. The writeback will never happen (we're holding the folio lock for read), nor will the OE finish. Thus we must skip the range. * All the blocks are uptodate This happens when the writeback finished, but OE not yet finished. Since the blocks are already uptodate, we can skip the OE range. The new helper lock_extents_for_read() will do a loop for the target range by: 1) Lock the full range 2) If there is no ordered extent in the remaining range, exit 3) If there is an ordered extent that we can skip Skip to the end of the OE, and continue checking We do not trigger writeback nor wait for the OE. 4) If there is an ordered extent that we cannot skip Unlock the whole extent range and start the ordered extent. And also update btrfs_start_ordered_extent() to add two more parameters: @nowriteback_start and @nowriteback_len, to prevent triggering flush for a certain range. This will allow us to handle the following case properly in the future: 16K page size, 4K btrfs block size: 0 4K 8K 12K 16K 20K 24K 28K 32K |/////////////////////////////||////////////////| | | |<-------------------- OE 2 ------------------->| |< OE 1 >| The folio has been written back before, thus we have an OE at [28K, 32K). Although the OE 1 finished its IO, the OE is not yet removed from IO tree. The folio got invalidated after writeback completed and before the ordered extent finished. And [16K, 24K) range is dirty and uptodate, caused by a block aligned buffered write (and future enhancements allowing btrfs to skip full folio read for such case). But writeback for folio 0 has began, thus it generated OE 2, covering range [0, 24K). Since the full folio 16K is not uptodate, if we want to read the folio, the existing btrfs_lock_and_flush_ordered_range() will dead lock, by: btrfs_read_folio() | Folio 16K is already locked |- btrfs_lock_and_flush_ordered_range() |- btrfs_start_ordered_extent() for range [16K, 24K) |- filemap_fdatawrite_range() for range [16K, 24K) |- extent_write_cache_pages() folio_lock() on folio 16K, deadlock. But now we will have the following sequence: btrfs_read_folio() | Folio 16K is already locked |- lock_extents_for_read() |- can_skip_ordered_extent() for range [16K, 24K) | Returned true, the range [16K, 24K) will be skipped. |- can_skip_ordered_extent() for range [28K, 32K) | Returned false. |- btrfs_start_ordered_extent() for range [28K, 32K) with [16K, 32K) as no writeback range No writeback for folio 16K will be triggered. And there will be no more possible deadlock on the same folio. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: use num_extent_folios() in for loop boundsDavid Sterba
As the helper num_extent_folios() is now __pure, we can use it in for loop without storing its value in a variable explicitly, the compiler will do this for us. The effects on btrfs.ko is -200 bytes and there are stack space savings too: btrfs_clone_extent_buffer -8 (32 -> 24) btrfs_clear_buffer_dirty -8 (48 -> 40) clear_extent_buffer_uptodate -8 (40 -> 32) set_extent_buffer_dirty -8 (32 -> 24) write_one_eb -8 (88 -> 80) set_extent_buffer_uptodate -8 (40 -> 32) read_extent_buffer_pages_nowait -16 (64 -> 48) find_extent_buffer -8 (32 -> 24) Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: simplify parameters of metadata folio helpersDavid Sterba
Unlike folio helpers for date the ones for metadata always take the extent buffer start and length, so they can be simplified to take the eb only. The fs_info can be obtained from eb too so it can be dropped as parameter. Added in patch "btrfs: use metadata specific helpers to simplify extent buffer helpers". Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: merge alloc_dummy_extent_buffer() helpersDavid Sterba
After previous patch removing nodesize from parameters, __alloc_dummy_extent_buffer() and alloc_dummy_extent_buffer() are identical so we can drop one. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: don't pass nodesize to __alloc_extent_buffer()David Sterba
All callers pass a valid fs_info so we can read the nodesize from that instead of passing it as parameter. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: require strict data/metadata split for subpage checksQu Wenruo
Since we have btrfs_meta_is_subpage(), we should make btrfs_is_subpage() to be data inode specific. This change involves: - Simplify btrfs_is_subpage() Now we only need to do a very simple sectorsize check against PAGE_SIZE. And since the function is pretty simple now, just make it an inline function. - Add an extra ASSERT() to make sure btrfs_is_subpage() is only called on data inode mapping - Migrate btree_csum_one_bio() to use btrfs_meta_folio_*() helpers - Migrate alloc_extent_buffer() to use btrfs_meta_folio_*() helpers - Migrate end_bbio_meta_write() to use btrfs_meta_folio_*() helpers Or we will trigger the ASSERT() due to calling btrfs_folio_*() on metadata folios. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: simplify subpage handling of read_extent_buffer_pages_nowait()Qu Wenruo
By using a shared bio_add_folio_nofail() with calculated range_start/range_len, so no more explicit subpage routine needed. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: simplify subpage handling of write_one_eb()Qu Wenruo
Currently inside write_one_eb() we have two different ways of handling subpage and regular metadata. The differences are: - Extra offset/length calculation when adding the folio range to bio for subpage cases - Only decrease wbc->nr_to_write if the whole page is no longer dirty for subpage cases - Use subpage helper for subpage cases Merge the tow ways into a shared one: - Always calculate the to-be-queued range So that bio_add_folio() can use the same calculated resulted length and offset for both cases. - Use btrfs_meta_folio_clear_dirty() and btrfs_meta_folio_set_writeback() helpers This will cover both cases. - Only decrease wbc->nr_to_write if the folio is no longer dirty Since we have the folio locked, no one else can modify the folio dirty flags (set_extent_buffer_dirty() will also lock the folio for subpage cases). Thus after our btrfs_meta_folio_clear_dirty() call, if the whole folio is no longer dirty, we're submitting the last dirty eb of the folio, and can decrease wbc->nr_to_write properly. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: simplify subpage handling of btrfs_clear_buffer_dirty()Qu Wenruo
The function btrfs_clear_buffer_dirty() is called on dirty extent buffer that will not be written back. The function will call btree_clear_folio_dirty() to clear the folio dirty flag and also clear PAGECACHE_TAG_DIRTY flag. And we split the subpage and regular handling, as for subpage cases we should only clear PAGECACHE_TAG_DIRTY if the last dirty extent buffer in the page is cleared. So here we can simplify the function by: - Use the newly introduced btrfs_meta_folio_clear_and_test_dirty() helper The helper will return true if we cleared the folio dirty flag. With that we can use the same helper for both subpage and regular cases. - Rename btree_clear_folio_dirty() to btree_clear_folio_dirty_tag() As we move the folio dirty clearing in the btrfs_clear_buffer_dirty(). - Call btrfs_meta_folio_clear_and_test_dirty() to clear the dirty flags for both regular and subpage metadata cases - Only call btree_clear_folio_dirty_tag() when the folio is no longer dirty - Update the comment inside set_extent_buffer_dirty() As there is no separate clear_subpage_extent_buffer_dirty() anymore. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: use metadata specific helpers to simplify extent buffer helpersQu Wenruo
The following functions are doing metadata specific checks: - set_extent_buffer_uptodate() - clear_extent_buffer_uptodate() The reason why we do not use btrfs_folio_*() helpers for those helpers is, btrfs_is_subpage() cannot handle dummy extent buffer if nodesize >= PAGE_SIZE but block size < PAGE_SIZE. In that case, we do not need to attach extra bitmaps to the extent buffer folio. But since dummy extent buffer folios are not attached to btree inode, btrfs_is_subpage() will return true, causing problems. And the following are using btrfs_folio_*() helpers for metadata, but in theory we should use metadata specific checks: - set_extent_buffer_dirty() This is not causing problems because a dummy extent buffer should never be marked dirty. To make code simpler, introduce btrfs_meta_folio_*() helpers, to do the metadata specific handling, so that we do not to open-code such checks in above involved functions. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: make subpage attach and detach handle metadata properlyQu Wenruo
Currently subpage attach/detach is not doing proper dummy extent buffer subpage check, as btrfs_is_subpage() is not reliable for dummy extent buffer folios. Since we have a metadata specific check now, use that for btrfs_attach_subpage() first. Then enhance btrfs_detach_subpage() to accept a type parameter, so that we can do extra checks for dummy extent buffers properly. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: factor out metadata subpage detection into a dedicated helperQu Wenruo
Currently we have only one btrfs_is_subpage() to cover both data and metadata. But there is a special case for metadata: - dummy extent buffer, sector size < PAGE_SIZE and node size >= PAGE_SIZE In such case, btrfs_is_subpage() will return true for extent buffer folio. But that is not correct, and that's exactly why we have some open-coded checks for functions like set_extent_buffer_uptodate() and clear_extent_buffer_uptodate(). Just extract the metadata specific checks into a helper, and replace those call sites. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>