Age | Commit message (Collapse) | Author |
|
In the zoned mode there's a bug in the extent buffer tree conversion to
xarray. The reference for eb is dropped and code continues but the
references get dropped by releasing the batch.
Reported-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Link: https://lore.kernel.org/linux-btrfs/202505191521.435b97ac-lkp@intel.com/
Fixes: 19d7f65f032f ("btrfs: convert the buffer_radix to an xarray")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The (correct) commit e41c81d0d30e ("mm/truncate: Replace page_mapped()
call in invalidate_inode_page()") replaced the page_mapped(page) check
with a refcount check. However, this refcount check does not work as
expected with drop_caches for btrfs's metadata pages.
Btrfs has a per-sb metadata inode with cached pages, and when not in
active use by btrfs, they have a refcount of 3. One from the initial
call to alloc_pages(), one (nr_pages == 1) from filemap_add_folio(), and
one from folio_attach_private(). We would expect such pages to get dropped
by drop_caches. However, drop_caches calls into mapping_evict_folio() via
mapping_try_invalidate() which gets a reference on the folio with
find_lock_entries(). As a result, these pages have a refcount of 4, and
fail this check.
For what it's worth, such pages do get reclaimed under memory pressure,
so I would say that while this behavior is surprising, it is not really
dangerously broken.
When I asked the mm folks about the expected refcount in this case, I
was told that the correct thing to do is to donate the refcount from the
original allocation to the page cache after inserting it.
Therefore, attempt to fix this by adding a put_folio() to the critical
spot in alloc_extent_buffer() where we are sure that we have really
allocated and attached new pages. We must also adjust
folio_detach_private() to properly handle being the last reference to the
folio and not do a use-after-free after folio_detach_private().
extent_buffers allocated by clone_extent_buffer() and
alloc_dummy_extent_buffer() are unmapped, so this transfer of ownership
from allocation to insertion in the mapping does not apply to them.
However, we can still folio_put() them safely once they are finished
being allocated and have called folio_attach_private().
Finally, removing the generic put_folio() for the allocation from
btrfs_detach_extent_buffer_folios() means we need to be careful to do
the appropriate put_folio() in allocation failure paths in
alloc_extent_buffer(), clone_extent_buffer() and
alloc_dummy_extent_buffer().
Link: https://lore.kernel.org/linux-mm/ZrwhTXKzgDnCK76Z@casper.infradead.org/
Tested-by: Klara Modin <klarasmodin@gmail.com>
Reviewed-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The `free_eb` label is used only once. Simplify by moving the code inplace.
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently we have this ugly back and forth with the btree writeback
where we find the folio, find the eb associated with that folio, and
then attempt to writeback. This results in two different paths for
subpage ebs and >= page size ebs.
Clean this up by adding our own infrastructure around looking up tagged
ebs and writing the ebs out directly. This allows us to unify the
subpage and >= pagesize IO paths, resulting in a much cleaner writeback
path for extent buffers.
I ran this through fsperf on a VM with 8 CPUs and 16GiB of RAM. I used
smallfiles100k, but reduced the files to 1k to make it run faster, the
results are as follows, with the statistically significant improvements
marked with *, there were no regressions. fsperf was run with -n 10 for
both runs, so the baseline is the average 10 runs and the test is the
average of 10 runs.
smallfiles100k results
metric baseline current stdev diff
================================================================================
avg_commit_ms 68.58 58.44 3.35 -14.79% *
commits 270.60 254.70 16.24 -5.88%
dev_read_iops 48 48 0 0.00%
dev_read_kbytes 1044 1044 0 0.00%
dev_write_iops 866117.90 850028.10 14292.20 -1.86%
dev_write_kbytes 10939976.40 10605701.20 351330.32 -3.06%
elapsed 49.30 33 1.64 -33.06% *
end_state_mount_ns 41251498.80 35773220.70 2531205.32 -13.28% *
end_state_umount_ns 1.90e+09 1.50e+09 14186226.85 -21.38% *
max_commit_ms 139 111.60 9.72 -19.71% *
sys_cpu 4.90 3.86 0.88 -21.29%
write_bw_bytes 42935768.20 64318451.10 1609415.05 49.80% *
write_clat_ns_mean 366431.69 243202.60 14161.98 -33.63% *
write_clat_ns_p50 49203.20 20992 264.40 -57.34% *
write_clat_ns_p99 827392 653721.60 65904.74 -20.99% *
write_io_kbytes 2035940 2035940 0 0.00%
write_iops 10482.37 15702.75 392.92 49.80% *
write_lat_ns_max 1.01e+08 90516129 3910102.06 -10.29% *
write_lat_ns_mean 366556.19 243308.48 14154.51 -33.62% *
As you can see we get about a 33% decrease runtime, with a 50%
throughput increase, which is pretty significant.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
In preparation for changing how we do writeout of extent buffers, start
tagging the extent buffer xarray with DIRTY and WRITEBACK to make it
easier to find extent buffers that are in either state.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
In order to fully utilize xarray tagging to improve writeback we need to
convert the buffer_radix to a proper xarray. This conversion is
relatively straightforward as the radix code uses the xarray underneath.
Using xarray directly allows for quite a lot less code.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When btrfs subpage support (fs block < page size) was introduced, a
subpage filesystem will only reject tree blocks which cross page
boundaries.
This used to be a compromise to simplify the tree block handling and
still allowing subpage cases to read some old converted filesystems
which did not have proper chunk alignment.
But in practice, suppose we have the following unaligned tree block on a
64K page sized system:
0 32K 44K 60K 64K
| |///////////////| |
Although btrfs has no problem reading the tree block at [44K, 60K), if
extent allocator is allocating another tree block, it may choose the
range [60K, 74K), as extent allocator has no awareness if it's a subpage
metadata request or not.
Then we'd get -EINVAL from the following sequence:
btrfs_alloc_tree_block()
|- btrfs_reserve_extent()
| Which returned range [60K, 74K)
|- btrfs_init_new_buffer()
|- btrfs_find_create_tree_block()
|- alloc_extent_buffer()
|- check_eb_alignment()
Which returned -EINVAL, because the range crosses page
boundary.
This situation will not fix itself and should mostly mark the fs
read-only.
Thankfully we didn't really get such reports in the real world because:
- The original unaligned tree block is only caused by older
btrfs-convert
It's before the btrfs-convert rework was done in v4.6, where converted
btrfs filesystem can have metadata block groups which are not aligned
to nodesize nor stripe size (64K).
But after btrfs-progs v4.6, all chunks allocated will be stripe (64K)
aligned, thus no more such problem.
Considering how old the fix is (v4.6 was released almost 10 years ago),
subpage support for btrfs was introduced in v5.15, it should be safe to
reject those unaligned tree blocks.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This is just a trivial change. The code looks a bit more readable this way, IMO.
Move initialization of existing_folio to the beginning of the retry loop
so it's set to NULL at one place.
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The use of ASSERT(0) is maybe useful for some cases but more like a
notice for developers. Assertions can be compiled in independently so
convert it to a debugging helper.
The difference is that it's just a warning and will not end up in BUG().
The converted cases are in connection with proper error handling.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Use the conditional warning instead of typing the whole condition.
Optional message is printed where it seems clear what could be the
problem.
Conversion is left out in btree_csum_one_bio() because of the additional
condition.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The bio implementation is not something we should really mess around,
and we shouldn't recalculate the pos from the folio over and over.
Instead just track then end of the current bio in logical file offsets
in the btrfs_bio_ctrl, which is much simpler and easier to read.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
end_bbio_data_read() checks that each iterated folio fragment is aligned
and justifies that with block drivers advancing the bio. But block
driver only advance bi_iter, while end_bbio_data_read() uses
bio_for_each_folio_all() to iterate the immutable bi_io_vec array that
can't be changed by drivers at all.
Furthermore btrfs has already did the alignment check of the file
offset inside submit_one_sector(), and the size is fixed to fs block
size, there is no need to re-do the alignment check again inside the
endio function.
So just remove the unnecessary alignment check along with the incorrect
comment.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Rename all the exported functions from extent_map.h that don't have a
'btrfs_' prefix in their names, so that they are consistent with all the
other functions, to make it clear they are btrfs specific functions and
to avoid potential name collisions in the future with functions defined
elsewhere in the kernel.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
These functions are exported and don't have a 'btrfs_' prefix in their
names, which goes against coding style conventions. Rename them to have
such prefix, making it clear they are from btrfs and avoiding potential
collisions in the future with functions defined elsewhere outside btrfs.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
These functions are exported and don't have a 'btrfs_' prefix in their
names, which goes against coding style conventions. Rename them to have
such prefix, making it clear they are from btrfs and avoiding potential
collisions in the future with functions defined elsewhere outside btrfs.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
These functions are exported and don't have a 'btrfs_' prefix in their
names, which goes against coding style conventions. Rename them to have
such prefix, making it clear they are from btrfs and avoiding potential
collisions in the future with functions defined elsewhere outside btrfs.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This is an exported function so it should have a 'btrfs_' prefix by
convention, to make it clear it's btrfs specific and to avoid collisions
with functions from elsewhere in the kernel.
Rename the function to add 'btrfs_' prefix to it.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
These functions are exported so they should have a 'btrfs_' prefix by
convention, to make it clear they are btrfs specific and to avoid
collisions with functions from elsewhere in the kernel.
So add a 'btrfs_' prefix to their names to make it clear they are from
btrfs.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
These functions are exported so they should have a 'btrfs_' prefix by
convention, to make it clear they are btrfs specific and to avoid
collisions with functions from elsewhere in the kernel. One of them has a
double underscore prefix which is also discouraged.
So remove double underscore prefix where applicable and add a 'btrfs_'
prefix to their name to make it clear they are from btrfs.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
These functions are exported so they should have a 'btrfs_' prefix by
convention, to make it clear they are btrfs specific and to avoid
collisions with functions from elsewhere in the kernel. So add a prefix to
their name.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently we use the following pattern to detect if the folio contains
the end of a file:
if (folio->index == end_index)
folio_zero_range();
But that only works if the folio is page sized.
For the following case, it will not work and leave the range beyond EOF
uninitialized:
The page size is 4K, and the fs block size is also 4K.
16K 20K 24K
| | | |
|
EOF at 22K
And we have a large folio sized 8K at file offset 16K.
In that case, the old "folio->index == end_index" will not work, thus
the range [22K, 24K) will not be zeroed out.
Fix the following call sites which use the above pattern:
- add_ra_bio_pages()
- extent_writepage()
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Inside functions unlock_delalloc_folio() and lock_delalloc_folios(), we
have the following early exits:
if (index == locked_folio->index && end_index == index)
return;
This allows us to exit early if the range is inside the same locked
folio.
However the current check relies on page sized folios, if we have a large
folio that contains @index but not at @index, then the early exit will
no longer trigger.
Furthermore without the above early check, the existing code can handle it
well, as both __process_folios_contig() and lock_delalloc_folios() will
skip any folio page lock/unlock if it's on the locked folio.
Here we remove the early exits and let the existing code handle the
same index case, to make the code a little simpler.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The function is doing an ASSERT() checking the folio order, but all
later functions are handling large folios properly, thus we can safely
remove that ASSERT().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Instead of using __clear_extent_bit() we can use clear_extent_bit() since
we pass a NULL value for the changeset argument.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Allow get_range_bits() to take an extent state pointer to pointer argument
so that we can cache the first extent state record in the target range, so
that a caller can use it for subsequent operations without doing a full
tree search. Currently the only user is try_release_extent_state(), which
then does a call to __clear_extent_bit() which can use such a cached state
record.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When the release_folio callback (from struct address_space_operations) is
invoked we don't allow the folio to be released if its range is currently
locked in the inode's io_tree, as it may indicate the folio may be needed
by the task that locked the range.
However if the range is locked because an ordered extent is finishing,
then we can safely allow the folio to be released because ordered extent
completion doesn't need to use the folio at all.
When we are under memory pressure, the kernel starts writeback of dirty
pages (folios) with the goal of releasing the pages from the page cache
after writeback completes, however this often is not possible on btrfs
because:
* Once the writeback completes we queue the ordered extent completion;
* Once the ordered extent completion starts, we lock the range in the
inode's io_tree (at btrfs_finish_one_ordered());
* If the release_folio callback is called while the folio's range is
locked in the inode's io_tree, we don't allow the folio to be
released, so the kernel has to try to release memory elsewhere,
which may result in triggering more writeback or releasing other
pages from the page cache which may be more useful to have around
for applications.
In contrast, when the release_folio callback is invoked after writeback
finishes and before ordered extent completion starts or locks the range,
we allow the folio to be released, as well as when the release_folio
callback is invoked after ordered extent completion unlocks the range.
Improve on this by detecting if the range is locked for ordered extent
completion and if it is, allow the folio to be released. This detection
is achieved by adding a new extent flag in the io_tree that is set when
the range is locked during ordered extent completion.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Drop reference to pages from the comment since the function is fully folio
aware and works regardless of how many pages are in the folio. Also while
at it, capitalize the first word and make it more explicit that
release_folio is a callback from struct address_space_operations.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This flag is set after inserting the eb to the buffer tree and cleared
on it's removal. It was added in commit 34b41acec1ccc0 ("Btrfs: use a
bit to track if we're in the radix tree") and wanted to make use of it,
faa2dbf004e89e ("Btrfs: add sanity tests for new qgroup accounting
code"). Both are 10+ years old, we can remove the flag.
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This flag was added by commit 656f30dba7ab ("Btrfs: be aware of btree
inode write errors to avoid fs corruption") but it stopped being used
after commit 046b562b20a5 ("btrfs: use a separate end_io handler for
read_extent_buffer").
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The folio_index() helper is only needed for mixed usage of page cache
and swap cache, for pure page cache usage, the caller can just use
folio->index instead.
It can't be a swap cache folio here. Swap mapping may only call into fs
through 'swap_rw' but btrfs does not use that method for swap.
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When running machines with 64k page size and a 16k nodesize we started
seeing tree log corruption in production. This turned out to be because
we were not writing out dirty blocks sometimes, so this in fact affects
all metadata writes.
When writing out a subpage EB we scan the subpage bitmap for a dirty
range. If the range isn't dirty we do
bit_start++;
to move onto the next bit. The problem is the bitmap is based on the
number of sectors that an EB has. So in this case, we have a 64k
pagesize, 16k nodesize, but a 4k sectorsize. This means our bitmap is 4
bits for every node. With a 64k page size we end up with 4 nodes per
page.
To make this easier this is how everything looks
[0 16k 32k 48k ] logical address
[0 4 8 12 ] radix tree offset
[ 64k page ] folio
[ 16k eb ][ 16k eb ][ 16k eb ][ 16k eb ] extent buffers
[ | | | | | | | | | | | | | | | | ] bitmap
Now we use all of our addressing based on fs_info->sectorsize_bits, so
as you can see the above our 16k eb->start turns into radix entry 4.
When we find a dirty range for our eb, we correctly do bit_start +=
sectors_per_node, because if we start at bit 0, the next bit for the
next eb is 4, to correspond to eb->start 16k.
However if our range is clean, we will do bit_start++, which will now
put us offset from our radix tree entries.
In our case, assume that the first time we check the bitmap the block is
not dirty, we increment bit_start so now it == 1, and then we loop
around and check again. This time it is dirty, and we go to find that
start using the following equation
start = folio_start + bit_start * fs_info->sectorsize;
so in the case above, eb->start 0 is now dirty, and we calculate start
as
0 + 1 * fs_info->sectorsize = 4096
4096 >> 12 = 1
Now we're looking up the radix tree for 1, and we won't find an eb.
What's worse is now we're using bit_start == 1, so we do bit_start +=
sectors_per_node, which is now 5. If that eb is dirty we will run into
the same thing, we will look at an offset that is not populated in the
radix tree, and now we're skipping the writeout of dirty extent buffers.
The best fix for this is to not use sectorsize_bits to address nodes,
but that's a larger change. Since this is a fs corruption problem fix
it simply by always using sectors_per_node to increment the start bit.
Fixes: c4aec299fa8f ("btrfs: introduce submit_eb_subpage() to submit a subpage metadata page")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When we're handling folios from filemap, we can no longer assume all
folios are page sized.
Thus for call sites assuming the folio is page sized, change the
PAGE_SIZE usage to folio_size() instead.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Since we can no longer assume page sized folio for data filemap folios,
allow btrfs_alloc_subpage() to accept a new parameter, @fsize,
indicating the folio size.
This doesn't follow the regular behavior of passing a folio directly,
because this function is shared by both data and metadata folios, and
for metadata folios we have extra allocation policy to ensure no large
folios whose sizes are larger than nodesize (unless it's page sized).
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
To support large data folios, we can no longer assume every filemap
folio is page sized.
So btrfs_is_subpage() check must be done against a folio.
Thankfully for metadata folios, we have the full control and ensure a
large folio will not be large than nodesize, so
btrfs_meta_is_subpage() doesn't need this change.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The inline function btrfs_is_testing() is hardcoded to return 0 if
CONFIG_BTRFS_FS_RUN_SANITY_TESTS is not set. Currently we're relying on
the compiler optimizing out the call to alloc_test_extent_buffer() in
btrfs_find_create_tree_block(), as it's not been defined (it's behind an
#ifdef).
Add a stub version of alloc_test_extent_buffer() to avoid linker errors
on non-standard optimization levels. This problem was seen on GCC 14
with -O0 and is helps to see symbols that would be otherwise optimized
out.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[OUT-OF-BAND DIRTY FOLIOS]
An out-of-band folio means the folio is marked dirty but without
notifying the filesystem.
This can lead to various problems, not limited to:
- No folio::private to track per block status
- No proper space reserved for such a dirty folio
[HISTORY IN BTRFS]
This used to be a problem related to get_user_page(), but with the
introduction of pin_user_pages*(), we should no longer hit such
case anymore.
In btrfs, we have a long history of catching such out-of-band dirty
folios by:
- Mark the folio ordered during delayed allocation
- Check the folio ordered flag during writeback
If the folio has no ordered flag, it means it doesn't go through
delayed allocation, thus it's definitely an out-of-band
one.
If we got one, we go through COW fixup, which will re-dirty the folio
with proper handling in another workqueue.
[PROBLEMS OF COW-FIXUP]
Such workaround is a blockage for us to migrate to iomap (it requires
extra flags to trace if a folio is dirtied by the fs or not) and I'd
argue it's not data checksum safe, since if a folio can be marked dirty
without informing the fs, the content can also change at any time.
But with the introduction of pin_user_pages*() during v5.8 merge
window, such out-of-band dirty folio such be treated as a bug.
Ext4 has treated such case by warning and erroring out even before
pin_user_pages*().
Furthermore, there are already proofs that such folio ordered flag
tracking can be screwed up by incorrect error handling, check the commit
messages of the following commits:
06f364284794 ("btrfs: do proper folio cleanup when cow_file_range() failed")
c2b47df81c8e ("btrfs: do proper folio cleanup when run_delalloc_nocow() failed")
[FIXES]
Unlike btrfs, ext4 and xfs (iomap) never bother handling such
out-of-band dirty folios.
- Ext4 just warns and errors out
- Iomap always follows the folio/block dirty flags
And there is nothing really COW specific, xfs also supports COW too.
Here we take one step towards ext4 by doing warning and erroring out.
But since the cow fixup thing is introduced from the beginning, we keep
the old behavior for non-experimental builds, and only do the new warning
for experimental builds before we're 100% sure and remove cow fixup.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
Since the support of block size (sector size) < page size for btrfs,
test case generic/563 fails with 4K block size and 64K page size:
--- tests/generic/563.out 2024-04-25 18:13:45.178550333 +0930
+++ /home/adam/xfstests-dev/results//generic/563.out.bad 2024-09-30 09:09:16.155312379 +0930
@@ -3,7 +3,8 @@
read is in range
write is in range
write -> read/write
-read is in range
+read has value of 8388608
+read is NOT in range -33792 .. 33792
write is in range
...
[CAUSE]
The test case creates a 8MiB file, then does buffered write into the 8MiB
using 4K block size, to overwrite the whole file.
On 4K page sized systems, since the write range covers the full block and
page, btrfs will not bother reading the page, just like what XFS and EXT4
do.
But on 64K page sized systems, although the 4K sized write is still block
aligned, it's not page aligned anymore, thus btrfs will read the full
page, which will be accounted by cgroup and fail the test.
As the test case itself expects such 4K block aligned write should not
trigger any read.
Such expected behavior is an optimization to reduce folio reads when
possible, and unfortunately btrfs does not implement such optimization.
[FIX]
To skip the full page read, we need to do the following modification:
- Do not trigger full page read as long as the buffered write is block
aligned
This is pretty simple by modifying the check inside
prepare_uptodate_page().
- Skip already uptodate blocks during full page read
Or we can lead to the following data corruption:
0 32K 64K
|///////| |
Where the file range [0, 32K) is dirtied by buffered write, the
remaining range [32K, 64K) is not.
When reading the full page, since [0,32K) is only dirtied but not
written back, there is no data extent map for it, but a hole covering
[0, 64k).
If we continue reading the full page range [0, 64K), the dirtied range
will be filled with 0 (since there is only a hole covering the whole
range).
This causes the dirtied range to get lost.
With this optimization, btrfs can pass generic/563 even if the page size
is larger than fs block size.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently if btrfs has its block size (the older sector size) smaller
than the page size, btrfs_do_readpage() will handle the range extent by
extent, this is good for performance as it doesn't need to re-lookup the
same extent map again and again.
(Although get_extent_map() already does extra cached em check, thus
the optimization is not that obvious.)
This is totally fine and is a valid optimization, but it has an
assumption that there is no partial uptodate range in the page.
Meanwhile there is an incoming feature, requiring btrfs to skip the full
page read if a buffered write range covers a full block but not a full
page.
In that case, we can have a page that is partially uptodate, and the
current per-extent lookup cannot handle such case.
So here we change btrfs_do_readpage() to do block-by-block read, this
simplifies the following things:
- Remove the need for @iosize variable
Because we just use sectorsize as our increment.
- Remove @pg_offset, and calculate it inside the loop when needed
It's just offset_in_folio().
- Use a for() loop instead of a while() loop
This will slightly reduce the read performance for subpage cases, but for
the future where we need to skip already uptodate blocks, it should still
be worth.
For block size == page size, this brings no performance change.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently we're using btrfs_lock_and_flush_ordered_range() for both
btrfs_read_folio() and btrfs_readahead(), but it has one critical
problem for future subpage optimizations:
- It will call btrfs_start_ordered_extent() to writeback the involved
folios
But remember we're calling btrfs_lock_and_flush_ordered_range() at
read paths, meaning the folio is already locked by read path.
If we really trigger writeback for those already locked folios, this
will lead to a deadlock and writeback cannot get the folio lock.
Such dead lock is prevented by the fact that btrfs always keeps a
dirty folio also uptodate, by either dirtying all blocks of the folio,
or by reading the whole folio before dirtying.
To prepare for the incoming patch which allows btrfs to skip full folio
read if the buffered write is block aligned, we have to start by solving
the possible deadlock first.
Instead of blindly calling btrfs_start_ordered_extent(), introduce a
new helper, which is smarter in the following ways:
- Only wait and flush the ordered extent if
* The folio doesn't even have private bit set
* Part of the blocks of the ordered extent are not uptodate
This can happen by:
* The folio writeback finished, then got invalidated.
There are a lot of reasons that a folio can get invalidated,
from memory pressure to direct IO (which invalidates all folios
of the range).
But OE not yet finished.
We have to wait for the ordered extent, as the OE may contain
to-be-inserted data checksum.
Without waiting, our read can fail due to the missing checksum.
But either way, the OE should not need any extra flush inside the
locked folio range.
- Skip the ordered extent completely if
* All the blocks are dirty
This happens when OE creation is caused by a folio writeback whose
file offset is before our folio.
E.g. 16K page size and 4K block size
0 8K 16K 24K 32K
|//////////////||///////| |
The writeback of folio 0 created an OE for range [0, 24K), but since
folio 16K is not fully uptodate, a read is triggered for folio 16K.
The writeback will never happen (we're holding the folio lock for
read), nor will the OE finish.
Thus we must skip the range.
* All the blocks are uptodate
This happens when the writeback finished, but OE not yet finished.
Since the blocks are already uptodate, we can skip the OE range.
The new helper lock_extents_for_read() will do a loop for the target
range by:
1) Lock the full range
2) If there is no ordered extent in the remaining range, exit
3) If there is an ordered extent that we can skip
Skip to the end of the OE, and continue checking
We do not trigger writeback nor wait for the OE.
4) If there is an ordered extent that we cannot skip
Unlock the whole extent range and start the ordered extent.
And also update btrfs_start_ordered_extent() to add two more parameters:
@nowriteback_start and @nowriteback_len, to prevent triggering flush for
a certain range.
This will allow us to handle the following case properly in the future:
16K page size, 4K btrfs block size:
0 4K 8K 12K 16K 20K 24K 28K 32K
|/////////////////////////////||////////////////| | |
|<-------------------- OE 2 ------------------->| |< OE 1 >|
The folio has been written back before, thus we have an OE at
[28K, 32K).
Although the OE 1 finished its IO, the OE is not yet removed from IO
tree.
The folio got invalidated after writeback completed and before the
ordered extent finished.
And [16K, 24K) range is dirty and uptodate, caused by a block aligned
buffered write (and future enhancements allowing btrfs to skip full
folio read for such case).
But writeback for folio 0 has began, thus it generated OE 2, covering
range [0, 24K).
Since the full folio 16K is not uptodate, if we want to read the folio,
the existing btrfs_lock_and_flush_ordered_range() will dead lock, by:
btrfs_read_folio()
| Folio 16K is already locked
|- btrfs_lock_and_flush_ordered_range()
|- btrfs_start_ordered_extent() for range [16K, 24K)
|- filemap_fdatawrite_range() for range [16K, 24K)
|- extent_write_cache_pages()
folio_lock() on folio 16K, deadlock.
But now we will have the following sequence:
btrfs_read_folio()
| Folio 16K is already locked
|- lock_extents_for_read()
|- can_skip_ordered_extent() for range [16K, 24K)
| Returned true, the range [16K, 24K) will be skipped.
|- can_skip_ordered_extent() for range [28K, 32K)
| Returned false.
|- btrfs_start_ordered_extent() for range [28K, 32K) with
[16K, 32K) as no writeback range
No writeback for folio 16K will be triggered.
And there will be no more possible deadlock on the same folio.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
As the helper num_extent_folios() is now __pure, we can use it in for
loop without storing its value in a variable explicitly, the compiler
will do this for us.
The effects on btrfs.ko is -200 bytes and there are stack space savings
too:
btrfs_clone_extent_buffer -8 (32 -> 24)
btrfs_clear_buffer_dirty -8 (48 -> 40)
clear_extent_buffer_uptodate -8 (40 -> 32)
set_extent_buffer_dirty -8 (32 -> 24)
write_one_eb -8 (88 -> 80)
set_extent_buffer_uptodate -8 (40 -> 32)
read_extent_buffer_pages_nowait -16 (64 -> 48)
find_extent_buffer -8 (32 -> 24)
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Unlike folio helpers for date the ones for metadata always take the
extent buffer start and length, so they can be simplified to take the
eb only. The fs_info can be obtained from eb too so it can be dropped
as parameter.
Added in patch "btrfs: use metadata specific helpers to simplify extent
buffer helpers".
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
After previous patch removing nodesize from parameters,
__alloc_dummy_extent_buffer() and alloc_dummy_extent_buffer() are
identical so we can drop one.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
All callers pass a valid fs_info so we can read the nodesize from that
instead of passing it as parameter.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Since we have btrfs_meta_is_subpage(), we should make btrfs_is_subpage()
to be data inode specific.
This change involves:
- Simplify btrfs_is_subpage()
Now we only need to do a very simple sectorsize check against
PAGE_SIZE.
And since the function is pretty simple now, just make it an inline
function.
- Add an extra ASSERT() to make sure btrfs_is_subpage() is only called
on data inode mapping
- Migrate btree_csum_one_bio() to use btrfs_meta_folio_*() helpers
- Migrate alloc_extent_buffer() to use btrfs_meta_folio_*() helpers
- Migrate end_bbio_meta_write() to use btrfs_meta_folio_*() helpers
Or we will trigger the ASSERT() due to calling btrfs_folio_*() on
metadata folios.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
By using a shared bio_add_folio_nofail() with calculated
range_start/range_len, so no more explicit subpage routine needed.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently inside write_one_eb() we have two different ways of handling
subpage and regular metadata.
The differences are:
- Extra offset/length calculation when adding the folio range to bio for
subpage cases
- Only decrease wbc->nr_to_write if the whole page is no longer dirty
for subpage cases
- Use subpage helper for subpage cases
Merge the tow ways into a shared one:
- Always calculate the to-be-queued range
So that bio_add_folio() can use the same calculated resulted length
and offset for both cases.
- Use btrfs_meta_folio_clear_dirty() and
btrfs_meta_folio_set_writeback() helpers
This will cover both cases.
- Only decrease wbc->nr_to_write if the folio is no longer dirty
Since we have the folio locked, no one else can modify the folio dirty
flags (set_extent_buffer_dirty() will also lock the folio for subpage
cases).
Thus after our btrfs_meta_folio_clear_dirty() call, if the whole folio
is no longer dirty, we're submitting the last dirty eb of the folio,
and can decrease wbc->nr_to_write properly.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The function btrfs_clear_buffer_dirty() is called on dirty extent buffer
that will not be written back.
The function will call btree_clear_folio_dirty() to clear the folio
dirty flag and also clear PAGECACHE_TAG_DIRTY flag.
And we split the subpage and regular handling, as for subpage cases we
should only clear PAGECACHE_TAG_DIRTY if the last dirty extent buffer in
the page is cleared.
So here we can simplify the function by:
- Use the newly introduced btrfs_meta_folio_clear_and_test_dirty() helper
The helper will return true if we cleared the folio dirty flag.
With that we can use the same helper for both subpage and regular
cases.
- Rename btree_clear_folio_dirty() to btree_clear_folio_dirty_tag()
As we move the folio dirty clearing in the btrfs_clear_buffer_dirty().
- Call btrfs_meta_folio_clear_and_test_dirty() to clear the dirty flags
for both regular and subpage metadata cases
- Only call btree_clear_folio_dirty_tag() when the folio is no longer
dirty
- Update the comment inside set_extent_buffer_dirty()
As there is no separate clear_subpage_extent_buffer_dirty() anymore.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The following functions are doing metadata specific checks:
- set_extent_buffer_uptodate()
- clear_extent_buffer_uptodate()
The reason why we do not use btrfs_folio_*() helpers for those helpers
is, btrfs_is_subpage() cannot handle dummy extent buffer if nodesize >=
PAGE_SIZE but block size < PAGE_SIZE.
In that case, we do not need to attach extra bitmaps to the extent
buffer folio. But since dummy extent buffer folios are not attached to
btree inode, btrfs_is_subpage() will return true, causing problems.
And the following are using btrfs_folio_*() helpers for metadata, but
in theory we should use metadata specific checks:
- set_extent_buffer_dirty()
This is not causing problems because a dummy extent buffer should never
be marked dirty.
To make code simpler, introduce btrfs_meta_folio_*() helpers, to do
the metadata specific handling, so that we do not to open-code such
checks in above involved functions.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently subpage attach/detach is not doing proper dummy extent buffer
subpage check, as btrfs_is_subpage() is not reliable for dummy extent
buffer folios.
Since we have a metadata specific check now, use that for
btrfs_attach_subpage() first.
Then enhance btrfs_detach_subpage() to accept a type parameter, so that
we can do extra checks for dummy extent buffers properly.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently we have only one btrfs_is_subpage() to cover both data and
metadata.
But there is a special case for metadata:
- dummy extent buffer, sector size < PAGE_SIZE and node size >= PAGE_SIZE
In such case, btrfs_is_subpage() will return true for extent buffer
folio.
But that is not correct, and that's exactly why we have some open-coded
checks for functions like set_extent_buffer_uptodate() and
clear_extent_buffer_uptodate().
Just extract the metadata specific checks into a helper, and replace
those call sites.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|