summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2022-07-25btrfs: increase direct io read size limit to 256 sectorsChristoph Hellwig
Btrfs currently limits direct I/O reads to a single sector, which goes back to commit c329861da406 ("Btrfs: don't allocate a separate csums array for direct reads") from Josef. That commit changes the direct I/O code to ".. use the private part of the io_tree for our csums.", but ten years later that isn't how checksums for direct reads work, instead they use a csums allocation on a per-btrfs_dio_private basis (which have their own performance problem for small I/O, but that will be addressed later). There is no fundamental limit in btrfs itself to limit the I/O size except for the size of the checksum array that scales linearly with the number of sectors in an I/O. Pick a somewhat arbitrary limit of 256 limits, which matches what the buffered reads typically see as the upper limit as the limit for direct I/O as well. This significantly improves direct read performance. For example a fio run doing 1 MiB aio reads with a queue depth of 1 roughly triples the throughput: Baseline: READ: bw=65.3MiB/s (68.5MB/s), 65.3MiB/s-65.3MiB/s (68.5MB/s-68.5MB/s), io=19.1GiB (20.6GB), run=300013-300013msec With this patch: READ: bw=196MiB/s (206MB/s), 196MiB/s-196MiB/s (206MB/s-206MB/s), io=57.5GiB (61.7GB), run=300006-300006msc Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: raid56: don't trust any cached sector in __raid56_parity_recover()Qu Wenruo
[BUG] There is a small workload which will always fail with recent kernel: (A simplified version from btrfs/125 test case) mkfs.btrfs -f -m raid5 -d raid5 -b 1G $dev1 $dev2 $dev3 mount $dev1 $mnt xfs_io -f -c "pwrite -S 0xee 0 1M" $mnt/file1 sync umount $mnt btrfs dev scan -u $dev3 mount -o degraded $dev1 $mnt xfs_io -f -c "pwrite -S 0xff 0 128M" $mnt/file2 umount $mnt btrfs dev scan mount $dev1 $mnt btrfs balance start --full-balance $mnt umount $mnt The failure is always failed to read some tree blocks: BTRFS info (device dm-4): relocating block group 217710592 flags data|raid5 BTRFS error (device dm-4): parent transid verify failed on 38993920 wanted 9 found 7 BTRFS error (device dm-4): parent transid verify failed on 38993920 wanted 9 found 7 ... [CAUSE] With the recently added debug output, we can see all RAID56 operations related to full stripe 38928384: 56.1183: raid56_read_partial: full_stripe=38928384 devid=2 type=DATA1 offset=0 opf=0x0 physical=9502720 len=65536 56.1185: raid56_read_partial: full_stripe=38928384 devid=3 type=DATA2 offset=16384 opf=0x0 physical=9519104 len=16384 56.1185: raid56_read_partial: full_stripe=38928384 devid=3 type=DATA2 offset=49152 opf=0x0 physical=9551872 len=16384 56.1187: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=0 opf=0x1 physical=9502720 len=16384 56.1188: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=32768 opf=0x1 physical=9535488 len=16384 56.1188: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=0 opf=0x1 physical=30474240 len=16384 56.1189: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=32768 opf=0x1 physical=30507008 len=16384 56.1218: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=49152 opf=0x1 physical=9551872 len=16384 56.1219: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=49152 opf=0x1 physical=30523392 len=16384 56.2721: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2 56.2723: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2 56.2724: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2 Before we enter raid56_parity_recover(), we have triggered some metadata write for the full stripe 38928384, this leads to us to read all the sectors from disk. Furthermore, btrfs raid56 write will cache its calculated P/Q sectors to avoid unnecessary read. This means, for that full stripe, after any partial write, we will have stale data, along with P/Q calculated using that stale data. Thankfully due to patch "btrfs: only write the sectors in the vertical stripe which has data stripes" we haven't submitted all the corrupted P/Q to disk. When we really need to recover certain range, aka in raid56_parity_recover(), we will use the cached rbio, along with its cached sectors (the full stripe is all cached). This explains why we have no event raid56_scrub_read_recover() triggered. Since we have the cached P/Q which is calculated using the stale data, the recovered one will just be stale. In our particular test case, it will always return the same incorrect metadata, thus causing the same error message "parent transid verify failed on 39010304 wanted 9 found 7" again and again. [BTRFS DESTRUCTIVE RMW PROBLEM] Test case btrfs/125 (and above workload) always has its trouble with the destructive read-modify-write (RMW) cycle: 0 32K 64K Data1: | Good | Good | Data2: | Bad | Bad | Parity: | Good | Good | In above case, if we trigger any write into Data1, we will use the bad data in Data2 to re-generate parity, killing the only chance to recovery Data2, thus Data2 is lost forever. This destructive RMW cycle is not specific to btrfs RAID56, but there are some btrfs specific behaviors making the case even worse: - Btrfs will cache sectors for unrelated vertical stripes. In above example, if we're only writing into 0~32K range, btrfs will still read data range (32K ~ 64K) of Data1, and (64K~128K) of Data2. This behavior is to cache sectors for later update. Incidentally commit d4e28d9b5f04 ("btrfs: raid56: make steal_rbio() subpage compatible") has a bug which makes RAID56 to never trust the cached sectors, thus slightly improve the situation for recovery. Unfortunately, follow up fix "btrfs: update stripe_sectors::uptodate in steal_rbio" will revert the behavior back to the old one. - Btrfs raid56 partial write will update all P/Q sectors and cache them This means, even if data at (64K ~ 96K) of Data2 is free space, and only (96K ~ 128K) of Data2 is really stale data. And we write into that (96K ~ 128K), we will update all the parity sectors for the full stripe. This unnecessary behavior will completely kill the chance of recovery. Thankfully, an unrelated optimization "btrfs: only write the sectors in the vertical stripe which has data stripes" will prevent submitting the write bio for untouched vertical sectors. That optimization will keep the on-disk P/Q untouched for a chance for later recovery. [FIX] Although we have no good way to completely fix the destructive RMW (unless we go full scrub for each partial write), we can still limit the damage. With patch "btrfs: only write the sectors in the vertical stripe which has data stripes" now we won't really submit the P/Q of unrelated vertical stripes, so the on-disk P/Q should still be fine. Now we really need to do is just drop all the cached sectors when doing recovery. By this, we have a chance to read the original P/Q from disk, and have a chance to recover the stale data, while still keep the cache to speed up regular write path. In fact, just dropping all the cache for recovery path is good enough to allow the test case btrfs/125 along with the small script to pass reliably. The lack of metadata write after the degraded mount, and forced metadata COW is saving us this time. So this patch will fix the behavior by not trust any cache in __raid56_parity_recover(), to solve the problem while still keep the cache useful. But please note that this test pass DOES NOT mean we have solved the destructive RMW problem, we just do better damage control a little better. Related patches: - btrfs: only write the sectors in the vertical stripe - d4e28d9b5f04 ("btrfs: raid56: make steal_rbio() subpage compatible") - btrfs: update stripe_sectors::uptodate in steal_rbio Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: remove the finish_func argument to btrfs_mark_ordered_io_finishedChristoph Hellwig
finish_func is always set to finish_ordered_fn, so remove it and also the now pointless and somewhat confusingly named __endio_write_update_ordered wrapper. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: batch up release of reserved metadata for delayed items used for deletionNikolay Borisov
With Filipe's recent rework of the delayed inode code one aspect which isn't batched is the release of the reserved metadata of delayed inode's delete items. With this patch on top of Filipe's rework and running the same test as provided in the description of a patch titled "btrfs: improve batch deletion of delayed dir index items" I observe the following change of the number of calls to btrfs_block_rsv_release: Before this change: - block_rsv_release: 1004 - btrfs_delete_delayed_items_total_time: 14602 - delete_batches: 505 After: - block_rsv_release: 510 - btrfs_delete_delayed_items_total_time: 13643 - delete_batches: 507 Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: warn about dev extents that are inside the reserved rangeQu Wenruo
Btrfs on-disk format has reserved the first 1MiB for the primary super block (at 64KiB offset) and bootloaders may also use this space. This behavior is only introduced since v4.1 btrfs-progs release, although kernel can ensure we never touch the reserved range of super blocks, it's better to inform the end users, and a balance will resolve the problem. Signed-off-by: Qu Wenruo <wqu@suse.com> [ update changelog and message ] Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: use named constant for reserved device spaceQu Wenruo
There's a reserved space on each device of size 1MiB that can be used by bootloaders or to avoid accidental overwrite. Use a symbolic constant with the explaining comment instead of hard coding the value and multiple comments. Note: since btrfs-progs v4.1, mkfs.btrfs will reserve the first 1MiB for the primary super block (at offset 64KiB), until then the range could have been used by mistake. Kernel has been always respecting the 1MiB range for writes. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: remove unused typedefs get_extent_t and btrfs_work_func_tDavid Sterba
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: sink iterator parameter to btrfs_ioctl_logical_to_inoDavid Sterba
There's only one function we pass to iterate_inodes_from_logical as iterator, so we can drop the indirection and call it directly, after moving the function to backref.c Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: simplify parameters of backref iteratorsDavid Sterba
The inode reference iterator interface takes parameters that are derived from the context parameter, but as it's a void* type the values are passed individually. Change the ctx type to inode_fs_path as it's the only thing we pass and drop any parameters that are derived from that. Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: call inode_to_path directly and drop indirectionDavid Sterba
The functions for iterating inode reference take a function parameter but there's only one value, inode_to_path(). Remove the indirection and call the function. As paths_from_inode would become just an alias for iterate_irefs(), merge the two into one function. Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: use ncopies from btrfs_raid_array in btrfs_num_copies()Qu Wenruo
For all non-RAID56 profiles, we can use btrfs_raid_array[].ncopies directly, only for RAID5 and RAID6 we need some extra handling as there's no table value for that. For RAID10 there's a change from sub_stripes to ncopies. The values are the same but semantically we want to use number of copies, as this is what btrfs_num_copies does. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: use btrfs_raid_array to calculate number of parity stripesQu Wenruo
Use the raid table instead of hard coded values and rename the helper as it is exported. This could make later extension on RAID56 based profiles easier. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: use btrfs_chunk_max_errors() to replace tolerance calculationQu Wenruo
In __btrfs_map_block() we have an assignment to @max_errors using nr_parity_stripes(). Although it works for RAID56 it's confusing. Replace it with btrfs_chunk_max_errors(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: remove parameter dev_extent_len from scrub_stripe()Qu Wenruo
For scrub_stripe() we can easily calculate the dev extent length as we have the full info of the chunk. Thus there is no need to pass @dev_extent_len from the caller, and we introduce a helper, btrfs_calc_stripe_length(), to do the calculation from extent_map structure. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: unify tree search helper returning prev and next nodesDavid Sterba
Simplify helper to return only next and prev pointers, we don't need all the node/parent/prev/next pointers of __etree_search as there are now other specialized helpers. Rename parameters so they follow the naming. Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: make tree search for insert more generic and use it for tree_searchDavid Sterba
With a slight extension of tree_search_for_insert (fill the return node and parent return parameters) we can avoid calling __etree_search from tree_search, that could be removed eventually in followup patches. Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: open code inexact rbtree search in tree_searchDavid Sterba
The call chain from tree_search tree_search_for_insert __etree_search can be open coded and allow further simplifications, here we need a tree search with fallback to the next node in case it's not found. This is represented as __etree_search parameters next_ret=valid, prev_ret=NULL. Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: remove node and parent parameters from insert_stateDavid Sterba
There's no caller left that would pass valid pointers to insert_state so we can drop them. Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: add fast path for extent_state insertionDavid Sterba
In two cases the exact location where to insert the extent state is known at the call time so we don't need to pass it to insert_state that takes the fast path. Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: pass bits by value not by pointer for extent_state helpersDavid Sterba
The bits are passed to all extent state helpers for no apparent reason, the value only read and never updated so remove the indirection and pass it directly. Also unify the type to u32 where needed. Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: lift start and end parameters to callers of insert_stateDavid Sterba
Let callers of insert_state to set up the extent state to allow further simplifications of the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: open code rbtree search in insert_stateDavid Sterba
The rbtree search is a known pattern and can be open coded, allowing to remove the tree_insert and further cleanups. Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: open code rbtree search in split_stateDavid Sterba
Preparatory work to remove tree_insert from extent_io.c, the rbtree search loop is a known and simple so it can be open coded. Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: raid56: avoid double for loop inside raid56_parity_scrub_stripe()Qu Wenruo
Originally it's iterating all the sectors which has dbitmap sector for the vertical stripe. It can be easily converted to sector bytenr iteration with an test_bit() call. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: raid56: avoid double for loop inside raid56_rmw_stripe()Qu Wenruo
This function doesn't even utilize full stripe skip, just iterate all the data sectors is definitely enough. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: raid56: avoid double for loop inside alloc_rbio_essential_pages()Qu Wenruo
The double loop is just checking if the page for the vertical stripe is allocated. We can easily convert it to single loop and get rid of @stripe variable. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: raid56: avoid double for loop inside __raid56_parity_recover()Qu Wenruo
The double for loop can be easily converted to single for loop as we're really iterating the sectors in their bytenr order. The only exception is the full stripe skip, however that can also easily be done inside the loop. Add an ASSERT() along with a comment for that specific case. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: raid56: avoid double for loop inside finish_rmw()Qu Wenruo
We can easily calculate the stripe number and sector number inside the stripe. Thus there is not much need for a double for loop. For the only case we want to skip the whole stripe, we can manually increase @total_sector_nr. This is not a recommended behavior, thus every time the iterator gets modified there will be a comment along with an ASSERT() for it. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: tree-log: make the return value for log syncing consistentJosef Bacik
Currently we will return 1 or -EAGAIN if we decide we need to commit the transaction rather than sync the log. In practice this doesn't really matter, we interpret any !0 and !BTRFS_NO_LOG_SYNC as needing to commit the transaction. However this makes it hard to figure out what the correct thing to do is. Fix this up by defining BTRFS_LOG_FORCE_COMMIT and using this in all the places where we want to force the transaction to be committed. CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: add tracepoints for ordered extentsJohannes Thumshirn
When debugging a reference counting issue with ordered extents, I've found we're lacking a lot of tracepoint coverage in the ordered extent code. Close these gaps by adding tracepoints after every refcount_inc() in the ordered extent code. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: sysfs: advertise zoned support among featuresDavid Sterba
We've hidden the zoned support in sysfs under debug config for the first releases but now the stability is reasonable, though not all features have been implemented. Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: split discard handling out of btrfs_map_blockChristoph Hellwig
Mapping block for discard doesn't really share any code with the regular block mapping case. Split it out into an entirely separate helper that just returns an array of btrfs_discard_stripe structures and the number of stripes. This removes the need for the length field in the btrfs_io_context structure, so remove tht. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: stop looking at btrfs_bio->iter in index_one_bioChristoph Hellwig
All the bios that index_one_bio operates on are the bios submitted by the upper layer. These are never resubmitted to an actual device by the raid56 code, and thus the iter never changes from the initial state. Thus we can always just use bi_iter directly as it will be the same as the saved copy. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: reject log replay if there is unsupported RO compat flagQu Wenruo
[BUG] If we have a btrfs image with dirty log, along with an unsupported RO compatible flag: log_root 30474240 ... compat_flags 0x0 compat_ro_flags 0x40000003 ( FREE_SPACE_TREE | FREE_SPACE_TREE_VALID | unknown flag: 0x40000000 ) Then even if we can only mount it RO, we will still cause metadata update for log replay: BTRFS info (device dm-1): flagging fs with big metadata feature BTRFS info (device dm-1): using free space tree BTRFS info (device dm-1): has skinny extents BTRFS info (device dm-1): start tree-log replay This is definitely against RO compact flag requirement. [CAUSE] RO compact flag only forces us to do RO mount, but we will still do log replay for plain RO mount. Thus this will result us to do log replay and update metadata. This can be very problematic for new RO compat flag, for example older kernel can not understand v2 cache, and if we allow metadata update on RO mount and invalidate/corrupt v2 cache. [FIX] Just reject the mount unless rescue=nologreplay is provided: BTRFS error (device dm-1): cannot replay dirty log with unsupport optional features (0x40000000), try rescue=nologreplay instead We don't want to set rescue=nologreply directly, as this would make the end user to read the old data, and cause confusion. Since the such case is really rare, we're mostly fine to just reject the mount with an error message, which also includes the proper workaround. CC: stable@vger.kernel.org #4.9+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: make btrfs_super_block::log_root_transid deprecatedQu Wenruo
When using "btrfs inspect-internal dump-super" to inspect an fs with dirty log, it always shows the log_root_transid as 0: log_root 30474240 log_root_transid 0 <<< log_root_level 0 It turns out that, btrfs_super_block::log_root_transid is never really utilized (even no read for it). This can date back to the introduction of btrfs into upstream kernel. In fact, when reading log tree root, we always use btrfs_super_block::generation + 1 as the expected generation. So here we're completely safe to mark this member deprecated. In theory we can easily reuse this member for other purposes, but to be extra safe, here we follow the leafsize way, by adding "__unused_" for log_root_transid. And we can safely remove the accessors, since there is no such callers from the very beginning. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: pass the btrfs_bio_ctrl to submit_one_bioChristoph Hellwig
submit_one_bio always works on the bio and compression flags from a btrfs_bio_ctrl structure. Pass the explicitly and clean up the calling conventions by handling a NULL bio in submit_one_bio, and using the btrfs_bio_ctrl to pass the mirror number as well. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: merge end_write_bio and flush_write_bioChristoph Hellwig
Merge end_write_bio and flush_write_bio into a single submit_write_bio helper, that either submits the bio or ends it if a negative errno was passed in. This consolidates a lot of duplicated checks in the callers. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: don't use bio->bi_private to pass the inode to submit_one_bioChristoph Hellwig
submit_one_bio is only used for page cache I/O, so the inode can be trivially derived from the first page in the bio. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: remove redundant check in up check_setget_boundsDavid Sterba
There are two separate checks in the bounds checker, the first one being a special case of the second. As this function is performance critical due to checking access to any eb member, reducing the size can slightly improve performance. On a release build on x86_64 the helper is completely inlined so the function call overhead is also gone. There was a report of 5% performance drop on metadata heavy workload, that disappeared after disabling asserts. The most significant part of that is the bounds checker. https://lore.kernel.org/linux-btrfs/20200724164147.39925-1-josef@toxicpanda.com/ After the analysis, the optimized code removes the worst overhead which is the function call and the performance was restored. https://lore.kernel.org/linux-btrfs/20200730110943.GE3703@twin.jikos.cz/ 1. baseline, asserts on, setget check on run time: 46s run time with perf: 48s 2. asserts on, comment out setget check run time: 44s run time with perf: 47s So this is confirms the 5% difference 3. asserts on, optimized seget check run time: 44s run time with perf: 47s The optimizations are reducing the number of ifs to 1 and inlining the hot path. Low-level stuff, gets the performance back. Patch below. 4. asserts off, no setget check run time: 44s run time with perf: 45s This verifies that asserts other than the setget check have negligible impact on performance and it's not harmful to keep them on. Analysis where the performance is lost: * check_setget_bounds is short function, but it's still a function call, changing the flow of instructions and given how many times it's called the overhead adds up * there are two conditions, one to check if the range is completely outside (member_offset > eb->len) or partially inside (member_offset + size > eb->len) Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: replace kmap() with kmap_local_page() in lzo.cFabio M. De Francesco
The use of kmap() is being deprecated in favor of kmap_local_page() where it is feasible. With kmap_local_page(), the mapping is per thread, CPU local and not globally visible. Therefore, use kmap_local_page() / kunmap_local() in lzo.c wherever the mappings are per thread and not globally visible. Tested on QEMU + KVM 32 bits VM with 4GB of RAM and HIGHMEM64G enabled. Suggested-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: replace kmap() with kmap_local_page() in inode.cFabio M. De Francesco
The use of kmap() is being deprecated in favor of kmap_local_page() where it is feasible. With kmap_local_page(), the mapping is per thread, CPU local and not globally visible. Therefore, use kmap_local_page() / kunmap_local() in inode.c wherever the mappings are per thread and not globally visible. Tested on QEMU + KVM 32 bits VM with 4GB of RAM and HIGHMEM64G enabled. Suggested-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: do not allocate a btrfs_bio for low-level biosChristoph Hellwig
The bios submitted from btrfs_map_bio don't really interact with the rest of btrfs and the only btrfs_bio member actually used in the low-level bios is the pointer to the btrfs_io_context used for endio handler. Use a union in struct btrfs_io_stripe that allows the endio handler to find the btrfs_io_context and remove the spurious ->device assignment so that a plain fs_bio_set bio can be used for the low-level bios allocated inside btrfs_map_bio. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: factor stripe submission logic out of btrfs_map_bioChristoph Hellwig
Move all per-stripe handling into submit_stripe_bio and use a label to cleanup instead of duplicating the logic. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: remove btrfs_end_io_wqChristoph Hellwig
All reads bio that go through btrfs_map_bio need to be completed in user context. And read I/Os are the most common and timing critical in almost any file system workloads. Embed a work_struct into struct btrfs_bio and use it to complete all read bios submitted through btrfs_map, using the REQ_META flag to decide which workqueue they are placed on. This removes the need for a separate 128 byte allocation (typically rounded up to 192 bytes by slab) for all reads with a size increase of 24 bytes for struct btrfs_bio. Future patches will reorganize struct btrfs_bio to make use of this extra space for writes as well. (All sizes are based a on typical 64-bit non-debug build) Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: centralize setting REQ_METAChristoph Hellwig
Set REQ_META in btrfs_submit_metadata_bio instead of the various callers. We'll start relying on this flag inside of btrfs in a bit, and this ensures it is always set correctly. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: don't use btrfs_bio_wq_end_io for compressed writesChristoph Hellwig
Compressed write bio completion is the only user of btrfs_bio_wq_end_io for writes, and the use of btrfs_bio_wq_end_io is a little suboptimal here as we only real need user context for the final completion of a compressed_bio structure, and not every single bio completion. Add a work_struct to struct compressed_bio instead and use that to call finish_compressed_bio_write. This allows to remove all handling of write bios in the btrfs_bio_wq_end_io infrastructure. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: don't double-defer bio completions for compressed readsChristoph Hellwig
The bio completion handler of the bio used for the compressed data is already run in a workqueue using btrfs_bio_wq_end_io, so don't schedule the completion of the original bio to the same workqueue again but just execute it directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: defer I/O completion based on the btrfs_raid_bioChristoph Hellwig
Instead of attaching an extra allocation an indirect call to each low-level bio issued by the RAID code, add a work_struct to struct btrfs_raid_bio and only defer the per-rbio completion action. The per-bio action for all the I/Os are trivial and can be safely done from interrupt context. As a nice side effect this also allows sharing the boilerplate code for the per-bio completions Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: split btrfs_submit_data_bio to read and write partsChristoph Hellwig
Split btrfs_submit_data_bio into one helper for reads and one for writes. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: simplify code flow in btrfs_submit_dio_bioChristoph Hellwig
There is no exit block and cleanup and the function is reasonably short so we can use inline return and not the goto. This makes the function more straight forward. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>