summaryrefslogtreecommitdiff
path: root/fs/btrfs/extent_io.c
AgeCommit message (Collapse)Author
2025-03-18btrfs: remove btrfs_fs_info::sectors_per_pageQu Wenruo
For the future large folio support, our filemap can have folios with different sizes, thus we can no longer rely on a fixed blocks_per_page value. To prepare for that future, here we do: - Remove btrfs_fs_info::sectors_per_page - Introduce a helper, btrfs_blocks_per_folio() Which uses the folio size to calculate the number of blocks for each folio. - Migrate the existing btrfs_fs_info::sectors_per_page to use that helper There are some exceptions: * Metadata nodesize < page size support In the future, even if we support large folios, we will only allocate a folio that matches our nodesize. Thus we won't have a folio covering multiple metadata unless nodesize < page size. * Existing subpage bitmap dump We use a single unsigned long to store the bitmap. That means until we change the bitmap dumping code, our upper limit for folio size will only be 256K (4K block size, 64 bit unsigned long). * btrfs_is_subpage() check This will be migrated into a future patch. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: avoid assigning twice to block_start at btrfs_do_readpage()Filipe Manana
At btrfs_do_readpage() if we get an extent map for a prealloc extent we end up assigning twice to the 'block_start' variable, first the value returned by extent_map_block_start() and then EXTENT_MAP_HOLE. This is pointless so make it more clear by using an if-else statement and doing only one assignment. Also, while at it, move the declaration of 'block_start' into the while loop's scope, since it's not used outside of it and the related 'disk_bytenr' is also declared in this scope. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-17btrfs: remove duplicated metadata folio flag update in end_bbio_meta_read()Qu Wenruo
In that function we call set_extent_buffer_uptodate() or clear_extent_buffer_uptodate(), which will already update the uptodate flag for all the involved extent buffer folios. Thus there is no need to update the folio uptodate flags again. Just remove the open-coded part. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-17btrfs: update some folio related commentsMatthew Wilcox (Oracle)
Remove references to the page lock and page->mapping. Also btrfs folios can never be swizzled into swap (mentioned in extent_write_cache_pages()). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-02-13Merge tag 'for-6.14-rc2-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fix stale page cache after race between readahead and direct IO write - fix hole expansion when writing at an offset beyond EOF, the range will not be zeroed - use proper way to calculate offsets in folio ranges * tag 'for-6.14-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix hole expansion when writing at an offset beyond EOF btrfs: fix stale page cache after race between readahead and direct IO write btrfs: fix two misuses of folio_shift()
2025-02-11btrfs: fix stale page cache after race between readahead and direct IO writeFilipe Manana
After commit ac325fc2aad5 ("btrfs: do not hold the extent lock for entire read") we can now trigger a race between a task doing a direct IO write and readahead. When this race is triggered it results in tasks getting stale data when they attempt do a buffered read (including the task that did the direct IO write). This race can be sporadically triggered with test case generic/418, failing like this: $ ./check generic/418 FSTYP -- btrfs PLATFORM -- Linux/x86_64 debian0 6.13.0-rc7-btrfs-next-185+ #17 SMP PREEMPT_DYNAMIC Mon Feb 3 12:28:46 WET 2025 MKFS_OPTIONS -- /dev/sdc MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1 generic/418 14s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//generic/418.out.bad) --- tests/generic/418.out 2020-06-10 19:29:03.850519863 +0100 +++ /home/fdmanana/git/hub/xfstests/results//generic/418.out.bad 2025-02-03 15:42:36.974609476 +0000 @@ -1,2 +1,5 @@ QA output created by 418 +cmpbuf: offset 0: Expected: 0x1, got 0x0 +[6:0] FAIL - comparison failed, offset 24576 +diotest -wp -b 4096 -n 8 -i 4 failed at loop 3 Silence is golden ... (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/generic/418.out /home/fdmanana/git/hub/xfstests/results//generic/418.out.bad' to see the entire diff) Ran: generic/418 Failures: generic/418 Failed 1 of 1 tests The race happens like this: 1) A file has a prealloc extent for the range [16K, 28K); 2) Task A starts a direct IO write against file range [24K, 28K). At the start of the direct IO write it invalidates the page cache at __iomap_dio_rw() with kiocb_invalidate_pages() for the 4K page at file offset 24K; 3) Task A enters btrfs_dio_iomap_begin() and locks the extent range [24K, 28K); 4) Task B starts a readahead for file range [16K, 28K), entering btrfs_readahead(). First it attempts to read the page at offset 16K by entering btrfs_do_readpage(), where it calls get_extent_map(), locks the range [16K, 20K) and gets the extent map for the range [16K, 28K), caching it into the 'em_cached' variable declared in the local stack of btrfs_readahead(), and then unlocks the range [16K, 20K). Since the extent map has the prealloc flag, at btrfs_do_readpage() we zero out the page's content and don't submit any bio to read the page from the extent. Then it attempts to read the page at offset 20K entering btrfs_do_readpage() where we reuse the previously cached extent map (decided by get_extent_map()) since it spans the page's range and it's still in the inode's extent map tree. Just like for the previous page, we zero out the page's content since the extent map has the prealloc flag set. Then it attempts to read the page at offset 24K entering btrfs_do_readpage() where we reuse the previously cached extent map (decided by get_extent_map()) since it spans the page's range and it's still in the inode's extent map tree. Just like for the previous pages, we zero out the page's content since the extent map has the prealloc flag set. Note that we didn't lock the extent range [24K, 28K), so we didn't synchronize with the ongoing direct IO write being performed by task A; 5) Task A enters btrfs_create_dio_extent() and creates an ordered extent for the range [24K, 28K), with the flags BTRFS_ORDERED_DIRECT and BTRFS_ORDERED_PREALLOC set; 6) Task A unlocks the range [24K, 28K) at btrfs_dio_iomap_begin(); 7) The ordered extent enters btrfs_finish_one_ordered() and locks the range [24K, 28K); 8) Task A enters fs/iomap/direct-io.c:iomap_dio_complete() and it tries to invalidate the page at offset 24K by calling kiocb_invalidate_post_direct_write(), resulting in a call chain that ends up at btrfs_release_folio(). The btrfs_release_folio() call ends up returning false because the range for the page at file offset 24K is currently locked by the task doing the ordered extent completion in the previous step (7), so we have: btrfs_release_folio() -> __btrfs_release_folio() -> try_release_extent_mapping() -> try_release_extent_state() This last function checking that the range is locked and returning false and propagating it up to btrfs_release_folio(). So this results in a failure to invalidate the page and kiocb_invalidate_post_direct_write() triggers this message logged in dmesg: Page cache invalidation failure on direct I/O. Possible data corruption due to collision with buffered I/O! After this we leave the page cache with stale data for the file range [24K, 28K), filled with zeroes instead of the data written by direct IO write (all bytes with a 0x01 value), so any task attempting to read with buffered IO, including the task that did the direct IO write, will get all bytes in the range with a 0x00 value instead of the written data. Fix this by locking the range, with btrfs_lock_and_flush_ordered_range(), at the two callers of btrfs_do_readpage() instead of doing it at get_extent_map(), just like we did before commit ac325fc2aad5 ("btrfs: do not hold the extent lock for entire read"), and unlocking the range after all the calls to btrfs_do_readpage(). This way we never reuse a cached extent map without flushing any pending ordered extents from a concurrent direct IO write. Fixes: ac325fc2aad5 ("btrfs: do not hold the extent lock for entire read") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-02-07btrfs: fix two misuses of folio_shift()Matthew Wilcox (Oracle)
It is meaningless to shift a byte count by folio_shift(). The folio index is in units of PAGE_SIZE, not folio_size(). We can use folio_contains() to make this work for arbitrary-order folios, so remove the assertion that the folios are of order 0. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-26Merge tag 'mm-stable-2025-01-26-14-59' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "The various patchsets are summarized below. Plus of course many indivudual patches which are described in their changelogs. - "Allocate and free frozen pages" from Matthew Wilcox reorganizes the page allocator so we end up with the ability to allocate and free zero-refcount pages. So that callers (ie, slab) can avoid a refcount inc & dec - "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to use large folios other than PMD-sized ones - "Fix mm/rodata_test" from Petr Tesarik performs some maintenance and fixes for this small built-in kernel selftest - "mas_anode_descend() related cleanup" from Wei Yang tidies up part of the mapletree code - "mm: fix format issues and param types" from Keren Sun implements a few minor code cleanups - "simplify split calculation" from Wei Yang provides a few fixes and a test for the mapletree code - "mm/vma: make more mmap logic userland testable" from Lorenzo Stoakes continues the work of moving vma-related code into the (relatively) new mm/vma.c - "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David Hildenbrand cleans up and rationalizes handling of gfp flags in the page allocator - "readahead: Reintroduce fix for improper RA window sizing" from Jan Kara is a second attempt at fixing a readahead window sizing issue. It should reduce the amount of unnecessary reading - "synchronously scan and reclaim empty user PTE pages" from Qi Zheng addresses an issue where "huge" amounts of pte pagetables are accumulated: https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/ Qi's series addresses this windup by synchronously freeing PTE memory within the context of madvise(MADV_DONTNEED) - "selftest/mm: Remove warnings found by adding compiler flags" from Muhammad Usama Anjum fixes some build warnings in the selftests code when optional compiler warnings are enabled - "mm: don't use __GFP_HARDWALL when migrating remote pages" from David Hildenbrand tightens the allocator's observance of __GFP_HARDWALL - "pkeys kselftests improvements" from Kevin Brodsky implements various fixes and cleanups in the MM selftests code, mainly pertaining to the pkeys tests - "mm/damon: add sample modules" from SeongJae Park enhances DAMON to estimate application working set size - "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn provides some cleanups to memcg's hugetlb charging logic - "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song removes the global swap cgroup lock. A speedup of 10% for a tmpfs-based kernel build was demonstrated - "zram: split page type read/write handling" from Sergey Senozhatsky has several fixes and cleaups for zram in the area of zram_write_page(). A watchdog softlockup warning was eliminated - "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin Brodsky cleans up the pagetable destructor implementations. A rare use-after-free race is fixed - "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes simplifies and cleans up the debugging code in the VMA merging logic - "Account page tables at all levels" from Kevin Brodsky cleans up and regularizes the pagetable ctor/dtor handling. This results in improvements in accounting accuracy - "mm/damon: replace most damon_callback usages in sysfs with new core functions" from SeongJae Park cleans up and generalizes DAMON's sysfs file interface logic - "mm/damon: enable page level properties based monitoring" from SeongJae Park increases the amount of information which is presented in response to DAMOS actions - "mm/damon: remove DAMON debugfs interface" from SeongJae Park removes DAMON's long-deprecated debugfs interfaces. Thus the migration to sysfs is completed - "mm/hugetlb: Refactor hugetlb allocation resv accounting" from Peter Xu cleans up and generalizes the hugetlb reservation accounting - "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino removes a never-used feature of the alloc_pages_bulk() interface - "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park extends DAMOS filters to support not only exclusion (rejecting), but also inclusion (allowing) behavior - "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi introduces a new memory descriptor for zswap.zpool that currently overlaps with struct page for now. This is part of the effort to reduce the size of struct page and to enable dynamic allocation of memory descriptors - "mm, swap: rework of swap allocator locks" from Kairui Song redoes and simplifies the swap allocator locking. A speedup of 400% was demonstrated for one workload. As was a 35% reduction for kernel build time with swap-on-zram - "mm: update mips to use do_mmap(), make mmap_region() internal" from Lorenzo Stoakes reworks MIPS's use of mmap_region() so that mmap_region() can be made MM-internal - "mm/mglru: performance optimizations" from Yu Zhao fixes a few MGLRU regressions and otherwise improves MGLRU performance - "Docs/mm/damon: add tuning guide and misc updates" from SeongJae Park updates DAMON documentation - "Cleanup for memfd_create()" from Isaac Manjarres does that thing - "mm: hugetlb+THP folio and migration cleanups" from David Hildenbrand provides various cleanups in the areas of hugetlb folios, THP folios and migration - "Uncached buffered IO" from Jens Axboe implements the new RWF_DONTCACHE flag which provides synchronous dropbehind for pagecache reading and writing. To permite userspace to address issues with massive buildup of useless pagecache when reading/writing fast devices - "selftests/mm: virtual_address_range: Reduce memory" from Thomas Weißschuh fixes and optimizes some of the MM selftests" * tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits) mm/compaction: fix UBSAN shift-out-of-bounds warning s390/mm: add missing ctor/dtor on page table upgrade kasan: sw_tags: use str_on_off() helper in kasan_init_sw_tags() tools: add VM_WARN_ON_VMG definition mm/damon/core: use str_high_low() helper in damos_wmark_wait_us() seqlock: add missing parameter documentation for raw_seqcount_try_begin() mm/page-writeback: consolidate wb_thresh bumping logic into __wb_calc_thresh mm/page_alloc: remove the incorrect and misleading comment zram: remove zcomp_stream_put() from write_incompressible_page() mm: separate move/undo parts from migrate_pages_batch() mm/kfence: use str_write_read() helper in get_access_type() selftests/mm/mkdirty: fix memory leak in test_uffdio_copy() kasan: hw_tags: Use str_on_off() helper in kasan_init_hw_tags() selftests/mm: virtual_address_range: avoid reading from VM_IO mappings selftests/mm: vm_util: split up /proc/self/smaps parsing selftests/mm: virtual_address_range: unmap chunks after validation selftests/mm: virtual_address_range: mmap() without PROT_WRITE selftests/memfd/memfd_test: fix possible NULL pointer dereference mm: add FGP_DONTCACHE folio creation flag mm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue ...
2025-01-25mm: alloc_pages_bulk: rename APILuiz Capitulino
The previous commit removed the page_list argument from alloc_pages_bulk_noprof() along with the alloc_pages_bulk_list() function. Now that only the *_array() flavour of the API remains, we can do the following renaming (along with the _noprof() ones): alloc_pages_bulk_array -> alloc_pages_bulk alloc_pages_bulk_array_mempolicy -> alloc_pages_bulk_mempolicy alloc_pages_bulk_array_node -> alloc_pages_bulk_node Link: https://lkml.kernel.org/r/275a3bbc0be20fbe9002297d60045e67ab3d4ada.1734991165.git.luizcap@redhat.com Signed-off-by: Luiz Capitulino <luizcap@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13btrfs: add extra error messages for delalloc range related errorsQu Wenruo
All the error handling bugs I hit so far are all -ENOSPC from either: - cow_file_range() - run_delalloc_nocow() - submit_uncompressed_range() Previously when those functions failed, there was no error message at all, making the debugging much harder. So here we introduce extra error messages for: - cow_file_range() - run_delalloc_nocow() - submit_uncompressed_range() - writepage_delalloc() when btrfs_run_delalloc_range() failed - extent_writepage() when extent_writepage_io() failed One example of the new debug error messages is the following one: run fstests generic/750 at 2024-12-08 12:41:41 BTRFS: device fsid 461b25f5-e240-4543-8deb-e7c2bd01a6d3 devid 1 transid 8 /dev/mapper/test-scratch1 (253:4) scanned by mount (2436600) BTRFS info (device dm-4): first mount of filesystem 461b25f5-e240-4543-8deb-e7c2bd01a6d3 BTRFS info (device dm-4): using crc32c (crc32c-arm64) checksum algorithm BTRFS info (device dm-4): forcing free space tree for sector size 4096 with page size 65536 BTRFS info (device dm-4): using free-space-tree BTRFS warning (device dm-4): read-write for sector size 4096 with page size 65536 is experimental BTRFS info (device dm-4): checking UUID tree BTRFS error (device dm-4): cow_file_range failed, root=363 inode=412 start=503808 len=98304: -28 BTRFS error (device dm-4): run_delalloc_nocow failed, root=363 inode=412 start=503808 len=98304: -28 BTRFS error (device dm-4): failed to run delalloc range, root=363 ino=412 folio=458752 submit_bitmap=11-15 start=503808 len=98304: -28 Which shows an error from cow_file_range() which is called inside a nocow write attempt, along with the extra bitmap from writepage_delalloc(). Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: fix double accounting race when extent_writepage_io() failedQu Wenruo
[BUG] If submit_one_sector() failed inside extent_writepage_io() for sector size < page size cases (e.g. 4K sector size and 64K page size), then we can hit double ordered extent accounting error. This should be very rare, as submit_one_sector() only fails when we failed to grab the extent map, and such extent map should exist inside the memory and has been pinned. [CAUSE] For example we have the following folio layout: 0 4K 32K 48K 60K 64K |//| |//////| |///| Where |///| is the dirty range we need to writeback. The 3 different dirty ranges are submitted for regular COW. Now we hit the following sequence: - submit_one_sector() returned 0 for [0, 4K) - submit_one_sector() returned 0 for [32K, 48K) - submit_one_sector() returned error for [60K, 64K) - btrfs_mark_ordered_io_finished() called for the whole folio This will mark the following ranges as finished: * [0, 4K) * [32K, 48K) Both ranges have their IO already submitted, this cleanup will lead to double accounting. * [60K, 64K) That's the correct cleanup. The only good news is, this error is only theoretical, as the target extent map is always pinned, thus we should directly grab it from memory, other than reading it from the disk. [FIX] Instead of calling btrfs_mark_ordered_io_finished() for the whole folio range, which can touch ranges we should not touch, instead move the error handling inside extent_writepage_io(). So that we can cleanup exact sectors that ought to be submitted but failed. This provides much more accurate cleanup, avoiding the double accounting. CC: stable@vger.kernel.org # 5.15+ Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: fix double accounting race when btrfs_run_delalloc_range() failedQu Wenruo
[BUG] When running btrfs with block size (4K) smaller than page size (64K, aarch64), there is a very high chance to crash the kernel at generic/750, with the following messages: (before the call traces, there are 3 extra debug messages added) BTRFS warning (device dm-3): read-write for sector size 4096 with page size 65536 is experimental BTRFS info (device dm-3): checking UUID tree hrtimer: interrupt took 5451385 ns BTRFS error (device dm-3): cow_file_range failed, root=4957 inode=257 start=1605632 len=69632: -28 BTRFS error (device dm-3): run_delalloc_nocow failed, root=4957 inode=257 start=1605632 len=69632: -28 BTRFS error (device dm-3): failed to run delalloc range, root=4957 ino=257 folio=1572864 submit_bitmap=8-15 start=1605632 len=69632: -28 ------------[ cut here ]------------ WARNING: CPU: 2 PID: 3020984 at ordered-data.c:360 can_finish_ordered_extent+0x370/0x3b8 [btrfs] CPU: 2 UID: 0 PID: 3020984 Comm: kworker/u24:1 Tainted: G OE 6.13.0-rc1-custom+ #89 Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022 Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs] pc : can_finish_ordered_extent+0x370/0x3b8 [btrfs] lr : can_finish_ordered_extent+0x1ec/0x3b8 [btrfs] Call trace: can_finish_ordered_extent+0x370/0x3b8 [btrfs] (P) can_finish_ordered_extent+0x1ec/0x3b8 [btrfs] (L) btrfs_mark_ordered_io_finished+0x130/0x2b8 [btrfs] extent_writepage+0x10c/0x3b8 [btrfs] extent_write_cache_pages+0x21c/0x4e8 [btrfs] btrfs_writepages+0x94/0x160 [btrfs] do_writepages+0x74/0x190 filemap_fdatawrite_wbc+0x74/0xa0 start_delalloc_inodes+0x17c/0x3b0 [btrfs] btrfs_start_delalloc_roots+0x17c/0x288 [btrfs] shrink_delalloc+0x11c/0x280 [btrfs] flush_space+0x288/0x328 [btrfs] btrfs_async_reclaim_data_space+0x180/0x228 [btrfs] process_one_work+0x228/0x680 worker_thread+0x1bc/0x360 kthread+0x100/0x118 ret_from_fork+0x10/0x20 ---[ end trace 0000000000000000 ]--- BTRFS critical (device dm-3): bad ordered extent accounting, root=4957 ino=257 OE offset=1605632 OE len=16384 to_dec=16384 left=0 BTRFS critical (device dm-3): bad ordered extent accounting, root=4957 ino=257 OE offset=1622016 OE len=12288 to_dec=12288 left=0 Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008 BTRFS critical (device dm-3): bad ordered extent accounting, root=4957 ino=257 OE offset=1634304 OE len=8192 to_dec=4096 left=0 CPU: 1 UID: 0 PID: 3286940 Comm: kworker/u24:3 Tainted: G W OE 6.13.0-rc1-custom+ #89 Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022 Workqueue: btrfs_work_helper [btrfs] (btrfs-endio-write) pstate: 404000c5 (nZcv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : process_one_work+0x110/0x680 lr : worker_thread+0x1bc/0x360 Call trace: process_one_work+0x110/0x680 (P) worker_thread+0x1bc/0x360 (L) worker_thread+0x1bc/0x360 kthread+0x100/0x118 ret_from_fork+0x10/0x20 Code: f84086a1 f9000fe1 53041c21 b9003361 (f9400661) ---[ end trace 0000000000000000 ]--- Kernel panic - not syncing: Oops: Fatal exception SMP: stopping secondary CPUs SMP: failed to stop secondary CPUs 2-3 Dumping ftrace buffer: (ftrace buffer empty) Kernel Offset: 0x275bb9540000 from 0xffff800080000000 PHYS_OFFSET: 0xffff8fbba0000000 CPU features: 0x100,00000070,00801250,8201720b [CAUSE] The above warning is triggered immediately after the delalloc range failure, this happens in the following sequence: - Range [1568K, 1636K) is dirty 1536K 1568K 1600K 1636K 1664K | |/////////|////////| | Where 1536K, 1600K and 1664K are page boundaries (64K page size) - Enter extent_writepage() for page 1536K - Enter run_delalloc_nocow() with locked page 1536K and range [1568K, 1636K) This is due to the inode having preallocated extents. - Enter cow_file_range() with locked page 1536K and range [1568K, 1636K) - btrfs_reserve_extent() only reserved two extents The main loop of cow_file_range() only reserved two data extents, Now we have: 1536K 1568K 1600K 1636K 1664K | |<-->|<--->|/|///////| | 1584K 1596K Range [1568K, 1596K) has an ordered extent reserved. - btrfs_reserve_extent() failed inside cow_file_range() for file offset 1596K This is already a bug in our space reservation code, but for now let's focus on the error handling path. Now cow_file_range() returned -ENOSPC. - btrfs_run_delalloc_range() do error cleanup <<< ROOT CAUSE Call btrfs_cleanup_ordered_extents() with locked folio 1536K and range [1568K, 1636K) Function btrfs_cleanup_ordered_extents() normally needs to skip the ranges inside the folio, as it will normally be cleaned up by extent_writepage(). Such split error handling is already problematic in the first place. What's worse is the folio range skipping itself, which is not taking subpage cases into consideration at all, it will only skip the range if the page start >= the range start. In our case, the page start < the range start, since for subpage cases we can have delalloc ranges inside the folio but not covering the folio. So it doesn't skip the page range at all. This means all the ordered extents, both [1568K, 1584K) and [1584K, 1596K) will be marked as IOERR. And these two ordered extents have no more pending ios, they are marked finished, and *QUEUED* to be deleted from the io tree. - extent_writepage() do error cleanup Call btrfs_mark_ordered_io_finished() for the range [1536K, 1600K). Although ranges [1568K, 1584K) and [1584K, 1596K) are finished, the deletion from io tree is async, it may or may not happen at this time. If the ranges have not yet been removed, we will do double cleaning on those ranges, triggering the above ordered extent warnings. In theory there are other bugs, like the cleanup in extent_writepage() can cause double accounting on ranges that are submitted asynchronously (compression for example). But that's much harder to trigger because normally we do not mix regular and compression delalloc ranges. [FIX] The folio range split is already buggy and not subpage compatible, it was introduced a long time ago where subpage support was not even considered. So instead of splitting the ordered extents cleanup into the folio range and out of folio range, do all the cleanup inside writepage_delalloc(). - Pass @NULL as locked_folio for btrfs_cleanup_ordered_extents() in btrfs_run_delalloc_range() - Skip the btrfs_cleanup_ordered_extents() if writepage_delalloc() failed So all ordered extents are only cleaned up by btrfs_run_delalloc_range(). - Handle the ranges that already have ordered extents allocated If part of the folio already has ordered extent allocated, and btrfs_run_delalloc_range() failed, we also need to cleanup that range. Now we have a concentrated error handling for ordered extents during btrfs_run_delalloc_range(). Fixes: d1051d6ebf8e ("btrfs: Fix error handling in btrfs_cleanup_ordered_extents") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: remove redundant variables from __process_folios_contig() and ↵David Sterba
lock_delalloc_folios() Same pattern in both functions, we really only use index, start_index is redundant. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: split waiting from read_extent_buffer_pages(), drop parameter waitDavid Sterba
There are only 2 WAIT_* values left for wait parameter, we can encode this to the function name if the waiting functionality is split. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: unwrap folio locking helpersDavid Sterba
Another conversion to folio API, use the folio locking directly instead of back and forth page <-> folio conversions. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: change return type to bool type of check_eb_alignment()David Sterba
The check function pattern is supposed to return true/false, currently there's only one error code. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: switch grab_extent_buffer() to foliosDavid Sterba
Use the folio API, remove page_folio/folio_page conversions. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: rename btrfs_release_extent_buffer_pages() to mention foliosDavid Sterba
Continue page to folio updates, sync what the function does with it's name. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: open code __free_extent_buffer()David Sterba
Using the kmem cache freeing directly is clear enough, we don't need to wrap it. All the users are in the same file. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: drop one time used local variable in end_bbio_meta_write()David Sterba
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: make wait_on_extent_buffer_writeback() static inlineDavid Sterba
The simple helper can be inlined, no need for the separate function. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: use btrfs_inode in extent_writepage()David Sterba
As extent_writepage() is internal helper we should use our inode type, so change it from struct inode. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: rename __get_extent_map() and pass btrfs_inodeDavid Sterba
The double underscore naming scheme does not apply here, there's only only get_extent_map(). As the definition is changed also pass the struct btrfs_inode. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: open code set_page_extent_mapped()David Sterba
The function set_page_extent_mapped() is now a simple wrapper so use the folio helper. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: rename __unlock_for_delalloc() and drop underscoresDavid Sterba
Drop the leading underscores in '__unlock_for_delalloc()' and rename it to 'unlock_delalloc_folio()'. This also ensures naming parity with 'lock_delalloc_folios()'. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: pass write-hint for buffered IOJing Xia
Commit 449813515d3e ("block, fs: Restore the per-bio/request data lifetime fields") restored write-hint support in btrfs. But that is applicable only for direct IO. This patch supports passing write-hint for buffered IO from btrfs file system to block layer by filling bi_write_hint of struct bio in alloc_new_bio(). There's an ongoing discussion which devices can use that, https://lore.kernel.org/all/20240910150200.6589-6-joshi.k@samsung.com, in SCSI there's support using sd_group_number() and sd_setup_rw32_cmnd(). The hint goes from the application directly to the block device so it's up to the application to set up everything properly to utilize the different hint classes. Link: https://lore.kernel.org/all/20240910150200.6589-6-joshi.k@samsung.com Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jing Xia <j.xia@samsung.com> [ Add more context and use case. ] Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: use PTR_ERR() instead of PTR_ERR_OR_ZERO() for btrfs_get_extent()Qu Wenruo
The function btrfs_get_extent() will only return an PTR_ERR() or a valid extent map pointer. It will not return NULL. Thus the usage of PTR_ERR_OR_ZERO() inside submit_one_sector() is not needed, use plain PTR_ERR() instead, and that is the only usage of PTR_ERR_OR_ZERO() after btrfs_get_extent(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-18Merge tag 'for-6.13-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "Changes outside of btrfs: add io_uring command flag to track a dying task (the rest will go via the block git tree). User visible changes: - wire encoded read (ioctl) to io_uring commands, this can be used on itself, in the future this will allow 'send' to be asynchronous. As a consequence, the encoded read ioctl can also work in non-blocking mode - new ioctl to wait for cleaned subvolumes, no need to use the generic and root-only SEARCH_TREE ioctl, will be used by "btrfs subvol sync" - recognize different paths/symlinks for the same devices and don't report them during rescanning, this can be observed with LVM or DM - seeding device use case change, the sprout device (the one capturing new writes) will not clear the read-only status of the super block; this prevents accumulating space from deleted snapshots Performance improvements: - reduce lock contention when traversing extent buffers - reduce extent tree lock contention when searching for inline backref - switch from rb-trees to xarray for delayed ref tracking, improvements due to better cache locality, branching factors and more compact data structures - enable extent map shrinker again (prevent memory exhaustion under some types of IO load), reworked to run in a single worker thread (there used to be problems causing long stalls under memory pressure) Core changes: - raid-stripe-tree feature updates: - make device replace and scrub work - implement partial deletion of stripe extents - new selftests - split the config option BTRFS_DEBUG and add EXPERIMENTAL for features that are experimental or with known problems so we don't misuse debugging config for that - subpage mode updates (sector < page): - update compression implementations - update writepage, writeback - continued folio API conversions: - buffered writes - make buffered write copy one page at a time, preparatory work for future integration with large folios, may cause performance drop - proper locking of root item regarding starting send - error handling improvements - code cleanups and refactoring: - dead code removal - unused parameter reduction - lockdep assertions" * tag 'for-6.13-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (119 commits) btrfs: send: check for read-only send root under critical section btrfs: send: check for dead send root under critical section btrfs: remove check for NULL fs_info at btrfs_folio_end_lock_bitmap() btrfs: fix warning on PTR_ERR() against NULL device at btrfs_control_ioctl() btrfs: fix a typo in btrfs_use_zone_append btrfs: avoid superfluous calls to free_extent_map() in btrfs_encoded_read() btrfs: simplify logic to decrement snapshot counter at btrfs_mksnapshot() btrfs: remove hole from struct btrfs_delayed_node btrfs: update stale comment for struct btrfs_delayed_ref_node::add_list btrfs: add new ioctl to wait for cleaned subvolumes btrfs: simplify range tracking in cow_file_range() btrfs: remove conditional path allocation in btrfs_read_locked_inode() btrfs: push cleanup into btrfs_read_locked_inode() io_uring/cmd: let cmds to know about dying task btrfs: add struct io_btrfs_cmd as type for io_uring_cmd_to_pdu() btrfs: add io_uring command for encoded reads (ENCODED_READ ioctl) btrfs: move priv off stack in btrfs_encoded_read_regular_fill_pages() btrfs: don't sleep in btrfs_encoded_read() if IOCB_NOWAIT is set btrfs: change btrfs_encoded_read() so that reading of extent is done by caller btrfs: remove pointless iocb::ki_pos addition in btrfs_encoded_read() ...
2024-11-18Merge tag 'vfs-6.13.misc' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "Features: - Fixup and improve NLM and kNFSD file lock callbacks Last year both GFS2 and OCFS2 had some work done to make their locking more robust when exported over NFS. Unfortunately, part of that work caused both NLM (for NFS v3 exports) and kNFSD (for NFSv4.1+ exports) to no longer send lock notifications to clients This in itself is not a huge problem because most NFS clients will still poll the server in order to acquire a conflicted lock It's important for NLM and kNFSD that they do not block their kernel threads inside filesystem's file_lock implementations because that can produce deadlocks. We used to make sure of this by only trusting that posix_lock_file() can correctly handle blocking lock calls asynchronously, so the lock managers would only setup their file_lock requests for async callbacks if the filesystem did not define its own lock() file operation However, when GFS2 and OCFS2 grew the capability to correctly handle blocking lock requests asynchronously, they started signalling this behavior with EXPORT_OP_ASYNC_LOCK, and the check for also trusting posix_lock_file() was inadvertently dropped, so now most filesystems no longer produce lock notifications when exported over NFS Fix this by using an fop_flag which greatly simplifies the problem and grooms the way for future uses by both filesystems and lock managers alike - Add a sysctl to delete the dentry when a file is removed instead of making it a negative dentry Commit 681ce8623567 ("vfs: Delete the associated dentry when deleting a file") introduced an unconditional deletion of the associated dentry when a file is removed. However, this led to performance regressions in specific benchmarks, such as ilebench.sum_operations/s, prompting a revert in commit 4a4be1ad3a6e ("Revert "vfs: Delete the associated dentry when deleting a file""). This reintroduces the concept conditionally through a sysctl - Expand the statmount() system call: * Report the filesystem subtype in a new fs_subtype field to e.g., report fuse filesystem subtypes * Report the superblock source in a new sb_source field * Add a new way to return filesystem specific mount options in an option array that returns filesystem specific mount options separated by zero bytes and unescaped. This allows caller's to retrieve filesystem specific mount options and immediately pass them to e.g., fsconfig() without having to unescape or split them * Report security (LSM) specific mount options in a separate security option array. We don't lump them together with filesystem specific mount options as security mount options are generic and most users aren't interested in them The format is the same as for the filesystem specific mount option array - Support relative paths in fsconfig()'s FSCONFIG_SET_STRING command - Optimize acl_permission_check() to avoid costly {g,u}id ownership checks if possible - Use smp_mb__after_spinlock() to avoid full smp_mb() in evict() - Add synchronous wakeup support for ep_poll_callback. Currently, epoll only uses wake_up() to wake up task. But sometimes there are epoll users which want to use the synchronous wakeup flag to give a hint to the scheduler, e.g., the Android binder driver. So add a wake_up_sync() define, and use wake_up_sync() when sync is true in ep_poll_callback() Fixes: - Fix kernel documentation for inode_insert5() and iget5_locked() - Annotate racy epoll check on file->f_ep - Make F_DUPFD_QUERY associative - Avoid filename buffer overrun in initramfs - Don't let statmount() return empty strings - Add a cond_resched() to dump_user_range() to avoid hogging the CPU - Don't query the device logical blocksize multiple times for hfsplus - Make filemap_read() check that the offset is positive or zero Cleanups: - Various typo fixes - Cleanup wbc_attach_fdatawrite_inode() - Add __releases annotation to wbc_attach_and_unlock_inode() - Add hugetlbfs tracepoints - Fix various vfs kernel doc parameters - Remove obsolete TODO comment from io_cancel() - Convert wbc_account_cgroup_owner() to take a folio - Fix comments for BANDWITH_INTERVAL and wb_domain_writeout_add() - Reorder struct posix_acl to save 8 bytes - Annotate struct posix_acl with __counted_by() - Replace one-element array with flexible array member in freevxfs - Use idiomatic atomic64_inc_return() in alloc_mnt_ns()" * tag 'vfs-6.13.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits) statmount: retrieve security mount options vfs: make evict() use smp_mb__after_spinlock instead of smp_mb statmount: add flag to retrieve unescaped options fs: add the ability for statmount() to report the sb_source writeback: wbc_attach_fdatawrite_inode out of line writeback: add a __releases annoation to wbc_attach_and_unlock_inode fs: add the ability for statmount() to report the fs_subtype fs: don't let statmount return empty strings fs:aio: Remove TODO comment suggesting hash or array usage in io_cancel() hfsplus: don't query the device logical block size multiple times freevxfs: Replace one-element array with flexible array member fs: optimize acl_permission_check() initramfs: avoid filename buffer overrun fs/writeback: convert wbc_account_cgroup_owner to take a folio acl: Annotate struct posix_acl with __counted_by() acl: Realign struct posix_acl to save 8 bytes epoll: Add synchronous wakeup support for ep_poll_callback coredump: add cond_resched() to dump_user_range mm/page-writeback.c: Fix comment of wb_domain_writeout_add() mm/page-writeback.c: Update comment for BANDWIDTH_INTERVAL ...
2024-11-11btrfs: use filemap_get_folio() helperAnand Jain
When fgp_flags and gfp_flags are zero, use filemap_get_folio(A, B) instead of __filemap_get_folio(A, B, 0, 0)—no need for the extra arguments 0, 0. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: rename btrfs_folio_(set|start|end)_writer_lock()Qu Wenruo
Since there is no user of reader locks, rename the writer locks into a more generic name, by removing the "_writer" part from the name. And also rename btrfs_subpage::writer into btrfs_subpage::locked. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: unify to use writer locks for subpage lockingQu Wenruo
Since commit d7172f52e993 ("btrfs: use per-buffer locking for extent_buffer reading"), metadata read no longer relies on the subpage reader locking. This means we do not need to maintain a different metadata/data split for locking, so we can convert the existing reader lock users by: - add_ra_bio_pages() Convert to btrfs_folio_set_writer_lock() - end_folio_read() Convert to btrfs_folio_end_writer_lock() - begin_folio_read() Convert to btrfs_folio_set_writer_lock() - folio_range_has_eb() Remove the subpage->readers checks, since it is always 0. - Remove btrfs_subpage_start_reader() and btrfs_subpage_end_reader() Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: remove redundant initializations for struct btrfs_tree_parent_checkFilipe Manana
It's pointless to initialize the has_first_key field of the stack local btrfs_tree_parent_check structure at btrfs_tree_parent_check() and at btrfs_qgroup_trace_subtree() since all fields not explicitly initialized are zeroed out. In the case of the first function it's a bit odd because we are assigning 0 and the field is of type bool, however not incorrect since a 0 is converted to false. Just remove the explicit initializations due to their redundancy. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: drop unused parameter fs_info from folio_range_has_eb()David Sterba
The parameter was added in 8ff8466d29efc2 ("btrfs: support subpage for extent buffer page release") for page but hasn't been used since, so we can drop it. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: drop unused parameter mask from try_release_extent_state()David Sterba
The mask parameter used for allocations got unified to GFP_NOFS and removed from relevant functions in 1d1268004430 ("btrfs: drop gfp from parameter extent state helpers"). Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: remove btrfs_set_range_writeback()Qu Wenruo
The function btrfs_set_range_writeback() was originally a callback for metadata and data, to mark a range with writeback flag. Then it was converted into a common function call for both metadata and data. From the very beginning, the function had been only called on a full page, later converted to handle range inside a page. But it never needed to handle multiple pages, and since commit 8189197425e7 ("btrfs: refactor __extent_writepage_io() to do sector-by-sector submission") the function was only called on a sector-by-sector basis. This makes the function unnecessary, and can be converted to a simple btrfs_folio_set_writeback() call instead. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: correct typos in multiple comments across various filesShen Lichuan
Fix some confusing spelling errors that were currently identified, the details are as follows: block-group.c: 2800: uncompressible ==> incompressible extent-tree.c: 3131: EXTEMT ==> EXTENT extent_io.c: 3124: utlizing ==> utilizing extent_map.c: 1323: ealier ==> earlier extent_map.c: 1325: possiblity ==> possibility fiemap.c: 189: emmitted ==> emitted fiemap.c: 197: emmitted ==> emitted fiemap.c: 203: emmitted ==> emitted transaction.h: 36: trasaction ==> transaction volumes.c: 5312: filesysmte ==> filesystem zoned.c: 1977: trasnsaction ==> transaction Signed-off-by: Shen Lichuan <shenlichuan@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: mark all dirty sectors as locked inside writepage_delalloc()Qu Wenruo
Currently we only mark sectors as locked if there is a *NEW* delalloc range for it. But NEW delalloc range is not the same as dirty sectors we want to submit, e.g: 0 32K 64K 96K 128K | |////////||///////| |////| 120K For above 64K page size case, writepage_delalloc() for page 0 will find and lock the delalloc range [32K, 96K), which is beyond the page boundary. Then when writepage_delalloc() is called for the page 64K, since [64K, 96K) is already locked, only [120K, 128K) will be locked. This means, although range [64K, 96K) is dirty and will be submitted later by extent_writepage_io(), it will not be marked as locked. This is fine for now, as we call btrfs_folio_end_writer_lock_bitmap() to free every non-compressed sector, and compression is only allowed for full page range. But this is not safe for future sector perfect compression support, as this can lead to double folio unlock: Thread A | Thread B ---------------------------------------+-------------------------------- | submit_one_async_extent() | |- extent_clear_unlock_delalloc() extent_writepage() | |- btrfs_folio_end_writer_lock() |- btrfs_folio_end_writer_lock_bitmap()| |- btrfs_subpage_end_and_test_writer() | | | |- atomic_sub_and_test() | | | /* Now the atomic value is 0 */ |- if (atomic_read() == 0) | | |- folio_unlock() | |- folio_unlock() The root cause is the above range [64K, 96K) is dirtied and should also be locked but it isn't. So to make everything more consistent and prepare for the incoming sector perfect compression, mark all dirty sectors as locked. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: move the delalloc range bitmap search into extent_io.cQu Wenruo
Currently for subpage (sector size < page size) cases, we reuse subpage locked bitmap to find out all delalloc ranges we have locked, and run all those found ranges. However such reuse is not perfect, e.g.: 0 32K 64K 96K 128K | |////////||///////| |////| 120K For above range, writepage_delalloc() for page 0 will handle the range [32K, 96k), note delalloc range can be beyond the page boundary. But writepage_delalloc() for page 64K will only handle range [120K, 128K), as the previous run on page 0 has already handled range [64K, 96K). Meanwhile for the writeback we should expect range [64K, 96K) to also be locked, this leads to the mismatch from locked bitmap and delalloc range. This is not causing problems yet, but it's still an inconsistent behavior. So instead of relying on the subpage locked bitmap, move the delalloc range search using local @delalloc_bitmap, so that we can remove the existing btrfs_folio_find_writer_locked(). Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: do not assume the full page range is not dirty in extent_writepage_io()Qu Wenruo
The function extent_writepage_io() will submit the dirty sectors inside the page for the write. But recently to co-operate with the incoming subpage compression enhancement, a new bitmap is introduced to btrfs_bio_ctrl::submit_bitmap, to only avoid a subset of the dirty range. This is because we can have the following cases with 64K page size: 0 16K 32K 48K 64K | |/////////| |/| 52K For range [16K, 32K), we queue the dirty range for compression, which is ran in a delayed workqueue. Then for range [48K, 52K), we go through the regular submission path. In that case, our btrfs_bio_ctrl::submit_bitmap will exclude the range [16K, 32K). The dirty flags for the range [16K, 32K) is only cleared when the compression is done, by the extent_clear_unlock_delalloc() call inside submit_one_async_extent(). This patch fix the false alert by removing the btrfs_folio_assert_not_dirty() check, since it's no longer correct for subpage compression cases. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: wait for writeback if sector size is smaller than page sizeQu Wenruo
[PROBLEM] If sector perfect compression is enabled for sector size < page size case, the following case can lead dirty ranges not being written back: 0 32K 64K 96K 128K | |///////||//////| |/| 124K In above example, the page size is 64K, and we need to write back above two pages. - Submit for page 0 (main thread) We found delalloc range [32K, 96K), which can be compressed. So we queue an async range for [32K, 96K). This means, the page unlock/clearing dirty/setting writeback will all happen in a workqueue context. - The compression is done, and compressed range is submitted (workqueue) Since the compression is done in asynchronously, the compression can be done before the main thread to submit for page 64K. Now the whole range [32K, 96K), involving two pages, will be marked writeback. - Submit for page 64K (main thread) extent_write_cache_pages() got its wbc->sync_mode is WB_SYNC_NONE, so it skips the writeback wait. And unlock the page and exit. This means the dirty range [124K, 128K) will never be submitted, until next writeback happens for page 64K. This will never happen for previous kernels because: - For sector size == page size case Since one page is one sector, if a page is marked writeback it will not have dirty flags. So this corner case will never hit. - For sector size < page size case We never do subpage compression, a range can only be submitted for compression if the range is fully page aligned. This change makes the subpage behavior mostly the same as non-subpage cases. [ENHANCEMENT] Instead of relying WB_SYNC_NONE check only, if it's a subpage case, then always wait for writeback flags. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-28fs/writeback: convert wbc_account_cgroup_owner to take a folioPankaj Raghav
Most of the callers of wbc_account_cgroup_owner() are converting a folio to page before calling the function. wbc_account_cgroup_owner() is converting the page back to a folio to call mem_cgroup_css_from_folio(). Convert wbc_account_cgroup_owner() to take a folio instead of a page, and convert all callers to pass a folio directly except f2fs. Convert the page to folio for all the callers from f2fs as they were the only callers calling wbc_account_cgroup_owner() with a page. As f2fs is already in the process of converting to folios, these call sites might also soon be calling wbc_account_cgroup_owner() with a folio directly in the future. No functional changes. Only compile tested. Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Link: https://lore.kernel.org/r/20240926140121.203821-1-kernel@pankajraghav.com Acked-by: David Sterba <dsterba@suse.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-22btrfs: fix the delalloc range locking if sector size < page sizeQu Wenruo
Inside lock_delalloc_folios(), there are several problems related to sector size < page size handling: - Set the writer locks without checking if the folio is still valid We call btrfs_folio_start_writer_lock() just like it's folio_lock(). But since the folio may not even be the folio of the current mapping, we can easily screw up the folio->private. - The range is not clamped inside the page This means we can over write other bitmaps if the start/len is not properly handled, and trigger the btrfs_subpage_assert(). - @processed_end is always rounded up to page end If the delalloc range is not page aligned, and we need to retry (returning -EAGAIN), then we will unlock to the page end. Thankfully this is not a huge problem, as now btrfs_folio_end_writer_lock() can handle range larger than the locked range, and only unlock what is already locked. Fix all these problems by: - Lock and check the folio first, then call btrfs_folio_set_writer_lock() So that if we got a folio not belonging to the inode, we won't touch folio->private. - Properly truncate the range inside the page - Update @processed_end to the locked range end Fixes: 1e1de38792e0 ("btrfs: make process_one_page() to handle subpage locking") CC: stable@vger.kernel.org # 6.1+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-09btrfs: fix clear_dirty and writeback ordering in submit_one_sector()Naohiro Aota
This commit is a replay of commit 6252690f7e1b ("btrfs: fix invalid mapping of extent xarray state"). We need to call btrfs_folio_clear_dirty() before btrfs_set_range_writeback(), so that xarray DIRTY tag is cleared. With a refactoring commit 8189197425e7 ("btrfs: refactor __extent_writepage_io() to do sector-by-sector submission"), it screwed up and the order is reversed and causing the same hang. Fix the ordering now in submit_one_sector(). Fixes: 8189197425e7 ("btrfs: refactor __extent_writepage_io() to do sector-by-sector submission") Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: only unlock the to-be-submitted ranges inside a folioQu Wenruo
[SUBPAGE COMPRESSION LIMITS] Currently inside writepage_delalloc(), if a delalloc range is going to be submitted asynchronously (inline or compression, the page dirty/writeback/unlock are all handled in at different time, not at the submission time), then we return 1 and extent_writepage() will skip the submission. This is fine if every sector matches page size, but if a sector is smaller than page size (aka, subpage case), then it can be very problematic, for example for the following 64K page: 0 16K 32K 48K 64K |/| |///////| |/| | | 4K 52K Where |/| is the dirty range we need to submit. In the above case, we need the following different handling for the 3 ranges: - [0, 4K) needs to be submitted for regular write A single sector cannot be compressed. - [16K, 32K) needs to be submitted for compressed write - [48K, 52K) needs to be submitted for regular write. Above, if we try to submit [16K, 32K) for compressed write, we will return 1 and immediately, and without submitting the remaining [48K, 52K) range. Furthermore, since extent_writepage() will exit without unlocking any sectors, the submitted range [0, 4K) will not have sector unlocked. That's the reason why for now subpage is only allowed for full page range. [ENHANCEMENT] - Introduce a submission bitmap at btrfs_bio_ctrl::submit_bitmap This records which sectors will be submitted by extent_writepage_io(). This allows us to track which sectors needs to be submitted thus later to be properly unlocked. For asynchronously submitted range (inline/compression), the corresponding bits will be cleared from that bitmap. - Only return 1 if no sector needs to be submitted in writepage_delalloc() - Only submit sectors marked by submission bitmap inside extent_writepage_io() So we won't touch the asynchronously submitted part. - Introduce btrfs_folio_end_writer_lock_bitmap() helper This will only unlock the involved sectors specified by @bitmap parameter, to avoid touching the range asynchronously submitted. Please note that, since subpage compression is still limited to page aligned range, this change is only a preparation for future sector perfect compression support for subpage. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: merge btrfs_folio_unlock_writer() into btrfs_folio_end_writer_lock()Qu Wenruo
The function btrfs_folio_unlock_writer() is already calling btrfs_folio_end_writer_lock() to do the heavy lifting work, the only missing 0 writer check. Thus there is no need to keep two different functions, move the 0 writer check into btrfs_folio_end_writer_lock(), and remove btrfs_folio_unlock_writer(). Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: remove btrfs_folio_end_all_writers()Qu Wenruo
The function btrfs_folio_end_all_writers() is only utilized in extent_writepage() as a way to unlock all subpage range (for both successful submission and error handling). Meanwhile we have a similar function, btrfs_folio_end_writer_lock(). The difference is, btrfs_folio_end_writer_lock() expects a range that is a subset of the already locked range. This limit on btrfs_folio_end_writer_lock() is a little overkilled, preventing it from being utilized for error paths. So here we enhance btrfs_folio_end_writer_lock() to accept a superset of the locked range, and only end the locked subset. This means we can replace btrfs_folio_end_all_writers() with btrfs_folio_end_writer_lock() instead. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: convert try_release_extent_mapping() to take a folioLi Zetao
The old page API is being gradually replaced and converted to use folio to improve code readability and avoid repeated conversion between page and folio. And page_to_inode() can be replaced with folio_to_inode() now. Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: convert try_release_extent_state() to take a folioLi Zetao
The old page API is being gradually replaced and converted to use folio to improve code readability and avoid repeated conversion between page and folio. Moreover, use folio_pos() instead of page_offset(), which is more consistent with folio usage. Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: convert submit_eb_page() to take a folioLi Zetao
The old page API is being gradually replaced and converted to use folio to improve code readability and avoid repeated conversion between page and folio. Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>