summaryrefslogtreecommitdiff
path: root/fs/xfs/xfs_buf.c
AgeCommit message (Collapse)Author
2025-07-08xfs: remove the bt_bdev_file buftarg fieldChristoph Hellwig
And use bt_file for both bdev and shmem backed buftargs. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-08xfs: rename the bt_bdev_* buftarg fieldsChristoph Hellwig
The extra bdev_ is weird, so drop it. Also improve the comment to make it clear these are the hardware limits. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-08xfs: remove the call to sync_blockdev in xfs_configure_buftargChristoph Hellwig
This extra call is not needed as xfs_alloc_buftarg already calls sync_blockdev. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-06-27xfs: avoid dquot buffer pin deadlockDave Chinner
On shutdown when quotas are enabled, the shutdown can deadlock trying to unpin the dquot buffer buf_log_item like so: [ 3319.483590] task:kworker/20:0H state:D stack:14360 pid:1962230 tgid:1962230 ppid:2 task_flags:0x4208060 flags:0x00004000 [ 3319.493966] Workqueue: xfs-log/dm-6 xlog_ioend_work [ 3319.498458] Call Trace: [ 3319.500800] <TASK> [ 3319.502809] __schedule+0x699/0xb70 [ 3319.512672] schedule+0x64/0xd0 [ 3319.515573] schedule_timeout+0x30/0xf0 [ 3319.528125] __down_common+0xc3/0x200 [ 3319.531488] __down+0x1d/0x30 [ 3319.534186] down+0x48/0x50 [ 3319.540501] xfs_buf_lock+0x3d/0xe0 [ 3319.543609] xfs_buf_item_unpin+0x85/0x1b0 [ 3319.547248] xlog_cil_committed+0x289/0x570 [ 3319.571411] xlog_cil_process_committed+0x6d/0x90 [ 3319.575590] xlog_state_shutdown_callbacks+0x52/0x110 [ 3319.580017] xlog_force_shutdown+0x169/0x1a0 [ 3319.583780] xlog_ioend_work+0x7c/0xb0 [ 3319.587049] process_scheduled_works+0x1d6/0x400 [ 3319.591127] worker_thread+0x202/0x2e0 [ 3319.594452] kthread+0x20c/0x240 The CIL push has seen the deadlock, so it has aborted the push and is running CIL checkpoint completion to abort all the items in the checkpoint. This calls ->iop_unpin(remove = true) to clean up the log items in the checkpoint. When a buffer log item is unpined like this, it needs to lock the buffer to run io completion to correctly fail the buffer and run all the required completions to fail attached log items as well. In this case, the attempt to lock the buffer on unpin is hanging because the buffer is already locked. I suspected a leaked XFS_BLI_HOLD state because of XFS_BLI_STALE handling changes I was testing, so I went looking for pin events on HOLD buffers and unpin events on locked buffer. That isolated this one buffer with these two events: xfs_buf_item_pin: dev 251:6 daddr 0xa910 bbcount 0x2 hold 2 pincount 0 lock 0 flags DONE|KMEM recur 0 refcount 1 bliflags HOLD|DIRTY|LOGGED liflags DIRTY .... xfs_buf_item_unpin: dev 251:6 daddr 0xa910 bbcount 0x2 hold 4 pincount 1 lock 0 flags DONE|KMEM recur 0 refcount 1 bliflags DIRTY liflags ABORTED Firstly, bbcount = 0x2, which means it is not a single sector structure. That rules out every xfs_trans_bhold() case except one: dquot buffers. Then hung task dumping gave this trace: [ 3197.312078] task:fsync-tester state:D stack:12080 pid:2051125 tgid:2051125 ppid:1643233 task_flags:0x400000 flags:0x00004002 [ 3197.323007] Call Trace: [ 3197.325581] <TASK> [ 3197.327727] __schedule+0x699/0xb70 [ 3197.334582] schedule+0x64/0xd0 [ 3197.337672] schedule_timeout+0x30/0xf0 [ 3197.350139] wait_for_completion+0xbd/0x180 [ 3197.354235] __flush_workqueue+0xef/0x4e0 [ 3197.362229] xlog_cil_force_seq+0xa0/0x300 [ 3197.374447] xfs_log_force+0x77/0x230 [ 3197.378015] xfs_qm_dqunpin_wait+0x49/0xf0 [ 3197.382010] xfs_qm_dqflush+0x55/0x460 [ 3197.385663] xfs_qm_dquot_isolate+0x29e/0x4d0 [ 3197.389977] __list_lru_walk_one+0x141/0x220 [ 3197.398867] list_lru_walk_one+0x10/0x20 [ 3197.402713] xfs_qm_shrink_scan+0x6a/0x100 [ 3197.406699] do_shrink_slab+0x18a/0x350 [ 3197.410512] shrink_slab+0xf7/0x430 [ 3197.413967] drop_slab+0x97/0xf0 [ 3197.417121] drop_caches_sysctl_handler+0x59/0xc0 [ 3197.421654] proc_sys_call_handler+0x18b/0x280 [ 3197.426050] proc_sys_write+0x13/0x20 [ 3197.429750] vfs_write+0x2b8/0x3e0 [ 3197.438532] ksys_write+0x7e/0xf0 [ 3197.441742] __x64_sys_write+0x1b/0x30 [ 3197.445363] x64_sys_call+0x2c72/0x2f60 [ 3197.449044] do_syscall_64+0x6c/0x140 [ 3197.456341] entry_SYSCALL_64_after_hwframe+0x76/0x7e Yup, another test run by check-parallel is running drop_caches concurrently and the dquot shrinker for the hung filesystem is running. That's trying to flush a dirty dquot from reclaim context, and it waiting on a log force to complete. xfs_qm_dqflush is called with the dquot buffer held locked, and so we've called xfs_log_force() with that buffer locked. Now the log force is waiting for a workqueue flush to complete, and that workqueue flush is waiting of CIL checkpoint processing to finish. The CIL checkpoint processing is aborting all the log items it has, and that requires locking aborted buffers to cancel them. Now, normally this isn't a problem if we are issuing a log force to unpin an object, because the ->iop_unpin() method wakes pin waiters first. That results in the pin waiter finishing off whatever it was doing, dropping the lock and then xfs_buf_item_unpin() can lock the buffer and fail it. However, xfs_qm_dqflush() is waiting on the -dquot- unpin event, not the dquot buffer unpin event, and so it never gets woken and so does not drop the buffer lock. Inodes do not have this problem, as they can only be written from one spot (->iop_push) whilst dquots can be written from multiple places (memory reclaim, ->iop_push, xfs_dq_dqpurge, and quotacheck). The reason that the dquot buffer has an attached buffer log item is that it has been recently allocated. Initialisation of the dquot buffer logs the buffer directly, thereby pinning it in memory. We then modify the dquot in a separate operation, and have memory reclaim racing with a shutdown and we trigger this deadlock. check-parallel reproduces this reliably on 1kB FSB filesystems with quota enabled because it does all of these things concurrently without having to explicitly write tests to exercise these corner case conditions. xfs_qm_dquot_logitem_push() doesn't have this deadlock because it checks if the dquot is pinned before locking the dquot buffer and skipping it if it is pinned. This means the xfs_qm_dqunpin_wait() log force in xfs_qm_dqflush() never triggers and we unlock the buffer safely allowing a concurrent shutdown to fail the buffer appropriately. xfs_qm_dqpurge() could have this problem as it is called from quotacheck and we might have allocated dquot buffers when recording the quota updates. This can be fixed by calling xfs_qm_dqunpin_wait() before we lock the dquot buffer. Because we hold the dquot locked, nothing will be able to add to the pin count between the unpin_wait and the dqflush callout, so this now makes xfs_qm_dqpurge() safe against this race. xfs_qm_dquot_isolate() can also be fixed this same way but, quite frankly, we shouldn't be doing IO in memory reclaim context. If the dquot is pinned or dirty, simply rotate it and let memory reclaim come back to it later, same as we do for inodes. This then gets rid of the nasty issue in xfs_qm_flush_one() where quotacheck writeback races with memory reclaim flushing the dquots. We can lift xfs_qm_dqunpin_wait() up into this code, then get rid of the "can't get the dqflush lock" buffer write to cycle the dqlfush lock and enable it to be flushed again. checking if the dquot is pinned and returning -EAGAIN so that the dquot walk will revisit the dquot again later. Finally, with xfs_qm_dqunpin_wait() lifted into all the callers, we can remove it from the xfs_qm_dqflush() code. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-05-26Merge tag 'xfs-merge-6.16' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs updates from Carlos Maiolino: - Atomic writes for XFS - Remove experimental warnings for pNFS, scrub and parent pointers * tag 'xfs-merge-6.16' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (26 commits) xfs: add inode to zone caching for data placement xfs: free the item in xfs_mru_cache_insert on failure xfs: remove the EXPERIMENTAL warning for pNFS xfs: remove some EXPERIMENTAL warnings xfs: Remove deprecated xfs_bufd sysctl parameters xfs: stop using set_blocksize xfs: allow sysadmins to specify a maximum atomic write limit at mount time xfs: update atomic write limits xfs: add xfs_calc_atomic_write_unit_max() xfs: add xfs_file_dio_write_atomic() xfs: commit CoW-based atomic writes atomically xfs: add large atomic writes checks in xfs_direct_write_iomap_begin() xfs: add xfs_atomic_write_cow_iomap_begin() xfs: refine atomic write size check in xfs_file_write_iter() xfs: refactor xfs_reflink_end_cow_extent() xfs: allow block allocator to take an alignment hint xfs: ignore HW which cannot atomic write a single block xfs: add helpers to compute transaction reservation for finishing intent items xfs: add helpers to compute log item overhead xfs: separate out setting buftarg atomic writes limits ...
2025-05-14Merge branch 'atomic_writes-6.16' into xfs-6.16-mergeCarlos Maiolino
Required update due to conflict with patch: xfs: stop using set_blocksize Conflicts: fs/xfs/xfs_buf.c Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-05-14xfs: stop using set_blocksizeDarrick J. Wong
XFS has its own buffer cache for metadata that uses submit_bio, which means that it no longer uses the block device pagecache for anything. Create a more lightweight helper that runs the blocksize checks and flushes dirty data and use that instead. No more truncating the pagecache because XFS does not use it or care about it. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-05-07xfs: ignore HW which cannot atomic write a single blockDarrick J. Wong
Currently only HW which can write at least 1x block is supported. For supporting atomic writes > 1x block, a CoW-based method will also be used and this will not be resticted to using HW which can write >= 1x block. However for deciding if HW-based atomic writes can be used, we need to start adding checks for write length < HW min, which complicates the code. Indeed, a statx field similar to unit_max_opt should also be added for this minimum, which is undesirable. HW which can only write > 1x blocks would be uncommon and quite weird, so let's just not support it. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2025-05-07xfs: separate out setting buftarg atomic writes limitsDarrick J. Wong
Separate out setting buftarg atomic writes limits into a dedicated function, xfs_configure_buftarg_atomic_writes(), to keep the specific functionality self-contained. For naming consistency, rename xfs_setsize_buftarg() -> xfs_configure_buftarg(). Signed-off-by: Darrick J. Wong <djwong@kernel.org> [jpg: separate out from patch "xfs: ignore HW which ..."] Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2025-05-07xfs: only call xfs_setsize_buftarg once per buffer targetDarrick J. Wong
It's silly to call xfs_setsize_buftarg from xfs_alloc_buftarg with the block device LBA size because we don't need to ask the block layer to validate a geometry number that it provided us. Instead, set the preliminary bt_meta_sector* fields to the LBA size in preparation for reading the primary super. However, we still want to flush and invalidate the pagecache for all three block devices before we start reading metadata from those devices, so call sync_blockdev() per bdev in xfs_alloc_buftarg(). This will enable a subsequent patch to validate hw atomic write geometry against the filesystem geometry. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com> [jpg: call sync_blockdev() from xfs_alloc_buftarg()] Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2025-05-07xfs: simplify xfs_buf_submit_bioChristoph Hellwig
Convert the __bio_add_page(..., virt_to_page(), ...) pattern to the bio_add_virt_nofail helper implementing it and use bio_add_vmalloc to insulate xfs from the details of adding vmalloc memory to a bio. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20250507120451.4000627-16-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-28xfs: stop using set_blocksizeDarrick J. Wong
XFS has its own buffer cache for metadata that uses submit_bio, which means that it no longer uses the block device pagecache for anything. Create a more lightweight helper that runs the blocksize checks and flushes dirty data and use that instead. No more truncating the pagecache because XFS does not use it or care about it. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-04-14xfs: mark xfs_buf_free as might_sleep()Christoph Hellwig
xfs_buf_free can call vunmap, which can sleep. The vunmap path is an unlikely one, so add might_sleep to ensure calling xfs_buf_free from atomic context gets caught more easily. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-03-18xfs: remove the flags argument to xfs_buf_get_uncachedChristoph Hellwig
No callers passes flags to xfs_buf_get_uncached, which makes sense given that the flags apply to behavior not used for uncached buffers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-03-18xfs: remove the flags argument to xfs_buf_read_uncachedChristoph Hellwig
No callers passes flags to xfs_buf_read_uncached, which makes sense given that the flags apply to behavior not used for uncached buffers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-03-18xfs: remove xfs_buf_free_mapsChristoph Hellwig
xfs_buf_free_maps only has a single caller, so open code it there. Stop zeroing the b_maps pointer as the buffer is freed in the next line. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-03-18xfs: remove xfs_buf_get_mapsChristoph Hellwig
xfs_buf_get_maps has a single caller, and can just be open coded there. When doing that, stop handling the allocation failure as we always pass __GFP_NOFAIL to the slab allocator, and use the proper kcalloc helper for array allocations. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-03-18xfs: call xfs_buf_alloc_backing_mem from _xfs_buf_allocChristoph Hellwig
We never allocate a buffer without backing memory. Simplify the call chain by calling xfs_buf_alloc_backing_mem from _xfs_buf_alloc. To avoid a forward declaration, move _xfs_buf_alloc down a bit in the file. Also drop the pointless _-prefix from _xfs_buf_alloc. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-03-10xfs: trace what memory backs a bufferChristoph Hellwig
Add three trace points for the different backing memory allocators for buffers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-03-10xfs: cleanup mapping tmpfs folios into the buffer cacheChristoph Hellwig
Directly assign b_addr based on the tmpfs folios without a detour through pages, reuse the folio_put path used for non-tmpfs buffers and replace all references to pages in comments with folios. Partially based on a patch from Dave Chinner <dchinner@redhat.com>. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-03-10xfs: use vmalloc instead of vm_map_area for buffer backing memoryChristoph Hellwig
The fallback buffer allocation path currently open codes a suboptimal version of vmalloc to allocate pages that are then mapped into vmalloc space. Switch to using vmalloc instead, which uses all the optimizations in the common vmalloc code, and removes the need to track the backing pages in the xfs_buf structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-03-10xfs: kill XBF_UNMAPPEDChristoph Hellwig
Unmapped buffer access is a pain, so kill it. The switch to large folios means we rarely pay a vmap penalty for large buffers, so this functionality is largely unnecessary now. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-03-10xfs: convert buffer cache to use high order foliosChristoph Hellwig
Now that we have the buffer cache using the folio API, we can extend the use of folios to allocate high order folios for multi-page buffers rather than an array of single pages that are then vmapped into a contiguous range. This creates a new type of single folio buffers that can have arbitrary order in addition to the existing multi-folio buffers made up of many single page folios that get vmapped. The single folio is for now stashed into the existing b_pages array, but that will go away entirely later in the series and remove the temporary page vs folio typing issues that only work because the two structures currently can be used largely interchangeable. The code that allocates buffers will optimistically attempt a high order folio allocation as a fast path if the buffer size is a power of two and thus fits into a folio. If this high order allocation fails, then we fall back to the existing multi-folio allocation code. This now forms the slow allocation path, and hopefully will be largely unused in normal conditions except for buffers with size that are not a power of two like larger remote xattrs. This should improve performance of large buffer operations (e.g. large directory block sizes) as we should now mostly avoid the expense of vmapping large buffers (and the vmap lock contention that can occur) as well as avoid the runtime pressure that frequently accessing kernel vmapped pages put on the TLBs. Based on a patch from Dave Chinner <dchinner@redhat.com>, but mutilated beyond recognition. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-03-10xfs: remove the kmalloc to page allocator fallbackChristoph Hellwig
Since commit 59bb47985c1d ("mm, sl[aou]b: guarantee natural alignment for kmalloc(power-of-two)", kmalloc and friends guarantee that power of two sized allocations are naturally aligned. Limit our use of kmalloc for buffers to these power of two sizes and remove the fallback to the page allocator for this case, but keep a check in addition to trusting the slab allocator to get the alignment right. Also refactor the kmalloc path to reuse various calculations for the size and gfp flags. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-03-10xfs: refactor backing memory allocations for buffersChristoph Hellwig
Lift handling of shmem and slab backed buffers into xfs_buf_alloc_pages and rename the result to xfs_buf_alloc_backing_mem. This shares more code and ensures uncached buffers can also use slab, which slightly reduces the memory usage of growfs on 512 byte sector size file systems, but more importantly means the allocation invariants are the same for cached and uncached buffers. Document these new invariants with a big fat comment mostly stolen from a patch by Dave Chinner. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-03-10xfs: remove xfs_buf_is_vmappedChristoph Hellwig
No need to look at the page count if we can simply call is_vmalloc_addr on bp->b_addr. This prepares for eventualy removing the b_page_count field. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-03-10xfs: remove xfs_buf.b_offsetChristoph Hellwig
b_offset is only set for slab backed buffers and always set to offset_in_page(bp->b_addr), which can be done just as easily in the only user of b_offset. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-03-10xfs: add a fast path to xfs_buf_zero when b_addr is setChristoph Hellwig
No need to walk the page list if bp->b_addr is valid. That also means b_offset doesn't need to be taken into account in the unmapped loop as b_offset is only set for kmem backed buffers which are always mapped. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-02-25xfs: remove the XBF_STALE check from xfs_buf_rele_cachedChristoph Hellwig
xfs_buf_stale already set b_lru_ref to 0, and thus prevents the buffer from moving to the LRU. Remove the duplicate check. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-02-25xfs: remove most in-flight buffer accountingChristoph Hellwig
The buffer cache keeps a bt_io_count per-CPU counter to track all in-flight I/O, which is used to ensure no I/O is in flight when unmounting the file system. For most I/O we already keep track of inflight I/O at higher levels: - for synchronous I/O (xfs_buf_read/xfs_bwrite/xfs_buf_delwri_submit), the caller has a reference and waits for I/O completions using xfs_buf_iowait - for xfs_buf_delwri_submit_nowait the only caller (AIL writeback) tracks the log items that the buffer attached to This only leaves only xfs_buf_readahead_map as a submitter of asynchronous I/O that is not tracked by anything else. Replace the bt_io_count per-cpu counter with a more specific bt_readahead_count counter only tracking readahead I/O. This allows to simply increment it when submitting readahead I/O and decrementing it when it completed, and thus simplify xfs_buf_rele and remove the needed for the XBF_NO_IOACCT flags and the XFS_BSTATE_IN_FLIGHT buffer state. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-02-25xfs: decouple buffer readahead from the normal buffer read pathChristoph Hellwig
xfs_buf_readahead_map is the only caller of xfs_buf_read_map and thus _xfs_buf_read that is not synchronous. Split it from xfs_buf_read_map so that the asynchronous path is self-contained and the now purely synchronous xfs_buf_read_map / _xfs_buf_read implementation can be simplified. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-02-25xfs: reduce context switches for synchronous buffered I/OChristoph Hellwig
Currently all metadata I/O completions happen in the m_buf_workqueue workqueue. But for synchronous I/O (i.e. all buffer reads) there is no need for that, as there always is a called in process context that is waiting for the I/O. Factor out the guts of xfs_buf_ioend into a separate helper and call it from xfs_buf_iowait to avoid a double an extra context switch to the workqueue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-02-03Merge tag 'xfs-fixes-6.14-rc2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs bug fixes from Carlos Maiolino: "A few fixes for XFS, but the most notable one is: - xfs: remove xfs_buf_cache.bc_lock which has been hit by different persons including syzbot" * tag 'xfs-fixes-6.14-rc2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: remove xfs_buf_cache.bc_lock xfs: Add error handling for xfs_reflink_cancel_cow_range xfs: Propagate errors from xfs_reflink_cancel_cow_range in xfs_dax_write_iomap_end xfs: don't call remap_verify_area with sb write protection held xfs: remove an out of data comment in _xfs_buf_alloc xfs: fix the entry condition of exact EOF block allocation optimization
2025-01-28xfs: remove xfs_buf_cache.bc_lockChristoph Hellwig
xfs_buf_cache.bc_lock serializes adding buffers to and removing them from the hashtable. But as the rhashtable code already uses fine grained internal locking for inserts and removals the extra protection isn't actually required. It also happens to fix a lock order inversion vs b_lock added by the recent lookup race fix. Fixes: ee10f6fcdb96 ("xfs: fix buffer lookup vs release race") Reported-by: Lai, Yi <yi1.lai@linux.intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-01-26Merge tag 'mm-stable-2025-01-26-14-59' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "The various patchsets are summarized below. Plus of course many indivudual patches which are described in their changelogs. - "Allocate and free frozen pages" from Matthew Wilcox reorganizes the page allocator so we end up with the ability to allocate and free zero-refcount pages. So that callers (ie, slab) can avoid a refcount inc & dec - "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to use large folios other than PMD-sized ones - "Fix mm/rodata_test" from Petr Tesarik performs some maintenance and fixes for this small built-in kernel selftest - "mas_anode_descend() related cleanup" from Wei Yang tidies up part of the mapletree code - "mm: fix format issues and param types" from Keren Sun implements a few minor code cleanups - "simplify split calculation" from Wei Yang provides a few fixes and a test for the mapletree code - "mm/vma: make more mmap logic userland testable" from Lorenzo Stoakes continues the work of moving vma-related code into the (relatively) new mm/vma.c - "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David Hildenbrand cleans up and rationalizes handling of gfp flags in the page allocator - "readahead: Reintroduce fix for improper RA window sizing" from Jan Kara is a second attempt at fixing a readahead window sizing issue. It should reduce the amount of unnecessary reading - "synchronously scan and reclaim empty user PTE pages" from Qi Zheng addresses an issue where "huge" amounts of pte pagetables are accumulated: https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/ Qi's series addresses this windup by synchronously freeing PTE memory within the context of madvise(MADV_DONTNEED) - "selftest/mm: Remove warnings found by adding compiler flags" from Muhammad Usama Anjum fixes some build warnings in the selftests code when optional compiler warnings are enabled - "mm: don't use __GFP_HARDWALL when migrating remote pages" from David Hildenbrand tightens the allocator's observance of __GFP_HARDWALL - "pkeys kselftests improvements" from Kevin Brodsky implements various fixes and cleanups in the MM selftests code, mainly pertaining to the pkeys tests - "mm/damon: add sample modules" from SeongJae Park enhances DAMON to estimate application working set size - "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn provides some cleanups to memcg's hugetlb charging logic - "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song removes the global swap cgroup lock. A speedup of 10% for a tmpfs-based kernel build was demonstrated - "zram: split page type read/write handling" from Sergey Senozhatsky has several fixes and cleaups for zram in the area of zram_write_page(). A watchdog softlockup warning was eliminated - "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin Brodsky cleans up the pagetable destructor implementations. A rare use-after-free race is fixed - "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes simplifies and cleans up the debugging code in the VMA merging logic - "Account page tables at all levels" from Kevin Brodsky cleans up and regularizes the pagetable ctor/dtor handling. This results in improvements in accounting accuracy - "mm/damon: replace most damon_callback usages in sysfs with new core functions" from SeongJae Park cleans up and generalizes DAMON's sysfs file interface logic - "mm/damon: enable page level properties based monitoring" from SeongJae Park increases the amount of information which is presented in response to DAMOS actions - "mm/damon: remove DAMON debugfs interface" from SeongJae Park removes DAMON's long-deprecated debugfs interfaces. Thus the migration to sysfs is completed - "mm/hugetlb: Refactor hugetlb allocation resv accounting" from Peter Xu cleans up and generalizes the hugetlb reservation accounting - "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino removes a never-used feature of the alloc_pages_bulk() interface - "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park extends DAMOS filters to support not only exclusion (rejecting), but also inclusion (allowing) behavior - "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi introduces a new memory descriptor for zswap.zpool that currently overlaps with struct page for now. This is part of the effort to reduce the size of struct page and to enable dynamic allocation of memory descriptors - "mm, swap: rework of swap allocator locks" from Kairui Song redoes and simplifies the swap allocator locking. A speedup of 400% was demonstrated for one workload. As was a 35% reduction for kernel build time with swap-on-zram - "mm: update mips to use do_mmap(), make mmap_region() internal" from Lorenzo Stoakes reworks MIPS's use of mmap_region() so that mmap_region() can be made MM-internal - "mm/mglru: performance optimizations" from Yu Zhao fixes a few MGLRU regressions and otherwise improves MGLRU performance - "Docs/mm/damon: add tuning guide and misc updates" from SeongJae Park updates DAMON documentation - "Cleanup for memfd_create()" from Isaac Manjarres does that thing - "mm: hugetlb+THP folio and migration cleanups" from David Hildenbrand provides various cleanups in the areas of hugetlb folios, THP folios and migration - "Uncached buffered IO" from Jens Axboe implements the new RWF_DONTCACHE flag which provides synchronous dropbehind for pagecache reading and writing. To permite userspace to address issues with massive buildup of useless pagecache when reading/writing fast devices - "selftests/mm: virtual_address_range: Reduce memory" from Thomas Weißschuh fixes and optimizes some of the MM selftests" * tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits) mm/compaction: fix UBSAN shift-out-of-bounds warning s390/mm: add missing ctor/dtor on page table upgrade kasan: sw_tags: use str_on_off() helper in kasan_init_sw_tags() tools: add VM_WARN_ON_VMG definition mm/damon/core: use str_high_low() helper in damos_wmark_wait_us() seqlock: add missing parameter documentation for raw_seqcount_try_begin() mm/page-writeback: consolidate wb_thresh bumping logic into __wb_calc_thresh mm/page_alloc: remove the incorrect and misleading comment zram: remove zcomp_stream_put() from write_incompressible_page() mm: separate move/undo parts from migrate_pages_batch() mm/kfence: use str_write_read() helper in get_access_type() selftests/mm/mkdirty: fix memory leak in test_uffdio_copy() kasan: hw_tags: Use str_on_off() helper in kasan_init_hw_tags() selftests/mm: virtual_address_range: avoid reading from VM_IO mappings selftests/mm: vm_util: split up /proc/self/smaps parsing selftests/mm: virtual_address_range: unmap chunks after validation selftests/mm: virtual_address_range: mmap() without PROT_WRITE selftests/memfd/memfd_test: fix possible NULL pointer dereference mm: add FGP_DONTCACHE folio creation flag mm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue ...
2025-01-25mm: alloc_pages_bulk: rename APILuiz Capitulino
The previous commit removed the page_list argument from alloc_pages_bulk_noprof() along with the alloc_pages_bulk_list() function. Now that only the *_array() flavour of the API remains, we can do the following renaming (along with the _noprof() ones): alloc_pages_bulk_array -> alloc_pages_bulk alloc_pages_bulk_array_mempolicy -> alloc_pages_bulk_mempolicy alloc_pages_bulk_array_node -> alloc_pages_bulk_node Link: https://lkml.kernel.org/r/275a3bbc0be20fbe9002297d60045e67ab3d4ada.1734991165.git.luizcap@redhat.com Signed-off-by: Luiz Capitulino <luizcap@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-24xfs: remove an out of data comment in _xfs_buf_allocChristoph Hellwig
There hasn't been anything like an io_length for a long time. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-01-16xfs: fix buffer lookup vs release raceChristoph Hellwig
Since commit 298f34224506 ("xfs: lockless buffer lookup") the buffer lookup fastpath is done without a hash-wide lock (then pag_buf_lock, now bc_lock) and only under RCU protection. But this means that nothing serializes lookups against the temporary 0 reference count for buffers that are added to the LRU after dropping the last regular reference, and a concurrent lookup would fail to find them. Fix this by doing all b_hold modifications under b_lock. We're already doing this for release so this "only" ~ doubles the b_lock round trips. We'll later look into the lockref infrastructure to optimize the number of lock round trips again. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-01-16xfs: check for dead buffers in xfs_buf_find_insertChristoph Hellwig
Commit 32dd4f9c506b ("xfs: remove a superflous hash lookup when inserting new buffers") converted xfs_buf_find_insert to use rhashtable_lookup_get_insert_fast and thus an operation that returns the existing buffer when an insert would duplicate the hash key. But this code path misses the check for a buffer with a reference count of zero, which could lead to reusing an about to be freed buffer. Fix this by using the same atomic_inc_not_zero pattern as xfs_buf_insert. Fixes: 32dd4f9c506b ("xfs: remove a superflous hash lookup when inserting new buffers") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Cc: stable@vger.kernel.org # v6.0 Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-01-14xfs: add a b_iodone callback to struct xfs_bufChristoph Hellwig
Stop open coding the log item completions and instead add a callback into back into the submitter. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-01-14xfs: move b_li_list based retry handling to common codeChristoph Hellwig
The dquot and inode version are very similar, which is expected given the overall b_li_list logic. The differences are that the inode version also clears the XFS_LI_FLUSHING which is defined in common but only ever set by the inode item, and that the dquot version takes the ail_lock over the list iteration. While this seems sensible given that additions and removals from b_li_list are protected by the ail_lock, log items are only added before buffer submission, and are only removed when completing the buffer, so nothing can change the list when retrying a buffer. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-01-14xfs: always complete the buffer inline in xfs_buf_submitChristoph Hellwig
xfs_buf_submit now only completes a buffer on error, or for in-memory buftargs. There is no point in using a workqueue for the latter as the completion will just wake up the caller. Optimize this case by avoiding the workqueue roundtrip. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-01-14xfs: remove the extra buffer reference in xfs_buf_submitChristoph Hellwig
Nothing touches the buffer after it has been submitted now, so the need for the extra transient reference went away as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-01-14xfs: move invalidate_kernel_vmap_range to xfs_buf_ioendChristoph Hellwig
Invalidating cache lines can be fairly expensive, so don't do it in interrupt context. Note that in practice very few setup will actually do anything here as virtually indexed caches are rather uncommon, but we might as well move the call to the proper place while touching this area. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-01-14xfs: simplify buffer I/O submissionChristoph Hellwig
The code in _xfs_buf_ioapply is unnecessarily complicated because it doesn't take advantage of modern bio features. Simplify it by making use of bio splitting and chaining, that is build a single bio for the pages in the buffer using a simple loop, and then split that bio on the map boundaries for discontiguous multi-FSB buffers and chain the split bios to the main one so that there is only a single I/O completion. This not only simplifies the code to build the buffer, but also removes the need for the b_io_remaining field as buffer ownership is granted to the bio on submit of the final bio with no chance for a completion before that as well as the b_io_error field that is now superfluous because there always is exactly one completion. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-01-14xfs: move in-memory buftarg handling out of _xfs_buf_ioapplyChristoph Hellwig
No I/O to apply for in-memory buffers, so skip the function call entirely. Clean up the b_io_error initialization logic to allow for this. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-01-14xfs: move write verification out of _xfs_buf_ioapplyChristoph Hellwig
Split the write verification logic out of _xfs_buf_ioapply into a new xfs_buf_verify_write helper called by xfs_buf_submit given that it isn't about applying the I/O and doesn't really fit in with the rest of _xfs_buf_ioapply. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-01-14xfs: remove xfs_buf_delwri_submit_buffersChristoph Hellwig
xfs_buf_delwri_submit_buffers has two callers for synchronous and asynchronous writes that share very little logic. Split out a helper for the shared per-buffer loop and otherwise open code the submission in the two callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-01-14xfs: simplify xfs_buf_delwri_pushbufChristoph Hellwig
xfs_buf_delwri_pushbuf synchronously writes a buffer that is on a delwri list already. Instead of doing a complicated dance with the delwri and wait list, just leave them alone and open code the actual buffer write. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-01-14xfs: move xfs_buf_iowait out of (__)xfs_buf_submitChristoph Hellwig
There is no good reason to pass a bool argument to wait for a buffer when the callers that want that can easily just wait themselves. This means the wait moves out of the extra hold of the buffer, but as the callers of synchronous buffer I/O need to hold a reference anyway that is perfectly fine. Because all async buffer submitters ignore the error return value, and the synchronous ones catch the error condition through b_error and xfs_buf_iowait this also means the new xfs_buf_submit doesn't have to return an error code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>