summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-03-03xfs: don't call xfs_can_free_eofblocks from ->release for zoned inodesChristoph Hellwig
Zoned file systems require out of place writes and thus can't support post-EOF speculative preallocations. Avoid the pointless ilock critical section to find out that none can be freed. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: disable FITRIM for zoned RT devicesChristoph Hellwig
The zoned allocator unconditionally issues zone resets or discards after emptying an entire zone, so supporting FITRIM for a zoned RT device is not useful. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: disable sb_frextents for zoned file systemsChristoph Hellwig
Zoned file systems not only don't use the global frextents counter, but for them the in-memory percpu counter also includes reservations taken before even allocating delalloc extent records, so it will never match the per-zone used information. Disable all updates and verification of the sb counter for zoned file systems as it isn't useful for them. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: export zoned geometry via XFS_FSOP_GEOMChristoph Hellwig
Export the zoned geometry information so that userspace can query it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: allow internal RT devices for zoned modeChristoph Hellwig
Allow creating an RT subvolume on the same device as the main data device. This is mostly used for SMR HDDs where the conventional zones are used for the data device and the sequential write required zones for the zoned RT section. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: define the zoned on-disk formatChristoph Hellwig
Zone file systems reuse the basic RT group enabled XFS file system structure to support a mode where each RT group is always written from start to end and then reset for reuse (after moving out any remaining data). There are few minor but important changes, which are indicated by a new incompat flag: 1) there are no bitmap and summary inodes, thus the /rtgroups/{rgno}.{bitmap,summary} metadir files do not exist and the sb_rbmblocks superblock field must be cleared to zero. 2) there is a new superblock field that specifies the start of an internal RT section. This allows supporting SMR HDDs that have random writable space at the beginning which is used for the XFS data device (which really is the metadata device for this configuration), directly followed by a RT device on the same block device. While something similar could be achieved using dm-linear just having a single device directly consumed by XFS makes handling the file systems a lot easier. 3) Another superblock field that tracks the amount of reserved space (or overprovisioning) that is never used for user capacity, but allows GC to run more smoothly. 4) an overlay of the cowextsize field for the rtrmap inode so that we can persistently track the total amount of rtblocks currently used in a RT group. There is no data structure other than the rmap that tracks used space in an RT group, and this counter is used to decide when a RT group has been entirely emptied, and to select one that is relatively empty if garbage collection needs to be performed. While this counter could be tracked entirely in memory and rebuilt from the rmap at mount time, that would lead to very long mount times with the large number of RT groups implied by the number of hardware zones especially on SMR hard drives with 256MB zone sizes. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: add a xfs_rtrmap_highest_rgbno helperChristoph Hellwig
Add a helper to find the last offset mapped in the rtrmap. This will be used by the zoned code to find out where to start writing again on conventional devices without hardware zone support. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: support XFS_BMAPI_REMAP in xfs_bmap_del_extent_delayChristoph Hellwig
The zone allocator wants to be able to remove a delalloc mapping in the COW fork while keeping the block reservation. To support that pass the flags argument down to xfs_bmap_del_extent_delay and support the XFS_BMAPI_REMAP flag to keep the reservation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: refine the unaligned check for always COW inodes in xfs_file_dio_writeChristoph Hellwig
For always COW inodes we also must check the alignment of each individual iovec segment, as they could end up with different I/Os due to the way bio_iov_iter_get_pages works, and we'd then overwrite an already written block. The existing always_cow sysctl based code doesn't catch this because nothing enforces that blocks aren't rewritten, but for zoned XFS on sequential write required zones this is a hard error. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: skip always_cow inodes in xfs_reflink_trim_around_sharedChristoph Hellwig
xfs_reflink_trim_around_shared tries to find shared blocks in the refcount btree. Always_cow inodes don't have that tree, so don't bother. For the existing always_cow code this is a minor optimization. For the upcoming zoned code that can do COW without the rtreflink code it avoids triggering a NULL pointer dereference. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: move xfs_bmapi_reserve_delalloc to xfs_iomap.cChristoph Hellwig
Delalloc reservations are not supported in userspace, and thus it doesn't make sense to share this helper with xfsprogs.c. Move it to xfs_iomap.c toward the two callers. Note that there rest of the delalloc handling should probably eventually also move out of xfs_bmap.c, but that will require a bit more surgery. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: add a rtg_blocks helperChristoph Hellwig
Shortcut dereferencing the xg_block_count field in the generic group structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: factor out a xfs_rt_check_size helperChristoph Hellwig
Add a helper to check that the last block of a RT device is readable to share the code between mount and growfs. This also adds the mount time overflow check to growfs and improves the error messages. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: reduce metafile reservationsChristoph Hellwig
There is no point in reserving more space than actually available on the data device for the worst case scenario that is unlikely to happen. Reserve at most 1/4th of the data device blocks, which is still a heuristic. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: make metabtree reservations globalChristoph Hellwig
Currently each metabtree inode has it's own space reservation to ensure it can be expanded to the maximum size, mirroring what is done for the AG-based btrees. But unlike the AG-based btrees the metabtree inodes aren't restricted to allocate from a single AG but can use free space form the entire file system. And unlike AG-based btrees where the required reservation shrinks with the available free space due to this, the metabtree reservations for the rtrmap and rtfreflink trees are not bound in any way by the data device free space as they track RT extent allocations. This is not very efficient as it requires a large number of blocks to be set aside that can't be used at all by other btrees. Switch to a model that uses a global pool instead in preparation for reducing the amount of reserved space, which now also removes the overloading of the i_nblocks field for metabtree inodes, which would create problems if metabtree inodes ever had a big enough xattr fork to require xattr blocks outside the inode. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: fixup the metabtree reservation in xrep_reap_metadir_fsblocksChristoph Hellwig
All callers of xrep_reap_metadir_fsblocks need to fix up the metabtree reservation, otherwise they'd leave the reservations in an incoherent state. Move the call to xrep_reset_metafile_resv into xrep_reap_metadir_fsblocks so it always is taken care of, and remove now superfluous helper functions in the callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: trace in-memory freecounter reservationsChristoph Hellwig
Add two tracepoints when the freecounter dips into the reserved pool and when it is entirely out of space. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: support reserved blocks for the rt extent counterChristoph Hellwig
The zoned space allocator will need reserved RT extents for garbage collection and zeroing of partial blocks. Move the resblks related fields into the freecounter array so that they can be used for all counters. Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: generalize the freespace and reserved blocks handlingChristoph Hellwig
xfs_{add,dec}_freecounter already handles the block and RT extent percpu counters, but it currently hardcodes the passed in counter. Add a freecounter abstraction that uses an enum to designate the counter and add wrappers that hide the actual percpu_counters. This will allow expanding the reserved block handling to the RT extent counter in the next step, and also prepares for adding yet another such counter that can share the code. Both these additions will be needed for the zoned allocator. Also switch the flooring of the frextents counter to 0 in statfs for the rthinherit case to a manual min_t call to match the handling of the fdblocks counter for normal file systems. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: reflow xfs_dec_freecounterChristoph Hellwig
Let the successful allocation be the main path through the function with exception handling in branches to make the code easier to follow. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03Merge branch 'vfs-6.15.iomap' of ↵Carlos Maiolino
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs into xfs-6.15-merge
2025-02-27Merge branch 'vfs-6.15.shared.iomap' of ↵Christian Brauner
gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs Pull in the VFS changes shared with xfs for making RWF_DONTCACHE work with buffered writes for xfs. Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-27Merge patch series "iomap: make buffered writes work with RWF_DONTCACHE"Christian Brauner
Support buffered writes with RWF_DONTCACHE. * patches from https://lore.kernel.org/r/20250204184047.356762-2-axboe@kernel.dk: xfs: flag as supporting FOP_DONTCACHE iomap: make buffered writes work with RWF_DONTCACHE Link: https://lore.kernel.org/r/20250204184047.356762-2-axboe@kernel.dk Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-27xfs: flag as supporting FOP_DONTCACHEJens Axboe
Read side was already fully supported, and with the write side appropriately punted to the worker queue, all that's needed now is setting FOP_DONTCACHE in the file_operations structure to enable full support for read and write uncached IO. This provides similar benefits to using RWF_DONTCACHE with reads. Testing buffered writes on 32 files: writing bs 65536, uncached 0 1s: 196035MB/sec 2s: 132308MB/sec 3s: 132438MB/sec 4s: 116528MB/sec 5s: 103898MB/sec 6s: 108893MB/sec 7s: 99678MB/sec 8s: 106545MB/sec 9s: 106826MB/sec 10s: 101544MB/sec 11s: 111044MB/sec 12s: 124257MB/sec 13s: 116031MB/sec 14s: 114540MB/sec 15s: 115011MB/sec 16s: 115260MB/sec 17s: 116068MB/sec 18s: 116096MB/sec where it's quite obvious where the page cache filled, and performance dropped from to about half of where it started, settling in at around 115GB/sec. Meanwhile, 32 kswapds were running full steam trying to reclaim pages. Running the same test with uncached buffered writes: writing bs 65536, uncached 1 1s: 198974MB/sec 2s: 189618MB/sec 3s: 193601MB/sec 4s: 188582MB/sec 5s: 193487MB/sec 6s: 188341MB/sec 7s: 194325MB/sec 8s: 188114MB/sec 9s: 192740MB/sec 10s: 189206MB/sec 11s: 193442MB/sec 12s: 189659MB/sec 13s: 191732MB/sec 14s: 190701MB/sec 15s: 191789MB/sec 16s: 191259MB/sec 17s: 190613MB/sec 18s: 191951MB/sec and the behavior is fully predictable, performing the same throughout even after the page cache would otherwise have fully filled with dirty data. It's also about 65% faster, and using half the CPU of the system compared to the normal buffered write. Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20250204184047.356762-3-axboe@kernel.dk Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-27iomap: make buffered writes work with RWF_DONTCACHEJens Axboe
Add iomap buffered write support for RWF_DONTCACHE. If RWF_DONTCACHE is set for a write, mark the folios being written as uncached. Then writeback completion will drop the pages. The write_iter handler simply kicks off writeback for the pages, and writeback completion will take care of the rest. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20250204184047.356762-2-axboe@kernel.dk Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-27xfs: flag as supporting FOP_DONTCACHEJens Axboe
Read side was already fully supported, and with the write side appropriately punted to the worker queue, all that's needed now is setting FOP_DONTCACHE in the file_operations structure to enable full support for read and write uncached IO. This provides similar benefits to using RWF_DONTCACHE with reads. Testing buffered writes on 32 files: writing bs 65536, uncached 0 1s: 196035MB/sec 2s: 132308MB/sec 3s: 132438MB/sec 4s: 116528MB/sec 5s: 103898MB/sec 6s: 108893MB/sec 7s: 99678MB/sec 8s: 106545MB/sec 9s: 106826MB/sec 10s: 101544MB/sec 11s: 111044MB/sec 12s: 124257MB/sec 13s: 116031MB/sec 14s: 114540MB/sec 15s: 115011MB/sec 16s: 115260MB/sec 17s: 116068MB/sec 18s: 116096MB/sec where it's quite obvious where the page cache filled, and performance dropped from to about half of where it started, settling in at around 115GB/sec. Meanwhile, 32 kswapds were running full steam trying to reclaim pages. Running the same test with uncached buffered writes: writing bs 65536, uncached 1 1s: 198974MB/sec 2s: 189618MB/sec 3s: 193601MB/sec 4s: 188582MB/sec 5s: 193487MB/sec 6s: 188341MB/sec 7s: 194325MB/sec 8s: 188114MB/sec 9s: 192740MB/sec 10s: 189206MB/sec 11s: 193442MB/sec 12s: 189659MB/sec 13s: 191732MB/sec 14s: 190701MB/sec 15s: 191789MB/sec 16s: 191259MB/sec 17s: 190613MB/sec 18s: 191951MB/sec and the behavior is fully predictable, performing the same throughout even after the page cache would otherwise have fully filled with dirty data. It's also about 65% faster, and using half the CPU of the system compared to the normal buffered write. Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20250204184047.356762-3-axboe@kernel.dk Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-27iomap: make buffered writes work with RWF_DONTCACHEJens Axboe
Add iomap buffered write support for RWF_DONTCACHE. If RWF_DONTCACHE is set for a write, mark the folios being written as uncached. Then writeback completion will drop the pages. The write_iter handler simply kicks off writeback for the pages, and writeback completion will take care of the rest. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20250204184047.356762-2-axboe@kernel.dk Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-26Merge patch series "iomap: incremental advance conversion -- phase 2"Christian Brauner
Brian Foster <bfoster@redhat.com> says: Here's phase 2 of the incremental iter advance conversions. This updates all remaining iomap operations to advance the iter within the operation and thus removes the need to advance from the core iomap iterator. Once all operations are switched over, the core advance code is removed and the processed field is renamed to reflect that it is now a pure status code. For context, this was first introduced in a previous series [1] that focused mainly on the core mechanism and iomap buffered write. This is because original impetus was to facilitate a folio batch mechanism where a filesystem can optionally provide a batch of folios to process for a given mapping (i.e. zero range of an unwritten mapping with dirty folios in pagecache). That is still WIP, but the broader point is that this was originally intended as an optional mode until consensus that fell out of discussion was that it would be preferable to convert over everything. This presumably facilitates some other future work and simplifies semantics in the core iteration code. Patches 1-3 convert over iomap buffered read, direct I/O and various other remaining ops (swap, etc.). Patches 4-9 convert over the various DAX iomap operations. Finally, patches 10-12 introduce some cleanups now that all iomap operations have updated iteration semantics. * patches from https://lore.kernel.org/r/20250224144757.237706-1-bfoster@redhat.com: iomap: introduce a full map advance helper iomap: rename iomap_iter processed field to status iomap: remove unnecessary advance from iomap_iter() dax: advance the iomap_iter on pte and pmd faults dax: advance the iomap_iter on dedupe range dax: advance the iomap_iter on unshare range dax: advance the iomap_iter on zero range dax: push advance down into dax_iomap_iter() for read and write dax: advance the iomap_iter in the read/write path iomap: convert misc simple ops to incremental advance iomap: advance the iter on direct I/O iomap: advance the iter directly on buffered read Link: https://lore.kernel.org/r/20250224144757.237706-1-bfoster@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-26iomap: introduce a full map advance helperBrian Foster
Various iomap_iter_advance() calls advance by the full mapping length and thus have no need for the current length input or post-advance remaining length output from the standard advance function. Add an iomap_iter_advance_full() helper to clean up these cases. Signed-off-by: Brian Foster <bfoster@redhat.com> Link: https://lore.kernel.org/r/20250224144757.237706-13-bfoster@redhat.com Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-26iomap: rename iomap_iter processed field to statusBrian Foster
The iter.processed field name is no longer appropriate now that iomap operations do not return the number of bytes processed. Rename the field to iter.status to reflect that a success or error code is expected. Also change the type to int as there is no longer a need for an s64. This reduces the size of iomap_iter by 8 bytes due to a combination of smaller type and reduction in structure padding. While here, fix up the return types of various _iter() helpers to reflect the type change. Signed-off-by: Brian Foster <bfoster@redhat.com> Link: https://lore.kernel.org/r/20250224144757.237706-12-bfoster@redhat.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-26iomap: remove unnecessary advance from iomap_iter()Brian Foster
At this point, all iomap operations have been updated to advance the iomap_iter directly before returning to iomap_iter(). Therefore, the complexity of handling both the old and new semantics is no longer required and can be removed from iomap_iter(). Update iomap_iter() to expect success or failure status in iter.processed. As a precaution and developer hint to prevent inadvertent use of old semantics, warn on a positive return code and fail the operation. Remove the unnecessary advance and simplify the termination logic. Signed-off-by: Brian Foster <bfoster@redhat.com> Link: https://lore.kernel.org/r/20250224144757.237706-11-bfoster@redhat.com Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-26dax: advance the iomap_iter on pte and pmd faultsBrian Foster
Advance the iomap_iter on PTE and PMD faults. Each of these operations assign a hardcoded size to iter.processed. Replace those with an advance and status return. Signed-off-by: Brian Foster <bfoster@redhat.com> Link: https://lore.kernel.org/r/20250224144757.237706-10-bfoster@redhat.com Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-26dax: advance the iomap_iter on dedupe rangeBrian Foster
Advance the iter on successful dedupe. Dedupe range uses two iters and iterates so long as both have outstanding work, so correspondingly this needs to advance both on each iteration. Since dax_range_compare_iter() now returns status instead of a byte count, update the variable name in the caller as well. Signed-off-by: Brian Foster <bfoster@redhat.com> Link: https://lore.kernel.org/r/20250224144757.237706-9-bfoster@redhat.com Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-26dax: advance the iomap_iter on unshare rangeBrian Foster
Advance the iter and return 0 or an error code for success or failure. Signed-off-by: Brian Foster <bfoster@redhat.com> Link: https://lore.kernel.org/r/20250224144757.237706-8-bfoster@redhat.com Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-26dax: advance the iomap_iter on zero rangeBrian Foster
Update the DAX zero range iomap iter handler to advance the iter directly. Advance by the full length in the hole/unwritten case, or otherwise advance incrementally in the zeroing loop. In either case, return 0 or an error code for success or failure. Signed-off-by: Brian Foster <bfoster@redhat.com> Link: https://lore.kernel.org/r/20250224144757.237706-7-bfoster@redhat.com Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-26dax: push advance down into dax_iomap_iter() for read and writeBrian Foster
DAX read and write currently advances the iter after the dax_iomap_iter() returns the number of bytes processed rather than internally within the iter handler itself, as most other iomap operations do. Push the advance down into dax_iomap_iter() and update the function to return op status instead of bytes processed. dax_iomap_iter() shortcuts reads from a hole or unwritten mapping by directly zeroing the iov_iter, so advance the iomap_iter similarly in that case. The DAX processing loop can operate on a range slightly different than defined by the iomap_iter depending on circumstances. For example, a read may be truncated by inode size, a read or write range can be increased due to page alignment, etc. Therefore, this patch aims to retain as much of the existing logic as possible. The loop control logic remains pos based, but is sampled from the iomap_iter on each iteration after the advance instead of being updated manually. Similarly, length is updated based on the output of the advance instead of being updated manually. The advance itself is based on the number of bytes transferred, which was previously used to update the local copies of pos and length. Signed-off-by: Brian Foster <bfoster@redhat.com> Link: https://lore.kernel.org/r/20250224144757.237706-6-bfoster@redhat.com Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-26dax: advance the iomap_iter in the read/write pathBrian Foster
DAX reads and writes flow through dax_iomap_iter(), which has one or more subtleties in terms of how it processes a range vs. what is specified in the iomap_iter. To keep things simple and remove the dependency on iomap_iter() advances, convert a positive return from dax_iomap_iter() to the new advance and status return semantics. The advance can be pushed further down in future patches. Signed-off-by: Brian Foster <bfoster@redhat.com> Link: https://lore.kernel.org/r/20250224144757.237706-5-bfoster@redhat.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-26iomap: convert misc simple ops to incremental advanceBrian Foster
Update several of the remaining iomap operations to advance the iter directly rather than via return value. This includes page faults, fiemap, seek data/hole and swapfile activation. Signed-off-by: Brian Foster <bfoster@redhat.com> Link: https://lore.kernel.org/r/20250224144757.237706-4-bfoster@redhat.com Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-26iomap: advance the iter on direct I/OBrian Foster
Update iomap direct I/O to advance the iter directly rather than via iter.processed. Update each mapping type helper to advance based on the amount of data processed and return success or failure. Signed-off-by: Brian Foster <bfoster@redhat.com> Link: https://lore.kernel.org/r/20250224144757.237706-3-bfoster@redhat.com Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-26iomap: advance the iter directly on buffered readBrian Foster
iomap buffered read advances the iter via iter.processed. To continue separating iter advance from return status, update iomap_readpage_iter() to advance the iter instead of returning the number of bytes processed. In turn, drop the offset parameter and sample the updated iter->pos at the start of the function. Update the callers to loop based on remaining length in the current iteration instead of number of bytes processed. Signed-off-by: Brian Foster <bfoster@redhat.com> Link: https://lore.kernel.org/r/20250224144757.237706-2-bfoster@redhat.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-25xfs: remove the XBF_STALE check from xfs_buf_rele_cachedChristoph Hellwig
xfs_buf_stale already set b_lru_ref to 0, and thus prevents the buffer from moving to the LRU. Remove the duplicate check. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-02-25xfs: remove most in-flight buffer accountingChristoph Hellwig
The buffer cache keeps a bt_io_count per-CPU counter to track all in-flight I/O, which is used to ensure no I/O is in flight when unmounting the file system. For most I/O we already keep track of inflight I/O at higher levels: - for synchronous I/O (xfs_buf_read/xfs_bwrite/xfs_buf_delwri_submit), the caller has a reference and waits for I/O completions using xfs_buf_iowait - for xfs_buf_delwri_submit_nowait the only caller (AIL writeback) tracks the log items that the buffer attached to This only leaves only xfs_buf_readahead_map as a submitter of asynchronous I/O that is not tracked by anything else. Replace the bt_io_count per-cpu counter with a more specific bt_readahead_count counter only tracking readahead I/O. This allows to simply increment it when submitting readahead I/O and decrementing it when it completed, and thus simplify xfs_buf_rele and remove the needed for the XBF_NO_IOACCT flags and the XFS_BSTATE_IN_FLIGHT buffer state. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-02-25xfs: decouple buffer readahead from the normal buffer read pathChristoph Hellwig
xfs_buf_readahead_map is the only caller of xfs_buf_read_map and thus _xfs_buf_read that is not synchronous. Split it from xfs_buf_read_map so that the asynchronous path is self-contained and the now purely synchronous xfs_buf_read_map / _xfs_buf_read implementation can be simplified. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-02-25xfs: reduce context switches for synchronous buffered I/OChristoph Hellwig
Currently all metadata I/O completions happen in the m_buf_workqueue workqueue. But for synchronous I/O (i.e. all buffer reads) there is no need for that, as there always is a called in process context that is waiting for the I/O. Factor out the guts of xfs_buf_ioend into a separate helper and call it from xfs_buf_iowait to avoid a double an extra context switch to the workqueue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-02-14xfs: flush inodegc before swaponChristoph Hellwig
Fix the brand new xfstest that tries to swapon on a recently unshared file and use the chance to document the other bit of magic in this function. The big comment is taken from a mailinglist post by Dave Chinner. Fixes: 5e672cd69f0a53 ("xfs: introduce xfs_inodegc_push()") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-02-14xfs: rename xfs_iomap_swapfile_activate to xfs_vm_swap_activateChristoph Hellwig
Match the method name and the naming convention or address_space operations. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-02-14xfs: Do not allow norecovery mount with quotacheckCarlos Maiolino
Mounting a filesystem that requires quota state changing will generate a transaction. We already check for a read-only device; we should do that for norecovery too. A quotacheck on a norecovery mount, and with the right log size, will cause the mount process to hang on: [<0>] xlog_grant_head_wait+0x5d/0x2a0 [xfs] [<0>] xlog_grant_head_check+0x112/0x180 [xfs] [<0>] xfs_log_reserve+0xe3/0x260 [xfs] [<0>] xfs_trans_reserve+0x179/0x250 [xfs] [<0>] xfs_trans_alloc+0x101/0x260 [xfs] [<0>] xfs_sync_sb+0x3f/0x80 [xfs] [<0>] xfs_qm_mount_quotas+0xe3/0x2f0 [xfs] [<0>] xfs_mountfs+0x7ad/0xc20 [xfs] [<0>] xfs_fs_fill_super+0x762/0xa50 [xfs] [<0>] get_tree_bdev_flags+0x131/0x1d0 [<0>] vfs_get_tree+0x26/0xd0 [<0>] vfs_cmd_create+0x59/0xe0 [<0>] __do_sys_fsconfig+0x4e3/0x6b0 [<0>] do_syscall_64+0x82/0x160 [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e This is caused by a transaction running with bogus initialized head/tail I initially hit this while running generic/050, with random log sizes, but I managed to reproduce it reliably here with the steps below: mkfs.xfs -f -lsize=1025M -f -b size=4096 -m crc=1,reflink=1,rmapbt=1, -i sparse=1 /dev/vdb2 > /dev/null mount -o usrquota,grpquota,prjquota /dev/vdb2 /mnt xfs_io -x -c 'shutdown -f' /mnt umount /mnt mount -o ro,norecovery,usrquota,grpquota,prjquota /dev/vdb2 /mnt Last mount hangs up As we add yet another validation if quota state is changing, this also add a new helper named xfs_qm_validate_state_change(), factoring the quota state changes out of xfs_qm_newmount() to reduce cluttering within it. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-02-14xfs: do not check NEEDSREPAIR if ro,norecovery mount.Lukas Herbolt
If there is corrutpion on the filesystem andxfs_repair fails to repair it. The last resort of getting the data is to use norecovery,ro mount. But if the NEEDSREPAIR is set the filesystem cannot be mounted. The flag must be cleared out manually using xfs_db, to get access to what left over of the corrupted fs. Signed-off-by: Lukas Herbolt <lukas@herbolt.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-02-14xfs: fix data fork format filtering during inode repairDarrick J. Wong
Coverity noticed that xrep_dinode_bad_metabt_fork never runs because XFS_DINODE_FMT_META_BTREE is always filtered out in the mode selection switch of xrep_dinode_check_dfork. Metadata btrees are allowed only in the data forks of regular files, so add this case explicitly. I guess this got fubard during a refactoring prior to 6.13 and I didn't notice until now. :/ Coverity-id: 1617714 Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-02-14xfs: fix online repair probing when CONFIG_XFS_ONLINE_REPAIR=nDarrick J. Wong
I received a report from the release engineering side of the house that xfs_scrub without the -n flag (aka fix it mode) would try to fix a broken filesystem even on a kernel that doesn't have online repair built into it: # xfs_scrub -dTvn /mnt/test EXPERIMENTAL xfs_scrub program in use! Use at your own risk! Phase 1: Find filesystem geometry. /mnt/test: using 1 threads to scrub. Phase 1: Memory used: 132k/0k (108k/25k), time: 0.00/ 0.00/ 0.00s <snip> Phase 4: Repair filesystem. <snip> Info: /mnt/test/some/victimdir directory entries: Attempting repair. (repair.c line 351) Corruption: /mnt/test/some/victimdir directory entries: Repair unsuccessful; offline repair required. (repair.c line 204) Source: https://blogs.oracle.com/linux/post/xfs-online-filesystem-repair It is strange that xfs_scrub doesn't refuse to run, because the kernel is supposed to return EOPNOTSUPP if we actually needed to run a repair, and xfs_io's repair subcommand will perror that. And yet: # xfs_io -x -c 'repair probe' /mnt/test # The first problem is commit dcb660f9222fd9 (4.15) which should have had xchk_probe set the CORRUPT OFLAG so that any of the repair machinery will get called at all. It turns out that some refactoring that happened in the 6.6-6.8 era broke the operation of this corner case. What we *really* want to happen is that all the predicates that would steer xfs_scrub_metadata() towards calling xrep_attempt() should function the same way that they do when repair is compiled in; and then xrep_attempt gets to return the fatal EOPNOTSUPP error code that causes the probe to fail. Instead, commit 8336a64eb75cba (6.6) started the failwhale swimming by hoisting OFLAG checking logic into a helper whose non-repair stub always returns false, causing scrub to return "repair not needed" when in fact the repair is not supported. Prior to that commit, the oflag checking that was open-coded in scrub.c worked correctly. Similarly, in commit 4bdfd7d15747b1 (6.8) we hoisted the IFLAG_REPAIR and ALREADY_FIXED logic into a helper whose non-repair stub always returns false, so we never enter the if test body that would have called xrep_attempt, let alone fail to decode the OFLAGs correctly. The final insult (yes, we're doing The Naked Gun now) is commit 48a72f60861f79 (6.8) in which we hoisted the "are we going to try a repair?" predicate into yet another function with a non-repair stub always returns false. Fix xchk_probe to trigger xrep_probe if repair is enabled, or return EOPNOTSUPP directly if it is not. For all the other scrub types, we need to fix the header predicates so that the ->repair functions (which are all xrep_notsupported) get called to return EOPNOTSUPP. Commit 48a72 is tagged here because the scrub code prior to LTS 6.12 are incomplete and not worth patching. Reported-by: David Flynn <david.flynn@oracle.com> Cc: <stable@vger.kernel.org> # v6.8 Fixes: 8336a64eb75c ("xfs: don't complain about unfixed metadata when repairs were injected") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>