summaryrefslogtreecommitdiff
path: root/fs/xfs/libxfs/xfs_sb.c
AgeCommit message (Collapse)Author
2025-03-12xfs: Remove duplicate xfs_rtbitmap.h headerJiapeng Chong
./fs/xfs/libxfs/xfs_sb.c: xfs_rtbitmap.h is included more than once. Fixes: 2167eaabe2fa ("xfs: define the zoned on-disk format") Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=19446 Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-03-03xfs: support zone gapsChristoph Hellwig
Zoned devices can have gaps beyond the usable capacity of a zone and the end in the LBA/daddr address space. In other words, the hardware equivalent to the RT groups already takes care of the power of 2 alignment for us. In this case the sparse FSB/RTB address space maps 1:1 to the device address space. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: disable sb_frextents for zoned file systemsChristoph Hellwig
Zoned file systems not only don't use the global frextents counter, but for them the in-memory percpu counter also includes reservations taken before even allocating delalloc extent records, so it will never match the per-zone used information. Disable all updates and verification of the sb counter for zoned file systems as it isn't useful for them. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: export zoned geometry via XFS_FSOP_GEOMChristoph Hellwig
Export the zoned geometry information so that userspace can query it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: allow internal RT devices for zoned modeChristoph Hellwig
Allow creating an RT subvolume on the same device as the main data device. This is mostly used for SMR HDDs where the conventional zones are used for the data device and the sequential write required zones for the zoned RT section. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: define the zoned on-disk formatChristoph Hellwig
Zone file systems reuse the basic RT group enabled XFS file system structure to support a mode where each RT group is always written from start to end and then reset for reuse (after moving out any remaining data). There are few minor but important changes, which are indicated by a new incompat flag: 1) there are no bitmap and summary inodes, thus the /rtgroups/{rgno}.{bitmap,summary} metadir files do not exist and the sb_rbmblocks superblock field must be cleared to zero. 2) there is a new superblock field that specifies the start of an internal RT section. This allows supporting SMR HDDs that have random writable space at the beginning which is used for the XFS data device (which really is the metadata device for this configuration), directly followed by a RT device on the same block device. While something similar could be achieved using dm-linear just having a single device directly consumed by XFS makes handling the file systems a lot easier. 3) Another superblock field that tracks the amount of reserved space (or overprovisioning) that is never used for user capacity, but allows GC to run more smoothly. 4) an overlay of the cowextsize field for the rtrmap inode so that we can persistently track the total amount of rtblocks currently used in a RT group. There is no data structure other than the rmap that tracks used space in an RT group, and this counter is used to decide when a RT group has been entirely emptied, and to select one that is relatively empty if garbage collection needs to be performed. While this counter could be tracked entirely in memory and rebuilt from the rmap at mount time, that would lead to very long mount times with the large number of RT groups implied by the number of hardware zones especially on SMR hard drives with 256MB zone sizes. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: generalize the freespace and reserved blocks handlingChristoph Hellwig
xfs_{add,dec}_freecounter already handles the block and RT extent percpu counters, but it currently hardcodes the passed in counter. Add a freecounter abstraction that uses an enum to designate the counter and add wrappers that hide the actual percpu_counters. This will allow expanding the reserved block handling to the RT extent counter in the next step, and also prepares for adding yet another such counter that can share the code. Both these additions will be needed for the zoned allocator. Also switch the flooring of the frextents counter to 0 in statfs for the rthinherit case to a manual min_t call to match the handling of the fdblocks counter for normal file systems. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2024-12-23xfs: introduce realtime refcount btree ondisk definitionsDarrick J. Wong
Add the ondisk structure definitions for realtime refcount btrees. The realtime refcount btree will be rooted from a hidden inode so it needs to have a separate btree block magic and pointer format. Next, add everything needed to read, write and manipulate refcount btree blocks. This prepares the way for connecting the btree operations implementation, though the changes to actually root the rtrefcount btree in an inode come later. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-23xfs: introduce realtime rmap btree ondisk definitionsDarrick J. Wong
Add the ondisk structure definitions for realtime rmap btrees. The realtime rmap btree will be rooted from a hidden inode so it needs to have a separate btree block magic and pointer format. Next, add everything needed to read, write and manipulate rmap btree blocks. This prepares the way for connecting the btree operations implementation, though embedding the rtrmap btree root in the inode comes later in the series. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12xfs: fix sb_spino_align checks for large fsblock sizesDarrick J. Wong
For a sparse inodes filesystem, mkfs.xfs computes the values of sb_spino_align and sb_inoalignmt with the following code: int cluster_size = XFS_INODE_BIG_CLUSTER_SIZE; if (cfg->sb_feat.crcs_enabled) cluster_size *= cfg->inodesize / XFS_DINODE_MIN_SIZE; sbp->sb_spino_align = cluster_size >> cfg->blocklog; sbp->sb_inoalignmt = XFS_INODES_PER_CHUNK * cfg->inodesize >> cfg->blocklog; On a V5 filesystem with 64k fsblocks and 512 byte inodes, this results in cluster_size = 8192 * (512 / 256) = 16384. As a result, sb_spino_align and sb_inoalignmt are both set to zero. Unfortunately, this trips the new sb_spino_align check that was just added to xfs_validate_sb_common, and the mkfs fails: # mkfs.xfs -f -b size=64k, /dev/sda meta-data=/dev/sda isize=512 agcount=4, agsize=81136 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=1 = reflink=1 bigtime=1 inobtcount=1 nrext64=1 = exchange=0 metadir=0 data = bsize=65536 blocks=324544, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=65536 ascii-ci=0, ftype=1, parent=0 log =internal log bsize=65536 blocks=5006, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=65536 blocks=0, rtextents=0 = rgcount=0 rgsize=0 extents Discarding blocks...Sparse inode alignment (0) is invalid. Metadata corruption detected at 0x560ac5a80bbe, xfs_sb block 0x0/0x200 libxfs_bwrite: write verifier failed on xfs_sb bno 0x0/0x1 mkfs.xfs: Releasing dirty buffer to free list! found dirty buffer (bulk) on free list! Sparse inode alignment (0) is invalid. Metadata corruption detected at 0x560ac5a80bbe, xfs_sb block 0x0/0x200 libxfs_bwrite: write verifier failed on xfs_sb bno 0x0/0x1 mkfs.xfs: writing AG headers failed, err=22 Prior to commit 59e43f5479cce1 this all worked fine, even if "sparse" inodes are somewhat meaningless when everything fits in a single fsblock. Adjust the checks to handle existing filesystems. Cc: <stable@vger.kernel.org> # v6.13-rc1 Fixes: 59e43f5479cce1 ("xfs: sb_spino_align is not verified") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-22xfs: remove unknown compat feature check in superblock write validationLong Li
Compat features are new features that older kernels can safely ignore, allowing read-write mounts without issues. The current sb write validation implementation returns -EFSCORRUPTED for unknown compat features, preventing filesystem write operations and contradicting the feature's definition. Additionally, if the mounted image is unclean, the log recovery may need to write to the superblock. Returning an error for unknown compat features during sb write validation can cause mount failures. Although XFS currently does not use compat feature flags, this issue affects current kernels' ability to mount images that may use compat feature flags in the future. Since superblock read validation already warns about unknown compat features, it's unnecessary to repeat this warning during write validation. Therefore, the relevant code in write validation is being removed. Fixes: 9e037cb7972f ("xfs: check for unknown v5 feature bits in superblock write verifier") Cc: stable@vger.kernel.org # v4.19+ Signed-off-by: Long Li <leo.lilong@huawei.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-11-12Merge tag 'metadir-quotas-6.13_2024-11-05' of ↵Carlos Maiolino
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into staging-merge xfs: persist quota options with metadir [v5.5 07/10] Store the quota files in the metadata directory tree instead of the superblock. Since we're introducing a new incompat feature flag, let's also make the mount process bring up quotas in whatever state they were when the filesystem was last unmounted, instead of requiring sysadmins to remember that themselves. With a bit of luck, this should all go splendidly. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-12Merge tag 'realtime-groups-6.13_2024-11-05' of ↵Carlos Maiolino
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into staging-merge xfs: shard the realtime section [v5.5 06/10] Right now, the realtime section uses a single pair of metadata inodes to store the free space information. This presents a scalability problem since every thread trying to allocate or free rt extents have to lock these files. Solve this problem by sharding the realtime section into separate realtime allocation groups. While we're at it, define a superblock to be stamped into the start of the rt section. This enables utilities such as blkid to identify block devices containing realtime sections, and avoids the situation where anything written into block 0 of the realtime extent can be misinterpreted as file data. The best advantage for rtgroups will become evident later when we get to adding rmap and reflink to the realtime volume, since the geometry constraints are the same for rt groups and AGs. Hence we can reuse all that code directly. This is a very large patchset, but it catches us up with 20 years of technical debt that have accumulated. With a bit of luck, this should all go splendidly. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-12Merge tag 'incore-rtgroups-6.13_2024-11-05' of ↵Carlos Maiolino
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into staging-merge xfs: create incore rt allocation groups [v5.5 04/10] Add in-memory data structures for sharding the realtime volume into independent allocation groups. For existing filesystems, the entire rt volume is modelled as having a single large group, with (potentially) a number of rt extents exceeding 2^32 blocks, though these are not likely to exist because the codebase has been a bit broken for decades. The next series fills in the ondisk format and other supporting structures. With a bit of luck, this should all go splendidly. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-12Merge tag 'metadata-directory-tree-6.13_2024-11-05' of ↵Carlos Maiolino
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into staging-merge xfs: metadata inode directory trees [v5.5 03/10] This series delivers a new feature -- metadata inode directories. This is a separate directory tree (rooted in the superblock) that contains only inodes that contain filesystem metadata. Different metadata objects can be looked up with regular paths. Start by creating xfs_imeta{dir,file}* functions to mediate access to the metadata directory tree. By the end of this mega series, all existing metadata inodes (rt+quota) will use this directory tree instead of the superblock. Next, define the metadir on-disk format, which consists of marking inodes with a new iflag that says they're metadata. This prevents bulkstat and friends from ever getting their hands on fs metadata files. With a bit of luck, this should all go splendidly. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-12Merge tag 'generic-groups-6.13_2024-11-05' of ↵Carlos Maiolino
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into staging-merge xfs: create a generic allocation group structure [v5.5 02/10] Soon we'll be sharding the realtime volume into separate allocation groups. These rt groups will /mostly/ behave the same as the ones on the data device, but since rt groups don't have quite the same set of struct fields as perags, let's hoist the parts that will be shared by both into a common xfs_group object. With a bit of luck, this should all go splendidly. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05xfs: use metadir for quota inodesDarrick J. Wong
Store the quota inodes in the /quota metadata directory if metadir is enabled. This enables us to stop using the sb_[ugp]uotino fields in the superblock. From this point on, all metadata files will be children of the metadata directory tree root. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05xfs: make xfs_rtblock_t a segmented address like xfs_fsblock_tDarrick J. Wong
Now that we've finished adding allocation groups to the realtime volume, let's make the file block mapping address (xfs_rtblock_t) a segmented value just like we do on the data device. This means that group number and block number conversions can be done with shifting and masking instead of integer division. While in theory we could continue caching the rgno shift value in m_rgblklog, the fact that we now always use the shift value means that we have an opportunity to increase the redundancy of the rt geometry by storing it in the ondisk superblock and adding more sb verifier code. Extend the sueprblock to store the rgblklog value. Now that we have segmented addresses, set the correct values in m_groups[XG_TYPE_RTG] so that the xfs_group helpers work correctly. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05xfs: mask off the rtbitmap and summary inodes when metadir in useDarrick J. Wong
Set the rtbitmap and summary file inumbers to NULLFSINO in the superblock and make sure they're zeroed whenever we write the superblock to disk, to mimic mkfs behavior. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05xfs: add block headers to realtime bitmap and summary blocksDarrick J. Wong
Upgrade rtbitmap and rtsummary blocks to have self describing metadata like most every other thing in XFS. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05xfs: add frextents to the lazysbcounters when rtgroups enabledDarrick J. Wong
Make the free rt extent count a part of the lazy sb counters when the realtime groups feature is enabled. This is possible because the patch to recompute frextents from the rtbitmap during log recovery predates the code adding rtgroup support, hence we know that the value will always be correct during runtime. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05xfs: export realtime group geometry via XFS_FSOP_GEOMDarrick J. Wong
Export the realtime geometry information so that userspace can query it. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05xfs: update realtime super every time we update the primary fs superDarrick J. Wong
Every time we update parts of the primary filesystem superblock that are echoed in the rt superblock, we must update the rt super. Avoid changing the log to support logging to the rt device by using ordered buffers. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05xfs: define the format of rt groupsDarrick J. Wong
Define the ondisk format of realtime group metadata, and a superblock for realtime volumes. rt supers are conditionally enabled by a predicate function so that they can be disabled if we ever implement zoned storage support for the realtime volume. For rt group enabled file systems there is a separate bitmap and summary file for each group and thus the number of bitmap and summary blocks needs to be calculated differently. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05xfs: create incore realtime group structuresDarrick J. Wong
Create an incore object that will contain information about a realtime allocation group. This will eventually enable us to shard the realtime section in a similar manner to how we shard the data section, but for now just a single object for the entire RT subvolume is created. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05xfs: advertise metadata directory featureDarrick J. Wong
Advertise the existence of the metadata directory feature; this will be used by scrub to decide if it needs to scan the metadir too. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05xfs: define the on-disk format for the metadir featureDarrick J. Wong
Define the on-disk layout and feature flags for the metadata inode directory feature. Add a xfs_sb_version_hasmetadir for benefit of xfs_repair, which needs to know where the new end of the superblock lies. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05xfs: add group based bno conversion helpersChristoph Hellwig
Add/move the blocks, blklog and blkmask fields to the generic groups structure so that code can work with AGs and RTGs by just using the right index into the array. Then, add convenience helpers to convert block numbers based on the generic group. This will allow writing code that doesn't care if it is used on AGs or the upcoming realtime groups. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05xfs: switch perag iteration from the for_each macros to a while based iteratorChristoph Hellwig
The current for_each_perag* macros are a bit annoying in that they require the caller to both provide an object and an index iterator, and also somewhat obsfucate the underlying control flow mechanism. Switch to open coded while loops using new xfs_perag_next{,_from,_range} helpers that return the next pag structure to iterate on based on the previous one or NULL for the loop start. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05xfs: factor out a generic xfs_group structureChristoph Hellwig
Split the lookup and refcount handling of struct xfs_perag into an embedded xfs_group structure that can be reused for the upcoming realtime groups. It will be extended with more features later. Note that he xg_type field will only need a single bit even with realtime group support. For now it fills a hole, but it might be worth to fold it into another field if we can use this space better. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-05xfs: sb_spino_align is not verifiedDave Chinner
It's just read in from the superblock and used without doing any validity checks at all on the value. Fixes: fb4f2b4e5a82 ("xfs: add sparse inode chunk alignment superblock field") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-09-01xfs: standardize the btree maxrecs function parametersDarrick J. Wong
Standardize the parameters in xfs_{alloc,bm,ino,rmap,refcount}bt_maxrecs so that we have consistent calling conventions. This doesn't affect the kernel that much, but enables us to clean up userspace a bit. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-09-01xfs: ensure rtx mask/shift are correct after growfsChristoph Hellwig
When growfs sets an extent size, it doesn't updated the m_rtxblklog and m_rtxblkmask values, which could lead to incorrect usage of them if they were set before and can't be used for the new extent size. Add a xfs_mount_sb_set_rextsize helper that updates the two fields, and also use it when calculating the new RT geometry instead of disabling the optimization there. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-09-01xfs: factor out a xfs_validate_rt_geometry helperChristoph Hellwig
Split the RT geometry validation in the early mount code into a helper than can be reused by repair (from which this code was apparently originally stolen anyway). Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: u64 return value for calc_rbmblocks] Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-09-01xfs: remove xfs_validate_rtextentsChristoph Hellwig
Replace xfs_validate_rtextents with an open coded check for 0 rtextents. The name for the function implies it does a lot more than a zero check, which is more obvious when open coded. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-06-10xfs: make sure sb_fdblocks is non-negativeWengang Wang
A user with a completely full filesystem experienced an unexpected shutdown when the filesystem tried to write the superblock during runtime. kernel shows the following dmesg: [ 8.176281] XFS (dm-4): Metadata corruption detected at xfs_sb_write_verify+0x60/0x120 [xfs], xfs_sb block 0x0 [ 8.177417] XFS (dm-4): Unmount and run xfs_repair [ 8.178016] XFS (dm-4): First 128 bytes of corrupted metadata buffer: [ 8.178703] 00000000: 58 46 53 42 00 00 10 00 00 00 00 00 01 90 00 00 XFSB............ [ 8.179487] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [ 8.180312] 00000020: cf 12 dc 89 ca 26 45 29 92 e6 e3 8d 3b b8 a2 c3 .....&E)....;... [ 8.181150] 00000030: 00 00 00 00 01 00 00 06 00 00 00 00 00 00 00 80 ................ [ 8.182003] 00000040: 00 00 00 00 00 00 00 81 00 00 00 00 00 00 00 82 ................ [ 8.182004] 00000050: 00 00 00 01 00 64 00 00 00 00 00 04 00 00 00 00 .....d.......... [ 8.182004] 00000060: 00 00 64 00 b4 a5 02 00 02 00 00 08 00 00 00 00 ..d............. [ 8.182005] 00000070: 00 00 00 00 00 00 00 00 0c 09 09 03 17 00 00 19 ................ [ 8.182008] XFS (dm-4): Corruption of in-memory data detected. Shutting down filesystem [ 8.182010] XFS (dm-4): Please unmount the filesystem and rectify the problem(s) When xfs_log_sb writes super block to disk, b_fdblocks is fetched from m_fdblocks without any lock. As m_fdblocks can experience a positive -> negative -> positive changing when the FS reaches fullness (see xfs_mod_fdblocks). So there is a chance that sb_fdblocks is negative, and because sb_fdblocks is type of unsigned long long, it reads super big. And sb_fdblocks being bigger than sb_dblocks is a problem during log recovery, xfs_validate_sb_write() complains. Fix: As sb_fdblocks will be re-calculated during mount when lazysbcount is enabled, We just need to make xfs_validate_sb_write() happy -- make sure sb_fdblocks is not nenative. This patch also takes care of other percpu counters in xfs_log_sb. Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-04-23xfs: add a incompat feature bit for parent pointersAllison Henderson
Create an incompat feature bit and a fs geometry flag so that we can enable the feature in the ondisk superblock and advertise its existence to userspace. Signed-off-by: Mark Tinguely <mark.tinguely@oracle.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-04-15xfs: create a incompat flag for atomic file mapping exchangesDarrick J. Wong
Create a incompat flag so that we only attempt to process file mapping exchange log items if the filesystem supports it, and a geometry flag to advertise support if it's present. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-03-25xfs: allow sunit mount option to repair bad primary sb stripe valuesDave Chinner
If a filesystem has a busted stripe alignment configuration on disk (e.g. because broken RAID firmware told mkfs that swidth was smaller than sunit), then the filesystem will refuse to mount due to the stripe validation failing. This failure is triggering during distro upgrades from old kernels lacking this check to newer kernels with this check, and currently the only way to fix it is with offline xfs_db surgery. This runtime validity checking occurs when we read the superblock for the first time and causes the mount to fail immediately. This prevents the rewrite of stripe unit/width via mount options that occurs later in the mount process. Hence there is no way to recover this situation without resorting to offline xfs_db rewrite of the values. However, we parse the mount options long before we read the superblock, and we know if the mount has been asked to re-write the stripe alignment configuration when we are reading the superblock and verifying it for the first time. Hence we can conditionally ignore stripe verification failures if the mount options specified will correct the issue. We validate that the new stripe unit/width are valid before we overwrite the superblock values, so we can ignore the invalid config at verification and fail the mount later if the new values are not valid. This, at least, gives users the chance of correcting the issue after a kernel upgrade without having to resort to xfs-db hacks. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-22xfs: report ag header corruption errors to the health tracking systemDarrick J. Wong
Whenever we encounter a corrupt AG header, we should report that to the health monitoring system for later reporting. Buffer readers that don't respond to corruption events with a _mark_sick call can be detected with the following script: #!/bin/bash # Detect missing calls to xfs_*_mark_sick filter=cat tty -s && filter=less git grep -A10 -E '( = xfs_trans_read_buf| = xfs_buf_read\()' fs/xfs/*.[ch] fs/xfs/libxfs/*.[ch] | awk ' BEGIN { ignore = 0; lineno = 0; delete lines; } { if ($0 == "--") { if (!ignore) { for (i = 0; i < lineno; i++) { print(lines[i]); } printf("--\n"); } delete lines; lineno = 0; ignore = 0; } else if ($0 ~ /mark_sick/) { ignore = 1; } else { lines[lineno++] = $0; } } ' | $filter Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-01-30xfs: remove conditional building of rt geometry validator functionsDarrick J. Wong
I mistakenly turned off CONFIG_XFS_RT in the Kconfig file for arm64 variant of the djwong-wtf git branch. Unfortunately, it took me a good hour to figure out that RT wasn't built because this is what got printed to dmesg: XFS (sda2): realtime geometry sanity check failed XFS (sda2): Metadata corruption detected at xfs_sb_read_verify+0x170/0x190 [xfs], xfs_sb block 0x0 Whereas I would have expected: XFS (sda2): Not built with CONFIG_XFS_RT XFS (sda2): RT mount failed The root cause of these problems is the conditional compilation of the new functions xfs_validate_rtextents and xfs_compute_rextslog that I introduced in the two commits listed below. The !RT versions of these functions return false and 0, respectively, which causes primary superblock validation to fail, which explains the first message. Move the two functions to other parts of libxfs that are not conditionally defined by CONFIG_XFS_RT and remove the broken stubs so that validation works again. Fixes: e14293803f4e ("xfs: don't allow overly small or large realtime volumes") Fixes: a6a38f309afc ("xfs: make rextslog computation consistent with mkfs") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2023-12-06xfs: don't allow overly small or large realtime volumesDarrick J. Wong
Don't allow realtime volumes that are less than one rt extent long. This has been broken across 4 LTS kernels with nobody noticing, so let's just disable it. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-12-06xfs: make rextslog computation consistent with mkfsDarrick J. Wong
There's a weird discrepancy in xfsprogs dating back to the creation of the Linux port -- if there are zero rt extents, mkfs will set sb_rextents and sb_rextslog both to zero: sbp->sb_rextslog = (uint8_t)(rtextents ? libxfs_highbit32((unsigned int)rtextents) : 0); However, that's not the check that xfs_repair uses for nonzero rtblocks: if (sb->sb_rextslog != libxfs_highbit32((unsigned int)sb->sb_rextents)) The difference here is that xfs_highbit32 returns -1 if its argument is zero. Unfortunately, this means that in the weird corner case of a realtime volume shorter than 1 rt extent, xfs_repair will immediately flag a freshly formatted filesystem as corrupt. Because mkfs has been writing ondisk artifacts like this for decades, we have to accept that as "correct". TBH, zero rextslog for zero rtextents makes more sense to me anyway. Regrettably, the superblock verifier checks created in commit copied xfs_repair even though mkfs has been writing out such filesystems for ages. Fix the superblock verifier to accept what mkfs spits out; the userspace version of this patch will have to fix xfs_repair as well. Note that the new helper leaves the zeroday bug where the upper 32 bits of sb_rextents is ripped off and fed to highbit32. This leads to a seriously undersized rt summary file, which immediately breaks mkfs: $ hugedisk.sh foo /dev/sdc $(( 0x100000080 * 4096))B $ /sbin/mkfs.xfs -f /dev/sda -m rmapbt=0,reflink=0 -r rtdev=/dev/mapper/foo meta-data=/dev/sda isize=512 agcount=4, agsize=1298176 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=0 = reflink=0 bigtime=1 inobtcount=1 nrext64=1 data = bsize=4096 blocks=5192704, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=16384, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =/dev/mapper/foo extsz=4096 blocks=4294967424, rtextents=4294967424 Discarding blocks...Done. mkfs.xfs: Error initializing the realtime space [117 - Structure needs cleaning] The next patch will drop support for rt volumes with fewer than 1 or more than 2^32-1 rt extents, since they've clearly been broken forever. Fixes: f8e566c0f5e1f ("xfs: validate the realtime geometry in xfs_validate_sb_common") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-10-17xfs: use shifting and masking when converting rt extents, if possibleDarrick J. Wong
Avoid the costs of integer division (32-bit and 64-bit) if the realtime extent size is a power of two. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-09-12xfs: fix log recovery when unknown rocompat bits are setDarrick J. Wong
Log recovery has always run on read only mounts, even where the primary superblock advertises unknown rocompat bits. Due to a misunderstanding between Eric and Darrick back in 2018, we accidentally changed the superblock write verifier to shutdown the fs over that exact scenario. As a result, the log cleaning that occurs at the end of the mounting process fails if there are unknown rocompat bits set. As we now allow writing of the superblock if there are unknown rocompat bits set on a RO mount, we no longer want to turn off RO state to allow log recovery to succeed on a RO mount. Hence we also remove all the (now unnecessary) RO state toggling from the log recovery path. Fixes: 9e037cb7972f ("xfs: check for unknown v5 feature bits in superblock write verifier" Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-06-29xfs: journal geometry is not properly bounds checkedDave Chinner
If the journal geometry results in a sector or log stripe unit validation problem, it indicates that we cannot set the log up to safely write to the the journal. In these cases, we must abort the mount because the corruption needs external intervention to resolve. Similarly, a journal that is too large cannot be written to safely, either, so we shouldn't allow those geometries to mount, either. If the log is too small, we risk having transaction reservations overruning the available log space and the system hanging waiting for space it can never provide. This is purely a runtime hang issue, not a corruption issue as per the first cases listed above. We abort mounts of the log is too small for V5 filesystems, but we must allow v4 filesystems to mount because, historically, there was no log size validity checking and so some systems may still be out there with undersized logs. The problem is that on V4 filesystems, when we discover a log geometry problem, we skip all the remaining checks and then allow the log to continue mounting. This mean that if one of the log size checks fails, we skip the log stripe unit check. i.e. we allow the mount because a "non-fatal" geometry is violated, and then fail to check the hard fail geometries that should fail the mount. Move all these fatal checks to the superblock verifier, and add a new check for the two log sector size geometry variables having the same values. This will prevent any attempt to mount a log that has invalid or inconsistent geometries long before we attempt to mount the log. However, for the minimum log size checks, we can only do that once we've setup up the log and calculated all the iclog sizes and roundoffs. Hence this needs to remain in the log mount code after the log has been initialised. It is also the only case where we should allow a v4 filesystem to continue running, so leave that handling in place, too. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-04-12xfs: don't consider future format versions validDave Chinner
In commit fe08cc504448 we reworked the valid superblock version checks. If it is a V5 filesystem, it is always valid, then we checked if the version was less than V4 (reject) and then checked feature fields in the V4 flags to determine if it was valid. What we missed was that if the version is not V4 at this point, we shoudl reject the fs. i.e. the check current treats V6+ filesystems as if it was a v4 filesystem. Fix this. cc: stable@vger.kernel.org Fixes: fe08cc504448 ("xfs: open code sb verifier feature checks") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-02-13xfs: convert xfs_ialloc_next_ag() to an atomicDave Chinner
This is currently a spinlock lock protected rotor which can be implemented with a single atomic operation. Change it to be more efficient and get rid of the m_agirotor_lock. Noticed while converting the inode allocation AG selection loop to active perag references. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-11-16xfs: fix sb write verify for lazysbcountLong Li
When lazysbcount is enabled, fsstress and loop mount/unmount test report the following problems: XFS (loop0): SB summary counter sanity check failed XFS (loop0): Metadata corruption detected at xfs_sb_write_verify+0x13b/0x460, xfs_sb block 0x0 XFS (loop0): Unmount and run xfs_repair XFS (loop0): First 128 bytes of corrupted metadata buffer: 00000000: 58 46 53 42 00 00 10 00 00 00 00 00 00 28 00 00 XFSB.........(.. 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000020: 69 fb 7c cd 5f dc 44 af 85 74 e0 cc d4 e3 34 5a i.|._.D..t....4Z 00000030: 00 00 00 00 00 20 00 06 00 00 00 00 00 00 00 80 ..... .......... 00000040: 00 00 00 00 00 00 00 81 00 00 00 00 00 00 00 82 ................ 00000050: 00 00 00 01 00 0a 00 00 00 00 00 04 00 00 00 00 ................ 00000060: 00 00 0a 00 b4 b5 02 00 02 00 00 08 00 00 00 00 ................ 00000070: 00 00 00 00 00 00 00 00 0c 09 09 03 14 00 00 19 ................ XFS (loop0): Corruption of in-memory data (0x8) detected at _xfs_buf_ioapply +0xe1e/0x10e0 (fs/xfs/xfs_buf.c:1580). Shutting down filesystem. XFS (loop0): Please unmount the filesystem and rectify the problem(s) XFS (loop0): log mount/recovery failed: error -117 XFS (loop0): log mount failed This corruption will shutdown the file system and the file system will no longer be mountable. The following script can reproduce the problem, but it may take a long time. #!/bin/bash device=/dev/sda testdir=/mnt/test round=0 function fail() { echo "$*" exit 1 } mkdir -p $testdir while [ $round -lt 10000 ] do echo "******* round $round ********" mkfs.xfs -f $device mount $device $testdir || fail "mount failed!" fsstress -d $testdir -l 0 -n 10000 -p 4 >/dev/null & sleep 4 killall -w fsstress umount $testdir xfs_repair -e $device > /dev/null if [ $? -eq 2 ];then echo "ERR CODE 2: Dirty log exception during repair." exit 1 fi round=$(($round+1)) done With lazysbcount is enabled, There is no additional lock protection for reading m_ifree and m_icount in xfs_log_sb(), if other cpu modifies the m_ifree, this will make the m_ifree greater than m_icount. For example, consider the following sequence and ifreedelta is postive: CPU0 CPU1 xfs_log_sb xfs_trans_unreserve_and_mod_sb ---------- ------------------------------ percpu_counter_sum(&mp->m_icount) percpu_counter_add_batch(&mp->m_icount, idelta, XFS_ICOUNT_BATCH) percpu_counter_add(&mp->m_ifree, ifreedelta); percpu_counter_sum(&mp->m_ifree) After this, incorrect inode count (sb_ifree > sb_icount) will be writen to the log. In the subsequent writing of sb, incorrect inode count (sb_ifree > sb_icount) will fail to pass the boundary check in xfs_validate_sb_write() that cause the file system shutdown. When lazysbcount is enabled, we don't need to guarantee that Lazy sb counters are completely correct, but we do need to guarantee that sb_ifree <= sb_icount. On the other hand, the constraint that m_ifree <= m_icount must be satisfied any time that there /cannot/ be other threads allocating or freeing inode chunks. If the constraint is violated under these circumstances, sb_i{count,free} (the ondisk superblock inode counters) maybe incorrect and need to be marked sick at unmount, the count will be rebuilt on the next mount. Fixes: 8756a5af1819 ("libxfs: add more bounds checking to sb sanity checks") Signed-off-by: Long Li <leo.lilong@huawei.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-05-04xfs: validate v5 feature fieldsDave Chinner
We don't check that the v4 feature flags taht v5 requires to be set are actually set anywhere. Do this check when we see that the filesystem is a v5 filesystem. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>