summaryrefslogtreecommitdiff
path: root/fs/xfs
AgeCommit message (Collapse)Author
2024-02-21xfs: use shmem_get_folio in in xfile_loadChristoph Hellwig
Switch to using shmem_get_folio in xfile_load instead of using shmem_read_mapping_page_gfp. This gets us support for large folios and also optimized reading from unallocated space, as shmem_get_folio with SGP_READ won't allocate a page for them just to zero the content. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-21xfs: use shmem_get_folio in xfile_obj_storeChristoph Hellwig
Switch to using shmem_get_folio and manually dirtying the page instead of abusing aops->write_begin and aops->write_end in xfile_get_page. This simplifies the code by not doing indirect calls of not actually exported interfaces that don't really fit the use case very well, and happens to get us large folio support for free. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-21xfs: don't allow highmem pages in xfile mappingsChristoph Hellwig
XFS is generally used on 64-bit, non-highmem platforms and xfile mappings are accessed all the time. Reduce our pain by not allowing any highmem mappings in the xfile page cache and remove all the kmap calls for it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-21xfs: don't try to handle non-update pages in xfile_obj_loadChristoph Hellwig
shmem_read_mapping_page_gfp always returns an uptodate page or an ERR_PTR. Remove the code that tries to handle a non-uptodate page. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-21xfs: remove the xfile_pread/pwrite APIsChristoph Hellwig
All current and pending xfile users use the xfile_obj_load and xfile_obj_store API, so make those the actual implementation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-21xfs: remove xfile_statChristoph Hellwig
vfs_getattr is needed to query inode attributes for unknown underlying file systems. But shmemfs is well known for users of shmem_file_setup and shmem_read_mapping_page_gfp that rely on it not needing specific inode revalidation and having a normal mapping. Remove the detour through the getattr method and an extra wrapper, and just read the inode size and i_bytes directly in the scrub tracing code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-21xfs: don't modify file and inode flags for shmem filesChristoph Hellwig
shmem_file_setup is explicitly intended for a file that can be fully read and written by kernel users without restrictions. Don't poke into internals to change random flags in the file or inode. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-21xfs: use shmem_kernel_file_setup in xfile_createChristoph Hellwig
shmem_kernel_file_setup is equivalent to shmem_file_setup except that it already sets the S_PRIVATE flag. Use it instead of open coding the logic. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-21xfs: shmem_file_setup can't return NULLChristoph Hellwig
shmem_file_setup always returns a struct file pointer or an ERR_PTR, so remove the code to check for a NULL return. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-21xfs: use VM_NORESERVE in xfile_createChristoph Hellwig
xfile_create creates a (potentially large) sparse file. Pass VM_NORESERVE to shmem_file_setup to not account for the entire file size at file creation time. Reported-by: Hugh Dickins <hughd@google.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-20xfs: use kvfree in xfs_ioc_getfsmap()Dave Chinner
Another incorrect conversion to kfree() instead of kvfree(). Fixes: 49292576136f ("xfs: convert kmem_free() for kvmalloc users to kvfree()") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-20xfs: use kvfree() in xfs_ioc_attr_list()Dave Chinner
Wrongly converted from kmem_free() to kfree(). Reported-by: Matthew Wilcox <willy@infradead.org> Fixes: 49292576136f ("xfs: convert kmem_free() for kvmalloc users to kvfree()") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-19xfs: Remove mrlock wrapperMatthew Wilcox (Oracle)
mrlock was an rwsem wrapper that also recorded whether the lock was held for read or write. Now that we can ask the generic code whether the lock is held for read or write, we can remove this wrapper and use an rwsem directly. As the comment says, we can't use lockdep to assert that the ILOCK is held for write, because we might be in a workqueue, and we aren't able to tell lockdep that we do in fact own the lock. Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-19xfs: Replace xfs_isilocked with xfs_assert_ilockedMatthew Wilcox (Oracle)
To use the new rwsem_assert_held()/rwsem_assert_held_write(), we can't use the existing ASSERT macro. Add a new xfs_assert_ilocked() and convert all the callers. Fix an apparent bug in xfs_isilocked(): If the caller specifies XFS_IOLOCK_EXCL | XFS_ILOCK_EXCL, xfs_assert_ilocked() will check both the IOLOCK and the ILOCK are held for write. xfs_isilocked() only checked that the ILOCK was held for write. xfs_assert_ilocked() is always on, even if DEBUG or XFS_WARN aren't defined. It's a cheap check, so I don't think it's worth defining it away. Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-19xfs: use kvfree for buf in xfs_ioc_getbmapChristoph Hellwig
Without this the kernel crashes in kfree for files with a sufficiently large number of extents. Fixes: d4c75a1b40cd ("xfs: convert remaining kmem_free() to kfree()") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-17xfs: ensure submit buffers on LSN boundaries in error handlersLong Li
While performing the IO fault injection test, I caught the following data corruption report: XFS (dm-0): Internal error ltbno + ltlen > bno at line 1957 of file fs/xfs/libxfs/xfs_alloc.c. Caller xfs_free_ag_extent+0x79c/0x1130 CPU: 3 PID: 33 Comm: kworker/3:0 Not tainted 6.5.0-rc7-next-20230825-00001-g7f8666926889 #214 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014 Workqueue: xfs-inodegc/dm-0 xfs_inodegc_worker Call Trace: <TASK> dump_stack_lvl+0x50/0x70 xfs_corruption_error+0x134/0x150 xfs_free_ag_extent+0x7d3/0x1130 __xfs_free_extent+0x201/0x3c0 xfs_trans_free_extent+0x29b/0xa10 xfs_extent_free_finish_item+0x2a/0xb0 xfs_defer_finish_noroll+0x8d1/0x1b40 xfs_defer_finish+0x21/0x200 xfs_itruncate_extents_flags+0x1cb/0x650 xfs_free_eofblocks+0x18f/0x250 xfs_inactive+0x485/0x570 xfs_inodegc_worker+0x207/0x530 process_scheduled_works+0x24a/0xe10 worker_thread+0x5ac/0xc60 kthread+0x2cd/0x3c0 ret_from_fork+0x4a/0x80 ret_from_fork_asm+0x11/0x20 </TASK> XFS (dm-0): Corruption detected. Unmount and run xfs_repair After analyzing the disk image, it was found that the corruption was triggered by the fact that extent was recorded in both inode datafork and AGF btree blocks. After a long time of reproduction and analysis, we found that the reason of free sapce btree corruption was that the AGF btree was not recovered correctly. Consider the following situation, Checkpoint A and Checkpoint B are in the same record and share the same start LSN1, buf items of same object (AGF btree block) is included in both Checkpoint A and Checkpoint B. If the buf item in Checkpoint A has been recovered and updates metadata LSN permanently, then the buf item in Checkpoint B cannot be recovered, because log recovery skips items with a metadata LSN >= the current LSN of the recovery item. If there is still an inode item in Checkpoint B that records the Extent X, the Extent X will be recorded in both inode datafork and AGF btree block after Checkpoint B is recovered. Such transaction can be seen when allocing enxtent for inode bmap, it record both the addition of extent to the inode extent list and the removing extent from the AGF. |------------Record (LSN1)------------------|---Record (LSN2)---| |-------Checkpoint A----------|----------Checkpoint B-----------| | Buf Item(Extent X) | Buf Item / Inode item(Extent X) | | Extent X is freed | Extent X is allocated | After commit 12818d24db8a ("xfs: rework log recovery to submit buffers on LSN boundaries") was introduced, we submit buffers on lsn boundaries during log recovery. The above problem can be avoided under normal paths, but it's not guaranteed under abnormal paths. Consider the following process, if an error was encountered after recover buf item in Checkpoint A and before recover buf item in Checkpoint B, buffers that have been added to the buffer_list will still be submitted, this violates the submits rule on lsn boundaries. So buf item in Checkpoint B cannot be recovered on the next mount due to current lsn of transaction equal to metadata lsn on disk. The detailed process of the problem is as follows. First Mount: xlog_do_recovery_pass error = xlog_recover_process xlog_recover_process_data xlog_recover_process_ophdr xlog_recovery_process_trans ... /* recover buf item in Checkpoint A */ xlog_recover_buf_commit_pass2 xlog_recover_do_reg_buffer /* add buffer of agf btree block to buffer_list */ xfs_buf_delwri_queue(bp, buffer_list) ... ==> Encounter read IO error and return /* submit buffers regardless of error */ if (!list_empty(&buffer_list)) xfs_buf_delwri_submit(&buffer_list); <buf items of agf btree block in Checkpoint A recovery success> Second Mount: xlog_do_recovery_pass error = xlog_recover_process xlog_recover_process_data xlog_recover_process_ophdr xlog_recovery_process_trans ... /* recover buf item in Checkpoint B */ xlog_recover_buf_commit_pass2 /* buffer of agf btree block wouldn't added to buffer_list due to lsn equal to current_lsn */ if (XFS_LSN_CMP(lsn, current_lsn) >= 0) goto out_release <buf items of agf btree block in Checkpoint B wouldn't recovery> In order to make sure that submits buffers on lsn boundaries in the abnormal paths, we need to check error status before submit buffers that have been added from the last record processed. If error status exist, buffers in the bufffer_list should not be writen to disk. Canceling the buffers in the buffer_list directly isn't correct, unlike any other place where write list was canceled, these buffers has been initialized by xfs_buf_item_init() during recovery and held by buf item, buf items will not be released in xfs_buf_delwri_cancel(), it's not easy to solve. If the filesystem has been shut down, then delwri list submission will error out all buffers on the list via IO submission/completion and do all the correct cleanup automatically. So shutting down the filesystem could prevents buffers in the bufffer_list from being written to disk. Fixes: 50d5c8d8e938 ("xfs: check LSN ordering for v5 superblocks during recovery") Signed-off-by: Long Li <leo.lilong@huawei.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-17xfs: remove duplicate ifdefsShrikanth Hegde
when a ifdef is used in the below manner, second one could be considered as duplicate. ifdef DEFINE_A ...code block... ifdef DEFINE_A ...code block... endif ...code block... endif In the xfs code two such patterns were seen. Hence removing these ifdefs. No functional change is intended here. It only aims to improve code readability. Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-17xfs: disable sparse inode chunk alignment check when there is no alignmentDarrick J. Wong
While testing a 64k-blocksize filesystem, I noticed that xfs/709 fails to rebuild the inode btree with a bunch of "Corruption remains" messages. It turns out that when the inode chunk size is smaller than a single filesystem block, no block alignments constraints are necessary for inode chunk allocations, and sb_spino_align is zero. Hence we can skip the check. Fixes: dbfbf3bdf639 ("xfs: repair inode btrees") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-13xfs: use xfs_defer_alloc a bit moreDave Chinner
Noticed by inspection, simple factoring allows the same allocation routine to be used for both transaction and recovery contexts. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-13xfs: clean up remaining GFP_NOFS usersDave Chinner
These few remaining GFP_NOFS callers do not need to use GFP_NOFS at all. They are only called from a non-transactional context or cannot be accessed from memory reclaim due to other constraints. Hence they can just use GFP_KERNEL. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-13xfs: place the CIL under nofs allocation contextDave Chinner
This is core code that needs to run in low memory conditions and can be triggered from memory reclaim. While it runs in a workqueue, it really shouldn't be recursing back into the filesystem during any memory allocation it needs to function. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-13xfs: place intent recovery under NOFS allocation contextDave Chinner
When recovery starts processing intents, all of the initial intent allocations are done outside of transaction contexts. That means they need to specifically use GFP_NOFS as we do not want memory reclaim to attempt to run direct reclaim of filesystem objects while we have lots of objects added into deferred operations. Rather than use GFP_NOFS for these specific allocations, just place the entire intent recovery process under NOFS context and we can then just use GFP_KERNEL for these allocations. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-13xfs: use GFP_KERNEL in pure transaction contextsDave Chinner
When running in a transaction context, memory allocations are scoped to GFP_NOFS. Hence we don't need to use GFP_NOFS contexts in pure transaction context allocations - GFP_KERNEL will automatically get converted to GFP_NOFS as appropriate. Go through the code and convert all the obvious GFP_NOFS allocations in transaction context to use GFP_KERNEL. This further reduces the explicit use of GFP_NOFS in XFS. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-13xfs: use __GFP_NOLOCKDEP instead of GFP_NOFSDave Chinner
In the past we've had problems with lockdep false positives stemming from inode locking occurring in memory reclaim contexts (e.g. from superblock shrinkers). Lockdep doesn't know that inodes access from above memory reclaim cannot be accessed from below memory reclaim (and vice versa) but there has never been a good solution to solving this problem with lockdep annotations. This situation isn't unique to inode locks - buffers are also locked above and below memory reclaim, and we have to maintain lock ordering for them - and against inodes - appropriately. IOWs, the same code paths and locks are taken both above and below memory reclaim and so we always need to make sure the lock orders are consistent. We are spared the lockdep problems this might cause by the fact that semaphores and bit locks aren't covered by lockdep. In general, this sort of lockdep false positive detection is cause by code that runs GFP_KERNEL memory allocation with an actively referenced inode locked. When it is run from a transaction, memory allocation is automatically GFP_NOFS, so we don't have reclaim recursion issues. So in the places where we do memory allocation with inodes locked outside of a transaction, we have explicitly set them to use GFP_NOFS allocations to prevent lockdep false positives from being reported if the allocation dips into direct memory reclaim. More recently, __GFP_NOLOCKDEP was added to the memory allocation flags to tell lockdep not to track that particular allocation for the purposes of reclaim recursion detection. This is a much better way of preventing false positives - it allows us to use GFP_KERNEL context outside of transactions, and allows direct memory reclaim to proceed normally without throwing out false positive deadlock warnings. The obvious places that lock inodes and do memory allocation are the lookup paths and inode extent list initialisation. These occur in non-transactional GFP_KERNEL contexts, and so can run direct reclaim and lock inodes. This patch makes a first path through all the explicit GFP_NOFS allocations in XFS and converts the obvious ones to GFP_KERNEL | __GFP_NOLOCKDEP as a first step towards removing explicit GFP_NOFS allocations from the XFS code. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-13xfs: use an empty transaction for fstrimDave Chinner
We currently use a btree walk in the fstrim code. This requires a btree cursor and btree cursors are only used inside transactions except for the fstrim code. This means that all the btree operations that allocate memory operate in both GFP_KERNEL and GFP_NOFS contexts. This causes problems with lockdep being unable to determine the difference between objects that are safe to lock both above and below memory reclaim. Free space btree buffers are definitely locked both above and below reclaim and that means we have to mark all btree infrastructure allocations with GFP_NOFS to avoid potential lockdep false positives. If we wrap this btree walk in an empty cursor, all btree walks are now done under transaction context and so all allocations inherit GFP_NOFS context from the tranaction. This enables us to move all the btree allocations to GFP_KERNEL context and hence help remove the explicit use of GFP_NOFS in XFS. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-13xfs: convert remaining kmem_free() to kfree()Dave Chinner
The remaining callers of kmem_free() are freeing heap memory, so we can convert them directly to kfree() and get rid of kmem_free() altogether. This conversion was done with: $ for f in `git grep -l kmem_free fs/xfs`; do > sed -i s/kmem_free/kfree/ $f > done $ Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-13xfs: convert kmem_free() for kvmalloc users to kvfree()Dave Chinner
Start getting rid of kmem_free() by converting all the cases where memory can come from vmalloc interfaces to calling kvfree() directly. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-13xfs: move kmem_to_page()Dave Chinner
Move it to the general xfs linux wrapper header file so we can prepare to remove kmem.h Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-13xfs: convert kmem_alloc() to kmalloc()Dave Chinner
kmem_alloc() is just a thin wrapper around kmalloc() these days. Convert everything to use kmalloc() so we can get rid of the wrapper. Note: the transaction region allocation in xlog_add_to_transaction() can be a high order allocation. Converting it to use kmalloc(__GFP_NOFAIL) results in warnings in the page allocation code being triggered because the mm subsystem does not want us to use __GFP_NOFAIL with high order allocations like we've been doing with the kmem_alloc() wrapper for a couple of decades. Hence this specific case gets converted to xlog_kvmalloc() rather than kmalloc() to avoid this issue. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-13xfs: convert kmem_zalloc() to kzalloc()Dave Chinner
There's no reason to keep the kmem_zalloc() around anymore, it's just a thin wrapper around kmalloc(), so lets get rid of it. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-12xfs: add support for FS_IOC_GETFSSYSFSPATHKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Link: https://lore.kernel.org/r/20240207025624.1019754-7-kent.overstreet@linux.dev Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-08fs: super_set_uuid()Kent Overstreet
Some weird old filesytems have UUID-like things that we wish to expose as UUIDs, but are smaller; add a length field so that the new FS_IOC_(GET|SET)UUID ioctls can handle them in generic code. And add a helper super_set_uuid(), for setting nonstandard length uuids. Helper is now required for the new FS_IOC_GETUUID ioctl; if super_set_uuid() hasn't been called, the ioctl won't be supported. Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Link: https://lore.kernel.org/r/20240207025624.1019754-2-kent.overstreet@linux.dev Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-01iomap: pass the length of the dirty region to ->map_blocksChristoph Hellwig
Let the file system know how much dirty data exists at the passed in offset. This allows file systems to allocate the right amount of space that actually is written back if they can't eagerly convert (e.g. because they don't support unwritten extents). Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20231207072710.176093-15-hch@lst.de Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-01iomap: don't chain biosChristoph Hellwig
Back in the days when a single bio could only be filled to the hardware limits, and we scheduled a work item for each bio completion, chaining multiple bios for a single ioend made a lot of sense to reduce the number of completions. But these days bios can be filled until we reach the number of vectors or total size limit, which means we can always fit at least 1 megabyte worth of data in the worst case, but usually a lot more due to large folios. The only thing bio chaining is buying us now is to reduce the size of the allocation from an ioend with an embedded bio into a plain bio, which is a 52 bytes differences on 64-bit systems. This is not worth the added complexity, so remove the bio chaining and only use the bio embedded into the ioend. This will help to simplify further changes to the iomap writeback code. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20231207072710.176093-10-hch@lst.de Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-01-30xfs: remove conditional building of rt geometry validator functionsDarrick J. Wong
I mistakenly turned off CONFIG_XFS_RT in the Kconfig file for arm64 variant of the djwong-wtf git branch. Unfortunately, it took me a good hour to figure out that RT wasn't built because this is what got printed to dmesg: XFS (sda2): realtime geometry sanity check failed XFS (sda2): Metadata corruption detected at xfs_sb_read_verify+0x170/0x190 [xfs], xfs_sb block 0x0 Whereas I would have expected: XFS (sda2): Not built with CONFIG_XFS_RT XFS (sda2): RT mount failed The root cause of these problems is the conditional compilation of the new functions xfs_validate_rtextents and xfs_compute_rextslog that I introduced in the two commits listed below. The !RT versions of these functions return false and 0, respectively, which causes primary superblock validation to fail, which explains the first message. Move the two functions to other parts of libxfs that are not conditionally defined by CONFIG_XFS_RT and remove the broken stubs so that validation works again. Fixes: e14293803f4e ("xfs: don't allow overly small or large realtime volumes") Fixes: a6a38f309afc ("xfs: make rextslog computation consistent with mkfs") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-01-29xfs: reset XFS_ATTR_INCOMPLETE filter on node removalAndrey Albershteyn
In XFS_DAS_NODE_REMOVE_ATTR case, xfs_attr_mode_remove_attr() sets filter to XFS_ATTR_INCOMPLETE. The filter is then reset in xfs_attr_complete_op() if XFS_DA_OP_REPLACE operation is performed. The filter is not reset though if XFS just removes the attribute (args->value == NULL) with xfs_attr_defer_remove(). attr code goes to XFS_DAS_DONE state. Fix this by always resetting XFS_ATTR_INCOMPLETE filter. The replace operation already resets this filter in anyway and others are completed at this step hence don't need it. Fixes: fdaf1bb3cafc ("xfs: ATTR_REPLACE algorithm with LARP enabled needs rework") Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-01-22xfs: read only mounts with fsopen mount API are bustedDave Chinner
Recently xfs/513 started failing on my test machines testing "-o ro,norecovery" mount options. This was being emitted in dmesg: [ 9906.932724] XFS (pmem0): no-recovery mounts must be read-only. Turns out, readonly mounts with the fsopen()/fsconfig() mount API have been busted since day zero. It's only taken 5 years for debian unstable to start using this "new" mount API, and shortly after this I noticed xfs/513 had started to fail as per above. The syscall trace is: fsopen("xfs", FSOPEN_CLOEXEC) = 3 mount_setattr(-1, NULL, 0, NULL, 0) = -1 EINVAL (Invalid argument) ..... fsconfig(3, FSCONFIG_SET_STRING, "source", "/dev/pmem0", 0) = 0 fsconfig(3, FSCONFIG_SET_FLAG, "ro", NULL, 0) = 0 fsconfig(3, FSCONFIG_SET_FLAG, "norecovery", NULL, 0) = 0 fsconfig(3, FSCONFIG_CMD_CREATE, NULL, NULL, 0) = -1 EINVAL (Invalid argument) close(3) = 0 Showing that the actual mount instantiation (FSCONFIG_CMD_CREATE) is what threw out the error. During mount instantiation, we call xfs_fs_validate_params() which does: /* No recovery flag requires a read-only mount */ if (xfs_has_norecovery(mp) && !xfs_is_readonly(mp)) { xfs_warn(mp, "no-recovery mounts must be read-only."); return -EINVAL; } and xfs_is_readonly() checks internal mount flags for read only state. This state is set in xfs_init_fs_context() from the context superblock flag state: /* * Copy binary VFS mount flags we are interested in. */ if (fc->sb_flags & SB_RDONLY) set_bit(XFS_OPSTATE_READONLY, &mp->m_opstate); With the old mount API, all of the VFS specific superblock flags had already been parsed and set before xfs_init_fs_context() is called, so this all works fine. However, in the brave new fsopen/fsconfig world, xfs_init_fs_context() is called from fsopen() context, before any VFS superblock have been set or parsed. Hence if we use fsopen(), the internal XFS readonly state is *never set*. Hence anything that depends on xfs_is_readonly() actually returning true for read only mounts is broken if fsopen() has been used to mount the filesystem. Fix this by moving this internal state initialisation to xfs_fs_fill_super() before we attempt to validate the parameters that have been set prior to the FSCONFIG_CMD_CREATE call being made. Signed-off-by: Dave Chinner <dchinner@redhat.com> Fixes: 73e5fff98b64 ("xfs: switch to use the new mount-api") cc: stable@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-01-19Merge tag 'xfs-6.8-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs fix from Chandan Babu: - Fix per-inode space accounting bug * tag 'xfs-6.8-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: fix backwards logic in xfs_bmap_alloc_account
2024-01-11xfs: fix backwards logic in xfs_bmap_alloc_accountDarrick J. Wong
We're only allocating from the realtime device if the inode is marked for realtime and we're /not/ allocating into the attr fork. Fixes: 58643460546d ("xfs: also use xfs_bmap_btalloc_accounting for RT allocations") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-01-10Merge tag 'sysctl-6.8-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux Pull sysctl updates from Luis Chamberlain: "To help make the move of sysctls out of kernel/sysctl.c not incur a size penalty sysctl has been changed to allow us to not require the sentinel, the final empty element on the sysctl array. Joel Granados has been doing all this work. In the v6.6 kernel we got the major infrastructure changes required to support this. For v6.7 we had all arch/ and drivers/ modified to remove the sentinel. For v6.8-rc1 we get a few more updates for fs/ directory only. The kernel/ directory is left but we'll save that for v6.9-rc1 as those patches are still being reviewed. After that we then can expect also the removal of the no longer needed check for procname == NULL. Let us recap the purpose of this work: - this helps reduce the overall build time size of the kernel and run time memory consumed by the kernel by about ~64 bytes per array - the extra 64-byte penalty is no longer inncurred now when we move sysctls out from kernel/sysctl.c to their own files Thomas Weißschuh also sent a few cleanups, for v6.9-rc1 we expect to see further work by Thomas Weißschuh with the constificatin of the struct ctl_table. Due to Joel Granados's work, and to help bring in new blood, I have suggested for him to become a maintainer and he's accepted. So for v6.9-rc1 I look forward to seeing him sent you a pull request for further sysctl changes. This also removes Iurii Zaikin as a maintainer as he has moved on to other projects and has had no time to help at all" * tag 'sysctl-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: sysctl: remove struct ctl_path sysctl: delete unused define SYSCTL_PERM_EMPTY_DIR coda: Remove the now superfluous sentinel elements from ctl_table array sysctl: Remove the now superfluous sentinel elements from ctl_table array fs: Remove the now superfluous sentinel elements from ctl_table array cachefiles: Remove the now superfluous sentinel element from ctl_table array sysclt: Clarify the results of selftest run sysctl: Add a selftest for handling empty dirs sysctl: Fix out of bounds access for empty sysctl registers MAINTAINERS: Add Joel Granados as co-maintainer for proc sysctl MAINTAINERS: remove Iurii Zaikin from proc sysctl
2024-01-10Merge tag 'xfs-6.8-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs updates from Chandan Babu: "New features/functionality: - Online repair: - Reserve disk space for online repairs - Fix misinteraction between the AIL and btree bulkloader because of which the bulk load fails to queue a buffer for writeback if it happens to be on the AIL list - Prevent transaction reservation overflows when reaping blocks during online repair - Whenever possible, bulkloader now copies multiple records into a block - Support repairing of 1. Per-AG free space, inode and refcount btrees 2. Ondisk inodes 3. File data and attribute fork mappings - Verify the contents of 1. Inode and data fork of realtime bitmap file 2. Quota files - Introduce MF_MEM_PRE_REMOVE. This will be used to notify tasks about a pmem device being removed Bug fixes: - Fix memory leak of recovered attri intent items - Fix UAF during log intent recovery - Fix realtime geometry integer overflows - Prevent scrub from live locking in xchk_iget - Prevent fs shutdown when removing files during low free disk space - Prevent transaction reservation overflow when extending an RT device - Prevent incorrect warning from being printed when extending a filesystem - Fix an off-by-one error in xreap_agextent_binval - Serialize access to perag radix tree during deletion operation - Fix perag memory leak during growfs - Allow allocation of minlen realtime extent when the maximum sized realtime free extent is minlen in size Cleanups: - Remove duplicate boilerplate code spread across functionality associated with different log items - Cleanup resblks interfaces - Pass defer ops pointer to defer helpers instead of an enum - Initialize di_crc in xfs_log_dinode to prevent KMSAN warnings - Use static_assert() instead of BUILD_BUG_ON_MSG() to validate size of structures and structure member offsets. This is done in order to be able to share the code with userspace - Move XFS documentation under a new directory specific to XFS - Do not invoke deferred ops' ->create_done callback if the deferred operation does not have an intent item associated with it - Remove duplicate inclusion of header files from scrub/health.c - Refactor Realtime code - Cleanup attr code" * tag 'xfs-6.8-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (123 commits) xfs: use the op name in trace_xlog_intent_recovery_failed xfs: fix a use after free in xfs_defer_finish_recovery xfs: turn the XFS_DA_OP_REPLACE checks in xfs_attr_shortform_addname into asserts xfs: remove xfs_attr_sf_hdr_t xfs: remove struct xfs_attr_shortform xfs: use xfs_attr_sf_findname in xfs_attr_shortform_getvalue xfs: remove xfs_attr_shortform_lookup xfs: simplify xfs_attr_sf_findname xfs: move the xfs_attr_sf_lookup tracepoint xfs: return if_data from xfs_idata_realloc xfs: make if_data a void pointer xfs: fold xfs_rtallocate_extent into xfs_bmap_rtalloc xfs: simplify and optimize the RT allocation fallback cascade xfs: reorder the minlen and prod calculations in xfs_bmap_rtalloc xfs: remove XFS_RTMIN/XFS_RTMAX xfs: remove rt-wrappers from xfs_format.h xfs: factor out a xfs_rtalloc_sumlevel helper xfs: tidy up xfs_rtallocate_extent_exact xfs: merge the calls to xfs_rtallocate_range in xfs_rtallocate_block xfs: reflow the tail end of xfs_rtallocate_extent_block ...
2024-01-09Merge tag 'mm-stable-2024-01-08-15-31' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Many singleton patches against the MM code. The patch series which are included in this merge do the following: - Peng Zhang has done some mapletree maintainance work in the series 'maple_tree: add mt_free_one() and mt_attr() helpers' 'Some cleanups of maple tree' - In the series 'mm: use memmap_on_memory semantics for dax/kmem' Vishal Verma has altered the interworking between memory-hotplug and dax/kmem so that newly added 'device memory' can more easily have its memmap placed within that newly added memory. - Matthew Wilcox continues folio-related work (including a few fixes) in the patch series 'Add folio_zero_tail() and folio_fill_tail()' 'Make folio_start_writeback return void' 'Fix fault handler's handling of poisoned tail pages' 'Convert aops->error_remove_page to ->error_remove_folio' 'Finish two folio conversions' 'More swap folio conversions' - Kefeng Wang has also contributed folio-related work in the series 'mm: cleanup and use more folio in page fault' - Jim Cromie has improved the kmemleak reporting output in the series 'tweak kmemleak report format'. - In the series 'stackdepot: allow evicting stack traces' Andrey Konovalov to permits clients (in this case KASAN) to cause eviction of no longer needed stack traces. - Charan Teja Kalla has fixed some accounting issues in the page allocator's atomic reserve calculations in the series 'mm: page_alloc: fixes for high atomic reserve caluculations'. - Dmitry Rokosov has added to the samples/ dorectory some sample code for a userspace memcg event listener application. See the series 'samples: introduce cgroup events listeners'. - Some mapletree maintanance work from Liam Howlett in the series 'maple_tree: iterator state changes'. - Nhat Pham has improved zswap's approach to writeback in the series 'workload-specific and memory pressure-driven zswap writeback'. - DAMON/DAMOS feature and maintenance work from SeongJae Park in the series 'mm/damon: let users feed and tame/auto-tune DAMOS' 'selftests/damon: add Python-written DAMON functionality tests' 'mm/damon: misc updates for 6.8' - Yosry Ahmed has improved memcg's stats flushing in the series 'mm: memcg: subtree stats flushing and thresholds'. - In the series 'Multi-size THP for anonymous memory' Ryan Roberts has added a runtime opt-in feature to transparent hugepages which improves performance by allocating larger chunks of memory during anonymous page faults. - Matthew Wilcox has also contributed some cleanup and maintenance work against eh buffer_head code int he series 'More buffer_head cleanups'. - Suren Baghdasaryan has done work on Andrea Arcangeli's series 'userfaultfd move option'. UFFDIO_MOVE permits userspace heap compaction algorithms to move userspace's pages around rather than UFFDIO_COPY'a alloc/copy/free. - Stefan Roesch has developed a 'KSM Advisor', in the series 'mm/ksm: Add ksm advisor'. This is a governor which tunes KSM's scanning aggressiveness in response to userspace's current needs. - Chengming Zhou has optimized zswap's temporary working memory use in the series 'mm/zswap: dstmem reuse optimizations and cleanups'. - Matthew Wilcox has performed some maintenance work on the writeback code, both code and within filesystems. The series is 'Clean up the writeback paths'. - Andrey Konovalov has optimized KASAN's handling of alloc and free stack traces for secondary-level allocators, in the series 'kasan: save mempool stack traces'. - Andrey also performed some KASAN maintenance work in the series 'kasan: assorted clean-ups'. - David Hildenbrand has gone to town on the rmap code. Cleanups, more pte batching, folio conversions and more. See the series 'mm/rmap: interface overhaul'. - Kinsey Ho has contributed some maintenance work on the MGLRU code in the series 'mm/mglru: Kconfig cleanup'. - Matthew Wilcox has contributed lruvec page accounting code cleanups in the series 'Remove some lruvec page accounting functions'" * tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (361 commits) mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER mm, treewide: introduce NR_PAGE_ORDERS selftests/mm: add separate UFFDIO_MOVE test for PMD splitting selftests/mm: skip test if application doesn't has root privileges selftests/mm: conform test to TAP format output selftests: mm: hugepage-mmap: conform to TAP format output selftests/mm: gup_test: conform test to TAP format output mm/selftests: hugepage-mremap: conform test to TAP format output mm/vmstat: move pgdemote_* out of CONFIG_NUMA_BALANCING mm: zsmalloc: return -ENOSPC rather than -EINVAL in zs_malloc while size is too large mm/memcontrol: remove __mod_lruvec_page_state() mm/khugepaged: use a folio more in collapse_file() slub: use a folio in __kmalloc_large_node slub: use folio APIs in free_large_kmalloc() slub: use alloc_pages_node() in alloc_slab_page() mm: remove inc/dec lruvec page state functions mm: ratelimit stat flush from workingset shrinker kasan: stop leaking stack trace handles mm/mglru: remove CONFIG_TRANSPARENT_HUGEPAGE mm/mglru: add dummy pmd_dirty() ...
2024-01-08Merge tag 'vfs-6.8.super' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs super updates from Christian Brauner: "This contains the super work for this cycle including the long-awaited series by Jan to make it possible to prevent writing to mounted block devices: - Writing to mounted devices is dangerous and can lead to filesystem corruption as well as crashes. Furthermore syzbot comes with more and more involved examples how to corrupt block device under a mounted filesystem leading to kernel crashes and reports we can do nothing about. Add tracking of writers to each block device and a kernel cmdline argument which controls whether other writeable opens to block devices open with BLK_OPEN_RESTRICT_WRITES flag are allowed. Note that this effectively only prevents modification of the particular block device's page cache by other writers. The actual device content can still be modified by other means - e.g. by issuing direct scsi commands, by doing writes through devices lower in the storage stack (e.g. in case loop devices, DM, or MD are involved) etc. But blocking direct modifications of the block device page cache is enough to give filesystems a chance to perform data validation when loading data from the underlying storage and thus prevent kernel crashes. Syzbot can use this cmdline argument option to avoid uninteresting crashes. Also users whose userspace setup does not need writing to mounted block devices can set this option for hardening. We expect that this will be interesting to quite a few workloads. Btrfs is currently opted out of this because they still haven't merged patches we require for this to work from three kernel releases ago. - Reimplement block device freezing and thawing as holder operations on the block device. This allows us to extend block device freezing to all devices associated with a superblock and not just the main device. It also allows us to remove get_active_super() and thus another function that scans the global list of superblocks. Freezing via additional block devices only works if the filesystem chooses to use @fs_holder_ops for these additional devices as well. That currently only includes ext4 and xfs. Earlier releases switched get_tree_bdev() and mount_bdev() to use @fs_holder_ops. The remaining nilfs2 open-coded version of mount_bdev() has been converted to rely on @fs_holder_ops as well. So block device freezing for the main block device will continue to work as before. There should be no regressions in functionality. The only special case is btrfs where block device freezing for the main block device never worked because sb->s_bdev isn't set. Block device freezing for btrfs can be fixed once they can switch to @fs_holder_ops but that can happen whenever they're ready" * tag 'vfs-6.8.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (27 commits) block: Fix a memory leak in bdev_open_by_dev() super: don't bother with WARN_ON_ONCE() super: massage wait event mechanism ext4: Block writes to journal device xfs: Block writes to log device fs: Block writes to mounted block devices btrfs: Do not restrict writes to btrfs devices block: Add config option to not allow writing to mounted devices block: Remove blkdev_get_by_*() functions bcachefs: Convert to bdev_open_by_path() fs: handle freezing from multiple devices fs: remove dead check nilfs2: simplify device handling fs: streamline thaw_super_locked ext4: simplify device handling xfs: simplify device handling fs: simplify setup_bdev_super() calls blkdev: comment fs_holder_ops porting: document block device freeze and thaw changes fs: remove unused helper ...
2023-12-29xfs: use the op name in trace_xlog_intent_recovery_failedChristoph Hellwig
Instead of tracing the address of the recovery handler, use the name in the defer op, similar to other defer ops related tracepoints. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2023-12-29xfs: fix a use after free in xfs_defer_finish_recoveryChristoph Hellwig
dfp will be freed by ->recover_work and thus the tracepoint in case of an error can lead to a use after free. Store the defer ops in a local variable to avoid that. Fixes: 7f2f7531e0d4 ("xfs: store an ops pointer in struct xfs_defer_pending") Reported-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2023-12-29xfs: turn the XFS_DA_OP_REPLACE checks in xfs_attr_shortform_addname into ↵Christoph Hellwig
asserts Since commit deed9512872d ("xfs: Check for -ENOATTR or -EEXIST"), the high-level attr code does a lookup for any attr we're trying to set, and does the checks to handle the create vs replace cases, which thus never hit the low-level attr code. Turn the checks in xfs_attr_shortform_addname as they must never trip. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2023-12-29xfs: remove xfs_attr_sf_hdr_tChristoph Hellwig
Remove the last two users of the typedef. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2023-12-29xfs: remove struct xfs_attr_shortformChristoph Hellwig
sparse complains about struct xfs_attr_shortform because it embeds a structure with a variable sized array in a variable sized array. Given that xfs_attr_shortform is not a very useful structure, and the dir2 equivalent has been removed a long time ago, remove it as well. Provide a xfs_attr_sf_firstentry helper that returns the first xfs_attr_sf_entry behind a xfs_attr_sf_hdr to replace the structure dereference. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2023-12-29xfs: use xfs_attr_sf_findname in xfs_attr_shortform_getvalueChristoph Hellwig
xfs_attr_shortform_getvalue duplicates the logic in xfs_attr_sf_findname. Use the helper instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2023-12-29xfs: remove xfs_attr_shortform_lookupChristoph Hellwig
xfs_attr_shortform_lookup is only used by xfs_attr_shortform_addname, which is much better served by calling xfs_attr_sf_findname. Switch it over and remove xfs_attr_shortform_lookup. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>