summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2025-07-18xfs: don't allocate the xfs_extent_busy structure for zoned RTGsChristoph Hellwig
Busy extent tracking is primarily used to ensure that freed blocks are not reused for data allocations before the transaction that deleted them has been committed to stable storage, and secondarily to drive online discard. None of the use cases applies to zoned RTGs, as the zoned allocator can't overwrite blocks before resetting the zone, which already flushes out all transactions touching the RTGs. So the busy extent tracking is not needed for zoned RTGs, and also not called for zoned RTGs. But somehow the code to skip allocating and freeing the structure got lost during the zoned XFS upstreaming process. This not only causes these structures to unnecessarily allocated, but can also lead to memory leaks as the xg_busy_extents pointer in the xfs_group structure is overlayed with the pointer for the linked list of to be reset zones. Stop allocating and freeing the structure to not pointlessly allocate memory which is then leaked when the zone is reset. Fixes: 080d01c41d44 ("xfs: implement zoned garbage collection") Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: <stable@vger.kernel.org> # v6.15 [cem: Fix type and add stable tag] Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-18efivarfs: Fix memory leak of efivarfs_fs_info in fs_context error pathsBreno Leitao
When processing mount options, efivarfs allocates efivarfs_fs_info (sfi) early in fs_context initialization. However, sfi is associated with the superblock and typically freed when the superblock is destroyed. If the fs_context is released (final put) before fill_super is called—such as on error paths or during reconfiguration—the sfi structure would leak, as ownership never transfers to the superblock. Implement the .free callback in efivarfs_context_ops to ensure any allocated sfi is properly freed if the fs_context is torn down before fill_super, preventing this memory leak. Suggested-by: James Bottomley <James.Bottomley@HansenPartnership.com> Fixes: 5329aa5101f73c ("efivarfs: Add uid/gid mount options") Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2025-07-18ovl: rename ovl_cleanup_unlocked() to ovl_cleanup()NeilBrown
The only remaining user of ovl_cleanup() is ovl_cleanup_locked(), so we no longer need both. This patch renames ovl_cleanup() to ovl_cleanup_locked() and makes it static. ovl_cleanup_unlocked() is renamed to ovl_cleanup(). Signed-off-by: NeilBrown <neil@brown.name> Link: https://lore.kernel.org/20250716004725.1206467-22-neil@brown.name Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-18ovl: change ovl_create_real() to receive dentry parentNeilBrown
Instead of passing an inode *dir, pass a dentry *parent. This makes the calling slightly cleaner. Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: NeilBrown <neil@brown.name> Link: https://lore.kernel.org/20250716004725.1206467-21-neil@brown.name Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-18ovl: narrow locking in ovl_check_rename_whiteout()NeilBrown
ovl_check_rename_whiteout() now only holds the directory lock when needed, and takes it again if necessary. This makes way for future changes where locks are taken on individual dentries rather than the whole directory. Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: NeilBrown <neil@brown.name> Link: https://lore.kernel.org/20250716004725.1206467-20-neil@brown.name Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-18ovl: narrow locking in ovl_whiteout()NeilBrown
ovl_whiteout() relies on the workdir i_rwsem to provide exclusive access to ofs->whiteout which it manipulates. Rather than depending on this, add a new mutex, "whiteout_lock" to explicitly provide the required locking. Use guard(mutex) for this so that we can return without needing to explicitly unlock. Then take the lock on workdir only when needed - to lookup the temp name and to do the whiteout or link. Signed-off-by: NeilBrown <neil@brown.name> Link: https://lore.kernel.org/20250716004725.1206467-19-neil@brown.name Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-18ovl: change ovl_cleanup_and_whiteout() to take rename lock as neededNeilBrown
Rather than locking the directory(s) before calling ovl_cleanup_and_whiteout(), change it (and ovl_whiteout()) to do the locking, so the locking can be fine grained as will be needed for proposed locking changes. Sometimes this is called to whiteout something in the index dir, in which case only that dir must be locked. In one case it is called on something in an upperdir, so two directories must be locked. We use ovl_lock_rename_workdir() for this and remove the restriction that upperdir cannot be indexdir - because now sometimes it is. Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: NeilBrown <neil@brown.name> Link: https://lore.kernel.org/20250716004725.1206467-18-neil@brown.name Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-18ovl: narrow locking on ovl_remove_and_whiteout()NeilBrown
This code: performs a lookup_upper creates a whiteout object renames the whiteout over the result of the lookup The create and the rename must be locked separately for proposed directory locking changes. This patch takes a first step of moving the lookup out of the locked region. A subsequent patch will separate the create from the rename. Signed-off-by: NeilBrown <neil@brown.name> Link: https://lore.kernel.org/20250716004725.1206467-17-neil@brown.name Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-18ovl: change ovl_workdir_cleanup() to take dir lock as needed.NeilBrown
Rather than calling ovl_workdir_cleanup() with the dir already locked, change it to take the dir lock only when needed. Also change ovl_workdir_cleanup() to take a dentry for the parent rather than an inode. Signed-off-by: NeilBrown <neil@brown.name> Link: https://lore.kernel.org/20250716004725.1206467-16-neil@brown.name Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-18ovl: narrow locking in ovl_workdir_cleanup_recurse()NeilBrown
Only take the dir lock when needed, rather than for the whole loop. Signed-off-by: NeilBrown <neil@brown.name> Link: https://lore.kernel.org/20250716004725.1206467-15-neil@brown.name Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-18ovl: narrow locking in ovl_indexdir_cleanup()NeilBrown
Instead of taking the directory lock for the whole cleanup, only take it when needed. Signed-off-by: NeilBrown <neil@brown.name> Link: https://lore.kernel.org/20250716004725.1206467-14-neil@brown.name Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-18ovl: narrow locking in ovl_workdir_create()NeilBrown
In ovl_workdir_create() don't hold the dir lock for the whole time, but only take it when needed. It now gets taken separately for ovl_workdir_cleanup(). A subsequent patch will move the locking into that function. Signed-off-by: NeilBrown <neil@brown.name> Link: https://lore.kernel.org/20250716004725.1206467-13-neil@brown.name Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-18ovl: narrow locking in ovl_cleanup_index()NeilBrown
ovl_cleanup_index() takes a lock on the directory and then does a lookup and possibly one of two different cleanups. This patch narrows the locking to use the _unlocked() versions of the lookup and one cleanup, and just takes the lock for the other cleanup. A subsequent patch will take the lock into the cleanup. Signed-off-by: NeilBrown <neil@brown.name> Link: https://lore.kernel.org/20250716004725.1206467-12-neil@brown.name Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-18ovl: narrow locking in ovl_cleanup_whiteouts()NeilBrown
Rather than lock the directory for the whole operation, use ovl_lookup_upper_unlocked() and ovl_cleanup_unlocked() to take the lock only when needed. This makes way for future changes where locks are taken on individual dentries rather than the whole directory. Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: NeilBrown <neil@brown.name> Link: https://lore.kernel.org/20250716004725.1206467-11-neil@brown.name Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-18ovl: narrow locking in ovl_rename()NeilBrown
Drop the rename lock immediately after the rename, and use ovl_cleanup_unlocked() for cleanup. This makes way for future changes where locks are taken on individual dentries rather than the whole directory. Signed-off-by: NeilBrown <neil@brown.name> Link: https://lore.kernel.org/20250716004725.1206467-10-neil@brown.name Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-18ovl: simplify gotos in ovl_rename()NeilBrown
Rather than having three separate goto label: out_unlock, out_dput_old, and out_dput, make use of that fact that dput() happily accepts a NULL pointer to reduce this to just one goto label: out_unlock. olddentry and newdentry are initialised to NULL and only set once a value dentry is found. They are then dput() late in the function. Suggested-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: NeilBrown <neil@brown.name> Link: https://lore.kernel.org/20250716004725.1206467-9-neil@brown.name Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-18ovl: narrow locking in ovl_create_over_whiteout()NeilBrown
Unlock the parents immediately after the rename, and use ovl_cleanup_unlocked() for cleanup, which takes a separate lock. This makes way for future changes where locks are taken on individual dentries rather than the whole directory. Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: NeilBrown <neil@brown.name> Link: https://lore.kernel.org/20250716004725.1206467-8-neil@brown.name Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-18ovl: narrow locking in ovl_clear_empty()NeilBrown
Drop the locks immediately after rename, and use a separate lock for cleanup. This makes way for future changes where locks are taken on individual dentries rather than the whole directory. Note that ovl_cleanup_whiteouts() operates on "upper", a child of "upperdir" and does not require upperdir or workdir to be locked. Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: NeilBrown <neil@brown.name> Link: https://lore.kernel.org/20250716004725.1206467-7-neil@brown.name Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-18ovl: narrow locking in ovl_create_upper()NeilBrown
Drop the directory lock immediately after the ovl_create_real() call and take a separate lock later for cleanup in ovl_cleanup_unlocked() - if needed. This makes way for future changes where locks are taken on individual dentries rather than the whole directory. Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: NeilBrown <neil@brown.name> Link: https://lore.kernel.org/20250716004725.1206467-6-neil@brown.name Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-18ovl: narrow the locked region in ovl_copy_up_workdir()NeilBrown
In ovl_copy_up_workdir() unlock immediately after the rename. There is nothing else in the function that needs the lock. Signed-off-by: NeilBrown <neil@brown.name> Link: https://lore.kernel.org/20250716004725.1206467-5-neil@brown.name Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-18ovl: Call ovl_create_temp() without lock held.NeilBrown
ovl currently locks a directory or two and then performs multiple actions in one or both directories. This is incompatible with proposed changes which will lock just the dentry of objects being acted on. This patch moves calls to ovl_create_temp() out of the locked regions and has it take and release the relevant lock itself. The lock that was taken before this function was called is now taken after. This means that any code between where the lock was taken and ovl_create_temp() is now unlocked. This necessitates the use of ovl_cleanup_unlocked() and the creation of ovl_lookup_upper_unlocked(). These will be used more widely in future patches. Now that the file is created before the lock is taken for rename, we need to ensure the parent wasn't changed before the lock was gained. ovl_lock_rename_workdir() is changed to optionally receive the dentries that will be involved in the rename. If either is present but has the wrong parent, an error is returned. Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: NeilBrown <neil@brown.name> Link: https://lore.kernel.org/20250716004725.1206467-4-neil@brown.name Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-18ovl: change ovl_create_index() to take dir locksNeilBrown
ovl_copy_up_workdir() currently take a rename lock on two directories, then use the lock to both create a file in one directory, perform a rename, and possibly unlink the file for cleanup. This is incompatible with proposed changes which will lock just the dentry of objects being acted on. This patch moves the call to ovl_create_index() earlier in ovl_copy_up_workdir() to before the lock is taken. ovl_create_index() then takes the required lock only when needed. Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: NeilBrown <neil@brown.name> Link: https://lore.kernel.org/20250716004725.1206467-3-neil@brown.name Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-18ovl: simplify an error path in ovl_copy_up_workdir()NeilBrown
If ovl_copy_up_data() fails the error is not immediately handled but the code continues on to call ovl_start_write() and lock_rename(), presumably because both of these locks are needed for the cleanup. Only then (if the lock was successful) is the error checked. This makes the code a little hard to follow and could be fragile. This patch changes to handle the error after the ovl_start_write() (which cannot fail, so there aren't multiple errors to deail with). A new ovl_cleanup_unlocked() is created which takes the required directory lock. This will be used extensively in later patches. In general we need to check the parent is still correct after taking the lock (as ovl_copy_up_workdir() does after a successful lock_rename()) so that is included in ovl_cleanup_unlocked() using new ovl_parent_lock() and ovl_parent_unlock() calls (it is planned to move this API into VFS code eventually, though in a slightly different form). Signed-off-by: NeilBrown <neil@brown.name> Link: https://lore.kernel.org/20250716004725.1206467-2-neil@brown.name Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-18ovl: support layers on case-folding capable filesystemsAmir Goldstein
Case folding is often applied to subtrees and not on an entire filesystem. Disallowing layers from filesystems that support case folding is over limiting. Replace the rule that case-folding capable are not allowed as layers with a rule that case folded directories are not allowed in a merged directory stack. Should case folding be enabled on an underlying directory while overlayfs is mounted the outcome is generally undefined. Specifically in ovl_lookup(), we check the base underlying directory and fail with -ESTALE and write a warning to kmsg if an underlying directory case folding is enabled. Suggested-by: Kent Overstreet <kent.overstreet@linux.dev> Link: https://lore.kernel.org/linux-fsdevel/20250520051600.1903319-1-kent.overstreet@linux.dev/ Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/20250602171702.1941891-1-amir73il@gmail.com Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-18ovl: remove unneeded non-const conversionAmir Goldstein
file_user_path() now takes a const file ptr. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/20250607115304.2521155-3-amir73il@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-18fs: constify file ptr in backing_file accessor helpersAmir Goldstein
Add internal helper backing_file_set_user_path() for the only two cases that need to modify backing_file fields. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/20250607115304.2521155-2-amir73il@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-17ext4: refactor the inline directory conversion and new directory codepathsTheodore Ts'o
There was a lot of common code in the codepaths used to convert an inline directory and to creaet a new directory. To address this, rename ext4_init_dot_dotdot() to ext4_init_dirblock() and then move common code into that function. This reduces the lines of code count in fs/ext4/inline.c and fs/ext4/namei.c, as well as reducing the size of their object files. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://patch.msgid.link/20250712181249.434530-3-tytso@mit.edu Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-17ext4: use memcpy() instead of strcpy()Theodore Ts'o
The strcpy() function is considered dangerous and eeeevil by people who are using sophisticated code analysis tools such as "grep". This is true even when a quick inspection would show that the source is a constant string ("." or "..") and the destination is a fixed array which is guaranteed to have enough space. Make the "grep" code analysis tool happy by using memcpy() isstead of strcpy(). :-) Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://patch.msgid.link/20250712181249.434530-2-tytso@mit.edu Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-17ext4: replace strcmp with direct comparison for '.' and '..'Theodore Ts'o
In a discussion over a proposed patch, "ext4: replace strcpy() with '.' assignment"[1], I had asserted that directory entries in ext4 were not NUL terminated, and hence it was safe to replace strcpy() with a direct assignment. As it turns out, this was incorrect. It's true for all all directory entries *except* for '.' and '..' where the kernel was using strcmp() and where e2fsck actually checks and offers to fix things if '.' and '..' are not NUL terminated. [1] https://lore.kernel.org/r/202505191316.JJMnPobO-lkp@intel.com We can't change this without breaking old kernel versions, but in the spirit of "be liberal in what you receive", use direct comparison of de->name_len and de->name[0,1] instead of strcmp(). This has the side benefit of reducing the compiled text size by 96 bytes on x86_64. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://patch.msgid.link/20250712181249.434530-1-tytso@mit.edu Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-17ext4: Make sure BH_New bit is cleared in ->write_end handlerJan Kara
Currently we clear BH_New bit in case of error and also in the standard ext4_write_end() handler (in block_commit_write()). However ext4_journalled_write_end() misses this clearing and thus we are leaving stale BH_New bits behind. Generally ext4_block_write_begin() clears these bits before any harm can be done but in case blocksize < pagesize and we hit some error when processing a page with these stale bits, we'll try to zero buffers with these stale BH_New bits and jbd2 will complain (as buffers were not prepared for writing in this transaction). Fix the problem by clearing BH_New bits in ext4_journalled_write_end() and WARN if ext4_block_write_begin() sees stale BH_New bits. Reported-by: Baolin Liu <liubaolin12138@163.com> Reported-by: Zhi Long <longzhi@sangfor.com.cn> Fixes: 3910b513fcdf ("ext4: persist the new uptodate buffers in ext4_journalled_zero_new_buffers") Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250709084831.23876-2-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-17ext4: fix inode use after free in ext4_end_io_rsv_work()Baokun Li
In ext4_io_end_defer_completion(), check if io_end->list_vec is empty to avoid adding an io_end that requires no conversion to the i_rsv_conversion_list, which in turn prevents starting an unnecessary worker. An ext4_emergency_state() check is also added to avoid attempting to abort the journal in an emergency state. Additionally, ext4_put_io_end_defer() is refactored to call ext4_io_end_defer_completion() directly instead of being open-coded. This also prevents starting an unnecessary worker when EXT4_IO_END_FAILED is set but data_err=abort is not enabled. This ensures that the check in ext4_put_io_end_defer() is consistent with the check in ext4_end_bio(). Otherwise, we might add an io_end to the i_rsv_conversion_list and then call ext4_finish_bio(), after which the inode could be freed before ext4_end_io_rsv_work() is called, triggering a use-after-free issue. Fixes: ce51afb8cc5e ("ext4: abort journal on data writeback failure if in data_err=abort mode") Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250708111504.3208660-1-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-17binfmt_elf: remove the 4k limitation of program header sizeYin Fengwei
We have assembly code generated by a script. GCC successfully compiles it. However, the kernel cannot load it on an ARM64 platform with a 4K page size. In contrast, the same ELF file loads correctly on the same platform with a 64K page size. The root cause is the Linux kernel's ELF_MIN_ALIGN limitation on the program headers of ELF files. The ELF file contains 78 program headers (the script inserts many holes when generating the assembly code). On ARM64 with a 4K page size, the ELF_MIN_ALLIGN enforces a maximum of 74 program headers, causing the ELF file to fail. However, with a 64K page size, the ELF_MIN_ALIGN is relaxed to over 1,184 program headers, allowing the file to run correctly. Cook kindly identified[1] that this limitation was introduced in Linux-0.99.15f without an explanation for its purpose. The ELF specification does not impose such a restriction on program headers. Removing the ELF_MIN_ALIGN limitation on program headers to align with the ELF spec. After removing ELF_MIN_ALIGN limitation, 64K size limitation still exist which should be sufficient. Suggested-by: Kees Cook <kees@kernel.org> Link: https://lore.kernel.org/linux-mm/202506270854.A729825@keescook/ [1] Signed-off-by: Yin Fengwei <fengwei_yin@linux.alibaba.com> Link: https://lore.kernel.org/r/20250717110108.55586-1-fengwei_yin@linux.alibaba.com Signed-off-by: Kees Cook <kees@kernel.org>
2025-07-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.16-rc7). Conflicts: Documentation/netlink/specs/ovpn.yaml 880d43ca9aa4 ("netlink: specs: clean up spaces in brackets") af52020fc599 ("ovpn: reject unexpected netlink attributes") drivers/net/phy/phy_device.c a44312d58e78 ("net: phy: Don't register LEDs for genphy") f0f2b992d818 ("net: phy: Don't register LEDs for genphy") https://lore.kernel.org/20250710114926.7ec3a64f@kernel.org drivers/net/wireless/intel/iwlwifi/fw/regulatory.c drivers/net/wireless/intel/iwlwifi/mld/regulatory.c 5fde0fcbd760 ("wifi: iwlwifi: mask reserved bits in chan_state_active_bitmap") ea045a0de3b9 ("wifi: iwlwifi: add support for accepting raw DSM tables by firmware") net/ipv6/mcast.c ae3264a25a46 ("ipv6: mcast: Delay put pmc->idev in mld_del_delrec()") a8594c956cc9 ("ipv6: mcast: Avoid a duplicate pointer check in mld_del_delrec()") https://lore.kernel.org/8cc52891-3653-4b03-a45e-05464fe495cf@kernel.org No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17ext4: Refactor breaking condition for xattr_find_entry()I Hsin Cheng
Refactor the condition for breaking the loop within xattr_find_entry(). Elimate the usage of "<=" and take condition shortcut when "!cmp" is true. Originally, the condition was "(cmp <= 0 && (sorted || cmp == 0))", which means after it knows "cmp <= 0" is true, it has to check the value of "sorted" and "cmp". The checking of "cmp" here would be redundant since it has already checked it. Observing from the logic, when "cmp == 0" the branch is going to be true, no need to check "cmp == 0" again, so we only need to take shortcut when "cmp == 0", on the other hand, we'll check "sorted" when "cmp < 0". The refactor can shrink the generated code size by 44 bytes. Numerous instructions can be saved thus should also benefit execution efficiency as well. $ ./scripts/bloat-o-meter vmlinux_old vmlinux_new add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-44 (-44) Function old new delta xattr_find_entry 300 256 -44 Total: Before=22989434, After=22989390, chg -0.00% The test is done on kernel version 6.16 with x86_64 defconfig and gcc 13.3.0. Signed-off-by: I Hsin Cheng <richard120310@gmail.com> Link: https://patch.msgid.link/20250708020013.175728-1-richard120310@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-17ilog2: add max_pow_of_two_factor()John Garry
Relocate the function max_pow_of_two_factor() to common ilog2.h from the xfs code, as it will be used elsewhere. Also simplify the function, as advised by Mikulas Patocka. Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20250711105258.3135198-2-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-17fuse: refactor writeback to use iomap_writepage_ctx inodeJoanne Koong
struct iomap_writepage_ctx includes a pointer to the file inode. In writeback, use that instead of also passing the inode into fuse_fill_wb_data. No functional changes. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Link: https://lore.kernel.org/20250715202122.2282532-6-joannelkoong@gmail.com Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-17fuse: hook into iomap for invalidating and checking partial uptodatenessJoanne Koong
Hook into iomap_invalidate_folio() so that if the entire folio is being invalidated during truncation, the dirty state is cleared and the folio doesn't get written back. As well the folio's corresponding ifs struct will get freed. Hook into iomap_is_partially_uptodate() since iomap tracks uptodateness granularly when it does buffered writes. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Link: https://lore.kernel.org/20250715202122.2282532-5-joannelkoong@gmail.com Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-17fuse: use iomap for folio launderingJoanne Koong
Use iomap for folio laundering, which will do granular dirty writeback when laundering a large folio. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Link: https://lore.kernel.org/20250715202122.2282532-4-joannelkoong@gmail.com Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-17fuse: use iomap for writebackJoanne Koong
Use iomap for dirty folio writeback in ->writepages(). This allows for granular dirty writeback of large folios. Only the dirty portions of the large folio will be written instead of having to write out the entire folio. For example if there is a 1 MB large folio and only 2 bytes in it are dirty, only the page for those dirty bytes will be written out. .dirty_folio needs to be set to iomap_dirty_folio so that the bitmap iomap uses for dirty tracking correctly reflects dirty regions that need to be written back. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Link: https://lore.kernel.org/20250715202122.2282532-3-joannelkoong@gmail.com Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-17fuse: use iomap for buffered writesJoanne Koong
Have buffered writes go through iomap. This has two advantages: * granular large folio synchronous reads * granular large folio dirty tracking If for example there is a 1 MB large folio and a write issued at pos 1 to pos 1 MB - 2, only the head and tail pages will need to be read in and marked uptodate instead of the entire folio needing to be read in. Non-relevant trailing pages are also skipped (eg if for a 1 MB large folio a write is issued at pos 1 to 4099, only the first two pages are read in and the ones after that are skipped). iomap also has granular dirty tracking. This is useful in that when it comes to writeback time, only the dirty portions of the large folio will be written instead of having to write out the entire folio. For example if there is a 1 MB large folio and only 2 bytes in it are dirty, only the page for those dirty bytes get written out. Please note that granular writeback is only done once fuse also uses iomap in writeback (separate commit). .release_folio needs to be set to iomap_release_folio so that any allocated iomap ifs structs get freed. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Link: https://lore.kernel.org/20250715202122.2282532-2-joannelkoong@gmail.com Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-16bcachefs: Fix bch2_maybe_casefold() when CONFIG_UTF8=nKent Overstreet
maybe_casefold() shouldn't have been nooped, just bch2_casefold(). Fixes: 94426e4201fb ("bcachefs: opts.casefold_disabled") Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-16bcachefs: Fix build when CONFIG_UNICODE=nKent Overstreet
94426e4201fb, which added the killswitch for casefolding, accidentally removed some of the ifdefs we need to avoid build errors. It appears we need better build testing for different configurations, it took two weeks for the robots to catch this one. Fixes: 94426e4201fb ("bcachefs: opts.casefold_disabled") Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-16bcachefs: Fix reference to invalid bucket in copygcKent Overstreet
Use bch2_dev_bucket_tryget() instead of bch2_dev_tryget() before checking the bucket bitmap. Reported-by: syzbot+3168625f36f4a539237e@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-16bcachefs: Don't build aux search tree when still repairing nodeKent Overstreet
bch2_btree_node_drop_keys_outside_node() will (re)build aux search trees, because it's also called by topology repair. bch2_btree_node_read_done() was calling it before validating individual keys; invalid ones have to be dropped. If we call drop_keys_outside_node() first, then bch2_bset_build_aux_tree() doesn't run because the node already has an aux search tree - which was invalidated by the repair. Reported-by: syzbot+c5e7a66b3b23ae65d44f@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-16bcachefs: Tweak threshold for allocator triggering discardsKent Overstreet
The allocator path has a "if we're really low on free buckets, check if we should issue discards" - tweak this to also trigger discards if more than 1/128th of the device is in need_discard state. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-16bcachefs: Fix triggering of discard by the journal pathKent Overstreet
It becomes possible to do discards after a journal flush, which naturally the journal code is reponsible for. A prior refactoring seems to have broken this - which went unnoticed because the foreground allocator path can also trigger discards. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-16gfs2: No more self recoveryAndreas Gruenbacher
When a node withdraws and it turns out that it is the only node that has the filesystem mounted, gfs2 currently tries to replay the local journal to bring the filesystem back into a consistent state. Not only is that a very bad idea, it has also never worked because gfs2_recover_func() will refuse to do anything during a withdraw. However, before even getting to this point, gfs2_recover_func() dereferences sdp->sd_jdesc->jd_inode. This was a use-after-free before commit 04133b607a78 ("gfs2: Prevent double iput for journal on error") and is a NULL pointer dereference since then. Simply get rid of self recovery to fix that. Fixes: 601ef0d52e96 ("gfs2: Force withdraw to replay journals and wait for it to finish") Reported-by: Chunjie Zhu <chunjie.zhu@cloud.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-07-16gfs2: Validate i_depth for exhash directoriesAndrew Price
A fuzzer test introduced corruption that ends up with a depth of 0 in dir_e_read(), causing an undefined shift by 32 at: index = hash >> (32 - dip->i_depth); As calculated in an open-coded way in dir_make_exhash(), the minimum depth for an exhash directory is ilog2(sdp->sd_hash_ptrs) and 0 is invalid as sdp->sd_hash_ptrs is fixed as sdp->bsize / 16 at mount time. So we can avoid the undefined behaviour by checking for depth values lower than the minimum in gfs2_dinode_in(). Values greater than the maximum are already being checked for there. Also switch the calculation in dir_make_exhash() to use ilog2() to clarify how the depth is calculated. Tested with the syzkaller repro.c and xfstests '-g quick'. Reported-by: syzbot+4708579bb230a0582a57@syzkaller.appspotmail.com Signed-off-by: Andrew Price <anprice@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-07-16ext4: support uncached buffered I/OTaotao Chen
Set FOP_DONTCACHE in ext4_file_operations to declare support for uncached buffered I/O. To handle this flag, update ext4_write_begin() and ext4_da_write_begin() to use write_begin_get_folio(), which encapsulates FGP_DONTCACHE logic based on iocb->ki_flags. Part of a series refactoring address_space_operations write_begin and write_end callbacks to use struct kiocb for passing write context and flags. Signed-off-by: Taotao Chen <chentaotao@didiglobal.com> Link: https://lore.kernel.org/20250716093559.217344-6-chentaotao@didiglobal.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-16fs: change write_begin/write_end interface to take struct kiocb *Taotao Chen
Change the address_space_operations callbacks write_begin() and write_end() to take struct kiocb * as the first argument instead of struct file *. Update all affected function prototypes, implementations, call sites, and related documentation across VFS, filesystems, and block layer. Part of a series refactoring address_space_operations write_begin and write_end callbacks to use struct kiocb for passing write context and flags. Signed-off-by: Taotao Chen <chentaotao@didiglobal.com> Link: https://lore.kernel.org/20250716093559.217344-4-chentaotao@didiglobal.com Signed-off-by: Christian Brauner <brauner@kernel.org>