summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2025-06-16bcachefs: fsck: Improve check_key_has_inode()Kent Overstreet
Print out more info when we find a key (extent, dirent, xattr) for a missing inode - was there a good inode in an older snapshot, full(ish) list of keys for that missing inode, so we can make better decisions on how to repair. If it looks like it should've been deleted, autofix it. If we ever hit the non-autofix cases, we'll want to write more repair code (possibly reconstituting the inode). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16bcachefs: don't return fsck_fix for unfixable node errors in __btree_errBharadwaj Raju
After cd3cdb1ef706 ("Single err message for btree node reads"), all errors caused __btree_err to return -BCH_ERR_fsck_fix no matter what the actual error type was if the recovery pass was scanning for btree nodes. This lead to the code continuing despite things like bad node formats when they earlier would have caused a jump to fsck_err, because btree_err only jumps when the return from __btree_err does not match fsck_fix. Ultimately this lead to undefined behavior by attempting to unpack a key based on an invalid format. Make only errors of type -BCH_ERR_btree_node_read_err_fixable cause __btree_err to return -BCH_ERR_fsck_fix when scanning for btree nodes. Reported-by: syzbot+cfd994b9cdf00446fd54@syzkaller.appspotmail.com Fixes: cd3cdb1ef706 ("bcachefs: Single err message for btree node reads") Signed-off-by: Bharadwaj Raju <bharadwaj.raju777@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16bcachefs: Fix pool->alloc NULL pointer dereferenceAlan Huang
btree_interior_update_pool has not been initialized before the filesystem becomes read-write, thus mempool_alloc in bch2_btree_update_start will trigger pool->alloc NULL pointer dereference in mempool_alloc_noprof Reported-by: syzbot+2f3859bd28f20fa682e6@syzkaller.appspotmail.com Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16bcachefs: Move bset size check before csum checkAlan Huang
In syzbot's crash, the bset's u64s is larger than the btree node. Reported-by: syzbot+bfaeaa8e26281970158d@syzkaller.appspotmail.com Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16bcachefs: mark more errors autofixKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16bcachefs: Kill unused tracepointsKent Overstreet
Dead code cleanup. Link: https://lore.kernel.org/linux-bcachefs/20250612224059.39fddd07@batman.local.home/ Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16bcachefs: opts.journal_rewindKent Overstreet
Add a mount option for rewinding the journal, bringing the entire filesystem to where it was at a previous point in time. This is for extreme disaster recovery scenarios - it's not intended as an undelete operation. The option takes a journal sequence number; the desired sequence number can be determined with 'bcachefs list_journal' Caveats: - The 'journal_transaction_names' option must have been enabled (it's on by default). The option controls emitting of extra debug info in the journal, so we can see what individual transactions were doing; It also enables journalling of keys being overwritten, which is what we rely on here. - A full fsck run will be automatically triggered since alloc info will be inconsistent. Only leaf node updates to non-alloc btrees are rewound, since rewinding interior btree updates isn't possible or desirable. - We can't do anything about data that was deleted and overwritten. Lots of metadata updates after the point in time we're rewinding to shouldn't cause a problem, since we segragate data and metadata allocations (this is in order to make repair by btree node scan practical on larger filesystems; there's a small 64-bit per device bitmap in the superblock of device ranges with btree nodes, and we try to keep this small). However, having discards enabled will cause problems, since buckets are discarded as soon as they become empty (this is why we don't implement fstrim: we don't need it). Hopefully, this feature will be a one-off thing that's never used again: this was implemented for recovering from the "vfs i_nlink 0 -> subvol deletion" bug, and that bug was unusually disastrous and additional safeguards have since been implemented. But if it does turn out that we need this more in the future, I'll have to implement an option so that empty buckets aren't discarded immediately - lagging by perhaps 1% of device capacity. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16simple_recursive_removal(): saner interaction with fsnotifyAl Viro
Make it match the real unlink(2)/rmdir(2) - notify *after* the operation. And use fsnotify_delete() instead of messing with fsnotify_unlink()/fsnotify_rmdir(). Currently the only caller that cares is the one in debugfs, and there the order matching the normal syscalls makes more sense; it'll get more serious for users introduced later in the series. Reviewed-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-16x86,fs/resctrl: Remove inappropriate references to cacheinfo in the resctrl ↵Qinyun Tan
subsystem In the resctrl subsystem's Sub-NUMA Cluster (SNC) mode, the rdt_mon_domain structure representing a NUMA node relies on the cacheinfo interface (rdt_mon_domain::ci) to store L3 cache information (e.g., shared_cpu_map) for monitoring. The L3 cache information of a SNC NUMA node determines which domains are summed for the "top level" L3-scoped events. rdt_mon_domain::ci is initialized using the first online CPU of a NUMA node. When this CPU goes offline, its shared_cpu_map is cleared to contain only the offline CPU itself. Subsequently, attempting to read counters via smp_call_on_cpu(offline_cpu) fails (and error ignored), returning zero values for "top-level events" without any error indication. Replace the cacheinfo references in struct rdt_mon_domain and struct rmid_read with the cacheinfo ID (a unique identifier for the L3 cache). rdt_domain_hdr::cpu_mask contains the online CPUs associated with that domain. When reading "top-level events", select a CPU from rdt_domain_hdr::cpu_mask and utilize its L3 shared_cpu_map to determine valid CPUs for reading RMID counter via the MSR interface. Considering all CPUs associated with the L3 cache improves the chances of picking a housekeeping CPU on which the counter reading work can be queued, avoiding an unnecessary IPI. Fixes: 328ea68874642 ("x86/resctrl: Prepare for new Sub-NUMA Cluster (SNC) monitor files") Signed-off-by: Qinyun Tan <qinyuntan@linux.alibaba.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/20250530182053.37502-2-qinyuntan@linux.alibaba.com
2025-06-16Merge tag 'vfs-6.16-rc3.fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - Fix a regression in overlayfs caused by reworking the lookup_one*() set of helpers - Make sure that the name of the dentry is printed in overlayfs' mkdir() helper - Add missing iocb values to TRACE_IOCB_STRINGS define - Unlock the superblock during iterate_supers_type(). This was an accidental internal api change - Drop a misleading assert in file_seek_cur_needs_f_lock() helper - Never refuse to return PIDFD_GET_INGO when parent pid is zero That can trivially happen in container scenarios where the parent process might be located in an ancestor pid namespace - Don't revalidate in try_lookup_noperm() as that causes regression for filesystems such as cifs - Fix simple_xattr_list() and reset the err variable after security_inode_listsecurity() got called so as not to confuse userspace about the length of the xattr * tag 'vfs-6.16-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: drop assert in file_seek_cur_needs_f_lock fs: unlock the superblock during iterate_supers_type ovl: fix debug print in case of mkdir error VFS: change try_lookup_noperm() to skip revalidation fs: add missing values to TRACE_IOCB_STRINGS fs/xattr.c: fix simple_xattr_list() ovl: fix regression caused by lookup helpers API changes pidfs: never refuse ppid == 0 in PIDFD_GET_INFO
2025-06-16coredump: move core_pipe_count to global variableChristian Brauner
The pipe coredump counter is a static local variable instead of a global variable like all of the rest. Move it to a global variable so it's all consistent. Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-12-315c0c34ba94@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-16coredump: prepare to simplify exit pathsChristian Brauner
The exit path is currently entangled with core pipe limit accounting which is really unpleasant. Use a local variable in struct core_name that remembers whether the count was incremented and if so to clean decrement in once the coredump is done. Assert that this only happens for pipes. Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-11-315c0c34ba94@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-16coredump: split file coredumping into coredump_file()Christian Brauner
* Move that whole mess into a separate helper instead of having all that hanging around in vfs_coredump() directly. * Stop using that need_suid_safe variable and add an inline helper that clearly communicates what's going on everywhere consistently. The mm flag snapshot is stable and can't change so nothing's gained with that boolean. * Only setup cprm->file once everything else succeeded, using RAII for the coredump file before. That allows to don't care to what goto label we jump in vfs_coredump(). Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-10-315c0c34ba94@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-16coredump: rename do_coredump() to vfs_coredump()Christian Brauner
Align the naming with the rest of our helpers exposed outside of core vfs. Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-9-315c0c34ba94@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-16coredump: validate socket path in coredump_parse()Christian Brauner
properly again. Someone might have modified the buffer concurrently. Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-7-315c0c34ba94@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-16coredump: don't allow ".." in coredump socket pathChristian Brauner
There's no point in allowing to walk upwards for the coredump socket. We already force userspace to give use a sane path, no symlinks, no magiclinks, and also block "..". Use an absolute path without any shenanigans. Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-6-315c0c34ba94@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-16coredump: validate that path doesn't exceed UNIX_PATH_MAXChristian Brauner
so we don't pointlessly accepts things that go over the limit. Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-4-315c0c34ba94@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-16coredump: fix socket path validationChristian Brauner
Make sure that we keep it extensible and well-formed. Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-3-315c0c34ba94@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-16coredump: make coredump_parse() return boolChristian Brauner
There's no point in returning negative error values. They will never be seen by anyone. Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-2-315c0c34ba94@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-16coredump: rename format_corename()Christian Brauner
It's not really about the name anymore. It parses very distinct information. Reflect that in the name. Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-1-315c0c34ba94@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-16VFS: change old_dir and new_dir in struct renamedata to dentrysNeilBrown
all users of 'struct renamedata' have the dentry for the old and new directories, and often have no use for the inode except to store it in the renamedata. This patch changes struct renamedata to hold the dentry, rather than the inode, for the old and new directories, and changes callers to match. The names are also changed from a _dir suffix to _parent. This is consistent with other usage in namei.c and elsewhere. This results in the removal of several local variables and several dereferences of ->d_inode at the cost of adding ->d_inode dereferences to vfs_rename(). Acked-by: Miklos Szeredi <miklos@szeredi.hu> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: NeilBrown <neil@brown.name> Link: https://lore.kernel.org/174977089072.608730.4244531834577097454@noble.neil.brown.name Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-16proc_fd_getattr(): don't bother with S_ISDIR() checkAl Viro
that thing is callable only as ->i_op->getattr() instance and only for directory inodes (/proc/*/fd and /proc/*/task/*/fd) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/20250615003321.GC3011112@ZenIV Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-16don't duplicate vfs_open() in kernel_file_open()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/20250615003216.GB3011112@ZenIV Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-16xfs: actually use the xfs_growfs_check_rtgeom tracepointDarrick J. Wong
We created a new tracepoint but forgot to put it in. Fix that. Cc: rostedt@goodmis.org Cc: stable@vger.kernel.org # v6.14 Fixes: 59a57acbce282d ("xfs: check that the rtrmapbt maxlevels doesn't increase when growing fs") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reported-by: Steven Rostedt <rostedt@goodmis.org> Closes: https://lore.kernel.org/all/20250612131021.114e6ec8@batman.local.home/ Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-06-16xfs: Improve error handling in xfs_mru_cache_create()Markus Elfring
Simplify error handling in this function implementation. * Delete unnecessary pointer checks and variable assignments. * Omit a redundant function call. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-06-16xfs: move xfs_submit_zoned_bio a bitChristoph Hellwig
Commit f3e2e53823b9 ("xfs: add inode to zone caching for data placement") add the new code right between xfs_submit_zoned_bio and xfs_zone_alloc_and_submit which implement the main zoned write path. Move xfs_submit_zoned_bio down to keep it together again. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-06-16xfs: use xfs_readonly_buftarg in xfs_remount_rwChristoph Hellwig
Use xfs_readonly_buftarg instead of open coding it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-06-16xfs: remove NULL pointer checks in xfs_mru_cache_insertChristoph Hellwig
Remove the check for a NULL mru or mru->list in xfs_mru_cache_insert as this API misused lead to a direct NULL pointer dereference on first use and is not user triggerable. As a smatch run by Dan points out with the recent cleanup it would otherwise try to free the object we just determined to be NULL for this impossible to reach case. Fixes: 70b95cb86513 ("xfs: free the item in xfs_mru_cache_insert on failure") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-06-16xfs: check for shutdown before going to sleep in xfs_select_zoneChristoph Hellwig
Ensure the file system hasn't been shut down before waiting for a free zone to become available, because that won't happen on a shut down file system. Without this processes can occasionally get stuck in the allocator wait loop when racing with a file system shutdown. This sporadically happens when running generic/388 or generic/475. Fixes: 4e4d52075577 ("xfs: add the zoned space allocator") Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-06-16fs: drop assert in file_seek_cur_needs_f_lockLuis Henriques
The assert in function file_seek_cur_needs_f_lock() can be triggered very easily because there are many users of vfs_llseek() (such as overlayfs) that do their custom locking around llseek instead of relying on fdget_pos(). Just drop the overzealous assertion. Fixes: da06e3c51794 ("fs: don't needlessly acquire f_lock") Suggested-by: Jan Kara <jack@suse.cz> Suggested-by: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Luis Henriques <luis@igalia.com> Link: https://lore.kernel.org/20250613101111.17716-1-luis@igalia.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-15bcachefs: fsck: fix extent past end of inode repairKent Overstreet
Fix the case where we're deleting in a different snapshot and need to emit a whiteout - that requires a regular BTREE_ITER_filter_snapshots iterator. Also, only delete the part of the extent that extents past i_size. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-15bcachefs: fsck: fix add_inode()Kent Overstreet
the inode btree uses the offset field for the inum, not the inode field. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-15bcachefs: Fix snapshot_key_missing_inode_snapshot repairKent Overstreet
When the inode was a whiteout, we were inserting a new whiteout at the wrong (old) snapshot. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-15bcachefs: Fix "now allowing incompatible features" messageKent Overstreet
Check against version_incompat_allowed, not version_incompat. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-15bcachefs: pass last_seq into fs_journal_start()Kent Overstreet
Prep work for journal rewind, where the seq we're replaying from may be different than the last journal entry's last_seq. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-15bcachefs: better __bch2_snapshot_is_ancestor() assertKent Overstreet
Previously, we weren't checking the result of the skiplist walk, just the is_ancestor bitmap. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-15bcachefs: btree_iter: fix updates, journal overlayKent Overstreet
We need to start searching from search_key - _not_ path->pos, which will point to the key we found in the btree Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-15bcachefs: Fix bch2_journal_keys_peek_prev_min()Kent Overstreet
this code is rarely invoked, so - we had a few bugs left from basing it off of bch2_journal_keys_peek_max()... Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-15bcachefs: Delay calculation of trans->journal_u64sAlan Huang
When there is commit error that need split btree leaf, fsck might change the value of trans->journal_entries.u64s, when retry commit, the value of trans->journal_u64s would be incorrect, which will lead to trans->journal_res.u64s underflow, and then out of bounds write will occur: [ 464.496970][T11969] Call trace: [ 464.496973][T11969] show_stack+0x3c/0x88 (C) [ 464.496995][T11969] dump_stack_lvl+0xf8/0x178 [ 464.497014][T11969] dump_stack+0x20/0x30 [ 464.497031][T11969] __bch2_trans_log_str+0x344/0x350 [ 464.497048][T11969] bch2_trans_log_str+0x3c/0x60 [ 464.497065][T11969] __bch2_fsck_err+0x11bc/0x1390 [ 464.497083][T11969] bch2_check_discard_freespace_key+0xad4/0x10d0 [ 464.497100][T11969] bch2_bucket_alloc_freelist+0x99c/0x1130 [ 464.497117][T11969] bch2_bucket_alloc_trans+0x79c/0xcb8 [ 464.497133][T11969] bch2_bucket_alloc_set_trans+0x378/0xc20 [ 464.497151][T11969] __open_bucket_add_buckets+0x7fc/0x1c00 [ 464.497168][T11969] open_bucket_add_buckets+0x184/0x3a8 [ 464.497185][T11969] bch2_alloc_sectors_start_trans+0xa04/0x1da0 [ 464.497203][T11969] bch2_btree_reserve_get+0x6e0/0xef0 [ 464.497220][T11969] bch2_btree_update_start+0x1618/0x2600 [ 464.497239][T11969] bch2_btree_split_leaf+0xcc/0x730 [ 464.497258][T11969] bch2_trans_commit_error+0x22c/0xc30 [ 464.497276][T11969] __bch2_trans_commit+0x207c/0x4e30 [ 464.497292][T11969] bch2_journal_replay+0x9e0/0x1420 [ 464.497305][T11969] __bch2_run_recovery_passes+0x458/0xf98 [ 464.497318][T11969] bch2_run_recovery_passes+0x280/0x478 [ 464.497331][T11969] bch2_fs_recovery+0x24f0/0x3a28 [ 464.497344][T11969] bch2_fs_start+0xb80/0x1248 [ 464.497358][T11969] bch2_fs_get_tree+0xe94/0x1708 [ 464.497377][T11969] vfs_get_tree+0x84/0x2d0 Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-15bcachefs: Add missing EBUG_ONAlan Huang
Just like the EBUG_ON in bch2_journal_add_entry(). Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-15bcachefs: Fix alloc_req use after freeAlan Huang
Now the alloc_req is allocated from the bump allocator, if there is reallocation, the memory of alloc_req would be frees, fix by delaying the reallocation to transaction restart, it has to restart anyway. Reported-by: syzbot+2887a13a5c387e616a68@syzkaller.appspotmail.com Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-15bcachefs: Don't allocate new memory when mempool is exhaustedAlan Huang
Allocating new memory when mempool is exhausted is too complicated, just return ENOMEM is fine. memcpy is not needed, since there might be pointers point to the old memory, that's the bug. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-15bcachefs: btree iter tracepointsKent Overstreet
We've been seeing some livelock-ish behavior in the index update part of the main write path, and while we've got low level btree path tracepoints, we've been lacking high level btree iterator tracepoints. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-15bcachefs: trace_extent_trim_atomicKent Overstreet
Add a tracepoint for when we insert only part of an extent, due to too many overwrites. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-14Merge tag 'v6.16-rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull smb client fixes from Steve French: - SMB3.1.1 POSIX extensions fix for char remapping - Fix for repeated directory listings when directory leases enabled - deferred close handle reuse fix * tag 'v6.16-rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: smb: improve directory cache reuse for readdir operations smb: client: fix perf regression with deferred closes smb: client: disable path remapping with POSIX extensions
2025-06-13ext2: Handle fiemap on empty files to prevent EINVALWei Gao
Previously, ext2_fiemap would unconditionally apply "len = min_t(u64, len, i_size_read(inode));", When inode->i_size was 0 (for an empty file), this would reduce the requested len to 0. Passing len = 0 to iomap_fiemap could then result in an -EINVAL error, even for valid queries on empty files. Link: https://github.com/linux-test-project/ltp/issues/1246 Signed-off-by: Wei Gao <wegao@suse.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250613152402.3432135-1-wegao@suse.com
2025-06-13zonefs: use ZONEFS_SUPER_SIZE instead of PAGE_SIZEJohannes Thumshirn
Use ZONEFS_SUPER_SIZE constant instead of PAGE_SIZE allocating memory for reading the super block in zonefs_read_super(). While PAGE_SIZE technically isn't incorrect as Linux doesn't support pages smaller than 4k ZONEFS_SUPER_SIZE is semantically more correct. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
2025-06-12NFSD: Avoid corruption of a referring call listChuck Lever
The new code neglects to remove a freshly-allocated RCL from the callback's referring call list when no matching referring call is found. Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/202505171002.cE46sdj5-lkp@intel.com/ Fixes: 4f3c8d8c9e10 ("NFSD: Implement CB_SEQUENCE referring call lists") Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-06-12smb: improve directory cache reuse for readdir operationsBharath SM
Currently, cached directory contents were not reused across subsequent 'ls' operations because the cache validity check relied on comparing the ctx pointer, which changes with each readdir invocation. As a result, the cached dir entries was not marked as valid and the cache was not utilized for subsequent 'ls' operations. This change uses the file pointer, which remains consistent across all readdir calls for a given directory instance, to associate and validate the cache. As a result, cached directory contents can now be correctly reused, improving performance for repeated directory listings. Performance gains with local windows SMB server: Without the patch and default actimeo=1: 1000 directory enumeration operations on dir with 10k files took 135.0s With this patch and actimeo=0: 1000 directory enumeration operations on dir with 10k files took just 5.1s Signed-off-by: Bharath SM <bharathsm@microsoft.com> Reviewed-by: Shyam Prasad N <sprasad@microsoft.com> Cc: stable@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>
2025-06-12smb: client: fix perf regression with deferred closesPaulo Alcantara
Customer reported that one of their applications started failing to open files with STATUS_INSUFFICIENT_RESOURCES due to NetApp server hitting the maximum number of opens to same file that it would allow for a single client connection. It turned out the client was failing to reuse open handles with deferred closes because matching ->f_flags directly without masking off O_CREAT|O_EXCL|O_TRUNC bits first broke the comparision and then client ended up with thousands of deferred closes to same file. Those bits are already satisfied on the original open, so no need to check them against existing open handles. Reproducer: #include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> #include <fcntl.h> #include <pthread.h> #define NR_THREADS 4 #define NR_ITERATIONS 2500 #define TEST_FILE "/mnt/1/test/dir/foo" static char buf[64]; static void *worker(void *arg) { int i, j; int fd; for (i = 0; i < NR_ITERATIONS; i++) { fd = open(TEST_FILE, O_WRONLY|O_CREAT|O_APPEND, 0666); for (j = 0; j < 16; j++) write(fd, buf, sizeof(buf)); close(fd); } } int main(int argc, char *argv[]) { pthread_t t[NR_THREADS]; int fd; int i; fd = open(TEST_FILE, O_WRONLY|O_CREAT|O_TRUNC, 0666); close(fd); memset(buf, 'a', sizeof(buf)); for (i = 0; i < NR_THREADS; i++) pthread_create(&t[i], NULL, worker, NULL); for (i = 0; i < NR_THREADS; i++) pthread_join(t[i], NULL); return 0; } Before patch: $ mount.cifs //srv/share /mnt/1 -o ... $ mkdir -p /mnt/1/test/dir $ gcc repro.c && ./a.out ... number of opens: 1391 After patch: $ mount.cifs //srv/share /mnt/1 -o ... $ mkdir -p /mnt/1/test/dir $ gcc repro.c && ./a.out ... number of opens: 1 Cc: linux-cifs@vger.kernel.org Cc: David Howells <dhowells@redhat.com> Cc: Jay Shin <jaeshin@redhat.com> Cc: Pierguido Lambri <plambri@redhat.com> Fixes: b8ea3b1ff544 ("smb: enable reuse of deferred file handles for write operations") Acked-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Signed-off-by: Steve French <stfrench@microsoft.com>