summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2025-05-29smb: client: Store original IO parameters and prevent zero IO sizesWang Zhaolong
[ Upstream commit 287906b20035a04a234d1a3c64f760a5678387be ] During mount option processing and negotiation with the server, the original user-specified rsize/wsize values were being modified directly. This makes it impossible to recover these values after a connection reset, leading to potential degraded performance after reconnection. The other problem is that When negotiating read and write sizes, there are cases where the negotiated values might calculate to zero, especially during reconnection when server->max_read or server->max_write might be reset. In general, these values come from the negotiation response. According to MS-SMB2 specification, these values should be at least 65536 bytes. This patch improves IO parameter handling: 1. Adds vol_rsize and vol_wsize fields to store the original user-specified values separately from the negotiated values 2. Uses got_rsize/got_wsize flags to determine if values were user-specified rather than checking for non-zero values, which is more reliable 3. Adds a prevent_zero_iosize() helper function to ensure IO sizes are never negotiated down to zero, which could happen in edge cases like when server->max_read/write is zero The changes make the CIFS client more resilient to unusual server responses and reconnection scenarios, preventing potential failures when IO sizes are calculated to be zero. Signed-off-by: Wang Zhaolong <wangzhaolong1@huawei.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-29cifs: Fix negotiate retry functionalityPali Rohár
[ Upstream commit e94e882a6d69525c07589222cf3a6ff57ad12b5b ] SMB negotiate retry functionality in cifs_negotiate() is currently broken and does not work when doing socket reconnect. Caller of this function, which is cifs_negotiate_protocol() requires that tcpStatus after successful execution of negotiate callback stay in CifsInNegotiate. But if the CIFSSMBNegotiate() called from cifs_negotiate() fails due to connection issues then tcpStatus is changed as so repeated CIFSSMBNegotiate() call does not help. Fix this problem by moving retrying code from negotiate callback (which is either cifs_negotiate() or smb2_negotiate()) to cifs_negotiate_protocol() which is caller of those callbacks. This allows to properly handle and implement correct transistions between tcpStatus states as function cifs_negotiate_protocol() already handles it. With this change, cifs_negotiate_protocol() now handles also -EAGAIN error set by the RFC1002_NEGATIVE_SESSION_RESPONSE processing after reconnecting with NetBIOS session. Signed-off-by: Pali Rohár <pali@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-29cifs: Fix querying and creating MF symlinks over SMB1Pali Rohár
[ Upstream commit 4236ac9fe5b8b42756070d4abfb76fed718e87c2 ] Old SMB1 servers without CAP_NT_SMBS do not support CIFS_open() function and instead SMBLegacyOpen() needs to be used. This logic is already handled in cifs_open_file() function, which is server->ops->open callback function. So for querying and creating MF symlinks use open callback function instead of CIFS_open() function directly. This change fixes querying and creating new MF symlinks on Windows 98. Currently cifs_query_mf_symlink() is not able to detect MF symlink and cifs_create_mf_symlink() is failing with EIO error. Signed-off-by: Pali Rohár <pali@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-29cifs: Add fallback for SMB2 CREATE without FILE_READ_ATTRIBUTESPali Rohár
[ Upstream commit e255612b5ed9f179abe8196df7c2ba09dd227900 ] Some operations, like WRITE, does not require FILE_READ_ATTRIBUTES access. So when FILE_READ_ATTRIBUTES is not explicitly requested for smb2_open_file() then first try to do SMB2 CREATE with FILE_READ_ATTRIBUTES access (like it was before) and then fallback to SMB2 CREATE without FILE_READ_ATTRIBUTES access (less common case). This change allows to complete WRITE operation to a file when it does not grant FILE_READ_ATTRIBUTES permission and its parent directory does not grant READ_DATA permission (parent directory READ_DATA is implicit grant of child FILE_READ_ATTRIBUTES permission). Signed-off-by: Pali Rohár <pali@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-29fs/ext4: use sleeping version of sb_find_get_block()Davidlohr Bueso
[ Upstream commit 6e8f57fd09c9fb569d10b2ccc3878155b702591a ] Enable ext4_free_blocks() to use it, which has a cond_resched to begin with. Convert to the new nonatomic flavor to benefit from potential performance benefits and adapt in the future vs migration such that semantics are kept. Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Link: https://kdevops.org/ext4/v6.15-rc2.html # [0] Link: https://lore.kernel.org/all/aAAEvcrmREWa1SKF@bombadil.infradead.org/ # [1] Link: https://lore.kernel.org/20250418015921.132400-7-dave@stgolabs.net Tested-by: kdevops@lists.linux.dev Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-29fs/jbd2: use sleeping version of __find_get_block()Davidlohr Bueso
[ Upstream commit f76d4c28a46a9260d85e00dafc8f46d369365d33 ] Convert to the new nonatomic flavor to benefit from potential performance benefits and adapt in the future vs migration such that semantics are kept. - jbd2_journal_revoke(): can sleep (has might_sleep() in the beginning) - jbd2_journal_cancel_revoke(): only used from do_get_write_access() and do_get_create_access() which do sleep. So can sleep. - jbd2_clear_buffer_revoked_flags() - only called from journal commit code which sleeps. So can sleep. Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Link: https://kdevops.org/ext4/v6.15-rc2.html # [0] Link: https://lore.kernel.org/all/aAAEvcrmREWa1SKF@bombadil.infradead.org/ # [1] Link: https://lore.kernel.org/20250418015921.132400-6-dave@stgolabs.net Tested-by: kdevops@lists.linux.dev Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-29fs/ocfs2: use sleeping version of __find_get_block()Davidlohr Bueso
[ Upstream commit a0b5ff07491010789fcb012bc8f9dad9d26f9a8b ] This is a path that allows for blocking as it does IO. Convert to the new nonatomic flavor to benefit from potential performance benefits and adapt in the future vs migration such that semantics are kept. Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Link: https://kdevops.org/ext4/v6.15-rc2.html # [0] Link: https://lore.kernel.org/all/aAAEvcrmREWa1SKF@bombadil.infradead.org/ # [1] Link: https://lore.kernel.org/20250418015921.132400-5-dave@stgolabs.net Tested-by: kdevops@lists.linux.dev Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-29fs/buffer: use sleeping version of __find_get_block()Davidlohr Bueso
[ Upstream commit 5b67d43976828dea2394eae2556b369bb7a61f64 ] Convert to the new nonatomic flavor to benefit from potential performance benefits and adapt in the future vs migration such that semantics are kept. Convert write_boundary_block() which already takes the buffer lock as well as bdev_getblk() depending on the respective gpf flags. There are no changes in semantics. Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Link: https://kdevops.org/ext4/v6.15-rc2.html # [0] Link: https://lore.kernel.org/all/aAAEvcrmREWa1SKF@bombadil.infradead.org/ # [1] Link: https://lore.kernel.org/20250418015921.132400-4-dave@stgolabs.net Tested-by: kdevops@lists.linux.dev # [0] [1] Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-29fs/buffer: introduce sleeping flavors for pagecache lookupsDavidlohr Bueso
[ Upstream commit 2814a7d3d2ff5d2cdd22936f641f758fdb971fa0 ] Add __find_get_block_nonatomic() and sb_find_get_block_nonatomic() calls for which users will be converted where safe. These versions will take the folio lock instead of the mapping's private_lock. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Link: https://kdevops.org/ext4/v6.15-rc2.html # [0] Link: https://lore.kernel.org/all/aAAEvcrmREWa1SKF@bombadil.infradead.org/ # [1] Link: https://lore.kernel.org/20250418015921.132400-3-dave@stgolabs.net Tested-by: kdevops@lists.linux.dev Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-29fs/buffer: split locking for pagecache lookupsDavidlohr Bueso
[ Upstream commit 7ffe3de53a885dbb5836541c2178bd07d1bad7df ] Callers of __find_get_block() may or may not allow for blocking semantics, and is currently assumed that it will not. Layout two paths based on this. The the private_lock scheme will continued to be used for atomic contexts. Otherwise take the folio lock instead, which protects the buffers, such as vs migration and try_to_free_buffers(). Per the "hack idea", the latter can alleviate contention on the private_lock for bdev mappings. For reasons of determinism and avoid making bugs hard to reproduce, the trylocking is not attempted. No change in semantics. All lookup users still take the spinlock. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Link: https://kdevops.org/ext4/v6.15-rc2.html # [0] Link: https://lore.kernel.org/all/aAAEvcrmREWa1SKF@bombadil.infradead.org/ # [1] Link: https://lore.kernel.org/20250418015921.132400-2-dave@stgolabs.net Tested-by: kdevops@lists.linux.dev Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-22CIFS: New mount option for cifs.upcall namespace resolutionRitvik Budhiraja
commit db363b0a1d9e6b9dc556296f1b1007aeb496a8cf upstream. In the current implementation, the SMB filesystem on a mount point can trigger upcalls from the kernel to the userspace to enable certain functionalities like spnego, dns_resolution, amongst others. These upcalls usually either happen in the context of the mount or in the context of an application/user. The upcall handler for cifs, cifs.upcall already has existing code which switches the namespaces to the caller's namespace before handling the upcall. This behaviour is expected for scenarios like multiuser mounts, but might not cover all single user scenario with services such as Kubernetes, where the mount can happen from different locations such as on the host, from an app container, or a driver pod which does the mount on behalf of a different pod. This patch introduces a new mount option called upcall_target, to customise the upcall behaviour. upcall_target can take 'mount' and 'app' as possible values. This aids use cases like Kubernetes where the mount happens on behalf of the application in another container altogether. Having this new mount option allows the mount command to specify where the upcall should happen: 'mount' for resolving the upcall to the host namespace, and 'app' for resolving the upcall to the ns of the calling thread. This will enable both the scenarios where the Kerberos credentials can be found on the application namespace or the host namespace to which just the mount operation is "delegated". Reviewed-by: Shyam Prasad <shyam.prasad@microsoft.com> Reviewed-by: Bharath S M <bharathsm@microsoft.com> Reviewed-by: Ronnie Sahlberg <ronniesahlberg@gmail.com> Signed-off-by: Ritvik Budhiraja <rbudhiraja@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> Cc: Salvatore Bonaccorso <carnil@debian.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-22smb: client: fix memory leak during error handling for POSIX mkdirJethro Donaldson
commit 1fe4a44b7fa3955bcb7b4067c07b778fe90d8ee7 upstream. The response buffer for the CREATE request handled by smb311_posix_mkdir() is leaked on the error path (goto err_free_rsp_buf) because the structure pointer *rsp passed to free_rsp_buf() is not assigned until *after* the error condition is checked. As *rsp is initialised to NULL, free_rsp_buf() becomes a no-op and the leak is instead reported by __kmem_cache_shutdown() upon subsequent rmmod of cifs.ko if (and only if) the error path has been hit. Pass rsp_iov.iov_base to free_rsp_buf() instead, similar to the code in other functions in smb2pdu.c for which *rsp is assigned late. Cc: stable@vger.kernel.org Signed-off-by: Jethro Donaldson <devel@jro.nz> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-22btrfs: add back warning for mount option commit values exceeding 300Kyoji Ogasawara
commit 4ce2affc6ef9f84b4aebbf18bd5c57397b6024eb upstream. The Btrfs documentation states that if the commit value is greater than 300 a warning should be issued. The warning was accidentally lost in the new mount API update. Fixes: 6941823cc878 ("btrfs: remove old mount API code") CC: stable@vger.kernel.org # 6.12+ Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Kyoji Ogasawara <sawara04.o@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-22btrfs: fix folio leak in submit_one_async_extent()Boris Burkov
commit a0fd1c6098633f9a95fc2f636383546c82b704c3 upstream. If btrfs_reserve_extent() fails while submitting an async_extent for a compressed write, then we fail to call free_async_extent_pages() on the async_extent and leak its folios. A likely cause for such a failure would be btrfs_reserve_extent() failing to find a large enough contiguous free extent for the compressed extent. I was able to reproduce this by: 1. mount with compress-force=zstd:3 2. fallocating most of a filesystem to a big file 3. fragmenting the remaining free space 4. trying to copy in a file which zstd would generate large compressed extents for (vmlinux worked well for this) Step 4. hits the memory leak and can be repeated ad nauseam to eventually exhaust the system memory. Fix this by detecting the case where we fallback to uncompressed submission for a compressed async_extent and ensuring that we call free_async_extent_pages(). Fixes: 131a821a243f ("btrfs: fallback if compressed IO fails for ENOSPC") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Co-developed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-22btrfs: fix discard worker infinite loop after disabling discardFilipe Manana
commit 54db6d1bdd71fa90172a2a6aca3308bbf7fa7eb5 upstream. If the discard worker is running and there's currently only one block group, that block group is a data block group, it's in the unused block groups discard list and is being used (it got an extent allocated from it after becoming unused), the worker can end up in an infinite loop if a transaction abort happens or the async discard is disabled (during remount or unmount for example). This happens like this: 1) Task A, the discard worker, is at peek_discard_list() and find_next_block_group() returns block group X; 2) Block group X is in the unused block groups discard list (its discard index is BTRFS_DISCARD_INDEX_UNUSED) since at some point in the past it become an unused block group and was added to that list, but then later it got an extent allocated from it, so its ->used counter is not zero anymore; 3) The current transaction is aborted by task B and we end up at __btrfs_handle_fs_error() in the transaction abort path, where we call btrfs_discard_stop(), which clears BTRFS_FS_DISCARD_RUNNING from fs_info, and then at __btrfs_handle_fs_error() we set the fs to RO mode (setting SB_RDONLY in the super block's s_flags field); 4) Task A calls __add_to_discard_list() with the goal of moving the block group from the unused block groups discard list into another discard list, but at __add_to_discard_list() we end up doing nothing because btrfs_run_discard_work() returns false, since the super block has SB_RDONLY set in its flags and BTRFS_FS_DISCARD_RUNNING is not set anymore in fs_info->flags. So block group X remains in the unused block groups discard list; 5) Task A then does a goto into the 'again' label, calls find_next_block_group() again we gets block group X again. Then it repeats the previous steps over and over since there are not other block groups in the discard lists and block group X is never moved out of the unused block groups discard list since btrfs_run_discard_work() keeps returning false and therefore __add_to_discard_list() doesn't move block group X out of that discard list. When this happens we can get a soft lockup report like this: [71.957] watchdog: BUG: soft lockup - CPU#0 stuck for 27s! [kworker/u4:3:97] [71.957] Modules linked in: xfs af_packet rfkill (...) [71.957] CPU: 0 UID: 0 PID: 97 Comm: kworker/u4:3 Tainted: G W 6.14.2-1-default #1 openSUSE Tumbleweed 968795ef2b1407352128b466fe887416c33af6fa [71.957] Tainted: [W]=WARN [71.957] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014 [71.957] Workqueue: btrfs_discard btrfs_discard_workfn [btrfs] [71.957] RIP: 0010:btrfs_discard_workfn+0xc4/0x400 [btrfs] [71.957] Code: c1 01 48 83 (...) [71.957] RSP: 0018:ffffafaec03efe08 EFLAGS: 00000246 [71.957] RAX: ffff897045500000 RBX: ffff8970413ed8d0 RCX: 0000000000000000 [71.957] RDX: 0000000000000001 RSI: ffff8970413ed8d0 RDI: 0000000a8f1272ad [71.957] RBP: 0000000a9d61c60e R08: ffff897045500140 R09: 8080808080808080 [71.957] R10: ffff897040276800 R11: fefefefefefefeff R12: ffff8970413ed860 [71.957] R13: ffff897045500000 R14: ffff8970413ed868 R15: 0000000000000000 [71.957] FS: 0000000000000000(0000) GS:ffff89707bc00000(0000) knlGS:0000000000000000 [71.957] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [71.957] CR2: 00005605bcc8d2f0 CR3: 000000010376a001 CR4: 0000000000770ef0 [71.957] PKRU: 55555554 [71.957] Call Trace: [71.957] <TASK> [71.957] process_one_work+0x17e/0x330 [71.957] worker_thread+0x2ce/0x3f0 [71.957] ? __pfx_worker_thread+0x10/0x10 [71.957] kthread+0xef/0x220 [71.957] ? __pfx_kthread+0x10/0x10 [71.957] ret_from_fork+0x34/0x50 [71.957] ? __pfx_kthread+0x10/0x10 [71.957] ret_from_fork_asm+0x1a/0x30 [71.957] </TASK> [71.957] Kernel panic - not syncing: softlockup: hung tasks [71.987] CPU: 0 UID: 0 PID: 97 Comm: kworker/u4:3 Tainted: G W L 6.14.2-1-default #1 openSUSE Tumbleweed 968795ef2b1407352128b466fe887416c33af6fa [71.989] Tainted: [W]=WARN, [L]=SOFTLOCKUP [71.989] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014 [71.991] Workqueue: btrfs_discard btrfs_discard_workfn [btrfs] [71.992] Call Trace: [71.993] <IRQ> [71.994] dump_stack_lvl+0x5a/0x80 [71.994] panic+0x10b/0x2da [71.995] watchdog_timer_fn.cold+0x9a/0xa1 [71.996] ? __pfx_watchdog_timer_fn+0x10/0x10 [71.997] __hrtimer_run_queues+0x132/0x2a0 [71.997] hrtimer_interrupt+0xff/0x230 [71.998] __sysvec_apic_timer_interrupt+0x55/0x100 [71.999] sysvec_apic_timer_interrupt+0x6c/0x90 [72.000] </IRQ> [72.000] <TASK> [72.001] asm_sysvec_apic_timer_interrupt+0x1a/0x20 [72.002] RIP: 0010:btrfs_discard_workfn+0xc4/0x400 [btrfs] [72.002] Code: c1 01 48 83 (...) [72.005] RSP: 0018:ffffafaec03efe08 EFLAGS: 00000246 [72.006] RAX: ffff897045500000 RBX: ffff8970413ed8d0 RCX: 0000000000000000 [72.006] RDX: 0000000000000001 RSI: ffff8970413ed8d0 RDI: 0000000a8f1272ad [72.007] RBP: 0000000a9d61c60e R08: ffff897045500140 R09: 8080808080808080 [72.008] R10: ffff897040276800 R11: fefefefefefefeff R12: ffff8970413ed860 [72.009] R13: ffff897045500000 R14: ffff8970413ed868 R15: 0000000000000000 [72.010] ? btrfs_discard_workfn+0x51/0x400 [btrfs 23b01089228eb964071fb7ca156eee8cd3bf996f] [72.011] process_one_work+0x17e/0x330 [72.012] worker_thread+0x2ce/0x3f0 [72.013] ? __pfx_worker_thread+0x10/0x10 [72.014] kthread+0xef/0x220 [72.014] ? __pfx_kthread+0x10/0x10 [72.015] ret_from_fork+0x34/0x50 [72.015] ? __pfx_kthread+0x10/0x10 [72.016] ret_from_fork_asm+0x1a/0x30 [72.017] </TASK> [72.017] Kernel Offset: 0x15000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) [72.019] Rebooting in 90 seconds.. So fix this by making sure we move a block group out of the unused block groups discard list when calling __add_to_discard_list(). Fixes: 2bee7eb8bb81 ("btrfs: discard one region at a time in async discard") Link: https://bugzilla.suse.com/show_bug.cgi?id=1242012 CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-22udf: Make sure i_lenExtents is uptodate on inode evictionJan Kara
commit 55dd5b4db3bf04cf077a8d1712f6295d4517c337 upstream. UDF maintains total length of all extents in i_lenExtents. Generally we keep extent lengths (and thus i_lenExtents) block aligned because it makes the file appending logic simpler. However the standard mandates that the inode size must match the length of all extents and thus we trim the last extent when closing the file. To catch possible bugs we also verify that i_lenExtents matches i_size when evicting inode from memory. Commit b405c1e58b73 ("udf: refactor udf_next_aext() to handle error") however broke the code updating i_lenExtents and thus udf_evict_inode() ended up spewing lots of errors about incorrectly sized extents although the extents were actually sized properly. Fix the updating of i_lenExtents to silence the errors. Fixes: b405c1e58b73 ("udf: refactor udf_next_aext() to handle error") CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-22NFSv4/pnfs: Reset the layout state after a layoutreturnTrond Myklebust
[ Upstream commit 6d6d7f91cc8c111d40416ac9240a3bb9396c5235 ] If there are still layout segments in the layout plh_return_lsegs list after a layout return, we should be resetting the state to ensure they eventually get returned as well. Fixes: 68f744797edd ("pNFS: Do not free layout segments that are marked for return") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-22nfs: handle failure of nfs_get_lock_context in unlock pathLi Lingfeng
[ Upstream commit c457dc1ec770a22636b473ce5d35614adfe97636 ] When memory is insufficient, the allocation of nfs_lock_context in nfs_get_lock_context() fails and returns -ENOMEM. If we mistakenly treat an nfs4_unlockdata structure (whose l_ctx member has been set to -ENOMEM) as valid and proceed to execute rpc_run_task(), this will trigger a NULL pointer dereference in nfs4_locku_prepare. For example: BUG: kernel NULL pointer dereference, address: 000000000000000c PGD 0 P4D 0 Oops: Oops: 0000 [#1] SMP PTI CPU: 15 UID: 0 PID: 12 Comm: kworker/u64:0 Not tainted 6.15.0-rc2-dirty #60 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 Workqueue: rpciod rpc_async_schedule RIP: 0010:nfs4_locku_prepare+0x35/0xc2 Code: 89 f2 48 89 fd 48 c7 c7 68 69 ef b5 53 48 8b 8e 90 00 00 00 48 89 f3 RSP: 0018:ffffbbafc006bdb8 EFLAGS: 00010246 RAX: 000000000000004b RBX: ffff9b964fc1fa00 RCX: 0000000000000000 RDX: 0000000000000000 RSI: fffffffffffffff4 RDI: ffff9ba53fddbf40 RBP: ffff9ba539934000 R08: 0000000000000000 R09: ffffbbafc006bc38 R10: ffffffffb6b689c8 R11: 0000000000000003 R12: ffff9ba539934030 R13: 0000000000000001 R14: 0000000004248060 R15: ffffffffb56d1c30 FS: 0000000000000000(0000) GS:ffff9ba5881f0000(0000) knlGS:00000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000000000c CR3: 000000093f244000 CR4: 00000000000006f0 Call Trace: <TASK> __rpc_execute+0xbc/0x480 rpc_async_schedule+0x2f/0x40 process_one_work+0x232/0x5d0 worker_thread+0x1da/0x3d0 ? __pfx_worker_thread+0x10/0x10 kthread+0x10d/0x240 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x34/0x50 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> Modules linked in: CR2: 000000000000000c ---[ end trace 0000000000000000 ]--- Free the allocated nfs4_unlockdata when nfs_get_lock_context() fails and return NULL to terminate subsequent rpc_run_task, preventing NULL pointer dereference. Fixes: f30cb757f680 ("NFS: Always wait for I/O completion before unlock") Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20250417072508.3850532-1-lilingfeng3@huawei.com Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-22binfmt_elf: Move brk for static PIE even if ASLR disabledKees Cook
[ Upstream commit 11854fe263eb1b9a8efa33b0c087add7719ea9b4 ] In commit bbdc6076d2e5 ("binfmt_elf: move brk out of mmap when doing direct loader exec"), the brk was moved out of the mmap region when loading static PIE binaries (ET_DYN without INTERP). The common case for these binaries was testing new ELF loaders, so the brk needed to be away from mmap to avoid colliding with stack, future mmaps (of the loader-loaded binary), etc. But this was only done when ASLR was enabled, in an attempt to minimize changes to memory layouts. After adding support to respect alignment requirements for static PIE binaries in commit 3545deff0ec7 ("binfmt_elf: Honor PT_LOAD alignment for static PIE"), it became possible to have a large gap after the final PT_LOAD segment and the top of the mmap region. This means that future mmap allocations might go after the last PT_LOAD segment (where brk might be if ASLR was disabled) instead of before them (where they traditionally ended up). On arm64, running with ASLR disabled, Ubuntu 22.04's "ldconfig" binary, a static PIE, has alignment requirements that leaves a gap large enough after the last PT_LOAD segment to fit the vdso and vvar, but still leave enough space for the brk (which immediately follows the last PT_LOAD segment) to be allocated by the binary. fffff7f20000-fffff7fde000 r-xp 00000000 fe:02 8110426 /sbin/ldconfig.real fffff7fee000-fffff7ff5000 rw-p 000be000 fe:02 8110426 /sbin/ldconfig.real fffff7ff5000-fffff7ffa000 rw-p 00000000 00:00 0 ***[brk will go here at fffff7ffa000]*** fffff7ffc000-fffff7ffe000 r--p 00000000 00:00 0 [vvar] fffff7ffe000-fffff8000000 r-xp 00000000 00:00 0 [vdso] fffffffdf000-1000000000000 rw-p 00000000 00:00 0 [stack] After commit 0b3bc3354eb9 ("arm64: vdso: Switch to generic storage implementation"), the arm64 vvar grew slightly, and suddenly the brk collided with the allocation. fffff7f20000-fffff7fde000 r-xp 00000000 fe:02 8110426 /sbin/ldconfig.real fffff7fee000-fffff7ff5000 rw-p 000be000 fe:02 8110426 /sbin/ldconfig.real fffff7ff5000-fffff7ffa000 rw-p 00000000 00:00 0 ***[oops, no room any more, vvar is at fffff7ffa000!]*** fffff7ffa000-fffff7ffe000 r--p 00000000 00:00 0 [vvar] fffff7ffe000-fffff8000000 r-xp 00000000 00:00 0 [vdso] fffffffdf000-1000000000000 rw-p 00000000 00:00 0 [stack] The solution is to unconditionally move the brk out of the mmap region for static PIE binaries. Whether ASLR is enabled or not does not change if there may be future mmap allocation collisions with a growing brk region. Update memory layout comments (with kernel-doc headings), consolidate the setting of mm->brk to later (it isn't needed early), move static PIE brk out of mmap unconditionally, and make sure brk(2) knows to base brk position off of mm->start_brk not mm->end_data no matter what the cause of moving it is (via current->brk_randomized). For the CONFIG_COMPAT_BRK case, though, leave the logic unchanged, as we can never safely move the brk. These systems, however, are not using specially aligned static PIE binaries. Reported-by: Ryan Roberts <ryan.roberts@arm.com> Closes: https://lore.kernel.org/lkml/f93db308-4a0e-4806-9faf-98f890f5a5e6@arm.com/ Fixes: bbdc6076d2e5 ("binfmt_elf: move brk out of mmap when doing direct loader exec") Link: https://lore.kernel.org/r/20250425224502.work.520-kees@kernel.org Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Tested-by: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Kees Cook <kees@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-22fs/xattr.c: fix simple_xattr_list to always include security.* xattrsStephen Smalley
[ Upstream commit 8b0ba61df5a1c44e2b3cf683831a4fc5e24ea99d ] The vfs has long had a fallback to obtain the security.* xattrs from the LSM when the filesystem does not implement its own listxattr, but shmem/tmpfs and kernfs later gained their own xattr handlers to support other xattrs. Unfortunately, as a side effect, tmpfs and kernfs-based filesystems like sysfs no longer return the synthetic security.* xattr names via listxattr unless they are explicitly set by userspace or initially set upon inode creation after policy load. coreutils has recently switched from unconditionally invoking getxattr for security.* for ls -Z via libselinux to only doing so if listxattr returns the xattr name, breaking ls -Z of such inodes. Before: $ getfattr -m.* /run/initramfs <no output> $ getfattr -m.* /sys/kernel/fscaps <no output> $ setfattr -n user.foo /run/initramfs $ getfattr -m.* /run/initramfs user.foo After: $ getfattr -m.* /run/initramfs security.selinux $ getfattr -m.* /sys/kernel/fscaps security.selinux $ setfattr -n user.foo /run/initramfs $ getfattr -m.* /run/initramfs security.selinux user.foo Link: https://lore.kernel.org/selinux/CAFqZXNtF8wDyQajPCdGn=iOawX4y77ph0EcfcqcUUj+T87FKyA@mail.gmail.com/ Link: https://lore.kernel.org/selinux/20250423175728.3185-2-stephen.smalley.work@gmail.com/ Signed-off-by: Stephen Smalley <stephen.smalley.work@gmail.com> Link: https://lore.kernel.org/20250424152822.2719-1-stephen.smalley.work@gmail.com Fixes: b09e0fa4b4ea66266058ee ("tmpfs: implement generic xattr support") Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-18do_umount(): add missing barrier before refcount checks in sync caseAl Viro
[ Upstream commit 65781e19dcfcb4aed1167d87a3ffcc2a0c071d47 ] do_umount() analogue of the race fixed in 119e1ef80ecf "fix __legitimize_mnt()/mntput() race". Here we want to make sure that if __legitimize_mnt() doesn't notice our lock_mount_hash(), we will notice their refcount increment. Harder to hit than mntput_no_expire() one, fortunately, and consequences are milder (sync umount acting like umount -l on a rare race with RCU pathwalk hitting at just the wrong time instead of use-after-free galore mntput_no_expire() counterpart used to be hit). Still a bug... Fixes: 48a066e72d97 ("RCU'd vfsmounts") Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-18ocfs2: stop quota recovery before disabling quotasJan Kara
commit fcaf3b2683b05a9684acdebda706a12025a6927a upstream. Currently quota recovery is synchronized with unmount using sb->s_umount semaphore. That is however prone to deadlocks because flush_workqueue(osb->ocfs2_wq) called from umount code can wait for quota recovery to complete while ocfs2_finish_quota_recovery() waits for sb->s_umount semaphore. Grabbing of sb->s_umount semaphore in ocfs2_finish_quota_recovery() is only needed to protect that function from disabling of quotas from ocfs2_dismount_volume(). Handle this problem by disabling quota recovery early during unmount in ocfs2_dismount_volume() instead so that we can drop acquisition of sb->s_umount from ocfs2_finish_quota_recovery(). Link: https://lkml.kernel.org/r/20250424134515.18933-6-jack@suse.cz Fixes: 5f530de63cfc ("ocfs2: Use s_umount for quota recovery protection") Signed-off-by: Jan Kara <jack@suse.cz> Reported-by: Shichangkuo <shi.changkuo@h3c.com> Reported-by: Murad Masimov <m.masimov@mt-integration.ru> Reviewed-by: Heming Zhao <heming.zhao@suse.com> Tested-by: Heming Zhao <heming.zhao@suse.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Joel Becker <jlbec@evilplan.org> Cc: Jun Piao <piaojun@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-18ocfs2: implement handshaking with ocfs2 recovery threadJan Kara
commit 8f947e0fd595951460f5a6e1ac29baa82fa02eab upstream. We will need ocfs2 recovery thread to acknowledge transitions of recovery_state when disabling particular types of recovery. This is similar to what currently happens when disabling recovery completely, just more general. Implement the handshake and use it for exit from recovery. Link: https://lkml.kernel.org/r/20250424134515.18933-5-jack@suse.cz Fixes: 5f530de63cfc ("ocfs2: Use s_umount for quota recovery protection") Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Heming Zhao <heming.zhao@suse.com> Tested-by: Heming Zhao <heming.zhao@suse.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Joel Becker <jlbec@evilplan.org> Cc: Jun Piao <piaojun@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Murad Masimov <m.masimov@mt-integration.ru> Cc: Shichangkuo <shi.changkuo@h3c.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-18ocfs2: switch osb->disable_recovery to enumJan Kara
commit c0fb83088f0cc4ee4706e0495ee8b06f49daa716 upstream. Patch series "ocfs2: Fix deadlocks in quota recovery", v3. This implements another approach to fixing quota recovery deadlocks. We avoid grabbing sb->s_umount semaphore from ocfs2_finish_quota_recovery() and instead stop quota recovery early in ocfs2_dismount_volume(). This patch (of 3): We will need more recovery states than just pure enable / disable to fix deadlocks with quota recovery. Switch osb->disable_recovery to enum. Link: https://lkml.kernel.org/r/20250424134301.1392-1-jack@suse.cz Link: https://lkml.kernel.org/r/20250424134515.18933-4-jack@suse.cz Fixes: 5f530de63cfc ("ocfs2: Use s_umount for quota recovery protection") Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Heming Zhao <heming.zhao@suse.com> Tested-by: Heming Zhao <heming.zhao@suse.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Cc: Murad Masimov <m.masimov@mt-integration.ru> Cc: Shichangkuo <shi.changkuo@h3c.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-18ocfs2: fix the issue with discontiguous allocation in the global_bitmapHeming Zhao
commit bd1261b16d9131d79723d982d54295e7f309797a upstream. commit 4eb7b93e0310 ("ocfs2: improve write IO performance when fragmentation is high") introduced another regression. The following ocfs2-test case can trigger this issue: > discontig_runner.sh => activate_discontig_bg.sh => resv_unwritten: > ${RESV_UNWRITTEN_BIN} -f ${WORK_PLACE}/large_testfile -s 0 -l \ > $((${FILE_MAJOR_SIZE_M}*1024*1024)) In my env, test disk size (by "fdisk -l <dev>"): > 53687091200 bytes, 104857600 sectors. Above command is: > /usr/local/ocfs2-test/bin/resv_unwritten -f \ > /mnt/ocfs2/ocfs2-activate-discontig-bg-dir/large_testfile -s 0 -l \ > 53187969024 Error log: > [*] Reserve 50724M space for a LARGE file, reserve 200M space for future test. > ioctl error 28: "No space left on device" > resv allocation failed Unknown error -1 > reserve unwritten region from 0 to 53187969024. Call flow: __ocfs2_change_file_space //by ioctl OCFS2_IOC_RESVSP64 ocfs2_allocate_unwritten_extents //start:0 len:53187969024 while() + ocfs2_get_clusters //cpos:0, alloc_size:1623168 (cluster number) + ocfs2_extend_allocation + ocfs2_lock_allocators | + choose OCFS2_AC_USE_MAIN & ocfs2_cluster_group_search | + ocfs2_add_inode_data ocfs2_add_clusters_in_btree __ocfs2_claim_clusters ocfs2_claim_suballoc_bits + During the allocation of the final part of the large file (after ~47GB), no chain had the required contiguous bits_wanted. Consequently, the allocation failed. How to fix: When OCFS2 is encountering fragmented allocation, the file system should stop attempting bits_wanted contiguous allocation and instead provide the largest available contiguous free bits from the cluster groups. Link: https://lkml.kernel.org/r/20250414060125.19938-2-heming.zhao@suse.com Fixes: 4eb7b93e0310 ("ocfs2: improve write IO performance when fragmentation is high") Signed-off-by: Heming Zhao <heming.zhao@suse.com> Reported-by: Gautham Ananthakrishna <gautham.ananthakrishna@oracle.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-18smb: client: Avoid race in open_cached_dir with lease breaksPaul Aurich
commit 3ca02e63edccb78ef3659bebc68579c7224a6ca2 upstream. A pre-existing valid cfid returned from find_or_create_cached_dir might race with a lease break, meaning open_cached_dir doesn't consider it valid, and thinks it's newly-constructed. This leaks a dentry reference if the allocation occurs before the queued lease break work runs. Avoid the race by extending holding the cfid_list_lock across find_or_create_cached_dir and when the result is checked. Cc: stable@vger.kernel.org Reviewed-by: Henrique Carvalho <henrique.carvalho@suse.com> Signed-off-by: Paul Aurich <paul@darkrain42.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-18mm/userfaultfd: fix uninitialized output field for -EAGAIN racePeter Xu
commit 95567729173e62e0e60a1f8ad9eb2e1320a8ccac upstream. While discussing some userfaultfd relevant issues recently, Andrea noticed a potential ABI breakage with -EAGAIN on almost all userfaultfd ioctl()s. Quote from Andrea, explaining how -EAGAIN was processed, and how this should fix it (taking example of UFFDIO_COPY ioctl): The "mmap_changing" and "stale pmd" conditions are already reported as -EAGAIN written in the copy field, this does not change it. This change removes the subnormal case that left copy.copy uninitialized and required apps to explicitly set the copy field to get deterministic behavior (which is a requirement contrary to the documentation in both the manpage and source code). In turn there's no alteration to backwards compatibility as result of this change because userland will find the copy field consistently set to -EAGAIN, and not anymore sometime -EAGAIN and sometime uninitialized. Even then the change only can make a difference to non cooperative users of userfaultfd, so when UFFD_FEATURE_EVENT_* is enabled, which is not true for the vast majority of apps using userfaultfd or this unintended uninitialized field may have been noticed sooner. Meanwhile, since this bug existed for years, it also almost affects all ioctl()s that was introduced later. Besides UFFDIO_ZEROPAGE, these also get affected in the same way: - UFFDIO_CONTINUE - UFFDIO_POISON - UFFDIO_MOVE This patch should have fixed all of them. Link: https://lkml.kernel.org/r/20250424215729.194656-2-peterx@redhat.com Fixes: df2cc96e7701 ("userfaultfd: prevent non-cooperative events vs mcopy_atomic races") Fixes: f619147104c8 ("userfaultfd: add UFFDIO_CONTINUE ioctl") Fixes: fc71884a5f59 ("mm: userfaultfd: add new UFFDIO_POISON ioctl") Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI") Signed-off-by: Peter Xu <peterx@redhat.com> Reported-by: Andrea Arcangeli <aarcange@redhat.com> Suggested-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-18erofs: ensure the extra temporary copy is valid for shortened bvecsGao Xiang
[ Upstream commit 35076d2223c731f7be75af61e67f90807384d030 ] When compressed data deduplication is enabled, multiple logical extents may reference the same compressed physical cluster. The previous commit 94c43de73521 ("erofs: fix wrong primary bvec selection on deduplicated extents") already avoids using shortened bvecs. However, in such cases, the extra temporary buffers also need to be preserved for later use in z_erofs_fill_other_copies() to to prevent data corruption. IOWs, extra temporary buffers have to be retained not only due to varying start relative offsets (`pageofs_out`, as indicated by `pcl->multibases`) but also because of shortened bvecs. android.hardware.graphics.composer@2.1.so : 270696 bytes 0: 0.. 204185 | 204185 : 628019200.. 628084736 | 65536 -> 1: 204185.. 225536 | 21351 : 544063488.. 544129024 | 65536 2: 225536.. 270696 | 45160 : 0.. 0 | 0 com.android.vndk.v28.apex : 93814897 bytes ... 364: 53869896..54095257 | 225361 : 543997952.. 544063488 | 65536 -> 365: 54095257..54309344 | 214087 : 544063488.. 544129024 | 65536 366: 54309344..54514557 | 205213 : 544129024.. 544194560 | 65536 ... Both 204185 and 54095257 have the same start relative offset of 3481, but the logical page 55 of `android.hardware.graphics.composer@2.1.so` ranges from 225280 to 229632, forming a shortened bvec [225280, 225536) that cannot be used for decompressing the range from 54095257 to 54309344 of `com.android.vndk.v28.apex`. Since `pcl->multibases` is already meaningless, just mark `be->keepxcpy` on demand for simplicity. Again, this issue can only lead to data corruption if `-Ededupe` is on. Fixes: 94c43de73521 ("erofs: fix wrong primary bvec selection on deduplicated extents") Reviewed-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250506101850.191506-1-hsiangkao@linux.alibaba.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-18ksmbd: fix memory leak in parse_lease_state()Wang Zhaolong
[ Upstream commit eb4447bcce915b43b691123118893fca4f372a8f ] The previous patch that added bounds check for create lease context introduced a memory leak. When the bounds check fails, the function returns NULL without freeing the previously allocated lease_ctx_info structure. This patch fixes the issue by adding kfree(lreq) before returning NULL in both boundary check cases. Fixes: bab703ed8472 ("ksmbd: add bounds check for create lease context") Signed-off-by: Wang Zhaolong <wangzhaolong1@huawei.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-18ksmbd: Fix UAF in __close_file_table_idsSean Heelan
commit 36991c1ccde2d5a521577c448ffe07fcccfe104d upstream. A use-after-free is possible if one thread destroys the file via __ksmbd_close_fd while another thread holds a reference to it. The existing checks on fp->refcount are not sufficient to prevent this. The fix takes ft->lock around the section which removes the file from the file table. This prevents two threads acquiring the same file pointer via __close_file_table_ids, as well as the other functions which retrieve a file from the IDR and which already use this same lock. Cc: stable@vger.kernel.org Signed-off-by: Sean Heelan <seanheelan@gmail.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-18ksmbd: prevent out-of-bounds stream writes by validating *posNorbert Szetei
commit 0ca6df4f40cf4c32487944aaf48319cb6c25accc upstream. ksmbd_vfs_stream_write() did not validate whether the write offset (*pos) was within the bounds of the existing stream data length (v_len). If *pos was greater than or equal to v_len, this could lead to an out-of-bounds memory write. This patch adds a check to ensure *pos is less than v_len before proceeding. If the condition fails, -EINVAL is returned. Cc: stable@vger.kernel.org Signed-off-by: Norbert Szetei <norbert@doyensec.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-18ksmbd: prevent rename with empty stringNamjae Jeon
commit 53e3e5babc0963a92d856a5ec0ce92c59f54bc12 upstream. Client can send empty newname string to ksmbd server. It will cause a kernel oops from d_alloc. This patch return the error when attempting to rename a file or directory with an empty new name string. Cc: stable@vger.kernel.org Reported-by: Norbert Szetei <norbert@doyensec.com> Tested-by: Norbert Szetei <norbert@doyensec.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-18Revert "btrfs: canonicalize the device path before adding it"Qu Wenruo
commit 8fb1dcbbcc1ffe6ed7cf3f0f96d2737491dd1fbf upstream. This reverts commit 7e06de7c83a746e58d4701e013182af133395188. Commit 7e06de7c83a7 ("btrfs: canonicalize the device path before adding it") tries to make btrfs to use "/dev/mapper/*" name first, then any filename inside "/dev/" as the device path. This is mostly fine when there is only the root namespace involved, but when multiple namespace are involved, things can easily go wrong for the d_path() usage. As d_path() returns a file path that is namespace dependent, the resulted string may not make any sense in another namespace. Furthermore, the "/dev/" prefix checks itself is not reliable, one can still make a valid initramfs without devtmpfs, and fill all needed device nodes manually. Overall the userspace has all its might to pass whatever device path for mount, and we are not going to win the war trying to cover every corner case. So just revert that commit, and do no extra d_path() based file path sanity check. CC: stable@vger.kernel.org # 6.12+ Link: https://lore.kernel.org/linux-fsdevel/20250115185608.GA2223535@zen.localdomain/ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-18fs/erofs/fileio: call erofs_onlinefolio_split() after bio_add_folio()Max Kellermann
commit bbfe756dc3062c1e934f06e5ba39c239aa953b92 upstream. If bio_add_folio() fails (because it is full), erofs_fileio_scan_folio() needs to submit the I/O request via erofs_fileio_rq_submit() and allocate a new I/O request with an empty `struct bio`. Then it retries the bio_add_folio() call. However, at this point, erofs_onlinefolio_split() has already been called which increments `folio->private`; the retry will call erofs_onlinefolio_split() again, but there will never be a matching erofs_onlinefolio_end() call. This leaves the folio locked forever and all waiters will be stuck in folio_wait_bit_common(). This bug has been added by commit ce63cb62d794 ("erofs: support unencoded inodes for fileio"), but was practically unreachable because there was room for 256 folios in the `struct bio` - until commit 9f74ae8c9ac9 ("erofs: shorten bvecs[] for file-backed mounts") which reduced the array capacity to 16 folios. It was now trivial to trigger the bug by manually invoking readahead from userspace, e.g.: posix_fadvise(fd, 0, st.st_size, POSIX_FADV_WILLNEED); This should be fixed by invoking erofs_onlinefolio_split() only after bio_add_folio() has succeeded. This is safe: asynchronous completions invoking erofs_onlinefolio_end() will not unlock the folio because erofs_fileio_scan_folio() is still holding a reference to be released by erofs_onlinefolio_end() at the end. Fixes: ce63cb62d794 ("erofs: support unencoded inodes for fileio") Fixes: 9f74ae8c9ac9 ("erofs: shorten bvecs[] for file-backed mounts") Cc: stable@vger.kernel.org Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Reviewed-by: Gao Xiang <xiang@kernel.org> Tested-by: Hongbo Li <lihongbo22@huawei.com> Link: https://lore.kernel.org/r/20250428230933.3422273-1-max.kellermann@ionos.com Signed-off-by: Gao Xiang <xiang@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-09bcachefs: Remove incorrect __counted_by annotationAlan Huang
commit 6846100b00d97d3d6f05766ae86a0d821d849e78 upstream. This actually reverts 86e92eeeb237 ("bcachefs: Annotate struct bch_xattr with __counted_by()"). After the x_name, there is a value. According to the disscussion[1], __counted_by assumes that the flexible array member contains exactly the amount of elements that are specified. Now there are users came across a false positive detection of an out of bounds write caused by the __counted_by here[2], so revert that. [1] https://lore.kernel.org/lkml/Zv8VDKWN1GzLRT-_@archlinux/T/#m0ce9541c5070146320efd4f928cc1ff8de69e9b2 [2] https://privatebin.net/?a0d4e97d590d71e1#9bLmp2Kb5NU6X6cZEucchDcu88HzUQwHUah8okKPReEt Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-09smb: client: fix zero length for mkdir POSIX create contextJethro Donaldson
commit 74c72419ec8da5cbc9c49410d3c44bb954538bdd upstream. SMB create requests issued via smb311_posix_mkdir() have an incorrect length of zero bytes for the POSIX create context data. ksmbd server rejects such requests and logs "cli req too short" causing mkdir to fail with "invalid argument" on the client side. It also causes subsequent rmmod to crash in cifs_destroy_request_bufs() Inspection of packets sent by cifs.ko using wireshark show valid data for the SMB2_POSIX_CREATE_CONTEXT is appended with the correct offset, but with an incorrect length of zero bytes. Fails with ksmbd+cifs.ko only as Windows server/client does not use POSIX extensions. Fix smb311_posix_mkdir() to set req->CreateContextsLength as part of appending the POSIX creation context to the request. Signed-off-by: Jethro Donaldson <devel@jro.nz> Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.com> Reviewed-by: Namjae Jeon <linkinjeon@kernel.org> Cc: stable@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-09ksmbd: fix use-after-free in session logoffSean Heelan
commit 2fc9feff45d92a92cd5f96487655d5be23fb7e2b upstream. The sess->user object can currently be in use by another thread, for example if another connection has sent a session setup request to bind to the session being free'd. The handler for that connection could be in the smb2_sess_setup function which makes use of sess->user. Cc: stable@vger.kernel.org Signed-off-by: Sean Heelan <seanheelan@gmail.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-09ksmbd: fix use-after-free in kerberos authenticationSean Heelan
commit e86e9134e1d1c90a960dd57f59ce574d27b9a124 upstream. Setting sess->user = NULL was introduced to fix the dangling pointer created by ksmbd_free_user. However, it is possible another thread could be operating on the session and make use of sess->user after it has been passed to ksmbd_free_user but before sess->user is set to NULL. Cc: stable@vger.kernel.org Signed-off-by: Sean Heelan <seanheelan@gmail.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-09ksmbd: fix use-after-free in ksmbd_session_rpc_openNamjae Jeon
commit a1f46c99d9ea411f9bf30025b912d881d36fc709 upstream. A UAF issue can occur due to a race condition between ksmbd_session_rpc_open() and __session_rpc_close(). Add rpc_lock to the session to protect it. Cc: stable@vger.kernel.org Reported-by: Norbert Szetei <norbert@doyensec.com> Tested-by: Norbert Szetei <norbert@doyensec.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-09btrfs: fix COW handling in run_delalloc_nocow()Dave Chen
commit be3f1938d3e6ea8186f0de3dd95245dda4f22c1e upstream. In run_delalloc_nocow(), when the found btrfs_key's offset > cur_offset, it indicates a gap between the current processing region and the next file extent. The original code would directly jump to the "must_cow" label, which increments the slot and forces a fallback to COW. This behavior might skip an extent item and result in an overestimated COW fallback range. This patch modifies the logic so that when a gap is detected: - If no COW range is already being recorded (cow_start is unset), cow_start is set to cur_offset. - cur_offset is then advanced to the beginning of the next extent. - Instead of jumping to "must_cow", control flows directly to "next_slot" so that the same extent item can be reexamined properly. The change ensures that we accurately account for the extent gap and avoid accidentally extending the range that needs to fallback to COW. CC: stable@vger.kernel.org # 6.6+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Dave Chen <davechen@synology.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-09btrfs: adjust subpage bit start based on sectorsizeJosef Bacik
commit e08e49d986f82c30f42ad0ed43ebbede1e1e3739 upstream. When running machines with 64k page size and a 16k nodesize we started seeing tree log corruption in production. This turned out to be because we were not writing out dirty blocks sometimes, so this in fact affects all metadata writes. When writing out a subpage EB we scan the subpage bitmap for a dirty range. If the range isn't dirty we do bit_start++; to move onto the next bit. The problem is the bitmap is based on the number of sectors that an EB has. So in this case, we have a 64k pagesize, 16k nodesize, but a 4k sectorsize. This means our bitmap is 4 bits for every node. With a 64k page size we end up with 4 nodes per page. To make this easier this is how everything looks [0 16k 32k 48k ] logical address [0 4 8 12 ] radix tree offset [ 64k page ] folio [ 16k eb ][ 16k eb ][ 16k eb ][ 16k eb ] extent buffers [ | | | | | | | | | | | | | | | | ] bitmap Now we use all of our addressing based on fs_info->sectorsize_bits, so as you can see the above our 16k eb->start turns into radix entry 4. When we find a dirty range for our eb, we correctly do bit_start += sectors_per_node, because if we start at bit 0, the next bit for the next eb is 4, to correspond to eb->start 16k. However if our range is clean, we will do bit_start++, which will now put us offset from our radix tree entries. In our case, assume that the first time we check the bitmap the block is not dirty, we increment bit_start so now it == 1, and then we loop around and check again. This time it is dirty, and we go to find that start using the following equation start = folio_start + bit_start * fs_info->sectorsize; so in the case above, eb->start 0 is now dirty, and we calculate start as 0 + 1 * fs_info->sectorsize = 4096 4096 >> 12 = 1 Now we're looking up the radix tree for 1, and we won't find an eb. What's worse is now we're using bit_start == 1, so we do bit_start += sectors_per_node, which is now 5. If that eb is dirty we will run into the same thing, we will look at an offset that is not populated in the radix tree, and now we're skipping the writeout of dirty extent buffers. The best fix for this is to not use sectorsize_bits to address nodes, but that's a larger change. Since this is a fs corruption problem fix it simply by always using sectors_per_node to increment the start bit. Fixes: c4aec299fa8f ("btrfs: introduce submit_eb_subpage() to submit a subpage metadata page") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-02ext4: goto right label 'out_mmap_sem' in ext4_setattr()Baokun Li
commit 7e91ae31e2d264155dfd102101afc2de7bd74a64 upstream. Otherwise, if ext4_inode_attach_jinode() fails, a hung task will happen because filemap_invalidate_unlock() isn't called to unlock mapping->invalidate_lock. Like this: EXT4-fs error (device sda) in ext4_setattr:5557: Out of memory INFO: task fsstress:374 blocked for more than 122 seconds. Not tainted 6.14.0-rc1-next-20250206-xfstests-dirty #726 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:fsstress state:D stack:0 pid:374 tgid:374 ppid:373 task_flags:0x440140 flags:0x00000000 Call Trace: <TASK> __schedule+0x2c9/0x7f0 schedule+0x27/0xa0 schedule_preempt_disabled+0x15/0x30 rwsem_down_read_slowpath+0x278/0x4c0 down_read+0x59/0xb0 page_cache_ra_unbounded+0x65/0x1b0 filemap_get_pages+0x124/0x3e0 filemap_read+0x114/0x3d0 vfs_read+0x297/0x360 ksys_read+0x6c/0xe0 do_syscall_64+0x4b/0x110 entry_SYSCALL_64_after_hwframe+0x76/0x7e Fixes: c7fc0366c656 ("ext4: partial zero eof block on unaligned inode size extension") Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Brian Foster <bfoster@redhat.com> Link: https://patch.msgid.link/20250213112247.3168709-1-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Ricardo Cañuelo Navarro <rcn@igalia.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-02xfs: flush inodegc before swaponChristoph Hellwig
Commit 2d873efd174bae9005776937d5ac6a96050266db upstream Fix the brand new xfstest that tries to swapon on a recently unshared file and use the chance to document the other bit of magic in this function. The big comment is taken from a mailinglist post by Dave Chinner. Fixes: 5e672cd69f0a53 ("xfs: introduce xfs_inodegc_push()") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-02xfs: rename xfs_iomap_swapfile_activate to xfs_vm_swap_activateChristoph Hellwig
Commit 3cd6a8056f5a2e794c42fc2114ee2611e358b357 upstream Match the method name and the naming convention or address_space operations. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-02xfs: Do not allow norecovery mount with quotacheckCarlos Maiolino
Commit 9f0902091c332b2665951cfb970f60ae7cbdc0f3 upstream Mounting a filesystem that requires quota state changing will generate a transaction. We already check for a read-only device; we should do that for norecovery too. A quotacheck on a norecovery mount, and with the right log size, will cause the mount process to hang on: [<0>] xlog_grant_head_wait+0x5d/0x2a0 [xfs] [<0>] xlog_grant_head_check+0x112/0x180 [xfs] [<0>] xfs_log_reserve+0xe3/0x260 [xfs] [<0>] xfs_trans_reserve+0x179/0x250 [xfs] [<0>] xfs_trans_alloc+0x101/0x260 [xfs] [<0>] xfs_sync_sb+0x3f/0x80 [xfs] [<0>] xfs_qm_mount_quotas+0xe3/0x2f0 [xfs] [<0>] xfs_mountfs+0x7ad/0xc20 [xfs] [<0>] xfs_fs_fill_super+0x762/0xa50 [xfs] [<0>] get_tree_bdev_flags+0x131/0x1d0 [<0>] vfs_get_tree+0x26/0xd0 [<0>] vfs_cmd_create+0x59/0xe0 [<0>] __do_sys_fsconfig+0x4e3/0x6b0 [<0>] do_syscall_64+0x82/0x160 [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e This is caused by a transaction running with bogus initialized head/tail I initially hit this while running generic/050, with random log sizes, but I managed to reproduce it reliably here with the steps below: mkfs.xfs -f -lsize=1025M -f -b size=4096 -m crc=1,reflink=1,rmapbt=1, -i sparse=1 /dev/vdb2 > /dev/null mount -o usrquota,grpquota,prjquota /dev/vdb2 /mnt xfs_io -x -c 'shutdown -f' /mnt umount /mnt mount -o ro,norecovery,usrquota,grpquota,prjquota /dev/vdb2 /mnt Last mount hangs up As we add yet another validation if quota state is changing, this also add a new helper named xfs_qm_validate_state_change(), factoring the quota state changes out of xfs_qm_newmount() to reduce cluttering within it. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-02xfs: do not check NEEDSREPAIR if ro,norecovery mount.Lukas Herbolt
Commit 9e00163c31676c6b43d2334fdf5b406232f42dee upstream If there is corrutpion on the filesystem andxfs_repair fails to repair it. The last resort of getting the data is to use norecovery,ro mount. But if the NEEDSREPAIR is set the filesystem cannot be mounted. The flag must be cleared out manually using xfs_db, to get access to what left over of the corrupted fs. Signed-off-by: Lukas Herbolt <lukas@herbolt.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-02ksmbd: fix WARNING "do not call blocking ops when !TASK_RUNNING"Namjae Jeon
[ Upstream commit 1df0d4c616138784e033ad337961b6e1a6bcd999 ] wait_event_timeout() will set the state of the current task to TASK_UNINTERRUPTIBLE, before doing the condition check. This means that ksmbd_durable_scavenger_alive() will try to acquire the mutex while already in a sleeping state. The scheduler warns us by giving the following warning: do not call blocking ops when !TASK_RUNNING; state=2 set at [<0000000061515a6f>] prepare_to_wait_event+0x9f/0x6c0 WARNING: CPU: 2 PID: 4147 at kernel/sched/core.c:10099 __might_sleep+0x12f/0x160 mutex lock is not needed in ksmbd_durable_scavenger_alive(). Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-02iomap: skip unnecessary ifs_block_is_uptodate checkGou Hao
[ Upstream commit 8e3c15ee0d292c413c66fe10201d1b035a0bea72 ] In iomap_adjust_read_range, i is either the first !uptodate block, or it is past last for the second loop looking for trailing uptodate blocks. Assuming there's no overflow (there's no combination of huge folios and tiny blksize) then yeah, there is no point in retesting that the same block pointed to by i is uptodate since we hold the folio lock so nobody else could have set it uptodate. Signed-off-by: Gou Hao <gouhao@uniontech.com> Link: https://lore.kernel.org/20250410071236.16017-1-gouhao@uniontech.com Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Suggested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-02netfs: Only create /proc/fs/netfs with CONFIG_PROC_FSSong Liu
[ Upstream commit 40cb48eba3b4b79e110c1a35d33a48cac54507a2 ] When testing a special config: CONFIG_NETFS_SUPPORTS=y CONFIG_PROC_FS=n The system crashes with something like: [ 3.766197] ------------[ cut here ]------------ [ 3.766484] kernel BUG at mm/mempool.c:560! [ 3.766789] Oops: invalid opcode: 0000 [#1] SMP NOPTI [ 3.767123] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Tainted: G W [ 3.767777] Tainted: [W]=WARN [ 3.767968] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), [ 3.768523] RIP: 0010:mempool_alloc_slab.cold+0x17/0x19 [ 3.768847] Code: 50 fe ff 58 5b 5d 41 5c 41 5d 41 5e 41 5f e9 93 95 13 00 [ 3.769977] RSP: 0018:ffffc90000013998 EFLAGS: 00010286 [ 3.770315] RAX: 000000000000002f RBX: ffff888100ba8640 RCX: 0000000000000000 [ 3.770749] RDX: 0000000000000000 RSI: 0000000000000003 RDI: 00000000ffffffff [ 3.771217] RBP: 0000000000092880 R08: 0000000000000000 R09: ffffc90000013828 [ 3.771664] R10: 0000000000000001 R11: 00000000ffffffea R12: 0000000000092cc0 [ 3.772117] R13: 0000000000000400 R14: ffff8881004b1620 R15: ffffea0004ef7e40 [ 3.772554] FS: 0000000000000000(0000) GS:ffff8881b5f3c000(0000) knlGS:0000000000000000 [ 3.773061] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3.773443] CR2: ffffffff830901b4 CR3: 0000000004296001 CR4: 0000000000770ef0 [ 3.773884] PKRU: 55555554 [ 3.774058] Call Trace: [ 3.774232] <TASK> [ 3.774371] mempool_alloc_noprof+0x6a/0x190 [ 3.774649] ? _printk+0x57/0x80 [ 3.774862] netfs_alloc_request+0x85/0x2ce [ 3.775147] netfs_readahead+0x28/0x170 [ 3.775395] read_pages+0x6c/0x350 [ 3.775623] ? srso_alias_return_thunk+0x5/0xfbef5 [ 3.775928] page_cache_ra_unbounded+0x1bd/0x2a0 [ 3.776247] filemap_get_pages+0x139/0x970 [ 3.776510] ? srso_alias_return_thunk+0x5/0xfbef5 [ 3.776820] filemap_read+0xf9/0x580 [ 3.777054] ? srso_alias_return_thunk+0x5/0xfbef5 [ 3.777368] ? srso_alias_return_thunk+0x5/0xfbef5 [ 3.777674] ? find_held_lock+0x32/0x90 [ 3.777929] ? netfs_start_io_read+0x19/0x70 [ 3.778221] ? netfs_start_io_read+0x19/0x70 [ 3.778489] ? srso_alias_return_thunk+0x5/0xfbef5 [ 3.778800] ? lock_acquired+0x1e6/0x450 [ 3.779054] ? srso_alias_return_thunk+0x5/0xfbef5 [ 3.779379] netfs_buffered_read_iter+0x57/0x80 [ 3.779670] __kernel_read+0x158/0x2c0 [ 3.779927] bprm_execve+0x300/0x7a0 [ 3.780185] kernel_execve+0x10c/0x140 [ 3.780423] ? __pfx_kernel_init+0x10/0x10 [ 3.780690] kernel_init+0xd5/0x150 [ 3.780910] ret_from_fork+0x2d/0x50 [ 3.781156] ? __pfx_kernel_init+0x10/0x10 [ 3.781414] ret_from_fork_asm+0x1a/0x30 [ 3.781677] </TASK> [ 3.781823] Modules linked in: [ 3.782065] ---[ end trace 0000000000000000 ]--- This is caused by the following error path in netfs_init(): if (!proc_mkdir("fs/netfs", NULL)) goto error_proc; Fix this by adding ifdef in netfs_main(), so that /proc/fs/netfs is only created with CONFIG_PROC_FS. Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/20250409170015.2651829-1-song@kernel.org Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-02ext4: make block validity check resistent to sb bh corruptionOjaswin Mujoo
[ Upstream commit ccad447a3d331a239477c281533bacb585b54a98 ] Block validity checks need to be skipped in case they are called for journal blocks since they are part of system's protected zone. Currently, this is done by checking inode->ino against sbi->s_es->s_journal_inum, which is a direct read from the ext4 sb buffer head. If someone modifies this underneath us then the s_journal_inum field might get corrupted. To prevent against this, change the check to directly compare the inode with journal->j_inode. **Slight change in behavior**: During journal init path, check_block_validity etc might be called for journal inode when sbi->s_journal is not set yet. In this case we now proceed with ext4_inode_block_valid() instead of returning early. Since systems zones have not been set yet, it is okay to proceed so we can perform basic checks on the blocks. Suggested-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/0c06bc9ebfcd6ccfed84a36e79147bf45ff5adc1.1743142920.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>