summaryrefslogtreecommitdiff
path: root/fs/nfsd
AgeCommit message (Collapse)Author
2025-06-27nfsd: use threads array as-is in netlink interfaceJeff Layton
commit 8ea688a3372e8369dc04395b39b4e71a6d91d4d5 upstream. The old nfsdfs interface for starting a server with multiple pools handles the special case of a single entry array passed down from userland by distributing the threads over every NUMA node. The netlink control interface however constructs an array of length nfsd_nrpools() and fills any unprovided slots with 0's. This behavior defeats the special casing that the old interface relies on. Change nfsd_nl_threads_set_doit() to pass down the array from userland as-is. Fixes: 7f5c330b2620 ("nfsd: allow passing in array of thread counts via netlink") Cc: stable@vger.kernel.org Reported-by: Mike Snitzer <snitzer@kernel.org> Closes: https://lore.kernel.org/linux-nfs/aDC-ftnzhJAlwqwh@kernel.org/ Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-27nfsd: Initialize ssc before laundromat_work to prevent NULL dereferenceLi Lingfeng
commit b31da62889e6d610114d81dc7a6edbcaa503fcf8 upstream. In nfs4_state_start_net(), laundromat_work may access nfsd_ssc through nfs4_laundromat -> nfsd4_ssc_expire_umount. If nfsd_ssc isn't initialized, this can cause NULL pointer dereference. Normally the delayed start of laundromat_work allows sufficient time for nfsd_ssc initialization to complete. However, when the kernel waits too long for userspace responses (e.g. in nfs4_state_start_net -> nfsd4_end_grace -> nfsd4_record_grace_done -> nfsd4_cld_grace_done -> cld_pipe_upcall -> __cld_pipe_upcall -> wait_for_completion path), the delayed work may start before nfsd_ssc initialization finishes. Fix this by moving nfsd_ssc initialization before starting laundromat_work. Fixes: f4e44b393389 ("NFSD: delay unmount source's export after inter-server copy completed.") Cc: stable@vger.kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-27nfsd: nfsd4_spo_must_allow() must check this is a v4 compound requestNeilBrown
commit 1244f0b2c3cecd3f349a877006e67c9492b41807 upstream. If the request being processed is not a v4 compound request, then examining the cstate can have undefined results. This patch adds a check that the rpc procedure being executed (rq_procinfo) is the NFSPROC4_COMPOUND procedure. Reported-by: Olga Kornievskaia <okorniev@redhat.com> Cc: stable@vger.kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neil@brown.name> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-27NFSD: Implement FATTR4_CLONE_BLKSIZE attributeChuck Lever
commit d6ca7d2643eebe09cf46840bdc7d68b6e07aba77 upstream. RFC 7862 states that if an NFS server implements a CLONE operation, it MUST also implement FATTR4_CLONE_BLKSIZE. NFSD implements CLONE, but does not implement FATTR4_CLONE_BLKSIZE. Note that in Section 12.2, RFC 7862 claims that FATTR4_CLONE_BLKSIZE is RECOMMENDED, not REQUIRED. Likely this is because a minor version is not permitted to add a REQUIRED attribute. Confusing. We assume this attribute reports a block size as a count of bytes, as RFC 7862 does not specify a unit. Reported-by: Roland Mainz <roland.mainz@nrubsig.org> Suggested-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Roland Mainz <roland.mainz@nrubsig.org> Cc: stable@vger.kernel.org # v6.7+ Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-27NFSD: fix race between nfsd registration and exports_procManinder Singh
commit f7fb730cac9aafda8b9813b55d04e28a9664d17c upstream. As of now nfsd calls create_proc_exports_entry() at start of init_nfsd and cleanup by remove_proc_entry() at last of exit_nfsd. Which causes kernel OOPs if there is race between below 2 operations: (i) exportfs -r (ii) mount -t nfsd none /proc/fs/nfsd for 5.4 kernel ARM64: CPU 1: el1_irq+0xbc/0x180 arch_counter_get_cntvct+0x14/0x18 running_clock+0xc/0x18 preempt_count_add+0x88/0x110 prep_new_page+0xb0/0x220 get_page_from_freelist+0x2d8/0x1778 __alloc_pages_nodemask+0x15c/0xef0 __vmalloc_node_range+0x28c/0x478 __vmalloc_node_flags_caller+0x8c/0xb0 kvmalloc_node+0x88/0xe0 nfsd_init_net+0x6c/0x108 [nfsd] ops_init+0x44/0x170 register_pernet_operations+0x114/0x270 register_pernet_subsys+0x34/0x50 init_nfsd+0xa8/0x718 [nfsd] do_one_initcall+0x54/0x2e0 CPU 2 : Unable to handle kernel NULL pointer dereference at virtual address 0000000000000010 PC is at : exports_net_open+0x50/0x68 [nfsd] Call trace: exports_net_open+0x50/0x68 [nfsd] exports_proc_open+0x2c/0x38 [nfsd] proc_reg_open+0xb8/0x198 do_dentry_open+0x1c4/0x418 vfs_open+0x38/0x48 path_openat+0x28c/0xf18 do_filp_open+0x70/0xe8 do_sys_open+0x154/0x248 Sometimes it crashes at exports_net_open() and sometimes cache_seq_next_rcu(). and same is happening on latest 6.14 kernel as well: [ 0.000000] Linux version 6.14.0-rc5-next-20250304-dirty ... [ 285.455918] Unable to handle kernel paging request at virtual address 00001f4800001f48 ... [ 285.464902] pc : cache_seq_next_rcu+0x78/0xa4 ... [ 285.469695] Call trace: [ 285.470083] cache_seq_next_rcu+0x78/0xa4 (P) [ 285.470488] seq_read+0xe0/0x11c [ 285.470675] proc_reg_read+0x9c/0xf0 [ 285.470874] vfs_read+0xc4/0x2fc [ 285.471057] ksys_read+0x6c/0xf4 [ 285.471231] __arm64_sys_read+0x1c/0x28 [ 285.471428] invoke_syscall+0x44/0x100 [ 285.471633] el0_svc_common.constprop.0+0x40/0xe0 [ 285.471870] do_el0_svc_compat+0x1c/0x34 [ 285.472073] el0_svc_compat+0x2c/0x80 [ 285.472265] el0t_32_sync_handler+0x90/0x140 [ 285.472473] el0t_32_sync+0x19c/0x1a0 [ 285.472887] Code: f9400885 93407c23 937d7c27 11000421 (f86378a3) [ 285.473422] ---[ end trace 0000000000000000 ]--- It reproduced simply with below script: while [ 1 ] do /exportfs -r done & while [ 1 ] do insmod /nfsd.ko mount -t nfsd none /proc/fs/nfsd umount /proc/fs/nfsd rmmod nfsd done & So exporting interfaces to user space shall be done at last and cleanup at first place. With change there is no Kernel OOPs. Co-developed-by: Shubham Rana <s9.rana@samsung.com> Signed-off-by: Shubham Rana <s9.rana@samsung.com> Signed-off-by: Maninder Singh <maninder1.s@samsung.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Cc: stable@vger.kernel.org Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-27NFSD: unregister filesystem in case genl_register_family() failsManinder Singh
commit ff12eb379554eea7932ad6caea55e3091701cce4 upstream. With rpc_status netlink support, unregister of register_filesystem() was missed in case of genl_register_family() fails. Correcting it by making new label. Fixes: bd9d6a3efa97 ("NFSD: add rpc_status netlink support") Cc: stable@vger.kernel.org Signed-off-by: Maninder Singh <maninder1.s@samsung.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-25nfsd: decrease sc_count directly if fail to queue dl_recallLi Lingfeng
[ Upstream commit a1d14d931bf700c1025db8c46d6731aa5cf440f9 ] A deadlock warning occurred when invoking nfs4_put_stid following a failed dl_recall queue operation: T1 T2 nfs4_laundromat nfs4_get_client_reaplist nfs4_anylock_blockers __break_lease spin_lock // ctx->flc_lock spin_lock // clp->cl_lock nfs4_lockowner_has_blockers locks_owner_has_blockers spin_lock // flctx->flc_lock nfsd_break_deleg_cb nfsd_break_one_deleg nfs4_put_stid refcount_dec_and_lock spin_lock // clp->cl_lock When a file is opened, an nfs4_delegation is allocated with sc_count initialized to 1, and the file_lease holds a reference to the delegation. The file_lease is then associated with the file through kernel_setlease. The disassociation is performed in nfsd4_delegreturn via the following call chain: nfsd4_delegreturn --> destroy_delegation --> destroy_unhashed_deleg --> nfs4_unlock_deleg_lease --> kernel_setlease --> generic_delete_lease The corresponding sc_count reference will be released after this disassociation. Since nfsd_break_one_deleg executes while holding the flc_lock, the disassociation process becomes blocked when attempting to acquire flc_lock in generic_delete_lease. This means: 1) sc_count in nfsd_break_one_deleg will not be decremented to 0; 2) The nfs4_put_stid called by nfsd_break_one_deleg will not attempt to acquire cl_lock; 3) Consequently, no deadlock condition is created. Given that sc_count in nfsd_break_one_deleg remains non-zero, we can safely perform refcount_dec on sc_count directly. This approach effectively avoids triggering deadlock warnings. Fixes: 230ca758453c ("nfsd: put dl_stid if fail to queue dl_recall") Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-25nfs: add missing selections of CONFIG_CRC32Eric Biggers
[ Upstream commit cd35b6cb46649750b7dbd0df0e2d767415d8917b ] nfs.ko, nfsd.ko, and lockd.ko all use crc32_le(), which is available only when CONFIG_CRC32 is enabled. But the only NFS kconfig option that selected CONFIG_CRC32 was CONFIG_NFS_DEBUG, which is client-specific and did not actually guard the use of crc32_le() even on the client. The code worked around this bug by only actually calling crc32_le() when CONFIG_CRC32 is built-in, instead hard-coding '0' in other cases. This avoided randconfig build errors, and in real kernels the fallback code was unlikely to be reached since CONFIG_CRC32 is 'default y'. But, this really needs to just be done properly, especially now that I'm planning to update CONFIG_CRC32 to not be 'default y'. Therefore, make CONFIG_NFS_FS, CONFIG_NFSD, and CONFIG_LOCKD select CONFIG_CRC32. Then remove the fallback code that becomes unnecessary, as well as the selection of CONFIG_CRC32 from CONFIG_NFS_DEBUG. Fixes: 1264a2f053a3 ("NFS: refactor code for calculating the crc32 hash of a filehandle") Signed-off-by: Eric Biggers <ebiggers@google.com> Acked-by: Anna Schumaker <anna.schumaker@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-20nfsd: don't ignore the return code of svc_proc_register()Jeff Layton
commit 930b64ca0c511521f0abdd1d57ce52b2a6e3476b upstream. Currently, nfsd_proc_stat_init() ignores the return value of svc_proc_register(). If the procfile creation fails, then the kernel will WARN when it tries to remove the entry later. Fix nfsd_proc_stat_init() to return the same type of pointer as svc_proc_register(), and fix up nfsd_net_init() to check that and fail the nfsd_net construction if it occurs. svc_proc_register() can fail if the dentry can't be allocated, or if an identical dentry already exists. The second case is pretty unlikely in the nfsd_net construction codepath, so if this happens, return -ENOMEM. Reported-by: syzbot+e34ad04f27991521104c@syzkaller.appspotmail.com Closes: https://lore.kernel.org/linux-nfs/67a47501.050a0220.19061f.05f9.GAE@google.com/ Cc: stable@vger.kernel.org # v6.9 Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-20NFSD: Fix CB_GETATTR status fixChuck Lever
commit 4990d098433db18c854e75fb0f90d941eb7d479e upstream. Jeff says: Now that I look, 1b3e26a5ccbf is wrong. The patch on the ml was correct, but the one that got committed is different. It should be: status = decode_cb_op_status(xdr, OP_CB_GETATTR, &cb->cb_status); if (unlikely(status || cb->cb_status)) If "status" is non-zero, decoding failed (usu. BADXDR), but we also want to bail out and not decode the rest of the call if the decoded cb_status is non-zero. That's not happening here, cb_seq_status has already been checked and is non-zero, so this ends up trying to decode the rest of the CB_GETATTR reply when it doesn't exist. Reported-by: Jeff Layton <jlayton@kernel.org> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219737 Fixes: 1b3e26a5ccbf ("NFSD: fix decoding in nfs4_xdr_dec_cb_getattr") Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-20NFSD: fix decoding in nfs4_xdr_dec_cb_getattrOlga Kornievskaia
commit 1b3e26a5ccbfc2f85bda1930cc278e313165e353 upstream. If a client were to send an error to a CB_GETATTR call, the code erronously continues to try decode past the error code. It ends up returning BAD_XDR error to the rpc layer and then in turn trigger a WARN_ONCE in nfsd4_cb_done() function. Fixes: 6487a13b5c6b ("NFSD: add support for CB_GETATTR callback") Signed-off-by: Olga Kornievskaia <okorniev@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-10NFSD: Skip sending CB_RECALL_ANY when the backchannel isn't upChuck Lever
commit 8a388c1fabeb6606e16467b23242416c0dbeffad upstream. NFSD sends CB_RECALL_ANY to clients when the server is low on memory or that client has a large number of delegations outstanding. We've seen cases where NFSD attempts to send CB_RECALL_ANY requests to disconnected clients, and gets confused. These calls never go anywhere if a backchannel transport to the target client isn't available. Before the server can send any backchannel operation, the client has to connect first and then do a BIND_CONN_TO_SESSION. This patch doesn't address the root cause of the confusion, but there's no need to queue up these optional operations if they can't go anywhere. Fixes: 44df6f439a17 ("NFSD: add delegation reaper to react to low memory condition") Reviewed-by: Jeff Layton <jlayton@kernel.org> Cc: stable@vger.kernel.org Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-10NFSD: Never return NFS4ERR_FILE_OPEN when removing a directoryChuck Lever
commit 370345b4bd184a49ac68d6591801e5e3605b355a upstream. RFC 8881 Section 18.25.4 paragraph 5 tells us that the server should return NFS4ERR_FILE_OPEN only if the target object is an opened file. This suggests that returning this status when removing a directory will confuse NFS clients. This is a version-specific issue; nfsd_proc_remove/rmdir() and nfsd3_proc_remove/rmdir() already return nfserr_access as appropriate. Unfortunately there is no quick way for nfsd4_remove() to determine whether the target object is a file or not, so the check is done in in nfsd_unlink() for now. Reported-by: Trond Myklebust <trondmy@hammerspace.com> Fixes: 466e16f0920f ("nfsd: check for EBUSY from vfs_rmdir/vfs_unink.") Reviewed-by: Jeff Layton <jlayton@kernel.org> Cc: stable@vger.kernel.org Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-10NFSD: nfsd_unlink() clobbers non-zero status returned from fh_fill_pre_attrs()Chuck Lever
commit d7d8e3169b56e7696559a2427c922c0d55debcec upstream. If fh_fill_pre_attrs() returns a non-zero status, the error flow takes it through out_unlock, which then overwrites the returned status code with err = nfserrno(host_err); Fixes: a332018a91c4 ("nfsd: handle failure to collect pre/post-op attrs more sanely") Reviewed-by: Jeff Layton <jlayton@kernel.org> Cc: stable@vger.kernel.org Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-10nfsd: fix management of listener transportsOlga Kornievskaia
commit d093c90892607be505e801469d6674459e69ab89 upstream. Currently, when no active threads are running, a root user using nfsdctl command can try to remove a particular listener from the list of previously added ones, then start the server by increasing the number of threads, it leads to the following problem: [ 158.835354] refcount_t: addition on 0; use-after-free. [ 158.835603] WARNING: CPU: 2 PID: 9145 at lib/refcount.c:25 refcount_warn_saturate+0x160/0x1a0 [ 158.836017] Modules linked in: rpcrdma rdma_cm iw_cm ib_cm ib_core nfsd auth_rpcgss nfs_acl lockd grace overlay isofs uinput snd_seq_dummy snd_hrtimer nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set nf_tables qrtr sunrpc vfat fat uvcvideo videobuf2_vmalloc videobuf2_memops uvc videobuf2_v4l2 videodev videobuf2_common snd_hda_codec_generic mc e1000e snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore sg loop dm_multipath dm_mod nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vmw_vmci vsock xfs libcrc32c crct10dif_ce ghash_ce vmwgfx sha2_ce sha256_arm64 sr_mod sha1_ce cdrom nvme drm_client_lib drm_ttm_helper ttm nvme_core drm_kms_helper nvme_auth drm fuse [ 158.840093] CPU: 2 UID: 0 PID: 9145 Comm: nfsd Kdump: loaded Tainted: G B W 6.13.0-rc6+ #7 [ 158.840624] Tainted: [B]=BAD_PAGE, [W]=WARN [ 158.840802] Hardware name: VMware, Inc. VMware20,1/VBSA, BIOS VMW201.00V.24006586.BA64.2406042154 06/04/2024 [ 158.841220] pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--) [ 158.841563] pc : refcount_warn_saturate+0x160/0x1a0 [ 158.841780] lr : refcount_warn_saturate+0x160/0x1a0 [ 158.842000] sp : ffff800089be7d80 [ 158.842147] x29: ffff800089be7d80 x28: ffff00008e68c148 x27: ffff00008e68c148 [ 158.842492] x26: ffff0002e3b5c000 x25: ffff600011cd1829 x24: ffff00008653c010 [ 158.842832] x23: ffff00008653c000 x22: 1fffe00011cd1829 x21: ffff00008653c028 [ 158.843175] x20: 0000000000000002 x19: ffff00008653c010 x18: 0000000000000000 [ 158.843505] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 [ 158.843836] x14: 0000000000000000 x13: 0000000000000001 x12: ffff600050a26493 [ 158.844143] x11: 1fffe00050a26492 x10: ffff600050a26492 x9 : dfff800000000000 [ 158.844475] x8 : 00009fffaf5d9b6e x7 : ffff000285132493 x6 : 0000000000000001 [ 158.844823] x5 : ffff000285132490 x4 : ffff600050a26493 x3 : ffff8000805e72bc [ 158.845174] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff000098588000 [ 158.845528] Call trace: [ 158.845658] refcount_warn_saturate+0x160/0x1a0 (P) [ 158.845894] svc_recv+0x58c/0x680 [sunrpc] [ 158.846183] nfsd+0x1fc/0x348 [nfsd] [ 158.846390] kthread+0x274/0x2f8 [ 158.846546] ret_from_fork+0x10/0x20 [ 158.846714] ---[ end trace 0000000000000000 ]--- nfsd_nl_listener_set_doit() would manipulate the list of transports of server's sv_permsocks and close the specified listener but the other list of transports (server's sp_xprts list) would not be changed leading to the problem above. Instead, determined if the nfsdctl is trying to remove a listener, in which case, delete all the existing listener transports and re-create all-but-the-removed ones. Fixes: 16a471177496 ("NFSD: add listener-{set,get} netlink command") Signed-off-by: Olga Kornievskaia <okorniev@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Cc: stable@vger.kernel.org Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-10nfsd: put dl_stid if fail to queue dl_recallLi Lingfeng
commit 230ca758453c63bd38e4d9f4a21db698f7abada8 upstream. Before calling nfsd4_run_cb to queue dl_recall to the callback_wq, we increment the reference count of dl_stid. We expect that after the corresponding work_struct is processed, the reference count of dl_stid will be decremented through the callback function nfsd4_cb_recall_release. However, if the call to nfsd4_run_cb fails, the incremented reference count of dl_stid will not be decremented correspondingly, leading to the following nfs4_stid leak: unreferenced object 0xffff88812067b578 (size 344): comm "nfsd", pid 2761, jiffies 4295044002 (age 5541.241s) hex dump (first 32 bytes): 01 00 00 00 6b 6b 6b 6b b8 02 c0 e2 81 88 ff ff ....kkkk........ 00 6b 6b 6b 6b 6b 6b 6b 00 00 00 00 ad 4e ad de .kkkkkkk.....N.. backtrace: kmem_cache_alloc+0x4b9/0x700 nfsd4_process_open1+0x34/0x300 nfsd4_open+0x2d1/0x9d0 nfsd4_proc_compound+0x7a2/0xe30 nfsd_dispatch+0x241/0x3e0 svc_process_common+0x5d3/0xcc0 svc_process+0x2a3/0x320 nfsd+0x180/0x2e0 kthread+0x199/0x1d0 ret_from_fork+0x30/0x50 ret_from_fork_asm+0x1b/0x30 unreferenced object 0xffff8881499f4d28 (size 368): comm "nfsd", pid 2761, jiffies 4295044005 (age 5541.239s) hex dump (first 32 bytes): 01 00 00 00 00 00 00 00 30 4d 9f 49 81 88 ff ff ........0M.I.... 30 4d 9f 49 81 88 ff ff 20 00 00 00 01 00 00 00 0M.I.... ....... backtrace: kmem_cache_alloc+0x4b9/0x700 nfs4_alloc_stid+0x29/0x210 alloc_init_deleg+0x92/0x2e0 nfs4_set_delegation+0x284/0xc00 nfs4_open_delegation+0x216/0x3f0 nfsd4_process_open2+0x2b3/0xee0 nfsd4_open+0x770/0x9d0 nfsd4_proc_compound+0x7a2/0xe30 nfsd_dispatch+0x241/0x3e0 svc_process_common+0x5d3/0xcc0 svc_process+0x2a3/0x320 nfsd+0x180/0x2e0 kthread+0x199/0x1d0 ret_from_fork+0x30/0x50 ret_from_fork_asm+0x1b/0x30 Fix it by checking the result of nfsd4_run_cb and call nfs4_put_stid if fail to queue dl_recall. Cc: stable@vger.kernel.org Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-10nfsd: allow SC_STATUS_FREEABLE when searching via nfs4_lookup_stateid()Jeff Layton
commit d1bc15b147d35b4cb7ca99a9a7d79d41ca342c13 upstream. The pynfs DELEG8 test fails when run against nfsd. It acquires a delegation and then lets the lease time out. It then tries to use the deleg stateid and expects to see NFS4ERR_DELEG_REVOKED, but it gets bad NFS4ERR_BAD_STATEID instead. When a delegation is revoked, it's initially marked with SC_STATUS_REVOKED, or SC_STATUS_ADMIN_REVOKED and later, it's marked with the SC_STATUS_FREEABLE flag, which denotes that it is waiting for s FREE_STATEID call. nfs4_lookup_stateid() accepts a statusmask that includes the status flags that a found stateid is allowed to have. Currently, that mask never includes SC_STATUS_FREEABLE, which means that revoked delegations are (almost) never found. Add SC_STATUS_FREEABLE to the always-allowed status flags, and remove it from nfsd4_delegreturn() since it's now always implied. Fixes: 8dd91e8d31fe ("nfsd: fix race between laundromat and free_stateid") Cc: stable@vger.kernel.org Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-07nfsd: fix legacy client tracking initializationScott Mayhew
commit de71d4e211eddb670b285a0ea477a299601ce1ca upstream. Get rid of the nfsd4_legacy_tracking_ops->init() call in check_for_legacy_methods(). That will be handled in the caller (nfsd4_client_tracking_init()). Otherwise, we'll wind up calling nfsd4_legacy_tracking_ops->init() twice, and the second time we'll trigger the BUG_ON() in nfsd4_init_recdir(). Fixes: 74fd48739d04 ("nfsd: new Kconfig option for legacy client tracking") Reported-by: Jur van der Burg <jur@avtware.com> Link: https://bugzilla.kernel.org/show_bug.cgi?id=219580 Signed-off-by: Scott Mayhew <smayhew@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Tested-by: Salvatore Bonaccorso <carnil@debian.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-21nfsd: validate the nfsd_serv pointer before calling svc_wake_upJeff Layton
commit b9382e29ca538b879645899ce45d652a304e2ed2 upstream. nfsd_file_dispose_list_delayed can be called from the filecache laundrette, which is shut down after the nfsd threads are shut down and the nfsd_serv pointer is cleared. If nn->nfsd_serv is NULL then there are no threads to wake. Ensure that the nn->nfsd_serv pointer is non-NULL before calling svc_wake_up in nfsd_file_dispose_list_delayed. This is safe since the svc_serv is not freed until after the filecache laundrette is cancelled. Reported-by: Salvatore Bonaccorso <carnil@debian.org> Closes: https://bugs.debian.org/1093734 Fixes: ffb402596147 ("nfsd: Don't leave work of closing files to a work queue") Cc: stable@vger.kernel.org Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-21NFSD: fix hang in nfsd4_shutdown_callbackDai Ngo
commit 036ac2778f7b28885814c6fbc07e156ad1624d03 upstream. If nfs4_client is in courtesy state then there is no point to send the callback. This causes nfsd4_shutdown_callback to hang since cl_cb_inflight is not 0. This hang lasts about 15 minutes until TCP notifies NFSD that the connection was dropped. This patch modifies nfsd4_run_cb_work to skip the RPC call if nfs4_client is in courtesy state. Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Fixes: 66af25799940 ("NFSD: add courteous server support for thread with only delegation") Cc: stable@vger.kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-21nfsd: clear acl_access/acl_default after releasing themLi Lingfeng
commit 7faf14a7b0366f153284db0ad3347c457ea70136 upstream. If getting acl_default fails, acl_access and acl_default will be released simultaneously. However, acl_access will still retain a pointer pointing to the released posix_acl, which will trigger a WARNING in nfs3svc_release_getacl like this: ------------[ cut here ]------------ refcount_t: underflow; use-after-free. WARNING: CPU: 26 PID: 3199 at lib/refcount.c:28 refcount_warn_saturate+0xb5/0x170 Modules linked in: CPU: 26 UID: 0 PID: 3199 Comm: nfsd Not tainted 6.12.0-rc6-00079-g04ae226af01f-dirty #8 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.1-2.fc37 04/01/2014 RIP: 0010:refcount_warn_saturate+0xb5/0x170 Code: cc cc 0f b6 1d b3 20 a5 03 80 fb 01 0f 87 65 48 d8 00 83 e3 01 75 e4 48 c7 c7 c0 3b 9b 85 c6 05 97 20 a5 03 01 e8 fb 3e 30 ff <0f> 0b eb cd 0f b6 1d 8a3 RSP: 0018:ffffc90008637cd8 EFLAGS: 00010282 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff83904fde RDX: dffffc0000000000 RSI: 0000000000000008 RDI: ffff88871ed36380 RBP: ffff888158beeb40 R08: 0000000000000001 R09: fffff520010c6f56 R10: ffffc90008637ab7 R11: 0000000000000001 R12: 0000000000000001 R13: ffff888140e77400 R14: ffff888140e77408 R15: ffffffff858b42c0 FS: 0000000000000000(0000) GS:ffff88871ed00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000562384d32158 CR3: 000000055cc6a000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ? refcount_warn_saturate+0xb5/0x170 ? __warn+0xa5/0x140 ? refcount_warn_saturate+0xb5/0x170 ? report_bug+0x1b1/0x1e0 ? handle_bug+0x53/0xa0 ? exc_invalid_op+0x17/0x40 ? asm_exc_invalid_op+0x1a/0x20 ? tick_nohz_tick_stopped+0x1e/0x40 ? refcount_warn_saturate+0xb5/0x170 ? refcount_warn_saturate+0xb5/0x170 nfs3svc_release_getacl+0xc9/0xe0 svc_process_common+0x5db/0xb60 ? __pfx_svc_process_common+0x10/0x10 ? __rcu_read_unlock+0x69/0xa0 ? __pfx_nfsd_dispatch+0x10/0x10 ? svc_xprt_received+0xa1/0x120 ? xdr_init_decode+0x11d/0x190 svc_process+0x2a7/0x330 svc_handle_xprt+0x69d/0x940 svc_recv+0x180/0x2d0 nfsd+0x168/0x200 ? __pfx_nfsd+0x10/0x10 kthread+0x1a2/0x1e0 ? kthread+0xf4/0x1e0 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x34/0x60 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> Kernel panic - not syncing: kernel: panic_on_warn set ... Clear acl_access/acl_default after posix_acl_release is called to prevent UAF from being triggered. Fixes: a257cdd0e217 ("[PATCH] NFSD: Add server support for NFSv3 ACLs.") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20241107014705.2509463-1-lilingfeng@huaweicloud.com/ Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> Reviewed-by: Rick Macklem <rmacklem@uoguelph.ca> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-17NFSD: Encode COMPOUND operation status on page boundariesChuck Lever
commit ef3675b45bcb6c17cabbbde620c6cea52ffb21ac upstream. J. David reports an odd corruption of a READDIR reply sent to a FreeBSD client. xdr_reserve_space() has to do a special trick when the @nbytes value requests more space than there is in the current page of the XDR buffer. In that case, xdr_reserve_space() returns a pointer to the start of the next page, and then the next call to xdr_reserve_space() invokes __xdr_commit_encode() to copy enough of the data item back into the previous page to make that data item contiguous across the page boundary. But we need to be careful in the case where buffer space is reserved early for a data item whose value will be inserted into the buffer later. One such caller, nfsd4_encode_operation(), reserves 8 bytes in the encoding buffer for each COMPOUND operation. However, a READDIR result can sometimes encode file names so that there are only 4 bytes left at the end of the current XDR buffer page (though plenty of pages are left to handle the remaining encoding tasks). If a COMPOUND operation follows the READDIR result (say, a GETATTR), then nfsd4_encode_operation() will reserve 8 bytes for the op number (9) and the op status (usually NFS4_OK). In this weird case, xdr_reserve_space() returns a pointer to byte zero of the next buffer page, as it assumes the data item will be copied back into place (in the previous page) on the next call to xdr_reserve_space(). nfsd4_encode_operation() writes the op num into the buffer, then saves the next 4-byte location for the op's status code. The next xdr_reserve_space() call is part of GETATTR encoding, so the op num gets copied back into the previous page, but the saved location for the op status continues to point to the wrong spot in the current XDR buffer page because __xdr_commit_encode() moved that data item. After GETATTR encoding is complete, nfsd4_encode_operation() writes the op status over the first XDR data item in the GETATTR result. The NFS4_OK status code (0) makes it look like there are zero items in the GETATTR's attribute bitmask. The patch description of commit 2825a7f90753 ("nfsd4: allow encoding across page boundaries") [2014] remarks that NFSD "can't handle a new operation starting close to the end of a page." This bug appears to be one reason for that remark. Reported-by: J David <j.david.lists@gmail.com> Closes: https://lore.kernel.org/linux-nfs/3998d739-c042-46b4-8166-dbd6c5f0e804@oracle.com/T/#t Tested-by: Rick Macklem <rmacklem@uoguelph.ca> Reviewed-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Cc: stable@vger.kernel.org Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-01-02nfsd: restore callback functionality for NFSv4.0NeilBrown
[ Upstream commit 7917f01a286ce01e9c085e24468421f596ee1a0c ] A recent patch inadvertently broke callbacks for NFSv4.0. In the 4.0 case we do not expect a session to be found but still need to call setup_callback_client() which will not try to dereference it. This patch moves the check for failure to find a session into the 4.1+ branch of setup_callback_client() Fixes: 1e02c641c3a4 ("NFSD: Prevent NULL dereference in nfsd4_process_cb_update()") Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-01-02nfsd: Revert "nfsd: release svc_expkey/svc_export with rcu_work"Yang Erkun
[ Upstream commit 69d803c40edeaf94089fbc8751c9b746cdc35044 ] This reverts commit f8c989a0c89a75d30f899a7cabdc14d72522bb8d. Before this commit, svc_export_put or expkey_put will call path_put with sync mode. After this commit, path_put will be called with async mode. And this can lead the unexpected results show as follow. mkfs.xfs -f /dev/sda echo "/ *(rw,no_root_squash,fsid=0)" > /etc/exports echo "/mnt *(rw,no_root_squash,fsid=1)" >> /etc/exports exportfs -ra service nfs-server start mount -t nfs -o vers=4.0 127.0.0.1:/mnt /mnt1 mount /dev/sda /mnt/sda touch /mnt1/sda/file exportfs -r umount /mnt/sda # failed unexcepted The touch will finally call nfsd_cross_mnt, add refcount to mount, and then add cache_head. Before this commit, exportfs -r will call cache_flush to cleanup all cache_head, and path_put in svc_export_put/expkey_put will be finished with sync mode. So, the latter umount will always success. However, after this commit, path_put will be called with async mode, the latter umount may failed, and if we add some delay, umount will success too. Personally I think this bug and should be fixed. We first revert before bugfix patch, and then fix the original bug with a different way. Fixes: f8c989a0c89a ("nfsd: release svc_expkey/svc_export with rcu_work") Signed-off-by: Yang Erkun <yangerkun@huawei.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-12-09nfsd: fix nfs4_openowner leak when concurrent nfsd4_open occurYang Erkun
commit 98100e88dd8865999dc6379a3356cd799795fe7b upstream. The action force umount(umount -f) will attempt to kill all rpc_task even umount operation may ultimately fail if some files remain open. Consequently, if an action attempts to open a file, it can potentially send two rpc_task to nfs server. NFS CLIENT thread1 thread2 open("file") ... nfs4_do_open _nfs4_do_open _nfs4_open_and_get_state _nfs4_proc_open nfs4_run_open_task /* rpc_task1 */ rpc_run_task rpc_wait_for_completion_task umount -f nfs_umount_begin rpc_killall_tasks rpc_signal_task rpc_task1 been wakeup and return -512 _nfs4_do_open // while loop ... nfs4_run_open_task /* rpc_task2 */ rpc_run_task rpc_wait_for_completion_task While processing an open request, nfsd will first attempt to find or allocate an nfs4_openowner. If it finds an nfs4_openowner that is not marked as NFS4_OO_CONFIRMED, this nfs4_openowner will released. Since two rpc_task can attempt to open the same file simultaneously from the client to server, and because two instances of nfsd can run concurrently, this situation can lead to lots of memory leak. Additionally, when we echo 0 to /proc/fs/nfsd/threads, warning will be triggered. NFS SERVER nfsd1 nfsd2 echo 0 > /proc/fs/nfsd/threads nfsd4_open nfsd4_process_open1 find_or_alloc_open_stateowner // alloc oo1, stateid1 nfsd4_open nfsd4_process_open1 find_or_alloc_open_stateowner // find oo1, without NFS4_OO_CONFIRMED release_openowner unhash_openowner_locked list_del_init(&oo->oo_perclient) // cannot find this oo // from client, LEAK!!! alloc_stateowner // alloc oo2 nfsd4_process_open2 init_open_stateid // associate oo1 // with stateid1, stateid1 LEAK!!! nfs4_get_vfs_file // alloc nfsd_file1 and nfsd_file_mark1 // all LEAK!!! nfsd4_process_open2 ... write_threads ... nfsd_destroy_serv nfsd_shutdown_net nfs4_state_shutdown_net nfs4_state_destroy_net destroy_client __destroy_client // won't find oo1!!! nfsd_shutdown_generic nfsd_file_cache_shutdown kmem_cache_destroy for nfsd_file_slab and nfsd_file_mark_slab // bark since nfsd_file1 // and nfsd_file_mark1 // still alive ======================================================================= BUG nfsd_file (Not tainted): Objects remaining in nfsd_file on __kmem_cache_shutdown() ----------------------------------------------------------------------- Slab 0xffd4000004438a80 objects=34 used=1 fp=0xff11000110e2ad28 flags=0x17ffffc0000240(workingset|head|node=0|zone=2|lastcpupid=0x1fffff) CPU: 4 UID: 0 PID: 757 Comm: sh Not tainted 6.12.0-rc6+ #19 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.1-2.fc37 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x53/0x70 slab_err+0xb0/0xf0 __kmem_cache_shutdown+0x15c/0x310 kmem_cache_destroy+0x66/0x160 nfsd_file_cache_shutdown+0xac/0x210 [nfsd] nfsd_destroy_serv+0x251/0x2a0 [nfsd] nfsd_svc+0x125/0x1e0 [nfsd] write_threads+0x16a/0x2a0 [nfsd] nfsctl_transaction_write+0x74/0xa0 [nfsd] vfs_write+0x1ae/0x6d0 ksys_write+0xc1/0x160 do_syscall_64+0x5f/0x170 entry_SYSCALL_64_after_hwframe+0x76/0x7e Disabling lock debugging due to kernel taint Object 0xff11000110e2ac38 @offset=3128 Allocated in nfsd_file_do_acquire+0x20f/0xa30 [nfsd] age=1635 cpu=3 pid=800 nfsd_file_do_acquire+0x20f/0xa30 [nfsd] nfsd_file_acquire_opened+0x5f/0x90 [nfsd] nfs4_get_vfs_file+0x4c9/0x570 [nfsd] nfsd4_process_open2+0x713/0x1070 [nfsd] nfsd4_open+0x74b/0x8b0 [nfsd] nfsd4_proc_compound+0x70b/0xc20 [nfsd] nfsd_dispatch+0x1b4/0x3a0 [nfsd] svc_process_common+0x5b8/0xc50 [sunrpc] svc_process+0x2ab/0x3b0 [sunrpc] svc_handle_xprt+0x681/0xa20 [sunrpc] nfsd+0x183/0x220 [nfsd] kthread+0x199/0x1e0 ret_from_fork+0x31/0x60 ret_from_fork_asm+0x1a/0x30 Add nfs4_openowner_unhashed to help found unhashed nfs4_openowner, and break nfsd4_open process to fix this problem. Cc: stable@vger.kernel.org # v5.4+ Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Yang Erkun <yangerkun@huawei.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-12-09nfsd: make sure exp active before svc_export_showYang Erkun
commit be8f982c369c965faffa198b46060f8853e0f1f0 upstream. The function `e_show` was called with protection from RCU. This only ensures that `exp` will not be freed. Therefore, the reference count for `exp` can drop to zero, which will trigger a refcount use-after-free warning when `exp_get` is called. To resolve this issue, use `cache_get_rcu` to ensure that `exp` remains active. ------------[ cut here ]------------ refcount_t: addition on 0; use-after-free. WARNING: CPU: 3 PID: 819 at lib/refcount.c:25 refcount_warn_saturate+0xb1/0x120 CPU: 3 UID: 0 PID: 819 Comm: cat Not tainted 6.12.0-rc3+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.1-2.fc37 04/01/2014 RIP: 0010:refcount_warn_saturate+0xb1/0x120 ... Call Trace: <TASK> e_show+0x20b/0x230 [nfsd] seq_read_iter+0x589/0x770 seq_read+0x1e5/0x270 vfs_read+0x125/0x530 ksys_read+0xc1/0x160 do_syscall_64+0x5f/0x170 entry_SYSCALL_64_after_hwframe+0x76/0x7e Fixes: bf18f163e89c ("NFSD: Using exp_get for export getting") Cc: stable@vger.kernel.org # 4.20+ Signed-off-by: Yang Erkun <yangerkun@huawei.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-12-05NFSD: Prevent a potential integer overflowChuck Lever
commit 7f33b92e5b18e904a481e6e208486da43e4dc841 upstream. If the tag length is >= U32_MAX - 3 then the "length + 4" addition can result in an integer overflow. Address this by splitting the decoding into several steps so that decode_cb_compound4res() does not have to perform arithmetic on the unsafe length value. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Cc: stable@vger.kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-12-05nfs_common: must not hold RCU while calling nfsd_file_put_localMike Snitzer
[ Upstream commit c840b8e1f039e90f97ca55525667eb961422f86c ] Move holding the RCU from nfs_to_nfsd_file_put_local to nfs_to_nfsd_net_put. It is the call to nfs_to->nfsd_serv_put that requires the RCU anyway (the puts for nfsd_file and netns were combined to avoid an extra indirect reference but that micro-optimization isn't possible now). This fixes xfstests generic/013 and it triggering: "Voluntary context switch within RCU read-side critical section!" [ 143.545738] Call Trace: [ 143.546206] <TASK> [ 143.546625] ? show_regs+0x6d/0x80 [ 143.547267] ? __warn+0x91/0x140 [ 143.547951] ? rcu_note_context_switch+0x496/0x5d0 [ 143.548856] ? report_bug+0x193/0x1a0 [ 143.549557] ? handle_bug+0x63/0xa0 [ 143.550214] ? exc_invalid_op+0x1d/0x80 [ 143.550938] ? asm_exc_invalid_op+0x1f/0x30 [ 143.551736] ? rcu_note_context_switch+0x496/0x5d0 [ 143.552634] ? wakeup_preempt+0x62/0x70 [ 143.553358] __schedule+0xaa/0x1380 [ 143.554025] ? _raw_spin_unlock_irqrestore+0x12/0x40 [ 143.554958] ? try_to_wake_up+0x1fe/0x6b0 [ 143.555715] ? wake_up_process+0x19/0x20 [ 143.556452] schedule+0x2e/0x120 [ 143.557066] schedule_preempt_disabled+0x19/0x30 [ 143.557933] rwsem_down_read_slowpath+0x24d/0x4a0 [ 143.558818] ? xfs_efi_item_format+0x50/0xc0 [xfs] [ 143.559894] down_read+0x4e/0xb0 [ 143.560519] xlog_cil_commit+0x1b2/0xbc0 [xfs] [ 143.561460] ? _raw_spin_unlock+0x12/0x30 [ 143.562212] ? xfs_inode_item_precommit+0xc7/0x220 [xfs] [ 143.563309] ? xfs_trans_run_precommits+0x69/0xd0 [xfs] [ 143.564394] __xfs_trans_commit+0xb5/0x330 [xfs] [ 143.565367] xfs_trans_roll+0x48/0xc0 [xfs] [ 143.566262] xfs_defer_trans_roll+0x57/0x100 [xfs] [ 143.567278] xfs_defer_finish_noroll+0x27a/0x490 [xfs] [ 143.568342] xfs_defer_finish+0x1a/0x80 [xfs] [ 143.569267] xfs_bunmapi_range+0x4d/0xb0 [xfs] [ 143.570208] xfs_itruncate_extents_flags+0x13d/0x230 [xfs] [ 143.571353] xfs_free_eofblocks+0x12e/0x190 [xfs] [ 143.572359] xfs_file_release+0x12d/0x140 [xfs] [ 143.573324] __fput+0xe8/0x2d0 [ 143.573922] __fput_sync+0x1d/0x30 [ 143.574574] nfsd_filp_close+0x33/0x60 [nfsd] [ 143.575430] nfsd_file_free+0x96/0x150 [nfsd] [ 143.576274] nfsd_file_put+0xf7/0x1a0 [nfsd] [ 143.577104] nfsd_file_put_local+0x18/0x30 [nfsd] [ 143.578070] nfs_close_local_fh+0x101/0x110 [nfs_localio] [ 143.579079] __put_nfs_open_context+0xc9/0x180 [nfs] [ 143.580031] nfs_file_clear_open_context+0x4a/0x60 [nfs] [ 143.581038] nfs_file_release+0x3e/0x60 [nfs] [ 143.581879] __fput+0xe8/0x2d0 [ 143.582464] __fput_sync+0x1d/0x30 [ 143.583108] __x64_sys_close+0x41/0x80 [ 143.583823] x64_sys_call+0x189a/0x20d0 [ 143.584552] do_syscall_64+0x64/0x170 [ 143.585240] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 143.586185] RIP: 0033:0x7f3c5153efd7 Fixes: 65f2a5c36635 ("nfs_common: fix race in NFS calls to nfsd_file_put_local() and nfsd_serv_put()") Signed-off-by: Mike Snitzer <snitzer@kernel.org> Reviewed-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-12-05NFSD: Fix nfsd4_shutdown_copy()Chuck Lever
[ Upstream commit 62a8642ba00aa8ceb0a02ade942f5ec52e877c95 ] nfsd4_shutdown_copy() is just this: while ((copy = nfsd4_get_copy(clp)) != NULL) nfsd4_stop_copy(copy); nfsd4_get_copy() bumps @copy's reference count, preventing nfsd4_stop_copy() from releasing @copy. A while loop like this usually works by removing the first element of the list, but neither nfsd4_get_copy() nor nfsd4_stop_copy() alters the async_copies list. Best I can tell, then, is that nfsd4_shutdown_copy() continues to loop until other threads manage to remove all the items from this list. The spinning loop blocks shutdown until these items are gone. Possibly the reason we haven't seen this issue in the field is because client_has_state() prevents __destroy_client() from calling nfsd4_shutdown_copy() if there are any items on this list. In a subsequent patch I plan to remove that restriction. Fixes: e0639dc5805a ("NFSD introduce async copy feature") Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-12-05nfsd: release svc_expkey/svc_export with rcu_workYang Erkun
[ Upstream commit f8c989a0c89a75d30f899a7cabdc14d72522bb8d ] The last reference for `cache_head` can be reduced to zero in `c_show` and `e_show`(using `rcu_read_lock` and `rcu_read_unlock`). Consequently, `svc_export_put` and `expkey_put` will be invoked, leading to two issues: 1. The `svc_export_put` will directly free ex_uuid. However, `e_show`/`c_show` will access `ex_uuid` after `cache_put`, which can trigger a use-after-free issue, shown below. ================================================================== BUG: KASAN: slab-use-after-free in svc_export_show+0x362/0x430 [nfsd] Read of size 1 at addr ff11000010fdc120 by task cat/870 CPU: 1 UID: 0 PID: 870 Comm: cat Not tainted 6.12.0-rc3+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.1-2.fc37 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x53/0x70 print_address_description.constprop.0+0x2c/0x3a0 print_report+0xb9/0x280 kasan_report+0xae/0xe0 svc_export_show+0x362/0x430 [nfsd] c_show+0x161/0x390 [sunrpc] seq_read_iter+0x589/0x770 seq_read+0x1e5/0x270 proc_reg_read+0xe1/0x140 vfs_read+0x125/0x530 ksys_read+0xc1/0x160 do_syscall_64+0x5f/0x170 entry_SYSCALL_64_after_hwframe+0x76/0x7e Allocated by task 830: kasan_save_stack+0x20/0x40 kasan_save_track+0x14/0x30 __kasan_kmalloc+0x8f/0xa0 __kmalloc_node_track_caller_noprof+0x1bc/0x400 kmemdup_noprof+0x22/0x50 svc_export_parse+0x8a9/0xb80 [nfsd] cache_do_downcall+0x71/0xa0 [sunrpc] cache_write_procfs+0x8e/0xd0 [sunrpc] proc_reg_write+0xe1/0x140 vfs_write+0x1a5/0x6d0 ksys_write+0xc1/0x160 do_syscall_64+0x5f/0x170 entry_SYSCALL_64_after_hwframe+0x76/0x7e Freed by task 868: kasan_save_stack+0x20/0x40 kasan_save_track+0x14/0x30 kasan_save_free_info+0x3b/0x60 __kasan_slab_free+0x37/0x50 kfree+0xf3/0x3e0 svc_export_put+0x87/0xb0 [nfsd] cache_purge+0x17f/0x1f0 [sunrpc] nfsd_destroy_serv+0x226/0x2d0 [nfsd] nfsd_svc+0x125/0x1e0 [nfsd] write_threads+0x16a/0x2a0 [nfsd] nfsctl_transaction_write+0x74/0xa0 [nfsd] vfs_write+0x1a5/0x6d0 ksys_write+0xc1/0x160 do_syscall_64+0x5f/0x170 entry_SYSCALL_64_after_hwframe+0x76/0x7e 2. We cannot sleep while using `rcu_read_lock`/`rcu_read_unlock`. However, `svc_export_put`/`expkey_put` will call path_put, which subsequently triggers a sleeping operation due to the following `dput`. ============================= WARNING: suspicious RCU usage 5.10.0-dirty #141 Not tainted ----------------------------- ... Call Trace: dump_stack+0x9a/0xd0 ___might_sleep+0x231/0x240 dput+0x39/0x600 path_put+0x1b/0x30 svc_export_put+0x17/0x80 e_show+0x1c9/0x200 seq_read_iter+0x63f/0x7c0 seq_read+0x226/0x2d0 vfs_read+0x113/0x2c0 ksys_read+0xc9/0x170 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x67/0xd1 Fix these issues by using `rcu_work` to help release `svc_expkey`/`svc_export`. This approach allows for an asynchronous context to invoke `path_put` and also facilitates the freeing of `uuid/exp/key` after an RCU grace period. Fixes: 9ceddd9da134 ("knfsd: Allow lockless lookups of the exports") Signed-off-by: Yang Erkun <yangerkun@huawei.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-12-05NFSD: Cap the number of bytes copied by nfs4_reset_recoverydir()Chuck Lever
[ Upstream commit f64ea4af43161bb86ffc77e6aeb5bcf5c3229df0 ] It's only current caller already length-checks the string, but let's be safe. Fixes: 0964a3d3f1aa ("[PATCH] knfsd: nfsd4 reboot dirname fix") Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-12-05NFSD: Prevent NULL dereference in nfsd4_process_cb_update()Chuck Lever
[ Upstream commit 1e02c641c3a43c88cecc08402000418e15578d38 ] @ses is initialized to NULL. If __nfsd4_find_backchannel() finds no available backchannel session, setup_callback_client() will try to dereference @ses and segfault. Fixes: dcbeaa68dbbd ("nfsd4: allow backchannel recovery") Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-12-05nfsd: drop inode parameter from nfsd4_change_attribute()Jeff Layton
[ Upstream commit f67eef8da0e8c54709fefdecd16ad8d70f0c9d20 ] The inode that nfs4_open_delegation() passes to this function is wrong, which throws off the result. The inode will end up getting a directory-style change attr instead of a regular-file-style one. Fix up nfs4_delegation_stat() to fetch STATX_MODE, and then drop the inode parameter from nfsd4_change_attribute(), since it's no longer needed. Fixes: c5967721e106 ("NFSD: handle GETATTR conflict with write delegation") Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-11-09Merge tag 'nfsd-6.12-4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fix from Chuck Lever: - Fix a v6.12-rc regression when exporting ext4 filesystems with NFSD * tag 'nfsd-6.12-4' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: NFSD: Fix READDIR on NFSv3 mounts of ext4 exports
2024-11-07NFSD: Fix READDIR on NFSv3 mounts of ext4 exportsChuck Lever
I noticed that recently, simple operations like "make" started failing on NFSv3 mounts of ext4 exports. Network capture shows that READDIRPLUS operated correctly but READDIR failed with NFS3ERR_INVAL. The vfs_llseek() call returned EINVAL when it is passed a non-zero starting directory cookie. I bisected to commit c689bdd3bffa ("nfsd: further centralize protocol version checks."). Turns out that nfsd3_proc_readdir() does not call fh_verify() before it calls nfsd_readdir(), so the new fhp->fh_64bit_cookies boolean is not set properly. This leaves the NFSD_MAY_64BIT_COOKIE unset when the directory is opened. For ext4, this causes the wrong "max file size" value to be used when sanity checking the incoming directory cookie (which is a seek offset value). The fhp->fh_64bit_cookies boolean is /always/ properly initialized after nfsd_open() returns. There doesn't seem to be a reason for the generic NFSD open helper to handle the f_mode fix-up for directories, so just move that to the one caller that tries to open an S_IFDIR with NFSD_MAY_64BIT_COOKIE. Suggested-by: NeilBrown <neilb@suse.de> Fixes: c689bdd3bffa ("nfsd: further centralize protocol version checks.") Reviewed-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-02Merge tag 'nfsd-6.12-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fixes from Chuck Lever: - Fix two async COPY bugs found during NFS bake-a-thon - Fix an svcrdma memory leak * tag 'nfsd-6.12-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: rpcrdma: Always release the rpcrdma_device's xa_array NFSD: Never decrement pending_async_copies on error NFSD: Initialize struct nfsd4_copy earlier
2024-10-30NFSD: Never decrement pending_async_copies on errorChuck Lever
The error flow in nfsd4_copy() calls cleanup_async_copy(), which already decrements nn->pending_async_copies. Reported-by: Olga Kornievskaia <okorniev@redhat.com> Fixes: aadc3bbea163 ("NFSD: Limit the number of concurrent async COPY operations") Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-10-29NFSD: Initialize struct nfsd4_copy earlierChuck Lever
Ensure the refcount and async_copies fields are initialized early. cleanup_async_copy() will reference these fields if an error occurs in nfsd4_copy(). If they are not correctly initialized, at the very least, a refcount underflow occurs. Reported-by: Olga Kornievskaia <okorniev@redhat.com> Fixes: aadc3bbea163 ("NFSD: Limit the number of concurrent async COPY operations") Reviewed-by: Jeff Layton <jlayton@kernel.org> Tested-by: Olga Kornievskaia <okorniev@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-10-25Merge tag 'nfsd-6.12-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fixes from Chuck Lever: - Fix a couple of use-after-free bugs * tag 'nfsd-6.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: nfsd: cancel nfsd_shrinker_work using sync mode in nfs4_state_shutdown_net nfsd: fix race between laundromat and free_stateid
2024-10-21nfsd: cancel nfsd_shrinker_work using sync mode in nfs4_state_shutdown_netYang Erkun
In the normal case, when we excute `echo 0 > /proc/fs/nfsd/threads`, the function `nfs4_state_destroy_net` in `nfs4_state_shutdown_net` will release all resources related to the hashed `nfs4_client`. If the `nfsd_client_shrinker` is running concurrently, the `expire_client` function will first unhash this client and then destroy it. This can lead to the following warning. Additionally, numerous use-after-free errors may occur as well. nfsd_client_shrinker echo 0 > /proc/fs/nfsd/threads expire_client nfsd_shutdown_net unhash_client ... nfs4_state_shutdown_net /* won't wait shrinker exit */ /* cancel_work(&nn->nfsd_shrinker_work) * nfsd_file for this /* won't destroy unhashed client1 */ * client1 still alive nfs4_state_destroy_net */ nfsd_file_cache_shutdown /* trigger warning */ kmem_cache_destroy(nfsd_file_slab) kmem_cache_destroy(nfsd_file_mark_slab) /* release nfsd_file and mark */ __destroy_client ==================================================================== BUG nfsd_file (Not tainted): Objects remaining in nfsd_file on __kmem_cache_shutdown() -------------------------------------------------------------------- CPU: 4 UID: 0 PID: 764 Comm: sh Not tainted 6.12.0-rc3+ #1 dump_stack_lvl+0x53/0x70 slab_err+0xb0/0xf0 __kmem_cache_shutdown+0x15c/0x310 kmem_cache_destroy+0x66/0x160 nfsd_file_cache_shutdown+0xac/0x210 [nfsd] nfsd_destroy_serv+0x251/0x2a0 [nfsd] nfsd_svc+0x125/0x1e0 [nfsd] write_threads+0x16a/0x2a0 [nfsd] nfsctl_transaction_write+0x74/0xa0 [nfsd] vfs_write+0x1a5/0x6d0 ksys_write+0xc1/0x160 do_syscall_64+0x5f/0x170 entry_SYSCALL_64_after_hwframe+0x76/0x7e ==================================================================== BUG nfsd_file_mark (Tainted: G B W ): Objects remaining nfsd_file_mark on __kmem_cache_shutdown() -------------------------------------------------------------------- dump_stack_lvl+0x53/0x70 slab_err+0xb0/0xf0 __kmem_cache_shutdown+0x15c/0x310 kmem_cache_destroy+0x66/0x160 nfsd_file_cache_shutdown+0xc8/0x210 [nfsd] nfsd_destroy_serv+0x251/0x2a0 [nfsd] nfsd_svc+0x125/0x1e0 [nfsd] write_threads+0x16a/0x2a0 [nfsd] nfsctl_transaction_write+0x74/0xa0 [nfsd] vfs_write+0x1a5/0x6d0 ksys_write+0xc1/0x160 do_syscall_64+0x5f/0x170 entry_SYSCALL_64_after_hwframe+0x76/0x7e To resolve this issue, cancel `nfsd_shrinker_work` using synchronous mode in nfs4_state_shutdown_net. Fixes: 7c24fa225081 ("NFSD: replace delayed_work with work_struct for nfsd_client_shrinker") Signed-off-by: Yang Erkun <yangerkun@huaweicloud.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-10-18nfsd: fix race between laundromat and free_stateidOlga Kornievskaia
There is a race between laundromat handling of revoked delegations and a client sending free_stateid operation. Laundromat thread finds that delegation has expired and needs to be revoked so it marks the delegation stid revoked and it puts it on a reaper list but then it unlock the state lock and the actual delegation revocation happens without the lock. Once the stid is marked revoked a racing free_stateid processing thread does the following (1) it calls list_del_init() which removes it from the reaper list and (2) frees the delegation stid structure. The laundromat thread ends up not calling the revoke_delegation() function for this particular delegation but that means it will no release the lock lease that exists on the file. Now, a new open for this file comes in and ends up finding that lease list isn't empty and calls nfsd_breaker_owns_lease() which ends up trying to derefence a freed delegation stateid. Leading to the followint use-after-free KASAN warning: kernel: ================================================================== kernel: BUG: KASAN: slab-use-after-free in nfsd_breaker_owns_lease+0x140/0x160 [nfsd] kernel: Read of size 8 at addr ffff0000e73cd0c8 by task nfsd/6205 kernel: kernel: CPU: 2 UID: 0 PID: 6205 Comm: nfsd Kdump: loaded Not tainted 6.11.0-rc7+ #9 kernel: Hardware name: Apple Inc. Apple Virtualization Generic Platform, BIOS 2069.0.0.0.0 08/03/2024 kernel: Call trace: kernel: dump_backtrace+0x98/0x120 kernel: show_stack+0x1c/0x30 kernel: dump_stack_lvl+0x80/0xe8 kernel: print_address_description.constprop.0+0x84/0x390 kernel: print_report+0xa4/0x268 kernel: kasan_report+0xb4/0xf8 kernel: __asan_report_load8_noabort+0x1c/0x28 kernel: nfsd_breaker_owns_lease+0x140/0x160 [nfsd] kernel: nfsd_file_do_acquire+0xb3c/0x11d0 [nfsd] kernel: nfsd_file_acquire_opened+0x84/0x110 [nfsd] kernel: nfs4_get_vfs_file+0x634/0x958 [nfsd] kernel: nfsd4_process_open2+0xa40/0x1a40 [nfsd] kernel: nfsd4_open+0xa08/0xe80 [nfsd] kernel: nfsd4_proc_compound+0xb8c/0x2130 [nfsd] kernel: nfsd_dispatch+0x22c/0x718 [nfsd] kernel: svc_process_common+0x8e8/0x1960 [sunrpc] kernel: svc_process+0x3d4/0x7e0 [sunrpc] kernel: svc_handle_xprt+0x828/0xe10 [sunrpc] kernel: svc_recv+0x2cc/0x6a8 [sunrpc] kernel: nfsd+0x270/0x400 [nfsd] kernel: kthread+0x288/0x310 kernel: ret_from_fork+0x10/0x20 This patch proposes a fixed that's based on adding 2 new additional stid's sc_status values that help coordinate between the laundromat and other operations (nfsd4_free_stateid() and nfsd4_delegreturn()). First to make sure, that once the stid is marked revoked, it is not removed by the nfsd4_free_stateid(), the laundromat take a reference on the stateid. Then, coordinating whether the stid has been put on the cl_revoked list or we are processing FREE_STATEID and need to make sure to remove it from the list, each check that state and act accordingly. If laundromat has added to the cl_revoke list before the arrival of FREE_STATEID, then nfsd4_free_stateid() knows to remove it from the list. If nfsd4_free_stateid() finds that operations arrived before laundromat has placed it on cl_revoke list, it marks the state freed and then laundromat will no longer add it to the list. Also, for nfsd4_delegreturn() when looking for the specified stid, we need to access stid that are marked removed or freeable, it means the laundromat has started processing it but hasn't finished and this delegreturn needs to return nfserr_deleg_revoked and not nfserr_bad_stateid. The latter will not trigger a FREE_STATEID and the lack of it will leave this stid on the cl_revoked list indefinitely. Fixes: 2d4a532d385f ("nfsd: ensure that clp->cl_revoked list is protected by clp->cl_lock") CC: stable@vger.kernel.org Signed-off-by: Olga Kornievskaia <okorniev@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-10-11Merge tag 'nfs-for-6.12-2' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds
Pull NFS client fixes from Anna Schumaker: "Localio Bugfixes: - remove duplicated include in localio.c - fix race in NFS calls to nfsd_file_put_local() and nfsd_serv_put() - fix Kconfig for NFS_COMMON_LOCALIO_SUPPORT - fix nfsd_file tracepoints to handle NULL rqstp pointers Other Bugfixes: - fix program selection loop in svc_process_common - fix integer overflow in decode_rc_list() - prevent NULL-pointer dereference in nfs42_complete_copies() - fix CB_RECALL performance issues when using a large number of delegations" * tag 'nfs-for-6.12-2' of git://git.linux-nfs.org/projects/anna/linux-nfs: NFS: remove revoked delegation from server's delegation list nfsd/localio: fix nfsd_file tracepoints to handle NULL rqstp nfs_common: fix Kconfig for NFS_COMMON_LOCALIO_SUPPORT nfs_common: fix race in NFS calls to nfsd_file_put_local() and nfsd_serv_put() NFSv4: Prevent NULL-pointer dereference in nfs42_complete_copies() SUNRPC: Fix integer overflow in decode_rc_list() sunrpc: fix prog selection loop in svc_process_common nfs: Remove duplicated include in localio.c
2024-10-10Merge tag 'nfsd-6.12-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fixes from Chuck Lever: - Fix NFSD bring-up / shutdown - Fix a UAF when releasing a stateid * tag 'nfsd-6.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: nfsd: fix possible badness in FREE_STATEID nfsd: nfsd_destroy_serv() must call svc_destroy() even if nfsd_startup_net() failed NFSD: Mark filecache "down" if init fails
2024-10-05nfsd: fix possible badness in FREE_STATEIDOlga Kornievskaia
When multiple FREE_STATEIDs are sent for the same delegation stateid, it can lead to a possible either use-after-free or counter refcount underflow errors. In nfsd4_free_stateid() under the client lock we find a delegation stateid, however the code drops the lock before calling nfs4_put_stid(), that allows another FREE_STATE to find the stateid again. The first one will proceed to then free the stateid which leads to either use-after-free or decrementing already zeroed counter. Fixes: 3f29cc82a84c ("nfsd: split sc_status out of sc_type") Signed-off-by: Olga Kornievskaia <okorniev@redhat.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-10-04nfsd/localio: fix nfsd_file tracepoints to handle NULL rqstpMike Snitzer
Otherwise nfsd_file_acquire, nfsd_file_insert_err, and nfsd_file_cons_err will hit a NULL pointer when they are enabled and LOCALIO used. Example trace output (note xid is 0x0 and LOCALIO flag set): nfsd_file_acquire: xid=0x0 inode=0000000069a1b2e7 may_flags=WRITE|LOCALIO ref=1 nf_flags=HASHED|GC nf_may=WRITE nf_file=0000000070123234 status=0 Fixes: c63f0e48febf ("nfsd: add nfsd_file_acquire_local()") Signed-off-by: Mike Snitzer <snitzer@kernel.org> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2024-10-03nfs_common: fix race in NFS calls to nfsd_file_put_local() and nfsd_serv_put()Mike Snitzer
Add nfs_to_nfsd_file_put_local() interface to fix race with nfsd module unload. Similarly, use RCU around nfs_open_local_fh()'s error path call to nfs_to->nfsd_serv_put(). Holding RCU ensures that NFS will safely _call and return_ from its nfs_to calls into the NFSD functions nfsd_file_put_local() and nfsd_serv_put(). Otherwise, if RCU isn't used then there is a narrow window when NFS's reference for the nfsd_file and nfsd_serv are dropped and the NFSD module could be unloaded, which could result in a crash from the return instruction for either nfs_to->nfsd_file_put_local() or nfs_to->nfsd_serv_put(). Reported-by: NeilBrown <neilb@suse.de> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2024-10-02inotify: Fix possible deadlock in fsnotify_destroy_markLizhi Xu
[Syzbot reported] WARNING: possible circular locking dependency detected 6.11.0-rc4-syzkaller-00019-gb311c1b497e5 #0 Not tainted ------------------------------------------------------ kswapd0/78 is trying to acquire lock: ffff88801b8d8930 (&group->mark_mutex){+.+.}-{3:3}, at: fsnotify_group_lock include/linux/fsnotify_backend.h:270 [inline] ffff88801b8d8930 (&group->mark_mutex){+.+.}-{3:3}, at: fsnotify_destroy_mark+0x38/0x3c0 fs/notify/mark.c:578 but task is already holding lock: ffffffff8ea2fd60 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat mm/vmscan.c:6841 [inline] ffffffff8ea2fd60 (fs_reclaim){+.+.}-{0:0}, at: kswapd+0xbb4/0x35a0 mm/vmscan.c:7223 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (fs_reclaim){+.+.}-{0:0}: ... kmem_cache_alloc_noprof+0x3d/0x2a0 mm/slub.c:4044 inotify_new_watch fs/notify/inotify/inotify_user.c:599 [inline] inotify_update_watch fs/notify/inotify/inotify_user.c:647 [inline] __do_sys_inotify_add_watch fs/notify/inotify/inotify_user.c:786 [inline] __se_sys_inotify_add_watch+0x72e/0x1070 fs/notify/inotify/inotify_user.c:729 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f -> #0 (&group->mark_mutex){+.+.}-{3:3}: ... __mutex_lock+0x136/0xd70 kernel/locking/mutex.c:752 fsnotify_group_lock include/linux/fsnotify_backend.h:270 [inline] fsnotify_destroy_mark+0x38/0x3c0 fs/notify/mark.c:578 fsnotify_destroy_marks+0x14a/0x660 fs/notify/mark.c:934 fsnotify_inoderemove include/linux/fsnotify.h:264 [inline] dentry_unlink_inode+0x2e0/0x430 fs/dcache.c:403 __dentry_kill+0x20d/0x630 fs/dcache.c:610 shrink_kill+0xa9/0x2c0 fs/dcache.c:1055 shrink_dentry_list+0x2c0/0x5b0 fs/dcache.c:1082 prune_dcache_sb+0x10f/0x180 fs/dcache.c:1163 super_cache_scan+0x34f/0x4b0 fs/super.c:221 do_shrink_slab+0x701/0x1160 mm/shrinker.c:435 shrink_slab+0x1093/0x14d0 mm/shrinker.c:662 shrink_one+0x43b/0x850 mm/vmscan.c:4815 shrink_many mm/vmscan.c:4876 [inline] lru_gen_shrink_node mm/vmscan.c:4954 [inline] shrink_node+0x3799/0x3de0 mm/vmscan.c:5934 kswapd_shrink_node mm/vmscan.c:6762 [inline] balance_pgdat mm/vmscan.c:6954 [inline] kswapd+0x1bcd/0x35a0 mm/vmscan.c:7223 [Analysis] The problem is that inotify_new_watch() is using GFP_KERNEL to allocate new watches under group->mark_mutex, however if dentry reclaim races with unlinking of an inode, it can end up dropping the last dentry reference for an unlinked inode resulting in removal of fsnotify mark from reclaim context which wants to acquire group->mark_mutex as well. This scenario shows that all notification groups are in principle prone to this kind of a deadlock (previously, we considered only fanotify and dnotify to be problematic for other reasons) so make sure all allocations under group->mark_mutex happen with GFP_NOFS. Reported-and-tested-by: syzbot+c679f13773f295d2da53@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=c679f13773f295d2da53 Signed-off-by: Lizhi Xu <lizhi.xu@windriver.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240927143642.2369508-1-lizhi.xu@windriver.com
2024-09-23nfsd: implement server support for NFS_LOCALIO_PROGRAMMike Snitzer
The LOCALIO auxiliary RPC protocol consists of a single "UUID_IS_LOCAL" RPC method that allows the Linux NFS client to verify the local Linux NFS server can see the nonce (single-use UUID) the client generated and made available in nfs_common. The server expects this protocol to use the same transport as NFS and NFSACL for its RPCs. This protocol isn't part of an IETF standard, nor does it need to be considering it is Linux-to-Linux auxiliary RPC protocol that amounts to an implementation detail. The UUID_IS_LOCAL method encodes the client generated uuid_t in terms of the fixed UUID_SIZE (16 bytes). The fixed size opaque encode and decode XDR methods are used instead of the less efficient variable sized methods. The RPC program number for the NFS_LOCALIO_PROGRAM is 400122 (as assigned by IANA, see https://www.iana.org/assignments/rpc-program-numbers/ ): Linux Kernel Organization 400122 nfslocalio Signed-off-by: Mike Snitzer <snitzer@kernel.org> [neilb: factored out and simplified single localio protocol] Co-developed-by: NeilBrown <neilb@suse.de> Signed-off-by: NeilBrown <neilb@suse.de> Acked-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2024-09-23nfsd: add LOCALIO supportWeston Andros Adamson
Add server support for bypassing NFS for localhost reads, writes, and commits. This is only useful when both the client and server are running on the same host. If nfsd_open_local_fh() fails then the NFS client will both retry and fallback to normal network-based read, write and commit operations if localio is no longer supported. Care is taken to ensure the same NFS security mechanisms are used (authentication, etc) regardless of whether localio or regular NFS access is used. The auth_domain established as part of the traditional NFS client access to the NFS server is also used for localio. Store auth_domain for localio in nfsd_uuid_t and transfer it to the client if it is local to the server. Relative to containers, localio gives the client access to the network namespace the server has. This is required to allow the client to access the server's per-namespace nfsd_net struct. This commit also introduces the use of NFSD's percpu_ref to interlock nfsd_destroy_serv and nfsd_open_local_fh, to ensure nn->nfsd_serv is not destroyed while in use by nfsd_open_local_fh and other LOCALIO client code. CONFIG_NFS_LOCALIO enables NFS server support for LOCALIO. Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Co-developed-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Co-developed-by: NeilBrown <neilb@suse.de> Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Acked-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2024-09-23nfs_common: prepare for the NFS client to use nfsd_file for LOCALIOMike Snitzer
The next commit will introduce nfsd_open_local_fh() which returns an nfsd_file structure. This commit exposes LOCALIO's required NFSD symbols to the NFS client: - Make nfsd_open_local_fh() symbol and other required NFSD symbols available to NFS in a global 'nfs_to' nfsd_localio_operations struct (global access suggested by Trond, nfsd_localio_operations suggested by NeilBrown). The next commit will also introduce nfsd_localio_ops_init() that init_nfsd() will call to initialize 'nfs_to'. - Introduce nfsd_file_file() that provides access to nfsd_file's backing file. Keeps nfsd_file structure opaque to NFS client (as suggested by Jeff Layton). - Introduce nfsd_file_put_local() that will put the reference to the nfsd_file's associated nn->nfsd_serv and then put the reference to the nfsd_file (as suggested by NeilBrown). Suggested-by: Trond Myklebust <trond.myklebust@hammerspace.com> # nfs_to Suggested-by: NeilBrown <neilb@suse.de> # nfsd_localio_operations Suggested-by: Jeff Layton <jlayton@kernel.org> # nfsd_file_file Signed-off-by: Mike Snitzer <snitzer@kernel.org> Reviewed-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>