summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2024-08-03kernfs: Convert kernfs_path_from_node_locked() from strlcpy() to strscpy()Kees Cook
[ Upstream commit ff6d413b0b59466e5acf2e42f294b1842ae130a1 ] One of the last remaining users of strlcpy() in the kernel is kernfs_path_from_node_locked(), which passes back the problematic "length we _would_ have copied" return value to indicate truncation. Convert the chain of all callers to use the negative return value (some of which already doing this explicitly). All callers were already also checking for negative return values, so the risk to missed checks looks very low. In this analysis, it was found that cgroup1_release_agent() actually didn't handle the "too large" condition, so this is technically also a bug fix. :) Here's the chain of callers, and resolution identifying each one as now handling the correct return value: kernfs_path_from_node_locked() kernfs_path_from_node() pr_cont_kernfs_path() returns void kernfs_path() sysfs_warn_dup() return value ignored cgroup_path() blkg_path() bfq_bic_update_cgroup() return value ignored TRACE_IOCG_PATH() return value ignored TRACE_CGROUP_PATH() return value ignored perf_event_cgroup() return value ignored task_group_path() return value ignored damon_sysfs_memcg_path_eq() return value ignored get_mm_memcg_path() return value ignored lru_gen_seq_show() return value ignored cgroup_path_from_kernfs_id() return value ignored cgroup_show_path() already converted "too large" error to negative value cgroup_path_ns_locked() cgroup_path_ns() bpf_iter_cgroup_show_fdinfo() return value ignored cgroup1_release_agent() wasn't checking "too large" error proc_cgroup_show() already converted "too large" to negative value Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Tejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Waiman Long <longman@redhat.com> Cc: <cgroups@vger.kernel.org> Co-developed-by: Azeem Shaikh <azeemshaikh38@gmail.com> Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com> Link: https://lore.kernel.org/r/20231116192127.1558276-3-keescook@chromium.org Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20231212211741.164376-3-keescook@chromium.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Stable-dep-of: 1be59c97c83c ("cgroup/cpuset: Prevent UAF in proc_cpuset_show()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-03hfsplus: fix to avoid false alarm of circular lockingChao Yu
[ Upstream commit be4edd1642ee205ed7bbf66edc0453b1be1fb8d7 ] Syzbot report potential ABBA deadlock as below: loop0: detected capacity change from 0 to 1024 ====================================================== WARNING: possible circular locking dependency detected 6.9.0-syzkaller-10323-g8f6a15f095a6 #0 Not tainted ------------------------------------------------------ syz-executor171/5344 is trying to acquire lock: ffff88807cb980b0 (&tree->tree_lock){+.+.}-{3:3}, at: hfsplus_file_truncate+0x811/0xb50 fs/hfsplus/extents.c:595 but task is already holding lock: ffff88807a930108 (&HFSPLUS_I(inode)->extents_lock){+.+.}-{3:3}, at: hfsplus_file_truncate+0x2da/0xb50 fs/hfsplus/extents.c:576 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&HFSPLUS_I(inode)->extents_lock){+.+.}-{3:3}: lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5754 __mutex_lock_common kernel/locking/mutex.c:608 [inline] __mutex_lock+0x136/0xd70 kernel/locking/mutex.c:752 hfsplus_file_extend+0x21b/0x1b70 fs/hfsplus/extents.c:457 hfsplus_bmap_reserve+0x105/0x4e0 fs/hfsplus/btree.c:358 hfsplus_rename_cat+0x1d0/0x1050 fs/hfsplus/catalog.c:456 hfsplus_rename+0x12e/0x1c0 fs/hfsplus/dir.c:552 vfs_rename+0xbdb/0xf00 fs/namei.c:4887 do_renameat2+0xd94/0x13f0 fs/namei.c:5044 __do_sys_rename fs/namei.c:5091 [inline] __se_sys_rename fs/namei.c:5089 [inline] __x64_sys_rename+0x86/0xa0 fs/namei.c:5089 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f -> #0 (&tree->tree_lock){+.+.}-{3:3}: check_prev_add kernel/locking/lockdep.c:3134 [inline] check_prevs_add kernel/locking/lockdep.c:3253 [inline] validate_chain+0x18cb/0x58e0 kernel/locking/lockdep.c:3869 __lock_acquire+0x1346/0x1fd0 kernel/locking/lockdep.c:5137 lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5754 __mutex_lock_common kernel/locking/mutex.c:608 [inline] __mutex_lock+0x136/0xd70 kernel/locking/mutex.c:752 hfsplus_file_truncate+0x811/0xb50 fs/hfsplus/extents.c:595 hfsplus_setattr+0x1ce/0x280 fs/hfsplus/inode.c:265 notify_change+0xb9d/0xe70 fs/attr.c:497 do_truncate+0x220/0x310 fs/open.c:65 handle_truncate fs/namei.c:3308 [inline] do_open fs/namei.c:3654 [inline] path_openat+0x2a3d/0x3280 fs/namei.c:3807 do_filp_open+0x235/0x490 fs/namei.c:3834 do_sys_openat2+0x13e/0x1d0 fs/open.c:1406 do_sys_open fs/open.c:1421 [inline] __do_sys_creat fs/open.c:1497 [inline] __se_sys_creat fs/open.c:1491 [inline] __x64_sys_creat+0x123/0x170 fs/open.c:1491 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&HFSPLUS_I(inode)->extents_lock); lock(&tree->tree_lock); lock(&HFSPLUS_I(inode)->extents_lock); lock(&tree->tree_lock); This is a false alarm as tree_lock mutex are different, one is from sbi->cat_tree, and another is from sbi->ext_tree: Thread A Thread B - hfsplus_rename - hfsplus_rename_cat - hfs_find_init - mutext_lock(cat_tree->tree_lock) - hfsplus_setattr - hfsplus_file_truncate - mutex_lock(hip->extents_lock) - hfs_find_init - mutext_lock(ext_tree->tree_lock) - hfs_bmap_reserve - hfsplus_file_extend - mutex_lock(hip->extents_lock) So, let's call mutex_lock_nested for tree_lock mutex lock, and pass correct lock class for it. Fixes: 31651c607151 ("hfsplus: avoid deadlock on file truncation") Reported-by: syzbot+6030b3b1b9bf70e538c4@syzkaller.appspotmail.com Closes: https://lore.kernel.org/linux-fsdevel/000000000000e37a4005ef129563@google.com Cc: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com> Signed-off-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20240607142304.455441-1-chao@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-07-27filelock: Fix fcntl/close race recovery compat pathJann Horn
commit f8138f2ad2f745b9a1c696a05b749eabe44337ea upstream. When I wrote commit 3cad1bc01041 ("filelock: Remove locks reliably when fcntl/close race is detected"), I missed that there are two copies of the code I was patching: The normal version, and the version for 64-bit offsets on 32-bit kernels. Thanks to Greg KH for stumbling over this while doing the stable backport... Apply exactly the same fix to the compat path for 32-bit kernels. Fixes: c293621bbf67 ("[PATCH] stale POSIX lock handling") Cc: stable@kernel.org Link: https://bugs.chromium.org/p/project-zero/issues/detail?id=2563 Signed-off-by: Jann Horn <jannh@google.com> Link: https://lore.kernel.org/r/20240723-fs-lock-recover-compatfix-v1-1-148096719529@google.com Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-07-27fs/ntfs3: Validate ff offsetlei lu
commit 50c47879650b4c97836a0086632b3a2e300b0f06 upstream. This adds sanity checks for ff offset. There is a check on rt->first_free at first, but walking through by ff without any check. If the second ff is a large offset. We may encounter an out-of-bound read. Signed-off-by: lei lu <llfamsec@gmail.com> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-07-27fs/ntfs3: Add a check for attr_names and oatblKonstantin Komarov
commit 702d4930eb06dcfda85a2fa67e8a1a27bfa2a845 upstream. Added out-of-bound checking for *ane (ATTR_NAME_ENTRY). Reported-by: lei lu <llfamsec@gmail.com> Fixes: 865e7a7700d93 ("fs/ntfs3: Reduce stack usage") Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-07-27jfs: don't walk off the end of ealistlei lu
commit d0fa70aca54c8643248e89061da23752506ec0d4 upstream. Add a check before visiting the members of ea to make sure each ea stays within the ealist. Signed-off-by: lei lu <llfamsec@gmail.com> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-07-27ocfs2: add bounds checking to ocfs2_check_dir_entry()lei lu
commit 255547c6bb8940a97eea94ef9d464ea5967763fb upstream. This adds sanity checks for ocfs2_dir_entry to make sure all members of ocfs2_dir_entry don't stray beyond valid memory region. Link: https://lkml.kernel.org/r/20240626104433.163270-1-llfamsec@gmail.com Signed-off-by: lei lu <llfamsec@gmail.com> Reviewed-by: Heming Zhao <heming.zhao@suse.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-07-25cachefiles: fix slab-use-after-free in cachefiles_withdraw_cookie()Baokun Li
[ Upstream commit 5d8f805789072ea7fd39504694b7bd17e5f751c4 ] We got the following issue in our fault injection stress test: ================================================================== BUG: KASAN: slab-use-after-free in cachefiles_withdraw_cookie+0x4d9/0x600 Read of size 8 at addr ffff888118efc000 by task kworker/u78:0/109 CPU: 13 PID: 109 Comm: kworker/u78:0 Not tainted 6.8.0-dirty #566 Call Trace: <TASK> kasan_report+0x93/0xc0 cachefiles_withdraw_cookie+0x4d9/0x600 fscache_cookie_state_machine+0x5c8/0x1230 fscache_cookie_worker+0x91/0x1c0 process_one_work+0x7fa/0x1800 [...] Allocated by task 117: kmalloc_trace+0x1b3/0x3c0 cachefiles_acquire_volume+0xf3/0x9c0 fscache_create_volume_work+0x97/0x150 process_one_work+0x7fa/0x1800 [...] Freed by task 120301: kfree+0xf1/0x2c0 cachefiles_withdraw_cache+0x3fa/0x920 cachefiles_put_unbind_pincount+0x1f6/0x250 cachefiles_daemon_release+0x13b/0x290 __fput+0x204/0xa00 task_work_run+0x139/0x230 do_exit+0x87a/0x29b0 [...] ================================================================== Following is the process that triggers the issue: p1 | p2 ------------------------------------------------------------ fscache_begin_lookup fscache_begin_volume_access fscache_cache_is_live(fscache_cache) cachefiles_daemon_release cachefiles_put_unbind_pincount cachefiles_daemon_unbind cachefiles_withdraw_cache fscache_withdraw_cache fscache_set_cache_state(cache, FSCACHE_CACHE_IS_WITHDRAWN); cachefiles_withdraw_objects(cache) fscache_wait_for_objects(fscache) atomic_read(&fscache_cache->object_count) == 0 fscache_perform_lookup cachefiles_lookup_cookie cachefiles_alloc_object refcount_set(&object->ref, 1); object->volume = volume fscache_count_object(vcookie->cache); atomic_inc(&fscache_cache->object_count) cachefiles_withdraw_volumes cachefiles_withdraw_volume fscache_withdraw_volume __cachefiles_free_volume kfree(cachefiles_volume) fscache_cookie_state_machine cachefiles_withdraw_cookie cache = object->volume->cache; // cachefiles_volume UAF !!! After setting FSCACHE_CACHE_IS_WITHDRAWN, wait for all the cookie lookups to complete first, and then wait for fscache_cache->object_count == 0 to avoid the cookie exiting after the volume has been freed and triggering the above issue. Therefore call fscache_withdraw_volume() before calling cachefiles_withdraw_objects(). This way, after setting FSCACHE_CACHE_IS_WITHDRAWN, only the following two cases will occur: 1) fscache_begin_lookup fails in fscache_begin_volume_access(). 2) fscache_withdraw_volume() will ensure that fscache_count_object() has been executed before calling fscache_wait_for_objects(). Fixes: fe2140e2f57f ("cachefiles: Implement volume support") Suggested-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Baokun Li <libaokun1@huawei.com> Link: https://lore.kernel.org/r/20240628062930.2467993-4-libaokun@huaweicloud.com Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Baokun Li <libaokun1@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-07-25cachefiles: fix slab-use-after-free in fscache_withdraw_volume()Baokun Li
[ Upstream commit 522018a0de6b6fcce60c04f86dfc5f0e4b6a1b36 ] We got the following issue in our fault injection stress test: ================================================================== BUG: KASAN: slab-use-after-free in fscache_withdraw_volume+0x2e1/0x370 Read of size 4 at addr ffff88810680be08 by task ondemand-04-dae/5798 CPU: 0 PID: 5798 Comm: ondemand-04-dae Not tainted 6.8.0-dirty #565 Call Trace: kasan_check_range+0xf6/0x1b0 fscache_withdraw_volume+0x2e1/0x370 cachefiles_withdraw_volume+0x31/0x50 cachefiles_withdraw_cache+0x3ad/0x900 cachefiles_put_unbind_pincount+0x1f6/0x250 cachefiles_daemon_release+0x13b/0x290 __fput+0x204/0xa00 task_work_run+0x139/0x230 Allocated by task 5820: __kmalloc+0x1df/0x4b0 fscache_alloc_volume+0x70/0x600 __fscache_acquire_volume+0x1c/0x610 erofs_fscache_register_volume+0x96/0x1a0 erofs_fscache_register_fs+0x49a/0x690 erofs_fc_fill_super+0x6c0/0xcc0 vfs_get_super+0xa9/0x140 vfs_get_tree+0x8e/0x300 do_new_mount+0x28c/0x580 [...] Freed by task 5820: kfree+0xf1/0x2c0 fscache_put_volume.part.0+0x5cb/0x9e0 erofs_fscache_unregister_fs+0x157/0x1b0 erofs_kill_sb+0xd9/0x1c0 deactivate_locked_super+0xa3/0x100 vfs_get_super+0x105/0x140 vfs_get_tree+0x8e/0x300 do_new_mount+0x28c/0x580 [...] ================================================================== Following is the process that triggers the issue: mount failed | daemon exit ------------------------------------------------------------ deactivate_locked_super cachefiles_daemon_release erofs_kill_sb erofs_fscache_unregister_fs fscache_relinquish_volume __fscache_relinquish_volume fscache_put_volume(fscache_volume, fscache_volume_put_relinquish) zero = __refcount_dec_and_test(&fscache_volume->ref, &ref); cachefiles_put_unbind_pincount cachefiles_daemon_unbind cachefiles_withdraw_cache cachefiles_withdraw_volumes list_del_init(&volume->cache_link) fscache_free_volume(fscache_volume) cache->ops->free_volume cachefiles_free_volume list_del_init(&cachefiles_volume->cache_link); kfree(fscache_volume) cachefiles_withdraw_volume fscache_withdraw_volume fscache_volume->n_accesses // fscache_volume UAF !!! The fscache_volume in cache->volumes must not have been freed yet, but its reference count may be 0. So use the new fscache_try_get_volume() helper function try to get its reference count. If the reference count of fscache_volume is 0, fscache_put_volume() is freeing it, so wait for it to be removed from cache->volumes. If its reference count is not 0, call cachefiles_withdraw_volume() with reference count protection to avoid the above issue. Fixes: fe2140e2f57f ("cachefiles: Implement volume support") Signed-off-by: Baokun Li <libaokun1@huawei.com> Link: https://lore.kernel.org/r/20240628062930.2467993-3-libaokun@huaweicloud.com Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Baokun Li <libaokun1@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-07-25netfs, fscache: export fscache_put_volume() and add fscache_try_get_volume()Baokun Li
[ Upstream commit 85b08b31a22b481ec6528130daf94eee4452e23f ] Export fscache_put_volume() and add fscache_try_get_volume() helper function to allow cachefiles to get/put fscache_volume via linux/fscache-cache.h. Signed-off-by: Baokun Li <libaokun1@huawei.com> Link: https://lore.kernel.org/r/20240628062930.2467993-2-libaokun@huaweicloud.com Signed-off-by: Christian Brauner <brauner@kernel.org> Stable-dep-of: 522018a0de6b ("cachefiles: fix slab-use-after-free in fscache_withdraw_volume()") Stable-dep-of: 5d8f80578907 ("cachefiles: fix slab-use-after-free in cachefiles_withdraw_cookie()") Signed-off-by: Baokun Li <libaokun1@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-07-25mm: page_ref: remove folio_try_get_rcu()Yang Shi
commit fa2690af573dfefb47ba6eef888797a64b6b5f3c upstream. The below bug was reported on a non-SMP kernel: [ 275.267158][ T4335] ------------[ cut here ]------------ [ 275.267949][ T4335] kernel BUG at include/linux/page_ref.h:275! [ 275.268526][ T4335] invalid opcode: 0000 [#1] KASAN PTI [ 275.269001][ T4335] CPU: 0 PID: 4335 Comm: trinity-c3 Not tainted 6.7.0-rc4-00061-gefa7df3e3bb5 #1 [ 275.269787][ T4335] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 [ 275.270679][ T4335] RIP: 0010:try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3)) [ 275.272813][ T4335] RSP: 0018:ffffc90005dcf650 EFLAGS: 00010202 [ 275.273346][ T4335] RAX: 0000000000000246 RBX: ffffea00066e0000 RCX: 0000000000000000 [ 275.274032][ T4335] RDX: fffff94000cdc007 RSI: 0000000000000004 RDI: ffffea00066e0034 [ 275.274719][ T4335] RBP: ffffea00066e0000 R08: 0000000000000000 R09: fffff94000cdc006 [ 275.275404][ T4335] R10: ffffea00066e0037 R11: 0000000000000000 R12: 0000000000000136 [ 275.276106][ T4335] R13: ffffea00066e0034 R14: dffffc0000000000 R15: ffffea00066e0008 [ 275.276790][ T4335] FS: 00007fa2f9b61740(0000) GS:ffffffff89d0d000(0000) knlGS:0000000000000000 [ 275.277570][ T4335] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 275.278143][ T4335] CR2: 00007fa2f6c00000 CR3: 0000000134b04000 CR4: 00000000000406f0 [ 275.278833][ T4335] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 275.279521][ T4335] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 275.280201][ T4335] Call Trace: [ 275.280499][ T4335] <TASK> [ 275.280751][ T4335] ? die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434 arch/x86/kernel/dumpstack.c:447) [ 275.281087][ T4335] ? do_trap (arch/x86/kernel/traps.c:112 arch/x86/kernel/traps.c:153) [ 275.281463][ T4335] ? try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3)) [ 275.281884][ T4335] ? try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3)) [ 275.282300][ T4335] ? do_error_trap (arch/x86/kernel/traps.c:174) [ 275.282711][ T4335] ? try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3)) [ 275.283129][ T4335] ? handle_invalid_op (arch/x86/kernel/traps.c:212) [ 275.283561][ T4335] ? try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3)) [ 275.283990][ T4335] ? exc_invalid_op (arch/x86/kernel/traps.c:264) [ 275.284415][ T4335] ? asm_exc_invalid_op (arch/x86/include/asm/idtentry.h:568) [ 275.284859][ T4335] ? try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3)) [ 275.285278][ T4335] try_grab_folio (mm/gup.c:148) [ 275.285684][ T4335] __get_user_pages (mm/gup.c:1297 (discriminator 1)) [ 275.286111][ T4335] ? __pfx___get_user_pages (mm/gup.c:1188) [ 275.286579][ T4335] ? __pfx_validate_chain (kernel/locking/lockdep.c:3825) [ 275.287034][ T4335] ? mark_lock (kernel/locking/lockdep.c:4656 (discriminator 1)) [ 275.287416][ T4335] __gup_longterm_locked (mm/gup.c:1509 mm/gup.c:2209) [ 275.288192][ T4335] ? __pfx___gup_longterm_locked (mm/gup.c:2204) [ 275.288697][ T4335] ? __pfx_lock_acquire (kernel/locking/lockdep.c:5722) [ 275.289135][ T4335] ? __pfx___might_resched (kernel/sched/core.c:10106) [ 275.289595][ T4335] pin_user_pages_remote (mm/gup.c:3350) [ 275.290041][ T4335] ? __pfx_pin_user_pages_remote (mm/gup.c:3350) [ 275.290545][ T4335] ? find_held_lock (kernel/locking/lockdep.c:5244 (discriminator 1)) [ 275.290961][ T4335] ? mm_access (kernel/fork.c:1573) [ 275.291353][ T4335] process_vm_rw_single_vec+0x142/0x360 [ 275.291900][ T4335] ? __pfx_process_vm_rw_single_vec+0x10/0x10 [ 275.292471][ T4335] ? mm_access (kernel/fork.c:1573) [ 275.292859][ T4335] process_vm_rw_core+0x272/0x4e0 [ 275.293384][ T4335] ? hlock_class (arch/x86/include/asm/bitops.h:227 arch/x86/include/asm/bitops.h:239 include/asm-generic/bitops/instrumented-non-atomic.h:142 kernel/locking/lockdep.c:228) [ 275.293780][ T4335] ? __pfx_process_vm_rw_core+0x10/0x10 [ 275.294350][ T4335] process_vm_rw (mm/process_vm_access.c:284) [ 275.294748][ T4335] ? __pfx_process_vm_rw (mm/process_vm_access.c:259) [ 275.295197][ T4335] ? __task_pid_nr_ns (include/linux/rcupdate.h:306 (discriminator 1) include/linux/rcupdate.h:780 (discriminator 1) kernel/pid.c:504 (discriminator 1)) [ 275.295634][ T4335] __x64_sys_process_vm_readv (mm/process_vm_access.c:291) [ 275.296139][ T4335] ? syscall_enter_from_user_mode (kernel/entry/common.c:94 kernel/entry/common.c:112) [ 275.296642][ T4335] do_syscall_64 (arch/x86/entry/common.c:51 (discriminator 1) arch/x86/entry/common.c:82 (discriminator 1)) [ 275.297032][ T4335] ? __task_pid_nr_ns (include/linux/rcupdate.h:306 (discriminator 1) include/linux/rcupdate.h:780 (discriminator 1) kernel/pid.c:504 (discriminator 1)) [ 275.297470][ T4335] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4300 kernel/locking/lockdep.c:4359) [ 275.297988][ T4335] ? do_syscall_64 (arch/x86/include/asm/cpufeature.h:171 arch/x86/entry/common.c:97) [ 275.298389][ T4335] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4300 kernel/locking/lockdep.c:4359) [ 275.298906][ T4335] ? do_syscall_64 (arch/x86/include/asm/cpufeature.h:171 arch/x86/entry/common.c:97) [ 275.299304][ T4335] ? do_syscall_64 (arch/x86/include/asm/cpufeature.h:171 arch/x86/entry/common.c:97) [ 275.299703][ T4335] ? do_syscall_64 (arch/x86/include/asm/cpufeature.h:171 arch/x86/entry/common.c:97) [ 275.300115][ T4335] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:129) This BUG is the VM_BUG_ON(!in_atomic() && !irqs_disabled()) assertion in folio_ref_try_add_rcu() for non-SMP kernel. The process_vm_readv() calls GUP to pin the THP. An optimization for pinning THP instroduced by commit 57edfcfd3419 ("mm/gup: accelerate thp gup even for "pages != NULL"") calls try_grab_folio() to pin the THP, but try_grab_folio() is supposed to be called in atomic context for non-SMP kernel, for example, irq disabled or preemption disabled, due to the optimization introduced by commit e286781d5f2e ("mm: speculative page references"). The commit efa7df3e3bb5 ("mm: align larger anonymous mappings on THP boundaries") is not actually the root cause although it was bisected to. It just makes the problem exposed more likely. The follow up discussion suggested the optimization for non-SMP kernel may be out-dated and not worth it anymore [1]. So removing the optimization to silence the BUG. However calling try_grab_folio() in GUP slow path actually is unnecessary, so the following patch will clean this up. [1] https://lore.kernel.org/linux-mm/821cf1d6-92b9-4ac4-bacc-d8f2364ac14f@paulmck-laptop/ Link: https://lkml.kernel.org/r/20240625205350.1777481-1-yang@os.amperecomputing.com Fixes: 57edfcfd3419 ("mm/gup: accelerate thp gup even for "pages != NULL"") Signed-off-by: Yang Shi <yang@os.amperecomputing.com> Reported-by: kernel test robot <oliver.sang@intel.com> Tested-by: Oliver Sang <oliver.sang@intel.com> Acked-by: Peter Xu <peterx@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: Vivek Kasireddy <vivek.kasireddy@intel.com> Cc: <stable@vger.kernel.org> [6.6+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-07-25cifs: fix noisy message on copy_file_rangeSteve French
commit ae4ccca47195332c69176b8615c5ee17efd30c46 upstream. There are common cases where copy_file_range can noisily log "source and target of copy not on same server" e.g. the mv command across mounts to two different server's shares. Change this to informational rather than logging as an error. A followon patch will add dynamic trace points e.g. for cifs_file_copychunk_range Cc: stable@vger.kernel.org Reviewed-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-07-25hfsplus: fix uninit-value in copy_nameEdward Adam Davis
[ Upstream commit 0570730c16307a72f8241df12363f76600baf57d ] [syzbot reported] BUG: KMSAN: uninit-value in sized_strscpy+0xc4/0x160 sized_strscpy+0xc4/0x160 copy_name+0x2af/0x320 fs/hfsplus/xattr.c:411 hfsplus_listxattr+0x11e9/0x1a50 fs/hfsplus/xattr.c:750 vfs_listxattr fs/xattr.c:493 [inline] listxattr+0x1f3/0x6b0 fs/xattr.c:840 path_listxattr fs/xattr.c:864 [inline] __do_sys_listxattr fs/xattr.c:876 [inline] __se_sys_listxattr fs/xattr.c:873 [inline] __x64_sys_listxattr+0x16b/0x2f0 fs/xattr.c:873 x64_sys_call+0x2ba0/0x3b50 arch/x86/include/generated/asm/syscalls_64.h:195 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcf/0x1e0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f Uninit was created at: slab_post_alloc_hook mm/slub.c:3877 [inline] slab_alloc_node mm/slub.c:3918 [inline] kmalloc_trace+0x57b/0xbe0 mm/slub.c:4065 kmalloc include/linux/slab.h:628 [inline] hfsplus_listxattr+0x4cc/0x1a50 fs/hfsplus/xattr.c:699 vfs_listxattr fs/xattr.c:493 [inline] listxattr+0x1f3/0x6b0 fs/xattr.c:840 path_listxattr fs/xattr.c:864 [inline] __do_sys_listxattr fs/xattr.c:876 [inline] __se_sys_listxattr fs/xattr.c:873 [inline] __x64_sys_listxattr+0x16b/0x2f0 fs/xattr.c:873 x64_sys_call+0x2ba0/0x3b50 arch/x86/include/generated/asm/syscalls_64.h:195 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcf/0x1e0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f [Fix] When allocating memory to strbuf, initialize memory to 0. Reported-and-tested-by: syzbot+efde959319469ff8d4d7@syzkaller.appspotmail.com Signed-off-by: Edward Adam Davis <eadavis@qq.com> Link: https://lore.kernel.org/r/tencent_8BBB6433BC9E1C1B7B4BDF1BF52574BA8808@qq.com Reported-and-tested-by: syzbot+01ade747b16e9c8030e0@syzkaller.appspotmail.com Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-07-25ksmbd: return FILE_DEVICE_DISK instead of super magicNamjae Jeon
[ Upstream commit 25a6e135569b3901452e4863c94560df7c11c492 ] MS-SMB2 specification describes setting ->DeviceType to FILE_DEVICE_DISK or FILE_DEVICE_CD_ROM. Set FILE_DEVICE_DISK instead of super magic in FS_DEVICE_INFORMATION. And Set FILE_READ_ONLY_DEVICE for read-only share. Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-07-25fs: better handle deep ancestor chains in is_subdir()Christian Brauner
[ Upstream commit 391b59b045004d5b985d033263ccba3e941a7740 ] Jan reported that 'cd ..' may take a long time in deep directory hierarchies under a bind-mount. If concurrent renames happen it is possible to livelock in is_subdir() because it will keep retrying. Change is_subdir() from simply retrying over and over to retry once and then acquire the rename lock to handle deep ancestor chains better. The list of alternatives to this approach were less then pleasant. Change the scope of rcu lock to cover the whole walk while at it. A big thanks to Jan and Linus. Both Jan and Linus had proposed effectively the same thing just that one version ended up being slightly more elegant. Reported-by: Jan Kara <jack@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-07-25erofs: ensure m_llen is reset to 0 if metadata is invalidGao Xiang
[ Upstream commit 9b32b063be1001e322c5f6e01f2a649636947851 ] Sometimes, the on-disk metadata might be invalid due to user interrupts, storage failures, or other unknown causes. In that case, z_erofs_map_blocks_iter() may still return a valid m_llen while other fields remain invalid (e.g., m_plen can be 0). Due to the return value of z_erofs_scan_folio() in some path will be ignored on purpose, the following z_erofs_scan_folio() could then use the invalid value by accident. Let's reset m_llen to 0 to prevent this. Link: https://lore.kernel.org/r/20240629185743.2819229-1-hsiangkao@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-07-25btrfs: qgroup: fix quota root leak after quota disable failureFilipe Manana
[ Upstream commit a7e4c6a3031c74078dba7fa36239d0f4fe476c53 ] If during the quota disable we fail when cleaning the quota tree or when deleting the root from the root tree, we jump to the 'out' label without ever dropping the reference on the quota root, resulting in a leak of the root since fs_info->quota_root is no longer pointing to the root (we have set it to NULL just before those steps). Fix this by always doing a btrfs_put_root() call under the 'out' label. This is a problem that exists since qgroups were first added in 2012 by commit bed92eae26cc ("Btrfs: qgroup implementation and prototypes"), but back then we missed a kfree on the quota root and free_extent_buffer() calls on its root and commit root nodes, since back then roots were not yet reference counted. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-07-25iomap: Fix iomap_adjust_read_range for plen calculationRitesh Harjani (IBM)
[ Upstream commit f5ceb1bbc98c69536d4673a97315e8427e67de1b ] If the extent spans the block that contains i_size, we need to handle both halves separately so that we properly zero data in the page cache for blocks that are entirely outside of i_size. But this is needed only when i_size is within the current folio under processing. "orig_pos + length > isize" can be true for all folios if the mapped extent length is greater than the folio size. That is making plen to break for every folio instead of only the last folio. So use orig_plen for checking if "orig_pos + orig_plen > isize". Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/a32e5f9a4fcfdb99077300c4020ed7ae61d6e0f9.1715067055.git.ritesh.list@gmail.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> cc: Ojaswin Mujoo <ojaswin@linux.ibm.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-07-25fs/file: fix the check in find_next_fd()Yuntao Wang
[ Upstream commit ed8c7fbdfe117abbef81f65428ba263118ef298a ] The maximum possible return value of find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) is maxbit. This return value, multiplied by BITS_PER_LONG, gives the value of bitbit, which can never be greater than maxfd, it can only be equal to maxfd at most, so the following check 'if (bitbit > maxfd)' will never be true. Moreover, when bitbit equals maxfd, it indicates that there are no unused fds, and the function can directly return. Fix this check. Signed-off-by: Yuntao Wang <yuntao.wang@linux.dev> Link: https://lore.kernel.org/r/20240529160656.209352-1-yuntao.wang@linux.dev Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-07-25cachefiles: make on-demand read killableBaokun Li
[ Upstream commit bc9dde6155464e906e630a0a5c17a4cab241ffbb ] Replacing wait_for_completion() with wait_for_completion_killable() in cachefiles_ondemand_send_req() allows us to kill processes that might trigger a hunk_task if the daemon is abnormal. But now only CACHEFILES_OP_READ is killable, because OP_CLOSE and OP_OPEN is initiated from kworker context and the signal is prohibited in these kworker. Note that when the req in xas changes, i.e. xas_load(&xas) != req, it means that a process will complete the current request soon, so wait again for the request to be completed. In addition, add the cachefiles_ondemand_finish_req() helper function to simplify the code. Suggested-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Baokun Li <libaokun1@huawei.com> Link: https://lore.kernel.org/r/20240522114308.2402121-13-libaokun@huaweicloud.com Acked-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jia Zhu <zhujia.zj@bytedance.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-07-25cachefiles: Set object to close if ondemand_id < 0 in copenZizhi Wo
[ Upstream commit 4f8703fb3482f92edcfd31661857b16fec89c2c0 ] If copen is maliciously called in the user mode, it may delete the request corresponding to the random id. And the request may have not been read yet. Note that when the object is set to reopen, the open request will be done with the still reopen state in above case. As a result, the request corresponding to this object is always skipped in select_req function, so the read request is never completed and blocks other process. Fix this issue by simply set object to close if its id < 0 in copen. Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Signed-off-by: Baokun Li <libaokun1@huawei.com> Link: https://lore.kernel.org/r/20240522114308.2402121-11-libaokun@huaweicloud.com Acked-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jia Zhu <zhujia.zj@bytedance.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-07-25cachefiles: add consistency check for copen/creadBaokun Li
[ Upstream commit a26dc49df37e996876f50a0210039b2d211fdd6f ] This prevents malicious processes from completing random copen/cread requests and crashing the system. Added checks are listed below: * Generic, copen can only complete open requests, and cread can only complete read requests. * For copen, ondemand_id must not be 0, because this indicates that the request has not been read by the daemon. * For cread, the object corresponding to fd and req should be the same. Signed-off-by: Baokun Li <libaokun1@huawei.com> Link: https://lore.kernel.org/r/20240522114308.2402121-7-libaokun@huaweicloud.com Acked-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-07-25nfs: don't invalidate dentries on transient errorsScott Mayhew
[ Upstream commit 0c8c7c559740d2d8b66048162af6c4dba8f0c88c ] This is a slight variation on a patch previously proposed by Neil Brown that never got merged. Prior to commit 5ceb9d7fdaaf ("NFS: Refactor nfs_lookup_revalidate()"), any error from nfs_lookup_verify_inode() other than -ESTALE would result in nfs_lookup_revalidate() returning that error (-ESTALE is mapped to zero). Since that commit, all errors result in nfs_lookup_revalidate() returning zero, resulting in dentries being invalidated where they previously were not (particularly in the case of -ERESTARTSYS). Fix it by passing the actual error code to nfs_lookup_revalidate_done(), and leaving the decision on whether to map the error code to zero or one to nfs_lookup_revalidate_done(). A simple reproducer is to run the following python code in a subdirectory of an NFS mount (not in the root of the NFS mount): ---8<--- import os import multiprocessing import time if __name__=="__main__": multiprocessing.set_start_method("spawn") count = 0 while True: try: os.getcwd() pool = multiprocessing.Pool(10) pool.close() pool.terminate() count += 1 except Exception as e: print(f"Failed after {count} iterations") print(e) break ---8<--- Prior to commit 5ceb9d7fdaaf, the above code would run indefinitely. After commit 5ceb9d7fdaaf, it fails almost immediately with -ENOENT. Signed-off-by: Scott Mayhew <smayhew@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-07-25nfs: Avoid flushing many pages with NFS_FILE_SYNCJan Kara
[ Upstream commit a527c3ba41c4c61e2069bfce4091e5515f06a8dd ] When we are doing WB_SYNC_ALL writeback, nfs submits write requests with NFS_FILE_SYNC flag to the server (which then generally treats it as an O_SYNC write). This helps to reduce latency for single requests but when submitting more requests, additional fsyncs on the server side hurt latency. NFS generally avoids this additional overhead by not setting NFS_FILE_SYNC if desc->pg_moreio is set. However this logic doesn't always work. When we do random 4k writes to a huge file and then call fsync(2), each page writeback is going to be sent with NFS_FILE_SYNC because after preparing one page for writeback, we start writing back next, nfs_do_writepage() will call nfs_pageio_cond_complete() which finds the page is not contiguous with previously prepared IO and submits is *without* setting desc->pg_moreio. Hence NFS_FILE_SYNC is used resulting in poor performance. Fix the problem by setting desc->pg_moreio in nfs_pageio_cond_complete() before submitting outstanding IO. This improves throughput of fsync-after-random-writes on my test SSD from ~70MB/s to ~250MB/s. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-07-25nfs: propagate readlink errors in nfs_symlink_fillerSagi Grimberg
[ Upstream commit 134d0b3f2440cdddd12fc3444c9c0f62331ce6fc ] There is an inherent race where a symlink file may have been overriden (by a different client) between lookup and readlink, resulting in a spurious EIO error returned to userspace. Fix this by propagating back ESTALE errors such that the vfs will retry the lookup/get_link (similar to nfs4_file_open) at least once. Cc: Dan Aloni <dan.aloni@vastdata.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-07-25NFSv4: Fix memory leak in nfs4_set_security_labelDmitry Mastykin
[ Upstream commit aad11473f8f4be3df86461081ce35ec5b145ba68 ] We leak nfs_fattr and nfs4_label every time we set a security xattr. Signed-off-by: Dmitry Mastykin <mastichi@gmail.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-07-25filelock: Remove locks reliably when fcntl/close race is detectedJann Horn
commit 3cad1bc010416c6dd780643476bc59ed742436b9 upstream. When fcntl_setlk() races with close(), it removes the created lock with do_lock_file_wait(). However, LSMs can allow the first do_lock_file_wait() that created the lock while denying the second do_lock_file_wait() that tries to remove the lock. In theory (but AFAIK not in practice), posix_lock_file() could also fail to remove a lock due to GFP_KERNEL allocation failure (when splitting a range in the middle). After the bug has been triggered, use-after-free reads will occur in lock_get_status() when userspace reads /proc/locks. This can likely be used to read arbitrary kernel memory, but can't corrupt kernel memory. This only affects systems with SELinux / Smack / AppArmor / BPF-LSM in enforcing mode and only works from some security contexts. Fix it by calling locks_remove_posix() instead, which is designed to reliably get rid of POSIX locks associated with the given file and files_struct and is also used by filp_flush(). Fixes: c293621bbf67 ("[PATCH] stale POSIX lock handling") Cc: stable@kernel.org Link: https://bugs.chromium.org/p/project-zero/issues/detail?id=2563 Signed-off-by: Jann Horn <jannh@google.com> Link: https://lore.kernel.org/r/20240702-fs-lock-recover-2-v1-1-edd456f63789@google.com Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org> [stable fixup: ->c.flc_type was ->fl_type in older kernels] Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-07-18btrfs: tree-checker: add type and sequence check for inline backrefsQu Wenruo
commit 1645c283a87c61f84b2bffd81f50724df959b11a upstream. [BUG] There is a bug report that ntfs2btrfs had a bug that it can lead to transaction abort and the filesystem flips to read-only. [CAUSE] For inline backref items, kernel has a strict requirement for their ordered, they must follow the following rules: - All btrfs_extent_inline_ref::type should be in an ascending order - Within the same type, the items should follow a descending order by their sequence number For EXTENT_DATA_REF type, the sequence number is result from hash_extent_data_ref(). For other types, their sequence numbers are btrfs_extent_inline_ref::offset. Thus if there is any code not following above rules, the resulted inline backrefs can prevent the kernel to locate the needed inline backref and lead to transaction abort. [FIX] Ntrfs2btrfs has already fixed the problem, and btrfs-progs has added the ability to detect such problems. For kernel, let's be more noisy and be more specific about the order, so that the next time kernel hits such problem we would reject it in the first place, without leading to transaction abort. Link: https://github.com/kdave/btrfs-progs/pull/622 Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> [ Fix a conflict due to header cleanup. ] Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-07-18ext4: avoid ptr null pointer dereferenceBaokun Li
When commit 13df4d44a3aa ("ext4: fix slab-out-of-bounds in ext4_mb_find_good_group_avg_frag_lists()") was backported to stable, the commit f536808adcc3 ("ext4: refactor out ext4_generic_attr_store()") that uniformly determines if the ptr is null is not merged in, so it needs to be judged whether ptr is null or not in each case of the switch, otherwise null pointer dereferencing may occur. Fixes: 677ff4589f15 ("ext4: fix slab-out-of-bounds in ext4_mb_find_good_group_avg_frag_lists()") Signed-off-by: Baokun Li <libaokun1@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-07-18nilfs2: fix kernel bug on rename operation of broken directoryRyusuke Konishi
commit a9e1ddc09ca55746079cc479aa3eb6411f0d99d4 upstream. Syzbot reported that in rename directory operation on broken directory on nilfs2, __block_write_begin_int() called to prepare block write may fail BUG_ON check for access exceeding the folio/page size. This is because nilfs_dotdot(), which gets parent directory reference entry ("..") of the directory to be moved or renamed, does not check consistency enough, and may return location exceeding folio/page size for broken directories. Fix this issue by checking required directory entries ("." and "..") in the first chunk of the directory in nilfs_dotdot(). Link: https://lkml.kernel.org/r/20240628165107.9006-1-konishi.ryusuke@gmail.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+d3abed1ad3d367fa2627@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=d3abed1ad3d367fa2627 Fixes: 2ba466d74ed7 ("nilfs2: directory entry operations") Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-07-18Fix userfaultfd_api to return EINVAL as expectedAudra Mitchell
commit 1723f04caacb32cadc4e063725d836a0c4450694 upstream. Currently if we request a feature that is not set in the Kernel config we fail silently and return all the available features. However, the man page indicates we should return an EINVAL. We need to fix this issue since we can end up with a Kernel warning should a program request the feature UFFD_FEATURE_WP_UNPOPULATED on a kernel with the config not set with this feature. [ 200.812896] WARNING: CPU: 91 PID: 13634 at mm/memory.c:1660 zap_pte_range+0x43d/0x660 [ 200.820738] Modules linked in: [ 200.869387] CPU: 91 PID: 13634 Comm: userfaultfd Kdump: loaded Not tainted 6.9.0-rc5+ #8 [ 200.877477] Hardware name: Dell Inc. PowerEdge R6525/0N7YGH, BIOS 2.7.3 03/30/2022 [ 200.885052] RIP: 0010:zap_pte_range+0x43d/0x660 Link: https://lkml.kernel.org/r/20240626130513.120193-1-audra@redhat.com Fixes: e06f1e1dd499 ("userfaultfd: wp: enabled write protection in userfaultfd API") Signed-off-by: Audra Mitchell <audra@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rafael Aquini <raquini@redhat.com> Cc: Shaohua Li <shli@fb.com> Cc: Shuah Khan <shuah@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-07-18ksmbd: discard write access to the directory openHobin Woo
commit e2e33caa5dc2eae7bddf88b22ce11ec3d760e5cd upstream. may_open() does not allow a directory to be opened with the write access. However, some writing flags set by client result in adding write access on server, making ksmbd incompatible with FUSE file system. Simply, let's discard the write access when opening a directory. list_add corruption. next is NULL. ------------[ cut here ]------------ kernel BUG at lib/list_debug.c:26! pc : __list_add_valid+0x88/0xbc lr : __list_add_valid+0x88/0xbc Call trace: __list_add_valid+0x88/0xbc fuse_finish_open+0x11c/0x170 fuse_open_common+0x284/0x5e8 fuse_dir_open+0x14/0x24 do_dentry_open+0x2a4/0x4e0 dentry_open+0x50/0x80 smb2_open+0xbe4/0x15a4 handle_ksmbd_work+0x478/0x5ec process_one_work+0x1b4/0x448 worker_thread+0x25c/0x430 kthread+0x104/0x1d4 ret_from_fork+0x10/0x20 Cc: stable@vger.kernel.org Signed-off-by: Yoonho Shin <yoonho.shin@samsung.com> Signed-off-by: Hobin Woo <hobin.woo@samsung.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-07-18cifs: fix setting SecurityFlags to trueSteve French
commit d2346e2836318a227057ed41061114cbebee5d2a upstream. If you try to set /proc/fs/cifs/SecurityFlags to 1 it will set them to CIFSSEC_MUST_NTLMV2 which no longer is relevant (the less secure ones like lanman have been removed from cifs.ko) and is also missing some flags (like for signing and encryption) and can even cause mount to fail, so change this to set it to Kerberos in this case. Also change the description of the SecurityFlags to remove mention of flags which are no longer supported. Cc: stable@vger.kernel.org Reviewed-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-07-18vfs: don't mod negative dentry count when on shrinker listBrian Foster
[ Upstream commit aabfe57ebaa75841db47ea59091ec3c5a06d2f52 ] The nr_dentry_negative counter is intended to only account negative dentries that are present on the superblock LRU. Therefore, the LRU add, remove and isolate helpers modify the counter based on whether the dentry is negative, but the shrinker list related helpers do not modify the counter, and the paths that change a dentry between positive and negative only do so if DCACHE_LRU_LIST is set. The problem with this is that a dentry on a shrinker list still has DCACHE_LRU_LIST set to indicate ->d_lru is in use. The additional DCACHE_SHRINK_LIST flag denotes whether the dentry is on LRU or a shrink related list. Therefore if a relevant operation (i.e. unlink) occurs while a dentry is present on a shrinker list, and the associated codepath only checks for DCACHE_LRU_LIST, then it is technically possible to modify the negative dentry count for a dentry that is off the LRU. Since the shrinker list related helpers do not modify the negative dentry count (because non-LRU dentries should not be included in the count) when the dentry is ultimately removed from the shrinker list, this can cause the negative dentry count to become permanently inaccurate. This problem can be reproduced via a heavy file create/unlink vs. drop_caches workload. On an 80xcpu system, I start 80 tasks each running a 1k file create/delete loop, and one task spinning on drop_caches. After 10 minutes or so of runtime, the idle/clean cache negative dentry count increases from somewhere in the range of 5-10 entries to several hundred (and increasingly grows beyond nr_dentry_unused). Tweak the logic in the paths that turn a dentry negative or positive to filter out the case where the dentry is present on a shrink related list. This allows the above workload to maintain an accurate negative dentry count. Fixes: af0c9af1b3f6 ("fs/dcache: Track & report number of negative dentries") Signed-off-by: Brian Foster <bfoster@redhat.com> Link: https://lore.kernel.org/r/20240703121301.247680-1-bfoster@redhat.com Acked-by: Ian Kent <ikent@redhat.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-07-18fs/dcache: Re-use value stored to dentry->d_flags instead of re-readinglinke li
[ Upstream commit 8bfb40be31ddea0cb4664b352e1797cfe6c91976 ] Currently, the __d_clear_type_and_inode() writes the value flags to dentry->d_flags, then immediately re-reads it in order to use it in a if statement. This re-read is useless because no other update to dentry->d_flags can occur at this point. This commit therefore re-use flags in the if statement instead of re-reading dentry->d_flags. Signed-off-by: linke li <lilinke99@qq.com> Link: https://lore.kernel.org/r/tencent_5E187BD0A61BA28605E85405F15228254D0A@qq.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org> Stable-dep-of: aabfe57ebaa7 ("vfs: don't mod negative dentry count when on shrinker list") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-07-18filelock: fix potential use-after-free in posix_lock_inodeJeff Layton
[ Upstream commit 1b3ec4f7c03d4b07bad70697d7e2f4088d2cfe92 ] Light Hsieh reported a KASAN UAF warning in trace_posix_lock_inode(). The request pointer had been changed earlier to point to a lock entry that was added to the inode's list. However, before the tracepoint could fire, another task raced in and freed that lock. Fix this by moving the tracepoint inside the spinlock, which should ensure that this doesn't happen. Fixes: 74f6f5912693 ("locks: fix KASAN: use-after-free in trace_event_raw_event_filelock_lock") Link: https://lore.kernel.org/linux-fsdevel/724ffb0a2962e912ea62bb0515deadf39c325112.camel@kernel.org/ Reported-by: Light Hsieh (謝明燈) <Light.Hsieh@mediatek.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20240702-filelock-6-10-v1-1-96e766aadc98@kernel.org Reviewed-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-07-18cachefiles: add missing lock protection when pollingJingbo Xu
[ Upstream commit cf5bb09e742a9cf6349127e868329a8f69b7a014 ] Add missing lock protection in poll routine when iterating xarray, otherwise: Even with RCU read lock held, only the slot of the radix tree is ensured to be pinned there, while the data structure (e.g. struct cachefiles_req) stored in the slot has no such guarantee. The poll routine will iterate the radix tree and dereference cachefiles_req accordingly. Thus RCU read lock is not adequate in this case and spinlock is needed here. Fixes: b817e22b2e91 ("cachefiles: narrow the scope of triggering EPOLLIN events in ondemand mode") Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Signed-off-by: Baokun Li <libaokun1@huawei.com> Link: https://lore.kernel.org/r/20240628062930.2467993-10-libaokun@huaweicloud.com Acked-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jia Zhu <zhujia.zj@bytedance.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-07-18cachefiles: cyclic allocation of msg_id to avoid reuseBaokun Li
[ Upstream commit 19f4f399091478c95947f6bd7ad61622300c30d9 ] Reusing the msg_id after a maliciously completed reopen request may cause a read request to remain unprocessed and result in a hung, as shown below: t1 | t2 | t3 ------------------------------------------------- cachefiles_ondemand_select_req cachefiles_ondemand_object_is_close(A) cachefiles_ondemand_set_object_reopening(A) queue_work(fscache_object_wq, &info->work) ondemand_object_worker cachefiles_ondemand_init_object(A) cachefiles_ondemand_send_req(OPEN) // get msg_id 6 wait_for_completion(&req_A->done) cachefiles_ondemand_daemon_read // read msg_id 6 req_A cachefiles_ondemand_get_fd copy_to_user // Malicious completion msg_id 6 copen 6,-1 cachefiles_ondemand_copen complete(&req_A->done) // will not set the object to close // because ondemand_id && fd is valid. // ondemand_object_worker() is done // but the object is still reopening. // new open req_B cachefiles_ondemand_init_object(B) cachefiles_ondemand_send_req(OPEN) // reuse msg_id 6 process_open_req copen 6,A.size // The expected failed copen was executed successfully Expect copen to fail, and when it does, it closes fd, which sets the object to close, and then close triggers reopen again. However, due to msg_id reuse resulting in a successful copen, the anonymous fd is not closed until the daemon exits. Therefore read requests waiting for reopen to complete may trigger hung task. To avoid this issue, allocate the msg_id cyclically to avoid reusing the msg_id for a very short duration of time. Fixes: c8383054506c ("cachefiles: notify the user daemon when looking up cookie") Signed-off-by: Baokun Li <libaokun1@huawei.com> Link: https://lore.kernel.org/r/20240628062930.2467993-9-libaokun@huaweicloud.com Acked-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Jia Zhu <zhujia.zj@bytedance.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-07-18cachefiles: wait for ondemand_object_worker to finish when dropping objectHou Tao
[ Upstream commit 12e009d60852f7bce0afc373ca0b320f14150418 ] When queuing ondemand_object_worker() to re-open the object, cachefiles_object is not pinned. The cachefiles_object may be freed when the pending read request is completed intentionally and the related erofs is umounted. If ondemand_object_worker() runs after the object is freed, it will incur use-after-free problem as shown below. process A processs B process C process D cachefiles_ondemand_send_req() // send a read req X // wait for its completion // close ondemand fd cachefiles_ondemand_fd_release() // set object as CLOSE cachefiles_ondemand_daemon_read() // set object as REOPENING queue_work(fscache_wq, &info->ondemand_work) // close /dev/cachefiles cachefiles_daemon_release cachefiles_flush_reqs complete(&req->done) // read req X is completed // umount the erofs fs cachefiles_put_object() // object will be freed cachefiles_ondemand_deinit_obj_info() kmem_cache_free(object) // both info and object are freed ondemand_object_worker() When dropping an object, it is no longer necessary to reopen the object, so use cancel_work_sync() to cancel or wait for ondemand_object_worker() to finish. Fixes: 0a7e54c1959c ("cachefiles: resend an open request if the read request's object is closed") Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Baokun Li <libaokun1@huawei.com> Link: https://lore.kernel.org/r/20240628062930.2467993-8-libaokun@huaweicloud.com Acked-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jia Zhu <zhujia.zj@bytedance.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-07-18cachefiles: cancel all requests for the object that is being droppedBaokun Li
[ Upstream commit 751f524635a4f076117d714705eeddadaf6748ee ] Because after an object is dropped, requests for that object are useless, cancel them to avoid causing other problems. This prepares for the later addition of cancel_work_sync(). After the reopen requests is generated, cancel it to avoid cancel_work_sync() blocking by waiting for daemon to complete the reopen requests. Signed-off-by: Baokun Li <libaokun1@huawei.com> Link: https://lore.kernel.org/r/20240628062930.2467993-7-libaokun@huaweicloud.com Acked-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Jia Zhu <zhujia.zj@bytedance.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Stable-dep-of: 12e009d60852 ("cachefiles: wait for ondemand_object_worker to finish when dropping object") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-07-18cachefiles: stop sending new request when dropping objectBaokun Li
[ Upstream commit b2415d1f4566b6939acacc69637eaa57815829c1 ] Added CACHEFILES_ONDEMAND_OBJSTATE_DROPPING indicates that the cachefiles object is being dropped, and is set after the close request for the dropped object completes, and no new requests are allowed to be sent after this state. This prepares for the later addition of cancel_work_sync(). It prevents leftover reopen requests from being sent, to avoid processing unnecessary requests and to avoid cancel_work_sync() blocking by waiting for daemon to complete the reopen requests. Signed-off-by: Baokun Li <libaokun1@huawei.com> Link: https://lore.kernel.org/r/20240628062930.2467993-6-libaokun@huaweicloud.com Acked-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Jia Zhu <zhujia.zj@bytedance.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Stable-dep-of: 12e009d60852 ("cachefiles: wait for ondemand_object_worker to finish when dropping object") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-07-18cachefiles: narrow the scope of triggering EPOLLIN events in ondemand modeJia Zhu
[ Upstream commit b817e22b2e91257ace32a6768c3c003faeaa1c5c ] Don't trigger EPOLLIN when there are only reopening read requests in xarray. Suggested-by: Xin Yin <yinxin.x@bytedance.com> Signed-off-by: Jia Zhu <zhujia.zj@bytedance.com> Link: https://lore.kernel.org/r/20231120041422.75170-5-zhujia.zj@bytedance.com Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Stable-dep-of: 12e009d60852 ("cachefiles: wait for ondemand_object_worker to finish when dropping object") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-07-18cachefiles: propagate errors from vfs_getxattr() to avoid infinite loopBaokun Li
[ Upstream commit 0ece614a52bc9d219b839a6a29282b30d10e0c48 ] In cachefiles_check_volume_xattr(), the error returned by vfs_getxattr() is not passed to ret, so it ends up returning -ESTALE, which leads to an endless loop as follows: cachefiles_acquire_volume retry: ret = cachefiles_check_volume_xattr ret = -ESTALE xlen = vfs_getxattr // return -EIO // The ret is not updated when xlen < 0, so -ESTALE is returned. return ret // Supposed to jump out of the loop at this judgement. if (ret != -ESTALE) goto error_dir; cachefiles_bury_object // EIO causes rename failure goto retry; Hence propagate the error returned by vfs_getxattr() to avoid the above issue. Do the same in cachefiles_check_auxdata(). Fixes: 32e150037dce ("fscache, cachefiles: Store the volume coherency data") Fixes: 72b957856b0c ("cachefiles: Implement metadata/coherency data storage in xattrs") Signed-off-by: Baokun Li <libaokun1@huawei.com> Link: https://lore.kernel.org/r/20240628062930.2467993-5-libaokun@huaweicloud.com Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-07-11nilfs2: fix incorrect inode allocation from reserved inodesRyusuke Konishi
commit 93aef9eda1cea9e84ab2453fcceb8addad0e46f1 upstream. If the bitmap block that manages the inode allocation status is corrupted, nilfs_ifile_create_inode() may allocate a new inode from the reserved inode area where it should not be allocated. Previous fix commit d325dc6eb763 ("nilfs2: fix use-after-free bug of struct nilfs_root"), fixed the problem that reserved inodes with inode numbers less than NILFS_USER_INO (=11) were incorrectly reallocated due to bitmap corruption, but since the start number of non-reserved inodes is read from the super block and may change, in which case inode allocation may occur from the extended reserved inode area. If that happens, access to that inode will cause an IO error, causing the file system to degrade to an error state. Fix this potential issue by adding a wraparound option to the common metadata object allocation routine and by modifying nilfs_ifile_create_inode() to disable the option so that it only allocates inodes with inode numbers greater than or equal to the inode number read in "nilfs->ns_first_ino", regardless of the bitmap status of reserved inodes. Link: https://lkml.kernel.org/r/20240623051135.4180-4-konishi.ryusuke@gmail.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-07-11fs/ntfs3: Mark volume as dirty if xattr is brokenKonstantin Komarov
[ Upstream commit 24f6f5020b0b2c89c2cba5ec224547be95f753ee ] Mark a volume as corrupted if the name length exceeds the space occupied by ea. Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-07-11f2fs: Add inline to f2fs_build_fault_attr() stubNathan Chancellor
commit 0d8968287a1cf7b03d07387dc871de3861b9f6b9 upstream. When building without CONFIG_F2FS_FAULT_INJECTION, there is a warning from each file that includes f2fs.h because the stub for f2fs_build_fault_attr() is missing inline: In file included from fs/f2fs/segment.c:21: fs/f2fs/f2fs.h:4605:12: warning: 'f2fs_build_fault_attr' defined but not used [-Wunused-function] 4605 | static int f2fs_build_fault_attr(struct f2fs_sb_info *sbi, unsigned long rate, | ^~~~~~~~~~~~~~~~~~~~~ Add the missing inline to resolve all of the warnings for this configuration. Fixes: 4ed886b187f4 ("f2fs: check validation of fault attrs in f2fs_build_fault_attr()") Signed-off-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-07-11btrfs: fix adding block group to a reclaim list and the unused list during ↵Naohiro Aota
reclaim commit 48f091fd50b2eb33ae5eaea9ed3c4f81603acf38 upstream. There is a potential parallel list adding for retrying in btrfs_reclaim_bgs_work and adding to the unused list. Since the block group is removed from the reclaim list and it is on a relocation work, it can be added into the unused list in parallel. When that happens, adding it to the reclaim list will corrupt the list head and trigger list corruption like below. Fix it by taking fs_info->unused_bgs_lock. [177.504][T2585409] BTRFS error (device nullb1): error relocating ch= unk 2415919104 [177.514][T2585409] list_del corruption. next->prev should be ff1100= 0344b119c0, but was ff11000377e87c70. (next=3Dff110002390cd9c0) [177.529][T2585409] ------------[ cut here ]------------ [177.537][T2585409] kernel BUG at lib/list_debug.c:65! [177.545][T2585409] Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI [177.555][T2585409] CPU: 9 PID: 2585409 Comm: kworker/u128:2 Tainted: G W 6.10.0-rc5-kts #1 [177.568][T2585409] Hardware name: Supermicro SYS-520P-WTR/X12SPW-TF, BIOS 1.2 02/14/2022 [177.579][T2585409] Workqueue: events_unbound btrfs_reclaim_bgs_work[btrfs] [177.589][T2585409] RIP: 0010:__list_del_entry_valid_or_report.cold+0x70/0x72 [177.624][T2585409] RSP: 0018:ff11000377e87a70 EFLAGS: 00010286 [177.633][T2585409] RAX: 000000000000006d RBX: ff11000344b119c0 RCX:0000000000000000 [177.644][T2585409] RDX: 000000000000006d RSI: 0000000000000008 RDI:ffe21c006efd0f40 [177.655][T2585409] RBP: ff110002e0509f78 R08: 0000000000000001 R09:ffe21c006efd0f08 [177.665][T2585409] R10: ff11000377e87847 R11: 0000000000000000 R12:ff110002390cd9c0 [177.676][T2585409] R13: ff11000344b119c0 R14: ff110002e0508000 R15:dffffc0000000000 [177.687][T2585409] FS: 0000000000000000(0000) GS:ff11000fec880000(0000) knlGS:0000000000000000 [177.700][T2585409] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [177.709][T2585409] CR2: 00007f06bc7b1978 CR3: 0000001021e86005 CR4:0000000000771ef0 [177.720][T2585409] DR0: 0000000000000000 DR1: 0000000000000000 DR2:0000000000000000 [177.731][T2585409] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:0000000000000400 [177.742][T2585409] PKRU: 55555554 [177.748][T2585409] Call Trace: [177.753][T2585409] <TASK> [177.759][T2585409] ? __die_body.cold+0x19/0x27 [177.766][T2585409] ? die+0x2e/0x50 [177.772][T2585409] ? do_trap+0x1ea/0x2d0 [177.779][T2585409] ? __list_del_entry_valid_or_report.cold+0x70/0x72 [177.788][T2585409] ? do_error_trap+0xa3/0x160 [177.795][T2585409] ? __list_del_entry_valid_or_report.cold+0x70/0x72 [177.805][T2585409] ? handle_invalid_op+0x2c/0x40 [177.812][T2585409] ? __list_del_entry_valid_or_report.cold+0x70/0x72 [177.820][T2585409] ? exc_invalid_op+0x2d/0x40 [177.827][T2585409] ? asm_exc_invalid_op+0x1a/0x20 [177.834][T2585409] ? __list_del_entry_valid_or_report.cold+0x70/0x72 [177.843][T2585409] btrfs_delete_unused_bgs+0x3d9/0x14c0 [btrfs] There is a similar retry_list code in btrfs_delete_unused_bgs(), but it is safe, AFAICS. Since the block group was in the unused list, the used bytes should be 0 when it was added to the unused list. Then, it checks block_group->{used,reserved,pinned} are still 0 under the block_group->lock. So, they should be still eligible for the unused list, not the reclaim list. The reason it is safe there it's because because we're holding space_info->groups_sem in write mode. That means no other task can allocate from the block group, so while we are at deleted_unused_bgs() it's not possible for other tasks to allocate and deallocate extents from the block group, so it can't be added to the unused list or the reclaim list by anyone else. The bug can be reproduced by btrfs/166 after a few rounds. In practice this can be hit when relocation cannot find more chunk space and ends with ENOSPC. Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Suggested-by: Johannes Thumshirn <Johannes.Thumshirn@wdc.com> Fixes: 4eb4e85c4f81 ("btrfs: retry block group reclaim without infinite loop") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-07-11nilfs2: add missing check for inode numbers on directory entriesRyusuke Konishi
commit bb76c6c274683c8570ad788f79d4b875bde0e458 upstream. Syzbot reported that mounting and unmounting a specific pattern of corrupted nilfs2 filesystem images causes a use-after-free of metadata file inodes, which triggers a kernel bug in lru_add_fn(). As Jan Kara pointed out, this is because the link count of a metadata file gets corrupted to 0, and nilfs_evict_inode(), which is called from iput(), tries to delete that inode (ifile inode in this case). The inconsistency occurs because directories containing the inode numbers of these metadata files that should not be visible in the namespace are read without checking. Fix this issue by treating the inode numbers of these internal files as errors in the sanity check helper when reading directory folios/pages. Also thanks to Hillf Danton and Matthew Wilcox for their initial mm-layer analysis. Link: https://lkml.kernel.org/r/20240623051135.4180-3-konishi.ryusuke@gmail.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+d79afb004be235636ee8@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=d79afb004be235636ee8 Reported-by: Jan Kara <jack@suse.cz> Closes: https://lkml.kernel.org/r/20240617075758.wewhukbrjod5fp5o@quack3 Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-07-11nilfs2: fix inode number range checksRyusuke Konishi
commit e2fec219a36e0993642844be0f345513507031f4 upstream. Patch series "nilfs2: fix potential issues related to reserved inodes". This series fixes one use-after-free issue reported by syzbot, caused by nilfs2's internal inode being exposed in the namespace on a corrupted filesystem, and a couple of flaws that cause problems if the starting number of non-reserved inodes written in the on-disk super block is intentionally (or corruptly) changed from its default value. This patch (of 3): In the current implementation of nilfs2, "nilfs->ns_first_ino", which gives the first non-reserved inode number, is read from the superblock, but its lower limit is not checked. As a result, if a number that overlaps with the inode number range of reserved inodes such as the root directory or metadata files is set in the super block parameter, the inode number test macros (NILFS_MDT_INODE and NILFS_VALID_INODE) will not function properly. In addition, these test macros use left bit-shift calculations using with the inode number as the shift count via the BIT macro, but the result of a shift calculation that exceeds the bit width of an integer is undefined in the C specification, so if "ns_first_ino" is set to a large value other than the default value NILFS_USER_INO (=11), the macros may potentially malfunction depending on the environment. Fix these issues by checking the lower bound of "nilfs->ns_first_ino" and by preventing bit shifts equal to or greater than the NILFS_USER_INO constant in the inode number test macros. Also, change the type of "ns_first_ino" from signed integer to unsigned integer to avoid the need for type casting in comparisons such as the lower bound check introduced this time. Link: https://lkml.kernel.org/r/20240623051135.4180-1-konishi.ryusuke@gmail.com Link: https://lkml.kernel.org/r/20240623051135.4180-2-konishi.ryusuke@gmail.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-07-11btrfs: scrub: initialize ret in scrub_simple_mirror() to fix compilation warningLu Yao
[ Upstream commit b4e585fffc1cf877112ed231a91f089e85688c2a ] The following error message is displayed: ../fs/btrfs/scrub.c:2152:9: error: ‘ret’ may be used uninitialized in this function [-Werror=maybe-uninitialized]" Compiler version: gcc version: (Debian 10.2.1-6) 10.2.1 20210110 Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Lu Yao <yaolu@kylinos.cn> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>