summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2022-08-17erofs: avoid consecutive detection for Highmem memoryGao Xiang
[ Upstream commit 448b5a1548d87c246c3d0c3df8480d3c6eb6c11a ] Currently, vmap()s are avoided if physical addresses are consecutive for decompressed buffers. I observed that is very common for 4KiB pclusters since the numbers of decompressed pages are almost 2 or 3. However, such detection doesn't work for Highmem pages on 32-bit machines, let's fix it now. Reported-by: Liu Jinbao <liujinbao1@xiaomi.com> Fixes: 7fc45dbc938a ("staging: erofs: introduce generic decompression backend") Link: https://lore.kernel.org/r/20220708101001.21242-1-hsiangkao@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17erofs: wake up all waiters after z_erofs_lzma_head readyYuwen Chen
[ Upstream commit 2df7c4bd7c1d2bc5ece5e9ed19dbd386810c2a65 ] When the user mounts the erofs second times, the decompression thread may hung. The problem happens due to a sequence of steps like the following: 1) Task A called z_erofs_load_lzma_config which obtain all of the node from the z_erofs_lzma_head. 2) At this time, task B called the z_erofs_lzma_decompress and wanted to get a node. But the z_erofs_lzma_head was empty, the Task B had to sleep. 3) Task A release nodes and push nodes into the z_erofs_lzma_head. But task B was still sleeping. One example report when the hung happens: task:kworker/u3:1 state:D stack:14384 pid: 86 ppid: 2 flags:0x00004000 Workqueue: erofs_unzipd z_erofs_decompressqueue_work Call Trace: <TASK> __schedule+0x281/0x760 schedule+0x49/0xb0 z_erofs_lzma_decompress+0x4bc/0x580 ? cpu_core_flags+0x10/0x10 z_erofs_decompress_pcluster+0x49b/0xba0 ? __update_load_avg_se+0x2b0/0x330 ? __update_load_avg_se+0x2b0/0x330 ? update_load_avg+0x5f/0x690 ? update_load_avg+0x5f/0x690 ? set_next_entity+0xbd/0x110 ? _raw_spin_unlock+0xd/0x20 z_erofs_decompress_queue.isra.0+0x2e/0x50 z_erofs_decompressqueue_work+0x30/0x60 process_one_work+0x1d3/0x3a0 worker_thread+0x45/0x3a0 ? process_one_work+0x3a0/0x3a0 kthread+0xe2/0x110 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x22/0x30 </TASK> Signed-off-by: Yuwen Chen <chenyuwen1@meizu.com> Fixes: 622ceaddb764 ("erofs: lzma compression support") Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20220626224041.4288-1-chenyuwen1@meizu.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17ext2: Add more validity checks for inode countsJan Kara
[ Upstream commit fa78f336937240d1bc598db817d638086060e7e9 ] Add checks verifying number of inodes stored in the superblock matches the number computed from number of inodes per group. Also verify we have at least one block worth of inodes per group. This prevents crashes on corrupted filesystems. Reported-by: syzbot+d273f7d7f58afd93be48@syzkaller.appspotmail.com Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17epoll: autoremove wakers even more aggressivelyBenjamin Segall
commit a16ceb13961068f7209e34d7984f8e42d2c06159 upstream. If a process is killed or otherwise exits while having active network connections and many threads waiting on epoll_wait, the threads will all be woken immediately, but not removed from ep->wq. Then when network traffic scans ep->wq in wake_up, every wakeup attempt will fail, and will not remove the entries from the list. This means that the cost of the wakeup attempt is far higher than usual, does not decrease, and this also competes with the dying threads trying to actually make progress and remove themselves from the wq. Handle this by removing visited epoll wq entries unconditionally, rather than only when the wakeup succeeds - the structure of ep_poll means that the only potential loss is the timed_out->eavail heuristic, which now can race and result in a redundant ep_send_events attempt. (But only when incoming data and a timeout actually race, not on every timeout) Shakeel added: : We are seeing this issue in production with real workloads and it has : caused hard lockups. Particularly network heavy workloads with a lot : of threads in epoll_wait() can easily trigger this issue if they get : killed (oom-killed in our case). Link: https://lkml.kernel.org/r/xm26fsjotqda.fsf@google.com Signed-off-by: Ben Segall <bsegall@google.com> Tested-by: Shakeel Butt <shakeelb@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Roman Penyaev <rpenyaev@suse.de> Cc: Jason Baron <jbaron@akamai.com> Cc: Khazhismel Kumykov <khazhy@google.com> Cc: Heiher <r@hev.cc> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-17mbcache: add functions to delete entry if unusedJan Kara
commit 3dc96bba65f53daa217f0a8f43edad145286a8f5 upstream. Add function mb_cache_entry_delete_or_get() to delete mbcache entry if it is unused and also add a function to wait for entry to become unused - mb_cache_entry_wait_unused(). We do not share code between the two deleting function as one of them will go away soon. CC: stable@vger.kernel.org Fixes: 82939d7999df ("ext4: convert to mbcache2") Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220712105436.32204-2-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-17mbcache: don't reclaim used entriesJan Kara
commit 58318914186c157477b978b1739dfe2f1b9dc0fe upstream. Do not reclaim entries that are currently used by somebody from a shrinker. Firstly, these entries are likely useful. Secondly, we will need to keep such entries to protect pending increment of xattr block refcount. CC: stable@vger.kernel.org Fixes: 82939d7999df ("ext4: convert to mbcache2") Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220712105436.32204-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-17fuse: fix deadlock between atomic O_TRUNC and page invalidationMiklos Szeredi
commit 2fdbb8dd01556e1501132b5ad3826e8f71e24a8b upstream. fuse_finish_open() will be called with FUSE_NOWRITE set in case of atomic O_TRUNC open(), so commit 76224355db75 ("fuse: truncate pagecache on atomic_o_trunc") replaced invalidate_inode_pages2() by truncate_pagecache() in such a case to avoid the A-A deadlock. However, we found another A-B-B-A deadlock related to the case above, which will cause the xfstests generic/464 testcase hung in our virtio-fs test environment. For example, consider two processes concurrently open one same file, one with O_TRUNC and another without O_TRUNC. The deadlock case is described below, if open(O_TRUNC) is already set_nowrite(acquired A), and is trying to lock a page (acquiring B), open() could have held the page lock (acquired B), and waiting on the page writeback (acquiring A). This would lead to deadlocks. open(O_TRUNC) ---------------------------------------------------------------- fuse_open_common inode_lock [C acquire] fuse_set_nowrite [A acquire] fuse_finish_open truncate_pagecache lock_page [B acquire] truncate_inode_page unlock_page [B release] fuse_release_nowrite [A release] inode_unlock [C release] ---------------------------------------------------------------- open() ---------------------------------------------------------------- fuse_open_common fuse_finish_open invalidate_inode_pages2 lock_page [B acquire] fuse_launder_page fuse_wait_on_page_writeback [A acquire & release] unlock_page [B release] ---------------------------------------------------------------- Besides this case, all calls of invalidate_inode_pages2() and invalidate_inode_pages2_range() in fuse code also can deadlock with open(O_TRUNC). Fix by moving the truncate_pagecache() call outside the nowrite protected region. The nowrite protection is only for delayed writeback (writeback_cache) case, where inode lock does not protect against truncation racing with writes on the server. Write syscalls racing with page cache truncation still get the inode lock protection. This patch also changes the order of filemap_invalidate_lock() vs. fuse_set_nowrite() in fuse_open_common(). This new order matches the order found in fuse_file_fallocate() and fuse_do_setattr(). Reported-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com> Tested-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com> Fixes: e4648309b85a ("fuse: truncate pending writes on O_TRUNC") Cc: <stable@vger.kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-17fuse: write inode in fuse_release()Miklos Szeredi
commit 035ff33cf4db101250fb980a3941bf078f37a544 upstream. A race between write(2) and close(2) allows pages to be dirtied after fuse_flush -> write_inode_now(). If these pages are not flushed from fuse_release(), then there might not be a writable open file later. So any remaining dirty pages must be written back before the file is released. This is a partial revert of the blamed commit. Reported-by: syzbot+6e1efbd8efaaa6860e91@syzkaller.appspotmail.com Fixes: 36ea23374d1f ("fuse: write inode in fuse_vma_close() instead of fuse_release()") Cc: <stable@vger.kernel.org> # v5.16 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-17fuse: ioctl: translate ENOSYSMiklos Szeredi
commit 02c0cab8e7345b06f1c0838df444e2902e4138d3 upstream. Overlayfs may fail to complete updates when a filesystem lacks fileattr/xattr syscall support and responds with an ENOSYS error code, resulting in an unexpected "Function not implemented" error. This bug may occur with FUSE filesystems, such as davfs2. Steps to reproduce: # install davfs2, e.g., apk add davfs2 mkdir /test mkdir /test/lower /test/upper /test/work /test/mnt yes '' | mount -t davfs -o ro http://some-web-dav-server/path \ /test/lower mount -t overlay -o upperdir=/test/upper,lowerdir=/test/lower \ -o workdir=/test/work overlay /test/mnt # when "some-file" exists in the lowerdir, this fails with "Function # not implemented", with dmesg showing "overlayfs: failed to retrieve # lower fileattr (/some-file, err=-38)" touch /test/mnt/some-file The underlying cause of this regresion is actually in FUSE, which fails to translate the ENOSYS error code returned by userspace filesystem (which means that the ioctl operation is not supported) to ENOTTY. Reported-by: Christian Kohlschütter <christian@kohlschutter.com> Fixes: 72db82115d2b ("ovl: copy up sync/noatime fileattr flags") Fixes: 59efec7b9039 ("fuse: implement ioctl support") Cc: <stable@vger.kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-17fuse: limit nsecMiklos Szeredi
commit 47912eaa061a6a81e4aa790591a1874c650733c0 upstream. Limit nanoseconds to 0..999999999. Fixes: d8a5ba45457e ("[PATCH] FUSE - core") Cc: <stable@vger.kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-17ksmbd: fix heap-based overflow in set_ntacl_dacl()Namjae Jeon
commit 8f0541186e9ad1b62accc9519cc2b7a7240272a7 upstream. The testcase use SMB2_SET_INFO_HE command to set a malformed file attribute under the label `security.NTACL`. SMB2_QUERY_INFO_HE command in testcase trigger the following overflow. [ 4712.003781] ================================================================== [ 4712.003790] BUG: KASAN: slab-out-of-bounds in build_sec_desc+0x842/0x1dd0 [ksmbd] [ 4712.003807] Write of size 1060 at addr ffff88801e34c068 by task kworker/0:0/4190 [ 4712.003813] CPU: 0 PID: 4190 Comm: kworker/0:0 Not tainted 5.19.0-rc5 #1 [ 4712.003850] Workqueue: ksmbd-io handle_ksmbd_work [ksmbd] [ 4712.003867] Call Trace: [ 4712.003870] <TASK> [ 4712.003873] dump_stack_lvl+0x49/0x5f [ 4712.003935] print_report.cold+0x5e/0x5cf [ 4712.003972] ? ksmbd_vfs_get_sd_xattr+0x16d/0x500 [ksmbd] [ 4712.003984] ? cmp_map_id+0x200/0x200 [ 4712.003988] ? build_sec_desc+0x842/0x1dd0 [ksmbd] [ 4712.004000] kasan_report+0xaa/0x120 [ 4712.004045] ? build_sec_desc+0x842/0x1dd0 [ksmbd] [ 4712.004056] kasan_check_range+0x100/0x1e0 [ 4712.004060] memcpy+0x3c/0x60 [ 4712.004064] build_sec_desc+0x842/0x1dd0 [ksmbd] [ 4712.004076] ? parse_sec_desc+0x580/0x580 [ksmbd] [ 4712.004088] ? ksmbd_acls_fattr+0x281/0x410 [ksmbd] [ 4712.004099] smb2_query_info+0xa8f/0x6110 [ksmbd] [ 4712.004111] ? psi_group_change+0x856/0xd70 [ 4712.004148] ? update_load_avg+0x1c3/0x1af0 [ 4712.004152] ? asym_cpu_capacity_scan+0x5d0/0x5d0 [ 4712.004157] ? xas_load+0x23/0x300 [ 4712.004162] ? smb2_query_dir+0x1530/0x1530 [ksmbd] [ 4712.004173] ? _raw_spin_lock_bh+0xe0/0xe0 [ 4712.004179] handle_ksmbd_work+0x30e/0x1020 [ksmbd] [ 4712.004192] process_one_work+0x778/0x11c0 [ 4712.004227] ? _raw_spin_lock_irq+0x8e/0xe0 [ 4712.004231] worker_thread+0x544/0x1180 [ 4712.004234] ? __cpuidle_text_end+0x4/0x4 [ 4712.004239] kthread+0x282/0x320 [ 4712.004243] ? process_one_work+0x11c0/0x11c0 [ 4712.004246] ? kthread_complete_and_exit+0x30/0x30 [ 4712.004282] ret_from_fork+0x1f/0x30 This patch add the buffer validation for security descriptor that is stored by malformed SMB2_SET_INFO_HE command. and allocate large response buffer about SMB2_O_INFO_SECURITY file info class. Fixes: e2f34481b24d ("cifsd: add server-side procedures for SMB3") Cc: stable@vger.kernel.org Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-17771 Reviewed-by: Hyunchul Lee <hyc.lee@gmail.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-17ksmbd: fix use-after-free bug in smb2_tree_disconectNamjae Jeon
commit cf6531d98190fa2cf92a6d8bbc8af0a4740a223c upstream. smb2_tree_disconnect() freed the struct ksmbd_tree_connect, but it left the dangling pointer. It can be accessed again under compound requests. This bug can lead an oops looking something link: [ 1685.468014 ] BUG: KASAN: use-after-free in ksmbd_tree_conn_disconnect+0x131/0x160 [ksmbd] [ 1685.468068 ] Read of size 4 at addr ffff888102172180 by task kworker/1:2/4807 ... [ 1685.468130 ] Call Trace: [ 1685.468132 ] <TASK> [ 1685.468135 ] dump_stack_lvl+0x49/0x5f [ 1685.468141 ] print_report.cold+0x5e/0x5cf [ 1685.468145 ] ? ksmbd_tree_conn_disconnect+0x131/0x160 [ksmbd] [ 1685.468157 ] kasan_report+0xaa/0x120 [ 1685.468194 ] ? ksmbd_tree_conn_disconnect+0x131/0x160 [ksmbd] [ 1685.468206 ] __asan_report_load4_noabort+0x14/0x20 [ 1685.468210 ] ksmbd_tree_conn_disconnect+0x131/0x160 [ksmbd] [ 1685.468222 ] smb2_tree_disconnect+0x175/0x250 [ksmbd] [ 1685.468235 ] handle_ksmbd_work+0x30e/0x1020 [ksmbd] [ 1685.468247 ] process_one_work+0x778/0x11c0 [ 1685.468251 ] ? _raw_spin_lock_irq+0x8e/0xe0 [ 1685.468289 ] worker_thread+0x544/0x1180 [ 1685.468293 ] ? __cpuidle_text_end+0x4/0x4 [ 1685.468297 ] kthread+0x282/0x320 [ 1685.468301 ] ? process_one_work+0x11c0/0x11c0 [ 1685.468305 ] ? kthread_complete_and_exit+0x30/0x30 [ 1685.468309 ] ret_from_fork+0x1f/0x30 Fixes: e2f34481b24d ("cifsd: add server-side procedures for SMB3") Cc: stable@vger.kernel.org Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-17816 Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Reviewed-by: Hyunchul Lee <hyc.lee@gmail.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-17ksmbd: prevent out of bound read for SMB2_WRITEHyunchul Lee
commit ac60778b87e45576d7bfdbd6f53df902654e6f09 upstream. OOB read memory can be written to a file, if DataOffset is 0 and Length is too large in SMB2_WRITE request of compound request. To prevent this, when checking the length of the data area of SMB2_WRITE in smb2_get_data_area_len(), let the minimum of DataOffset be the size of SMB2 header + the size of SMB2_WRITE header. This bug can lead an oops looking something like: [ 798.008715] BUG: KASAN: slab-out-of-bounds in copy_page_from_iter_atomic+0xd3d/0x14b0 [ 798.008724] Read of size 252 at addr ffff88800f863e90 by task kworker/0:2/2859 ... [ 798.008754] Call Trace: [ 798.008756] <TASK> [ 798.008759] dump_stack_lvl+0x49/0x5f [ 798.008764] print_report.cold+0x5e/0x5cf [ 798.008768] ? __filemap_get_folio+0x285/0x6d0 [ 798.008774] ? copy_page_from_iter_atomic+0xd3d/0x14b0 [ 798.008777] kasan_report+0xaa/0x120 [ 798.008781] ? copy_page_from_iter_atomic+0xd3d/0x14b0 [ 798.008784] kasan_check_range+0x100/0x1e0 [ 798.008788] memcpy+0x24/0x60 [ 798.008792] copy_page_from_iter_atomic+0xd3d/0x14b0 [ 798.008795] ? pagecache_get_page+0x53/0x160 [ 798.008799] ? iov_iter_get_pages_alloc+0x1590/0x1590 [ 798.008803] ? ext4_write_begin+0xfc0/0xfc0 [ 798.008807] ? current_time+0x72/0x210 [ 798.008811] generic_perform_write+0x2c8/0x530 [ 798.008816] ? filemap_fdatawrite_wbc+0x180/0x180 [ 798.008820] ? down_write+0xb4/0x120 [ 798.008824] ? down_write_killable+0x130/0x130 [ 798.008829] ext4_buffered_write_iter+0x137/0x2c0 [ 798.008833] ext4_file_write_iter+0x40b/0x1490 [ 798.008837] ? __fsnotify_parent+0x275/0xb20 [ 798.008842] ? __fsnotify_update_child_dentry_flags+0x2c0/0x2c0 [ 798.008846] ? ext4_buffered_write_iter+0x2c0/0x2c0 [ 798.008851] __kernel_write+0x3a1/0xa70 [ 798.008855] ? __x64_sys_preadv2+0x160/0x160 [ 798.008860] ? security_file_permission+0x4a/0xa0 [ 798.008865] kernel_write+0xbb/0x360 [ 798.008869] ksmbd_vfs_write+0x27e/0xb90 [ksmbd] [ 798.008881] ? ksmbd_vfs_read+0x830/0x830 [ksmbd] [ 798.008892] ? _raw_read_unlock+0x2a/0x50 [ 798.008896] smb2_write+0xb45/0x14e0 [ksmbd] [ 798.008909] ? __kasan_check_write+0x14/0x20 [ 798.008912] ? _raw_spin_lock_bh+0xd0/0xe0 [ 798.008916] ? smb2_read+0x15e0/0x15e0 [ksmbd] [ 798.008927] ? memcpy+0x4e/0x60 [ 798.008931] ? _raw_spin_unlock+0x19/0x30 [ 798.008934] ? ksmbd_smb2_check_message+0x16af/0x2350 [ksmbd] [ 798.008946] ? _raw_spin_lock_bh+0xe0/0xe0 [ 798.008950] handle_ksmbd_work+0x30e/0x1020 [ksmbd] [ 798.008962] process_one_work+0x778/0x11c0 [ 798.008966] ? _raw_spin_lock_irq+0x8e/0xe0 [ 798.008970] worker_thread+0x544/0x1180 [ 798.008973] ? __cpuidle_text_end+0x4/0x4 [ 798.008977] kthread+0x282/0x320 [ 798.008982] ? process_one_work+0x11c0/0x11c0 [ 798.008985] ? kthread_complete_and_exit+0x30/0x30 [ 798.008989] ret_from_fork+0x1f/0x30 [ 798.008995] </TASK> Fixes: e2f34481b24d ("cifsd: add server-side procedures for SMB3") Cc: stable@vger.kernel.org Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-17817 Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-17ksmbd: prevent out of bound read for SMB2_TREE_CONNNECTHyunchul Lee
commit 824d4f64c20093275f72fc8101394d75ff6a249e upstream. if Status is not 0 and PathLength is long, smb_strndup_from_utf16 could make out of bound read in smb2_tree_connnect. This bug can lead an oops looking something like: [ 1553.882047] BUG: KASAN: slab-out-of-bounds in smb_strndup_from_utf16+0x469/0x4c0 [ksmbd] [ 1553.882064] Read of size 2 at addr ffff88802c4eda04 by task kworker/0:2/42805 ... [ 1553.882095] Call Trace: [ 1553.882098] <TASK> [ 1553.882101] dump_stack_lvl+0x49/0x5f [ 1553.882107] print_report.cold+0x5e/0x5cf [ 1553.882112] ? smb_strndup_from_utf16+0x469/0x4c0 [ksmbd] [ 1553.882122] kasan_report+0xaa/0x120 [ 1553.882128] ? smb_strndup_from_utf16+0x469/0x4c0 [ksmbd] [ 1553.882139] __asan_report_load_n_noabort+0xf/0x20 [ 1553.882143] smb_strndup_from_utf16+0x469/0x4c0 [ksmbd] [ 1553.882155] ? smb_strtoUTF16+0x3b0/0x3b0 [ksmbd] [ 1553.882166] ? __kmalloc_node+0x185/0x430 [ 1553.882171] smb2_tree_connect+0x140/0xab0 [ksmbd] [ 1553.882185] handle_ksmbd_work+0x30e/0x1020 [ksmbd] [ 1553.882197] process_one_work+0x778/0x11c0 [ 1553.882201] ? _raw_spin_lock_irq+0x8e/0xe0 [ 1553.882206] worker_thread+0x544/0x1180 [ 1553.882209] ? __cpuidle_text_end+0x4/0x4 [ 1553.882214] kthread+0x282/0x320 [ 1553.882218] ? process_one_work+0x11c0/0x11c0 [ 1553.882221] ? kthread_complete_and_exit+0x30/0x30 [ 1553.882225] ret_from_fork+0x1f/0x30 [ 1553.882231] </TASK> There is no need to check error request validation in server. This check allow invalid requests not to validate message. Fixes: e2f34481b24d ("cifsd: add server-side procedures for SMB3") Cc: stable@vger.kernel.org Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-17818 Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-17ksmbd: fix memory leak in smb2_handle_negotiateNamjae Jeon
commit aa7253c2393f6dcd6a1468b0792f6da76edad917 upstream. The allocated memory didn't free under an error path in smb2_handle_negotiate(). Fixes: e2f34481b24d ("cifsd: add server-side procedures for SMB3") Cc: stable@vger.kernel.org Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-17815 Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Reviewed-by: Hyunchul Lee <hyc.lee@gmail.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-17btrfs: reject log replay if there is unsupported RO compat flagQu Wenruo
commit dc4d31684974d140250f3ee612c3f0cab13b3146 upstream. [BUG] If we have a btrfs image with dirty log, along with an unsupported RO compatible flag: log_root 30474240 ... compat_flags 0x0 compat_ro_flags 0x40000003 ( FREE_SPACE_TREE | FREE_SPACE_TREE_VALID | unknown flag: 0x40000000 ) Then even if we can only mount it RO, we will still cause metadata update for log replay: BTRFS info (device dm-1): flagging fs with big metadata feature BTRFS info (device dm-1): using free space tree BTRFS info (device dm-1): has skinny extents BTRFS info (device dm-1): start tree-log replay This is definitely against RO compact flag requirement. [CAUSE] RO compact flag only forces us to do RO mount, but we will still do log replay for plain RO mount. Thus this will result us to do log replay and update metadata. This can be very problematic for new RO compat flag, for example older kernel can not understand v2 cache, and if we allow metadata update on RO mount and invalidate/corrupt v2 cache. [FIX] Just reject the mount unless rescue=nologreplay is provided: BTRFS error (device dm-1): cannot replay dirty log with unsupport optional features (0x40000000), try rescue=nologreplay instead We don't want to set rescue=nologreply directly, as this would make the end user to read the old data, and cause confusion. Since the such case is really rare, we're mostly fine to just reject the mount with an error message, which also includes the proper workaround. CC: stable@vger.kernel.org #4.9+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-17ovl: drop WARN_ON() dentry is NULL in ovl_encode_fh()Jiachen Zhang
commit dd524b7f317de8d31d638cbfdc7be4cf9b770e42 upstream. Some code paths cannot guarantee the inode have any dentry alias. So WARN_ON() all !dentry may flood the kernel logs. For example, when an overlayfs inode is watched by inotifywait (1), and someone is trying to read the /proc/$(pidof inotifywait)/fdinfo/INOTIFY_FD, at that time if the dentry has been reclaimed by kernel (such as echo 2 > /proc/sys/vm/drop_caches), there will be a WARN_ON(). The printed call stack would be like: ? show_mark_fhandle+0xf0/0xf0 show_mark_fhandle+0x4a/0xf0 ? show_mark_fhandle+0xf0/0xf0 ? seq_vprintf+0x30/0x50 ? seq_printf+0x53/0x70 ? show_mark_fhandle+0xf0/0xf0 inotify_fdinfo+0x70/0x90 show_fdinfo.isra.4+0x53/0x70 seq_show+0x130/0x170 seq_read+0x153/0x440 vfs_read+0x94/0x150 ksys_read+0x5f/0xe0 do_syscall_64+0x59/0x1e0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 So let's drop WARN_ON() to avoid kernel log flooding. Reported-by: Hongbo Yin <yinhongbo@bytedance.com> Signed-off-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com> Signed-off-by: Tianci Zhang <zhangtianci.1997@bytedance.com> Fixes: 8ed5eec9d6c4 ("ovl: encode pure upper file handles") Cc: <stable@vger.kernel.org> # v4.16 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-17fs: Add missing umask strip in vfs_tmpfileYang Xu
commit ac6800e279a22b28f4fc21439843025a0d5bf03e upstream. All creation paths except for O_TMPFILE handle umask in the vfs directly if the filesystem doesn't support or enable POSIX ACLs. If the filesystem does then umask handling is deferred until posix_acl_create(). Because, O_TMPFILE misses umask handling in the vfs it will not honor umask settings. Fix this by adding the missing umask handling. Link: https://lore.kernel.org/r/1657779088-2242-2-git-send-email-xuyang2018.jy@fujitsu.com Fixes: 60545d0d4610 ("[O_TMPFILE] it's still short a few helpers, but infrastructure should be OK now...") Cc: <stable@vger.kernel.org> # 4.19+ Reported-by: Christian Brauner (Microsoft) <brauner@kernel.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-and-Tested-by: Jeff Layton <jlayton@kernel.org> Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-17vfs: Check the truncate maximum size in inode_newsize_ok()David Howells
commit e2ebff9c57fe4eb104ce4768f6ebcccf76bef849 upstream. If something manages to set the maximum file size to MAX_OFFSET+1, this can cause the xfs and ext4 filesystems at least to become corrupt. Ordinarily, the kernel protects against userspace trying this by checking the value early in the truncate() and ftruncate() system calls calls - but there are at least two places that this check is bypassed: (1) Cachefiles will round up the EOF of the backing file to DIO block size so as to allow DIO on the final block - but this might push the offset negative. It then calls notify_change(), but this inadvertently bypasses the checking. This can be triggered if someone puts an 8EiB-1 file on a server for someone else to try and access by, say, nfs. (2) ksmbd doesn't check the value it is given in set_end_of_file_info() and then calls vfs_truncate() directly - which also bypasses the check. In both cases, it is potentially possible for a network filesystem to cause a disk filesystem to be corrupted: cachefiles in the client's cache filesystem; ksmbd in the server's filesystem. nfsd is okay as it checks the value, but we can then remove this check too. Fix this by adding a check to inode_newsize_ok(), as called from setattr_prepare(), thereby catching the issue as filesystems set up to perform the truncate with minimal opportunity for bypassing the new check. Fixes: 1f08c925e7a3 ("cachefiles: Implement backing file wrangling") Fixes: f44158485826 ("cifsd: add file operations") Signed-off-by: David Howells <dhowells@redhat.com> Reported-by: Jeff Layton <jlayton@kernel.org> Tested-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Namjae Jeon <linkinjeon@kernel.org> Cc: stable@kernel.org Acked-by: Alexander Viro <viro@zeniv.linux.org.uk> cc: Steve French <sfrench@samba.org> cc: Hyunchul Lee <hyc.lee@gmail.com> cc: Chuck Lever <chuck.lever@oracle.com> cc: Dave Wysochanski <dwysocha@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-17lockd: detect and reject lock arguments that overflowJeff Layton
commit 6930bcbfb6ceda63e298c6af6d733ecdf6bd4cde upstream. lockd doesn't currently vet the start and length in nlm4 requests like it should, and can end up generating lock requests with arguments that overflow when passed to the filesystem. The NLM4 protocol uses unsigned 64-bit arguments for both start and length, whereas struct file_lock tracks the start and end as loff_t values. By the time we get around to calling nlm4svc_retrieve_args, we've lost the information that would allow us to determine if there was an overflow. Start tracking the actual start and len for NLM4 requests in the nlm_lock. In nlm4svc_retrieve_args, vet these values to ensure they won't cause an overflow, and return NLM4_FBIG if they do. Link: https://bugzilla.linux-nfs.org/show_bug.cgi?id=392 Reported-by: Jan Kasiak <j.kasiak@gmail.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Cc: <stable@vger.kernel.org> # 5.14+ Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-17nfsd: eliminate the NFSD_FILE_BREAK_* flagsJeff Layton
commit 23ba98de6dcec665e15c0ca19244379bb0d30932 upstream. We had a report from the spring Bake-a-thon of data corruption in some nfstest_interop tests. Looking at the traces showed the NFS server allowing a v3 WRITE to proceed while a read delegation was still outstanding. Currently, we only set NFSD_FILE_BREAK_* flags if NFSD_MAY_NOT_BREAK_LEASE was set when we call nfsd_file_alloc. NFSD_MAY_NOT_BREAK_LEASE was intended to be set when finding files for COMMIT ops, where we need a writeable filehandle but don't need to break read leases. It doesn't make any sense to consult that flag when allocating a file since the file may be used on subsequent calls where we do want to break the lease (and the usage of it here seems to be reverse from what it should be anyway). Also, after calling nfsd_open_break_lease, we don't want to clear the BREAK_* bits. A lease could end up being set on it later (more than once) and we need to be able to break those leases as well. This means that the NFSD_FILE_BREAK_* flags now just mirror NFSD_MAY_{READ,WRITE} flags, so there's no need for them at all. Just drop those flags and unconditionally call nfsd_open_break_lease every time. Reported-by: Olga Kornieskaia <kolga@netapp.com> Link: https://bugzilla.redhat.com/show_bug.cgi?id=2107360 Fixes: 65294c1f2c5e (nfsd: add a new struct file caching facility to nfsd) Cc: <stable@vger.kernel.org> # 5.4.x : bb283ca18d1e NFSD: Clean up the show_nf_flags() macro Cc: <stable@vger.kernel.org> # 5.4.x Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-17pNFS/flexfiles: Report RDMA connection errors to the serverTrond Myklebust
commit 7836d75467e9d214bdf5c693b32721de729a6e38 upstream. The RPC/RDMA driver will return -EPROTO and -ENODEV as connection errors under certain circumstances. Make sure that we handle them and report them to the server. If not, we can end up cycling forever in a LAYOUTGET/LAYOUTRETURN loop. Fixes: a12f996d3413 ("NFSv4/pNFS: Use connections to a DS that are all of the same protocol family") Cc: stable@vger.kernel.org # 5.11.x Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-17Revert "pNFS: nfs3_set_ds_client should set NFS_CS_NOPING"Trond Myklebust
commit 9597152d98840c2517230740952df97cfcc07e2f upstream. This reverts commit c6eb58435b98bd843d3179664a0195ff25adb2c3. If a transport is down, then we want to fail over to other transports if they are listed in the GETDEVICEINFO reply. Fixes: c6eb58435b98 ("pNFS: nfs3_set_ds_client should set NFS_CS_NOPING") Cc: stable@vger.kernel.org # 5.11.x Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-07-26Merge tag 'mm-hotfixes-stable-2022-07-26' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "Thirteen hotfixes. Eight are cc:stable and the remainder are for post-5.18 issues or are too minor to warrant backporting" * tag 'mm-hotfixes-stable-2022-07-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mailmap: update Gao Xiang's email addresses userfaultfd: provide properly masked address for huge-pages Revert "ocfs2: mount shared volume without ha stack" hugetlb: fix memoryleak in hugetlb_mcopy_atomic_pte fs: sendfile handles O_NONBLOCK of out_fd ntfs: fix use-after-free in ntfs_ucsncmp() secretmem: fix unhandled fault in truncate mm/hugetlb: separate path for hwpoison entry in copy_hugetlb_page_range() mm: fix missing wake-up event for FSDAX pages mm: fix page leak with multiple threads mapping the same page mailmap: update Seth Forshee's email address tmpfs: fix the issue that the mount and remount results are inconsistent. mm: kfence: apply kmemleak_ignore_phys on early allocated pool
2022-07-26userfaultfd: provide properly masked address for huge-pagesNadav Amit
Commit 824ddc601adc ("userfaultfd: provide unmasked address on page-fault") was introduced to fix an old bug, in which the offset in the address of a page-fault was masked. Concerns were raised - although were never backed by actual code - that some userspace code might break because the bug has been around for quite a while. To address these concerns a new flag was introduced, and only when this flag is set by the user, userfaultfd provides the exact address of the page-fault. The commit however had a bug, and if the flag is unset, the offset was always masked based on a base-page granularity. Yet, for huge-pages, the behavior prior to the commit was that the address is masked to the huge-page granulrity. While there are no reports on real breakage, fix this issue. If the flag is unset, use the address with the masking that was done before. Link: https://lkml.kernel.org/r/20220711165906.2682-1-namit@vmware.com Fixes: 824ddc601adc ("userfaultfd: provide unmasked address on page-fault") Signed-off-by: Nadav Amit <namit@vmware.com> Reported-by: James Houghton <jthoughton@google.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: James Houghton <jthoughton@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-22Merge tag 'io_uring-5.19-2022-07-21' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring fixes from Jens Axboe: "Fix for a bad kfree() introduced in this cycle, and a quick fix for disabling buffer recycling for IORING_OP_READV. The latter will get reworked for 5.20, but it gets the job done for 5.19" * tag 'io_uring-5.19-2022-07-21' of git://git.kernel.dk/linux-block: io_uring: do not recycle buffer in READV io_uring: fix free of unallocated buffer list
2022-07-21io_uring: do not recycle buffer in READVDylan Yudaken
READV cannot recycle buffers as it would lose some of the data required to reimport that buffer. Reported-by: Ammar Faizi <ammarfaizi2@gnuweeb.org> Fixes: b66e65f41426 ("io_uring: never call io_buffer_select() for a buffer re-select") Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220721131325.624788-1-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-21io_uring: fix free of unallocated buffer listDylan Yudaken
in the error path of io_register_pbuf_ring, only free bl if it was allocated. Reported-by: Dipanjan Das <mail.dipanjan.das@gmail.com> Fixes: c7fb19428d67 ("io_uring: add support for ring mapped supplied buffers") Signed-off-by: Dylan Yudaken <dylany@fb.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/all/CANX2M5bXKw1NaHdHNVqssUUaBCs8aBpmzRNVEYEvV0n44P7ioA@mail.gmail.com/ Link: https://lore.kernel.org/all/CANX2M5YiZBXU3L6iwnaLs-HHJXRvrxM8mhPDiMDF9Y9sAvOHUA@mail.gmail.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-18Revert "ocfs2: mount shared volume without ha stack"Junxiao Bi
This reverts commit 912f655d78c5d4ad05eac287f23a435924df7144. This commit introduced a regression that can cause mount hung. The changes in __ocfs2_find_empty_slot causes that any node with none-zero node number can grab the slot that was already taken by node 0, so node 1 will access the same journal with node 0, when it try to grab journal cluster lock, it will hung because it was already acquired by node 0. It's very easy to reproduce this, in one cluster, mount node 0 first, then node 1, you will see the following call trace from node 1. [13148.735424] INFO: task mount.ocfs2:53045 blocked for more than 122 seconds. [13148.739691] Not tainted 5.15.0-2148.0.4.el8uek.mountracev2.x86_64 #2 [13148.742560] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [13148.745846] task:mount.ocfs2 state:D stack: 0 pid:53045 ppid: 53044 flags:0x00004000 [13148.749354] Call Trace: [13148.750718] <TASK> [13148.752019] ? usleep_range+0x90/0x89 [13148.753882] __schedule+0x210/0x567 [13148.755684] schedule+0x44/0xa8 [13148.757270] schedule_timeout+0x106/0x13c [13148.759273] ? __prepare_to_swait+0x53/0x78 [13148.761218] __wait_for_common+0xae/0x163 [13148.763144] __ocfs2_cluster_lock.constprop.0+0x1d6/0x870 [ocfs2] [13148.765780] ? ocfs2_inode_lock_full_nested+0x18d/0x398 [ocfs2] [13148.768312] ocfs2_inode_lock_full_nested+0x18d/0x398 [ocfs2] [13148.770968] ocfs2_journal_init+0x91/0x340 [ocfs2] [13148.773202] ocfs2_check_volume+0x39/0x461 [ocfs2] [13148.775401] ? iput+0x69/0xba [13148.777047] ocfs2_mount_volume.isra.0.cold+0x40/0x1f5 [ocfs2] [13148.779646] ocfs2_fill_super+0x54b/0x853 [ocfs2] [13148.781756] mount_bdev+0x190/0x1b7 [13148.783443] ? ocfs2_remount+0x440/0x440 [ocfs2] [13148.785634] legacy_get_tree+0x27/0x48 [13148.787466] vfs_get_tree+0x25/0xd0 [13148.789270] do_new_mount+0x18c/0x2d9 [13148.791046] __x64_sys_mount+0x10e/0x142 [13148.792911] do_syscall_64+0x3b/0x89 [13148.794667] entry_SYSCALL_64_after_hwframe+0x170/0x0 [13148.797051] RIP: 0033:0x7f2309f6e26e [13148.798784] RSP: 002b:00007ffdcee7d408 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5 [13148.801974] RAX: ffffffffffffffda RBX: 00007ffdcee7d4a0 RCX: 00007f2309f6e26e [13148.804815] RDX: 0000559aa762a8ae RSI: 0000559aa939d340 RDI: 0000559aa93a22b0 [13148.807719] RBP: 00007ffdcee7d5b0 R08: 0000559aa93a2290 R09: 00007f230a0b4820 [13148.810659] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffdcee7d420 [13148.813609] R13: 0000000000000000 R14: 0000559aa939f000 R15: 0000000000000000 [13148.816564] </TASK> To fix it, we can just fix __ocfs2_find_empty_slot. But original commit introduced the feature to mount ocfs2 locally even it is cluster based, that is a very dangerous, it can easily cause serious data corruption, there is no way to stop other nodes mounting the fs and corrupting it. Setup ha or other cluster-aware stack is just the cost that we have to take for avoiding corruption, otherwise we have to do it in kernel. Link: https://lkml.kernel.org/r/20220603222801.42488-1-junxiao.bi@oracle.com Fixes: 912f655d78c5("ocfs2: mount shared volume without ha stack") Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: <heming.zhao@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-18fs: sendfile handles O_NONBLOCK of out_fdAndrei Vagin
sendfile has to return EAGAIN if out_fd is nonblocking and the write into it would block. Here is a small reproducer for the problem: #define _GNU_SOURCE /* See feature_test_macros(7) */ #include <fcntl.h> #include <stdio.h> #include <unistd.h> #include <errno.h> #include <sys/stat.h> #include <sys/types.h> #include <sys/sendfile.h> #define FILE_SIZE (1UL << 30) int main(int argc, char **argv) { int p[2], fd; if (pipe2(p, O_NONBLOCK)) return 1; fd = open(argv[1], O_RDWR | O_TMPFILE, 0666); if (fd < 0) return 1; ftruncate(fd, FILE_SIZE); if (sendfile(p[1], fd, 0, FILE_SIZE) == -1) { fprintf(stderr, "FAIL\n"); } if (sendfile(p[1], fd, 0, FILE_SIZE) != -1 || errno != EAGAIN) { fprintf(stderr, "FAIL\n"); } return 0; } It worked before b964bf53e540, it is stuck after b964bf53e540, and it works again with this fix. This regression occurred because do_splice_direct() calls pipe_write that handles O_NONBLOCK. Here is a trace log from the reproducer: 1) | __x64_sys_sendfile64() { 1) | do_sendfile() { 1) | __fdget() 1) | rw_verify_area() 1) | __fdget() 1) | rw_verify_area() 1) | do_splice_direct() { 1) | rw_verify_area() 1) | splice_direct_to_actor() { 1) | do_splice_to() { 1) | rw_verify_area() 1) | generic_file_splice_read() 1) + 74.153 us | } 1) | direct_splice_actor() { 1) | iter_file_splice_write() { 1) | __kmalloc() 1) 0.148 us | pipe_lock(); 1) 0.153 us | splice_from_pipe_next.part.0(); 1) 0.162 us | page_cache_pipe_buf_confirm(); ... 16 times 1) 0.159 us | page_cache_pipe_buf_confirm(); 1) | vfs_iter_write() { 1) | do_iter_write() { 1) | rw_verify_area() 1) | do_iter_readv_writev() { 1) | pipe_write() { 1) | mutex_lock() 1) 0.153 us | mutex_unlock(); 1) 1.368 us | } 1) 1.686 us | } 1) 5.798 us | } 1) 6.084 us | } 1) 0.174 us | kfree(); 1) 0.152 us | pipe_unlock(); 1) + 14.461 us | } 1) + 14.783 us | } 1) 0.164 us | page_cache_pipe_buf_release(); ... 16 times 1) 0.161 us | page_cache_pipe_buf_release(); 1) | touch_atime() 1) + 95.854 us | } 1) + 99.784 us | } 1) ! 107.393 us | } 1) ! 107.699 us | } Link: https://lkml.kernel.org/r/20220415005015.525191-1-avagin@gmail.com Fixes: b964bf53e540 ("teach sendfile(2) to handle send-to-pipe directly") Signed-off-by: Andrei Vagin <avagin@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-18ntfs: fix use-after-free in ntfs_ucsncmp()ChenXiaoSong
Syzkaller reported use-after-free bug as follows: ================================================================== BUG: KASAN: use-after-free in ntfs_ucsncmp+0x123/0x130 Read of size 2 at addr ffff8880751acee8 by task a.out/879 CPU: 7 PID: 879 Comm: a.out Not tainted 5.19.0-rc4-next-20220630-00001-gcc5218c8bd2c-dirty #7 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x1c0/0x2b0 print_address_description.constprop.0.cold+0xd4/0x484 print_report.cold+0x55/0x232 kasan_report+0xbf/0xf0 ntfs_ucsncmp+0x123/0x130 ntfs_are_names_equal.cold+0x2b/0x41 ntfs_attr_find+0x43b/0xb90 ntfs_attr_lookup+0x16d/0x1e0 ntfs_read_locked_attr_inode+0x4aa/0x2360 ntfs_attr_iget+0x1af/0x220 ntfs_read_locked_inode+0x246c/0x5120 ntfs_iget+0x132/0x180 load_system_files+0x1cc6/0x3480 ntfs_fill_super+0xa66/0x1cf0 mount_bdev+0x38d/0x460 legacy_get_tree+0x10d/0x220 vfs_get_tree+0x93/0x300 do_new_mount+0x2da/0x6d0 path_mount+0x496/0x19d0 __x64_sys_mount+0x284/0x300 do_syscall_64+0x3b/0xc0 entry_SYSCALL_64_after_hwframe+0x46/0xb0 RIP: 0033:0x7f3f2118d9ea Code: 48 8b 0d a9 f4 0b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 76 f4 0b 00 f7 d8 64 89 01 48 RSP: 002b:00007ffc269deac8 EFLAGS: 00000202 ORIG_RAX: 00000000000000a5 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f3f2118d9ea RDX: 0000000020000000 RSI: 0000000020000100 RDI: 00007ffc269dec00 RBP: 00007ffc269dec80 R08: 00007ffc269deb00 R09: 00007ffc269dec44 R10: 0000000000000000 R11: 0000000000000202 R12: 000055f81ab1d220 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 </TASK> The buggy address belongs to the physical page: page:0000000085430378 refcount:1 mapcount:1 mapping:0000000000000000 index:0x555c6a81d pfn:0x751ac memcg:ffff888101f7e180 anon flags: 0xfffffc00a0014(uptodate|lru|mappedtodisk|swapbacked|node=0|zone=1|lastcpupid=0x1fffff) raw: 000fffffc00a0014 ffffea0001bf2988 ffffea0001de2448 ffff88801712e201 raw: 0000000555c6a81d 0000000000000000 0000000100000000 ffff888101f7e180 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff8880751acd80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff8880751ace00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >ffff8880751ace80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ^ ffff8880751acf00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff8880751acf80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ================================================================== The reason is that struct ATTR_RECORD->name_offset is 6485, end address of name string is out of bounds. Fix this by adding sanity check on end address of attribute name string. [akpm@linux-foundation.org: coding-style cleanups] [chenxiaosong2@huawei.com: cleanup suggested by Hawkins Jiawei] Link: https://lkml.kernel.org/r/20220709064511.3304299-1-chenxiaosong2@huawei.com Link: https://lkml.kernel.org/r/20220707105329.4020708-1-chenxiaosong2@huawei.com Signed-off-by: ChenXiaoSong <chenxiaosong2@huawei.com> Signed-off-by: Hawkins Jiawei <yin31149@gmail.com> Cc: Anton Altaparmakov <anton@tuxera.com> Cc: ChenXiaoSong <chenxiaosong2@huawei.com> Cc: Yongqiang Liu <liuyongqiang13@huawei.com> Cc: Zhang Yi <yi.zhang@huawei.com> Cc: Zhang Xiaoxu <zhangxiaoxu5@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-16Merge tag 'for-5.19-rc7-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs reverts from David Sterba: "Due to a recent report [1] we need to revert the radix tree to xarray conversion patches. There's a problem with sleeping under spinlock, when xa_insert could allocate memory under pressure. We use GFP_NOFS so this is a real problem that we unfortunately did not discover during review. I'm sorry to do such change at rc6 time but the revert is IMO the safer option, there are patches to use mutex instead of the spin locks but that would need more testing. The revert branch has been tested on a few setups, all seem ok. The conversion to xarray will be revisited in the future" Link: https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/ [1] * tag 'for-5.19-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Revert "btrfs: turn delayed_nodes_tree into an XArray" Revert "btrfs: turn name_cache radix tree into XArray in send_ctx" Revert "btrfs: turn fs_info member buffer_radix into XArray" Revert "btrfs: turn fs_roots_radix in btrfs_fs_info into an XArray"
2022-07-15Merge tag 'ceph-for-5.19-rc7' of https://github.com/ceph/ceph-clientLinus Torvalds
Pull ceph fix from Ilya Dryomov: "A folio locking fixup that Xiubo and David cooperated on, marked for stable. Most of it is in netfs but I picked it up into ceph tree on agreement with David" * tag 'ceph-for-5.19-rc7' of https://github.com/ceph/ceph-client: netfs: do not unlock and put the folio twice
2022-07-15Revert "btrfs: turn delayed_nodes_tree into an XArray"David Sterba
This reverts commit 253bf57555e451dec5a7f09dc95d380ce8b10e5b. Revert the xarray conversion, there's a problem with potential sleep-inside-spinlock [1] when calling xa_insert that triggers GFP_NOFS allocation. The radix tree used the preloading mechanism to avoid sleeping but this is not available in xarray. Conversion from spin lock to mutex is possible but at time of rc6 is riskier than a clean revert. [1] https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/ Reported-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-15Revert "btrfs: turn name_cache radix tree into XArray in send_ctx"David Sterba
This reverts commit 4076942021fe14efecae33bf98566df6dd5ae6f7. Revert the xarray conversion, there's a problem with potential sleep-inside-spinlock [1] when calling xa_insert that triggers GFP_NOFS allocation. The radix tree used the preloading mechanism to avoid sleeping but this is not available in xarray. Conversion from spin lock to mutex is possible but at time of rc6 is riskier than a clean revert. [1] https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/ Reported-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-15Revert "btrfs: turn fs_info member buffer_radix into XArray"David Sterba
This reverts commit 8ee922689d67b7cfa6acbe2aa1ee76ac72e6fc8a. Revert the xarray conversion, there's a problem with potential sleep-inside-spinlock [1] when calling xa_insert that triggers GFP_NOFS allocation. The radix tree used the preloading mechanism to avoid sleeping but this is not available in xarray. Conversion from spin lock to mutex is possible but at time of rc6 is riskier than a clean revert. [1] https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/ Reported-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-15Revert "btrfs: turn fs_roots_radix in btrfs_fs_info into an XArray"David Sterba
This reverts commit 48b36a602a335c184505346b5b37077840660634. Revert the xarray conversion, there's a problem with potential sleep-inside-spinlock [1] when calling xa_insert that triggers GFP_NOFS allocation. The radix tree used the preloading mechanism to avoid sleeping but this is not available in xarray. Conversion from spin lock to mutex is possible but at time of rc6 is riskier than a clean revert. [1] https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/ Reported-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-14Revert "vf/remap: return the amount of bytes actually deduplicated"Linus Torvalds
This reverts commit 4a57a8400075bc5287c5c877702c68aeae2a033d. Dave Chinner reports: "As I suspected would occur, this change causes test failures. e.g generic/517 in fstests fails with: generic/517 1s ... - output mismatch [..] -deduped 131172/131172 bytes at offset 65536 +deduped 131072/131172 bytes at offset 65536" can you please revert this commit for the 5.19 series to give us more time to investigate and consider the impact of the the API change on userspace applications before we commit to changing the API" That changed return value seems to reflect reality, but with the fstest change, let's revert for now. Requested-by: Dave Chinner <david@fromorbit.com> Link: https://lore.kernel.org/all/20220714223238.GH3600936@dread.disaster.area/ Cc: Ansgar Lößer <ansgar.loesser@tu-darmstadt.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-07-14Merge tag '5.19-rc6-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull cifs fixes from Steve French: "Three smb3 client fixes: - two multichannel fixes: fix a potential deadlock freeing a channel, and fix a race condition on failed creation of a new channel - mount failure fix: work around a server bug in some common older Samba servers by avoiding padding at the end of the negotiate protocol request" * tag '5.19-rc6-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: smb3: workaround negprot bug in some Samba servers cifs: remove unnecessary locking of chan_lock while freeing session cifs: fix race condition with delayed threads
2022-07-14Merge tag 'nfsd-5.19-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fixes from Chuck Lever: "Notable regression fixes: - Enable SETATTR(time_create) to fix regression with Mac OS clients - Fix a lockd crasher and broken NLM UNLCK behavior" * tag 'nfsd-5.19-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: lockd: fix nlm_close_files lockd: set fl_owner when unlocking files NFSD: Decode NFSv4 birth time attribute
2022-07-14netfs: do not unlock and put the folio twiceXiubo Li
check_write_begin() will unlock and put the folio when return non-zero. So we should avoid unlocking and putting it twice in netfs layer. Change the way ->check_write_begin() works in the following two ways: (1) Pass it a pointer to the folio pointer, allowing it to unlock and put the folio prior to doing the stuff it wants to do, provided it clears the folio pointer. (2) Change the return values such that 0 with folio pointer set means continue, 0 with folio pointer cleared means re-get and all error codes indicating an error (no special treatment for -EAGAIN). [ bagasdotme: use Sphinx code text syntax for *foliop pointer ] Cc: stable@vger.kernel.org Link: https://tracker.ceph.com/issues/56423 Link: https://lore.kernel.org/r/cf169f43-8ee7-8697-25da-0204d1b4343e@redhat.com Co-developed-by: David Howells <dhowells@redhat.com> Signed-off-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-07-13smb3: workaround negprot bug in some Samba serversSteve French
Mount can now fail to older Samba servers due to a server bug handling padding at the end of the last negotiate context (negotiate contexts typically are rounded up to 8 bytes by adding padding if needed). This server bug can be avoided by switching the order of negotiate contexts, placing a negotiate context at the end that does not require padding (prior to the recent netname context fix this was the case on the client). Fixes: 73130a7b1ac9 ("smb3: fix empty netname context on secondary channels") Reported-by: Julian Sikorski <belegdol@gmail.com> Tested-by: Julian Sikorski <belegdol+github@gmail.com> Reviewed-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2022-07-13vf/remap: return the amount of bytes actually deduplicatedAnsgar Lößer
When using the FIDEDUPRANGE ioctl, in case of success the requested size is returned. In some cases this might not be the actual amount of bytes deduplicated. This change modifies vfs_dedupe_file_range() to report the actual amount of bytes deduplicated, instead of the requested amount. Link: https://lore.kernel.org/linux-fsdevel/5548ef63-62f9-4f46-5793-03165ceccacc@tu-darmstadt.de/ Reported-by: Ansgar Lößer <ansgar.loesser@kom.tu-darmstadt.de> Reported-by: Max Schlecht <max.schlecht@informatik.hu-berlin.de> Reported-by: Björn Scheuermann <scheuermann@kom.tu-darmstadt.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Darrick J Wong <djwong@kernel.org> Signed-off-by: Ansgar Lößer <ansgar.loesser@kom.tu-darmstadt.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-07-13fs/remap: constrain dedupe of EOF blocksDave Chinner
If dedupe of an EOF block is not constrainted to match against only other EOF blocks with the same EOF offset into the block, it can match against any other block that has the same matching initial bytes in it, even if the bytes beyond EOF in the source file do not match. Fix this by constraining the EOF block matching to only match against other EOF blocks that have identical EOF offsets and data. This allows "whole file dedupe" to continue to work without allowing eof blocks to randomly match against partial full blocks with the same data. Reported-by: Ansgar Lößer <ansgar.loesser@tu-darmstadt.de> Fixes: 1383a7ed6749 ("vfs: check file ranges before cloning files") Link: https://lore.kernel.org/linux-fsdevel/a7c93559-4ba1-df2f-7a85-55a143696405@tu-darmstadt.de/ Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-07-12Merge tag 'ovl-fixes-5.19-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs fix from Miklos Szeredi: "Add a temporary fix for posix acls on idmapped mounts introduced in this cycle. A proper fix will be added in the next cycle" * tag 'ovl-fixes-5.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: ovl: turn off SB_POSIXACL with idmapped layers temporarily
2022-07-12cifs: remove unnecessary locking of chan_lock while freeing sessionShyam Prasad N
In cifs_put_smb_ses, when we're freeing the last ref count to the session, we need to free up each channel. At this point, it is unnecessary to take chan_lock, since we have the last reference to the ses. Picking up this lock also introduced a deadlock because it calls cifs_put_tcp_ses, which locks cifs_tcp_ses_lock. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Acked-by: Enzo Matsumiya <ematsumiya@suse.de> Signed-off-by: Steve French <stfrench@microsoft.com>
2022-07-12cifs: fix race condition with delayed threadsShyam Prasad N
On failure to create a new channel, first cancel the delayed threads, which could try to search for this channel, and not find it. The other option was to put the tcp session for the channel first, before decrementing chan_count. But that would leave a reference to the tcp session, when it has been freed already. So going with the former option and cancelling the delayed works first, before rolling back the channel. Fixes: aa45dadd34e4 ("cifs: change iface_list from array to sorted linked list") Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Acked-by: Enzo Matsumiya <ematsumiya@suse.de> Signed-off-by: Steve French <stfrench@microsoft.com>
2022-07-11Merge tag 'for-5.19-rc6-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A more fixes that seem to me to be important enough to get merged before release: - in zoned mode, fix leak of a structure when reading zone info, this happens on normal path so this can be significant - in zoned mode, revert an optimization added in 5.19-rc1 to finish a zone when the capacity is full, but this is not reliable in all cases - try to avoid short reads for compressed data or inline files when it's a NOWAIT read, applications should handle that but there are two, qemu and mariadb, that are affected" * tag 'for-5.19-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: zoned: drop optimization of zone finish btrfs: zoned: fix a leaked bioc in read_zone_info btrfs: return -EAGAIN for NOWAIT dio reads/writes on compressed and inline extents
2022-07-11lockd: fix nlm_close_filesJeff Layton
This loop condition tries a bit too hard to be clever. Just test for the two indices we care about explicitly. Cc: J. Bruce Fields <bfields@fieldses.org> Fixes: 7f024fcd5c97 ("Keep read and write fds with each nlm_file") Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-11Merge tag 'mm-hotfixes-stable-2022-07-11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull hotfixes from Andrew Morton: "Mainly MM fixes. About half for issues which were introduced after 5.18 and the remainder for longer-term issues" * tag 'mm-hotfixes-stable-2022-07-11' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm: split huge PUD on wp_huge_pud fallback nilfs2: fix incorrect masking of permission flags for symlinks mm/rmap: fix dereferencing invalid subpage pointer in try_to_migrate_one() riscv/mm: fix build error while PAGE_TABLE_CHECK enabled without MMU Documentation: highmem: use literal block for code example in highmem.h comment mm: sparsemem: fix missing higher order allocation splitting mm/damon: use set_huge_pte_at() to make huge pte old sh: convert nommu io{re,un}map() to static inline functions mm: userfaultfd: fix UFFDIO_CONTINUE on fallocated shmem pages