summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2024-07-15Merge tag 'for-6.11/io_uring-20240714' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring updates from Jens Axboe: "Here are the io_uring updates queued up for 6.11. Nothing major this time around, various minor improvements and cleanups/fixes. This contains: - Add bind/listen opcodes. Main motivation is to support direct descriptors, to avoid needing a regular fd just for doing these two operations (Gabriel) - Probe fixes (Gabriel) - Treat io-wq work flags as atomics. Not fixing a real issue, but may as well and it silences a KCSAN warning (me) - Cleanup of rsrc __set_current_state() usage (me) - Add 64-bit for {m,f}advise operations (me) - Improve performance of data ring messages (me) - Fix for ring message overflow posting (Pavel) - Fix for freezer interaction with TWA_NOTIFY_SIGNAL. Not strictly an io_uring thing, but since TWA_NOTIFY_SIGNAL was originally added for faster task_work signaling for io_uring, bundling it with this pull (Pavel) - Add Pavel as a co-maintainer - Various cleanups (me, Thorsten)" * tag 'for-6.11/io_uring-20240714' of git://git.kernel.dk/linux: (28 commits) io_uring/net: check socket is valid in io_bind()/io_listen() kernel: rerun task_work while freezing in get_signal() io_uring/io-wq: limit retrying worker initialisation io_uring/napi: Remove unnecessary s64 cast io_uring/net: cleanup io_recv_finish() bundle handling io_uring/msg_ring: fix overflow posting MAINTAINERS: change Pavel Begunkov from io_uring reviewer to maintainer io_uring/msg_ring: use kmem_cache_free() to free request io_uring/msg_ring: check for dead submitter task io_uring/msg_ring: add an alloc cache for io_kiocb entries io_uring/msg_ring: improve handling of target CQE posting io_uring: add io_add_aux_cqe() helper io_uring: add remote task_work execution helper io_uring/msg_ring: tighten requirement for remote posting io_uring: Allocate only necessary memory in io_probe io_uring: Fix probe of disabled operations io_uring: Introduce IORING_OP_LISTEN io_uring: Introduce IORING_OP_BIND net: Split a __sys_listen helper for io_uring net: Split a __sys_bind helper for io_uring ...
2024-07-15trace/pid_list: Change gfp flags in pid_list_fill_irq()levi.yun
pid_list_fill_irq() runs via irq_work. When CONFIG_PREEMPT_RT is disabled, it would run in irq_context. so it shouldn't sleep while memory allocation. Change gfp flags from GFP_KERNEL to GFP_NOWAIT to prevent sleep in irq_work. This change wouldn't impact functionality in practice because the worst-size is 2K. Cc: stable@goodmis.org Fixes: 8d6e90983ade2 ("tracing: Create a sparse bitmask for pid filtering") Link: https://lore.kernel.org/20240704150226.1359936-1-yeoreum.yun@arm.com Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: levi.yun <yeoreum.yun@arm.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-07-15Merge tag 'vfs-6.11.misc' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "Features: - Support passing NULL along AT_EMPTY_PATH for statx(). NULL paths with any flag value other than AT_EMPTY_PATH go the usual route and end up with -EFAULT to retain compatibility (Rust is abusing calls of the sort to detect availability of statx) This avoids path lookup code, lockref management, memory allocation and in case of NULL path userspace memory access (which can be quite expensive with SMAP on x86_64) - Don't block i_writecount during exec. Remove the deny_write_access() mechanism for executables - Relax open_by_handle_at() permissions in specific cases where we can prove that the caller had sufficient privileges to open a file - Switch timespec64 fields in struct inode to discrete integers freeing up 4 bytes Fixes: - Fix false positive circular locking warning in hfsplus - Initialize hfs_inode_info after hfs_alloc_inode() in hfs - Avoid accidental overflows in vfs_fallocate() - Don't interrupt fallocate with EINTR in tmpfs to avoid constantly restarting shmem_fallocate() - Add missing quote in comment in fs/readdir Cleanups: - Don't assign and test in an if statement in mqueue. Move the assignment out of the if statement - Reflow the logic in may_create_in_sticky() - Remove the usage of the deprecated ida_simple_xx() API from procfs - Reject FSCONFIG_CMD_CREATE_EXCL requets that depend on the new mount api early - Rename variables in copy_tree() to make it easier to understand - Replace WARN(down_read_trylock, ...) abuse with proper asserts in various places in the VFS - Get rid of user_path_at_empty() and drop the empty argument from getname_flags() - Check for error while copying and no path in one branch in getname_flags() - Avoid redundant smp_mb() for THP handling in do_dentry_open() - Rename parent_ino to d_parent_ino and make it use RCU - Remove unused header include in fs/readdir - Export in_group_capable() helper and switch f2fs and fuse over to it instead of open-coding the logic in both places" * tag 'vfs-6.11.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (27 commits) ipc: mqueue: remove assignment from IS_ERR argument vfs: rename parent_ino to d_parent_ino and make it use RCU vfs: support statx(..., NULL, AT_EMPTY_PATH, ...) stat: use vfs_empty_path() helper fs: new helper vfs_empty_path() fs: reflow may_create_in_sticky() vfs: remove redundant smp_mb for thp handling in do_dentry_open fuse: Use in_group_or_capable() helper f2fs: Use in_group_or_capable() helper fs: Export in_group_or_capable() vfs: reorder checks in may_create_in_sticky hfs: fix to initialize fields of hfs_inode_info after hfs_alloc_inode() proc: Remove usage of the deprecated ida_simple_xx() API hfsplus: fix to avoid false alarm of circular locking Improve readability of copy_tree vfs: shave a branch in getname_flags vfs: retire user_path_at_empty and drop empty arg from getname_flags vfs: stop using user_path_at_empty in do_readlinkat tmpfs: don't interrupt fallocate with EINTR fs: don't block i_writecount during exec ...
2024-07-16kbuild: remove PROVIDE() for kallsyms symbolsMasahiro Yamada
This reimplements commit 951bcae6c5a0 ("kallsyms: Avoid weak references for kallsyms symbols") because I am not a big fan of PROVIDE(). As an alternative solution, this commit prepends one more kallsyms step. KSYMS .tmp_vmlinux.kallsyms0.S # added AS .tmp_vmlinux.kallsyms0.o # added LD .tmp_vmlinux.btf BTF .btf.vmlinux.bin.o LD .tmp_vmlinux.kallsyms1 NM .tmp_vmlinux.kallsyms1.syms KSYMS .tmp_vmlinux.kallsyms1.S AS .tmp_vmlinux.kallsyms1.o LD .tmp_vmlinux.kallsyms2 NM .tmp_vmlinux.kallsyms2.syms KSYMS .tmp_vmlinux.kallsyms2.S AS .tmp_vmlinux.kallsyms2.o LD vmlinux Step 0 takes /dev/null as input, and generates .tmp_vmlinux.kallsyms0.o, which has a valid kallsyms format with the empty symbol list, and can be linked to vmlinux. Since it is really small, the added compile-time cost is negligible. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Acked-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Nicolas Schier <nicolas@fjasle.eu>
2024-07-15Merge branch 'slab/for-6.11/buckets' into slab/for-nextVlastimil Babka
Merge all the slab patches previously collected on top of v6.10-rc1, over cleanups/fixes that had to be based on rc6.
2024-07-14workqueue: Always queue work items to the newest PWQ for order workqueuesLai Jiangshan
To ensure non-reentrancy, __queue_work() attempts to enqueue a work item to the pool of the currently executing worker. This is not only unnecessary for an ordered workqueue, where order inherently suggests non-reentrancy, but it could also disrupt the sequence if the item is not enqueued on the newest PWQ. Just queue it to the newest PWQ and let order management guarantees non-reentrancy. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Fixes: 4c065dbce1e8 ("workqueue: Enable unbound cpumask update on ordered workqueues") Cc: stable@vger.kernel.org # v6.9+ Signed-off-by: Tejun Heo <tj@kernel.org> (cherry picked from commit 74347be3edfd11277799242766edf844c43dd5d3)
2024-07-14Merge branch 'for-6.10-fixes' into for-6.11Tejun Heo
2024-07-14lockdep: lockdep_set_notrack_class()Kent Overstreet
Add a new helper to disable lockdep tracking entirely for a given class. This is needed for bcachefs, which takes too many btree node locks for lockdep to track. Instead, we have a single lockdep_map for "btree_trans has any btree nodes locked", which makes more since given that we have centralized lock management and a cycle detector. Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14Merge tag 'sched_urgent_for_v6.10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Borislav Petkov: - Fix a performance regression when measuring the CPU time of a thread (clock_gettime(CLOCK_THREAD_CPUTIME_ID,...)) due to the addition of PSI IRQ time accounting in the hotpath - Fix a task_struct leak due to missing to decrement the refcount when the task is enqueued before the timer which is supposed to do that, expires - Revert an attempt to expedite detaching of movable tasks, as finding those could become very costly. Turns out the original issue wasn't even hit by anyone * tag 'sched_urgent_for_v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Move psi_account_irqtime() out of update_rq_clock_task() hotpath sched/deadline: Fix task_struct reference leak Revert "sched/fair: Make sure to try to detach at least one movable task"
2024-07-13Merge tag 'timers-v6.11-rc1' of ↵Thomas Gleixner
https://git.linaro.org/people/daniel.lezcano/linux into timers/core Pull clocksource/event driver updates from Daniel Lezcano: - Remove unnecessary local variables initialization as they will be initialized in the code path anyway right after on the ARM arch timer and the ARM global timer (Li kunyu) - Fix a race condition in the interrupt leading to a deadlock on the SH CMT driver. Note that this fix was not tested on the platform using this timer but the fix seems reasonable enough to be picked confidently (Niklas Söderlund) - Increase the rating of the gic-timer and use the configured width clocksource register on the MIPS architecture (Jiaxun Yang) - Add the DT bindings for the TMU on the Renesas platforms (Geert Uytterhoeven) - Add the DT bindings for the SOPHGO SG2002 clint on RiscV (Thomas Bonnefille) - Add the rtl-otto timer driver along with the DT bindings for the Realtek platform (Chris Packham) Link: https://lore.kernel.org/all/91cd05de-4c5d-4242-a381-3b8a4fe6a2a2@linaro.org
2024-07-13swiotlb: fix kernel-doc description for swiotlb_del_transientYang Li
Describe the pool argument in the kernel-doc comment for swiotlb_del_transient. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-07-12Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2024-07-12 We've added 23 non-merge commits during the last 3 day(s) which contain a total of 18 files changed, 234 insertions(+), 243 deletions(-). The main changes are: 1) Improve BPF verifier by utilizing overflow.h helpers to check for overflows, from Shung-Hsi Yu. 2) Fix NULL pointer dereference in resolve_prog_type() for BPF_PROG_TYPE_EXT when attr->attach_prog_fd was not specified, from Tengda Wu. 3) Fix arm64 BPF JIT when generating code for BPF trampolines with BPF_TRAMP_F_CALL_ORIG which corrupted upper address bits, from Puranjay Mohan. 4) Remove test_run callback from lwt_seg6local_prog_ops which never worked in the first place and caused syzbot reports, from Sebastian Andrzej Siewior. 5) Relax BPF verifier to accept non-zero offset on KF_TRUSTED_ARGS/ /KF_RCU-typed BPF kfuncs, from Matt Bobrowski. 6) Fix a long standing bug in libbpf with regards to handling of BPF skeleton's forward and backward compatibility, from Andrii Nakryiko. 7) Annotate btf_{seq,snprintf}_show functions with __printf, from Alan Maguire. 8) BPF selftest improvements to reuse common network helpers in sk_lookup test and dropping the open-coded inetaddr_len() and make_socket() ones, from Geliang Tang. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (23 commits) selftests/bpf: Test for null-pointer-deref bugfix in resolve_prog_type() bpf: Fix null pointer dereference in resolve_prog_type() for BPF_PROG_TYPE_EXT selftests/bpf: DENYLIST.aarch64: Skip fexit_sleep again bpf: use check_sub_overflow() to check for subtraction overflows bpf: use check_add_overflow() to check for addition overflows bpf: fix overflow check in adjust_jmp_off() bpf: Eliminate remaining "make W=1" warnings in kernel/bpf/btf.o bpf: annotate BTF show functions with __printf bpf, arm64: Fix trampoline for BPF_TRAMP_F_CALL_ORIG selftests/bpf: Close obj in error path in xdp_adjust_tail selftests/bpf: Null checks for links in bpf_tcp_ca selftests/bpf: Use connect_fd_to_fd in sk_lookup selftests/bpf: Use start_server_addr in sk_lookup selftests/bpf: Use start_server_str in sk_lookup selftests/bpf: Close fd in error path in drop_on_reuseport selftests/bpf: Add ASSERT_OK_FD macro selftests/bpf: Add backlog for network_helper_opts selftests/bpf: fix compilation failure when CONFIG_NF_FLOW_TABLE=m bpf: Remove tst_run from lwt_seg6local_prog_ops. bpf: relax zero fixed offset constraint on KF_TRUSTED_ARGS/KF_RCU ... ==================== Link: https://patch.msgid.link/20240712212448.5378-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-12tsacct: replace strncpy() with strscpy()Kees Cook
Replace the deprecated[1] use of strncpy() in bacct_add_tsk(). Since this is UAPI, include trailing padding in the copy. Link: https://github.com/KSPP/linux/issues/90 [1] Link: https://lkml.kernel.org/r/20240711171308.work.995-kees@kernel.org Signed-off-by: Kees Cook <kees@kernel.org> Cc: "Dr. Thomas Orgis" <thomas.orgis@uni-hamburg.de> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Ismael Luceno <ismael@iodev.co.uk> Cc: Peng Liu <liupeng256@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-12mm: define __pte_leaf_size() to also take a PMD entryChristophe Leroy
On powerpc 8xx, when a page is 8M size, the information is in the PMD entry. So allow architectures to provide __pte_leaf_size() instead of pte_leaf_size() and provide the PMD entry to that function. When __pte_leaf_size() is not defined, define it as a pte_leaf_size() so that architectures not interested in the PMD arguments are not impacted. Only define a default pte_leaf_size() when __pte_leaf_size() is not defined to make sure nobody adds new calls to pte_leaf_size() in the core. Link: https://lkml.kernel.org/r/c7c008f0a314bf8029ad7288fdc908db1ec7e449.1719928057.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-12cgroup/misc: Introduce misc.events.localXiu Jianfeng
Currently the event counting provided by misc.events is hierarchical, it's not practical if user is only concerned with events of a specified cgroup. Therefore, introduce misc.events.local collect events specific to the given cgroup. This is analogous to memory.events.local and pids.events.local. Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-12bpf: use check_sub_overflow() to check for subtraction overflowsShung-Hsi Yu
Similar to previous patch that drops signed_add*_overflows() and uses (compiler) builtin-based check_add_overflow(), do the same for signed_sub*_overflows() and replace them with the generic check_sub_overflow() to make future refactoring easier and have the checks implemented more efficiently. Unsigned overflow check for subtraction does not use helpers and are simple enough already, so they're left untouched. After the change GCC 13.3.0 generates cleaner assembly on x86_64: if (check_sub_overflow(*dst_smin, src_reg->smax_value, dst_smin) || 139bf: mov 0x28(%r12),%rax 139c4: mov %edx,0x54(%r12) 139c9: sub %r11,%rax 139cc: mov %rax,0x28(%r12) 139d1: jo 14627 <adjust_reg_min_max_vals+0x1237> check_sub_overflow(*dst_smax, src_reg->smin_value, dst_smax)) { 139d7: mov 0x30(%r12),%rax 139dc: sub %r9,%rax 139df: mov %rax,0x30(%r12) if (check_sub_overflow(*dst_smin, src_reg->smax_value, dst_smin) || 139e4: jo 14627 <adjust_reg_min_max_vals+0x1237> ... *dst_smin = S64_MIN; 14627: movabs $0x8000000000000000,%rax 14631: mov %rax,0x28(%r12) *dst_smax = S64_MAX; 14636: sub $0x1,%rax 1463a: mov %rax,0x30(%r12) Before the change it gives: if (signed_sub_overflows(dst_reg->smin_value, smax_val) || 13a50: mov 0x28(%r12),%rdi 13a55: mov %edx,0x54(%r12) dst_reg->smax_value = S64_MAX; 13a5a: movabs $0x7fffffffffffffff,%rdx 13a64: mov %eax,0x50(%r12) dst_reg->smin_value = S64_MIN; 13a69: movabs $0x8000000000000000,%rax s64 res = (s64)((u64)a - (u64)b); 13a73: mov %rdi,%rsi 13a76: sub %rcx,%rsi if (b < 0) 13a79: test %rcx,%rcx 13a7c: js 145ea <adjust_reg_min_max_vals+0x119a> if (signed_sub_overflows(dst_reg->smin_value, smax_val) || 13a82: cmp %rsi,%rdi 13a85: jl 13ac7 <adjust_reg_min_max_vals+0x677> signed_sub_overflows(dst_reg->smax_value, smin_val)) { 13a87: mov 0x30(%r12),%r8 s64 res = (s64)((u64)a - (u64)b); 13a8c: mov %r8,%rax 13a8f: sub %r9,%rax return res > a; 13a92: cmp %rax,%r8 13a95: setl %sil if (b < 0) 13a99: test %r9,%r9 13a9c: js 147d1 <adjust_reg_min_max_vals+0x1381> dst_reg->smax_value = S64_MAX; 13aa2: movabs $0x7fffffffffffffff,%rdx dst_reg->smin_value = S64_MIN; 13aac: movabs $0x8000000000000000,%rax if (signed_sub_overflows(dst_reg->smin_value, smax_val) || 13ab6: test %sil,%sil 13ab9: jne 13ac7 <adjust_reg_min_max_vals+0x677> dst_reg->smin_value -= smax_val; 13abb: mov %rdi,%rax dst_reg->smax_value -= smin_val; 13abe: mov %r8,%rdx dst_reg->smin_value -= smax_val; 13ac1: sub %rcx,%rax dst_reg->smax_value -= smin_val; 13ac4: sub %r9,%rdx 13ac7: mov %rax,0x28(%r12) ... 13ad1: mov %rdx,0x30(%r12) ... if (signed_sub_overflows(dst_reg->smin_value, smax_val) || 145ea: cmp %rsi,%rdi 145ed: jg 13ac7 <adjust_reg_min_max_vals+0x677> 145f3: jmp 13a87 <adjust_reg_min_max_vals+0x637> Suggested-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20240712080127.136608-4-shung-hsi.yu@suse.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-07-12bpf: use check_add_overflow() to check for addition overflowsShung-Hsi Yu
signed_add*_overflows() was added back when there was no overflow-check helper. With the introduction of such helpers in commit f0907827a8a91 ("compiler.h: enable builtin overflow checkers and add fallback code"), we can drop signed_add*_overflows() in kernel/bpf/verifier.c and use the generic check_add_overflow() instead. This will make future refactoring easier, and takes advantage of compiler-emitted hardware instructions that efficiently implement these checks. After the change GCC 13.3.0 generates cleaner assembly on x86_64: err = adjust_scalar_min_max_vals(env, insn, dst_reg, *src_reg); 13625: mov 0x28(%rbx),%r9 /* r9 = src_reg->smin_value */ 13629: mov 0x30(%rbx),%rcx /* rcx = src_reg->smax_value */ ... if (check_add_overflow(*dst_smin, src_reg->smin_value, dst_smin) || 141c1: mov %r9,%rax 141c4: add 0x28(%r12),%rax 141c9: mov %rax,0x28(%r12) 141ce: jo 146e4 <adjust_reg_min_max_vals+0x1294> check_add_overflow(*dst_smax, src_reg->smax_value, dst_smax)) { 141d4: add 0x30(%r12),%rcx 141d9: mov %rcx,0x30(%r12) if (check_add_overflow(*dst_smin, src_reg->smin_value, dst_smin) || 141de: jo 146e4 <adjust_reg_min_max_vals+0x1294> ... *dst_smin = S64_MIN; 146e4: movabs $0x8000000000000000,%rax 146ee: mov %rax,0x28(%r12) *dst_smax = S64_MAX; 146f3: sub $0x1,%rax 146f7: mov %rax,0x30(%r12) Before the change it gives: s64 smin_val = src_reg->smin_value; 675: mov 0x28(%rsi),%r8 s64 smax_val = src_reg->smax_value; u64 umin_val = src_reg->umin_value; u64 umax_val = src_reg->umax_value; 679: mov %rdi,%rax /* rax = dst_reg */ if (signed_add_overflows(dst_reg->smin_value, smin_val) || 67c: mov 0x28(%rdi),%rdi /* rdi = dst_reg->smin_value */ u64 umin_val = src_reg->umin_value; 680: mov 0x38(%rsi),%rdx u64 umax_val = src_reg->umax_value; 684: mov 0x40(%rsi),%rcx s64 res = (s64)((u64)a + (u64)b); 688: lea (%r8,%rdi,1),%r9 /* r9 = dst_reg->smin_value + src_reg->smin_value */ return res < a; 68c: cmp %r9,%rdi 68f: setg %r10b /* r10b = (dst_reg->smin_value + src_reg->smin_value) > dst_reg->smin_value */ if (b < 0) 693: test %r8,%r8 696: js 72b <scalar_min_max_add+0xbb> signed_add_overflows(dst_reg->smax_value, smax_val)) { dst_reg->smin_value = S64_MIN; dst_reg->smax_value = S64_MAX; 69c: movabs $0x7fffffffffffffff,%rdi s64 smax_val = src_reg->smax_value; 6a6: mov 0x30(%rsi),%r8 dst_reg->smin_value = S64_MIN; 6aa: 00 00 00 movabs $0x8000000000000000,%rsi if (signed_add_overflows(dst_reg->smin_value, smin_val) || 6b4: test %r10b,%r10b /* (dst_reg->smin_value + src_reg->smin_value) > dst_reg->smin_value ? goto 6cb */ 6b7: jne 6cb <scalar_min_max_add+0x5b> signed_add_overflows(dst_reg->smax_value, smax_val)) { 6b9: mov 0x30(%rax),%r10 /* r10 = dst_reg->smax_value */ s64 res = (s64)((u64)a + (u64)b); 6bd: lea (%r10,%r8,1),%r11 /* r11 = dst_reg->smax_value + src_reg->smax_value */ if (b < 0) 6c1: test %r8,%r8 6c4: js 71e <scalar_min_max_add+0xae> if (signed_add_overflows(dst_reg->smin_value, smin_val) || 6c6: cmp %r11,%r10 /* (dst_reg->smax_value + src_reg->smax_value) <= dst_reg->smax_value ? goto 723 */ 6c9: jle 723 <scalar_min_max_add+0xb3> } else { dst_reg->smin_value += smin_val; dst_reg->smax_value += smax_val; } 6cb: mov %rsi,0x28(%rax) ... 6d5: mov %rdi,0x30(%rax) ... if (signed_add_overflows(dst_reg->smin_value, smin_val) || 71e: cmp %r11,%r10 721: jl 6cb <scalar_min_max_add+0x5b> dst_reg->smin_value += smin_val; 723: mov %r9,%rsi dst_reg->smax_value += smax_val; 726: mov %r11,%rdi 729: jmp 6cb <scalar_min_max_add+0x5b> return res > a; 72b: cmp %r9,%rdi 72e: setl %r10b 732: jmp 69c <scalar_min_max_add+0x2c> 737: nopw 0x0(%rax,%rax,1) Note: unlike adjust_ptr_min_max_vals() and scalar*_min_max_add(), it is necessary to introduce intermediate variable in adjust_jmp_off() to keep the functional behavior unchanged. Without an intermediate variable imm/off will be altered even on overflow. Suggested-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Link: https://lore.kernel.org/r/20240712080127.136608-3-shung-hsi.yu@suse.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-07-12bpf: fix overflow check in adjust_jmp_off()Shung-Hsi Yu
adjust_jmp_off() incorrectly used the insn->imm field for all overflow check, which is incorrect as that should only be done or the BPF_JMP32 | BPF_JA case, not the general jump instruction case. Fix it by using insn->off for overflow check in the general case. Fixes: 5337ac4c9b80 ("bpf: Fix the corner case with may_goto and jump to the 1st insn.") Signed-off-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Link: https://lore.kernel.org/r/20240712080127.136608-2-shung-hsi.yu@suse.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-07-12bpf: Eliminate remaining "make W=1" warnings in kernel/bpf/btf.oAlan Maguire
As reported by Mirsad [1] we still see format warnings in kernel/bpf/btf.o at W=1 warning level: CC kernel/bpf/btf.o ./kernel/bpf/btf.c: In function ‘btf_type_seq_show_flags’: ./kernel/bpf/btf.c:7553:21: warning: assignment left-hand side might be a candidate for a format attribute [-Wsuggest-attribute=format] 7553 | sseq.showfn = btf_seq_show; | ^ ./kernel/bpf/btf.c: In function ‘btf_type_snprintf_show’: ./kernel/bpf/btf.c:7604:31: warning: assignment left-hand side might be a candidate for a format attribute [-Wsuggest-attribute=format] 7604 | ssnprintf.show.showfn = btf_snprintf_show; | ^ Combined with CONFIG_WERROR=y these can halt the build. The fix (annotating the structure field with __printf()) suggested by Mirsad resolves these. Apologies I missed this last time. No other W=1 warnings were observed in kernel/bpf after this fix. [1] https://lore.kernel.org/bpf/92c9d047-f058-400c-9c7d-81d4dc1ef71b@gmail.com/ Fixes: b3470da314fd ("bpf: annotate BTF show functions with __printf") Reported-by: Mirsad Todorovac <mtodorovac69@gmail.com> Suggested-by: Mirsad Todorovac <mtodorovac69@gmail.com> Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20240712092859.1390960-1-alan.maguire@oracle.com
2024-07-11workqueue: Rename wq_update_pod() to unbound_wq_update_pwq()Lai Jiangshan
What wq_update_pod() does is just to update the pwq of the specific cpu. Rename it and update the comments. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-11workqueue: Remove the arguments @hotplug_cpu and @online from wq_update_pod()Lai Jiangshan
The arguments @hotplug_cpu and @online are not used in wq_update_pod() since the functions called by wq_update_pod() don't need them. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-11workqueue: Remove the argument @cpu_going_down from wq_calc_pod_cpumask()Lai Jiangshan
wq_calc_pod_cpumask() uses wq_online_cpumask, which excludes the cpu going down, so the argument cpu_going_down is unused and can be removed. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-11workqueue: Remove the unneeded cpumask empty check in wq_calc_pod_cpumask()Lai Jiangshan
The cpumask empty check in wq_calc_pod_cpumask() has long been useless. It just works purely as documents which states that the cpumask is not possible empty after the function returns. Now the code above is even more explicit that the cpumask is not empty, so the document-only empty check can be removed. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-11workqueue: Remove cpus_read_lock() from apply_wqattrs_lock()Lai Jiangshan
1726a1713590 ("workqueue: Put PWQ allocation and WQ enlistment in the same lock C.S.") led to the following possible deadlock: WARNING: possible recursive locking detected 6.10.0-rc5-00004-g1d4c6111406c #1 Not tainted -------------------------------------------- swapper/0/1 is trying to acquire lock: c27760f4 (cpu_hotplug_lock){++++}-{0:0}, at: alloc_workqueue (kernel/workqueue.c:5152 kernel/workqueue.c:5730) but task is already holding lock: c27760f4 (cpu_hotplug_lock){++++}-{0:0}, at: padata_alloc (kernel/padata.c:1007) ... stack backtrace: ... cpus_read_lock (include/linux/percpu-rwsem.h:53 kernel/cpu.c:488) alloc_workqueue (kernel/workqueue.c:5152 kernel/workqueue.c:5730) padata_alloc (kernel/padata.c:1007 (discriminator 1)) pcrypt_init_padata (crypto/pcrypt.c:327 (discriminator 1)) pcrypt_init (crypto/pcrypt.c:353) do_one_initcall (init/main.c:1267) do_initcalls (init/main.c:1328 (discriminator 1) init/main.c:1345 (discriminator 1)) kernel_init_freeable (init/main.c:1364) kernel_init (init/main.c:1469) ret_from_fork (arch/x86/kernel/process.c:153) ret_from_fork_asm (arch/x86/entry/entry_32.S:737) entry_INT80_32 (arch/x86/entry/entry_32.S:944) This is caused by pcrypt allocating a workqueue while holding cpus_read_lock(), so workqueue code can't do it again as that can lead to deadlocks if down_write starts after the first down_read. The pwq creations and installations have been reworked based on wq_online_cpumask rather than cpu_online_mask making cpus_read_lock() is unneeded during wqattrs changes. Fix the deadlock by removing cpus_read_lock() from apply_wqattrs_lock(). tj: Updated changelog. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Fixes: 1726a1713590 ("workqueue: Put PWQ allocation and WQ enlistment in the same lock C.S.") Link: http://lkml.kernel.org/r/202407081521.83b627c1-lkp@intel.com Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-11workqueue: Simplify wq_calc_pod_cpumask() with wq_online_cpumaskLai Jiangshan
Avoid relying on cpu_online_mask for wqattrs changes so that cpus_read_lock() can be removed from apply_wqattrs_lock(). And with wq_online_cpumask, attrs->__pod_cpumask doesn't need to be reused as a temporary storage to calculate if the pod have any online CPUs @attrs wants since @cpu_going_down is not in the wq_online_cpumask. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-11workqueue: Add wq_online_cpumaskLai Jiangshan
The new wq_online_mask mirrors the cpu_online_mask except during hotplugging; specifically, it differs between the hotplugging stages of workqueue_offline_cpu() and workqueue_online_cpu(), during which the transitioning CPU is not represented in the mask. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-11bpf: annotate BTF show functions with __printfAlan Maguire
-Werror=suggest-attribute=format warns about two functions in kernel/bpf/btf.c [1]; add __printf() annotations to silence these warnings since for CONFIG_WERROR=y they will trigger build failures. [1] https://lore.kernel.org/bpf/a8b20c72-6631-4404-9e1f-0410642d7d20@gmail.com/ Fixes: 31d0bc81637d ("bpf: Move to generic BTF show support, apply it to seq files/strings") Reported-by: Mirsad Todorovac <mtodorovac69@gmail.com> Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Tested-by: Mirsad Todorovac <mtodorovac69@yahoo.com> Link: https://lore.kernel.org/r/20240711182321.963667-1-alan.maguire@oracle.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-07-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. Conflicts: net/sched/act_ct.c 26488172b029 ("net/sched: Fix UAF when resolving a clash") 3abbd7ed8b76 ("act_ct: prepare for stolen verdict coming from conntrack and nat engine") No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-11tick/broadcast: Make takeover of broadcast hrtimer reliableYu Liao
Running the LTP hotplug stress test on a aarch64 machine results in rcu_sched stall warnings when the broadcast hrtimer was owned by the un-plugged CPU. The issue is the following: CPU1 (owns the broadcast hrtimer) CPU2 tick_broadcast_enter() // shutdown local timer device broadcast_shutdown_local() ... tick_broadcast_exit() clockevents_switch_state(dev, CLOCK_EVT_STATE_ONESHOT) // timer device is not programmed cpumask_set_cpu(cpu, tick_broadcast_force_mask) initiates offlining of CPU1 take_cpu_down() /* * CPU1 shuts down and does not * send broadcast IPI anymore */ takedown_cpu() hotplug_cpu__broadcast_tick_pull() // move broadcast hrtimer to this CPU clockevents_program_event() bc_set_next() hrtimer_start() /* * timer device is not programmed * because only the first expiring * timer will trigger clockevent * device reprogramming */ What happens is that CPU2 exits broadcast mode with force bit set, then the local timer device is not reprogrammed and CPU2 expects to receive the expired event by the broadcast IPI. But this does not happen because CPU1 is offlined by CPU2. CPU switches the clockevent device to ONESHOT state, but does not reprogram the device. The subsequent reprogramming of the hrtimer broadcast device does not program the clockevent device of CPU2 either because the pending expiry time is already in the past and the CPU expects the event to be delivered. As a consequence all CPUs which wait for a broadcast event to be delivered are stuck forever. Fix this issue by reprogramming the local timer device if the broadcast force bit of the CPU is set so that the broadcast hrtimer is delivered. [ tglx: Massage comment and change log. Add Fixes tag ] Fixes: 989dcb645ca7 ("tick: Handle broadcast wakeup of multiple cpus") Signed-off-by: Yu Liao <liaoyu15@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240711124843.64167-1-liaoyu15@huawei.com
2024-07-11x86/xen: remove deprecated xen_nopvspin boot parameterJuergen Gross
The xen_nopvspin boot parameter is deprecated since 2019. nopvspin can be used instead. Remove the xen_nopvspin boot parameter and replace the xen_pvspin variable use cases with nopvspin. This requires to move the nopvspin variable out of the .initdata section, as it needs to be accessed for cpuhotplug, too. Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Message-ID: <20240710110139.22300-1-jgross@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2024-07-11Merge branch 'sched/urgent' into sched/core, to pick up fixes and refresh ↵Ingo Molnar
the branch Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-07-11kernel: rerun task_work while freezing in get_signal()Pavel Begunkov
io_uring can asynchronously add a task_work while the task is getting freezed. TIF_NOTIFY_SIGNAL will prevent the task from sleeping in do_freezer_trap(), and since the get_signal()'s relock loop doesn't retry task_work, the task will spin there not being able to sleep until the freezing is cancelled / the task is killed / etc. Run task_works in the freezer path. Keep the patch small and simple so it can be easily back ported, but we might need to do some cleaning after and look if there are other places with similar problems. Cc: stable@vger.kernel.org Link: https://github.com/systemd/systemd/issues/33626 Fixes: 12db8b690010c ("entry: Add support for TIF_NOTIFY_SIGNAL") Reported-by: Julian Orth <ju.orth@gmail.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/89ed3a52933370deaaf61a0a620a6ac91f1e754d.1720634146.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-10bpf: Defer work in bpf_timer_cancel_and_freeKumar Kartikeya Dwivedi
Currently, the same case as previous patch (two timer callbacks trying to cancel each other) can be invoked through bpf_map_update_elem as well, or more precisely, freeing map elements containing timers. Since this relies on hrtimer_cancel as well, it is prone to the same deadlock situation as the previous patch. It would be sufficient to use hrtimer_try_to_cancel to fix this problem, as the timer cannot be enqueued after async_cancel_and_free. Once async_cancel_and_free has been done, the timer must be reinitialized before it can be armed again. The callback running in parallel trying to arm the timer will fail, and freeing bpf_hrtimer without waiting is sufficient (given kfree_rcu), and bpf_timer_cb will return HRTIMER_NORESTART, preventing the timer from being rearmed again. However, there exists a UAF scenario where the callback arms the timer before entering this function, such that if cancellation fails (due to timer callback invoking this routine, or the target timer callback running concurrently). In such a case, if the timer expiration is significantly far in the future, the RCU grace period expiration happening before it will free the bpf_hrtimer state and along with it the struct hrtimer, that is enqueued. Hence, it is clear cancellation needs to occur after async_cancel_and_free, and yet it cannot be done inline due to deadlock issues. We thus modify bpf_timer_cancel_and_free to defer work to the global workqueue, adding a work_struct alongside rcu_head (both used at _different_ points of time, so can share space). Update existing code comments to reflect the new state of affairs. Fixes: b00628b1c7d5 ("bpf: Introduce bpf timers.") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20240709185440.1104957-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-07-10bpf: Fail bpf_timer_cancel when callback is being cancelledKumar Kartikeya Dwivedi
Given a schedule: timer1 cb timer2 cb bpf_timer_cancel(timer2); bpf_timer_cancel(timer1); Both bpf_timer_cancel calls would wait for the other callback to finish executing, introducing a lockup. Add an atomic_t count named 'cancelling' in bpf_hrtimer. This keeps track of all in-flight cancellation requests for a given BPF timer. Whenever cancelling a BPF timer, we must check if we have outstanding cancellation requests, and if so, we must fail the operation with an error (-EDEADLK) since cancellation is synchronous and waits for the callback to finish executing. This implies that we can enter a deadlock situation involving two or more timer callbacks executing in parallel and attempting to cancel one another. Note that we avoid incrementing the cancelling counter for the target timer (the one being cancelled) if bpf_timer_cancel is not invoked from a callback, to avoid spurious errors. The whole point of detecting cur->cancelling and returning -EDEADLK is to not enter a busy wait loop (which may or may not lead to a lockup). This does not apply in case the caller is in a non-callback context, the other side can continue to cancel as it sees fit without running into errors. Background on prior attempts: Earlier versions of this patch used a bool 'cancelling' bit and used the following pattern under timer->lock to publish cancellation status. lock(t->lock); t->cancelling = true; mb(); if (cur->cancelling) return -EDEADLK; unlock(t->lock); hrtimer_cancel(t->timer); t->cancelling = false; The store outside the critical section could overwrite a parallel requests t->cancelling assignment to true, to ensure the parallely executing callback observes its cancellation status. It would be necessary to clear this cancelling bit once hrtimer_cancel is done, but lack of serialization introduced races. Another option was explored where bpf_timer_start would clear the bit when (re)starting the timer under timer->lock. This would ensure serialized access to the cancelling bit, but may allow it to be cleared before in-flight hrtimer_cancel has finished executing, such that lockups can occur again. Thus, we choose an atomic counter to keep track of all outstanding cancellation requests and use it to prevent lockups in case callbacks attempt to cancel each other while executing in parallel. Reported-by: Dohyun Kim <dohyunkim@google.com> Reported-by: Neel Natu <neelnatu@google.com> Fixes: b00628b1c7d5 ("bpf: Introduce bpf timers.") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20240709185440.1104957-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-07-10bpf: fix order of args in call to bpf_map_kvcallocMohammad Shehar Yaar Tausif
The original function call passed size of smap->bucket before the number of buckets which raises the error 'calloc-transposed-args' on compilation. Vlastimil Babka added: The order of parameters can be traced back all the way to 6ac99e8f23d4 ("bpf: Introduce bpf sk local storage") accross several refactorings, and that's why the commit is used as a Fixes: tag. In v6.10-rc1, a different commit 2c321f3f70bc ("mm: change inlined allocation helpers to account at the call site") however exposed the order of args in a way that gcc-14 has enough visibility to start warning about it, because (in !CONFIG_MEMCG case) bpf_map_kvcalloc is then a macro alias for kvcalloc instead of a static inline wrapper. To sum up the warning happens when the following conditions are all met: - gcc-14 is used (didn't see it with gcc-13) - commit 2c321f3f70bc is present - CONFIG_MEMCG is not enabled in .config - CONFIG_WERROR turns this from a compiler warning to error Fixes: 6ac99e8f23d4 ("bpf: Introduce bpf sk local storage") Reviewed-by: Andrii Nakryiko <andrii@kernel.org> Tested-by: Christian Kujau <lists@nerdbynature.de> Signed-off-by: Mohammad Shehar Yaar Tausif <sheharyaar48@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Link: https://lore.kernel.org/r/20240710100521.15061-2-vbabka@suse.cz Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-07-10smp: Add missing destroy_work_on_stack() call in smp_call_on_cpu()Zqiang
For CONFIG_DEBUG_OBJECTS_WORK=y kernels sscs.work defined by INIT_WORK_ONSTACK() is initialized by debug_object_init_on_stack() for the debug check in __init_work() to work correctly. But this lacks the counterpart to remove the tracked object from debug objects again, which will cause a debug object warning once the stack is freed. Add the missing destroy_work_on_stack() invocation to cure that. [ tglx: Massaged changelog ] Signed-off-by: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/r/20240704065213.13559-1-qiang.zhang1211@gmail.com
2024-07-10kernel/fork.c: put set_max_threads()/task_struct_whitelist() in __init sectionWei Yang
The functions set_max_threads() and task_struct_whitelist() are only used by fork_init() during bootup. Let's add __init tag to them. Link: https://lkml.kernel.org/r/20240701013410.17260-2-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-10kernel/fork.c: get totalram_pages from memblock to calculate max_threadsWei Yang
Since we plan to move the accounting into __free_pages_core(), totalram_pages may not represent the total usable pages on system at this point when defer_init is enabled. Instead we can get the total usable pages from memblock directly. Link: https://lkml.kernel.org/r/20240701013410.17260-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-10mm: remove CONFIG_MEMCG_KMEMJohannes Weiner
CONFIG_MEMCG_KMEM used to be a user-visible option for whether slab tracking is enabled. It has been default-enabled and equivalent to CONFIG_MEMCG for almost a decade. We've only grown more kernel memory accounting sites since, and there is no imaginable cgroup usecase going forward that wants to track user pages but not the multitude of user-drivable kernel allocations. Link: https://lkml.kernel.org/r/20240701153148.452230-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: David Hildenbrand <david@redhat.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-10clone3: drop __ARCH_WANT_SYS_CLONE3 macroArnd Bergmann
When clone3() was introduced, it was not obvious how each architecture deals with setting up the stack and keeping the register contents in a fork()-like system call, so this was left for the architecture maintainers to implement, with __ARCH_WANT_SYS_CLONE3 defined by those that already implement it. Five years later, we still have a few architectures left that are missing clone3(), and the macro keeps getting in the way as it's fundamentally different from all the other __ARCH_WANT_SYS_* macros that are meant to provide backwards-compatibility with applications using older syscalls that are no longer provided by default. Address this by reversing the polarity of the macro, adding an __ARCH_BROKEN_SYS_CLONE3 macro to all architectures that don't already provide the syscall, and remove __ARCH_WANT_SYS_CLONE3 from all the other ones. Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2024-07-10swiotlb: reduce swiotlb pool lookupsMichael Kelley
With CONFIG_SWIOTLB_DYNAMIC enabled, each round-trip map/unmap pair in the swiotlb results in 6 calls to swiotlb_find_pool(). In multiple places, the pool is found and used in one function, and then must be found again in the next function that is called because only the tlb_addr is passed as an argument. These are the six call sites: dma_direct_map_page: 1. swiotlb_map -> swiotlb_tbl_map_single -> swiotlb_bounce dma_direct_unmap_page: 2. dma_direct_sync_single_for_cpu -> is_swiotlb_buffer 3. dma_direct_sync_single_for_cpu -> swiotlb_sync_single_for_cpu -> swiotlb_bounce 4. is_swiotlb_buffer 5. swiotlb_tbl_unmap_single -> swiotlb_del_transient 6. swiotlb_tbl_unmap_single -> swiotlb_release_slots Reduce the number of calls by finding the pool at a higher level, and passing it as an argument instead of searching again. A key change is for is_swiotlb_buffer() to return a pool pointer instead of a boolean, and then pass this pool pointer to subsequent swiotlb functions. There are 9 occurrences of is_swiotlb_buffer() used to test if a buffer is a swiotlb buffer before calling a swiotlb function. To reduce code duplication in getting the pool pointer and passing it as an argument, introduce inline wrappers for this pattern. The generated code is essentially unchanged. Since is_swiotlb_buffer() no longer returns a boolean, rename some functions to reflect the change: * swiotlb_find_pool() becomes __swiotlb_find_pool() * is_swiotlb_buffer() becomes swiotlb_find_pool() * is_xen_swiotlb_buffer() becomes xen_swiotlb_find_pool() With these changes, a round-trip map/unmap pair requires only 2 pool lookups (listed using the new names and wrappers): dma_direct_unmap_page: 1. dma_direct_sync_single_for_cpu -> swiotlb_find_pool 2. swiotlb_tbl_unmap_single -> swiotlb_find_pool These changes come from noticing the inefficiencies in a code review, not from performance measurements. With CONFIG_SWIOTLB_DYNAMIC, __swiotlb_find_pool() is not trivial, and it uses an RCU read lock, so avoiding the redundant calls helps performance in a hot path. When CONFIG_SWIOTLB_DYNAMIC is *not* set, the code size reduction is minimal and the perf benefits are likely negligible, but no harm is done. No functional change is intended. Signed-off-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Petr Tesarik <petr@tesarici.cz> Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-07-09bpf: relax zero fixed offset constraint on KF_TRUSTED_ARGS/KF_RCUMatt Bobrowski
Currently, BPF kfuncs which accept trusted pointer arguments i.e. those flagged as KF_TRUSTED_ARGS, KF_RCU, or KF_RELEASE, all require an original/unmodified trusted pointer argument to be supplied to them. By original/unmodified, it means that the backing register holding the trusted pointer argument that is to be supplied to the BPF kfunc must have its fixed offset set to zero, or else the BPF verifier will outright reject the BPF program load. However, this zero fixed offset constraint that is currently enforced by the BPF verifier onto BPF kfuncs specifically flagged to accept KF_TRUSTED_ARGS or KF_RCU trusted pointer arguments is rather unnecessary, and can limit their usability in practice. Specifically, it completely eliminates the possibility of constructing a derived trusted pointer from an original trusted pointer. To put it simply, a derived pointer is a pointer which points to one of the nested member fields of the object being pointed to by the original trusted pointer. This patch relaxes the zero fixed offset constraint that is enforced upon BPF kfuncs which specifically accept KF_TRUSTED_ARGS, or KF_RCU arguments. Although, the zero fixed offset constraint technically also applies to BPF kfuncs accepting KF_RELEASE arguments, relaxing this constraint for such BPF kfuncs has subtle and unwanted side-effects. This was discovered by experimenting a little further with an initial version of this patch series [0]. The primary issue with relaxing the zero fixed offset constraint on BPF kfuncs accepting KF_RELEASE arguments is that it'd would open up the opportunity for BPF programs to supply both trusted pointers and derived trusted pointers to them. For KF_RELEASE BPF kfuncs specifically, this could be problematic as resources associated with the backing pointer could be released by the backing BPF kfunc and cause instabilities for the rest of the kernel. With this new fixed offset semantic in-place for BPF kfuncs accepting KF_TRUSTED_ARGS and KF_RCU arguments, we now have more flexibility when it comes to the BPF kfuncs that we're able to introduce moving forward. Early discussions covering the possibility of relaxing the zero fixed offset constraint can be found using the link below. This will provide more context on where all this has stemmed from [1]. Notably, pre-existing tests have been updated such that they provide coverage for the updated zero fixed offset functionality. Specifically, the nested offset test was converted from a negative to positive test as it was already designed to assert zero fixed offset semantics of a KF_TRUSTED_ARGS BPF kfunc. [0] https://lore.kernel.org/bpf/ZnA9ndnXKtHOuYMe@google.com/ [1] https://lore.kernel.org/bpf/ZhkbrM55MKQ0KeIV@google.com/ Signed-off-by: Matt Bobrowski <mattbobrowski@google.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20240709210939.1544011-1-mattbobrowski@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-07-10tracing/kprobes: Fix build error when find_module() is not availableMasami Hiramatsu (Google)
The kernel test robot reported that the find_module() is not available if CONFIG_MODULES=n. Fix this error by hiding find_modules() in #ifdef CONFIG_MODULES with related rcu locks as try_module_get_by_name(). Link: https://lore.kernel.org/all/172056819167.201571.250053007194508038.stgit@devnote2/ Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202407070744.RcLkn8sq-lkp@intel.com/ Closes: https://lore.kernel.org/oe-kbuild-all/202407070917.VVUCBlaS-lkp@intel.com/ Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-07-09Merge tag 'for-netdev' of ↵Paolo Abeni
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2024-07-08 The following pull-request contains BPF updates for your *net-next* tree. We've added 102 non-merge commits during the last 28 day(s) which contain a total of 127 files changed, 4606 insertions(+), 980 deletions(-). The main changes are: 1) Support resilient split BTF which cuts down on duplication and makes BTF as compact as possible wrt BTF from modules, from Alan Maguire & Eduard Zingerman. 2) Add support for dumping kfunc prototypes from BTF which enables both detecting as well as dumping compilable prototypes for kfuncs, from Daniel Xu. 3) Batch of s390x BPF JIT improvements to add support for BPF arena and to implement support for BPF exceptions, from Ilya Leoshkevich. 4) Batch of riscv64 BPF JIT improvements in particular to add 12-argument support for BPF trampolines and to utilize bpf_prog_pack for the latter, from Pu Lehui. 5) Extend BPF test infrastructure to add a CHECKSUM_COMPLETE validation option for skbs and add coverage along with it, from Vadim Fedorenko. 6) Inline bpf_get_current_task/_btf() helpers in the arm64 BPF JIT which gives a small 1% performance improvement in micro-benchmarks, from Puranjay Mohan. 7) Extend the BPF verifier to track the delta between linked registers in order to better deal with recent LLVM code optimizations, from Alexei Starovoitov. 8) Fix bpf_wq_set_callback_impl() kfunc signature where the third argument should have been a pointer to the map value, from Benjamin Tissoires. 9) Extend BPF selftests to add regular expression support for test output matching and adjust some of the selftest when compiled under gcc, from Cupertino Miranda. 10) Simplify task_file_seq_get_next() and remove an unnecessary loop which always iterates exactly once anyway, from Dan Carpenter. 11) Add the capability to offload the netfilter flowtable in XDP layer through kfuncs, from Florian Westphal & Lorenzo Bianconi. 12) Various cleanups in networking helpers in BPF selftests to shave off a few lines of open-coded functions on client/server handling, from Geliang Tang. 13) Properly propagate prog->aux->tail_call_reachable out of BPF verifier, so that x86 JIT does not need to implement detection, from Leon Hwang. 14) Fix BPF verifier to add a missing check_func_arg_reg_off() to prevent an out-of-bounds memory access for dynpointers, from Matt Bobrowski. 15) Fix bpf_session_cookie() kfunc to return __u64 instead of long pointer as it might lead to problems on 32-bit archs, from Jiri Olsa. 16) Enhance traffic validation and dynamic batch size support in xsk selftests, from Tushar Vyavahare. bpf-next-for-netdev * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (102 commits) selftests/bpf: DENYLIST.aarch64: Remove fexit_sleep selftests/bpf: amend for wrong bpf_wq_set_callback_impl signature bpf: helpers: fix bpf_wq_set_callback_impl signature libbpf: Add NULL checks to bpf_object__{prev_map,next_map} selftests/bpf: Remove exceptions tests from DENYLIST.s390x s390/bpf: Implement exceptions s390/bpf: Change seen_reg to a mask bpf: Remove unnecessary loop in task_file_seq_get_next() riscv, bpf: Optimize stack usage of trampoline bpf, devmap: Add .map_alloc_check selftests/bpf: Remove arena tests from DENYLIST.s390x selftests/bpf: Add UAF tests for arena atomics selftests/bpf: Introduce __arena_global s390/bpf: Support arena atomics s390/bpf: Enable arena s390/bpf: Support address space cast instruction s390/bpf: Support BPF_PROBE_MEM32 s390/bpf: Land on the next JITed instruction after exception s390/bpf: Introduce pre- and post- probe functions s390/bpf: Get rid of get_probe_mem_regno() ... ==================== Link: https://patch.msgid.link/20240708221438.10974-1-daniel@iogearbox.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-07-09perf: Split __perf_pending_irq() out of perf_pending_irq()Sebastian Andrzej Siewior
perf_pending_irq() invokes perf_event_wakeup() and __perf_pending_irq(). The former is in charge of waking any tasks which waits to be woken up while the latter disables perf-events. The irq_work perf_pending_irq(), while this an irq_work, the callback is invoked in thread context on PREEMPT_RT. This is needed because all the waking functions (wake_up_all(), kill_fasync()) acquire sleep locks which must not be used with disabled interrupts. Disabling events, as done by __perf_pending_irq(), expects a hardirq context and disabled interrupts. This requirement is not fulfilled on PREEMPT_RT. Split functionality based on perf_event::pending_disable into irq_work named `pending_disable_irq' and invoke it in hardirq context on PREEMPT_RT. Rename the split out callback to perf_pending_disable(). Reported-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Marco Elver <elver@google.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Link: https://lore.kernel.org/r/20240704170424.1466941-8-bigeasy@linutronix.de
2024-07-09perf: Don't disable preemption in perf_pending_task().Sebastian Andrzej Siewior
perf_pending_task() is invoked in task context and disables preemption because perf_swevent_get_recursion_context() used to access per-CPU variables. The other reason is to create a RCU read section while accessing the perf_event. The recursion counter is no longer a per-CPU accounter so disabling preemption is no longer required. The RCU section is needed and must be created explicit. Replace the preemption-disable section with a explicit RCU-read section. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Marco Elver <elver@google.com> Link: https://lore.kernel.org/r/20240704170424.1466941-7-bigeasy@linutronix.de
2024-07-09perf: Move swevent_htable::recursion into task_struct.Sebastian Andrzej Siewior
The swevent_htable::recursion counter is used to avoid creating an swevent while an event is processed to avoid recursion. The counter is per-CPU and preemption must be disabled to have a stable counter. perf_pending_task() disables preemption to access the counter and then signal. This is problematic on PREEMPT_RT because sending a signal uses a spinlock_t which must not be acquired in atomic on PREEMPT_RT because it becomes a sleeping lock. The atomic context can be avoided by moving the counter into the task_struct. There is a 4 byte hole between futex_state (usually always on) and the following perf pointer (perf_event_ctxp). After the recursion lost some weight it fits perfectly. Move swevent_htable::recursion into task_struct. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Marco Elver <elver@google.com> Link: https://lore.kernel.org/r/20240704170424.1466941-6-bigeasy@linutronix.de
2024-07-09perf: Shrink the size of the recursion counter.Sebastian Andrzej Siewior
There are four recursion counter, one for each context. The type of the counter is `int' but the counter is used as `bool' since it is only incremented if zero. The main goal here is to shrink the whole struct into 32bit int which can later be added task_struct into an existing hole. Reduce the type of the recursion counter to an unsigned char, keep the increment/ decrement operation. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Marco Elver <elver@google.com> Link: https://lore.kernel.org/r/20240704170424.1466941-5-bigeasy@linutronix.de
2024-07-09perf: Enqueue SIGTRAP always via task_work.Sebastian Andrzej Siewior
A signal is delivered by raising irq_work() which works from any context including NMI. irq_work() can be delayed if the architecture does not provide an interrupt vector. In order not to lose a signal, the signal is injected via task_work during event_sched_out(). Instead going via irq_work, the signal could be added directly via task_work. The signal is sent to current and can be enqueued on its return path to userland. Queue signal via task_work and consider possible NMI context. Remove perf_event::pending_sigtrap and and use perf_event::pending_work instead. Reported-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Marco Elver <elver@google.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Link: https://lore.kernel.org/r/20240704170424.1466941-4-bigeasy@linutronix.de
2024-07-09task_work: Add TWA_NMI_CURRENT as an additional notify mode.Sebastian Andrzej Siewior
Adding task_work from NMI context requires the following: - The kasan_record_aux_stack() is not NMU safe and must be avoided. - Using TWA_RESUME is NMI safe. If the NMI occurs while the CPU is in userland then it will continue in userland and not invoke the `work' callback. Add TWA_NMI_CURRENT as an additional notify mode. In this mode skip kasan and use irq_work in hardirq-mode to for needed interrupt. Set TIF_NOTIFY_RESUME within the irq_work callback due to k[ac]san instrumentation in test_and_set_bit() which does not look NMI safe in case of a report. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240704170424.1466941-3-bigeasy@linutronix.de