summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2023-05-23tracing: Rename stacktrace field to common_stacktraceSteven Rostedt (Google)
The histogram and synthetic events can use a pseudo event called "stacktrace" that will create a stacktrace at the time of the event and use it just like it was a normal field. We have other pseudo events such as "common_cpu" and "common_timestamp". To stay consistent with that, convert "stacktrace" to "common_stacktrace". As this was used in older kernels, to keep backward compatibility, this will act just like "common_cpu" did with "cpu". That is, "cpu" will be the same as "common_cpu" unless the event has a "cpu" field. In which case, the event's field is used. The same is true with "stacktrace". Also update the documentation to reflect this change. Link: https://lore.kernel.org/linux-trace-kernel/20230523230913.6860e28d@rorschach.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Tom Zanussi <zanussi@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-05-23tracing/histograms: Allow variables to have some modifiersSteven Rostedt (Google)
Modifiers are used to change the behavior of keys. For instance, they can grouped into buckets, converted to syscall names (from the syscall identifier), show task->comm of the current pid, be an array of longs that represent a stacktrace, and more. It was found that nothing stopped a value from taking a modifier. As values are simple counters. If this happened, it would call code that was not expecting a modifier and crash the kernel. This was fixed by having the ___create_val_field() function test if a modifier was present and fail if one was. This fixed the crash. Now there's a problem with variables. Variables are used to pass fields from one event to another. Variables are allowed to have some modifiers, as the processing may need to happen at the time of the event (like stacktraces and comm names of the current pid). The issue is that it too uses __create_val_field(). Now that fails on modifiers, variables can no longer use them (this is a regression). As not all modifiers are for variables, have them use a separate check. Link: https://lore.kernel.org/linux-trace-kernel/20230523221108.064a5d82@rorschach.local.home Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Tom Zanussi <zanussi@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Fixes: e0213434fe3e4 ("tracing: Do not let histogram values have some modifiers") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-05-23tracing/user_events: Document user_event_mm one-shot list usageBeau Belgrave
During 6.4 development it became clear that the one-shot list used by the user_event_mm's next field was confusing to others. It is not clear how this list is protected or what the next field usage is for unless you are familiar with the code. Add comments into the user_event_mm struct indicating lock requirement and usage. Also document how and why this approach was used via comments in both user_event_enabler_update() and user_event_mm_get_all() and the rules to properly use it. Link: https://lkml.kernel.org/r/20230519230741.669-5-beaub@linux.microsoft.com Link: https://lore.kernel.org/linux-trace-kernel/CAHk-=wicngggxVpbnrYHjRTwGE0WYscPRM+L2HO2BF8ia1EXgQ@mail.gmail.com/ Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-05-23tracing/user_events: Rename link fields for clarityBeau Belgrave
Currently most list_head fields of various structs within user_events are simply named link. This causes folks to keep additional context in their head when working with the code, which can be confusing. Instead of using link, describe what the actual link is, for example: list_del_rcu(&mm->link); Changes into: list_del_rcu(&mm->mms_link); The reader now is given a hint the link is to the mms global list instead of having to remember or spot check within the code. Link: https://lkml.kernel.org/r/20230519230741.669-4-beaub@linux.microsoft.com Link: https://lore.kernel.org/linux-trace-kernel/CAHk-=wicngggxVpbnrYHjRTwGE0WYscPRM+L2HO2BF8ia1EXgQ@mail.gmail.com/ Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-05-23tracing/user_events: Remove RCU lock while pinning pagesLinus Torvalds
pin_user_pages_remote() can reschedule which means we cannot hold any RCU lock while using it. Now that enablers are not exposed out to the tracing register callbacks during fork(), there is clearly no need to require the RCU lock as event_mutex is enough to protect changes. Remove unneeded RCU usages when pinning pages and walking enablers with event_mutex held. Cleanup a misleading "safe" list walk that is not needed. During fork() duplication, remove unneeded RCU list add, since the list is not exposed yet. Link: https://lkml.kernel.org/r/20230519230741.669-3-beaub@linux.microsoft.com Link: https://lore.kernel.org/linux-trace-kernel/CAHk-=wiiBfT4zNS29jA0XEsy8EmbqTH1hAPdRJCDAJMD8Gxt5A@mail.gmail.com/ Fixes: 7235759084a4 ("tracing/user_events: Use remote writes for event enablement") Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [ change log written by Beau Belgrave ] Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-05-23tracing/user_events: Split up mm alloc and attachLinus Torvalds
When a new mm is being created in a fork() path it currently is allocated and then attached in one go. This leaves the mm exposed out to the tracing register callbacks while any parent enabler locations are copied in. This should not happen. Split up mm alloc and attach as unique operations. When duplicating enablers, first alloc, then duplicate, and only upon success, attach. This prevents any timing window outside of the event_reg mutex for enablement walking. This allows for dropping RCU requirement for enablement walking in later patches. Link: https://lkml.kernel.org/r/20230519230741.669-2-beaub@linux.microsoft.com Link: https://lore.kernel.org/linux-trace-kernel/CAHk-=whTBvXJuoi_kACo3qi5WZUmRrhyA-_=rRFsycTytmB6qw@mail.gmail.com/ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [ change log written by Beau Belgrave ] Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-05-23bpf: Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commandsAndrii Nakryiko
Current UAPI of BPF_OBJ_PIN and BPF_OBJ_GET commands of bpf() syscall forces users to specify pinning location as a string-based absolute or relative (to current working directory) path. This has various implications related to security (e.g., symlink-based attacks), forces BPF FS to be exposed in the file system, which can cause races with other applications. One of the feedbacks we got from folks working with containers heavily was that inability to use purely FD-based location specification was an unfortunate limitation and hindrance for BPF_OBJ_PIN and BPF_OBJ_GET commands. This patch closes this oversight, adding path_fd field to BPF_OBJ_PIN and BPF_OBJ_GET UAPI, following conventions established by *at() syscalls for dirfd + pathname combinations. This now allows interesting possibilities like working with detached BPF FS mount (e.g., to perform multiple pinnings without running a risk of someone interfering with them), and generally making pinning/getting more secure and not prone to any races and/or security attacks. This is demonstrated by a selftest added in subsequent patch that takes advantage of new mount APIs (fsopen, fsconfig, fsmount) to demonstrate creating detached BPF FS mount, pinning, and then getting BPF map out of it, all while never exposing this private instance of BPF FS to outside worlds. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/bpf/20230523170013.728457-4-andrii@kernel.org
2023-05-23cpu/hotplug: Fix off by one in cpuhp_bringup_mask()Thomas Gleixner
cpuhp_bringup_mask() iterates over a cpumask and starts all present CPUs up to a caller provided upper limit. The limit variable is decremented and checked for 0 before invoking cpu_up(), which is obviously off by one and prevents the bringup of the last CPU when the limit is equal to the number of present CPUs. Move the decrement and check after the cpu_up() invocation. Fixes: 18415f33e2ac ("cpu/hotplug: Allow "parallel" bringup up to CPUHP_BP_KICK_AP_STATE") Reported-by: Mark Brown <broonie@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/87wn10ufj9.ffs@tglx
2023-05-23tracing/timerlat: Always wakeup the timerlat threadDaniel Bristot de Oliveira
While testing rtla timerlat auto analysis, I reach a condition where the interface was not receiving tracing data. I was able to manually reproduce the problem with these steps: # echo 0 > tracing_on # disable trace # echo 1 > osnoise/stop_tracing_us # stop trace if timerlat irq > 1 us # echo timerlat > current_tracer # enable timerlat tracer # sleep 1 # wait... that is the time when rtla # apply configs like prio or cgroup # echo 1 > tracing_on # start tracing # cat trace # tracer: timerlat # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / _-=> migrate-disable # |||| / delay # ||||| ACTIVATION # TASK-PID CPU# ||||| TIMESTAMP ID CONTEXT LATENCY # | | | ||||| | | | | NOTHING! Then, trying to enable tracing again with echo 1 > tracing_on resulted in no change: the trace was still not tracing. This problem happens because the timerlat IRQ hits the stop tracing condition while tracing is off, and do not wake up the timerlat thread, so the timerlat threads are kept sleeping forever, resulting in no trace, even after re-enabling the tracer. Avoid this condition by always waking up the threads, even after stopping tracing, allowing the tracer to return to its normal operating after a new tracing on. Link: https://lore.kernel.org/linux-trace-kernel/1ed8f830638b20a39d535d27d908e319a9a3c4e2.1683822622.git.bristot@kernel.org Cc: Juri Lelli <juri.lelli@redhat.com> Cc: stable@vger.kernel.org Fixes: a955d7eac177 ("trace: Add timerlat tracer") Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-05-23bpf: Validate BPF object in BPF_OBJ_PIN before calling LSMAndrii Nakryiko
Do a sanity check whether provided file-to-be-pinned is actually a BPF object (prog, map, btf) before calling security_path_mknod LSM hook. If it's not, LSM hook doesn't have to be triggered, as the operation has no chance of succeeding anyways. Suggested-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/bpf/20230522232917.2454595-2-andrii@kernel.org
2023-05-23tracing/user_events: Use long vs int for atomic bit opsBeau Belgrave
Each event stores a int to track which bit to set/clear when enablement changes. On big endian 64-bit configurations, it's possible this could cause memory corruption when it's used for atomic bit operations. Use unsigned long for enablement values to ensure any possible corruption cannot occur. Downcast to int after mask for the bit target. Link: https://lore.kernel.org/all/6f758683-4e5e-41c3-9b05-9efc703e827c@kili.mountain/ Link: https://lore.kernel.org/linux-trace-kernel/20230505205855.6407-1-beaub@linux.microsoft.com Fixes: dcb8177c1395 ("tracing/user_events: Add ioctl for disabling addresses") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-05-22cgroup: always put cset in cgroup_css_set_put_forkJohn Sperbeck
A successful call to cgroup_css_set_fork() will always have taken a ref on kargs->cset (regardless of CLONE_INTO_CGROUP), so always do a corresponding put in cgroup_css_set_put_fork(). Without this, a cset and its contained css structures will be leaked for some fork failures. The following script reproduces the leak for a fork failure due to exceeding pids.max in the pids controller. A similar thing can happen if we jump to the bad_fork_cancel_cgroup label in copy_process(). [ -z "$1" ] && echo "Usage $0 pids-root" && exit 1 PID_ROOT=$1 CGROUP=$PID_ROOT/foo [ -e $CGROUP ] && rmdir -f $CGROUP mkdir $CGROUP echo 5 > $CGROUP/pids.max echo $$ > $CGROUP/cgroup.procs fork_bomb() { set -e for i in $(seq 10); do /bin/sleep 3600 & done } (fork_bomb) & wait echo $$ > $PID_ROOT/cgroup.procs kill $(cat $CGROUP/cgroup.procs) rmdir $CGROUP Fixes: ef2c41cf38a7 ("clone3: allow spawning processes into cgroups") Cc: stable@vger.kernel.org # v5.7+ Signed-off-by: John Sperbeck <jsperbeck@google.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-05-22module: Fix use-after-free bug in read_file_mod_stats()Harshit Mogalapalli
Smatch warns: kernel/module/stats.c:394 read_file_mod_stats() warn: passing freed memory 'buf' We are passing 'buf' to simple_read_from_buffer() after freeing it. Fix this by changing the order of 'simple_read_from_buffer' and 'kfree'. Fixes: df3e764d8e5c ("module: add debug stats to help identify memory pressure") Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-05-22cgroup: Replace all non-returning strlcpy with strscpyAzeem Shaikh
strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to linear read overflows if a source string is not NUL-terminated [1]. In an effort to remove strlcpy() completely [2], replace strlcpy() here with strscpy(). No return values were used, so direct replacement is safe. [1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy [2] https://github.com/KSPP/linux/issues/89 Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-05-22cgroup: fix missing cpus_read_{lock,unlock}() in cgroup_transfer_tasks()Qi Zheng
The commit 4f7e7236435c ("cgroup: Fix threadgroup_rwsem <-> cpus_read_lock() deadlock") fixed the deadlock between cgroup_threadgroup_rwsem and cpus_read_lock() by introducing cgroup_attach_{lock,unlock}() and removing cpus_read_{lock,unlock}() from cpuset_attach(). But cgroup_transfer_tasks() was missed and not handled, which will cause th following warning: WARNING: CPU: 0 PID: 589 at kernel/cpu.c:526 lockdep_assert_cpus_held+0x32/0x40 CPU: 0 PID: 589 Comm: kworker/1:4 Not tainted 6.4.0-rc2-next-20230517 #50 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 Workqueue: events cpuset_hotplug_workfn RIP: 0010:lockdep_assert_cpus_held+0x32/0x40 <...> Call Trace: <TASK> cpuset_attach+0x40/0x240 cgroup_migrate_execute+0x452/0x5e0 ? _raw_spin_unlock_irq+0x28/0x40 cgroup_transfer_tasks+0x1f3/0x360 ? find_held_lock+0x32/0x90 ? cpuset_hotplug_workfn+0xc81/0xed0 cpuset_hotplug_workfn+0xcb1/0xed0 ? process_one_work+0x248/0x5b0 process_one_work+0x2b9/0x5b0 worker_thread+0x56/0x3b0 ? process_one_work+0x5b0/0x5b0 kthread+0xf1/0x120 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x1f/0x30 </TASK> So just use the cgroup_attach_{lock,unlock}() helper to fix it. Reported-by: Zhao Gongyi <zhaogongyi@bytedance.com> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: Muchun Song <songmuchun@bytedance.com> Fixes: 05c7b7a92cc8 ("cgroup/cpuset: Fix a race between cpuset_attach() and cpu hotplug") Cc: stable@vger.kernel.org # v5.17+ Signed-off-by: Tejun Heo <tj@kernel.org>
2023-05-22capability: fix kernel-doc warnings in capability.cGaosheng Cui
Fix all kernel-doc warnings in capability.c: kernel/capability.c:477: warning: Function parameter or member 'idmap' not described in 'privileged_wrt_inode_uidgid' kernel/capability.c:493: warning: Function parameter or member 'idmap' not described in 'capable_wrt_inode_uidgid' Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com> Acked-by: Serge Hallyn <serge@hallyn.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2023-05-22bpf: fix a memory leak in the LRU and LRU_PERCPU hash mapsAnton Protopopov
The LRU and LRU_PERCPU maps allocate a new element on update before locking the target hash table bucket. Right after that the maps try to lock the bucket. If this fails, then maps return -EBUSY to the caller without releasing the allocated element. This makes the element untracked: it doesn't belong to either of free lists, and it doesn't belong to the hash table, so can't be re-used; this eventually leads to the permanent -ENOMEM on LRU map updates, which is unexpected. Fix this by returning the element to the local free list if bucket locking fails. Fixes: 20b6cc34ea74 ("bpf: Avoid hashtab deadlock with map_locked") Signed-off-by: Anton Protopopov <aspsk@isovalent.com> Link: https://lore.kernel.org/r/20230522154558.2166815-1-aspsk@isovalent.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-05-20cgroup/cpuset: remove unneeded header filesMiaohe Lin
Remove some unnecessary header files. No functional change intended. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-05-20sched/psi: Avoid resetting the min update period when it is unnecessaryYang Yang
Psi_group's poll_min_period is determined by the minimum window size of psi_trigger when creating new triggers. While destroying a psi_trigger, there is no need to reset poll_min_period if the psi_trigger being destroyed did not have the minimum window size, since in this condition poll_min_period will remain the same as before. Signed-off-by: Yang Yang <yang.yang29@zte.com.cn> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Suren Baghdasaryan <surenb@google.com> Link: https://lkml.kernel.org/r/20230514163338.834345-1-surenb@google.com
2023-05-19bpf: Add kfunc filter function to 'struct btf_kfunc_id_set'Aditi Ghag
This commit adds the ability to filter kfuncs to certain BPF program types. This is required to limit bpf_sock_destroy kfunc implemented in follow-up commits to programs with attach type 'BPF_TRACE_ITER'. The commit adds a callback filter to 'struct btf_kfunc_id_set'. The filter has access to the `bpf_prog` construct including its properties such as `expected_attached_type`. Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com> Link: https://lore.kernel.org/r/20230519225157.760788-7-aditi.ghag@isovalent.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-05-19bpf: Show target_{obj,btf}_id in tracing link fdinfoYafang Shao
The target_btf_id can help us understand which kernel function is linked by a tracing prog. The target_btf_id and target_obj_id have already been exposed to userspace, so we just need to show them. The result as follows, $ cat /proc/10673/fdinfo/10 pos: 0 flags: 02000000 mnt_id: 15 ino: 2094 link_type: tracing link_id: 2 prog_tag: a04f5eef06a7f555 prog_id: 13 attach_type: 24 target_obj_id: 1 target_btf_id: 13964 Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230517103126.68372-2-laoar.shao@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-05-19bpf: Fix mask generation for 32-bit narrow loads of 64-bit fieldsWill Deacon
A narrow load from a 64-bit context field results in a 64-bit load followed potentially by a 64-bit right-shift and then a bitwise AND operation to extract the relevant data. In the case of a 32-bit access, an immediate mask of 0xffffffff is used to construct a 64-bit BPP_AND operation which then sign-extends the mask value and effectively acts as a glorified no-op. For example: 0: 61 10 00 00 00 00 00 00 r0 = *(u32 *)(r1 + 0) results in the following code generation for a 64-bit field: ldr x7, [x7] // 64-bit load mov x10, #0xffffffffffffffff and x7, x7, x10 Fix the mask generation so that narrow loads always perform a 32-bit AND operation: ldr x7, [x7] // 64-bit load mov w10, #0xffffffff and w7, w7, w10 Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Krzesimir Nowak <krzesimir@kinvolk.io> Cc: Andrey Ignatov <rdna@fb.com> Acked-by: Yonghong Song <yhs@fb.com> Fixes: 31fd85816dbe ("bpf: permits narrower load from bpf program context fields") Signed-off-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20230518102528.1341-1-will@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-05-19lockdep: Add lock_set_cmp_fn() annotationKent Overstreet
This implements a new interface to lockdep, lock_set_cmp_fn(), for defining a custom ordering when taking multiple locks of the same class. This is an alternative to subclasses, but can not fully replace them since subclasses allow lock hierarchies with other clasees inter-twined, while this relies on pure class nesting. Specifically, if A is our nesting class then: A/0 <- B <- A/1 Would be a valid lock order with subclasses (each subclass really is a full class from the validation PoV) but not with this annotation, which requires all nesting to be consecutive. Example output: | ============================================ | WARNING: possible recursive locking detected | 6.2.0-rc8-00003-g7d81e591ca6a-dirty #15 Not tainted | -------------------------------------------- | kworker/14:3/938 is trying to acquire lock: | ffff8880143218c8 (&b->lock l=0 0:2803368){++++}-{3:3}, at: bch_btree_node_get.part.0+0x81/0x2b0 | | but task is already holding lock: | ffff8880143de8c8 (&b->lock l=1 1048575:9223372036854775807){++++}-{3:3}, at: __bch_btree_map_nodes+0xea/0x1e0 | and the lock comparison function returns 1: | | other info that might help us debug this: | Possible unsafe locking scenario: | | CPU0 | ---- | lock(&b->lock l=1 1048575:9223372036854775807); | lock(&b->lock l=0 0:2803368); | | *** DEADLOCK *** | | May be due to missing lock nesting notation | | 3 locks held by kworker/14:3/938: | #0: ffff888005ea9d38 ((wq_completion)bcache){+.+.}-{0:0}, at: process_one_work+0x1ec/0x530 | #1: ffff8880098c3e70 ((work_completion)(&cl->work)#3){+.+.}-{0:0}, at: process_one_work+0x1ec/0x530 | #2: ffff8880143de8c8 (&b->lock l=1 1048575:9223372036854775807){++++}-{3:3}, at: __bch_btree_map_nodes+0xea/0x1e0 [peterz: extended changelog] Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230509195847.1745548-1-kent.overstreet@linux.dev
2023-05-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Conflicts: drivers/net/ethernet/freescale/fec_main.c 6ead9c98cafc ("net: fec: remove the xdp_return_frame when lack of tx BDs") 144470c88c5d ("net: fec: using the standard return codes when xdp xmit errors") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-18x86/hibernate: Declare global functions in suspend.hArnd Bergmann
Three functions that are defined in x86 specific code to override generic __weak implementations cause a warning because of a missing prototype: arch/x86/power/cpu.c:298:5: error: no previous prototype for 'hibernate_resume_nonboot_cpu_disable' [-Werror=missing-prototypes] arch/x86/power/hibernate.c:129:5: error: no previous prototype for 'arch_hibernation_header_restore' [-Werror=missing-prototypes] arch/x86/power/hibernate.c:91:5: error: no previous prototype for 'arch_hibernation_header_save' [-Werror=missing-prototypes] Move the declarations into a global header so it can be included by any file defining one of these. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://lore.kernel.org/all/20230516193549.544673-14-arnd%40kernel.org
2023-05-18Merge tag 'probes-fixes-v6.4-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probes fixes from Masami Hiramatsu: - Initialize 'ret' local variables on fprobe_handler() to fix the smatch warning. With this, fprobe function exit handler is not working randomly. - Fix to use preempt_enable/disable_notrace for rethook handler to prevent recursive call of fprobe exit handler (which is based on rethook) - Fix recursive call issue on fprobe_kprobe_handler() - Fix to detect recursive call on fprobe_exit_handler() - Fix to make all arch-dependent rethook code notrace (the arch-independent code is already notrace)" * tag 'probes-fixes-v6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: rethook, fprobe: do not trace rethook related functions fprobe: add recursion detection in fprobe_exit_handler fprobe: make fprobe_kprobe_handler recursion free rethook: use preempt_{disable, enable}_notrace in rethook_trampoline_handler tracing: fprobe: Initialize ret valiable to fix smatch error
2023-05-17workqueue: Track and monitor per-workqueue CPU time usageTejun Heo
Now that wq_worker_tick() is there, we can easily track the rough CPU time consumption of each workqueue by charging the whole tick whenever a tick hits an active workqueue. While not super accurate, it provides reasonable visibility into the workqueues that consume a lot of CPU cycles. wq_monitor.py is updated to report the per-workqueue CPU times. v2: wq_monitor.py was using "cputime" as the key when outputting in json format. Use "cpu_time" instead for consistency with other fields. Signed-off-by: Tejun Heo <tj@kernel.org>
2023-05-17workqueue: Report work funcs that trigger automatic CPU_INTENSIVE mechanismTejun Heo
Workqueue now automatically marks per-cpu work items that hog CPU for too long as CPU_INTENSIVE, which excludes them from concurrency management and prevents stalling other concurrency-managed work items. If a work function keeps running over the thershold, it likely needs to be switched to use an unbound workqueue. This patch adds a debug mechanism which tracks the work functions which trigger the automatic CPU_INTENSIVE mechanism and report them using pr_warn() with exponential backoff. v3: Documentation update. v2: Drop bouncing to kthread_worker for printing messages. It was to avoid introducing circular locking dependency through printk but not effective as it still had pool lock -> wci_lock -> printk -> pool lock loop. Let's just print directly using printk_deferred(). Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Peter Zijlstra <peterz@infradead.org>
2023-05-17workqueue: Automatically mark CPU-hogging work items CPU_INTENSIVETejun Heo
If a per-cpu work item hogs the CPU, it can prevent other work items from starting through concurrency management. A per-cpu workqueue which intends to host such CPU-hogging work items can choose to not participate in concurrency management by setting %WQ_CPU_INTENSIVE; however, this can be error-prone and difficult to debug when missed. This patch adds an automatic CPU usage based detection. If a concurrency-managed work item consumes more CPU time than the threshold (10ms by default) continuously without intervening sleeps, wq_worker_tick() which is called from scheduler_tick() will detect the condition and automatically mark it CPU_INTENSIVE. The mechanism isn't foolproof: * Detection depends on tick hitting the work item. Getting preempted at the right timings may allow a violating work item to evade detection at least temporarily. * nohz_full CPUs may not be running ticks and thus can fail detection. * Even when detection is working, the 10ms detection delays can add up if many CPU-hogging work items are queued at the same time. However, in vast majority of cases, this should be able to detect violations reliably and provide reasonable protection with a small increase in code complexity. If some work items trigger this condition repeatedly, the bigger problem likely is the CPU being saturated with such per-cpu work items and the solution would be making them UNBOUND. The next patch will add a debug mechanism to help spot such cases. v4: Documentation for workqueue.cpu_intensive_thresh_us added to kernel-parameters.txt. v3: Switch to use wq_worker_tick() instead of hooking into preemptions as suggested by Peter. v2: Lai pointed out that wq_worker_stopping() also needs to be called from preemption and rtlock paths and an earlier patch was updated accordingly. This patch adds a comment describing the risk of infinte recursions and how they're avoided. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com>
2023-05-17workqueue: Improve locking rule description for worker fieldsTejun Heo
* Some worker fields are modified only by the worker itself while holding pool->lock thus making them safe to read from self, IRQ context if the CPU is running the worker or while holding pool->lock. Add 'K' locking rule for them. * worker->sleeping is currently marked "None" which isn't very descriptive. It's used only by the worker itself. Add 'S' locking rule for it. A future patch will depend on the 'K' rule to access worker->current_* from the scheduler ticks. Signed-off-by: Tejun Heo <tj@kernel.org>
2023-05-17workqueue: Move worker_set/clr_flags() upwardsTejun Heo
They are going to be used in wq_worker_stopping(). Move them upwards. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com>
2023-05-17workqueue: Re-order struct worker fieldsTejun Heo
struct worker was laid out with the intent that all fields that are modified for each work item execution are in the first cacheline. However, this hasn't been true for a while with the addition of ->last_func. Let's just collect hot fields together at the top. Move ->sleeping in the hole after ->current_color and move ->lst_func right below. While at it, drop the cacheline comment which isn't useful anymore. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com>
2023-05-17workqueue: Add pwq->stats[] and a monitoring scriptTejun Heo
Currently, the only way to peer into workqueue operations is through tracing. While possible, it isn't easy or convenient to monitor per-workqueue behaviors over time this way. Let's add pwq->stats[] that track relevant events and a drgn monitoring script - tools/workqueue/wq_monitor.py. It's arguable whether this needs to be configurable. However, it currently only has several counters and the runtime overhead shouldn't be noticeable given that they're on pwq's which are per-cpu on per-cpu workqueues and per-numa-node on unbound ones. Let's keep it simple for the time being. v2: Patch reordered to earlier with fewer fields. Field will be added back gradually. Help message improved. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com>
2023-05-18fprobe: add recursion detection in fprobe_exit_handlerZe Gao
fprobe_hander and fprobe_kprobe_handler has guarded ftrace recursion detection but fprobe_exit_handler has not, which possibly introduce recursive calls if the fprobe exit callback calls any traceable functions. Checking in fprobe_hander or fprobe_kprobe_handler is not enough and misses this case. So add recursion free guard the same way as fprobe_hander. Since ftrace recursion check does not employ ip(s), so here use entry_ip and entry_parent_ip the same as fprobe_handler. Link: https://lore.kernel.org/all/20230517034510.15639-4-zegao@tencent.com/ Fixes: 5b0ab78998e3 ("fprobe: Add exit_handler support") Signed-off-by: Ze Gao <zegao@tencent.com> Cc: stable@vger.kernel.org Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-05-18fprobe: make fprobe_kprobe_handler recursion freeZe Gao
Current implementation calls kprobe related functions before doing ftrace recursion check in fprobe_kprobe_handler, which opens door to kernel crash due to stack recursion if preempt_count_{add, sub} is traceable in kprobe_busy_{begin, end}. Things goes like this without this patch quoted from Steven: " fprobe_kprobe_handler() { kprobe_busy_begin() { preempt_disable() { preempt_count_add() { <-- trace fprobe_kprobe_handler() { [ wash, rinse, repeat, CRASH!!! ] " By refactoring the common part out of fprobe_kprobe_handler and fprobe_handler and call ftrace recursion detection at the very beginning, the whole fprobe_kprobe_handler is free from recursion. [ Fix the indentation of __fprobe_handler() parameters. ] Link: https://lore.kernel.org/all/20230517034510.15639-3-zegao@tencent.com/ Fixes: ab51e15d535e ("fprobe: Introduce FPROBE_FL_KPROBE_SHARED flag for fprobe") Signed-off-by: Ze Gao <zegao@tencent.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-05-18rethook: use preempt_{disable, enable}_notrace in rethook_trampoline_handlerZe Gao
This patch replaces preempt_{disable, enable} with its corresponding notrace version in rethook_trampoline_handler so no worries about stack recursion or overflow introduced by preempt_count_{add, sub} under fprobe + rethook context. Link: https://lore.kernel.org/all/20230517034510.15639-2-zegao@tencent.com/ Fixes: 54ecbe6f1ed5 ("rethook: Add a generic return hook") Signed-off-by: Ze Gao <zegao@tencent.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-05-17audit: avoid missing-prototype warningsArnd Bergmann
Building with 'make W=1' reveals two function definitions without a previous prototype in the audit code: lib/compat_audit.c:32:5: error: no previous prototype for 'audit_classify_compat_syscall' [-Werror=missing-prototypes] kernel/audit.c:1813:14: error: no previous prototype for 'audit_serial' [-Werror=missing-prototypes] The first one needs a declaration from linux/audit.h but cannot include that header without causing conflicting (compat) syscall number definitions, so move the it into linux/audit_arch.h. The second one is declared conditionally based on CONFIG_AUDITSYSCALL but needed as a local function even when that option is disabled, so move the declaration out of the #ifdef block. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Paul Moore <paul@paul-moore.com>
2023-05-17tracing: fprobe: Initialize ret valiable to fix smatch errorMasami Hiramatsu (Google)
The commit 39d954200bf6 ("fprobe: Skip exit_handler if entry_handler returns !0") introduced a hidden dependency of 'ret' local variable in the fprobe_handler(), Smatch warns the `ret` can be accessed without initialization. kernel/trace/fprobe.c:59 fprobe_handler() error: uninitialized symbol 'ret'. kernel/trace/fprobe.c 49 fpr->entry_ip = ip; 50 if (fp->entry_data_size) 51 entry_data = fpr->data; 52 } 53 54 if (fp->entry_handler) 55 ret = fp->entry_handler(fp, ip, ftrace_get_regs(fregs), entry_data); ret is only initialized if there is an ->entry_handler 56 57 /* If entry_handler returns !0, nmissed is not counted. */ 58 if (rh) { rh is only true if there is an ->exit_handler. Presumably if you have and ->exit_handler that means you also have a ->entry_handler but Smatch is not smart enough to figure it out. --> 59 if (ret) ^^^ Warning here. 60 rethook_recycle(rh); 61 else 62 rethook_hook(rh, ftrace_get_regs(fregs), true); 63 } 64 out: 65 ftrace_test_recursion_unlock(bit); 66 } Link: https://lore.kernel.org/all/168100731160.79534.374827110083836722.stgit@devnote2/ Reported-by: Dan Carpenter <error27@gmail.com> Link: https://lore.kernel.org/all/85429a5c-a4b9-499e-b6c0-cbd313291c49@kili.mountain Fixes: 39d954200bf6 ("fprobe: Skip exit_handler if entry_handler returns !0") Acked-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-05-16bpf: drop unnecessary user-triggerable WARN_ONCE in verifierl logAndrii Nakryiko
It's trivial for user to trigger "verifier log line truncated" warning, as verifier has a fixed-sized buffer of 1024 bytes (as of now), and there are at least two pieces of user-provided information that can be output through this buffer, and both can be arbitrarily sized by user: - BTF names; - BTF.ext source code lines strings. Verifier log buffer should be properly sized for typical verifier state output. But it's sort-of expected that this buffer won't be long enough in some circumstances. So let's drop the check. In any case code will work correctly, at worst truncating a part of a single line output. Reported-by: syzbot+8b2a08dfbd25fd933d75@syzkaller.appspotmail.com Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230516180409.3549088-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-05-16Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2023-05-16 We've added 57 non-merge commits during the last 19 day(s) which contain a total of 63 files changed, 3293 insertions(+), 690 deletions(-). The main changes are: 1) Add precision propagation to verifier for subprogs and callbacks, from Andrii Nakryiko. 2) Improve BPF's {g,s}setsockopt() handling with wrong option lengths, from Stanislav Fomichev. 3) Utilize pahole v1.25 for the kernel's BTF generation to filter out inconsistent function prototypes, from Alan Maguire. 4) Various dyn-pointer verifier improvements to relax restrictions, from Daniel Rosenberg. 5) Add a new bpf_task_under_cgroup() kfunc for designated task, from Feng Zhou. 6) Unblock tests for arm64 BPF CI after ftrace supporting direct call, from Florent Revest. 7) Add XDP hint kfunc metadata for RX hash/timestamp for igc, from Jesper Dangaard Brouer. 8) Add several new dyn-pointer kfuncs to ease their usability, from Joanne Koong. 9) Add in-depth LRU internals description and dot function graph, from Joe Stringer. 10) Fix KCSAN report on bpf_lru_list when accessing node->ref, from Martin KaFai Lau. 11) Only dump unprivileged_bpf_disabled log warning upon write, from Kui-Feng Lee. 12) Extend test_progs to directly passing allow/denylist file, from Stephen Veiss. 13) Fix BPF trampoline memleak upon failure attaching to fentry, from Yafang Shao. 14) Fix emitting struct bpf_tcp_sock type in vmlinux BTF, from Yonghong Song. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (57 commits) bpf: Fix memleak due to fentry attach failure bpf: Remove bpf trampoline selector bpf, arm64: Support struct arguments in the BPF trampoline bpftool: JIT limited misreported as negative value on aarch64 bpf: fix calculation of subseq_idx during precision backtracking bpf: Remove anonymous union in bpf_kfunc_call_arg_meta bpf: Document EFAULT changes for sockopt selftests/bpf: Correctly handle optlen > 4096 selftests/bpf: Update EFAULT {g,s}etsockopt selftests bpf: Don't EFAULT for {g,s}setsockopt with wrong optlen libbpf: fix offsetof() and container_of() to work with CO-RE bpf: Address KCSAN report on bpf_lru_list bpf: Add --skip_encoding_btf_inconsistent_proto, --btf_gen_optimized to pahole flags for v1.25 selftests/bpf: Accept mem from dynptr in helper funcs bpf: verifier: Accept dynptr mem as mem in helpers selftests/bpf: Check overflow in optional buffer selftests/bpf: Test allowing NULL buffer in dynptr slice bpf: Allow NULL buffers in bpf_dynptr_slice(_rw) selftests/bpf: Add testcase for bpf_task_under_cgroup bpf: Add bpf_task_under_cgroup() kfunc ... ==================== Link: https://lore.kernel.org/r/20230515225603.27027-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-15bpf: Fix memleak due to fentry attach failureYafang Shao
If it fails to attach fentry, the allocated bpf trampoline image will be left in the system. That can be verified by checking /proc/kallsyms. This meamleak can be verified by a simple bpf program as follows: SEC("fentry/trap_init") int fentry_run() { return 0; } It will fail to attach trap_init because this function is freed after kernel init, and then we can find the trampoline image is left in the system by checking /proc/kallsyms. $ tail /proc/kallsyms ffffffffc0613000 t bpf_trampoline_6442453466_1 [bpf] ffffffffc06c3000 t bpf_trampoline_6442453466_1 [bpf] $ bpftool btf dump file /sys/kernel/btf/vmlinux | grep "FUNC 'trap_init'" [2522] FUNC 'trap_init' type_id=119 linkage=static $ echo $((6442453466 & 0x7fffffff)) 2522 Note that there are two left bpf trampoline images, that is because the libbpf will fallback to raw tracepoint if -EINVAL is returned. Fixes: e21aa341785c ("bpf: Fix fexit trampoline.") Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <song@kernel.org> Cc: Jiri Olsa <olsajiri@gmail.com> Link: https://lore.kernel.org/bpf/20230515130849.57502-2-laoar.shao@gmail.com
2023-05-15bpf: Remove bpf trampoline selectorYafang Shao
After commit e21aa341785c ("bpf: Fix fexit trampoline."), the selector is only used to indicate how many times the bpf trampoline image are updated and been displayed in the trampoline ksym name. After the trampoline is freed, the selector will start from 0 again. So the selector is a useless value to the user. We can remove it. If the user want to check whether the bpf trampoline image has been updated or not, the user can compare the address. Each time the trampoline image is updated, the address will change consequently. Jiri also pointed out another issue that perf is still using the old name "bpf_trampoline_%lu", so this change can fix the issue in perf. Fixes: e21aa341785c ("bpf: Fix fexit trampoline.") Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <song@kernel.org> Cc: Jiri Olsa <olsajiri@gmail.com> Link: https://lore.kernel.org/bpf/ZFvOOlrmHiY9AgXE@krava Link: https://lore.kernel.org/bpf/20230515130849.57502-3-laoar.shao@gmail.com
2023-05-15bpf: fix calculation of subseq_idx during precision backtrackingAndrii Nakryiko
Subsequent instruction index (subseq_idx) is an index of an instruction that was verified/executed by verifier after the currently processed instruction. It is maintained during precision backtracking processing and is used to detect various subprog calling conditions. This patch fixes the bug with incorrectly resetting subseq_idx to -1 when going from child state to parent state during backtracking. If we don't maintain correct subseq_idx we can misidentify subprog calls leading to precision tracking bugs. One such case was triggered by test_global_funcs/global_func9 test where global subprog call happened to be the very last instruction in parent state, leading to subseq_idx==-1, triggering WARN_ONCE: [ 36.045754] verifier backtracking bug [ 36.045764] WARNING: CPU: 13 PID: 2073 at kernel/bpf/verifier.c:3503 __mark_chain_precision+0xcc6/0xde0 [ 36.046819] Modules linked in: aesni_intel(E) crypto_simd(E) cryptd(E) kvm_intel(E) kvm(E) irqbypass(E) i2c_piix4(E) serio_raw(E) i2c_core(E) crc32c_intel) [ 36.048040] CPU: 13 PID: 2073 Comm: test_progs Tainted: G W OE 6.3.0-07976-g4d585f48ee6b-dirty #972 [ 36.048783] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 [ 36.049648] RIP: 0010:__mark_chain_precision+0xcc6/0xde0 [ 36.050038] Code: 3d 82 c6 05 bb 35 32 02 01 e8 66 21 ec ff 0f 0b b8 f2 ff ff ff e9 30 f5 ff ff 48 c7 c7 f3 61 3d 82 4c 89 0c 24 e8 4a 21 ec ff <0f> 0b 4c0 With the fix precision tracking across multiple states works correctly now: mark_precise: frame0: last_idx 45 first_idx 38 subseq_idx -1 mark_precise: frame0: regs=r8 stack= before 44: (61) r7 = *(u32 *)(r10 -4) mark_precise: frame0: regs=r8 stack= before 43: (85) call pc+41 mark_precise: frame0: regs=r8 stack= before 42: (07) r1 += -48 mark_precise: frame0: regs=r8 stack= before 41: (bf) r1 = r10 mark_precise: frame0: regs=r8 stack= before 40: (63) *(u32 *)(r10 -48) = r1 mark_precise: frame0: regs=r8 stack= before 39: (b4) w1 = 0 mark_precise: frame0: regs=r8 stack= before 38: (85) call pc+38 mark_precise: frame0: parent state regs=r8 stack=: R0_w=scalar() R1_w=map_value(off=4,ks=4,vs=8,imm=0) R6=1 R7_w=scalar() R8_r=P0 R10=fpm mark_precise: frame0: last_idx 36 first_idx 28 subseq_idx 38 mark_precise: frame0: regs=r8 stack= before 36: (18) r1 = 0xffff888104f2ed14 mark_precise: frame0: regs=r8 stack= before 35: (85) call pc+33 mark_precise: frame0: regs=r8 stack= before 33: (18) r1 = 0xffff888104f2ed10 mark_precise: frame0: regs=r8 stack= before 32: (85) call pc+36 mark_precise: frame0: regs=r8 stack= before 31: (07) r1 += -4 mark_precise: frame0: regs=r8 stack= before 30: (bf) r1 = r10 mark_precise: frame0: regs=r8 stack= before 29: (63) *(u32 *)(r10 -4) = r7 mark_precise: frame0: regs=r8 stack= before 28: (4c) w7 |= w0 mark_precise: frame0: parent state regs=r8 stack=: R0_rw=scalar() R6=1 R7_rw=scalar() R8_rw=P0 R10=fp0 fp-48_r=mmmmmmmm mark_precise: frame0: last_idx 27 first_idx 16 subseq_idx 28 mark_precise: frame0: regs=r8 stack= before 27: (85) call pc+31 mark_precise: frame0: regs=r8 stack= before 26: (b7) r1 = 0 mark_precise: frame0: regs=r8 stack= before 25: (b7) r8 = 0 Note how subseq_idx starts out as -1, then is preserved as 38 and then 28 as we go up the parent state chain. Reported-by: Alexei Starovoitov <ast@kernel.org> Fixes: fde2a3882bd0 ("bpf: support precision propagation in the presence of subprogs") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230515180710.1535018-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-05-15bpf: Remove anonymous union in bpf_kfunc_call_arg_metaDave Marchevsky
For kfuncs like bpf_obj_drop and bpf_refcount_acquire - which take user-defined types as input - the verifier needs to track the specific type passed in when checking a particular kfunc call. This requires tracking (btf, btf_id) tuple. In commit 7c50b1cb76ac ("bpf: Add bpf_refcount_acquire kfunc") I added an anonymous union with inner structs named after the specific kfuncs tracking this information, with the goal of making it more obvious which kfunc this data was being tracked / expected to be tracked on behalf of. In a recent series adding a new user of this tuple, Alexei mentioned that he didn't like this union usage as it doesn't really help with readability or bug-proofing ([0]). In an offline convo we agreed to have the tuple be fields (arg_btf, arg_btf_id), with comments in bpf_kfunc_call_arg_meta definition enumerating the uses of the fields by kfunc-specific handling logic. Such a pattern is used by struct bpf_reg_state without trouble. Accordingly, this patch removes the anonymous union in favor of arg_btf and arg_btf_id fields and comment enumerating their current uses. The patch also removes struct btf_and_id, which was only being used by the removed union's inner structs. This is a mechanical change, existing linked_list and rbtree tests will validate that correct (btf, btf_id) are being passed. [0]: https://lore.kernel.org/bpf/20230505021707.vlyiwy57vwxglbka@dhcp-172-26-102-232.dhcp.thefacebook.com Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230510213047.1633612-1-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-05-15bpf: netdev: init the offload table earlierJakub Kicinski
Some netdevices may get unregistered before late_initcall(), we have to move the hashtable init earlier. Fixes: f1fc43d03946 ("bpf: Move offload initialization into late_initcall") Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217399 Signed-off-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20230505215836.491485-1-kuba@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-05-15cpu/hotplug: Allow "parallel" bringup up to CPUHP_BP_KICK_AP_STATEThomas Gleixner
There is often significant latency in the early stages of CPU bringup, and time is wasted by waking each CPU (e.g. with SIPI/INIT/INIT on x86) and then waiting for it to respond before moving on to the next. Allow a platform to enable parallel setup which brings all to be onlined CPUs up to the CPUHP_BP_KICK_AP state. While this state advancement on the control CPU (BP) is single-threaded the important part is the last state CPUHP_BP_KICK_AP which wakes the to be onlined CPUs up. This allows the CPUs to run up to the first sychronization point cpuhp_ap_sync_alive() where they wait for the control CPU to release them one by one for the full onlining procedure. This parallelism depends on the CPU hotplug core sync mechanism which ensures that the parallel brought up CPUs wait for release before touching any state which would make the CPU visible to anything outside the hotplug control mechanism. To handle the SMT constraints of X86 correctly the bringup happens in two iterations when CONFIG_HOTPLUG_SMT is enabled. The control CPU brings up the primary SMT threads of each core first, which can load the microcode without the need to rendevouz with the thread siblings. Once that's completed it brings up the secondary SMT threads. Co-developed-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Helge Deller <deller@gmx.de> # parisc Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Link: https://lore.kernel.org/r/20230512205257.240231377@linutronix.de
2023-05-15cpu/hotplug: Provide a split up CPUHP_BRINGUP mechanismThomas Gleixner
The bring up logic of a to be onlined CPU consists of several parts, which are considered to be a single hotplug state: 1) Control CPU issues the wake-up 2) To be onlined CPU starts up, does the minimal initialization, reports to be alive and waits for release into the complete bring-up. 3) Control CPU waits for the alive report and releases the upcoming CPU for the complete bring-up. Allow to split this into two states: 1) Control CPU issues the wake-up After that the to be onlined CPU starts up, does the minimal initialization, reports to be alive and waits for release into the full bring-up. As this can run after the control CPU dropped the hotplug locks the code which is executed on the AP before it reports alive has to be carefully audited to not violate any of the hotplug constraints, especially not modifying any of the various cpumasks. This is really only meant to avoid waiting for the AP to react on the wake-up. Of course an architecture can move strict CPU related setup functionality, e.g. microcode loading, with care before the synchronization point to save further pointless waiting time. 2) Control CPU waits for the alive report and releases the upcoming CPU for the complete bring-up. This allows that the two states can be split up to run all to be onlined CPUs up to state #1 on the control CPU and then at a later point run state #2. This spares some of the latencies of the full serialized per CPU bringup by avoiding the per CPU wakeup/wait serialization. The assumption is that the first AP already waits when the last AP has been woken up. This obvioulsy depends on the hardware latencies and depending on the timings this might still not completely eliminate all wait scenarios. This split is just a preparatory step for enabling the parallel bringup later. The boot time bringup is still fully serialized. It has a separate config switch so that architectures which want to support parallel bringup can test the split of the CPUHP_BRINGUG step separately. To enable this the architecture must support the CPU hotplug core sync mechanism and has to be audited that there are no implicit hotplug state dependencies which require a fully serialized bringup. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Helge Deller <deller@gmx.de> # parisc Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Link: https://lore.kernel.org/r/20230512205257.080801387@linutronix.de
2023-05-15cpu/hotplug: Reset task stack state in _cpu_up()David Woodhouse
Commit dce1ca0525bf ("sched/scs: Reset task stack state in bringup_cpu()") ensured that the shadow call stack and KASAN poisoning were removed from a CPU's stack each time that CPU is brought up, not just once. This is not incorrect. However, with parallel bringup the idle thread setup will happen at a different step. As a consequence the cleanup in bringup_cpu() would be too late. Move the SCS/KASAN cleanup to the generic _cpu_up() function instead, which already ensures that the new CPU's stack is available, purely to allow for early failure. This occurs when the CPU to be brought up is in the CPUHP_OFFLINE state, which should correctly do the cleanup any time the CPU has been taken down to the point where such is needed. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Tested-by: Mark Rutland <mark.rutland@arm.com> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Helge Deller <deller@gmx.de> # parisc Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Link: https://lore.kernel.org/r/20230512205257.027075560@linutronix.de
2023-05-15cpu/hotplug: Remove unused state functionsThomas Gleixner
All users converted to the hotplug core mechanism. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Helge Deller <deller@gmx.de> # parisc Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Link: https://lore.kernel.org/r/20230512205256.972894276@linutronix.de
2023-05-15cpu/hotplug: Remove cpu_report_state() and related unused cruftThomas Gleixner
No more users. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Helge Deller <deller@gmx.de> # parisc Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Link: https://lore.kernel.org/r/20230512205256.582584351@linutronix.de