summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2023-01-04kcsan: Instrument memcpy/memset/memmove with newer ClangMarco Elver
commit 7c201739beef1a586d806463f1465429cdce34c5 upstream. With Clang version 16+, -fsanitize=thread will turn memcpy/memset/memmove calls in instrumented functions into __tsan_memcpy/__tsan_memset/__tsan_memmove calls respectively. Add these functions to the core KCSAN runtime, so that we (a) catch data races with mem* functions, and (b) won't run into linker errors with such newer compilers. Cc: stable@vger.kernel.org # v5.10+ Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-04rtmutex: Add acquire semantics for rtmutex lock acquisition slow pathMel Gorman
commit 1c0908d8e441631f5b8ba433523cf39339ee2ba0 upstream. Jan Kara reported the following bug triggering on 6.0.5-rt14 running dbench on XFS on arm64. kernel BUG at fs/inode.c:625! Internal error: Oops - BUG: 0 [#1] PREEMPT_RT SMP CPU: 11 PID: 6611 Comm: dbench Tainted: G E 6.0.0-rt14-rt+ #1 pc : clear_inode+0xa0/0xc0 lr : clear_inode+0x38/0xc0 Call trace: clear_inode+0xa0/0xc0 evict+0x160/0x180 iput+0x154/0x240 do_unlinkat+0x184/0x300 __arm64_sys_unlinkat+0x48/0xc0 el0_svc_common.constprop.4+0xe4/0x2c0 do_el0_svc+0xac/0x100 el0_svc+0x78/0x200 el0t_64_sync_handler+0x9c/0xc0 el0t_64_sync+0x19c/0x1a0 It also affects 6.1-rc7-rt5 and affects a preempt-rt fork of 5.14 so this is likely a bug that existed forever and only became visible when ARM support was added to preempt-rt. The same problem does not occur on x86-64 and he also reported that converting sb->s_inode_wblist_lock to raw_spinlock_t makes the problem disappear indicating that the RT spinlock variant is the problem. Which in turn means that RT mutexes on ARM64 and any other weakly ordered architecture are affected by this independent of RT. Will Deacon observed: "I'd be more inclined to be suspicious of the slowpath tbh, as we need to make sure that we have acquire semantics on all paths where the lock can be taken. Looking at the rtmutex code, this really isn't obvious to me -- for example, try_to_take_rt_mutex() appears to be able to return via the 'takeit' label without acquire semantics and it looks like we might be relying on the caller's subsequent _unlock_ of the wait_lock for ordering, but that will give us release semantics which aren't correct." Sebastian Andrzej Siewior prototyped a fix that does work based on that comment but it was a little bit overkill and added some fences that should not be necessary. The lock owner is updated with an IRQ-safe raw spinlock held, but the spin_unlock does not provide acquire semantics which are needed when acquiring a mutex. Adds the necessary acquire semantics for lock owner updates in the slow path acquisition and the waiter bit logic. It successfully completed 10 iterations of the dbench workload while the vanilla kernel fails on the first iteration. [ bigeasy@linutronix.de: Initial prototype fix ] Fixes: 700318d1d7b38 ("locking/rtmutex: Use acquire/release semantics") Fixes: 23f78d4a03c5 ("[PATCH] pi-futex: rt mutex core") Reported-by: Jan Kara <jack@suse.cz> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20221202100223.6mevpbl7i6x5udfd@techsingularity.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-04futex: Fix futex_waitv() hrtimer debug object leak on kcalloc errorMathieu Desnoyers
commit 94cd8fa09f5f1ebdd4e90964b08b7f2cc4b36c43 upstream. In a scenario where kcalloc() fails to allocate memory, the futex_waitv system call immediately returns -ENOMEM without invoking destroy_hrtimer_on_stack(). When CONFIG_DEBUG_OBJECTS_TIMERS=y, this results in leaking a timer debug object. Fixes: bf69bad38cf6 ("futex: Implement sys_futex_waitv()") Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Cc: stable@vger.kernel.org Cc: stable@vger.kernel.org # v5.16+ Link: https://lore.kernel.org/r/20221214222008.200393-1-mathieu.desnoyers@efficios.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-04kprobes: kretprobe events missing on 2-core KVM guestwuqiang
[ Upstream commit 3b7ddab8a19aefc768f345fd3782af35b4a68d9b ] Default value of maxactive is set as num_possible_cpus() for nonpreemptable systems. For a 2-core system, only 2 kretprobe instances would be allocated in default, then these 2 instances for execve kretprobe are very likely to be used up with a pipelined command. Here's the testcase: a shell script was added to crontab, and the content of the script is: #!/bin/sh do_something_magic `tr -dc a-z < /dev/urandom | head -c 10` cron will trigger a series of program executions (4 times every hour). Then events loss would be noticed normally after 3-4 hours of testings. The issue is caused by a burst of series of execve requests. The best number of kretprobe instances could be different case by case, and should be user's duty to determine, but num_possible_cpus() as the default value is inadequate especially for systems with small number of cpus. This patch enables the logic for preemption as default, thus increases the minimum of maxactive to 10 for nonpreemptable systems. Link: https://lore.kernel.org/all/20221110081502.492289-1-wuqiang.matt@bytedance.com/ Signed-off-by: wuqiang <wuqiang.matt@bytedance.com> Reviewed-by: Solar Designer <solar@openwall.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31gcov: add support for checksum fieldRickard x Andersson
commit e96b95c2b7a63a454b6498e2df67aac14d046d13 upstream. In GCC version 12.1 a checksum field was added. This patch fixes a kernel crash occurring during boot when using gcov-kernel with GCC version 12.2. The crash occurred on a system running on i.MX6SX. Link: https://lkml.kernel.org/r/20221220102318.3418501-1-rickaran@axis.com Fixes: 977ef30a7d88 ("gcov: support GCC 12.1 and newer compilers") Signed-off-by: Rickard x Andersson <rickaran@axis.com> Reviewed-by: Peter Oberparleiter <oberpar@linux.ibm.com> Tested-by: Peter Oberparleiter <oberpar@linux.ibm.com> Reviewed-by: Martin Liska <mliska@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-12-31bpf: Prevent decl_tag from being referenced in func_proto argStanislav Fomichev
[ Upstream commit f17472d4599697d701aa239b4c475a506bccfd19 ] Syzkaller managed to hit another decl_tag issue: btf_func_proto_check kernel/bpf/btf.c:4506 [inline] btf_check_all_types kernel/bpf/btf.c:4734 [inline] btf_parse_type_sec+0x1175/0x1980 kernel/bpf/btf.c:4763 btf_parse kernel/bpf/btf.c:5042 [inline] btf_new_fd+0x65a/0xb00 kernel/bpf/btf.c:6709 bpf_btf_load+0x6f/0x90 kernel/bpf/syscall.c:4342 __sys_bpf+0x50a/0x6c0 kernel/bpf/syscall.c:5034 __do_sys_bpf kernel/bpf/syscall.c:5093 [inline] __se_sys_bpf kernel/bpf/syscall.c:5091 [inline] __x64_sys_bpf+0x7c/0x90 kernel/bpf/syscall.c:5091 do_syscall_64+0x54/0x70 arch/x86/entry/common.c:48 This seems similar to commit ea68376c8bed ("bpf: prevent decl_tag from being referenced in func_proto") but for the argument. Reported-by: syzbot+8dd0551dda6020944c5d@syzkaller.appspotmail.com Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20221123035422.872531-2-sdf@google.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31acct: fix potential integer overflow in encode_comp_t()Zheng Yejian
[ Upstream commit c5f31c655bcc01b6da53b836ac951c1556245305 ] The integer overflow is descripted with following codes: > 317 static comp_t encode_comp_t(u64 value) > 318 { > 319 int exp, rnd; ...... > 341 exp <<= MANTSIZE; > 342 exp += value; > 343 return exp; > 344 } Currently comp_t is defined as type of '__u16', but the variable 'exp' is type of 'int', so overflow would happen when variable 'exp' in line 343 is greater than 65535. Link: https://lkml.kernel.org/r/20210515140631.369106-3-zhengyejian1@huawei.com Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Cc: Hanjun Guo <guohanjun@huawei.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zhang Jinhao <zhangjinhao2@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31rcu: Fix __this_cpu_read() lockdep warning in rcu_force_quiescent_state()Zqiang
[ Upstream commit ceb1c8c9b8aa9199da46a0f29d2d5f08d9b44c15 ] Running rcutorture with non-zero fqs_duration module parameter in a kernel built with CONFIG_PREEMPTION=y results in the following splat: BUG: using __this_cpu_read() in preemptible [00000000] code: rcu_torture_fqs/398 caller is __this_cpu_preempt_check+0x13/0x20 CPU: 3 PID: 398 Comm: rcu_torture_fqs Not tainted 6.0.0-rc1-yoctodev-standard+ Call Trace: <TASK> dump_stack_lvl+0x5b/0x86 dump_stack+0x10/0x16 check_preemption_disabled+0xe5/0xf0 __this_cpu_preempt_check+0x13/0x20 rcu_force_quiescent_state.part.0+0x1c/0x170 rcu_force_quiescent_state+0x1e/0x30 rcu_torture_fqs+0xca/0x160 ? rcu_torture_boost+0x430/0x430 kthread+0x192/0x1d0 ? kthread_complete_and_exit+0x30/0x30 ret_from_fork+0x22/0x30 </TASK> The problem is that rcu_force_quiescent_state() uses __this_cpu_read() in preemptible code instead of the proper raw_cpu_read(). This commit therefore changes __this_cpu_read() to raw_cpu_read(). Signed-off-by: Zqiang <qiang1.zhang@intel.com> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31bpf: prevent leak of lsm program after failed attachMilan Landaverde
[ Upstream commit e89f3edffb860a0f54a9ed16deadb7a4a1fa3862 ] In [0], we added the ability to bpf_prog_attach LSM programs to cgroups, but in our validation to make sure the prog is meant to be attached to BPF_LSM_CGROUP, we return too early if the check fails. This results in lack of decrementing prog's refcnt (through bpf_prog_put) leaving the LSM program alive past the point of the expected lifecycle. This fix allows for the decrement to take place. [0] https://lore.kernel.org/all/20220628174314.1216643-4-sdf@google.com/ Fixes: 69fd337a975c ("bpf: per-cgroup lsm flavor") Signed-off-by: Milan Landaverde <milan@mdaverde.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20221213175714.31963-1-milan@mdaverde.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31kprobes: Fix check for probe enabled in kill_kprobe()Li Huafei
[ Upstream commit 0c76ef3f26d5ef2ac2c21b47e7620cff35809fbb ] In kill_kprobe(), the check whether disarm_kprobe_ftrace() needs to be called always fails. This is because before that we set the KPROBE_FLAG_GONE flag for kprobe so that "!kprobe_disabled(p)" is always false. The disarm_kprobe_ftrace() call introduced by commit: 0cb2f1372baa ("kprobes: Fix NULL pointer dereference at kprobe_ftrace_handler") to fix the NULL pointer reference problem. When the probe is enabled, if we do not disarm it, this problem still exists. Fix it by putting the probe enabled check before setting the KPROBE_FLAG_GONE flag. Link: https://lore.kernel.org/all/20221126114316.201857-1-lihuafei1@huawei.com/ Fixes: 3031313eb3d54 ("kprobes: Fix to check probe enabled before disarm_kprobe_ftrace()") Signed-off-by: Li Huafei <lihuafei1@huawei.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31tracing/hist: Fix issue of losting command info in error_logZheng Yejian
[ Upstream commit 608c6ed3337850c767ab0dd6c583477922233e29 ] When input some constructed invalid 'trigger' command, command info in 'error_log' are lost [1]. The root cause is that there is a path that event_hist_trigger_parse() is recursely called once and 'last_cmd' which save origin command is cleared, then later calling of hist_err() will no longer record origin command info: event_hist_trigger_parse() { last_cmd_set() // <1> 'last_cmd' save origin command here at first create_actions() { onmatch_create() { action_create() { trace_action_create() { trace_action_create_field_var() { create_field_var_hist() { event_hist_trigger_parse() { // <2> recursely called once hist_err_clear() // <3> 'last_cmd' is cleared here } hist_err() // <4> No longer find origin command!!! Since 'glob' is empty string while running into the recurse call, we can trickly check it and bypass the call of hist_err_clear() to solve it. [1] # cd /sys/kernel/tracing # echo "my_synth_event int v1; int v2; int v3;" >> synthetic_events # echo 'hist:keys=pid' >> events/sched/sched_waking/trigger # echo "hist:keys=next_pid:onmatch(sched.sched_waking).my_synth_event(\ pid,pid1)" >> events/sched/sched_switch/trigger # cat error_log [ 8.405018] hist:sched:sched_switch: error: Couldn't find synthetic event Command: hist:keys=next_pid:onmatch(sched.sched_waking).my_synth_event(pid,pid1) ^ [ 8.816902] hist:sched:sched_switch: error: Couldn't find field Command: hist:keys=next_pid:onmatch(sched.sched_waking).my_synth_event(pid,pid1) ^ [ 8.816902] hist:sched:sched_switch: error: Couldn't parse field variable Command: hist:keys=next_pid:onmatch(sched.sched_waking).my_synth_event(pid,pid1) ^ [ 8.999880] : error: Couldn't find field Command: ^ [ 8.999880] : error: Couldn't parse field variable Command: ^ [ 8.999880] : error: Couldn't find field Command: ^ [ 8.999880] : error: Couldn't create histogram for field Command: ^ Link: https://lore.kernel.org/linux-trace-kernel/20221207135326.3483216-1-zhengyejian1@huawei.com Cc: <mhiramat@kernel.org> Cc: <zanussi@kernel.org> Fixes: f404da6e1d46 ("tracing: Add 'last error' error facility for hist triggers") Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31tracing/user_events: Fix call print_fmt leakBeau Belgrave
[ Upstream commit 4bded7af8b9af6e97514b0521004f90267905aef ] If user_event_trace_register() fails within user_event_parse() the call's print_fmt member is not freed. Add kfree call to fix this. Link: https://lkml.kernel.org/r/20221123183248.554-1-beaub@linux.microsoft.com Fixes: aa3b2b4c6692 ("user_events: Add print_fmt generation support for basic types") Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31padata: Fix list iterator in padata_do_serial()Daniel Jordan
[ Upstream commit 57ddfecc72a6c9941d159543e1c0c0a74fe9afdd ] list_for_each_entry_reverse() assumes that the iterated list is nonempty and that every list_head is embedded in the same type, but its use in padata_do_serial() breaks both rules. This doesn't cause any issues now because padata_priv and padata_list happen to have their list fields at the same offset, but we really shouldn't be relying on that. Fixes: bfde23ce200e ("padata: unbind parallel jobs from specific CPUs") Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31padata: Always leave BHs disabled when running ->parallel()Daniel Jordan
[ Upstream commit 34c3a47d20ae55b3600fed733bf96eafe9c500d5 ] A deadlock can happen when an overloaded system runs ->parallel() in the context of the current task: padata_do_parallel ->parallel() pcrypt_aead_enc/dec padata_do_serial spin_lock(&reorder->lock) // BHs still enabled <interrupt> ... __do_softirq ... padata_do_serial spin_lock(&reorder->lock) It's a bug for BHs to be on in _do_serial as Steffen points out, so ensure they're off in the "current task" case like they are in padata_parallel_worker to avoid this situation. Reported-by: syzbot+bc05445bc14148d51915@syzkaller.appspotmail.com Fixes: 4611ce224688 ("padata: allocate work structures for parallel jobs from a pool") Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com> Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31bpf: Do not zero-extend kfunc return valuesBjörn Töpel
[ Upstream commit d35af0a7feb077c43ff0233bba5a8c6e75b73e35 ] In BPF all global functions, and BPF helpers return a 64-bit value. For kfunc calls, this is not the case, and they can return e.g. 32-bit values. The return register R0 for kfuncs calls can therefore be marked as subreg_def != DEF_NOT_SUBREG. In general, if a register is marked with subreg_def != DEF_NOT_SUBREG, some archs (where bpf_jit_needs_zext() returns true) require the verifier to insert explicit zero-extension instructions. For kfuncs calls, however, the caller should do sign/zero extension for return values. In other words, the compiler is responsible to insert proper instructions, not the verifier. An example, provided by Yonghong Song: $ cat t.c extern unsigned foo(void); unsigned bar1(void) { return foo(); } unsigned bar2(void) { if (foo()) return 10; else return 20; } $ clang -target bpf -mcpu=v3 -O2 -c t.c && llvm-objdump -d t.o t.o: file format elf64-bpf Disassembly of section .text: 0000000000000000 <bar1>: 0: 85 10 00 00 ff ff ff ff call -0x1 1: 95 00 00 00 00 00 00 00 exit 0000000000000010 <bar2>: 2: 85 10 00 00 ff ff ff ff call -0x1 3: bc 01 00 00 00 00 00 00 w1 = w0 4: b4 00 00 00 14 00 00 00 w0 = 0x14 5: 16 01 01 00 00 00 00 00 if w1 == 0x0 goto +0x1 <LBB1_2> 6: b4 00 00 00 0a 00 00 00 w0 = 0xa 0000000000000038 <LBB1_2>: 7: 95 00 00 00 00 00 00 00 exit If the return value of 'foo()' is used in the BPF program, the proper zero-extension will be done. Currently, the verifier correctly marks, say, a 32-bit return value as subreg_def != DEF_NOT_SUBREG, but will fail performing the actual zero-extension, due to a verifier bug in opt_subreg_zext_lo32_rnd_hi32(). load_reg is not properly set to R0, and the following path will be taken: if (WARN_ON(load_reg == -1)) { verbose(env, "verifier bug. zext_dst is set, but no reg is defined\n"); return -EFAULT; } A longer discussion from v1 can be found in the link below. Correct the verifier by avoiding doing explicit zero-extension of R0 for kfunc calls. Note that R0 will still be marked as a sub-register for return values smaller than 64-bit. Fixes: 83a2881903f3 ("bpf: Account for BPF_FETCH in insn_has_def32()") Link: https://lore.kernel.org/bpf/20221202103620.1915679-1-bjorn@kernel.org/ Suggested-by: Yonghong Song <yhs@meta.com> Signed-off-by: Björn Töpel <bjorn@rivosinc.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20221207103540.396496-1-bjorn@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31blktrace: Fix output non-blktrace event when blk_classic option enabledYang Jihong
[ Upstream commit f596da3efaf4130ff61cd029558845808df9bf99 ] When the blk_classic option is enabled, non-blktrace events must be filtered out. Otherwise, events of other types are output in the blktrace classic format, which is unexpected. The problem can be triggered in the following ways: # echo 1 > /sys/kernel/debug/tracing/options/blk_classic # echo 1 > /sys/kernel/debug/tracing/events/enable # echo blk > /sys/kernel/debug/tracing/current_tracer # cat /sys/kernel/debug/tracing/trace_pipe Fixes: c71a89615411 ("blktrace: add ftrace plugin") Signed-off-by: Yang Jihong <yangjihong1@huawei.com> Link: https://lore.kernel.org/r/20221122040410.85113-1-yangjihong1@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31module: Fix NULL vs IS_ERR checking for module_get_next_pageMiaoqian Lin
[ Upstream commit 45af1d7aae7d5520d2858f8517a1342646f015db ] The module_get_next_page() function return error pointers on error instead of NULL. Use IS_ERR() to check the return value to fix this. Fixes: b1ae6dc41eaa ("module: add in-kernel support for decompressing") Signed-off-by: Miaoqian Lin <linmq006@gmail.com> Reviewed-by: Dmitry Torokhov <dmitry.torokhov@gmail.com Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31bpf: propagate precision across all frames, not just the last oneAndrii Nakryiko
[ Upstream commit 529409ea92d590659be487ba0839710329bd8074 ] When equivalent completed state is found and it has additional precision restrictions, BPF verifier propagates precision to currently-being-verified state chain (i.e., including parent states) so that if some of the states in the chain are not yet completed, necessary precision restrictions are enforced. Unfortunately, right now this happens only for the last frame (deepest active subprogram's frame), not all the frames. This can lead to incorrect matching of states due to missing precision marker. Currently this doesn't seem possible as BPF verifier forces everything to precise when validated BPF program has any subprograms. But with the next patch lifting this restriction, this becomes problematic. In fact, without this fix, we'll start getting failure in one of the existing test_verifier test cases: #906/p precise: cross frame pruning FAIL Unexpected success to load! verification time 48 usec stack depth 0+0 processed 26 insns (limit 1000000) max_states_per_insn 3 total_states 17 peak_states 17 mark_read 8 This patch adds precision propagation across all frames. Fixes: a3ce685dd01a ("bpf: fix precision tracking") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20221104163649.121784-3-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31bpf: propagate precision in ALU/ALU64 operationsAndrii Nakryiko
[ Upstream commit a3b666bfa9c9edc05bca62a87abafe0936bd7f97 ] When processing ALU/ALU64 operations (apart from BPF_MOV, which is handled correctly already; and BPF_NEG and BPF_END are special and don't have source register), if destination register is already marked precise, this causes problem with potentially missing precision tracking for the source register. E.g., when we have r1 >>= r5 and r1 is marked precise, but r5 isn't, this will lead to r5 staying as imprecise. This is due to the precision backtracking logic stopping early when it sees r1 is already marked precise. If r1 wasn't precise, we'd keep backtracking and would add r5 to the set of registers that need to be marked precise. So there is a discrepancy here which can lead to invalid and incompatible states matched due to lack of precision marking on r5. If r1 wasn't precise, precision backtracking would correctly mark both r1 and r5 as precise. This is simple to fix, though. During the forward instruction simulation pass, for arithmetic operations of `scalar <op>= scalar` form (where <op> is ALU or ALU64 operations), if destination register is already precise, mark source register as precise. This applies only when both involved registers are SCALARs. `ptr += scalar` and `scalar += ptr` cases are already handled correctly. This does have (negative) effect on some selftest programs and few Cilium programs. ~/baseline-tmp-results.csv are veristat results with this patch, while ~/baseline-results.csv is without it. See post scriptum for instructions on how to make Cilium programs testable with veristat. Correctness has a price. $ ./veristat -C -e file,prog,insns,states ~/baseline-results.csv ~/baseline-tmp-results.csv | grep -v '+0' File Program Total insns (A) Total insns (B) Total insns (DIFF) Total states (A) Total states (B) Total states (DIFF) ----------------------- -------------------- --------------- --------------- ------------------ ---------------- ---------------- ------------------- bpf_cubic.bpf.linked1.o bpf_cubic_cong_avoid 997 1700 +703 (+70.51%) 62 90 +28 (+45.16%) test_l4lb.bpf.linked1.o balancer_ingress 4559 5469 +910 (+19.96%) 118 126 +8 (+6.78%) ----------------------- -------------------- --------------- --------------- ------------------ ---------------- ---------------- ------------------- $ ./veristat -C -e file,prog,verdict,insns,states ~/baseline-results-cilium.csv ~/baseline-tmp-results-cilium.csv | grep -v '+0' File Program Total insns (A) Total insns (B) Total insns (DIFF) Total states (A) Total states (B) Total states (DIFF) ------------- ------------------------------ --------------- --------------- ------------------ ---------------- ---------------- ------------------- bpf_host.o tail_nodeport_nat_ingress_ipv6 4448 5261 +813 (+18.28%) 234 247 +13 (+5.56%) bpf_host.o tail_nodeport_nat_ipv6_egress 3396 3446 +50 (+1.47%) 201 203 +2 (+1.00%) bpf_lxc.o tail_nodeport_nat_ingress_ipv6 4448 5261 +813 (+18.28%) 234 247 +13 (+5.56%) bpf_overlay.o tail_nodeport_nat_ingress_ipv6 4448 5261 +813 (+18.28%) 234 247 +13 (+5.56%) bpf_xdp.o tail_lb_ipv4 71736 73442 +1706 (+2.38%) 4295 4370 +75 (+1.75%) ------------- ------------------------------ --------------- --------------- ------------------ ---------------- ---------------- ------------------- P.S. To make Cilium ([0]) programs libbpf-compatible and thus veristat-loadable, apply changes from topmost commit in [1], which does minimal changes to Cilium source code, mostly around SEC() annotations and BPF map definitions. [0] https://github.com/cilium/cilium/ [1] https://github.com/anakryiko/cilium/commits/libbpf-friendliness Fixes: b5dc0163d8fd ("bpf: precise scalar_value tracking") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20221104163649.121784-2-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31bpf: Fix slot type check in check_stack_write_var_offKumar Kartikeya Dwivedi
[ Upstream commit f5e477a861e4a20d8a1c5f7a245f3a3c3c376b03 ] For the case where allow_ptr_leaks is false, code is checking whether slot type is STACK_INVALID and STACK_SPILL and rejecting other cases. This is a consequence of incorrectly checking for register type instead of the slot type (NOT_INIT and SCALAR_VALUE respectively). Fix the check. Fixes: 01f810ace9ed ("bpf: Allow variable-offset stack access") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221103191013.1236066-5-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31bpf: Clobber stack slot when writing over spilled PTR_TO_BTF_IDKumar Kartikeya Dwivedi
[ Upstream commit 261f4664caffdeb9dff4e83ee3c0334b1c3a552f ] When support was added for spilled PTR_TO_BTF_ID to be accessed by helper memory access, the stack slot was not overwritten to STACK_MISC (and that too is only safe when env->allow_ptr_leaks is true). This means that helpers who take ARG_PTR_TO_MEM and write to it may essentially overwrite the value while the verifier continues to track the slot for spilled register. This can cause issues when PTR_TO_BTF_ID is spilled to stack, and then overwritten by helper write access, which can then be passed to BPF helpers or kfuncs. Handle this by falling back to the case introduced in a later commit, which will also handle PTR_TO_BTF_ID along with other pointer types, i.e. cd17d38f8b28 ("bpf: Permits pointers on stack for helper calls"). Finally, include a comment on why REG_LIVE_WRITTEN is not being set when clobber is set to true. In short, the reason is that while when clobber is unset, we know that we won't be writing, when it is true, we *may* write to any of the stack slots in that range. It may be a partial or complete write, to just one or many stack slots. We cannot be sure, hence to be conservative, we leave things as is and never set REG_LIVE_WRITTEN for any stack slot. However, clobber still needs to reset them to STACK_MISC assuming writes happened. However read marks still need to be propagated upwards from liveness point of view, as parent stack slot's contents may still continue to matter to child states. Cc: Yonghong Song <yhs@meta.com> Fixes: 1d68f22b3d53 ("bpf: Handle spilled PTR_TO_BTF_ID properly when checking stack_boundary") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221103191013.1236066-4-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31relay: fix type mismatch when allocating memory in relay_create_buf()Gavrilov Ilia
[ Upstream commit 4d8586e04602fe42f0a782d2005956f8b6302678 ] The 'padding' field of the 'rchan_buf' structure is an array of 'size_t' elements, but the memory is allocated for an array of 'size_t *' elements. Found by Linux Verification Center (linuxtesting.org) with SVACE. Link: https://lkml.kernel.org/r/20221129092002.3538384-1-Ilia.Gavrilov@infotecs.ru Fixes: b86ff981a825 ("[PATCH] relay: migrate from relayfs to a generic relay API") Signed-off-by: Ilia.Gavrilov <Ilia.Gavrilov@infotecs.ru> Cc: Colin Ian King <colin.i.king@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: wuchi <wuchi.zero@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31cpu/hotplug: Do not bail-out in DYING/STARTING sectionsVincent Donnefort
[ Upstream commit 6f855b39e4602b6b42a8e5cbcfefb8a1b8b5f0be ] The DYING/STARTING callbacks are not expected to fail. However, as reported by Derek, buggy drivers such as tboot are still free to return errors within those sections, which halts the hot(un)plug and leaves the CPU in an unrecoverable state. As there is no rollback possible, only log the failures and proceed with the following steps. This restores the hotplug behaviour prior to commit 453e41085183 ("cpu/hotplug: Add cpuhp_invoke_callback_range()") Fixes: 453e41085183 ("cpu/hotplug: Add cpuhp_invoke_callback_range()") Reported-by: Derek Dolney <z23@posteo.net> Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Derek Dolney <z23@posteo.net> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://bugzilla.kernel.org/show_bug.cgi?id=215867 Link: https://lore.kernel.org/r/20220927101259.1149636-1-vdonnefort@google.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31cpu/hotplug: Make target_store() a nop when target == statePhil Auld
[ Upstream commit 64ea6e44f85b9b75925ebe1ba0e6e8430cc4e06f ] Writing the current state back in hotplug/target calls cpu_down() which will set cpu dying even when it isn't and then nothing will ever clear it. A stress test that reads values and writes them back for all cpu device files in sysfs will trigger the BUG() in select_fallback_rq once all cpus are marked as dying. kernel/cpu.c::target_store() ... if (st->state < target) ret = cpu_up(dev->id, target); else ret = cpu_down(dev->id, target); cpu_down() -> cpu_set_state() bool bringup = st->state < target; ... if (cpu_dying(cpu) != !bringup) set_cpu_dying(cpu, !bringup); Fix this by letting state==target fall through in the target_store() conditional. Also make sure st->target == target in that case. Fixes: 757c989b9994 ("cpu/hotplug: Make target state writeable") Signed-off-by: Phil Auld <pauld@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20221117162329.3164999-2-pauld@redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31futex: Resend potentially swallowed owner death notificationAlexey Izbyshev
[ Upstream commit 90d758896787048fa3d4209309d4800f3920e66f ] Commit ca16d5bee598 ("futex: Prevent robust futex exit race") addressed two cases when tasks waiting on a robust non-PI futex remained blocked despite the futex not being owned anymore: * if the owner died after writing zero to the futex word, but before waking up a waiter * if a task waiting on the futex was woken up, but died before updating the futex word (effectively swallowing the notification without acting on it) In the second case, the task could be woken up either by the previous owner (after the futex word was reset to zero) or by the kernel (after the OWNER_DIED bit was set and the TID part of the futex word was reset to zero) if the previous owner died without the resetting the futex. Because the referenced commit wakes up a potential waiter only if the whole futex word is zero, the latter subcase remains unaddressed. Fix this by looking only at the TID part of the futex when deciding whether a wake up is needed. Fixes: ca16d5bee598 ("futex: Prevent robust futex exit race") Signed-off-by: Alexey Izbyshev <izbyshev@ispras.ru> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20221111215439.248185-1-izbyshev@ispras.ru Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31genirq/irqdesc: Don't try to remove non-existing sysfs filesYang Yingliang
[ Upstream commit 9049e1ca41983ab773d7ea244bee86d7835ec9f5 ] Fault injection tests trigger warnings like this: kernfs: can not remove 'chip_name', no directory WARNING: CPU: 0 PID: 253 at fs/kernfs/dir.c:1616 kernfs_remove_by_name_ns+0xce/0xe0 RIP: 0010:kernfs_remove_by_name_ns+0xce/0xe0 Call Trace: <TASK> remove_files.isra.1+0x3f/0xb0 sysfs_remove_group+0x68/0xe0 sysfs_remove_groups+0x41/0x70 __kobject_del+0x45/0xc0 kobject_del+0x29/0x40 free_desc+0x42/0x70 irq_free_descs+0x5e/0x90 The reason is that the interrupt descriptor sysfs handling does not roll back on a failing kobject_add() during allocation. If the descriptor is freed later on, kobject_del() is invoked with a not added kobject resulting in the above warnings. A proper rollback in case of a kobject_add() failure would be the straight forward solution. But this is not possible due to the way how interrupt descriptor sysfs handling works. Interrupt descriptors are allocated before sysfs becomes available. So the sysfs files for the early allocated descriptors are added later in the boot process. At this point there can be nothing useful done about a failing kobject_add(). For consistency the interrupt descriptor allocation always treats kobject_add() failures as non-critical and just emits a warning. To solve this problem, keep track in the interrupt descriptor whether kobject_add() was successful or not and make the invocation of kobject_del() conditional on that. [ tglx: Massage changelog, comments and use a state bit. ] Fixes: ecb3f394c5db ("genirq: Expose interrupt information through sysfs") Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lore.kernel.org/r/20221128151612.1786122-1-yangyingliang@huawei.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31perf: Fix possible memleak in pmu_dev_alloc()Chen Zhongjin
[ Upstream commit e8d7a90c08ce963c592fb49845f2ccc606a2ac21 ] In pmu_dev_alloc(), when dev_set_name() failed, it will goto free_dev and call put_device(pmu->dev) to release it. However pmu->dev->release is assigned after this, which makes warning and memleak. Call dev_set_name() after pmu->dev->release = pmu_dev_release to fix it. Device '(null)' does not have a release() function... WARNING: CPU: 2 PID: 441 at drivers/base/core.c:2332 device_release+0x1b9/0x240 ... Call Trace: <TASK> kobject_put+0x17f/0x460 put_device+0x20/0x30 pmu_dev_alloc+0x152/0x400 perf_pmu_register+0x96b/0xee0 ... kmemleak: 1 new suspected memory leaks (see /sys/kernel/debug/kmemleak) unreferenced object 0xffff888014759000 (size 2048): comm "modprobe", pid 441, jiffies 4294931444 (age 38.332s) backtrace: [<0000000005aed3b4>] kmalloc_trace+0x27/0x110 [<000000006b38f9b8>] pmu_dev_alloc+0x50/0x400 [<00000000735f17be>] perf_pmu_register+0x96b/0xee0 [<00000000e38477f1>] 0xffffffffc0ad8603 [<000000004e162216>] do_one_initcall+0xd0/0x4e0 ... Fixes: abe43400579d ("perf: Sysfs enumeration") Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20221111103653.91058-1-chenzhongjin@huawei.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31PM: hibernate: Fix mistake in kerneldoc commentxiongxin
[ Upstream commit 6e5d7300cbe7c3541bc31f16db3e9266e6027b4b ] The actual maximum image size formula in hibernate_preallocate_memory() is as follows: max_size = (count - (size + PAGES_FOR_IO)) / 2 - 2 * DIV_ROUND_UP(reserved_size, PAGE_SIZE); but the one in the kerneldoc comment of the function is different and incorrect. Fixes: ddeb64870810 ("PM / Hibernate: Add sysfs knob to control size of memory for drivers") Signed-off-by: xiongxin <xiongxin@kylinos.cn> [ rjw: Subject and changelog rewrite ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31sched/psi: Fix possible missing or delayed pending eventHao Lee
[ Upstream commit e38f89af6a13e895805febd3a329a13ab7e66fa4 ] When a pending event exists and growth is less than the threshold, the current logic is to skip this trigger without generating event. However, from e6df4ead85d9 ("psi: fix possible trigger missing in the window"), our purpose is to generate event as long as pending event exists and the rate meets the limit, no matter what growth is. This patch handles this case properly. Fixes: e6df4ead85d9 ("psi: fix possible trigger missing in the window") Signed-off-by: Hao Lee <haolee.swjtu@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Suren Baghdasaryan <surenb@google.com> Link: https://lore.kernel.org/r/20220919072356.GA29069@haolee.io Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31sched/uclamp: Cater for uclamp in find_energy_efficient_cpu()'s early exit ↵Qais Yousef
condition [ Upstream commit d81304bc6193554014d4372a01debdf65e1e9a4d ] If the utilization of the woken up task is 0, we skip the energy calculation because it has no impact. But if the task is boosted (uclamp_min != 0) will have an impact on task placement and frequency selection. Only skip if the util is truly 0 after applying uclamp values. Change uclamp_task_cpu() signature to avoid unnecessary additional calls to uclamp_eff_get(). feec() is the only user now. Fixes: 732cd75b8c920 ("sched/fair: Select an energy-efficient CPU on task wake-up") Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220804143609.515789-8-qais.yousef@arm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31sched/uclamp: Make cpu_overutilized() use util_fits_cpu()Qais Yousef
[ Upstream commit c56ab1b3506ba0e7a872509964b100912bde165d ] So that it is now uclamp aware. This fixes a major problem of busy tasks capped with UCLAMP_MAX keeping the system in overutilized state which disables EAS and leads to wasting energy in the long run. Without this patch running a busy background activity like JIT compilation on Pixel 6 causes the system to be in overutilized state 74.5% of the time. With this patch this goes down to 9.79%. It also fixes another problem when long running tasks that have their UCLAMP_MIN changed while running such that they need to upmigrate to honour the new UCLAMP_MIN value. The upmigration doesn't get triggered because overutilized state never gets set in this state, hence misfit migration never happens at tick in this case until the task wakes up again. Fixes: af24bde8df202 ("sched/uclamp: Add uclamp support to energy_compute()") Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220804143609.515789-7-qais.yousef@arm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31sched/uclamp: Make asym_fits_capacity() use util_fits_cpu()Qais Yousef
[ Upstream commit a2e7f03ed28fce26c78b985f87913b6ce3accf9d ] Use the new util_fits_cpu() to ensure migration margin and capacity pressure are taken into account correctly when uclamp is being used otherwise we will fail to consider CPUs as fitting in scenarios where they should. s/asym_fits_capacity/asym_fits_cpu/ to better reflect what it does now. Fixes: b4c9c9f15649 ("sched/fair: Prefer prev cpu in asymmetric wakeup path") Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220804143609.515789-6-qais.yousef@arm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31sched/core: Introduce sched_asym_cpucap_active()Dietmar Eggemann
[ Upstream commit 740cf8a760b73e8375bfb4bedcbe9746183350f9 ] Create an inline helper for conditional code to be only executed on asymmetric CPU capacity systems. This makes these (currently ~10 and future) conditions a lot more readable. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20220729111305.1275158-2-dietmar.eggemann@arm.com Stable-dep-of: a2e7f03ed28f ("sched/uclamp: Make asym_fits_capacity() use util_fits_cpu()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31sched/uclamp: Make select_idle_capacity() use util_fits_cpu()Qais Yousef
[ Upstream commit b759caa1d9f667b94727b2ad12589cbc4ce13a82 ] Use the new util_fits_cpu() to ensure migration margin and capacity pressure are taken into account correctly when uclamp is being used otherwise we will fail to consider CPUs as fitting in scenarios where they should. Fixes: b4c9c9f15649 ("sched/fair: Prefer prev cpu in asymmetric wakeup path") Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220804143609.515789-5-qais.yousef@arm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31sched/uclamp: Fix fits_capacity() check in feec()Qais Yousef
[ Upstream commit 244226035a1f9b2b6c326e55ae5188fab4f428cb ] As reported by Yun Hsiang [1], if a task has its uclamp_min >= 0.8 * 1024, it'll always pick the previous CPU because fits_capacity() will always return false in this case. The new util_fits_cpu() logic should handle this correctly for us beside more corner cases where similar failures could occur, like when using UCLAMP_MAX. We open code uclamp_rq_util_with() except for the clamp() part, util_fits_cpu() needs the 'raw' values to be passed to it. Also introduce uclamp_rq_{set, get}() shorthand accessors to get uclamp value for the rq. Makes the code more readable and ensures the right rules (use READ_ONCE/WRITE_ONCE) are respected transparently. [1] https://lists.linaro.org/pipermail/eas-dev/2020-July/001488.html Fixes: 1d42509e475c ("sched/fair: Make EAS wakeup placement consider uclamp restrictions") Reported-by: Yun Hsiang <hsiang023167@gmail.com> Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220804143609.515789-4-qais.yousef@arm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31sched/uclamp: Make task_fits_capacity() use util_fits_cpu()Qais Yousef
[ Upstream commit b48e16a69792b5dc4a09d6807369d11b2970cc36 ] So that the new uclamp rules in regard to migration margin and capacity pressure are taken into account correctly. Fixes: a7008c07a568 ("sched/fair: Make task_fits_capacity() consider uclamp restrictions") Co-developed-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220804143609.515789-3-qais.yousef@arm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31sched/uclamp: Fix relationship between uclamp and migration marginQais Yousef
[ Upstream commit 48d5e9daa8b767e75ed9421665b037a49ce4bc04 ] fits_capacity() verifies that a util is within 20% margin of the capacity of a CPU, which is an attempt to speed up upmigration. But when uclamp is used, this 20% margin is problematic because for example if a task is boosted to 1024, then it will not fit on any CPU according to fits_capacity() logic. Or if a task is boosted to capacity_orig_of(medium_cpu). The task will end up on big instead on the desired medium CPU. Similar corner cases exist for uclamp and usage of capacity_of(). Slightest irq pressure on biggest CPU for example will make a 1024 boosted task look like it can't fit. What we really want is for uclamp comparisons to ignore the migration margin and capacity pressure, yet retain them for when checking the _actual_ util signal. For example, task p: p->util_avg = 300 p->uclamp[UCLAMP_MIN] = 1024 Will fit a big CPU. But p->util_avg = 900 p->uclamp[UCLAMP_MIN] = 1024 will not, this should trigger overutilized state because the big CPU is now *actually* being saturated. Similar reasoning applies to capping tasks with UCLAMP_MAX. For example: p->util_avg = 1024 p->uclamp[UCLAMP_MAX] = capacity_orig_of(medium_cpu) Should fit the task on medium cpus without triggering overutilized state. Inlined comments expand more on desired behavior in more scenarios. Introduce new util_fits_cpu() function which encapsulates the new logic. The new function is not used anywhere yet, but will be used to update various users of fits_capacity() in later patches. Fixes: af24bde8df202 ("sched/uclamp: Add uclamp support to energy_compute()") Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220804143609.515789-2-qais.yousef@arm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31seccomp: Move copy_seccomp() to no failure path.Kuniyuki Iwashima
[ Upstream commit a1140cb215fa13dcec06d12ba0c3ee105633b7c4 ] Our syzbot instance reported memory leaks in do_seccomp() [0], similar to the report [1]. It shows that we miss freeing struct seccomp_filter and some objects included in it. We can reproduce the issue with the program below [2] which calls one seccomp() and two clone() syscalls. The first clone()d child exits earlier than its parent and sends a signal to kill it during the second clone(), more precisely before the fatal_signal_pending() test in copy_process(). When the parent receives the signal, it has to destroy the embryonic process and return -EINTR to user space. In the failure path, we have to call seccomp_filter_release() to decrement the filter's refcount. Initially, we called it in free_task() called from the failure path, but the commit 3a15fb6ed92c ("seccomp: release filter after task is fully dead") moved it to release_task() to notify user space as early as possible that the filter is no longer used. To keep the change and current seccomp refcount semantics, let's move copy_seccomp() just after the signal check and add a WARN_ON_ONCE() in free_task() for future debugging. [0]: unreferenced object 0xffff8880063add00 (size 256): comm "repro_seccomp", pid 230, jiffies 4294687090 (age 9.914s) hex dump (first 32 bytes): 01 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 ................ ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ backtrace: do_seccomp (./include/linux/slab.h:600 ./include/linux/slab.h:733 kernel/seccomp.c:666 kernel/seccomp.c:708 kernel/seccomp.c:1871 kernel/seccomp.c:1991) do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120) unreferenced object 0xffffc90000035000 (size 4096): comm "repro_seccomp", pid 230, jiffies 4294687090 (age 9.915s) hex dump (first 32 bytes): 01 00 00 00 00 00 00 00 00 00 00 00 05 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: __vmalloc_node_range (mm/vmalloc.c:3226) __vmalloc_node (mm/vmalloc.c:3261 (discriminator 4)) bpf_prog_alloc_no_stats (kernel/bpf/core.c:91) bpf_prog_alloc (kernel/bpf/core.c:129) bpf_prog_create_from_user (net/core/filter.c:1414) do_seccomp (kernel/seccomp.c:671 kernel/seccomp.c:708 kernel/seccomp.c:1871 kernel/seccomp.c:1991) do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120) unreferenced object 0xffff888003fa1000 (size 1024): comm "repro_seccomp", pid 230, jiffies 4294687090 (age 9.915s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: bpf_prog_alloc_no_stats (./include/linux/slab.h:600 ./include/linux/slab.h:733 kernel/bpf/core.c:95) bpf_prog_alloc (kernel/bpf/core.c:129) bpf_prog_create_from_user (net/core/filter.c:1414) do_seccomp (kernel/seccomp.c:671 kernel/seccomp.c:708 kernel/seccomp.c:1871 kernel/seccomp.c:1991) do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120) unreferenced object 0xffff888006360240 (size 16): comm "repro_seccomp", pid 230, jiffies 4294687090 (age 9.915s) hex dump (first 16 bytes): 01 00 37 00 76 65 72 6c e0 83 01 06 80 88 ff ff ..7.verl........ backtrace: bpf_prog_store_orig_filter (net/core/filter.c:1137) bpf_prog_create_from_user (net/core/filter.c:1428) do_seccomp (kernel/seccomp.c:671 kernel/seccomp.c:708 kernel/seccomp.c:1871 kernel/seccomp.c:1991) do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120) unreferenced object 0xffff8880060183e0 (size 8): comm "repro_seccomp", pid 230, jiffies 4294687090 (age 9.915s) hex dump (first 8 bytes): 06 00 00 00 00 00 ff 7f ........ backtrace: kmemdup (mm/util.c:129) bpf_prog_store_orig_filter (net/core/filter.c:1144) bpf_prog_create_from_user (net/core/filter.c:1428) do_seccomp (kernel/seccomp.c:671 kernel/seccomp.c:708 kernel/seccomp.c:1871 kernel/seccomp.c:1991) do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120) [1]: https://syzkaller.appspot.com/bug?id=2809bb0ac77ad9aa3f4afe42d6a610aba594a987 [2]: #define _GNU_SOURCE #include <sched.h> #include <signal.h> #include <unistd.h> #include <sys/syscall.h> #include <linux/filter.h> #include <linux/seccomp.h> void main(void) { struct sock_filter filter[] = { BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW), }; struct sock_fprog fprog = { .len = sizeof(filter) / sizeof(filter[0]), .filter = filter, }; long i, pid; syscall(__NR_seccomp, SECCOMP_SET_MODE_FILTER, 0, &fprog); for (i = 0; i < 2; i++) { pid = syscall(__NR_clone, CLONE_NEWNET | SIGKILL, NULL, NULL, 0); if (pid == 0) return; } } Fixes: 3a15fb6ed92c ("seccomp: release filter after task is fully dead") Reported-by: syzbot+ab17848fe269b573eb71@syzkaller.appspotmail.com Reported-by: Ayushman Dutta <ayudutta@amazon.com> Suggested-by: Kees Cook <keescook@chromium.org> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220823154532.82913-1-kuniyu@amazon.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-21bpf: Take module reference on kprobe_multi linkJiri Olsa
commit e22061b2d3095c12f90336479f24bf5eeb70e1bd upstream. Currently we allow to create kprobe multi link on function from kernel module, but we don't take the module reference to ensure it's not unloaded while we are tracing it. The multi kprobe link is based on fprobe/ftrace layer which takes different approach and releases ftrace hooks when module is unloaded even if there's tracer registered on top of it. Adding code that gathers all the related modules for the link and takes their references before it's attached. All kernel module references are released after link is unregistered. Note that we do it the same way already for trampoline probes (but for single address). Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20221025134148.3300700-5-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-12-21bpf: Rename __bpf_kprobe_multi_cookie_cmp to bpf_kprobe_multi_addrs_cmpJiri Olsa
commit 1a1b0716d36d21f8448bd7d3f1c0ade7230bb294 upstream. Renaming __bpf_kprobe_multi_cookie_cmp to bpf_kprobe_multi_addrs_cmp, because it's more suitable to current and upcoming code. Acked-by: Song Liu <song@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20221025134148.3300700-4-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-12-21ftrace: Add support to resolve module symbols in ftrace_lookup_symbolsJiri Olsa
commit 3640bf8584f4ab0f5eed6285f09213954acd8b62 upstream. Currently ftrace_lookup_symbols iterates only over core symbols, adding module_kallsyms_on_each_symbol call to check on modules symbols as well. Also removing 'args.found == args.cnt' condition, because it's already checked in kallsyms_callback function. Also removing 'err < 0' check, because both *kallsyms_on_each_symbol functions do not return error. Reported-by: Martynas Pumputis <m@lambda.lt> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20221025134148.3300700-3-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-12-21kallsyms: Make module_kallsyms_on_each_symbol generally availableJiri Olsa
commit 73feb8d5fa3b755bb51077c0aabfb6aa556fd498 upstream. Making module_kallsyms_on_each_symbol generally available, so it can be used outside CONFIG_LIVEPATCH option in following changes. Rather than adding another ifdef option let's make the function generally available (when CONFIG_KALLSYMS and CONFIG_MODULES options are defined). Cc: Christoph Hellwig <hch@lst.de> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20221025134148.3300700-2-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-12-19perf: Fix perf_pending_task() UaFPeter Zijlstra
[ Upstream commit 517e6a301f34613bff24a8e35b5455884f2d83d8 ] Per syzbot it is possible for perf_pending_task() to run after the event is free()'d. There are two related but distinct cases: - the task_work was already queued before destroying the event; - destroying the event itself queues the task_work. The first cannot be solved using task_work_cancel() since perf_release() itself might be called from a task_work (____fput), which means the current->task_works list is already empty and task_work_cancel() won't be able to find the perf_pending_task() entry. The simplest alternative is extending the perf_event lifetime to cover the task_work. The second is just silly, queueing a task_work while you know the event is going away makes no sense and is easily avoided by re-arranging how the event is marked STATE_DEAD and ensuring it goes through STATE_OFF on the way down. Reported-by: syzbot+9228d6098455bb209ec8@syzkaller.appspotmail.com Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Marco Elver <elver@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-14memcg: fix possible use-after-free in memcg_write_event_control()Tejun Heo
commit 4a7ba45b1a435e7097ca0f79a847d0949d0eb088 upstream. memcg_write_event_control() accesses the dentry->d_name of the specified control fd to route the write call. As a cgroup interface file can't be renamed, it's safe to access d_name as long as the specified file is a regular cgroup file. Also, as these cgroup interface files can't be removed before the directory, it's safe to access the parent too. Prior to 347c4a874710 ("memcg: remove cgroup_event->cft"), there was a call to __file_cft() which verified that the specified file is a regular cgroupfs file before further accesses. The cftype pointer returned from __file_cft() was no longer necessary and the commit inadvertently dropped the file type check with it allowing any file to slip through. With the invarients broken, the d_name and parent accesses can now race against renames and removals of arbitrary files and cause use-after-free's. Fix the bug by resurrecting the file type check in __file_cft(). Now that cgroupfs is implemented through kernfs, checking the file operations needs to go through a layer of indirection. Instead, let's check the superblock and dentry type. Link: https://lkml.kernel.org/r/Y5FRm/cfcKPGzWwl@slm.duckdns.org Fixes: 347c4a874710 ("memcg: remove cgroup_event->cft") Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jann Horn <jannh@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: <stable@vger.kernel.org> [3.14+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-12-08proc: proc_skip_spaces() shouldn't think it is working on C stringsLinus Torvalds
commit bce9332220bd677d83b19d21502776ad555a0e73 upstream. proc_skip_spaces() seems to think it is working on C strings, and ends up being just a wrapper around skip_spaces() with a really odd calling convention. Instead of basing it on skip_spaces(), it should have looked more like proc_skip_char(), which really is the exact same function (except it skips a particular character, rather than whitespace). So use that as inspiration, odd coding and all. Now the calling convention actually makes sense and works for the intended purpose. Reported-and-tested-by: Kyle Zeng <zengyhkyle@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-12-08proc: avoid integer type confusion in get_proc_longLinus Torvalds
commit e6cfaf34be9fcd1a8285a294e18986bfc41a409c upstream. proc_get_long() is passed a size_t, but then assigns it to an 'int' variable for the length. Let's not do that, even if our IO paths are limited to MAX_RW_COUNT (exactly because of these kinds of type errors). So do the proper test in the rigth type. Reported-by: Kyle Zeng <zengyhkyle@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-12-08tracing: Free buffers when a used dynamic event is removedSteven Rostedt (Google)
commit 4313e5a613049dfc1819a6dfb5f94cf2caff9452 upstream. After 65536 dynamic events have been added and removed, the "type" field of the event then uses the first type number that is available (not currently used by other events). A type number is the identifier of the binary blobs in the tracing ring buffer (known as events) to map them to logic that can parse the binary blob. The issue is that if a dynamic event (like a kprobe event) is traced and is in the ring buffer, and then that event is removed (because it is dynamic, which means it can be created and destroyed), if another dynamic event is created that has the same number that new event's logic on parsing the binary blob will be used. To show how this can be an issue, the following can crash the kernel: # cd /sys/kernel/tracing # for i in `seq 65536`; do echo 'p:kprobes/foo do_sys_openat2 $arg1:u32' > kprobe_events # done For every iteration of the above, the writing to the kprobe_events will remove the old event and create a new one (with the same format) and increase the type number to the next available on until the type number reaches over 65535 which is the max number for the 16 bit type. After it reaches that number, the logic to allocate a new number simply looks for the next available number. When an dynamic event is removed, that number is then available to be reused by the next dynamic event created. That is, once the above reaches the max number, the number assigned to the event in that loop will remain the same. Now that means deleting one dynamic event and created another will reuse the previous events type number. This is where bad things can happen. After the above loop finishes, the kprobes/foo event which reads the do_sys_openat2 function call's first parameter as an integer. # echo 1 > kprobes/foo/enable # cat /etc/passwd > /dev/null # cat trace cat-2211 [005] .... 2007.849603: foo: (do_sys_openat2+0x0/0x130) arg1=4294967196 cat-2211 [005] .... 2007.849620: foo: (do_sys_openat2+0x0/0x130) arg1=4294967196 cat-2211 [005] .... 2007.849838: foo: (do_sys_openat2+0x0/0x130) arg1=4294967196 cat-2211 [005] .... 2007.849880: foo: (do_sys_openat2+0x0/0x130) arg1=4294967196 # echo 0 > kprobes/foo/enable Now if we delete the kprobe and create a new one that reads a string: # echo 'p:kprobes/foo do_sys_openat2 +0($arg2):string' > kprobe_events And now we can the trace: # cat trace sendmail-1942 [002] ..... 530.136320: foo: (do_sys_openat2+0x0/0x240) arg1= cat-2046 [004] ..... 530.930817: foo: (do_sys_openat2+0x0/0x240) arg1="������������������������������������������������������������������������������������������������" cat-2046 [004] ..... 530.930961: foo: (do_sys_openat2+0x0/0x240) arg1="������������������������������������������������������������������������������������������������" cat-2046 [004] ..... 530.934278: foo: (do_sys_openat2+0x0/0x240) arg1="������������������������������������������������������������������������������������������������" cat-2046 [004] ..... 530.934563: foo: (do_sys_openat2+0x0/0x240) arg1="������������������������������������������������������������������������������������������������" bash-1515 [007] ..... 534.299093: foo: (do_sys_openat2+0x0/0x240) arg1="kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk���������@��4Z����;Y�����U And dmesg has: ================================================================== BUG: KASAN: use-after-free in string+0xd4/0x1c0 Read of size 1 at addr ffff88805fdbbfa0 by task cat/2049 CPU: 0 PID: 2049 Comm: cat Not tainted 6.1.0-rc6-test+ #641 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016 Call Trace: <TASK> dump_stack_lvl+0x5b/0x77 print_report+0x17f/0x47b kasan_report+0xad/0x130 string+0xd4/0x1c0 vsnprintf+0x500/0x840 seq_buf_vprintf+0x62/0xc0 trace_seq_printf+0x10e/0x1e0 print_type_string+0x90/0xa0 print_kprobe_event+0x16b/0x290 print_trace_line+0x451/0x8e0 s_show+0x72/0x1f0 seq_read_iter+0x58e/0x750 seq_read+0x115/0x160 vfs_read+0x11d/0x460 ksys_read+0xa9/0x130 do_syscall_64+0x3a/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7fc2e972ade2 Code: c0 e9 b2 fe ff ff 50 48 8d 3d b2 3f 0a 00 e8 05 f0 01 00 0f 1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24 RSP: 002b:00007ffc64e687c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007fc2e972ade2 RDX: 0000000000020000 RSI: 00007fc2e980d000 RDI: 0000000000000003 RBP: 00007fc2e980d000 R08: 00007fc2e980c010 R09: 0000000000000000 R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000020f00 R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000 </TASK> The buggy address belongs to the physical page: page:ffffea00017f6ec0 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x5fdbb flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff) raw: 000fffffc0000000 0000000000000000 ffffea00017f6ec8 0000000000000000 raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff88805fdbbe80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ffff88805fdbbf00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff >ffff88805fdbbf80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ^ ffff88805fdbc000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ffff88805fdbc080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ================================================================== This was found when Zheng Yejian sent a patch to convert the event type number assignment to use IDA, which gives the next available number, and this bug showed up in the fuzz testing by Yujie Liu and the kernel test robot. But after further analysis, I found that this behavior is the same as when the event type numbers go past the 16bit max (and the above shows that). As modules have a similar issue, but is dealt with by setting a "WAS_ENABLED" flag when a module event is enabled, and when the module is freed, if any of its events were enabled, the ring buffer that holds that event is also cleared, to prevent reading stale events. The same can be done for dynamic events. If any dynamic event that is being removed was enabled, then make sure the buffers they were enabled in are now cleared. Link: https://lkml.kernel.org/r/20221123171434.545706e3@gandalf.local.home Link: https://lore.kernel.org/all/20221110020319.1259291-1-zhengyejian1@huawei.com/ Cc: stable@vger.kernel.org Cc: Andrew Morton <akpm@linux-foundation.org> Depends-on: e18eb8783ec49 ("tracing: Add tracing_reset_all_online_cpus_unlocked() function") Depends-on: 5448d44c38557 ("tracing: Add unified dynamic event framework") Depends-on: 6212dd29683ee ("tracing/kprobes: Use dyn_event framework for kprobe events") Depends-on: 065e63f951432 ("tracing: Only have rmmod clear buffers that its events were active in") Depends-on: 575380da8b469 ("tracing: Only clear trace buffer on module unload if event was traced") Fixes: 77b44d1b7c283 ("tracing/kprobes: Rename Kprobe-tracer to kprobe-event") Reported-by: Zheng Yejian <zhengyejian1@huawei.com> Reported-by: Yujie Liu <yujie.liu@intel.com> Reported-by: kernel test robot <yujie.liu@intel.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-12-08tracing: Fix race where histograms can be called before the eventSteven Rostedt (Google)
commit ef38c79a522b660f7f71d45dad2d6244bc741841 upstream. commit 94eedf3dded5 ("tracing: Fix race where eprobes can be called before the event") fixed an issue where if an event is soft disabled, and the trigger is being added, there's a small window where the event sees that there's a trigger but does not see that it requires reading the event yet, and then calls the trigger with the record == NULL. This could be solved with adding memory barriers in the hot path, or to make sure that all the triggers requiring a record check for NULL. The latter was chosen. Commit 94eedf3dded5 set the eprobe trigger handle to check for NULL, but the same needs to be done with histograms. Link: https://lore.kernel.org/linux-trace-kernel/20221118211809.701d40c0f8a757b0df3c025a@kernel.org/ Link: https://lore.kernel.org/linux-trace-kernel/20221123164323.03450c3a@gandalf.local.home Cc: Tom Zanussi <zanussi@kernel.org> Cc: stable@vger.kernel.org Fixes: 7491e2c442781 ("tracing: Add a probe that attaches to trace events") Reported-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-12-08tracing/osnoise: Fix duration typeDaniel Bristot de Oliveira
commit 022632f6c43a86f2135642dccd5686de318e861d upstream. The duration type is a 64 long value, not an int. This was causing some long noise to report wrong values. Change the duration to a 64 bits value. Link: https://lkml.kernel.org/r/a93d8a8378c7973e9c609de05826533c9e977939.1668692096.git.bristot@kernel.org Cc: stable@vger.kernel.org Cc: Daniel Bristot de Oliveira <bristot@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Fixes: bce29ac9ce0b ("trace: Add osnoise tracer") Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-12-08bpf: Do not copy spin lock field from user in bpf_selem_allocXu Kuohai
[ Upstream commit 836e49e103dfeeff670c934b7d563cbd982fce87 ] bpf_selem_alloc function is used by inode_storage, sk_storage and task_storage maps to set map value, for these map types, there may be a spin lock in the map value, so if we use memcpy to copy the whole map value from user, the spin lock field may be initialized incorrectly. Since the spin lock field is zeroed by kzalloc, call copy_map_value instead of memcpy to skip copying the spin lock field to fix it. Fixes: 6ac99e8f23d4 ("bpf: Introduce bpf sk local storage") Signed-off-by: Xu Kuohai <xukuohai@huawei.com> Link: https://lore.kernel.org/r/20221114134720.1057939-2-xukuohai@huawei.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>