Age | Commit message (Collapse) | Author |
|
commit 5ffb537e416ee22dbfb3d552102e50da33fec7f6 upstream.
Add two tests:
- one test has 'rX <op> r10' where rX is not r10, and
- another test has 'rX <op> rY' where rX and rY are not r10
but there is an early insn 'rX = r10'.
Without previous verifier change, both tests will fail.
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250524041340.4046304-1-yonghong.song@linux.dev
[ shung-hsi.yu: contains additional hunks for kernel/bpf/verifier.c that
should be part of the previous patch in the series, commit
e2d2115e56c4 "bpf: Do not include stack ptr register in precision
backtracking bookkeeping", which already incorporated. ]
Link: https://lore.kernel.org/all/9b41f9f5-396f-47e0-9a12-46c52087df6c@linux.dev/
Signed-off-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 91a229bb7ba86b2592c3f18c54b7b2c5e6fe0f95 upstream.
A warning is raised when __request_region() detects a conflict with a
resource whose resource.desc is IORES_DESC_DEVICE_PRIVATE_MEMORY.
But this warning is only valid for iomem_resources.
The hmem device resource uses resource.desc as the numa node id, which can
cause spurious warnings.
This warning appeared on a machine with multiple cxl memory expanders.
One of the NUMA node id is 6, which is the same as the value of
IORES_DESC_DEVICE_PRIVATE_MEMORY.
In this environment it was just a spurious warning, but when I saw the
warning I suspected a real problem so it's better to fix it.
This change fixes this by restricting the warning to only iomem_resource.
This also adds a missing new line to the warning message.
Link: https://lkml.kernel.org/r/20250719112604.25500-1-akinobu.mita@gmail.com
Fixes: 7dab174e2e27 ("dax/hmem: Move hmem device registration to dax_hmem.ko")
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
drivers
commit 67c632b4a7fbd6b76a08b86f4950f0f84de93439 upstream.
Most drivers only populate the fields cycles and cs_id of system_counterval
in their get_time_fn() callback for get_device_system_crosststamp(), unless
they explicitly provide nanosecond values.
When the use_nsecs field was added to struct system_counterval, most
drivers did not care. Clock sources other than CSID_GENERIC could then get
converted in convert_base_to_cs() based on an uninitialized use_nsecs field,
which usually results in -EINVAL during the following range check.
Pass in a fully zero initialized system_counterval_t to cure that.
Fixes: 6b2e29977518 ("timekeeping: Provide infrastructure for converting to/from a base clock")
Signed-off-by: Markus Blöchl <markus@blochl.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20250720-timekeeping_uninit_crossts-v2-1-f513c885b7c2@blochl.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 9beb8c5e77dc10e3889ff5f967eeffba78617a88 upstream.
Commit cff5f49d433f ("cgroup_freezer: cgroup_freezing: Check if not
frozen") modified the cgroup_freezing() logic to verify that the FROZEN
flag is not set, affecting the return value of the freezing() function,
in order to address a warning in __thaw_task.
A race condition exists that may allow tasks to escape being frozen. The
following scenario demonstrates this issue:
CPU 0 (get_signal path) CPU 1 (freezer.state reader)
try_to_freeze read freezer.state
__refrigerator freezer_read
update_if_frozen
WRITE_ONCE(current->__state, TASK_FROZEN);
...
/* Task is now marked frozen */
/* frozen(task) == true */
/* Assuming other tasks are frozen */
freezer->state |= CGROUP_FROZEN;
/* freezing(current) returns false */
/* because cgroup is frozen (not freezing) */
break out
__set_current_state(TASK_RUNNING);
/* Bug: Task resumes running when it should remain frozen */
The existing !frozen(p) check in __thaw_task makes the
WARN_ON_ONCE(freezing(p)) warning redundant. Removing this warning enables
reverting commit cff5f49d433f ("cgroup_freezer: cgroup_freezing: Check if
not frozen") to resolve the issue.
This patch removes the warning from __thaw_task. A subsequent patch will
revert commit cff5f49d433f ("cgroup_freezer: cgroup_freezing: Check if
not frozen") to complete the fix.
Reported-by: Zhong Jiawei<zhongjiawei1@huawei.com>
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 36569780b0d64de283f9d6c2195fd1a43e221ee8 upstream.
The commit e6fe3f422be1 ("sched: Make multiple runqueue task counters
32-bit") changed nr_uninterruptible to an unsigned int. But the
nr_uninterruptible values for each of the CPU runqueues can grow to
large numbers, sometimes exceeding INT_MAX. This is valid, if, over
time, a large number of tasks are migrated off of one CPU after going
into an uninterruptible state. Only the sum of all nr_interruptible
values across all CPUs yields the correct result, as explained in a
comment in kernel/sched/loadavg.c.
Change the type of nr_uninterruptible back to unsigned long to prevent
overflows, and thus the miscalculation of load average.
Fixes: e6fe3f422be1 ("sched: Make multiple runqueue task counters 32-bit")
Signed-off-by: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250709173328.606794-1-aruna.ramakrishna@oracle.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 14a67b42cb6f3ab66f41603c062c5056d32ea7dd ]
This reverts commit cff5f49d433fcd0063c8be7dd08fa5bf190c6c37.
Commit cff5f49d433f ("cgroup_freezer: cgroup_freezing: Check if not
frozen") modified the cgroup_freezing() logic to verify that the FROZEN
flag is not set, affecting the return value of the freezing() function,
in order to address a warning in __thaw_task.
A race condition exists that may allow tasks to escape being frozen. The
following scenario demonstrates this issue:
CPU 0 (get_signal path) CPU 1 (freezer.state reader)
try_to_freeze read freezer.state
__refrigerator freezer_read
update_if_frozen
WRITE_ONCE(current->__state, TASK_FROZEN);
...
/* Task is now marked frozen */
/* frozen(task) == true */
/* Assuming other tasks are frozen */
freezer->state |= CGROUP_FROZEN;
/* freezing(current) returns false */
/* because cgroup is frozen (not freezing) */
break out
__set_current_state(TASK_RUNNING);
/* Bug: Task resumes running when it should remain frozen */
The existing !frozen(p) check in __thaw_task makes the
WARN_ON_ONCE(freezing(p)) warning redundant. Removing this warning enables
reverting the commit cff5f49d433f ("cgroup_freezer: cgroup_freezing: Check
if not frozen") to resolve the issue.
The warning has been removed in the previous patch. This patch revert the
commit cff5f49d433f ("cgroup_freezer: cgroup_freezing: Check if not
frozen") to complete the fix.
Fixes: cff5f49d433f ("cgroup_freezer: cgroup_freezing: Check if not frozen")
Reported-by: Zhong Jiawei<zhongjiawei1@huawei.com>
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit f8242745871f81a3ac37f9f51853d12854fd0b58 ]
static const char fmt[] = "%p%";
bpf_trace_printk(fmt, sizeof(fmt));
The above BPF program isn't rejected and causes a kernel warning at
runtime:
Please remove unsupported %\x00 in format string
WARNING: CPU: 1 PID: 7244 at lib/vsprintf.c:2680 format_decode+0x49c/0x5d0
This happens because bpf_bprintf_prepare skips over the second %,
detected as punctuation, while processing %p. This patch fixes it by
not skipping over punctuation. %\x00 is then processed in the next
iteration and rejected.
Reported-by: syzbot+e2c932aec5c8a6e1d31c@syzkaller.appspotmail.com
Fixes: 48cac3f4a96d ("bpf: Implement formatted output helpers with bstr_printf")
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Link: https://lore.kernel.org/r/a0e06cc479faec9e802ae51ba5d66420523251ee.1751395489.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 85a3bce695b361d85fc528e6fbb33e4c8089c806 upstream.
We have observed kernel panics when using timerlat with stack saving,
with the following dmesg output:
memcpy: detected buffer overflow: 88 byte write of buffer size 0
WARNING: CPU: 2 PID: 8153 at lib/string_helpers.c:1032 __fortify_report+0x55/0xa0
CPU: 2 UID: 0 PID: 8153 Comm: timerlatu/2 Kdump: loaded Not tainted 6.15.3-200.fc42.x86_64 #1 PREEMPT(lazy)
Call Trace:
<TASK>
? trace_buffer_lock_reserve+0x2a/0x60
__fortify_panic+0xd/0xf
__timerlat_dump_stack.cold+0xd/0xd
timerlat_dump_stack.part.0+0x47/0x80
timerlat_fd_read+0x36d/0x390
vfs_read+0xe2/0x390
? syscall_exit_to_user_mode+0x1d5/0x210
ksys_read+0x73/0xe0
do_syscall_64+0x7b/0x160
? exc_page_fault+0x7e/0x1a0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
__timerlat_dump_stack() constructs the ftrace stack entry like this:
struct stack_entry *entry;
...
memcpy(&entry->caller, fstack->calls, size);
entry->size = fstack->nr_entries;
Since commit e7186af7fb26 ("tracing: Add back FORTIFY_SOURCE logic to
kernel_stack event structure"), struct stack_entry marks its caller
field with __counted_by(size). At the time of the memcpy, entry->size
contains garbage from the ringbuffer, which under some circumstances is
zero, triggering a kernel panic by buffer overflow.
Populate the size field before the memcpy so that the out-of-bounds
check knows the correct size. This is analogous to
__ftrace_trace_stack().
Cc: stable@vger.kernel.org
Cc: John Kacur <jkacur@redhat.com>
Cc: Luis Goncalves <lgoncalv@redhat.com>
Cc: Attila Fazekas <afazekas@redhat.com>
Link: https://lore.kernel.org/20250716143601.7313-1-tglozar@redhat.com
Fixes: e7186af7fb26 ("tracing: Add back FORTIFY_SOURCE logic to kernel_stack event structure")
Signed-off-by: Tomas Glozar <tglozar@redhat.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit b5e8acc14dcb314a9b61ff19dcd9fdd0d88f70df upstream.
When a module is loaded, it adds trace events defined by the module. It
may also need to modify the modules trace printk formats to replace enum
names with their values.
If two modules are loaded at the same time, the adding of the event to the
ftrace_events list can corrupt the walking of the list in the code that is
modifying the printk format strings and crash the kernel.
The addition of the event should take the trace_event_sem for write while
it adds the new event.
Also add a lockdep_assert_held() on that semaphore in
__trace_add_event_dirs() as it iterates the list.
Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://lore.kernel.org/20250718223158.799bfc0c@batman.local.home
Reported-by: Fusheng Huang(黄富生) <Fusheng.Huang@luxshare-ict.com>
Closes: https://lore.kernel.org/all/20250717105007.46ccd18f@batman.local.home/
Fixes: 110bf2b764eb6 ("tracing: add protection around module events unload")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 1ed171a3afe81531b3ace96bd151a372dda3ee25 upstream.
After a recent change in clang to strengthen uninitialized warnings [1],
it points out that in one of the error paths in parse_btf_arg(), params
is used uninitialized:
kernel/trace/trace_probe.c:660:19: warning: variable 'params' is uninitialized when used here [-Wuninitialized]
660 | return PTR_ERR(params);
| ^~~~~~
Match many other NO_BTF_ENTRY error cases and return -ENOENT, clearing
up the warning.
Link: https://lore.kernel.org/all/20250715-trace_probe-fix-const-uninit-warning-v1-1-98960f91dd04@kernel.org/
Cc: stable@vger.kernel.org
Closes: https://github.com/ClangBuiltLinux/linux/issues/2110
Fixes: d157d7694460 ("tracing/probes: Support BTF field access from $retval")
Link: https://github.com/llvm/llvm-project/commit/2464313eef01c5b1edf0eccf57a32cdee01472c7 [1]
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit fd881d0a085fc54354414aed990ccf05f282ba53 upstream.
The rseq_cs field is documented as being set to 0 by user-space prior to
registration, however this is not currently enforced by the kernel. This
can result in a segfault on return to user-space if the value stored in
the rseq_cs field doesn't point to a valid struct rseq_cs.
The correct solution to this would be to fail the rseq registration when
the rseq_cs field is non-zero. However, some older versions of glibc
will reuse the rseq area of previous threads without clearing the
rseq_cs field and will also terminate the process if the rseq
registration fails in a secondary thread. This wasn't caught in testing
because in this case the leftover rseq_cs does point to a valid struct
rseq_cs.
What we can do is clear the rseq_cs field on registration when it's
non-zero which will prevent segfaults on registration and won't break
the glibc versions that reuse rseq areas on thread creation.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250306211223.109455-1-mjeanson@efficios.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit d4adf1c9ee7722545450608bcb095fb31512f0c6 ]
BPF_MAP_TYPE_LRU_HASH can recycle most recent elements well before the
map is full, due to percpu reservations and force shrink before
neighbor stealing. Once a CPU is unable to borrow from the global map,
it will once steal one elem from a neighbor and after that each time
flush this one element to the global list and immediately recycle it.
Batch value LOCAL_FREE_TARGET (128) will exhaust a 10K element map
with 79 CPUs. CPU 79 will observe this behavior even while its
neighbors hold 78 * 127 + 1 * 15 == 9921 free elements (99%).
CPUs need not be active concurrently. The issue can appear with
affinity migration, e.g., irqbalance. Each CPU can reserve and then
hold onto its 128 elements indefinitely.
Avoid global list exhaustion by limiting aggregate percpu caches to
half of map size, by adjusting LOCAL_FREE_TARGET based on cpu count.
This change has no effect on sufficiently large tables.
Similar to LOCAL_NR_SCANS and lru->nr_scans, introduce a map variable
lru->free_target. The extra field fits in a hole in struct bpf_lru.
The cacheline is already warm where read in the hot path. The field is
only accessed with the lru lock held.
Tested-by: Anton Protopopov <a.s.protopopov@gmail.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://lore.kernel.org/r/20250618215803.3587312-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit fc975cfb36393db1db517fbbe366e550bcdcff14 ]
In our testing with 6.12 based kernel on a big.LITTLE system, we were
seeing instances of RT tasks being blocked from running on the LITTLE
cpus for multiple seconds of time, apparently by the dl_server. This
far exceeds the default configured 50ms per second runtime.
This is due to the fair dl_server runtime calculation being scaled
for frequency & capacity of the cpu.
Consider the following case under a Big.LITTLE architecture:
Assume the runtime is: 50,000,000 ns, and Frequency/capacity
scale-invariance defined as below:
Frequency scale-invariance: 100
Capacity scale-invariance: 50
First by Frequency scale-invariance,
the runtime is scaled to 50,000,000 * 100 >> 10 = 4,882,812
Then by capacity scale-invariance,
it is further scaled to 4,882,812 * 50 >> 10 = 238,418.
So it will scaled to 238,418 ns.
This smaller "accounted runtime" value is what ends up being
subtracted against the fair-server's runtime for the current period.
Thus after 50ms of real time, we've only accounted ~238us against the
fair servers runtime. This 209:1 ratio in this example means that on
the smaller cpu the fair server is allowed to continue running,
blocking RT tasks, for over 10 seconds before it exhausts its supposed
50ms of runtime. And on other hardware configurations it can be even
worse.
For the fair deadline_server, to prevent realtime tasks from being
unexpectedly delayed, we really do want to use fixed time, and not
scaled time for smaller capacity/frequency cpus. So remove the scaling
from the fair server's accounting to fix this.
Fixes: a110a81c52a9 ("sched/deadline: Deferrable dl server")
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: John Stultz <jstultz@google.com>
Signed-off-by: kuyo chang <kuyo.chang@mediatek.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: John Stultz <jstultz@google.com>
Tested-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/r/20250702021440.2594736-1-kuyo.chang@mediatek.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit ba677dbe77af5ffe6204e0f3f547f3ba059c6302 ]
Jann reports that uprobes can be used destructively when used in the
middle of an instruction. The kernel only verifies there is a valid
instruction at the requested offset, but due to variable instruction
length cannot determine if this is an instruction as seen by the
intended execution stream.
Additionally, Mark Rutland notes that on architectures that mix data
in the text segment (like arm64), a similar things can be done if the
data word is 'mistaken' for an instruction.
As such, require CAP_SYS_ADMIN for uprobes.
Fixes: c9e0924e5c2b ("perf/core: open access to probes for CAP_PERFMON privileged process")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/CAG48ez1n4520sq0XrWYDHKiKxE_+WCfAK+qt9qkY4ZiBGmL-5g@mail.gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 009836b4fa52f92cba33618e773b1094affa8cd2 ]
On Mon, Jun 02, 2025 at 03:22:13PM +0800, Kuyo Chang wrote:
> So, the potential race scenario is:
>
> CPU0 CPU1
> // doing migrate_swap(cpu0/cpu1)
> stop_two_cpus()
> ...
> // doing _cpu_down()
> sched_cpu_deactivate()
> set_cpu_active(cpu, false);
> balance_push_set(cpu, true);
> cpu_stop_queue_two_works
> __cpu_stop_queue_work(stopper1,...);
> __cpu_stop_queue_work(stopper2,..);
> stop_cpus_in_progress -> true
> preempt_enable();
> ...
> 1st balance_push
> stop_one_cpu_nowait
> cpu_stop_queue_work
> __cpu_stop_queue_work
> list_add_tail -> 1st add push_work
> wake_up_q(&wakeq); -> "wakeq is empty.
> This implies that the stopper is at wakeq@migrate_swap."
> preempt_disable
> wake_up_q(&wakeq);
> wake_up_process // wakeup migrate/0
> try_to_wake_up
> ttwu_queue
> ttwu_queue_cond ->meet below case
> if (cpu == smp_processor_id())
> return false;
> ttwu_do_activate
> //migrate/0 wakeup done
> wake_up_process // wakeup migrate/1
> try_to_wake_up
> ttwu_queue
> ttwu_queue_cond
> ttwu_queue_wakelist
> __ttwu_queue_wakelist
> __smp_call_single_queue
> preempt_enable();
>
> 2nd balance_push
> stop_one_cpu_nowait
> cpu_stop_queue_work
> __cpu_stop_queue_work
> list_add_tail -> 2nd add push_work, so the double list add is detected
> ...
> ...
> cpu1 get ipi, do sched_ttwu_pending, wakeup migrate/1
>
So this balance_push() is part of schedule(), and schedule() is supposed
to switch to stopper task, but because of this race condition, stopper
task is stuck in WAKING state and not actually visible to be picked.
Therefore CPU1 can do another schedule() and end up doing another
balance_push() even though the last one hasn't been done yet.
This is a confluence of fail, where both wake_q and ttwu_wakelist can
cause crucial wakeups to be delayed, resulting in the malfunction of
balance_push.
Since there is only a single stopper thread to be woken, the wake_q
doesn't really add anything here, and can be removed in favour of
direct wakeups of the stopper thread.
Then add a clause to ttwu_queue_cond() to ensure the stopper threads
are never queued / delayed.
Of all 3 moving parts, the last addition was the balance_push()
machinery, so pick that as the point the bug was introduced.
Fixes: 2558aacff858 ("sched/hotplug: Ensure only per-cpu kthreads run during hotplug")
Reported-by: Kuyo Chang <kuyo.chang@mediatek.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Kuyo Chang <kuyo.chang@mediatek.com>
Link: https://lkml.kernel.org/r/20250605100009.GO39944@noisy.programming.kicks-ass.net
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 7b4c5a37544ba22c6ebe72c0d4ea56c953459fa5 ]
commit 3172fb986666 ("perf/core: Fix WARN in perf_cgroup_switch()") try to
fix a concurrency problem between perf_cgroup_switch and
perf_cgroup_event_disable. But it does not to move the WARN_ON_ONCE into
lock-protected region, so the warning is still be triggered.
Fixes: 3172fb986666 ("perf/core: Fix WARN in perf_cgroup_switch()")
Signed-off-by: Luo Gengkun <luogengkun@huaweicloud.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250626135403.2454105-1-luogengkun@huaweicloud.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 33b6a1f155d627f5bd80c7485c598ce45428f74f ]
Currently the call_rcu() API does not check whether a callback
pointer is NULL. If NULL is passed, rcu_core() will try to invoke
it, resulting in NULL pointer dereference and a kernel crash.
To prevent this and improve debuggability, this patch adds a check
for NULL and emits a kernel stack trace to help identify a faulty
caller.
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 8a2277a3c9e4cc5398f80821afe7ecbe9bdf2819 ]
Initialize `ops` member's pointers properly by using kzalloc() instead of
kmalloc() when allocating the simulation work context. Otherwise the
pointers contain random content leading to invalid dereferencing.
Signed-off-by: Gyeyoung Baek <gye976@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250612124827.63259-1-gye976@gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit c50784e99f0e7199cdb12dbddf02229b102744ef ]
Otherwise, tg->scx.weight can go out of sync while scx_cgroup is not enabled
and ops.cgroup_init() may be called with a stale weight value.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 819513666966 ("sched_ext: Add cgroup support")
Cc: stable@vger.kernel.org # v6.12+
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit e2d2115e56c4a02377189bfc3a9a7933552a7b0f ]
Yi Lai reported an issue ([1]) where the following warning appears
in kernel dmesg:
[ 60.643604] verifier backtracking bug
[ 60.643635] WARNING: CPU: 10 PID: 2315 at kernel/bpf/verifier.c:4302 __mark_chain_precision+0x3a6c/0x3e10
[ 60.648428] Modules linked in: bpf_testmod(OE)
[ 60.650471] CPU: 10 UID: 0 PID: 2315 Comm: test_progs Tainted: G OE 6.15.0-rc4-gef11287f8289-dirty #327 PREEMPT(full)
[ 60.654385] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[ 60.656682] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[ 60.660475] RIP: 0010:__mark_chain_precision+0x3a6c/0x3e10
[ 60.662814] Code: 5a 30 84 89 ea e8 c4 d9 01 00 80 3d 3e 7d d8 04 00 0f 85 60 fa ff ff c6 05 31 7d d8 04
01 48 c7 c7 00 58 30 84 e8 c4 06 a5 ff <0f> 0b e9 46 fa ff ff 48 ...
[ 60.668720] RSP: 0018:ffff888116cc7298 EFLAGS: 00010246
[ 60.671075] RAX: 54d70e82dfd31900 RBX: ffff888115b65e20 RCX: 0000000000000000
[ 60.673659] RDX: 0000000000000001 RSI: 0000000000000004 RDI: 00000000ffffffff
[ 60.676241] RBP: 0000000000000400 R08: ffff8881f6f23bd3 R09: 1ffff1103ede477a
[ 60.678787] R10: dffffc0000000000 R11: ffffed103ede477b R12: ffff888115b60ae8
[ 60.681420] R13: 1ffff11022b6cbc4 R14: 00000000fffffff2 R15: 0000000000000001
[ 60.684030] FS: 00007fc2aedd80c0(0000) GS:ffff88826fa8a000(0000) knlGS:0000000000000000
[ 60.686837] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 60.689027] CR2: 000056325369e000 CR3: 000000011088b002 CR4: 0000000000370ef0
[ 60.691623] Call Trace:
[ 60.692821] <TASK>
[ 60.693960] ? __pfx_verbose+0x10/0x10
[ 60.695656] ? __pfx_disasm_kfunc_name+0x10/0x10
[ 60.697495] check_cond_jmp_op+0x16f7/0x39b0
[ 60.699237] do_check+0x58fa/0xab10
...
Further analysis shows the warning is at line 4302 as below:
4294 /* static subprog call instruction, which
4295 * means that we are exiting current subprog,
4296 * so only r1-r5 could be still requested as
4297 * precise, r0 and r6-r10 or any stack slot in
4298 * the current frame should be zero by now
4299 */
4300 if (bt_reg_mask(bt) & ~BPF_REGMASK_ARGS) {
4301 verbose(env, "BUG regs %x\n", bt_reg_mask(bt));
4302 WARN_ONCE(1, "verifier backtracking bug");
4303 return -EFAULT;
4304 }
With the below test (also in the next patch):
__used __naked static void __bpf_jmp_r10(void)
{
asm volatile (
"r2 = 2314885393468386424 ll;"
"goto +0;"
"if r2 <= r10 goto +3;"
"if r1 >= -1835016 goto +0;"
"if r2 <= 8 goto +0;"
"if r3 <= 0 goto +0;"
"exit;"
::: __clobber_all);
}
SEC("?raw_tp")
__naked void bpf_jmp_r10(void)
{
asm volatile (
"r3 = 0 ll;"
"call __bpf_jmp_r10;"
"r0 = 0;"
"exit;"
::: __clobber_all);
}
The following is the verifier failure log:
0: (18) r3 = 0x0 ; R3_w=0
2: (85) call pc+2
caller:
R10=fp0
callee:
frame1: R1=ctx() R3_w=0 R10=fp0
5: frame1: R1=ctx() R3_w=0 R10=fp0
; asm volatile (" \ @ verifier_precision.c:184
5: (18) r2 = 0x20202000256c6c78 ; frame1: R2_w=0x20202000256c6c78
7: (05) goto pc+0
8: (bd) if r2 <= r10 goto pc+3 ; frame1: R2_w=0x20202000256c6c78 R10=fp0
9: (35) if r1 >= 0xffe3fff8 goto pc+0 ; frame1: R1=ctx()
10: (b5) if r2 <= 0x8 goto pc+0
mark_precise: frame1: last_idx 10 first_idx 0 subseq_idx -1
mark_precise: frame1: regs=r2 stack= before 9: (35) if r1 >= 0xffe3fff8 goto pc+0
mark_precise: frame1: regs=r2 stack= before 8: (bd) if r2 <= r10 goto pc+3
mark_precise: frame1: regs=r2,r10 stack= before 7: (05) goto pc+0
mark_precise: frame1: regs=r2,r10 stack= before 5: (18) r2 = 0x20202000256c6c78
mark_precise: frame1: regs=r10 stack= before 2: (85) call pc+2
BUG regs 400
The main failure reason is due to r10 in precision backtracking bookkeeping.
Actually r10 is always precise and there is no need to add it for the precision
backtracking bookkeeping.
One way to fix the issue is to prevent bt_set_reg() if any src/dst reg is
r10. Andrii suggested to go with push_insn_history() approach to avoid
explicitly checking r10 in backtrack_insn().
This patch added push_insn_history() support for cond_jmp like 'rX <op> rY'
operations. In check_cond_jmp_op(), if any of rX or rY is a stack pointer,
push_insn_history() will record such information, and later backtrack_insn()
will do bt_set_reg() properly for those register(s).
[1] https://lore.kernel.org/bpf/Z%2F8q3xzpU59CIYQE@ly-workstation/
Reported by: Yi Lai <yi1.lai@linux.intel.com>
Fixes: 407958a0e980 ("bpf: encapsulate precision backtracking bookkeeping")
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250524041335.4046126-1-yonghong.song@linux.dev
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 96a30e469ca1d2b8cc7811b40911f8614b558241 ]
Instead of allocating and copying instruction history each time we
enqueue child verifier state, switch to a model where we use one common
dynamically sized array of instruction history entries across all states.
The key observation for proving this is correct is that instruction
history is only relevant while state is active, which means it either is
a current state (and thus we are actively modifying instruction history
and no other state can interfere with us) or we are checkpointed state
with some children still active (either enqueued or being current).
In the latter case our portion of instruction history is finalized and
won't change or grow, so as long as we keep it immutable until the state
is finalized, we are good.
Now, when state is finalized and is put into state hash for potentially
future pruning lookups, instruction history is not used anymore. This is
because instruction history is only used by precision marking logic, and
we never modify precision markings for finalized states.
So, instead of each state having its own small instruction history, we
keep a global dynamically-sized instruction history, where each state in
current DFS path from root to active state remembers its portion of
instruction history. Current state can append to this history, but
cannot modify any of its parent histories.
Async callback state enqueueing, while logically detached from parent
state, still is part of verification backtracking tree, so has to follow
the same schema as normal state checkpoints.
Because the insn_hist array can be grown through realloc, states don't
keep pointers, they instead maintain two indices, [start, end), into
global instruction history array. End is exclusive index, so
`start == end` means there is no relevant instruction history.
This eliminates a lot of allocations and minimizes overall memory usage.
For instance, running a worst-case test from [0] (but without the
heuristics-based fix [1]), it took 12.5 minutes until we get -ENOMEM.
With the changes in this patch the whole test succeeds in 10 minutes
(very slow, so heuristics from [1] is important, of course).
To further validate correctness, veristat-based comparison was performed for
Meta production BPF objects and BPF selftests objects. In both cases there
were no differences *at all* in terms of verdict or instruction and state
counts, providing a good confidence in the change.
Having this low-memory-overhead solution of keeping dynamic
per-instruction history cheaply opens up some new possibilities, like
keeping extra information for literally every single validated
instruction. This will be used for simplifying precision backpropagation
logic in follow up patches.
[0] https://lore.kernel.org/bpf/20241029172641.1042523-2-eddyz87@gmail.com/
[1] https://lore.kernel.org/bpf/20241029172641.1042523-1-eddyz87@gmail.com/
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20241115001303.277272-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Stable-dep-of: e2d2115e56c4 ("bpf: Do not include stack ptr register in precision backtracking bookkeeping")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit aa3ee4f0b7541382c9f6f43f7408d73a5d4f4042 ]
Delayed dequeued feature keeps a sleeping task enqueued until its
lag has elapsed. As a result, it stays also visible in rq->nr_running.
So when in wake_affine_idle(), we should use the real running-tasks
in rq to check whether we should place the wake-up task to
current cpu.
On the other hand, add a helper function to return the nr-delayed.
Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue")
Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
Reviewed-and-tested-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250303105241.17251-2-xuewen.yan@unisoc.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit c2a295bffeaf9461ecba76dc9e4780c898c94f03 ]
With delayed dequeued feature, a sleeping sched_entity remains queued in
the rq until its lag has elapsed. As a result, it stays also visible
in the statistics that are used to balance the system and in particular
the field cfs.h_nr_queued when the sched_entity is associated to a task.
Create a new h_nr_runnable that tracks only queued and runnable tasks.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20241202174606.4074512-5-vincent.guittot@linaro.org
Stable-dep-of: aa3ee4f0b754 ("sched/fair: Fixup wake_up_sync() vs DELAYED_DEQUEUE")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 7b8a702d943827130cc00ae36075eff5500f86f1 ]
With delayed dequeued feature, a sleeping sched_entity remains queued
in the rq until its lag has elapsed but can't run.
Rename h_nr_running into h_nr_queued to reflect this new behavior.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20241202174606.4074512-4-vincent.guittot@linaro.org
Stable-dep-of: aa3ee4f0b754 ("sched/fair: Fixup wake_up_sync() vs DELAYED_DEQUEUE")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 3172fb986666dfb71bf483b6d3539e1e587fa197 ]
There may be concurrency between perf_cgroup_switch and
perf_cgroup_event_disable. Consider the following scenario: after a new
perf cgroup event is created on CPU0, the new event may not trigger
a reprogramming, causing ctx->is_active to be 0. In this case, when CPU1
disables this perf event, it executes __perf_remove_from_context->
list _del_event->perf_cgroup_event_disable on CPU1, which causes a race
with perf_cgroup_switch running on CPU0.
The following describes the details of this concurrency scenario:
CPU0 CPU1
perf_cgroup_switch:
...
# cpuctx->cgrp is not NULL here
if (READ_ONCE(cpuctx->cgrp) == NULL)
return;
perf_remove_from_context:
...
raw_spin_lock_irq(&ctx->lock);
...
# ctx->is_active == 0 because reprogramm is not
# tigger, so CPU1 can do __perf_remove_from_context
# for CPU0
__perf_remove_from_context:
perf_cgroup_event_disable:
...
if (--ctx->nr_cgroups)
...
# this warning will happened because CPU1 changed
# ctx.nr_cgroups to 0.
WARN_ON_ONCE(cpuctx->ctx.nr_cgroups == 0);
[peterz: use guard instead of goto unlock]
Fixes: db4a835601b7 ("perf/core: Set cgroup in CPU contexts for new cgroup events")
Signed-off-by: Luo Gengkun <luogengkun@huaweicloud.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250604033924.3914647-3-luogengkun@huaweicloud.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 61988e36dc5457cdff7ae7927e8d9ad1419ee998 ]
While chasing down a missing perf_cgroup_event_disable() elsewhere,
Leo Yan found that both perf_put_aux_event() and
perf_remove_sibling_event() were also missing one.
Specifically, the rule is that events that switch to OFF,ERROR need to
call perf_cgroup_event_disable().
Unify the disable paths to ensure this.
Fixes: ab43762ef010 ("perf: Allow normal events to output AUX data")
Fixes: 9f0c4fa111dc ("perf/core: Add a new PERF_EV_CAP_SIBLING event capability")
Reported-by: Leo Yan <leo.yan@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250605123343.GD35970@noisy.programming.kicks-ass.net
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 4f6fc782128355931527cefe3eb45338abd8ab39 ]
Baisheng Gao reported an ARM64 crash, which Mark decoded as being a
synchronous external abort -- most likely due to trying to access
MMIO in bad ways.
The crash further shows perf trying to do a user stack sample while in
exit_mmap()'s tlb_finish_mmu() -- i.e. while tearing down the address
space it is trying to access.
It turns out that we stop perf after we tear down the userspace mm; a
receipie for disaster, since perf likes to access userspace for
various reasons.
Flip this order by moving up where we stop perf in do_exit().
Additionally, harden PERF_SAMPLE_CALLCHAIN and PERF_SAMPLE_STACK_USER
to abort when the current task does not have an mm (exit_mm() makes
sure to set current->mm = NULL; before commencing with the actual
teardown). Such that CPU wide events don't trip on this same problem.
Fixes: c5ebcedb566e ("perf: Add ability to attach user stack dump to sample")
Reported-by: Baisheng Gao <baisheng.gao@unisoc.com>
Suggested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250605110815.GQ39944@noisy.programming.kicks-ass.net
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 261dce3d64021e7ec828a17b4975ce9182e54ceb ]
Now when isolcpus is enabled via the cmdline, wq_isolated_cpumask does
not include these isolated CPUs, even wq_unbound_cpumask has already
excluded them. It is only when we successfully configure an isolate cpuset
partition that wq_isolated_cpumask gets overwritten by
workqueue_unbound_exclude_cpumask(), including both the cmdline-specified
isolated CPUs and the isolated CPUs within the cpuset partitions.
Fix this issue by initializing wq_isolated_cpumask properly in
workqueue_init_early().
Fixes: fe28f631fa94 ("workqueue: Add workqueue_unbound_exclude_cpumask() to exclude CPUs from wq_unbound_cpumask")
Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
sched_create_group()
commit 33796b91871ad4010c8188372dd1faf97cf0f1c0 upstream.
During task_group creation, sched_create_group() calls
scx_group_set_weight() with CGROUP_WEIGHT_DFL to initialize the sched_ext
portion. This is premature and ends up calling ops.cgroup_set_weight() with
an incorrect @cgrp before ops.cgroup_init() is called.
sched_create_group() should just initialize SCX related fields in the new
task_group. Fix it by factoring out scx_tg_init() from sched_init() and
making sched_create_group() call that function instead of
scx_group_set_weight().
v2: Retain CONFIG_EXT_GROUP_SCHED ifdef in sched_init() as removing it leads
to build failures on !CONFIG_GROUP_SCHED configs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 819513666966 ("sched_ext: Add cgroup support")
Cc: stable@vger.kernel.org # v6.12+
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 53ebef53a657d7957d35dc2b953db64f1bb28065 ]
The calculation of the index used to access the mask field in 'struct
bpf_raw_tp_null_args' is done with 'int' type, which could overflow when
the tracepoint being attached has more than 8 arguments.
While none of the tracepoints mentioned in raw_tp_null_args[] currently
have more than 8 arguments, there do exist tracepoints that had more
than 8 arguments (e.g. iocost_iocg_forgive_debt), so use the correct
type for calculation and avoid Smatch static checker warning.
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/bpf/20250418074946.35569-1-shung-hsi.yu@suse.com
Closes: https://lore.kernel.org/r/843a3b94-d53d-42db-93d4-be10a4090146@stanley.mountain/
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 94bde253d3ae5d8a01cb958663b12daef1d06574 ]
There is currently some confusion in the s390x JIT regarding whether
orig_call can be NULL and what that means. Originally the NULL value
was used to distinguish the struct_ops case, but this was superseded by
BPF_TRAMP_F_INDIRECT (see commit 0c970ed2f87c ("s390/bpf: Fix indirect
trampoline generation").
The remaining reason to have this check is that NULL can actually be
passed to the arch_bpf_trampoline_size() call - but not to the
respective arch_prepare_bpf_trampoline()! call - by
bpf_struct_ops_prepare_trampoline().
Remove this asymmetry by passing stub_func to both functions, so that
JITs may rely on orig_call never being NULL.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20250512221911.61314-2-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit d4965578267e2e81f67c86e2608481e77e9c8569 ]
bpf_map_lookup_percpu_elem() helper is also available for sleepable bpf
program. When BPF JIT is disabled or under 32-bit host,
bpf_map_lookup_percpu_elem() will not be inlined. Using it in a
sleepable bpf program will trigger the warning in
bpf_map_lookup_percpu_elem(), because the bpf program only holds
rcu_read_lock_trace lock. Therefore, add the missed check.
Reported-by: syzbot+dce5aae19ae4d6399986@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/bpf/000000000000176a130617420310@google.com/
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20250526062534.1105938-1-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 08d7becc1a6b8c936e25d827becabfe3bff72a36 ]
Right now, if the clocksource watchdog detects a clocksource skew, it might
perform a per CPU check, for example in the TSC case on x86. In other
words: supposing TSC is detected as unstable by the clocksource watchdog
running at CPU1, as part of marking TSC unstable the kernel will also run a
check of TSC readings on some CPUs to be sure it is synced between them
all.
But that check happens only on some CPUs, not all of them; this choice is
based on the parameter "verify_n_cpus" and in some random cpumask
calculation. So, the watchdog runs such per CPU checks on up to
"verify_n_cpus" random CPUs among all online CPUs, with the risk of
repeating CPUs (that aren't double checked) in the cpumask random
calculation.
But if "verify_n_cpus" > num_online_cpus(), it should skip the random
calculation and just go ahead and check the clocksource sync between
all online CPUs, without the risk of skipping some CPUs due to
duplicity in the random cpumask calculation.
Tests in a 4 CPU laptop with TSC skew detected led to some cases of the per
CPU verification skipping some CPU even with verify_n_cpus=8, due to the
duplicity on random cpumask generation. Skipping the randomization when the
number of online CPUs is smaller than verify_n_cpus, solves that.
Suggested-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/all/20250323173857.372390-1-gpiccoli@igalia.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit f914b52c379c12288b7623bb814d0508dbe7481d upstream.
The following issue happens with a buggy module:
BUG: unable to handle page fault for address: ffffffffc05d0218
PGD 1bd66f067 P4D 1bd66f067 PUD 1bd671067 PMD 101808067 PTE 0
Oops: Oops: 0000 [#1] SMP KASAN PTI
Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
RIP: 0010:sized_strscpy+0x81/0x2f0
RSP: 0018:ffff88812d76fa08 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffffffffc0601010 RCX: dffffc0000000000
RDX: 0000000000000038 RSI: dffffc0000000000 RDI: ffff88812608da2d
RBP: 8080808080808080 R08: ffff88812608da2d R09: ffff88812608da68
R10: ffff88812608d82d R11: ffff88812608d810 R12: 0000000000000038
R13: ffff88812608da2d R14: ffffffffc05d0218 R15: fefefefefefefeff
FS: 00007fef552de740(0000) GS:ffff8884251c7000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffc05d0218 CR3: 00000001146f0000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
ftrace_mod_get_kallsym+0x1ac/0x590
update_iter_mod+0x239/0x5b0
s_next+0x5b/0xa0
seq_read_iter+0x8c9/0x1070
seq_read+0x249/0x3b0
proc_reg_read+0x1b0/0x280
vfs_read+0x17f/0x920
ksys_read+0xf3/0x1c0
do_syscall_64+0x5f/0x2e0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
The above issue may happen as follows:
(1) Add kprobe tracepoint;
(2) insmod test.ko;
(3) Module triggers ftrace disabled;
(4) rmmod test.ko;
(5) cat /proc/kallsyms; --> Will trigger UAF as test.ko already removed;
ftrace_mod_get_kallsym()
...
strscpy(module_name, mod_map->mod->name, MODULE_NAME_LEN);
...
The problem is when a module triggers an issue with ftrace and
sets ftrace_disable. The ftrace_disable is set when an anomaly is
discovered and to prevent any more damage, ftrace stops all text
modification. The issue that happened was that the ftrace_disable stops
more than just the text modification.
When a module is loaded, its init functions can also be traced. Because
kallsyms deletes the init functions after a module has loaded, ftrace
saves them when the module is loaded and function tracing is enabled. This
allows the output of the function trace to show the init function names
instead of just their raw memory addresses.
When a module is removed, ftrace_release_mod() is called, and if
ftrace_disable is set, it just returns without doing anything more. The
problem here is that it leaves the mod_list still around and if kallsyms
is called, it will call into this code and access the module memory that
has already been freed as it will return:
strscpy(module_name, mod_map->mod->name, MODULE_NAME_LEN);
Where the "mod" no longer exists and triggers a UAF bug.
Link: https://lore.kernel.org/all/20250523135452.626d8dcd@gandalf.local.home/
Cc: stable@vger.kernel.org
Fixes: aba4b5c22cba ("ftrace: Save module init functions kallsyms symbols for tracing")
Link: https://lore.kernel.org/20250529111955.2349189-2-yebin@huaweicloud.com
Signed-off-by: Ye Bin <yebin10@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 7123dbbef88cfd9f09e8a7899b0911834600cfa3 upstream.
When updating `watchdog_thresh`, there is a race condition between writing
the new `watchdog_thresh` value and stopping the old watchdog timer. If
the old timer triggers during this window, it may falsely detect a
softlockup due to the old interval and the new `watchdog_thresh` value
being used. The problem can be described as follow:
# We asuume previous watchdog_thresh is 60, so the watchdog timer is
# coming every 24s.
echo 10 > /proc/sys/kernel/watchdog_thresh (User space)
|
+------>+ update watchdog_thresh (We are in kernel now)
|
| # using old interval and new `watchdog_thresh`
+------>+ watchdog hrtimer (irq context: detect softlockup)
|
|
+-------+
|
|
+ softlockup_stop_all
To fix this problem, introduce a shadow variable for `watchdog_thresh`.
The update to the actual `watchdog_thresh` is delayed until after the old
timer is stopped, preventing false positives.
The following testcase may help to understand this problem.
---------------------------------------------
echo RT_RUNTIME_SHARE > /sys/kernel/debug/sched/features
echo -1 > /proc/sys/kernel/sched_rt_runtime_us
echo 0 > /sys/kernel/debug/sched/fair_server/cpu3/runtime
echo 60 > /proc/sys/kernel/watchdog_thresh
taskset -c 3 chrt -r 99 /bin/bash -c "while true;do true; done" &
echo 10 > /proc/sys/kernel/watchdog_thresh &
---------------------------------------------
The test case above first removes the throttling restrictions for
real-time tasks. It then sets watchdog_thresh to 60 and executes a
real-time task ,a simple while(1) loop, on cpu3. Consequently, the final
command gets blocked because the presence of this real-time thread
prevents kworker:3 from being selected by the scheduler. This eventually
triggers a softlockup detection on cpu3 due to watchdog_timer_fn operating
with inconsistent variable - using both the old interval and the updated
watchdog_thresh simultaneously.
[nysal@linux.ibm.com: fix the SOFTLOCKUP_DETECTOR=n case]
Link: https://lkml.kernel.org/r/20250502111120.282690-1-nysal@linux.ibm.com
Link: https://lkml.kernel.org/r/20250421035021.3507649-1-luogengkun@huaweicloud.com
Signed-off-by: Luo Gengkun <luogengkun@huaweicloud.com>
Signed-off-by: Nysal Jan K.A. <nysal@linux.ibm.com>
Cc: Doug Anderson <dianders@chromium.org>
Cc: Joel Granados <joel.granados@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: "Nysal Jan K.A." <nysal@linux.ibm.com>
Cc: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 37fb58a7273726e59f9429c89ade5116083a213d upstream.
An issue was found:
# cd /sys/fs/cgroup/freezer/
# mkdir test
# echo FROZEN > test/freezer.state
# cat test/freezer.state
FROZEN
# sleep 1000 &
[1] 863
# echo 863 > test/cgroup.procs
# cat test/freezer.state
FREEZING
When tasks are migrated to a frozen cgroup, the freezer fails to
immediately freeze the tasks, causing the cgroup to remain in the
"FREEZING".
The freeze_task() function is called before clearing the CGROUP_FROZEN
flag. This causes the freezing() check to incorrectly return false,
preventing __freeze_task() from being invoked for the migrated task.
To fix this issue, clear the CGROUP_FROZEN state before calling
freeze_task().
Fixes: f5d39b020809 ("freezer,sched: Rewrite core freezer logic")
Cc: stable@vger.kernel.org # v6.1+
Reported-by: Zhong Jiawei <zhongjiawei1@huawei.com>
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit c98cc9797b7009308fff73d41bc1d08642dab77a upstream.
Running a modified trace-cmd record --nosplice where it does a mmap of the
ring buffer when '--nosplice' is set, caused the following lockdep splat:
======================================================
WARNING: possible circular locking dependency detected
6.15.0-rc7-test-00002-gfb7d03d8a82f #551 Not tainted
------------------------------------------------------
trace-cmd/1113 is trying to acquire lock:
ffff888100062888 (&buffer->mutex){+.+.}-{4:4}, at: ring_buffer_map+0x11c/0xe70
but task is already holding lock:
ffff888100a5f9f8 (&cpu_buffer->mapping_lock){+.+.}-{4:4}, at: ring_buffer_map+0xcf/0xe70
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #5 (&cpu_buffer->mapping_lock){+.+.}-{4:4}:
__mutex_lock+0x192/0x18c0
ring_buffer_map+0xcf/0xe70
tracing_buffers_mmap+0x1c4/0x3b0
__mmap_region+0xd8d/0x1f70
do_mmap+0x9d7/0x1010
vm_mmap_pgoff+0x20b/0x390
ksys_mmap_pgoff+0x2e9/0x440
do_syscall_64+0x79/0x1c0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
-> #4 (&mm->mmap_lock){++++}-{4:4}:
__might_fault+0xa5/0x110
_copy_to_user+0x22/0x80
_perf_ioctl+0x61b/0x1b70
perf_ioctl+0x62/0x90
__x64_sys_ioctl+0x134/0x190
do_syscall_64+0x79/0x1c0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
-> #3 (&cpuctx_mutex){+.+.}-{4:4}:
__mutex_lock+0x192/0x18c0
perf_event_init_cpu+0x325/0x7c0
perf_event_init+0x52a/0x5b0
start_kernel+0x263/0x3e0
x86_64_start_reservations+0x24/0x30
x86_64_start_kernel+0x95/0xa0
common_startup_64+0x13e/0x141
-> #2 (pmus_lock){+.+.}-{4:4}:
__mutex_lock+0x192/0x18c0
perf_event_init_cpu+0xb7/0x7c0
cpuhp_invoke_callback+0x2c0/0x1030
__cpuhp_invoke_callback_range+0xbf/0x1f0
_cpu_up+0x2e7/0x690
cpu_up+0x117/0x170
cpuhp_bringup_mask+0xd5/0x120
bringup_nonboot_cpus+0x13d/0x170
smp_init+0x2b/0xf0
kernel_init_freeable+0x441/0x6d0
kernel_init+0x1e/0x160
ret_from_fork+0x34/0x70
ret_from_fork_asm+0x1a/0x30
-> #1 (cpu_hotplug_lock){++++}-{0:0}:
cpus_read_lock+0x2a/0xd0
ring_buffer_resize+0x610/0x14e0
__tracing_resize_ring_buffer.part.0+0x42/0x120
tracing_set_tracer+0x7bd/0xa80
tracing_set_trace_write+0x132/0x1e0
vfs_write+0x21c/0xe80
ksys_write+0xf9/0x1c0
do_syscall_64+0x79/0x1c0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
-> #0 (&buffer->mutex){+.+.}-{4:4}:
__lock_acquire+0x1405/0x2210
lock_acquire+0x174/0x310
__mutex_lock+0x192/0x18c0
ring_buffer_map+0x11c/0xe70
tracing_buffers_mmap+0x1c4/0x3b0
__mmap_region+0xd8d/0x1f70
do_mmap+0x9d7/0x1010
vm_mmap_pgoff+0x20b/0x390
ksys_mmap_pgoff+0x2e9/0x440
do_syscall_64+0x79/0x1c0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
other info that might help us debug this:
Chain exists of:
&buffer->mutex --> &mm->mmap_lock --> &cpu_buffer->mapping_lock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&cpu_buffer->mapping_lock);
lock(&mm->mmap_lock);
lock(&cpu_buffer->mapping_lock);
lock(&buffer->mutex);
*** DEADLOCK ***
2 locks held by trace-cmd/1113:
#0: ffff888106b847e0 (&mm->mmap_lock){++++}-{4:4}, at: vm_mmap_pgoff+0x192/0x390
#1: ffff888100a5f9f8 (&cpu_buffer->mapping_lock){+.+.}-{4:4}, at: ring_buffer_map+0xcf/0xe70
stack backtrace:
CPU: 5 UID: 0 PID: 1113 Comm: trace-cmd Not tainted 6.15.0-rc7-test-00002-gfb7d03d8a82f #551 PREEMPT
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x6e/0xa0
print_circular_bug.cold+0x178/0x1be
check_noncircular+0x146/0x160
__lock_acquire+0x1405/0x2210
lock_acquire+0x174/0x310
? ring_buffer_map+0x11c/0xe70
? ring_buffer_map+0x11c/0xe70
? __mutex_lock+0x169/0x18c0
__mutex_lock+0x192/0x18c0
? ring_buffer_map+0x11c/0xe70
? ring_buffer_map+0x11c/0xe70
? function_trace_call+0x296/0x370
? __pfx___mutex_lock+0x10/0x10
? __pfx_function_trace_call+0x10/0x10
? __pfx___mutex_lock+0x10/0x10
? _raw_spin_unlock+0x2d/0x50
? ring_buffer_map+0x11c/0xe70
? ring_buffer_map+0x11c/0xe70
? __mutex_lock+0x5/0x18c0
ring_buffer_map+0x11c/0xe70
? do_raw_spin_lock+0x12d/0x270
? find_held_lock+0x2b/0x80
? _raw_spin_unlock+0x2d/0x50
? rcu_is_watching+0x15/0xb0
? _raw_spin_unlock+0x2d/0x50
? trace_preempt_on+0xd0/0x110
tracing_buffers_mmap+0x1c4/0x3b0
__mmap_region+0xd8d/0x1f70
? ring_buffer_lock_reserve+0x99/0xff0
? __pfx___mmap_region+0x10/0x10
? ring_buffer_lock_reserve+0x99/0xff0
? __pfx_ring_buffer_lock_reserve+0x10/0x10
? __pfx_ring_buffer_lock_reserve+0x10/0x10
? bpf_lsm_mmap_addr+0x4/0x10
? security_mmap_addr+0x46/0xd0
? lock_is_held_type+0xd9/0x130
do_mmap+0x9d7/0x1010
? 0xffffffffc0370095
? __pfx_do_mmap+0x10/0x10
vm_mmap_pgoff+0x20b/0x390
? __pfx_vm_mmap_pgoff+0x10/0x10
? 0xffffffffc0370095
ksys_mmap_pgoff+0x2e9/0x440
do_syscall_64+0x79/0x1c0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7fb0963a7de2
Code: 00 00 00 0f 1f 44 00 00 41 f7 c1 ff 0f 00 00 75 27 55 89 cd 53 48 89 fb 48 85 ff 74 3b 41 89 ea 48 89 df b8 09 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 76 5b 5d c3 0f 1f 00 48 8b 05 e1 9f 0d 00 64
RSP: 002b:00007ffdcc8fb878 EFLAGS: 00000246 ORIG_RAX: 0000000000000009
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb0963a7de2
RDX: 0000000000000001 RSI: 0000000000001000 RDI: 0000000000000000
RBP: 0000000000000001 R08: 0000000000000006 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000
R13: 00007ffdcc8fbe68 R14: 00007fb096628000 R15: 00005633e01a5c90
</TASK>
The issue is that cpus_read_lock() is taken within buffer->mutex. The
memory mapped pages are taken with the mmap_lock held. The buffer->mutex
is taken within the cpu_buffer->mapping_lock. There's quite a chain with
all these locks, where the deadlock can be fixed by moving the
cpus_read_lock() outside the taking of the buffer->mutex.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/20250527105820.0f45d045@gandalf.local.home
Fixes: 117c39200d9d7 ("ring-buffer: Introducing ring-buffer mapping functions")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 40ee2afafc1d9fe3aa44a6fbe440d78a5c96a72e upstream.
Enlarge the critical section in ring_buffer_subbuf_order_set() to
ensure that error handling takes place with per-buffer mutex held,
thus preventing list corruption and other concurrency-related issues.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Link: https://lore.kernel.org/20250606112242.1510605-1-dmantipov@yandex.ru
Reported-by: syzbot+05d673e83ec640f0ced9@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=05d673e83ec640f0ced9
Fixes: f9b94daa542a8 ("ring-buffer: Set new size of the ring buffer sub page")
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 4fc78a7c9ca994e1da5d3940704d4e8f0ea8c5e4 upstream.
When reading a memory mapped buffer the reader page is just swapped out
with the last page written in the write buffer. If the reader page is the
same as the commit buffer (the buffer that is currently being written to)
it was assumed that it should never have missed events. If it does, it
triggers a WARN_ON_ONCE().
But there just happens to be one scenario where this can legitimately
happen. That is on a commit_overrun. A commit overrun is when an interrupt
preempts an event being written to the buffer and then the interrupt adds
so many new events that it fills and wraps the buffer back to the commit.
Any new events would then be dropped and be reported as "missed_events".
In this case, the next page to read is the commit buffer and after the
swap of the reader page, the reader page will be the commit buffer, but
this time there will be missed events and this triggers the following
warning:
------------[ cut here ]------------
WARNING: CPU: 2 PID: 1127 at kernel/trace/ring_buffer.c:7357 ring_buffer_map_get_reader+0x49a/0x780
Modules linked in: kvm_intel kvm irqbypass
CPU: 2 UID: 0 PID: 1127 Comm: trace-cmd Not tainted 6.15.0-rc7-test-00004-g478bc2824b45-dirty #564 PREEMPT
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
RIP: 0010:ring_buffer_map_get_reader+0x49a/0x780
Code: 00 00 00 48 89 fe 48 c1 ee 03 80 3c 2e 00 0f 85 ec 01 00 00 4d 3b a6 a8 00 00 00 0f 85 8a fd ff ff 48 85 c0 0f 84 55 fe ff ff <0f> 0b e9 4e fe ff ff be 08 00 00 00 4c 89 54 24 58 48 89 54 24 50
RSP: 0018:ffff888121787dc0 EFLAGS: 00010002
RAX: 00000000000006a2 RBX: ffff888100062800 RCX: ffffffff8190cb49
RDX: ffff888126934c00 RSI: 1ffff11020200a15 RDI: ffff8881010050a8
RBP: dffffc0000000000 R08: 0000000000000000 R09: ffffed1024d26982
R10: ffff888126934c17 R11: ffff8881010050a8 R12: ffff888126934c00
R13: ffff8881010050b8 R14: ffff888101005000 R15: ffff888126930008
FS: 00007f95c8cd7540(0000) GS:ffff8882b576e000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f95c8de4dc0 CR3: 0000000128452002 CR4: 0000000000172ef0
Call Trace:
<TASK>
? __pfx_ring_buffer_map_get_reader+0x10/0x10
tracing_buffers_ioctl+0x283/0x370
__x64_sys_ioctl+0x134/0x190
do_syscall_64+0x79/0x1c0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f95c8de48db
Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00
RSP: 002b:00007ffe037ba110 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007ffe037bb2b0 RCX: 00007f95c8de48db
RDX: 0000000000000000 RSI: 0000000000005220 RDI: 0000000000000006
RBP: 00007ffe037ba180 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007ffe037bb6f8 R14: 00007f95c9065000 R15: 00005575c7492c90
</TASK>
irq event stamp: 5080
hardirqs last enabled at (5079): [<ffffffff83e0adb0>] _raw_spin_unlock_irqrestore+0x50/0x70
hardirqs last disabled at (5080): [<ffffffff83e0aa83>] _raw_spin_lock_irqsave+0x63/0x70
softirqs last enabled at (4182): [<ffffffff81516122>] handle_softirqs+0x552/0x710
softirqs last disabled at (4159): [<ffffffff815163f7>] __irq_exit_rcu+0x107/0x210
---[ end trace 0000000000000000 ]---
The above was triggered by running on a kernel with both lockdep and KASAN
as well as kmemleak enabled and executing the following command:
# perf record -o perf-test.dat -a -- trace-cmd record --nosplice -e all -p function hackbench 50
With perf interjecting a lot of interrupts and trace-cmd enabling all
events as well as function tracing, with lockdep, KASAN and kmemleak
enabled, it could cause an interrupt preempting an event being written to
add enough events to wrap the buffer. trace-cmd was modified to have
--nosplice use mmap instead of reading the buffer.
The way to differentiate this case from the normal case of there only
being one page written to where the swap of the reader page received that
one page (which is the commit page), check if the tail page is on the
reader page. The difference between the commit page and the tail page is
that the tail page is where new writes go to, and the commit page holds
the first write that hasn't been committed yet. In the case of an
interrupt preempting the write of an event and filling the buffer, it
would move the tail page but not the commit page.
Have the warning only trigger if the tail page is also on the reader page,
and also print out the number of events dropped by a commit overrun as
that can not yet be safely added to the page so that the reader can see
there were events dropped.
Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/20250528121555.2066527e@gandalf.local.home
Fixes: fe832be05a8ee ("ring-buffer: Have mmapped ring buffer keep track of missed events")
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
posix_cpu_timer_del()
commit f90fff1e152dedf52b932240ebbd670d83330eca upstream.
If an exiting non-autoreaping task has already passed exit_notify() and
calls handle_posix_cpu_timers() from IRQ, it can be reaped by its parent
or debugger right after unlock_task_sighand().
If a concurrent posix_cpu_timer_del() runs at that moment, it won't be
able to detect timer->it.cpu.firing != 0: cpu_timer_task_rcu() and/or
lock_task_sighand() will fail.
Add the tsk->exit_state check into run_posix_cpu_timers() to fix this.
This fix is not needed if CONFIG_POSIX_CPU_TIMERS_TASK_WORK=y, because
exit_task_work() is called before exit_notify(). But the check still
makes sense, task_work_add(&tsk->posix_cputimers_work.work) will fail
anyway in this case.
Cc: stable@vger.kernel.org
Reported-by: Benoît Sevens <bsevens@google.com>
Fixes: 0bdd2ed4138e ("sched: run_posix_cpu_timers: Don't check ->exit_state, use lock_task_sighand()")
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 7ed9138a72829d2035ecbd8dbd35b1bc3c137c40 ]
Ravi reported that the bpf_perf_link_attach() usage of
perf_event_set_bpf_prog() is not serialized by ctx->mutex, unlike the
PERF_EVENT_IOC_SET_BPF case.
Reported-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lkml.kernel.org/r/20250307193305.486326750@infradead.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 86bc9c742426a16b52a10ef61f5b721aecca2344 ]
syzkaller reported an issue:
WARNING: CPU: 3 PID: 217 at kernel/bpf/core.c:2357 __bpf_prog_ret0_warn+0xa/0x20 kernel/bpf/core.c:2357
Modules linked in:
CPU: 3 UID: 0 PID: 217 Comm: kworker/u32:6 Not tainted 6.15.0-rc4-syzkaller-00040-g8bac8898fe39
RIP: 0010:__bpf_prog_ret0_warn+0xa/0x20 kernel/bpf/core.c:2357
Call Trace:
<TASK>
bpf_dispatcher_nop_func include/linux/bpf.h:1316 [inline]
__bpf_prog_run include/linux/filter.h:718 [inline]
bpf_prog_run include/linux/filter.h:725 [inline]
cls_bpf_classify+0x74a/0x1110 net/sched/cls_bpf.c:105
...
When creating bpf program, 'fp->jit_requested' depends on bpf_jit_enable.
This issue is triggered because of CONFIG_BPF_JIT_ALWAYS_ON is not set
and bpf_jit_enable is set to 1, causing the arch to attempt JIT the prog,
but jit failed due to FAULT_INJECTION. As a result, incorrectly
treats the program as valid, when the program runs it calls
`__bpf_prog_ret0_warn` and triggers the WARN_ON_ONCE(1).
Reported-by: syzbot+0903f6d7f285e41cdf10@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/bpf/6816e34e.a70a0220.254cdc.002c.GAE@google.com
Fixes: fa9dd599b4da ("bpf: get rid of pure_initcall dependency to enable jits")
Signed-off-by: KaFai Wan <mannkafai@gmail.com>
Link: https://lore.kernel.org/r/20250526133358.2594176-1-mannkafai@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
multi-uprobe attach logic"
[ Upstream commit 4e2e6841ff761cc15a54e8bebcf35d7325ec78a2 ]
This reverts commit 4a8f635a6054.
Althought get_pid_task() internally already calls rcu_read_lock() and
rcu_read_unlock(), the find_vpid() was not.
The documentation for find_vpid() clearly states:
"Must be called with the tasklist_lock or rcu_read_lock() held."
Add proper rcu_read_lock/unlock() to protect the find_vpid().
Fixes: 4a8f635a6054 ("bpf: remove unnecessary rcu_read_{lock,unlock}() in multi-uprobe attach logic")
Reported-by: Xuewen Yan <xuewen.yan@unisoc.com>
Signed-off-by: Di Shen <di.shen@unisoc.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20250520054943.5002-1-xuewen.yan@unisoc.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 3880cdbed1c4607e378f58fa924c5d6df900d1d3 ]
syzkaller reported an issue:
WARNING: CPU: 3 PID: 5971 at kernel/trace/bpf_trace.c:1861 get_bpf_raw_tp_regs+0xa4/0x100 kernel/trace/bpf_trace.c:1861
Modules linked in:
CPU: 3 UID: 0 PID: 5971 Comm: syz-executor205 Not tainted 6.15.0-rc5-syzkaller-00038-g707df3375124 #0 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
RIP: 0010:get_bpf_raw_tp_regs+0xa4/0x100 kernel/trace/bpf_trace.c:1861
RSP: 0018:ffffc90003636fa8 EFLAGS: 00010293
RAX: 0000000000000000 RBX: 0000000000000003 RCX: ffffffff81c6bc4c
RDX: ffff888032efc880 RSI: ffffffff81c6bc83 RDI: 0000000000000005
RBP: ffff88806a730860 R08: 0000000000000005 R09: 0000000000000003
R10: 0000000000000004 R11: 0000000000000000 R12: 0000000000000004
R13: 0000000000000001 R14: ffffc90003637008 R15: 0000000000000900
FS: 0000000000000000(0000) GS:ffff8880d6cdf000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f7baee09130 CR3: 0000000029f5a000 CR4: 0000000000352ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
____bpf_get_stack_raw_tp kernel/trace/bpf_trace.c:1934 [inline]
bpf_get_stack_raw_tp+0x24/0x160 kernel/trace/bpf_trace.c:1931
bpf_prog_ec3b2eefa702d8d3+0x43/0x47
bpf_dispatcher_nop_func include/linux/bpf.h:1316 [inline]
__bpf_prog_run include/linux/filter.h:718 [inline]
bpf_prog_run include/linux/filter.h:725 [inline]
__bpf_trace_run kernel/trace/bpf_trace.c:2363 [inline]
bpf_trace_run3+0x23f/0x5a0 kernel/trace/bpf_trace.c:2405
__bpf_trace_mmap_lock_acquire_returned+0xfc/0x140 include/trace/events/mmap_lock.h:47
__traceiter_mmap_lock_acquire_returned+0x79/0xc0 include/trace/events/mmap_lock.h:47
__do_trace_mmap_lock_acquire_returned include/trace/events/mmap_lock.h:47 [inline]
trace_mmap_lock_acquire_returned include/trace/events/mmap_lock.h:47 [inline]
__mmap_lock_do_trace_acquire_returned+0x138/0x1f0 mm/mmap_lock.c:35
__mmap_lock_trace_acquire_returned include/linux/mmap_lock.h:36 [inline]
mmap_read_trylock include/linux/mmap_lock.h:204 [inline]
stack_map_get_build_id_offset+0x535/0x6f0 kernel/bpf/stackmap.c:157
__bpf_get_stack+0x307/0xa10 kernel/bpf/stackmap.c:483
____bpf_get_stack kernel/bpf/stackmap.c:499 [inline]
bpf_get_stack+0x32/0x40 kernel/bpf/stackmap.c:496
____bpf_get_stack_raw_tp kernel/trace/bpf_trace.c:1941 [inline]
bpf_get_stack_raw_tp+0x124/0x160 kernel/trace/bpf_trace.c:1931
bpf_prog_ec3b2eefa702d8d3+0x43/0x47
Tracepoint like trace_mmap_lock_acquire_returned may cause nested call
as the corner case show above, which will be resolved with more general
method in the future. As a result, WARN_ON_ONCE will be triggered. As
Alexei suggested, remove the WARN_ON_ONCE first.
Fixes: 9594dc3c7e71 ("bpf: fix nested bpf tracepoints with per-cpu data")
Reported-by: syzbot+45b0c89a0fc7ae8dbadc@syzkaller.appspotmail.com
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Tao Chen <chen.dylane@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250513042747.757042-1-chen.dylane@linux.dev
Closes: https://lore.kernel.org/bpf/8bc2554d-1052-4922-8832-e0078a033e1d@gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit c5dd28e7fb4f63475b50df4f58311df92939d011 ]
According to trigger_data_alloc() doc, trigger_data_free() should be
used to free an event_trigger_data object. This fixes a mismatch introduced
when kzalloc was replaced with trigger_data_alloc without updating
the corresponding deallocation calls.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Link: https://lore.kernel.org/20250507145455.944453325@goodmis.org
Link: https://lore.kernel.org/20250318112737.4174-1-linmq006@gmail.com
Fixes: e1f187d09e11 ("tracing: Have existing event_command.parse() implementations use helpers")
Signed-off-by: Miaoqian Lin <linmq006@gmail.com>
[ SDR: Changed event_trigger_alloc/free() to trigger_data_alloc/free() ]
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit f2947c4b7d0f235621c5daf78aecfbd6e22c05e5 ]
The function event_trigger_alloc() creates an event_trigger_data
descriptor and states that it needs to be freed via event_trigger_free().
This is incorrect, it needs to be freed by trigger_data_free() as
event_trigger_free() adds ref counting.
Rename event_trigger_alloc() to trigger_data_alloc() and state that it
needs to be freed via trigger_data_free(). This naming convention
was introducing bugs.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tom Zanussi <zanussi@kernel.org>
Link: https://lore.kernel.org/20250507145455.776436410@goodmis.org
Fixes: 86599dbe2c527 ("tracing: Add helper functions to simplify event_command.parse() callback handling")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 7ab0fc61ce73040f89b12d76a8279995ec283541 ]
The histogram trigger has three somewhat large arrays on the kernel stack:
unsigned long entries[HIST_STACKTRACE_DEPTH];
u64 var_ref_vals[TRACING_MAP_VARS_MAX];
char compound_key[HIST_KEY_SIZE_MAX];
Checking the function event_hist_trigger() stack frame size, it currently
uses 816 bytes for its stack frame due to these variables!
Instead, allocate a per CPU structure that holds these arrays for each
context level (normal, softirq, irq and NMI). That is, each CPU will have
4 of these structures. This will be allocated when the first histogram
trigger is enabled and freed when the last is disabled. When the
histogram callback triggers, it will request this structure. The request
will disable preemption, get the per CPU structure at the index of the
per CPU variable, and increment that variable.
The callback will use the arrays in this structure to perform its work and
then release the structure. That in turn will simply decrement the per CPU
index and enable preemption.
Moving the variables from the kernel stack to the per CPU structure brings
the stack frame of event_hist_trigger() down to just 112 bytes.
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Link: https://lore.kernel.org/20250407123851.74ea8d58@gandalf.local.home
Fixes: 067fe038e70f6 ("tracing: Add variable reference handling to hist triggers")
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 714070c4cb7a10ff57450a618a936775f3036245 ]
In the current implementation if the program is dev-bound to a specific
device, it will not be possible to perform XDP_REDIRECT into a DEVMAP
or CPUMAP even if the program is running in the driver NAPI context and
it is not attached to any map entry. This seems in contrast with the
explanation available in bpf_prog_map_compatible routine.
Fix the issue introducing __bpf_prog_map_compatible utility routine in
order to avoid bpf_prog_is_dev_bound() check running bpf_check_tail_call()
at program load time (bpf_prog_select_runtime()).
Continue forbidding to attach a dev-bound program to XDP maps
(BPF_MAP_TYPE_PROG_ARRAY, BPF_MAP_TYPE_DEVMAP and BPF_MAP_TYPE_CPUMAP).
Fixes: 3d76a4d3d4e59 ("bpf: XDP metadata RX kfuncs")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 243911982aa9faf4361aa952f879331ad66933fe ]
The link_create.flags are currently not used for multi-kprobes, so return
-EINVAL if it is set, same as for other attach APIs.
We allow target_fd, on the other hand, to have an arbitrary value for
multi-kprobe, as there are existing users (libbpf) relying on this.
Fixes: 0dcac2725406 ("bpf: Add multi kprobe link")
Signed-off-by: Tao Chen <chen.dylane@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250407035752.1108927-1-chen.dylane@linux.dev
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 1b17d4525bca3916644c41e01522df8fa0f8b90b ]
Commit cdb8c100d8a4 ("include/linux/suspend.h: Only show pm_pr_dbg
messages at suspend/resume") caused PM debug messages to only be
printed during system-wide suspend and resume in progress, but it
forgot about hibernation.
Address this by adding a check for hibernation in progress to
pm_debug_messages_should_print().
Fixes: cdb8c100d8a4 ("include/linux/suspend.h: Only show pm_pr_dbg messages at suspend/resume")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Link: https://patch.msgid.link/4998903.GXAFRqVoOG@rjwysocki.net
Signed-off-by: Sasha Levin <sashal@kernel.org>
|