Age | Commit message (Collapse) | Author |
|
commit 5ffb537e416ee22dbfb3d552102e50da33fec7f6 upstream.
Add two tests:
- one test has 'rX <op> r10' where rX is not r10, and
- another test has 'rX <op> rY' where rX and rY are not r10
but there is an early insn 'rX = r10'.
Without previous verifier change, both tests will fail.
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250524041340.4046304-1-yonghong.song@linux.dev
[ shung-hsi.yu: contains additional hunks for kernel/bpf/verifier.c that
should be part of the previous patch in the series, commit
e2d2115e56c4 "bpf: Do not include stack ptr register in precision
backtracking bookkeeping", which already incorporated. ]
Link: https://lore.kernel.org/all/9b41f9f5-396f-47e0-9a12-46c52087df6c@linux.dev/
Signed-off-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 91a229bb7ba86b2592c3f18c54b7b2c5e6fe0f95 upstream.
A warning is raised when __request_region() detects a conflict with a
resource whose resource.desc is IORES_DESC_DEVICE_PRIVATE_MEMORY.
But this warning is only valid for iomem_resources.
The hmem device resource uses resource.desc as the numa node id, which can
cause spurious warnings.
This warning appeared on a machine with multiple cxl memory expanders.
One of the NUMA node id is 6, which is the same as the value of
IORES_DESC_DEVICE_PRIVATE_MEMORY.
In this environment it was just a spurious warning, but when I saw the
warning I suspected a real problem so it's better to fix it.
This change fixes this by restricting the warning to only iomem_resource.
This also adds a missing new line to the warning message.
Link: https://lkml.kernel.org/r/20250719112604.25500-1-akinobu.mita@gmail.com
Fixes: 7dab174e2e27 ("dax/hmem: Move hmem device registration to dax_hmem.ko")
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
drivers
commit 67c632b4a7fbd6b76a08b86f4950f0f84de93439 upstream.
Most drivers only populate the fields cycles and cs_id of system_counterval
in their get_time_fn() callback for get_device_system_crosststamp(), unless
they explicitly provide nanosecond values.
When the use_nsecs field was added to struct system_counterval, most
drivers did not care. Clock sources other than CSID_GENERIC could then get
converted in convert_base_to_cs() based on an uninitialized use_nsecs field,
which usually results in -EINVAL during the following range check.
Pass in a fully zero initialized system_counterval_t to cure that.
Fixes: 6b2e29977518 ("timekeeping: Provide infrastructure for converting to/from a base clock")
Signed-off-by: Markus Blöchl <markus@blochl.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20250720-timekeeping_uninit_crossts-v2-1-f513c885b7c2@blochl.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit e14fd98c6d66cb76694b12c05768e4f9e8c95664 upstream.
Avoid invoking update_locked_rq() when the runqueue (rq) pointer is NULL
in the SCX_CALL_OP and SCX_CALL_OP_RET macros.
Previously, calling update_locked_rq(NULL) with preemption enabled could
trigger the following warning:
BUG: using __this_cpu_write() in preemptible [00000000]
This happens because __this_cpu_write() is unsafe to use in preemptible
context.
rq is NULL when an ops invoked from an unlocked context. In such cases, we
don't need to store any rq, since the value should already be NULL
(unlocked). Ensure that update_locked_rq() is only called when rq is
non-NULL, preventing calling __this_cpu_write() on preemptible context.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Fixes: 18853ba782bef ("sched_ext: Track currently locked rq")
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org # v6.15
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 9beb8c5e77dc10e3889ff5f967eeffba78617a88 upstream.
Commit cff5f49d433f ("cgroup_freezer: cgroup_freezing: Check if not
frozen") modified the cgroup_freezing() logic to verify that the FROZEN
flag is not set, affecting the return value of the freezing() function,
in order to address a warning in __thaw_task.
A race condition exists that may allow tasks to escape being frozen. The
following scenario demonstrates this issue:
CPU 0 (get_signal path) CPU 1 (freezer.state reader)
try_to_freeze read freezer.state
__refrigerator freezer_read
update_if_frozen
WRITE_ONCE(current->__state, TASK_FROZEN);
...
/* Task is now marked frozen */
/* frozen(task) == true */
/* Assuming other tasks are frozen */
freezer->state |= CGROUP_FROZEN;
/* freezing(current) returns false */
/* because cgroup is frozen (not freezing) */
break out
__set_current_state(TASK_RUNNING);
/* Bug: Task resumes running when it should remain frozen */
The existing !frozen(p) check in __thaw_task makes the
WARN_ON_ONCE(freezing(p)) warning redundant. Removing this warning enables
reverting commit cff5f49d433f ("cgroup_freezer: cgroup_freezing: Check if
not frozen") to resolve the issue.
This patch removes the warning from __thaw_task. A subsequent patch will
revert commit cff5f49d433f ("cgroup_freezer: cgroup_freezing: Check if
not frozen") to complete the fix.
Reported-by: Zhong Jiawei<zhongjiawei1@huawei.com>
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 36569780b0d64de283f9d6c2195fd1a43e221ee8 upstream.
The commit e6fe3f422be1 ("sched: Make multiple runqueue task counters
32-bit") changed nr_uninterruptible to an unsigned int. But the
nr_uninterruptible values for each of the CPU runqueues can grow to
large numbers, sometimes exceeding INT_MAX. This is valid, if, over
time, a large number of tasks are migrated off of one CPU after going
into an uninterruptible state. Only the sum of all nr_interruptible
values across all CPUs yields the correct result, as explained in a
comment in kernel/sched/loadavg.c.
Change the type of nr_uninterruptible back to unsigned long to prevent
overflows, and thus the miscalculation of load average.
Fixes: e6fe3f422be1 ("sched: Make multiple runqueue task counters 32-bit")
Signed-off-by: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250709173328.606794-1-aruna.ramakrishna@oracle.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 14a67b42cb6f3ab66f41603c062c5056d32ea7dd ]
This reverts commit cff5f49d433fcd0063c8be7dd08fa5bf190c6c37.
Commit cff5f49d433f ("cgroup_freezer: cgroup_freezing: Check if not
frozen") modified the cgroup_freezing() logic to verify that the FROZEN
flag is not set, affecting the return value of the freezing() function,
in order to address a warning in __thaw_task.
A race condition exists that may allow tasks to escape being frozen. The
following scenario demonstrates this issue:
CPU 0 (get_signal path) CPU 1 (freezer.state reader)
try_to_freeze read freezer.state
__refrigerator freezer_read
update_if_frozen
WRITE_ONCE(current->__state, TASK_FROZEN);
...
/* Task is now marked frozen */
/* frozen(task) == true */
/* Assuming other tasks are frozen */
freezer->state |= CGROUP_FROZEN;
/* freezing(current) returns false */
/* because cgroup is frozen (not freezing) */
break out
__set_current_state(TASK_RUNNING);
/* Bug: Task resumes running when it should remain frozen */
The existing !frozen(p) check in __thaw_task makes the
WARN_ON_ONCE(freezing(p)) warning redundant. Removing this warning enables
reverting the commit cff5f49d433f ("cgroup_freezer: cgroup_freezing: Check
if not frozen") to resolve the issue.
The warning has been removed in the previous patch. This patch revert the
commit cff5f49d433f ("cgroup_freezer: cgroup_freezing: Check if not
frozen") to complete the fix.
Fixes: cff5f49d433f ("cgroup_freezer: cgroup_freezing: Check if not frozen")
Reported-by: Zhong Jiawei<zhongjiawei1@huawei.com>
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit f8242745871f81a3ac37f9f51853d12854fd0b58 ]
static const char fmt[] = "%p%";
bpf_trace_printk(fmt, sizeof(fmt));
The above BPF program isn't rejected and causes a kernel warning at
runtime:
Please remove unsupported %\x00 in format string
WARNING: CPU: 1 PID: 7244 at lib/vsprintf.c:2680 format_decode+0x49c/0x5d0
This happens because bpf_bprintf_prepare skips over the second %,
detected as punctuation, while processing %p. This patch fixes it by
not skipping over punctuation. %\x00 is then processed in the next
iteration and rejected.
Reported-by: syzbot+e2c932aec5c8a6e1d31c@syzkaller.appspotmail.com
Fixes: 48cac3f4a96d ("bpf: Implement formatted output helpers with bstr_printf")
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Link: https://lore.kernel.org/r/a0e06cc479faec9e802ae51ba5d66420523251ee.1751395489.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 85a3bce695b361d85fc528e6fbb33e4c8089c806 upstream.
We have observed kernel panics when using timerlat with stack saving,
with the following dmesg output:
memcpy: detected buffer overflow: 88 byte write of buffer size 0
WARNING: CPU: 2 PID: 8153 at lib/string_helpers.c:1032 __fortify_report+0x55/0xa0
CPU: 2 UID: 0 PID: 8153 Comm: timerlatu/2 Kdump: loaded Not tainted 6.15.3-200.fc42.x86_64 #1 PREEMPT(lazy)
Call Trace:
<TASK>
? trace_buffer_lock_reserve+0x2a/0x60
__fortify_panic+0xd/0xf
__timerlat_dump_stack.cold+0xd/0xd
timerlat_dump_stack.part.0+0x47/0x80
timerlat_fd_read+0x36d/0x390
vfs_read+0xe2/0x390
? syscall_exit_to_user_mode+0x1d5/0x210
ksys_read+0x73/0xe0
do_syscall_64+0x7b/0x160
? exc_page_fault+0x7e/0x1a0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
__timerlat_dump_stack() constructs the ftrace stack entry like this:
struct stack_entry *entry;
...
memcpy(&entry->caller, fstack->calls, size);
entry->size = fstack->nr_entries;
Since commit e7186af7fb26 ("tracing: Add back FORTIFY_SOURCE logic to
kernel_stack event structure"), struct stack_entry marks its caller
field with __counted_by(size). At the time of the memcpy, entry->size
contains garbage from the ringbuffer, which under some circumstances is
zero, triggering a kernel panic by buffer overflow.
Populate the size field before the memcpy so that the out-of-bounds
check knows the correct size. This is analogous to
__ftrace_trace_stack().
Cc: stable@vger.kernel.org
Cc: John Kacur <jkacur@redhat.com>
Cc: Luis Goncalves <lgoncalv@redhat.com>
Cc: Attila Fazekas <afazekas@redhat.com>
Link: https://lore.kernel.org/20250716143601.7313-1-tglozar@redhat.com
Fixes: e7186af7fb26 ("tracing: Add back FORTIFY_SOURCE logic to kernel_stack event structure")
Signed-off-by: Tomas Glozar <tglozar@redhat.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit b5e8acc14dcb314a9b61ff19dcd9fdd0d88f70df upstream.
When a module is loaded, it adds trace events defined by the module. It
may also need to modify the modules trace printk formats to replace enum
names with their values.
If two modules are loaded at the same time, the adding of the event to the
ftrace_events list can corrupt the walking of the list in the code that is
modifying the printk format strings and crash the kernel.
The addition of the event should take the trace_event_sem for write while
it adds the new event.
Also add a lockdep_assert_held() on that semaphore in
__trace_add_event_dirs() as it iterates the list.
Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://lore.kernel.org/20250718223158.799bfc0c@batman.local.home
Reported-by: Fusheng Huang(黄富生) <Fusheng.Huang@luxshare-ict.com>
Closes: https://lore.kernel.org/all/20250717105007.46ccd18f@batman.local.home/
Fixes: 110bf2b764eb6 ("tracing: add protection around module events unload")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 1ed171a3afe81531b3ace96bd151a372dda3ee25 upstream.
After a recent change in clang to strengthen uninitialized warnings [1],
it points out that in one of the error paths in parse_btf_arg(), params
is used uninitialized:
kernel/trace/trace_probe.c:660:19: warning: variable 'params' is uninitialized when used here [-Wuninitialized]
660 | return PTR_ERR(params);
| ^~~~~~
Match many other NO_BTF_ENTRY error cases and return -ENOENT, clearing
up the warning.
Link: https://lore.kernel.org/all/20250715-trace_probe-fix-const-uninit-warning-v1-1-98960f91dd04@kernel.org/
Cc: stable@vger.kernel.org
Closes: https://github.com/ClangBuiltLinux/linux/issues/2110
Fixes: d157d7694460 ("tracing/probes: Support BTF field access from $retval")
Link: https://github.com/llvm/llvm-project/commit/2464313eef01c5b1edf0eccf57a32cdee01472c7 [1]
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit d4adf1c9ee7722545450608bcb095fb31512f0c6 ]
BPF_MAP_TYPE_LRU_HASH can recycle most recent elements well before the
map is full, due to percpu reservations and force shrink before
neighbor stealing. Once a CPU is unable to borrow from the global map,
it will once steal one elem from a neighbor and after that each time
flush this one element to the global list and immediately recycle it.
Batch value LOCAL_FREE_TARGET (128) will exhaust a 10K element map
with 79 CPUs. CPU 79 will observe this behavior even while its
neighbors hold 78 * 127 + 1 * 15 == 9921 free elements (99%).
CPUs need not be active concurrently. The issue can appear with
affinity migration, e.g., irqbalance. Each CPU can reserve and then
hold onto its 128 elements indefinitely.
Avoid global list exhaustion by limiting aggregate percpu caches to
half of map size, by adjusting LOCAL_FREE_TARGET based on cpu count.
This change has no effect on sufficiently large tables.
Similar to LOCAL_NR_SCANS and lru->nr_scans, introduce a map variable
lru->free_target. The extra field fits in a hole in struct bpf_lru.
The cacheline is already warm where read in the hot path. The field is
only accessed with the lru lock held.
Tested-by: Anton Protopopov <a.s.protopopov@gmail.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://lore.kernel.org/r/20250618215803.3587312-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 3ebb1b6522392f64902b4e96954e35927354aa27 upstream.
Zero is a valid value for "preempt_dynamic_mode", namely
"preempt_dynamic_none".
Fix the off-by-one in preempt_model_str(), so that "preempty_dynamic_none"
is correctly formatted as PREEMPT(none) instead of PREEMPT(undef).
Fixes: 8bdc5daaa01e ("sched: Add a generic function to return the preemption string")
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20250626-preempt-str-none-v2-1-526213b70a89@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit ca3881f6fd8e9b6eb2d51e8718d07d3b8029d886 ]
The function move_module() uses the variable t to track how many memory
types it has allocated and consequently how many should be freed if an
error occurs.
The variable is initially set to 0 and is updated when a call to
module_memory_alloc() fails. However, move_module() can fail for other
reasons as well, in which case t remains set to 0 and no memory is freed.
Fix the problem by initializing t to MOD_MEM_NUM_TYPES. Additionally, make
the deallocation loop more robust by not relying on the mod_mem_type_t enum
having a signed integer as its underlying type.
Fixes: c7ee8aebf6c0 ("module: add stop-grap sanity check on module memcpy()")
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Daniel Gomez <da.gomez@samsung.com>
Link: https://lore.kernel.org/r/20250618122730.51324-2-petr.pavlu@suse.com
Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
Message-ID: <20250618122730.51324-2-petr.pavlu@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit fc975cfb36393db1db517fbbe366e550bcdcff14 ]
In our testing with 6.12 based kernel on a big.LITTLE system, we were
seeing instances of RT tasks being blocked from running on the LITTLE
cpus for multiple seconds of time, apparently by the dl_server. This
far exceeds the default configured 50ms per second runtime.
This is due to the fair dl_server runtime calculation being scaled
for frequency & capacity of the cpu.
Consider the following case under a Big.LITTLE architecture:
Assume the runtime is: 50,000,000 ns, and Frequency/capacity
scale-invariance defined as below:
Frequency scale-invariance: 100
Capacity scale-invariance: 50
First by Frequency scale-invariance,
the runtime is scaled to 50,000,000 * 100 >> 10 = 4,882,812
Then by capacity scale-invariance,
it is further scaled to 4,882,812 * 50 >> 10 = 238,418.
So it will scaled to 238,418 ns.
This smaller "accounted runtime" value is what ends up being
subtracted against the fair-server's runtime for the current period.
Thus after 50ms of real time, we've only accounted ~238us against the
fair servers runtime. This 209:1 ratio in this example means that on
the smaller cpu the fair server is allowed to continue running,
blocking RT tasks, for over 10 seconds before it exhausts its supposed
50ms of runtime. And on other hardware configurations it can be even
worse.
For the fair deadline_server, to prevent realtime tasks from being
unexpectedly delayed, we really do want to use fixed time, and not
scaled time for smaller capacity/frequency cpus. So remove the scaling
from the fair server's accounting to fix this.
Fixes: a110a81c52a9 ("sched/deadline: Deferrable dl server")
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: John Stultz <jstultz@google.com>
Signed-off-by: kuyo chang <kuyo.chang@mediatek.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: John Stultz <jstultz@google.com>
Tested-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/r/20250702021440.2594736-1-kuyo.chang@mediatek.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit ba677dbe77af5ffe6204e0f3f547f3ba059c6302 ]
Jann reports that uprobes can be used destructively when used in the
middle of an instruction. The kernel only verifies there is a valid
instruction at the requested offset, but due to variable instruction
length cannot determine if this is an instruction as seen by the
intended execution stream.
Additionally, Mark Rutland notes that on architectures that mix data
in the text segment (like arm64), a similar things can be done if the
data word is 'mistaken' for an instruction.
As such, require CAP_SYS_ADMIN for uprobes.
Fixes: c9e0924e5c2b ("perf/core: open access to probes for CAP_PERFMON privileged process")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/CAG48ez1n4520sq0XrWYDHKiKxE_+WCfAK+qt9qkY4ZiBGmL-5g@mail.gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 009836b4fa52f92cba33618e773b1094affa8cd2 ]
On Mon, Jun 02, 2025 at 03:22:13PM +0800, Kuyo Chang wrote:
> So, the potential race scenario is:
>
> CPU0 CPU1
> // doing migrate_swap(cpu0/cpu1)
> stop_two_cpus()
> ...
> // doing _cpu_down()
> sched_cpu_deactivate()
> set_cpu_active(cpu, false);
> balance_push_set(cpu, true);
> cpu_stop_queue_two_works
> __cpu_stop_queue_work(stopper1,...);
> __cpu_stop_queue_work(stopper2,..);
> stop_cpus_in_progress -> true
> preempt_enable();
> ...
> 1st balance_push
> stop_one_cpu_nowait
> cpu_stop_queue_work
> __cpu_stop_queue_work
> list_add_tail -> 1st add push_work
> wake_up_q(&wakeq); -> "wakeq is empty.
> This implies that the stopper is at wakeq@migrate_swap."
> preempt_disable
> wake_up_q(&wakeq);
> wake_up_process // wakeup migrate/0
> try_to_wake_up
> ttwu_queue
> ttwu_queue_cond ->meet below case
> if (cpu == smp_processor_id())
> return false;
> ttwu_do_activate
> //migrate/0 wakeup done
> wake_up_process // wakeup migrate/1
> try_to_wake_up
> ttwu_queue
> ttwu_queue_cond
> ttwu_queue_wakelist
> __ttwu_queue_wakelist
> __smp_call_single_queue
> preempt_enable();
>
> 2nd balance_push
> stop_one_cpu_nowait
> cpu_stop_queue_work
> __cpu_stop_queue_work
> list_add_tail -> 2nd add push_work, so the double list add is detected
> ...
> ...
> cpu1 get ipi, do sched_ttwu_pending, wakeup migrate/1
>
So this balance_push() is part of schedule(), and schedule() is supposed
to switch to stopper task, but because of this race condition, stopper
task is stuck in WAKING state and not actually visible to be picked.
Therefore CPU1 can do another schedule() and end up doing another
balance_push() even though the last one hasn't been done yet.
This is a confluence of fail, where both wake_q and ttwu_wakelist can
cause crucial wakeups to be delayed, resulting in the malfunction of
balance_push.
Since there is only a single stopper thread to be woken, the wake_q
doesn't really add anything here, and can be removed in favour of
direct wakeups of the stopper thread.
Then add a clause to ttwu_queue_cond() to ensure the stopper threads
are never queued / delayed.
Of all 3 moving parts, the last addition was the balance_push()
machinery, so pick that as the point the bug was introduced.
Fixes: 2558aacff858 ("sched/hotplug: Ensure only per-cpu kthreads run during hotplug")
Reported-by: Kuyo Chang <kuyo.chang@mediatek.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Kuyo Chang <kuyo.chang@mediatek.com>
Link: https://lkml.kernel.org/r/20250605100009.GO39944@noisy.programming.kicks-ass.net
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 7b4c5a37544ba22c6ebe72c0d4ea56c953459fa5 ]
commit 3172fb986666 ("perf/core: Fix WARN in perf_cgroup_switch()") try to
fix a concurrency problem between perf_cgroup_switch and
perf_cgroup_event_disable. But it does not to move the WARN_ON_ONCE into
lock-protected region, so the warning is still be triggered.
Fixes: 3172fb986666 ("perf/core: Fix WARN in perf_cgroup_switch()")
Signed-off-by: Luo Gengkun <luogengkun@huaweicloud.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250626135403.2454105-1-luogengkun@huaweicloud.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 33b6a1f155d627f5bd80c7485c598ce45428f74f ]
Currently the call_rcu() API does not check whether a callback
pointer is NULL. If NULL is passed, rcu_core() will try to invoke
it, resulting in NULL pointer dereference and a kernel crash.
To prevent this and improve debuggability, this patch adds a check
for NULL and emits a kernel stack trace to help identify a faulty
caller.
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 8a2277a3c9e4cc5398f80821afe7ecbe9bdf2819 ]
Initialize `ops` member's pointers properly by using kzalloc() instead of
kmalloc() when allocating the simulation work context. Otherwise the
pointers contain random content leading to invalid dereferencing.
Signed-off-by: Gyeyoung Baek <gye976@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250612124827.63259-1-gye976@gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit c50784e99f0e7199cdb12dbddf02229b102744ef ]
Otherwise, tg->scx.weight can go out of sync while scx_cgroup is not enabled
and ops.cgroup_init() may be called with a stale weight value.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 819513666966 ("sched_ext: Add cgroup support")
Cc: stable@vger.kernel.org # v6.12+
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 3172fb986666dfb71bf483b6d3539e1e587fa197 ]
There may be concurrency between perf_cgroup_switch and
perf_cgroup_event_disable. Consider the following scenario: after a new
perf cgroup event is created on CPU0, the new event may not trigger
a reprogramming, causing ctx->is_active to be 0. In this case, when CPU1
disables this perf event, it executes __perf_remove_from_context->
list _del_event->perf_cgroup_event_disable on CPU1, which causes a race
with perf_cgroup_switch running on CPU0.
The following describes the details of this concurrency scenario:
CPU0 CPU1
perf_cgroup_switch:
...
# cpuctx->cgrp is not NULL here
if (READ_ONCE(cpuctx->cgrp) == NULL)
return;
perf_remove_from_context:
...
raw_spin_lock_irq(&ctx->lock);
...
# ctx->is_active == 0 because reprogramm is not
# tigger, so CPU1 can do __perf_remove_from_context
# for CPU0
__perf_remove_from_context:
perf_cgroup_event_disable:
...
if (--ctx->nr_cgroups)
...
# this warning will happened because CPU1 changed
# ctx.nr_cgroups to 0.
WARN_ON_ONCE(cpuctx->ctx.nr_cgroups == 0);
[peterz: use guard instead of goto unlock]
Fixes: db4a835601b7 ("perf/core: Set cgroup in CPU contexts for new cgroup events")
Signed-off-by: Luo Gengkun <luogengkun@huaweicloud.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250604033924.3914647-3-luogengkun@huaweicloud.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 61988e36dc5457cdff7ae7927e8d9ad1419ee998 ]
While chasing down a missing perf_cgroup_event_disable() elsewhere,
Leo Yan found that both perf_put_aux_event() and
perf_remove_sibling_event() were also missing one.
Specifically, the rule is that events that switch to OFF,ERROR need to
call perf_cgroup_event_disable().
Unify the disable paths to ensure this.
Fixes: ab43762ef010 ("perf: Allow normal events to output AUX data")
Fixes: 9f0c4fa111dc ("perf/core: Add a new PERF_EV_CAP_SIBLING event capability")
Reported-by: Leo Yan <leo.yan@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250605123343.GD35970@noisy.programming.kicks-ass.net
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 4f6fc782128355931527cefe3eb45338abd8ab39 ]
Baisheng Gao reported an ARM64 crash, which Mark decoded as being a
synchronous external abort -- most likely due to trying to access
MMIO in bad ways.
The crash further shows perf trying to do a user stack sample while in
exit_mmap()'s tlb_finish_mmu() -- i.e. while tearing down the address
space it is trying to access.
It turns out that we stop perf after we tear down the userspace mm; a
receipie for disaster, since perf likes to access userspace for
various reasons.
Flip this order by moving up where we stop perf in do_exit().
Additionally, harden PERF_SAMPLE_CALLCHAIN and PERF_SAMPLE_STACK_USER
to abort when the current task does not have an mm (exit_mm() makes
sure to set current->mm = NULL; before commencing with the actual
teardown). Such that CPU wide events don't trip on this same problem.
Fixes: c5ebcedb566e ("perf: Add ability to attach user stack dump to sample")
Reported-by: Baisheng Gao <baisheng.gao@unisoc.com>
Suggested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250605110815.GQ39944@noisy.programming.kicks-ass.net
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 8a157d8a00e815cab4432653cb50c9cedbbb4931 upstream.
The variable "head" is allocated and initialized as a list before
allocating the first "item" for the list. If the allocation of "item"
fails, it frees "head" and then jumps to the label "free_now" which will
process head and free it.
This will cause a UAF of "head", and it doesn't need to free it before
jumping to the "free_now" label as that code will free it.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250610093348.33c5643a@gandalf.local.home
Fixes: a9d0aab5eb33 ("tracing: Fix regression of filter waiting a long time on RCU synchronization")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202506070424.lCiNreTI-lkp@intel.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 261dce3d64021e7ec828a17b4975ce9182e54ceb ]
Now when isolcpus is enabled via the cmdline, wq_isolated_cpumask does
not include these isolated CPUs, even wq_unbound_cpumask has already
excluded them. It is only when we successfully configure an isolate cpuset
partition that wq_isolated_cpumask gets overwritten by
workqueue_unbound_exclude_cpumask(), including both the cmdline-specified
isolated CPUs and the isolated CPUs within the cpuset partitions.
Fix this issue by initializing wq_isolated_cpumask properly in
workqueue_init_early().
Fixes: fe28f631fa94 ("workqueue: Add workqueue_unbound_exclude_cpumask() to exclude CPUs from wq_unbound_cpumask")
Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
sched_create_group()
commit 33796b91871ad4010c8188372dd1faf97cf0f1c0 upstream.
During task_group creation, sched_create_group() calls
scx_group_set_weight() with CGROUP_WEIGHT_DFL to initialize the sched_ext
portion. This is premature and ends up calling ops.cgroup_set_weight() with
an incorrect @cgrp before ops.cgroup_init() is called.
sched_create_group() should just initialize SCX related fields in the new
task_group. Fix it by factoring out scx_tg_init() from sched_init() and
making sched_create_group() call that function instead of
scx_group_set_weight().
v2: Retain CONFIG_EXT_GROUP_SCHED ifdef in sched_init() as removing it leads
to build failures on !CONFIG_GROUP_SCHED configs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 819513666966 ("sched_ext: Add cgroup support")
Cc: stable@vger.kernel.org # v6.12+
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 327e28664307d49ce3fa71ba30dcc0007c270974 upstream.
When setting the funcgraph-args option when function graph tracer is net
enabled, it incorrectly enables it. Worse, it unregisters itself when it
was never registered. Then when it gets enabled again, it will register
itself a second time causing a WARNing.
~# echo 1 > /sys/kernel/tracing/options/funcgraph-args
~# head -20 /sys/kernel/tracing/trace
# tracer: nop
#
# entries-in-buffer/entries-written: 813/26317372 #P:8
#
# _-----=> irqs-off/BH-disabled
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / _-=> migrate-disable
# |||| / delay
# TASK-PID CPU# ||||| TIMESTAMP FUNCTION
# | | | ||||| | |
<idle>-0 [007] d..4. 358.966010: 7) 1.692 us | fetch_next_timer_interrupt(basej=4294981640, basem=357956000000, base_local=0xffff88823c3ae040, base_global=0xffff88823c3af300, tevt=0xffff888100e47cb8);
<idle>-0 [007] d..4. 358.966012: 7) | tmigr_cpu_deactivate(nextexp=357988000000) {
<idle>-0 [007] d..4. 358.966013: 7) | _raw_spin_lock(lock=0xffff88823c3b2320) {
<idle>-0 [007] d..4. 358.966014: 7) 0.981 us | preempt_count_add(val=1);
<idle>-0 [007] d..5. 358.966017: 7) 1.058 us | do_raw_spin_lock(lock=0xffff88823c3b2320);
<idle>-0 [007] d..4. 358.966019: 7) 5.824 us | }
<idle>-0 [007] d..5. 358.966021: 7) | tmigr_inactive_up(group=0xffff888100cb9000, child=0x0, data=0xffff888100e47bc0) {
<idle>-0 [007] d..5. 358.966022: 7) | tmigr_update_events(group=0xffff888100cb9000, child=0x0, data=0xffff888100e47bc0) {
Notice the "tracer: nop" at the top there. The current tracer is the "nop"
tracer, but the content is obviously the function graph tracer.
Enabling function graph tracing will cause it to register again and
trigger a warning in the accounting:
~# echo function_graph > /sys/kernel/tracing/current_tracer
-bash: echo: write error: Device or resource busy
With the dmesg of:
------------[ cut here ]------------
WARNING: CPU: 7 PID: 1095 at kernel/trace/ftrace.c:3509 ftrace_startup_subops+0xc1e/0x1000
Modules linked in: kvm_intel kvm irqbypass
CPU: 7 UID: 0 PID: 1095 Comm: bash Not tainted 6.16.0-rc2-test-00006-gea03de4105d3 #24 PREEMPT
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
RIP: 0010:ftrace_startup_subops+0xc1e/0x1000
Code: 48 b8 22 01 00 00 00 00 ad de 49 89 84 24 88 01 00 00 8b 44 24 08 89 04 24 e9 c3 f7 ff ff c7 04 24 ed ff ff ff e9 b7 f7 ff ff <0f> 0b c7 04 24 f0 ff ff ff e9 a9 f7 ff ff c7 04 24 f4 ff ff ff e9
RSP: 0018:ffff888133cff948 EFLAGS: 00010202
RAX: 0000000000000001 RBX: 1ffff1102679ff31 RCX: 0000000000000000
RDX: 1ffffffff0b27a60 RSI: ffffffff8593d2f0 RDI: ffffffff85941140
RBP: 00000000000c2041 R08: ffffffffffffffff R09: ffffed1020240221
R10: ffff88810120110f R11: ffffed1020240214 R12: ffffffff8593d2f0
R13: ffffffff8593d300 R14: ffffffff85941140 R15: ffffffff85631100
FS: 00007f7ec6f28740(0000) GS:ffff8882b5251000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f7ec6f181c0 CR3: 000000012f1d0005 CR4: 0000000000172ef0
Call Trace:
<TASK>
? __pfx_ftrace_startup_subops+0x10/0x10
? find_held_lock+0x2b/0x80
? ftrace_stub_direct_tramp+0x10/0x10
? ftrace_stub_direct_tramp+0x10/0x10
? trace_preempt_on+0xd0/0x110
? __pfx_trace_graph_entry_args+0x10/0x10
register_ftrace_graph+0x4d2/0x1020
? tracing_reset_online_cpus+0x14b/0x1e0
? __pfx_register_ftrace_graph+0x10/0x10
? ring_buffer_record_enable+0x16/0x20
? tracing_reset_online_cpus+0x153/0x1e0
? __pfx_tracing_reset_online_cpus+0x10/0x10
? __pfx_trace_graph_return+0x10/0x10
graph_trace_init+0xfd/0x160
tracing_set_tracer+0x500/0xa80
? __pfx_tracing_set_tracer+0x10/0x10
? lock_release+0x181/0x2d0
? _copy_from_user+0x26/0xa0
tracing_set_trace_write+0x132/0x1e0
? __pfx_tracing_set_trace_write+0x10/0x10
? ftrace_graph_func+0xcc/0x140
? ftrace_stub_direct_tramp+0x10/0x10
? ftrace_stub_direct_tramp+0x10/0x10
? ftrace_stub_direct_tramp+0x10/0x10
vfs_write+0x1d0/0xe90
? __pfx_vfs_write+0x10/0x10
Have the setting of the funcgraph-args check if function_graph tracer is
the current tracer of the instance, and if not, do nothing, as there's
nothing to do (the option is checked when function_graph tracing starts).
Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/20250618073801.057ea636@gandalf.local.home
Fixes: c7a60a733c373 ("ftrace: Have funcgraph-args take affect during tracing")
Closes: https://lore.kernel.org/all/4ab1a7bdd0174ab09c7b0d68cdbff9a4@huawei.com/
Reported-by: Changbin Du <changbin.du@huawei.com>
Tested-by: Changbin Du <changbin.du@huawei.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit a9d0aab5eb33a44792a66b7af13ff50d7b3e7022 upstream.
When faultable trace events were added, a trace event may no longer use
normal RCU to synchronize but instead used synchronize_rcu_tasks_trace().
This synchronization takes a much longer time to synchronize.
The filter logic would free the filters by calling
tracepoint_synchronize_unregister() after it unhooked the filter strings
and before freeing them. With this function now calling
synchronize_rcu_tasks_trace() this increased the time to free a filter
tremendously. On a PREEMPT_RT system, it was even more noticeable.
# time trace-cmd record -p function sleep 1
[..]
real 2m29.052s
user 0m0.244s
sys 0m20.136s
As trace-cmd would clear out all the filters before recording, it could
take up to 2 minutes to do a recording of "sleep 1".
To find out where the issues was:
~# trace-cmd sqlhist -e -n sched_stack select start.prev_state as state, end.next_comm as comm, TIMESTAMP_DELTA_USECS as delta, start.STACKTRACE as stack from sched_switch as start join sched_switch as end on start.prev_pid = end.next_pid
Which will produce the following commands (and -e will also execute them):
echo 's:sched_stack s64 state; char comm[16]; u64 delta; unsigned long stack[];' >> /sys/kernel/tracing/dynamic_events
echo 'hist:keys=prev_pid:__arg_18057_2=prev_state,__arg_18057_4=common_timestamp.usecs,__arg_18057_7=common_stacktrace' >> /sys/kernel/tracing/events/sched/sched_switch/trigger
echo 'hist:keys=next_pid:__state_18057_1=$__arg_18057_2,__comm_18057_3=next_comm,__delta_18057_5=common_timestamp.usecs-$__arg_18057_4,__stack_18057_6=$__arg_18057_7:onmatch(sched.sched_switch).trace(sched_stack,$__state_18057_1,$__comm_18057_3,$__delta_18057_5,$__stack_18057_6)' >> /sys/kernel/tracing/events/sched/sched_switch/trigger
The above creates a synthetic event that creates a stack trace when a task
schedules out and records it with the time it scheduled back in. Basically
the time a task is off the CPU. It also records the state of the task when
it left the CPU (running, blocked, sleeping, etc). It also saves the comm
of the task as "comm" (needed for the next command).
~# echo 'hist:keys=state,stack.stacktrace:vals=delta:sort=state,delta if comm == "trace-cmd" && state & 3' > /sys/kernel/tracing/events/synthetic/sched_stack/trigger
The above creates a histogram with buckets per state, per stack, and the
value of the total time it was off the CPU for that stack trace. It filters
on tasks with "comm == trace-cmd" and only the sleeping and blocked states
(1 - sleeping, 2 - blocked).
~# trace-cmd record -p function sleep 1
~# cat /sys/kernel/tracing/events/synthetic/sched_stack/hist | tail -18
{ state: 2, stack.stacktrace __schedule+0x1545/0x3700
schedule+0xe2/0x390
schedule_timeout+0x175/0x200
wait_for_completion_state+0x294/0x440
__wait_rcu_gp+0x247/0x4f0
synchronize_rcu_tasks_generic+0x151/0x230
apply_subsystem_event_filter+0xa2b/0x1300
subsystem_filter_write+0x67/0xc0
vfs_write+0x1e2/0xeb0
ksys_write+0xff/0x1d0
do_syscall_64+0x7b/0x420
entry_SYSCALL_64_after_hwframe+0x76/0x7e
} hitcount: 237 delta: 99756288 <<--------------- Delta is 99 seconds!
Totals:
Hits: 525
Entries: 21
Dropped: 0
This shows that this particular trace waited for 99 seconds on
synchronize_rcu_tasks() in apply_subsystem_event_filter().
In fact, there's a lot of places in the filter code that spends a lot of
time waiting for synchronize_rcu_tasks_trace() in order to free the
filters.
Add helper functions that will use call_rcu*() variants to asynchronously
free the filters. This brings the timings back to normal:
# time trace-cmd record -p function sleep 1
[..]
real 0m14.681s
user 0m0.335s
sys 0m28.616s
And the histogram also shows this:
~# cat /sys/kernel/tracing/events/synthetic/sched_stack/hist | tail -21
{ state: 2, stack.stacktrace __schedule+0x1545/0x3700
schedule+0xe2/0x390
schedule_timeout+0x175/0x200
wait_for_completion_state+0x294/0x440
__wait_rcu_gp+0x247/0x4f0
synchronize_rcu_normal+0x3db/0x5c0
tracing_reset_online_cpus+0x8f/0x1e0
tracing_open+0x335/0x440
do_dentry_open+0x4c6/0x17a0
vfs_open+0x82/0x360
path_openat+0x1a36/0x2990
do_filp_open+0x1c5/0x420
do_sys_openat2+0xed/0x180
__x64_sys_openat+0x108/0x1d0
do_syscall_64+0x7b/0x420
} hitcount: 2 delta: 77044
Totals:
Hits: 55
Entries: 28
Dropped: 0
Where the total waiting time of synchronize_rcu_tasks_trace() is 77
milliseconds.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Jan Kiszka <jan.kiszka@siemens.com>
Cc: Andreas Ziegler <ziegler.andreas@siemens.com>
Cc: Felix MOESSBAUER <felix.moessbauer@siemens.com>
Link: https://lore.kernel.org/20250606201936.1e3d09a9@batman.local.home
Reported-by: "Flot, Julien" <julien.flot@siemens.com>
Tested-by: Julien Flot <julien.flot@siemens.com>
Fixes: a363d27cdbc2 ("tracing: Allow system call tracepoints to handle page faults")
Closes: https://lore.kernel.org/all/240017f656631c7dd4017aa93d91f41f653788ea.camel@siemens.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 53ebef53a657d7957d35dc2b953db64f1bb28065 ]
The calculation of the index used to access the mask field in 'struct
bpf_raw_tp_null_args' is done with 'int' type, which could overflow when
the tracepoint being attached has more than 8 arguments.
While none of the tracepoints mentioned in raw_tp_null_args[] currently
have more than 8 arguments, there do exist tracepoints that had more
than 8 arguments (e.g. iocost_iocg_forgive_debt), so use the correct
type for calculation and avoid Smatch static checker warning.
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/bpf/20250418074946.35569-1-shung-hsi.yu@suse.com
Closes: https://lore.kernel.org/r/843a3b94-d53d-42db-93d4-be10a4090146@stanley.mountain/
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 00d872dd541cdf22230510201a1baf58f0147db9 ]
The trace_adjust_address() will take a given address and examine the
persistent ring buffer to see if the address matches a module that is
listed there. If it does not, it will just adjust the value to the core
kernel delta. But if the address was for something that was not part of
the core kernel text or data it should not be adjusted.
Check the result of the adjustment and only return the adjustment if it
lands in the current kernel text or data. If not, return the original
address.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250506102300.0ba2f9e0@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 94bde253d3ae5d8a01cb958663b12daef1d06574 ]
There is currently some confusion in the s390x JIT regarding whether
orig_call can be NULL and what that means. Originally the NULL value
was used to distinguish the struct_ops case, but this was superseded by
BPF_TRAMP_F_INDIRECT (see commit 0c970ed2f87c ("s390/bpf: Fix indirect
trampoline generation").
The remaining reason to have this check is that NULL can actually be
passed to the arch_bpf_trampoline_size() call - but not to the
respective arch_prepare_bpf_trampoline()! call - by
bpf_struct_ops_prepare_trampoline().
Remove this asymmetry by passing stub_func to both functions, so that
JITs may rely on orig_call never being NULL.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20250512221911.61314-2-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit d4965578267e2e81f67c86e2608481e77e9c8569 ]
bpf_map_lookup_percpu_elem() helper is also available for sleepable bpf
program. When BPF JIT is disabled or under 32-bit host,
bpf_map_lookup_percpu_elem() will not be inlined. Using it in a
sleepable bpf program will trigger the warning in
bpf_map_lookup_percpu_elem(), because the bpf program only holds
rcu_read_lock_trace lock. Therefore, add the missed check.
Reported-by: syzbot+dce5aae19ae4d6399986@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/bpf/000000000000176a130617420310@google.com/
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20250526062534.1105938-1-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 70e1683ca3a6474360af1d3a020a9a98c8492cc0 ]
Fixed a race condition in incrementing wq->stats[PWQ_STAT_COMPLETED] by
moving the operation under pool->lock.
Reported-by: syzbot+01affb1491750534256d@syzkaller.appspotmail.com
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 08d7becc1a6b8c936e25d827becabfe3bff72a36 ]
Right now, if the clocksource watchdog detects a clocksource skew, it might
perform a per CPU check, for example in the TSC case on x86. In other
words: supposing TSC is detected as unstable by the clocksource watchdog
running at CPU1, as part of marking TSC unstable the kernel will also run a
check of TSC readings on some CPUs to be sure it is synced between them
all.
But that check happens only on some CPUs, not all of them; this choice is
based on the parameter "verify_n_cpus" and in some random cpumask
calculation. So, the watchdog runs such per CPU checks on up to
"verify_n_cpus" random CPUs among all online CPUs, with the risk of
repeating CPUs (that aren't double checked) in the cpumask random
calculation.
But if "verify_n_cpus" > num_online_cpus(), it should skip the random
calculation and just go ahead and check the clocksource sync between
all online CPUs, without the risk of skipping some CPUs due to
duplicity in the random cpumask calculation.
Tests in a 4 CPU laptop with TSC skew detected led to some cases of the per
CPU verification skipping some CPU even with verify_n_cpus=8, due to the
duplicity on random cpumask generation. Skipping the randomization when the
number of online CPUs is smaller than verify_n_cpus, solves that.
Suggested-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/all/20250323173857.372390-1-gpiccoli@igalia.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit f914b52c379c12288b7623bb814d0508dbe7481d upstream.
The following issue happens with a buggy module:
BUG: unable to handle page fault for address: ffffffffc05d0218
PGD 1bd66f067 P4D 1bd66f067 PUD 1bd671067 PMD 101808067 PTE 0
Oops: Oops: 0000 [#1] SMP KASAN PTI
Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
RIP: 0010:sized_strscpy+0x81/0x2f0
RSP: 0018:ffff88812d76fa08 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffffffffc0601010 RCX: dffffc0000000000
RDX: 0000000000000038 RSI: dffffc0000000000 RDI: ffff88812608da2d
RBP: 8080808080808080 R08: ffff88812608da2d R09: ffff88812608da68
R10: ffff88812608d82d R11: ffff88812608d810 R12: 0000000000000038
R13: ffff88812608da2d R14: ffffffffc05d0218 R15: fefefefefefefeff
FS: 00007fef552de740(0000) GS:ffff8884251c7000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffc05d0218 CR3: 00000001146f0000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
ftrace_mod_get_kallsym+0x1ac/0x590
update_iter_mod+0x239/0x5b0
s_next+0x5b/0xa0
seq_read_iter+0x8c9/0x1070
seq_read+0x249/0x3b0
proc_reg_read+0x1b0/0x280
vfs_read+0x17f/0x920
ksys_read+0xf3/0x1c0
do_syscall_64+0x5f/0x2e0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
The above issue may happen as follows:
(1) Add kprobe tracepoint;
(2) insmod test.ko;
(3) Module triggers ftrace disabled;
(4) rmmod test.ko;
(5) cat /proc/kallsyms; --> Will trigger UAF as test.ko already removed;
ftrace_mod_get_kallsym()
...
strscpy(module_name, mod_map->mod->name, MODULE_NAME_LEN);
...
The problem is when a module triggers an issue with ftrace and
sets ftrace_disable. The ftrace_disable is set when an anomaly is
discovered and to prevent any more damage, ftrace stops all text
modification. The issue that happened was that the ftrace_disable stops
more than just the text modification.
When a module is loaded, its init functions can also be traced. Because
kallsyms deletes the init functions after a module has loaded, ftrace
saves them when the module is loaded and function tracing is enabled. This
allows the output of the function trace to show the init function names
instead of just their raw memory addresses.
When a module is removed, ftrace_release_mod() is called, and if
ftrace_disable is set, it just returns without doing anything more. The
problem here is that it leaves the mod_list still around and if kallsyms
is called, it will call into this code and access the module memory that
has already been freed as it will return:
strscpy(module_name, mod_map->mod->name, MODULE_NAME_LEN);
Where the "mod" no longer exists and triggers a UAF bug.
Link: https://lore.kernel.org/all/20250523135452.626d8dcd@gandalf.local.home/
Cc: stable@vger.kernel.org
Fixes: aba4b5c22cba ("ftrace: Save module init functions kallsyms symbols for tracing")
Link: https://lore.kernel.org/20250529111955.2349189-2-yebin@huaweicloud.com
Signed-off-by: Ye Bin <yebin10@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 7123dbbef88cfd9f09e8a7899b0911834600cfa3 upstream.
When updating `watchdog_thresh`, there is a race condition between writing
the new `watchdog_thresh` value and stopping the old watchdog timer. If
the old timer triggers during this window, it may falsely detect a
softlockup due to the old interval and the new `watchdog_thresh` value
being used. The problem can be described as follow:
# We asuume previous watchdog_thresh is 60, so the watchdog timer is
# coming every 24s.
echo 10 > /proc/sys/kernel/watchdog_thresh (User space)
|
+------>+ update watchdog_thresh (We are in kernel now)
|
| # using old interval and new `watchdog_thresh`
+------>+ watchdog hrtimer (irq context: detect softlockup)
|
|
+-------+
|
|
+ softlockup_stop_all
To fix this problem, introduce a shadow variable for `watchdog_thresh`.
The update to the actual `watchdog_thresh` is delayed until after the old
timer is stopped, preventing false positives.
The following testcase may help to understand this problem.
---------------------------------------------
echo RT_RUNTIME_SHARE > /sys/kernel/debug/sched/features
echo -1 > /proc/sys/kernel/sched_rt_runtime_us
echo 0 > /sys/kernel/debug/sched/fair_server/cpu3/runtime
echo 60 > /proc/sys/kernel/watchdog_thresh
taskset -c 3 chrt -r 99 /bin/bash -c "while true;do true; done" &
echo 10 > /proc/sys/kernel/watchdog_thresh &
---------------------------------------------
The test case above first removes the throttling restrictions for
real-time tasks. It then sets watchdog_thresh to 60 and executes a
real-time task ,a simple while(1) loop, on cpu3. Consequently, the final
command gets blocked because the presence of this real-time thread
prevents kworker:3 from being selected by the scheduler. This eventually
triggers a softlockup detection on cpu3 due to watchdog_timer_fn operating
with inconsistent variable - using both the old interval and the updated
watchdog_thresh simultaneously.
[nysal@linux.ibm.com: fix the SOFTLOCKUP_DETECTOR=n case]
Link: https://lkml.kernel.org/r/20250502111120.282690-1-nysal@linux.ibm.com
Link: https://lkml.kernel.org/r/20250421035021.3507649-1-luogengkun@huaweicloud.com
Signed-off-by: Luo Gengkun <luogengkun@huaweicloud.com>
Signed-off-by: Nysal Jan K.A. <nysal@linux.ibm.com>
Cc: Doug Anderson <dianders@chromium.org>
Cc: Joel Granados <joel.granados@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: "Nysal Jan K.A." <nysal@linux.ibm.com>
Cc: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 37fb58a7273726e59f9429c89ade5116083a213d upstream.
An issue was found:
# cd /sys/fs/cgroup/freezer/
# mkdir test
# echo FROZEN > test/freezer.state
# cat test/freezer.state
FROZEN
# sleep 1000 &
[1] 863
# echo 863 > test/cgroup.procs
# cat test/freezer.state
FREEZING
When tasks are migrated to a frozen cgroup, the freezer fails to
immediately freeze the tasks, causing the cgroup to remain in the
"FREEZING".
The freeze_task() function is called before clearing the CGROUP_FROZEN
flag. This causes the freezing() check to incorrectly return false,
preventing __freeze_task() from being invoked for the migrated task.
To fix this issue, clear the CGROUP_FROZEN state before calling
freeze_task().
Fixes: f5d39b020809 ("freezer,sched: Rewrite core freezer logic")
Cc: stable@vger.kernel.org # v6.1+
Reported-by: Zhong Jiawei <zhongjiawei1@huawei.com>
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit c70fc32f44431bb30f9025ce753ba8be25acbba3 upstream.
Mike reports that commit 6d71a9c61604 ("sched/fair: Fix EEVDF entity
placement bug causing scheduling lag") relies on commit 4423af84b297
("sched/fair: optimize the PLACE_LAG when se->vlag is zero") to not
trip a WARN in place_entity().
What happens is that the lag of the very last entity is 0 per
definition -- the average of one element matches the value of that
element. Therefore place_entity() will match the condition skipping
the lag adjustment:
if (sched_feat(PLACE_LAG) && cfs_rq->nr_queued && se->vlag) {
Without the 'se->vlag' condition -- it will attempt to adjust the zero
lag even though we're inserting into an empty tree.
Notably, we should have failed the 'cfs_rq->nr_queued' condition, but
don't because they didn't get updated.
Additionally, move update_load_add() after placement() as is
consistent with other place_entity() users -- this change is
non-functional, place_entity() does not use cfs_rq->load.
Fixes: 6d71a9c61604 ("sched/fair: Fix EEVDF entity placement bug causing scheduling lag")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/c216eb4ef0e0e0029c600aefc69d56681cee5581.camel@gmx.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 690e47d1403e90b7f2366f03b52ed3304194c793 upstream.
Overview
========
When a CPU chooses to call push_rt_task and picks a task to push to
another CPU's runqueue then it will call find_lock_lowest_rq method
which would take a double lock on both CPUs' runqueues. If one of the
locks aren't readily available, it may lead to dropping the current
runqueue lock and reacquiring both the locks at once. During this window
it is possible that the task is already migrated and is running on some
other CPU. These cases are already handled. However, if the task is
migrated and has already been executed and another CPU is now trying to
wake it up (ttwu) such that it is queued again on the runqeue
(on_rq is 1) and also if the task was run by the same CPU, then the
current checks will pass even though the task was migrated out and is no
longer in the pushable tasks list.
Crashes
=======
This bug resulted in quite a few flavors of crashes triggering kernel
panics with various crash signatures such as assert failures, page
faults, null pointer dereferences, and queue corruption errors all
coming from scheduler itself.
Some of the crashes:
-> kernel BUG at kernel/sched/rt.c:1616! BUG_ON(idx >= MAX_RT_PRIO)
Call Trace:
? __die_body+0x1a/0x60
? die+0x2a/0x50
? do_trap+0x85/0x100
? pick_next_task_rt+0x6e/0x1d0
? do_error_trap+0x64/0xa0
? pick_next_task_rt+0x6e/0x1d0
? exc_invalid_op+0x4c/0x60
? pick_next_task_rt+0x6e/0x1d0
? asm_exc_invalid_op+0x12/0x20
? pick_next_task_rt+0x6e/0x1d0
__schedule+0x5cb/0x790
? update_ts_time_stats+0x55/0x70
schedule_idle+0x1e/0x40
do_idle+0x15e/0x200
cpu_startup_entry+0x19/0x20
start_secondary+0x117/0x160
secondary_startup_64_no_verify+0xb0/0xbb
-> BUG: kernel NULL pointer dereference, address: 00000000000000c0
Call Trace:
? __die_body+0x1a/0x60
? no_context+0x183/0x350
? __warn+0x8a/0xe0
? exc_page_fault+0x3d6/0x520
? asm_exc_page_fault+0x1e/0x30
? pick_next_task_rt+0xb5/0x1d0
? pick_next_task_rt+0x8c/0x1d0
__schedule+0x583/0x7e0
? update_ts_time_stats+0x55/0x70
schedule_idle+0x1e/0x40
do_idle+0x15e/0x200
cpu_startup_entry+0x19/0x20
start_secondary+0x117/0x160
secondary_startup_64_no_verify+0xb0/0xbb
-> BUG: unable to handle page fault for address: ffff9464daea5900
kernel BUG at kernel/sched/rt.c:1861! BUG_ON(rq->cpu != task_cpu(p))
-> kernel BUG at kernel/sched/rt.c:1055! BUG_ON(!rq->nr_running)
Call Trace:
? __die_body+0x1a/0x60
? die+0x2a/0x50
? do_trap+0x85/0x100
? dequeue_top_rt_rq+0xa2/0xb0
? do_error_trap+0x64/0xa0
? dequeue_top_rt_rq+0xa2/0xb0
? exc_invalid_op+0x4c/0x60
? dequeue_top_rt_rq+0xa2/0xb0
? asm_exc_invalid_op+0x12/0x20
? dequeue_top_rt_rq+0xa2/0xb0
dequeue_rt_entity+0x1f/0x70
dequeue_task_rt+0x2d/0x70
__schedule+0x1a8/0x7e0
? blk_finish_plug+0x25/0x40
schedule+0x3c/0xb0
futex_wait_queue_me+0xb6/0x120
futex_wait+0xd9/0x240
do_futex+0x344/0xa90
? get_mm_exe_file+0x30/0x60
? audit_exe_compare+0x58/0x70
? audit_filter_rules.constprop.26+0x65e/0x1220
__x64_sys_futex+0x148/0x1f0
do_syscall_64+0x30/0x80
entry_SYSCALL_64_after_hwframe+0x62/0xc7
-> BUG: unable to handle page fault for address: ffff8cf3608bc2c0
Call Trace:
? __die_body+0x1a/0x60
? no_context+0x183/0x350
? spurious_kernel_fault+0x171/0x1c0
? exc_page_fault+0x3b6/0x520
? plist_check_list+0x15/0x40
? plist_check_list+0x2e/0x40
? asm_exc_page_fault+0x1e/0x30
? _cond_resched+0x15/0x30
? futex_wait_queue_me+0xc8/0x120
? futex_wait+0xd9/0x240
? try_to_wake_up+0x1b8/0x490
? futex_wake+0x78/0x160
? do_futex+0xcd/0xa90
? plist_check_list+0x15/0x40
? plist_check_list+0x2e/0x40
? plist_del+0x6a/0xd0
? plist_check_list+0x15/0x40
? plist_check_list+0x2e/0x40
? dequeue_pushable_task+0x20/0x70
? __schedule+0x382/0x7e0
? asm_sysvec_reschedule_ipi+0xa/0x20
? schedule+0x3c/0xb0
? exit_to_user_mode_prepare+0x9e/0x150
? irqentry_exit_to_user_mode+0x5/0x30
? asm_sysvec_reschedule_ipi+0x12/0x20
Above are some of the common examples of the crashes that were observed
due to this issue.
Details
=======
Let's look at the following scenario to understand this race.
1) CPU A enters push_rt_task
a) CPU A has chosen next_task = task p.
b) CPU A calls find_lock_lowest_rq(Task p, CPU Z’s rq).
c) CPU A identifies CPU X as a destination CPU (X < Z).
d) CPU A enters double_lock_balance(CPU Z’s rq, CPU X’s rq).
e) Since X is lower than Z, CPU A unlocks CPU Z’s rq. Someone else has
locked CPU X’s rq, and thus, CPU A must wait.
2) At CPU Z
a) Previous task has completed execution and thus, CPU Z enters
schedule, locks its own rq after CPU A releases it.
b) CPU Z dequeues previous task and begins executing task p.
c) CPU Z unlocks its rq.
d) Task p yields the CPU (ex. by doing IO or waiting to acquire a
lock) which triggers the schedule function on CPU Z.
e) CPU Z enters schedule again, locks its own rq, and dequeues task p.
f) As part of dequeue, it sets p.on_rq = 0 and unlocks its rq.
3) At CPU B
a) CPU B enters try_to_wake_up with input task p.
b) Since CPU Z dequeued task p, p.on_rq = 0, and CPU B updates
B.state = WAKING.
c) CPU B via select_task_rq determines CPU Y as the target CPU.
4) The race
a) CPU A acquires CPU X’s lock and relocks CPU Z.
b) CPU A reads task p.cpu = Z and incorrectly concludes task p is
still on CPU Z.
c) CPU A failed to notice task p had been dequeued from CPU Z while
CPU A was waiting for locks in double_lock_balance. If CPU A knew
that task p had been dequeued, it would return NULL forcing
push_rt_task to give up the task p's migration.
d) CPU B updates task p.cpu = Y and calls ttwu_queue.
e) CPU B locks Ys rq. CPU B enqueues task p onto Y and sets task
p.on_rq = 1.
f) CPU B unlocks CPU Y, triggering memory synchronization.
g) CPU A reads task p.on_rq = 1, cementing its assumption that task p
has not migrated.
h) CPU A decides to migrate p to CPU X.
This leads to A dequeuing p from Y's queue and various crashes down the
line.
Solution
========
The solution here is fairly simple. After obtaining the lock (at 4a),
the check is enhanced to make sure that the task is still at the head of
the pushable tasks list. If not, then it is anyway not suitable for
being pushed out.
Testing
=======
The fix is tested on a cluster of 3 nodes, where the panics due to this
are hit every couple of days. A fix similar to this was deployed on such
cluster and was stable for more than 30 days.
Co-developed-by: Jon Kohler <jon@nutanix.com>
Signed-off-by: Jon Kohler <jon@nutanix.com>
Co-developed-by: Gauri Patwardhan <gauri.patwardhan@nutanix.com>
Signed-off-by: Gauri Patwardhan <gauri.patwardhan@nutanix.com>
Co-developed-by: Rahul Chunduru <rahul.chunduru@nutanix.com>
Signed-off-by: Rahul Chunduru <rahul.chunduru@nutanix.com>
Signed-off-by: Harshit Agarwal <harshit@nutanix.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: "Steven Rostedt (Google)" <rostedt@goodmis.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Tested-by: Will Ton <william.ton@nutanix.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20250225180553.167995-1-harshit@nutanix.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 044d2aee6c575231ed4a24fb3d119bad0937488b upstream.
Failures inside codetag_load_module() are currently ignored. As a result
an error there would not cause a module load failure and freeing of the
associated resources. Correct this behavior by propagating the error code
to the caller and handling possible errors. With this change, error to
allocate percpu counters, which happens at this stage, will not be ignored
and will cause a module load failure and freeing of resources. With this
change we also do not need to disable memory allocation profiling when
this error happens, instead we fail to load the module.
Link: https://lkml.kernel.org/r/20250521160602.1940771-1-surenb@google.com
Fixes: 10075262888b ("alloc_tag: allocate percpu counters for module tags dynamically")
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reported-by: Casey Chen <cachen@purestorage.com>
Closes: https://lore.kernel.org/all/20250520231620.15259-1-cachen@purestorage.com/
Cc: Daniel Gomez <da.gomez@samsung.com>
Cc: David Wang <00107082@163.com>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Luis Chamberalin <mcgrof@kernel.org>
Cc: Petr Pavlu <petr.pavlu@suse.com>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit c98cc9797b7009308fff73d41bc1d08642dab77a upstream.
Running a modified trace-cmd record --nosplice where it does a mmap of the
ring buffer when '--nosplice' is set, caused the following lockdep splat:
======================================================
WARNING: possible circular locking dependency detected
6.15.0-rc7-test-00002-gfb7d03d8a82f #551 Not tainted
------------------------------------------------------
trace-cmd/1113 is trying to acquire lock:
ffff888100062888 (&buffer->mutex){+.+.}-{4:4}, at: ring_buffer_map+0x11c/0xe70
but task is already holding lock:
ffff888100a5f9f8 (&cpu_buffer->mapping_lock){+.+.}-{4:4}, at: ring_buffer_map+0xcf/0xe70
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #5 (&cpu_buffer->mapping_lock){+.+.}-{4:4}:
__mutex_lock+0x192/0x18c0
ring_buffer_map+0xcf/0xe70
tracing_buffers_mmap+0x1c4/0x3b0
__mmap_region+0xd8d/0x1f70
do_mmap+0x9d7/0x1010
vm_mmap_pgoff+0x20b/0x390
ksys_mmap_pgoff+0x2e9/0x440
do_syscall_64+0x79/0x1c0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
-> #4 (&mm->mmap_lock){++++}-{4:4}:
__might_fault+0xa5/0x110
_copy_to_user+0x22/0x80
_perf_ioctl+0x61b/0x1b70
perf_ioctl+0x62/0x90
__x64_sys_ioctl+0x134/0x190
do_syscall_64+0x79/0x1c0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
-> #3 (&cpuctx_mutex){+.+.}-{4:4}:
__mutex_lock+0x192/0x18c0
perf_event_init_cpu+0x325/0x7c0
perf_event_init+0x52a/0x5b0
start_kernel+0x263/0x3e0
x86_64_start_reservations+0x24/0x30
x86_64_start_kernel+0x95/0xa0
common_startup_64+0x13e/0x141
-> #2 (pmus_lock){+.+.}-{4:4}:
__mutex_lock+0x192/0x18c0
perf_event_init_cpu+0xb7/0x7c0
cpuhp_invoke_callback+0x2c0/0x1030
__cpuhp_invoke_callback_range+0xbf/0x1f0
_cpu_up+0x2e7/0x690
cpu_up+0x117/0x170
cpuhp_bringup_mask+0xd5/0x120
bringup_nonboot_cpus+0x13d/0x170
smp_init+0x2b/0xf0
kernel_init_freeable+0x441/0x6d0
kernel_init+0x1e/0x160
ret_from_fork+0x34/0x70
ret_from_fork_asm+0x1a/0x30
-> #1 (cpu_hotplug_lock){++++}-{0:0}:
cpus_read_lock+0x2a/0xd0
ring_buffer_resize+0x610/0x14e0
__tracing_resize_ring_buffer.part.0+0x42/0x120
tracing_set_tracer+0x7bd/0xa80
tracing_set_trace_write+0x132/0x1e0
vfs_write+0x21c/0xe80
ksys_write+0xf9/0x1c0
do_syscall_64+0x79/0x1c0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
-> #0 (&buffer->mutex){+.+.}-{4:4}:
__lock_acquire+0x1405/0x2210
lock_acquire+0x174/0x310
__mutex_lock+0x192/0x18c0
ring_buffer_map+0x11c/0xe70
tracing_buffers_mmap+0x1c4/0x3b0
__mmap_region+0xd8d/0x1f70
do_mmap+0x9d7/0x1010
vm_mmap_pgoff+0x20b/0x390
ksys_mmap_pgoff+0x2e9/0x440
do_syscall_64+0x79/0x1c0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
other info that might help us debug this:
Chain exists of:
&buffer->mutex --> &mm->mmap_lock --> &cpu_buffer->mapping_lock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&cpu_buffer->mapping_lock);
lock(&mm->mmap_lock);
lock(&cpu_buffer->mapping_lock);
lock(&buffer->mutex);
*** DEADLOCK ***
2 locks held by trace-cmd/1113:
#0: ffff888106b847e0 (&mm->mmap_lock){++++}-{4:4}, at: vm_mmap_pgoff+0x192/0x390
#1: ffff888100a5f9f8 (&cpu_buffer->mapping_lock){+.+.}-{4:4}, at: ring_buffer_map+0xcf/0xe70
stack backtrace:
CPU: 5 UID: 0 PID: 1113 Comm: trace-cmd Not tainted 6.15.0-rc7-test-00002-gfb7d03d8a82f #551 PREEMPT
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x6e/0xa0
print_circular_bug.cold+0x178/0x1be
check_noncircular+0x146/0x160
__lock_acquire+0x1405/0x2210
lock_acquire+0x174/0x310
? ring_buffer_map+0x11c/0xe70
? ring_buffer_map+0x11c/0xe70
? __mutex_lock+0x169/0x18c0
__mutex_lock+0x192/0x18c0
? ring_buffer_map+0x11c/0xe70
? ring_buffer_map+0x11c/0xe70
? function_trace_call+0x296/0x370
? __pfx___mutex_lock+0x10/0x10
? __pfx_function_trace_call+0x10/0x10
? __pfx___mutex_lock+0x10/0x10
? _raw_spin_unlock+0x2d/0x50
? ring_buffer_map+0x11c/0xe70
? ring_buffer_map+0x11c/0xe70
? __mutex_lock+0x5/0x18c0
ring_buffer_map+0x11c/0xe70
? do_raw_spin_lock+0x12d/0x270
? find_held_lock+0x2b/0x80
? _raw_spin_unlock+0x2d/0x50
? rcu_is_watching+0x15/0xb0
? _raw_spin_unlock+0x2d/0x50
? trace_preempt_on+0xd0/0x110
tracing_buffers_mmap+0x1c4/0x3b0
__mmap_region+0xd8d/0x1f70
? ring_buffer_lock_reserve+0x99/0xff0
? __pfx___mmap_region+0x10/0x10
? ring_buffer_lock_reserve+0x99/0xff0
? __pfx_ring_buffer_lock_reserve+0x10/0x10
? __pfx_ring_buffer_lock_reserve+0x10/0x10
? bpf_lsm_mmap_addr+0x4/0x10
? security_mmap_addr+0x46/0xd0
? lock_is_held_type+0xd9/0x130
do_mmap+0x9d7/0x1010
? 0xffffffffc0370095
? __pfx_do_mmap+0x10/0x10
vm_mmap_pgoff+0x20b/0x390
? __pfx_vm_mmap_pgoff+0x10/0x10
? 0xffffffffc0370095
ksys_mmap_pgoff+0x2e9/0x440
do_syscall_64+0x79/0x1c0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7fb0963a7de2
Code: 00 00 00 0f 1f 44 00 00 41 f7 c1 ff 0f 00 00 75 27 55 89 cd 53 48 89 fb 48 85 ff 74 3b 41 89 ea 48 89 df b8 09 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 76 5b 5d c3 0f 1f 00 48 8b 05 e1 9f 0d 00 64
RSP: 002b:00007ffdcc8fb878 EFLAGS: 00000246 ORIG_RAX: 0000000000000009
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb0963a7de2
RDX: 0000000000000001 RSI: 0000000000001000 RDI: 0000000000000000
RBP: 0000000000000001 R08: 0000000000000006 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000
R13: 00007ffdcc8fbe68 R14: 00007fb096628000 R15: 00005633e01a5c90
</TASK>
The issue is that cpus_read_lock() is taken within buffer->mutex. The
memory mapped pages are taken with the mmap_lock held. The buffer->mutex
is taken within the cpu_buffer->mapping_lock. There's quite a chain with
all these locks, where the deadlock can be fixed by moving the
cpus_read_lock() outside the taking of the buffer->mutex.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/20250527105820.0f45d045@gandalf.local.home
Fixes: 117c39200d9d7 ("ring-buffer: Introducing ring-buffer mapping functions")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 40ee2afafc1d9fe3aa44a6fbe440d78a5c96a72e upstream.
Enlarge the critical section in ring_buffer_subbuf_order_set() to
ensure that error handling takes place with per-buffer mutex held,
thus preventing list corruption and other concurrency-related issues.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Link: https://lore.kernel.org/20250606112242.1510605-1-dmantipov@yandex.ru
Reported-by: syzbot+05d673e83ec640f0ced9@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=05d673e83ec640f0ced9
Fixes: f9b94daa542a8 ("ring-buffer: Set new size of the ring buffer sub page")
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 4fc78a7c9ca994e1da5d3940704d4e8f0ea8c5e4 upstream.
When reading a memory mapped buffer the reader page is just swapped out
with the last page written in the write buffer. If the reader page is the
same as the commit buffer (the buffer that is currently being written to)
it was assumed that it should never have missed events. If it does, it
triggers a WARN_ON_ONCE().
But there just happens to be one scenario where this can legitimately
happen. That is on a commit_overrun. A commit overrun is when an interrupt
preempts an event being written to the buffer and then the interrupt adds
so many new events that it fills and wraps the buffer back to the commit.
Any new events would then be dropped and be reported as "missed_events".
In this case, the next page to read is the commit buffer and after the
swap of the reader page, the reader page will be the commit buffer, but
this time there will be missed events and this triggers the following
warning:
------------[ cut here ]------------
WARNING: CPU: 2 PID: 1127 at kernel/trace/ring_buffer.c:7357 ring_buffer_map_get_reader+0x49a/0x780
Modules linked in: kvm_intel kvm irqbypass
CPU: 2 UID: 0 PID: 1127 Comm: trace-cmd Not tainted 6.15.0-rc7-test-00004-g478bc2824b45-dirty #564 PREEMPT
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
RIP: 0010:ring_buffer_map_get_reader+0x49a/0x780
Code: 00 00 00 48 89 fe 48 c1 ee 03 80 3c 2e 00 0f 85 ec 01 00 00 4d 3b a6 a8 00 00 00 0f 85 8a fd ff ff 48 85 c0 0f 84 55 fe ff ff <0f> 0b e9 4e fe ff ff be 08 00 00 00 4c 89 54 24 58 48 89 54 24 50
RSP: 0018:ffff888121787dc0 EFLAGS: 00010002
RAX: 00000000000006a2 RBX: ffff888100062800 RCX: ffffffff8190cb49
RDX: ffff888126934c00 RSI: 1ffff11020200a15 RDI: ffff8881010050a8
RBP: dffffc0000000000 R08: 0000000000000000 R09: ffffed1024d26982
R10: ffff888126934c17 R11: ffff8881010050a8 R12: ffff888126934c00
R13: ffff8881010050b8 R14: ffff888101005000 R15: ffff888126930008
FS: 00007f95c8cd7540(0000) GS:ffff8882b576e000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f95c8de4dc0 CR3: 0000000128452002 CR4: 0000000000172ef0
Call Trace:
<TASK>
? __pfx_ring_buffer_map_get_reader+0x10/0x10
tracing_buffers_ioctl+0x283/0x370
__x64_sys_ioctl+0x134/0x190
do_syscall_64+0x79/0x1c0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f95c8de48db
Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00
RSP: 002b:00007ffe037ba110 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007ffe037bb2b0 RCX: 00007f95c8de48db
RDX: 0000000000000000 RSI: 0000000000005220 RDI: 0000000000000006
RBP: 00007ffe037ba180 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007ffe037bb6f8 R14: 00007f95c9065000 R15: 00005575c7492c90
</TASK>
irq event stamp: 5080
hardirqs last enabled at (5079): [<ffffffff83e0adb0>] _raw_spin_unlock_irqrestore+0x50/0x70
hardirqs last disabled at (5080): [<ffffffff83e0aa83>] _raw_spin_lock_irqsave+0x63/0x70
softirqs last enabled at (4182): [<ffffffff81516122>] handle_softirqs+0x552/0x710
softirqs last disabled at (4159): [<ffffffff815163f7>] __irq_exit_rcu+0x107/0x210
---[ end trace 0000000000000000 ]---
The above was triggered by running on a kernel with both lockdep and KASAN
as well as kmemleak enabled and executing the following command:
# perf record -o perf-test.dat -a -- trace-cmd record --nosplice -e all -p function hackbench 50
With perf interjecting a lot of interrupts and trace-cmd enabling all
events as well as function tracing, with lockdep, KASAN and kmemleak
enabled, it could cause an interrupt preempting an event being written to
add enough events to wrap the buffer. trace-cmd was modified to have
--nosplice use mmap instead of reading the buffer.
The way to differentiate this case from the normal case of there only
being one page written to where the swap of the reader page received that
one page (which is the commit page), check if the tail page is on the
reader page. The difference between the commit page and the tail page is
that the tail page is where new writes go to, and the commit page holds
the first write that hasn't been committed yet. In the case of an
interrupt preempting the write of an event and filling the buffer, it
would move the tail page but not the commit page.
Have the warning only trigger if the tail page is also on the reader page,
and also print out the number of events dropped by a commit overrun as
that can not yet be safely added to the page so that the reader can see
there were events dropped.
Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/20250528121555.2066527e@gandalf.local.home
Fixes: fe832be05a8ee ("ring-buffer: Have mmapped ring buffer keep track of missed events")
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
posix_cpu_timer_del()
commit f90fff1e152dedf52b932240ebbd670d83330eca upstream.
If an exiting non-autoreaping task has already passed exit_notify() and
calls handle_posix_cpu_timers() from IRQ, it can be reaped by its parent
or debugger right after unlock_task_sighand().
If a concurrent posix_cpu_timer_del() runs at that moment, it won't be
able to detect timer->it.cpu.firing != 0: cpu_timer_task_rcu() and/or
lock_task_sighand() will fail.
Add the tsk->exit_state check into run_posix_cpu_timers() to fix this.
This fix is not needed if CONFIG_POSIX_CPU_TIMERS_TASK_WORK=y, because
exit_task_work() is called before exit_notify(). But the check still
makes sense, task_work_add(&tsk->posix_cputimers_work.work) will fail
anyway in this case.
Cc: stable@vger.kernel.org
Reported-by: Benoît Sevens <bsevens@google.com>
Fixes: 0bdd2ed4138e ("sched: run_posix_cpu_timers: Don't check ->exit_state, use lock_task_sighand()")
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 7ed9138a72829d2035ecbd8dbd35b1bc3c137c40 ]
Ravi reported that the bpf_perf_link_attach() usage of
perf_event_set_bpf_prog() is not serialized by ctx->mutex, unlike the
PERF_EVENT_IOC_SET_BPF case.
Reported-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lkml.kernel.org/r/20250307193305.486326750@infradead.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 9960be72a54cf0e4d47abdd4cacd1278835a3bb4 ]
In the idle CPU selection logic, attempting cross-node searches adds
unnecessary complexity when CONFIG_NUMA is disabled.
Since there's no meaningful concept of nodes in this case, simplify the
logic by restricting the idle CPU search to the current node only.
Fixes: 48849271e6611 ("sched_ext: idle: Per-node idle cpumasks")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit e2d2115e56c4a02377189bfc3a9a7933552a7b0f ]
Yi Lai reported an issue ([1]) where the following warning appears
in kernel dmesg:
[ 60.643604] verifier backtracking bug
[ 60.643635] WARNING: CPU: 10 PID: 2315 at kernel/bpf/verifier.c:4302 __mark_chain_precision+0x3a6c/0x3e10
[ 60.648428] Modules linked in: bpf_testmod(OE)
[ 60.650471] CPU: 10 UID: 0 PID: 2315 Comm: test_progs Tainted: G OE 6.15.0-rc4-gef11287f8289-dirty #327 PREEMPT(full)
[ 60.654385] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[ 60.656682] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[ 60.660475] RIP: 0010:__mark_chain_precision+0x3a6c/0x3e10
[ 60.662814] Code: 5a 30 84 89 ea e8 c4 d9 01 00 80 3d 3e 7d d8 04 00 0f 85 60 fa ff ff c6 05 31 7d d8 04
01 48 c7 c7 00 58 30 84 e8 c4 06 a5 ff <0f> 0b e9 46 fa ff ff 48 ...
[ 60.668720] RSP: 0018:ffff888116cc7298 EFLAGS: 00010246
[ 60.671075] RAX: 54d70e82dfd31900 RBX: ffff888115b65e20 RCX: 0000000000000000
[ 60.673659] RDX: 0000000000000001 RSI: 0000000000000004 RDI: 00000000ffffffff
[ 60.676241] RBP: 0000000000000400 R08: ffff8881f6f23bd3 R09: 1ffff1103ede477a
[ 60.678787] R10: dffffc0000000000 R11: ffffed103ede477b R12: ffff888115b60ae8
[ 60.681420] R13: 1ffff11022b6cbc4 R14: 00000000fffffff2 R15: 0000000000000001
[ 60.684030] FS: 00007fc2aedd80c0(0000) GS:ffff88826fa8a000(0000) knlGS:0000000000000000
[ 60.686837] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 60.689027] CR2: 000056325369e000 CR3: 000000011088b002 CR4: 0000000000370ef0
[ 60.691623] Call Trace:
[ 60.692821] <TASK>
[ 60.693960] ? __pfx_verbose+0x10/0x10
[ 60.695656] ? __pfx_disasm_kfunc_name+0x10/0x10
[ 60.697495] check_cond_jmp_op+0x16f7/0x39b0
[ 60.699237] do_check+0x58fa/0xab10
...
Further analysis shows the warning is at line 4302 as below:
4294 /* static subprog call instruction, which
4295 * means that we are exiting current subprog,
4296 * so only r1-r5 could be still requested as
4297 * precise, r0 and r6-r10 or any stack slot in
4298 * the current frame should be zero by now
4299 */
4300 if (bt_reg_mask(bt) & ~BPF_REGMASK_ARGS) {
4301 verbose(env, "BUG regs %x\n", bt_reg_mask(bt));
4302 WARN_ONCE(1, "verifier backtracking bug");
4303 return -EFAULT;
4304 }
With the below test (also in the next patch):
__used __naked static void __bpf_jmp_r10(void)
{
asm volatile (
"r2 = 2314885393468386424 ll;"
"goto +0;"
"if r2 <= r10 goto +3;"
"if r1 >= -1835016 goto +0;"
"if r2 <= 8 goto +0;"
"if r3 <= 0 goto +0;"
"exit;"
::: __clobber_all);
}
SEC("?raw_tp")
__naked void bpf_jmp_r10(void)
{
asm volatile (
"r3 = 0 ll;"
"call __bpf_jmp_r10;"
"r0 = 0;"
"exit;"
::: __clobber_all);
}
The following is the verifier failure log:
0: (18) r3 = 0x0 ; R3_w=0
2: (85) call pc+2
caller:
R10=fp0
callee:
frame1: R1=ctx() R3_w=0 R10=fp0
5: frame1: R1=ctx() R3_w=0 R10=fp0
; asm volatile (" \ @ verifier_precision.c:184
5: (18) r2 = 0x20202000256c6c78 ; frame1: R2_w=0x20202000256c6c78
7: (05) goto pc+0
8: (bd) if r2 <= r10 goto pc+3 ; frame1: R2_w=0x20202000256c6c78 R10=fp0
9: (35) if r1 >= 0xffe3fff8 goto pc+0 ; frame1: R1=ctx()
10: (b5) if r2 <= 0x8 goto pc+0
mark_precise: frame1: last_idx 10 first_idx 0 subseq_idx -1
mark_precise: frame1: regs=r2 stack= before 9: (35) if r1 >= 0xffe3fff8 goto pc+0
mark_precise: frame1: regs=r2 stack= before 8: (bd) if r2 <= r10 goto pc+3
mark_precise: frame1: regs=r2,r10 stack= before 7: (05) goto pc+0
mark_precise: frame1: regs=r2,r10 stack= before 5: (18) r2 = 0x20202000256c6c78
mark_precise: frame1: regs=r10 stack= before 2: (85) call pc+2
BUG regs 400
The main failure reason is due to r10 in precision backtracking bookkeeping.
Actually r10 is always precise and there is no need to add it for the precision
backtracking bookkeeping.
One way to fix the issue is to prevent bt_set_reg() if any src/dst reg is
r10. Andrii suggested to go with push_insn_history() approach to avoid
explicitly checking r10 in backtrack_insn().
This patch added push_insn_history() support for cond_jmp like 'rX <op> rY'
operations. In check_cond_jmp_op(), if any of rX or rY is a stack pointer,
push_insn_history() will record such information, and later backtrack_insn()
will do bt_set_reg() properly for those register(s).
[1] https://lore.kernel.org/bpf/Z%2F8q3xzpU59CIYQE@ly-workstation/
Reported by: Yi Lai <yi1.lai@linux.intel.com>
Fixes: 407958a0e980 ("bpf: encapsulate precision backtracking bookkeeping")
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250524041335.4046126-1-yonghong.song@linux.dev
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 86bc9c742426a16b52a10ef61f5b721aecca2344 ]
syzkaller reported an issue:
WARNING: CPU: 3 PID: 217 at kernel/bpf/core.c:2357 __bpf_prog_ret0_warn+0xa/0x20 kernel/bpf/core.c:2357
Modules linked in:
CPU: 3 UID: 0 PID: 217 Comm: kworker/u32:6 Not tainted 6.15.0-rc4-syzkaller-00040-g8bac8898fe39
RIP: 0010:__bpf_prog_ret0_warn+0xa/0x20 kernel/bpf/core.c:2357
Call Trace:
<TASK>
bpf_dispatcher_nop_func include/linux/bpf.h:1316 [inline]
__bpf_prog_run include/linux/filter.h:718 [inline]
bpf_prog_run include/linux/filter.h:725 [inline]
cls_bpf_classify+0x74a/0x1110 net/sched/cls_bpf.c:105
...
When creating bpf program, 'fp->jit_requested' depends on bpf_jit_enable.
This issue is triggered because of CONFIG_BPF_JIT_ALWAYS_ON is not set
and bpf_jit_enable is set to 1, causing the arch to attempt JIT the prog,
but jit failed due to FAULT_INJECTION. As a result, incorrectly
treats the program as valid, when the program runs it calls
`__bpf_prog_ret0_warn` and triggers the WARN_ON_ONCE(1).
Reported-by: syzbot+0903f6d7f285e41cdf10@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/bpf/6816e34e.a70a0220.254cdc.002c.GAE@google.com
Fixes: fa9dd599b4da ("bpf: get rid of pure_initcall dependency to enable jits")
Signed-off-by: KaFai Wan <mannkafai@gmail.com>
Link: https://lore.kernel.org/r/20250526133358.2594176-1-mannkafai@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
multi-uprobe attach logic"
[ Upstream commit 4e2e6841ff761cc15a54e8bebcf35d7325ec78a2 ]
This reverts commit 4a8f635a6054.
Althought get_pid_task() internally already calls rcu_read_lock() and
rcu_read_unlock(), the find_vpid() was not.
The documentation for find_vpid() clearly states:
"Must be called with the tasklist_lock or rcu_read_lock() held."
Add proper rcu_read_lock/unlock() to protect the find_vpid().
Fixes: 4a8f635a6054 ("bpf: remove unnecessary rcu_read_{lock,unlock}() in multi-uprobe attach logic")
Reported-by: Xuewen Yan <xuewen.yan@unisoc.com>
Signed-off-by: Di Shen <di.shen@unisoc.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20250520054943.5002-1-xuewen.yan@unisoc.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|