summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2025-07-08module: Fix memory deallocation on error path in move_module()Petr Pavlu
The function move_module() uses the variable t to track how many memory types it has allocated and consequently how many should be freed if an error occurs. The variable is initially set to 0 and is updated when a call to module_memory_alloc() fails. However, move_module() can fail for other reasons as well, in which case t remains set to 0 and no memory is freed. Fix the problem by initializing t to MOD_MEM_NUM_TYPES. Additionally, make the deallocation loop more robust by not relying on the mod_mem_type_t enum having a signed integer as its underlying type. Fixes: c7ee8aebf6c0 ("module: add stop-grap sanity check on module memcpy()") Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Daniel Gomez <da.gomez@samsung.com> Link: https://lore.kernel.org/r/20250618122730.51324-2-petr.pavlu@suse.com Signed-off-by: Daniel Gomez <da.gomez@samsung.com> Message-ID: <20250618122730.51324-2-petr.pavlu@suse.com>
2025-07-08rcu/nocb: Fix possible invalid rdp's->nocb_cb_kthread pointer accessZqiang
In the preparation stage of CPU online, if the corresponding the rdp's->nocb_cb_kthread does not exist, will be created, there is a situation where the rdp's rcuop kthreads creation fails, and then de-offload this CPU's rdp, does not assign this CPU's rdp->nocb_cb_kthread pointer, but this rdp's->nocb_gp_rdp and rdp's->rdp_gp->nocb_gp_kthread is still valid. This will cause the subsequent re-offload operation of this offline CPU, which will pass the conditional check and the kthread_unpark() will access invalid rdp's->nocb_cb_kthread pointer. This commit therefore use rdp's->nocb_gp_kthread instead of rdp_gp's->nocb_gp_kthread for safety check. Signed-off-by: Zqiang <qiang.zhang1211@gmail.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>
2025-07-08rcu/exp: Warn on QS requested on dying CPUFrederic Weisbecker
It is not possible to send an IPI to a dying CPU that has passed the CPUHP_TEARDOWN_CPU stage. Remaining unhandled IPIs are handled later at CPUHP_AP_SMPCFD_DYING stage by stop machine. This is the last opportunity for RCU exp handler to request an expedited quiescent state. And the upcoming final context switch between stop machine and idle must have reported the requested context switch. Therefore, it should not be possible to observe a pending requested expedited quiescent state when RCU finally stops watching the outgoing CPU. Once IPIs aren't possible anymore, the QS for the target CPU will be reported on its behalf by the RCU exp kworker. Provide an assertion to verify those expectations. Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>
2025-07-08rcu/exp: Remove needless CPU up quiescent state reportFrederic Weisbecker
A CPU coming online checks for an ongoing grace period and reports a quiescent state accordingly if needed. This special treatment that shortcuts the expedited IPI finds its origin as an optimization purpose on the following commit: 338b0f760e84 (rcu: Better hotplug handling for synchronize_sched_expedited() The point is to avoid an IPI while waiting for a CPU to become online or failing to become offline. However this is pointless and even error prone for several reasons: * If the CPU has been seen offline in the first round scanning offline and idle CPUs, no IPI is even tried and the quiescent state is reported on behalf of the CPU. * This means that if the IPI fails, the CPU just became offline. So it's unlikely to become online right away, unless the cpu hotplug operation failed and rolled back, which is a rare event that can wait a jiffy for a new IPI to be issued. * But then the "optimization" applying on failing CPU hotplug down only applies to !PREEMPT_RCU. * This force reports a quiescent state even if ->cpu_no_qs.b.exp is not set. As a result it can race with remote QS reports on the same rdp. Fortunately it happens to be OK but an accident is waiting to happen. For all those reasons, remove this optimization that doesn't look worthy to keep around. Reported-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>
2025-07-08rcu/exp: Remove confusing needless full barrier on task unblockFrederic Weisbecker
A full memory barrier in the RCU-PREEMPT task unblock path advertizes to order the context switch (or rather the accesses prior to rcu_read_unlock()) with the expedited grace period fastpath. However the grace period can not complete without the rnp calling into rcu_report_exp_rnp() with the node locked. This reports the quiescent state in a fully ordered fashion against updater's accesses thanks to: 1) The READ-SIDE smp_mb__after_unlock_lock() barrier across nodes locking while propagating QS up to the root. 2) The UPDATE-SIDE smp_mb__after_unlock_lock() barrier while holding the the root rnp to wait/check for the GP completion. 3) The (perhaps redundant given step 1) and 2)) smp_mb() in rcu_seq_end() before the grace period completes. This makes the explicit barrier in this place superfluous. Therefore remove it as it is confusing. Acked-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>
2025-07-08fold fs_struct->{lock,seq} into a seqlockAl Viro
The combination of spinlock_t lock and seqcount_spinlock_t seq in struct fs_struct is an open-coded seqlock_t (see linux/seqlock_types.h). Combine and switch to equivalent seqlock_t primitives. AFAICS, that does end up with the same sequence of underlying operations in all cases. While we are at it, get_fs_pwd() is open-coded verbatim in get_path_from_fd(); rather than applying conversion to it, replace with the call of get_fs_pwd() there. Not worth splitting the commit for that, IMO... A bit of historical background - conversion of seqlock_t to use of seqcount_spinlock_t happened several months after the same had been done to struct fs_struct; switching fs_struct to seqlock_t could've been done immediately after that, but it looks like nobody had gotten around to that until now. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/20250702053437.GC1880847@ZenIV Acked-by: Ahmed S. Darwish <darwi@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-07bpf: Clean code with bpf_copy_to_user()Tao Chen
No logic change, use bpf_copy_to_user() to clean code. Signed-off-by: Tao Chen <chen.dylane@linux.dev> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20250703163700.677628-1-chen.dylane@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-07bpf: Fix aux usage after do_check_insn()Luis Gerhorst
We must terminate the speculative analysis if the just-analyzed insn had nospec_result set. Using cur_aux() here is wrong because insn_idx might have been incremented by do_check_insn(). Therefore, introduce and use insn_aux variable. Also change cur_aux(env)->nospec in case do_check_insn() ever manages to increment insn_idx but still fail. Change the warning to check the insn class (which prevents it from triggering for ldimm64, for which nospec_result would not be problematic) and use verifier_bug_if(). In line with Eduard's suggestion, do not introduce prev_aux() because that requires one to understand that after do_check_insn() call what was current became previous. This would at-least require a comment. Fixes: d6f1c85f2253 ("bpf: Fall back to nospec for Spectre v1") Reported-by: Paul Chaignon <paul.chaignon@gmail.com> Reported-by: Eduard Zingerman <eddyz87@gmail.com> Reported-by: syzbot+dc27c5fb8388e38d2d37@syzkaller.appspotmail.com Link: https://lore.kernel.org/bpf/685b3c1b.050a0220.2303ee.0010.GAE@google.com/ Link: https://lore.kernel.org/bpf/4266fd5de04092aa4971cbef14f1b4b96961f432.camel@gmail.com/ Suggested-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Luis Gerhorst <luis.gerhorst@fau.de> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250705190908.1756862-2-luis.gerhorst@fau.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-07bpf: Fix improper int-to-ptr cast in dump_stack_cbKumar Kartikeya Dwivedi
On 32-bit platforms, we'll try to convert a u64 directly to a pointer type which is 32-bit, which causes the compiler to complain about cast from an integer of a different size to a pointer type. Cast to long before casting to the pointer type to match the pointer width. Reported-by: kernelci.org bot <bot@kernelci.org> Reported-by: Randy Dunlap <rdunlap@infradead.org> Fixes: d7c431cafcb4 ("bpf: Add dump_stack() analogue to print to BPF stderr") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Link: https://lore.kernel.org/r/20250705053035.3020320-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-07bpf: Fix bounds for bpf_prog_get_file_line linfo loopKumar Kartikeya Dwivedi
We may overrun the bounds because linfo and jited_linfo are already advanced to prog->aux->linfo_idx, hence we must only iterate the remaining elements until we reach prog->aux->nr_linfo. Adjust the nr_linfo calculation to fix this. Reported in [0]. [0]: https://lore.kernel.org/bpf/f3527af3b0620ce36e299e97e7532d2555018de2.camel@gmail.com Reported-by: Eduard Zingerman <eddyz87@gmail.com> Fixes: 0e521efaf363 ("bpf: Add function to extract program source info") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250705053035.3020320-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-07bpf: support for void/primitive __arg_untrusted global func paramsEduard Zingerman
Allow specifying __arg_untrusted for void */char */int */long * parameters. Treat such parameters as PTR_TO_MEM|MEM_RDONLY|PTR_UNTRUSTED of size zero. Intended usage is as follows: int memcmp(char *a __arg_untrusted, char *b __arg_untrusted, size_t n) { bpf_for(i, 0, n) { if (a[i] - b[i]) // load at any offset is allowed return a[i] - b[i]; } return 0; } Allocate register id for ARG_PTR_TO_MEM parameters only when PTR_MAYBE_NULL is set. Register id for PTR_TO_MEM is used only to propagate non-null status after conditionals. Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250704230354.1323244-8-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-07bpf: attribute __arg_untrusted for global function parametersEduard Zingerman
Add support for PTR_TO_BTF_ID | PTR_UNTRUSTED global function parameters. Anything is allowed to pass to such parameters, as these are read-only and probe read instructions would protect against invalid memory access. Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250704230354.1323244-5-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-07bpf: rdonly_untrusted_mem for btf id walk pointer leafsEduard Zingerman
When processing a load from a PTR_TO_BTF_ID, the verifier calculates the type of the loaded structure field based on the load offset. For example, given the following types: struct foo { struct foo *a; int *b; } *p; The verifier would calculate the type of `p->a` as a pointer to `struct foo`. However, the type of `p->b` is currently calculated as a SCALAR_VALUE. This commit updates the logic for processing PTR_TO_BTF_ID to instead calculate the type of p->b as PTR_TO_MEM|MEM_RDONLY|PTR_UNTRUSTED. This change allows further dereferencing of such pointers (using probe memory instructions). Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250704230354.1323244-3-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-07bpf: make makr_btf_ld_reg return error for unexpected reg typesEduard Zingerman
Non-functional change: mark_btf_ld_reg() expects 'reg_type' parameter to be either SCALAR_VALUE or PTR_TO_BTF_ID. Next commit expands this set, so update this function to fail if unexpected type is passed. Also update callers to propagate the error. Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250704230354.1323244-2-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-07refscale: Check that nreaders and loops multiplication doesn't overflowArtem Sadovnikov
The nreaders and loops variables are exposed as module parameters, which, in certain combinations, can lead to multiplication overflow. Besides, loops parameter is defined as long, while through the code is used as int, which can cause truncation on 64-bit kernels and possible zeroes where they shouldn't appear. Since code uses result of multiplication as int anyway, it only makes sense to replace loops with int. Multiplication overflow check is also added due to possible multiplication between two very big numbers. Found by Linux Verification Center (linuxtesting.org) with SVACE. Fixes: 653ed64b01dc ("refperf: Add a test to measure performance of read-side synchronization") Signed-off-by: Artem Sadovnikov <a.sadovnikov@ispras.ru> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>
2025-07-07rcu/nocb: Dump gp state even if rdp gp itself is not offloadedFrederic Weisbecker
When a stall is detected, the state of each NOCB CPU is dumped along with the state of each NOCB group. The latter part however is incidentally ignored if the NOCB group leader happens not to be offloaded itself. Fix this to make sure related precious informations aren't lost over a stall report. Reported-by: "Paul E. McKenney" <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: "Paul E. McKenney" <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>
2025-07-06Merge tag 'sched_urgent_for_v6.16_rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Borislav Petkov: - Fix the calculation of the deadline server task's runtime as this mishap was preventing realtime tasks from running - Avoid a race condition during migrate-swapping two tasks - Fix the string reported for the "none" dynamic preemption option * tag 'sched_urgent_for_v6.16_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/deadline: Fix dl_server runtime calculation formula sched/core: Fix migrate_swap() vs. hotplug sched: Fix preemption string of preempt_dynamic_none
2025-07-06Merge tag 'perf_urgent_for_v6.16_rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Borislav Petkov: - Revert uprobes to using CAP_SYS_ADMIN again as currently they can destructively modify kernel code from an unprivileged process - Move a warning to where it belongs * tag 'perf_urgent_for_v6.16_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: Revert to requiring CAP_SYS_ADMIN for uprobes perf/core: Fix the WARN_ON_ONCE is out of lock protected region
2025-07-06smp: Wait only if work was enqueuedRik van Riel
Whenever work is enqueued for a remote CPU, smp_call_function_many_cond() may need to wait for that work to be completed. However, if no work is enqueued for a remote CPU, because the condition func() evaluated to false for all CPUs, there is no need to wait. Set run_remote only if work was enqueued on remote CPUs. Document the difference between "work enqueued", and "CPU needs to be woken up" Suggested-by: Jann Horn <jannh@google.com> Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Yury Norov (NVIDIA) <yury.norov@gmail.com> Link: https://lore.kernel.org/all/20250703203019.11331ac3@fangorn
2025-07-04Merge branch 'pm-sleep'Rafael J. Wysocki
Merge fixes related to system sleep for 6.16-rc5: - Fix typo in the ABI documentation (Sumanth Gavini). - Allow swap to be used a bit longer during system suspend and hibernation to avoid suspend failures under memory pressure (Mario Limonciello). * pm-sleep: PM: sleep: docs: Replace "diasble" with "disable" PM: Restrict swap use to later in the suspend sequence
2025-07-04lib/crypto: sha256: Make library API use strongly-typed contextsEric Biggers
Currently the SHA-224 and SHA-256 library functions can be mixed arbitrarily, even in ways that are incorrect, for example using sha224_init() and sha256_final(). This is because they operate on the same structure, sha256_state. Introduce stronger typing, as I did for SHA-384 and SHA-512. Also as I did for SHA-384 and SHA-512, use the names *_ctx instead of *_state. The *_ctx names have the following small benefits: - They're shorter. - They avoid an ambiguity with the compression function state. - They're consistent with the well-known OpenSSL API. - Users usually name the variable 'sctx' anyway, which suggests that *_ctx would be the more natural name for the actual struct. Therefore: update the SHA-224 and SHA-256 APIs, implementation, and calling code accordingly. In the new structs, also strongly-type the compression function state. Acked-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20250630160645.3198-7-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2025-07-04watchdog/perf: Provide function for adjusting the event periodYicong Yang
Architecture's using perf events for hard lockup detection needs to convert the watchdog_thresh to the event's period, some architecture for example arm64 perform this conversion using the CPU's maximum frequency which will be acquired by cpufreq. However by the time the lockup detector's initialized the cpufreq driver may not be initialized, thus launch a watchdog with inaccurate period. Provide a function hardlockup_detector_perf_adjust_period() to allowing adjust the event period. Then architecture can update with more accurate period if cpufreq is initialized. Reviewed-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Link: https://lore.kernel.org/r/20250701110214.27242-2-yangyicong@huawei.com Signed-off-by: Will Deacon <will@kernel.org>
2025-07-04sched/deadline: Fix dl_server runtime calculation formulakuyo chang
In our testing with 6.12 based kernel on a big.LITTLE system, we were seeing instances of RT tasks being blocked from running on the LITTLE cpus for multiple seconds of time, apparently by the dl_server. This far exceeds the default configured 50ms per second runtime. This is due to the fair dl_server runtime calculation being scaled for frequency & capacity of the cpu. Consider the following case under a Big.LITTLE architecture: Assume the runtime is: 50,000,000 ns, and Frequency/capacity scale-invariance defined as below: Frequency scale-invariance: 100 Capacity scale-invariance: 50 First by Frequency scale-invariance, the runtime is scaled to 50,000,000 * 100 >> 10 = 4,882,812 Then by capacity scale-invariance, it is further scaled to 4,882,812 * 50 >> 10 = 238,418. So it will scaled to 238,418 ns. This smaller "accounted runtime" value is what ends up being subtracted against the fair-server's runtime for the current period. Thus after 50ms of real time, we've only accounted ~238us against the fair servers runtime. This 209:1 ratio in this example means that on the smaller cpu the fair server is allowed to continue running, blocking RT tasks, for over 10 seconds before it exhausts its supposed 50ms of runtime. And on other hardware configurations it can be even worse. For the fair deadline_server, to prevent realtime tasks from being unexpectedly delayed, we really do want to use fixed time, and not scaled time for smaller capacity/frequency cpus. So remove the scaling from the fair server's accounting to fix this. Fixes: a110a81c52a9 ("sched/deadline: Deferrable dl server") Suggested-by: Peter Zijlstra <peterz@infradead.org> Suggested-by: John Stultz <jstultz@google.com> Signed-off-by: kuyo chang <kuyo.chang@mediatek.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: John Stultz <jstultz@google.com> Tested-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/r/20250702021440.2594736-1-kuyo.chang@mediatek.com
2025-07-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netPaolo Abeni
Cross-merge networking fixes after downstream PR (net-6.16-rc5). No conflicts. No adjacent changes. Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-03bpf: Avoid putting struct bpf_scc_callchain variables on the stackYonghong Song
Add a 'struct bpf_scc_callchain callchain_buf' field in bpf_verifier_env. This way, the previous bpf_scc_callchain local variables can be replaced by taking address of env->callchain_buf. This can reduce stack usage and fix the following error: kernel/bpf/verifier.c:19921:12: error: stack frame size (1368) exceeds limit (1280) in 'do_check' [-Werror,-Wframe-larger-than] Reported-by: Arnd Bergmann <arnd@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20250703141117.1485108-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-03bpf: Reduce stack frame size by using env->insn_buf for bpf insnsYonghong Song
Arnd Bergmann reported an issue ([1]) where clang compiler (less than llvm18) may trigger an error where the stack frame size exceeds the limit. I can reproduce the error like below: kernel/bpf/verifier.c:24491:5: error: stack frame size (2552) exceeds limit (1280) in 'bpf_check' [-Werror,-Wframe-larger-than] kernel/bpf/verifier.c:19921:12: error: stack frame size (1368) exceeds limit (1280) in 'do_check' [-Werror,-Wframe-larger-than] Use env->insn_buf for bpf insns instead of putting these insns on the stack. This can resolve the above 'bpf_check' error. The 'do_check' error will be resolved in the next patch. [1] https://lore.kernel.org/bpf/20250620113846.3950478-1-arnd@kernel.org/ Reported-by: Arnd Bergmann <arnd@kernel.org> Tested-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20250703141111.1484521-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-03bpf: Simplify assignment to struct bpf_insn pointer in do_misc_fixups()Yonghong Song
In verifier.c, the following code patterns (in two places) struct bpf_insn *patch = &insn_buf[0]; can be simplified to struct bpf_insn *patch = insn_buf; which is easier to understand. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20250703141106.1483216-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-03bpf: Avoid warning on unexpected map for tail callPaul Chaignon
Before handling the tail call in record_func_key(), we check that the map is of the expected type and log a verifier error if it isn't. Such an error however doesn't indicate anything wrong with the verifier. The check for map<>func compatibility is done after record_func_key(), by check_map_func_compatibility(). Therefore, this patch logs the error as a typical reject instead of a verifier error. Fixes: d2e4c1e6c294 ("bpf: Constant map key tracking for prog array pokes") Fixes: 0df1a55afa83 ("bpf: Warn on internal verifier errors") Reported-by: syzbot+efb099d5833bca355e51@syzkaller.appspotmail.com Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/1f395b74e73022e47e04a31735f258babf305420.1751578055.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-03bpf: Report rqspinlock deadlocks/timeout to BPF stderrKumar Kartikeya Dwivedi
Begin reporting rqspinlock deadlocks and timeout to BPF program's stderr. Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250703204818.925464-9-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-03bpf: Report may_goto timeout to BPF stderrKumar Kartikeya Dwivedi
Begin reporting may_goto timeouts to BPF program's stderr stream. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250703204818.925464-8-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-03bpf: Add dump_stack() analogue to print to BPF stderrKumar Kartikeya Dwivedi
Introduce a kernel function which is the analogue of dump_stack() printing some useful information and the stack trace. This is not exposed to BPF programs yet, but can be made available in the future. When we have a program counter for a BPF program in the stack trace, also additionally output the filename and line number to make the trace helpful. The rest of the trace can be passed into ./decode_stacktrace.sh to obtain the line numbers for kernel symbols. Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250703204818.925464-7-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-03bpf: Add function to find program from stack traceKumar Kartikeya Dwivedi
In preparation of figuring out the closest program that led to the current point in the kernel, implement a function that scans through the stack trace and finds out the closest BPF program when walking down the stack trace. Special care needs to be taken to skip over kernel and BPF subprog frames. We basically scan until we find a BPF main prog frame. The assumption is that if a program calls into us transitively, we'll hit it along the way. If not, we end up returning NULL. Contextually the function will be used in places where we know the program may have called into us. Due to reliance on arch_bpf_stack_walk(), this function only works on x86 with CONFIG_UNWINDER_ORC, arm64, and s390. Remove the warning from arch_bpf_stack_walk as well since we call it outside bpf_throw() context. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250703204818.925464-6-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-03bpf: Ensure RCU lock is held around bpf_prog_ksym_findKumar Kartikeya Dwivedi
Add a warning to ensure RCU lock is held around tree lookup, and then fix one of the invocations in bpf_stack_walker. The program has an active stack frame and won't disappear. Use the opportunity to remove unneeded invocation of is_bpf_text_address. Fixes: f18b03fabaa9 ("bpf: Implement BPF exceptions") Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250703204818.925464-5-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-03bpf: Add function to extract program source infoKumar Kartikeya Dwivedi
Prepare a function for use in future patches that can extract the file info, line info, and the source line number for a given BPF program provided it's program counter. Only the basename of the file path is provided, given it can be excessively long in some cases. This will be used in later patches to print source info to the BPF stream. Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250703204818.925464-4-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-03bpf: Introduce BPF standard streamsKumar Kartikeya Dwivedi
Add support for a stream API to the kernel and expose related kfuncs to BPF programs. Two streams are exposed, BPF_STDOUT and BPF_STDERR. These can be used for printing messages that can be consumed from user space, thus it's similar in spirit to existing trace_pipe interface. The kernel will use the BPF_STDERR stream to notify the program of any errors encountered at runtime. BPF programs themselves may use both streams for writing debug messages. BPF library-like code may use BPF_STDERR to print warnings or errors on misuse at runtime. The implementation of a stream is as follows. Everytime a message is emitted from the kernel (directly, or through a BPF program), a record is allocated by bump allocating from per-cpu region backed by a page obtained using alloc_pages_nolock(). This ensures that we can allocate memory from any context. The eventual plan is to discard this scheme in favor of Alexei's kmalloc_nolock() [0]. This record is then locklessly inserted into a list (llist_add()) so that the printing side doesn't require holding any locks, and works in any context. Each stream has a maximum capacity of 4MB of text, and each printed message is accounted against this limit. Messages from a program are emitted using the bpf_stream_vprintk kfunc, which takes a stream_id argument in addition to working otherwise similar to bpf_trace_vprintk. The bprintf buffer helpers are extracted out to be reused for printing the string into them before copying it into the stream, so that we can (with the defined max limit) format a string and know its true length before performing allocations of the stream element. For consuming elements from a stream, we expose a bpf(2) syscall command named BPF_PROG_STREAM_READ_BY_FD, which allows reading data from the stream of a given prog_fd into a user space buffer. The main logic is implemented in bpf_stream_read(). The log messages are queued in bpf_stream::log by the bpf_stream_vprintk kfunc, and then pulled and ordered correctly in the stream backlog. For this purpose, we hold a lock around bpf_stream_backlog_peek(), as llist_del_first() (if we maintained a second lockless list for the backlog) wouldn't be safe from multiple threads anyway. Then, if we fail to find something in the backlog log, we splice out everything from the lockless log, and place it in the backlog log, and then return the head of the backlog. Once the full length of the element is consumed, we will pop it and free it. The lockless list bpf_stream::log is a LIFO stack. Elements obtained using a llist_del_all() operation are in LIFO order, thus would break the chronological ordering if printed directly. Hence, this batch of messages is first reversed. Then, it is stashed into a separate list in the stream, i.e. the backlog_log. The head of this list is the actual message that should always be returned to the caller. All of this is done in bpf_stream_backlog_fill(). From the kernel side, the writing into the stream will be a bit more involved than the typical printk. First, the kernel typically may print a collection of messages into the stream, and parallel writers into the stream may suffer from interleaving of messages. To ensure each group of messages is visible atomically, we can lift the advantage of using a lockless list for pushing in messages. To enable this, we add a bpf_stream_stage() macro, and require kernel users to use bpf_stream_printk statements for the passed expression to write into the stream. Underneath the macro, we have a message staging API, where a bpf_stream_stage object on the stack accumulates the messages being printed into a local llist_head, and then a commit operation splices the whole batch into the stream's lockless log list. This is especially pertinent for rqspinlock deadlock messages printed to program streams. After this change, we see each deadlock invocation as a non-interleaving contiguous message without any confusion on the reader's part, improving their user experience in debugging the fault. While programs cannot benefit from this staged stream writing API, they could just as well hold an rqspinlock around their print statements to serialize messages, hence this is kept kernel-internal for now. Overall, this infrastructure provides NMI-safe any context printing of messages to two dedicated streams. Later patches will add support for printing splats in case of BPF arena page faults, rqspinlock deadlocks, and cond_break timeouts, and integration of this facility into bpftool for dumping messages to user space. [0]: https://lore.kernel.org/bpf/20250501032718.65476-1-alexei.starovoitov@gmail.com Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250703204818.925464-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-03bpf: Refactor bprintf buffer supportKumar Kartikeya Dwivedi
Refactor code to be able to get and put bprintf buffers and use bpf_printf_prepare independently. This will be used in the next patch to implement BPF streams support, particularly as a staging buffer for strings that need to be formatted and then allocated and pushed into a stream. Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250703204818.925464-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-03bpf: Add show_fdinfo for kprobe_multiTao Chen
Show kprobe_multi link info with fdinfo, the info as follows: link_type: kprobe_multi link_id: 1 prog_tag: a69740b9746f7da8 prog_id: 21 kprobe_cnt: 8 missed: 0 cookie func 1 bpf_fentry_test1+0x0/0x20 7 bpf_fentry_test2+0x0/0x20 2 bpf_fentry_test3+0x0/0x20 3 bpf_fentry_test4+0x0/0x20 4 bpf_fentry_test5+0x0/0x20 5 bpf_fentry_test6+0x0/0x20 6 bpf_fentry_test7+0x0/0x20 8 bpf_fentry_test8+0x0/0x10 Signed-off-by: Tao Chen <chen.dylane@linux.dev> Link: https://lore.kernel.org/r/20250702153958.639852-3-chen.dylane@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-03bpf: Add show_fdinfo for uprobe_multiTao Chen
Show uprobe_multi link info with fdinfo, the info as follows: link_type: uprobe_multi link_id: 9 prog_tag: e729f789e34a8eca prog_id: 39 uprobe_cnt: 3 pid: 0 path: /home/dylane/bpf/tools/testing/selftests/bpf/test_progs cookie offset ref_ctr_offset 3 0xa69f13 0x0 1 0xa69f1e 0x0 2 0xa69f29 0x0 Signed-off-by: Tao Chen <chen.dylane@linux.dev> Link: https://lore.kernel.org/r/20250702153958.639852-2-chen.dylane@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-03bpf: Show precise link_type for {uprobe,kprobe}_multi fdinfoTao Chen
Alexei suggested, 'link_type' can be more precise and differentiate for human in fdinfo. In fact BPF_LINK_TYPE_KPROBE_MULTI includes kretprobe_multi type, the same as BPF_LINK_TYPE_UPROBE_MULTI, so we can show it more concretely. link_type: kprobe_multi link_id: 1 prog_tag: d2b307e915f0dd37 ... link_type: kretprobe_multi link_id: 2 prog_tag: ab9ea0545870781d ... link_type: uprobe_multi link_id: 9 prog_tag: e729f789e34a8eca ... link_type: uretprobe_multi link_id: 10 prog_tag: 7db356c03e61a4d4 Co-developed-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Tao Chen <chen.dylane@linux.dev> Link: https://lore.kernel.org/r/20250702153958.639852-1-chen.dylane@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-03bpf: Add bpf_dynptr_memset() kfuncIhor Solodrai
Currently there is no straightforward way to fill dynptr memory with a value (most commonly zero). One can do it with bpf_dynptr_write(), but a temporary buffer is necessary for that. Implement bpf_dynptr_memset() - an analogue of memset() from libc. Signed-off-by: Ihor Solodrai <isolodrai@meta.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250702210309.3115903-2-isolodrai@meta.com
2025-07-03PM: sleep: console: Fix the black screen issuetuhaowen
When the computer enters sleep status without a monitor connected, the system switches the console to the virtual terminal tty63(SUSPEND_CONSOLE). If a monitor is subsequently connected before waking up, the system skips the required VT restoration process during wake-up, leaving the console on tty63 instead of switching back to tty1. To fix this issue, a global flag vt_switch_done is introduced to record whether the system has successfully switched to the suspend console via vt_move_to_console() during suspend. If the switch was completed, vt_switch_done is set to 1. Later during resume, this flag is checked to ensure that the original console is restored properly by calling vt_move_to_console(orig_fgconsole, 0). This prevents scenarios where the resume logic skips console restoration due to incorrect detection of the console state, especially when a monitor is reconnected before waking up. Signed-off-by: tuhaowen <tuhaowen@uniontech.com> Link: https://patch.msgid.link/20250611032345.29962-1-tuhaowen@uniontech.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-07-03irqdomain: Add device pointer to irq_domain_info and msi_domain_infoThomas Gleixner
Add device pointer to irq_domain_info and msi_domain_info, so that the device can be specified at domain creation time. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/943e52403b20cf13c320d55bd4446b4562466aab.1750860131.git.namcao@linutronix.de
2025-07-03Merge tag 'ktime-get-clock-ts64-for-ptp' of ↵Paolo Abeni
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Base implementation for PTP with a temporary CLOCK_AUX* workaround to allow integration of depending changes into the networking tree. Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-03Merge tag 'ktime-get-clock-ts64-for-ptp' into timers/ptpThomas Gleixner
Pull the base implementation of ktime_get_clock_ts64() for PTP, which contains a temporary CLOCK_AUX* workaround. That was created to allow integration of depending changes into the networking tree. The workaround is going to be removed in a subsequent change in the timekeeping tree. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2025-07-03timekeeping: Provide ktime_get_clock_ts64()Thomas Gleixner
PTP implements an inline switch case for taking timestamps from various POSIX clock IDs, which already consumes quite some text space. Expanding it for auxiliary clocks really becomes too big for inlining. Provide a out of line version. The function invalidates the timestamp in case the clock is invalid. The invalidation allows to implement a validation check without the need to propagate a return value through deep existing call chains. Due to merge logistics this temporarily defines CLOCK_AUX[_LAST] if undefined, so that the plain branch, which does not contain any of the core timekeeper changes, can be pulled into the networking tree as prerequisite for the PTP side changes. These temporary defines are removed after that branch is merged into the tip::timers/ptp branch. That way the result in -next or upstream in the next merge window has zero dependencies. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/all/20250701132628.357686408@linutronix.de
2025-07-03perf: Revert to requiring CAP_SYS_ADMIN for uprobesPeter Zijlstra
Jann reports that uprobes can be used destructively when used in the middle of an instruction. The kernel only verifies there is a valid instruction at the requested offset, but due to variable instruction length cannot determine if this is an instruction as seen by the intended execution stream. Additionally, Mark Rutland notes that on architectures that mix data in the text segment (like arm64), a similar things can be done if the data word is 'mistaken' for an instruction. As such, require CAP_SYS_ADMIN for uprobes. Fixes: c9e0924e5c2b ("perf/core: open access to probes for CAP_PERFMON privileged process") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/CAG48ez1n4520sq0XrWYDHKiKxE_+WCfAK+qt9qkY4ZiBGmL-5g@mail.gmail.com
2025-07-02bpf: Avoid warning on multiple referenced args in callPaul Chaignon
The description of full helper calls in syzkaller [1] and the addition of kernel warnings in commit 0df1a55afa83 ("bpf: Warn on internal verifier errors") allowed syzbot to reach a verifier state that was thought to indicate a verifier bug [2]: 12: (85) call bpf_tcp_raw_gen_syncookie_ipv4#204 verifier bug: more than one arg with ref_obj_id R2 2 2 This error can be reproduced with the program from the previous commit: 0: (b7) r2 = 20 1: (b7) r3 = 0 2: (18) r1 = 0xffff92cee3cbc600 4: (85) call bpf_ringbuf_reserve#131 5: (55) if r0 == 0x0 goto pc+3 6: (bf) r1 = r0 7: (bf) r2 = r0 8: (85) call bpf_tcp_raw_gen_syncookie_ipv4#204 9: (95) exit bpf_tcp_raw_gen_syncookie_ipv4 expects R1 and R2 to be ARG_PTR_TO_FIXED_SIZE_MEM (with a size of at least sizeof(struct iphdr) for R1). R0 is a ring buffer payload of 20B and therefore matches this requirement. The verifier reaches the check on ref_obj_id while verifying R2 and rejects the program because the helper isn't supposed to take two referenced arguments. This case is a legitimate rejection and doesn't indicate a kernel bug, so we shouldn't log it as such and shouldn't emit a kernel warning. Link: https://github.com/google/syzkaller/pull/4313 [1] Link: https://lore.kernel.org/all/686491d6.a70a0220.3b7e22.20ea.GAE@google.com/T/ [2] Fixes: 457f44363a88 ("bpf: Implement BPF ring buffer and verifier support for it") Fixes: 0df1a55afa83 ("bpf: Warn on internal verifier errors") Reported-by: syzbot+69014a227f8edad4d8c6@syzkaller.appspotmail.com Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Link: https://lore.kernel.org/r/cd09afbfd7bef10bbc432d72693f78ffdc1e8ee5.1751463262.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2025-07-02bpf: avoid jump misprediction for PTR_TO_MEM | PTR_UNTRUSTEDEduard Zingerman
Commit f2362a57aeff ("bpf: allow void* cast using bpf_rdonly_cast()") added a notion of PTR_TO_MEM | MEM_RDONLY | PTR_UNTRUSTED type. This simultaneously introduced a bug in jump prediction logic for situations like below: p = bpf_rdonly_cast(..., 0); if (p) a(); else b(); Here verifier would wrongly predict that else branch cannot be taken. This happens because: - Function reg_not_null() assumes that PTR_TO_MEM w/o PTR_MAYBE_NULL flag cannot be zero. - The call to bpf_rdonly_cast() returns a rdonly_untrusted_mem value w/o PTR_MAYBE_NULL flag. Tracking of PTR_MAYBE_NULL flag for untrusted PTR_TO_MEM does not make sense, as the main purpose of the flag is to catch null pointer access errors. Such errors are not possible on load of PTR_UNTRUSTED values and verifier makes sure that PTR_UNTRUSTED can't be passed to helpers or kfuncs. Hence, modify reg_not_null() to assume that nullness of untrusted PTR_TO_MEM is not known. Fixes: f2362a57aeff ("bpf: allow void* cast using bpf_rdonly_cast()") Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250702073620.897517-1-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2025-07-02smp: Defer check for local execution in smp_call_function_many_cond()Yury Norov [NVIDIA]
Defer check for local execution to the actual place where it is needed, which removes the extra local variable. Signed-off-by: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250623000010.10124-5-yury.norov@gmail.com
2025-07-02bpf: Mark cgroup_subsys_state->cgroup RCU safeSong Liu
Mark struct cgroup_subsys_state->cgroup as safe under RCU read lock. This will enable accessing css->cgroup from a bpf css iterator. Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/20250623063854.1896364-4-song@kernel.org Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>