summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2025-06-25rcutorture: Make rcutorture_one_extend_check() account for hard IRQsPaul E. McKenney
This commit retrospectively prepares for testing of RCU readers invoked from hardware interrupt handlers (for example, HRTIMER_MODE_HARD hrtimer handlers) in kernels built with CONFIG_RCU_TORTURE_TEST_CHK_RDR_STATE=y, which is rarely used but sometimes extremely useful. This preparation involves taking early exits if in_hardirq(), and, while we are in the area, a very early exit if in_nmi(). This means that a number of insoftirq parameters are no longer needed, but that is the subject of a later commit. Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202505140917.8ee62cc6-lkp@intel.com Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>
2025-06-25rcutorture: Start rcu_torture_writer() after rcu_torture_reader()Paul E. McKenney
Testing of rcutorture's SRCU-P scenario on a large arm64 system resulted in rcu_torture_writer() forward-progress failures, but these same tests passed on x86. After some off-list discussion of possible memory-ordering causes for these failures, Boqun showed that these were in fact due to reordering, but by the scheduler, not by the memory system. On x86, rcu_torture_writer() would have run quickly enough that by the time the rcu_torture_updown() kthread started, the rcu_torture_current variable would already be initialized, thus avoiding a bug in which a NULL value would cause rcu_torture_updown() to do an extra call to srcu_up_read_fast(). This commit therefore moves creation of the rcu_torture_writer() kthread after that of the rcu_torture_reader() kthreads. This results in deterministic failures on x86. What about the double-srcu_up_read_fast() bug? Boqun has the fix. But let's also fix the test while we are at it! Reported-by: Joel Fernandes <joelagnelf@nvidia.com> Reported-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>
2025-06-25rcutorture: Print only one rtort_pipe_count splatPaul E. McKenney
The rcu_torture_writer() function scans the memory blocks after a stutter (or forced idle) interval, complaining about any that have not passed through ten grace periods since the start of the stutter interval. But one splat suffices, so this commit therefore stops at the first splat. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>
2025-06-25rcu: Robustify rcu_is_cpu_rrupt_from_idle()Frederic Weisbecker
RCU relies on the context tracking nesting counter in order to determine if it is running in extended quiescent state. However the context tracking nesting counter is not completely synchronized with the actual context tracking state: * The nesting counter is set to 1 or incremented further _after_ the actual state is set to RCU watching. * The nesting counter is set to 0 or decremented further _before_ the actual state is set to RCU not watching. Therefore it is safe to assume that if ct_nesting() > 0, RCU is watching. But if ct_nesting() <= 0, RCU is not watching except for tiny windows. This hasn't been a problem so far because rcu_is_cpu_rrupt_from_idle() has only been called from interrupts. However the code is confusing and abuses the role of the context tracking nesting counter while there are more accurate indicators available. Clarify and robustify accordingly. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>
2025-06-25rcu/exp: Protect against early QS reportFrederic Weisbecker
When a grace period is started, the ->expmask of each node is set up from sync_exp_reset_tree(). Then later on each leaf node also initialize its ->exp_tasks pointer. This means that the initialization of the quiescent state of a node and the initialization of its blocking tasks happen with an unlocked node gap in-between. It happens to be fine because nothing is expected to report an exp quiescent state within this gap, since no IPI have been issued yet and every rdp's ->cpu_no_qs.b.exp should be false. However if it were to happen by accident, the quiescent state could be reported and propagated while ignoring tasks that blocked _before_ the start of the grace period. Prevent such trouble to happen in the future and initialize both the quiescent states mask to report and the blocked tasks head from the same node locked block. If a task blocks within an RCU read side critical section before sync_exp_reset_tree() is called and is then unblocked between sync_exp_reset_tree() and __sync_rcu_exp_select_node_cpus(), the QS won't be reported because no RCU exp IPI had been issued to request it through the setting of srdp->cpu_no_qs.b.exp. Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>
2025-06-24bpf, verifier: Improve precision for BPF_ADD and BPF_SUBHarishankar Vishwanathan
This patch improves the precison of the scalar(32)_min_max_add and scalar(32)_min_max_sub functions, which update the u(32)min/u(32)_max ranges for the BPF_ADD and BPF_SUB instructions. We discovered this more precise operator using a technique we are developing for automatically synthesizing functions for updating tnums and ranges. According to the BPF ISA [1], "Underflow and overflow are allowed during arithmetic operations, meaning the 64-bit or 32-bit value will wrap". Our patch leverages the wrap-around semantics of unsigned overflow and underflow to improve precision. Below is an example of our patch for scalar_min_max_add; the idea is analogous for all four functions. There are three cases to consider when adding two u64 ranges [dst_umin, dst_umax] and [src_umin, src_umax]. Consider a value x in the range [dst_umin, dst_umax] and another value y in the range [src_umin, src_umax]. (a) No overflow: No addition x + y overflows. This occurs when even the largest possible sum, i.e., dst_umax + src_umax <= U64_MAX. (b) Partial overflow: Some additions x + y overflow. This occurs when the largest possible sum overflows (dst_umax + src_umax > U64_MAX), but the smallest possible sum does not overflow (dst_umin + src_umin <= U64_MAX). (c) Full overflow: All additions x + y overflow. This occurs when both the smallest possible sum and the largest possible sum overflow, i.e., both (dst_umin + src_umin) and (dst_umax + src_umax) are > U64_MAX. The current implementation conservatively sets the output bounds to unbounded, i.e, [umin=0, umax=U64_MAX], whenever there is *any* possibility of overflow, i.e, in cases (b) and (c). Otherwise it computes tight bounds as [dst_umin + src_umin, dst_umax + src_umax]: if (check_add_overflow(*dst_umin, src_reg->umin_value, dst_umin) || check_add_overflow(*dst_umax, src_reg->umax_value, dst_umax)) { *dst_umin = 0; *dst_umax = U64_MAX; } Our synthesis-based technique discovered a more precise operator. Particularly, in case (c), all possible additions x + y overflow and wrap around according to eBPF semantics, and the computation of the output range as [dst_umin + src_umin, dst_umax + src_umax] continues to work. Only in case (b), do we need to set the output bounds to unbounded, i.e., [0, U64_MAX]. Case (b) can be checked by seeing if the minimum possible sum does *not* overflow and the maximum possible sum *does* overflow, and when that happens, we set the output to unbounded: min_overflow = check_add_overflow(*dst_umin, src_reg->umin_value, dst_umin); max_overflow = check_add_overflow(*dst_umax, src_reg->umax_value, dst_umax); if (!min_overflow && max_overflow) { *dst_umin = 0; *dst_umax = U64_MAX; } Below is an example eBPF program and the corresponding log from the verifier. The current implementation of scalar_min_max_add() sets r3's bounds to [0, U64_MAX] at instruction 5: (0f) r3 += r3, due to conservative overflow handling. 0: R1=ctx() R10=fp0 0: (b7) r4 = 0 ; R4_w=0 1: (87) r4 = -r4 ; R4_w=scalar() 2: (18) r3 = 0xa000000000000000 ; R3_w=0xa000000000000000 4: (4f) r3 |= r4 ; R3_w=scalar(smin=0xa000000000000000,smax=-1,umin=0xa000000000000000,var_off=(0xa000000000000000; 0x5fffffffffffffff)) R4_w=scalar() 5: (0f) r3 += r3 ; R3_w=scalar() 6: (b7) r0 = 1 ; R0_w=1 7: (95) exit With our patch, r3's bounds after instruction 5 are set to a much more precise [0x4000000000000000,0xfffffffffffffffe]. ... 5: (0f) r3 += r3 ; R3_w=scalar(umin=0x4000000000000000,umax=0xfffffffffffffffe) 6: (b7) r0 = 1 ; R0_w=1 7: (95) exit The logic for scalar32_min_max_add is analogous. For the scalar(32)_min_max_sub functions, the reasoning is similar but applied to detecting underflow instead of overflow. We verified the correctness of the new implementations using Agni [3,4]. We since also discovered that a similar technique has been used to calculate output ranges for unsigned interval addition and subtraction in Hacker's Delight [2]. [1] https://docs.kernel.org/bpf/standardization/instruction-set.html [2] Hacker's Delight Ch.4-2, Propagating Bounds through Add’s and Subtract’s [3] https://github.com/bpfverif/agni [4] https://people.cs.rutgers.edu/~sn349/papers/sas24-preprint.pdf Co-developed-by: Matan Shachnai <m.shachnai@rutgers.edu> Signed-off-by: Matan Shachnai <m.shachnai@rutgers.edu> Co-developed-by: Srinivas Narayana <srinivas.narayana@rutgers.edu> Signed-off-by: Srinivas Narayana <srinivas.narayana@rutgers.edu> Co-developed-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu> Signed-off-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu> Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250623040359.343235-2-harishankar.vishwanathan@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-06-24sched_ext, rcu: Eject BPF scheduler on RCU CPU stall panicDavid Dai
For systems using a sched_ext scheduler and has panic_on_rcu_stall enabled, try kicking out the current scheduler before issuing a panic. While there are numerous reasons for RCU CPU stalls that are not directly attributed to the scheduler, deferring the panic gives sched_ext an opportunity to provide additional debug info when ejecting the current scheduler. Also, handling the event more gracefully allows us to potentially recover the system instead of incurring additional down time. Suggested-by: Tejun Heo <tj@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: David Dai <david.dai@linux.dev> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-24kheaders: double-quote variables to satisfy shellcheckMasahiro Yamada
Fix the following: In kernel/gen_kheaders.sh line 48: -I $XZ -cf $tarfile -C "${tmpdir}/" . > /dev/null ^-^ SC2086 (info): Double quote to prevent globbing and word splitting. ^------^ SC2086 (info): Double quote to prevent globbing and word splitting. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2025-06-24kheaders: rebuild kheaders_data.tar.xz when KBUILD_BUILD_TIMESTAMP is changedMasahiro Yamada
This problem is similar to commit 7f8256ae0efb ("initramfs: Encode dependency on KBUILD_BUILD_TIMESTAMP"): kernel/gen_kheaders.sh has an internal dependency on KBUILD_BUILD_TIMESTAMP that is not exposed to make, so changing KBUILD_BUILD_TIMESTAMP will not trigger a rebuild of the archive. Move $(KBUILD_BUILD_TIMESTAMP) to the Makefile so that is is recorded in the *.cmd file. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2025-06-24kheaders: rebuild kheaders_data.tar.xz when a file is modified within a minuteMasahiro Yamada
When a header file is changed, kernel/gen_kheaders.sh may fail to update kernel/kheaders_data.tar.xz. [steps to reproduce] [1] Build kernel/kheaders_data.tar.xz $ make -j$(nproc) kernel/kheaders.o DESCEND objtool INSTALL libsubcmd_headers CALL scripts/checksyscalls.sh CHK kernel/kheaders_data.tar.xz GEN kernel/kheaders_data.tar.xz CC kernel/kheaders.o [2] Modify a header without changing the file size $ sed -i s/0xdeadbeef/0xfeedbeef/ include/linux/elfnote.h [3] Rebuild kernel/kheaders_data.tar.xz $ make -j$(nproc) kernel/kheaders.o DESCEND objtool INSTALL libsubcmd_headers CALL scripts/checksyscalls.sh CHK kernel/kheaders_data.tar.xz kernel/kheaders_data.tar.xz is not updated if steps [1] - [3] are run within the same minute. The headers_md5 variable stores the MD5 hash of the 'ls -l' output for all header files. This hash value is used to determine whether kheaders_data.tar.xz needs to be rebuilt. However, 'ls -l' prints the modification times with minute-level granularity. If a file is modified within the same minute and its size remains the same, the MD5 hash does not change. To reliably detect file modifications, this commit rewrites kernel/gen_kheaders.sh to output header dependencies to kernel/.kheaders_data.tar.xz.cmd. Then, Make compares the timestamps and reruns kernel/gen_kheaders.sh when necessary. This is the standard mechanism used by Make and Kbuild. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2025-06-23bpf: Specify access type of bpf_sysctl_get_name argsJerome Marchand
The second argument of bpf_sysctl_get_name() helper is a pointer to a buffer that is being written to. However that isn't specify in the prototype. Until commit 37cce22dbd51a ("bpf: verifier: Refactor helper access type tracking"), all helper accesses were considered as a possible write access by the verifier, so no big harm was done. However, since then, the verifier might make wrong asssumption about the content of that address which might lead it to make faulty optimizations (such as removing code that was wrongly labeled dead). This is what happens in test_sysctl selftest to the tests related to sysctl_get_name. Add MEM_WRITE flag the second argument of bpf_sysctl_get_name(). Signed-off-by: Jerome Marchand <jmarchan@redhat.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20250619140603.148942-2-jmarchan@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-06-23kernel/sched/ext.c: fix typo "occured" -> "occurred" in commentsKe Ma
Fixes a minor spelling mistake in two comment lines Signed-off-by: Ke Ma <makebit1999@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-23workqueue: Remove unused work_on_cpu_safeDr. David Alan Gilbert
The last use of the work_on_cpu_safe() macro was removed recently by commit 9cda46babdfe ("crypto: n2 - remove Niagara2 SPU driver") Remove it, and the work_on_cpu_safe_key() function it calls. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-23replace collect_mounts()/drop_collected_mounts() with a safer variantAl Viro
collect_mounts() has several problems - one can't iterate over the results directly, so it has to be done with callback passed to iterate_mounts(); it has an oopsable race with d_invalidate(); it creates temporary clones of mounts invisibly for sync umount (IOW, you can have non-lazy umount succeed leaving filesystem not mounted anywhere and yet still busy). A saner approach is to give caller an array of struct path that would pin every mount in a subtree, without cloning any mounts. * collect_mounts()/drop_collected_mounts()/iterate_mounts() is gone * collect_paths(where, preallocated, size) gives either ERR_PTR(-E...) or a pointer to array of struct path, one for each chunk of tree visible under 'where' (i.e. the first element is a copy of where, followed by (mount,root) for everything mounted under it - the same set collect_mounts() would give). Unlike collect_mounts(), the mounts are *not* cloned - we just get pinning references to the roots of subtrees in the caller's namespace. Array is terminated by {NULL, NULL} struct path. If it fits into preallocated array (on-stack, normally), that's where it goes; otherwise it's allocated by kmalloc_array(). Passing 0 as size means that 'preallocated' is ignored (and expected to be NULL). * drop_collected_paths(paths, preallocated) is given the array returned by an earlier call of collect_paths() and the preallocated array passed to that call. All mount/dentry references are dropped and array is kfree'd if it's not equal to 'preallocated'. * instead of iterate_mounts(), users should just iterate over array of struct path - nothing exotic is needed for that. Existing users (all in audit_tree.c) are converted. [folded a fix for braino reported by Venkat Rao Bagalkote <venkat88@linux.ibm.com>] Fixes: 80b5dce8c59b0 ("vfs: Add a function to lazily unmount all mounts from any dentry") Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-23sched/wait: Add a waitqueue helper for fully exclusive priority waitersSean Christopherson
Add a waitqueue helper to add a priority waiter that requires exclusive wakeups, i.e. that requires that it be the _only_ priority waiter. The API will be used by KVM to ensure that at most one of KVM's irqfds is bound to a single eventfd (across the entire kernel). Open code the helper instead of using __add_wait_queue() so that the common path doesn't need to "handle" impossible failures. Cc: K Prateek Nayak <kprateek.nayak@amd.com> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250522235223.3178519-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23sched/wait: Drop WQ_FLAG_EXCLUSIVE from add_wait_queue_priority()Sean Christopherson
Drop the setting of WQ_FLAG_EXCLUSIVE from add_wait_queue_priority() and instead have callers manually add the flag prior to adding their structure to the queue. Blindly setting WQ_FLAG_EXCLUSIVE is flawed, as the nature of exclusive, priority waiters means that only the first waiter added will ever receive notifications. Pushing the flawed behavior to callers will allow fixing the problem one hypervisor at a time (KVM added the flawed API, and then KVM's code was copy+pasted nearly verbatim by Xen and Hyper-V), and will also allow for adding an API that provides true exclusivity, i.e. that guarantees at most one priority waiter is in the queue. Opportunistically add a comment in Hyper-V to call out the mess. Xen privcmd's irqfd_wakefup() doesn't actually operate in exclusive mode, i.e. can be "fixed" simply by dropping WQ_FLAG_EXCLUSIVE. And KVM is primed to switch to the aforementioned fully exclusive API, i.e. won't be carrying the flawed code for long. No functional change intended. Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250522235223.3178519-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23bpf: Make update_prog_stats() always_inlineMenglong Dong
The function update_prog_stats() will be called in the bpf trampoline. In most cases, it will be optimized by the compiler by making it inline. However, we can't rely on the compiler all the time, and just make it __always_inline to reduce the possible overhead. Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Link: https://lore.kernel.org/r/20250621045501.101187-1-dongml2@chinatelecom.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-06-23Merge tag 'mm-hotfixes-stable-2025-06-22-18-52' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "20 hotfixes. 7 are cc:stable and the remainder address post-6.15 issues or aren't considered necessary for -stable kernels. Only 4 are for MM. - The series `Revert "bcache: update min_heap_callbacks to use default builtin swap"' from Kuan-Wei Chiu backs out the author's recent min_heap changes due to a performance regression. A fix for this regression has been developed but we felt it best to go back to the known-good version to give the new code more bake time. - A lot of MAINTAINERS maintenance. I like to get these changes upstreamed promptly because they can't break things and more accurate/complete MAINTAINERS info hopefully improves the speed and accuracy of our responses to submitters and reporters" * tag 'mm-hotfixes-stable-2025-06-22-18-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: MAINTAINERS: add additional mmap-related files to mmap section MAINTAINERS: add memfd, shmem quota files to shmem section MAINTAINERS: add stray rmap file to mm rmap section MAINTAINERS: add hugetlb_cgroup.c to hugetlb section MAINTAINERS: add further init files to mm init block MAINTAINERS: update maintainers for HugeTLB maple_tree: fix MA_STATE_PREALLOC flag in mas_preallocate() MAINTAINERS: add missing test files to mm gup section MAINTAINERS: add missing mm/workingset.c file to mm reclaim section selftests/mm: skip uprobe vma merge test if uprobes are not enabled bcache: remove unnecessary select MIN_HEAP Revert "bcache: remove heap-related macros and switch to generic min_heap" Revert "bcache: update min_heap_callbacks to use default builtin swap" selftests/mm: add configs to fix testcase failure kho: initialize tail pages for higher order folios properly MAINTAINERS: add linux-mm@ list to Kexec Handover mm: userfaultfd: fix race of userfaultfd_move and swap cache mm/gup: revert "mm: gup: fix infinite loop within __get_longterm_locked" selftests/mm: increase timeout from 180 to 900 seconds mm/shmem, swap: fix softlockup with mTHP swapin
2025-06-23bpf: Mark cgroup_subsys_state->cgroup RCU safeSong Liu
Mark struct cgroup_subsys_state->cgroup as safe under RCU read lock. This will enable accessing css->cgroup from a bpf css iterator. Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/20250623063854.1896364-4-song@kernel.org Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-23bpf: Introduce bpf_cgroup_read_xattr to read xattr of cgroup's nodeSong Liu
BPF programs, such as LSM and sched_ext, would benefit from tags on cgroups. One common practice to apply such tags is to set xattrs on cgroupfs folders. Introduce kfunc bpf_cgroup_read_xattr, which allows reading cgroup's xattr. Note that, we already have bpf_get_[file|dentry]_xattr. However, these two APIs are not ideal for reading cgroupfs xattrs, because: 1) These two APIs only works in sleepable contexts; 2) There is no kfunc that matches current cgroup to cgroupfs dentry. bpf_cgroup_read_xattr is generic and can be useful for many program types. It is also safe, because it requires trusted or rcu protected argument (KF_RCU). Therefore, we make it available to all program types. Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/20250623063854.1896364-3-song@kernel.org Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-23Merge 6.16-rc3 into driver-core-nextGreg Kroah-Hartman
We need the driver-core fixes that are in 6.16-rc3 into here as well to build on top of. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-22Merge tag 'irq_urgent_for_v6.16_rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Borislav Petkov: - Fix missing prototypes warnings - Properly initialize work context when allocating it - Remove a method tracking when managed interrupts are suspended during hotplug, in favor of the code using a IRQ disable depth tracking now, and have interrupts get properly enabled again on restore - Make sure multiple CPUs getting hotplugged don't cause wrong tracking of the managed IRQ disable depth * tag 'irq_urgent_for_v6.16_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip/ath79-misc: Fix missing prototypes warnings genirq/irq_sim: Initialize work context pointers properly genirq/cpuhotplug: Restore affinity even for suspended IRQ genirq/cpuhotplug: Rebalance managed interrupts across multi-CPU hotplug
2025-06-22Merge tag 'perf_urgent_for_v6.16_rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Borislav Petkov: - Avoid a crash on a heterogeneous machine where not all cores support the same hw events features - Avoid a deadlock when throttling events - Document the perf event states more - Make sure a number of perf paths switching off or rescheduling events call perf_cgroup_event_disable() - Make sure perf does task sampling before its userspace mapping is torn down, and not after * tag 'perf_urgent_for_v6.16_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel: Fix crash in icl_update_topdown_event() perf: Fix the throttle error of some clock events perf: Add comment to enum perf_event_state perf/core: Fix WARN in perf_cgroup_switch() perf: Fix dangling cgroup pointer in cpuctx perf: Fix cgroup state vs ERROR perf: Fix sample vs do_exit()
2025-06-22Merge tag 'locking_urgent_for_v6.16_rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixes from Borislav Petkov: - Make sure the switch to the global hash is requested always under a lock so that two threads requesting that simultaneously cannot get to inconsistent state - Reject negative NUMA nodes earlier in the futex NUMA interface handling code - Selftests fixes * tag 'locking_urgent_for_v6.16_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: futex: Verify under the lock if hash can be replaced futex: Handle invalid node numbers supplied by user selftests/futex: Set the home_node in futex_numa_mpol selftests/futex: getopt() requires int as return value.
2025-06-20sched_ext: Add support for cgroup bandwidth control interfaceTejun Heo
From 077814f57f8acce13f91dc34bbd2b7e4911fbf25 Mon Sep 17 00:00:00 2001 From: Tejun Heo <tj@kernel.org> Date: Fri, 13 Jun 2025 15:06:47 -1000 - Add CONFIG_GROUP_SCHED_BANDWIDTH which is selected by both CONFIG_CFS_BANDWIDTH and EXT_GROUP_SCHED. - Put bandwidth control interface files for both cgroup v1 and v2 under CONFIG_GROUP_SCHED_BANDWIDTH. - Update tg_bandwidth() to fetch configuration parameters from fair if CONFIG_CFS_BANDWIDTH, SCX otherwise. - Update tg_set_bandwidth() to update the parameters for both fair and SCX. - Add bandwidth control parameters to struct scx_cgroup_init_args. - Add sched_ext_ops.cgroup_set_bandwidth() which is invoked on bandwidth control parameter updates. - Update scx_qmap and maximal selftest to test the new feature. Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-20sched_ext, sched/core: Factor out struct scx_task_groupTejun Heo
More sched_ext fields will be added to struct task_group. In preparation, factor out sched_ext fields into struct scx_task_group to reduce clutter in the common header. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-20sched_ext: Merge branch 'for-6.16-fixes' into for-6.17Tejun Heo
Pull sched_ext/for-6.16-fixes to receive: c50784e99f0e ("sched_ext: Make scx_group_set_weight() always update tg->scx.weight") 33796b91871a ("sched_ext, sched/core: Don't call scx_group_set_weight() prematurely from sched_create_group()") which are needed to implement CPU bandwidth control interface support. Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-20sched_ext: Merge branch 'sched/core' of ↵Tejun Heo
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into for-6.17 Pull tip/sched/core to receive the following commits: d403a3689af5 ("sched/fair: Move max_cfs_quota_period decl and default_cfs_period() def from fair.c to sched.h") de4c80c6963e ("sched/core: Relocate tg_get_cfs_*() and cpu_cfs_*_read_*()") 43e33f53e256 ("sched/core: Reorganize cgroup bandwidth control interface file reads") 5bc34be478d0 ("sched/core: Reorganize cgroup bandwidth control interface file writes") These will be used to implement CPU bandwidth interface support in sched_ext. Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-20rcu: Return early if callback is not specifiedUladzislau Rezki (Sony)
Currently the call_rcu() API does not check whether a callback pointer is NULL. If NULL is passed, rcu_core() will try to invoke it, resulting in NULL pointer dereference and a kernel crash. To prevent this and improve debuggability, this patch adds a check for NULL and emits a kernel stack trace to help identify a faulty caller. Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
2025-06-19kho: initialize tail pages for higher order folios properlyPratyush Yadav
Currently, when restoring higher order folios, kho_restore_folio() only calls prep_compound_page() on all the pages. That is not enough to properly initialize the folios. The managed page count does not get updated, the reserved flag does not get dropped, and page count does not get initialized properly. Restoring a higher order folio with it results in the following BUG with CONFIG_DEBUG_VM when attempting to free the folio: BUG: Bad page state in process test pfn:104e2b page: refcount:1 mapcount:0 mapping:0000000000000000 index:0xffffffffffffffff pfn:0x104e2b flags: 0x2fffff80000000(node=0|zone=2|lastcpupid=0x1fffff) raw: 002fffff80000000 0000000000000000 00000000ffffffff 0000000000000000 raw: ffffffffffffffff 0000000000000000 00000001ffffffff 0000000000000000 page dumped because: nonzero _refcount [...] Call Trace: <TASK> dump_stack_lvl+0x4b/0x70 bad_page.cold+0x97/0xb2 __free_frozen_pages+0x616/0x850 [...] Combine the path for 0-order and higher order folios, initialize the tail pages with a count of zero, and call adjust_managed_page_count() to account for all the pages instead of just missing them. In addition, since all the KHO-preserved pages get marked with MEMBLOCK_RSRV_NOINIT by deserialize_bitmap(), the reserved flag is not actually set (as can also be seen from the flags of the dumped page in the logs above). So drop the ClearPageReserved() calls. [ptyadav@amazon.de: declare i in the loop instead of at the top] Link: https://lkml.kernel.org/r/20250613125916.39272-1-pratyush@kernel.org Link: https://lkml.kernel.org/r/20250605171143.76963-1-pratyush@kernel.org Fixes: fc33e4b44b27 ("kexec: enable KHO support for memory preservation") Signed-off-by: Pratyush Yadav <ptyadav@amazon.de> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Baoquan He <bhe@redhat.com> Cc: Changyuan Lyu <changyuanl@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-19ntp: Use ktime_get_ntp_seconds()Thomas Gleixner
Use ktime_get_ntp_seconds() to prepare for auxiliary clocks so that the readout becomes per timekeeper. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250519083026.472512636@linutronix.de
2025-06-19pidfs: persist informationChristian Brauner
Persist exit and coredump information independent of whether anyone currently holds a pidfd for the struct pid. The current scheme allocated pidfs dentries on-demand repeatedly. This scheme is reaching it's limits as it makes it impossible to pin information that needs to be available after the task has exited or coredumped and that should not be lost simply because the pidfd got closed temporarily. The next opener should still see the stashed information. This is also a prerequisite for supporting extended attributes on pidfds to allow attaching meta information to them. If someone opens a pidfd for a struct pid a pidfs dentry is allocated and stashed in pid->stashed. Once the last pidfd for the struct pid is closed the pidfs dentry is released and removed from pid->stashed. So if 10 callers create a pidfs dentry for the same struct pid sequentially, i.e., each closing the pidfd before the other creates a new one then a new pidfs dentry is allocated every time. Because multiple tasks acquiring and releasing a pidfd for the same struct pid can race with each another a task may still find a valid pidfs entry from the previous task in pid->stashed and reuse it. Or it might find a dead dentry in there and fail to reuse it and so stashes a new pidfs dentry. Multiple tasks may race to stash a new pidfs dentry but only one will succeed, the other ones will put their dentry. The current scheme aims to ensure that a pidfs dentry for a struct pid can only be created if the task is still alive or if a pidfs dentry already existed before the task was reaped and so exit information has been was stashed in the pidfs inode. That's great except that it's buggy. If a pidfs dentry is stashed in pid->stashed after pidfs_exit() but before __unhash_process() is called we will return a pidfd for a reaped task without exit information being available. The pidfds_pid_valid() check does not guard against this race as it doens't sync at all with pidfs_exit(). The pid_has_task() check might be successful simply because we're before __unhash_process() but after pidfs_exit(). Introduce a new scheme where the lifetime of information associated with a pidfs entry (coredump and exit information) isn't bound to the lifetime of the pidfs inode but the struct pid itself. The first time a pidfs dentry is allocated for a struct pid a struct pidfs_attr will be allocated which will be used to store exit and coredump information. If all pidfs for the pidfs dentry are closed the dentry and inode can be cleaned up but the struct pidfs_attr will stick until the struct pid itself is freed. This will ensure minimal memory usage while persisting relevant information. The new scheme has various advantages. First, it allows to close the race where we end up handing out a pidfd for a reaped task for which no exit information is available. Second, it minimizes memory usage. Third, it allows to remove complex lifetime tracking via dentries when registering a struct pid with pidfs. There's no need to get or put a reference. Instead, the lifetime of exit and coredump information associated with a struct pid is bound to the lifetime of struct pid itself. Link: https://lore.kernel.org/20250618-work-pidfs-persistent-v2-5-98f3456fd552@kernel.org Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-19timekeeping: Provide ktime_get_ntp_seconds()Thomas Gleixner
ntp_adjtimex() requires access to the actual time keeper per timekeeper ID. Provide an interface. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250519083026.411809421@linutronix.de
2025-06-19timekeeping: Introduce auxiliary timekeepersAnna-Maria Behnsen
Provide timekeepers for auxiliary clocks and initialize them during boot. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/all/20250519083026.350061049@linutronix.de
2025-06-19timekeeping: Add clock_valid flag to timekeeperThomas Gleixner
In preparation for supporting independent auxiliary timekeepers, add a clock valid field and set it to true for the system timekeeper. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/all/20250519083026.287145536@linutronix.de
2025-06-19timekeeping: Prepare timekeeping_update_from_shadow()Thomas Gleixner
Don't invoke the VDSO and paravirt updates when utilized for auxiliary clocks. This is a temporary workaround until the VDSO and paravirt interfaces have been worked out. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250519083026.223876435@linutronix.de
2025-06-19timekeeping: Make __timekeeping_advance() reusableAnna-Maria Behnsen
In __timekeeping_advance() the pointer to struct tk_data is hardcoded by the use of &tk_core. As long as there is only a single timekeeper (tk_core), this is not a problem. But when __timekeeping_advance() will be reused for per auxiliary timekeepers, __timekeeping_advance() needs to be generalized. Add a pointer to struct tk_data as function argument of __timekeeping_advance() and adapt all call sites. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/all/20250519083026.160967312@linutronix.de
2025-06-19ntp: Rename __do_adjtimex() to ntp_adjtimex()Thomas Gleixner
Clean up the name space. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/all/20250519083026.095637820@linutronix.de
2025-06-19ntp: Add timekeeper ID arguments to public functionsThomas Gleixner
In preparation for supporting auxiliary POSIX clocks, add a timekeeper ID to the relevant functions. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/all/20250519083026.032425931@linutronix.de
2025-06-19ntp: Add support for auxiliary timekeepersThomas Gleixner
If auxiliary clocks are enabled, provide an array of NTP data so that the auxiliary timekeepers can be steered independently of the core timekeeper. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/all/20250519083025.969000914@linutronix.de
2025-06-19time: Introduce auxiliary POSIX clocksAnna-Maria Behnsen
To support auxiliary timekeeping and the related user space interfaces, it's required to define a clock ID range for them. Reserve 8 auxiliary clock IDs after the regular timekeeping clock ID space. This is the maximum number of auxiliary clocks the kernel can support. The actual number of supported clocks depends obviously on the presence of related devices and might be constraint by the available VDSO space. Add the corresponding timekeeper IDs as well. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/all/20250519083025.905800695@linutronix.de
2025-06-19timekeeping: Introduce timekeeper IDAnna-Maria Behnsen
As long as there is only a single timekeeper, there is no need to clarify which timekeeper is used. But with the upcoming reusage of the timekeeper infrastructure for auxiliary clock timekeepers, an ID is required to differentiate. Introduce an enum for timekeeper IDs, introduce a field in struct tk_data to store this timekeeper id and add also initialization. The id struct field is added at the end of the second cachline, as there is a 4 byte hole anyway. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/all/20250519083025.842476378@linutronix.de
2025-06-19timekeeping: Avoid double notification in do_adjtimex()Thomas Gleixner
Consolidate do_adjtimex() so that it does not notify about clock changes twice. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250519083025.779267274@linutronix.de
2025-06-19timekeeping: Cleanup kernel doc of __ktime_get_real_seconds()Thomas Gleixner
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/all/20250519083025.715836017@linutronix.de
2025-06-19timekeeping: Remove hardcoded access to tk_coreThomas Gleixner
This was overlooked in the initial conversion. Use the provided pointer to access the shadow timekeeper. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/all/20250519083025.652611452@linutronix.de
2025-06-18bpf: Adjust free target to avoid global starvation of LRU mapWillem de Bruijn
BPF_MAP_TYPE_LRU_HASH can recycle most recent elements well before the map is full, due to percpu reservations and force shrink before neighbor stealing. Once a CPU is unable to borrow from the global map, it will once steal one elem from a neighbor and after that each time flush this one element to the global list and immediately recycle it. Batch value LOCAL_FREE_TARGET (128) will exhaust a 10K element map with 79 CPUs. CPU 79 will observe this behavior even while its neighbors hold 78 * 127 + 1 * 15 == 9921 free elements (99%). CPUs need not be active concurrently. The issue can appear with affinity migration, e.g., irqbalance. Each CPU can reserve and then hold onto its 128 elements indefinitely. Avoid global list exhaustion by limiting aggregate percpu caches to half of map size, by adjusting LOCAL_FREE_TARGET based on cpu count. This change has no effect on sufficiently large tables. Similar to LOCAL_NR_SCANS and lru->nr_scans, introduce a map variable lru->free_target. The extra field fits in a hole in struct bpf_lru. The cacheline is already warm where read in the hot path. The field is only accessed with the lru lock held. Tested-by: Anton Protopopov <a.s.protopopov@gmail.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://lore.kernel.org/r/20250618215803.3587312-1-willemdebruijn.kernel@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-06-18Merge tag 'cgroup-for-6.16-rc2-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fix from Tejun Heo: - In cgroup1 freezer, a task migrating into a frozen cgroup might not get frozen immediately due to the wrong operation order. Fix it. * tag 'cgroup-for-6.16-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup,freezer: fix incomplete freezing when attaching tasks
2025-06-18Merge tag 'wq-for-6.16-rc2-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fix from Tejun Heo: - Fix missed early init of wq_isolated_cpumask * tag 'wq-for-6.16-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: Initialize wq_isolated_cpumask in workqueue_init_early()
2025-06-18Merge tag 'sched_ext-for-6.16-rc2-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Fix a couple bugs in cgroup cpu.weight support - Add the new sched-ext@lists.linux.dev to MAINTAINERS * tag 'sched_ext-for-6.16-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext, sched/core: Don't call scx_group_set_weight() prematurely from sched_create_group() sched_ext: Make scx_group_set_weight() always update tg->scx.weight sched_ext: Update mailing list entry in MAINTAINERS
2025-06-18cgroup,freezer: fix incomplete freezing when attaching tasksChen Ridong
An issue was found: # cd /sys/fs/cgroup/freezer/ # mkdir test # echo FROZEN > test/freezer.state # cat test/freezer.state FROZEN # sleep 1000 & [1] 863 # echo 863 > test/cgroup.procs # cat test/freezer.state FREEZING When tasks are migrated to a frozen cgroup, the freezer fails to immediately freeze the tasks, causing the cgroup to remain in the "FREEZING". The freeze_task() function is called before clearing the CGROUP_FROZEN flag. This causes the freezing() check to incorrectly return false, preventing __freeze_task() from being invoked for the migrated task. To fix this issue, clear the CGROUP_FROZEN state before calling freeze_task(). Fixes: f5d39b020809 ("freezer,sched: Rewrite core freezer logic") Cc: stable@vger.kernel.org # v6.1+ Reported-by: Zhong Jiawei <zhongjiawei1@huawei.com> Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>