summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2024-03-28bpf: Mitigate latency spikes associated with freeing non-preallocated htabYafang Shao
Following the recent upgrade of one of our BPF programs, we encountered significant latency spikes affecting other applications running on the same host. After thorough investigation, we identified that these spikes were primarily caused by the prolonged duration required to free a non-preallocated htab with approximately 2 million keys. Notably, our kernel configuration lacks the presence of CONFIG_PREEMPT. In scenarios where kernel execution extends excessively, other threads might be starved of CPU time, resulting in latency issues across the system. To mitigate this, we've adopted a proactive approach by incorporating cond_resched() calls within the kernel code. This ensures that during lengthy kernel operations, the scheduler is invoked periodically to provide opportunities for other threads to execute. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20240327032022.78391-1-laoar.shao@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-28bpf: add bpf_modify_return_test_tp() kfunc triggering tracepointAndrii Nakryiko
Add a simple bpf_modify_return_test_tp() kfunc, available to all program types, that is useful for various testing and benchmarking scenarios, as it allows to trigger most tracing BPF program types from BPF side, allowing to do complex testing and benchmarking scenarios. It is also attachable to for fmod_ret programs, making it a good and simple way to trigger fmod_ret program under test/benchmark. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240326162151.3981687-6-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-28bpf,arena: Use helper sizeof_field in struct accessorsHaiyue Wang
Use the well defined helper sizeof_field() to calculate the size of a struct member, instead of doing custom calculations. Signed-off-by: Haiyue Wang <haiyue.wang@intel.com> Link: https://lore.kernel.org/r/20240327065334.8140-1-haiyue.wang@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-28bpf: improve error message for unsupported helperMykyta Yatsenko
BPF verifier emits "unknown func" message when given BPF program type does not support BPF helper. This message may be confusing for users, as important context that helper is unknown only to current program type is not provided. This patch changes message to "program of this type cannot use helper " and aligns dependent code in libbpf and tests. Any suggestions on improving/changing this message are welcome. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Quentin Monnet <qmo@kernel.org> Link: https://lore.kernel.org/r/20240325152210.377548-1-yatsenko@meta.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. No conflicts, or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-28Merge tag 'net-6.9-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from bpf, WiFi and netfilter. Current release - regressions: - ipv6: fix address dump when IPv6 is disabled on an interface Current release - new code bugs: - bpf: temporarily disable atomic operations in BPF arena - nexthop: fix uninitialized variable in nla_put_nh_group_stats() Previous releases - regressions: - bpf: protect against int overflow for stack access size - hsr: fix the promiscuous mode in offload mode - wifi: don't always use FW dump trig - tls: adjust recv return with async crypto and failed copy to userspace - tcp: properly terminate timers for kernel sockets - ice: fix memory corruption bug with suspend and rebuild - at803x: fix kernel panic with at8031_probe - qeth: handle deferred cc1 Previous releases - always broken: - bpf: fix bug in BPF_LDX_MEMSX - netfilter: reject table flag and netdev basechain updates - inet_defrag: prevent sk release while still in use - wifi: pick the version of SESSION_PROTECTION_NOTIF - wwan: t7xx: split 64bit accesses to fix alignment issues - mlxbf_gige: call request_irq() after NAPI initialized - hns3: fix kernel crash when devlink reload during pf initialization" * tag 'net-6.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (81 commits) inet: inet_defrag: prevent sk release while still in use Octeontx2-af: fix pause frame configuration in GMP mode net: lan743x: Add set RFE read fifo threshold for PCI1x1x chips net: bcmasp: Remove phy_{suspend/resume} net: bcmasp: Bring up unimac after PHY link up net: phy: qcom: at803x: fix kernel panic with at8031_probe netfilter: arptables: Select NETFILTER_FAMILY_ARP when building arp_tables.c netfilter: nf_tables: skip netdev hook unregistration if table is dormant netfilter: nf_tables: reject table flag and netdev basechain updates netfilter: nf_tables: reject destroy command to remove basechain hooks bpf: update BPF LSM designated reviewer list bpf: Protect against int overflow for stack access size bpf: Check bloom filter map value size bpf: fix warning for crash_kexec selftests: netdevsim: set test timeout to 10 minutes net: wan: framer: Add missing static inline qualifiers mlxbf_gige: call request_irq() after NAPI initialized tls: get psock ref after taking rxlock to avoid leak selftests: tls: add test with a partially invalid iov tls: adjust recv return with async crypto and failed copy to userspace ...
2024-03-28sched/fair: Rename set_rd_overutilized_status() to set_rd_overutilized()Ingo Molnar
The _status() postfix has no real meaning, simplify the naming and harmonize it with set_rd_overloaded(). Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Qais Yousef <qyousef@layalina.io> Cc: Shrikanth Hegde <sshegde@linux.ibm.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/ZgVHq65XKsOZpfgK@gmail.com
2024-03-28sched/fair: Rename SG_OVERLOAD to SG_OVERLOADEDIngo Molnar
Follow the rename of the root_domain::overloaded flag. Note that this also matches the SG_OVERUTILIZED flag better. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Qais Yousef <qyousef@layalina.io> Cc: Shrikanth Hegde <sshegde@linux.ibm.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/ZgVHq65XKsOZpfgK@gmail.com
2024-03-28sched/fair: Rename {set|get}_rd_overload() to {set|get}_rd_overloaded()Ingo Molnar
Follow the rename of the root_domain::overloaded flag. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Qais Yousef <qyousef@layalina.io> Cc: Shrikanth Hegde <sshegde@linux.ibm.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/ZgVHq65XKsOZpfgK@gmail.com
2024-03-28sched/fair: Rename root_domain::overload to ::overloadedIngo Molnar
It is silly to use an ambiguous noun instead of a clear adjective when naming such a flag ... Note how root_domain::overutilized already used a proper adjective. rd->overloaded is now set to 1 when the root domain is overloaded. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Qais Yousef <qyousef@layalina.io> Cc: Shrikanth Hegde <sshegde@linux.ibm.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/ZgVHq65XKsOZpfgK@gmail.com
2024-03-28sched/fair: Use helper functions to access root_domain::overloadShrikanth Hegde
Introduce two helper functions to access & set the root_domain::overload flag: get_rd_overload() set_rd_overload() To make sure code is always following READ_ONCE()/WRITE_ONCE() access methods. No change in functionality intended. [ mingo: Renamed the accessors to get_/set_rd_overload(), tidied up the changelog. ] Suggested-by: Qais Yousef <qyousef@layalina.io> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Qais Yousef <qyousef@layalina.io> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20240325054505.201995-3-sshegde@linux.ibm.com
2024-03-28sched/fair: Check root_domain::overload value before updateShrikanth Hegde
The root_domain::overload flag is 1 when there's any rq in the root domain that has 2 or more running tasks. (Ie. it's overloaded.) The root_domain structure itself is a global structure per cpuset island. The ::overload flag is maintained the following way: - Set when adding a second task to the runqueue. - It is cleared in update_sd_lb_stats() during load balance, if none of the rqs have 2 or more running tasks. This flag is used during newidle balance to see if its worth doing a full load balance pass, which can be an expensive operation. If it is set, then newidle balance will try to aggressively pull a task. Since commit: 630246a06ae2 ("sched/fair: Clean-up update_sg_lb_stats parameters") ::overload is being written unconditionally, even if it has the same value. The change in value of this depends on the workload, but on typical workloads, it doesn't change all that often: a system is either dominantly overloaded for substantial amounts of time, or not. Extra writes to this semi-global structure cause unnecessary overhead, extra bus traffic, etc. - so avoid it as much as possible. Perf probe stats show that it's worth making this change (numbers are with patch applied): 1M probe:sched_balance_newidle_L38 139 probe:update_sd_lb_stats_L53 <====== 1->0 writes 129K probe:add_nr_running_L12 74 probe:add_nr_running_L13 <====== 0->1 writes 54K probe:update_sd_lb_stats_L50 <====== reads These numbers prove that actual change in the ::overload value is (much) less frequent: L50 is much larger at ~54,000 accesses vs L53+L13 of 139+74. [ mingo: Rewrote the changelog. ] Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Qais Yousef <qyousef@layalina.io> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Cc: Mel Gorman <mgorman@suse.de> Link: https://lore.kernel.org/r/20240325054505.201995-2-sshegde@linux.ibm.com
2024-03-28sched/fair: Combine EAS check with root_domain::overutilized accessShrikanth Hegde
Access to root_domainoverutilized is always used with sched_energy_enabled in the pattern: if (sched_energy_enabled && !overutilized) do something So modify the helper function to utilize this pattern. This is more readable code as it would say, do something when root domain is not overutilized. This function always return true when EAS is disabled. No change in functionality intended. Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240326152616.380999-1-sshegde@linux.ibm.com
2024-03-27Merge tag 'mm-hotfixes-stable-2024-03-27-11-25' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "Various hotfixes. About half are cc:stable and the remainder address post-6.8 issues or aren't considered suitable for backporting. zswap figures prominently in the post-6.8 issues - folloup against the large amount of changes we have just made to that code. Apart from that, all over the map" * tag 'mm-hotfixes-stable-2024-03-27-11-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (21 commits) crash: use macro to add crashk_res into iomem early for specific arch mm: zswap: fix data loss on SWP_SYNCHRONOUS_IO devices selftests/mm: fix ARM related issue with fork after pthread_create hexagon: vmlinux.lds.S: handle attributes section userfaultfd: fix deadlock warning when locking src and dst VMAs tmpfs: fix race on handling dquot rbtree selftests/mm: sigbus-wp test requires UFFD_FEATURE_WP_HUGETLBFS_SHMEM mm: zswap: fix writeback shinker GFP_NOIO/GFP_NOFS recursion ARM: prctl: reject PR_SET_MDWE on pre-ARMv6 prctl: generalize PR_SET_MDWE support check to be per-arch MAINTAINERS: remove incorrect M: tag for dm-devel@lists.linux.dev mm: zswap: fix kernel BUG in sg_init_one selftests: mm: restore settings from only parent process tools/Makefile: remove cgroup target mm: cachestat: fix two shmem bugs mm: increase folio batch size mm,page_owner: fix recursion mailmap: update entry for Leonard Crestez init: open /initrd.image with O_LARGEFILE selftests/mm: Fix build with _FORTIFY_SOURCE ...
2024-03-27Merge tag 'probes-fixes-v6.9-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probes fixlet from Masami Hiramatsu: - tracing/probes: initialize a 'val' local variable with zero. This variable is read by FETCH_OP_ST_EDATA in a loop, and is initialized by FETCH_OP_ARG in the same loop. Since this initialization is not obvious, smatch warns about it. Explicitly initializing 'val' with zero fixes this warning. * tag 'probes-fixes-v6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: probes: Fix to zero initialize a local variable
2024-03-27bpf: Protect against int overflow for stack access sizeAndrei Matei
This patch re-introduces protection against the size of access to stack memory being negative; the access size can appear negative as a result of overflowing its signed int representation. This should not actually happen, as there are other protections along the way, but we should protect against it anyway. One code path was missing such protections (fixed in the previous patch in the series), causing out-of-bounds array accesses in check_stack_range_initialized(). This patch causes the verification of a program with such a non-sensical access size to fail. This check used to exist in a more indirect way, but was inadvertendly removed in a833a17aeac7. Fixes: a833a17aeac7 ("bpf: Fix verification of indirect var-off stack access") Reported-by: syzbot+33f4297b5f927648741a@syzkaller.appspotmail.com Reported-by: syzbot+aafd0513053a1cbf52ef@syzkaller.appspotmail.com Closes: https://lore.kernel.org/bpf/CAADnVQLORV5PT0iTAhRER+iLBTkByCYNBYyvBSgjN1T31K+gOw@mail.gmail.com/ Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Andrei Matei <andreimatei1@gmail.com> Link: https://lore.kernel.org/r/20240327024245.318299-3-andreimatei1@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-27bpf: Check bloom filter map value sizeAndrei Matei
This patch adds a missing check to bloom filter creating, rejecting values above KMALLOC_MAX_SIZE. This brings the bloom map in line with many other map types. The lack of this protection can cause kernel crashes for value sizes that overflow int's. Such a crash was caught by syzkaller. The next patch adds more guard-rails at a lower level. Signed-off-by: Andrei Matei <andreimatei1@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240327024245.318299-2-andreimatei1@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-27Fix memory leak in posix_clock_open()Linus Torvalds
If the clk ops.open() function returns an error, we don't release the pccontext we allocated for this clock. Re-organize the code slightly to make it all more obvious. Reported-by: Rohit Keshri <rkeshri@redhat.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Fixes: 60c6946675fc ("posix-clock: introduce posix_clock_context concept") Cc: Jakub Kicinski <kuba@kernel.org> Cc: David S. Miller <davem@davemloft.net> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Linus Torvalds <torvalds@linuxfoundation.org>
2024-03-27bpf: fix warning for crash_kexecHari Bathini
With [1], crash dump specific code is moved out of CONFIG_KEXEC_CORE and placed under CONFIG_CRASH_DUMP, where it is more appropriate. And since CONFIG_KEXEC & !CONFIG_CRASH_DUMP build option is supported with that, it led to the below warning: "WARN: resolve_btfids: unresolved symbol crash_kexec" Fix it by using the appropriate #ifdef. [1] https://lore.kernel.org/all/20240124051254.67105-1-bhe@redhat.com/ Acked-by: Baoquan He <bhe@redhat.com> Fixes: 02aff8480533 ("crash: split crash dumping code out from kexec_core.c") Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> Link: https://lore.kernel.org/r/20240319080152.36987-1-hbathini@linux.ibm.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-27Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2024-03-25 We've added 38 non-merge commits during the last 13 day(s) which contain a total of 50 files changed, 867 insertions(+), 274 deletions(-). The main changes are: 1) Add the ability to specify and retrieve BPF cookie also for raw tracepoint programs in order to ease migration from classic to raw tracepoints, from Andrii Nakryiko. 2) Allow the use of bpf_get_{ns_,}current_pid_tgid() helper for all program types and add additional BPF selftests, from Yonghong Song. 3) Several improvements to bpftool and its build, for example, enabling libbpf logs when loading pid_iter in debug mode, from Quentin Monnet. 4) Check the return code of all BPF-related set_memory_*() functions during load and bail out in case they fail, from Christophe Leroy. 5) Avoid a goto in regs_refine_cond_op() such that the verifier can be better integrated into Agni tool which doesn't support backedges yet, from Harishankar Vishwanathan. 6) Add a small BPF trie perf improvement by always inlining longest_prefix_match, from Jesper Dangaard Brouer. 7) Small BPF selftest refactor in bpf_tcp_ca.c to utilize start_server() helper instead of open-coding it, from Geliang Tang. 8) Improve test_tc_tunnel.sh BPF selftest to prevent client connect before the server bind, from Alessandro Carminati. 9) Fix BPF selftest benchmark for older glibc and use syscall(SYS_gettid) instead of gettid(), from Alan Maguire. 10) Implement a backward-compatible method for struct_ops types with additional fields which are not present in older kernels, from Kui-Feng Lee. 11) Add a small helper to check if an instruction is addr_space_cast from as(0) to as(1) and utilize it in x86-64 JIT, from Puranjay Mohan. 12) Small cleanup to remove unnecessary error check in bpf_struct_ops_map_update_elem, from Martin KaFai Lau. 13) Improvements to libbpf fd validity checks for BPF map/programs, from Mykyta Yatsenko. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (38 commits) selftests/bpf: Fix flaky test btf_map_in_map/lookup_update bpf: implement insn_is_cast_user() helper for JITs bpf: Avoid get_kernel_nofault() to fetch kprobe entry IP selftests/bpf: Use start_server in bpf_tcp_ca bpf: Sync uapi bpf.h to tools directory libbpf: Add new sec_def "sk_skb/verdict" selftests/bpf: Mark uprobe trigger functions with nocf_check attribute selftests/bpf: Use syscall(SYS_gettid) instead of gettid() wrapper in bench bpf-next: Avoid goto in regs_refine_cond_op() bpftool: Clean up HOST_CFLAGS, HOST_LDFLAGS for bootstrap bpftool selftests/bpf: scale benchmark counting by using per-CPU counters bpftool: Remove unnecessary source files from bootstrap version bpftool: Enable libbpf logs when loading pid_iter in debug mode selftests/bpf: add raw_tp/tp_btf BPF cookie subtests libbpf: add support for BPF cookie for raw_tp/tp_btf programs bpf: support BPF cookie in raw tracepoint (raw_tp, tp_btf) programs bpf: pass whole link instead of prog when triggering raw tracepoint bpf: flatten bpf_probe_register call chain selftests/bpf: Prevent client connect before server bind in test_tc_tunnel.sh selftests/bpf: Add a sk_msg prog bpf_get_ns_current_pid_tgid() test ... ==================== Link: https://lore.kernel.org/r/20240325233940.7154-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-26sched/fair: Simplify the continue_balancing logic in sched_balance_newidle()Shrikanth Hegde
newidle(CPU_NEWLY_IDLE) balancing doesn't stop the load-balancing if the continue_balancing flag is reset, but the other two balancing (IDLE, BUSY) cases do that. newidle balance stops the load balancing if rq has a task or there is wakeup pending. The same checks are present in should_we_balance for newidle. Hence use the return value and simplify continue_balancing mechanism for newidle. Update the comment surrounding it as well. No change in functionality intended. Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20240325153926.274284-1-sshegde@linux.ibm.com
2024-03-26crash: use macro to add crashk_res into iomem early for specific archBaoquan He
There are regression reports[1][2] that crashkernel region on x86_64 can't be added into iomem tree sometime. This causes the later failure of kdump loading. This happened after commit 4a693ce65b18 ("kdump: defer the insertion of crashkernel resources") was merged. Even though, these reported issues are proved to be related to other component, they are just exposed after above commmit applied, I still would like to keep crashk_res and crashk_low_res being added into iomem early as before because the early adding has been always there on x86_64 and working very well. For safety of kdump, Let's change it back. Here, add a macro HAVE_ARCH_ADD_CRASH_RES_TO_IOMEM_EARLY to limit that only ARCH defining the macro can have the early adding crashk_res/_low_res into iomem. Then define HAVE_ARCH_ADD_CRASH_RES_TO_IOMEM_EARLY on x86 to enable it. Note: In reserve_crashkernel_low(), there's a remnant of crashk_low_res handling which was mistakenly added back in commit 85fcde402db1 ("kexec: split crashkernel reservation code out from crash_core.c"). [1] [PATCH V2] x86/kexec: do not update E820 kexec table for setup_data https://lore.kernel.org/all/Zfv8iCL6CT2JqLIC@darkstar.users.ipa.redhat.com/T/#u [2] Question about Address Range Validation in Crash Kernel Allocation https://lore.kernel.org/all/4eeac1f733584855965a2ea62fa4da58@huawei.com/T/#u Link: https://lkml.kernel.org/r/ZgDYemRQ2jxjLkq+@MiWiFi-R3L-srv Fixes: 4a693ce65b18 ("kdump: defer the insertion of crashkernel resources") Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Huacai Chen <chenhuacai@loongson.cn> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Bohac <jbohac@suse.cz> Cc: Li Huafei <lihuafei1@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-26prctl: generalize PR_SET_MDWE support check to be per-archZev Weiss
Patch series "ARM: prctl: Reject PR_SET_MDWE where not supported". I noticed after a recent kernel update that my ARM926 system started segfaulting on any execve() after calling prctl(PR_SET_MDWE). After some investigation it appears that ARMv5 is incapable of providing the appropriate protections for MDWE, since any readable memory is also implicitly executable. The prctl_set_mdwe() function already had some special-case logic added disabling it on PARISC (commit 793838138c15, "prctl: Disable prctl(PR_SET_MDWE) on parisc"); this patch series (1) generalizes that check to use an arch_*() function, and (2) adds a corresponding override for ARM to disable MDWE on pre-ARMv6 CPUs. With the series applied, prctl(PR_SET_MDWE) is rejected on ARMv5 and subsequent execve() calls (as well as mmap(PROT_READ|PROT_WRITE)) can succeed instead of unconditionally failing; on ARMv6 the prctl works as it did previously. [0] https://lore.kernel.org/all/2023112456-linked-nape-bf19@gregkh/ This patch (of 2): There exist systems other than PARISC where MDWE may not be feasible to support; rather than cluttering up the generic code with additional arch-specific logic let's add a generic function for checking MDWE support and allow each arch to override it as needed. Link: https://lkml.kernel.org/r/20240227013546.15769-4-zev@bewilderbeest.net Link: https://lkml.kernel.org/r/20240227013546.15769-5-zev@bewilderbeest.net Signed-off-by: Zev Weiss <zev@bewilderbeest.net> Acked-by: Helge Deller <deller@gmx.de> [parisc] Cc: Borislav Petkov <bp@alien8.de> Cc: David Hildenbrand <david@redhat.com> Cc: Florent Revest <revest@chromium.org> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Kees Cook <keescook@chromium.org> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ondrej Mosnacek <omosnace@redhat.com> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Cc: Russell King (Oracle) <linux@armlinux.org.uk> Cc: Sam James <sam@gentoo.org> Cc: Stefan Roesch <shr@devkernel.io> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Cc: <stable@vger.kernel.org> [6.3+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-26Merge tag 'printk-for-6.9-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux Pull printk fix from Petr Mladek: - Prevent scheduling in an atomic context when printk() takes over the console flushing duty * tag 'printk-for-6.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: printk: Update @console_may_schedule in console_trylock_spinning()
2024-03-26Merge tag 'for-netdev' of ↵Paolo Abeni
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2024-03-25 The following pull-request contains BPF updates for your *net* tree. We've added 17 non-merge commits during the last 12 day(s) which contain a total of 19 files changed, 184 insertions(+), 61 deletions(-). The main changes are: 1) Fix an arm64 BPF JIT bug in BPF_LDX_MEMSX implementation's offset handling found via test_bpf module, from Puranjay Mohan. 2) Various fixups to the BPF arena code in particular in the BPF verifier and around BPF selftests to match latest corresponding LLVM implementation, from Puranjay Mohan and Alexei Starovoitov. 3) Fix xsk to not assume that metadata is always requested in TX completion, from Stanislav Fomichev. 4) Fix riscv BPF JIT's kfunc parameter incompatibility between BPF and the riscv ABI which requires sign-extension on int/uint, from Pu Lehui. 5) Fix s390x BPF JIT's bpf_plt pointer arithmetic which triggered a crash when testing struct_ops, from Ilya Leoshkevich. 6) Fix libbpf's arena mmap handling which had incorrect u64-to-pointer cast on 32-bit architectures, from Andrii Nakryiko. 7) Fix libbpf to define MFD_CLOEXEC when not available, from Arnaldo Carvalho de Melo. 8) Fix arm64 BPF JIT implementation for 32bit unconditional bswap which resulted in an incorrect swap as indicated by test_bpf, from Artem Savkov. 9) Fix BPF man page build script to use silent mode, from Hangbin Liu. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: riscv, bpf: Fix kfunc parameters incompatibility between bpf and riscv abi bpf: verifier: reject addr_space_cast insn without arena selftests/bpf: verifier_arena: fix mmap address for arm64 bpf: verifier: fix addr_space_cast from as(1) to as(0) libbpf: Define MFD_CLOEXEC if not available arm64: bpf: fix 32bit unconditional bswap bpf, arm64: fix bug in BPF_LDX_MEMSX libbpf: fix u64-to-pointer cast on 32-bit arches s390/bpf: Fix bpf_plt pointer arithmetic xsk: Don't assume metadata is always requested in TX completion selftests/bpf: Add arena test case for 4Gbyte corner case selftests/bpf: Remove hard coded PAGE_SIZE macro. libbpf, selftests/bpf: Adjust libbpf, bpftool, selftests to match LLVM bpf: Clarify bpf_arena comments. MAINTAINERS: Update email address for Quentin Monnet scripts/bpf_doc: Use silent mode when exec make cmd bpf: Temporarily disable atomic operations in BPF arena ==================== Link: https://lore.kernel.org/r/20240325213520.26688-1-daniel@iogearbox.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-03-26sched/fair: Introduce is_rd_overutilized() helper function to access ↵Shrikanth Hegde
root_domain::overutilized The root_domain::overutilized field is READ_ONCE() accessed in multiple places, which could be simplified with a helper function. This might also make it more apparent that it needs to be used only in case of EAS. No change in functionality intended. Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Qais Yousef <qyousef@layalina.io> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20240307085725.444486-3-sshegde@linux.ibm.com
2024-03-26sched/fair: Add EAS checks before updating root_domain::overutilizedShrikanth Hegde
root_domain::overutilized is only used for EAS(energy aware scheduler) to decide whether to do load balance or not. It is not used if EAS not possible. Currently enqueue_task_fair and task_tick_fair accesses, sometime updates this field. In update_sd_lb_stats it is updated often. This causes cache contention due to true sharing and burns a lot of cycles. ::overload and ::overutilized are part of the same cacheline. Updating it often invalidates the cacheline. That causes access to ::overload to slow down due to false sharing. Hence add EAS check before accessing/updating this field. EAS check is optimized at compile time or it is a static branch. Hence it shouldn't cost much. With the patch, both enqueue_task_fair and newidle_balance don't show up as hot routines in perf profile. 6.8-rc4: 7.18% swapper [kernel.vmlinux] [k] enqueue_task_fair 6.78% s [kernel.vmlinux] [k] newidle_balance +patch: 0.14% swapper [kernel.vmlinux] [k] enqueue_task_fair 0.00% swapper [kernel.vmlinux] [k] newidle_balance While at it: trace_sched_overutilized_tp expect that second argument to be bool. So do a int to bool conversion for that. Fixes: 2802bf3cd936 ("sched/fair: Add over-utilization/tipping point indicator") Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Qais Yousef <qyousef@layalina.io> Reviewed-by: Srikar Dronamraju <srikar@linux.ibm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20240307085725.444486-2-sshegde@linux.ibm.com
2024-03-25genirq: Introduce IRQF_COND_ONESHOT and use it in pinctrl-amdRafael J. Wysocki
There is a problem when a driver requests a shared interrupt line to run a threaded handler on it without IRQF_ONESHOT set if that flag has been set already for the IRQ in question by somebody else. Namely, the request fails which usually leads to a probe failure even though the driver might have worked just fine with IRQF_ONESHOT, but it does not want to use it by default. Currently, the only way to handle this is to try to request the IRQ without IRQF_ONESHOT, but with IRQF_PROBE_SHARED set and if this fails, try again with IRQF_ONESHOT set. However, this is a bit cumbersome and not very clean. When commit 7a36b901a6eb ("ACPI: OSL: Use a threaded interrupt handler for SCI") switched the ACPI subsystem over to using a threaded interrupt handler for the SCI, it had to use IRQF_ONESHOT for it because that's required due to the way the SCI handler works (it needs to walk all of the enabled GPEs before the interrupt line can be unmasked). The SCI interrupt line is not shared with other users very often due to the SCI handling overhead, but on sone systems it is shared and when the other user of it attempts to install a threaded handler, a flags mismatch related to IRQF_ONESHOT may occur. As it turned out, that happened to the pinctrl-amd driver and so commit 4451e8e8415e ("pinctrl: amd: Add IRQF_ONESHOT to the interrupt request") attempted to address the issue by adding IRQF_ONESHOT to the interrupt flags in that driver, but this is now causing an IRQF_ONESHOT-related mismatch to occur on another system which cannot boot as a result of it. Clearly, pinctrl-amd can work with IRQF_ONESHOT if need be, but it should not set that flag by default, so it needs a way to indicate that to the interrupt subsystem. To that end, introdcuce a new interrupt flag, IRQF_COND_ONESHOT, which will only have effect when the IRQ line is shared and IRQF_ONESHOT has been set for it already, in which case it will be promoted to the latter. This is sufficient for drivers sharing the interrupt line with the SCI as it is requested by the ACPI subsystem before any drivers are probed, so they will always see IRQF_ONESHOT set for the interrupt in question. Fixes: 4451e8e8415e ("pinctrl: amd: Add IRQF_ONESHOT to the interrupt request") Reported-by: Francisco Ayala Le Brun <francisco@videowindow.eu> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Cc: 6.8+ <stable@vger.kernel.org> # 6.8+ Closes: https://lore.kernel.org/lkml/CAN-StX1HqWqi+YW=t+V52-38Mfp5fAz7YHx4aH-CQjgyNiKx3g@mail.gmail.com/ Link: https://lore.kernel.org/r/12417336.O9o76ZdvQC@kreacher
2024-03-25workqueue: Cleanup subsys attribute registrationDan Williams
While reviewing users of subsys_virtual_register() I noticed that wq_sysfs_init() ignores the @groups argument. This looks like a historical artifact as the original wq_subsys only had one attribute to register. On the way to building up an @groups argument to pass to subsys_virtual_register() a few more cleanups fell out: * Use DEVICE_ATTR_RO() and DEVICE_ATTR_RW() for cpumask_{isolated,requested} and cpumask respectively. Rename the @show and @store methods accordingly. * Co-locate the attribute definition with the methods. This required moving wq_unbound_cpumask_show down next to wq_unbound_cpumask_store (renamed to cpumask_show() and cpumask_store()) * Use ATTRIBUTE_GROUPS() to skip some boilerplate declarations Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-03-25workqueue: Use list_last_entry() to get the last idle workerLai Jiangshan
It is clearer than open code. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-03-25workqueue: Move attrs->cpumask out of worker_pool's properties when ↵Lai Jiangshan
attrs->affn_strict Allow more pools can be shared when attrs->affn_strict. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-03-25workqueue: Use INIT_WORK_ONSTACK in workqueue_softirq_dead()Lai Jiangshan
dead_work is a stack variable. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-03-25Merge tag 'v6.9-p2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto fixes from Herbert Xu: "This fixes a regression that broke iwd as well as a divide by zero in iaa" * tag 'v6.9-p2' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: crypto: iaa - Fix nr_cpus < nr_iaa case Revert "crypto: pkcs7 - remove sha1 support"
2024-03-25workqueue: Allow cancel_work_sync() and disable_work() from atomic contexts ↵Tejun Heo
on BH work items Now that work_grab_pending() can always grab the PENDING bit without sleeping, the only thing that prevents allowing cancel_work_sync() of a BH work item from an atomic context is the flushing of the in-flight instance. When we're flushing a BH work item for cancel_work_sync(), we know that the work item is not queued and must be executing in a BH context, which means that it's safe to busy-wait for its completion from a non-hardirq atomic context. This patch updates __flush_work() so that it busy-waits when flushing a BH work item for cancel_work_sync(). might_sleep() is pushed from start_flush_work() to its callers - when operating on a BH work item, __cancel_work_sync() now enforces !in_hardirq() instead of might_sleep(). This allows cancel_work_sync() and disable_work() to be called from non-hardirq atomic contexts on BH work items. v3: In __flush_work(), test WORK_OFFQ_BH to tell whether a work item being canceled can be busy waited instead of making start_flush_work() return the pool. (Lai) v2: Lai pointed out that __flush_work() was accessing pool->flags outside the RCU critical section protecting the pool pointer. Fix it by testing and remembering the result inside the RCU critical section. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-03-25workqueue: Remember whether a work item was on a BH workqueueTejun Heo
Add an off-queue flag, WORK_OFFQ_BH, that indicates whether the last workqueue the work item was on was a BH one. This will be used to test whether a work item is BH in cancel_sync path to implement atomic cancel_sync'ing for BH work items. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-03-25workqueue: Remove WORK_OFFQ_CANCELINGTejun Heo
cancel[_delayed]_work_sync() guarantees that it can shut down self-requeueing work items. To achieve that, it grabs and then holds WORK_STRUCT_PENDING bit set while flushing the currently executing instance. As the PENDING bit is set, all queueing attempts including the self-requeueing ones fail and once the currently executing instance is flushed, the work item should be idle as long as someone else isn't actively queueing it. This means that the cancel_work_sync path may hold the PENDING bit set while flushing the target work item. This isn't a problem for the queueing path - it can just fail which is the desired effect. It doesn't affect flush. It doesn't matter to cancel_work either as it can just report that the work item has successfully canceled. However, if there's another cancel_work_sync attempt on the work item, it can't simply fail or report success and that would breach the guarantee that it should provide. cancel_work_sync has to wait for and grab that PENDING bit and go through the motions. WORK_OFFQ_CANCELING and wq_cancel_waitq are what implement this cancel_work_sync to cancel_work_sync wait mechanism. When a work item is being canceled, WORK_OFFQ_CANCELING is also set on it and other cancel_work_sync attempts wait on the bit to be cleared using the wait queue. While this works, it's an isolated wart which doesn't jive with the rest of flush and cancel mechanisms and forces enable_work() and disable_work() to require a sleepable context, which hampers their usability. Now that a work item can be disabled, we can use that to block queueing while cancel_work_sync is in progress. Instead of holding PENDING the bit, it can temporarily disable the work item, flush and then re-enable it as that'd achieve the same end result of blocking queueings while canceling and thus enable canceling of self-requeueing work items. - WORK_OFFQ_CANCELING and the surrounding mechanims are removed. - work_grab_pending() is now simpler, no longer has to wait for a blocking operation and thus can be called from any context. - With work_grab_pending() simplified, no need to use try_to_grab_pending() directly. All users are converted to use work_grab_pending(). - __cancel_work_sync() is updated to __cancel_work() with WORK_CANCEL_DISABLE to cancel and plug racing queueing attempts. It then flushes and re-enables the work item if necessary. - These changes allow disable_work() and enable_work() to be called from any context. v2: Lai pointed out that mod_delayed_work_on() needs to check the disable count before queueing the delayed work item. Added clear_pending_if_disabled() call. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-03-25workqueue: Implement disable/enable for (delayed) work itemsTejun Heo
While (delayed) work items could be flushed and canceled, there was no way to prevent them from being queued in the future. While this didn't lead to functional deficiencies, it sometimes required a bit more effort from the workqueue users to e.g. sequence shutdown steps with more care. Workqueue is currently in the process of replacing tasklet which does support disabling and enabling. The feature is used relatively widely to, for example, temporarily suppress main path while a control plane operation (reset or config change) is in progress. To enable easy conversion of tasklet users and as it seems like an inherent useful feature, this patch implements disabling and enabling of work items. - A work item carries 16bit disable count in work->data while not queued. The access to the count is synchronized by the PENDING bit like all other parts of work->data. - If the count is non-zero, the work item cannot be queued. Any attempt to queue the work item fails and returns %false. - disable_work[_sync](), enable_work(), disable_delayed_work[_sync]() and enable_delayed_work() are added. v3: enable_work() was using local_irq_enable() instead of local_irq_restore() to undo IRQ-disable by work_grab_pending(). This is awkward now and will become incorrect as enable_work() will later be used from IRQ context too. (Lai) v2: Lai noticed that queue_work_node() wasn't checking the disable count. Fixed. queue_rcu_work() is updated to trigger warning if the inner work item is disabled. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-03-25workqueue: Preserve OFFQ bits in cancel[_sync] pathsTejun Heo
The cancel[_sync] paths acquire and release WORK_STRUCT_PENDING, and manipulate WORK_OFFQ_CANCELING. However, they assume that all the OFFQ bit values except for the pool ID are statically known and don't preserve them, which is not wrong in the current code as the pool ID and CANCELING are the only information carried. However, the planned disable/enable support will add more fields and need them to be preserved. This patch updates work data handling so that only the bits which need updating are updated. - struct work_offq_data is added along with work_offqd_unpack() and work_offqd_pack_flags() to help manipulating multiple fields contained in work->data. Note that the helpers look a bit silly right now as there isn't that much to pack. The next patch will add more. - mark_work_canceling() which is used only by __cancel_work_sync() is replaced by open-coded usage of work_offq_data and set_work_pool_and_keep_pending() in __cancel_work_sync(). - __cancel_work[_sync]() uses offq_data helpers to preserve other OFFQ bits when clearing WORK_STRUCT_PENDING and WORK_OFFQ_CANCELING at the end. - This removes all users of get_work_pool_id() which is dropped. Note that get_work_pool_id() could handle both WORK_STRUCT_PWQ and !WORK_STRUCT_PWQ cases; however, it was only being called after try_to_grab_pending() succeeded, in which case WORK_STRUCT_PWQ is never set and thus it's safe to use work_offqd_unpack() instead. No behavior changes intended. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-03-25bpf: Avoid get_kernel_nofault() to fetch kprobe entry IPAndrii Nakryiko
get_kernel_nofault() (or, rather, underlying copy_from_kernel_nofault()) is not free and it does pop up in performance profiles when kprobes are heavily utilized with CONFIG_X86_KERNEL_IBT=y config. Let's avoid using it if we know that fentry_ip - 4 can't cross page boundary. We do that by masking lowest 12 bits and checking if they are Another benefit (and actually what caused a closer look at this part of code) is that now LBR record is (typically) not wasted on copy_from_kernel_nofault() call and code, which helps tools like retsnoop that grab LBR records from inside BPF code in kretprobes. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Link: https://lore.kernel.org/bpf/20240319212013.1046779-1-andrii@kernel.org
2024-03-25sched/fair: Don't double balance_interval for migrate_misfitQais Yousef
It is not necessarily an indication of the system being busy and requires a backoff of the load balancer activities. But pushing it high could mean generally delaying other misfit activities or other type of imbalances. Also don't pollute nr_balance_failed because of misfit failures. The value is used for enabling cache hot migration and in migrate_util/load types. None of which should be impacted (skewed) by misfit failures. Signed-off-by: Qais Yousef <qyousef@layalina.io> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20240324004552.999936-5-qyousef@layalina.io
2024-03-25sched/topology: Remove root_domain::max_cpu_capacityQais Yousef
The value is no longer used as we now keep track of max_allowed_capacity for each task instead. Signed-off-by: Qais Yousef <qyousef@layalina.io> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20240324004552.999936-4-qyousef@layalina.io
2024-03-25sched/fair: Check if a task has a fitting CPU when updating misfitQais Yousef
If a misfit task is affined to a subset of the possible CPUs, we need to verify that one of these CPUs can fit it. Otherwise the load balancer code will continuously trigger needlessly leading the balance_interval to increase in return and eventually end up with a situation where real imbalances take a long time to address because of this impossible imbalance situation. This can happen in Android world where it's common for background tasks to be restricted to little cores. Similarly if we can't fit the biggest core, triggering misfit is pointless as it is the best we can ever get on this system. To be able to detect that; we use asym_cap_list to iterate through capacities in the system to see if the task is able to run at a higher capacity level based on its p->cpus_ptr. We do that when the affinity change, a fair task is forked, or when a task switched to fair policy. We store the max_allowed_capacity in task_struct to allow for cheap comparison in the fast path. Improve check_misfit_status() function by removing redundant checks. misfit_task_load will be 0 if the task can't move to a bigger CPU. And nohz_balancer_kick() already checks for cpu_check_capacity() before calling check_misfit_status(). Test: ===== Add trace_printk("balance_interval = %lu\n", interval) in get_sd_balance_interval(). run if [ "$MASK" != "0" ]; then adb shell "taskset -a $MASK cat /dev/zero > /dev/null" fi sleep 10 // parse ftrace buffer counting the occurrence of each valaue Where MASK is either: * 0: no busy task running * 1: busy task is pinned to 1 cpu; handled today to not cause misfit * f: busy task pinned to little cores, simulates busy background task, demonstrates the problem to be fixed Results: ======== Note how occurrence of balance_interval = 128 overshoots for MASK = f. BEFORE ------ MASK=0 1 balance_interval = 175 120 balance_interval = 128 846 balance_interval = 64 55 balance_interval = 63 215 balance_interval = 32 2 balance_interval = 31 2 balance_interval = 16 4 balance_interval = 8 1870 balance_interval = 4 65 balance_interval = 2 MASK=1 27 balance_interval = 175 37 balance_interval = 127 840 balance_interval = 64 167 balance_interval = 63 449 balance_interval = 32 84 balance_interval = 31 304 balance_interval = 16 1156 balance_interval = 8 2781 balance_interval = 4 428 balance_interval = 2 MASK=f 1 balance_interval = 175 1328 balance_interval = 128 44 balance_interval = 64 101 balance_interval = 63 25 balance_interval = 32 5 balance_interval = 31 23 balance_interval = 16 23 balance_interval = 8 4306 balance_interval = 4 177 balance_interval = 2 AFTER ----- Note how the high values almost disappear for all MASK values. The system has background tasks that could trigger the problem without simulate it even with MASK=0. MASK=0 103 balance_interval = 63 19 balance_interval = 31 194 balance_interval = 8 4827 balance_interval = 4 179 balance_interval = 2 MASK=1 131 balance_interval = 63 1 balance_interval = 31 87 balance_interval = 8 3600 balance_interval = 4 7 balance_interval = 2 MASK=f 8 balance_interval = 127 182 balance_interval = 63 3 balance_interval = 31 9 balance_interval = 16 415 balance_interval = 8 3415 balance_interval = 4 21 balance_interval = 2 Signed-off-by: Qais Yousef <qyousef@layalina.io> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20240324004552.999936-3-qyousef@layalina.io
2024-03-25sched/topology: Export asym_cap_listQais Yousef
So that we can use it to iterate through available capacities in the system. Sort asym_cap_list in descending order as expected users are likely to be interested on the highest capacity first. Make the list RCU protected to allow for cheap access in hot paths. Signed-off-by: Qais Yousef <qyousef@layalina.io> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20240324004552.999936-2-qyousef@layalina.io
2024-03-25Merge tag 'v6.9-rc1' into sched/core, to pick up fixes and to refresh the branchIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-03-25tracing: probes: Fix to zero initialize a local variableMasami Hiramatsu (Google)
Fix to initialize 'val' local variable with zero. Dan reported that Smatch static code checker reports an error that a local 'val' variable needs to be initialized. Actually, the 'val' is expected to be initialized by FETCH_OP_ARG in the same loop, but it is not obvious. So initialize it with zero. Link: https://lore.kernel.org/all/171092223833.237219.17304490075697026697.stgit@devnote2/ Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/all/b010488e-68aa-407c-add0-3e059254aaa0@moroto.mountain/ Fixes: 25f00e40ce79 ("tracing/probes: Support $argN in return probe (kprobe and fprobe)") Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-03-24Merge tag 'dma-mapping-6.9-2024-03-24' of ↵Linus Torvalds
git://git.infradead.org/users/hch/dma-mapping Pull dma-mapping fixes from Christoph Hellwig: "This has a set of swiotlb alignment fixes for sometimes very long standing bugs from Will. We've been discussion them for a while and they should be solid now" * tag 'dma-mapping-6.9-2024-03-24' of git://git.infradead.org/users/hch/dma-mapping: swiotlb: Reinstate page-alignment for mappings >= PAGE_SIZE iommu/dma: Force swiotlb_max_mapping_size on an untrusted device swiotlb: Fix alignment checks when both allocation and DMA masks are present swiotlb: Honour dma_alloc_coherent() alignment in swiotlb_alloc() swiotlb: Enforce page alignment in swiotlb_alloc() swiotlb: Fix double-allocation of slots due to broken alignment handling
2024-03-23Merge tag 'timers-urgent-2024-03-23' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Thomas Gleixner: "Two regression fixes for the timer and timer migration code: - Prevent endless timer requeuing which is caused by two CPUs racing out of idle. This happens when the last CPU goes idle and therefore has to ensure to expire the pending global timers and some other CPU come out of idle at the same time and the other CPU wins the race and expires the global queue. This causes the last CPU to chase ghost timers forever and reprogramming it's clockevent device endlessly. Cure this by re-evaluating the wakeup time unconditionally. - The split into local (pinned) and global timers in the timer wheel caused a regression for NOHZ full as it broke the idle tracking of global timers. On NOHZ full this prevents an self IPI being sent which in turn causes the timer to be not programmed and not being expired on time. Restore the idle tracking for the global timer base so that the self IPI condition for NOHZ full is working correctly again" * tag 'timers-urgent-2024-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timers: Fix removed self-IPI on global timer's enqueue in nohz_full timers/migration: Fix endless timer requeue after idle interrupts
2024-03-23Merge tag 'core-entry-2024-03-23' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core entry fix from Thomas Gleixner: "A single fix for the generic entry code: The trace_sys_enter() tracepoint can modify the syscall number via kprobes or BPF in pt_regs, but that requires that the syscall number is re-evaluted from pt_regs after the tracepoint. A seccomp fix in that area removed the re-evaluation so the change does not take effect as the code just uses the locally cached number. Restore the original behaviour by re-evaluating the syscall number after the tracepoint" * tag 'core-entry-2024-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: entry: Respect changes to system call number by trace_sys_enter()
2024-03-22bpf: verifier: reject addr_space_cast insn without arenaPuranjay Mohan
The verifier allows using the addr_space_cast instruction in a program that doesn't have an associated arena. This was caught in the form an invalid memory access in do_misc_fixups() when while converting addr_space_cast to a normal 32-bit mov, env->prog->aux->arena was dereferenced to check for BPF_F_NO_USER_CONV flag. Reject programs that include the addr_space_cast instruction but don't have an associated arena. root@rv-tester:~# ./reproducer Unable to handle kernel access to user memory without uaccess routines at virtual address 0000000000000030 Oops [#1] [<ffffffff8017eeaa>] do_misc_fixups+0x43c/0x1168 [<ffffffff801936d6>] bpf_check+0xda8/0x22b6 [<ffffffff80174b32>] bpf_prog_load+0x486/0x8dc [<ffffffff80176566>] __sys_bpf+0xbd8/0x214e [<ffffffff80177d14>] __riscv_sys_bpf+0x22/0x2a [<ffffffff80d2493a>] do_trap_ecall_u+0x102/0x17c [<ffffffff80d3048c>] ret_from_exception+0x0/0x64 Fixes: 6082b6c328b5 ("bpf: Recognize addr_space_cast instruction in the verifier.") Reported-by: xingwei lee <xrivendell7@gmail.com> Reported-by: yue sun <samsun1006219@gmail.com> Closes: https://lore.kernel.org/bpf/CABOYnLz09O1+2gGVJuCxd_24a-7UueXzV-Ff+Fr+h5EKFDiYCQ@mail.gmail.com/ Signed-off-by: Puranjay Mohan <puranjay12@gmail.com> Link: https://lore.kernel.org/r/20240322153518.11555-1-puranjay12@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-22bpf: verifier: fix addr_space_cast from as(1) to as(0)Puranjay Mohan
The verifier currently converts addr_space_cast from as(1) to as(0) that is: BPF_ALU64 | BPF_MOV | BPF_X with off=1 and imm=1 to BPF_ALU | BPF_MOV | BPF_X with imm=1 (32-bit mov) Because of this imm=1, the JITs that have bpf_jit_needs_zext() == true, interpret the converted instruction as BPF_ZEXT_REG(DST) which is a special form of mov32, used for doing explicit zero extension on dst. These JITs will just zero extend the dst reg and will not move the src to dst before the zext. Fix do_misc_fixups() to set imm=0 when converting addr_space_cast to a normal mov32. The JITs that have bpf_jit_needs_zext() == true rely on the verifier to emit zext instructions. Mark dst_reg as subreg when doing cast from as(1) to as(0) so the verifier emits a zext instruction after the mov. Fixes: 6082b6c328b5 ("bpf: Recognize addr_space_cast instruction in the verifier.") Signed-off-by: Puranjay Mohan <puranjay12@gmail.com> Link: https://lore.kernel.org/r/20240321153939.113996-1-puranjay12@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>