summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-01-16Merge branch 'support-eliding-map-lookup-nullness'Alexei Starovoitov
Daniel Xu says: ==================== Support eliding map lookup nullness This patch allows progs to elide a null check on statically known map lookup keys. In other words, if the verifier can statically prove that the lookup will be in-bounds, allow the prog to drop the null check. This is useful for two reasons: 1. Large numbers of nullness checks (especially when they cannot fail) unnecessarily pushes prog towards BPF_COMPLEXITY_LIMIT_JMP_SEQ. 2. It forms a tighter contract between programmer and verifier. For (1), bpftrace is starting to make heavier use of percpu scratch maps. As a result, for user scripts with large number of unrolled loops, we are starting to hit jump complexity verification errors. These percpu lookups cannot fail anyways, as we only use static key values. Eliding nullness probably results in less work for verifier as well. For (2), percpu scratch maps are often used as a larger stack, as the currrent stack is limited to 512 bytes. In these situations, it is desirable for the programmer to express: "this lookup should never fail, and if it does, it means I messed up the code". By omitting the null check, the programmer can "ask" the verifier to double check the logic. === Changelog === Changes in v7: * Use more accurate frame number when marking precise * Add test for non-stack key * Test for marking stack slot precise Changes in v6: * Use is_spilled_scalar_reg() helper and remove unnecessary comment * Add back deleted selftest with different helper to dirty dst buffer * Check size of spill is exactly key_size and update selftests * Read slot_type from correct offset into the spi * Rewrite selftests in C where possible * Mark constant map keys as precise Changes in v5: * Dropped all acks * Use s64 instead of long for const_map_key * Ensure stack slot contains spilled reg before accessing spilled_ptr * Ensure spilled reg is a scalar before accessing tnum const value * Fix verifier selftest for 32-bit write to write at 8 byte alignment to ensure spill is tracked * Introduce more precise tracking of helper stack accesses * Do constant map key extraction as part of helper argument processing and then remove duplicated stack checks * Use ret_flag instead of regs[BPF_REG_0].type * Handle STACK_ZERO * Fix bug in bpf_load_hdr_opt() arg annotation Changes in v4: * Only allow for CAP_BPF * Add test for stack growing upwards * Improve comment about stack growing upwards Changes in v3: * Check if stack is (erroneously) growing upwards * Mention in commit message why existing tests needed change Changes in v2: * Added a check for when R2 is not a ptr to stack * Added a check for when stack is uninitialized (no stack slot yet) * Updated existing tests to account for null elision * Added test case for when R2 can be both const and non-const ==================== Link: https://patch.msgid.link/cover.1736886479.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-16bpf: selftests: verifier: Add nullness elision testsDaniel Xu
Test that nullness elision works for common use cases. For example, we want to check that both constant scalar spills and STACK_ZERO functions. As well as when there's both const and non-const values of R2 leading up to a lookup. And obviously some bound checks. Particularly tricky are spills both smaller or larger than key size. For smaller, we need to ensure verifier doesn't let through a potential read into unchecked bytes. For larger, endianness comes into play, as the native endian value tracked in the verifier may not be the bytes the kernel would have read out of the key pointer. So check that we disallow both. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/f1dacaa777d4516a5476162e0ea549f7c3354d73.1736886479.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-16bpf: verifier: Support eliding map lookup nullnessDaniel Xu
This commit allows progs to elide a null check on statically known map lookup keys. In other words, if the verifier can statically prove that the lookup will be in-bounds, allow the prog to drop the null check. This is useful for two reasons: 1. Large numbers of nullness checks (especially when they cannot fail) unnecessarily pushes prog towards BPF_COMPLEXITY_LIMIT_JMP_SEQ. 2. It forms a tighter contract between programmer and verifier. For (1), bpftrace is starting to make heavier use of percpu scratch maps. As a result, for user scripts with large number of unrolled loops, we are starting to hit jump complexity verification errors. These percpu lookups cannot fail anyways, as we only use static key values. Eliding nullness probably results in less work for verifier as well. For (2), percpu scratch maps are often used as a larger stack, as the currrent stack is limited to 512 bytes. In these situations, it is desirable for the programmer to express: "this lookup should never fail, and if it does, it means I messed up the code". By omitting the null check, the programmer can "ask" the verifier to double check the logic. Tests also have to be updated in sync with these changes, as the verifier is more efficient with this change. Notable, iters.c tests had to be changed to use a map type that still requires null checks, as it's exercising verifier tracking logic w.r.t iterators. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/68f3ea96ff3809a87e502a11a4bd30177fc5823e.1736886479.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-16bpf: verifier: Refactor helper access type trackingDaniel Xu
Previously, the verifier was treating all PTR_TO_STACK registers passed to a helper call as potentially written to by the helper. However, all calls to check_stack_range_initialized() already have precise access type information available. Rather than treat ACCESS_HELPER as a proxy for BPF_WRITE, pass enum bpf_access_type to check_stack_range_initialized() to more precisely track helper arguments. One benefit from this precision is that registers tracked as valid spills and passed as a read-only helper argument remain tracked after the call. Rather than being marked STACK_MISC afterwards. An additional benefit is the verifier logs are also more precise. For this particular error, users will enjoy a slightly clearer message. See included selftest updates for examples. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/ff885c0e5859e0cd12077c3148ff0754cad4f7ed.1736886479.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-16bpf: tcp: Mark bpf_load_hdr_opt() arg2 as read-writeDaniel Xu
MEM_WRITE attribute is defined as: "Non-presence of MEM_WRITE means that MEM is only being read". bpf_load_hdr_opt() both reads and writes from its arg2 - void *search_res. This matters a lot for the next commit where we more precisely track stack accesses. Without this annotation, the verifier will make false assumptions about the contents of memory written to by helpers and possibly prune valid branches. Fixes: 6fad274f06f0 ("bpf: Add MEM_WRITE attribute") Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/730e45f8c39be2a5f3d8c4406cceca9d574cbf14.1736886479.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-16bpf: verifier: Add missing newline on verbose() callDaniel Xu
The print was missing a newline. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/59cbe18367b159cd470dc6d5c652524c1dc2b984.1736886479.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-16selftests/bpf: Add distilled BTF test about marking BTF_IS_EMBEDDEDPu Lehui
When redirecting the split BTF to the vmlinux base BTF, we need to mark the distilled base struct/union members of split BTF structs/unions in id_map with BTF_IS_EMBEDDED. This indicates that these types must match both name and size later. So if a needed composite type, which is the member of composite type in the split BTF, has a different size in the base BTF we wish to relocate with, btf__relocate() should error out. Signed-off-by: Pu Lehui <pulehui@huawei.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250115100241.4171581-4-pulehui@huaweicloud.com
2025-01-16libbpf: Fix incorrect traversal end type ID when marking BTF_IS_EMBEDDEDPu Lehui
When redirecting the split BTF to the vmlinux base BTF, we need to mark the distilled base struct/union members of split BTF structs/unions in id_map with BTF_IS_EMBEDDED. This indicates that these types must match both name and size later. Therefore, we need to traverse the entire split BTF, which involves traversing type IDs from nr_dist_base_types to nr_types. However, the current implementation uses an incorrect traversal end type ID, so let's correct it. Fixes: 19e00c897d50 ("libbpf: Split BTF relocation") Signed-off-by: Pu Lehui <pulehui@huawei.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250115100241.4171581-3-pulehui@huaweicloud.com
2025-01-16libbpf: Fix return zero when elf_begin failedPu Lehui
The error number of elf_begin is omitted when encapsulating the btf_find_elf_sections function. Fixes: c86f180ffc99 ("libbpf: Make btf_parse_elf process .BTF.base transparently") Signed-off-by: Pu Lehui <pulehui@huawei.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250115100241.4171581-2-pulehui@huaweicloud.com
2025-01-16selftests/bpf: Fix btf leak on new btf alloc failure in btf_distill testPu Lehui
Fix btf leak on new btf alloc failure in btf_distill test. Fixes: affdeb50616b ("selftests/bpf: Extend distilled BTF tests to cover BTF relocation") Signed-off-by: Pu Lehui <pulehui@huawei.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250115100241.4171581-1-pulehui@huaweicloud.com
2025-01-16veristat: Load struct_ops programs only onceEduard Zingerman
libbpf automatically adjusts autoload for struct_ops programs, see libbpf.c:bpf_object_adjust_struct_ops_autoload. For example, if there is a map: SEC(".struct_ops.link") struct sched_ext_ops ops = { .enqueue = foo, .tick = bar, }; Both 'foo' and 'bar' would be loaded if 'ops' autocreate is true, both 'foo' and 'bar' would be skipped if 'ops' autocreate is false. This means that when veristat processes object file with 'ops', it would load 4 programs in total: two programs per each 'process_prog' call. The adjustment occurs at object load time, and libbpf remembers association between 'ops' and 'foo'/'bar' at object open time. The only way to persuade libbpf to load one of two is to adjust map initial value, such that only one program is referenced. This patch does exactly that, significantly reducing time to process object files with big number of struct_ops programs. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250115223835.919989-1-eddyz87@gmail.com
2025-01-16selftests/bpf: Fix undefined UINT_MAX in veristat.cTony Ambardar
Include <limits.h> in 'veristat.c' to provide a UINT_MAX definition and avoid multiple compile errors against mips64el/musl-libc: veristat.c: In function 'max_verifier_log_size': veristat.c:1135:36: error: 'UINT_MAX' undeclared (first use in this function) 1135 | const int SMALL_LOG_SIZE = UINT_MAX >> 8; | ^~~~~~~~ veristat.c:24:1: note: 'UINT_MAX' is defined in header '<limits.h>'; did you forget to '#include <limits.h>'? 23 | #include <math.h> +++ |+#include <limits.h> 24 | Fixes: 1f7c33630724 ("selftests/bpf: Increase verifier log limit in veristat") Signed-off-by: Tony Ambardar <tony.ambardar@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250116075036.3459898-1-tony.ambardar@gmail.com
2025-01-15bpf: Send signals asynchronously if !preemptiblePuranjay Mohan
BPF programs can execute in all kinds of contexts and when a program running in a non-preemptible context uses the bpf_send_signal() kfunc, it will cause issues because this kfunc can sleep. Change `irqs_disabled()` to `!preemptible()`. Reported-by: syzbot+97da3d7e0112d59971de@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/67486b09.050a0220.253251.0084.GAE@google.com/ Fixes: 1bc7896e9ef4 ("bpf: Fix deadlock with rq_lock in bpf_send_signal()") Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20250115103647.38487-1-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-15selftests/bpf: Fix test_xdp_adjust_tail_grow2 selftest on powerpcSaket Kumar Bhaskar
On powerpc cache line size is 128 bytes, so skb_shared_info must be aligned accordingly. Signed-off-by: Saket Kumar Bhaskar <skb99@linux.ibm.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20250110103109.3670793-1-skb99@linux.ibm.com
2025-01-10Merge branch 'selftests-bpf-migrate-test_xdp_redirect-sh-to-test_progs'Martin KaFai Lau
Bastien Curutchet says: ==================== This patch series continues the work to migrate the *.sh tests into prog_tests. test_xdp_redirect.sh tests the XDP redirections done through bpf_redirect(). These XDP redirections are already tested by prog_tests/xdp_do_redirect.c but IMO it doesn't cover the exact same code path because xdp_do_redirect.c uses bpf_prog_test_run_opts() to trigger redirections of 'fake packets' while test_xdp_redirect.sh redirects packets coming from the network. Also, the test_xdp_redirect.sh script tests the redirections with both SKB and DRV modes while xdp_do_redirect.c only tests the DRV mode. The patch series adds two new test cases in prog_tests/xdp_do_redirect.c to replace the test_xdp_redirect.sh script. ==================== Link: https://patch.msgid.link/20250110-xdp_redirect-v2-0-b8f3ae53e894@bootlin.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2025-01-10selftests/bpf: Migrate test_xdp_redirect.c to test_xdp_do_redirect.cBastien Curutchet (eBPF Foundation)
prog_tests/xdp_do_redirect.c is the only user of the BPF programs located in progs/test_xdp_do_redirect.c and progs/test_xdp_redirect.c. There is no need to keep both files with such close names. Move test_xdp_redirect.c contents to test_xdp_do_redirect.c and remove progs/test_xdp_redirect.c Signed-off-by: Bastien Curutchet (eBPF Foundation) <bastien.curutchet@bootlin.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20250110-xdp_redirect-v2-3-b8f3ae53e894@bootlin.com
2025-01-10selftests/bpf: Migrate test_xdp_redirect.sh to xdp_do_redirect.cBastien Curutchet (eBPF Foundation)
test_xdp_redirect.sh can't be used by the BPF CI. Migrate test_xdp_redirect.sh into a new test case in xdp_do_redirect.c. It uses the same network topology and the same BPF programs located in progs/test_xdp_redirect.c and progs/xdp_dummy.c. Remove test_xdp_redirect.sh and its Makefile entry. Signed-off-by: Bastien Curutchet (eBPF Foundation) <bastien.curutchet@bootlin.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20250110-xdp_redirect-v2-2-b8f3ae53e894@bootlin.com
2025-01-10selftests/bpf: test_xdp_redirect: Rename BPF sectionsBastien Curutchet (eBPF Foundation)
SEC("redirect_to_111") and SEC("redirect_to_222") can't be loaded by the __load() helper. Rename both sections SEC("xdp") so it can be interpreted by the __load() helper in upcoming patch. Update the test_xdp_redirect.sh to use the program name instead of the section name to load the BPF program. Signed-off-by: Bastien Curutchet (eBPF Foundation) <bastien.curutchet@bootlin.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Alexis Lothoré (eBPF Foundation) <alexis.lothore@bootlin.com> Link: https://patch.msgid.link/20250110-xdp_redirect-v2-1-b8f3ae53e894@bootlin.com
2025-01-10veristat: Document verifier log dumping capabilityDaniel Xu
`-vl2` is a useful combination of flags to dump the entire verification log. This is helpful when making changes to the verifier, as you can see what it thinks program one instruction at a time. This was more or less a hidden feature before. Document it so others can discover it. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/d57bbcca81e06ae8dcdadaedb99a48dced67e422.1736466129.git.dxu@dxuuu.xyz
2025-01-10bpftool: Fix control flow graph segfault during edge creationChristoph Werle
If the last instruction of a control flow graph building block is a BPF_CALL, an incorrect edge with e->dst set to NULL is created and results in a segfault during graph output. Ensure that BPF_CALL as last instruction of a building block is handled correctly and only generates a single edge unlike actual BPF_JUMP* instructions. Signed-off-by: Christoph Werle <christoph.werle@longjmp.de> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Tested-by: Quentin Monnet <qmo@kernel.org> Reviewed-by: Quentin Monnet <qmo@kernel.org> Link: https://lore.kernel.org/bpf/20250108220937.1470029-1-christoph.werle@longjmp.de
2025-01-10selftests/bpf: Add a test for kprobe multi with unique_matchYonghong Song
Add a kprobe multi subtest to test kprobe multi unique_match option. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250109174028.3368967-1-yonghong.song@linux.dev
2025-01-10libbpf: Add unique_match option for multi kprobeYonghong Song
Jordan reported an issue in Meta production environment where func try_to_wake_up() is renamed to try_to_wake_up.llvm.<hash>() by clang compiler at lto mode. The original 'kprobe/try_to_wake_up' does not work any more since try_to_wake_up() does not match the actual func name in /proc/kallsyms. There are a couple of ways to resolve this issue. For example, in attach_kprobe(), we could do lookup in /proc/kallsyms so try_to_wake_up() can be replaced by try_to_wake_up.llvm.<hach>(). Or we can force users to use bpf_program__attach_kprobe() where they need to lookup /proc/kallsyms to find out try_to_wake_up.llvm.<hach>(). But these two approaches requires extra work by either libbpf or user. Luckily, suggested by Andrii, multi kprobe already supports wildcard ('*') for symbol matching. In the above example, 'try_to_wake_up*' can match to try_to_wake_up() or try_to_wake_up.llvm.<hash>() and this allows bpf prog works for different kernels as some kernels may have try_to_wake_up() and some others may have try_to_wake_up.llvm.<hash>(). The original intention is to kprobe try_to_wake_up() only, so an optional field unique_match is added to struct bpf_kprobe_multi_opts. If the field is set to true, the number of matched functions must be one. Otherwise, the attachment will fail. In the above case, multi kprobe with 'try_to_wake_up*' and unique_match preserves user functionality. Reported-by: Jordan Rome <linux@jordanrome.com> Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250109174023.3368432-1-yonghong.song@linux.dev
2025-01-08Merge branch 'bpf-reduce-the-use-of-migrate_-disable-enable'Alexei Starovoitov
Hou Tao says: ==================== The use of migrate_{disable|enable} pair in BPF is mainly due to the introduction of bpf memory allocator and the use of per-CPU data struct in its internal implementation. The caller needs to disable migration before invoking the alloc or free APIs of bpf memory allocator, and enable migration after the invocation. The main users of bpf memory allocator are various kind of bpf maps in which the map values or the special fields in the map values are allocated by using bpf memory allocator. At present, the running context for bpf program has already disabled migration explictly or implictly, therefore, when these maps are manipulated in bpf program, it is OK to not invoke migrate_disable() and migrate_enable() pair. Howevers, it is not always the case when these maps are manipulated through bpf syscall, therefore many migrate_{disable|enable} pairs are added when the map can either be manipulated by BPF program or BPF syscall. The initial idea of reducing the use of migrate_{disable|enable} comes from Alexei [1]. I turned it into a patch set that archives the goals through the following three methods: 1. remove unnecessary migrate_{disable|enable} pair when the BPF syscall path also disables migration, it is OK to remove the pair. Patch #1~#3 fall into this category, while patch #4~#5 are partially included. 2. move the migrate_{disable|enable} pair from inner callee to outer caller Instead of invoking migrate_disable() in the inner callee, invoking migrate_disable() in the outer caller to simplify reasoning about when migrate_disable() is needed. Patch #4~#5 and patch #6~#19 belongs to this category. 3. add cant_migrate() check in the inner callee Add cant_migrate() check in the inner callee to ensure the guarantee that migration is disabled is not broken. Patch #1~#5, #13, #16~#19 also belong to this category. Please check the individual patches for more details. Comments are always welcome. Change Log: v2: * sqaush the ->map_free related patches (#10~#12, #15) into one patch * remove unnecessary cant_migrate() checks. v1: https://lore.kernel.org/bpf/20250106081900.1665573-1-houtao@huaweicloud.com ==================== Link: https://patch.msgid.link/20250108010728.207536-1-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-08bpf: Remove migrate_{disable|enable} from bpf_selem_free()Hou Tao
bpf_selem_free() has the following three callers: (1) bpf_local_storage_update It will be invoked through ->map_update_elem syscall or helpers for storage map. Migration has already been disabled in these running contexts. (2) bpf_sk_storage_clone It has already disabled migration before invoking bpf_selem_free(). (3) bpf_selem_free_list bpf_selem_free_list() has three callers: bpf_selem_unlink_storage(), bpf_local_storage_update() and bpf_local_storage_destroy(). The callers of bpf_selem_unlink_storage() includes: storage map ->map_delete_elem syscall, storage map delete helpers and bpf_local_storage_map_free(). These contexts have already disabled migration when invoking bpf_selem_unlink() which invokes bpf_selem_unlink_storage() and bpf_selem_free_list() correspondingly. bpf_local_storage_update() has been analyzed as the first caller above. bpf_local_storage_destroy() is invoked when freeing the local storage for the kernel object. Now cgroup, task, inode and sock storage have already disabled migration before invoking bpf_local_storage_destroy(). After the analyses above, it is safe to remove migrate_{disable|enable} from bpf_selem_free(). Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250108010728.207536-17-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-08bpf: Remove migrate_{disable|enable} from bpf_local_storage_free()Hou Tao
bpf_local_storage_free() has three callers: 1) bpf_local_storage_alloc() Its caller must have disabled migration. 2) bpf_local_storage_destroy() Its four callers (bpf_{cgrp|inode|task|sk}_storage_free()) have already invoked migrate_disable() before invoking bpf_local_storage_destroy(). 3) bpf_selem_unlink() Its callers include: cgrp/inode/task/sk storage ->map_delete_elem callbacks, bpf_{cgrp|inode|task|sk}_storage_delete() helpers and bpf_local_storage_map_free(). All of these callers have already disabled migration before invoking bpf_selem_unlink(). Therefore, it is OK to remove migrate_{disable|enable} pair from bpf_local_storage_free(). Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250108010728.207536-16-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-08bpf: Remove migrate_{disable|enable} from bpf_local_storage_alloc()Hou Tao
These two callers of bpf_local_storage_alloc() are the same as bpf_selem_alloc(): bpf_sk_storage_clone() and bpf_local_storage_update(). The running contexts of these two callers have already disabled migration, therefore, there is no need to add extra migrate_{disable|enable} pair in bpf_local_storage_alloc(). Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250108010728.207536-15-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-08bpf: Remove migrate_{disable|enable} from bpf_selem_alloc()Hou Tao
bpf_selem_alloc() has two callers: (1) bpf_sk_storage_clone_elem() bpf_sk_storage_clone() has already disabled migration before invoking bpf_sk_storage_clone_elem(). (2) bpf_local_storage_update() Its callers include: cgrp/task/inode/sock storage ->map_update_elem() callbacks and bpf_{cgrp|task|inode|sk}_storage_get() helpers. These running contexts have already disabled migration Therefore, there is no need to add extra migrate_{disable|enable} pair in bpf_selem_alloc(). Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250108010728.207536-14-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-08bpf: Remove migrate_{disable,enable} in bpf_cpumask_release()Hou Tao
When BPF program invokes bpf_cpumask_release(), the migration must have been disabled. When bpf_cpumask_release_dtor() invokes bpf_cpumask_release(), the caller bpf_obj_free_fields() also has disabled migration, therefore, it is OK to remove the unnecessary migrate_{disable|enable} pair in bpf_cpumask_release(). Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250108010728.207536-13-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-08bpf: Remove migrate_{disable|enable} in bpf_obj_free_fields()Hou Tao
The callers of bpf_obj_free_fields() have already guaranteed that the migration is disabled, therefore, there is no need to invoke migrate_{disable,enable} pair in bpf_obj_free_fields()'s underly implementation. This patch removes unnecessary migrate_{disable|enable} pairs from bpf_obj_free_fields() and its callees: bpf_list_head_free() and bpf_rb_root_free(). Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250108010728.207536-12-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-08bpf: Disable migration before calling ops->map_free()Hou Tao
The freeing of all map elements may invoke bpf_obj_free_fields() to free the special fields in the map value. Since these special fields may be allocated from bpf memory allocator, migrate_{disable|enable} pairs are necessary for the freeing of these special fields. To simplify reasoning about when migrate_disable() is needed for the freeing of these special fields, let the caller to guarantee migration is disabled before invoking bpf_obj_free_fields(). Therefore, disabling migration before calling ops->map_free() to simplify the freeing of map values or special fields allocated from bpf memory allocator. After disabling migration in bpf_map_free(), there is no need for additional migration_{disable|enable} pairs in these ->map_free() callbacks. Remove these redundant invocations. The migrate_{disable|enable} pairs in the underlying implementation of bpf_obj_free_fields() will be removed by the following patch. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250108010728.207536-11-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-08bpf: Disable migration in bpf_selem_free_rcuHou Tao
bpf_selem_free_rcu() calls bpf_obj_free_fields() to free the special fields in map value (e.g., kptr). Since kptrs may be allocated from bpf memory allocator, migrate_{disable|enable} pairs are necessary for the freeing of these kptrs. To simplify reasoning about when migrate_disable() is needed for the freeing of these dynamically-allocated kptrs, let the caller to guarantee migration is disabled before invoking bpf_obj_free_fields(). Therefore, the patch adds migrate_{disable|enable} pair in bpf_selem_free_rcu(). The migrate_{disable|enable} pairs in the underlying implementation of bpf_obj_free_fields() will be removed by the following patch. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250108010728.207536-10-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-08bpf: Disable migration when cloning sock storageHou Tao
bpf_sk_storage_clone() will call bpf_selem_free() to free the clone element when the allocation of new sock storage fails. bpf_selem_free() will call check_and_free_fields() to free the special fields in the element. Since the allocated element is not visible to bpf syscall or bpf program when bpf_local_storage_alloc() fails, these special fields in the element must be all zero when invoking bpf_selem_free(). To be uniform with other callers of bpf_selem_free(), disabling migration when cloning sock storage. Adding migrate_{disable|enable} pair also benefits the potential switching from kzalloc to bpf memory allocator for sock storage. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250108010728.207536-9-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-08bpf: Disable migration when destroying sock storageHou Tao
When destroying sock storage, it invokes bpf_local_storage_destroy() to remove all storage elements saved in the sock storage. The destroy procedure will call bpf_selem_free() to free the element, and bpf_selem_free() calls bpf_obj_free_fields() to free the special fields in map value (e.g., kptr). Since kptrs may be allocated from bpf memory allocator, migrate_{disable|enable} pairs are necessary for the freeing of these kptrs. To simplify reasoning about when migrate_disable() is needed for the freeing of these dynamically-allocated kptrs, let the caller to guarantee migration is disabled before invoking bpf_obj_free_fields(). Therefore, the patch adds migrate_{disable|enable} pair in bpf_sock_storage_free(). The migrate_{disable|enable} pairs in the underlying implementation of bpf_obj_free_fields() will be removed by The following patch. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250108010728.207536-8-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-08bpf: Disable migration when destroying inode storageHou Tao
When destroying inode storage, it invokes bpf_local_storage_destroy() to remove all storage elements saved in the inode storage. The destroy procedure will call bpf_selem_free() to free the element, and bpf_selem_free() calls bpf_obj_free_fields() to free the special fields in map value (e.g., kptr). Since kptrs may be allocated from bpf memory allocator, migrate_{disable|enable} pairs are necessary for the freeing of these kptrs. To simplify reasoning about when migrate_disable() is needed for the freeing of these dynamically-allocated kptrs, let the caller to guarantee migration is disabled before invoking bpf_obj_free_fields(). Therefore, the patch adds migrate_{disable|enable} pair in bpf_inode_storage_free(). The migrate_{disable|enable} pairs in the underlying implementation of bpf_obj_free_fields() will be removed by the following patch. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250108010728.207536-7-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-08bpf: Remove migrate_{disable|enable} from bpf_task_storage_lock helpersHou Tao
Three callers of bpf_task_storage_lock() are ->map_lookup_elem, ->map_update_elem, ->map_delete_elem from bpf syscall. BPF syscall for these three operations of task storage has already disabled migration. Another two callers are bpf_task_storage_get() and bpf_task_storage_delete() helpers which will be used by BPF program. Two callers of bpf_task_storage_trylock() are bpf_task_storage_get() and bpf_task_storage_delete() helpers. The running contexts of these helpers have already disabled migration. Therefore, it is safe to remove migrate_{disable|enable} from task storage lock helpers for these call sites. However, bpf_task_storage_free() also invokes bpf_task_storage_lock() and its running context doesn't disable migration, therefore, add the missed migrate_{disable|enable} in bpf_task_storage_free(). Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250108010728.207536-6-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-08bpf: Remove migrate_{disable|enable} from bpf_cgrp_storage_lock helpersHou Tao
Three callers of bpf_cgrp_storage_lock() are ->map_lookup_elem, ->map_update_elem, ->map_delete_elem from bpf syscall. BPF syscall for these three operations of cgrp storage has already disabled migration. Two call sites of bpf_cgrp_storage_trylock() are bpf_cgrp_storage_get(), and bpf_cgrp_storage_delete() helpers. The running contexts of these helpers have already disabled migration. Therefore, it is safe to remove migrate_disable() for these callers. However, bpf_cgrp_storage_free() also invokes bpf_cgrp_storage_lock() and its running context doesn't disable migration. Therefore, also add the missed migrate_{disabled|enable} in bpf_cgrp_storage_free(). Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250108010728.207536-5-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-08bpf: Remove migrate_{disable|enable} in htab_elem_freeHou Tao
htab_elem_free() has two call-sites: delete_all_elements() has already disabled migration, free_htab_elem() is invoked by other 4 functions: __htab_map_lookup_and_delete_elem, __htab_map_lookup_and_delete_batch, htab_map_update_elem and htab_map_delete_elem. BPF syscall has already disabled migration before invoking ->map_update_elem, ->map_delete_elem, and ->map_lookup_and_delete_elem callbacks for hash map. __htab_map_lookup_and_delete_batch() also disables migration before invoking free_htab_elem(). ->map_update_elem() and ->map_delete_elem() of hash map may be invoked by BPF program and the running context of BPF program has already disabled migration. Therefore, it is safe to remove the migration_{disable|enable} pair in htab_elem_free() Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250108010728.207536-4-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-08bpf: Remove migrate_{disable|enable} in ->map_for_each_callbackHou Tao
BPF program may call bpf_for_each_map_elem(), and it will call the ->map_for_each_callback callback of related bpf map. Considering the running context of bpf program has already disabled migration, remove the unnecessary migrate_{disable|enable} pair in the implementations of ->map_for_each_callback. To ensure the guarantee will not be voilated later, also add cant_migrate() check in the implementations. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250108010728.207536-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-08bpf: Remove migrate_{disable|enable} from LPM trieHou Tao
Both bpf program and bpf syscall may invoke ->update or ->delete operation for LPM trie. For bpf program, its running context has already disabled migration explicitly through (migrate_disable()) or implicitly through (preempt_disable() or disable irq). For bpf syscall, the migration is disabled through the use of bpf_disable_instrumentation() before invoking the corresponding map operation callback. Therefore, it is safe to remove the migrate_{disable|enable){} pair from LPM trie. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250108010728.207536-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-08selftests/bpf: Add kprobe session recursion check testJiri Olsa
Adding kprobe.session probe to bpf_kfunc_common_test that misses bpf program execution due to recursion check and making sure it increases the program missed count properly. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20250106175048.1443905-2-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-08bpf: Return error for missed kprobe multi bpf program executionJiri Olsa
When kprobe multi bpf program can't be executed due to recursion check, we currently return 0 (success) to fprobe layer where it's ignored for standard kprobe multi probes. For kprobe session the success return value will make fprobe layer to install return probe and try to execute it as well. But the return session probe should not get executed, because the entry part did not run. FWIW the return probe bpf program most likely won't get executed, because its recursion check will likely fail as well, but we don't need to run it in the first place.. also we can make this clear and obvious. It also affects missed counts for kprobe session program execution, which are now doubled (extra count for not executed return probe). Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Link: https://lore.kernel.org/r/20250106175048.1443905-1-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-08bpf: Move out synchronize_rcu_tasks_trace from mutex CSPu Lehui
Commit ef1b808e3b7c ("bpf: Fix UAF via mismatching bpf_prog/attachment RCU flavors") resolved a possible UAF issue in uprobes that attach non-sleepable bpf prog by explicitly waiting for a tasks-trace-RCU grace period. But, in the current implementation, synchronize_rcu_tasks_trace is included within the mutex critical section, which increases the length of the critical section and may affect performance. So let's move out synchronize_rcu_tasks_trace from mutex CS. Signed-off-by: Pu Lehui <pulehui@huawei.com> Reviewed-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20250104013946.1111785-1-pulehui@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-08bpf: Fix range_tree_set() error handlingSoma Nakata
range_tree_set() might fail and return -ENOMEM, causing subsequent `bpf_arena_alloc_pages` to fail. Add the error handling. Signed-off-by: Soma Nakata <soma.nakata@somane.sakura.ne.jp> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250106231536.52856-1-soma.nakata@somane.sakura.ne.jp Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-08selftests/bpf: add -std=gnu11 to BPF_CFLAGS and CFLAGSIhor Solodrai
Latest versions of GCC BPF use C23 standard by default. This causes compilation errors in vmlinux.h due to bool types declarations. Add -std=gnu11 to BPF_CFLAGS and CFLAGS. This aligns with the version of the standard used when building the kernel currently [1]. For more details see the discussions at [2] and [3]. [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Makefile#n465 [2] https://lore.kernel.org/bpf/EYcXjcKDCJY7Yb0GGtAAb7nLKPEvrgWdvWpuNzXm2qi6rYMZDixKv5KwfVVMBq17V55xyC-A1wIjrqG3aw-Imqudo9q9X7D7nLU2gWgbN0w=@pm.me/ [3] https://lore.kernel.org/bpf/20250106202715.1232864-1-ihor.solodrai@pm.me/ CC: Jose E. Marchesi <jose.marchesi@oracle.com> Signed-off-by: Ihor Solodrai <ihor.solodrai@pm.me> Link: https://lore.kernel.org/r/20250107235813.2964472-1-ihor.solodrai@pm.me Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-06selftests/bpf: Handle prog/attach type comparison in veristatMykyta Yatsenko
Implemented handling of prog type and attach type stats comparison in veristat. To test this change: ``` ./veristat pyperf600.bpf.o -o csv > base1.csv ./veristat pyperf600.bpf.o -o csv > base2.csv ./veristat -C base2.csv base1.csv -o csv ...,raw_tracepoint,raw_tracepoint,MATCH, ...,cgroup_inet_ingress,cgroup_inet_ingress,MATCH ``` Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Tested-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20250106144321.32337-1-mykyta.yatsenko5@gmail.com
2025-01-06selftests/bpf: add -fno-strict-aliasing to BPF_CFLAGSIhor Solodrai
Following the discussion at [1], set -fno-strict-aliasing flag for all BPF object build rules. Remove now unnecessary <test>-CFLAGS variables. [1] https://lore.kernel.org/bpf/20250106185447.951609-1-ihor.solodrai@pm.me/ CC: Jose E. Marchesi <jose.marchesi@oracle.com> Signed-off-by: Ihor Solodrai <ihor.solodrai@pm.me> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250106201728.1219791-1-ihor.solodrai@pm.me Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-06Merge branch 'bpf-allow-bpf_for-bpf_repeat-while-holding-spin'Alexei Starovoitov
Emil Tsalapatis says: ==================== In BPF programs, kfunc calls while holding a lock are not allowed because kfuncs may sleep by default. The exception to this rule are the functions in special_kfunc_list, which are guaranteed to not sleep. The bpf_iter_num_* functions used by the bpf_for and bpf_repeat macros make no function calls themselves, and as such are guaranteed to not sleep. Add them to special_kfunc_list to allow them within BPF spinlock critical sections. Signed-off-by: Emil Tsalapatis (Meta) <emil@etsalapatis.com> ==================== Link: https://patch.msgid.link/20250104202528.882482-1-emil@etsalapatis.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-06selftests/bpf: test bpf_for within spin lock sectionEmil Tsalapatis
Add a selftest to ensure BPF for loops within critical sections are accepted by the verifier. Signed-off-by: Emil Tsalapatis (Meta) <emil@etsalapatis.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250104202528.882482-3-emil@etsalapatis.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-06bpf: Allow bpf_for/bpf_repeat calls while holding a spinlockEmil Tsalapatis
Add the bpf_iter_num_* kfuncs called by bpf_for in special_kfunc_list, and allow the calls even while holding a spin lock. Signed-off-by: Emil Tsalapatis (Meta) <emil@etsalapatis.com> Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250104202528.882482-2-emil@etsalapatis.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-06bpf/tests: Add 32 bits only long conditional jump testsChristophe Leroy
Commit f1517eb790f9 ("bpf/tests: Expand branch conversion JIT test") introduced "Long conditional jump tests" but due to those tests making use of 64 bits DIV and MOD, they don't get jited on powerpc/32, leading to the long conditional jump test being skiped for unrelated reason. Add 4 new tests that are restricted to 32 bits ALU so that the jump tests can also be performed on platforms that do no support 64 bits operations. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/609f87a2d84e032c8d9ccb9ba7aebef893698f1e.1736154762.git.christophe.leroy@csgroup.eu