summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-05-13bpf: Fix WARN() in get_bpf_raw_tp_regsTao Chen
syzkaller reported an issue: WARNING: CPU: 3 PID: 5971 at kernel/trace/bpf_trace.c:1861 get_bpf_raw_tp_regs+0xa4/0x100 kernel/trace/bpf_trace.c:1861 Modules linked in: CPU: 3 UID: 0 PID: 5971 Comm: syz-executor205 Not tainted 6.15.0-rc5-syzkaller-00038-g707df3375124 #0 PREEMPT(full) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 RIP: 0010:get_bpf_raw_tp_regs+0xa4/0x100 kernel/trace/bpf_trace.c:1861 RSP: 0018:ffffc90003636fa8 EFLAGS: 00010293 RAX: 0000000000000000 RBX: 0000000000000003 RCX: ffffffff81c6bc4c RDX: ffff888032efc880 RSI: ffffffff81c6bc83 RDI: 0000000000000005 RBP: ffff88806a730860 R08: 0000000000000005 R09: 0000000000000003 R10: 0000000000000004 R11: 0000000000000000 R12: 0000000000000004 R13: 0000000000000001 R14: ffffc90003637008 R15: 0000000000000900 FS: 0000000000000000(0000) GS:ffff8880d6cdf000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f7baee09130 CR3: 0000000029f5a000 CR4: 0000000000352ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ____bpf_get_stack_raw_tp kernel/trace/bpf_trace.c:1934 [inline] bpf_get_stack_raw_tp+0x24/0x160 kernel/trace/bpf_trace.c:1931 bpf_prog_ec3b2eefa702d8d3+0x43/0x47 bpf_dispatcher_nop_func include/linux/bpf.h:1316 [inline] __bpf_prog_run include/linux/filter.h:718 [inline] bpf_prog_run include/linux/filter.h:725 [inline] __bpf_trace_run kernel/trace/bpf_trace.c:2363 [inline] bpf_trace_run3+0x23f/0x5a0 kernel/trace/bpf_trace.c:2405 __bpf_trace_mmap_lock_acquire_returned+0xfc/0x140 include/trace/events/mmap_lock.h:47 __traceiter_mmap_lock_acquire_returned+0x79/0xc0 include/trace/events/mmap_lock.h:47 __do_trace_mmap_lock_acquire_returned include/trace/events/mmap_lock.h:47 [inline] trace_mmap_lock_acquire_returned include/trace/events/mmap_lock.h:47 [inline] __mmap_lock_do_trace_acquire_returned+0x138/0x1f0 mm/mmap_lock.c:35 __mmap_lock_trace_acquire_returned include/linux/mmap_lock.h:36 [inline] mmap_read_trylock include/linux/mmap_lock.h:204 [inline] stack_map_get_build_id_offset+0x535/0x6f0 kernel/bpf/stackmap.c:157 __bpf_get_stack+0x307/0xa10 kernel/bpf/stackmap.c:483 ____bpf_get_stack kernel/bpf/stackmap.c:499 [inline] bpf_get_stack+0x32/0x40 kernel/bpf/stackmap.c:496 ____bpf_get_stack_raw_tp kernel/trace/bpf_trace.c:1941 [inline] bpf_get_stack_raw_tp+0x124/0x160 kernel/trace/bpf_trace.c:1931 bpf_prog_ec3b2eefa702d8d3+0x43/0x47 Tracepoint like trace_mmap_lock_acquire_returned may cause nested call as the corner case show above, which will be resolved with more general method in the future. As a result, WARN_ON_ONCE will be triggered. As Alexei suggested, remove the WARN_ON_ONCE first. Fixes: 9594dc3c7e71 ("bpf: fix nested bpf tracepoints with per-cpu data") Reported-by: syzbot+45b0c89a0fc7ae8dbadc@syzkaller.appspotmail.com Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Tao Chen <chen.dylane@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250513042747.757042-1-chen.dylane@linux.dev Closes: https://lore.kernel.org/bpf/8bc2554d-1052-4922-8832-e0078a033e1d@gmail.com
2025-05-13docs: bpf: Fix bullet point formatting warningKhaled Elnaggar
Fix indentation for a bullet list item in bpf_iterators.rst. According to reStructuredText rules, bullet list item bodies must be consistently indented relative to the bullet. The indentation of the first line after the bullet determines the alignment for the rest of the item body. Reported by smatch: /linux/Documentation/bpf/bpf_iterators.rst:55: WARNING: Bullet list ends without a blank line; unexpected unindent. [docutils] Fixes: 7220eabff8cb ("bpf, docs: document open-coded BPF iterators") Signed-off-by: Khaled Elnaggar <khaledelnaggarlinux@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250513015901.475207-1-khaledelnaggarlinux@gmail.com
2025-05-12Merge branch 'introduce-kfuncs-for-memory-reads-into-dynptrs'Alexei Starovoitov
Mykyta Yatsenko says: ==================== Introduce kfuncs for memory reads into dynptrs From: Mykyta Yatsenko <yatsenko@meta.com> This patch adds new kfuncs that enable reading variable-length user or kernel data directly into dynptrs. These kfuncs provide a way to perform dynamically-sized reads while maintaining memory safety. Unlike existing `bpf_probe_read_{user|kernel}` APIs, which are limited to constant-sized reads, these new kfuncs allow for more flexible data access. v4 -> v5 * Fix pointers annotations, use __user where necessary, cast where needed v3 -> v4 * Added pid filtering in selftests v2 -> v3 * Add KF_TRUSTED_ARGS for kfuncs that take pointer to task_struct as an argument * Remove checks for non-NULL task, where it was not necessary * Added comments on constants used in selftests, etc. v1 -> v2 * Renaming helper functions to use "user_str" instead of "user_data_str" suffix ==================== Link: https://patch.msgid.link/20250512205348.191079-1-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-12selftests/bpf: introduce tests for dynptr copy kfuncsMykyta Yatsenko
Introduce selftests verifying newly-added dynptr copy kfuncs. Covering contiguous and non-contiguous memory backed dynptrs. Disable test_probe_read_user_str_dynptr that triggers bug in strncpy_from_user_nofault. Patch to fix the issue [1]. [1] https://patchwork.kernel.org/project/linux-mm/patch/20250422131449.57177-1-mykyta.yatsenko5@gmail.com/ Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Link: https://lore.kernel.org/r/20250512205348.191079-4-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-12bpf: Implement dynptr copy kfuncsMykyta Yatsenko
This patch introduces a new set of kfuncs for working with dynptrs in BPF programs, enabling reading variable-length user or kernel data into dynptr directly. To enable memory-safety, verifier allows only constant-sized reads via existing bpf_probe_read_{user|kernel} etc. kfuncs, dynptr-based kfuncs allow dynamically-sized reads without memory safety shortcomings. The following kfuncs are introduced: * `bpf_probe_read_kernel_dynptr()`: probes kernel-space data into a dynptr * `bpf_probe_read_user_dynptr()`: probes user-space data into a dynptr * `bpf_probe_read_kernel_str_dynptr()`: probes kernel-space string into a dynptr * `bpf_probe_read_user_str_dynptr()`: probes user-space string into a dynptr * `bpf_copy_from_user_dynptr()`: sleepable, copies user-space data into a dynptr for the current task * `bpf_copy_from_user_str_dynptr()`: sleepable, copies user-space string into a dynptr for the current task * `bpf_copy_from_user_task_dynptr()`: sleepable, copies user-space data of the task into a dynptr * `bpf_copy_from_user_task_str_dynptr()`: sleepable, copies user-space string of the task into a dynptr The implementation is built on two generic functions: * __bpf_dynptr_copy * __bpf_dynptr_copy_str These functions take function pointers as arguments, enabling the copying of data from various sources, including both kernel and user space. Use __always_inline for generic functions and callbacks to make sure the compiler doesn't generate indirect calls into callbacks, which is more expensive, especially on some kernel configurations. Inlining allows compiler to put direct calls into all the specific callback implementations (copy_user_data_sleepable, copy_user_data_nofault, and so on). Reviewed-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Link: https://lore.kernel.org/r/20250512205348.191079-3-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-12helpers: make few bpf helpers publicMykyta Yatsenko
Make bpf_dynptr_slice_rdwr, bpf_dynptr_check_off_len and __bpf_dynptr_write available outside of the helpers.c by adding their prototypes into linux/include/bpf.h. bpf_dynptr_check_off_len() implementation is moved to header and made inline explicitly, as small function should typically be inlined. These functions are going to be used from bpf_trace.c in the next patch of this series. Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Link: https://lore.kernel.org/r/20250512205348.191079-2-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-12libbpf: Use proper errno value in nlattrAnton Protopopov
Return value of the validate_nla() function can be propagated all the way up to users of libbpf API. In case of error this libbpf version of validate_nla returns -1 which will be seen as -EPERM from user's point of view. Instead, return a more reasonable -EINVAL. Fixes: bbf48c18ee0c ("libbpf: add error reporting in XDP") Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250510182011.2246631-1-a.s.protopopov@gmail.com
2025-05-12selftests/bpf: Allow skipping docs compilationMykyta Yatsenko
Currently rst2man is required to build bpf selftests, as the tool is used by Makefile.docs. rst2man may be missing in some build environments and is not essential for selftests. It makes sense to allow user to skip building docs. This patch adds SKIP_DOCS variable into bpf selftests Makefile that when set to 1 allows skipping building docs, for example: make -C tools/testing/selftests TARGETS=bpf SKIP_DOCS=1 Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250510002450.365613-1-mykyta.yatsenko5@gmail.com
2025-05-12Merge branch 'fix-verifier-test-failures-in-verbose-mode'Alexei Starovoitov
Gregory Bell says: ==================== Fix verifier test failures in verbose mode This patch series fixes two issues that cause false failures in the BPF verifier test suite when run with verbose output (`-v`). The following tests fail only when running the test_verifier in verbose. This leads to inconsistent results across verbose and non-verbose runs. Patch 1 addresses an issue where the verbose flag (`-v`) unintentionally overrides the `opts.log_level`, leading to incorrect contents when checking bpf_vlog in tests with `expected_ret == VERBOSE_ACCEPT`. This occurs when running verbose with `-v` but not `-vv` Patch 2 increases the size of the `bpf_vlog[]` buffer to prevent truncation of large verifier logs, which was causing failures in several scale and 64-bit immediate tests. Before patches: ./test_verifier | grep FAIL Summary: 790 PASSED, 0 SKIPPED, 0 FAILED ./test_verifier -v | grep FAIL Summary: 782 PASSED, 0 SKIPPED, 8 FAILED ./test_verifier -vv | grep FAIL Summary: 787 PASSED, 0 SKIPPED, 3 FAILED After patches: ./test_verifier -v | grep FAIL Summary: 790 PASSED, 0 SKIPPED, 0 FAILED ./test_verifier -vv | grep FAIL Summary: 790 PASSED, 0 SKIPPED, 0 FAILED These fixes improve test reliability and ensure consistent behavior across verbose and non-verbose runs. ==================== Tested-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://patch.msgid.link/cover.1747058195.git.grbell@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-12selftests/bpf: test_verifier verbose log overflowsGregory Bell
Tests: - 458/p ld_dw: xor semi-random 64-bit imms, test 5 - 501/p scale: scale test 1 - 502/p scale: scale test 2 fail in verbose mode due to bpf_vlog[] overflowing. These tests generate large verifier logs that exceed the current buffer size, causing them to fail to load. Increase the size of the bpf_vlog[] buffer to accommodate larger logs and prevent false failures during test runs with verbose output. Signed-off-by: Gregory Bell <grbell@redhat.com> Link: https://lore.kernel.org/r/e49267100f07f099a5877a3a5fc797b702bbaf0c.1747058195.git.grbell@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-12selftests/bpf: test_verifier verbose causes erroneous failuresGregory Bell
When running test_verifier with the -v flag and a test with `expected_ret==VERBOSE_ACCEPT`, the opts.log_level is unintentionally overwritten because the verbose flag takes precedence. This leads to a mismatch in the expected and actual contents of bpf_vlog, causing tests to fail incorrectly. Reorder the conditional logic that sets opts.log_level to preserve the expected log level and prevent it from being overridden by -v. Signed-off-by: Gregory Bell <grbell@redhat.com> Link: https://lore.kernel.org/r/182bf00474f817c99f968a9edb119882f62be0f8.1747058195.git.grbell@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-09bpf, docs: document open-coded BPF iteratorsAndrii Nakryiko
Extract BPF open-coded iterators documentation spread out across a few original commit messages ([0], [1]) into a dedicated doc section under Documentation/bpf/bpf_iterators.rst. Also make explicit expectation that BPF iterator program type should be accompanied by a corresponding open-coded BPF iterator implementation, going forward. [0] https://lore.kernel.org/all/20230308184121.1165081-3-andrii@kernel.org/ [1] https://lore.kernel.org/all/20230308184121.1165081-4-andrii@kernel.org/ Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20250509180350.2604946-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-09Merge branch 'ktls-sockmap-fix-missing-uncharge-operation-and-add-selfttest'Martin KaFai Lau
Jiayuan Chen says: ==================== ktls, sockmap: Fix missing uncharge operation and add selfttest Cong reported a warning when running ./test_sockmp: https://lore.kernel.org/bpf/aAmIi0vlycHtbXeb@pop-os.localdomain/T/#t ------------[ cut here ]------------ WARNING: CPU: 1 PID: 40 at net/ipv4/af_inet.c inet_sock_destruct+0x173/0x1d5 Tainted: [W]=WARN Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014 Workqueue: events sk_psock_destroy RIP: 0010:inet_sock_destruct+0x173/0x1d5 RSP: 0018:ffff8880085cfc18 EFLAGS: 00010202 RAX: 1ffff11003dbfc00 RBX: ffff88801edfe3e8 RCX: ffffffff822f5af4 RDX: 0000000000000007 RSI: dffffc0000000000 RDI: ffff88801edfe16c RBP: ffff88801edfe184 R08: ffffed1003dbfc31 R09: 0000000000000000 R10: ffffffff822f5ab7 R11: ffff88801edfe187 R12: ffff88801edfdec0 R13: ffff888020376ac0 R14: ffff888020376ac0 R15: ffff888020376a60 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000556365155830 CR3: 000000001d6aa000 CR4: 0000000000350ef0 Call Trace: <TASK> __sk_destruct+0x46/0x222 sk_psock_destroy+0x22f/0x242 process_one_work+0x504/0x8a8 ? process_one_work+0x39d/0x8a8 ? __pfx_process_one_work+0x10/0x10 ? worker_thread+0x44/0x2ae ? __list_add_valid_or_report+0x83/0xea ? srso_return_thunk+0x5/0x5f ? __list_add+0x45/0x52 process_scheduled_works+0x73/0x82 worker_thread+0x1ce/0x2ae When we specify apply_bytes, we divide the msg into multiple segments, each with a length of 'send', and every time we send this part of the data using tcp_bpf_sendmsg_redir(), we use sk_msg_return_zero() to uncharge the memory of the specified 'send' size. However, if the first segment of data fails to send, for example, the peer's buffer is full, we need to release all of the msg. When releasing the msg, we haven't uncharged the memory of the subsequent segments. This modification does not make significant logical changes, but only fills in the missing uncharge places. This issue has existed all along, until it was exposed after we added the apply test in test_sockmap: commit 3448ad23b34e ("selftests/bpf: Add apply_bytes test to test_txmsg_redir_wait_sndmem in test_sockmap") ==================== Link: https://patch.msgid.link/20250425060015.6968-1-jiayuan.chen@linux.dev Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2025-05-09selftests/bpf: Add test to cover sockmap with ktlsJiayuan Chen
The selftest can reproduce an issue where we miss the uncharge operation when freeing msg, which will cause the following warning. We fixed the issue and added this reproducer to selftest to ensure it will not happen again. ------------[ cut here ]------------ WARNING: CPU: 1 PID: 40 at net/ipv4/af_inet.c inet_sock_destruct+0x173/0x1d5 Tainted: [W]=WARN Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014 Workqueue: events sk_psock_destroy RIP: 0010:inet_sock_destruct+0x173/0x1d5 RSP: 0018:ffff8880085cfc18 EFLAGS: 00010202 RAX: 1ffff11003dbfc00 RBX: ffff88801edfe3e8 RCX: ffffffff822f5af4 RDX: 0000000000000007 RSI: dffffc0000000000 RDI: ffff88801edfe16c RBP: ffff88801edfe184 R08: ffffed1003dbfc31 R09: 0000000000000000 R10: ffffffff822f5ab7 R11: ffff88801edfe187 R12: ffff88801edfdec0 R13: ffff888020376ac0 R14: ffff888020376ac0 R15: ffff888020376a60 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000556365155830 CR3: 000000001d6aa000 CR4: 0000000000350ef0 Call Trace: <TASK> __sk_destruct+0x46/0x222 sk_psock_destroy+0x22f/0x242 process_one_work+0x504/0x8a8 ? process_one_work+0x39d/0x8a8 ? __pfx_process_one_work+0x10/0x10 ? worker_thread+0x44/0x2ae ? __list_add_valid_or_report+0x83/0xea ? srso_return_thunk+0x5/0x5f ? __list_add+0x45/0x52 process_scheduled_works+0x73/0x82 worker_thread+0x1ce/0x2ae Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20250425060015.6968-3-jiayuan.chen@linux.dev
2025-05-09ktls, sockmap: Fix missing uncharge operationJiayuan Chen
When we specify apply_bytes, we divide the msg into multiple segments, each with a length of 'send', and every time we send this part of the data using tcp_bpf_sendmsg_redir(), we use sk_msg_return_zero() to uncharge the memory of the specified 'send' size. However, if the first segment of data fails to send, for example, the peer's buffer is full, we need to release all of the msg. When releasing the msg, we haven't uncharged the memory of the subsequent segments. This modification does not make significant logical changes, but only fills in the missing uncharge places. This issue has existed all along, until it was exposed after we added the apply test in test_sockmap: commit 3448ad23b34e ("selftests/bpf: Add apply_bytes test to test_txmsg_redir_wait_sndmem in test_sockmap") Fixes: d3b18ad31f93 ("tls: add bpf support to sk_msg handling") Reported-by: Cong Wang <xiyou.wangcong@gmail.com> Closes: https://lore.kernel.org/bpf/aAmIi0vlycHtbXeb@pop-os.localdomain/T/#t Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com> Link: https://lore.kernel.org/r/20250425060015.6968-2-jiayuan.chen@linux.dev
2025-05-09Merge branch 'bpf-retrieve-ref_ctr_offset-from-uprobe-perf-link'Andrii Nakryiko
Jiri Olsa says: ==================== bpf: Retrieve ref_ctr_offset from uprobe perf link hi, adding ref_ctr_offset retrieval for uprobe perf link info. v2 changes: - display ref_ctr_offset as hex number [Andrii] - added acks thanks, jirka --- ==================== Link: https://patch.msgid.link/20250509153539.779599-1-jolsa@kernel.org Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2025-05-09bpftool: Display ref_ctr_offset for uprobe link infoJiri Olsa
Adding support to display ref_ctr_offset in link output, like: # bpftool link ... 42: perf_event prog 174 uprobe /proc/self/exe+0x102f13 cookie 3735928559 ref_ctr_offset 0x303a3fa bpf_cookie 3735928559 pids test_progs(1820) # bpftool link -j | jq [ ... { "id": 42, ... "ref_ctr_offset": 50500538, } ] Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250509153539.779599-4-jolsa@kernel.org
2025-05-09selftests/bpf: Add link info test for ref_ctr_offset retrievalJiri Olsa
Adding link info test for ref_ctr_offset retrieval for both uprobe and uretprobe probes. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yafang Shao <laoar.shao@gmail.com> Link: https://lore.kernel.org/bpf/20250509153539.779599-3-jolsa@kernel.org
2025-05-09bpf: Add support to retrieve ref_ctr_offset for uprobe perf linkJiri Olsa
Adding support to retrieve ref_ctr_offset for uprobe perf link, which got somehow omitted from the initial uprobe link info changes. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yafang Shao <laoar.shao@gmail.com> Link: https://lore.kernel.org/bpf/20250509153539.779599-2-jolsa@kernel.org
2025-05-09scripts/bpf_doc.py: implement json output formatIhor Solodrai
bpf_doc.py parses bpf.h header to collect information about various API elements (such as BPF helpers) and then dump them in one of the supported formats: rst docs and a C header. It's useful for external tools to be able to consume this information in an easy-to-parse format such as JSON. Implement JSON printers and add --json command line argument. v3->v4: refactor attrs to only be a helper's field v2->v3: nit cleanup v1->v2: add json printer for syscall target v3: https://lore.kernel.org/bpf/20250507203034.270428-1-isolodrai@meta.com/ v2: https://lore.kernel.org/bpf/20250507182802.3833349-1-isolodrai@meta.com/ v1: https://lore.kernel.org/bpf/20250506000605.497296-1-isolodrai@meta.com/ Signed-off-by: Ihor Solodrai <isolodrai@meta.com> Tested-by: Quentin Monnet <qmo@kernel.org> Reviewed-by: Quentin Monnet <qmo@kernel.org> Link: https://lore.kernel.org/r/20250508203708.2520847-1-isolodrai@meta.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-09selftests/bpf: Fix caps for __xlated/jited_unprivLuis Gerhorst
Currently, __xlated_unpriv and __jited_unpriv do not work because the BPF syscall will overwrite info.jited_prog_len and info.xlated_prog_len with 0 if the process is not bpf_capable(). This bug was not noticed before, because there is no test that actually uses __xlated_unpriv/__jited_unpriv. To resolve this, simply restore the capabilities earlier (but still after loading the program). Adding this here unconditionally is fine because the function first checks that the capabilities were initialized before attempting to restore them. This will be important later when we add tests that check whether a speculation barrier was inserted in the correct location. Signed-off-by: Luis Gerhorst <luis.gerhorst@fau.de> Fixes: 9c9f73391310 ("selftests/bpf: allow checking xlated programs in verifier_* tests") Fixes: 7d743e4c759c ("selftests/bpf: __jited test tag to check disassembly after jit") Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Tested-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250501073603.1402960-2-luis.gerhorst@fau.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-09Merge branch 'bpf-allow-some-trace-helpers-for-all-prog-types'Andrii Nakryiko
Feng Yang says: ==================== bpf: Allow some trace helpers for all prog types From: Feng Yang <yangfeng@kylinos.cn> This series allow some trace helpers for all prog types. if it works under NMI and doesn't use any context-dependent things, should be fine for any program type. The detailed discussion is in [1]. [1] https://lore.kernel.org/all/CAEf4Bza6gK3dsrTosk6k3oZgtHesNDSrDd8sdeQ-GiS6oJixQg@mail.gmail.com/ --- Changes in v3: - cgroup_current_func_proto clean. - bpf_scx_get_func_proto clean. Thanks, Andrii Nakryiko. - Link to v2: https://lore.kernel.org/all/20250427063821.207263-1-yangfeng59949@163.com/ Changes in v2: - not expose compat probe read APIs to more program types. - Remove the prog->sleepable check added for copy_from_user, - or the summarization_freplace/might_sleep_with_might_sleep test will fail with the error "program of this type cannot use helper bpf_copy_from_user" - Link to v1: https://lore.kernel.org/all/20250425080032.327477-1-yangfeng59949@163.com/ ==================== Link: https://patch.msgid.link/20250506061434.94277-1-yangfeng59949@163.com Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2025-05-09sched_ext: Remove bpf_scx_get_func_protoFeng Yang
task_storage_{get,delete} has been moved to bpf_base_func_proto. Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com> Signed-off-by: Feng Yang <yangfeng@kylinos.cn> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/bpf/20250506061434.94277-3-yangfeng59949@163.com
2025-05-09bpf: Allow some trace helpers for all prog typesFeng Yang
if it works under NMI and doesn't use any context-dependent things, should be fine for any program type. The detailed discussion is in [1]. [1] https://lore.kernel.org/all/CAEf4Bza6gK3dsrTosk6k3oZgtHesNDSrDd8sdeQ-GiS6oJixQg@mail.gmail.com/ Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com> Signed-off-by: Feng Yang <yangfeng@kylinos.cn> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/bpf/20250506061434.94277-2-yangfeng59949@163.com
2025-05-09Merge branch 'bpf-riscv64-support-load-acquire-and-store-release-instructions'Alexei Starovoitov
Peilin Ye says: ==================== bpf, riscv64: Support load-acquire and store-release instructions Hi all! Patchset [1] introduced BPF load-acquire (BPF_LOAD_ACQ) and store-release (BPF_STORE_REL) instructions, and added x86-64 and arm64 JIT compiler support. As a follow-up, this v2 patchset supports load-acquire and store-release instructions for the riscv64 JIT compiler, and introduces some related selftests/ changes. Specifically: * PATCH 1 makes insn_def_regno() handle load-acquires properly for bpf_jit_needs_zext() (true for riscv64) architectures * PATCH 2, 3 from Andrea Parri add the actual support to the riscv64 JIT compiler * PATCH 4 optimizes code emission by skipping redundant zext instructions inserted by the verifier * PATCH 5, 6 and 7 are minor selftest/ improvements * PATCH 8 enables (non-arena) load-acquire/store-release selftests for riscv64 v1: https://lore.kernel.org/bpf/cover.1745970908.git.yepeilin@google.com/ Changes since v1: * add Acked-by:, Reviewed-by: and Tested-by: tags from Lehui and Björn * simplify code logic in PATCH 1 (Lehui) * in PATCH 3, avoid changing 'return 0;' to 'return ret;' at the end of bpf_jit_emit_insn() (Lehui) Please refer to individual patches for details. Thanks! [1] https://lore.kernel.org/all/cover.1741049567.git.yepeilin@google.com/ ==================== Link: https://patch.msgid.link/cover.1746588351.git.yepeilin@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-09selftests/bpf: Enable non-arena load-acquire/store-release selftests for riscv64Peilin Ye
For riscv64, enable all BPF_{LOAD_ACQ,STORE_REL} selftests except the arena_atomics/* ones (not guarded behind CAN_USE_LOAD_ACQ_STORE_REL), since arena access is not yet supported. Acked-by: Björn Töpel <bjorn@kernel.org> Reviewed-by: Pu Lehui <pulehui@huawei.com> Tested-by: Björn Töpel <bjorn@rivosinc.com> # QEMU/RVA23 Signed-off-by: Peilin Ye <yepeilin@google.com> Link: https://lore.kernel.org/r/9d878fa99a72626208a8eed3c04c4140caf77fda.1746588351.git.yepeilin@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-09selftests/bpf: Verify zero-extension behavior in load-acquire testsPeilin Ye
Verify that 8-, 16- and 32-bit load-acquires are zero-extending by using immediate values with their highest bit set. Do the same for the 64-bit variant to keep the style consistent. Acked-by: Björn Töpel <bjorn@kernel.org> Reviewed-by: Pu Lehui <pulehui@huawei.com> Tested-by: Björn Töpel <bjorn@rivosinc.com> # QEMU/RVA23 Signed-off-by: Peilin Ye <yepeilin@google.com> Link: https://lore.kernel.org/r/11097fd515f10308b3941469ee4c86cb8872db3f.1746588351.git.yepeilin@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-09selftests/bpf: Avoid passing out-of-range values to __retval()Peilin Ye
Currently, we pass 0x1234567890abcdef to __retval() for the following two tests: verifier_load_acquire/load_acquire_64 verifier_store_release/store_release_64 However, the upper 32 bits of that value are being ignored, since __retval() expects an int. Actually, the tests would still pass even if I change '__retval(0x1234567890abcdef)' to e.g. '__retval(0x90abcdef)'. Restructure the tests a bit to test the entire 64-bit values properly. Do the same to their 8-, 16- and 32-bit variants as well to keep the style consistent. Fixes: ff3afe5da998 ("selftests/bpf: Add selftests for load-acquire and store-release instructions") Acked-by: Björn Töpel <bjorn@kernel.org> Reviewed-by: Pu Lehui <pulehui@huawei.com> Tested-by: Björn Töpel <bjorn@rivosinc.com> # QEMU/RVA23 Signed-off-by: Peilin Ye <yepeilin@google.com> Link: https://lore.kernel.org/r/d67f4c6f6ee0d0388cbce1f4892ec4176ee2d604.1746588351.git.yepeilin@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-09selftests/bpf: Use CAN_USE_LOAD_ACQ_STORE_REL when appropriatePeilin Ye
Instead of open-coding the conditions, use '#ifdef CAN_USE_LOAD_ACQ_STORE_REL' to guard the following tests: verifier_precision/bpf_load_acquire verifier_precision/bpf_store_release verifier_store_release/* Note that, for the first two tests in verifier_precision.c, switching to '#ifdef CAN_USE_LOAD_ACQ_STORE_REL' means also checking if '__clang_major__ >= 18', which has already been guaranteed by the outer '#if' check. Acked-by: Björn Töpel <bjorn@kernel.org> Reviewed-by: Pu Lehui <pulehui@huawei.com> Tested-by: Björn Töpel <bjorn@rivosinc.com> # QEMU/RVA23 Signed-off-by: Peilin Ye <yepeilin@google.com> Link: https://lore.kernel.org/r/45d7e025f6e390a8ff36f08fc51e31705ac896bd.1746588351.git.yepeilin@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-09bpf, riscv64: Skip redundant zext instruction after load-acquirePeilin Ye
Currently, the verifier inserts a zext instruction right after every 8-, 16- or 32-bit load-acquire, which is already zero-extending. Skip such redundant zext instructions. While we are here, update that already-obsolete comment about "skip the next instruction" in build_body(). Also change emit_atomic_rmw()'s parameters to keep it consistent with emit_atomic_ld_st(). Note that checking 'insn[1]' relies on 'insn' not being the last instruction, which should have been guaranteed by the verifier; we already use 'insn[1]' elsewhere in the file for similar purposes. Additionally, we don't check if 'insn[1]' is actually a zext for our load-acquire's dst_reg, or some other registers - in other words, here we are relying on the verifier to always insert a redundant zext right after a 8/16/32-bit load-acquire, for its dst_reg. Acked-by: Björn Töpel <bjorn@kernel.org> Reviewed-by: Pu Lehui <pulehui@huawei.com> Tested-by: Björn Töpel <bjorn@rivosinc.com> # QEMU/RVA23 Signed-off-by: Peilin Ye <yepeilin@google.com> Link: https://lore.kernel.org/r/10e90e0eab042f924d35ad0d1c1f7ca29f673152.1746588351.git.yepeilin@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-09bpf, riscv64: Support load-acquire and store-release instructionsAndrea Parri
Support BPF load-acquire (BPF_LOAD_ACQ) and store-release (BPF_STORE_REL) instructions in the riscv64 JIT compiler. For example, consider the following 64-bit load-acquire (assuming little-endian): db 10 00 00 00 01 00 00 r1 = load_acquire((u64 *)(r1 + 0x0)) 95 00 00 00 00 00 00 00 exit opcode (0xdb): BPF_ATOMIC | BPF_DW | BPF_STX imm (0x00000100): BPF_LOAD_ACQ The JIT compiler will emit an LD instruction followed by a FENCE R,RW instruction for the above, e.g.: ld x7,0(x6) fence r,rw Similarly, consider the following 16-bit store-release: cb 21 00 00 10 01 00 00 store_release((u16 *)(r1 + 0x0), w2) 95 00 00 00 00 00 00 00 exit opcode (0xcb): BPF_ATOMIC | BPF_H | BPF_STX imm (0x00000110): BPF_STORE_REL A FENCE RW,W instruction followed by an SH instruction will be emitted, e.g.: fence rw,w sh x2,0(x4) 8-bit and 16-bit load-acquires are zero-extending (cf., LBU, LHU). The verifier always rejects misaligned load-acquires/store-releases (even if BPF_F_ANY_ALIGNMENT is set), so the emitted load and store instructions are guaranteed to be single-copy atomic. Introduce primitives to emit the relevant (and the most common/used in the kernel) fences, i.e. fences with R -> RW, RW -> W and RW -> RW. Rename emit_atomic() to emit_atomic_rmw() to make it clear that it only handles RMW atomics, and replace its is64 parameter to allow to perform the required checks on the opsize (BPF_SIZE(code)). Acked-by: Björn Töpel <bjorn@kernel.org> Tested-by: Björn Töpel <bjorn@rivosinc.com> # QEMU/RVA23 Signed-off-by: Andrea Parri <parri.andrea@gmail.com> Co-developed-by: Peilin Ye <yepeilin@google.com> Signed-off-by: Peilin Ye <yepeilin@google.com> Reviewed-by: Pu Lehui <pulehui@huawei.com> Link: https://lore.kernel.org/r/3059c560e537ad43ed19055d2ebbd970c698095a.1746588351.git.yepeilin@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-09bpf, riscv64: Introduce emit_load_*() and emit_store_*()Andrea Parri
We're planning to add support for the load-acquire and store-release BPF instructions. Define emit_load_<size>() and emit_store_<size>() to enable/facilitate the (re)use of their code. Acked-by: Björn Töpel <bjorn@kernel.org> Reviewed-by: Pu Lehui <pulehui@huawei.com> Tested-by: Björn Töpel <bjorn@rivosinc.com> # QEMU/RVA23 Tested-by: Peilin Ye <yepeilin@google.com> Signed-off-by: Andrea Parri <parri.andrea@gmail.com> [yepeilin@google.com: cosmetic change to commit title] Signed-off-by: Peilin Ye <yepeilin@google.com> Link: https://lore.kernel.org/r/fce89473a5748e1631d18a5917d953460d1ae0d0.1746588351.git.yepeilin@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-09bpf/verifier: Handle BPF_LOAD_ACQ instructions in insn_def_regno()Peilin Ye
In preparation for supporting BPF load-acquire and store-release instructions for architectures where bpf_jit_needs_zext() returns true (e.g. riscv64), make insn_def_regno() handle load-acquires properly. Acked-by: Björn Töpel <bjorn@kernel.org> Tested-by: Björn Töpel <bjorn@rivosinc.com> # QEMU/RVA23 Signed-off-by: Peilin Ye <yepeilin@google.com> Reviewed-by: Pu Lehui <pulehui@huawei.com> Link: https://lore.kernel.org/r/09cb2aec979aaed9d16db41f0f5b364de39377c0.1746588351.git.yepeilin@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-08bpftool: Fix cgroup command to only show cgroup bpf programsMartin KaFai Lau
The netkit program is not a cgroup bpf program and should not be shown in the output of the "bpftool cgroup show" command. However, if the netkit device happens to have ifindex 3, the "bpftool cgroup show" command will output the netkit bpf program as well: > ip -d link show dev nk1 3: nk1@if2: ... link/ether ... netkit mode ... > bpftool net show tc: nk1(3) netkit/peer tw_ns_nk2phy prog_id 469447 > bpftool cgroup show /sys/fs/cgroup/... ID AttachType AttachFlags Name ... ... ... 469447 netkit_peer tw_ns_nk2phy The reason is that the target_fd (which is the cgroup_fd here) and the target_ifindex are in a union in the uapi/linux/bpf.h. The bpftool iterates all values in "enum bpf_attach_type" which includes non cgroup attach types like netkit. The cgroup_fd is usually 3 here, so the bug is triggered when the netkit ifindex just happens to be 3 as well. The bpftool's cgroup.c already has a list of cgroup-only attach type defined in "cgroup_attach_types[]". This patch fixes it by iterating over "cgroup_attach_types[]" instead of "__MAX_BPF_ATTACH_TYPE". Cc: Quentin Monnet <qmo@kernel.org> Reported-by: Takshak Chahande <ctakshak@meta.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Quentin Monnet <qmo@kernel.org> Link: https://lore.kernel.org/r/20250507203232.1420762-1-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-06bpftool: Fix regression of "bpftool cgroup tree" EINVAL on older kernelsYiFei Zhu
If cgroup_has_attached_progs queries an attach type not supported by the running kernel, due to the kernel being older than the bpftool build, it would encounter an -EINVAL from BPF_PROG_QUERY syscall. Prior to commit 98b303c9bf05 ("bpftool: Query only cgroup-related attach types"), this EINVAL would be ignored by the function, allowing the function to only consider supported attach types. The commit changed so that, instead of querying all attach types, only attach types from the array `cgroup_attach_types` is queried. The assumption is that because these are only cgroup attach types, they should all be supported. Unfortunately this assumption may be false when the kernel is older than the bpftool build, where the attach types queried by bpftool is not yet implemented in the kernel. This would result in errors such as: $ bpftool cgroup tree CgroupPath ID AttachType AttachFlags Name Error: can't query bpf programs attached to /sys/fs/cgroup: Invalid argument This patch restores the logic of ignoring EINVAL from prior to that patch. Fixes: 98b303c9bf05 ("bpftool: Query only cgroup-related attach types") Reported-by: Sagarika Sharma <sharmasagarika@google.com> Reported-by: Minh-Anh Nguyen <minhanhdn@google.com> Signed-off-by: YiFei Zhu <zhuyifei@google.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Quentin Monnet <qmo@kernel.org> Link: https://lore.kernel.org/bpf/20250428211536.1651456-1-zhuyifei@google.com
2025-05-06Merge branch 'bpf-support-bpf-rbtree-traversal-and-list-peeking'Alexei Starovoitov
Martin KaFai Lau says: ==================== bpf: Support bpf rbtree traversal and list peeking From: Martin KaFai Lau <martin.lau@kernel.org> The RFC v1 [1] showed a fq qdisc implementation in bpf that is much closer to the kernel sch_fq.c. The fq example and bpf qdisc changes are separated out from this set. This set is to focus on the kfunc and verifier changes that enable the bpf rbtree traversal and list peeking. v2: - Added tests to check that the return value of the bpf_rbtree_{root,left,right} and bpf_list_{front,back} is marked as a non_own_ref node pointer. (Kumar) - Added tests to ensure that the bpf_rbtree_{root,left,right} and bpf_list_{front,back} must be called after holding the spinlock. - Squashed the selftests adjustment to the corresponding verifier changes to avoid bisect failure. (Kumar) - Separated the bpf qdisc specific changes and fq selftest example from this set. [1]: https://lore.kernel.org/bpf/20250418224652.105998-1-martin.lau@linux.dev/ ==================== Link: https://patch.msgid.link/20250506015857.817950-1-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-06selftests/bpf: Add test for bpf_list_{front,back}Martin KaFai Lau
This patch adds the "list_peek" test to use the new bpf_list_{front,back} kfunc. The test_{front,back}* tests ensure that the return value is a non_own_ref node pointer and requires the spinlock to be held. Suggested-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> # check non_own_ref marking Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20250506015857.817950-9-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-06bpf: Add bpf_list_{front,back} kfuncMartin KaFai Lau
In the kernel fq qdisc implementation, it only needs to look at the fields of the first node in a list but does not always need to remove it from the list. It is more convenient to have a peek kfunc for the list. It works similar to the bpf_rbtree_first(). This patch adds bpf_list_{front,back} kfunc. The verifier is changed such that the kfunc returning "struct bpf_list_node *" will be marked as non-owning. The exception is the KF_ACQUIRE kfunc. The net effect is only the new bpf_list_{front,back} kfuncs will have its return pointer marked as non-owning. Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20250506015857.817950-8-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-06bpf: Simplify reg0 marking for the list kfuncs that return a bpf_list_node ↵Martin KaFai Lau
pointer The next patch will add bpf_list_{front,back} kfuncs to peek the head and tail of a list. Both of them will return a 'struct bpf_list_node *'. Follow the earlier change for rbtree, this patch checks the return btf type is a 'struct bpf_list_node' pointer instead of checking each kfuncs individually to decide if mark_reg_graph_node should be called. This will make the bpf_list_{front,back} kfunc addition easier in the later patch. Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20250506015857.817950-7-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-06selftests/bpf: Add tests for bpf_rbtree_{root,left,right}Martin KaFai Lau
This patch has a much simplified rbtree usage from the kernel sch_fq qdisc. It has a "struct node_data" which can be added to two different rbtrees which are ordered by different keys. The test first populates both rbtrees. Then search for a lookup_key from the "groot0" rbtree. Once the lookup_key is found, that node refcount is taken. The node is then removed from another "groot1" rbtree. While searching the lookup_key, the test will also try to remove all rbnodes in the path leading to the lookup_key. The test_{root,left,right}_spinlock_true tests ensure that the return value of the bpf_rbtree functions is a non_own_ref node pointer. This is done by forcing an verifier error by calling a helper bpf_jiffies64() while holding the spinlock. The tests then check for the verifier message "call bpf_rbtree...R0=rcu_ptr_or_null_node..." The other test_{root,left,right}_spinlock_false tests ensure that they must be called with spinlock held. Suggested-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> # Check non_own_ref marking Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20250506015857.817950-6-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-06bpf: Allow refcounted bpf_rb_node used in bpf_rbtree_{remove,left,right}Martin KaFai Lau
The bpf_rbtree_{remove,left,right} requires the root's lock to be held. They also check the node_internal->owner is still owned by that root before proceeding, so it is safe to allow refcounted bpf_rb_node pointer to be used in these kfuncs. In a bpf fq implementation which is much closer to the kernel fq, https://lore.kernel.org/bpf/20250418224652.105998-13-martin.lau@linux.dev/, a networking flow (allocated by bpf_obj_new) can be added to two different rbtrees. There are cases that the flow is searched from one rbtree, held the refcount of the flow, and then removed from another rbtree: struct fq_flow { struct bpf_rb_node fq_node; struct bpf_rb_node rate_node; struct bpf_refcount refcount; unsigned long sk_long; }; int bpf_fq_enqueue(...) { /* ... */ bpf_spin_lock(&root->lock); while (can_loop) { /* ... */ if (!p) break; gc_f = bpf_rb_entry(p, struct fq_flow, fq_node); if (gc_f->sk_long == sk_long) { f = bpf_refcount_acquire(gc_f); break; } /* ... */ } bpf_spin_unlock(&root->lock); if (f) { bpf_spin_lock(&q->lock); bpf_rbtree_remove(&q->delayed, &f->rate_node); bpf_spin_unlock(&q->lock); } } bpf_rbtree_{left,right} do not need this change but are relaxed together with bpf_rbtree_remove instead of adding extra verifier logic to exclude these kfuncs. To avoid bi-sect failure, this patch also changes the selftests together. The "rbtree_api_remove_unadded_node" is not expecting verifier's error. The test now expects bpf_rbtree_remove(&groot, &m->node) to return NULL. The test uses __retval(0) to ensure this NULL return value. Some of the "only take non-owning..." failure messages are changed also. Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20250506015857.817950-5-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-06bpf: Add bpf_rbtree_{root,left,right} kfuncMartin KaFai Lau
In a bpf fq implementation that is much closer to the kernel fq, it will need to traverse the rbtree: https://lore.kernel.org/bpf/20250418224652.105998-13-martin.lau@linux.dev/ The much simplified logic that uses the bpf_rbtree_{root,left,right} to traverse the rbtree is like: struct fq_flow { struct bpf_rb_node fq_node; struct bpf_rb_node rate_node; struct bpf_refcount refcount; unsigned long sk_long; }; struct fq_flow_root { struct bpf_spin_lock lock; struct bpf_rb_root root __contains(fq_flow, fq_node); }; struct fq_flow *fq_classify(...) { struct bpf_rb_node *tofree[FQ_GC_MAX]; struct fq_flow_root *root; struct fq_flow *gc_f, *f; struct bpf_rb_node *p; int i, fcnt = 0; /* ... */ f = NULL; bpf_spin_lock(&root->lock); p = bpf_rbtree_root(&root->root); while (can_loop) { if (!p) break; gc_f = bpf_rb_entry(p, struct fq_flow, fq_node); if (gc_f->sk_long == sk_long) { f = bpf_refcount_acquire(gc_f); break; } /* To be removed from the rbtree */ if (fcnt < FQ_GC_MAX && fq_gc_candidate(gc_f, jiffies_now)) tofree[fcnt++] = p; if (gc_f->sk_long > sk_long) p = bpf_rbtree_left(&root->root, p); else p = bpf_rbtree_right(&root->root, p); } /* remove from the rbtree */ for (i = 0; i < fcnt; i++) { p = tofree[i]; tofree[i] = bpf_rbtree_remove(&root->root, p); } bpf_spin_unlock(&root->lock); /* bpf_obj_drop the fq_flow(s) that have just been removed * from the rbtree. */ for (i = 0; i < fcnt; i++) { p = tofree[i]; if (p) { gc_f = bpf_rb_entry(p, struct fq_flow, fq_node); bpf_obj_drop(gc_f); } } return f; } The above simplified code needs to traverse the rbtree for two purposes, 1) find the flow with the desired sk_long value 2) while searching for the sk_long, collect flows that are the fq_gc_candidate. They will be removed from the rbtree. This patch adds the bpf_rbtree_{root,left,right} kfunc to enable the rbtree traversal. The returned bpf_rb_node pointer will be a non-owning reference which is the same as the returned pointer of the exisiting bpf_rbtree_first kfunc. Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20250506015857.817950-4-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-06bpf: Simplify reg0 marking for the rbtree kfuncs that return a bpf_rb_node ↵Martin KaFai Lau
pointer The current rbtree kfunc, bpf_rbtree_{first, remove}, returns the bpf_rb_node pointer. The check_kfunc_call currently checks the kfunc btf_id instead of its return pointer type to decide if it needs to do mark_reg_graph_node(reg0) and ref_set_non_owning(reg0). The later patch will add bpf_rbtree_{root,left,right} that will also return a bpf_rb_node pointer. Instead of adding more kfunc btf_id checks to the "if" case, this patch changes the test to check the kfunc's return type. is_rbtree_node_type() function is added to test if a pointer type is a bpf_rb_node. The callers have already skipped the modifiers of the pointer type. A note on the ref_set_non_owning(), although bpf_rbtree_remove() also returns a bpf_rb_node pointer, the bpf_rbtree_remove() has the KF_ACQUIRE flag. Thus, its reg0 will not become non-owning. Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20250506015857.817950-3-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-06bpf: Check KF_bpf_rbtree_add_impl for the "case KF_ARG_PTR_TO_RB_NODE"Martin KaFai Lau
In a later patch, two new kfuncs will take the bpf_rb_node pointer arg. struct bpf_rb_node *bpf_rbtree_left(struct bpf_rb_root *root, struct bpf_rb_node *node); struct bpf_rb_node *bpf_rbtree_right(struct bpf_rb_root *root, struct bpf_rb_node *node); In the check_kfunc_call, there is a "case KF_ARG_PTR_TO_RB_NODE" to check if the reg->type should be an allocated pointer or should be a non_owning_ref. The later patch will need to ensure that the bpf_rb_node pointer passing to the new bpf_rbtree_{left,right} must be a non_owning_ref. This should be the same requirement as the existing bpf_rbtree_remove. This patch swaps the current "if else" statement. Instead of checking the bpf_rbtree_remove, it checks the bpf_rbtree_add. Then the new bpf_rbtree_{left,right} will fall into the "else" case to make the later patch simpler. bpf_rbtree_add should be the only one that needs an allocated pointer. This should be a no-op change considering there are only two kfunc(s) taking bpf_rb_node pointer arg, rbtree_add and rbtree_remove. Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20250506015857.817950-2-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-05libbpf: Improve BTF dedup handling of "identical" BTF typesAndrii Nakryiko
BTF dedup has a strong assumption that compiler with deduplicate identical types within any given compilation unit (i.e., .c file). This property is used when establishing equilvalence of two subgraphs of types. Unfortunately, this property doesn't always holds in practice. We've seen cases of having truly identical structs, unions, array definitions, and, most recently, even pointers to the same type being duplicated within CU. Previously, we mitigated this on a case-by-case basis, adding a few simple heuristics for validating that two BTF types (having two different type IDs) are structurally the same. But this approach scales poorly, and we can have more weird cases come up in the future. So let's take a half-step back, and implement a bit more generic structural equivalence check, recursively. We still limit it to reasonable depth to avoid long reference loops. Depth-wise limiting of potentially cyclical graph isn't great, but as I mentioned below doesn't seem to be detrimental performance-wise. We can always improve this in the future with per-type visited markers, if necessary. Performance-wise this doesn't seem too affect vmlinux BTF dedup, which makes sense because this logic kicks in not so frequently and only if we already established a canonical candidate type match, but suddenly find a different (but probably identical) type. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Alan Maguire <alan.maguire@oracle.com> Link: https://lore.kernel.org/r/20250501235231.1339822-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-05bpf: Replace offsetof() with struct_size()Thorsten Blum
Compared to offsetof(), struct_size() provides additional compile-time checks for structs with flexible arrays (e.g., __must_be_array()). No functional changes intended. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250503151513.343931-2-thorsten.blum@linux.dev
2025-05-05bpf: Fix uninitialized values in BPF_{CORE,PROBE}_READAnton Protopopov
With the latest LLVM bpf selftests build will fail with the following error message: progs/profiler.inc.h:710:31: error: default initialization of an object of type 'typeof ((parent_task)->real_cred->uid.val)' (aka 'const unsigned int') leaves the object uninitialized and is incompatible with C++ [-Werror,-Wdefault-const-init-unsafe] 710 | proc_exec_data->parent_uid = BPF_CORE_READ(parent_task, real_cred, uid.val); | ^ tools/testing/selftests/bpf/tools/include/bpf/bpf_core_read.h:520:35: note: expanded from macro 'BPF_CORE_READ' 520 | ___type((src), a, ##__VA_ARGS__) __r; \ | ^ This happens because BPF_CORE_READ (and other macro) declare the variable __r using the ___type macro which can inherit const modifier from intermediate types. Fix this by using __typeof_unqual__, when supported. (And when it is not supported, the problem shouldn't appear, as older compilers haven't complained.) Fixes: 792001f4f7aa ("libbpf: Add user-space variants of BPF_CORE_READ() family of macros") Fixes: a4b09a9ef945 ("libbpf: Add non-CO-RE variants of BPF_CORE_READ() macro family") Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250502193031.3522715-1-a.s.protopopov@gmail.com
2025-05-05selftests/bpf: Remove sockmap_ktls disconnect_after_delete testIhor Solodrai
"sockmap_ktls disconnect_after_delete" is effectively moot after disconnect has been disabled for TLS [1][2]. Remove the test completely. [1] https://lore.kernel.org/bpf/20250416170246.2438524-1-ihor.solodrai@linux.dev/ [2] https://lore.kernel.org/netdev/20250404180334.3224206-1-kuba@kernel.org/ Signed-off-by: Ihor Solodrai <isolodrai@meta.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250502185221.1556192-1-isolodrai@meta.com
2025-05-01selftests/bpf: Add btf dedup test covering module BTF dedupAlan Maguire
Recently issues were observed with module BTF deduplication failures [1]. Add a dedup selftest that ensures that core kernel types are referenced from split BTF as base BTF types. To do this use bpf_testmod functions which utilize core kernel types, specifically ssize_t bpf_testmod_test_write(struct file *file, struct kobject *kobj, struct bin_attribute *bin_attr, char *buf, loff_t off, size_t len); __bpf_kfunc struct sock *bpf_kfunc_call_test3(struct sock *sk); __bpf_kfunc void bpf_kfunc_call_test_pass_ctx(struct __sk_buff *skb); For each of these ensure that the types they reference - struct file, struct kobject, struct bin_attr etc - are in base BTF. Note that because bpf_testmod.ko is built with distilled base BTF the associated reference types - i.e. the PTR that points at a "struct file" - will be in split BTF. As a result the test resolves typedef and pointer references and verifies the pointed-at or typedef'ed type is in base BTF. Because we use BTF from /sys/kernel/btf/bpf_testmod relocation has occurred for the referenced types and they will be base - not distilled base - types. For large-scale dedup issues, we see such types appear in split BTF and as a result this test fails. Hence it is proposed as a test which will fail when large-scale dedup issues have occurred. [1] https://lore.kernel.org/dwarves/CAADnVQL+-LiJGXwxD3jEUrOonO-fX0SZC8496dVzUXvfkB7gYQ@mail.gmail.com/ Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20250430134249.2451066-1-alan.maguire@oracle.com
2025-05-01Merge branch 'bpf-allow-xdp_redirect-for-xdp-dev-bound-programs'Martin KaFai Lau
Lorenzo Bianconi says: ==================== bpf: Allow XDP_REDIRECT for XDP dev-bound programs In the current implementation if the program is dev-bound to a specific device, it will not be possible to perform XDP_REDIRECT into a DEVMAP or CPUMAP even if the program is running in the driver NAPI context. Fix the issue introducing __bpf_prog_map_compatible utility routine in order to avoid bpf_prog_is_dev_bound() during the XDP program load. Continue forbidding to attach a dev-bound program to XDP maps. ==================== Link: https://patch.msgid.link/20250428-xdp-prog-bound-fix-v3-0-c9e9ba3300c7@kernel.org Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>