summaryrefslogtreecommitdiff
path: root/kernel/bpf
AgeCommit message (Collapse)Author
2023-10-16bpf: Ensure proper register state printing for cond jumpsAndrii Nakryiko
Verifier emits relevant register state involved in any given instruction next to it after `;` to the right, if possible. Or, worst case, on the separate line repeating instruction index. E.g., a nice and simple case would be: 2: (d5) if r0 s<= 0x0 goto pc+1 ; R0_w=0 But if there is some intervening extra output (e.g., precision backtracking log) involved, we are supposed to see the state after the precision backtrack log: 4: (75) if r0 s>= 0x0 goto pc+1 mark_precise: frame0: last_idx 4 first_idx 0 subseq_idx -1 mark_precise: frame0: regs=r0 stack= before 2: (d5) if r0 s<= 0x0 goto pc+1 mark_precise: frame0: regs=r0 stack= before 1: (b7) r0 = 0 6: R0_w=0 First off, note that in `6: R0_w=0` instruction index corresponds to the next instruction, not to the conditional jump instruction itself, which is wrong and we'll get to that. But besides that, the above is a happy case that does work today. Yet, if it so happens that precision backtracking had to traverse some of the parent states, this `6: R0_w=0` state output would be missing. This is due to a quirk of print_verifier_state() routine, which performs mark_verifier_state_clean(env) at the end. This marks all registers as "non-scratched", which means that subsequent logic to print *relevant* registers (that is, "scratched ones") fails and doesn't see anything relevant to print and skips the output altogether. print_verifier_state() is used both to print instruction context, but also to print an **entire** verifier state indiscriminately, e.g., during precision backtracking (and in a few other situations, like during entering or exiting subprogram). Which means if we have to print entire parent state before getting to printing instruction context state, instruction context is marked as clean and is omitted. Long story short, this is definitely not intentional. So we fix this behavior in this patch by teaching print_verifier_state() to clear scratch state only if it was used to print instruction state, not the parent/callback state. This is determined by print_all option, so if it's not set, we don't clear scratch state. This fixes missing instruction state for these cases. As for the mismatched instruction index, we fix that by making sure we call print_insn_state() early inside check_cond_jmp_op() before we adjusted insn_idx based on jump branch taken logic. And with that we get desired correct information: 9: (16) if w4 == 0x1 goto pc+9 mark_precise: frame0: last_idx 9 first_idx 9 subseq_idx -1 mark_precise: frame0: parent state regs=r4 stack=: R2_w=1944 R4_rw=P1 R10=fp0 mark_precise: frame0: last_idx 8 first_idx 0 subseq_idx 9 mark_precise: frame0: regs=r4 stack= before 8: (66) if w4 s> 0x3 goto pc+5 mark_precise: frame0: regs=r4 stack= before 7: (b7) r4 = 1 9: R4=1 Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20231011223728.3188086-6-andrii@kernel.org
2023-10-16bpf: Disambiguate SCALAR register state output in verifier logsAndrii Nakryiko
Currently the way that verifier prints SCALAR_VALUE register state (and PTR_TO_PACKET, which can have var_off and ranges info as well) is very ambiguous. In the name of brevity we are trying to eliminate "unnecessary" output of umin/umax, smin/smax, u32_min/u32_max, and s32_min/s32_max values, if possible. Current rules are that if any of those have their default value (which for mins is the minimal value of its respective types: 0, S32_MIN, or S64_MIN, while for maxs it's U32_MAX, S32_MAX, S64_MAX, or U64_MAX) *OR* if there is another min/max value that as matching value. E.g., if smin=100 and umin=100, we'll emit only umin=10, omitting smin altogether. This approach has a few problems, being both ambiguous and sort-of incorrect in some cases. Ambiguity is due to missing value could be either default value or value of umin/umax or smin/smax. This is especially confusing when we mix signed and unsigned ranges. Quite often, umin=0 and smin=0, and so we'll have only `umin=0` leaving anyone reading verifier log to guess whether smin is actually 0 or it's actually -9223372036854775808 (S64_MIN). And often times it's important to know, especially when debugging tricky issues. "Sort-of incorrectness" comes from mixing negative and positive values. E.g., if umin is some large positive number, it can be equal to smin which is, interpreted as signed value, is actually some negative value. Currently, that smin will be omitted and only umin will be emitted with a large positive value, giving an impression that smin is also positive. Anyway, ambiguity is the biggest issue making it impossible to have an exact understanding of register state, preventing any sort of automated testing of verifier state based on verifier log. This patch is attempting to rectify the situation by removing ambiguity, while minimizing the verboseness of register state output. The rules are straightforward: - if some of the values are missing, then it definitely has a default value. I.e., `umin=0` means that umin is zero, but smin is actually S64_MIN; - all the various boundaries that happen to have the same value are emitted in one equality separated sequence. E.g., if umin and smin are both 100, we'll emit `smin=umin=100`, making this explicit; - we do not mix negative and positive values together, and even if they happen to have the same bit-level value, they will be emitted separately with proper sign. I.e., if both umax and smax happen to be 0xffffffffffffffff, we'll emit them both separately as `smax=-1,umax=18446744073709551615`; - in the name of a bit more uniformity and consistency, {u32,s32}_{min,max} are renamed to {s,u}{min,max}32, which seems to improve readability. The above means that in case of all 4 ranges being, say, [50, 100] range, we'd previously see hugely ambiguous: R1=scalar(umin=50,umax=100) Now, we'll be more explicit: R1=scalar(smin=umin=smin32=umin32=50,smax=umax=smax32=umax32=100) This is slightly more verbose, but distinct from the case when we don't know anything about signed boundaries and 32-bit boundaries, which under new rules will match the old case: R1=scalar(umin=50,umax=100) Also, in the name of simplicity of implementation and consistency, order for {s,u}32_{min,max} are emitted *before* var_off. Previously they were emitted afterwards, for unclear reasons. This patch also includes a few fixes to selftests that expect exact register state to accommodate slight changes to verifier format. You can see that the changes are pretty minimal in common cases. Note, the special case when SCALAR_VALUE register is a known constant isn't changed, we'll emit constant value once, interpreted as signed value. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20231011223728.3188086-5-andrii@kernel.org
2023-10-13bpf: Introduce task_vma open-coded iterator kfuncsDave Marchevsky
This patch adds kfuncs bpf_iter_task_vma_{new,next,destroy} which allow creation and manipulation of struct bpf_iter_task_vma in open-coded iterator style. BPF programs can use these kfuncs directly or through bpf_for_each macro for natural-looking iteration of all task vmas. The implementation borrows heavily from bpf_find_vma helper's locking - differing only in that it holds the mmap_read lock for all iterations while the helper only executes its provided callback on a maximum of 1 vma. Aside from locking, struct vma_iterator and vma_next do all the heavy lifting. A pointer to an inner data struct, struct bpf_iter_task_vma_data, is the only field in struct bpf_iter_task_vma. This is because the inner data struct contains a struct vma_iterator (not ptr), whose size is likely to change under us. If bpf_iter_task_vma_kern contained vma_iterator directly such a change would require change in opaque bpf_iter_task_vma struct's size. So better to allocate vma_iterator using BPF allocator, and since that alloc must already succeed, might as well allocate all iter fields, thereby freezing struct bpf_iter_task_vma size. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20231013204426.1074286-4-davemarchevsky@fb.com
2023-10-13bpf: Don't explicitly emit BTF for struct btf_iter_numDave Marchevsky
Commit 6018e1f407cc ("bpf: implement numbers iterator") added the BTF_TYPE_EMIT line that this patch is modifying. The struct btf_iter_num doesn't exist, so only a forward declaration is emitted in BTF: FWD 'btf_iter_num' fwd_kind=struct That commit was probably hoping to ensure that struct bpf_iter_num is emitted in vmlinux BTF. A previous version of this patch changed the line to emit the correct type, but Yonghong confirmed that it would definitely be emitted regardless in [0], so this patch simply removes the line. This isn't marked "Fixes" because the extraneous btf_iter_num FWD wasn't causing any issues that I noticed, aside from mild confusion when I looked through the code. [0]: https://lore.kernel.org/bpf/25d08207-43e6-36a8-5e0f-47a913d4cda5@linux.dev/ Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20231013204426.1074286-2-davemarchevsky@fb.com
2023-10-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. No conflicts. Adjacent changes: kernel/bpf/verifier.c 829955981c55 ("bpf: Fix verifier log for async callback return values") a923819fb2c5 ("bpf: Treat first argument as return value for bpf_throw") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-11bpf: Implement cgroup sockaddr hooks for unix socketsDaan De Meyer
These hooks allows intercepting connect(), getsockname(), getpeername(), sendmsg() and recvmsg() for unix sockets. The unix socket hooks get write access to the address length because the address length is not fixed when dealing with unix sockets and needs to be modified when a unix socket address is modified by the hook. Because abstract socket unix addresses start with a NUL byte, we cannot recalculate the socket address in kernelspace after running the hook by calculating the length of the unix socket path using strlen(). These hooks can be used when users want to multiplex syscall to a single unix socket to multiple different processes behind the scenes by redirecting the connect() and other syscalls to process specific sockets. We do not implement support for intercepting bind() because when using bind() with unix sockets with a pathname address, this creates an inode in the filesystem which must be cleaned up. If we rewrite the address, the user might try to clean up the wrong file, leaking the socket in the filesystem where it is never cleaned up. Until we figure out a solution for this (and a use case for intercepting bind()), we opt to not allow rewriting the sockaddr in bind() calls. We also implement recvmsg() support for connected streams so that after a connect() that is modified by a sockaddr hook, any corresponding recmvsg() on the connected socket can also be modified to make the connected program think it is connected to the "intended" remote. Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Daan De Meyer <daan.j.demeyer@gmail.com> Link: https://lore.kernel.org/r/20231011185113.140426-5-daan.j.demeyer@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-10-11bpf: Add bpf_sock_addr_set_sun_path() to allow writing unix sockaddr from bpfDaan De Meyer
As prep for adding unix socket support to the cgroup sockaddr hooks, let's add a kfunc bpf_sock_addr_set_sun_path() that allows modifying a unix sockaddr from bpf. While this is already possible for AF_INET and AF_INET6, we'll need this kfunc when we add unix socket support since modifying the address for those requires modifying both the address and the sockaddr length. Signed-off-by: Daan De Meyer <daan.j.demeyer@gmail.com> Link: https://lore.kernel.org/r/20231011185113.140426-4-daan.j.demeyer@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-10-11bpf: Propagate modified uaddrlen from cgroup sockaddr programsDaan De Meyer
As prep for adding unix socket support to the cgroup sockaddr hooks, let's propagate the sockaddr length back to the caller after running a bpf cgroup sockaddr hook program. While not important for AF_INET or AF_INET6, the sockaddr length is important when working with AF_UNIX sockaddrs as the size of the sockaddr cannot be determined just from the address family or the sockaddr's contents. __cgroup_bpf_run_filter_sock_addr() is modified to take the uaddrlen as an input/output argument. After running the program, the modified sockaddr length is stored in the uaddrlen pointer. Signed-off-by: Daan De Meyer <daan.j.demeyer@gmail.com> Link: https://lore.kernel.org/r/20231011185113.140426-3-daan.j.demeyer@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-10-09bpf: Fix verifier log for async callback return valuesDavid Vernet
The verifier, as part of check_return_code(), verifies that async callbacks such as from e.g. timers, will return 0. It does this by correctly checking that R0->var_off is in tnum_const(0), which effectively checks that it's in a range of 0. If this condition fails, however, it prints an error message which says that the value should have been in (0x0; 0x1). This results in possibly confusing output such as the following in which an async callback returns 1: At async callback the register R0 has value (0x1; 0x0) should have been in (0x0; 0x1) The fix is easy -- we should just pass the tnum_const(0) as the correct range to verbose_invalid_scalar(), which will then print the following: At async callback the register R0 has value (0x1; 0x0) should have been in (0x0; 0x0) Fixes: bfc6bb74e4f1 ("bpf: Implement verifier support for validation of async callbacks.") Signed-off-by: David Vernet <void@manifault.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20231009161414.235829-1-void@manifault.com
2023-10-09bpf: Add ability to pin bpf timer to calling CPUDavid Vernet
BPF supports creating high resolution timers using bpf_timer_* helper functions. Currently, only the BPF_F_TIMER_ABS flag is supported, which specifies that the timeout should be interpreted as absolute time. It would also be useful to be able to pin that timer to a core. For example, if you wanted to make a subset of cores run without timer interrupts, and only have the timer be invoked on a single core. This patch adds support for this with a new BPF_F_TIMER_CPU_PIN flag. When specified, the HRTIMER_MODE_PINNED flag is passed to hrtimer_start(). A subsequent patch will update selftests to validate. Signed-off-by: David Vernet <void@manifault.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <song@kernel.org> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/bpf/20231004162339.200702-2-void@manifault.com
2023-10-06bpf: Refuse unused attributes in bpf_prog_{attach,detach}Lorenz Bauer
The recently added tcx attachment extended the BPF UAPI for attaching and detaching by a couple of fields. Those fields are currently only supported for tcx, other types like cgroups and flow dissector silently ignore the new fields except for the new flags. This is problematic once we extend bpf_mprog to older attachment types, since it's hard to figure out whether the syscall really was successful if the kernel silently ignores non-zero values. Explicitly reject non-zero fields relevant to bpf_mprog for attachment types which don't use the latter yet. Fixes: e420bed02507 ("bpf: Add fd-based tcx multi-prog infra with link support") Signed-off-by: Lorenz Bauer <lmb@isovalent.com> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20231006220655.1653-3-daniel@iogearbox.net Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-10-06bpf: Handle bpf_mprog_query with NULL entryDaniel Borkmann
Improve consistency for bpf_mprog_query() API and let the latter also handle a NULL entry as can be the case for tcx. Instead of returning -ENOENT, we copy a count of 0 and revision of 1 to user space, so that this can be fed into a subsequent bpf_mprog_attach() call as expected_revision. A BPF self- test as part of this series has been added to assert this case. Suggested-by: Lorenz Bauer <lmb@isovalent.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20231006220655.1653-2-daniel@iogearbox.net Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-10-06bpf: Fix BPF_PROG_QUERY last field checkDaniel Borkmann
While working on the ebpf-go [0] library integration for bpf_mprog and tcx, Lorenz noticed that two subsequent BPF_PROG_QUERY requests currently fail. A typical workflow is to first gather the bpf_mprog count without passing program/ link arrays, followed by the second request which contains the actual array pointers. The initial call populates count and revision fields. The second call gets rejected due to a BPF_PROG_QUERY_LAST_FIELD bug which should point to query.revision instead of query.link_attach_flags since the former is really the last member. It was not noticed in libbpf as bpf_prog_query_opts() always calls bpf(2) with an on-stack bpf_attr that is memset() each time (and therefore query.revision was reset to zero). [0] https://ebpf-go.dev Fixes: e420bed02507 ("bpf: Add fd-based tcx multi-prog infra with link support") Reported-by: Lorenz Bauer <lmb@isovalent.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20231006220655.1653-1-daniel@iogearbox.net Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-10-06bpf: Annotate struct bpf_stack_map with __counted_byKees Cook
Prepare for the coming implementation by GCC and Clang of the __counted_by attribute. Flexible array members annotated with __counted_by can have their accesses bounds-checked at run-time via CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family functions). As found with Coccinelle [1], add __counted_by for struct bpf_stack_map. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci [1] Link: https://lore.kernel.org/bpf/20231006201657.work.531-kees@kernel.org
2023-10-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. No conflicts (or adjacent changes of note). Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-09-30bpf: Use kmalloc_size_roundup() to adjust size_indexHou Tao
Commit d52b59315bf5 ("bpf: Adjust size_index according to the value of KMALLOC_MIN_SIZE") uses KMALLOC_MIN_SIZE to adjust size_index, but as reported by Nathan, the adjustment is not enough, because __kmalloc_minalign() also decides the minimal alignment of slab object as shown in new_kmalloc_cache() and its value may be greater than KMALLOC_MIN_SIZE (e.g., 64 bytes vs 8 bytes under a riscv QEMU VM). Instead of invoking __kmalloc_minalign() in bpf subsystem to find the maximal alignment, just using kmalloc_size_roundup() directly to get the corresponding slab object size for each allocation size. If these two sizes are unmatched, adjust size_index to select a bpf_mem_cache with unit_size equal to the object_size of the underlying slab cache for the allocation size. Fixes: 822fb26bdb55 ("bpf: Add a hint to allocated objects.") Reported-by: Nathan Chancellor <nathan@kernel.org> Closes: https://lore.kernel.org/bpf/20230914181407.GA1000274@dev-arch.thelio-3990X/ Signed-off-by: Hou Tao <houtao1@huawei.com> Tested-by: Emil Renner Berthing <emil.renner.berthing@canonical.com> Link: https://lore.kernel.org/r/20230928101558.2594068-1-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-29bpf, mprog: Fix maximum program check on mprog attachmentDaniel Borkmann
After Paul's recent improvement to syzkaller to improve coverage for bpf_mprog and tcx, it hit a splat that the program limit was surpassed. What happened is that the maximum number of progs got added, followed by another prog add request which adds with BPF_F_BEFORE flag relative to the last program in the array. The idx >= bpf_mprog_max() check in bpf_mprog_attach() still passes because the index is below the maximum but the maximum will be surpassed. We need to add a check upfront for insertions to catch this situation. Fixes: 053c8e1f235d ("bpf: Add generic attach/detach/query API for multi-progs") Reported-by: syzbot+baa44e3dbbe48e05c1ad@syzkaller.appspotmail.com Reported-by: syzbot+b97d20ed568ce0951a06@syzkaller.appspotmail.com Reported-by: syzbot+2558ca3567a77b7af4e3@syzkaller.appspotmail.com Co-developed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Tested-by: syzbot+baa44e3dbbe48e05c1ad@syzkaller.appspotmail.com Tested-by: syzbot+b97d20ed568ce0951a06@syzkaller.appspotmail.com Link: https://github.com/google/syzkaller/pull/4207 Link: https://lore.kernel.org/bpf/20230929204121.20305-1-daniel@iogearbox.net
2023-09-25bpf: Add missed value to kprobe perf link infoJiri Olsa
Add missed value to kprobe attached through perf link info to hold the stats of missed kprobe handler execution. The kprobe's missed counter gets incremented when kprobe handler is not executed due to another kprobe running on the same cpu. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20230920213145.1941596-4-jolsa@kernel.org
2023-09-21bpf: Disable zero-extension for BPF_MEMSXIlya Leoshkevich
On the architectures that use bpf_jit_needs_zext(), e.g., s390x, the verifier incorrectly inserts a zero-extension after BPF_MEMSX, leading to miscompilations like the one below: 24: 89 1a ff fe 00 00 00 00 "r1 = *(s16 *)(r10 - 2);" # zext_dst set 0x3ff7fdb910e: lgh %r2,-2(%r13,%r0) # load halfword 0x3ff7fdb9114: llgfr %r2,%r2 # wrong! 25: 65 10 00 03 00 00 7f ff if r1 s> 32767 goto +3 <l0_1> # check_cond_jmp_op() Disable such zero-extensions. The JITs need to insert sign-extension themselves, if necessary. Suggested-by: Puranjay Mohan <puranjay12@gmail.com> Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Reviewed-by: Puranjay Mohan <puranjay12@gmail.com> Link: https://lore.kernel.org/r/20230919101336.2223655-2-iii@linux.ibm.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netPaolo Abeni
Cross-merge networking fixes after downstream PR. No conflicts. Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-09-20bpf: unconditionally reset backtrack_state masks on global func exitAndrii Nakryiko
In mark_chain_precision() logic, when we reach the entry to a global func, it is expected that R1-R5 might be still requested to be marked precise. This would correspond to some integer input arguments being tracked as precise. This is all expected and handled as a special case. What's not expected is that we'll leave backtrack_state structure with some register bits set. This is because for subsequent precision propagations backtrack_state is reused without clearing masks, as all code paths are carefully written in a way to leave empty backtrack_state with zeroed out masks, for speed. The fix is trivial, we always clear register bit in the register mask, and then, optionally, set reg->precise if register is SCALAR_VALUE type. Reported-by: Chris Mason <clm@meta.com> Fixes: be2ef8161572 ("bpf: allow precision tracking for programs with subprogs") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230918210110.2241458-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-19bpf: Remove unused variables.Alexei Starovoitov
Remove unused prev_offset, min_size, krec_size variables. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202309190634.fL17FWoT-lkp@intel.com/ Fixes: aaa619ebccb2 ("bpf: Refactor check_btf_func and split into two phases") Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-19bpf: Fix bpf_throw warning on 32-bit archKumar Kartikeya Dwivedi
On 32-bit architectures, the pointer width is 32-bit, while we try to cast from a u64 down to it, the compiler complains on mismatch in integer size. Fix this by first casting to long which should match the pointer width on targets supported by Linux. Fixes: ec5290a178b7 ("bpf: Prevent KASAN false positive with bpf_throw") Reported-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net> Link: https://lore.kernel.org/r/20230918155233.297024-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-17Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller
Alexei Starovoitov says: ==================== The following pull-request contains BPF updates for your *net-next* tree. We've added 73 non-merge commits during the last 9 day(s) which contain a total of 79 files changed, 5275 insertions(+), 600 deletions(-). The main changes are: 1) Basic BTF validation in libbpf, from Andrii Nakryiko. 2) bpf_assert(), bpf_throw(), exceptions in bpf progs, from Kumar Kartikeya Dwivedi. 3) next_thread cleanups, from Oleg Nesterov. 4) Add mcpu=v4 support to arm32, from Puranjay Mohan. 5) Add support for __percpu pointers in bpf progs, from Yonghong Song. 6) Fix bpf tailcall interaction with bpf trampoline, from Leon Hwang. 7) Raise irq_work in bpf_mem_alloc while irqs are disabled to improve refill probabablity, from Hou Tao. Please consider pulling these changes from: git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git Thanks a lot! Also thanks to reporters, reviewers and testers of commits in this pull-request: Alan Maguire, Andrey Konovalov, Dave Marchevsky, "Eric W. Biederman", Jiri Olsa, Maciej Fijalkowski, Quentin Monnet, Russell King (Oracle), Song Liu, Stanislav Fomichev, Yonghong Song ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2023-09-16bpf: Fix kfunc callback register type handlingKumar Kartikeya Dwivedi
The kfunc code to handle KF_ARG_PTR_TO_CALLBACK does not check the reg type before using reg->subprogno. This can accidently permit invalid pointers from being passed into callback helpers (e.g. silently from different paths). Likewise, reg->subprogno from the per-register type union may not be meaningful either. We need to reject any other type except PTR_TO_FUNC. Acked-by: Dave Marchevsky <davemarchevsky@fb.com> Fixes: 5d92ddc3de1b ("bpf: Add callback validation to kfunc verifier logic") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230912233214.1518551-14-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-16bpf: Disallow fentry/fexit/freplace for exception callbacksKumar Kartikeya Dwivedi
During testing, it was discovered that extensions to exception callbacks had no checks, upon running a testcase, the kernel ended up running off the end of a program having final call as bpf_throw, and hitting int3 instructions. The reason is that while the default exception callback would have reset the stack frame to return back to the main program's caller, the replacing extension program will simply return back to bpf_throw, which will instead return back to the program and the program will continue execution, now in an undefined state where anything could happen. The way to support extensions to an exception callback would be to mark the BPF_PROG_TYPE_EXT main subprog as an exception_cb, and prevent it from calling bpf_throw. This would make the JIT produce a prologue that restores saved registers and reset the stack frame. But let's not do that until there is a concrete use case for this, and simply disallow this for now. Similar issues will exist for fentry and fexit cases, where trampoline saves data on the stack when invoking exception callback, which however will then end up resetting the stack frame, and on return, the fexit program will never will invoked as the return address points to the main program's caller in the kernel. Instead of additional complexity and back and forth between the two stacks to enable such a use case, simply forbid it. One key point here to note is that currently X86_TAIL_CALL_OFFSET didn't require any modifications, even though we emit instructions before the corresponding endbr64 instruction. This is because we ensure that a main subprog never serves as an exception callback, and therefore the exception callback (which will be a global subprog) can never serve as the tail call target, eliminating any discrepancies. However, once we support a BPF_PROG_TYPE_EXT to also act as an exception callback, it will end up requiring change to the tail call offset to account for the extra instructions. For simplicitly, tail calls could be disabled for such targets. Noting the above, it appears better to wait for a concrete use case before choosing to permit extension programs to replace exception callbacks. As a precaution, we disable fentry and fexit for exception callbacks as well. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230912233214.1518551-13-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-16bpf: Detect IP == ksym.end as part of BPF programKumar Kartikeya Dwivedi
Now that bpf_throw kfunc is the first such call instruction that has noreturn semantics within the verifier, this also kicks in dead code elimination in unprecedented ways. For one, any instruction following a bpf_throw call will never be marked as seen. Moreover, if a callchain ends up throwing, any instructions after the call instruction to the eventually throwing subprog in callers will also never be marked as seen. The tempting way to fix this would be to emit extra 'int3' instructions which bump the jited_len of a program, and ensure that during runtime when a program throws, we can discover its boundaries even if the call instruction to bpf_throw (or to subprogs that always throw) is emitted as the final instruction in the program. An example of such a program would be this: do_something(): ... r0 = 0 exit foo(): r1 = 0 call bpf_throw r0 = 0 exit bar(cond): if r1 != 0 goto pc+2 call do_something exit call foo r0 = 0 // Never seen by verifier exit // main(ctx): r1 = ... call bar r0 = 0 exit Here, if we do end up throwing, the stacktrace would be the following: bpf_throw foo bar main In bar, the final instruction emitted will be the call to foo, as such, the return address will be the subsequent instruction (which the JIT emits as int3 on x86). This will end up lying outside the jited_len of the program, thus, when unwinding, we will fail to discover the return address as belonging to any program and end up in a panic due to the unreliable stack unwinding of BPF programs that we never expect. To remedy this case, make bpf_prog_ksym_find treat IP == ksym.end as part of the BPF program, so that is_bpf_text_address returns true when such a case occurs, and we are able to unwind reliably when the final instruction ends up being a call instruction. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230912233214.1518551-12-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-16bpf: Prevent KASAN false positive with bpf_throwKumar Kartikeya Dwivedi
The KASAN stack instrumentation when CONFIG_KASAN_STACK is true poisons the stack of a function when it is entered and unpoisons it when leaving. However, in the case of bpf_throw, we will never return as we switch our stack frame to the BPF exception callback. Later, this discrepancy will lead to confusing KASAN splats when kernel resumes execution on return from the BPF program. Fix this by unpoisoning everything below the stack pointer of the BPF program, which should cover the range that would not be unpoisoned. An example splat is below: BUG: KASAN: stack-out-of-bounds in stack_trace_consume_entry+0x14e/0x170 Write of size 8 at addr ffffc900013af958 by task test_progs/227 CPU: 0 PID: 227 Comm: test_progs Not tainted 6.5.0-rc2-g43f1c6c9052a-dirty #26 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-2.fc39 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x4a/0x80 print_report+0xcf/0x670 ? arch_stack_walk+0x79/0x100 kasan_report+0xda/0x110 ? stack_trace_consume_entry+0x14e/0x170 ? stack_trace_consume_entry+0x14e/0x170 ? __pfx_stack_trace_consume_entry+0x10/0x10 stack_trace_consume_entry+0x14e/0x170 ? __sys_bpf+0xf2e/0x41b0 arch_stack_walk+0x8b/0x100 ? __sys_bpf+0xf2e/0x41b0 ? bpf_prog_test_run_skb+0x341/0x1c70 ? bpf_prog_test_run_skb+0x341/0x1c70 stack_trace_save+0x9b/0xd0 ? __pfx_stack_trace_save+0x10/0x10 ? __kasan_slab_free+0x109/0x180 ? bpf_prog_test_run_skb+0x341/0x1c70 ? __sys_bpf+0xf2e/0x41b0 ? __x64_sys_bpf+0x78/0xc0 ? do_syscall_64+0x3c/0x90 ? entry_SYSCALL_64_after_hwframe+0x6e/0xd8 kasan_save_stack+0x33/0x60 ? kasan_save_stack+0x33/0x60 ? kasan_set_track+0x25/0x30 ? kasan_save_free_info+0x2b/0x50 ? __kasan_slab_free+0x109/0x180 ? kmem_cache_free+0x191/0x460 ? bpf_prog_test_run_skb+0x341/0x1c70 kasan_set_track+0x25/0x30 kasan_save_free_info+0x2b/0x50 __kasan_slab_free+0x109/0x180 kmem_cache_free+0x191/0x460 bpf_prog_test_run_skb+0x341/0x1c70 ? __pfx_bpf_prog_test_run_skb+0x10/0x10 ? __fget_light+0x51/0x220 __sys_bpf+0xf2e/0x41b0 ? __might_fault+0xa2/0x170 ? __pfx___sys_bpf+0x10/0x10 ? lock_release+0x1de/0x620 ? __might_fault+0xcd/0x170 ? __pfx_lock_release+0x10/0x10 ? __pfx_blkcg_maybe_throttle_current+0x10/0x10 __x64_sys_bpf+0x78/0xc0 ? syscall_enter_from_user_mode+0x20/0x50 do_syscall_64+0x3c/0x90 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 RIP: 0033:0x7f0fbb38880d Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d f3 45 12 00 f7 d8 64 89 01 48 RSP: 002b:00007ffe13907de8 EFLAGS: 00000206 ORIG_RAX: 0000000000000141 RAX: ffffffffffffffda RBX: 00007ffe13908708 RCX: 00007f0fbb38880d RDX: 0000000000000050 RSI: 00007ffe13907e20 RDI: 000000000000000a RBP: 00007ffe13907e00 R08: 0000000000000000 R09: 00007ffe13907e20 R10: 0000000000000064 R11: 0000000000000206 R12: 0000000000000003 R13: 0000000000000000 R14: 00007f0fbb532000 R15: 0000000000cfbd90 </TASK> The buggy address belongs to stack of task test_progs/227 KASAN internal error: frame info validation failed; invalid marker: 0 The buggy address belongs to the virtual mapping at [ffffc900013a8000, ffffc900013b1000) created by: kernel_clone+0xcd/0x600 The buggy address belongs to the physical page: page:00000000b70f4332 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x11418f flags: 0x2fffe0000000000(node=0|zone=2|lastcpupid=0x7fff) page_type: 0xffffffff() raw: 02fffe0000000000 0000000000000000 dead000000000122 0000000000000000 raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffffc900013af800: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffffc900013af880: 00 00 00 f1 f1 f1 f1 00 00 00 f3 f3 f3 f3 f3 00 >ffffc900013af900: 00 00 00 00 00 00 00 00 00 00 00 f1 00 00 00 00 ^ ffffc900013af980: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffffc900013afa00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ================================================================== Disabling lock debugging due to kernel taint Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Andrey Konovalov <andreyknvl@gmail.com> Link: https://lore.kernel.org/r/20230912233214.1518551-11-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-16bpf: Treat first argument as return value for bpf_throwKumar Kartikeya Dwivedi
In case of the default exception callback, change the behavior of bpf_throw, where the passed cookie value is no longer ignored, but is instead the return value of the default exception callback. As such, we need to place restrictions on the value being passed into bpf_throw in such a case, only allowing those permitted by the check_return_code function. Thus, bpf_throw can now control the return value of the program from each call site without having the user install a custom exception callback just to override the return value when an exception is thrown. We also modify the hidden subprog instructions to now move BPF_REG_1 to BPF_REG_0, so as to set the return value before exit in the default callback. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230912233214.1518551-9-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-16bpf: Perform CFG walk for exception callbackKumar Kartikeya Dwivedi
Since exception callbacks are not referenced using bpf_pseudo_func and bpf_pseudo_call instructions, check_cfg traversal will never explore instructions of the exception callback. Even after adding the subprog, the program will then fail with a 'unreachable insn' error. We thus need to begin walking from the start of the exception callback again in check_cfg after a complete CFG traversal finishes, so as to explore the CFG rooted at the exception callback. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230912233214.1518551-8-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-16bpf: Add support for custom exception callbacksKumar Kartikeya Dwivedi
By default, the subprog generated by the verifier to handle a thrown exception hardcodes a return value of 0. To allow user-defined logic and modification of the return value when an exception is thrown, introduce the 'exception_callback:' declaration tag, which marks a callback as the default exception handler for the program. The format of the declaration tag is 'exception_callback:<value>', where <value> is the name of the exception callback. Each main program can be tagged using this BTF declaratiion tag to associate it with an exception callback. In case the tag is absent, the default callback is used. As such, the exception callback cannot be modified at runtime, only set during verification. Allowing modification of the callback for the current program execution at runtime leads to issues when the programs begin to nest, as any per-CPU state maintaing this information will have to be saved and restored. We don't want it to stay in bpf_prog_aux as this takes a global effect for all programs. An alternative solution is spilling the callback pointer at a known location on the program stack on entry, and then passing this location to bpf_throw as a parameter. However, since exceptions are geared more towards a use case where they are ideally never invoked, optimizing for this use case and adding to the complexity has diminishing returns. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230912233214.1518551-7-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-16bpf: Refactor check_btf_func and split into two phasesKumar Kartikeya Dwivedi
This patch splits the check_btf_info's check_btf_func check into two separate phases. The first phase sets up the BTF and prepares func_info, but does not perform any validation of required invariants for subprogs just yet. This is left to the second phase, which happens where check_btf_info executes currently, and performs the line_info and CO-RE relocation. The reason to perform this split is to obtain the userspace supplied func_info information before we perform the add_subprog call, where we would now require finding and adding subprogs that may not have a bpf_pseudo_call or bpf_pseudo_func instruction in the program. We require this as we want to enable userspace to supply exception callbacks that can override the default hidden subprogram generated by the verifier (which performs a hardcoded action). In such a case, the exception callback may never be referenced in an instruction, but will still be suitably annotated (by way of BTF declaration tags). For finding this exception callback, we would require the program's BTF information, and the supplied func_info information which maps BTF type IDs to subprograms. Since the exception callback won't actually be referenced through instructions, later checks in check_cfg and do_check_subprogs will not verify the subprog. This means that add_subprog needs to add them in the add_subprog_and_kfunc phase before we move forward, which is why the BTF and func_info are required at that point. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230912233214.1518551-6-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-16bpf: Implement BPF exceptionsKumar Kartikeya Dwivedi
This patch implements BPF exceptions, and introduces a bpf_throw kfunc to allow programs to throw exceptions during their execution at runtime. A bpf_throw invocation is treated as an immediate termination of the program, returning back to its caller within the kernel, unwinding all stack frames. This allows the program to simplify its implementation, by testing for runtime conditions which the verifier has no visibility into, and assert that they are true. In case they are not, the program can simply throw an exception from the other branch. BPF exceptions are explicitly *NOT* an unlikely slowpath error handling primitive, and this objective has guided design choices of the implementation of the them within the kernel (with the bulk of the cost for unwinding the stack offloaded to the bpf_throw kfunc). The implementation of this mechanism requires use of add_hidden_subprog mechanism introduced in the previous patch, which generates a couple of instructions to move R1 to R0 and exit. The JIT then rewrites the prologue of this subprog to take the stack pointer and frame pointer as inputs and reset the stack frame, popping all callee-saved registers saved by the main subprog. The bpf_throw function then walks the stack at runtime, and invokes this exception subprog with the stack and frame pointers as parameters. Reviewers must take note that currently the main program is made to save all callee-saved registers on x86_64 during entry into the program. This is because we must do an equivalent of a lightweight context switch when unwinding the stack, therefore we need the callee-saved registers of the caller of the BPF program to be able to return with a sane state. Note that we have to additionally handle r12, even though it is not used by the program, because when throwing the exception the program makes an entry into the kernel which could clobber r12 after saving it on the stack. To be able to preserve the value we received on program entry, we push r12 and restore it from the generated subprogram when unwinding the stack. For now, bpf_throw invocation fails when lingering resources or locks exist in that path of the program. In a future followup, bpf_throw will be extended to perform frame-by-frame unwinding to release lingering resources for each stack frame, removing this limitation. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230912233214.1518551-5-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-16bpf: Implement support for adding hidden subprogsKumar Kartikeya Dwivedi
Introduce support in the verifier for generating a subprogram and include it as part of a BPF program dynamically after the do_check phase is complete. The first user will be the next patch which generates default exception callbacks if none are set for the program. The phase of invocation will be do_misc_fixups. Note that this is an internal verifier function, and should be used with instruction blocks which uphold the invariants stated in check_subprogs. Since these subprogs are always appended to the end of the instruction sequence of the program, it becomes relatively inexpensive to do the related adjustments to the subprog_info of the program. Only the fake exit subprogram is shifted forward, making room for our new subprog. This is useful to insert a new subprogram, get it JITed, and obtain its function pointer. The next patch will use this functionality to insert a default exception callback which will be invoked after unwinding the stack. Note that these added subprograms are invisible to userspace, and never reported in BPF_OBJ_GET_INFO_BY_ID etc. For now, only a single subprogram is supported, but more can be easily supported in the future. To this end, two function counts are introduced now, the existing func_cnt, and real_func_cnt, the latter including hidden programs. This allows us to conver the JIT code to use the real_func_cnt for management of resources while syscall path continues working with existing func_cnt. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230912233214.1518551-4-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-16arch/x86: Implement arch_bpf_stack_walkKumar Kartikeya Dwivedi
The plumbing for offline unwinding when we throw an exception in programs would require walking the stack, hence introduce a new arch_bpf_stack_walk function. This is provided when the JIT supports exceptions, i.e. bpf_jit_supports_exceptions is true. The arch-specific code is really minimal, hence it should be straightforward to extend this support to other architectures as well, as it reuses the logic of arch_stack_walk, but allowing access to unwind_state data. Once the stack pointer and frame pointer are known for the main subprog during the unwinding, we know the stack layout and location of any callee-saved registers which must be restored before we return back to the kernel. This handling will be added in the subsequent patches. Note that while we primarily unwind through BPF frames, which are effectively CONFIG_UNWINDER_FRAME_POINTER, we still need one of this or CONFIG_UNWINDER_ORC to be able to unwind through the bpf_throw frame from which we begin walking the stack. We also require both sp and bp (stack and frame pointers) from the unwind_state structure, which are only available when one of these two options are enabled. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230912233214.1518551-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Alexei Starovoitov says: ==================== The following pull-request contains BPF updates for your *net* tree. We've added 21 non-merge commits during the last 8 day(s) which contain a total of 21 files changed, 450 insertions(+), 36 deletions(-). The main changes are: 1) Adjust bpf_mem_alloc buckets to match ksize(), from Hou Tao. 2) Check whether override is allowed in kprobe mult, from Jiri Olsa. 3) Fix btf_id symbol generation with ld.lld, from Jiri and Nick. 4) Fix potential deadlock when using queue and stack maps from NMI, from Toke Høiland-Jørgensen. Please consider pulling these changes from: git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git Thanks a lot! Also thanks to reporters, reviewers and testers of commits in this pull-request: Alan Maguire, Biju Das, Björn Töpel, Dan Carpenter, Daniel Borkmann, Eduard Zingerman, Hsin-Wei Hung, Marcus Seyfarth, Nathan Chancellor, Satya Durga Srinivasu Prabhala, Song Liu, Stephen Rothwell ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2023-09-15bpf: Allow to use kfunc XDP hints and frags togetherLarysa Zaremba
There is no fundamental reason, why multi-buffer XDP and XDP kfunc RX hints cannot coexist in a single program. Allow those features to be used together by modifying the flags condition for dev-bound-only programs, segments are still prohibited for fully offloaded programs, hence additional check. Suggested-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/bpf/CAKH8qBuzgtJj=OKMdsxEkyML36VsAuZpcrsXcyqjdKXSJCBq=Q@mail.gmail.com/ Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20230915083914.65538-1-larysa.zaremba@intel.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-09-15bpf: expose information about supported xdp metadata kfuncStanislav Fomichev
Add new xdp-rx-metadata-features member to netdev netlink which exports a bitmask of supported kfuncs. Most of the patch is autogenerated (headers), the only relevant part is netdev.yaml and the changes in netdev-genl.c to marshal into netlink. Example output on veth: $ ip link add veth0 type veth peer name veth1 # ifndex == 12 $ ./tools/net/ynl/samples/netdev 12 Select ifc ($ifindex; or 0 = dump; or -2 ntf check): 12 veth1[12] xdp-features (23): basic redirect rx-sg xdp-rx-metadata-features (3): timestamp hash xdp-zc-max-segs=0 Cc: netdev@vger.kernel.org Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20230913171350.369987-3-sdf@google.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-09-15bpf: make it easier to add new metadata kfuncStanislav Fomichev
No functional changes. Instead of having hand-crafted code in bpf_dev_bound_resolve_kfunc, move kfunc <> xmo handler relationship into XDP_METADATA_KFUNC_xxx. This way, any time new kfunc is added, we don't have to touch bpf_dev_bound_resolve_kfunc. Also document XDP_METADATA_KFUNC_xxx arguments since we now have more than two and it might be confusing what is what. Cc: netdev@vger.kernel.org Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20230913171350.369987-2-sdf@google.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-09-15bpf: Skip unit_size checking for global per-cpu allocatorHou Tao
For global per-cpu allocator, the size of free object in free list doesn't match with unit_size and now there is no way to get the size of per-cpu pointer saved in free object, so just skip the checking. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Closes: https://lore.kernel.org/bpf/20230913133436.0eeec4cb@canb.auug.org.au/ Signed-off-by: Hou Tao <houtao1@huawei.com> Tested-by: Biju Das <biju.das.jz@bp.renesas.com> Link: https://lore.kernel.org/r/20230913135943.3137292-1-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-14bpf: Charge modmem for struct_ops trampolineSong Liu
Current code charges modmem for regular trampoline, but not for struct_ops trampoline. Add bpf_jit_[charge|uncharge]_modmem() to struct_ops so the trampoline is charged in both cases. Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230914222542.2986059-1-song@kernel.org Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-09-12bpf, cgroup: fix multiple kernel-doc warningsRandy Dunlap
Fix missing or extra function parameter kernel-doc warnings in cgroup.c: kernel/bpf/cgroup.c:1359: warning: Excess function parameter 'type' description in '__cgroup_bpf_run_filter_skb' kernel/bpf/cgroup.c:1359: warning: Function parameter or member 'atype' not described in '__cgroup_bpf_run_filter_skb' kernel/bpf/cgroup.c:1439: warning: Excess function parameter 'type' description in '__cgroup_bpf_run_filter_sk' kernel/bpf/cgroup.c:1439: warning: Function parameter or member 'atype' not described in '__cgroup_bpf_run_filter_sk' kernel/bpf/cgroup.c:1467: warning: Excess function parameter 'type' description in '__cgroup_bpf_run_filter_sock_addr' kernel/bpf/cgroup.c:1467: warning: Function parameter or member 'atype' not described in '__cgroup_bpf_run_filter_sock_addr' kernel/bpf/cgroup.c:1512: warning: Excess function parameter 'type' description in '__cgroup_bpf_run_filter_sock_ops' kernel/bpf/cgroup.c:1512: warning: Function parameter or member 'atype' not described in '__cgroup_bpf_run_filter_sock_ops' kernel/bpf/cgroup.c:1685: warning: Excess function parameter 'type' description in '__cgroup_bpf_run_filter_sysctl' kernel/bpf/cgroup.c:1685: warning: Function parameter or member 'atype' not described in '__cgroup_bpf_run_filter_sysctl' kernel/bpf/cgroup.c:795: warning: Excess function parameter 'type' description in '__cgroup_bpf_replace' kernel/bpf/cgroup.c:795: warning: Function parameter or member 'new_prog' not described in '__cgroup_bpf_replace' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: bpf@vger.kernel.org Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230912060812.1715-1-rdunlap@infradead.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-12bpf: Fix a erroneous check after snprintf()Christophe JAILLET
snprintf() does not return negative error code on error, it returns the number of characters which *would* be generated for the given input. Fix the error handling check. Fixes: 57539b1c0ac2 ("bpf: Enable annotating trusted nested pointers") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Link: https://lore.kernel.org/r/393bdebc87b22563c08ace094defa7160eb7a6c0.1694190795.git.christophe.jaillet@wanadoo.fr Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-12bpf, x64: Fix tailcall infinite loopLeon Hwang
From commit ebf7d1f508a73871 ("bpf, x64: rework pro/epilogue and tailcall handling in JIT"), the tailcall on x64 works better than before. From commit e411901c0b775a3a ("bpf: allow for tailcalls in BPF subprograms for x64 JIT"), tailcall is able to run in BPF subprograms on x64. From commit 5b92a28aae4dd0f8 ("bpf: Support attaching tracing BPF program to other BPF programs"), BPF program is able to trace other BPF programs. How about combining them all together? 1. FENTRY/FEXIT on a BPF subprogram. 2. A tailcall runs in the BPF subprogram. 3. The tailcall calls the subprogram's caller. As a result, a tailcall infinite loop comes up. And the loop would halt the machine. As we know, in tail call context, the tail_call_cnt propagates by stack and rax register between BPF subprograms. So do in trampolines. Fixes: ebf7d1f508a7 ("bpf, x64: rework pro/epilogue and tailcall handling in JIT") Fixes: e411901c0b77 ("bpf: allow for tailcalls in BPF subprograms for x64 JIT") Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Leon Hwang <hffilwlqm@gmail.com> Link: https://lore.kernel.org/r/20230912150442.2009-3-hffilwlqm@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-11bpf: Avoid dummy bpf_offload_netdev in __bpf_prog_dev_bound_initEduard Zingerman
Fix for a bug observable under the following sequence of events: 1. Create a network device that does not support XDP offload. 2. Load a device bound XDP program with BPF_F_XDP_DEV_BOUND_ONLY flag (such programs are not offloaded). 3. Load a device bound XDP program with zero flags (such programs are offloaded). At step (2) __bpf_prog_dev_bound_init() associates with device (1) a dummy bpf_offload_netdev struct with .offdev field set to NULL. At step (3) __bpf_prog_dev_bound_init() would reuse dummy struct allocated at step (2). However, downstream usage of the bpf_offload_netdev assumes that .offdev field can't be NULL, e.g. in bpf_prog_offload_verifier_prep(). Adjust __bpf_prog_dev_bound_init() to require bpf_offload_netdev with non-NULL .offdev for offloaded BPF programs. Fixes: 2b3486bc2d23 ("bpf: Introduce device-bound XDP programs") Reported-by: syzbot+291100dcb32190ec02a8@syzkaller.appspotmail.com Closes: https://lore.kernel.org/bpf/000000000000d97f3c060479c4f8@google.com/ Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20230912005539.2248244-2-eddyz87@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-09-11bpf: Avoid deadlock when using queue and stack maps from NMIToke Høiland-Jørgensen
Sysbot discovered that the queue and stack maps can deadlock if they are being used from a BPF program that can be called from NMI context (such as one that is attached to a perf HW counter event). To fix this, add an in_nmi() check and use raw_spin_trylock() in NMI context, erroring out if grabbing the lock fails. Fixes: f1a2e44a3aec ("bpf: add queue and stack maps") Reported-by: Hsin-Wei Hung <hsinweih@uci.edu> Tested-by: Hsin-Wei Hung <hsinweih@uci.edu> Co-developed-by: Hsin-Wei Hung <hsinweih@uci.edu> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/r/20230911132815.717240-1-toke@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-11bpf: Ensure unit_size is matched with slab cache object sizeHou Tao
Add extra check in bpf_mem_alloc_init() to ensure the unit_size of bpf_mem_cache is matched with the object_size of underlying slab cache. If these two sizes are unmatched, print a warning once and return -EINVAL in bpf_mem_alloc_init(), so the mismatch can be found early and the potential issue can be prevented. Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20230908133923.2675053-4-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-11bpf: Don't prefill for unused bpf_mem_cacheHou Tao
When the unit_size of a bpf_mem_cache is unmatched with the object_size of the underlying slab cache, the bpf_mem_cache will not be used, and the allocation will be redirected to a bpf_mem_cache with a bigger unit_size instead, so there is no need to prefill for these unused bpf_mem_caches. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20230908133923.2675053-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-11bpf: Adjust size_index according to the value of KMALLOC_MIN_SIZEHou Tao
The following warning was reported when running "./test_progs -a link_api -a linked_list" on a RISC-V QEMU VM: ------------[ cut here ]------------ WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342 bpf_mem_refill Modules linked in: bpf_testmod(OE) CPU: 3 PID: 261 Comm: test_progs- ... 6.5.0-rc5-01743-gdcb152bb8328 #2 Hardware name: riscv-virtio,qemu (DT) epc : bpf_mem_refill+0x1fc/0x206 ra : irq_work_single+0x68/0x70 epc : ffffffff801b1bc4 ra : ffffffff8015fe84 sp : ff2000000001be20 gp : ffffffff82d26138 tp : ff6000008477a800 t0 : 0000000000046600 t1 : ffffffff812b6ddc t2 : 0000000000000000 s0 : ff2000000001be70 s1 : ff5ffffffffe8998 a0 : ff5ffffffffe8998 a1 : ff600003fef4b000 a2 : 000000000000003f a3 : ffffffff80008250 a4 : 0000000000000060 a5 : 0000000000000080 a6 : 0000000000000000 a7 : 0000000000735049 s2 : ff5ffffffffe8998 s3 : 0000000000000022 s4 : 0000000000001000 s5 : 0000000000000007 s6 : ff5ffffffffe8570 s7 : ffffffff82d6bd30 s8 : 000000000000003f s9 : ffffffff82d2c5e8 s10: 000000000000ffff s11: ffffffff82d2c5d8 t3 : ffffffff81ea8f28 t4 : 0000000000000000 t5 : ff6000008fd28278 t6 : 0000000000040000 [<ffffffff801b1bc4>] bpf_mem_refill+0x1fc/0x206 [<ffffffff8015fe84>] irq_work_single+0x68/0x70 [<ffffffff8015feb4>] irq_work_run_list+0x28/0x36 [<ffffffff8015fefa>] irq_work_run+0x38/0x66 [<ffffffff8000828a>] handle_IPI+0x3a/0xb4 [<ffffffff800a5c3a>] handle_percpu_devid_irq+0xa4/0x1f8 [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36 [<ffffffff800ae570>] ipi_mux_process+0xac/0xfa [<ffffffff8000a8ea>] sbi_ipi_handle+0x2e/0x88 [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36 [<ffffffff807ee70e>] riscv_intc_irq+0x36/0x4e [<ffffffff812b5d3a>] handle_riscv_irq+0x54/0x86 [<ffffffff812b6904>] do_irq+0x66/0x98 ---[ end trace 0000000000000000 ]--- The warning is due to WARN_ON_ONCE(tgt->unit_size != c->unit_size) in free_bulk(). The direct reason is that a object is allocated and freed by bpf_mem_caches with different unit_size. The root cause is that KMALLOC_MIN_SIZE is 64 and there is no 96-bytes slab cache in the specific VM. When linked_list test allocates a 72-bytes object through bpf_obj_new(), bpf_global_ma will allocate it from a bpf_mem_cache with 96-bytes unit_size, but this bpf_mem_cache is backed by 128-bytes slab cache. When the object is freed, bpf_mem_free() uses ksize() to choose the corresponding bpf_mem_cache. Because the object is allocated from 128-bytes slab cache, ksize() returns 128, bpf_mem_free() chooses a 128-bytes bpf_mem_cache to free the object and triggers the warning. A similar warning will also be reported when using CONFIG_SLAB instead of CONFIG_SLUB in a x86-64 kernel. Because CONFIG_SLUB defines KMALLOC_MIN_SIZE as 8 but CONFIG_SLAB defines KMALLOC_MIN_SIZE as 32. An alternative fix is to use kmalloc_size_round() in bpf_mem_alloc() to choose a bpf_mem_cache which has the same unit_size with the backing slab cache, but it may introduce performance degradation, so fix the warning by adjusting the indexes in size_index according to the value of KMALLOC_MIN_SIZE just like setup_kmalloc_cache_index_table() does. Fixes: 822fb26bdb55 ("bpf: Add a hint to allocated objects.") Reported-by: Björn Töpel <bjorn@kernel.org> Closes: https://lore.kernel.org/bpf/87jztjmmy4.fsf@all.your.base.are.belong.to.us Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20230908133923.2675053-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-09Merge tag 'riscv-for-linus-6.6-mw2-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull more RISC-V updates from Palmer Dabbelt: - The kernel now dynamically probes for misaligned access speed, as opposed to relying on a table of known implementations. - Support for non-coherent devices on systems using the Andes AX45MP core, including the RZ/Five SoCs. - Support for the V extension in ptrace(), again. - Support for KASLR. - Support for the BPF prog pack allocator in RISC-V. - A handful of bug fixes and cleanups. * tag 'riscv-for-linus-6.6-mw2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (25 commits) soc: renesas: Kconfig: For ARCH_R9A07G043 select the required configs if dependencies are met riscv: Kconfig.errata: Add dependency for RISCV_SBI in ERRATA_ANDES config riscv: Kconfig.errata: Drop dependency for MMU in ERRATA_ANDES_CMO config riscv: Kconfig: Select DMA_DIRECT_REMAP only if MMU is enabled bpf, riscv: use prog pack allocator in the BPF JIT riscv: implement a memset like function for text riscv: extend patch_text_nosync() for multiple pages bpf: make bpf_prog_pack allocator portable riscv: libstub: Implement KASLR by using generic functions libstub: Fix compilation warning for rv32 arm64: libstub: Move KASLR handling functions to kaslr.c riscv: Dump out kernel offset information on panic riscv: Introduce virtual kernel mapping KASLR RISC-V: Add ptrace support for vectors soc: renesas: Kconfig: Select the required configs for RZ/Five SoC cache: Add L2 cache management for Andes AX45MP RISC-V core dt-bindings: cache: andestech,ax45mp-cache: Add DT binding documentation for L2 cache controller riscv: mm: dma-noncoherent: nonstandard cache operations support riscv: errata: Add Andes alternative ports riscv: asm: vendorid_list: Add Andes Technology to the vendors list ...