summaryrefslogtreecommitdiff
path: root/kernel/bpf
AgeCommit message (Collapse)Author
2023-01-06bpf: Skip invalid kfunc call in backtrack_insnHao Sun
The verifier skips invalid kfunc call in check_kfunc_call(), which would be captured in fixup_kfunc_call() if such insn is not eliminated by dead code elimination. However, this can lead to the following warning in backtrack_insn(), also see [1]: ------------[ cut here ]------------ verifier backtracking bug WARNING: CPU: 6 PID: 8646 at kernel/bpf/verifier.c:2756 backtrack_insn kernel/bpf/verifier.c:2756 __mark_chain_precision kernel/bpf/verifier.c:3065 mark_chain_precision kernel/bpf/verifier.c:3165 adjust_reg_min_max_vals kernel/bpf/verifier.c:10715 check_alu_op kernel/bpf/verifier.c:10928 do_check kernel/bpf/verifier.c:13821 [inline] do_check_common kernel/bpf/verifier.c:16289 [...] So make backtracking conservative with this by returning ENOTSUPP. [1] https://lore.kernel.org/bpf/CACkBjsaXNceR8ZjkLG=dT3P=4A8SBsg0Z5h5PWLryF5=ghKq=g@mail.gmail.com/ Reported-by: syzbot+4da3ff23081bafe74fc2@syzkaller.appspotmail.com Signed-off-by: Hao Sun <sunhao.th@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20230104014709.9375-1-sunhao.th@gmail.com
2023-01-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-04Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== bpf-next 2023-01-04 We've added 45 non-merge commits during the last 21 day(s) which contain a total of 50 files changed, 1454 insertions(+), 375 deletions(-). The main changes are: 1) Fixes, improvements and refactoring of parts of BPF verifier's state equivalence checks, from Andrii Nakryiko. 2) Fix a few corner cases in libbpf's BTF-to-C converter in particular around padding handling and enums, also from Andrii Nakryiko. 3) Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key to better support decap on GRE tunnel devices not operating in collect metadata, from Christian Ehrig. 4) Improve x86 JIT's codegen for PROBE_MEM runtime error checks, from Dave Marchevsky. 5) Remove the need for trace_printk_lock for bpf_trace_printk and bpf_trace_vprintk helpers, from Jiri Olsa. 6) Add proper documentation for BPF_MAP_TYPE_SOCK{MAP,HASH} maps, from Maryam Tahhan. 7) Improvements in libbpf's btf_parse_elf error handling, from Changbin Du. 8) Bigger batch of improvements to BPF tracing code samples, from Daniel T. Lee. 9) Add LoongArch support to libbpf's bpf_tracing helper header, from Hengqi Chen. 10) Fix a libbpf compiler warning in perf_event_open_probe on arm32, from Khem Raj. 11) Optimize bpf_local_storage_elem by removing 56 bytes of padding, from Martin KaFai Lau. 12) Use pkg-config to locate libelf for resolve_btfids build, from Shen Jiamin. 13) Various libbpf improvements around API documentation and errno handling, from Xin Liu. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (45 commits) libbpf: Return -ENODATA for missing btf section libbpf: Add LoongArch support to bpf_tracing.h libbpf: Restore errno after pr_warn. libbpf: Added the description of some API functions libbpf: Fix invalid return address register in s390 samples/bpf: Use BPF_KSYSCALL macro in syscall tracing programs samples/bpf: Fix tracex2 by using BPF_KSYSCALL macro samples/bpf: Change _kern suffix to .bpf with syscall tracing program samples/bpf: Use vmlinux.h instead of implicit headers in syscall tracing program samples/bpf: Use kyscall instead of kprobe in syscall tracing program bpf: rename list_head -> graph_root in field info types libbpf: fix errno is overwritten after being closed. bpf: fix regs_exact() logic in regsafe() to remap IDs correctly bpf: perform byte-by-byte comparison only when necessary in regsafe() bpf: reject non-exact register type matches in regsafe() bpf: generalize MAYBE_NULL vs non-MAYBE_NULL rule bpf: reorganize struct bpf_reg_state fields bpf: teach refsafe() to take into account ID remapping bpf: Remove unused field initialization in bpf's ctl_table selftests/bpf: Add jit probe_mem corner case tests to s390x denylist ... ==================== Link: https://lore.kernel.org/r/20230105000926.31350-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-28bpf: rename list_head -> graph_root in field info typesDave Marchevsky
Many of the structs recently added to track field info for linked-list head are useful as-is for rbtree root. So let's do a mechanical renaming of list_head-related types and fields: include/linux/bpf.h: struct btf_field_list_head -> struct btf_field_graph_root list_head -> graph_root in struct btf_field union kernel/bpf/btf.c: list_head -> graph_root in struct btf_field_info This is a nonfunctional change, functionality to actually use these fields for rbtree will be added in further patches. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20221217082506.1570898-5-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-28bpf: Always use maximal size for copy_array()Kees Cook
Instead of counting on prior allocations to have sized allocations to the next kmalloc bucket size, always perform a krealloc that is at least ksize(dst) in size (which is a no-op), so the size can be correctly tracked by all the various allocation size trackers (KASAN, __alloc_size, etc). Reported-by: Hyunwoo Kim <v4bel@theori.io> Link: https://lore.kernel.org/bpf/20221223094551.GA1439509@ubuntu Fixes: ceb35b666d42 ("bpf/verifier: Use kmalloc_size_roundup() to match ksize() usage") Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Song Liu <song@kernel.org> Cc: Yonghong Song <yhs@fb.com> Cc: KP Singh <kpsingh@kernel.org> Cc: Stanislav Fomichev <sdf@google.com> Cc: Hao Luo <haoluo@google.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: bpf@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20221223182836.never.866-kees@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-28bpf: keep a reference to the mm, in case the task is dead.Kui-Feng Lee
Fix the system crash that happens when a task iterator travel through vma of tasks. In task iterators, we used to access mm by following the pointer on the task_struct; however, the death of a task will clear the pointer, even though we still hold the task_struct. That can cause an unexpected crash for a null pointer when an iterator is visiting a task that dies during the visit. Keeping a reference of mm on the iterator ensures we always have a valid pointer to mm. Co-developed-by: Song Liu <song@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com> Reported-by: Nathan Slingerland <slinger@meta.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20221216221855.4122288-2-kuifeng@meta.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-28bpf: Fix panic due to wrong pageattr of im->imageChuang Wang
In the scenario where livepatch and kretfunc coexist, the pageattr of im->image is rox after arch_prepare_bpf_trampoline in bpf_trampoline_update, and then modify_fentry or register_fentry returns -EAGAIN from bpf_tramp_ftrace_ops_func, the BPF_TRAMP_F_ORIG_STACK flag will be configured, and arch_prepare_bpf_trampoline will be re-executed. At this time, because the pageattr of im->image is rox, arch_prepare_bpf_trampoline will read and write im->image, which causes a fault. as follows: insmod livepatch-sample.ko # samples/livepatch/livepatch-sample.c bpftrace -e 'kretfunc:cmdline_proc_show {}' BUG: unable to handle page fault for address: ffffffffa0206000 PGD 322d067 P4D 322d067 PUD 322e063 PMD 1297e067 PTE d428061 Oops: 0003 [#1] PREEMPT SMP PTI CPU: 2 PID: 270 Comm: bpftrace Tainted: G E K 6.1.0 #5 RIP: 0010:arch_prepare_bpf_trampoline+0xed/0x8c0 RSP: 0018:ffffc90001083ad8 EFLAGS: 00010202 RAX: ffffffffa0206000 RBX: 0000000000000020 RCX: 0000000000000000 RDX: ffffffffa0206001 RSI: ffffffffa0206000 RDI: 0000000000000030 RBP: ffffc90001083b70 R08: 0000000000000066 R09: ffff88800f51b400 R10: 000000002e72c6e5 R11: 00000000d0a15080 R12: ffff8880110a68c8 R13: 0000000000000000 R14: ffff88800f51b400 R15: ffffffff814fec10 FS: 00007f87bc0dc780(0000) GS:ffff88803e600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffa0206000 CR3: 0000000010b70000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> bpf_trampoline_update+0x25a/0x6b0 __bpf_trampoline_link_prog+0x101/0x240 bpf_trampoline_link_prog+0x2d/0x50 bpf_tracing_prog_attach+0x24c/0x530 bpf_raw_tp_link_attach+0x73/0x1d0 __sys_bpf+0x100e/0x2570 __x64_sys_bpf+0x1c/0x30 do_syscall_64+0x5b/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd With this patch, when modify_fentry or register_fentry returns -EAGAIN from bpf_tramp_ftrace_ops_func, the pageattr of im->image will be reset to nx+rw. Cc: stable@vger.kernel.org Fixes: 00963a2e75a8 ("bpf: Support bpf_trampoline on functions with IPMODIFY (e.g. livepatch)") Signed-off-by: Chuang Wang <nashuiliang@gmail.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20221224133146.780578-1-nashuiliang@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-27bpf: fix regs_exact() logic in regsafe() to remap IDs correctlyAndrii Nakryiko
Comparing IDs exactly between two separate states is not just suboptimal, but also incorrect in some cases. So update regs_exact() check to do byte-by-byte memcmp() only up to id/ref_obj_id. For id and ref_obj_id perform proper check_ids() checks, taking into account idmap. This change makes more states equivalent improving insns and states stats across a bunch of selftest BPF programs: File Program Insns (A) Insns (B) Insns (DIFF) States (A) States (B) States (DIFF) ------------------------------------------- -------------------------------- --------- --------- -------------- ---------- ---------- ------------- cgrp_kfunc_success.bpf.linked1.o test_cgrp_get_release 141 137 -4 (-2.84%) 13 13 +0 (+0.00%) cgrp_kfunc_success.bpf.linked1.o test_cgrp_xchg_release 142 139 -3 (-2.11%) 14 13 -1 (-7.14%) connect6_prog.bpf.linked1.o connect_v6_prog 139 102 -37 (-26.62%) 9 6 -3 (-33.33%) ima.bpf.linked1.o bprm_creds_for_exec 68 61 -7 (-10.29%) 6 5 -1 (-16.67%) linked_list.bpf.linked1.o global_list_in_list 569 499 -70 (-12.30%) 60 52 -8 (-13.33%) linked_list.bpf.linked1.o global_list_push_pop 167 150 -17 (-10.18%) 18 16 -2 (-11.11%) linked_list.bpf.linked1.o global_list_push_pop_multiple 881 815 -66 (-7.49%) 74 63 -11 (-14.86%) linked_list.bpf.linked1.o inner_map_list_in_list 579 534 -45 (-7.77%) 61 55 -6 (-9.84%) linked_list.bpf.linked1.o inner_map_list_push_pop 190 181 -9 (-4.74%) 19 18 -1 (-5.26%) linked_list.bpf.linked1.o inner_map_list_push_pop_multiple 916 850 -66 (-7.21%) 75 64 -11 (-14.67%) linked_list.bpf.linked1.o map_list_in_list 588 525 -63 (-10.71%) 62 55 -7 (-11.29%) linked_list.bpf.linked1.o map_list_push_pop 183 174 -9 (-4.92%) 18 17 -1 (-5.56%) linked_list.bpf.linked1.o map_list_push_pop_multiple 909 843 -66 (-7.26%) 75 64 -11 (-14.67%) map_kptr.bpf.linked1.o test_map_kptr 264 256 -8 (-3.03%) 26 26 +0 (+0.00%) map_kptr.bpf.linked1.o test_map_kptr_ref 95 91 -4 (-4.21%) 9 8 -1 (-11.11%) task_kfunc_success.bpf.linked1.o test_task_xchg_release 139 136 -3 (-2.16%) 14 13 -1 (-7.14%) test_bpf_nf.bpf.linked1.o nf_skb_ct_test 815 509 -306 (-37.55%) 57 30 -27 (-47.37%) test_bpf_nf.bpf.linked1.o nf_xdp_ct_test 815 509 -306 (-37.55%) 57 30 -27 (-47.37%) test_cls_redirect.bpf.linked1.o cls_redirect 78925 78390 -535 (-0.68%) 4782 4704 -78 (-1.63%) test_cls_redirect_subprogs.bpf.linked1.o cls_redirect 64901 63897 -1004 (-1.55%) 4612 4470 -142 (-3.08%) test_sk_lookup.bpf.linked1.o access_ctx_sk 181 95 -86 (-47.51%) 19 10 -9 (-47.37%) test_sk_lookup.bpf.linked1.o ctx_narrow_access 447 437 -10 (-2.24%) 38 37 -1 (-2.63%) test_sk_lookup_kern.bpf.linked1.o sk_lookup_success 148 133 -15 (-10.14%) 14 12 -2 (-14.29%) test_tcp_check_syncookie_kern.bpf.linked1.o check_syncookie_clsact 304 300 -4 (-1.32%) 23 22 -1 (-4.35%) test_tcp_check_syncookie_kern.bpf.linked1.o check_syncookie_xdp 304 300 -4 (-1.32%) 23 22 -1 (-4.35%) test_verify_pkcs7_sig.bpf.linked1.o bpf 87 76 -11 (-12.64%) 7 6 -1 (-14.29%) ------------------------------------------- -------------------------------- --------- --------- -------------- ---------- ---------- ------------- Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20221223054921.958283-7-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-27bpf: perform byte-by-byte comparison only when necessary in regsafe()Andrii Nakryiko
Extract byte-by-byte comparison of bpf_reg_state in regsafe() into a helper function, which makes it more convenient to use it "on demand" only for registers that benefit from such checks, instead of doing it all the time, even if result of such comparison is ignored. Also, remove WARN_ON_ONCE(1)+return false dead code. There is no risk of missing some case as compiler will warn about non-void function not returning value in some branches (and that under assumption that default case is removed in the future). Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20221223054921.958283-6-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-27bpf: reject non-exact register type matches in regsafe()Andrii Nakryiko
Generalize the (somewhat implicit) rule of regsafe(), which states that if register types in old and current states do not match *exactly*, they can't be safely considered equivalent. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20221223054921.958283-5-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-27bpf: generalize MAYBE_NULL vs non-MAYBE_NULL ruleAndrii Nakryiko
Make generic check to prevent XXX_OR_NULL and XXX register types to be intermixed. While technically in some situations it could be safe, it's impossible to enforce due to the loss of an ID when converting XXX_OR_NULL to its non-NULL variant. So prevent this in general, not just for PTR_TO_MAP_KEY and PTR_TO_MAP_VALUE. PTR_TO_MAP_KEY_OR_NULL and PTR_TO_MAP_VALUE_OR_NULL checks, which were previously special-cased, are simplified to generic check that takes into account range_within() and tnum_in(). This is correct as BPF verifier doesn't allow arithmetic on XXX_OR_NULL register types, so var_off and ranges should stay zero. But even if in the future this restriction is lifted, it's even more important to enforce that var_off and ranges are compatible, otherwise it's possible to construct case where this can be exploited to bypass verifier's memory range safety checks. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20221223054921.958283-4-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-27bpf: reorganize struct bpf_reg_state fieldsAndrii Nakryiko
Move id and ref_obj_id fields after scalar data section (var_off and ranges). This is necessary to simplify next patch which will change regsafe()'s logic to be safer, as it makes the contents that has to be an exact match (type-specific parts, off, type, and var_off+ranges) a single sequential block of memory, while id and ref_obj_id should always be remapped and thus can't be memcp()'ed. There are few places that assume that var_off is after id/ref_obj_id to clear out id/ref_obj_id with the single memset(0). These are changed to explicitly zero-out id/ref_obj_id fields. Other places are adjusted to preserve exact byte-by-byte comparison behavior. No functional changes. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20221223054921.958283-3-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-27bpf: teach refsafe() to take into account ID remappingAndrii Nakryiko
states_equal() check performs ID mapping between old and new states to establish a 1-to-1 correspondence between IDs, even if their absolute numberic values across two equivalent states differ. This is important both for correctness and to avoid unnecessary work when two states are equivalent. With recent changes we partially fixed this logic by maintaining ID map across all function frames. This patch also makes refsafe() check take into account (and maintain) ID map, making states_equal() behavior more optimal and correct. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20221223054921.958283-2-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-24Merge tag 'for-netdev' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Daniel Borkmann says: ==================== The following pull-request contains BPF updates for your *net* tree. We've added 7 non-merge commits during the last 5 day(s) which contain a total of 11 files changed, 231 insertions(+), 3 deletions(-). The main changes are: 1) Fix a splat in bpf_skb_generic_pop() under CHECKSUM_PARTIAL due to misuse of skb_postpull_rcsum(), from Jakub Kicinski with test case from Martin Lau. 2) Fix BPF verifier's nullness propagation when registers are of type PTR_TO_BTF_ID, from Hao Sun. 3) Fix bpftool build for JIT disassembler under statically built libllvm, from Anton Protopopov. 4) Fix warnings reported by resolve_btfids when building vmlinux with CONFIG_SECURITY_NETWORK disabled, from Hou Tao. 5) Minor fix up for BPF selftest gitignore, from Stanislav Fomichev. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-22bpf: fix nullness propagation for reg to reg comparisonsHao Sun
After befae75856ab, the verifier would propagate null information after JEQ/JNE, e.g., if two pointers, one is maybe_null and the other is not, the former would be marked as non-null in eq path. However, as comment "PTR_TO_BTF_ID points to a kernel struct that does not need to be null checked by the BPF program ... The verifier must keep this in mind and can make no assumptions about null or non-null when doing branch ...". If one pointer is maybe_null and the other is PTR_TO_BTF, the former is incorrectly marked non-null. The following BPF prog can trigger a null-ptr-deref, also see this report for more details[1]: 0: (18) r1 = map_fd ; R1_w=map_ptr(ks=4, vs=4) 2: (79) r6 = *(u64 *)(r1 +8) ; R6_w=bpf_map->inner_map_data ; R6 is PTR_TO_BTF_ID ; equals to null at runtime 3: (bf) r2 = r10 4: (07) r2 += -4 5: (62) *(u32 *)(r2 +0) = 0 6: (85) call bpf_map_lookup_elem#1 ; R0_w=map_value_or_null 7: (1d) if r6 == r0 goto pc+1 8: (95) exit ; from 7 to 9: R0=map_value R6=ptr_bpf_map 9: (61) r0 = *(u32 *)(r0 +0) ; null-ptr-deref 10: (95) exit So, make the verifier propagate nullness information for reg to reg comparisons only if neither reg is PTR_TO_BTF_ID. [1] https://lore.kernel.org/bpf/CACkBjsaFJwjC5oiw-1KXvcazywodwXo4zGYsRHwbr2gSG9WcSw@mail.gmail.com/T/#u Fixes: befae75856ab ("bpf: propagate nullness information for reg to reg comparisons") Signed-off-by: Hao Sun <sunhao.th@gmail.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20221222024414.29539-1-sunhao.th@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2022-12-22bpf: Remove unused field initialization in bpf's ctl_tableRicardo Ribalda
Maxlen is used by standard proc_handlers such as proc_dointvec(), but in this case we have our own proc_handler via bpf_stats_handler(). Therefore, remove the initialization. Signed-off-by: Ricardo Ribalda <ribalda@chromium.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/bpf/20221221-bpf-syscall-v1-0-9550f5f2c3fc@chromium.org
2022-12-21Merge tag 'net-6.2-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from bpf, netfilter and can. Current release - regressions: - bpf: synchronize dispatcher update with bpf_dispatcher_xdp_func - rxrpc: - fix security setting propagation - fix null-deref in rxrpc_unuse_local() - fix switched parameters in peer tracing Current release - new code bugs: - rxrpc: - fix I/O thread startup getting skipped - fix locking issues in rxrpc_put_peer_locked() - fix I/O thread stop - fix uninitialised variable in rxperf server - fix the return value of rxrpc_new_incoming_call() - microchip: vcap: fix initialization of value and mask - nfp: fix unaligned io read of capabilities word Previous releases - regressions: - stop in-kernel socket users from corrupting socket's task_frag - stream: purge sk_error_queue in sk_stream_kill_queues() - openvswitch: fix flow lookup to use unmasked key - dsa: mv88e6xxx: avoid reg_lock deadlock in mv88e6xxx_setup_port() - devlink: - hold region lock when flushing snapshots - protect devlink dump by the instance lock Previous releases - always broken: - bpf: - prevent leak of lsm program after failed attach - resolve fext program type when checking map compatibility - skbuff: account for tail adjustment during pull operations - macsec: fix net device access prior to holding a lock - bonding: switch back when high prio link up - netfilter: flowtable: really fix NAT IPv6 offload - enetc: avoid buffer leaks on xdp_do_redirect() failure - unix: fix race in SOCK_SEQPACKET's unix_dgram_sendmsg() - dsa: microchip: remove IRQF_TRIGGER_FALLING in request_threaded_irq" * tag 'net-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (64 commits) net: fec: check the return value of build_skb() net: simplify sk_page_frag Treewide: Stop corrupting socket's task_frag net: Introduce sk_use_task_frag in struct sock. mctp: Remove device type check at unregister net: dsa: microchip: remove IRQF_TRIGGER_FALLING in request_threaded_irq can: kvaser_usb: hydra: help gcc-13 to figure out cmd_len can: flexcan: avoid unbalanced pm_runtime_enable warning Documentation: devlink: add missing toc entry for etas_es58x devlink doc mctp: serial: Fix starting value for frame check sequence nfp: fix unaligned io read of capabilities word net: stream: purge sk_error_queue in sk_stream_kill_queues() myri10ge: Fix an error handling path in myri10ge_probe() net: microchip: vcap: Fix initialization of value and mask rxrpc: Fix the return value of rxrpc_new_incoming_call() rxrpc: rxperf: Fix uninitialised variable rxrpc: Fix I/O thread stop rxrpc: Fix switched parameters in peer tracing rxrpc: Fix locking issues in rxrpc_put_peer_locked() rxrpc: Fix I/O thread startup getting skipped ...
2022-12-21bpf: Reduce smap->elem_sizeMartin KaFai Lau
'struct bpf_local_storage_elem' has an unused 56 byte padding at the end due to struct's cache-line alignment requirement. This padding space is overlapped by storage value contents, so if we use sizeof() to calculate the total size, we overinflate it by 56 bytes. Use offsetof() instead to calculate more exact memory use. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20221221013036.3427431-1-martin.lau@linux.dev
2022-12-19bpf: Remove trace_printk_lockJiri Olsa
Both bpf_trace_printk and bpf_trace_vprintk helpers use static buffer guarded with trace_printk_lock spin lock. The spin lock contention causes issues with bpf programs attached to contention_begin tracepoint [1][2]. Andrii suggested we could get rid of the contention by using trylock, but we could actually get rid of the spinlock completely by using percpu buffers the same way as for bin_args in bpf_bprintf_prepare function. Adding new return 'buf' argument to struct bpf_bprintf_data and making bpf_bprintf_prepare to return also the buffer for printk helpers. [1] https://lore.kernel.org/bpf/CACkBjsakT_yWxnSWr4r-0TpPvbKm9-OBmVUhJb7hV3hY8fdCkw@mail.gmail.com/ [2] https://lore.kernel.org/bpf/CACkBjsaCsTovQHFfkqJKto6S4Z8d02ud1D7MPESrHa1cVNNTrw@mail.gmail.com/ Reported-by: Hao Sun <sunhao.th@gmail.com> Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20221215214430.1336195-4-jolsa@kernel.org
2022-12-19bpf: Do cleanup in bpf_bprintf_cleanup only when neededJiri Olsa
Currently we always cleanup/decrement bpf_bprintf_nest_level variable in bpf_bprintf_cleanup if it's > 0. There's possible scenario where this could cause a problem, when bpf_bprintf_prepare does not get bin_args buffer (because num_args is 0) and following bpf_bprintf_cleanup call decrements bpf_bprintf_nest_level variable, like: in task context: bpf_bprintf_prepare(num_args != 0) increments 'bpf_bprintf_nest_level = 1' -> first irq : bpf_bprintf_prepare(num_args == 0) bpf_bprintf_cleanup decrements 'bpf_bprintf_nest_level = 0' -> second irq: bpf_bprintf_prepare(num_args != 0) bpf_bprintf_nest_level = 1 gets same buffer as task context above Adding check to bpf_bprintf_cleanup and doing the real cleanup only if we got bin_args data in the first place. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20221215214430.1336195-3-jolsa@kernel.org
2022-12-19bpf: Add struct for bin_args arg in bpf_bprintf_prepareJiri Olsa
Adding struct bpf_bprintf_data to hold bin_args argument for bpf_bprintf_prepare function. We will add another return argument to bpf_bprintf_prepare and pass the struct to bpf_bprintf_cleanup for proper cleanup in following changes. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20221215214430.1336195-2-jolsa@kernel.org
2022-12-19bpf: Define sock security related BTF IDs under CONFIG_SECURITY_NETWORKHou Tao
There are warnings reported from resolve_btfids when building vmlinux with CONFIG_SECURITY_NETWORK disabled: WARN: resolve_btfids: unresolved symbol bpf_lsm_sk_free_security WARN: resolve_btfids: unresolved symbol bpf_lsm_sk_alloc_security So only define BTF IDs for these LSM hooks when CONFIG_SECURITY_NETWORK is enabled. Fixes: c0c852dd1876 ("bpf: Do not mark certain LSM hook arguments as trusted") Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20221217062144.2507222-1-houtao@huaweicloud.com
2022-12-17Merge tag 'x86_mm_for_6.2_v2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 mm updates from Dave Hansen: "New Feature: - Randomize the per-cpu entry areas Cleanups: - Have CR3_ADDR_MASK use PHYSICAL_PAGE_MASK instead of open coding it - Move to "native" set_memory_rox() helper - Clean up pmd_get_atomic() and i386-PAE - Remove some unused page table size macros" * tag 'x86_mm_for_6.2_v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (35 commits) x86/mm: Ensure forced page table splitting x86/kasan: Populate shadow for shared chunk of the CPU entry area x86/kasan: Add helpers to align shadow addresses up and down x86/kasan: Rename local CPU_ENTRY_AREA variables to shorten names x86/mm: Populate KASAN shadow for entire per-CPU range of CPU entry area x86/mm: Recompute physical address for every page of per-CPU CEA mapping x86/mm: Rename __change_page_attr_set_clr(.checkalias) x86/mm: Inhibit _PAGE_NX changes from cpa_process_alias() x86/mm: Untangle __change_page_attr_set_clr(.checkalias) x86/mm: Add a few comments x86/mm: Fix CR3_ADDR_MASK x86/mm: Remove P*D_PAGE_MASK and P*D_PAGE_SIZE macros mm: Convert __HAVE_ARCH_P..P_GET to the new style mm: Remove pointless barrier() after pmdp_get_lockless() x86/mm/pae: Get rid of set_64bit() x86_64: Remove pointless set_64bit() usage x86/mm/pae: Be consistent with pXXp_get_and_clear() x86/mm/pae: Use WRITE_ONCE() x86/mm/pae: Don't (ab)use atomic64 mm/gup: Fix the lockless PMD access ...
2022-12-15mm: Introduce set_memory_rox()Peter Zijlstra
Because endlessly repeating: set_memory_ro() set_memory_x() is getting tedious. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/Y1jek64pXOsougmz@hirez.programming.kicks-ass.net
2022-12-14bpf: Resolve fext program type when checking map compatibilityToke Høiland-Jørgensen
The bpf_prog_map_compatible() check makes sure that BPF program types are not mixed inside BPF map types that can contain programs (tail call maps, cpumaps and devmaps). It does this by setting the fields of the map->owner struct to the values of the first program being checked against, and rejecting any subsequent programs if the values don't match. One of the values being set in the map owner struct is the program type, and since the code did not resolve the prog type for fext programs, the map owner type would be set to PROG_TYPE_EXT and subsequent loading of programs of the target type into the map would fail. This bug is seen in particular for XDP programs that are loaded as PROG_TYPE_EXT using libxdp; these cannot insert programs into devmaps and cpumaps because the check fails as described above. Fix the bug by resolving the fext program type to its target program type as elsewhere in the verifier. v3: - Add Yonghong's ACK Fixes: f45d5b6ce2e8 ("bpf: generalise tail call map compatibility check") Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/r/20221214230254.790066-1-toke@redhat.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2022-12-14bpf: Synchronize dispatcher update with bpf_dispatcher_xdp_funcJiri Olsa
Hao Sun reported crash in dispatcher image [1]. Currently we don't have any sync between bpf_dispatcher_update and bpf_dispatcher_xdp_func, so following race is possible: cpu 0: cpu 1: bpf_prog_run_xdp ... bpf_dispatcher_xdp_func in image at offset 0x0 bpf_dispatcher_update update image at offset 0x800 bpf_dispatcher_update update image at offset 0x0 in image at offset 0x0 -> crash Fixing this by synchronizing dispatcher image update (which is done in bpf_dispatcher_update function) with bpf_dispatcher_xdp_func that reads and execute the dispatcher image. Calling synchronize_rcu after updating and installing new image ensures that readers leave old image before it's changed in the next dispatcher update. The update itself is locked with dispatcher's mutex. The bpf_prog_run_xdp is called under local_bh_disable and synchronize_rcu will wait for it to leave [2]. [1] https://lore.kernel.org/bpf/Y5SFho7ZYXr9ifRn@krava/T/#m00c29ece654bc9f332a17df493bbca33e702896c [2] https://lore.kernel.org/bpf/0B62D35A-E695-4B7A-A0D4-774767544C1A@gmail.com/T/#mff43e2c003ae99f4a38f353c7969be4c7162e877 Reported-by: Hao Sun <sunhao.th@gmail.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/r/20221214123542.1389719-1-jolsa@kernel.org Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2022-12-14bpf: prevent leak of lsm program after failed attachMilan Landaverde
In [0], we added the ability to bpf_prog_attach LSM programs to cgroups, but in our validation to make sure the prog is meant to be attached to BPF_LSM_CGROUP, we return too early if the check fails. This results in lack of decrementing prog's refcnt (through bpf_prog_put) leaving the LSM program alive past the point of the expected lifecycle. This fix allows for the decrement to take place. [0] https://lore.kernel.org/all/20220628174314.1216643-4-sdf@google.com/ Fixes: 69fd337a975c ("bpf: per-cgroup lsm flavor") Signed-off-by: Milan Landaverde <milan@mdaverde.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20221213175714.31963-1-milan@mdaverde.com
2022-12-13Merge tag 'net-next-6.2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Paolo Abeni: "Core: - Allow live renaming when an interface is up - Add retpoline wrappers for tc, improving considerably the performances of complex queue discipline configurations - Add inet drop monitor support - A few GRO performance improvements - Add infrastructure for atomic dev stats, addressing long standing data races - De-duplicate common code between OVS and conntrack offloading infrastructure - A bunch of UBSAN_BOUNDS/FORTIFY_SOURCE improvements - Netfilter: introduce packet parser for tunneled packets - Replace IPVS timer-based estimators with kthreads to scale up the workload with the number of available CPUs - Add the helper support for connection-tracking OVS offload BPF: - Support for user defined BPF objects: the use case is to allocate own objects, build own object hierarchies and use the building blocks to build own data structures flexibly, for example, linked lists in BPF - Make cgroup local storage available to non-cgroup attached BPF programs - Avoid unnecessary deadlock detection and failures wrt BPF task storage helpers - A relevant bunch of BPF verifier fixes and improvements - Veristat tool improvements to support custom filtering, sorting, and replay of results - Add LLVM disassembler as default library for dumping JITed code - Lots of new BPF documentation for various BPF maps - Add bpf_rcu_read_{,un}lock() support for sleepable programs - Add RCU grace period chaining to BPF to wait for the completion of access from both sleepable and non-sleepable BPF programs - Add support storing struct task_struct objects as kptrs in maps - Improve helper UAPI by explicitly defining BPF_FUNC_xxx integer values - Add libbpf *_opts API-variants for bpf_*_get_fd_by_id() functions Protocols: - TCP: implement Protective Load Balancing across switch links - TCP: allow dynamically disabling TCP-MD5 static key, reverting back to fast[er]-path - UDP: Introduce optional per-netns hash lookup table - IPv6: simplify and cleanup sockets disposal - Netlink: support different type policies for each generic netlink operation - MPTCP: add MSG_FASTOPEN and FastOpen listener side support - MPTCP: add netlink notification support for listener sockets events - SCTP: add VRF support, allowing sctp sockets binding to VRF devices - Add bridging MAC Authentication Bypass (MAB) support - Extensions for Ethernet VPN bridging implementation to better support multicast scenarios - More work for Wi-Fi 7 support, comprising conversion of all the existing drivers to internal TX queue usage - IPSec: introduce a new offload type (packet offload) allowing complete header processing and crypto offloading - IPSec: extended ack support for more descriptive XFRM error reporting - RXRPC: increase SACK table size and move processing into a per-local endpoint kernel thread, reducing considerably the required locking - IEEE 802154: synchronous send frame and extended filtering support, initial support for scanning available 15.4 networks - Tun: bump the link speed from 10Mbps to 10Gbps - Tun/VirtioNet: implement UDP segmentation offload support Driver API: - PHY/SFP: improve power level switching between standard level 1 and the higher power levels - New API for netdev <-> devlink_port linkage - PTP: convert existing drivers to new frequency adjustment implementation - DSA: add support for rx offloading - Autoload DSA tagging driver when dynamically changing protocol - Add new PCP and APPTRUST attributes to Data Center Bridging - Add configuration support for 800Gbps link speed - Add devlink port function attribute to enable/disable RoCE and migratable - Extend devlink-rate to support strict prioriry and weighted fair queuing - Add devlink support to directly reading from region memory - New device tree helper to fetch MAC address from nvmem - New big TCP helper to simplify temporary header stripping New hardware / drivers: - Ethernet: - Marvel Octeon CNF95N and CN10KB Ethernet Switches - Marvel Prestera AC5X Ethernet Switch - WangXun 10 Gigabit NIC - Motorcomm yt8521 Gigabit Ethernet - Microchip ksz9563 Gigabit Ethernet Switch - Microsoft Azure Network Adapter - Linux Automation 10Base-T1L adapter - PHY: - Aquantia AQR112 and AQR412 - Motorcomm YT8531S - PTP: - Orolia ART-CARD - WiFi: - MediaTek Wi-Fi 7 (802.11be) devices - RealTek rtw8821cu, rtw8822bu, rtw8822cu and rtw8723du USB devices - Bluetooth: - Broadcom BCM4377/4378/4387 Bluetooth chipsets - Realtek RTL8852BE and RTL8723DS - Cypress.CYW4373A0 WiFi + Bluetooth combo device Drivers: - CAN: - gs_usb: bus error reporting support - kvaser_usb: listen only and bus error reporting support - Ethernet NICs: - Intel (100G): - extend action skbedit to RX queue mapping - implement devlink-rate support - support direct read from memory - nVidia/Mellanox (mlx5): - SW steering improvements, increasing rules update rate - Support for enhanced events compression - extend H/W offload packet manipulation capabilities - implement IPSec packet offload mode - nVidia/Mellanox (mlx4): - better big TCP support - Netronome Ethernet NICs (nfp): - IPsec offload support - add support for multicast filter - Broadcom: - RSS and PTP support improvements - AMD/SolarFlare: - netlink extened ack improvements - add basic flower matches to offload, and related stats - Virtual NICs: - ibmvnic: introduce affinity hint support - small / embedded: - FreeScale fec: add initial XDP support - Marvel mv643xx_eth: support MII/GMII/RGMII modes for Kirkwood - TI am65-cpsw: add suspend/resume support - Mediatek MT7986: add RX wireless wthernet dispatch support - Realtek 8169: enable GRO software interrupt coalescing per default - Ethernet high-speed switches: - Microchip (sparx5): - add support for Sparx5 TC/flower H/W offload via VCAP - Mellanox mlxsw: - add 802.1X and MAC Authentication Bypass offload support - add ip6gre support - Embedded Ethernet switches: - Mediatek (mtk_eth_soc): - improve PCS implementation, add DSA untag support - enable flow offload support - Renesas: - add rswitch R-Car Gen4 gPTP support - Microchip (lan966x): - add full XDP support - add TC H/W offload via VCAP - enable PTP on bridge interfaces - Microchip (ksz8): - add MTU support for KSZ8 series - Qualcomm 802.11ax WiFi (ath11k): - support configuring channel dwell time during scan - MediaTek WiFi (mt76): - enable Wireless Ethernet Dispatch (WED) offload support - add ack signal support - enable coredump support - remain_on_channel support - Intel WiFi (iwlwifi): - enable Wi-Fi 7 Extremely High Throughput (EHT) PHY capabilities - 320 MHz channels support - RealTek WiFi (rtw89): - new dynamic header firmware format support - wake-over-WLAN support" * tag 'net-next-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2002 commits) ipvs: fix type warning in do_div() on 32 bit net: lan966x: Remove a useless test in lan966x_ptp_add_trap() net: ipa: add IPA v4.7 support dt-bindings: net: qcom,ipa: Add SM6350 compatible bnxt: Use generic HBH removal helper in tx path IPv6/GRO: generic helper to remove temporary HBH/jumbo header in driver selftests: forwarding: Add bridge MDB test selftests: forwarding: Rename bridge_mdb test bridge: mcast: Support replacement of MDB port group entries bridge: mcast: Allow user space to specify MDB entry routing protocol bridge: mcast: Allow user space to add (*, G) with a source list and filter mode bridge: mcast: Add support for (*, G) with a source list and filter mode bridge: mcast: Avoid arming group timer when (S, G) corresponds to a source bridge: mcast: Add a flag for user installed source entries bridge: mcast: Expose __br_multicast_del_group_src() bridge: mcast: Expose br_multicast_new_group_src() bridge: mcast: Add a centralized error path bridge: mcast: Place netlink policy before validation functions bridge: mcast: Split (*, G) and (S, G) addition into different functions bridge: mcast: Do not derive entry type from its filter mode ...
2022-12-12Merge tag 'random-6.2-rc1-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/crng/random Pull random number generator updates from Jason Donenfeld: - Replace prandom_u32_max() and various open-coded variants of it, there is now a new family of functions that uses fast rejection sampling to choose properly uniformly random numbers within an interval: get_random_u32_below(ceil) - [0, ceil) get_random_u32_above(floor) - (floor, U32_MAX] get_random_u32_inclusive(floor, ceil) - [floor, ceil] Coccinelle was used to convert all current users of prandom_u32_max(), as well as many open-coded patterns, resulting in improvements throughout the tree. I'll have a "late" 6.1-rc1 pull for you that removes the now unused prandom_u32_max() function, just in case any other trees add a new use case of it that needs to converted. According to linux-next, there may be two trivial cases of prandom_u32_max() reintroductions that are fixable with a 's/.../.../'. So I'll have for you a final conversion patch doing that alongside the removal patch during the second week. This is a treewide change that touches many files throughout. - More consistent use of get_random_canary(). - Updates to comments, documentation, tests, headers, and simplification in configuration. - The arch_get_random*_early() abstraction was only used by arm64 and wasn't entirely useful, so this has been replaced by code that works in all relevant contexts. - The kernel will use and manage random seeds in non-volatile EFI variables, refreshing a variable with a fresh seed when the RNG is initialized. The RNG GUID namespace is then hidden from efivarfs to prevent accidental leakage. These changes are split into random.c infrastructure code used in the EFI subsystem, in this pull request, and related support inside of EFISTUB, in Ard's EFI tree. These are co-dependent for full functionality, but the order of merging doesn't matter. - Part of the infrastructure added for the EFI support is also used for an improvement to the way vsprintf initializes its siphash key, replacing an sleep loop wart. - The hardware RNG framework now always calls its correct random.c input function, add_hwgenerator_randomness(), rather than sometimes going through helpers better suited for other cases. - The add_latent_entropy() function has long been called from the fork handler, but is a no-op when the latent entropy gcc plugin isn't used, which is fine for the purposes of latent entropy. But it was missing out on the cycle counter that was also being mixed in beside the latent entropy variable. So now, if the latent entropy gcc plugin isn't enabled, add_latent_entropy() will expand to a call to add_device_randomness(NULL, 0), which adds a cycle counter, without the absent latent entropy variable. - The RNG is now reseeded from a delayed worker, rather than on demand when used. Always running from a worker allows it to make use of the CPU RNG on platforms like S390x, whose instructions are too slow to do so from interrupts. It also has the effect of adding in new inputs more frequently with more regularity, amounting to a long term transcript of random values. Plus, it helps a bit with the upcoming vDSO implementation (which isn't yet ready for 6.2). - The jitter entropy algorithm now tries to execute on many different CPUs, round-robining, in hopes of hitting even more memory latencies and other unpredictable effects. It also will mix in a cycle counter when the entropy timer fires, in addition to being mixed in from the main loop, to account more explicitly for fluctuations in that timer firing. And the state it touches is now kept within the same cache line, so that it's assured that the different execution contexts will cause latencies. * tag 'random-6.2-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random: (23 commits) random: include <linux/once.h> in the right header random: align entropy_timer_state to cache line random: mix in cycle counter when jitter timer fires random: spread out jitter callback to different CPUs random: remove extraneous period and add a missing one in comments efi: random: refresh non-volatile random seed when RNG is initialized vsprintf: initialize siphash key using notifier random: add back async readiness notifier random: reseed in delayed work rather than on-demand random: always mix cycle counter in add_latent_entropy() hw_random: use add_hwgenerator_randomness() for early entropy random: modernize documentation comment on get_random_bytes() random: adjust comment to account for removed function random: remove early archrandom abstraction random: use random.trust_{bootloader,cpu} command line option only stackprotector: actually use get_random_canary() stackprotector: move get_random_canary() into stackprotector.h treewide: use get_random_u32_inclusive() when possible treewide: use get_random_u32_{above,below}() instead of manual loop treewide: use get_random_u32_below() instead of deprecated function ...
2022-12-10bpf: use check_ids() for active_lock comparisonEduard Zingerman
An update for verifier.c:states_equal()/regsafe() to use check_ids() for active spin lock comparisons. This fixes the issue reported by Kumar Kartikeya Dwivedi in [1] using technique suggested by Edward Cree. W/o this commit the verifier might be tricked to accept the following program working with a map containing spin locks: 0: r9 = map_lookup_elem(...) ; Returns PTR_TO_MAP_VALUE_OR_NULL id=1. 1: r8 = map_lookup_elem(...) ; Returns PTR_TO_MAP_VALUE_OR_NULL id=2. 2: if r9 == 0 goto exit ; r9 -> PTR_TO_MAP_VALUE. 3: if r8 == 0 goto exit ; r8 -> PTR_TO_MAP_VALUE. 4: r7 = ktime_get_ns() ; Unbound SCALAR_VALUE. 5: r6 = ktime_get_ns() ; Unbound SCALAR_VALUE. 6: bpf_spin_lock(r8) ; active_lock.id == 2. 7: if r6 > r7 goto +1 ; No new information about the state ; is derived from this check, thus ; produced verifier states differ only ; in 'insn_idx'. 8: r9 = r8 ; Optionally make r9.id == r8.id. --- checkpoint --- ; Assume is_state_visisted() creates a ; checkpoint here. 9: bpf_spin_unlock(r9) ; (a,b) active_lock.id == 2. ; (a) r9.id == 2, (b) r9.id == 1. 10: exit(0) Consider two verification paths: (a) 0-10 (b) 0-7,9-10 The path (a) is verified first. If checkpoint is created at (8) the (b) would assume that (8) is safe because regsafe() does not compare register ids for registers of type PTR_TO_MAP_VALUE. [1] https://lore.kernel.org/bpf/20221111202719.982118-1-memxor@gmail.com/ Reported-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Suggested-by: Edward Cree <ecree.xilinx@gmail.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20221209135733.28851-6-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-10bpf: states_equal() must build idmap for all function framesEduard Zingerman
verifier.c:states_equal() must maintain register ID mapping across all function frames. Otherwise the following example might be erroneously marked as safe: main: fp[-24] = map_lookup_elem(...) ; frame[0].fp[-24].id == 1 fp[-32] = map_lookup_elem(...) ; frame[0].fp[-32].id == 2 r1 = &fp[-24] r2 = &fp[-32] call foo() r0 = 0 exit foo: 0: r9 = r1 1: r8 = r2 2: r7 = ktime_get_ns() 3: r6 = ktime_get_ns() 4: if (r6 > r7) goto skip_assign 5: r9 = r8 skip_assign: ; <--- checkpoint 6: r9 = *r9 ; (a) frame[1].r9.id == 2 ; (b) frame[1].r9.id == 1 7: if r9 == 0 goto exit: ; mark_ptr_or_null_regs() transfers != 0 info ; for all regs sharing ID: ; (a) r9 != 0 => &frame[0].fp[-32] != 0 ; (b) r9 != 0 => &frame[0].fp[-24] != 0 8: r8 = *r8 ; (a) r8 == &frame[0].fp[-32] ; (b) r8 == &frame[0].fp[-32] 9: r0 = *r8 ; (a) safe ; (b) unsafe exit: 10: exit While processing call to foo() verifier considers the following execution paths: (a) 0-10 (b) 0-4,6-10 (There is also path 0-7,10 but it is not interesting for the issue at hand. (a) is verified first.) Suppose that checkpoint is created at (6) when path (a) is verified, next path (b) is verified and (6) is reached. If states_equal() maintains separate 'idmap' for each frame the mapping at (6) for frame[1] would be empty and regsafe(r9)::check_ids() would add a pair 2->1 and return true, which is an error. If states_equal() maintains single 'idmap' for all frames the mapping at (6) would be { 1->1, 2->2 } and regsafe(r9)::check_ids() would return false when trying to add a pair 2->1. This issue was suggested in the following discussion: https://lore.kernel.org/bpf/CAEf4BzbFB5g4oUfyxk9rHy-PJSLQ3h8q9mV=rVoXfr_JVm8+1Q@mail.gmail.com/ Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20221209135733.28851-4-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-10bpf: regsafe() must not skip check_ids()Eduard Zingerman
The verifier.c:regsafe() has the following shortcut: equal = memcmp(rold, rcur, offsetof(struct bpf_reg_state, parent)) == 0; ... if (equal) return true; Which is executed regardless old register type. This is incorrect for register types that might have an ID checked by check_ids(), namely: - PTR_TO_MAP_KEY - PTR_TO_MAP_VALUE - PTR_TO_PACKET_META - PTR_TO_PACKET The following pattern could be used to exploit this: 0: r9 = map_lookup_elem(...) ; Returns PTR_TO_MAP_VALUE_OR_NULL id=1. 1: r8 = map_lookup_elem(...) ; Returns PTR_TO_MAP_VALUE_OR_NULL id=2. 2: r7 = ktime_get_ns() ; Unbound SCALAR_VALUE. 3: r6 = ktime_get_ns() ; Unbound SCALAR_VALUE. 4: if r6 > r7 goto +1 ; No new information about the state ; is derived from this check, thus ; produced verifier states differ only ; in 'insn_idx'. 5: r9 = r8 ; Optionally make r9.id == r8.id. --- checkpoint --- ; Assume is_state_visisted() creates a ; checkpoint here. 6: if r9 == 0 goto <exit> ; Nullness info is propagated to all ; registers with matching ID. 7: r1 = *(u64 *) r8 ; Not always safe. Verifier first visits path 1-7 where r8 is verified to be not null at (6). Later the jump from 4 to 6 is examined. The checkpoint for (6) looks as follows: R8_rD=map_value_or_null(id=2,off=0,ks=4,vs=8,imm=0) R9_rwD=map_value_or_null(id=2,off=0,ks=4,vs=8,imm=0) R10=fp0 The current state is: R0=... R6=... R7=... fp-8=... R8=map_value_or_null(id=2,off=0,ks=4,vs=8,imm=0) R9=map_value_or_null(id=1,off=0,ks=4,vs=8,imm=0) R10=fp0 Note that R8 states are byte-to-byte identical, so regsafe() would exit early and skip call to check_ids(), thus ID mapping 2->2 will not be added to 'idmap'. Next, states for R9 are compared: these are not identical and check_ids() is executed, but 'idmap' is empty, so check_ids() adds mapping 2->1 to 'idmap' and returns success. This commit pushes the 'equal' down to register types that don't need check_ids(). Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20221209135733.28851-2-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-08bpf: Use memmove for bpf_dynptr_{read,write}Kumar Kartikeya Dwivedi
It may happen that destination buffer memory overlaps with memory dynptr points to. Hence, we must use memmove to correctly copy from dynptr to destination buffer, or source buffer to dynptr. This actually isn't a problem right now, as memcpy implementation falls back to memmove on detecting overlap and warns about it, but we shouldn't be relying on that. Acked-by: Joanne Koong <joannelkoong@gmail.com> Acked-by: David Vernet <void@manifault.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221207204141.308952-7-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-08bpf: Move PTR_TO_STACK alignment check to process_dynptr_funcKumar Kartikeya Dwivedi
After previous commit, we are minimizing helper specific assumptions from check_func_arg_reg_off, making it generic, and offloading checks for a specific argument type to their respective functions called after check_func_arg_reg_off has been called. This allows relying on a consistent set of guarantees after that call and then relying on them in code that deals with registers for each argument type later. This is in line with how process_spin_lock, process_timer_func, process_kptr_func check reg->var_off to be constant. The same reasoning is used here to move the alignment check into process_dynptr_func. Note that it also needs to check for constant var_off, and accumulate the constant var_off when computing the spi in get_spi, but that fix will come in later changes. Acked-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221207204141.308952-6-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-08bpf: Rework check_func_arg_reg_offKumar Kartikeya Dwivedi
While check_func_arg_reg_off is the place which performs generic checks needed by various candidates of reg->type, there is some handling for special cases, like ARG_PTR_TO_DYNPTR, OBJ_RELEASE, and ARG_PTR_TO_RINGBUF_MEM. This commit aims to streamline these special cases and instead leave other things up to argument type specific code to handle. The function will be restrictive by default, and cover all possible cases when OBJ_RELEASE is set, without having to update the function again (and missing to do that being a bug). This is done primarily for two reasons: associating back reg->type to its argument leaves room for the list getting out of sync when a new reg->type is supported by an arg_type. The other case is ARG_PTR_TO_RINGBUF_MEM. The problem there is something we already handle, whenever a release argument is expected, it should be passed as the pointer that was received from the acquire function. Hence zero fixed and variable offset. There is nothing special about ARG_PTR_TO_RINGBUF_MEM, where technically its target register type PTR_TO_MEM | MEM_RINGBUF can already be passed with non-zero offset to other helper functions, which makes sense. Hence, lift the arg_type_is_release check for reg->off and cover all possible register types, instead of duplicating the same kind of check twice for current OBJ_RELEASE arg_types (alloc_mem and ptr_to_btf_id). For the release argument, arg_type_is_dynptr is the special case, where we go to actual object being freed through the dynptr, so the offset of the pointer still needs to allow fixed and variable offset and process_dynptr_func will verify them later for the release argument case as well. This is not specific to ARG_PTR_TO_DYNPTR though, we will need to make this exception for any future object on the stack that needs to be released. In this sense, PTR_TO_STACK as a candidate for object on stack argument is a special case for release offset checks, and they need to be done by the helper releasing the object on stack. Since the check has been lifted above all register type checks, remove the duplicated check that is being done for PTR_TO_BTF_ID. Acked-by: Joanne Koong <joannelkoong@gmail.com> Acked-by: David Vernet <void@manifault.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221207204141.308952-5-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-08bpf: Rework process_dynptr_funcKumar Kartikeya Dwivedi
Recently, user ringbuf support introduced a PTR_TO_DYNPTR register type for use in callback state, because in case of user ringbuf helpers, there is no dynptr on the stack that is passed into the callback. To reflect such a state, a special register type was created. However, some checks have been bypassed incorrectly during the addition of this feature. First, for arg_type with MEM_UNINIT flag which initialize a dynptr, they must be rejected for such register type. Secondly, in the future, there are plans to add dynptr helpers that operate on the dynptr itself and may change its offset and other properties. In all of these cases, PTR_TO_DYNPTR shouldn't be allowed to be passed to such helpers, however the current code simply returns 0. The rejection for helpers that release the dynptr is already handled. For fixing this, we take a step back and rework existing code in a way that will allow fitting in all classes of helpers and have a coherent model for dealing with the variety of use cases in which dynptr is used. First, for ARG_PTR_TO_DYNPTR, it can either be set alone or together with a DYNPTR_TYPE_* constant that denotes the only type it accepts. Next, helpers which initialize a dynptr use MEM_UNINIT to indicate this fact. To make the distinction clear, use MEM_RDONLY flag to indicate that the helper only operates on the memory pointed to by the dynptr, not the dynptr itself. In C parlance, it would be equivalent to taking the dynptr as a point to const argument. When either of these flags are not present, the helper is allowed to mutate both the dynptr itself and also the memory it points to. Currently, the read only status of the memory is not tracked in the dynptr, but it would be trivial to add this support inside dynptr state of the register. With these changes and renaming PTR_TO_DYNPTR to CONST_PTR_TO_DYNPTR to better reflect its usage, it can no longer be passed to helpers that initialize a dynptr, i.e. bpf_dynptr_from_mem, bpf_ringbuf_reserve_dynptr. A note to reviewers is that in code that does mark_stack_slots_dynptr, and unmark_stack_slots_dynptr, we implicitly rely on the fact that PTR_TO_STACK reg is the only case that can reach that code path, as one cannot pass CONST_PTR_TO_DYNPTR to helpers that don't set MEM_RDONLY. In both cases such helpers won't be setting that flag. The next patch will add a couple of selftest cases to make sure this doesn't break. Fixes: 205715673844 ("bpf: Add bpf_user_ringbuf_drain() helper") Acked-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221207204141.308952-4-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-08bpf: Propagate errors from process_* checks in check_func_argKumar Kartikeya Dwivedi
Currently, we simply ignore the errors in process_spin_lock, process_timer_func, process_kptr_func, process_dynptr_func. Instead, bubble up the error by storing and checking err variable. Acked-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221207204141.308952-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-08bpf: Refactor ARG_PTR_TO_DYNPTR checks into process_dynptr_funcKumar Kartikeya Dwivedi
ARG_PTR_TO_DYNPTR is akin to ARG_PTR_TO_TIMER, ARG_PTR_TO_KPTR, where the underlying register type is subjected to more special checks to determine the type of object represented by the pointer and its state consistency. Move dynptr checks to their own 'process_dynptr_func' function so that is consistent and in-line with existing code. This also makes it easier to reuse this code for kfunc handling. Then, reuse this consolidated function in kfunc dynptr handling too. Note that for kfuncs, the arg_type constraint of DYNPTR_TYPE_LOCAL has been lifted. Acked-by: David Vernet <void@manifault.com> Acked-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221207204141.308952-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-08bpf: Skip rcu_barrier() if rcu_trace_implies_rcu_gp() is trueHou Tao
If there are pending rcu callback, free_mem_alloc() will use rcu_barrier_tasks_trace() and rcu_barrier() to wait for the pending __free_rcu_tasks_trace() and __free_rcu() callback. If rcu_trace_implies_rcu_gp() is true, there will be no pending __free_rcu(), so it will be OK to skip rcu_barrier() as well. Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20221209010947.3130477-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-08bpf: Reuse freed element in free_by_rcu during allocationHou Tao
When there are batched freeing operations on a specific CPU, part of the freed elements ((high_watermark - lower_watermark) / 2 + 1) will be indirectly moved into waiting_for_gp list through free_by_rcu list. After call_rcu_in_progress becomes false again, the remaining elements in free_by_rcu list will be moved to waiting_for_gp list by the next invocation of free_bulk(). However if the expiration of RCU tasks trace grace period is relatively slow, none element in free_by_rcu list will be moved. So instead of invoking __alloc_percpu_gfp() or kmalloc_node() to allocate a new object, in alloc_bulk() just check whether or not there is freed element in free_by_rcu list and reuse it if available. Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20221209010947.3130477-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-08bpf: Fix comment error in fixup_kfunc_call functionYang Jihong
insn->imm for kfunc is the relative address of __bpf_call_base, instead of __bpf_base_call, Fix the comment error. Signed-off-by: Yang Jihong <yangjihong1@huawei.com> Link: https://lore.kernel.org/r/20221208013724.257848-1-yangjihong1@huawei.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-08bpf: Do not zero-extend kfunc return valuesBjörn Töpel
In BPF all global functions, and BPF helpers return a 64-bit value. For kfunc calls, this is not the case, and they can return e.g. 32-bit values. The return register R0 for kfuncs calls can therefore be marked as subreg_def != DEF_NOT_SUBREG. In general, if a register is marked with subreg_def != DEF_NOT_SUBREG, some archs (where bpf_jit_needs_zext() returns true) require the verifier to insert explicit zero-extension instructions. For kfuncs calls, however, the caller should do sign/zero extension for return values. In other words, the compiler is responsible to insert proper instructions, not the verifier. An example, provided by Yonghong Song: $ cat t.c extern unsigned foo(void); unsigned bar1(void) { return foo(); } unsigned bar2(void) { if (foo()) return 10; else return 20; } $ clang -target bpf -mcpu=v3 -O2 -c t.c && llvm-objdump -d t.o t.o: file format elf64-bpf Disassembly of section .text: 0000000000000000 <bar1>: 0: 85 10 00 00 ff ff ff ff call -0x1 1: 95 00 00 00 00 00 00 00 exit 0000000000000010 <bar2>: 2: 85 10 00 00 ff ff ff ff call -0x1 3: bc 01 00 00 00 00 00 00 w1 = w0 4: b4 00 00 00 14 00 00 00 w0 = 0x14 5: 16 01 01 00 00 00 00 00 if w1 == 0x0 goto +0x1 <LBB1_2> 6: b4 00 00 00 0a 00 00 00 w0 = 0xa 0000000000000038 <LBB1_2>: 7: 95 00 00 00 00 00 00 00 exit If the return value of 'foo()' is used in the BPF program, the proper zero-extension will be done. Currently, the verifier correctly marks, say, a 32-bit return value as subreg_def != DEF_NOT_SUBREG, but will fail performing the actual zero-extension, due to a verifier bug in opt_subreg_zext_lo32_rnd_hi32(). load_reg is not properly set to R0, and the following path will be taken: if (WARN_ON(load_reg == -1)) { verbose(env, "verifier bug. zext_dst is set, but no reg is defined\n"); return -EFAULT; } A longer discussion from v1 can be found in the link below. Correct the verifier by avoiding doing explicit zero-extension of R0 for kfunc calls. Note that R0 will still be marked as a sub-register for return values smaller than 64-bit. Fixes: 83a2881903f3 ("bpf: Account for BPF_FETCH in insn_has_def32()") Link: https://lore.kernel.org/bpf/20221202103620.1915679-1-bjorn@kernel.org/ Suggested-by: Yonghong Song <yhs@meta.com> Signed-off-by: Björn Töpel <bjorn@rivosinc.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20221207103540.396496-1-bjorn@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-07bpf/docs: Document struct cgroup * kfuncsDavid Vernet
bpf_cgroup_acquire(), bpf_cgroup_release(), bpf_cgroup_kptr_get(), and bpf_cgroup_ancestor(), are kfuncs that were recently added to kernel/bpf/helpers.c. These are "core" kfuncs in that they're available for use in any tracepoint or struct_ops BPF program. Though they have no ABI stability guarantees, we should still document them. This patch adds a struct cgroup * subsection to the Core kfuncs section which describes each of these kfuncs. Signed-off-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20221207204911.873646-3-void@manifault.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-07bpf/docs: Document struct task_struct * kfuncsDavid Vernet
bpf_task_acquire(), bpf_task_release(), and bpf_task_from_pid() are kfuncs that were recently added to kernel/bpf/helpers.c. These are "core" kfuncs in that they're available for use for any tracepoint or struct_ops BPF program. Though they have no ABI stability guarantees, we should still document them. This patch adds a new Core kfuncs section to the BPF kfuncs doc, and adds entries for all of these task kfuncs. Note that bpf_task_kptr_get() is not documented, as it still returns NULL while we're working to resolve how it can use RCU to ensure struct task_struct * lifetime. Signed-off-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20221207204911.873646-2-void@manifault.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-07bpf: Remove unused insn_cnt argument from visit_[func_call_]insn()Andrii Nakryiko
Number of total instructions in BPF program (including subprogs) can and is accessed from env->prog->len. visit_func_call_insn() doesn't do any checks against insn_cnt anymore, relying on push_insn() to do this check internally. So remove unnecessary insn_cnt input argument from visit_func_call_insn() and visit_insn() functions. Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20221207195534.2866030-1-andrii@kernel.org
2022-12-07Merge "do not rely on ALLOW_ERROR_INJECTION for fmod_ret" into bpf-nextAlexei Starovoitov
Merge commit 5b481acab4ce ("bpf: do not rely on ALLOW_ERROR_INJECTION for fmod_ret") from hid tree into bpf-next. Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-07bpf: do not rely on ALLOW_ERROR_INJECTION for fmod_retBenjamin Tissoires
The current way of expressing that a non-bpf kernel component is willing to accept that bpf programs can be attached to it and that they can change the return value is to abuse ALLOW_ERROR_INJECTION. This is debated in the link below, and the result is that it is not a reasonable thing to do. Reuse the kfunc declaration structure to also tag the kernel functions we want to be fmodret. This way we can control from any subsystem which functions are being modified by bpf without touching the verifier. Link: https://lore.kernel.org/all/20221121104403.1545f9b5@gandalf.local.home/ Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/r/20221206145936.922196-2-benjamin.tissoires@redhat.com
2022-12-06bpf: remove unnecessary prune and jump pointsAndrii Nakryiko
Don't mark some instructions as jump points when there are actually no jumps and instructions are just processed sequentially. Such case is handled naturally by precision backtracking logic without the need to update jump history. See get_prev_insn_idx(). It goes back linearly by one instruction, unless current top of jmp_history is pointing to current instruction. In such case we use `st->jmp_history[cnt - 1].prev_idx` to find instruction from which we jumped to the current instruction non-linearly. Also remove both jump and prune point marking for instruction right after unconditional jumps, as program flow can get to the instruction right after unconditional jump instruction only if there is a jump to that instruction from somewhere else in the program. In such case we'll mark such instruction as prune/jump point because it's a destination of a jump. This change has no changes in terms of number of instructions or states processes across Cilium and selftests programs. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/r/20221206233345.438540-4-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-06bpf: mostly decouple jump history management from is_state_visited()Andrii Nakryiko
Jump history updating and state equivalence checks are conceptually independent, so move push_jmp_history() out of is_state_visited(). Also make a decision whether to perform state equivalence checks or not one layer higher in do_check(), keeping is_state_visited() unconditionally performing state checks. push_jmp_history() should be performed after state checks. There is just one small non-uniformity. When is_state_visited() finds already validated equivalent state, it propagates precision marks to current state's parent chain. For this to work correctly, jump history has to be updated, so is_state_visited() is doing that internally. But if no equivalent verified state is found, jump history has to be updated in a newly cloned child state, so is_jmp_point() + push_jmp_history() is performed after is_state_visited() exited with zero result, which means "proceed with validation". This change has no functional changes. It's not strictly necessary, but feels right to decouple these two processes. Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20221206233345.438540-3-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-06bpf: decouple prune and jump pointsAndrii Nakryiko
BPF verifier marks some instructions as prune points. Currently these prune points serve two purposes. It's a point where verifier tries to find previously verified state and check current state's equivalence to short circuit verification for current code path. But also currently it's a point where jump history, used for precision backtracking, is updated. This is done so that non-linear flow of execution could be properly backtracked. Such coupling is coincidental and unnecessary. Some prune points are not part of some non-linear jump path, so don't need update of jump history. On the other hand, not all instructions which have to be recorded in jump history necessarily are good prune points. This patch splits prune and jump points into independent flags. Currently all prune points are marked as jump points to minimize amount of changes in this patch, but next patch will perform some optimization of prune vs jmp point placement. No functional changes are intended. Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20221206233345.438540-2-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>