linux/linux-stable.git - Linux kernel stable tree

Age	Commit message (Collapse)	Author
2023-03-13	selftests/bpf: use canonical ftrace path	Ross Zwisler
	The canonical location for the tracefs filesystem is at /sys/kernel/tracing. But, from Documentation/trace/ftrace.rst: Before 4.1, all ftrace tracing control files were within the debugfs file system, which is typically located at /sys/kernel/debug/tracing. For backward compatibility, when mounting the debugfs file system, the tracefs file system will be automatically mounted at: /sys/kernel/debug/tracing Many tests in the bpf selftest code still refer to this older debugfs path, so let's update them to avoid confusion. Signed-off-by: Ross Zwisler <zwisler@google.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lore.kernel.org/r/20230313205628.1058720-3-zwisler@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-13	Merge branch 'bpf: Allow reads from uninit stack'	Alexei Starovoitov
	Merge commit bf9bec4cb3a4 ("Merge branch 'bpf: Allow reads from uninit stack'") from bpf-next to bpf tree to address verification issues in some programs due to stack usage. Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-10	selftests/bpf: Add local kptr stashing test	Dave Marchevsky
	Add a new selftest, local_kptr_stash, which uses bpf_kptr_xchg to stash a bpf_obj_new-allocated object in a map. Test the following scenarios: * Stash two rb_nodes in an arraymap, don't unstash them, rely on map free to destruct them * Stash two rb_nodes in an arraymap, unstash the second one in a separate program, rely on map free to destruct first Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230310230743.2320707-4-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-10	selftests/bpf: Fix progs/test_deny_namespace.c issues.	Alexei Starovoitov
	The following build error can be seen: progs/test_deny_namespace.c:22:19: error: call to undeclared function 'BIT_LL'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration] __u64 cap_mask = BIT_LL(CAP_SYS_ADMIN); The struct kernel_cap_struct no longer exists in the kernel as well. Adjust bpf prog to fix both issues. Fixes: f122a08b197d ("capability: just use a 'u64' instead of a 'u32[2]' array") Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-10	selftests/bpf: Fix progs/find_vma_fail1.c build error.	Alexei Starovoitov
	The commit 11e456cae91e ("selftests/bpf: Fix compilation errors: Assign a value to a constant") fixed the issue cleanly in bpf-next. This is an alternative fix in bpf tree to avoid merge conflict between bpf and bpf-next. Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-10	selftests/bpf: Add local-storage-create benchmark	Martin KaFai Lau
	This patch tests how many kmallocs is needed to create and free a batch of UDP sockets and each socket has a 64bytes bpf storage. It also measures how fast the UDP sockets can be created. The result is from my qemu setup. Before bpf_mem_cache_alloc/free: ./bench -p 1 local-storage-create Setting up benchmark 'local-storage-create'... Benchmark 'local-storage-create' started. Iter 0 ( 73.193us): creates 213.552k/s (213.552k/prod), 3.09 kmallocs/create Iter 1 (-20.724us): creates 211.908k/s (211.908k/prod), 3.09 kmallocs/create Iter 2 ( 9.280us): creates 212.574k/s (212.574k/prod), 3.12 kmallocs/create Iter 3 ( 11.039us): creates 213.209k/s (213.209k/prod), 3.12 kmallocs/create Iter 4 (-11.411us): creates 213.351k/s (213.351k/prod), 3.12 kmallocs/create Iter 5 ( -7.915us): creates 214.754k/s (214.754k/prod), 3.12 kmallocs/create Iter 6 ( 11.317us): creates 210.942k/s (210.942k/prod), 3.12 kmallocs/create Summary: creates 212.789 ± 1.310k/s (212.789k/prod), 3.12 kmallocs/create After bpf_mem_cache_alloc/free: ./bench -p 1 local-storage-create Setting up benchmark 'local-storage-create'... Benchmark 'local-storage-create' started. Iter 0 ( 68.265us): creates 243.984k/s (243.984k/prod), 1.04 kmallocs/create Iter 1 ( 30.357us): creates 238.424k/s (238.424k/prod), 1.04 kmallocs/create Iter 2 (-18.712us): creates 232.963k/s (232.963k/prod), 1.04 kmallocs/create Iter 3 (-15.885us): creates 238.879k/s (238.879k/prod), 1.04 kmallocs/create Iter 4 ( 5.590us): creates 237.490k/s (237.490k/prod), 1.04 kmallocs/create Iter 5 ( 8.577us): creates 237.521k/s (237.521k/prod), 1.04 kmallocs/create Iter 6 ( -6.263us): creates 238.508k/s (238.508k/prod), 1.04 kmallocs/create Summary: creates 237.298 ± 2.198k/s (237.298k/prod), 1.04 kmallocs/create Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20230308065936.1550103-18-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-10	selftests/bpf: Check freeing sk->sk_local_storage with ↵	Martin KaFai Lau
	sk_local_storage->smap is NULL This patch tweats the socket_bind bpf prog to test the local_storage->smap == NULL case in the bpf_local_storage_free() code path. The idea is to create the local_storage with the sk_storage_map's selem first. Then add the sk_storage_map2's selem and then delete the earlier sk_storeage_map's selem. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20230308065936.1550103-17-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-10	selftests/bpf: fix lots of silly mistakes pointed out by compiler	Andrii Nakryiko
	Once we enable -Wall for BPF sources, compiler will complain about lots of unused variables, variables that are set but never read, etc. Fix all these issues first before enabling -Wall in Makefile. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230309054015.4068562-4-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-10	selftests/bpf: add __sink() macro to fake variable consumption	Andrii Nakryiko
	Add __sink(expr) macro that forces compiler to believe that passed in expression is both read and written. It used a simple embedded asm for this. This is useful in a lot of tests where we assign value to some variable to trigger some action, but later don't read variable, causing compiler to complain (if corresponding compiler warnings are turned on, which we'll do in the next patch). Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230309054015.4068562-3-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-10	selftests/bpf: prevent unused variable warning in bpf_for()	Andrii Nakryiko
	Add __attribute__((unused)) to inner __p variable inside bpf_for(), bpf_for_each(), and bpf_repeat() macros to avoid compiler warnings about unused variable. Reported-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230309054015.4068562-2-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-09	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net	Jakub Kicinski
	Documentation/bpf/bpf_devel_QA.rst b7abcd9c656b ("bpf, doc: Link to submitting-patches.rst for general patch submission info") d56b0c461d19 ("bpf, docs: Fix link to netdev-FAQ target") https://lore.kernel.org/all/20230307095812.236eb1be@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-09	selftests/bpf: Workaround verification failure for ↵	Yonghong Song
	fexit_bpf2bpf/func_replace_return_code With latest llvm17, selftest fexit_bpf2bpf/func_replace_return_code has the following verification failure: 0: R1=ctx(off=0,imm=0) R10=fp0 ; int connect_v4_prog(struct bpf_sock_addr ctx) 0: (bf) r7 = r1 ; R1=ctx(off=0,imm=0) R7_w=ctx(off=0,imm=0) 1: (b4) w6 = 0 ; R6_w=0 ; memset(&tuple.ipv4.saddr, 0, sizeof(tuple.ipv4.saddr)); ... ; return do_bind(ctx) ? 1 : 0; 179: (bf) r1 = r7 ; R1=ctx(off=0,imm=0) R7=ctx(off=0,imm=0) 180: (85) call pc+147 Func#3 is global and valid. Skipping. 181: R0_w=scalar() 181: (bc) w6 = w0 ; R0_w=scalar() R6_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) 182: (05) goto pc-129 ; } 54: (bc) w0 = w6 ; R0_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) R6_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) 55: (95) exit At program exit the register R0 has value (0x0; 0xffffffff) should have been in (0x0; 0x1) processed 281 insns (limit 1000000) max_states_per_insn 1 total_states 26 peak_states 26 mark_read 13 -- END PROG LOAD LOG -- libbpf: prog 'connect_v4_prog': failed to load: -22 The corresponding source code: __attribute__ ((noinline)) int do_bind(struct bpf_sock_addr ctx) { struct sockaddr_in sa = {}; sa.sin_family = AF_INET; sa.sin_port = bpf_htons(0); sa.sin_addr.s_addr = bpf_htonl(SRC_REWRITE_IP4); if (bpf_bind(ctx, (struct sockaddr )&sa, sizeof(sa)) != 0) return 0; return 1; } ... SEC("cgroup/connect4") int connect_v4_prog(struct bpf_sock_addr ctx) { ... return do_bind(ctx) ? 1 : 0; } Insn 180 is a call to 'do_bind'. The call's return value is also the return value for the program. Since do_bind() returns 0/1, so it is legitimate for compiler to optimize 'return do_bind(ctx) ? 1 : 0' to 'return do_bind(ctx)'. However, such optimization breaks verifier as the return value of 'do_bind()' is marked as any scalar which violates the requirement of prog return value 0/1. There are two ways to fix this problem, (1) changing 'return 1' in do_bind() to e.g. 'return 10' so the compiler has to do 'do_bind(ctx) ? 1 :0', or (2) suggested by Andrii, marking do_bind() with __weak attribute so the compiler cannot make any assumption on do_bind() return value. This patch adopted adding __weak approach which is simpler and more resistant to potential compiler optimizations. Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20230310012410.2920570-1-yhs@fb.com
2023-03-08	selftests/bpf: implement and test custom testmod_seq iterator	Andrii Nakryiko
	Implement a trivial iterator returning same specified integer value N times as part of bpf_testmod kernel module. Add selftests to validate everything works end to end. We also reuse these tests as "verification-only" tests to validate that kernel prints the state of custom kernel module-defined iterator correctly: fp-16=iter_testmod_seq(ref_id=1,state=drained,depth=0) "testmod_seq" part is an iterator type, and is coming from module's BTF data dynamically at runtime. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230308184121.1165081-9-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-08	selftests/bpf: add number iterator tests	Andrii Nakryiko
	Add number iterator (bpf_iter_num_{new,next,destroy}()) tests, validating the correct handling of various corner and common cases at runtime. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230308184121.1165081-8-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-08	selftests/bpf: add iterators tests	Andrii Nakryiko
	Add various tests for open-coded iterators. Some of them excercise various possible coding patterns in C, some go down to low-level assembly for more control over various conditions, especially invalid ones. We also make use of bpf_for(), bpf_for_each(), bpf_repeat() macros in some of these tests. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230308184121.1165081-7-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-08	selftests/bpf: add bpf_for_each(), bpf_for(), and bpf_repeat() macros	Andrii Nakryiko
	Add bpf_for_each(), bpf_for(), and bpf_repeat() macros that make writing open-coded iterator-based loops much more convenient and natural. These macros utilize cleanup attribute to ensure proper destruction of the iterator and thanks to that manage to provide the ergonomics that is very close to C language's for() construct. Typical loop would look like: int i; int arr[N]; bpf_for(i, 0, N) { /* verifier will know that i >= 0 && i < N, so could be used to * directly access array elements with no extra checks / arr[i] = i; } bpf_repeat() is very similar, but it doesn't expose iteration number and is meant as a simple "repeat action N times" loop: bpf_repeat(N) { / whatever, N times / } Note that `break` and `continue` statements inside the {} block work as expected. bpf_for_each() is a generalization over any kind of BPF open-coded iterator allowing to use for-each-like approach instead of calling low-level bpf_iter_<type>_{new,next,destroy}() APIs explicitly. E.g.: struct cgroup cg; bpf_for_each(cgroup, cg, some, input, args) { /* do something with each cg */ } would call (not-yet-implemented) bpf_iter_cgroup_{new,next,destroy}() functions to form a loop over cgroups, where `some, input, args` are passed verbatim into constructor as bpf_iter_cgroup_new(&it, some, input, args). As a first demonstration, add pyperf variant based on the bpf_for() loop. Also clean up a few tests that either included bpf_misc.h header unnecessarily from the user-space, which is unsupported, or included it before any common types are defined (and thus leading to unnecessary compilation warnings, potentially). Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230308184121.1165081-6-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-06	selftests/bpf: Add test for legacy/perf kprobe/uprobe attach mode	Menglong Dong
	Add the testing for kprobe/uprobe attaching in default, legacy, perf and link mode. And the testing passed: ./test_progs -t attach_probe $5/1 attach_probe/manual-default:OK $5/2 attach_probe/manual-legacy:OK $5/3 attach_probe/manual-perf:OK $5/4 attach_probe/manual-link:OK $5/5 attach_probe/auto:OK $5/6 attach_probe/kprobe-sleepable:OK $5/7 attach_probe/uprobe-lib:OK $5/8 attach_probe/uprobe-sleepable:OK $5/9 attach_probe/uprobe-ref_ctr:OK $5 attach_probe:OK Summary: 1/9 PASSED, 0 SKIPPED, 0 FAILED Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Biao Jiang <benbjiang@tencent.com> Link: https://lore.kernel.org/bpf/20230306064833.7932-4-imagedong@tencent.com
2023-03-06	selftests/bpf: Split test_attach_probe into multi subtests	Menglong Dong
	In order to adapt to the older kernel, now we split the "attach_probe" testing into multi subtests: manual // manual attach tests for kprobe/uprobe auto // auto-attach tests for kprobe and uprobe kprobe-sleepable // kprobe sleepable test uprobe-lib // uprobe tests for library function by name uprobe-sleepable // uprobe sleepable test uprobe-ref_ctr // uprobe ref_ctr test As sleepable kprobe needs to set BPF_F_SLEEPABLE flag before loading, we need to move it to a stand alone skel file, in case of it is not supported by kernel and make the whole loading fail. Therefore, we can only enable part of the subtests for older kernel. Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Biao Jiang <benbjiang@tencent.com> Link: https://lore.kernel.org/bpf/20230306064833.7932-3-imagedong@tencent.com
2023-03-03	bpf: Refactor RCU enforcement in the verifier.	Alexei Starovoitov
	bpf_rcu_read_lock/unlock() are only available in clang compiled kernels. Lack of such key mechanism makes it impossible for sleepable bpf programs to use RCU pointers. Allow bpf_rcu_read_lock/unlock() in GCC compiled kernels (though GCC doesn't support btf_type_tag yet) and allowlist certain field dereferences in important data structures like tast_struct, cgroup, socket that are used by sleepable programs either as RCU pointer or full trusted pointer (which is valid outside of RCU CS). Use BTF_TYPE_SAFE_RCU and BTF_TYPE_SAFE_TRUSTED macros for such tagging. They will be removed once GCC supports btf_type_tag. With that refactor check_ptr_to_btf_access(). Make it strict in enforcing PTR_TRUSTED and PTR_UNTRUSTED while deprecating old PTR_TO_BTF_ID without modifier flags. There is a chance that this strict enforcement might break existing programs (especially on GCC compiled kernels), but this cleanup has to start sooner than later. Note PTR_TO_CTX access still yields old deprecated PTR_TO_BTF_ID. Once it's converted to strict PTR_TRUSTED or PTR_UNTRUSTED the kfuncs and helpers will be able to default to KF_TRUSTED_ARGS. KF_RCU will remain as a weaker version of KF_TRUSTED_ARGS where obj refcnt could be 0. Adjust rcu_read_lock selftest to run on gcc and clang compiled kernels. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/bpf/20230303041446.3630-7-alexei.starovoitov@gmail.com
2023-03-03	selftests/bpf: Tweak cgroup kfunc test.	Alexei Starovoitov
	Adjust cgroup kfunc test to dereference RCU protected cgroup pointer as PTR_TRUSTED and pass into KF_TRUSTED_ARGS kfunc. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/bpf/20230303041446.3630-6-alexei.starovoitov@gmail.com
2023-03-03	selftests/bpf: Add a test case for kptr_rcu.	Alexei Starovoitov
	Tweak existing map_kptr test to check kptr_rcu. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/bpf/20230303041446.3630-5-alexei.starovoitov@gmail.com
2023-03-03	bpf: Introduce kptr_rcu.	Alexei Starovoitov
	The life time of certain kernel structures like 'struct cgroup' is protected by RCU. Hence it's safe to dereference them directly from __kptr tagged pointers in bpf maps. The resulting pointer is MEM_RCU and can be passed to kfuncs that expect KF_RCU. Derefrence of other kptr-s returns PTR_UNTRUSTED. For example: struct map_value { struct cgroup __kptr cgrp; }; SEC("tp_btf/cgroup_mkdir") int BPF_PROG(test_cgrp_get_ancestors, struct cgroup cgrp_arg, const char path) { struct cgroup cg, *cg2; cg = bpf_cgroup_acquire(cgrp_arg); // cg is PTR_TRUSTED and ref_obj_id > 0 bpf_kptr_xchg(&v->cgrp, cg); cg2 = v->cgrp; // This is new feature introduced by this patch. // cg2 is PTR_MAYBE_NULL \| MEM_RCU. // When cg2 != NULL, it's a valid cgroup, but its percpu_ref could be zero if (cg2) bpf_cgroup_ancestor(cg2, level); // safe to do. } Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/bpf/20230303041446.3630-4-alexei.starovoitov@gmail.com
2023-03-03	bpf: Rename __kptr_ref -> __kptr and __kptr -> __kptr_untrusted.	Alexei Starovoitov
	__kptr meant to store PTR_UNTRUSTED kernel pointers inside bpf maps. The concept felt useful, but didn't get much traction, since bpf_rdonly_cast() was added soon after and bpf programs received a simpler way to access PTR_UNTRUSTED kernel pointers without going through restrictive __kptr usage. Rename __kptr_ref -> __kptr and __kptr -> __kptr_untrusted to indicate its intended usage. The main goal of __kptr_untrusted was to read/write such pointers directly while bpf_kptr_xchg was a mechanism to access refcnted kernel pointers. The next patch will allow RCU protected __kptr access with direct read. At that point __kptr_untrusted will be deprecated. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/bpf/20230303041446.3630-2-alexei.starovoitov@gmail.com
2023-03-02	selftests/bpf: Add absolute timer test	Tero Kristo
	Add test for the absolute BPF timer under the existing timer tests. This will run the timer two times with 1us expiration time, and then re-arm the timer at ~35s in the future. At the end, it is verified that the absolute timer expired exactly two times. Signed-off-by: Tero Kristo <tero.kristo@linux.intel.com> Link: https://lore.kernel.org/r/20230302114614.2985072-3-tero.kristo@linux.intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-02	selftests/bpf: Add -Wuninitialized flag to bpf prog flags	Dave Marchevsky
	Per C99 standard [0], Section 6.7.8, Paragraph 10: If an object that has automatic storage duration is not initialized explicitly, its value is indeterminate. And in the same document, in appendix "J.2 Undefined behavior": The behavior is undefined in the following circumstances: [...] The value of an object with automatic storage duration is used while it is indeterminate (6.2.4, 6.7.8, 6.8). This means that use of an uninitialized stack variable is undefined behavior, and therefore that clang can choose to do a variety of scary things, such as not generating bytecode for "bunch of useful code" in the below example: void some_func() { int i; if (!i) return; // bunch of useful code } To add insult to injury, if some_func above is a helper function for some BPF program, clang can choose to not generate an "exit" insn, causing verifier to fail with "last insn is not an exit or jmp". Going from that verification failure to the root cause of uninitialized use is certain to be frustrating. This patch adds -Wuninitialized to the cflags for selftest BPF progs and fixes up existing instances of uninitialized use. [0]: https://www.open-std.org/jtc1/sc22/WG14/www/docs/n1256.pdf Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Cc: David Vernet <void@manifault.com> Cc: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20230303005500.1614874-1-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-01	selftests/bpf: Support custom per-test flags and multiple expected messages	Andrii Nakryiko
	Extend __flag attribute by allowing to specify one of the following: * BPF_F_STRICT_ALIGNMENT * BPF_F_ANY_ALIGNMENT * BPF_F_TEST_RND_HI32 * BPF_F_TEST_STATE_FREQ * BPF_F_SLEEPABLE * BPF_F_XDP_HAS_FRAGS * Some numeric value Extend __msg attribute by allowing to specify multiple exepcted messages. All messages are expected to be present in the verifier log in the order of application. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20230301175417.3146070-2-eddyz87@gmail.com [ Eduard: added commit message, formatting, comments ]
2023-03-01	selftests/bpf: Add more tests for kptrs in maps	Kumar Kartikeya Dwivedi
	Firstly, ensure programs successfully load when using all of the supported maps. Then, extend existing tests to test more cases at runtime. We are currently testing both the synchronous freeing of items and asynchronous destruction when map is freed, but the code needs to be adjusted a bit to be able to also accomodate support for percpu maps. We now do a delete on the item (and update for array maps which has a similar effect for kptrs) to perform a synchronous free of the kptr, and test destruction both for the synchronous and asynchronous deletion. Next time the program runs, it should observe the refcount as 1 since all existing references should have been released by then. By running the program after both possible paths freeing kptrs, we establish that they correctly release resources. Next, we augment the existing test to also test the same code path shared by all local storage maps using a task local storage map. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230225154010.391965-4-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-01	selftests/bpf: tests for using dynptrs to parse skb and xdp buffers	Joanne Koong
	Test skb and xdp dynptr functionality in the following ways: 1) progs/test_cls_redirect_dynptr.c * Rewrite "progs/test_cls_redirect.c" test to use dynptrs to parse skb data * This is a great example of how dynptrs can be used to simplify a lot of the parsing logic for non-statically known values. When measuring the user + system time between the original version vs. using dynptrs, and averaging the time for 10 runs (using "time ./test_progs -t cls_redirect"): original version: 0.092 sec with dynptrs: 0.078 sec 2) progs/test_xdp_dynptr.c * Rewrite "progs/test_xdp.c" test to use dynptrs to parse xdp data When measuring the user + system time between the original version vs. using dynptrs, and averaging the time for 10 runs (using "time ./test_progs -t xdp_attach"): original version: 0.118 sec with dynptrs: 0.094 sec 3) progs/test_l4lb_noinline_dynptr.c * Rewrite "progs/test_l4lb_noinline.c" test to use dynptrs to parse skb data When measuring the user + system time between the original version vs. using dynptrs, and averaging the time for 10 runs (using "time ./test_progs -t l4lb_all"): original version: 0.062 sec with dynptrs: 0.081 sec For number of processed verifier instructions: original version: 6268 insns with dynptrs: 2588 insns 4) progs/test_parse_tcp_hdr_opt_dynptr.c * Add sample code for parsing tcp hdr opt lookup using dynptrs. This logic is lifted from a real-world use case of packet parsing in katran [0], a layer 4 load balancer. The original version "progs/test_parse_tcp_hdr_opt.c" (not using dynptrs) is included here as well, for comparison. When measuring the user + system time between the original version vs. using dynptrs, and averaging the time for 10 runs (using "time ./test_progs -t parse_tcp_hdr_opt"): original version: 0.031 sec with dynptrs: 0.045 sec 5) progs/dynptr_success.c * Add test case "test_skb_readonly" for testing attempts at writes on a prog type with read-only skb ctx. * Add "test_dynptr_skb_data" for testing that bpf_dynptr_data isn't supported for skb progs. 6) progs/dynptr_fail.c * Add test cases "skb_invalid_data_slice{1,2,3,4}" and "xdp_invalid_data_slice{1,2}" for testing that helpers that modify the underlying packet buffer automatically invalidate the associated data slice. * Add test cases "skb_invalid_ctx" and "xdp_invalid_ctx" for testing that prog types that do not support bpf_dynptr_from_skb/xdp don't have access to the API. * Add test case "dynptr_slice_var_len{1,2}" for testing that variable-sized len can't be passed in to bpf_dynptr_slice * Add test case "skb_invalid_slice_write" for testing that writes to a read-only data slice are rejected by the verifier. * Add test case "data_slice_out_of_bounds_skb" for testing that writes to an area outside the slice are rejected. * Add test case "invalid_slice_rdwr_rdonly" for testing that prog types that don't allow writes to packet data don't accept any calls to bpf_dynptr_slice_rdwr. [0] https://github.com/facebookincubator/katran/blob/main/katran/lib/bpf/pckt_parsing.h Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230301154953.641654-11-joannelkoong@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-01	capability: just use a 'u64' instead of a 'u32[2]' array	Linus Torvalds
	Back in 2008 we extended the capability bits from 32 to 64, and we did it by extending the single 32-bit capability word from one word to an array of two words. It was then obfuscated by hiding the "2" behind two macro expansions, with the reasoning being that maybe it gets extended further some day. That reasoning may have been valid at the time, but the last thing we want to do is to extend the capability set any more. And the array of values not only causes source code oddities (with loops to deal with it), but also results in worse code generation. It's a lose-lose situation. So just change the 'u32[2]' into a 'u64' and be done with it. We still have to deal with the fact that the user space interface is designed around an array of these 32-bit values, but that was the case before too, since the array layouts were different (ie user space doesn't use an array of 32-bit values for individual capability masks, but an array of 32-bit slices of multiple masks). So that marshalling of data is actually simplified too, even if it does remain somewhat obscure and odd. This was all triggered by my reaction to the new "cap_isidentical()" introduced recently. By just using a saner data structure, it went from unsigned __capi; CAP_FOR_EACH_U32(__capi) { if (a.cap[__capi] != b.cap[__capi]) return false; } return true; to just being return a.val == b.val; instead. Which is rather more obvious both to humans and to compilers. Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Casey Schaufler <casey@schaufler-ca.com> Cc: Serge Hallyn <serge@hallyn.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Paul Moore <paul@paul-moore.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-02-27	selftests/bpf: Fix compilation errors: Assign a value to a constant	Rong Tao
	Commit bc292ab00f6c("mm: introduce vma->vm_flags wrapper functions") turns the vm_flags into a const variable. Added bpf_find_vma test in commit f108662b27c9("selftests/bpf: Add tests for bpf_find_vma") to assign values to variables that declare const in find_vma_fail1.c programs, which is an error to the compiler and does not test BPF verifiers. It is better to replace 'const vm_flags_t vm_flags' with 'unsigned long vm_start' for testing. $ make -C tools/testing/selftests/bpf/ -j8 ... progs/find_vma_fail1.c:16:16: error: cannot assign to non-static data member 'vm_flags' with const-qualified type 'const vm_flags_t' (aka 'const unsigned long') vma->vm_flags \|= 0x55; ~~~~~~~~~~~~~ ^ ../tools/testing/selftests/bpf/tools/include/vmlinux.h:1898:20: note: non-static data member 'vm_flags' declared const here const vm_flags_t vm_flags; ~~~~~~~~~~~`~~~~~~^~~~~~~~ Signed-off-by: Rong Tao <rongtao@cestc.cn> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/tencent_CB281722B3C1BD504C16CDE586CACC2BE706@qq.com
2023-02-27	selftests/bpf: Use __NR_prlimit64 instead of __NR_getrlimit in user_ringbuf test	Tiezhu Yang
	After commit 80d7da1cac62 ("asm-generic: Drop getrlimit and setrlimit syscalls from default list"), new architectures won't need to include getrlimit and setrlimit, they are superseded with prlimit64. In order to maintain compatibility for the new architectures, such as LoongArch which does not define __NR_getrlimit, it is better to use __NR_prlimit64 instead of __NR_getrlimit in user_ringbuf test to fix the following build error: TEST-OBJ [test_progs] user_ringbuf.test.o tools/testing/selftests/bpf/prog_tests/user_ringbuf.c: In function 'kick_kernel_cb': tools/testing/selftests/bpf/prog_tests/user_ringbuf.c:593:17: error: '__NR_getrlimit' undeclared (first use in this function) 593 \| syscall(__NR_getrlimit); \| ^~~~~~~~~~~~~~ tools/testing/selftests/bpf/prog_tests/user_ringbuf.c:593:17: note: each undeclared identifier is reported only once for each function it appears in make: *** [Makefile:573: tools/testing/selftests/bpf/user_ringbuf.test.o] Error 1 make: Leaving directory 'tools/testing/selftests/bpf' Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/1677235015-21717-4-git-send-email-yangtiezhu@loongson.cn
2023-02-23	selftests/bpf: Add a test case for bpf_cgroup_from_id()	Tejun Heo
	Add a test case for bpf_cgroup_from_id. Signed-off-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/Y/bBlt+tPozcQgws@slm.duckdns.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-22	selftests/bpf: Fix BPF_FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL for empty flow label	Stanislav Fomichev
	Kernel's flow dissector continues to parse the packet when the (optional) IPv6 flow label is empty even when instructed to stop (via BPF_FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL). Do the same in our reference BPF reimplementation. Signed-off-by: Stanislav Fomichev <sdf@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20230221180518.2139026-1-sdf@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-22	selftests/bpf: Tests for uninitialized stack reads	Eduard Zingerman
	Three testcases to make sure that stack reads from uninitialized locations are accepted by verifier when executed in privileged mode: - read from a fixed offset; - read from a variable offset; - passing a pointer to stack to a helper converts STACK_INVALID to STACK_MISC. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230219200427.606541-3-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-22	bpf: Allow reads from uninit stack	Eduard Zingerman
	This commits updates the following functions to allow reads from uninitialized stack locations when env->allow_uninit_stack option is enabled: - check_stack_read_fixed_off() - check_stack_range_initialized(), called from: - check_stack_read_var_off() - check_helper_mem_access() Such change allows to relax logic in stacksafe() to treat STACK_MISC and STACK_INVALID in a same way and make the following stack slot configurations equivalent: \| Cached state \| Current state \| \| stack slot \| stack slot \| \|------------------+------------------\| \| STACK_INVALID or \| STACK_INVALID or \| \| STACK_MISC \| STACK_SPILL or \| \| \| STACK_MISC or \| \| \| STACK_ZERO or \| \| \| STACK_DYNPTR \| This leads to significant verification speed gains (see below). The idea was suggested by Andrii Nakryiko [1] and initial patch was created by Alexei Starovoitov [2]. Currently the env->allow_uninit_stack is allowed for programs loaded by users with CAP_PERFMON or CAP_SYS_ADMIN capabilities. A number of test cases from verifier/.c were expecting uninitialized stack access to be an error. These test cases were updated to execute in unprivileged mode (thus preserving the tests). The test progs/test_global_func10.c expected "invalid indirect read from stack" error message because of the access to uninitialized memory region. This error is no longer possible in privileged mode. The test is updated to provoke an error "invalid indirect access to stack" because of access to invalid stack address (such error is not verified by progs/test_global_func.c series of tests). The following tests had to be removed because these can't be made unprivileged: - verifier/sock.c: - "sk_storage_get(map, skb->sk, &stack_value, 1): partially init stack_value" BPF_PROG_TYPE_SCHED_CLS programs are not executed in unprivileged mode. - verifier/var_off.c: - "indirect variable-offset stack access, max_off+size > max_initialized" - "indirect variable-offset stack access, uninitialized" These tests verify that access to uninitialized stack values is detected when stack offset is not a constant. However, variable stack access is prohibited in unprivileged mode, thus these tests are no longer valid. * * * Here is veristat log comparing this patch with current master on a set of selftest binaries listed in tools/testing/selftests/bpf/veristat.cfg and cilium BPF binaries (see [3]): $ ./veristat -e file,prog,states -C -f 'states_pct<-30' master.log current.log File Program States (A) States (B) States (DIFF) -------------------------- -------------------------- ---------- ---------- ---------------- bpf_host.o tail_handle_ipv6_from_host 349 244 -105 (-30.09%) bpf_host.o tail_handle_nat_fwd_ipv4 1320 895 -425 (-32.20%) bpf_lxc.o tail_handle_nat_fwd_ipv4 1320 895 -425 (-32.20%) bpf_sock.o cil_sock4_connect 70 48 -22 (-31.43%) bpf_sock.o cil_sock4_sendmsg 68 46 -22 (-32.35%) bpf_xdp.o tail_handle_nat_fwd_ipv4 1554 803 -751 (-48.33%) bpf_xdp.o tail_lb_ipv4 6457 2473 -3984 (-61.70%) bpf_xdp.o tail_lb_ipv6 7249 3908 -3341 (-46.09%) pyperf600_bpf_loop.bpf.o on_event 287 145 -142 (-49.48%) strobemeta.bpf.o on_event 15915 4772 -11143 (-70.02%) strobemeta_nounroll2.bpf.o on_event 17087 3820 -13267 (-77.64%) xdp_synproxy_kern.bpf.o syncookie_tc 21271 6635 -14636 (-68.81%) xdp_synproxy_kern.bpf.o syncookie_xdp 23122 6024 -17098 (-73.95%) -------------------------- -------------------------- ---------- ---------- ---------------- Note: I limited selection by states_pct<-30%. Inspection of differences in pyperf600_bpf_loop behavior shows that the following patch for the test removes almost all differences: - a/tools/testing/selftests/bpf/progs/pyperf.h + b/tools/testing/selftests/bpf/progs/pyperf.h @ -266,8 +266,8 @ int __on_event(struct bpf_raw_tracepoint_args ctx) } if (event->pthread_match \|\| !pidData->use_tls) { - void frame_ptr; - FrameData frame; + void* frame_ptr = 0; + FrameData frame = {}; Symbol sym = {}; int cur_cpu = bpf_get_smp_processor_id(); W/o this patch the difference comes from the following pattern (for different variables): static bool get_frame_data(... FrameData frame ...) { ... bpf_probe_read_user(&frame->f_code, ...); if (!frame->f_code) return false; ... bpf_probe_read_user(&frame->co_name, ...); if (frame->co_name) ...; } int __on_event(struct bpf_raw_tracepoint_args ctx) { FrameData frame; ... get_frame_data(... &frame ...) // indirectly via a bpf_loop & callback ... } SEC("raw_tracepoint/kfree_skb") int on_event(struct bpf_raw_tracepoint_args* ctx) { ... ret \|= __on_event(ctx); ret \|= __on_event(ctx); ... } With regards to value `frame->co_name` the following is important: - Because of the conditional `if (!frame->f_code)` each call to __on_event() produces two states, one with `frame->co_name` marked as STACK_MISC, another with it as is (and marked STACK_INVALID on a first call). - The call to bpf_probe_read_user() does not mark stack slots corresponding to `&frame->co_name` as REG_LIVE_WRITTEN but it marks these slots as BPF_MISC, this happens because of the following loop in the check_helper_call(): for (i = 0; i < meta.access_size; i++) { err = check_mem_access(env, insn_idx, meta.regno, i, BPF_B, BPF_WRITE, -1, false); if (err) return err; } Note the size of the write, it is a one byte write for each byte touched by a helper. The BPF_B write does not lead to write marks for the target stack slot. - Which means that w/o this patch when second __on_event() call is verified `if (frame->co_name)` will propagate read marks first to a stack slot with STACK_MISC marks and second to a stack slot with STACK_INVALID marks and these states would be considered different. [1] https://lore.kernel.org/bpf/CAEf4BzY3e+ZuC6HUa8dCiUovQRg2SzEk7M-dSkqNZyn=xEmnPA@mail.gmail.com/ [2] https://lore.kernel.org/bpf/CAADnVQKs2i1iuZ5SUGuJtxWVfGYR9kDgYKhq3rNV+kBLQCu7rA@mail.gmail.com/ [3] git@github.com:anakryiko/cilium.git Suggested-by: Andrii Nakryiko <andrii@kernel.org> Co-developed-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230219200427.606541-2-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-21	Merge tag 'net-next-6.3' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core: - Add dedicated kmem_cache for typical/small skb->head, avoid having to access struct page at kfree time, and improve memory use. - Introduce sysctl to set default RPS configuration for new netdevs. - Define Netlink protocol specification format which can be used to describe messages used by each family and auto-generate parsers. Add tools for generating kernel data structures and uAPI headers. - Expose all net/core sysctls inside netns. - Remove 4s sleep in netpoll if carrier is instantly detected on boot. - Add configurable limit of MDB entries per port, and port-vlan. - Continue populating drop reasons throughout the stack. - Retire a handful of legacy Qdiscs and classifiers. Protocols: - Support IPv4 big TCP (TSO frames larger than 64kB). - Add IP_LOCAL_PORT_RANGE socket option, to control local port range on socket by socket basis. - Track and report in procfs number of MPTCP sockets used. - Support mixing IPv4 and IPv6 flows in the in-kernel MPTCP path manager. - IPv6: don't check net.ipv6.route.max_size and rely on garbage collection to free memory (similarly to IPv4). - Support Penultimate Segment Pop (PSP) flavor in SRv6 (RFC8986). - ICMP: add per-rate limit counters. - Add support for user scanning requests in ieee802154. - Remove static WEP support. - Support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate reporting. - WiFi 7 EHT channel puncturing support (client & AP). BPF: - Add a rbtree data structure following the "next-gen data structure" precedent set by recently added linked list, that is, by using kfunc + kptr instead of adding a new BPF map type. - Expose XDP hints via kfuncs with initial support for RX hash and timestamp metadata. - Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key to better support decap on GRE tunnel devices not operating in collect metadata. - Improve x86 JIT's codegen for PROBE_MEM runtime error checks. - Remove the need for trace_printk_lock for bpf_trace_printk and bpf_trace_vprintk helpers. - Extend libbpf's bpf_tracing.h support for tracing arguments of kprobes/uprobes and syscall as a special case. - Significantly reduce the search time for module symbols by livepatch and BPF. - Enable cpumasks to be used as kptrs, which is useful for tracing programs tracking which tasks end up running on which CPUs in different time intervals. - Add support for BPF trampoline on s390x and riscv64. - Add capability to export the XDP features supported by the NIC. - Add __bpf_kfunc tag for marking kernel functions as kfuncs. - Add cgroup.memory=nobpf kernel parameter option to disable BPF memory accounting for container environments. Netfilter: - Remove the CLUSTERIP target. It has been marked as obsolete for years, and we still have WARN splats wrt races of the out-of-band /proc interface installed by this target. - Add 'destroy' commands to nf_tables. They are identical to the existing 'delete' commands, but do not return an error if the referenced object (set, chain, rule...) did not exist. Driver API: - Improve cpumask_local_spread() locality to help NICs set the right IRQ affinity on AMD platforms. - Separate C22 and C45 MDIO bus transactions more clearly. - Introduce new DCB table to control DSCP rewrite on egress. - Support configuration of Physical Layer Collision Avoidance (PLCA) Reconciliation Sublayer (RS) (802.3cg-2019). Modern version of shared medium Ethernet. - Support for MAC Merge layer (IEEE 802.3-2018 clause 99). Allowing preemption of low priority frames by high priority frames. - Add support for controlling MACSec offload using netlink SET. - Rework devlink instance refcounts to allow registration and de-registration under the instance lock. Split the code into multiple files, drop some of the unnecessarily granular locks and factor out common parts of netlink operation handling. - Add TX frame aggregation parameters (for USB drivers). - Add a new attr TCA_EXT_WARN_MSG to report TC (offload) warning messages with notifications for debug. - Allow offloading of UDP NEW connections via act_ct. - Add support for per action HW stats in TC. - Support hardware miss to TC action (continue processing in SW from a specific point in the action chain). - Warn if old Wireless Extension user space interface is used with modern cfg80211/mac80211 drivers. Do not support Wireless Extensions for Wi-Fi 7 devices at all. Everyone should switch to using nl80211 interface instead. - Improve the CAN bit timing configuration. Use extack to return error messages directly to user space, update the SJW handling, including the definition of a new default value that will benefit CAN-FD controllers, by increasing their oscillator tolerance. New hardware / drivers: - Ethernet: - nVidia BlueField-3 support (control traffic driver) - Ethernet support for imx93 SoCs - Motorcomm yt8531 gigabit Ethernet PHY - onsemi NCN26000 10BASE-T1S PHY (with support for PLCA) - Microchip LAN8841 PHY (incl. cable diagnostics and PTP) - Amlogic gxl MDIO mux - WiFi: - RealTek RTL8188EU (rtl8xxxu) - Qualcomm Wi-Fi 7 devices (ath12k) - CAN: - Renesas R-Car V4H Drivers: - Bluetooth: - Set Per Platform Antenna Gain (PPAG) for Intel controllers. - Ethernet NICs: - Intel (1G, igc): - support TSN / Qbv / packet scheduling features of i226 model - Intel (100G, ice): - use GNSS subsystem instead of TTY - multi-buffer XDP support - extend support for GPIO pins to E823 devices - nVidia/Mellanox: - update the shared buffer configuration on PFC commands - implement PTP adjphase function for HW offset control - TC support for Geneve and GRE with VF tunnel offload - more efficient crypto key management method - multi-port eswitch support - Netronome/Corigine: - add DCB IEEE support - support IPsec offloading for NFP3800 - Freescale/NXP (enetc): - support XDP_REDIRECT for XDP non-linear buffers - improve reconfig, avoid link flap and waiting for idle - support MAC Merge layer - Other NICs: - sfc/ef100: add basic devlink support for ef100 - ionic: rx_push mode operation (writing descriptors via MMIO) - bnxt: use the auxiliary bus abstraction for RDMA - r8169: disable ASPM and reset bus in case of tx timeout - cpsw: support QSGMII mode for J721e CPSW9G - cpts: support pulse-per-second output - ngbe: add an mdio bus driver - usbnet: optimize usbnet_bh() by avoiding unnecessary queuing - r8152: handle devices with FW with NCM support - amd-xgbe: support 10Mbps, 2.5GbE speeds and rx-adaptation - virtio-net: support multi buffer XDP - virtio/vsock: replace virtio_vsock_pkt with sk_buff - tsnep: XDP support - Ethernet high-speed switches: - nVidia/Mellanox (mlxsw): - add support for latency TLV (in FW control messages) - Microchip (sparx5): - separate explicit and implicit traffic forwarding rules, make the implicit rules always active - add support for egress DSCP rewrite - IS0 VCAP support (Ingress Classification) - IS2 VCAP filters (protos, L3 addrs, L4 ports, flags, ToS etc.) - ES2 VCAP support (Egress Access Control) - support for Per-Stream Filtering and Policing (802.1Q, 8.6.5.1) - Ethernet embedded switches: - Marvell (mv88e6xxx): - add MAB (port auth) offload support - enable PTP receive for mv88e6390 - NXP (ocelot): - support MAC Merge layer - support for the the vsc7512 internal copper phys - Microchip: - lan9303: convert to PHYLINK - lan966x: support TC flower filter statistics - lan937x: PTP support for KSZ9563/KSZ8563 and LAN937x - lan937x: support Credit Based Shaper configuration - ksz9477: support Energy Efficient Ethernet - other: - qca8k: convert to regmap read/write API, use bulk operations - rswitch: Improve TX timestamp accuracy - Intel WiFi (iwlwifi): - EHT (Wi-Fi 7) rate reporting - STEP equalizer support: transfer some STEP (connection to radio on platforms with integrated wifi) related parameters from the BIOS to the firmware. - Qualcomm 802.11ax WiFi (ath11k): - IPQ5018 support - Fine Timing Measurement (FTM) responder role support - channel 177 support - MediaTek WiFi (mt76): - per-PHY LED support - mt7996: EHT (Wi-Fi 7) support - Wireless Ethernet Dispatch (WED) reset support - switch to using page pool allocator - RealTek WiFi (rtw89): - support new version of Bluetooth co-existance - Mobile: - rmnet: support TX aggregation" * tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1872 commits) page_pool: add a comment explaining the fragment counter usage net: ethtool: fix __ethtool_dev_mm_supported() implementation ethtool: pse-pd: Fix double word in comments xsk: add linux/vmalloc.h to xsk.c sefltests: netdevsim: wait for devlink instance after netns removal selftest: fib_tests: Always cleanup before exit net/mlx5e: Align IPsec ASO result memory to be as required by hardware net/mlx5e: TC, Set CT miss to the specific ct action instance net/mlx5e: Rename CHAIN_TO_REG to MAPPED_OBJ_TO_REG net/mlx5: Refactor tc miss handling to a single function net/mlx5: Kconfig: Make tc offload depend on tc skb extension net/sched: flower: Support hardware miss to tc action net/sched: flower: Move filter handle initialization earlier net/sched: cls_api: Support hardware miss to tc action net/sched: Rename user cookie and act cookie sfc: fix builds without CONFIG_RTC_LIB sfc: clean up some inconsistent indentings net/mlx4_en: Introduce flexible array to silence overflow warning net: lan966x: Fix possible deadlock inside PTP net/ulp: Remove redundant ->clone() test in inet_clone_ulp(). ...
2023-02-20	Merge tag 'fs.idmapped.v6.3' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping Pull vfs idmapping updates from Christian Brauner: - Last cycle we introduced the dedicated struct mnt_idmap type for mount idmapping and the required infrastucture in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). As promised in last cycle's pull request message this converts everything to rely on struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevant on the mount level. Especially for non-vfs developers without detailed knowledge in this area this was a potential source for bugs. This finishes the conversion. Instead of passing the plain namespace around this updates all places that currently take a pointer to a mnt_userns with a pointer to struct mnt_idmap. Now that the conversion is done all helpers down to the really low-level helpers only accept a struct mnt_idmap argument instead of two namespace arguments. Conflating mount and other idmappings will now cause the compiler to complain loudly thus eliminating the possibility of any bugs. This makes it impossible for filesystem developers to mix up mount and filesystem idmappings as they are two distinct types and require distinct helpers that cannot be used interchangeably. Everything associated with struct mnt_idmap is moved into a single separate file. With that change no code can poke around in struct mnt_idmap. It can only be interacted with through dedicated helpers. That means all filesystems are and all of the vfs is completely oblivious to the actual implementation of idmappings. We are now also able to extend struct mnt_idmap as we see fit. For example, we can decouple it completely from namespaces for users that don't require or don't want to use them at all. We can also extend the concept of idmappings so we can cover filesystem specific requirements. In combination with the vfs{g,u}id_t work we finished in v6.2 this makes this feature substantially more robust and thus difficult to implement wrong by a given filesystem and also protects the vfs. - Enable idmapped mounts for tmpfs and fulfill a longstanding request. A long-standing request from users had been to make it possible to create idmapped mounts for tmpfs. For example, to share the host's tmpfs mount between multiple sandboxes. This is a prerequisite for some advanced Kubernetes cases. Systemd also has a range of use-cases to increase service isolation. And there are more users of this. However, with all of the other work going on this was way down on the priority list but luckily someone other than ourselves picked this up. As usual the patch is tiny as all the infrastructure work had been done multiple kernel releases ago. In addition to all the tests that we already have I requested that Rodrigo add a dedicated tmpfs testsuite for idmapped mounts to xfstests. It is to be included into xfstests during the v6.3 development cycle. This should add a slew of additional tests. * tag 'fs.idmapped.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping: (26 commits) shmem: support idmapped mounts for tmpfs fs: move mnt_idmap fs: port vfs{g,u}id helpers to mnt_idmap fs: port fs{g,u}id helpers to mnt_idmap fs: port i_{g,u}id_into_vfs{g,u}id() to mnt_idmap fs: port i_{g,u}id_{needs_}update() to mnt_idmap quota: port to mnt_idmap fs: port privilege checking helpers to mnt_idmap fs: port inode_owner_or_capable() to mnt_idmap fs: port inode_init_owner() to mnt_idmap fs: port acl to mnt_idmap fs: port xattr to mnt_idmap fs: port ->permission() to pass mnt_idmap fs: port ->fileattr_set() to pass mnt_idmap fs: port ->set_acl() to pass mnt_idmap fs: port ->get_acl() to pass mnt_idmap fs: port ->tmpfile() to pass mnt_idmap fs: port ->rename() to pass mnt_idmap fs: port ->mknod() to pass mnt_idmap fs: port ->mkdir() to pass mnt_idmap ...
2023-02-17	selftests/bpf: Add bpf_fib_lookup test	Martin KaFai Lau
	This patch tests the bpf_fib_lookup helper when looking up a neigh in NUD_FAILED and NUD_STALE state. It also adds test for the new BPF_FIB_LOOKUP_SKIP_NEIGH flag. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20230217205515.3583372-2-martin.lau@linux.dev
2023-02-17	selftests/bpf: Add global subprog context passing tests	Andrii Nakryiko
	Add tests validating that it's possible to pass context arguments into global subprogs for various types of programs, including a particularly tricky KPROBE programs (which cover kprobes, uprobes, USDTs, a vast and important class of programs). Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/bpf/20230216045954.3002473-4-andrii@kernel.org
2023-02-17	selftests/bpf: Convert test_global_funcs test to test_loader framework	Andrii Nakryiko
	Convert 17 test_global_funcs subtests into test_loader framework for easier maintenance and more declarative way to define expected failures/successes. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/bpf/20230216045954.3002473-3-andrii@kernel.org
2023-02-16	Fix typos in selftest/bpf files	Taichi Nishimura
	Run spell checker on files in selftest/bpf and fixed typos. Signed-off-by: Taichi Nishimura <awkrail01@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Link: https://lore.kernel.org/bpf/20230216085537.519062-1-awkrail01@gmail.com
2023-02-15	selftest/bpf/benchs: Add benchmark for hashmap lookups	Anton Protopopov
	Add a new benchmark which measures hashmap lookup operations speed. A user can control the following parameters of the benchmark: * key_size (max 1024): the key size to use * max_entries: the hashmap max entries * nr_entries: the number of entries to insert/lookup * nr_loops: the number of loops for the benchmark * map_flags The hashmap flags passed to BPF_MAP_CREATE The BPF program performing the benchmarks calls two nested bpf_loop: bpf_loop(nr_loops/nr_entries) bpf_loop(nr_entries) bpf_map_lookup() So the nr_loops determines the number of actual map lookups. All lookups are successful. Example (the output is generated on a AMD Ryzen 9 3950X machine): for nr_entries in `seq 4096 4096 65536`; do echo -n "$((nr_entries*100/65536))% full: "; sudo ./bench -d2 -a bpf-hashmap-lookup --key_size=4 --nr_entries=$nr_entries --max_entries=65536 --nr_loops=1000000 --map_flags=0x40 \| grep cpu; done 6% full: cpu01: lookup 50.739M ± 0.018M events/sec (approximated from 32 samples of ~19ms) 12% full: cpu01: lookup 47.751M ± 0.015M events/sec (approximated from 32 samples of ~20ms) 18% full: cpu01: lookup 45.153M ± 0.013M events/sec (approximated from 32 samples of ~22ms) 25% full: cpu01: lookup 43.826M ± 0.014M events/sec (approximated from 32 samples of ~22ms) 31% full: cpu01: lookup 41.971M ± 0.012M events/sec (approximated from 32 samples of ~23ms) 37% full: cpu01: lookup 41.034M ± 0.015M events/sec (approximated from 32 samples of ~24ms) 43% full: cpu01: lookup 39.946M ± 0.012M events/sec (approximated from 32 samples of ~25ms) 50% full: cpu01: lookup 38.256M ± 0.014M events/sec (approximated from 32 samples of ~26ms) 56% full: cpu01: lookup 36.580M ± 0.018M events/sec (approximated from 32 samples of ~27ms) 62% full: cpu01: lookup 36.252M ± 0.012M events/sec (approximated from 32 samples of ~27ms) 68% full: cpu01: lookup 35.200M ± 0.012M events/sec (approximated from 32 samples of ~28ms) 75% full: cpu01: lookup 34.061M ± 0.009M events/sec (approximated from 32 samples of ~29ms) 81% full: cpu01: lookup 34.374M ± 0.010M events/sec (approximated from 32 samples of ~29ms) 87% full: cpu01: lookup 33.244M ± 0.011M events/sec (approximated from 32 samples of ~30ms) 93% full: cpu01: lookup 32.182M ± 0.013M events/sec (approximated from 32 samples of ~31ms) 100% full: cpu01: lookup 31.497M ± 0.016M events/sec (approximated from 32 samples of ~31ms) Signed-off-by: Anton Protopopov <aspsk@isovalent.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20230213091519.1202813-8-aspsk@isovalent.com
2023-02-15	selftests/bpf: Add test case for element reuse in htab map	Hou Tao
	The reinitialization of spin-lock in map value after immediate reuse may corrupt lookup with BPF_F_LOCK flag and result in hard lock-up, so add one test case to demonstrate the problem. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20230215082132.3856544-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-15	selftests/bpf: Fix map_kptr test.	Alexei Starovoitov
	The compiler is optimizing out majority of unref_ptr read/writes, so the test wasn't testing much. For example, one could delete '__kptr' tag from 'struct prog_test_ref_kfunc __kptr *unref_ptr;' and the test would still "pass". Convert it to volatile stores. Confirmed by comparing bpf asm before/after. Fixes: 2cbc469a6fc3 ("selftests/bpf: Add C tests for kptr") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Stanislav Fomichev <sdf@google.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230214235051.22938-1-alexei.starovoitov@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-02-13	selftests/bpf: Clean up user_ringbuf, cgrp_kfunc, kfunc_dynptr_param tests	Joanne Koong
	Clean up user_ringbuf, cgrp_kfunc, and kfunc_dynptr_param tests to use the generic verification tester for checking verifier rejections. The generic verification tester uses btf_decl_tag-based annotations for verifying that the tests fail with the expected log messages. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Acked-by: David Vernet <void@manifault.com> Reviewed-by: Roberto Sassu <roberto.sassu@huawei.com> Link: https://lore.kernel.org/r/20230214051332.4007131-1-joannelkoong@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-13	selftests/bpf: Add rbtree selftests	Dave Marchevsky
	This patch adds selftests exercising the logic changed/added in the previous patches in the series. A variety of successful and unsuccessful rbtree usages are validated: Success: * Add some nodes, let map_value bpf_rbtree_root destructor clean them up * Add some nodes, remove one using the non-owning ref leftover by successful rbtree_add() call * Add some nodes, remove one using the non-owning ref returned by rbtree_first() call Failure: * BTF where bpf_rb_root owns bpf_list_node should fail to load * BTF where node of type X is added to tree containing nodes of type Y should fail to load * No calling rbtree api functions in 'less' callback for rbtree_add * No releasing lock in 'less' callback for rbtree_add * No removing a node which hasn't been added to any tree * No adding a node which has already been added to a tree * No escaping of non-owning references past their lock's critical section * No escaping of non-owning references past other invalidation points (rbtree_remove) These tests mostly focus on rbtree-specific additions, but some of the failure cases revalidate scenarios common to both linked_list and rbtree which are covered in the former's tests. Better to be a bit redundant in case linked_list and rbtree semantics deviate over time. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230214004017.2534011-8-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-13	bpf: Migrate release_on_unlock logic to non-owning ref semantics	Dave Marchevsky
	This patch introduces non-owning reference semantics to the verifier, specifically linked_list API kfunc handling. release_on_unlock logic for refs is refactored - with small functional changes - to implement these semantics, and bpf_list_push_{front,back} are migrated to use them. When a list node is pushed to a list, the program still has a pointer to the node: n = bpf_obj_new(typeof(n)); bpf_spin_lock(&l); bpf_list_push_back(&l, n); / n still points to the just-added node / bpf_spin_unlock(&l); What the verifier considers n to be after the push, and thus what can be done with n, are changed by this patch. Common properties both before/after this patch: After push, n is only a valid reference to the node until end of critical section * After push, n cannot be pushed to any list * After push, the program can read the node's fields using n Before: * After push, n retains the ref_obj_id which it received on bpf_obj_new, but the associated bpf_reference_state's release_on_unlock field is set to true * release_on_unlock field and associated logic is used to implement "n is only a valid ref until end of critical section" * After push, n cannot be written to, the node must be removed from the list before writing to its fields * After push, n is marked PTR_UNTRUSTED After: * After push, n's ref is released and ref_obj_id set to 0. NON_OWN_REF type flag is added to reg's type, indicating that it's a non-owning reference. * NON_OWN_REF flag and logic is used to implement "n is only a valid ref until end of critical section" * n can be written to (except for special fields e.g. bpf_list_node, timer, ...) Summary of specific implementation changes to achieve the above: * release_on_unlock field, ref_set_release_on_unlock helper, and logic to "release on unlock" based on that field are removed * The anonymous active_lock struct used by bpf_verifier_state is pulled out into a named struct bpf_active_lock. * NON_OWN_REF type flag is introduced along with verifier logic changes to handle non-owning refs * Helpers are added to use NON_OWN_REF flag to implement non-owning ref semantics as described above * invalidate_non_owning_refs - helper to clobber all non-owning refs matching a particular bpf_active_lock identity. Replaces release_on_unlock logic in process_spin_lock. * ref_set_non_owning - set NON_OWN_REF type flag after doing some sanity checking * ref_convert_owning_non_owning - convert owning reference w/ specified ref_obj_id to non-owning references. Set NON_OWN_REF flag for each reg with that ref_obj_id and 0-out its ref_obj_id * Update linked_list selftests to account for minor semantic differences introduced by this patch * Writes to a release_on_unlock node ref are not allowed, while writes to non-owning reference pointees are. As a result the linked_list "write after push" failure tests are no longer scenarios that should fail. * The test##missing_lock##op and test##incorrect_lock##op macro-generated failure tests need to have a valid node argument in order to have the same error output as before. Otherwise verification will fail early and the expected error output won't be seen. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230212092715.1422619-2-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-10	selftests/bpf: Attach to fopen()/fclose() in attach_probe	Ilya Leoshkevich
	malloc() and free() may be completely replaced by sanitizers, use fopen() and fclose() instead. Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20230210001210.395194-7-iii@linux.ibm.com
2023-02-10	selftests/bpf: Attach to fopen()/fclose() in uprobe_autoattach	Ilya Leoshkevich
	malloc() and free() may be completely replaced by sanitizers, use fopen() and fclose() instead. Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20230210001210.395194-6-iii@linux.ibm.com
2023-02-02	selftests/bpf: introduce XDP compliance test tool	Lorenzo Bianconi
	Introduce xdp_features tool in order to test XDP features supported by the NIC and match them against advertised ones. In order to test supported/advertised XDP features, xdp_features must run on the Device Under Test (DUT) and on a Tester device. xdp_features opens a control TCP channel between DUT and Tester devices to send control commands from Tester to the DUT and a UDP data channel where the Tester sends UDP 'echo' packets and the DUT is expected to reply back with the same packet. DUT installs multiple XDP programs on the NIC to test XDP capabilities and reports back to the Tester some XDP stats. Currently xdp_features supports the following XDP features: - XDP_DROP - XDP_ABORTED - XDP_PASS - XDP_TX - XDP_REDIRECT - XDP_NDO_XMIT Co-developed-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/7c1af8e7e6ef0614cf32fa9e6bdaa2d8d605f859.1675245258.git.lorenzo@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>