summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-11-04bpf: Mark raw_tp arguments with PTR_MAYBE_NULLKumar Kartikeya Dwivedi
Arguments to a raw tracepoint are tagged as trusted, which carries the semantics that the pointer will be non-NULL. However, in certain cases, a raw tracepoint argument may end up being NULL. More context about this issue is available in [0]. Thus, there is a discrepancy between the reality, that raw_tp arguments can actually be NULL, and the verifier's knowledge, that they are never NULL, causing explicit NULL checks to be deleted, and accesses to such pointers potentially crashing the kernel. To fix this, mark raw_tp arguments as PTR_MAYBE_NULL, and then special case the dereference and pointer arithmetic to permit it, and allow passing them into helpers/kfuncs; these exceptions are made for raw_tp programs only. Ensure that we don't do this when ref_obj_id > 0, as in that case this is an acquired object and doesn't need such adjustment. The reason we do mask_raw_tp_trusted_reg logic is because other will recheck in places whether the register is a trusted_reg, and then consider our register as untrusted when detecting the presence of the PTR_MAYBE_NULL flag. To allow safe dereference, we enable PROBE_MEM marking when we see loads into trusted pointers with PTR_MAYBE_NULL. While trusted raw_tp arguments can also be passed into helpers or kfuncs where such broken assumption may cause issues, a future patch set will tackle their case separately, as PTR_TO_BTF_ID (without PTR_TRUSTED) can already be passed into helpers and causes similar problems. Thus, they are left alone for now. It is possible that these checks also permit passing non-raw_tp args that are trusted PTR_TO_BTF_ID with null marking. In such a case, allowing dereference when pointer is NULL expands allowed behavior, so won't regress existing programs, and the case of passing these into helpers is the same as above and will be dealt with later. Also update the failure case in tp_btf_nullable selftest to capture the new behavior, as the verifier will no longer cause an error when directly dereference a raw tracepoint argument marked as __nullable. [0]: https://lore.kernel.org/bpf/ZrCZS6nisraEqehw@jlelli-thinkpadt14gen4.remote.csb Reviewed-by: Jiri Olsa <jolsa@kernel.org> Reported-by: Juri Lelli <juri.lelli@redhat.com> Tested-by: Juri Lelli <juri.lelli@redhat.com> Fixes: 3f00c5239344 ("bpf: Allow trusted pointers to be passed to KF_TRUSTED_ARGS kfuncs") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241104171959.2938862-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-04bpf: Move btf_type_is_struct_ptr() under CONFIG_BPF_SYSCALLAlistair Francis
The static inline btf_type_is_struct_ptr() function calls btf_type_skip_modifiers() which is guarded by CONFIG_BPF_SYSCALL. btf_type_is_struct_ptr() is also only called by CONFIG_BPF_SYSCALL ifdef code, so let's only expose btf_type_is_struct_ptr() if CONFIG_BPF_SYSCALL is defined. Signed-off-by: Alistair Francis <alistair.francis@wdc.com> Link: https://lore.kernel.org/r/20241104060300.421403-1-alistair.francis@wdc.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-03Merge branch 'fix-resource-leak-checks-for-tail-calls'Alexei Starovoitov
Kumar Kartikeya Dwivedi says: ==================== Fix resource leak checks for tail calls This set contains a fix for detecting unreleased RCU read locks or unfinished preempt_disable sections when performing a tail call. Spin locks are prevented by accident since they don't allow any function calls, including tail calls (modelled as call instruction to a helper), so we ensure they are checked as well, in preparation for relaxing function call restricton for critical sections in the future. Then, in the second patch, all the checks for reference leaks and locks are unified into a single function that can be called from different places. This unification patch is kept separate and placed after the fix to allow independent backport of the fix to older kernels without a depdendency on the clean up. Naturally, this creates a divergence in the disparate error messages, therefore selftests that rely on the exact error strings need to be updated to match the new verifier log message. A selftest is included to ensure no regressions occur wrt this behavior. ==================== Link: https://lore.kernel.org/r/20241103225940.1408302-1-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-03selftests/bpf: Add tests for tail calls with locks and refsKumar Kartikeya Dwivedi
Add failure tests to ensure bugs don't slip through for tail calls and lingering locks, RCU sections, preemption disabled sections, and references prevent tail calls. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241103225940.1408302-4-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-03bpf: Unify resource leak checksKumar Kartikeya Dwivedi
There are similar checks for covering locks, references, RCU read sections and preempt_disable sections in 3 places in the verifer, i.e. for tail calls, bpf_ld_[abs, ind], and exit path (for BPF_EXIT and bpf_throw). Unify all of these into a common check_resource_leak function to avoid code duplication. Also update the error strings in selftests to the new ones in the same change to ensure clean bisection. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241103225940.1408302-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-03bpf: Tighten tail call checks for lingering locks, RCU, preempt_disableKumar Kartikeya Dwivedi
There are three situations when a program logically exits and transfers control to the kernel or another program: bpf_throw, BPF_EXIT, and tail calls. The former two check for any lingering locks and references, but tail calls currently do not. Expand the checks to check for spin locks, RCU read sections and preempt disabled sections. Spin locks are indirectly preventing tail calls as function calls are disallowed, but the checks for preemption and RCU are more relaxed, hence ensure tail calls are prevented in their presence. Fixes: 9bb00b2895cb ("bpf: Add kfunc bpf_rcu_read_lock/unlock()") Fixes: fc7566ad0a82 ("bpf: Introduce bpf_preempt_[disable,enable] kfuncs") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241103225940.1408302-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-01selftests/bpf: Disable warnings on unused flags for Clang buildsViktor Malik
There exist compiler flags supported by GCC but not supported by Clang (e.g. -specs=...). Currently, these cannot be passed to BPF selftests builds, even when building with GCC, as some binaries (urandom_read and liburandom_read.so) are always built with Clang and the unsupported flags make the compilation fail (as -Werror is turned on). Add -Wno-unused-command-line-argument to these rules to suppress such errors. This allows to do things like: $ CFLAGS="-specs=/usr/lib/rpm/redhat/redhat-hardened-cc1" \ make -C tools/testing/selftests/bpf Without this patch, the compilation would fail with: [...] clang: error: argument unused during compilation: '-specs=/usr/lib/rpm/redhat/redhat-hardened-cc1' [-Werror,-Wunused-command-line-argument] make: *** [Makefile:273: /bpf-next/tools/testing/selftests/bpf/liburandom_read.so] Error 1 [...] Signed-off-by: Viktor Malik <vmalik@redhat.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/2d349e9d5eb0a79dd9ff94b496769d64e6ff7654.1730449390.git.vmalik@redhat.com
2024-11-01bpftool: Prevent setting duplicate _GNU_SOURCE in MakefileViktor Malik
When building selftests with CFLAGS set via env variable, the value of CFLAGS is propagated into bpftool Makefile (called from selftests Makefile). This makes the compilation fail as _GNU_SOURCE is defined two times - once from selftests Makefile (by including lib.mk) and once from bpftool Makefile (by calling `llvm-config --cflags`): $ CFLAGS="" make -C tools/testing/selftests/bpf [...] CC /bpf-next/tools/testing/selftests/bpf/tools/build/bpftool/btf.o <command-line>: error: "_GNU_SOURCE" redefined [-Werror] <command-line>: note: this is the location of the previous definition cc1: all warnings being treated as errors [...] Filter out -D_GNU_SOURCE from the result of `llvm-config --cflags` in bpftool Makefile to prevent this error. Signed-off-by: Viktor Malik <vmalik@redhat.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Quentin Monnet <qmo@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/acec3108b62d4df1436cda777e58e93e033ac7a7.1730449390.git.vmalik@redhat.com
2024-11-01bpf, bpftool: Fix incorrect disasm pcLeon Hwang
This patch addresses the bpftool issue "Wrong callq address displayed"[0]. The issue stemmed from an incorrect program counter (PC) value used during disassembly with LLVM or libbfd. For LLVM: The PC argument must represent the actual address in the kernel to compute the correct relative address. For libbfd: The relative address can be adjusted by adding func_ksym within the custom info->print_address_func to yield the correct address. Links: [0] https://github.com/libbpf/bpftool/issues/109 Changes: v2 -> v3: * Address comment from Quentin: * Remove the typedef. v1 -> v2: * Fix the broken libbfd disassembler. Fixes: e1947c750ffe ("bpftool: Refactor disassembler for JIT-ed programs") Signed-off-by: Leon Hwang <leon.hwang@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Tested-by: Quentin Monnet <qmo@kernel.org> Reviewed-by: Quentin Monnet <qmo@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/bpf/20241031152844.68817-1-leon.hwang@linux.dev
2024-11-01selftests/bpf: Add a test for open coded kmem_cache iterNamhyung Kim
The new subtest runs with bpf_prog_test_run_opts() as a syscall prog. It iterates the kmem_cache using bpf_for_each loop and count the number of entries. Finally it checks it with the number of entries from the regular iterator. $ ./vmtest.sh -- ./test_progs -t kmem_cache_iter ... #130/1 kmem_cache_iter/check_task_struct:OK #130/2 kmem_cache_iter/check_slabinfo:OK #130/3 kmem_cache_iter/open_coded_iter:OK #130 kmem_cache_iter:OK Summary: 1/3 PASSED, 0 SKIPPED, 0 FAILED Also simplify the code by using attach routine of the skeleton. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20241030222819.1800667-2-namhyung@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-01bpf: Add open coded version of kmem_cache iteratorNamhyung Kim
Add a new open coded iterator for kmem_cache which can be called from a BPF program like below. It doesn't take any argument and traverses all kmem_cache entries. struct kmem_cache *pos; bpf_for_each(kmem_cache, pos) { ... } As it needs to grab slab_mutex, it should be called from sleepable BPF programs only. Also update the existing iterator code to use the open coded version internally as suggested by Andrii. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20241030222819.1800667-1-namhyung@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-30uprobes: SRCU-protect uretprobe lifetime (with timeout)Andrii Nakryiko
Avoid taking refcount on uprobe in prepare_uretprobe(), instead take uretprobe-specific SRCU lock and keep it active as kernel transfers control back to user space. Given we can't rely on user space returning from traced function within reasonable time period, we need to make sure not to keep SRCU lock active for too long, though. To that effect, we employ a timer callback which is meant to terminate SRCU lock region after predefined timeout (currently set to 100ms), and instead transfer underlying struct uprobe's lifetime protection to refcounting. This fallback to less scalable refcounting after 100ms is a fine tradeoff from uretprobe's scalability and performance perspective, because uretprobing *long running* user functions inherently doesn't run into scalability issues (there is just not enough frequency to cause noticeable issues with either performance or scalability). The overall trick is in ensuring synchronization between current thread and timer's callback fired on some other thread. To cope with that with minimal logic complications, we add hprobe wrapper which is used to contain all the synchronization related issues behind a small number of basic helpers: hprobe_expire() for "downgrading" uprobe from SRCU-protected state to refcounted state, and a hprobe_consume() and hprobe_finalize() pair of single-use consuming helpers. Other than that, whatever current thread's logic is there stays the same, as timer thread cannot modify return_instance state (or add new/remove old return_instances). It only takes care of SRCU unlock and uprobe refcounting, which is hidden from the higher-level uretprobe handling logic. We use atomic xchg() in hprobe_consume(), which is called from performance critical handle_uretprobe_chain() function run in the current context. When uncontended, this xchg() doesn't seem to hurt performance as there are no other competing CPUs fighting for the same cache line. We also mark struct return_instance as ____cacheline_aligned to ensure no false sharing can happen. Another technical moment. We need to make sure that the list of return instances can be safely traversed under RCU from timer callback, so we delay return_instance freeing with kfree_rcu() and make sure that list modifications use RCU-aware operations. Also, given SRCU lock survives transition from kernel to user space and back we need to use lower-level __srcu_read_lock() and __srcu_read_unlock() to avoid lockdep complaining. Just to give an impression of a kind of performance improvements this change brings, below are benchmarking results with and without these SRCU changes, assuming other uprobe optimizations (mainly RCU Tasks Trace for entry uprobes, lockless RB-tree lookup, and lockless VMA to uprobe lookup) are left intact: WITHOUT SRCU for uretprobes =========================== uretprobe-nop ( 1 cpus): 2.197 ± 0.002M/s ( 2.197M/s/cpu) uretprobe-nop ( 2 cpus): 3.325 ± 0.001M/s ( 1.662M/s/cpu) uretprobe-nop ( 3 cpus): 4.129 ± 0.002M/s ( 1.376M/s/cpu) uretprobe-nop ( 4 cpus): 6.180 ± 0.003M/s ( 1.545M/s/cpu) uretprobe-nop ( 8 cpus): 7.323 ± 0.005M/s ( 0.915M/s/cpu) uretprobe-nop (16 cpus): 6.943 ± 0.005M/s ( 0.434M/s/cpu) uretprobe-nop (32 cpus): 5.931 ± 0.014M/s ( 0.185M/s/cpu) uretprobe-nop (64 cpus): 5.145 ± 0.003M/s ( 0.080M/s/cpu) uretprobe-nop (80 cpus): 4.925 ± 0.005M/s ( 0.062M/s/cpu) WITH SRCU for uretprobes ======================== uretprobe-nop ( 1 cpus): 1.968 ± 0.001M/s ( 1.968M/s/cpu) uretprobe-nop ( 2 cpus): 3.739 ± 0.003M/s ( 1.869M/s/cpu) uretprobe-nop ( 3 cpus): 5.616 ± 0.003M/s ( 1.872M/s/cpu) uretprobe-nop ( 4 cpus): 7.286 ± 0.002M/s ( 1.822M/s/cpu) uretprobe-nop ( 8 cpus): 13.657 ± 0.007M/s ( 1.707M/s/cpu) uretprobe-nop (32 cpus): 45.305 ± 0.066M/s ( 1.416M/s/cpu) uretprobe-nop (64 cpus): 42.390 ± 0.922M/s ( 0.662M/s/cpu) uretprobe-nop (80 cpus): 47.554 ± 2.411M/s ( 0.594M/s/cpu) Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20241024044159.3156646-3-andrii@kernel.org
2024-10-30uprobes: allow put_uprobe() from non-sleepable softirq contextAndrii Nakryiko
Currently put_uprobe() might trigger mutex_lock()/mutex_unlock(), which makes it unsuitable to be called from more restricted context like softirq. Let's make put_uprobe() agnostic to the context in which it is called, and use work queue to defer the mutex-protected clean up steps. RB tree removal step is also moved into work-deferred callback to avoid potential deadlock between softirq-based timer callback, added in the next patch, and the rest of uprobe code. We can rework locking altogher as a follow up, but that's significantly more tricky, so warrants its own patch set. For now, we need to make sure that changes in the next patch that add timer thread work correctly with existing approach, while concentrating on SRCU + timeout logic. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20241024044159.3156646-2-andrii@kernel.org
2024-10-30perf/x86/rapl: Clean up cpumask and hotplugKan Liang
The rapl pmu is die scope, which is supported by the generic perf_event subsystem now. Set the scope for the rapl PMU and remove all the cpumask and hotplug codes. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Oliver Sang <oliver.sang@intel.com> Tested-by: Dhananjay Ugwekar <dhananjay.ugwekar@amd.com> Link: https://lore.kernel.org/r/20241010142604.770192-2-kan.liang@linux.intel.com
2024-10-30perf/x86/rapl: Move the pmu allocation out of CPU hotplugKan Liang
There are extra codes in the CPU hotplug function to allocate rapl pmus. The generic PMU hotplug support is hard to be applied. As long as the rapl pmus can be allocated upfront for each die/socket, the code doesn't need to be implemented in the CPU hotplug function. Move the code to the init_rapl_pmus(), and allocate a PMU for each possible die/socket. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Oliver Sang <oliver.sang@intel.com> Link: https://lore.kernel.org/r/20241010142604.770192-1-kan.liang@linux.intel.com
2024-10-29selftests/bpf: drop unnecessary bpf_iter.h type duplicationAndrii Nakryiko
Drop bpf_iter.h header which uses vmlinux.h but re-defines a bunch of iterator structures and some of BPF constants for use in BPF iterator selftests. None of that is necessary when fresh vmlinux.h header is generated for vmlinux image that matches latest selftests. So drop ugly hacks and have a nice plain vmlinux.h usage everywhere. We could do the same with all the kfunc __ksym redefinitions, but that has dependency on very fresh pahole, so I'm not addressing that here. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20241029203919.1948941-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-29libbpf: start v1.6 development cycleAndrii Nakryiko
With libbpf v1.5.0 release out, start v1.6 dev cycle. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20241029184045.581537-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-29docs/bpf: Add description of .BTF.base sectionAlan Maguire
Now that .BTF.base sections are generated for out-of-tree kernel modules (provided pahole supports the "distilled_base" BTF feature), document .BTF.base and its role in supporting resilient split BTF and BTF relocation. Changes since v1: - updated formatting, corrected typo, used BTF ID[s] consistently (Andrii) Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241028091543.2175967-1-alan.maguire@oracle.com
2024-10-29bpf: handle implicit declaration of function gettid in bpf_iter.cJason Xing
As we can see from the title, when I compiled the selftests/bpf, I saw the error: implicit declaration of function ‘gettid’ ; did you mean ‘getgid’? [-Werror=implicit-function-declaration] skel->bss->tid = gettid(); ^~~~~~ getgid Directly call the syscall solves this issue. Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Alan Maguire <alan.maguire@oracle.com> Tested-by: Alan Maguire <alan.maguire@oracle.com> Link: https://lore.kernel.org/r/20241029074627.80289-1-kerneljasonxing@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-10-25bpf, arm64: Remove garbage frame for struct_ops trampolineXu Kuohai
The callsite layout for arm64 fentry is: mov x9, lr nop When a bpf prog is attached, the nop instruction is patched to a call to bpf trampoline: mov x9, lr bl <bpf trampoline> So two return addresses are passed to bpf trampoline: the return address for the traced function/prog, stored in x9, and the return address for the bpf trampoline itself, stored in lr. To obtain a full and accurate call stack, the bpf trampoline constructs two fake function frames using x9 and lr. However, struct_ops progs are invoked directly as function callbacks, meaning that x9 is not set as it is in the fentry callsite. In this case, the frame constructed using x9 is garbage. The following stack trace for struct_ops, captured by perf sampling, illustrates this issue, where tcp_ack+0x404 is a garbage frame: ffffffc0801a04b4 bpf_prog_50992e55a0f655a9_bpf_cubic_cong_avoid+0x98 (bpf_prog_50992e55a0f655a9_bpf_cubic_cong_avoid) ffffffc0801a228c [unknown] ([kernel.kallsyms]) // bpf trampoline ffffffd08d362590 tcp_ack+0x798 ([kernel.kallsyms]) // caller for bpf trampoline ffffffd08d3621fc tcp_ack+0x404 ([kernel.kallsyms]) // garbage frame ffffffd08d36452c tcp_rcv_established+0x4ac ([kernel.kallsyms]) ffffffd08d375c58 tcp_v4_do_rcv+0x1f0 ([kernel.kallsyms]) ffffffd08d378630 tcp_v4_rcv+0xeb8 ([kernel.kallsyms]) To fix it, construct only one frame using lr for struct_ops. The above stack trace also indicates that there is no kernel symbol for struct_ops bpf trampoline. This will be addressed in a follow-up patch. Fixes: efc9909fdce0 ("bpf, arm64: Add bpf trampoline for arm64") Signed-off-by: Xu Kuohai <xukuohai@huawei.com> Acked-by: Puranjay Mohan <puranjay@kernel.org> Tested-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20241025085220.533949-1-xukuohai@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfAlexei Starovoitov
Cross-merge bpf fixes after downstream PR. No conflicts. Adjacent changes in: include/linux/bpf.h include/uapi/linux/bpf.h kernel/bpf/btf.c kernel/bpf/helpers.c kernel/bpf/syscall.c kernel/bpf/verifier.c kernel/trace/bpf_trace.c mm/slab_common.c tools/include/uapi/linux/bpf.h tools/testing/selftests/bpf/Makefile Link: https://lore.kernel.org/all/20241024215724.60017-1-daniel@iogearbox.net/ Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-24Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfLinus Torvalds
Pull bpf fixes from Daniel Borkmann: - Fix an out-of-bounds read in bpf_link_show_fdinfo for BPF sockmap link file descriptors (Hou Tao) - Fix BPF arm64 JIT's address emission with tag-based KASAN enabled reserving not enough size (Peter Collingbourne) - Fix BPF verifier do_misc_fixups patching for inlining of the bpf_get_branch_snapshot BPF helper (Andrii Nakryiko) - Fix a BPF verifier bug and reject BPF program write attempts into read-only marked BPF maps (Daniel Borkmann) - Fix perf_event_detach_bpf_prog error handling by removing an invalid check which would skip BPF program release (Jiri Olsa) - Fix memory leak when parsing mount options for the BPF filesystem (Hou Tao) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf: Check validity of link->type in bpf_link_show_fdinfo() bpf: Add the missing BPF_LINK_TYPE invocation for sockmap bpf: fix do_misc_fixups() for bpf_get_branch_snapshot() bpf,perf: Fix perf_event_detach_bpf_prog error handling selftests/bpf: Add test for passing in uninit mtu_len selftests/bpf: Add test for writes to .rodata bpf: Remove MEM_UNINIT from skb/xdp MTU helpers bpf: Fix overloading of MEM_UNINIT's meaning bpf: Add MEM_WRITE attribute bpf: Preserve param->string when parsing mount options bpf, arm64: Fix address emission with tag-based KASAN enabled
2024-10-24Merge tag 'net-6.12-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from netfiler, xfrm and bluetooth. Oddly this includes a fix for a posix clock regression; in our previous PR we included a change there as a pre-requisite for networking one. That fix proved to be buggy and requires the follow-up included here. Thomas suggested we should send it, given we sent the buggy patch. Current release - regressions: - posix-clock: Fix unbalanced locking in pc_clock_settime() - netfilter: fix typo causing some targets not to load on IPv6 Current release - new code bugs: - xfrm: policy: remove last remnants of pernet inexact list Previous releases - regressions: - core: fix races in netdev_tx_sent_queue()/dev_watchdog() - bluetooth: fix UAF on sco_sock_timeout - eth: hv_netvsc: fix VF namespace also in synthetic NIC NETDEV_REGISTER event - eth: usbnet: fix name regression - eth: be2net: fix potential memory leak in be_xmit() - eth: plip: fix transmit path breakage Previous releases - always broken: - sched: deny mismatched skip_sw/skip_hw flags for actions created by classifiers - netfilter: bpf: must hold reference on net namespace - eth: virtio_net: fix integer overflow in stats - eth: bnxt_en: replace ptp_lock with irqsave variant - eth: octeon_ep: add SKB allocation failures handling in __octep_oq_process_rx() Misc: - MAINTAINERS: add Simon as an official reviewer" * tag 'net-6.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (40 commits) net: dsa: mv88e6xxx: support 4000ps cycle counter period net: dsa: mv88e6xxx: read cycle counter period from hardware net: dsa: mv88e6xxx: group cycle counter coefficients net: usb: qmi_wwan: add Fibocom FG132 0x0112 composition hv_netvsc: Fix VF namespace also in synthetic NIC NETDEV_REGISTER event net: dsa: microchip: disable EEE for KSZ879x/KSZ877x/KSZ876x Bluetooth: ISO: Fix UAF on iso_sock_timeout Bluetooth: SCO: Fix UAF on sco_sock_timeout Bluetooth: hci_core: Disable works on hci_unregister_dev posix-clock: posix-clock: Fix unbalanced locking in pc_clock_settime() r8169: avoid unsolicited interrupts net: sched: use RCU read-side critical section in taprio_dump() net: sched: fix use-after-free in taprio_change() net/sched: act_api: deny mismatched skip_sw/skip_hw flags for actions created by classifiers net: usb: usbnet: fix name regression mlxsw: spectrum_router: fix xa_store() error checking virtio_net: fix integer overflow in stats net: fix races in netdev_tx_sent_queue()/dev_watchdog() net: wwan: fix global oob in wwan_rtnl_policy netfilter: xtables: fix typo causing some targets not to load on IPv6 ...
2024-10-24Merge tag 'hid-for-linus-20241024' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid Pull HID fixes from Jiri Kosina: "Device-specific functionality quirks for Thinkpad X1 Gen3, Logitech Bolt and some Goodix touchpads (Bartłomiej Maryńczak, Hans de Goede and Kenneth Albanowski)" * tag 'hid-for-linus-20241024' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid: HID: lenovo: Add support for Thinkpad X1 Tablet Gen 3 keyboard HID: multitouch: Add quirk for Logitech Bolt receiver w/ Casa touchpad HID: i2c-hid: Delayed i2c resume wakeup for 0x0d42 Goodix touchpad
2024-10-24Merge tag 'loongarch-fixes-6.12-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson Pull LoongArch fixes from Huacai Chen: "Get correct cores_per_package for SMT systems, enable IRQ if do_ale() triggered in irq-enabled context, and fix some bugs about vDSO, memory managenent, hrtimer in KVM, etc" * tag 'loongarch-fixes-6.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson: LoongArch: KVM: Mark hrtimer to expire in hard interrupt context LoongArch: Make KASAN usable for variable cpu_vabits LoongArch: Set initial pte entry with PAGE_GLOBAL for kernel space LoongArch: Don't crash in stack_top() for tasks without vDSO LoongArch: Set correct size for vDSO code mapping LoongArch: Enable IRQ if do_ale() triggered in irq-enabled context LoongArch: Get correct cores_per_package for SMT systems LoongArch: Use "Exception return address" to comment ERA
2024-10-24Merge tag 'probes-fixes-v6.12-rc4.2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probes fixes from Masami Hiramatsu: - objpool: Fix choosing allocation for percpu slots Fixes to allocate objpool's percpu slots correctly according to the GFP flag. It checks whether "any bit" in GFP_ATOMIC is set to choose the vmalloc source, but it should check "all bits" in GFP_ATOMIC flag is set, because GFP_ATOMIC is a combined flag. - tracing/probes: Fix MAX_TRACE_ARGS limit handling If more than MAX_TRACE_ARGS are passed for creating a probe event, the entries over MAX_TRACE_ARG in trace_arg array are not initialized. Thus if the kernel accesses those entries, it crashes. This rejects creating event if the number of arguments is over MAX_TRACE_ARGS. - tracing: Consider the NUL character when validating the event length A strlen() is used when parsing the event name, and the original code does not consider the terminal null byte. Thus it can pass the name one byte longer than the buffer. This fixes to check it correctly. * tag 'probes-fixes-v6.12-rc4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Consider the NULL character when validating the event length tracing/probes: Fix MAX_TRACE_ARGS limit handling objpool: fix choosing allocation for percpu slots
2024-10-24Merge tag 'for-6.12-rc4-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - mount option fixes: - fix handling of compression mount options on remount - reject rw remount in case there are options that don't work in read-write mode (like rescue options) - fix zone accounting of unusable space - fix in-memory corruption when merging extent maps - fix delalloc range locking for sector < page - use more convenient default value of drop subtree threshold, clean more subvolumes without the fallback to marking quotas inconsistent - fix smatch warning about incorrect value passed to ERR_PTR * tag 'for-6.12-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix passing 0 to ERR_PTR in btrfs_search_dir_index_item() btrfs: reject ro->rw reconfiguration if there are hard ro requirements btrfs: fix read corruption due to race with extent map merging btrfs: fix the delalloc range locking if sector size < page size btrfs: qgroup: set a more sane default value for subtree drop threshold btrfs: clear force-compress on remount when compress mount option is given btrfs: zoned: fix zone unusable accounting for freed reserved extent
2024-10-24Merge tag 'jfs-6.12-rc5' of github.com:kleikamp/linux-shaggyLinus Torvalds
Pull jfs fix from David Kleikamp: "Fix a regression introduced in 6.12-rc1" * tag 'jfs-6.12-rc5' of github.com:kleikamp/linux-shaggy: jfs: Fix sanity check in dbMount
2024-10-24Merge tag 'bcachefs-2024-10-22' of https://github.com/koverstreet/bcachefsLinus Torvalds
Pull bcachefs fixes from Kent Overstreet: "Lots of hotfixes: - transaction restart injection has been shaking out a few things - fix a data corruption in the buffered write path on -ENOSPC, found by xfstests generic/299 - Some small show_options fixes - Repair mismatches in inode hash type, seed: different snapshot versions of an inode must have the same hash/type seed, used for directory entries and xattrs. We were checking the hash seed, but not the type, and a user contributed a filesystem where the hash type on one inode had somehow been flipped; these fixes allow his filesystem to repair. Additionally, the hash type flip made some directory entries invisible, which were then recreated by userspace; so the hash check code now checks for duplicate non dangling dirents, and renames one of them if necessary. - Don't use wait_event_interruptible() in recovery: this fixes some filesystems failing to mount with -ERESTARTSYS - Workaround for kvmalloc not supporting > INT_MAX allocations, causing an -ENOMEM when allocating the sorted array of journal keys: this allows a 75 TB filesystem to mount - Make sure bch_inode_unpacked.bi_snapshot is set in the old inode compat path: this alllows Marcin's filesystem (in use since before 6.7) to repair and mount" * tag 'bcachefs-2024-10-22' of https://github.com/koverstreet/bcachefs: (26 commits) bcachefs: Set bch_inode_unpacked.bi_snapshot in old inode path bcachefs: Mark more errors as AUTOFIX bcachefs: Workaround for kvmalloc() not supporting > INT_MAX allocations bcachefs: Don't use wait_event_interruptible() in recovery bcachefs: Fix __bch2_fsck_err() warning bcachefs: fsck: Improve hash_check_key() bcachefs: bch2_hash_set_or_get_in_snapshot() bcachefs: Repair mismatches in inode hash seed, type bcachefs: Add hash seed, type to inode_to_text() bcachefs: INODE_STR_HASH() for bch_inode_unpacked bcachefs: Run in-kernel offline fsck without ratelimit errors bcachefs: skip mount option handle for empty string. bcachefs: fix incorrect show_options results bcachefs: Fix data corruption on -ENOSPC in buffered write path bcachefs: bch2_folio_reservation_get_partial() is now better behaved bcachefs: fix disk reservation accounting in bch2_folio_reservation_get() bcachefS: ec: fix data type on stripe deletion bcachefs: Don't use commit_do() unnecessarily bcachefs: handle restarts in bch2_bucket_io_time_reset() bcachefs: fix restart handling in __bch2_resume_logged_op_finsert() ...
2024-10-24Revert "9p: Enable multipage folios"Dominique Martinet
This reverts commit 1325e4a91a405f88f1b18626904d37860a4f9069. using multipage folios apparently break some madvise operations like MADV_PAGEOUT which do not reliably unload the specified page anymore, Revert the patch until that is figured out. Reported-by: Andrii Nakryiko <andrii@kernel.org> Fixes: 1325e4a91a40 ("9p: Enable multipage folios") Signed-off-by: Dominique Martinet <asmadeus@codewreck.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-10-24Merge branch 'share-user-memory-to-bpf-program-through-task-storage-map'Alexei Starovoitov
Martin KaFai Lau says: ==================== Share user memory to BPF program through task storage map From: Martin KaFai Lau <martin.lau@kernel.org> It is the v6 of this series. Starting from v5, it is a continuation work of the RFC v4. Changes in v6: 1. In patch 1, reject t->size == 0 in btf_check_and_fixup_fields. Reject a uptr pointing to an empty struct. A test is added to patch 12 to test this case. 2. In patch 6, when checking if the uptr struct spans across pages, there was an off by one error in calculating the "end" such that the uptr will be rejected by error if the object is located exactly at the end of a page. This is fixed by adding t->size "- 1" to "start". A test is added to patch 9 to test this case. 3. In patch 6, check for PageHighMem(page) and return -EOPNOTSUPP. The 32 bit arch jit is missing other crucial bpf features (e.g. kfunc). Patch 6 commit message has been updated to include this change. 4. The selftests are cleaned up such that "struct user_data *dummy_data" global ptr is used instead of the whole "struct user_data dummy_data" object. Still a hack to avoid generating fwd btf type for the uptr struct but somewhat lighter than a full blown global object. Changes in v5: 1. The original patch 1 and patch 2 are combined. 2. Patch 3, 4, and 5 are new. They get the bpf_local_storage ready to handle the __uptr in the map_value. 3. Patch 6 is mostly new, so I reset the sob. 4. There are some changes in the carry over patch 1 and 2 also. They are mentioned at the individual patch. 5. More tests are added. The following is the original cover letter and the earlier change log. The bpf prog example has been removed. Please find a similar example in the selftests task_ls_uptr.c. ~~~~~~~~ Some of BPF schedulers (sched_ext) need hints from user programs to do a better job. For example, a scheduler can handle a task in a different way if it knows a task is doing GC. So, we need an efficient way to share the information between user programs and BPF programs. Sharing memory between user programs and BPF programs is what this patchset does. == REQUIREMENT == This patchset enables every task in every process to share a small chunk of memory of it's own with a BPF scheduler. So, they can update the hints without expensive overhead of syscalls. It also wants every task sees only the data/memory belong to the task/or the task's process. == DESIGN == This patchset enables BPF prorams to embed __uptr; uptr in the values of task storage maps. A uptr field can only be set by user programs by updating map element value through a syscall. A uptr points to a block of memory allocated by the user program updating the element value. The memory will be pinned to ensure it staying in the core memory and to avoid a page fault when the BPF program accesses it. Please see the selftests task_ls_uptr.c for an example. == MEMORY == In order to use memory efficiently, we don't want to pin a large number of pages. To archieve that, user programs should collect the memory blocks pointed by uptrs together to share memory pages if possible. It avoid the situation that pin one page for each thread in a process. Instead, we can have several threads pointing their uptrs to the same page but with different offsets. Although it is not necessary, avoiding the memory pointed by an uptr crossing the boundary of a page can prevent an additional mapping in the kernel address space. == RESTRICT == The memory pointed by a uptr should reside in one memory page. Crossing multi-pages is not supported at the moment. Only task storage map have been supported at the moment. The values of uptrs can only be updated by user programs through syscalls. bpf_map_lookup_elem() from userspace returns zeroed values for uptrs to prevent leaking information of the kernel. --- Changes from v3: - Merge part 4 and 5 as the new part 4 in order to cease the warning of unused functions from CI. Changes from v1: - Rename BPF_KPTR_USER to BPF_UPTR. - Restrict uptr to one page. - Mark uptr with PTR_TO_MEM | PTR_MAY_BE_NULL and with the size of the target type. - Move uptr away from bpf_obj_memcpy() by introducing bpf_obj_uptrcpy() and copy_map_uptr_locked(). - Remove the BPF_FROM_USER flag. - Align the meory pointed by an uptr in the test case. Remove the uptr of mmapped memory. Kui-Feng Lee (4): bpf: Support __uptr type tag in BTF bpf: Handle BPF_UPTR in verifier libbpf: define __uptr. selftests/bpf: Some basic __uptr tests ==================== Link: https://lore.kernel.org/r/20241023234759.860539-1-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-24selftests/bpf: Create task_local_storage map with invalid uptr's structMartin KaFai Lau
This patch tests the map creation failure when the map_value has unsupported uptr. The three cases are the struct is larger than one page, the struct is empty, and the struct is a kernel struct. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20241023234759.860539-13-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-24selftests/bpf: Add uptr failure verifier testsMartin KaFai Lau
Add verifier tests to ensure invalid uptr usages are rejected. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20241023234759.860539-12-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-24selftests/bpf: Add update_elem failure test for task storage uptrMartin KaFai Lau
This patch test the following failures in syscall update_elem 1. The first update_elem(BPF_F_LOCK) should be EOPNOTSUPP. syscall.c takes care of unpinning the uptr. 2. The second update_elem(BPF_EXIST) fails. syscall.c takes care of unpinning the uptr. 3. The forth update_elem(BPF_NOEXIST) fails. bpf_local_storage_update takes care of unpinning the uptr. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20241023234759.860539-11-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-24selftests/bpf: Test a uptr struct spanning across pages.Martin KaFai Lau
This patch tests the case when uptr has a struct spanning across two pages. It is not supported now and EOPNOTSUPP is expected from the syscall update_elem. It also tests the whole uptr struct located exactly at the end of a page and ensures that this case is accepted by update_elem. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20241023234759.860539-10-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-24selftests/bpf: Some basic __uptr testsKui-Feng Lee
Make sure the memory of uptrs have been mapped to the kernel properly. Also ensure the values of uptrs in the kernel haven't been copied to userspace. It also has the syscall update_elem/delete_elem test to test the pin/unpin code paths. Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20241023234759.860539-9-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-24libbpf: define __uptr.Kui-Feng Lee
Make __uptr available to BPF programs to enable them to define uptrs. Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20241023234759.860539-8-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-24bpf: Add uptr support in the map_value of the task local storage.Martin KaFai Lau
This patch adds uptr support in the map_value of the task local storage. struct map_value { struct user_data __uptr *uptr; }; struct { __uint(type, BPF_MAP_TYPE_TASK_STORAGE); __uint(map_flags, BPF_F_NO_PREALLOC); __type(key, int); __type(value, struct value_type); } datamap SEC(".maps"); A new bpf_obj_pin_uptrs() is added to pin the user page and also stores the kernel address back to the uptr for the bpf prog to use later. It currently does not support the uptr pointing to a user struct across two pages. It also excludes PageHighMem support to keep it simple. As of now, the 32bit bpf jit is missing other more crucial bpf features. For example, many important bpf features depend on bpf kfunc now but so far only one arch (x86-32) supports it which was added by me as an example when kfunc was first introduced to bpf. The uptr can only be stored to the task local storage by the syscall update_elem. Meaning the uptr will not be considered if it is provided by the bpf prog through bpf_task_storage_get(BPF_LOCAL_STORAGE_GET_F_CREATE). This is enforced by only calling bpf_local_storage_update(swap_uptrs==true) in bpf_pid_task_storage_update_elem. Everywhere else will have swap_uptrs==false. This will pump down to bpf_selem_alloc(swap_uptrs==true). It is the only case that bpf_selem_alloc() will take the uptr value when updating the newly allocated selem. bpf_obj_swap_uptrs() is added to swap the uptr between the SDATA(selem)->data and the user provided map_value in "void *value". bpf_obj_swap_uptrs() makes the SDATA(selem)->data takes the ownership of the uptr and the user space provided map_value will have NULL in the uptr. The bpf_obj_unpin_uptrs() is called after map->ops->map_update_elem() returning error. If the map->ops->map_update_elem has reached a state that the local storage has taken the uptr ownership, the bpf_obj_unpin_uptrs() will be a no op because the uptr is NULL. A "__"bpf_obj_unpin_uptrs is added to make this error path unpin easier such that it does not have to check the map->record is NULL or not. BPF_F_LOCK is not supported when the map_value has uptr. This can be revisited later if there is a use case. A similar swap_uptrs idea can be considered. The final bit is to do unpin_user_page in the bpf_obj_free_fields(). The earlier patch has ensured that the bpf_obj_free_fields() has gone through the rcu gp when needed. Cc: linux-mm@kvack.org Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Link: https://lore.kernel.org/r/20241023234759.860539-7-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-24bpf: Postpone bpf_obj_free_fields to the rcu callbackMartin KaFai Lau
A later patch will enable the uptr usage in the task_local_storage map. This will require the unpin_user_page() to be done after the rcu task trace gp for the cases that the uptr may still be used by a bpf prog. The bpf_obj_free_fields() will be the one doing unpin_user_page(), so this patch is to postpone calling bpf_obj_free_fields() to the rcu callback. The bpf_obj_free_fields() is only required to be done in the rcu callback when bpf->bpf_ma==true and reuse_now==false. bpf->bpf_ma==true case is because uptr will only be enabled in task storage which has already been moved to bpf_mem_alloc. The bpf->bpf_ma==false case can be supported in the future also if there is a need. reuse_now==false when the selem (aka storage) is deleted by bpf prog (bpf_task_storage_delete) or by syscall delete_elem(). In both cases, bpf_obj_free_fields() needs to wait for rcu gp. A few words on reuse_now==true. reuse_now==true when the storage's owner (i.e. the task_struct) is destructing or the map itself is doing map_free(). In both cases, no bpf prog should have a hold on the selem and its uptrs, so there is no need to postpone bpf_obj_free_fields(). reuse_now==true should be the common case for local storage usage where the storage exists throughout the lifetime of its owner (task_struct). The bpf_obj_free_fields() needs to use the map->record. Doing bpf_obj_free_fields() in a rcu callback will require the bpf_local_storage_map_free() to wait for rcu_barrier. An optimization could be only waiting for rcu_barrier when the map has uptr in its map_value. This will require either yet another rcu callback function or adding a bool in the selem to flag if the SDATA(selem)->smap is still valid. This patch chooses to keep it simple and wait for rcu_barrier for maps that use bpf_mem_alloc. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20241023234759.860539-6-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-24bpf: Postpone bpf_selem_free() in bpf_selem_unlink_storage_nolock()Martin KaFai Lau
In a later patch, bpf_selem_free() will call unpin_user_page() through bpf_obj_free_fields(). unpin_user_page() may take spin_lock. However, some bpf_selem_free() call paths have held a raw_spin_lock. Like this: raw_spin_lock_irqsave() bpf_selem_unlink_storage_nolock() bpf_selem_free() unpin_user_page() spin_lock() To avoid spinlock nested in raw_spinlock, bpf_selem_free() should be done after releasing the raw_spinlock. The "bool reuse_now" arg is replaced with "struct hlist_head *free_selem_list" in bpf_selem_unlink_storage_nolock(). The bpf_selem_unlink_storage_nolock() will append the to-be-free selem at the free_selem_list. The caller of bpf_selem_unlink_storage_nolock() will need to call the new bpf_selem_free_list(free_selem_list, reuse_now) to free the selem after releasing the raw_spinlock. Note that the selem->snode cannot be reused for linking to the free_selem_list because the selem->snode is protected by the raw_spinlock that we want to avoid holding. A new "struct hlist_node free_node;" is union-ized with the rcu_head. Only the first one successfully hlist_del_init_rcu(&selem->snode) will be able to use the free_node. After succeeding hlist_del_init_rcu(&selem->snode), the free_node and rcu_head usage is serialized such that they can share the 16 bytes in a union. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20241023234759.860539-5-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-24bpf: Add "bool swap_uptrs" arg to bpf_local_storage_update() and ↵Martin KaFai Lau
bpf_selem_alloc() In a later patch, the task local storage will only accept uptr from the syscall update_elem and will not accept uptr from the bpf prog. The reason is the bpf prog does not have a way to provide a valid user space address. bpf_local_storage_update() and bpf_selem_alloc() are used by both bpf prog bpf_task_storage_get(BPF_LOCAL_STORAGE_GET_F_CREATE) and bpf syscall update_elem. "bool swap_uptrs" arg is added to bpf_local_storage_update() and bpf_selem_alloc() to tell if it is called by the bpf prog or by the bpf syscall. When swap_uptrs==true, it is called by the syscall. The arg is named (swap_)uptrs because the later patch will swap the uptrs between the newly allocated selem and the user space provided map_value. It will make error handling easier in case map->ops->map_update_elem() fails and the caller can decide if it needs to unpin the uptr in the user space provided map_value or the bpf_local_storage_update() has already taken the uptr ownership and will take care of unpinning it also. Only swap_uptrs==false is passed now. The logic to handle the true case will be added in a later patch. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20241023234759.860539-4-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-24bpf: Handle BPF_UPTR in verifierKui-Feng Lee
This patch adds BPF_UPTR support to the verifier. Not that only the map_value will support the "__uptr" type tag. This patch enforces only BPF_LDX is allowed to the value of an uptr. After BPF_LDX, it will mark the dst_reg as PTR_TO_MEM | PTR_MAYBE_NULL with size deduced from the field.kptr.btf_id. This will make the dst_reg pointed memory to be readable and writable as scalar. There is a redundant "val_reg = reg_state(env, value_regno);" statement in the check_map_kptr_access(). This patch takes this chance to remove it also. Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20241023234759.860539-3-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-24bpf: Support __uptr type tag in BTFKui-Feng Lee
This patch introduces the "__uptr" type tag to BTF. It is to define a pointer pointing to the user space memory. This patch adds BTF logic to pass the "__uptr" type tag. btf_find_kptr() is reused for the "__uptr" tag. The "__uptr" will only be supported in the map_value of the task storage map. However, btf_parse_struct_meta() also uses btf_find_kptr() but it is not interested in "__uptr". This patch adds a "field_mask" argument to btf_find_kptr() which will return BTF_FIELD_IGNORE if the caller is not interested in a “__uptr” field. btf_parse_kptr() is also reused to parse the uptr. The btf_check_and_fixup_fields() is changed to do extra checks on the uptr to ensure that its struct size is not larger than PAGE_SIZE. It is not clear how a uptr pointing to a CO-RE supported kernel struct will be used, so it is also not allowed now. Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20241023234759.860539-2-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-24Merge branch 'add-the-missing-bpf_link_type-invocation-for-sockmap'Andrii Nakryiko
Hou Tao says: ==================== Add the missing BPF_LINK_TYPE invocation for sockmap From: Hou Tao <houtao1@huawei.com> Hi, The tiny patch set fixes the out-of-bound read problem when reading the fdinfo of sock map link fd. And in order to spot such omission early for the newly-added link type in the future, it also checks the validity of the link->type and adds a WARN_ONCE() for missed invocation. Please see individual patches for more details. And comments are always welcome. v3: * patch #2: check and warn the validity of link->type instead of adding a static assertion for bpf_link_type_strs array. v2: http://lore.kernel.org/bpf/d49fa2f4-f743-c763-7579-c3cab4dd88cb@huaweicloud.com ==================== Link: https://lore.kernel.org/r/20241024013558.1135167-1-houtao@huaweicloud.com Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2024-10-24bpf: Check validity of link->type in bpf_link_show_fdinfo()Hou Tao
If a newly-added link type doesn't invoke BPF_LINK_TYPE(), accessing bpf_link_type_strs[link->type] may result in an out-of-bounds access. To spot such missed invocations early in the future, checking the validity of link->type in bpf_link_show_fdinfo() and emitting a warning when such invocations are missed. Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241024013558.1135167-3-houtao@huaweicloud.com
2024-10-24bpf: Add the missing BPF_LINK_TYPE invocation for sockmapHou Tao
There is an out-of-bounds read in bpf_link_show_fdinfo() for the sockmap link fd. Fix it by adding the missing BPF_LINK_TYPE invocation for sockmap link Also add comments for bpf_link_type to prevent missing updates in the future. Fixes: 699c23f02c65 ("bpf: Add bpf_link support for sk_msg and sk_skb progs") Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241024013558.1135167-2-houtao@huaweicloud.com
2024-10-24Merge branch 'net-dsa-mv88e6xxx-fix-mv88e6393x-phc-frequency-on-internal-clock'Paolo Abeni
Shenghao Yang says: ==================== net: dsa: mv88e6xxx: fix MV88E6393X PHC frequency on internal clock The MV88E6393X family of switches can additionally run their cycle counters using a 250MHz internal clock instead of the usual 125MHz external clock [1]. The driver currently assumes all designs utilize that external clock, but MikroTik's RB5009 uses the internal source - causing the PHC to be seen running at 2x real time in userspace, making synchronization with ptp4l impossible. This series adds support for reading off the cycle counter frequency known to the hardware in the TAI_CLOCK_PERIOD register and picking an appropriate set of scaling coefficients instead of using a fixed set for each switch family. Patch 1 groups those cycle counter coefficients into a new structure to make it easier to pass them around. Patch 2 modifies PTP initialization to probe TAI_CLOCK_PERIOD and use an appropriate set of coefficients. Patch 3 adds support for 4000ps cycle counter periods. Changes since v2 [2]: - Patch 1: "net: dsa: mv88e6xxx: group cycle counter coefficients" - Moved declaration of mv88e6xxx_cc_coeffs to avoid moving that in Patch 2. - Patch 2: "net: dsa: mv88e6xxx: read cycle counter period from hardware" - Removed move of mv88e6xxx_cc_coeffs declaration. - Patch 3: "net: dsa: mv88e6xxx: support 4000ps cycle counter periods" - No change. [1] https://lore.kernel.org/netdev/d6622575-bf1b-445a-b08f-2739e3642aae@lunn.ch/ [2] https://lore.kernel.org/netdev/20241006145951.719162-1-me@shenghaoyang.info/ ==================== Link: https://patch.msgid.link/20241020063833.5425-1-me@shenghaoyang.info Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-24net: dsa: mv88e6xxx: support 4000ps cycle counter periodShenghao Yang
The MV88E6393X family of devices can run its cycle counter off an internal 250MHz clock instead of an external 125MHz one. Add support for this cycle counter period by adding another set of coefficients and lowering the periodic cycle counter read interval to compensate for faster overflows at the increased frequency. Otherwise, the PHC runs at 2x real time in userspace and cannot be synchronized. Fixes: de776d0d316f ("net: dsa: mv88e6xxx: add support for mv88e6393x family") Signed-off-by: Shenghao Yang <me@shenghaoyang.info> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-24net: dsa: mv88e6xxx: read cycle counter period from hardwareShenghao Yang
Instead of relying on a fixed mapping of hardware family to cycle counter frequency, pull this information from the MV88E6XXX_TAI_CLOCK_PERIOD register. This lets us support switches whose cycle counter frequencies depend on board design. Fixes: de776d0d316f ("net: dsa: mv88e6xxx: add support for mv88e6393x family") Suggested-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Shenghao Yang <me@shenghaoyang.info> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-24net: dsa: mv88e6xxx: group cycle counter coefficientsShenghao Yang
Instead of having them as individual fields in ptp_ops, wrap the coefficients in a separate struct so they can be referenced together. Fixes: de776d0d316f ("net: dsa: mv88e6xxx: add support for mv88e6393x family") Signed-off-by: Shenghao Yang <me@shenghaoyang.info> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Paolo Abeni <pabeni@redhat.com>