summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2022-12-30net: hns3: refine the handling for VF heartbeatJian Shen
Currently, the PF check the VF alive by the KEEP_ALVE mailbox from VF. VF keep sending the mailbox per 2 seconds. Once PF lost the mailbox for more than 8 seconds, it will regards the VF is abnormal, and stop notifying the state change to VF, include link state, vf mac, reset, even though it receives the KEEP_ALIVE mailbox again. It's inreasonable. This patch fixes it. PF will record the state change which need to notify VF when lost the VF's KEEP_ALIVE mailbox. And notify VF when receive the mailbox again. Introduce a new flag HCLGE_VPORT_STATE_INITED, used to distinguish the case whether VF driver loaded or not. For VF will query these states when initializing, so it's unnecessary to notify it in this case. Fixes: aa5c4f175be6 ("net: hns3: add reset handling for VF when doing PF reset") Signed-off-by: Jian Shen <shenjian15@huawei.com> Signed-off-by: Hao Lan <lanhao@huawei.com> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-30net: ethernet: freescale: enetc: Drop empty platform remove functionUwe Kleine-König
A remove callback just returning 0 is equivalent to no remove callback at all. So drop the useless function. Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-30net: ethernet: broadcom: bcm63xx_enet: Drop empty platform remove functionUwe Kleine-König
A remove callback just returning 0 is equivalent to no remove callback at all. So drop the useless function. Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-30Merge branch 'tcp-bhash2-fixes'David S. Miller
Kuniyuki Iwashima says: =================== tcp: Fix bhash2 and TIME_WAIT regression. We forgot to add twsk to bhash2. Therefore TIME_WAIT sockets cannot prevent bind() to the same local address and port. Changes: v1: * Patch 1: * Add tw_bind2_node in inet_timewait_sock instead of moving sk_bind2_node from struct sock to struct sock_common. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-30tcp: Add selftest for bind() and TIME_WAIT.Kuniyuki Iwashima
bhash2 split the bind() validation logic into wildcard and non-wildcard cases. Let's add a test to catch future regression. Before the previous patch: # ./bind_timewait TAP version 13 1..2 # Starting 2 tests from 3 test cases. # RUN bind_timewait.localhost.1 ... # bind_timewait.c:87:1:Expected ret (0) == -1 (-1) # 1: Test terminated by assertion # FAIL bind_timewait.localhost.1 not ok 1 bind_timewait.localhost.1 # RUN bind_timewait.addrany.1 ... # OK bind_timewait.addrany.1 ok 2 bind_timewait.addrany.1 # FAILED: 1 / 2 tests passed. # Totals: pass:1 fail:1 xfail:0 xpass:0 skip:0 error:0 After: # ./bind_timewait TAP version 13 1..2 # Starting 2 tests from 3 test cases. # RUN bind_timewait.localhost.1 ... # OK bind_timewait.localhost.1 ok 1 bind_timewait.localhost.1 # RUN bind_timewait.addrany.1 ... # OK bind_timewait.addrany.1 ok 2 bind_timewait.addrany.1 # PASSED: 2 / 2 tests passed. # Totals: pass:2 fail:0 xfail:0 xpass:0 skip:0 error:0 Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-30tcp: Add TIME_WAIT sockets in bhash2.Kuniyuki Iwashima
Jiri Slaby reported regression of bind() with a simple repro. [0] The repro creates a TIME_WAIT socket and tries to bind() a new socket with the same local address and port. Before commit 28044fc1d495 ("net: Add a bhash2 table hashed by port and address"), the bind() failed with -EADDRINUSE, but now it succeeds. The cited commit should have put TIME_WAIT sockets into bhash2; otherwise, inet_bhash2_conflict() misses TIME_WAIT sockets when validating bind() requests if the address is not a wildcard one. The straight option is to move sk_bind2_node from struct sock to struct sock_common to add twsk to bhash2 as implemented as RFC. [1] However, the binary layout change in the struct sock could affect performances moving hot fields on different cachelines. To avoid that, we add another TIME_WAIT list in inet_bind2_bucket and check it while validating bind(). [0]: https://lore.kernel.org/netdev/6b971a4e-c7d8-411e-1f92-fda29b5b2fb9@kernel.org/ [1]: https://lore.kernel.org/netdev/20221221151258.25748-2-kuniyu@amazon.com/ Fixes: 28044fc1d495 ("net: Add a bhash2 table hashed by port and address") Reported-by: Jiri Slaby <jirislaby@kernel.org> Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-29libbpf: Restore errno after pr_warn.Alexei Starovoitov
pr_warn calls into user-provided callback, which can clobber errno, so `errno = saved_errno` should happen after pr_warn. Fixes: 07453245620c ("libbpf: fix errno is overwritten after being closed.") Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-29Merge tag 'block-6.2-2022-12-29' of git://git.kernel.dk/linuxLinus Torvalds
Pull block fixes from Jens Axboe: "Mostly just NVMe, but also a single fixup for BFQ for a regression that happened during the merge window. In detail: - NVMe pull requests via Christoph: - Fix doorbell buffer value endianness (Klaus Jensen) - Fix Linux vs NVMe page size mismatch (Keith Busch) - Fix a potential use memory access beyong the allocation limit (Keith Busch) - Fix a multipath vs blktrace NULL pointer dereference (Yanjun Zhang) - Fix various problems in handling the Command Supported and Effects log (Christoph Hellwig) - Don't allow unprivileged passthrough of commands that don't transfer data but modify logical block content (Christoph Hellwig) - Add a features and quirks policy document (Christoph Hellwig) - Fix some really nasty code that was correct but made smatch complain (Sagi Grimberg) - Use-after-free regression in BFQ from this merge window (Yu)" * tag 'block-6.2-2022-12-29' of git://git.kernel.dk/linux: nvme-auth: fix smatch warning complaints nvme: consult the CSE log page for unprivileged passthrough nvme: also return I/O command effects from nvme_command_effects nvmet: don't defer passthrough commands with trivial effects to the workqueue nvmet: set the LBCC bit for commands that modify data nvmet: use NVME_CMD_EFFECTS_CSUPP instead of open coding it nvme: fix the NVME_CMD_EFFECTS_CSE_MASK definition docs, nvme: add a feature and quirk policy document nvme-pci: update sqsize when adjusting the queue depth nvme: fix setting the queue depth in nvme_alloc_io_tag_set block, bfq: fix uaf for bfqq in bfq_exit_icq_bfqq nvme: fix multipath crash caused by flush request when blktrace is enabled nvme-pci: fix page size checks nvme-pci: fix mempool alloc size nvme-pci: fix doorbell buffer value endianness
2022-12-29Merge tag 'io_uring-6.2-2022-12-29' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring fixes from Jens Axboe: - Two fixes for mutex grabbing when the task state is != TASK_RUNNING (me) - Check for invalid opcode in io_uring_register() a bit earlier, to avoid going through the quiesce machinery just to return -EINVAL later in the process (me) - Fix for the uapi io_uring header, skipping including time_types.h when necessary (Stefan) * tag 'io_uring-6.2-2022-12-29' of git://git.kernel.dk/linux: uapi:io_uring.h: allow linux/time_types.h to be skipped io_uring: check for valid register opcode earlier io_uring/cancel: re-grab ctx mutex after finishing wait io_uring: finish waiting before flushing overflow entries
2022-12-29Merge tag 'linux-kselftest-kunit-fixes-6.2-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest Pull KUnit fix from Shuah Khan: - alloc_string_stream_fragment() error path fix to free before returning a failure. * tag 'linux-kselftest-kunit-fixes-6.2-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: kunit: alloc_string_stream_fragment error handling bug fix
2022-12-29libbpf: Added the description of some API functionsXin Liu
Currently, many API functions are not described in the document. Add add API description of the following four API functions: - libbpf_set_print; - bpf_object__open; - bpf_object__load; - bpf_object__close. Signed-off-by: Xin Liu <liuxin350@huawei.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20221224112058.12038-1-liuxin350@huawei.com
2022-12-29Merge branch 'samples/bpf: enhance syscall tracing program'Andrii Nakryiko
"Daniel T. Lee" says: ==================== Syscall tracing using kprobe is quite unstable. Since it uses the exact name of the kernel function, the program might broke due to the rename of a function. The problem can also be caused by a changes in the arguments of the function to which the kprobe connects. This commit enhances syscall tracing program with the following instruments. In this patchset, ksyscall is used instead of kprobe. By using ksyscall, libbpf will detect the appropriate kernel function name. (e.g. sys_write -> __s390_sys_write). This eliminates the need to worry about which wrapper function to attach in order to parse arguments. Also ksyscall provides more fine method with attaching system call, the coarse SYSCALL helper at trace_common.h can be removed. Next, BPF_SYSCALL is used to reduce the inconvenience of parsing arguments. Since the nature of SYSCALL_WRAPPER function wraps the argument once, additional process of argument extraction is required to properly parse the argument. The BPF_SYSCALL macro will reduces the hassle of parsing arguments from pt_regs. Lastly, vmlinux.h is applied to syscall tracing program. This change allows the bpf program to refer to the internal structure as a single "vmlinux.h" instead of including each header referenced by the bpf program. Additionally, this patchset changes the suffix of _kern to .bpf to make use of the new compile rule (CLANG-BPF) which is more simple and neat. By just changing the _kern suffix to .bpf will inherit the benefit of the new CLANG-BPF compile target. Also, this commit adds dummy gnu/stub.h to the samples/bpf directory. This will fix the compiling problem with 'clang -target bpf'. To fix the build error with the s390x, this patchset also includes the fix of libbpf invalid return address register mapping in s390. --- Changes in V2: - add gnu/stub.h hack to fix compile error with 'clang -target bpf' Changes in V3: - fix libbpf invalid return address register mapping in s390 ==================== Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2022-12-29libbpf: Fix invalid return address register in s390Daniel T. Lee
There is currently an invalid register mapping in the s390 return address register. As the manual[1] states, the return address can be found at r14. In bpf_tracing.h, the s390 registers were named gprs(general purpose registers). This commit fixes the problem by correcting the mistyped mapping. [1]: https://uclibc.org/docs/psABI-s390x.pdf#page=14 Fixes: 3cc31d794097 ("libbpf: Normalize PT_REGS_xxx() macro definitions") Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20221224071527.2292-7-danieltimlee@gmail.com
2022-12-29samples/bpf: Use BPF_KSYSCALL macro in syscall tracing programsDaniel T. Lee
This commit enhances the syscall tracing programs by using the BPF_SYSCALL macro to reduce the inconvenience of parsing arguments from pt_regs. By simplifying argument extraction, bpf program will become clear to understand. Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20221224071527.2292-6-danieltimlee@gmail.com
2022-12-29samples/bpf: Fix tracex2 by using BPF_KSYSCALL macroDaniel T. Lee
Currently, there is a problem with tracex2, as it doesn't print the histogram properly and the results are misleading. (all results report as 0) The problem is caused by a change in arguments of the function to which the kprobe connects. This tracex2 bpf program uses kprobe (attached to __x64_sys_write) to figure out the size of the write system call. In order to achieve this, the third argument 'count' must be intact. The following is a prototype of the sys_write variant. (checked with pfunct) ~/git/linux$ pfunct -P fs/read_write.o | grep sys_write ssize_t ksys_write(unsigned int fd, const char * buf, size_t count); long int __x64_sys_write(const struct pt_regs * regs); ... cross compile with s390x ... long int __s390_sys_write(struct pt_regs * regs); Since the nature of SYSCALL_WRAPPER function wraps the argument once, additional process of argument extraction is required to properly parse the argument. #define BPF_KSYSCALL(name, args...) ... snip ... struct pt_regs *regs = LINUX_HAS_SYSCALL_WRAPPER \ ? (struct pt_regs *)PT_REGS_PARM1(ctx) \ : ctx; \ In order to fix this problem, the BPF_SYSCALL macro has been used. This reduces the hassle of parsing arguments from pt_regs. Since the macro uses the CORE version of argument extraction, additional portability comes too. Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20221224071527.2292-5-danieltimlee@gmail.com
2022-12-29samples/bpf: Change _kern suffix to .bpf with syscall tracing programDaniel T. Lee
Currently old compile rule (CLANG-bpf) doesn't contains VMLINUX_H define flag which is essential for the bpf program that includes "vmlinux.h". Also old compile rule doesn't directly specify the compile target as bpf, instead it uses bunch of extra options with clang followed by long chain of commands. (e.g. clang | opt | llvm-dis | llc) In Makefile, there is already new compile rule which is more simple and neat. And it also has -D__VMLINUX_H__ option. By just changing the _kern suffix to .bpf will inherit the benefit of the new CLANG-BPF compile target. Also, this commit adds dummy gnu/stub.h to the samples/bpf directory. As commit 1c2dd16add7e ("selftests/bpf: get rid of -D__x86_64__") noted, compiling with 'clang -target bpf' will raise an error with stubs.h unless workaround (-D__x86_64) is used. This commit solves this problem by adding dummy stub.h to make /usr/include/features.h to follow the expected path as the same way selftests/bpf dealt with. Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20221224071527.2292-4-danieltimlee@gmail.com
2022-12-29samples/bpf: Use vmlinux.h instead of implicit headers in syscall tracing ↵Daniel T. Lee
program This commit applies vmlinux.h to syscall tracing program. This change allows the bpf program to refer to the internal structure as a single "vmlinux.h" instead of including each header referenced by the bpf program. Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20221224071527.2292-3-danieltimlee@gmail.com
2022-12-29samples/bpf: Use kyscall instead of kprobe in syscall tracing programDaniel T. Lee
Syscall tracing using kprobe is quite unstable. Since it uses the exact name of the kernel function, the program might broke due to the rename of a function. The problem can also be caused by a changes in the arguments of the function to which the kprobe connects. In this commit, ksyscall is used instead of kprobe. By using ksyscall, libbpf will detect the appropriate kernel function name. (e.g. sys_write -> __s390_sys_write). This eliminates the need to worry about which wrapper function to attach in order to parse arguments. In addition, ksyscall provides more fine method with attaching system call, the coarse SYSCALL helper at trace_common.h can be removed. Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20221224071527.2292-2-danieltimlee@gmail.com
2022-12-29Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm fixes from Paolo Bonzini: "Changes that were posted too late for 6.1, or after the release. x86: - several fixes to nested VMX execution controls - fixes and clarification to the documentation for Xen emulation - do not unnecessarily release a pmu event with zero period - MMU fixes - fix Coverity warning in kvm_hv_flush_tlb() selftests: - fixes for the ucall mechanism in selftests - other fixes mostly related to compilation with clang" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (41 commits) KVM: selftests: restore special vmmcall code layout needed by the harness Documentation: kvm: clarify SRCU locking order KVM: x86: fix deadlock for KVM_XEN_EVTCHN_RESET KVM: x86/xen: Documentation updates and clarifications KVM: x86/xen: Add KVM_XEN_INVALID_GPA and KVM_XEN_INVALID_GFN to uapi KVM: x86/xen: Simplify eventfd IOCTLs KVM: x86/xen: Fix SRCU/RCU usage in readers of evtchn_ports KVM: x86/xen: Use kvm_read_guest_virt() instead of open-coding it badly KVM: x86/xen: Fix memory leak in kvm_xen_write_hypercall_page() KVM: Delete extra block of "};" in the KVM API documentation kvm: x86/mmu: Remove duplicated "be split" in spte.h kvm: Remove the unused macro KVM_MMU_READ_{,UN}LOCK() MAINTAINERS: adjust entry after renaming the vmx hyperv files KVM: selftests: Mark correct page as mapped in virt_map() KVM: arm64: selftests: Don't identity map the ucall MMIO hole KVM: selftests: document the default implementation of vm_vaddr_populate_bitmap KVM: selftests: Use magic value to signal ucall_alloc() failure KVM: selftests: Disable "gnu-variable-sized-type-not-at-end" warning KVM: selftests: Include lib.mk before consuming $(CC) KVM: selftests: Explicitly disable builtins for mem*() overrides ...
2022-12-29Merge tag 'nvme-6.2-2022-12-29' of git://git.infradead.org/nvme into block-6.2Jens Axboe
Pull NVMe fixes from Christoph: "nvme fixes for Linux 6.2 - fix various problems in handling the Command Supported and Effects log (Christoph Hellwig) - don't allow unprivileged passthrough of commands that don't transfer data but modify logical block content (Christoph Hellwig) - add a features and quirks policy document (Christoph Hellwig) - fix some really nasty code that was correct but made smatch complain (Sagi Grimberg)" * tag 'nvme-6.2-2022-12-29' of git://git.infradead.org/nvme: nvme-auth: fix smatch warning complaints nvme: consult the CSE log page for unprivileged passthrough nvme: also return I/O command effects from nvme_command_effects nvmet: don't defer passthrough commands with trivial effects to the workqueue nvmet: set the LBCC bit for commands that modify data nvmet: use NVME_CMD_EFFECTS_CSUPP instead of open coding it nvme: fix the NVME_CMD_EFFECTS_CSE_MASK definition docs, nvme: add a feature and quirk policy document
2022-12-28bpf: rename list_head -> graph_root in field info typesDave Marchevsky
Many of the structs recently added to track field info for linked-list head are useful as-is for rbtree root. So let's do a mechanical renaming of list_head-related types and fields: include/linux/bpf.h: struct btf_field_list_head -> struct btf_field_graph_root list_head -> graph_root in struct btf_field union kernel/bpf/btf.c: list_head -> graph_root in struct btf_field_info This is a nonfunctional change, functionality to actually use these fields for rbtree will be added in further patches. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20221217082506.1570898-5-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-29kconfig: Add static text for search information in help menuBhaskar Chowdhury
Add few static text to explain how one can bring up the search dialog box by pressing the forward slash key anywhere on this interface. Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2022-12-28bpf: Always use maximal size for copy_array()Kees Cook
Instead of counting on prior allocations to have sized allocations to the next kmalloc bucket size, always perform a krealloc that is at least ksize(dst) in size (which is a no-op), so the size can be correctly tracked by all the various allocation size trackers (KASAN, __alloc_size, etc). Reported-by: Hyunwoo Kim <v4bel@theori.io> Link: https://lore.kernel.org/bpf/20221223094551.GA1439509@ubuntu Fixes: ceb35b666d42 ("bpf/verifier: Use kmalloc_size_roundup() to match ksize() usage") Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Song Liu <song@kernel.org> Cc: Yonghong Song <yhs@fb.com> Cc: KP Singh <kpsingh@kernel.org> Cc: Stanislav Fomichev <sdf@google.com> Cc: Hao Luo <haoluo@google.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: bpf@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20221223182836.never.866-kees@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-28Merge branch 'bpf: fix the crash caused by task iterators over vma'Alexei Starovoitov
Kui-Feng Lee says: ==================== This issue is related to task iterators over vma. A system crash can occur when a task iterator travels through vma of tasks as the death of a task will clear the pointer to its mm, even though the task_struct is still held. As a result, an unexpected crash happens due to a null pointer. To address this problem, a reference to mm is kept on the iterator to make sure that the pointer is always valid. This patch set provides a solution for this crash by properly referencing mm on task iterators over vma. The major changes from v1 are: - Fix commit logs of the test case. - Use reverse Christmas tree coding style. - Remove unnecessary error handling for time(). v1: https://lore.kernel.org/bpf/20221216015912.991616-1-kuifeng@meta.com/ ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-28selftests/bpf: add a test for iter/task_vma for short-lived processesKui-Feng Lee
When a task iterator traverses vma(s), it is possible task->mm might become invalid in the middle of traversal and this may cause kernel misbehave (e.g., crash) This test case creates iterators repeatedly and forks short-lived processes in the background to detect this bug. The test will last for 3 seconds to get the chance to trigger the issue. Signed-off-by: Kui-Feng Lee <kuifeng@meta.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20221216221855.4122288-3-kuifeng@meta.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-28bpf: keep a reference to the mm, in case the task is dead.Kui-Feng Lee
Fix the system crash that happens when a task iterator travel through vma of tasks. In task iterators, we used to access mm by following the pointer on the task_struct; however, the death of a task will clear the pointer, even though we still hold the task_struct. That can cause an unexpected crash for a null pointer when an iterator is visiting a task that dies during the visit. Keeping a reference of mm on the iterator ensures we always have a valid pointer to mm. Co-developed-by: Song Liu <song@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com> Reported-by: Nathan Slingerland <slinger@meta.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20221216221855.4122288-2-kuifeng@meta.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-28libbpf: fix errno is overwritten after being closed.Xin Liu
In the ensure_good_fd function, if the fcntl function succeeds but the close function fails, ensure_good_fd returns a normal fd and sets errno, which may cause users to misunderstand. The close failure is not a serious problem, and the correct FD has been handed over to the upper-layer application. Let's restore errno here. Signed-off-by: Xin Liu <liuxin350@huawei.com> Link: https://lore.kernel.org/r/20221223133618.10323-1-liuxin350@huawei.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-28selftests/bpf: Temporarily disable part of btf_dump:var_data test.Alexei Starovoitov
Commit 7443b296e699 ("x86/percpu: Move cpu_number next to current_task") moved global per_cpu variable 'cpu_number' into pcpu_hot structure. Therefore this part of var_data test is no longer valid. Disable it until better solution is found. Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-28bpf: Fix panic due to wrong pageattr of im->imageChuang Wang
In the scenario where livepatch and kretfunc coexist, the pageattr of im->image is rox after arch_prepare_bpf_trampoline in bpf_trampoline_update, and then modify_fentry or register_fentry returns -EAGAIN from bpf_tramp_ftrace_ops_func, the BPF_TRAMP_F_ORIG_STACK flag will be configured, and arch_prepare_bpf_trampoline will be re-executed. At this time, because the pageattr of im->image is rox, arch_prepare_bpf_trampoline will read and write im->image, which causes a fault. as follows: insmod livepatch-sample.ko # samples/livepatch/livepatch-sample.c bpftrace -e 'kretfunc:cmdline_proc_show {}' BUG: unable to handle page fault for address: ffffffffa0206000 PGD 322d067 P4D 322d067 PUD 322e063 PMD 1297e067 PTE d428061 Oops: 0003 [#1] PREEMPT SMP PTI CPU: 2 PID: 270 Comm: bpftrace Tainted: G E K 6.1.0 #5 RIP: 0010:arch_prepare_bpf_trampoline+0xed/0x8c0 RSP: 0018:ffffc90001083ad8 EFLAGS: 00010202 RAX: ffffffffa0206000 RBX: 0000000000000020 RCX: 0000000000000000 RDX: ffffffffa0206001 RSI: ffffffffa0206000 RDI: 0000000000000030 RBP: ffffc90001083b70 R08: 0000000000000066 R09: ffff88800f51b400 R10: 000000002e72c6e5 R11: 00000000d0a15080 R12: ffff8880110a68c8 R13: 0000000000000000 R14: ffff88800f51b400 R15: ffffffff814fec10 FS: 00007f87bc0dc780(0000) GS:ffff88803e600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffa0206000 CR3: 0000000010b70000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> bpf_trampoline_update+0x25a/0x6b0 __bpf_trampoline_link_prog+0x101/0x240 bpf_trampoline_link_prog+0x2d/0x50 bpf_tracing_prog_attach+0x24c/0x530 bpf_raw_tp_link_attach+0x73/0x1d0 __sys_bpf+0x100e/0x2570 __x64_sys_bpf+0x1c/0x30 do_syscall_64+0x5b/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd With this patch, when modify_fentry or register_fentry returns -EAGAIN from bpf_tramp_ftrace_ops_func, the pageattr of im->image will be reset to nx+rw. Cc: stable@vger.kernel.org Fixes: 00963a2e75a8 ("bpf: Support bpf_trampoline on functions with IPMODIFY (e.g. livepatch)") Signed-off-by: Chuang Wang <nashuiliang@gmail.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20221224133146.780578-1-nashuiliang@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-28net/mlx5: Lag, fix failure to cancel delayed bond workEli Cohen
Commit 0d4e8ed139d8 ("net/mlx5: Lag, avoid lockdep warnings") accidentally removed a call to cancel delayed bond work thus it may cause queued delay to expire and fall on an already destroyed work queue. Fix by restoring the call cancel_delayed_work_sync() before destroying the workqueue. This prevents call trace such as this: [ 329.230417] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 329.231444] #PF: supervisor write access in kernel mode [ 329.232233] #PF: error_code(0x0002) - not-present page [ 329.233007] PGD 0 P4D 0 [ 329.233476] Oops: 0002 [#1] SMP [ 329.234012] CPU: 5 PID: 145 Comm: kworker/u20:4 Tainted: G OE 6.0.0-rc5_mlnx #1 [ 329.235282] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 [ 329.236868] Workqueue: mlx5_cmd_0000:08:00.1 cmd_work_handler [mlx5_core] [ 329.237886] RIP: 0010:_raw_spin_lock+0xc/0x20 [ 329.238585] Code: f0 0f b1 17 75 02 f3 c3 89 c6 e9 6f 3c 5f ff 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 0f 1f 44 00 00 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 02 f3 c3 89 c6 e9 45 3c 5f ff 0f 1f 44 00 00 0f 1f [ 329.241156] RSP: 0018:ffffc900001b0e98 EFLAGS: 00010046 [ 329.241940] RAX: 0000000000000000 RBX: ffffffff82374ae0 RCX: 0000000000000000 [ 329.242954] RDX: 0000000000000001 RSI: 0000000000000014 RDI: 0000000000000000 [ 329.243974] RBP: ffff888106ccf000 R08: ffff8881004000c8 R09: ffff888100400000 [ 329.244990] R10: 0000000000000000 R11: ffffffff826669f8 R12: 0000000000002000 [ 329.246009] R13: 0000000000000005 R14: ffff888100aa7ce0 R15: ffff88852ca80000 [ 329.247030] FS: 0000000000000000(0000) GS:ffff88852ca80000(0000) knlGS:0000000000000000 [ 329.248260] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 329.249111] CR2: 0000000000000000 CR3: 000000016d675001 CR4: 0000000000770ee0 [ 329.250133] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 329.251152] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 329.252176] PKRU: 55555554 Fixes: 0d4e8ed139d8 ("net/mlx5: Lag, avoid lockdep warnings") Signed-off-by: Eli Cohen <elic@nvidia.com> Reviewed-by: Maor Dickman <maord@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-12-28net/mlx5e: Set geneve_tlv_option_0_exist when matching on geneve optionMaor Dickman
The cited patch added support of matching on geneve option by setting geneve_tlv_option_0_data mask and key but didn't set geneve_tlv_option_0_exist bit which is required on some HWs when matching geneve_tlv_option_0_data parameter, this may cause in some cases for packets to wrongly match on rules with different geneve option. Example of such case is packet with geneve_tlv_object class=789 and data=456 will wrongly match on rule with match geneve_tlv_object class=123 and data=456. Fix it by setting geneve_tlv_option_0_exist bit when supported by the HW when matching on geneve_tlv_option_0_data parameter. Fixes: 9272e3df3023 ("net/mlx5e: Geneve, Add support for encap/decap flows offload") Signed-off-by: Maor Dickman <maord@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-12-28net/mlx5e: Fix hw mtu initializing at XDP SQ allocationAdham Faris
Current xdp xmit functions logic (mlx5e_xmit_xdp_frame_mpwqe or mlx5e_xmit_xdp_frame), validates xdp packet length by comparing it to hw mtu (configured at xdp sq allocation) before xmiting it. This check does not account for ethernet fcs length (calculated and filled by the nic). Hence, when we try sending packets with length > (hw-mtu - ethernet-fcs-size), the device port drops it and tx_errors_phy is incremented. Desired behavior is to catch these packets and drop them by the driver. Fix this behavior in XDP SQ allocation function (mlx5e_alloc_xdpsq) by subtracting ethernet FCS header size (4 Bytes) from current hw mtu value, since ethernet FCS is calculated and written to ethernet frames by the nic. Fixes: d8bec2b29a82 ("net/mlx5e: Support bpf_xdp_adjust_head()") Signed-off-by: Adham Faris <afaris@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-12-28net/mlx5e: Always clear dest encap in neigh-update-delChris Mi
The cited commit introduced a bug for multiple encapsulations flow. If one dest encap becomes invalid, the flow is set slow path flag. But when other dests encap become invalid, they are not cleared due to slow path flag of the flow. When neigh-update-add is running, it will use invalid encap. Fix it by checking slow path flag after clearing dest encap. Fixes: 9a5f9cc794e1 ("net/mlx5e: Fix possible use-after-free deleting fdb rule") Signed-off-by: Chris Mi <cmi@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-12-28net/mlx5e: CT: Fix ct debugfs folder nameChris Mi
Need to use sprintf to build a string instead of sscanf. Otherwise dirname is null and both "ct_nic" and "ct_fdb" won't be created. But its redundant anyway as driver could be in switchdev mode but still add nic rules. So use "ct" as folder name. Fixes: 77422a8f6f61 ("net/mlx5e: CT: Add ct driver counters") Signed-off-by: Chris Mi <cmi@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-12-28net/mlx5e: Fix RX reporter for XSK RQsTariq Toukan
RX reporter mistakenly reads from the regular (inactive) RQ when XSK RQ is active. Fix it here. Fixes: 3db4c85cde7a ("net/mlx5e: xsk: Use queue indices starting from 0 for XSK queues") Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-12-28net/mlx5e: IPoIB, Don't allow CQE compression to be turned on by defaultDragos Tatulea
mlx5e_build_nic_params will turn CQE compression on if the hardware capability is enabled and the slow_pci_heuristic condition is detected. As IPoIB doesn't support CQE compression, make sure to disable the feature in the IPoIB profile init. Please note that the feature is not exposed to the user for IPoIB interfaces, so it can't be subsequently turned on. Fixes: b797a684b0dd ("net/mlx5e: Enable CQE compression when PCI is slower than link") Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-12-28net/mlx5: Fix RoCE setting at HCA levelShay Drory
mlx5 PF can disable RoCE for its VFs and SFs. In such case RoCE is marked as unsupported on those VFs/SFs. The cited patch added an option for disable (and enable) RoCE at HCA level. However, that commit didn't check whether RoCE is supported on the HCA and enabled user to try and set RoCE to on. Fix it by checking whether the HCA supports RoCE. Fixes: fbfa97b4d79f ("net/mlx5: Disable roce at HCA level") Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-12-28net/mlx5: Avoid recovery in probe flowsShay Drory
Currently, recovery is done without considering whether the device is still in probe flow. This may lead to recovery before device have finished probed successfully. e.g.: while mlx5_init_one() is running. Recovery flow is using functionality that is loaded only by mlx5_init_one(), and there is no point in running recovery without mlx5_init_one() finished successfully. Fix it by waiting for probe flow to finish and checking whether the device is probed before trying to perform recovery. Fixes: 51d138c2610a ("net/mlx5: Fix health error state handling") Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-12-28net/mlx5: Fix io_eq_size and event_eq_size params validationShay Drory
io_eq_size and event_eq_size params are of param type DEVLINK_PARAM_TYPE_U32. But, the validation callback is addressing them as DEVLINK_PARAM_TYPE_U16. This cause mismatch in validation in big-endian systems, in which values in range were rejected while 268500991 was accepted. Fix it by checking the U32 value in the validation callback. Fixes: 0844fa5f7b89 ("net/mlx5: Let user configure io_eq_size param") Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-12-28net/mlx5: Add forgotten cleanup calls into mlx5_init_once() error pathJiri Pirko
There are two cleanup calls missing in mlx5_init_once() error path. Add them making the error path flow to be the same as mlx5_cleanup_once(). Fixes: 52ec462eca9b ("net/mlx5: Add reserved-gids support") Fixes: 7c39afb394c7 ("net/mlx5: PTP code migration to driver core section") Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-12-28net/mlx5: E-Switch, properly handle ingress tagged packets on VSTMoshe Shemesh
Fix SRIOV VST mode behavior to insert cvlan when a guest tag is already present in the frame. Previous VST mode behavior was to drop packets or override existing tag, depending on the device version. In this patch we fix this behavior by correctly building the HW steering rule with a push vlan action, or for older devices we ask the FW to stack the vlan when a vlan is already present. Fixes: 07bab9502641 ("net/mlx5: E-Switch, Refactor eswitch ingress acl codes") Fixes: dfcb1ed3c331 ("net/mlx5: E-Switch, Vport ingress/egress ACLs rules for VST mode") Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-12-28nvme-auth: fix smatch warning complaintsSagi Grimberg
When initializing auth context, there may be no secrets passed by the user. Make return code explicit when returning successfully. smatch warnings: drivers/nvme/host/auth.c:950 nvme_auth_init_ctrl() warn: missing error code? 'ret' Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-12-28nvme: consult the CSE log page for unprivileged passthroughChristoph Hellwig
Commands like Write Zeros can change the contents of a namespaces without actually transferring data. To protect against this, check the Commands Supported and Effects log is supported by the controller for any unprivileg command passthrough and refuse unprivileged passthrough if the command has any effects that can change data or metadata. Note: While the Commands Support and Effects log page has only been mandatory since NVMe 2.0, it is widely supported because Windows requires it for any command passthrough from userspace. Fixes: e4fbcf32c860 ("nvme: identify-namespace without CAP_SYS_ADMIN") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
2022-12-28nvme: also return I/O command effects from nvme_command_effectsChristoph Hellwig
To be able to use the Commands Supported and Effects Log for allowing unprivileged passtrough, it needs to be corretly reported for I/O commands as well. Return the I/O command effects from nvme_command_effects, and also add a default list of effects for the NVM command set. For other command sets, the Commands Supported and Effects log is required to be present already. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
2022-12-28nvmet: don't defer passthrough commands with trivial effects to the workqueueChristoph Hellwig
Mask out the "Command Supported" and "Logical Block Content Change" bits and only defer execution of commands that have non-trivial effects to the workqueue for synchronous execution. This allows to execute admin commands asynchronously on controllers that provide a Command Supported and Effects log page, and will keep allowing to execute Write commands asynchronously once command effects on I/O commands are taken into account. Fixes: c1fef73f793b ("nvmet: add passthru code to process commands") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
2022-12-28nvmet: set the LBCC bit for commands that modify dataChristoph Hellwig
Write, Write Zeroes, Zone append and a Zone Reset through Zone Management Send modify the logical block content of a namespace, so make sure the LBCC bit is reported for them. Fixes: b5d0b38c0475 ("nvmet: add Command Set Identifier support") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-12-28nvmet: use NVME_CMD_EFFECTS_CSUPP instead of open coding itChristoph Hellwig
Use NVME_CMD_EFFECTS_CSUPP instead of open coding it and assign a single value to multiple array entries instead of repeated assignments. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-12-28nvme: fix the NVME_CMD_EFFECTS_CSE_MASK definitionChristoph Hellwig
3 << 16 does not generate the correct mask for bits 16, 17 and 18. Use the GENMASK macro to generate the correct mask instead. Fixes: 84fef62d135b ("nvme: check admin passthru command effects") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
2022-12-28docs, nvme: add a feature and quirk policy documentChristoph Hellwig
This adds a document about what specification features are supported by the Linux NVMe driver, and what qualifies for a quirk if an implementation has problems following the specification. Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Jonathan Corbet <corbet@lwn.net>
2022-12-28ALSA: hda/hdmi: Static PCM mapping again with AMD HDMI codecsTakashi Iwai
The recent code refactoring for HD-audio HDMI codec driver caused a regression on AMD/ATI HDMI codecs; namely, PulseAudioand pipewire don't recognize HDMI outputs any longer while the direct output via ALSA raw access still works. The problem turned out that, after the code refactoring, the driver assumes only the dynamic PCM assignment, and when a PCM stream that still isn't assigned to any pin gets opened, the driver tries to assign any free converter to the PCM stream. This behavior is OK for Intel and other codecs, as they have arbitrary connections between pins and converters. OTOH, on AMD chips that have a 1:1 mapping between pins and converters, this may end up with blocking the open of the next PCM stream for the pin that is tied with the formerly taken converter. Also, with the code refactoring, more PCM streams are exposed than necessary as we assume all converters can be used, while this isn't true for AMD case. This may change the PCM stream assignment and confuse users as well. This patch fixes those problems by: - Introducing a flag spec->static_pcm_mapping, and if it's set, the driver applies the static mapping between pins and converters at the probe time - Limiting the number of PCM streams per pins, too; this avoids the superfluous PCM streams Fixes: ef6f5494faf6 ("ALSA: hda/hdmi: Use only dynamic PCM device allocation") Cc: <stable@vger.kernel.org> Link: https://bugzilla.kernel.org/show_bug.cgi?id=216836 Co-developed-by: Jaroslav Kysela <perex@perex.cz> Signed-off-by: Jaroslav Kysela <perex@perex.cz> Link: https://lore.kernel.org/r/20221228125714.16329-1-tiwai@suse.de Signed-off-by: Takashi Iwai <tiwai@suse.de>