summaryrefslogtreecommitdiff
path: root/tools/include/uapi/linux
AgeCommit message (Collapse)Author
2023-12-06bpf: add BPF token support to BPF_MAP_CREATE commandAndrii Nakryiko
Allow providing token_fd for BPF_MAP_CREATE command to allow controlled BPF map creation from unprivileged process through delegated BPF token. Wire through a set of allowed BPF map types to BPF token, derived from BPF FS at BPF token creation time. This, in combination with allowed_cmds allows to create a narrowly-focused BPF token (controlled by privileged agent) with a restrictive set of BPF maps that application can attempt to create. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231130185229.2688956-5-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-06bpf: introduce BPF token objectAndrii Nakryiko
Add new kind of BPF kernel object, BPF token. BPF token is meant to allow delegating privileged BPF functionality, like loading a BPF program or creating a BPF map, from privileged process to a *trusted* unprivileged process, all while having a good amount of control over which privileged operations could be performed using provided BPF token. This is achieved through mounting BPF FS instance with extra delegation mount options, which determine what operations are delegatable, and also constraining it to the owning user namespace (as mentioned in the previous patch). BPF token itself is just a derivative from BPF FS and can be created through a new bpf() syscall command, BPF_TOKEN_CREATE, which accepts BPF FS FD, which can be attained through open() API by opening BPF FS mount point. Currently, BPF token "inherits" delegated command, map types, prog type, and attach type bit sets from BPF FS as is. In the future, having an BPF token as a separate object with its own FD, we can allow to further restrict BPF token's allowable set of things either at the creation time or after the fact, allowing the process to guard itself further from unintentionally trying to load undesired kind of BPF programs. But for now we keep things simple and just copy bit sets as is. When BPF token is created from BPF FS mount, we take reference to the BPF super block's owning user namespace, and then use that namespace for checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN} capabilities that are normally only checked against init userns (using capable()), but now we check them using ns_capable() instead (if BPF token is provided). See bpf_token_capable() for details. Such setup means that BPF token in itself is not sufficient to grant BPF functionality. User namespaced process has to *also* have necessary combination of capabilities inside that user namespace. So while previously CAP_BPF was useless when granted within user namespace, now it gains a meaning and allows container managers and sys admins to have a flexible control over which processes can and need to use BPF functionality within the user namespace (i.e., container in practice). And BPF FS delegation mount options and derived BPF tokens serve as a per-container "flag" to grant overall ability to use bpf() (plus further restrict on which parts of bpf() syscalls are treated as namespaced). Note also, BPF_TOKEN_CREATE command itself requires ns_capable(CAP_BPF) within the BPF FS owning user namespace, rounding up the ns_capable() story of BPF token. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231130185229.2688956-4-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-04netdev-genl: spec: Add PID in netdev netlink YAML specAmritha Nambiar
Add support in netlink spec(netdev.yaml) for PID of the NAPI thread. Add code generated from the spec. Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com> Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Link: https://lore.kernel.org/r/170147335301.5260.11872351477120434501.stgit@anambiarhost.jf.intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-12-04netdev-genl: spec: Add irq in netdev netlink YAML specAmritha Nambiar
Add support in netlink spec(netdev.yaml) for interrupt number among the NAPI attributes. Add code generated from the spec. Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com> Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Link: https://lore.kernel.org/r/170147334210.5260.18178387869057516983.stgit@anambiarhost.jf.intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-12-04netdev-genl: spec: Extend netdev netlink spec in YAML for NAPIAmritha Nambiar
Add support in netlink spec(netdev.yaml) for napi related information. Add code generated from the spec. Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com> Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Link: https://lore.kernel.org/r/170147333119.5260.7050639053080529108.stgit@anambiarhost.jf.intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-12-04netdev-genl: spec: Extend netdev netlink spec in YAML for queueAmritha Nambiar
Add support in netlink spec(netdev.yaml) for queue information. Add code generated from the spec. Note: The "queue-type" attribute takes values 0 and 1 for rx and tx queue type respectively. Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com> Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Link: https://lore.kernel.org/r/170147330963.5260.2576294626647300472.stgit@anambiarhost.jf.intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-11-30Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2023-11-30 We've added 30 non-merge commits during the last 7 day(s) which contain a total of 58 files changed, 1598 insertions(+), 154 deletions(-). The main changes are: 1) Add initial TX metadata implementation for AF_XDP with support in mlx5 and stmmac drivers. Two types of offloads are supported right now, that is, TX timestamp and TX checksum offload, from Stanislav Fomichev with stmmac implementation from Song Yoong Siang. 2) Change BPF verifier logic to validate global subprograms lazily instead of unconditionally before the main program, so they can be guarded using BPF CO-RE techniques, from Andrii Nakryiko. 3) Add BPF link_info support for uprobe multi link along with bpftool integration for the latter, from Jiri Olsa. 4) Use pkg-config in BPF selftests to determine ld flags which is in particular needed for linking statically, from Akihiko Odaki. 5) Fix a few BPF selftest failures to adapt to the upcoming LLVM18, from Yonghong Song. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (30 commits) bpf/tests: Remove duplicate JSGT tests selftests/bpf: Add TX side to xdp_hw_metadata selftests/bpf: Convert xdp_hw_metadata to XDP_USE_NEED_WAKEUP selftests/bpf: Add TX side to xdp_metadata selftests/bpf: Add csum helpers selftests/xsk: Support tx_metadata_len xsk: Add option to calculate TX checksum in SW xsk: Validate xsk_tx_metadata flags xsk: Document tx_metadata_len layout net: stmmac: Add Tx HWTS support to XDP ZC net/mlx5e: Implement AF_XDP TX timestamp and checksum offload tools: ynl: Print xsk-features from the sample xsk: Add TX timestamp and TX checksum offload support xsk: Support tx_metadata_len selftests/bpf: Use pkg-config for libelf selftests/bpf: Override PKG_CONFIG for static builds selftests/bpf: Choose pkg-config for the target bpftool: Add support to display uprobe_multi links selftests/bpf: Add link_info test for uprobe_multi link selftests/bpf: Use bpf_link__destroy in fill_link_info tests ... ==================== Conflicts: Documentation/netlink/specs/netdev.yaml: 839ff60df3ab ("net: page_pool: add nlspec for basic access to page pools") 48eb03dd2630 ("xsk: Add TX timestamp and TX checksum offload support") https://lore.kernel.org/all/20231201094705.1ee3cab8@canb.auug.org.au/ While at it also regen, tree is dirty after: 48eb03dd2630 ("xsk: Add TX timestamp and TX checksum offload support") looks like code wasn't re-rendered after "render-max" was removed. Link: https://lore.kernel.org/r/20231130145708.32573-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-11-29xsk: Add option to calculate TX checksum in SWStanislav Fomichev
For XDP_COPY mode, add a UMEM option XDP_UMEM_TX_SW_CSUM to call skb_checksum_help in transmit path. Might be useful to debugging issues with real hardware. I also use this mode in the selftests. Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20231127190319.1190813-9-sdf@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-29xsk: Add TX timestamp and TX checksum offload supportStanislav Fomichev
This change actually defines the (initial) metadata layout that should be used by AF_XDP userspace (xsk_tx_metadata). The first field is flags which requests appropriate offloads, followed by the offload-specific fields. The supported per-device offloads are exported via netlink (new xsk-flags). The offloads themselves are still implemented in a bit of a framework-y fashion that's left from my initial kfunc attempt. I'm introducing new xsk_tx_metadata_ops which drivers are supposed to implement. The drivers are also supposed to call xsk_tx_metadata_request/xsk_tx_metadata_complete in the right places. Since xsk_tx_metadata_{request,_complete} are static inline, we don't incur any extra overhead doing indirect calls. The benefit of this scheme is as follows: - keeps all metadata layout parsing away from driver code - makes it easy to grep and see which drivers implement what - don't need any extra flags to maintain to keep track of what offloads are implemented; if the callback is implemented - the offload is supported (used by netlink reporting code) Two offloads are defined right now: 1. XDP_TXMD_FLAGS_CHECKSUM: skb-style csum_start+csum_offset 2. XDP_TXMD_FLAGS_TIMESTAMP: writes TX timestamp back into metadata area upon completion (tx_timestamp field) XDP_TXMD_FLAGS_TIMESTAMP is also implemented for XDP_COPY mode: it writes SW timestamp from the skb destructor (note I'm reusing hwtstamps to pass metadata pointer). The struct is forward-compatible and can be extended in the future by appending more fields. Reviewed-by: Song Yoong Siang <yoong.siang.song@intel.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20231127190319.1190813-3-sdf@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-29xsk: Support tx_metadata_lenStanislav Fomichev
For zerocopy mode, tx_desc->addr can point to an arbitrary offset and carry some TX metadata in the headroom. For copy mode, there is no way currently to populate skb metadata. Introduce new tx_metadata_len umem config option that indicates how many bytes to treat as metadata. Metadata bytes come prior to tx_desc address (same as in RX case). The size of the metadata has mostly the same constraints as XDP: - less than 256 bytes - 8-byte aligned (compared to 4-byte alignment on xdp, due to 8-byte timestamp in the completion) - non-zero This data is not interpreted in any way right now. Reviewed-by: Song Yoong Siang <yoong.siang.song@intel.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20231127190319.1190813-2-sdf@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-28bpf: Add link_info support for uprobe multi linkJiri Olsa
Adding support to get uprobe_link details through bpf_link_info interface. Adding new struct uprobe_multi to struct bpf_link_info to carry the uprobe_multi link details. The uprobe_multi.count is passed from user space to denote size of array fields (offsets/ref_ctr_offsets/cookies). The actual array size is stored back to uprobe_multi.count (allowing user to find out the actual array size) and array fields are populated up to the user passed size. All the non-array fields (path/count/flags/pid) are always set. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/bpf/20231125193130.834322-4-jolsa@kernel.org
2023-11-28tools: ynl: add sample for getting page-pool informationJakub Kicinski
Regenerate the tools/ code after netdev spec changes. Add sample to query page-pool info in a concise fashion: $ ./page-pool eth0[2] page pools: 10 (zombies: 0) refs: 41984 bytes: 171966464 (refs: 0 bytes: 0) recycling: 90.3% (alloc: 656:397681 recycle: 89652:270201) Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-11-22tools headers UAPI: Update tools's copy of vhost.h headerNamhyung Kim
tldr; Just FYI, I'm carrying this on the perf tools tree. Full explanation: There used to be no copies, with tools/ code using kernel headers directly. From time to time tools/perf/ broke due to legitimate kernel hacking. At some point Linus complained about such direct usage. Then we adopted the current model. The way these headers are used in perf are not restricted to just including them to compile something. There are sometimes used in scripts that convert defines into string tables, etc, so some change may break one of these scripts, or new MSRs may use some different #define pattern, etc. E.g.: $ ls -1 tools/perf/trace/beauty/*.sh | head -5 tools/perf/trace/beauty/arch_errno_names.sh tools/perf/trace/beauty/drm_ioctl.sh tools/perf/trace/beauty/fadvise.sh tools/perf/trace/beauty/fsconfig.sh tools/perf/trace/beauty/fsmount.sh $ $ tools/perf/trace/beauty/fadvise.sh static const char *fadvise_advices[] = { [0] = "NORMAL", [1] = "RANDOM", [2] = "SEQUENTIAL", [3] = "WILLNEED", [4] = "DONTNEED", [5] = "NOREUSE", }; $ The tools/perf/check-headers.sh script, part of the tools/ build process, points out changes in the original files. So its important not to touch the copies in tools/ when doing changes in the original kernel headers, that will be done later, when check-headers.sh inform about the change to the perf tools hackers. Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: kvm@vger.kernel.org Cc: virtualization@lists.linux.dev Cc: netdev@vger.kernel.org Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20231121225650.390246-5-namhyung@kernel.org
2023-11-22tools headers UAPI: Update tools's copy of mount.h headerNamhyung Kim
tldr; Just FYI, I'm carrying this on the perf tools tree. Full explanation: There used to be no copies, with tools/ code using kernel headers directly. From time to time tools/perf/ broke due to legitimate kernel hacking. At some point Linus complained about such direct usage. Then we adopted the current model. The way these headers are used in perf are not restricted to just including them to compile something. There are sometimes used in scripts that convert defines into string tables, etc, so some change may break one of these scripts, or new MSRs may use some different #define pattern, etc. E.g.: $ ls -1 tools/perf/trace/beauty/*.sh | head -5 tools/perf/trace/beauty/arch_errno_names.sh tools/perf/trace/beauty/drm_ioctl.sh tools/perf/trace/beauty/fadvise.sh tools/perf/trace/beauty/fsconfig.sh tools/perf/trace/beauty/fsmount.sh $ $ tools/perf/trace/beauty/fadvise.sh static const char *fadvise_advices[] = { [0] = "NORMAL", [1] = "RANDOM", [2] = "SEQUENTIAL", [3] = "WILLNEED", [4] = "DONTNEED", [5] = "NOREUSE", }; $ The tools/perf/check-headers.sh script, part of the tools/ build process, points out changes in the original files. So its important not to touch the copies in tools/ when doing changes in the original kernel headers, that will be done later, when check-headers.sh inform about the change to the perf tools hackers. Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20231121225650.390246-4-namhyung@kernel.org
2023-11-22tools headers UAPI: Update tools's copy of kvm.h headerNamhyung Kim
tldr; Just FYI, I'm carrying this on the perf tools tree. Full explanation: There used to be no copies, with tools/ code using kernel headers directly. From time to time tools/perf/ broke due to legitimate kernel hacking. At some point Linus complained about such direct usage. Then we adopted the current model. The way these headers are used in perf are not restricted to just including them to compile something. There are sometimes used in scripts that convert defines into string tables, etc, so some change may break one of these scripts, or new MSRs may use some different #define pattern, etc. E.g.: $ ls -1 tools/perf/trace/beauty/*.sh | head -5 tools/perf/trace/beauty/arch_errno_names.sh tools/perf/trace/beauty/drm_ioctl.sh tools/perf/trace/beauty/fadvise.sh tools/perf/trace/beauty/fsconfig.sh tools/perf/trace/beauty/fsmount.sh $ $ tools/perf/trace/beauty/fadvise.sh static const char *fadvise_advices[] = { [0] = "NORMAL", [1] = "RANDOM", [2] = "SEQUENTIAL", [3] = "WILLNEED", [4] = "DONTNEED", [5] = "NOREUSE", }; $ The tools/perf/check-headers.sh script, part of the tools/ build process, points out changes in the original files. So its important not to touch the copies in tools/ when doing changes in the original kernel headers, that will be done later, when check-headers.sh inform about the change to the perf tools hackers. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: kvm@vger.kernel.org Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20231121225650.390246-3-namhyung@kernel.org
2023-11-22tools headers UAPI: Update tools's copy of fscrypt.h headerNamhyung Kim
tldr; Just FYI, I'm carrying this on the perf tools tree. Full explanation: There used to be no copies, with tools/ code using kernel headers directly. From time to time tools/perf/ broke due to legitimate kernel hacking. At some point Linus complained about such direct usage. Then we adopted the current model. The way these headers are used in perf are not restricted to just including them to compile something. There are sometimes used in scripts that convert defines into string tables, etc, so some change may break one of these scripts, or new MSRs may use some different #define pattern, etc. E.g.: $ ls -1 tools/perf/trace/beauty/*.sh | head -5 tools/perf/trace/beauty/arch_errno_names.sh tools/perf/trace/beauty/drm_ioctl.sh tools/perf/trace/beauty/fadvise.sh tools/perf/trace/beauty/fsconfig.sh tools/perf/trace/beauty/fsmount.sh $ $ tools/perf/trace/beauty/fadvise.sh static const char *fadvise_advices[] = { [0] = "NORMAL", [1] = "RANDOM", [2] = "SEQUENTIAL", [3] = "WILLNEED", [4] = "DONTNEED", [5] = "NOREUSE", }; $ The tools/perf/check-headers.sh script, part of the tools/ build process, points out changes in the original files. So its important not to touch the copies in tools/ when doing changes in the original kernel headers, that will be done later, when check-headers.sh inform about the change to the perf tools hackers. Cc: Eric Biggers <ebiggers@kernel.org> Cc: "Theodore Y. Ts'o" <tytso@mit.edu> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: linux-fscrypt@vger.kernel.org Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20231121225650.390246-2-namhyung@kernel.org
2023-11-17bpf: rename BPF_F_TEST_SANITY_STRICT to BPF_F_TEST_REG_INVARIANTSAndrii Nakryiko
Rename verifier internal flag BPF_F_TEST_SANITY_STRICT to more neutral BPF_F_TEST_REG_INVARIANTS. This is a follow up to [0]. A few selftests and veristat need to be adjusted in the same patch as well. [0] https://patchwork.kernel.org/project/netdevbpf/patch/20231112010609.848406-5-andrii@kernel.org/ Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231117171404.225508-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-15bpf: add register bounds sanity checks and sanitizationAndrii Nakryiko
Add simple sanity checks that validate well-formed ranges (min <= max) across u64, s64, u32, and s32 ranges. Also for cases when the value is constant (either 64-bit or 32-bit), we validate that ranges and tnums are in agreement. These bounds checks are performed at the end of BPF_ALU/BPF_ALU64 operations, on conditional jumps, and for LDX instructions (where subreg zero/sign extension is probably the most important to check). This covers most of the interesting cases. Also, we validate the sanity of the return register when manually adjusting it for some special helpers. By default, sanity violation will trigger a warning in verifier log and resetting register bounds to "unbounded" ones. But to aid development and debugging, BPF_F_TEST_SANITY_STRICT flag is added, which will trigger hard failure of verification with -EFAULT on register bounds violations. This allows selftests to catch such issues. veristat will also gain a CLI option to enable this behavior. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Link: https://lore.kernel.org/r/20231112010609.848406-5-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-10bpf: Add crosstask check to __bpf_get_stackJordan Rome
Currently get_perf_callchain only supports user stack walking for the current task. Passing the correct *crosstask* param will return 0 frames if the task passed to __bpf_get_stack isn't the current one instead of a single incorrect frame/address. This change passes the correct *crosstask* param but also does a preemptive check in __bpf_get_stack if the task is current and returns -EOPNOTSUPP if it is not. This issue was found using bpf_get_task_stack inside a BPF iterator ("iter/task"), which iterates over all tasks. bpf_get_task_stack works fine for fetching kernel stacks but because get_perf_callchain relies on the caller to know if the requested *task* is the current one (via *crosstask*) it was failing in a confusing way. It might be possible to get user stacks for all tasks utilizing something like access_process_vm but that requires the bpf program calling bpf_get_task_stack to be sleepable and would therefore be a breaking change. Fixes: fa28dcb82a38 ("bpf: Introduce helper bpf_get_task_stack()") Signed-off-by: Jordan Rome <jordalgo@meta.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20231108112334.3433136-1-jordalgo@meta.com
2023-11-09bpf: Use named fields for certain bpf uapi structsYonghong Song
Martin and Vadim reported a verifier failure with bpf_dynptr usage. The issue is mentioned but Vadim workarounded the issue with source change ([1]). The below describes what is the issue and why there is a verification failure. int BPF_PROG(skb_crypto_setup) { struct bpf_dynptr algo, key; ... bpf_dynptr_from_mem(..., ..., 0, &algo); ... } The bpf program is using vmlinux.h, so we have the following definition in vmlinux.h: struct bpf_dynptr { long: 64; long: 64; }; Note that in uapi header bpf.h, we have struct bpf_dynptr { long: 64; long: 64; } __attribute__((aligned(8))); So we lost alignment information for struct bpf_dynptr by using vmlinux.h. Let us take a look at a simple program below: $ cat align.c typedef unsigned long long __u64; struct bpf_dynptr_no_align { __u64 :64; __u64 :64; }; struct bpf_dynptr_yes_align { __u64 :64; __u64 :64; } __attribute__((aligned(8))); void bar(void *, void *); int foo() { struct bpf_dynptr_no_align a; struct bpf_dynptr_yes_align b; bar(&a, &b); return 0; } $ clang --target=bpf -O2 -S -emit-llvm align.c Look at the generated IR file align.ll: ... %a = alloca %struct.bpf_dynptr_no_align, align 1 %b = alloca %struct.bpf_dynptr_yes_align, align 8 ... The compiler dictates the alignment for struct bpf_dynptr_no_align is 1 and the alignment for struct bpf_dynptr_yes_align is 8. So theoretically compiler could allocate variable %a with alignment 1 although in reallity the compiler may choose a different alignment by considering other local variables. In [1], the verification failure happens because variable 'algo' is allocated on the stack with alignment 4 (fp-28). But the verifer wants its alignment to be 8. To fix the issue, the RFC patch ([1]) tried to add '__attribute__((aligned(8)))' to struct bpf_dynptr plus other similar structs. Andrii suggested that we could directly modify uapi struct with named fields like struct 'bpf_iter_num': struct bpf_iter_num { /* opaque iterator state; having __u64 here allows to preserve correct * alignment requirements in vmlinux.h, generated from BTF */ __u64 __opaque[1]; } __attribute__((aligned(8))); Indeed, adding named fields for those affected structs in this patch can preserve alignment when bpf program references them in vmlinux.h. With this patch, the verification failure in [1] can also be resolved. [1] https://lore.kernel.org/bpf/1b100f73-7625-4c1f-3ae5-50ecf84d3ff0@linux.dev/ [2] https://lore.kernel.org/bpf/20231103055218.2395034-1-yonghong.song@linux.dev/ Cc: Vadim Fedorenko <vadfed@meta.com> Cc: Martin KaFai Lau <martin.lau@linux.dev> Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231104024900.1539182-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-06tools headers UAPI: Sync include/uapi/linux/perf_event.h header with the kernelKan Liang
Sync the new sample type for the branch counters feature. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexey Bayduraev <alexey.v.bayduraev@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Tinghao Zhang <tinghao.zhang@intel.com> Link: https://lore.kernel.org/r/20231025201626.3000228-6-kan.liang@linux.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-11-02Merge tag 'mm-stable-2023-11-01-14-33' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Many singleton patches against the MM code. The patch series which are included in this merge do the following: - Kemeng Shi has contributed some compation maintenance work in the series 'Fixes and cleanups to compaction' - Joel Fernandes has a patchset ('Optimize mremap during mutual alignment within PMD') which fixes an obscure issue with mremap()'s pagetable handling during a subsequent exec(), based upon an implementation which Linus suggested - More DAMON/DAMOS maintenance and feature work from SeongJae Park i the following patch series: mm/damon: misc fixups for documents, comments and its tracepoint mm/damon: add a tracepoint for damos apply target regions mm/damon: provide pseudo-moving sum based access rate mm/damon: implement DAMOS apply intervals mm/damon/core-test: Fix memory leaks in core-test mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval - In the series 'Do not try to access unaccepted memory' Adrian Hunter provides some fixups for the recently-added 'unaccepted memory' feature. To increase the feature's checking coverage. 'Plug a few gaps where RAM is exposed without checking if it is unaccepted memory' - In the series 'cleanups for lockless slab shrink' Qi Zheng has done some maintenance work which is preparation for the lockless slab shrinking code - Qi Zheng has redone the earlier (and reverted) attempt to make slab shrinking lockless in the series 'use refcount+RCU method to implement lockless slab shrink' - David Hildenbrand contributes some maintenance work for the rmap code in the series 'Anon rmap cleanups' - Kefeng Wang does more folio conversions and some maintenance work in the migration code. Series 'mm: migrate: more folio conversion and unification' - Matthew Wilcox has fixed an issue in the buffer_head code which was causing long stalls under some heavy memory/IO loads. Some cleanups were added on the way. Series 'Add and use bdev_getblk()' - In the series 'Use nth_page() in place of direct struct page manipulation' Zi Yan has fixed a potential issue with the direct manipulation of hugetlb page frames - In the series 'mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO' has improved our handling of gigantic pages in the hugetlb vmmemmep optimizaton code. This provides significant boot time improvements when significant amounts of gigantic pages are in use - Matthew Wilcox has sent the series 'Small hugetlb cleanups' - code rationalization and folio conversions in the hugetlb code - Yin Fengwei has improved mlock()'s handling of large folios in the series 'support large folio for mlock' - In the series 'Expose swapcache stat for memcg v1' Liu Shixin has added statistics for memcg v1 users which are available (and useful) under memcg v2 - Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable) prctl so that userspace may direct the kernel to not automatically propagate the denial to child processes. The series is named 'MDWE without inheritance' - Kefeng Wang has provided the series 'mm: convert numa balancing functions to use a folio' which does what it says - In the series 'mm/ksm: add fork-exec support for prctl' Stefan Roesch makes is possible for a process to propagate KSM treatment across exec() - Huang Ying has enhanced memory tiering's calculation of memory distances. This is used to permit the dax/kmem driver to use 'high bandwidth memory' in addition to Optane Data Center Persistent Memory Modules (DCPMM). The series is named 'memory tiering: calculate abstract distance based on ACPI HMAT' - In the series 'Smart scanning mode for KSM' Stefan Roesch has optimized KSM by teaching it to retain and use some historical information from previous scans - Yosry Ahmed has fixed some inconsistencies in memcg statistics in the series 'mm: memcg: fix tracking of pending stats updates values' - In the series 'Implement IOCTL to get and optionally clear info about PTEs' Peter Xu has added an ioctl to /proc/<pid>/pagemap which permits us to atomically read-then-clear page softdirty state. This is mainly used by CRIU - Hugh Dickins contributed the series 'shmem,tmpfs: general maintenance', a bunch of relatively minor maintenance tweaks to this code - Matthew Wilcox has increased the use of the VMA lock over file-backed page faults in the series 'Handle more faults under the VMA lock'. Some rationalizations of the fault path became possible as a result - In the series 'mm/rmap: convert page_move_anon_rmap() to folio_move_anon_rmap()' David Hildenbrand has implemented some cleanups and folio conversions - In the series 'various improvements to the GUP interface' Lorenzo Stoakes has simplified and improved the GUP interface with an eye to providing groundwork for future improvements - Andrey Konovalov has sent along the series 'kasan: assorted fixes and improvements' which does those things - Some page allocator maintenance work from Kemeng Shi in the series 'Two minor cleanups to break_down_buddy_pages' - In thes series 'New selftest for mm' Breno Leitao has developed another MM self test which tickles a race we had between madvise() and page faults - In the series 'Add folio_end_read' Matthew Wilcox provides cleanups and an optimization to the core pagecache code - Nhat Pham has added memcg accounting for hugetlb memory in the series 'hugetlb memcg accounting' - Cleanups and rationalizations to the pagemap code from Lorenzo Stoakes, in the series 'Abstract vma_merge() and split_vma()' - Audra Mitchell has fixed issues in the procfs page_owner code's new timestamping feature which was causing some misbehaviours. In the series 'Fix page_owner's use of free timestamps' - Lorenzo Stoakes has fixed the handling of new mappings of sealed files in the series 'permit write-sealed memfd read-only shared mappings' - Mike Kravetz has optimized the hugetlb vmemmap optimization in the series 'Batch hugetlb vmemmap modification operations' - Some buffer_head folio conversions and cleanups from Matthew Wilcox in the series 'Finish the create_empty_buffers() transition' - As a page allocator performance optimization Huang Ying has added automatic tuning to the allocator's per-cpu-pages feature, in the series 'mm: PCP high auto-tuning' - Roman Gushchin has contributed the patchset 'mm: improve performance of accounted kernel memory allocations' which improves their performance by ~30% as measured by a micro-benchmark - folio conversions from Kefeng Wang in the series 'mm: convert page cpupid functions to folios' - Some kmemleak fixups in Liu Shixin's series 'Some bugfix about kmemleak' - Qi Zheng has improved our handling of memoryless nodes by keeping them off the allocation fallback list. This is done in the series 'handle memoryless nodes more appropriately' - khugepaged conversions from Vishal Moola in the series 'Some khugepaged folio conversions'" [ bcachefs conflicts with the dynamically allocated shrinkers have been resolved as per Stephen Rothwell in https://lore.kernel.org/all/20230913093553.4290421e@canb.auug.org.au/ with help from Qi Zheng. The clone3 test filtering conflict was half-arsed by yours truly ] * tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (406 commits) mm/damon/sysfs: update monitoring target regions for online input commit mm/damon/sysfs: remove requested targets when online-commit inputs selftests: add a sanity check for zswap Documentation: maple_tree: fix word spelling error mm/vmalloc: fix the unchecked dereference warning in vread_iter() zswap: export compression failure stats Documentation: ubsan: drop "the" from article title mempolicy: migration attempt to match interleave nodes mempolicy: mmap_lock is not needed while migrating folios mempolicy: alloc_pages_mpol() for NUMA policy without vma mm: add page_rmappable_folio() wrapper mempolicy: remove confusing MPOL_MF_LAZY dead code mempolicy: mpol_shared_policy_init() without pseudo-vma mempolicy trivia: use pgoff_t in shared mempolicy tree mempolicy trivia: slightly more consistent naming mempolicy trivia: delete those ancient pr_debug()s mempolicy: fix migrate_pages(2) syscall return nr_failed kernfs: drop shared NUMA mempolicy hooks hugetlbfs: drop shared NUMA mempolicy pretence mm/damon/sysfs-test: add a unit test for damon_sysfs_set_targets() ...
2023-11-01Merge tag 'for-6.7/io_uring-sockopt-2023-10-30' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring {get,set}sockopt support from Jens Axboe: "This adds support for using getsockopt and setsockopt via io_uring. The main use cases for this is to enable use of direct descriptors, rather than first instantiating a normal file descriptor, doing the option tweaking needed, then turning it into a direct descriptor. With this support, we can avoid needing a regular file descriptor completely. The net and bpf bits have been signed off on their side" * tag 'for-6.7/io_uring-sockopt-2023-10-30' of git://git.kernel.dk/linux: selftests/bpf/sockopt: Add io_uring support io_uring/cmd: Introduce SOCKET_URING_OP_SETSOCKOPT io_uring/cmd: Introduce SOCKET_URING_OP_GETSOCKOPT io_uring/cmd: return -EOPNOTSUPP if net is disabled selftests/net: Extract uring helpers to be reusable tools headers: Grab copy of io_uring.h io_uring/cmd: Pass compat mode in issue_flags net/socket: Break down __sys_getsockopt net/socket: Break down __sys_setsockopt bpf: Add sockptr support for setsockopt bpf: Add sockptr support for getsockopt
2023-10-24tools: Sync if_link uapi headerDaniel Borkmann
Sync if_link uapi header to the latest version as we need the refresher in tooling for netkit device. Given it's been a while since the last sync and the diff is fairly big, it has been done as its own commit. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20231024214904.29825-3-daniel@iogearbox.net Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-10-24netkit, bpf: Add bpf programmable net deviceDaniel Borkmann
This work adds a new, minimal BPF-programmable device called "netkit" (former PoC code-name "meta") we recently presented at LSF/MM/BPF. The core idea is that BPF programs are executed within the drivers xmit routine and therefore e.g. in case of containers/Pods moving BPF processing closer to the source. One of the goals was that in case of Pod egress traffic, this allows to move BPF programs from hostns tcx ingress into the device itself, providing earlier drop or forward mechanisms, for example, if the BPF program determines that the skb must be sent out of the node, then a redirect to the physical device can take place directly without going through per-CPU backlog queue. This helps to shift processing for such traffic from softirq to process context, leading to better scheduling decisions/performance (see measurements in the slides). In this initial version, the netkit device ships as a pair, but we plan to extend this further so it can also operate in single device mode. The pair comes with a primary and a peer device. Only the primary device, typically residing in hostns, can manage BPF programs for itself and its peer. The peer device is designated for containers/Pods and cannot attach/detach BPF programs. Upon the device creation, the user can set the default policy to 'pass' or 'drop' for the case when no BPF program is attached. Additionally, the device can be operated in L3 (default) or L2 mode. The management of BPF programs is done via bpf_mprog, so that multi-attach is supported right from the beginning with similar API and dependency controls as tcx. For details on the latter see commit 053c8e1f235d ("bpf: Add generic attach/detach/query API for multi-progs"). tc BPF compatibility is provided, so that existing programs can be easily migrated. Going forward, we plan to use netkit devices in Cilium as the main device type for connecting Pods. They will be operated in L3 mode in order to simplify a Pod's neighbor management and the peer will operate in default drop mode, so that no traffic is leaving between the time when a Pod is brought up by the CNI plugin and programs attached by the agent. Additionally, the programs we attach via tcx on the physical devices are using bpf_redirect_peer() for inbound traffic into netkit device, hence the latter is also supporting the ndo_get_peer_dev callback. Similarly, we use bpf_redirect_neigh() for the way out, pushing from netkit peer to phys device directly. Also, BIG TCP is supported on netkit device. For the follow-up work in single device mode, we plan to convert Cilium's cilium_host/_net devices into a single one. An extensive test suite for checking device operations and the BPF program and link management API comes as BPF selftests in this series. Co-developed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Stanislav Fomichev <sdf@google.com> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://github.com/borkmann/iproute2/tree/pr/netkit Link: http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf (24ff.) Link: https://lore.kernel.org/r/20231024214904.29825-2-daniel@iogearbox.net Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-10-19tools headers: Grab copy of io_uring.hBreno Leitao
This file will be used by mini_uring.h and allow tests to run without the need of installing liburing to run the tests. This is needed to run io_uring tests in BPF, such as (tools/testing/selftests/bpf/prog_tests/sockopt.c). Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://lore.kernel.org/r/20231016134750.1381153-7-leitao@debian.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-18tools headers UAPI: update linux/fs.h with the kernel sourcesMuhammad Usama Anjum
New IOCTL and macros has been added in the kernel sources. Update the tools header file as well. Link: https://lkml.kernel.org/r/20230821141518.870589-5-usama.anjum@collabora.com Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Alex Sierra <alex.sierra@amd.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrei Vagin <avagin@gmail.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Gustavo A. R. Silva <gustavoars@kernel.org> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Miroslaw <emmir@google.com> Cc: Michał Mirosław <mirq-linux@rere.qmqm.pl> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Nadav Amit <namit@vmware.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Paul Gofman <pgofman@codeweavers.com> Cc: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Cc: Yun Zhou <yun.zhou@windriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-16Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2023-10-16 We've added 90 non-merge commits during the last 25 day(s) which contain a total of 120 files changed, 3519 insertions(+), 895 deletions(-). The main changes are: 1) Add missed stats for kprobes to retrieve the number of missed kprobe executions and subsequent executions of BPF programs, from Jiri Olsa. 2) Add cgroup BPF sockaddr hooks for unix sockets. The use case is for systemd to reimplement the LogNamespace feature which allows running multiple instances of systemd-journald to process the logs of different services, from Daan De Meyer. 3) Implement BPF CPUv4 support for s390x BPF JIT, from Ilya Leoshkevich. 4) Improve BPF verifier log output for scalar registers to better disambiguate their internal state wrt defaults vs min/max values matching, from Andrii Nakryiko. 5) Extend the BPF fib lookup helpers for IPv4/IPv6 to support retrieving the source IP address with a new BPF_FIB_LOOKUP_SRC flag, from Martynas Pumputis. 6) Add support for open-coded task_vma iterator to help with symbolization for BPF-collected user stacks, from Dave Marchevsky. 7) Add libbpf getters for accessing individual BPF ring buffers which is useful for polling them individually, for example, from Martin Kelly. 8) Extend AF_XDP selftests to validate the SHARED_UMEM feature, from Tushar Vyavahare. 9) Improve BPF selftests cross-building support for riscv arch, from Björn Töpel. 10) Add the ability to pin a BPF timer to the same calling CPU, from David Vernet. 11) Fix libbpf's bpf_tracing.h macros for riscv to use the generic implementation of PT_REGS_SYSCALL_REGS() to access syscall arguments, from Alexandre Ghiti. 12) Extend libbpf to support symbol versioning for uprobes, from Hengqi Chen. 13) Fix bpftool's skeleton code generation to guarantee that ELF data is 8 byte aligned, from Ian Rogers. 14) Inherit system-wide cpu_mitigations_off() setting for Spectre v1/v4 security mitigations in BPF verifier, from Yafang Shao. 15) Annotate struct bpf_stack_map with __counted_by attribute to prepare BPF side for upcoming __counted_by compiler support, from Kees Cook. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (90 commits) bpf: Ensure proper register state printing for cond jumps bpf: Disambiguate SCALAR register state output in verifier logs selftests/bpf: Make align selftests more robust selftests/bpf: Improve missed_kprobe_recursion test robustness selftests/bpf: Improve percpu_alloc test robustness selftests/bpf: Add tests for open-coded task_vma iter bpf: Introduce task_vma open-coded iterator kfuncs selftests/bpf: Rename bpf_iter_task_vma.c to bpf_iter_task_vmas.c bpf: Don't explicitly emit BTF for struct btf_iter_num bpf: Change syscall_nr type to int in struct syscall_tp_t net/bpf: Avoid unused "sin_addr_len" warning when CONFIG_CGROUP_BPF is not set bpf: Avoid unnecessary audit log for CPU security mitigations selftests/bpf: Add tests for cgroup unix socket address hooks selftests/bpf: Make sure mount directory exists documentation/bpf: Document cgroup unix socket address hooks bpftool: Add support for cgroup unix socket address hooks libbpf: Add support for cgroup unix socket address hooks bpf: Implement cgroup sockaddr hooks for unix sockets bpf: Add bpf_sock_addr_set_sun_path() to allow writing unix sockaddr from bpf bpf: Propagate modified uaddrlen from cgroup sockaddr programs ... ==================== Link: https://lore.kernel.org/r/20231016204803.30153-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-11bpf: Implement cgroup sockaddr hooks for unix socketsDaan De Meyer
These hooks allows intercepting connect(), getsockname(), getpeername(), sendmsg() and recvmsg() for unix sockets. The unix socket hooks get write access to the address length because the address length is not fixed when dealing with unix sockets and needs to be modified when a unix socket address is modified by the hook. Because abstract socket unix addresses start with a NUL byte, we cannot recalculate the socket address in kernelspace after running the hook by calculating the length of the unix socket path using strlen(). These hooks can be used when users want to multiplex syscall to a single unix socket to multiple different processes behind the scenes by redirecting the connect() and other syscalls to process specific sockets. We do not implement support for intercepting bind() because when using bind() with unix sockets with a pathname address, this creates an inode in the filesystem which must be cleaned up. If we rewrite the address, the user might try to clean up the wrong file, leaking the socket in the filesystem where it is never cleaned up. Until we figure out a solution for this (and a use case for intercepting bind()), we opt to not allow rewriting the sockaddr in bind() calls. We also implement recvmsg() support for connected streams so that after a connect() that is modified by a sockaddr hook, any corresponding recmvsg() on the connected socket can also be modified to make the connected program think it is connected to the "intended" remote. Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Daan De Meyer <daan.j.demeyer@gmail.com> Link: https://lore.kernel.org/r/20231011185113.140426-5-daan.j.demeyer@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-10-09bpf: Derive source IP addr via bpf_*_fib_lookup()Martynas Pumputis
Extend the bpf_fib_lookup() helper by making it to return the source IPv4/IPv6 address if the BPF_FIB_LOOKUP_SRC flag is set. For example, the following snippet can be used to derive the desired source IP address: struct bpf_fib_lookup p = { .ipv4_dst = ip4->daddr }; ret = bpf_skb_fib_lookup(skb, p, sizeof(p), BPF_FIB_LOOKUP_SRC | BPF_FIB_LOOKUP_SKIP_NEIGH); if (ret != BPF_FIB_LKUP_RET_SUCCESS) return TC_ACT_SHOT; /* the p.ipv4_src now contains the source address */ The inability to derive the proper source address may cause malfunctions in BPF-based dataplanes for hosts containing netdevs with more than one routable IP address or for multi-homed hosts. For example, Cilium implements packet masquerading in BPF. If an egressing netdev to which the Cilium's BPF prog is attached has multiple IP addresses, then only one [hardcoded] IP address can be used for masquerading. This breaks connectivity if any other IP address should have been selected instead, for example, when a public and private addresses are attached to the same egress interface. The change was tested with Cilium [1]. Nikolay Aleksandrov helped to figure out the IPv6 addr selection. [1]: https://github.com/cilium/cilium/pull/28283 Signed-off-by: Martynas Pumputis <m@lambda.lt> Link: https://lore.kernel.org/r/20231007081415.33502-2-m@lambda.lt Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-10-09bpf: Add ability to pin bpf timer to calling CPUDavid Vernet
BPF supports creating high resolution timers using bpf_timer_* helper functions. Currently, only the BPF_F_TIMER_ABS flag is supported, which specifies that the timeout should be interpreted as absolute time. It would also be useful to be able to pin that timer to a core. For example, if you wanted to make a subset of cores run without timer interrupts, and only have the timer be invoked on a single core. This patch adds support for this with a new BPF_F_TIMER_CPU_PIN flag. When specified, the HRTIMER_MODE_PINNED flag is passed to hrtimer_start(). A subsequent patch will update selftests to validate. Signed-off-by: David Vernet <void@manifault.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <song@kernel.org> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/bpf/20231004162339.200702-2-void@manifault.com
2023-10-06mm: add a NO_INHERIT flag to the PR_SET_MDWE prctlFlorent Revest
This extends the current PR_SET_MDWE prctl arg with a bit to indicate that the process doesn't want MDWE protection to propagate to children. To implement this no-inherit mode, the tag in current->mm->flags must be absent from MMF_INIT_MASK. This means that the encoding for "MDWE but without inherit" is different in the prctl than in the mm flags. This leads to a bit of bit-mangling in the prctl implementation. Link: https://lkml.kernel.org/r/20230828150858.393570-6-revest@chromium.org Signed-off-by: Florent Revest <revest@chromium.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Alexey Izbyshev <izbyshev@ispras.ru> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Ayush Jain <ayush.jain3@amd.com> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Joey Gouly <joey.gouly@arm.com> Cc: KP Singh <kpsingh@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Szabolcs Nagy <Szabolcs.Nagy@arm.com> Cc: Topi Miettinen <toiwoton@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-06mm: make PR_MDWE_REFUSE_EXEC_GAIN an unsigned longFlorent Revest
Defining a prctl flag as an int is a footgun because on a 64 bit machine and with a variadic implementation of prctl (like in musl and glibc), when used directly as a prctl argument, it can get casted to long with garbage upper bits which would result in unexpected behaviors. This patch changes the constant to an unsigned long to eliminate that possibilities. This does not break UAPI. I think that a stable backport would be "nice to have": to reduce the chances that users build binaries that could end up with garbage bits in their MDWE prctl arguments. We are not aware of anyone having yet encountered this corner case with MDWE prctls but a backport would reduce the likelihood it happens, since this sort of issues has happened with other prctls. But If this is perceived as a backporting burden, I suppose we could also live without a stable backport. Link: https://lkml.kernel.org/r/20230828150858.393570-5-revest@chromium.org Fixes: b507808ebce2 ("mm: implement memory-deny-write-execute as a prctl") Signed-off-by: Florent Revest <revest@chromium.org> Suggested-by: Alexey Izbyshev <izbyshev@ispras.ru> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Ayush Jain <ayush.jain3@amd.com> Cc: Greg Thelen <gthelen@google.com> Cc: Joey Gouly <joey.gouly@arm.com> Cc: KP Singh <kpsingh@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Szabolcs Nagy <Szabolcs.Nagy@arm.com> Cc: Topi Miettinen <toiwoton@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. No conflicts (or adjacent changes of note). Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-09-26Merge tag 'perf-tools-fixes-for-v6.6-1-2023-09-25' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools Pull perf tools fixes from Namhyung Kim: "Build: - Update header files in the tools/**/include directory to sync with the kernel sources as usual. - Remove unused bpf-prologue files. While it's not strictly a fix, but the functionality was removed in this cycle so better to get rid of the code together. - Other minor build fixes. Misc: - Fix uninitialized memory access in PMU parsing code - Fix segfaults on software event" * tag 'perf-tools-fixes-for-v6.6-1-2023-09-25' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools: perf jevent: fix core dump on software events on s390 perf pmu: Ensure all alias variables are initialized perf jevents metric: Fix type of strcmp_cpuid_str perf trace: Avoid compile error wrt redefining bool perf bpf-prologue: Remove unused file tools headers UAPI: Update tools's copy of drm.h headers tools arch x86: Sync the msr-index.h copy with the kernel sources perf bench sched-seccomp-notify: Use the tools copy of seccomp.h UAPI tools headers UAPI: Copy seccomp.h to be able to build 'perf bench' in older systems tools headers UAPI: Sync files changed by new fchmodat2 and map_shadow_stack syscalls with the kernel sources perf tools: Update copy of libbpf's hashmap.c
2023-09-25bpf: Add missed value to kprobe perf link infoJiri Olsa
Add missed value to kprobe attached through perf link info to hold the stats of missed kprobe handler execution. The kprobe's missed counter gets incremented when kprobe handler is not executed due to another kprobe running on the same cpu. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20230920213145.1941596-4-jolsa@kernel.org
2023-09-25bpf: Add missed value to kprobe_multi link infoJiri Olsa
Add missed value to kprobe_multi link info to hold the stats of missed kprobe_multi probe. The missed counter gets incremented when fprobe fails the recursion check or there's no rethook available for return probe. In either case the attached bpf program is not executed. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Tested-by: Song Liu <song@kernel.org> Reviewed-by: Song Liu <song@kernel.org> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/bpf/20230920213145.1941596-3-jolsa@kernel.org
2023-09-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netPaolo Abeni
Cross-merge networking fixes after downstream PR. No conflicts. Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-09-15bpf: expose information about supported xdp metadata kfuncStanislav Fomichev
Add new xdp-rx-metadata-features member to netdev netlink which exports a bitmask of supported kfuncs. Most of the patch is autogenerated (headers), the only relevant part is netdev.yaml and the changes in netdev-genl.c to marshal into netlink. Example output on veth: $ ip link add veth0 type veth peer name veth1 # ifndex == 12 $ ./tools/net/ynl/samples/netdev 12 Select ifc ($ifindex; or 0 = dump; or -2 ntf check): 12 veth1[12] xdp-features (23): basic redirect rx-sg xdp-rx-metadata-features (3): timestamp hash xdp-zc-max-segs=0 Cc: netdev@vger.kernel.org Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20230913171350.369987-3-sdf@google.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-09-13tools headers UAPI: Copy seccomp.h to be able to build 'perf bench' in older ↵Arnaldo Carvalho de Melo
systems The new 'perf bench' for sched-seccomp-notify uses defines and types not available in older systems where we want to have perf available, so grab a copy of this UAPI from the kernel sources to allow that. This will be checked in the future for drift from the original when we build the perf tool, that will warn when that happens like: make: Entering directory '/var/home/acme/git/perf-tools/tools/perf' BUILD: Doing 'make -j32' parallel build Warning: Kernel ABI header differences: Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Andrei Vagin <avagin@google.com> Cc: Ian Rogers <irogers@google.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kees Kook <keescook@chromium.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/lkml/ZQGhMXtwX7RvV3ya@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-09-11bpf: Clarify error expectations from bpf_clone_redirectStanislav Fomichev
Commit 151e887d8ff9 ("veth: Fixing transmit return status for dropped packets") exposed the fact that bpf_clone_redirect is capable of returning raw NET_XMIT_XXX return codes. This is in the conflict with its UAPI doc which says the following: "0 on success, or a negative error in case of failure." Update the UAPI to reflect the fact that bpf_clone_redirect can return positive error numbers, but don't explicitly define their meaning. Reported-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20230911194731.286342-1-sdf@google.com
2023-09-08bpf: Mark BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE deprecatedYonghong Song
Now 'BPF_MAP_TYPE_CGRP_STORAGE + local percpu ptr' can cover all BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE functionality and more. So mark BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE deprecated. Also make changes in selftests/bpf/test_bpftool_synctypes.py and selftest libbpf_str to fix otherwise test errors. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230827152837.2003563-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-08-21bpf: Add pid filter support for uprobe_multi linkJiri Olsa
Adding support to specify pid for uprobe_multi link and the uprobes are created only for task with given pid value. Using the consumer.filter filter callback for that, so the task gets filtered during the uprobe installation. We still need to check the task during runtime in the uprobe handler, because the handler could get executed if there's another system wide consumer on the same uprobe (thanks Oleg for the insight). Cc: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230809083440.3209381-6-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-08-21bpf: Add cookies support for uprobe_multi linkJiri Olsa
Adding support to specify cookies array for uprobe_multi link. The cookies array share indexes and length with other uprobe_multi arrays (offsets/ref_ctr_offsets). The cookies[i] value defines cookie for i-the uprobe and will be returned by bpf_get_attach_cookie helper when called from ebpf program hooked to that specific uprobe. Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230809083440.3209381-5-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-08-21bpf: Add multi uprobe linkJiri Olsa
Adding new multi uprobe link that allows to attach bpf program to multiple uprobes. Uprobes to attach are specified via new link_create uprobe_multi union: struct { __aligned_u64 path; __aligned_u64 offsets; __aligned_u64 ref_ctr_offsets; __u32 cnt; __u32 flags; } uprobe_multi; Uprobes are defined for single binary specified in path and multiple calling sites specified in offsets array with optional reference counters specified in ref_ctr_offsets array. All specified arrays have length of 'cnt'. The 'flags' supports single bit for now that marks the uprobe as return probe. Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230809083440.3209381-4-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-08-21bpf: Switch BPF_F_KPROBE_MULTI_RETURN macro to enumJiri Olsa
Switching BPF_F_KPROBE_MULTI_RETURN macro to anonymous enum, so it'd show up in vmlinux.h. There's not functional change compared to having this as macro. Acked-by: Yafang Shao <laoar.shao@gmail.com> Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230809083440.3209381-2-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-08-07bpf: Add support for bpf_get_func_ip helper for uprobe programJiri Olsa
Adding support for bpf_get_func_ip helper for uprobe program to return probed address for both uprobe and return uprobe. We discussed this in [1] and agreed that uprobe can have special use of bpf_get_func_ip helper that differs from kprobe. The kprobe bpf_get_func_ip returns: - address of the function if probe is attach on function entry for both kprobe and return kprobe - 0 if the probe is not attach on function entry The uprobe bpf_get_func_ip returns: - address of the probe for both uprobe and return uprobe The reason for this semantic change is that kernel can't really tell if the probe user space address is function entry. The uprobe program is actually kprobe type program attached as uprobe. One of the consequences of this design is that uprobes do not have its own set of helpers, but share them with kprobes. As we need different functionality for bpf_get_func_ip helper for uprobe, I'm adding the bool value to the bpf_trace_run_ctx, so the helper can detect that it's executed in uprobe context and call specific code. The is_uprobe bool is set as true in bpf_prog_run_array_sleepable, which is currently used only for executing bpf programs in uprobe. Renaming bpf_prog_run_array_sleepable to bpf_prog_run_array_uprobe to address that it's only used for uprobes and that it sets the run_ctx.is_uprobe as suggested by Yafang Shao. Suggested-by: Andrii Nakryiko <andrii@kernel.org> Tested-by: Alan Maguire <alan.maguire@oracle.com> [1] https://lore.kernel.org/bpf/CAEf4BzZ=xLVkG5eurEuvLU79wAMtwho7ReR+XJAgwhFF4M-7Cg@mail.gmail.com/ Signed-off-by: Jiri Olsa <jolsa@kernel.org> Tested-by: Viktor Malik <vmalik@redhat.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230807085956.2344866-2-jolsa@kernel.org Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-03Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Martin KaFai Lau says: ==================== pull-request: bpf-next 2023-08-03 We've added 54 non-merge commits during the last 10 day(s) which contain a total of 84 files changed, 4026 insertions(+), 562 deletions(-). The main changes are: 1) Add SO_REUSEPORT support for TC bpf_sk_assign from Lorenz Bauer, Daniel Borkmann 2) Support new insns from cpu v4 from Yonghong Song 3) Non-atomically allocate freelist during prefill from YiFei Zhu 4) Support defragmenting IPv(4|6) packets in BPF from Daniel Xu 5) Add tracepoint to xdp attaching failure from Leon Hwang 6) struct netdev_rx_queue and xdp.h reshuffling to reduce rebuild time from Jakub Kicinski * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (54 commits) net: invert the netdevice.h vs xdp.h dependency net: move struct netdev_rx_queue out of netdevice.h eth: add missing xdp.h includes in drivers selftests/bpf: Add testcase for xdp attaching failure tracepoint bpf, xdp: Add tracepoint to xdp attaching failure selftests/bpf: fix static assert compilation issue for test_cls_*.c bpf: fix bpf_probe_read_kernel prototype mismatch riscv, bpf: Adapt bpf trampoline to optimized riscv ftrace framework libbpf: fix typos in Makefile tracing: bpf: use struct trace_entry in struct syscall_tp_t bpf, devmap: Remove unused dtab field from bpf_dtab_netdev bpf, cpumap: Remove unused cmap field from bpf_cpu_map_entry netfilter: bpf: Only define get_proto_defrag_hook() if necessary bpf: Fix an array-index-out-of-bounds issue in disasm.c net: remove duplicate INDIRECT_CALLABLE_DECLARE of udp[6]_ehashfn docs/bpf: Fix malformed documentation bpf: selftests: Add defrag selftests bpf: selftests: Support custom type and proto for client sockets bpf: selftests: Support not connecting client socket netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link ... ==================== Link: https://lore.kernel.org/r/20230803174845.825419-1-martin.lau@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-28netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter linkDaniel Xu
This commit adds support for enabling IP defrag using pre-existing netfilter defrag support. Basically all the flag does is bump a refcnt while the link the active. Checks are also added to ensure the prog requesting defrag support is run _after_ netfilter defrag hooks. We also take care to avoid any issues w.r.t. module unloading -- while defrag is active on a link, the module is prevented from unloading. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Reviewed-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/5cff26f97e55161b7d56b09ddcf5f8888a5add1d.1689970773.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-28ynl: regenerate all headersStanislav Fomichev
Also add support to pass topdir to ynl-regen.sh (Jakub) and call it from the makefile to update the UAPI headers. Signed-off-by: Stanislav Fomichev <sdf@google.com> Co-developed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230727163001.3952878-4-sdf@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>