summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2022-09-03Merge branch 'net_sched-redundant-resource-cleanups'David S. Miller
Zhengchao Shao says: ==================== net: sched: remove redundant resource cleanup when init() fails qdisc_create() calls .init() to initialize qdisc. If the initialization fails, qdisc_create() will call .destroy() to release resources. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-03net: sched: htb: remove redundant resource cleanup in htb_init()Zhengchao Shao
If htb_init() fails, qdisc_create() invokes htb_destroy() to clear resources. Therefore, remove redundant resource cleanup in htb_init(). Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-03net: sched: fq_codel: remove redundant resource cleanup in fq_codel_init()Zhengchao Shao
If fq_codel_init() fails, qdisc_create() invokes fq_codel_destroy() to clear resources. Therefore, remove redundant resource cleanup in fq_codel_init(). Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-03net: fec: add stop mode support for imx8 platformWei Fang
The current driver support stop mode by calling machine api. The patch add dts support to set GPR register for stop request. imx8mq enter stop/exit stop mode by setting GPR bit, which can be accessed by A core. imx8qm enter stop/exit stop mode by calling IMX_SC ipc APIs that communicate with M core ipc service, and the M core set the related GPR bit at last. Signed-off-by: Wei Fang <wei.fang@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-03r8152: Add MAC passthrough support for Lenovo Travel HubAndré Apitzsch
The Lenovo USB-C Travel Hub supports MAC passthrough. Signed-off-by: André Apitzsch <git@apitzsch.eu> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-03net/ipv4: Use __DECLARE_FLEX_ARRAY() helperGustavo A. R. Silva
We now have a cleaner way to keep compatibility with user-space (a.k.a. not breaking it) when we need to keep in place a one-element array (for its use in user-space) together with a flexible-array member (for its use in kernel-space) without making it hard to read at the source level. This is through the use of the new __DECLARE_FLEX_ARRAY() helper macro. The size and memory layout of the structure is preserved after the changes. See below. Before changes: $ pahole -C ip_msfilter net/ipv4/igmp.o struct ip_msfilter { union { struct { __be32 imsf_multiaddr_aux; /* 0 4 */ __be32 imsf_interface_aux; /* 4 4 */ __u32 imsf_fmode_aux; /* 8 4 */ __u32 imsf_numsrc_aux; /* 12 4 */ __be32 imsf_slist[1]; /* 16 4 */ }; /* 0 20 */ struct { __be32 imsf_multiaddr; /* 0 4 */ __be32 imsf_interface; /* 4 4 */ __u32 imsf_fmode; /* 8 4 */ __u32 imsf_numsrc; /* 12 4 */ __be32 imsf_slist_flex[0]; /* 16 0 */ }; /* 0 16 */ }; /* 0 20 */ /* size: 20, cachelines: 1, members: 1 */ /* last cacheline: 20 bytes */ }; After changes: $ pahole -C ip_msfilter net/ipv4/igmp.o struct ip_msfilter { __be32 imsf_multiaddr; /* 0 4 */ __be32 imsf_interface; /* 4 4 */ __u32 imsf_fmode; /* 8 4 */ __u32 imsf_numsrc; /* 12 4 */ union { __be32 imsf_slist[1]; /* 16 4 */ struct { struct { } __empty_imsf_slist_flex; /* 16 0 */ __be32 imsf_slist_flex[0]; /* 16 0 */ }; /* 16 0 */ }; /* 16 4 */ /* size: 20, cachelines: 1, members: 5 */ /* last cacheline: 20 bytes */ }; In the past, we had to duplicate the whole original structure within a union, and update the names of all the members. Now, we just need to declare the flexible-array member to be used in kernel-space through the __DECLARE_FLEX_ARRAY() helper together with the one-element array, within a union. This makes the source code more clean and easier to read. Link: https://github.com/KSPP/linux/issues/193 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-02net/sched: cls_api: remove redundant 0 check in tcf_qevent_init()Zhengchao Shao
tcf_qevent_parse_block_index() never returns a zero block_index. Therefore, it is unnecessary to check the value of block_index in tcf_qevent_init(). Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Link: https://lore.kernel.org/r/20220901011617.14105-1-shaozhengchao@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-02net: lantiq_etop: Fix return type for implementation of ndo_start_xmitGUO Zihua
Since Linux now supports CFI, it will be a good idea to fix mismatched return type for implementation of hooks. Otherwise this might get cought out by CFI and cause a panic. ltq_etop_tx() would return either NETDEV_TX_BUSY or NETDEV_TX_OK, so change the return type to netdev_tx_t directly. Signed-off-by: GUO Zihua <guozihua@huawei.com> Link: https://lore.kernel.org/r/20220902081521.59867-1-guozihua@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-02net: sunplus: Fix return type for implementation of ndo_start_xmitGUO Zihua
Since Linux now supports CFI, it will be a good idea to fix mismatched return type for implementation of hooks. Otherwise this might get cought out by CFI and cause a panic. spl2sw_ethernet_start_xmit() would return either NETDEV_TX_BUSY or NETDEV_TX_OK, so change the return type to netdev_tx_t directly. Signed-off-by: GUO Zihua <guozihua@huawei.com> Link: https://lore.kernel.org/r/20220902081550.60095-1-guozihua@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-02net: xscale: Fix return type for implementation of ndo_start_xmitGUO Zihua
Since Linux now supports CFI, it will be a good idea to fix mismatched return type for implementation of hooks. Otherwise this might get cought out by CFI and cause a panic. eth_xmit() would return either NETDEV_TX_BUSY or NETDEV_TX_OK, so change the return type to netdev_tx_t directly. Signed-off-by: GUO Zihua <guozihua@huawei.com> Link: https://lore.kernel.org/r/20220902081612.60405-1-guozihua@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-02net: broadcom: Fix return type for implementation ofGUO Zihua
Since Linux now supports CFI, it will be a good idea to fix mismatched return type for implementation of hooks. Otherwise this might get cought out by CFI and cause a panic. bcm4908_enet_start_xmit() would return either NETDEV_TX_BUSY or NETDEV_TX_OK, so change the return type to netdev_tx_t directly. Signed-off-by: GUO Zihua <guozihua@huawei.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Link: https://lore.kernel.org/r/20220902075407.52358-1-guozihua@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-02Merge branch 'bpf: net: Remove duplicated code from bpf_getsockopt()'Alexei Starovoitov
Martin KaFai Lau says: ==================== From: Martin KaFai Lau <martin.lau@kernel.org> The earlier commits [0] removed duplicated code from bpf_setsockopt(). This series is to remove duplicated code from bpf_getsockopt(). Unlike the setsockopt() which had already changed to take the sockptr_t argument, the same has not been done to getsockopt(). This is the extra step being done in this series. [0]: https://lore.kernel.org/all/20220817061704.4174272-1-kafai@fb.com/ v2: - The previous v2 did not reach the list. It is a resend. - Add comments on bpf_getsockopt() should not free the saved_syn (Stanislav) - Explicitly null-terminate the tcp-cc name (Stanislav) ==================== Reviewed-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-02selftest/bpf: Add test for bpf_getsockopt()Martin KaFai Lau
This patch removes the __bpf_getsockopt() which directly reads the sk by using PTR_TO_BTF_ID. Instead, the test now directly uses the kernel bpf helper bpf_getsockopt() which supports all the required optname now. TCP_SAVE[D]_SYN and TCP_MAXSEG are not tested in a loop for all the hooks and sock_ops's cb. TCP_SAVE[D]_SYN only works in passive connection. TCP_MAXSEG only works when it is setsockopt before the connection is established and the getsockopt return value can only be tested after the connection is established. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20220902002937.2896904-1-kafai@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-02bpf: Change bpf_getsockopt(SOL_IPV6) to reuse do_ipv6_getsockopt()Martin KaFai Lau
This patch changes bpf_getsockopt(SOL_IPV6) to reuse do_ipv6_getsockopt(). It removes the duplicated code from bpf_getsockopt(SOL_IPV6). This also makes bpf_getsockopt(SOL_IPV6) supporting the same set of optnames as in bpf_setsockopt(SOL_IPV6). In particular, this adds IPV6_AUTOFLOWLABEL support to bpf_getsockopt(SOL_IPV6). ipv6 could be compiled as a module. Like how other code solved it with stubs in ipv6_stubs.h, this patch adds the do_ipv6_getsockopt to the ipv6_bpf_stub. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20220902002931.2896218-1-kafai@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-02bpf: Change bpf_getsockopt(SOL_IP) to reuse do_ip_getsockopt()Martin KaFai Lau
This patch changes bpf_getsockopt(SOL_IP) to reuse do_ip_getsockopt() and remove the duplicated code. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20220902002925.2895416-1-kafai@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-02bpf: Change bpf_getsockopt(SOL_TCP) to reuse do_tcp_getsockopt()Martin KaFai Lau
This patch changes bpf_getsockopt(SOL_TCP) to reuse do_tcp_getsockopt(). It removes the duplicated code from bpf_getsockopt(SOL_TCP). Before this patch, there were some optnames available to bpf_setsockopt(SOL_TCP) but missing in bpf_getsockopt(SOL_TCP). For example, TCP_NODELAY, TCP_MAXSEG, TCP_KEEPIDLE, TCP_KEEPINTVL, and a few more. It surprises users from time to time. This patch automatically closes this gap without duplicating more code. bpf_getsockopt(TCP_SAVED_SYN) does not free the saved_syn, so it stays in sol_tcp_sockopt(). For string name value like TCP_CONGESTION, bpf expects it is always null terminated, so sol_tcp_sockopt() decrements optlen by one before calling do_tcp_getsockopt() and the 'if (optlen < saved_optlen) memset(..,0,..);' in __bpf_getsockopt() will always do a null termination. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20220902002918.2894511-1-kafai@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-02bpf: Change bpf_getsockopt(SOL_SOCKET) to reuse sk_getsockopt()Martin KaFai Lau
This patch changes bpf_getsockopt(SOL_SOCKET) to reuse sk_getsockopt(). It removes all duplicated code from bpf_getsockopt(SOL_SOCKET). Before this patch, there were some optnames available to bpf_setsockopt(SOL_SOCKET) but missing in bpf_getsockopt(SOL_SOCKET). It surprises users from time to time. For example, SO_REUSEADDR, SO_KEEPALIVE, SO_RCVLOWAT, and SO_MAX_PACING_RATE. This patch automatically closes this gap without duplicating more code. The only exception is SO_BINDTODEVICE because it needs to acquire a blocking lock. Thus, SO_BINDTODEVICE is not supported. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20220902002912.2894040-1-kafai@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-02bpf: Embed kernel CONFIG check into the if statement in bpf_getsockoptMartin KaFai Lau
This patch moves the "#ifdef CONFIG_XXX" check into the "if/else" statement itself. The change is done for the bpf_getsockopt() function only. It will make the latter patches easier to follow without the surrounding ifdef macro. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20220902002906.2893572-1-kafai@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-02bpf: net: Avoid do_ipv6_getsockopt() taking sk lock when called from bpfMartin KaFai Lau
Similar to the earlier patch that changes sk_getsockopt() to use sockopt_{lock,release}_sock() such that it can avoid taking the lock when called from bpf. This patch also changes do_ipv6_getsockopt() to use sockopt_{lock,release}_sock() such that bpf_getsockopt(SOL_IPV6) can reuse do_ipv6_getsockopt(). Although bpf_getsockopt(SOL_IPV6) currently does not support optname that requires lock_sock(), using sockopt_{lock,release}_sock() consistently across *_getsockopt() will make future optname addition harder to miss the sockopt_{lock,release}_sock() usage. eg. when adding new optname that requires a lock and the new optname is needed in bpf_getsockopt() also. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20220902002859.2893064-1-kafai@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-02bpf: net: Change do_ipv6_getsockopt() to take the sockptr_t argumentMartin KaFai Lau
Similar to the earlier patch that changes sk_getsockopt() to take the sockptr_t argument . This patch also changes do_ipv6_getsockopt() to take the sockptr_t argument such that a latter patch can make bpf_getsockopt(SOL_IPV6) to reuse do_ipv6_getsockopt(). Note on the change in ip6_mc_msfget(). This function is to return an array of sockaddr_storage in optval. This function is shared between ipv6_get_msfilter() and compat_ipv6_get_msfilter(). However, the sockaddr_storage is stored at different offset of the optval because of the difference between group_filter and compat_group_filter. Thus, a new 'ss_offset' argument is added to ip6_mc_msfget(). Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20220902002853.2892532-1-kafai@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-02net: Add a len argument to compat_ipv6_get_msfilter()Martin KaFai Lau
Pass the len to the compat_ipv6_get_msfilter() instead of compat_ipv6_get_msfilter() getting it again from optlen. Its counter part ipv6_get_msfilter() is also taking the len from do_ipv6_getsockopt(). Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20220902002846.2892091-1-kafai@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-02net: Remove unused flags argument from do_ipv6_getsockoptMartin KaFai Lau
The 'unsigned int flags' argument is always 0, so it can be removed. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20220902002840.2891763-1-kafai@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-02bpf: net: Avoid do_ip_getsockopt() taking sk lock when called from bpfMartin KaFai Lau
Similar to the earlier commit that changed sk_setsockopt() to use sockopt_{lock,release}_sock() such that it can avoid taking lock when called from bpf. This patch also changes do_ip_getsockopt() to use sockopt_{lock,release}_sock() such that a latter patch can make bpf_getsockopt(SOL_IP) to reuse do_ip_getsockopt(). Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20220902002834.2891514-1-kafai@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-02bpf: net: Change do_ip_getsockopt() to take the sockptr_t argumentMartin KaFai Lau
Similar to the earlier patch that changes sk_getsockopt() to take the sockptr_t argument. This patch also changes do_ip_getsockopt() to take the sockptr_t argument such that a latter patch can make bpf_getsockopt(SOL_IP) to reuse do_ip_getsockopt(). Note on the change in ip_mc_gsfget(). This function is to return an array of sockaddr_storage in optval. This function is shared between ip_get_mcast_msfilter() and compat_ip_get_mcast_msfilter(). However, the sockaddr_storage is stored at different offset of the optval because of the difference between group_filter and compat_group_filter. Thus, a new 'ss_offset' argument is added to ip_mc_gsfget(). Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20220902002828.2890585-1-kafai@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-02bpf: net: Avoid do_tcp_getsockopt() taking sk lock when called from bpfMartin KaFai Lau
Similar to the earlier commit that changed sk_setsockopt() to use sockopt_{lock,release}_sock() such that it can avoid taking lock when called from bpf. This patch also changes do_tcp_getsockopt() to use sockopt_{lock,release}_sock() such that a latter patch can make bpf_getsockopt(SOL_TCP) to reuse do_tcp_getsockopt(). Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20220902002821.2889765-1-kafai@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-02bpf: net: Change do_tcp_getsockopt() to take the sockptr_t argumentMartin KaFai Lau
Similar to the earlier patch that changes sk_getsockopt() to take the sockptr_t argument . This patch also changes do_tcp_getsockopt() to take the sockptr_t argument such that a latter patch can make bpf_getsockopt(SOL_TCP) to reuse do_tcp_getsockopt(). Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20220902002815.2889332-1-kafai@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-02bpf: net: Avoid sk_getsockopt() taking sk lock when called from bpfMartin KaFai Lau
Similar to the earlier commit that changed sk_setsockopt() to use sockopt_{lock,release}_sock() such that it can avoid taking lock when called from bpf. This patch also changes sk_getsockopt() to use sockopt_{lock,release}_sock() such that a latter patch can make bpf_getsockopt(SOL_SOCKET) to reuse sk_getsockopt(). Only sk_get_filter() requires this change and it is used by the optname SO_GET_FILTER. The '.getname' implementations in sock->ops->getname() is not changed also since bpf does not always have the sk->sk_socket pointer and cannot support SO_PEERNAME. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20220902002809.2888981-1-kafai@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-02bpf: net: Change sk_getsockopt() to take the sockptr_t argumentMartin KaFai Lau
This patch changes sk_getsockopt() to take the sockptr_t argument such that it can be used by bpf_getsockopt(SOL_SOCKET) in a latter patch. security_socket_getpeersec_stream() is not changed. It stays with the __user ptr (optval.user and optlen.user) to avoid changes to other security hooks. bpf_getsockopt(SOL_SOCKET) also does not support SO_PEERSEC. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20220902002802.2888419-1-kafai@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-02net: Change sock_getsockopt() to take the sk ptr instead of the sock ptrMartin KaFai Lau
A latter patch refactors bpf_getsockopt(SOL_SOCKET) with the sock_getsockopt() to avoid code duplication and code drift between the two duplicates. The current sock_getsockopt() takes sock ptr as the argument. The very first thing of this function is to get back the sk ptr by 'sk = sock->sk'. bpf_getsockopt() could be called when the sk does not have the sock ptr created. Meaning sk->sk_socket is NULL. For example, when a passive tcp connection has just been established but has yet been accept()-ed. Thus, it cannot use the sock_getsockopt(sk->sk_socket) or else it will pass a NULL ptr. This patch moves all sock_getsockopt implementation to the newly added sk_getsockopt(). The new sk_getsockopt() takes a sk ptr and immediately gets the sock ptr by 'sock = sk->sk_socket' The existing sock_getsockopt(sock) is changed to call sk_getsockopt(sock->sk). All existing callers have both sock->sk and sk->sk_socket pointer. The latter patch will make bpf_getsockopt(SOL_SOCKET) call sk_getsockopt(sk) directly. The bpf_getsockopt(SOL_SOCKET) does not use the optnames that require sk->sk_socket, so it will be safe. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20220902002756.2887884-1-kafai@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-02net: ieee802154: Fix compilation error when ↵Gal Pressman
CONFIG_IEEE802154_NL802154_EXPERIMENTAL is disabled When CONFIG_IEEE802154_NL802154_EXPERIMENTAL is disabled, NL802154_CMD_DEL_SEC_LEVEL is undefined and results in a compilation error: net/ieee802154/nl802154.c:2503:19: error: 'NL802154_CMD_DEL_SEC_LEVEL' undeclared here (not in a function); did you mean 'NL802154_CMD_SET_CCA_ED_LEVEL'? 2503 | .resv_start_op = NL802154_CMD_DEL_SEC_LEVEL + 1, | ^~~~~~~~~~~~~~~~~~~~~~~~~~ | NL802154_CMD_SET_CCA_ED_LEVEL Unhide the experimental commands, having them defined in an enum makes no difference. Fixes: 9c5d03d36251 ("genetlink: start to validate reserved header bytes") Signed-off-by: Gal Pressman <gal@nvidia.com> Acked-by: Stefan Schmidt <stefan@datenfreihafen.org> Tested-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Link: https://lore.kernel.org/r/20220902030620.2737091-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-02selftests/xsk: Avoid use-after-free on ctxIan Rogers
The put lowers the reference count to 0 and frees ctx, reading it afterwards is invalid. Move the put after the uses and determine the last use by the reference count being 1. Fixes: 39e940d4abfa ("selftests/xsk: Destroy BPF resources only when ctx refcount drops to 0") Signed-off-by: Ian Rogers <irogers@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20220901202645.1463552-1-irogers@google.com
2022-09-02selftests/bpf: Store BPF object files with .bpf.o extensionDaniel Müller
BPF object files are, in a way, the final artifact produced as part of the ahead-of-time compilation process. That makes them somewhat special compared to "regular" object files, which are a intermediate build artifacts that can typically be removed safely. As such, it can make sense to name them differently to make it easier to spot this difference at a glance. Among others, libbpf-bootstrap [0] has established the extension .bpf.o for BPF object files. It seems reasonable to follow this example and establish the same denomination for selftest build artifacts. To that end, this change adjusts the corresponding part of the build system and the test programs loading BPF object files to work with .bpf.o files. [0] https://github.com/libbpf/libbpf-bootstrap Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Müller <deso@posteo.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220901222253.1199242-1-deso@posteo.net
2022-09-02selftests/xsk: Add support for zero copy testingMaciej Fijalkowski
Introduce new mode to xdpxceiver responsible for testing AF_XDP zero copy support of driver that serves underlying physical device. When setting up test suite, determine whether driver has ZC support or not by trying to bind XSK ZC socket to the interface. If it succeeded, interpret it as ZC support being in place and do softirq and busy poll tests for zero copy mode. Note that Rx dropped tests are skipped since ZC path is not touching rx_dropped stat at all. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20220901114813.16275-7-maciej.fijalkowski@intel.com
2022-09-02selftests/xsk: Make sure single threaded test terminatesMaciej Fijalkowski
For single threaded poll tests call pthread_kill() from main thread so that we are sure worker thread has finished its job and it is possible to proceed with next test types from test suite. It was observed that on some platforms it takes a bit longer for worker thread to exit and next test case sees device as busy in this case. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20220901114813.16275-6-maciej.fijalkowski@intel.com
2022-09-02selftests/xsk: Add support for executing tests on physical deviceMaciej Fijalkowski
Currently, architecture of xdpxceiver is designed strictly for conducting veth based tests. Veth pair is created together with a network namespace and one of the veth interfaces is moved to the mentioned netns. Then, separate threads for Tx and Rx are spawned which will utilize described setup. Infrastructure described in the paragraph above can not be used for testing AF_XDP support on physical devices. That testing will be conducted on a single network interface and same queue. Xskxceiver needs to be extended to distinguish between veth tests and physical interface tests. Since same iface/queue id pair will be used by both Tx/Rx threads for physical device testing, Tx thread, which happen to run after the Rx thread, is going to create XSK socket with shared umem flag. In order to track this setting throughout the lifetime of spawned threads, introduce 'shared_umem' boolean variable to struct ifobject and set it to true when xdpxceiver is run against physical device. In such case, UMEM size needs to be doubled, so half of it will be used by Rx thread and other half by Tx thread. For two step based test types, value of XSKMAP element under key 0 has to be updated as there is now another socket for the second step. Also, to avoid race conditions when destroying XSK resources, move this activity to the main thread after spawned Rx and Tx threads have finished its job. This way it is possible to gracefully remove shared umem without introducing synchronization mechanisms. To run xsk selftests suite on physical device, append "-i $IFACE" when invoking test_xsk.sh. For veth based tests, simply skip it. When "-i $IFACE" is in place, under the hood test_xsk.sh will use $IFACE for both interfaces supplied to xdpxceiver, which in turn will interpret that this execution of test suite is for a physical device. Note that currently this makes it possible only to test SKB and DRV mode (in case underlying device has native XDP support). ZC testing support is added in a later patch. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20220901114813.16275-5-maciej.fijalkowski@intel.com
2022-09-02selftests/xsk: Increase chars for interface name to 16Maciej Fijalkowski
So that "enp240s0f0" or such name can be used against xskxceiver. While at it, also extend character count for netns name. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20220901114813.16275-4-maciej.fijalkowski@intel.com
2022-09-02selftests/xsk: Introduce default Rx pkt streamMaciej Fijalkowski
In order to prepare xdpxceiver for physical device testing, let us introduce default Rx pkt stream. Reason for doing it is that physical device testing will use a UMEM with a doubled size where half of it will be used by Tx and other half by Rx. This means that pkt addresses will differ for Tx and Rx streams. Rx thread will initialize the xsk_umem_info::base_addr that is added here so that pkt_set(), when working on Rx UMEM will add this offset and second half of UMEM space will be used. Note that currently base_addr is 0 on both sides. Future commit will do the mentioned initialization. Previously, veth based testing worked on separate UMEMs, so single default stream was fine. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20220901114813.16275-3-maciej.fijalkowski@intel.com
2022-09-02selftests/xsk: Query for native XDP supportMaciej Fijalkowski
Currently, xdpxceiver assumes that underlying device supports XDP in native mode - it is fine by now since tests can run only on a veth pair. Future commit is going to allow running test suite against physical devices, so let us query the device if it is capable of running XDP programs in native mode. This way xdpxceiver will not try to run TEST_MODE_DRV if device being tested is not supporting it. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20220901114813.16275-2-maciej.fijalkowski@intel.com
2022-09-02selftests/bpf: Amend test_tunnel to exercise BPF_F_TUNINFO_FLAGSShmulik Ladkani
Get the tunnel flags in {ipv6}vxlan_get_tunnel_src and ensure they are aligned with tunnel params set at {ipv6}vxlan_set_tunnel_dst. Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220831144010.174110-2-shmulik.ladkani@gmail.com
2022-09-02bpf: Support getting tunnel flagsShmulik Ladkani
Existing 'bpf_skb_get_tunnel_key' extracts various tunnel parameters (id, ttl, tos, local and remote) but does not expose ip_tunnel_info's tun_flags to the BPF program. It makes sense to expose tun_flags to the BPF program. Assume for example multiple GRE tunnels maintained on a single GRE interface in collect_md mode. The program expects origins to initiate over GRE, however different origins use different GRE characteristics (e.g. some prefer to use GRE checksum, some do not; some pass a GRE key, some do not, etc..). A BPF program getting tun_flags can therefore remember the relevant flags (e.g. TUNNEL_CSUM, TUNNEL_SEQ...) for each initiating remote. In the reply path, the program can use 'bpf_skb_set_tunnel_key' in order to correctly reply to the remote, using similar characteristics, based on the stored tunnel flags. Introduce BPF_F_TUNINFO_FLAGS flag for bpf_skb_get_tunnel_key. If specified, 'bpf_tunnel_key->tunnel_flags' is set with the tun_flags. Decided to use the existing unused 'tunnel_ext' as the storage for the 'tunnel_flags' in order to avoid changing bpf_tunnel_key's layout. Also, the following has been considered during the design: 1. Convert the "interesting" internal TUNNEL_xxx flags back to BPF_F_yyy and place into the new 'tunnel_flags' field. This has 2 drawbacks: - The BPF_F_yyy flags are from *set_tunnel_key* enumeration space, e.g. BPF_F_ZERO_CSUM_TX. It is awkward that it is "returned" into tunnel_flags from a *get_tunnel_key* call. - Not all "interesting" TUNNEL_xxx flags can be mapped to existing BPF_F_yyy flags, and it doesn't make sense to create new BPF_F_yyy flags just for purposes of the returned tunnel_flags. 2. Place key.tun_flags into 'tunnel_flags' but mask them, keeping only "interesting" flags. That's ok, but the drawback is that what's "interesting" for my usecase might be limiting for other usecases. Therefore I decided to expose what's in key.tun_flags *as is*, which seems most flexible. The BPF user can just choose to ignore bits he's not interested in. The TUNNEL_xxx are also UAPI, so no harm exposing them back in the get_tunnel_key call. Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220831144010.174110-1-shmulik.ladkani@gmail.com
2022-09-02bpf, tnums: Warn against the usage of tnum_in(tnum_range(), ...)Shung-Hsi Yu
Commit a657182a5c51 ("bpf: Don't use tnum_range on array range checking for poke descriptors") has shown that using tnum_range() as argument to tnum_in() can lead to misleading code that looks like tight bound check when in fact the actual allowed range is much wider. Document such behavior to warn against its usage in general, and suggest some scenario where result can be trusted. Signed-off-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/984b37f9fdf7ac36831d2137415a4a915744c1b6.1661462653.git.daniel@iogearbox.net Link: https://www.openwall.com/lists/oss-security/2022/08/26/1 Link: https://lore.kernel.org/bpf/20220831031907.16133-3-shung-hsi.yu@suse.com Link: https://lore.kernel.org/bpf/20220831031907.16133-2-shung-hsi.yu@suse.com
2022-09-02net: remove netif_tx_napi_add()Jakub Kicinski
All callers are now gone. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-02net: bql: add more documentationEric Dumazet
Add some documentation for netdev_tx_sent_queue() and netdev_tx_completed_queue() Stating that netdev_tx_completed_queue() must be called once per TX completion round is apparently not obvious for everybody. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-02Merge branch 'net-ipa-transaction-state-IDs'David S. Miller
Alex Elder says: ==================== net: ipa: use IDs to track transaction state This series is the first of three groups of changes that simplify the way the IPA driver tracks the state of its transactions. Each GSI channel has a fixed number of transactions allocated at initialization time. The number allocated matches the number of TREs in the transfer ring associated with the channel. This is because the transfer ring limits the number of transfers that can ever be underway, and in the worst case, each transaction represents a single TRE. Transactions go through various states during their lifetime. Currently a set of lists keeps track of which transactions are in each state. Initially, all transactions are free. An allocated transaction is placed on the allocated list. Once an allocated transaction is committed, it is moved from the allocated to the committed list. When a committed transaction is sent to hardware (via a doorbell) it is moved to the pending list. When hardware signals that some work has completed, transactions are moved to the completed list. Finally, when a completed transaction is polled it's moved to the polled list before being removed when it becomes free. Changing a transaction's state thus normally involves manipulating two lists, and to prevent corruption a spinlock is held while the lists are updated. Transactions move through their states in a well-defined sequence though, and they do so strictly in order. So transaction 0 is always allocated before transaction 1; transaction 0 is always committed before transaction 1; and so on, through completion, polling, and becoming free. Because of this, it's sufficient to just keep track of which transaction is the first in each state. The rest of the transactions in a given state can be derived from the first transaction in an "adjacent" state. As a result, we can track the state of all transactions with a set of indexes, and can update these without the need for a spinlock. This first group of patches just defines the set of indexes that will be used for this new way of tracking transaction state. Two more groups of patches will follow. I've broken the 17 patches into these three groups to facilitate review. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-02net: ipa: track polled transactions with an IDAlex Elder
Add a transaction ID to track the first element in the transaction array that has been polled. Advance the ID when we are releasing a transaction. Temporarily add warnings that verify that the first polled transaction tracked by the ID matches the first element on the polled list, both when polling and freeing. Remove the temporary warnings added by the previous commit. Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-02net: ipa: track completed transactions with an IDAlex Elder
Add a transaction ID field to track the first element in the transaction array that has completed but has not yet been polled. Advance the ID when we are processing a transaction in the NAPI polling loop (where completed transactions become polled). Temporarily add warnings that verify that the first completed transaction tracked by the ID matches the first element on the completed list, both when pending and completing. Remove the temporary warnings added by the previous commit. Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-02net: ipa: track pending transactions with an IDAlex Elder
Add a transaction ID field to track the first element in the transaction array that is pending (sent to hardware) but not yet complete. Advance the ID when a completion event for a channel indicates that transactions have completed. Temporarily add warnings that verify that the first pending transaction tracked by the ID matches the first element on the pending list, both when pending and completing, as well as when resetting the channel. Remove the temporary warnings added by the previous commit. Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-02net: ipa: track committed transactions with an IDAlex Elder
Add a transaction ID field to track the first element in a channel's transaction array that has been committed, but not yet passed to the hardware. Advance the ID when the hardware is notified via doorbell that TREs from a transaction are ready for consumption. Temporarily add warnings that verify that the first committed transaction tracked by the ID matches the first element on the committed list, both when committing and pending (at doorbell). Remove the temporary warnings added by the previous commit. Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-02net: ipa: track allocated transactions with an IDAlex Elder
Transactions for a channel are now managed in an array, with a free transaction ID indicating which is the next one free. Add another transaction ID field to track the first element in the array that has been allocated. Advance it when a transaction is committed (because that is when that transaction leaves allocated state). Temporarily add warnings that verify that the first allocated transaction tracked by the ID matches the first element on the allocated list, both when allocating and committing a transaction. Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-02net: ipa: use an array for transactionsAlex Elder
Transactions are always allocated one at a time. The maximum number of them we could ever need occurs if each TRE is assigned to a transaction. So a channel requires no more transactions than the number of TREs in its transfer ring. That number is known to be a power-of-2 less than 65536. The transaction pool abstraction is used for other things, but for transactions we can use a simple array of transaction structures, and use a free index to indicate which entry in the array is the next one free for allocation. By having the number of elements in the array be a power-of-2, we can use an ever-incrementing 16-bit free index, and use it modulo the array size. Distinguish a "trans_id" (whose value can exceed the number of entries in the transaction array) from a "trans_index" (which is less than the number of entries). Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>