summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2023-05-15octeontx2-pf: Refactor schedular queue alloc/free callsHariprasad Kelam
1. Upon txschq free request, the transmit schedular config in hardware is not getting reset. This patch adds necessary changes to do the same. 2. Current implementation calls txschq alloc during interface initialization and in response handler updates the default txschq array. This creates a problem for htb offload where txsch alloc will be called for every tc class. This patch addresses the issue by reading txschq response in mbox caller function instead in the response handler. 3. Current otx2_txschq_stop routine tries to free all txschq nodes allocated to the interface. This creates a problem for htb offload. This patch introduces the otx2_txschq_free_one to free txschq in a given level. Signed-off-by: Hariprasad Kelam <hkelam@marvell.com> Signed-off-by: Naveen Mamindlapalli <naveenm@marvell.com> Signed-off-by: Sunil Kovvuri Goutham <sgoutham@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-15octeontx2-pf: qos send queues managementSubbaraya Sundeep
Current implementation is such that the number of Send queues (SQs) are decided on the device probe which is equal to the number of online cpus. These SQs are allocated and deallocated in interface open and c lose calls respectively. This patch defines new APIs for initializing and deinitializing Send queues dynamically and allocates more number of transmit queues for QOS feature. Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com> Signed-off-by: Hariprasad Kelam <hkelam@marvell.com> Signed-off-by: Sunil Kovvuri Goutham <sgoutham@marvell.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-15octeontx2-pf: Rename tot_tx_queues to non_qos_queuesHariprasad Kelam
current implementation is such that tot_tx_queues contains both xdp queues and normal tx queues. which will be allocated in interface open calls and deallocated on interface down calls respectively. With addition of QOS, where send quees are allocated/deallacated upon user request Qos send queues won't be part of tot_tx_queues. So this patch renames tot_tx_queues to non_qos_queues. Signed-off-by: Hariprasad Kelam <hkelam@marvell.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-15sch_htb: Allow HTB priority parameter in offload modeNaveen Mamindlapalli
The current implementation of HTB offload returns the EINVAL error for unsupported parameters like prio and quantum. This patch removes the error returning checks for 'prio' parameter and populates its value to tc_htb_qopt_offload structure such that driver can use the same. Add prio parameter check in mlx5 driver, as mlx5 devices are not capable of supporting the prio parameter when htb offload is used. Report error if prio parameter is set to a non-default value. Signed-off-by: Naveen Mamindlapalli <naveenm@marvell.com> Co-developed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: Hariprasad Kelam <hkelam@marvell.com> Signed-off-by: Sunil Kovvuri Goutham <sgoutham@marvell.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-15net: Remove low_thresh in ip defragAngus Chen
As low_thresh has no work in fragment reassembles,del it. And Mark it deprecated in sysctl Document. Signed-off-by: Angus Chen <angus.chen@jaguarmicro.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-15Merge tag 'wireless-next-2023-05-12' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next Kalle valo says: ==================== wireless-next patches for v6.5 The first pull request for v6.5 and only driver changes this time. rtl8xxxu has been making lots of progress lately and now has AP mode support. Major changes: rtl8xxxu * AP mode support, initially only for rtl8188f rtw89 * provide RSSI, EVN and SNR statistics via debugfs * support U-NII-4 channels on 5 GHz band ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-13Merge branch 'bpf: Don't EFAULT for {g,s}setsockopt with wrong optlen'Martin KaFai Lau
Stanislav Fomichev says: ==================== optval larger than PAGE_SIZE leads to EFAULT if the BPF program isn't careful enough. This is often overlooked and might break completely unrelated socket options. Instead of EFAULT, let's ignore BPF program buffer changes. See the first patch for more info. In addition, clearly document this corner case and reset optlen in our selftests (in case somebody copy-pastes from them). v6: - no changes; resending due to screwing up v5 series with the unrelated patch v5: - goto in the selftest (Martin) - set IP_TOS to zero to avoid endianness complications (Martin) v4: - ignore retval as well when optlen > PAGE_SIZE (Martin) v3: - don't hard-code PAGE_SIZE (Martin) - reset orig_optlen in getsockopt when kernel part succeeds (Martin) ==================== Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-05-13bpf: Document EFAULT changes for sockoptStanislav Fomichev
And add examples for how to correctly handle large optlens. This is less relevant now when we don't EFAULT anymore, but that's still the correct thing to do. Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20230511170456.1759459-5-sdf@google.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-05-13selftests/bpf: Correctly handle optlen > 4096Stanislav Fomichev
Even though it's not relevant in selftests, the people might still copy-paste from them. So let's take care of optlen > 4096 cases explicitly. Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20230511170456.1759459-4-sdf@google.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-05-13selftests/bpf: Update EFAULT {g,s}etsockopt selftestsStanislav Fomichev
Instead of assuming EFAULT, let's assume the BPF program's output is ignored. Remove "getsockopt: deny arbitrary ctx->retval" because it was actually testing optlen. We have separate set of tests for retval. Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20230511170456.1759459-3-sdf@google.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-05-13bpf: Don't EFAULT for {g,s}setsockopt with wrong optlenStanislav Fomichev
With the way the hooks implemented right now, we have a special condition: optval larger than PAGE_SIZE will expose only first 4k into BPF; any modifications to the optval are ignored. If the BPF program doesn't handle this condition by resetting optlen to 0, the userspace will get EFAULT. The intention of the EFAULT was to make it apparent to the developers that the program is doing something wrong. However, this inadvertently might affect production workloads with the BPF programs that are not too careful (i.e., returning EFAULT for perfectly valid setsockopt/getsockopt calls). Let's try to minimize the chance of BPF program screwing up userspace by ignoring the output of those BPF programs (instead of returning EFAULT to the userspace). pr_info_once those cases to the dmesg to help with figuring out what's going wrong. Fixes: 0d01da6afc54 ("bpf: implement getsockopt and setsockopt hooks") Suggested-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20230511170456.1759459-2-sdf@google.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-05-13sfc: fix use-after-free in efx_tc_flower_record_encap_match()Edward Cree
When writing error messages to extack for pseudo collisions, we can't use encap->type as encap has already been freed. Fortunately the same value is stored in local variable em_type, so use that instead. Fixes: 3c9561c0a5b9 ("sfc: support TC decap rules matching on enc_ip_tos") Reported-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Edward Cree <ecree.xilinx@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-13net: phylink: constify fwnode argumentsRussell King (Oracle)
Both phylink_create() and phylink_fwnode_phy_connect() do not modify the fwnode argument that they are passed, so lets constify these. Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-13net: fec: using the standard return codes when xdp xmit errorsShenwei Wang
This patch standardizes the inconsistent return values for unsuccessful XDP transmits by using standardized error codes (-EBUSY or -ENOMEM). Signed-off-by: Shenwei Wang <shenwei.wang@nxp.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-13net: macb: Shorten max_tx_len to 4KiB - 56 on mpfsDaire McNamara
On mpfs, with SRAM configured for 4 queues, setting max_tx_len to GEM_TX_MAX_LEN=0x3f0 results multiple AMBA errors. Setting max_tx_len to (4KiB - 56) removes those errors. The details are described in erratum 1686 by Cadence The max jumbo frame size is also reduced for mpfs to (4KiB - 56). Signed-off-by: Daire McNamara <daire.mcnamara@microchip.com> Reviewed-by: Conor Dooley <conor.dooley@microchip.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Claudiu Beznea <claudiu.beznea@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-13ping: Convert hlist_nulls to plain hlist.Kuniyuki Iwashima
Since introduced in commit c319b4d76b9e ("net: ipv4: add IPPROTO_ICMP socket kind"), ping socket does not use SLAB_TYPESAFE_BY_RCU nor check nulls marker in loops. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-13Merge branch 'skb_frag_fill_page_desc'David S. Miller
Yunsheng Lin says: ==================== net: introduce skb_frag_fill_page_desc() Most users use __skb_frag_set_page()/skb_frag_off_set()/ skb_frag_size_set() to fill the page desc for a skb frag. It does not make much sense to calling __skb_frag_set_page() without calling skb_frag_off_set(), as the offset may depend on whether the page is head page or tail page, so add skb_frag_fill_page_desc() to fill the page desc for a skb frag. In the future, we can make sure the page in the frag is head page of compound page or a base page, if not, we may warn about that and convert the tail page to head page and update the offset accordingly, if we see a warning about that, we also fix the caller to fill the head page in the frag. when the fixing is done, we may remove the warning and converting. In this way, we can remove the compound_head() or use page_ref_*() like the below case: https://elixir.bootlin.com/linux/latest/source/net/core/page_pool.c#L881 https://elixir.bootlin.com/linux/latest/source/include/linux/skbuff.h#L3383 It may also convert net stack to use the folio easier. V1: repost with all the ack/review tags included. RFC: remove a local variable as pointed out by Simon. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-13net: remove __skb_frag_set_page()Yunsheng Lin
The remaining users calling __skb_frag_set_page() with page being NULL seems to be doing defensive programming, as shinfo->nr_frags is already decremented, so remove them. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Michael Chan <michael.chan@broadcom.com> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-13net: introduce and use skb_frag_fill_page_desc()Yunsheng Lin
Most users use __skb_frag_set_page()/skb_frag_off_set()/ skb_frag_size_set() to fill the page desc for a skb frag. Introduce skb_frag_fill_page_desc() to do that. net/bpf/test_run.c does not call skb_frag_off_set() to set the offset, "copy_from_user(page_address(page), ...)" and 'shinfo' being part of the 'data' kzalloced in bpf_test_init() suggest that it is assuming offset to be initialized as zero, so call skb_frag_fill_page_desc() with offset being zero for this case. Also, skb_frag_set_page() is not used anymore, so remove it. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-13selftests: net: vxlan: Add tests for vxlan nolocalbypass option.Vladimir Nikishkin
Add test to make sure that the localbypass option is on by default. Add test to change vxlan localbypass to nolocalbypass and check that packets are delivered to userspace. Signed-off-by: Vladimir Nikishkin <vladimir@nikishkin.pw> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-13net: vxlan: Add nolocalbypass option to vxlan.Vladimir Nikishkin
If a packet needs to be encapsulated towards a local destination IP, the packet will undergo a "local bypass" and be injected into the Rx path as if it was received by the target VXLAN device without undergoing encapsulation. If such a device does not exist, the packet will be dropped. There are scenarios where we do not want to perform such a bypass, but instead want the packet to be encapsulated and locally received by a user space program for post-processing. To that end, add a new VXLAN device attribute that controls whether a "local bypass" is performed or not. Default to performing a bypass to maintain existing behavior. Signed-off-by: Vladimir Nikishkin <vladimir@nikishkin.pw> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-13Merge branch 'broadcom-phy-wol'David S. Miller
Florian Fainelli says: ==================== Support for Wake-on-LAN for Broadcom PHYs This patch series adds support for Wake-on-LAN to the Broadcom PHY driver. Specifically the BCM54210E/B50212E are capable of supporting Wake-on-LAN using an external pin typically wired up to a system's GPIO. These PHY operate a programmable Ethernet MAC destination address comparator which will fire up an interrupt whenever a match is received. Because of that, it was necessary to introduce patch #1 which allows the PHY driver's ->suspend() routine to be called unconditionally. This is necessary in our case because we need a hook point into the device suspend/resume flow to enable the wake-up interrupt as late as possible. Patch #2 adds support for the Broadcom PHY library and driver for Wake-on-LAN proper with the WAKE_UCAST, WAKE_MCAST, WAKE_BCAST, WAKE_MAGIC and WAKE_MAGICSECURE. Note that WAKE_FILTER is supportable, however this will require further discussions and be submitted as a RFC series later on. Patch #3 updates the GENET driver to defer to the PHY for Wake-on-LAN if the PHY supports it, thus allowing the MAC to be powered down to conserve power. Changes in v3: - collected Reviewed-by tags - explicitly use return 0 in bcm54xx_phy_probe() (Paolo) Changes in v2: - introduce PHY_ALWAYS_CALL_SUSPEND and only have the Broadcom PHY driver set this flag to minimize changes to the suspend flow to only drivers that need it - corrected possibly uninitialized variable in bcm54xx_set_wakeup_irq (Simon) ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-13net: bcmgenet: Add support for PHY-based Wake-on-LANFlorian Fainelli
If available, interrogate the PHY to find out whether we can use it for Wake-on-LAN. This can be a more power efficient way of implementing that feature, especially when the MAC is powered off in low power states. Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-13net: phy: broadcom: Add support for Wake-on-LANFlorian Fainelli
Add support for WAKE_UCAST, WAKE_MCAST, WAKE_BCAST, WAKE_MAGIC and WAKE_MAGICSECURE. This is only supported with the BCM54210E and compatible Ethernet PHYs. Using the in-band interrupt or an out of band GPIO interrupts are supported. Broadcom PHYs will generate a Wake-on-LAN level low interrupt on LED4 as soon as one of the supported patterns is being matched. That includes generating such an interrupt even if the PHY is operated during normal modes. If WAKE_UCAST is selected, this could lead to the LED4 interrupt firing up for every packet being received which is absolutely undesirable from a performance point of view. Because the Wake-on-LAN configuration can be set long before the system is actually put to sleep, we cannot have an interrupt service routine to clear on read the interrupt status register and ensure that new packet matches will be detected. It is desirable to enable the Wake-on-LAN interrupt as late as possible during the system suspend process such that we limit the number of interrupts to be handled by the system, but also conversely feed into the Linux's system suspend way of dealing with interrupts in and around the points of no return. Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-13net: phy: Allow drivers to always call into ->suspend()Florian Fainelli
A few PHY drivers are currently attempting to not suspend the PHY when Wake-on-LAN is enabled, however that code is not currently executing at all due to an early check in phy_suspend(). This prevents PHY drivers from making an appropriate decisions and put the hardware into a low power state if desired. In order to allow the PHY drivers to opt into getting their ->suspend routine to be called, add a PHY_ALWAYS_CALL_SUSPEND bit which can be set. A boolean that tracks whether the PHY or the attached MAC has Wake-on-LAN enabled is also provided for convenience. If phydev::wol_enabled then the PHY shall not prevent its own Wake-on-LAN detection logic from working and shall not prevent the Ethernet MAC from receiving packets for matching. Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-12libbpf: fix offsetof() and container_of() to work with CO-REAndrii Nakryiko
It seems like __builtin_offset() doesn't preserve CO-RE field relocations properly. So if offsetof() macro is defined through __builtin_offset(), CO-RE-enabled BPF code using container_of() will be subtly and silently broken. To avoid this problem, redefine offsetof() and container_of() in the form that works with CO-RE relocations more reliably. Fixes: 5fbc220862fc ("tools/libpf: Add offsetof/container_of macro in bpf_helpers.h") Reported-by: Lennart Poettering <lennart@poettering.net> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20230509065502.2306180-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-05-12bpf: Address KCSAN report on bpf_lru_listMartin KaFai Lau
KCSAN reported a data-race when accessing node->ref. Although node->ref does not have to be accurate, take this chance to use a more common READ_ONCE() and WRITE_ONCE() pattern instead of data_race(). There is an existing bpf_lru_node_is_ref() and bpf_lru_node_set_ref(). This patch also adds bpf_lru_node_clear_ref() to do the WRITE_ONCE(node->ref, 0) also. ================================================================== BUG: KCSAN: data-race in __bpf_lru_list_rotate / __htab_lru_percpu_map_update_elem write to 0xffff888137038deb of 1 bytes by task 11240 on cpu 1: __bpf_lru_node_move kernel/bpf/bpf_lru_list.c:113 [inline] __bpf_lru_list_rotate_active kernel/bpf/bpf_lru_list.c:149 [inline] __bpf_lru_list_rotate+0x1bf/0x750 kernel/bpf/bpf_lru_list.c:240 bpf_lru_list_pop_free_to_local kernel/bpf/bpf_lru_list.c:329 [inline] bpf_common_lru_pop_free kernel/bpf/bpf_lru_list.c:447 [inline] bpf_lru_pop_free+0x638/0xe20 kernel/bpf/bpf_lru_list.c:499 prealloc_lru_pop kernel/bpf/hashtab.c:290 [inline] __htab_lru_percpu_map_update_elem+0xe7/0x820 kernel/bpf/hashtab.c:1316 bpf_percpu_hash_update+0x5e/0x90 kernel/bpf/hashtab.c:2313 bpf_map_update_value+0x2a9/0x370 kernel/bpf/syscall.c:200 generic_map_update_batch+0x3ae/0x4f0 kernel/bpf/syscall.c:1687 bpf_map_do_batch+0x2d9/0x3d0 kernel/bpf/syscall.c:4534 __sys_bpf+0x338/0x810 __do_sys_bpf kernel/bpf/syscall.c:5096 [inline] __se_sys_bpf kernel/bpf/syscall.c:5094 [inline] __x64_sys_bpf+0x43/0x50 kernel/bpf/syscall.c:5094 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd read to 0xffff888137038deb of 1 bytes by task 11241 on cpu 0: bpf_lru_node_set_ref kernel/bpf/bpf_lru_list.h:70 [inline] __htab_lru_percpu_map_update_elem+0x2f1/0x820 kernel/bpf/hashtab.c:1332 bpf_percpu_hash_update+0x5e/0x90 kernel/bpf/hashtab.c:2313 bpf_map_update_value+0x2a9/0x370 kernel/bpf/syscall.c:200 generic_map_update_batch+0x3ae/0x4f0 kernel/bpf/syscall.c:1687 bpf_map_do_batch+0x2d9/0x3d0 kernel/bpf/syscall.c:4534 __sys_bpf+0x338/0x810 __do_sys_bpf kernel/bpf/syscall.c:5096 [inline] __se_sys_bpf kernel/bpf/syscall.c:5094 [inline] __x64_sys_bpf+0x43/0x50 kernel/bpf/syscall.c:5094 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd value changed: 0x01 -> 0x00 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 11241 Comm: syz-executor.3 Not tainted 6.3.0-rc7-syzkaller-00136-g6a66fdd29ea1 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/30/2023 ================================================================== Reported-by: syzbot+ebe648a84e8784763f82@syzkaller.appspotmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20230511043748.1384166-1-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-05-12bpf: Add --skip_encoding_btf_inconsistent_proto, --btf_gen_optimized to ↵Alan Maguire
pahole flags for v1.25 v1.25 of pahole supports filtering out functions with multiple inconsistent function prototypes or optimized-out parameters from the BTF representation. These present problems because there is no additional info in BTF saying which inconsistent prototype matches which function instance to help guide attachment, and functions with optimized-out parameters can lead to incorrect assumptions about register contents. So for now, filter out such functions while adding BTF representations for functions that have "."-suffixes (foo.isra.0) but not optimized-out parameters. This patch assumes that below linked changes land in pahole for v1.25. Issues with pahole filtering being too aggressive in removing functions appear to be resolved now, but CI and further testing will confirm. Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20230510130241.1696561-1-alan.maguire@oracle.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-05-12Merge branch 'sfc-decap'David S. Miller
Edward Cree says: ==================== sfc: more flexible encap matches on TC decap rules This series extends the TC offload support on EF100 to support optionally matching on the IP ToS and UDP source port of the outer header in rules performing tunnel decapsulation. Both of these fields allow masked matches if the underlying hardware supports it (current EF100 hardware supports masking on ToS, but only exact-match on source port). Given that the source port is typically populated from a hash of inner header entropy, it's not clear whether filtering on it is useful, but since we can support it we may as well expose the capability. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-12sfc: support TC decap rules matching on enc_src_portEdward Cree
Allow efx_tc_encap_match entries to include a udp_sport and a udp_sport_mask. As with enc_ip_tos, use pseudos to enforce that all encap matches within a given <src_ip,dst_ip,udp_dport> tuple have the same udp_sport_mask. Note that since we use a single layer of pseudos for both fields, two matches that differ in (say) udp_sport value aren't permitted to have different ip_tos_mask, even though this would technically be safe. Current userland TC does not support setting enc_src_port; this patch was tested with an iproute2 patched to support it. Signed-off-by: Edward Cree <ecree.xilinx@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-12sfc: support TC decap rules matching on enc_ip_tosEdward Cree
Allow efx_tc_encap_match entries to include an ip_tos and ip_tos_mask. To avoid partially-overlapping Outer Rules (which can lead to undefined behaviour in the hardware), store extra "pseudo" entries in our encap_match hashtable, which are used to enforce that all Outer Rule entries within a given <src_ip,dst_ip,udp_dport> tuple (or IPv6 equivalent) have the same ip_tos_mask. The "direct" encap_match entry takes a reference on the "pseudo", allowing it to be destroyed when all "direct" entries using it are removed. efx_tc_em_pseudo_type is an enum rather than just a bool because in future an additional pseudo-type will be added to support Conntrack offload. Signed-off-by: Edward Cree <ecree.xilinx@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-12sfc: populate enc_ip_tos matches in MAE outer rulesEdward Cree
Currently tc.c will block them before they get here, but following patch will change that. Use the extack message from efx_mae_check_encap_match_caps() instead of writing a new one, since there's now more being fed in than just an IP version. Signed-off-by: Edward Cree <ecree.xilinx@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-12sfc: release encap match in efx_tc_flow_free()Edward Cree
When force-freeing leftover entries from our match_action_ht, call efx_tc_delete_rule(), which releases all the rule's resources, rather than open-coding it. The open-coded version was missing a call to release the rule's encap match (if any). It probably doesn't matter as everything's being torn down anyway, but it's cleaner this way and prevents further error messages potentially being logged by efx_tc_encap_match_free() later on. Move efx_tc_flow_free() further down the file to avoid introducing a forward declaration of efx_tc_delete_rule(). Fixes: 17654d84b47c ("sfc: add offloading of 'foreign' TC (decap) rules") Signed-off-by: Edward Cree <ecree.xilinx@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-12net: liquidio: lio_main: Remove unnecessary (void*) conversionswuych
Pointer variables of void * type do not require type cast. Signed-off-by: wuych <yunchuan@nfschina.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-12sctp: add bpf_bypass_getsockopt proto callbackAlexander Mikhalitsyn
Implement ->bpf_bypass_getsockopt proto callback and filter out SCTP_SOCKOPT_PEELOFF, SCTP_SOCKOPT_PEELOFF_FLAGS and SCTP_SOCKOPT_CONNECTX3 socket options from running eBPF hook on them. SCTP_SOCKOPT_PEELOFF and SCTP_SOCKOPT_PEELOFF_FLAGS options do fd_install(), and if BPF_CGROUP_RUN_PROG_GETSOCKOPT hook returns an error after success of the original handler sctp_getsockopt(...), userspace will receive an error from getsockopt syscall and will be not aware that fd was successfully installed into a fdtable. As pointed by Marcelo Ricardo Leitner it seems reasonable to skip bpf getsockopt hook for SCTP_SOCKOPT_CONNECTX3 sockopt too. Because internaly, it triggers connect() and if error is masked then userspace will be confused. This patch was born as a result of discussion around a new SCM_PIDFD interface: https://lore.kernel.org/all/20230413133355.350571-3-aleksandr.mikhalitsyn@canonical.com/ Fixes: 0d01da6afc54 ("bpf: implement getsockopt and setsockopt hooks") Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Christian Brauner <brauner@kernel.org> Cc: Stanislav Fomichev <sdf@google.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Cc: Xin Long <lucien.xin@gmail.com> Cc: linux-sctp@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: netdev@vger.kernel.org Suggested-by: Stanislav Fomichev <sdf@google.com> Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Acked-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-12Merge branch 'selftests-fcnal'David S. Miller
Guillaume Nault says: ==================== selftests: fcnal: Test SO_DONTROUTE socket option. The objective is to cover kernel paths that use the RTO_ONLINK flag in .flowi4_tos. This way we'll be able to safely remove this flag in the future by properly setting .flowi4_scope instead. With these selftests in place, we can make sure this won't introduce regressions. For more context, the final objective is to convert .flowi4_tos to dscp_t, to ensure that ECN bits don't influence route and fib-rule lookups (see commit a410a0cf9885 ("ipv6: Define dscp_t and stop taking ECN bits into account in fib6-rules")). These selftests only cover IPv4, as SO_DONTROUTE has no effect on IPv6 sockets. v2: - Use two different nettest options for setting SO_DONTROUTE either on the server or on the client socket. - Use the above feature to run a single 'nettest -B' instance per test (instead of having two nettest processes for server and client). ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-12selftests: fcnal: Test SO_DONTROUTE on raw and ping sockets.Guillaume Nault
Use ping -r to test the kernel behaviour with raw and ping sockets having the SO_DONTROUTE option. Since ipv4_ping_novrf() is called with different values of net.ipv4.ping_group_range, then it tests both raw and ping sockets (ping uses ping sockets if its user ID belongs to ping_group_range and raw sockets otherwise). With both socket types, sending packets to a neighbour (on link) host, should work. When the host is behind a router, sending should fail. Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-12selftests: fcnal: Test SO_DONTROUTE on UDP sockets.Guillaume Nault
Use nettest --client-dontroute to test the kernel behaviour with UDP sockets having the SO_DONTROUTE option. Sending packets to a neighbour (on link) host, should work. When the host is behind a router, sending should fail. Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-12selftests: fcnal: Test SO_DONTROUTE on TCP sockets.Guillaume Nault
Use nettest --{client,server}-dontroute to test the kernel behaviour with TCP sockets having the SO_DONTROUTE option. Sending packets to a neighbour (on link) host, should work. When the host is behind a router, sending should fail. Client and server sockets are tested independently, so that we can cover different TCP kernel paths. SO_DONTROUTE also affects the syncookies path. So ipv4_tcp_dontroute() is made to work with or without syncookies, to cover both paths. Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-12selftests: Add SO_DONTROUTE option to nettest.Guillaume Nault
Add --client-dontroute and --server-dontroute options to nettest. They allow to set the SO_DONTROUTE option to the client and server sockets respectively. This will be used by the following patches to test the SO_DONTROUTE kernel behaviour with TCP and UDP. Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-12bonding: Always assign be16 value to vlan_protoSimon Horman
The type of the vlan_proto field is __be16. And most users of the field use it as such. In the case of setting or testing the field for the special VLAN_N_VID value, host byte order is used. Which seems incorrect. It also seems somewhat odd to store a VLAN ID value in a field that is otherwise used to store Ether types. Address this issue by defining BOND_VLAN_PROTO_NONE, a big endian value. 0xffff was chosen somewhat arbitrarily. What is important is that it doesn't overlap with any valid VLAN Ether types. I don't believe the problems described above are a bug because VLAN_N_VID in both little-endian and big-endian byte order does not conflict with any supported VLAN Ether types in big-endian byte order. Reported by sparse as: .../bond_main.c:2857:26: warning: restricted __be16 degrades to integer .../bond_main.c:2863:20: warning: restricted __be16 degrades to integer .../bond_main.c:2939:40: warning: incorrect type in assignment (different base types) .../bond_main.c:2939:40: expected restricted __be16 [usertype] vlan_proto .../bond_main.c:2939:40: got int No functional changes intended. Compile tested only. Signed-off-by: Simon Horman <horms@kernel.org> Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-12Merge branch 'net-handshake-fixes'David S. Miller
Chuck Lever says: ==================== Bug fixes for net/handshake Please consider these for merge via net-next. Paolo observed that there is a possible leak of sock->file. I haven't looked into that yet, but it seems to be separate from the fixes in this series, so no need to hold these up. Changes since v2: - Address Paolo comment regarding handshake_dup() Changes since v1: - Rework "Fix handshake_dup() ref counting" - Unpin sock->file when a handshake is cancelled ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-12net/handshake: Enable the SNI extension to work properlyChuck Lever
Enable the upper layer protocol to specify the SNI peername. This avoids the need for tlshd to use a DNS lookup, which can return a hostname that doesn't match the incoming certificate's SubjectName. Fixes: 2fd5532044a8 ("net/handshake: Add a kernel API for requesting a TLSv1.3 handshake") Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-12net/handshake: Unpin sock->file if a handshake is cancelledChuck Lever
If user space never calls DONE, sock->file's reference count remains elevated. Enable sock->file to be freed eventually in this case. Reported-by: Jakub Kacinski <kuba@kernel.org> Fixes: 3b3009ea8abb ("net/handshake: Create a NETLINK service for handling handshake requests") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-12net/handshake: handshake_genl_notify() shouldn't ignore @flagsChuck Lever
Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Fixes: 3b3009ea8abb ("net/handshake: Create a NETLINK service for handling handshake requests") Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-12net/handshake: Fix uninitialized local variableChuck Lever
trace_handshake_cmd_done_err() simply records the pointer in @req, so initializing it to NULL is sufficient and safe. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Fixes: 3b3009ea8abb ("net/handshake: Create a NETLINK service for handling handshake requests") Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-12net/handshake: Fix handshake_dup() ref countingChuck Lever
If get_unused_fd_flags() fails, we ended up calling fput(sock->file) twice. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Suggested-by: Paolo Abeni <pabeni@redhat.com> Fixes: 3b3009ea8abb ("net/handshake: Create a NETLINK service for handling handshake requests") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-12net/handshake: Remove unneeded check from handshake_dup()Chuck Lever
handshake_req_submit() now verifies that the socket has a file. Fixes: 3b3009ea8abb ("net/handshake: Create a NETLINK service for handling handshake requests") Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-12ipvlan: Remove NULL check before dev_{put, hold}Yang Li
The call netdev_{put, hold} of dev_{put, hold} will check NULL, so there is no need to check before using dev_{put, hold}, remove it to silence the warning: ./drivers/net/ipvlan/ipvlan_core.c:559:3-11: WARNING: NULL check before dev_{put, hold} functions is not needed. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=4930 Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-12octeontx2-pf: mcs: Offload extended packet number(XPN) featureSubbaraya Sundeep
The macsec hardware block supports XPN cipher suites also. Hence added changes to offload XPN feature. Changes include configuring SecY policy to XPN cipher suite, Salt and SSCI values. 64 bit packet number is passed instead of 32 bit packet number. Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com> Signed-off-by: Sunil Kovvuri Goutham <sgoutham@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>