linux/linux-stable.git - Linux kernel stable tree

Age	Commit message (Collapse)	Author
2024-05-02	bnxt_en: Don't support offline self test when RoCE driver is loaded	Kalesh AP
	Offline self test is a very disruptive operation for RoCE and requires all active QPs to be destroyed. With a large number of QPs, it can take a long time to destroy all the QPs and can timeout. Do not allow ethtool offline self test if the RoCE driver is registered on the device. Reviewed-by: Selvin Thyparampil Xavier <selvin.xavier@broadcom.com> Reviewed-by: Vikas Gupta <vikas.gupta@broadcom.com> Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240501003056.100607-3-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-02	bnxt_en: share NQ ring sw_stats memory with subrings	Edwin Peer
	On P5_PLUS chips and later, the NQ rings have subrings for RX and TX completions respectively. These subrings are passed to the poll function instead of the base NQ, but each ring carries its own copy of the software ring statistics. For stats to be conveniently accessible in __bnxt_poll_work(), the statistics memory should either be shared between the NQ and its subrings or the subrings need to be included in the ethtool stats aggregation logic. This patch opts for the former, because it's more efficient and less confusing having the software statistics for a ring exist in a single place. Before this patch, the counter will not be displayed if the "wrong" cpr->sw_stats was used to increment a counter. Link: https://lore.kernel.org/netdev/CACKFLikEhVAJA+osD7UjQNotdGte+fth7zOy7yDdLkTyFk9Pyw@mail.gmail.com/ Signed-off-by: Edwin Peer <edwin.peer@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240501003056.100607-2-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-02	Merge branch '40GbE' of ↵	Jakub Kicinski
	git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue Tony Nguyen says: ==================== i40e: cleanups & refactors Ivan Vecera says: This series do following: Patch 1 - Removes write-only flags field from i40e_veb structure and from i40e_veb_setup() parameters Patch 2 - Refactors parameter of i40e_notify_client_of_l2_param_changes() and i40e_notify_client_of_netdev_close() Patch 3 - Refactors parameter of i40e_detect_recover_hung() Patch 4 - Adds helper i40e_pf_get_main_vsi() to get main VSI and uses it in existing code Patch 5 - Consolidates checks whether given VSI is the main one Patch 6 - Adds helper i40e_pf_get_main_veb() to get main VEB and uses it in existing code Patch 7 - Adds helper i40e_vsi_reconfig_tc() to reconfigure TC for particular and uses it to replace existing open-coded pieces * '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue: i40e: Add and use helper to reconfigure TC for given VSI i40e: Add helper to access main VEB i40e: Consolidate checks whether given VSI is main i40e: Add helper to access main VSI i40e: Refactor argument of i40e_detect_recover_hung() i40e: Refactor argument of several client notification functions i40e: Remove flags field from i40e_veb ==================== Link: https://lore.kernel.org/r/20240430180639.1938515-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-02	net/sched: unregister lockdep keys in qdisc_create/qdisc_alloc error path	Davide Caratti
	Naresh and Eric report several errors (corrupted elements in the dynamic key hash list), when running tdc.py or syzbot. The error path of qdisc_alloc() and qdisc_create() frees the qdisc memory, but it forgets to unregister the lockdep key, thus causing use-after-free like the following one: ================================================================== BUG: KASAN: slab-use-after-free in lockdep_register_key+0x5f2/0x700 Read of size 8 at addr ffff88811236f2a8 by task ip/7925 CPU: 26 PID: 7925 Comm: ip Kdump: loaded Not tainted 6.9.0-rc2+ #648 Hardware name: Supermicro SYS-6027R-72RF/X9DRH-7TF/7F/iTF/iF, BIOS 3.0 07/26/2013 Call Trace: <TASK> dump_stack_lvl+0x7c/0xc0 print_report+0xc9/0x610 kasan_report+0x89/0xc0 lockdep_register_key+0x5f2/0x700 qdisc_alloc+0x21d/0xb60 qdisc_create_dflt+0x63/0x3c0 attach_one_default_qdisc.constprop.37+0x8e/0x170 dev_activate+0x4bd/0xc30 __dev_open+0x275/0x380 __dev_change_flags+0x3f1/0x570 dev_change_flags+0x7c/0x160 do_setlink+0x1ea1/0x34b0 __rtnl_newlink+0x8c9/0x1510 rtnl_newlink+0x61/0x90 rtnetlink_rcv_msg+0x2f0/0xbc0 netlink_rcv_skb+0x120/0x380 netlink_unicast+0x420/0x630 netlink_sendmsg+0x732/0xbc0 __sock_sendmsg+0x1ea/0x280 ____sys_sendmsg+0x5a9/0x990 ___sys_sendmsg+0xf1/0x180 __sys_sendmsg+0xd3/0x180 do_syscall_64+0x96/0x180 entry_SYSCALL_64_after_hwframe+0x71/0x79 RIP: 0033:0x7f9503f4fa07 Code: 0a 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b9 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10 RSP: 002b:00007fff6c729068 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 000000006630c681 RCX: 00007f9503f4fa07 RDX: 0000000000000000 RSI: 00007fff6c7290d0 RDI: 0000000000000003 RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000078 R10: 000000000000009b R11: 0000000000000246 R12: 0000000000000001 R13: 00007fff6c729180 R14: 0000000000000000 R15: 000055bf67dd9040 </TASK> Allocated by task 7745: kasan_save_stack+0x1c/0x40 kasan_save_track+0x10/0x30 __kasan_kmalloc+0x7b/0x90 __kmalloc_node+0x1ff/0x460 qdisc_alloc+0xae/0xb60 qdisc_create+0xdd/0xfb0 tc_modify_qdisc+0x37e/0x1960 rtnetlink_rcv_msg+0x2f0/0xbc0 netlink_rcv_skb+0x120/0x380 netlink_unicast+0x420/0x630 netlink_sendmsg+0x732/0xbc0 __sock_sendmsg+0x1ea/0x280 ____sys_sendmsg+0x5a9/0x990 ___sys_sendmsg+0xf1/0x180 __sys_sendmsg+0xd3/0x180 do_syscall_64+0x96/0x180 entry_SYSCALL_64_after_hwframe+0x71/0x79 Freed by task 7745: kasan_save_stack+0x1c/0x40 kasan_save_track+0x10/0x30 kasan_save_free_info+0x36/0x60 __kasan_slab_free+0xfe/0x180 kfree+0x113/0x380 qdisc_create+0xafb/0xfb0 tc_modify_qdisc+0x37e/0x1960 rtnetlink_rcv_msg+0x2f0/0xbc0 netlink_rcv_skb+0x120/0x380 netlink_unicast+0x420/0x630 netlink_sendmsg+0x732/0xbc0 __sock_sendmsg+0x1ea/0x280 ____sys_sendmsg+0x5a9/0x990 ___sys_sendmsg+0xf1/0x180 __sys_sendmsg+0xd3/0x180 do_syscall_64+0x96/0x180 entry_SYSCALL_64_after_hwframe+0x71/0x79 Fix this ensuring that lockdep_unregister_key() is called before the qdisc struct is freed, also in the error path of qdisc_create() and qdisc_alloc(). Fixes: af0cb3fa3f9e ("net/sched: fix false lockdep warning on qdisc root lock") Reported-by: Linux Kernel Functional Testing <lkft@linaro.org> Closes: https://lore.kernel.org/netdev/20240429221706.1492418-1-naresh.kamboju@linaro.org/ Signed-off-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Tested-by: Ido Schimmel <idosch@nvidia.com> Link: https://lore.kernel.org/r/2aa1ca0c0a3aa0acc15925c666c777a4b5de553c.1714496886.git.dcaratti@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-01	Merge branch 'net-dsa-adjust_link-removal'	Jakub Kicinski
	Florian Fainelli says: ==================== net: dsa: adjust_link removal Now that the last in-tree driver (b53) has been converted to PHYLINK, we can get rid of all of code that catered to working with drivers implementing only PHYLIB's adjust_link callback. ==================== Link: https://lore.kernel.org/r/20240430164816.2400606-1-florian.fainelli@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-01	net: dsa: Remove adjust_link paths	Florian Fainelli
	Now that we no longer any drivers using PHYLIB's adjust_link callback, remove all paths that made use of adjust_link as well as the associated functions. Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://lore.kernel.org/r/20240430164816.2400606-3-florian.fainelli@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-01	net: dsa: Remove fixed_link_update member	Florian Fainelli
	We have not had a switch driver use a fixed_link_update callback since 58d56fcc3964f9be0a9ca42fd126bcd9dc7afc90 ("net: dsa: bcm_sf2: Get rid of PHYLIB functions") remove this callback. Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://lore.kernel.org/r/20240430164816.2400606-2-florian.fainelli@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-01	net: ti: icssg_prueth: Add SW TX / RX Coalescing based on hrtimers	MD Danish Anwar
	Add SW IRQ coalescing based on hrtimers for RX and TX data path for ICSSG driver, which can be enabled by ethtool commands: - RX coalescing ethtool -C eth1 rx-usecs 50 - TX coalescing can be enabled per TX queue - by default enables coalescing for TX0 ethtool -C eth1 tx-usecs 50 - configure TX0 ethtool -Q eth0 queue_mask 1 --coalesce tx-usecs 100 - configure TX1 ethtool -Q eth0 queue_mask 2 --coalesce tx-usecs 100 - configure TX0 and TX1 ethtool -Q eth0 queue_mask 3 --coalesce tx-usecs 100 --coalesce tx-usecs 100 Minimum value for both rx-usecs and tx-usecs is 20us. Compared to gro_flush_timeout and napi_defer_hard_irqs this patch allows to enable IRQ coalescing for RX path separately. Benchmarking numbers: =============================================================== \| Method \| Tput_TX \| CPU_TX \| Tput_RX \| CPU_RX \| \| ============================================================== \| Default Driver 943 Mbps 31% 517 Mbps 38% \| \| IRQ Coalescing (Patch) 943 Mbps 28% 518 Mbps 25% \| =============================================================== Signed-off-by: MD Danish Anwar <danishanwar@ti.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240430120634.1558998-1-danishanwar@ti.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-01	Merge branch 'arp-random-clean-up-and-rcu-conversion-for-ioctl-siocgarp'	Jakub Kicinski
	Kuniyuki Iwashima says: ==================== arp: Random clean up and RCU conversion for ioctl(SIOCGARP). arp_ioctl() holds rtnl_lock() regardless of cmd (SIOCDARP, SIOCSARP, and SIOCGARP) to get net_device by __dev_get_by_name() and copy dev->name safely. In the SIOCGARP path, arp_req_get() calls neigh_lookup(), which looks up a neighbour entry under RCU. This series cleans up ioctl() code a bit and extends the RCU section not to take rtnl_lock() and instead use dev_get_by_name_rcu() and netdev_copy_name() for SIOCGARP. v2: https://lore.kernel.org/netdev/20240425170002.68160-1-kuniyu@amazon.com/ v1: https://lore.kernel.org/netdev/20240422194755.4221-1-kuniyu@amazon.com/ ==================== Link: https://lore.kernel.org/r/20240430015813.71143-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-01	arp: Convert ioctl(SIOCGARP) to RCU.	Kuniyuki Iwashima
	ioctl(SIOCGARP) holds rtnl_lock() to get netdev by __dev_get_by_name() and copy dev->name safely and calls neigh_lookup() later, which looks up a neighbour entry under RCU. Let's replace __dev_get_by_name() with dev_get_by_name_rcu() and strscpy() with netdev_copy_name() to avoid locking rtnl_lock(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240430015813.71143-8-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-01	net: Protect dev->name by seqlock.	Kuniyuki Iwashima
	We will convert ioctl(SIOCGARP) to RCU, and then we need to copy dev->name which is currently protected by rtnl_lock(). This patch does the following: 1) Add seqlock netdev_rename_lock to protect dev->name 2) Add netdev_copy_name() that copies dev->name to buffer under netdev_rename_lock 3) Use netdev_copy_name() in netdev_get_name() and drop devnet_rename_sem Suggested-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/netdev/CANn89iJEWs7AYSJqGCUABeVqOCTkErponfZdT5kV-iD=-SajnQ@mail.gmail.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240430015813.71143-7-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-01	arp: Get dev after calling arp_req_(delete\|set\|get)().	Kuniyuki Iwashima
	arp_ioctl() holds rtnl_lock() first regardless of cmd (SIOCDARP, SIOCSARP, and SIOCGARP) to get net_device by __dev_get_by_name() and copy dev->name safely. In the SIOCGARP path, arp_req_get() calls neigh_lookup(), which looks up a neighbour entry under RCU. We will extend the RCU section not to take rtnl_lock() and instead use dev_get_by_name_rcu() for SIOCGARP. As a preparation, let's move __dev_get_by_name() into another function and call it from arp_req_delete(), arp_req_set(), and arp_req_get(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240430015813.71143-6-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-01	arp: Remove a nest in arp_req_get().	Kuniyuki Iwashima
	This is a prep patch to make the following changes tidy. No functional change intended. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240430015813.71143-5-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-01	arp: Factorise ip_route_output() call in arp_req_set() and arp_req_delete().	Kuniyuki Iwashima
	When ioctl(SIOCDARP/SIOCSARP) is issued for non-proxy entry (no ATF_COM) without arpreq.arp_dev[] set, arp_req_set() and arp_req_delete() looks up dev based on IPv4 address by ip_route_output(). Let's factorise the same code as arp_req_dev(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240430015813.71143-4-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-01	arp: Validate netmask earlier for SIOCDARP and SIOCSARP in arp_ioctl().	Kuniyuki Iwashima
	When ioctl(SIOCDARP/SIOCSARP) is issued with ATF_PUBL, r.arp_netmask must be 0.0.0.0 or 255.255.255.255. Currently, the netmask is validated in arp_req_delete_public() or arp_req_set_public() under rtnl_lock(). We have ATF_NETMASK test in arp_ioctl() before holding rtnl_lock(), so let's move the netmask validation there. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240430015813.71143-3-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-01	arp: Move ATF_COM setting in arp_req_set().	Kuniyuki Iwashima
	In arp_req_set(), if ATF_PERM is set in arpreq.arp_flags, ATF_COM is set automatically. The flag will be used later for neigh_update() only when a neighbour entry is found. Let's set ATF_COM just before calling neigh_update(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240430015813.71143-2-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-01	selftests: netfilter: nft_concat_range.sh: reduce debug kernel run time	Florian Westphal
	Even a 1h timeout isn't enough for nft_concat_range.sh to complete on debug kernels. Reduce test complexity and only match on single entry if KSFT_MACHINE_SLOW is set. To spot 'slow' tests, print the subtest duration (in seconds) in addition to the status. Add new nft_concat_range_perf.sh script, not executed via kselftest, to run the performance (pps match rate) tests. Those need about 25m to complete which seems too much to run this via 'make run_tests'. Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20240430145810.23447-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-01	ipv6: anycast: use call_rcu_hurry() in aca_put()	Eric Dumazet
	This is a followup of commit b5327b9a300e ("ipv6: use call_rcu_hurry() in fib6_info_release()"). I had another pmtu.sh failure, and found another lazy call_rcu() causing this failure. aca_free_rcu() calls fib6_info_release() which releases devices references. We must not delay it too much or risk unregister_netdevice/ref_tracker traces because references to netdev are not released in time. This should speedup device/netns dismantles when CONFIG_RCU_LAZY=y Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-30	netpoll: Fix race condition in netpoll_owner_active	Breno Leitao
	KCSAN detected a race condition in netpoll: BUG: KCSAN: data-race in net_rx_action / netpoll_send_skb write (marked) to 0xffff8881164168b0 of 4 bytes by interrupt on cpu 10: net_rx_action (./include/linux/netpoll.h:90 net/core/dev.c:6712 net/core/dev.c:6822) <snip> read to 0xffff8881164168b0 of 4 bytes by task 1 on cpu 2: netpoll_send_skb (net/core/netpoll.c:319 net/core/netpoll.c:345 net/core/netpoll.c:393) netpoll_send_udp (net/core/netpoll.c:?) <snip> value changed: 0x0000000a -> 0xffffffff This happens because netpoll_owner_active() needs to check if the current CPU is the owner of the lock, touching napi->poll_owner non atomically. The ->poll_owner field contains the current CPU holding the lock. Use an atomic read to check if the poll owner is the current CPU. Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://lore.kernel.org/r/20240429100437.3487432-1-leitao@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-30	net: loopback: Do not allocate lstats explicitly	Breno Leitao
	With commit 34d21de99cea9 ("net: Move {l,t,d}stats allocation to core and convert veth & vrf"), stats allocation could be done on net core instead of in this driver. With this new approach, the driver doesn't have to bother with error handling (allocation failure checking, making sure free happens in the right spot, etc). This is core responsibility now. Remove the allocation in the loopback driver and leverage the network core allocation instead. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://lore.kernel.org/r/20240429085559.2841918-1-leitao@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-30	Merge branch 'dt-bindings-net-snps-dwmac-remove-tx-sched-sp-property'	Jakub Kicinski
	Flavio Suligoi says: ==================== dt-bindings: net: snps, dwmac: remove tx-sched-sp property Strict priority for the tx scheduler is by default in Linux driver, so the tx-sched-sp property was removed in commit aed6864035b1 ("net: stmmac: platform: Delete a redundant condition branch"). This property is still in use in the following DT (and it will be removed in a separate patch series): - arch/arm64/boot/dts/freescale/imx8mp-beacon-som.dtsi - arch/arm64/boot/dts/freescale/imx8mp-evk.dts - arch/arm64/boot/dts/freescale/imx8mp-verdin.dtsi - arch/arm64/boot/dts/qcom/sa8540p-ride.dts - arch/arm64/boot/dts/qcom/sa8775p-ride.dts There is no problem if that property is still used in the DTs above, since, as seen above, it is a default property of the driver. ==================== Link: https://lore.kernel.org/r/20240429092654.31390-1-f.suligoi@asem.it Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-30	dt-bindings: net: snps, dwmac: remove tx-sched-sp property	Flavio Suligoi
	Strict priority for the tx scheduler is by default in Linux driver, so the tx-sched-sp property was removed in commit aed6864035b1 ("net: stmmac: platform: Delete a redundant condition branch"). This property is still in use in the following DT (and it will be removed in a separate patch series): - arch/arm64/boot/dts/freescale/imx8mp-beacon-som.dtsi - arch/arm64/boot/dts/freescale/imx8mp-evk.dts - arch/arm64/boot/dts/freescale/imx8mp-verdin.dtsi - arch/arm64/boot/dts/qcom/sa8540p-ride.dts - arch/arm64/boot/dts/qcom/sa8775p-ride.dts There is no problem if that property is still used in the DTs above, since, as seen above, it is a default property of the driver. Signed-off-by: Flavio Suligoi <f.suligoi@asem.it> Acked-by: Krzysztof Kozlowski <krzk@kernel.org> Acked-by: Adam Ford <aford173@gmail.com> Link: https://lore.kernel.org/r/20240429092654.31390-2-f.suligoi@asem.it Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-30	Merge branch 'net-three-additions-to-net_hotdata'	Jakub Kicinski
	Eric Dumazet says: ==================== net: three additions to net_hotdata This series moves three fast path sysctls to net_hotdata. To avoid <net/hotdata.h> inclusion from <net/sock.h>, create <net/proto_memory.h> to hold proto memory definitions. ==================== Link: https://lore.kernel.org/r/20240429134025.1233626-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-30	net: move sysctl_mem_pcpu_rsv to net_hotdata	Eric Dumazet
	sysctl_mem_pcpu_rsv is used in TCP fast path, move it to net_hodata for better cache locality. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240429134025.1233626-6-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-30	net: add <net/proto_memory.h>	Eric Dumazet
	Move some proto memory definitions out of <net/sock.h> Very few files need them, and following patch will include <net/hotdata.h> from <net/proto_memory.h> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240429134025.1233626-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-30	tcp: move tcp_out_of_memory() to net/ipv4/tcp.c	Eric Dumazet
	tcp_out_of_memory() has a single caller: tcp_check_oom(). Following patch will also make sk_memory_allocated() not anymore visible from <net/sock.h> and <net/tcp.h> Add const qualifier to sock argument of tcp_out_of_memory() and tcp_check_oom(). Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240429134025.1233626-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-30	net: move sysctl_skb_defer_max to net_hotdata	Eric Dumazet
	sysctl_skb_defer_max is used in TCP fast path, move it to net_hodata. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240429134025.1233626-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-30	net: move sysctl_max_skb_frags to net_hotdata	Eric Dumazet
	sysctl_max_skb_frags is used in TCP and MPTCP fast paths, move it to net_hodata for better cache locality. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240429134025.1233626-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-30	inet: introduce dst_rtable() helper	Eric Dumazet
	I added dst_rt6_info() in commit e8dfd42c17fa ("ipv6: introduce dst_rt6_info() helper") This patch does a similar change for IPv4. Instead of (struct rtable *)dst casts, we can use : #define dst_rtable(_ptr) \ container_of_const(_ptr, struct rtable, dst) Patch is smaller than IPv6 one, because IPv4 has skb_rtable() helper. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://lore.kernel.org/r/20240429133009.1227754-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-30	i40e: Add and use helper to reconfigure TC for given VSI	Ivan Vecera
	Add helper i40e_vsi_reconfig_tc(vsi) that configures TC for given VSI using previously stored TC bitmap. Effectively replaces open-coded patterns: enabled_tc = vsi->tc_config.enabled_tc; vsi->tc_config.enabled_tc = 0; i40e_vsi_config_tc(vsi, enabled_tc); Reviewed-by: Michal Schmidt <mschmidt@redhat.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> Signed-off-by: Ivan Vecera <ivecera@redhat.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-04-30	i40e: Add helper to access main VEB	Ivan Vecera
	Add a helper to access main VEB: i40e_pf_get_main_veb(pf) replaces 'pf->veb[pf->lan_veb]' Reviewed-by: Michal Schmidt <mschmidt@redhat.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> Signed-off-by: Ivan Vecera <ivecera@redhat.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-04-30	i40e: Consolidate checks whether given VSI is main	Ivan Vecera
	In the driver code there are 3 types of checks whether given VSI is main or not: 1. vsi->type ==/!= I40E_VSI_MAIN 2. vsi ==/!= pf->vsi[pf->lan_vsi] 3. vsi->seid ==/!= pf->vsi[pf->lan_vsi]->seid All of them are equivalent and can be consolidated. Convert cases 2 and 3 to case 1. Reviewed-by: Michal Schmidt <mschmidt@redhat.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> Signed-off-by: Ivan Vecera <ivecera@redhat.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-04-30	i40e: Add helper to access main VSI	Ivan Vecera
	Add simple helper i40e_pf_get_main_vsi(pf) to access main VSI that replaces pattern 'pf->vsi[pf->lan_vsi]' Reviewed-by: Michal Schmidt <mschmidt@redhat.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> Signed-off-by: Ivan Vecera <ivecera@redhat.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-04-30	i40e: Refactor argument of i40e_detect_recover_hung()	Ivan Vecera
	Commit 07d44190a389 ("i40e/i40evf: Detect and recover hung queue scenario") changes i40e_detect_recover_hung() argument type from i40e_pf* to i40e_vsi* to be shareable by both i40e and i40evf. Because the i40evf does not exist anymore and the function is exclusively used by i40e we can revert this change. Reviewed-by: Michal Schmidt <mschmidt@redhat.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> Signed-off-by: Ivan Vecera <ivecera@redhat.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-04-30	i40e: Refactor argument of several client notification functions	Ivan Vecera
	Commit 0ef2d5afb12d ("i40e: KISS the client interface") simplified the client interface so in practice it supports only one client per i40e netdev. But we have still 2 notification functions that uses as parameter a pointer to VSI of netdevice associated with the client. After the mentioned commit only possible and used VSI is the main (LAN) VSI. So refactor these functions so they are called with PF pointer argument and the associated VSI (LAN) is taken inside them. Reviewed-by: Michal Schmidt <mschmidt@redhat.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> Signed-off-by: Ivan Vecera <ivecera@redhat.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-04-30	i40e: Remove flags field from i40e_veb	Ivan Vecera
	The field is initialized always to zero and it is never read. Remove it. Reviewed-by: Michal Schmidt <mschmidt@redhat.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> Signed-off-by: Ivan Vecera <ivecera@redhat.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-04-30	Merge branch 'selftests-net-page_poll-allocation-error-injection'	Jakub Kicinski
	Jakub Kicinski says: ==================== selftests: net: page_poll allocation error injection Add a test for exercising driver memory allocation failure paths. page pool is a bit tricky to inject errors into at the page allocator level because of the bulk alloc and recycling, so add explicit error injection support "in front" of the caches. Add a test to exercise that using only the standard APIs. This is the first useful test for the new tests with an endpoint. There's no point testing netdevsim here, so this is also the first HW-only test in Python. I'm not super happy with the traffic generation using iperf3, my initial approach was to use mausezahn. But it turned out to be 5x slower in terms of PPS. Hopefully this is good enough for now. v1: https://lore.kernel.org/all/20240426232400.624864-1-kuba@kernel.org/ ==================== Link: https://lore.kernel.org/r/20240429144426.743476-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-30	selftests: drv-net-hw: add test for memory allocation failures with page pool	Jakub Kicinski
	Bugs in memory allocation failure paths are quite common. Add a test exercising those paths based on qstat and page pool failure hook. Running on bnxt: # ./drivers/net/hw/pp_alloc_fail.py KTAP version 1 1..1 # ethtool -G change retval: success ok 1 pp_alloc_fail.test_pp_alloc # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0 I initially wrote this test to validate commit be43b7489a3c ("net/mlx5e: RX, Fix page_pool allocation failure recovery for striding rq") but mlx5 still doesn't have qstat. So I run it on bnxt, and while bnxt survives I found the problem fixed in commit 730117730709 ("eth: bnxt: fix counting packets discarded due to OOM and netpoll"). Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20240429144426.743476-7-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-30	selftests: drv-net: support generating iperf3 load	Jakub Kicinski
	While we are not very interested in testing performance it's useful to be able to generate a lot of traffic. iperf is the simplest way of getting relatively high PPS. Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20240429144426.743476-6-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-30	selftests: net: py: avoid all ports < 10k	Jakub Kicinski
	When picking TCP ports to use, avoid all below 10k. This should lower the chance of collision or running afoul whatever random policies may be on the host. Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20240429144426.743476-5-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-30	selftests: net: py: extract tool logic	Jakub Kicinski
	The main use of the ip() wrapper over cmd() is that it can parse JSON. cmd("ip -j link show") will return stdout as a string, and test has to call json.loads(). With ip("link show", json=True) the return value will be already parsed. More tools (ethtool, bpftool etc.) support the --json switch. To avoid having to wrap all of them individually create a tool() helper. Switch from -j to --json (for ethtool). While at it consume the netns attribute at the ip() level. Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20240429144426.743476-4-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-30	selftests: drv-net-hw: support using Python from net hw tests	Jakub Kicinski
	We created a separate directory for HW-only tests, recently. Glue in the Python test library there, Python is a bit annoying when it comes to using library code located "lower" in the directory structure. Reuse the Env class, but let tests require non-nsim setup. Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20240429144426.743476-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-30	net: page_pool: support error injection	Jakub Kicinski
	Because of caching / recycling using the general page allocation failures to induce errors in page pool allocation is very hard. Add direct error injection support to page_pool_alloc_pages(). Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20240429144426.743476-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-30	selftests: netfilter: avoid test timeouts on debug kernels	Florian Westphal
	Jakub reports that some tests fail on netdev CI when executed in a debug kernel. Increase test timeout to 30m, this should hopefully be enough. Also reduce test duration where possible for "slow" machines. Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20240429105736.22677-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-30	net: sfp-bus: constify link_modes to sfp_select_interface()	Russell King (Oracle)
	sfp_select_interface() does not modify its link_modes argument, so make this a const pointer. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Daniel Machon <daniel.machon@microchip.com> Link: https://lore.kernel.org/r/E1s15s0-00AHyq-8E@rmk-PC.armlinux.org.uk Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-30	net: sfp: allow use 2500base-X for 2500base-T modules	Russell King (Oracle)
	Allow use of 2500base-X interface mode for PHY modules that support 2500base-T. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Daniel Machon <daniel.machon@microchip.com> Link: https://lore.kernel.org/r/E1s15rv-00AHyk-5S@rmk-PC.armlinux.org.uk Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-30	net: phylink: add debug print for empty posssible_interfaces	Russell King (Oracle)
	Add a debugging print in phylink_validate_phy() when we detect that the PHY has not supplied a possible_interfaces bitmap. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Daniel Machon <daniel.machon@microchip.com> Link: https://lore.kernel.org/r/E1s15rq-00AHye-22@rmk-PC.armlinux.org.uk Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-30	net: dsa: realtek: provide own phylink MAC operations	Russell King (Oracle)
	Convert realtek to provide its own phylink MAC operations, thus avoiding the shim layer in DSA's port.c. We need to provide a stub for the mandatory mac_config() method for rtl8366rb. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Link: https://lore.kernel.org/r/E1s11qJ-00AHi0-Kk@rmk-PC.armlinux.org.uk Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-30	net: dsa: mt7530: do not set MT7530_P5_DIS when PHY muxing is being used	Arınç ÜNAL
	DSA initalises the ds->num_ports amount of ports in dsa_switch_touch_ports(). When the PHY muxing feature is in use, port 5 won't be defined in the device tree. Because of this, the type member of the dsa_port structure for this port will be assigned DSA_PORT_TYPE_UNUSED. The dsa_port_setup() function calls ds->ops->port_disable() when the port type is DSA_PORT_TYPE_UNUSED. The MT7530_P5_DIS bit is unset in mt7530_setup() when PHY muxing is being used. mt7530_port_disable() which is assigned to ds->ops->port_disable() is called afterwards. Currently, mt7530_port_disable() sets MT7530_P5_DIS which breaks network connectivity when PHY muxing is being used. Therefore, do not set MT7530_P5_DIS when PHY muxing is being used. Fixes: 377174c5760c ("net: dsa: mt7530: move MT753X_MTRAP operations for MT7530") Reported-by: Daniel Golle <daniel@makrotopia.org> Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20240428-for-netnext-mt7530-do-not-disable-port5-when-phy-muxing-v2-1-bb7c37d293f8@arinc9.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-30	Merge branch 'net-smc-smc-intra-os-shortcut-with-loopback-ism'	Paolo Abeni
	Wen Gu says: ==================== net/smc: SMC intra-OS shortcut with loopback-ism This patch set acts as the second part of the new version of [1] (The first part can be referred from [2]), the updated things of this version are listed at the end. - Background SMC-D is now used in IBM z with ISM function to optimize network interconnect for intra-CPC communications. Inspired by this, we try to make SMC-D available on the non-s390 architecture through a software-implemented Emulated-ISM device, that is the loopback-ism device here, to accelerate inter-process or inter-containers communication within the same OS instance. - Design This patch set includes 3 parts: - Patch #1: some prepare work for loopback-ism. - Patch #2-#7: implement loopback-ism device and adapt SMC-D for it. loopback-ism now serves only SMC and no userspace interfaces exposed. - Patch #8-#11: memory copy optimization for intra-OS scenario. The loopback-ism device is designed as an ISMv2 device and not be limited to a specific net namespace, ends of both inter-process connection (1/1' in diagram below) or inter-container connection (2/2' in diagram below) can find the same available loopback-ism and choose it during the CLC handshake. Container 1 (ns1) Container 2 (ns2) +-----------------------------------------+ +-------------------------+ \| +-------+ +-------+ +-------+ \| \| +-------+ \| \| \| App A \| \| App B \| \| App C \| \| \| \| App D \|<-+ \| \| +-------+ +---^---+ +-------+ \| \| +-------+ \|(2') \| \| \|127.0.0.1 (1')\| \|192.168.0.11 192.168.0.12\| \| \| (1)\| +--------+ \| +--------+ \|(2) \| \| +--------+ +--------+ \| \| `-->\| lo \|-` \| eth0 \|<-` \| \| \| lo \| \| eth0 \| \| +---------+--\|---^-+---+-----\|--+---------+ +-+--------+---+-^------+-+ \| \| \| \| Kernel \| \| \| \| +----+-------v---+-----------v----------------------------------+---+----+ \| \| TCP \| \| \| \| \| \| \| +--------------------------------------------------------------+ \| \| \| \| +--------------+ \| \| \| smc loopback \| \| +---------------------------+--------------+-----------------------------+ loopback-ism device creates DMBs (shared memory) for each connection peer. Since data transfer occurs within the same kernel, the sndbuf of each peer is only a descriptor and point to the same memory region as peer DMB, so that the data copy from sndbuf to peer DMB can be avoided in loopback-ism case. Container 1 (ns1) Container 2 (ns2) +-----------------------------------------+ +-------------------------+ \| +-------+ \| \| +-------+ \| \| \| App C \|-----+ \| \| \| App D \| \| \| +-------+ \| \| \| +-^-----+ \| \| \| \| \| \| \| \| (2) \| \| \| (2') \| \| \| \| \| \| \| \| +---------------\|-------------------------+ +----------\|--------------+ \| \| Kernel \| \| +---------------\|-----------------------------------------\|--------------+ \| +--------+ +--v-----+ +--------+ +--------+ \| \| \|dmb_desc\| \|snd_desc\| \|dmb_desc\| \|snd_desc\| \| \| +-----\|--+ +--\|-----+ +-----\|--+ +--------+ \| \| +-----\|--+ \| +-----\|--+ \| \| \| DMB C \| +---------------------------------\| DMB D \| \| \| +--------+ +--------+ \| \| \| \| +--------------+ \| \| \| smc loopback \| \| +---------------------------+--------------+-----------------------------+ - Benchmark Test * Test environments: - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem. - SMC sndbuf/DMB size 1MB. * Test object: - TCP: run on TCP loopback. - SMC lo: run on SMC loopback-ism. 1. ipc-benchmark (see [3]) - ./<foo> -c 1000000 -s 100 TCP SMC-lo Message rate (msg/s) 84991 151293(+78.01%) 2. sockperf - serv: <smc_run> sockperf sr --tcp - clnt: <smc_run> sockperf { tp \| pp } --tcp --msg-size={ 64000 for tp \| 14 for pp } -i 127.0.0.1 -t 30 TCP SMC-lo Bandwidth(MBps) 5033.569 7987.732(+58.69%) Latency(us) 5.986 3.398(-43.23%) 3. nginx/wrk - serv: <smc_run> nginx - clnt: <smc_run> wrk -t 8 -c 1000 -d 30 http://127.0.0.1:80 TCP SMC-lo Requests/s 187951.76 267107.90(+42.12%) 4. redis-benchmark - serv: <smc_run> redis-server - clnt: <smc_run> redis-benchmark -h 127.0.0.1 -q -t set,get -n 400000 -c 200 -d 1024 TCP SMC-lo GET(Requests/s) 86132.64 118133.49(+37.15%) SET(Requests/s) 87374.40 122887.86(+40.65%) Change log: v7->v6 - Patch #2: minor: remove unnecessary 'return' of inline smc_loopback_exit(). - Patch #10: minor: directly return 0 instead of 'rc' in smcd_cdc_msg_send(). - all: collect the Reviewed-by tags. v6->RFC v5 Link: https://lore.kernel.org/netdev/20240414040304.54255-1-guwen@linux.alibaba.com/ - Patch #2: make the use of CONFIG_SMC_LO cleaner. - Patch #5: mark some smcd_ops that loopback-ism doesn't support as optional and check for the support when they are called. - Patch #7: keep loopback-ism at the beginning of the SMC-D device list. - Some expression changes in commit logs and comments. RFC v5->RFC v4: Link: https://lore.kernel.org/netdev/20240324135522.108564-1-guwen@linux.alibaba.com/ - Patch #2: minor changes in description of config SMC_LO and comments. - Patch #10: minor changes in comments and if(smc_ism_support_dmb_nocopy()) check in smcd_cdc_msg_send(). - Patch #3: change smc_lo_generate_id() to smc_lo_generate_ids() and SMC_LO_CHID to SMC_LO_RESERVED_CHID. - Patch #5: memcpy while holding the ldev->dmb_ht_lock. - Some expression changes in commit logs. RFC v4->v3: Link: https://lore.kernel.org/netdev/20240317100545.96663-1-guwen@linux.alibaba.com/ - The merge window of v6.9 is open, so post this series as an RFC. - Patch #6: since some information fed back by smc_nl_handle_smcd_dev() dose not apply to Emulated-ISM (including loopback-ism here), loopback-ism is not exposed through smc netlink for the time being. we may refactor this part when smc netlink interface is updated. v3->v2: Link: https://lore.kernel.org/netdev/20240312142743.41406-1-guwen@linux.alibaba.com/ - Patch #11: use tasklet_schedule(&conn->rx_tsklet) instead of smcd_cdc_rx_handler() to avoid possible recursive locking of conn->send_lock and use {read\|write}_lock_bh() to acquire dmb_ht_lock. v2->v1: Link: https://lore.kernel.org/netdev/20240307095536.29648-1-guwen@linux.alibaba.com/ - All the patches: changed the term virtual-ISM to Emulated-ISM as defined by SMCv2.1. - Patch #3: optimized the description of SMC_LO config. Avoid exposing loopback-ism to sysfs and remove all the knobs until future definition clear. - Patch #3: try to make lockdep happy by using read_lock_bh() in smc_lo_move_data(). - Patch #6: defaultly use physical contiguous DMB buffers. - Patch #11: defaultly enable DMB no-copy for loopback-ism and free the DMB in unregister_dmb or detach_dmb when dmb_node->refcnt reaches 0, instead of using wait_event to keep waiting in unregister_dmb. v1->RFC: Link: https://lore.kernel.org/netdev/20240111120036.109903-1-guwen@linux.alibaba.com/ - Patch #9: merge rx_bytes and tx_bytes as xfer_bytes statistics: /sys/devices/virtual/smc/loopback-ism/xfer_bytes - Patch #10: add support_dmb_nocopy operation to check if SMC-D device supports merging sndbuf with peer DMB. - Patch #13 & #14: introduce loopback-ism device control of DMB memory type and control of whether to merge sndbuf and DMB. They can be respectively set by: /sys/devices/virtual/smc/loopback-ism/dmb_type /sys/devices/virtual/smc/loopback-ism/dmb_copy The motivation for these two control is that a performance bottleneck was found when using vzalloced DMB and sndbuf is merged with DMB, and there are many CPUs and CONFIG_HARDENED_USERCOPY is set [4]. The bottleneck is caused by the lock contention in vmap_area_lock [5] which is involved in memcpy_from_msg() or memcpy_to_msg(). Currently, Uladzislau Rezki is working on mitigating the vmap lock contention [6]. It has significant effects, but using virtual memory still has additional overhead compared to using physical memory. So this new version provides controls of dmb_type and dmb_copy to suit different scenarios. - Some minor changes and comments improvements. RFC->old version([1]): Link: https://lore.kernel.org/netdev/1702214654-32069-1-git-send-email-guwen@linux.alibaba.com/ - Patch #1: improve the loopback-ism dump, it shows as follows now: # smcd d FID Type PCI-ID PCHID InUse #LGs PNET-ID 0000 0 loopback-ism ffff No 0 - Patch #3: introduce the smc_ism_set_v2_capable() helper and set smc_ism_v2_capable when ISMv2 or virtual ISM is registered, regardless of whether there is already a device in smcd device list. - Patch #3: loopback-ism will be added into /sys/devices/virtual/smc/loopback-ism/. - Patch #8: introduce the runtime switch /sys/devices/virtual/smc/loopback-ism/active to activate or deactivate the loopback-ism. - Patch #9: introduce the statistics of loopback-ism by /sys/devices/virtual/smc/loopback-ism/{{tx\|rx}_tytes\|dmbs_cnt}. - Some minor changes and comments improvements. [1] https://lore.kernel.org/netdev/1695568613-125057-1-git-send-email-guwen@linux.alibaba.com/ [2] https://lore.kernel.org/netdev/20231219142616.80697-1-guwen@linux.alibaba.com/ [3] https://github.com/goldsborough/ipc-bench [4] https://lore.kernel.org/all/3189e342-c38f-6076-b730-19a6efd732a5@linux.alibaba.com/ [5] https://lore.kernel.org/all/238e63cd-e0e8-4fbf-852f-bc4d5bc35d5a@linux.alibaba.com/ [6] https://lore.kernel.org/all/20240102184633.748113-1-urezki@gmail.com/ ==================== Link: https://lore.kernel.org/r/20240428060738.60843-1-guwen@linux.alibaba.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>