linux/linux-stable.git - Linux kernel stable tree

Age	Commit message (Collapse)	Author
2024-11-11	Merge branch 'knobs-for-npc-default-rule-counters'	Jakub Kicinski
	Linu Cherian says: ==================== Knobs for NPC default rule counters Patch 1 introduce _rvu_mcam_remove/add_counter_from/to_rule by refactoring existing code Patch 2 adds a devlink param to enable/disable counters for default rules. Once enabled, counters can Patch 3 adds documentation for devlink params v4: https://lore.kernel.org/20241029035739.1981839-1-lcherian@marvell.com ==================== Link: https://patch.msgid.link/20241105125620.2114301-1-lcherian@marvell.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11	devlink: Add documentation for OcteonTx2 AF	Linu Cherian
	Add documentation for the following devlink params - npc_mcam_high_zone_percent - npc_def_rule_cntr - nix_maxlf Signed-off-by: Linu Cherian <lcherian@marvell.com> Link: https://patch.msgid.link/20241105125620.2114301-4-lcherian@marvell.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11	octeontx2-af: Knobs for NPC default rule counters	Linu Cherian
	Add devlink knobs to enable/disable counters on NPC default rule entries. Sample command to enable default rule counters: devlink dev param set <dev> name npc_def_rule_cntr value true cmode runtime Sample command to read the counter: cat /sys/kernel/debug/cn10k/npc/mcam_rules Signed-off-by: Linu Cherian <lcherian@marvell.com> Link: https://patch.msgid.link/20241105125620.2114301-3-lcherian@marvell.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11	octeontx2-af: Refactor few NPC mcam APIs	Linu Cherian
	Introduce lowlevel variant of rvu_mcam_remove/add_counter_from/to_rule for better code reuse, which assumes necessary locks are taken at higher level. These low level functions would be used for implementing default rule counter APIs in the subsequent patch. Signed-off-by: Linu Cherian <lcherian@marvell.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20241105125620.2114301-2-lcherian@marvell.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11	mlx5/core: deduplicate {mlx5_,}eq_update_ci()	Caleb Sander Mateos
	The logic of eq_update_ci() is duplicated in mlx5_eq_update_ci(). The only additional work done by mlx5_eq_update_ci() is to increment eq->cons_index. Call eq_update_ci() from mlx5_eq_update_ci() to avoid the duplication. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Acked-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107183054.2443218-2-csander@purestorage.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11	mlx5/core: relax memory barrier in eq_update_ci()	Caleb Sander Mateos
	The memory barrier in eq_update_ci() after the doorbell write is a significant hot spot in mlx5_eq_comp_int(). Under heavy TCP load, we see 3% of CPU time spent on the mfence instruction. 98df6d5b877c ("net/mlx5: A write memory barrier is sufficient in EQ ci update") already relaxed the full memory barrier to just a write barrier in mlx5_eq_update_ci(), which duplicates eq_update_ci(). So replace mb() with wmb() in eq_update_ci() too. On strongly ordered architectures, no barrier is actually needed because the MMIO writes to the doorbell register are guaranteed to appear to the device in the order they were made. However, the kernel's ordered MMIO primitive writel() lacks a convenient big-endian interface. Therefore, we opt to stick with __raw_writel() + a barrier. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Acked-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107183054.2443218-1-csander@purestorage.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11	Merge branch ↵	Jakub Kicinski
	'macsec-inherit-lower-device-s-features-and-tso-limits-when-offloading' Sabrina Dubroca says: ==================== macsec: inherit lower device's features and TSO limits when offloading When macsec is offloaded to a NIC, we can take advantage of some of its features, mainly TSO and checksumming. This increases performance significantly. Some features cannot be inherited, because they require additional ops that aren't provided by the macsec netdevice. We also need to inherit TSO limits from the lower device, like VLAN/macvlan devices do. This series also moves the existing macsec offload selftest to the netdevsim selftests before adding tests for the new features. To allow this new selftest to work, netdevsim's hw_features are expanded. ==================== Link: https://patch.msgid.link/cover.1730929545.git.sd@queasysnail.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11	selftests: netdevsim: add ethtool features to macsec offload tests	Sabrina Dubroca
	The test verifies that available features aren't changed by toggling offload on the device. Creating a device with offload off and then enabling it later should result in the same features as creating the device with offload enabled directly. Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/ba801bd0a75b02de2dddbfc77f9efceb8b3d8a2e.1730929545.git.sd@queasysnail.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11	selftests: netdevsim: add test toggling macsec offload	Sabrina Dubroca
	The test verifies that toggling offload works (both via rtnetlink and macsec's genetlink APIs). This is only possible when no SA is configured. Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/bf8e27ee0d921caa4eb35f1e830eca6d4080ddb2.1730929545.git.sd@queasysnail.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11	selftests: move macsec offload tests from net/rtnetlink to drivers/net/netdvesim	Sabrina Dubroca
	We're going to expand this test, and macsec offload is only lightly related to rtnetlink. Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/a1f92c250cc129b4bb111a206c4b560bab4e24a5.1730929545.git.sd@queasysnail.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11	macsec: inherit lower device's TSO limits when offloading	Sabrina Dubroca
	If macsec is offloaded, we need to follow the lower device's capabilities, like VLAN devices do. Leave the limits unchanged when the offload is disabled. Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/8240c0181e851f169d815f59658a01fb9dfc5073.1730929545.git.sd@queasysnail.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11	macsec: clean up local variables in macsec_notify	Sabrina Dubroca
	For all events, we need to loop over the list of secys, so let's move the common variables out of the switch/case. Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/9b8996af518fbeb3b7d527feb15d5788495e3108.1730929545.git.sd@queasysnail.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11	macsec: add some of the lower device's features when offloading	Sabrina Dubroca
	This commit extends the set of netdevice features supported by macsec devices when offload is enabled, which increases performance significantly (for a single TCP stream: 17.5Gbps to 38.5Gbps on my test machines). Commit c850240b6c41 ("net: macsec: report real_dev features when HW offloading is enabled") previously attempted something similar, but had to be reverted (commit 8bcd560ae878 ("Revert "net: macsec: report real_dev features when HW offloading is enabled"")) because the set of features it exposed was too large. During initialization, all features are set, and they're then removed via ndo_fix_features (macsec_fix_features). This allows the offloadable features to be automatically enabled if offloading is turned on after device creation. Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/8b32c3011d269d6f149724e80c1ffe67c9534067.1730929545.git.sd@queasysnail.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11	selftests: netdevsim: add a test checking ethtool features	Sabrina Dubroca
	Add a test checking that some features are active by default and changeable. Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/fff58fa70f8a300440958b5020f6a4eb2e9dad61.1730929545.git.sd@queasysnail.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11	netdevsim: add more hw_features	Sabrina Dubroca
	netdevsim currently only set HW_TC in its hw_features, but other features should also be present to better reflect the behavior of real HW. In my macsec offload testing, this ends up as HW_CSUM being missing from hw_features, so it doesn't stick in wanted_features when offload is turned off. Then HW_CSUM (and thus TSO, thanks to netdev_fix_features) is not automatically turned back on when offload is re-enabled. Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/b918dc4dd76410a57f7516a855f66b0a2bd58326.1730929545.git.sd@queasysnail.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11	Merge branch 'replace-page_frag-with-page_frag_cache-part-1'	Jakub Kicinski
	Yunsheng Lin says: ==================== Replace page_frag with page_frag_cache (Part-1) This is part 1 of "Replace page_frag with page_frag_cache", which mainly contain refactoring and optimization for the implementation of page_frag API before the replacing. As the discussion in [1], it would be better to target net-next tree to get more testing as all the callers page_frag API are in networking, and the chance of conflicting with MM tree seems low as implementation of page_frag API seems quite self-contained. After [2], there are still two implementations for page frag: 1. mm/page_alloc.c: net stack seems to be using it in the rx part with 'struct page_frag_cache' and the main API being page_frag_alloc_align(). 2. net/core/sock.c: net stack seems to be using it in the tx part with 'struct page_frag' and the main API being skb_page_frag_refill(). This patchset tries to unfiy the page frag implementation by replacing page_frag with page_frag_cache for sk_page_frag() first. net_high_order_alloc_disable_key for the implementation in net/core/sock.c doesn't seems matter that much now as pcp is also supported for high-order pages: commit 44042b449872 ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists") As the related change is mostly related to networking, so targeting the net-next. And will try to replace the rest of page_frag in the follow patchset. After this patchset: 1. Unify the page frag implementation by taking the best out of two the existing implementations: we are able to save some space for the 'page_frag_cache' API user, and avoid 'get_page()' for the old 'page_frag' API user. 2. Future bugfix and performance can be done in one place, hence improving maintainability of page_frag's implementation. Kernel Image changing: Linux Kernel total \| text data bss ------------------------------------------------------ after 45250307 \| 27274279 17209996 766032 before 45254134 \| 27278118 17209984 766032 delta -3827 \| -3839 +12 +0 1. https://lore.kernel.org/all/add10dd4-7f5d-4aa1-aa04-767590f944e0@redhat.com/ 2. https://lore.kernel.org/all/20240228093013.8263-1-linyunsheng@huawei.com/ ==================== Link: https://patch.msgid.link/20241028115343.3405838-1-linyunsheng@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11	mm: page_frag: use __alloc_pages() to replace alloc_pages_node()	Yunsheng Lin
	It seems there is about 24Bytes binary size increase for __page_frag_cache_refill() after refactoring in arm64 system with 64K PAGE_SIZE. By doing the gdb disassembling, It seems we can have more than 100Bytes decrease for the binary size by using __alloc_pages() to replace alloc_pages_node(), as there seems to be some unnecessary checking for nid being NUMA_NO_NODE, especially when page_frag is part of the mm system. CC: Andrew Morton <akpm@linux-foundation.org> CC: Linux-MM <linux-mm@kvack.org> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Link: https://patch.msgid.link/20241028115343.3405838-8-linyunsheng@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11	mm: page_frag: reuse existing space for 'size' and 'pfmemalloc'	Yunsheng Lin
	Currently there is one 'struct page_frag' for every 'struct sock' and 'struct task_struct', we are about to replace the 'struct page_frag' with 'struct page_frag_cache' for them. Before begin the replacing, we need to ensure the size of 'struct page_frag_cache' is not bigger than the size of 'struct page_frag', as there may be tens of thousands of 'struct sock' and 'struct task_struct' instances in the system. By or'ing the page order & pfmemalloc with lower bits of 'va' instead of using 'u16' or 'u32' for page size and 'u8' for pfmemalloc, we are able to avoid 3 or 5 bytes space waste. And page address & pfmemalloc & order is unchanged for the same page in the same 'page_frag_cache' instance, it makes sense to fit them together. After this patch, the size of 'struct page_frag_cache' should be the same as the size of 'struct page_frag'. CC: Andrew Morton <akpm@linux-foundation.org> CC: Linux-MM <linux-mm@kvack.org> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Link: https://patch.msgid.link/20241028115343.3405838-7-linyunsheng@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11	xtensa: remove the get_order() implementation	Yunsheng Lin
	As the get_order() implemented by xtensa supporting 'nsau' instruction seems be the same as the generic implementation in include/asm-generic/getorder.h when size is not a constant value as the generic implementation calling the fls*() is also utilizing the 'nsau' instruction for xtensa. So remove the get_order() implemented by xtensa, as using the generic implementation may enable the compiler to do the computing when size is a constant value instead of runtime computing and enable the using of get_order() in BUILD_BUG_ON() macro in next patch. CC: Andrew Morton <akpm@linux-foundation.org> CC: Linux-MM <linux-mm@kvack.org> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Acked-by: Max Filippov <jcmvbkbc@gmail.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Link: https://patch.msgid.link/20241028115343.3405838-6-linyunsheng@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11	mm: page_frag: avoid caller accessing 'page_frag_cache' directly	Yunsheng Lin
	Use appropriate frag_page API instead of caller accessing 'page_frag_cache' directly. CC: Andrew Morton <akpm@linux-foundation.org> CC: Linux-MM <linux-mm@kvack.org> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Acked-by: Chuck Lever <chuck.lever@oracle.com> Link: https://patch.msgid.link/20241028115343.3405838-5-linyunsheng@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11	mm: page_frag: use initial zero offset for page_frag_alloc_align()	Yunsheng Lin
	We are about to use page_frag_alloc_*() API to not just allocate memory for skb->data, but also use them to do the memory allocation for skb frag too. Currently the implementation of page_frag in mm subsystem is running the offset as a countdown rather than count-up value, there may have several advantages to that as mentioned in [1], but it may have some disadvantages, for example, it may disable skb frag coalescing and more correct cache prefetching We have a trade-off to make in order to have a unified implementation and API for page_frag, so use a initial zero offset in this patch, and the following patch will try to make some optimization to avoid the disadvantages as much as possible. 1. https://lore.kernel.org/all/f4abe71b3439b39d17a6fb2d410180f367cadf5c.camel@gmail.com/ CC: Andrew Morton <akpm@linux-foundation.org> CC: Linux-MM <linux-mm@kvack.org> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Link: https://patch.msgid.link/20241028115343.3405838-4-linyunsheng@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11	mm: move the page fragment allocator from page_alloc into its own file	Yunsheng Lin
	Inspired by [1], move the page fragment allocator from page_alloc into its own c file and header file, as we are about to make more change for it to replace another page_frag implementation in sock.c As this patchset is going to replace 'struct page_frag' with 'struct page_frag_cache' in sched.h, including page_frag_cache.h in sched.h has a compiler error caused by interdependence between mm_types.h and mm.h for asm-offsets.c, see [2]. So avoid the compiler error by moving 'struct page_frag_cache' to mm_types_task.h as suggested by Alexander, see [3]. 1. https://lore.kernel.org/all/20230411160902.4134381-3-dhowells@redhat.com/ 2. https://lore.kernel.org/all/15623dac-9358-4597-b3ee-3694a5956920@gmail.com/ 3. https://lore.kernel.org/all/CAKgT0UdH1yD=LSCXFJ=YM_aiA4OomD-2wXykO42bizaWMt_HOA@mail.gmail.com/ CC: David Howells <dhowells@redhat.com> CC: Linux-MM <linux-mm@kvack.org> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Link: https://patch.msgid.link/20241028115343.3405838-3-linyunsheng@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11	mm: page_frag: add a test module for page_frag	Yunsheng Lin
	The testing is done by ensuring that the fragment allocated from a frag_frag_cache instance is pushed into a ptr_ring instance in a kthread binded to a specified cpu, and a kthread binded to a specified cpu will pop the fragment from the ptr_ring and free the fragment. CC: Andrew Morton <akpm@linux-foundation.org> CC: Linux-MM <linux-mm@kvack.org> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Link: https://patch.msgid.link/20241028115343.3405838-2-linyunsheng@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11	net: convert to nla_get_*_default()	Johannes Berg
	Most of the original conversion is from the spatch below, but I edited some and left out other instances that were either buggy after conversion (where default values don't fit into the type) or just looked strange. @@ expression attr, def; expression val; identifier fn =~ "^nla_get_.*"; fresh identifier dfn = fn ## "_default"; @@ ( -if (attr) - val = fn(attr); -else - val = def; +val = dfn(attr, def); \| -if (!attr) - val = def; -else - val = fn(attr); +val = dfn(attr, def); \| -if (!attr) - return def; -return fn(attr); +return dfn(attr, def); \| -attr ? fn(attr) : def +dfn(attr, def) \| -!attr ? def : fn(attr) +dfn(attr, def) ) Signed-off-by: Johannes Berg <johannes.berg@intel.com> Reviewed-by: Toke Høiland-Jørgensen <toke@kernel.org> Link: https://patch.msgid.link/20241108114145.0580b8684e7f.I740beeaa2f70ebfc19bfca1045a24d6151992790@changeid Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11	net: netlink: add nla_get_*_default() accessors	Johannes Berg
	There are quite a number of places that use patterns such as if (attr) val = nla_get_u16(attr); else val = DEFAULT; Add nla_get_u16_default() and friends like that to not have to type this out all the time. Acked-by: Toke Høiland-Jørgensen <toke@kernel.org> Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Link: https://patch.msgid.link/20241108114145.acd2aadb03ac.I3df6aac71d38a5baa1c0a03d0c7e82d4395c030e@changeid Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-09	bridge: Allow deleting FDB entries with non-existent VLAN	Ido Schimmel
	It is currently impossible to delete individual FDB entries (as opposed to flushing) that were added with a VLAN that no longer exists: # ip link add name dummy1 up type dummy # ip link add name br1 up type bridge vlan_filtering 1 # ip link set dev dummy1 master br1 # bridge fdb add 00:11:22:33:44:55 dev dummy1 master static vlan 1 # bridge vlan del vid 1 dev dummy1 # bridge fdb get 00:11:22:33:44:55 br br1 vlan 1 00:11:22:33:44:55 dev dummy1 vlan 1 master br1 static # bridge fdb del 00:11:22:33:44:55 dev dummy1 master vlan 1 RTNETLINK answers: Invalid argument # bridge fdb get 00:11:22:33:44:55 br br1 vlan 1 00:11:22:33:44:55 dev dummy1 vlan 1 master br1 static This is in contrast to MDB entries that can be deleted after the VLAN was deleted: # bridge vlan add vid 10 dev dummy1 # bridge mdb add dev br1 port dummy1 grp 239.1.1.1 permanent vid 10 # bridge vlan del vid 10 dev dummy1 # bridge mdb get dev br1 grp 239.1.1.1 vid 10 dev br1 port dummy1 grp 239.1.1.1 permanent vid 10 # bridge mdb del dev br1 port dummy1 grp 239.1.1.1 permanent vid 10 # bridge mdb get dev br1 grp 239.1.1.1 vid 10 Error: bridge: MDB entry not found. Align the two interfaces and allow user space to delete FDB entries that were added with a VLAN that no longer exists: # ip link add name dummy1 up type dummy # ip link add name br1 up type bridge vlan_filtering 1 # ip link set dev dummy1 master br1 # bridge fdb add 00:11:22:33:44:55 dev dummy1 master static vlan 1 # bridge vlan del vid 1 dev dummy1 # bridge fdb get 00:11:22:33:44:55 br br1 vlan 1 00:11:22:33:44:55 dev dummy1 vlan 1 master br1 static # bridge fdb del 00:11:22:33:44:55 dev dummy1 master vlan 1 # bridge fdb get 00:11:22:33:44:55 br br1 vlan 1 Error: Fdb entry not found. Add a selftest to make sure this behavior does not regress: # ./rtnetlink.sh -t kci_test_fdb_del PASS: bridge fdb del Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Andy Roulin <aroulin@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20241105133954.350479-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-09	mlx5/core: Schedule EQ comp tasklet only if necessary	Caleb Sander Mateos
	Currently, the mlx5_eq_comp_int() interrupt handler schedules a tasklet to call mlx5_cq_tasklet_cb() if it processes any completions. For CQs whose completions don't need to be processed in tasklet context, this adds unnecessary overhead. In a heavy TCP workload, we see 4% of CPU time spent on the tasklet_trylock() in tasklet_action_common(), with a smaller amount spent on the atomic operations in tasklet_schedule(), tasklet_clear_sched(), and locking the spinlock in mlx5_cq_tasklet_cb(). TCP completions are handled by mlx5e_completion_event(), which schedules NAPI to poll the queue, so they don't need tasklet processing. Schedule the tasklet in mlx5_add_cq_to_tasklet() instead to avoid this overhead. mlx5_add_cq_to_tasklet() is responsible for enqueuing the CQs to be processed in tasklet context, so it can schedule the tasklet. CQs that need tasklet processing have their interrupt comp handler set to mlx5_add_cq_to_tasklet(), so they will schedule the tasklet. CQs that don't need tasklet processing won't schedule the tasklet. To avoid scheduling the tasklet multiple times during the same interrupt, only schedule the tasklet in mlx5_add_cq_to_tasklet() if the tasklet work queue was empty before the new CQ was pushed to it. The additional branch in mlx5_add_cq_to_tasklet(), called for each EQE, may add a small cost for the userspace Infiniband CQs whose completions are processed in tasklet context. But this seems worth it to avoid the tasklet overhead for CQs that don't need it. Note that the mlx4 driver works the same way: it schedules the tasklet in mlx4_add_cq_to_tasklet() and only if the work queue was empty before. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Acked-by: Tariq Toukan <tariqt@nvidia.com> Acked-by: Saeed Mahameed <saeedm@nvidia.com> Link: https://patch.msgid.link/20241105204000.1807095-1-csander@purestorage.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-09	Merge branch 'improve-neigh_flush_dev-performance'	Jakub Kicinski
	Gilad Naaman says: ==================== Improve neigh_flush_dev performance This patchsets improves the performance of neigh_flush_dev. Currently, the only way to implement it requires traversing all neighbours known to the kernel, across all network-namespaces. This means that some flows are slowed down as a function of neigh-scale, even if the specific link they're handling has little to no neighbours. In order to solve this, this patchset adds a netdev->neighbours list, as well as making the original linked-list doubly-, so that it is possible to unlink neighbours without traversing the hash-bucket to obtain the previous neighbour. The original use-case we encountered was mass-deletion of links (12K VLANs) while there are 50K ARPs and 50K NDPs in the system; though the slowdowns would also appear when the links are set down. ==================== Link: https://patch.msgid.link/20241107160444.2913124-1-gnaaman@drivenets.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-09	neighbour: Create netdev->neighbour association	Gilad Naaman
	Create a mapping between a netdev and its neighoburs, allowing for much cheaper flushes. Signed-off-by: Gilad Naaman <gnaaman@drivenets.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20241107160444.2913124-7-gnaaman@drivenets.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-09	neighbour: Remove bare neighbour::next pointer	Gilad Naaman
	Remove the now-unused neighbour::next pointer, leaving struct neighbour solely with the hlist_node implementation. Signed-off-by: Gilad Naaman <gnaaman@drivenets.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20241107160444.2913124-6-gnaaman@drivenets.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-09	neighbour: Convert iteration to use hlist+macro	Gilad Naaman
	Remove all usage of the bare neighbour::next pointer, replacing them with neighbour::hash and its for_each macro. Signed-off-by: Gilad Naaman <gnaaman@drivenets.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20241107160444.2913124-5-gnaaman@drivenets.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-09	neighbour: Convert seq_file functions to use hlist	Gilad Naaman
	Convert seq_file-related neighbour functionality to use neighbour::hash and the related for_each macro. Signed-off-by: Gilad Naaman <gnaaman@drivenets.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20241107160444.2913124-4-gnaaman@drivenets.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-09	neighbour: Define neigh_for_each_in_bucket	Gilad Naaman
	Introduce neigh_for_each_in_bucket in neighbour.h, to help iterate over the neighbour table more succinctly. Signed-off-by: Gilad Naaman <gnaaman@drivenets.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20241107160444.2913124-3-gnaaman@drivenets.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-09	neighbour: Add hlist_node to struct neighbour	Gilad Naaman
	Add a doubly-linked node to neighbours, so that they can be deleted without iterating the entire bucket they're in. Signed-off-by: Gilad Naaman <gnaaman@drivenets.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20241107160444.2913124-2-gnaaman@drivenets.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-09	Merge branch 'r8169-improve-wol-suspend-related-code'	Jakub Kicinski
	Heiner Kallweit says: ==================== r8169: improve wol/suspend-related code This series improves wol/suspend-related code parts. ==================== Link: https://patch.msgid.link/be734d10-37f7-4830-b7c2-367c0a656c08@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-09	r8169: align WAKE_PHY handling with r8125/r8126 vendor drivers	Heiner Kallweit
	Vendor drivers r8125/r8126 apply this additional magic setting when enabling WAKE_PHY, so do the same here. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/51130715-45be-4db5-abb7-05d87e1f5df9@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-09	r8169: improve rtl_set_d3_pll_down	Heiner Kallweit
	Make use of new helper r8169_mod_reg8_cond() and move from a switch() to an if() clause. Benefit is that we don't have to touch this piece of code each time support for a new chip version is added. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/e1ccdb85-a4ed-4800-89c2-89770ff06452@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-09	r8169: improve __rtl8169_set_wol	Heiner Kallweit
	Add helper r8169_mod_reg8_cond() what allows to significantly simplify __rtl8169_set_wol(). Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/697b197a-8eac-40c6-8847-27093cacec36@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-09	tc: fix typo probabilty in tc.yaml doc	Abhinav Saxena
	Fix spelling of "probability" in tc.yaml documentation. This corrects the max-P field description in struct tc_sfq_qopt_v1. Signed-off-by: Abhinav Saxena <xandfury@gmail.com> Link: https://patch.msgid.link/20241108195642.139315-1-xandfury@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-09	mISDN: Fix typos	Andrew Kreimer
	Fix typos: - syncronized -> synchronized - interfacs -> interface - otherwhise -> otherwise - ony -> only - busses -> buses - maxinum -> maximum Via codespell. Reported-by: Simon Horman <horms@kernel.org> Signed-off-by: Andrew Kreimer <algonell@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20241106112513.9559-1-algonell@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-09	hv_sock: Initializing vsk->trans to NULL to prevent a dangling pointer	Hyunwoo Kim
	When hvs is released, there is a possibility that vsk->trans may not be initialized to NULL, which could lead to a dangling pointer. This issue is resolved by initializing vsk->trans to NULL. Signed-off-by: Hyunwoo Kim <v4bel@theori.io> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Link: https://patch.msgid.link/Zys4hCj61V+mQfX2@v4bel-B760M-AORUS-ELITE-AX Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-09	net: sfc: use ethtool string helpers	Rosen Penev
	The latter is the preferred way to copy ethtool strings. Avoids manually incrementing the pointer. Cleans up the code quite well. Signed-off-by: Rosen Penev <rosenp@gmail.com> Acked-by: Edward Cree <ecree.xilinx@gmail.com> Link: https://patch.msgid.link/20241105231855.235894-1-rosenp@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-09	mptcp: remove the redundant assignment of 'new_ctx->tcp_sock' in ↵	MoYuanhao
	subflow_ulp_clone() The variable has already been assigned in the subflow_create_ctx(), So we don't need to reassign this variable in the subflow_ulp_clone(). Signed-off-by: MoYuanhao <moyuanhao3676@163.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20241106071035.2591-1-moyuanhao3676@163.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-09	net: mctp: Expose transport binding identifier via IFLA attribute	Khang Nguyen
	MCTP control protocol implementations are transport binding dependent. Endpoint discovery is mandatory based on transport binding. Message timing requirements are specified in each respective transport binding specification. However, we currently have no means to get this information from MCTP links. Add a IFLA_MCTP_PHYS_BINDING netlink link attribute, which represents the transport type using the DMTF DSP0239-defined type numbers, returned as part of RTM_GETLINK data. We get an IFLA_MCTP_PHYS_BINDING attribute for each MCTP link, for example: - 0x00 (unspec) for loopback interface; - 0x01 (SMBus/I2C) for mctpi2c%d interfaces; and - 0x05 (serial) for mctpserial%d interfaces. Signed-off-by: Khang Nguyen <khangng@os.amperecomputing.com> Reviewed-by: Matt Johnston <matt@codeconstruct.com.au> Link: https://patch.msgid.link/20241105071915.821871-1-khangng@os.amperecomputing.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-07	bonding: add ESP offload features when slaves support	Jianbo Liu
	Add NETIF_F_GSO_ESP bit to bond's gso_partial_features if all slaves support it, such that ESP segmentation is handled by hardware if possible. Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Boris Pismenny <borisp@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241105192721.584822-1-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-07	Merge branch 'netlink-specs-add-neigh-and-rule-ynl-specs'	Jakub Kicinski
	Donald Hunter says: ==================== netlink: specs: Add neigh and rule YNL specs Add YNL specs for the FDB neighbour tables and FIB rules from the rtnelink families. Example usage: ./tools/net/ynl/cli.py \ --spec Documentation/netlink/specs/rt_neigh.yaml \ --dump getneigh [{'cacheinfo': {'confirmed': 122664055, 'refcnt': 0, 'updated': 122658055, 'used': 122658055}, 'dst': '0.0.0.0', 'family': 2, 'flags': set(), 'ifindex': 5, 'lladr': '', 'probes': 0, 'state': {'noarp'}, 'type': 'broadcast'}, ...] ./tools/net/ynl/cli.py \ --spec Documentation/netlink/specs/rt_rule.yaml \ --dump getrule --json '{"family": 2}' [{'action': 'to-tbl', 'dst-len': 0, 'family': 2, 'flags': 0, 'protocol': 2, 'src-len': 0, 'suppress-prefixlen': '0xffffffff', 'table': 255, 'tos': 0}, ... ] ==================== Link: https://patch.msgid.link/20241106090718.64713-1-donald.hunter@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-07	netlink: specs: Add a spec for FIB rule management	Donald Hunter
	Add a YNL spec for FIB rules: ./tools/net/ynl/cli.py \ --spec Documentation/netlink/specs/rt_rule.yaml \ --dump getrule --json '{"family": 2}' [{'action': 'to-tbl', 'dst-len': 0, 'family': 2, 'flags': 0, 'protocol': 2, 'src-len': 0, 'suppress-prefixlen': '0xffffffff', 'table': 255, 'tos': 0}, ... ] Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20241106090718.64713-3-donald.hunter@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-07	netlink: specs: Add a spec for neighbor tables in rtnetlink	Donald Hunter
	Add a YNL spec for neighbour tables and neighbour entries in rtnetlink. ./tools/net/ynl/cli.py \ --spec Documentation/netlink/specs/rt_neigh.yaml \ --dump getneigh [{'cacheinfo': {'confirmed': 122664055, 'refcnt': 0, 'updated': 122658055, 'used': 122658055}, 'dst': '0.0.0.0', 'family': 2, 'flags': set(), 'ifindex': 5, 'lladr': '', 'probes': 0, 'state': {'noarp'}, 'type': 'broadcast'}, ...] Acked-by: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20241106090718.64713-2-donald.hunter@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-07	phonet: do not call synchronize_rcu() from phonet_route_del()	Eric Dumazet
	Calling synchronize_rcu() while holding rcu_read_lock() is not permitted [1] Move the synchronize_rcu() + dev_put() to route_doit(). Alternative would be to not use rcu_read_lock() in route_doit(). [1] WARNING: suspicious RCU usage 6.12.0-rc5-syzkaller-01056-gf07a6e6ceb05 #0 Not tainted ----------------------------- kernel/rcu/tree.c:4092 Illegal synchronize_rcu() in RCU read-side critical section! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by syz-executor427/5840: #0: ffffffff8e937da0 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:337 [inline] #0: ffffffff8e937da0 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:849 [inline] #0: ffffffff8e937da0 (rcu_read_lock){....}-{1:2}, at: route_doit+0x3d6/0x640 net/phonet/pn_netlink.c:264 stack backtrace: CPU: 1 UID: 0 PID: 5840 Comm: syz-executor427 Not tainted 6.12.0-rc5-syzkaller-01056-gf07a6e6ceb05 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024 Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120 lockdep_rcu_suspicious+0x226/0x340 kernel/locking/lockdep.c:6821 synchronize_rcu+0xea/0x360 kernel/rcu/tree.c:4089 phonet_route_del+0xc6/0x140 net/phonet/pn_dev.c:409 route_doit+0x514/0x640 net/phonet/pn_netlink.c:275 rtnetlink_rcv_msg+0x791/0xcf0 net/core/rtnetlink.c:6790 netlink_rcv_skb+0x1e3/0x430 net/netlink/af_netlink.c:2551 netlink_unicast_kernel net/netlink/af_netlink.c:1331 [inline] netlink_unicast+0x7f6/0x990 net/netlink/af_netlink.c:1357 netlink_sendmsg+0x8e4/0xcb0 net/netlink/af_netlink.c:1901 sock_sendmsg_nosec net/socket.c:729 [inline] __sock_sendmsg+0x221/0x270 net/socket.c:744 sock_write_iter+0x2d7/0x3f0 net/socket.c:1165 new_sync_write fs/read_write.c:590 [inline] vfs_write+0xaeb/0xd30 fs/read_write.c:683 ksys_write+0x183/0x2b0 fs/read_write.c:736 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f Fixes: 17a1ac0018ae ("phonet: Don't hold RTNL for route_doit().") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Cc: Remi Denis-Courmont <courmisch@gmail.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://patch.msgid.link/20241106131818.1240710-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-07	ipv4: Prepare ip_route_output() to future .flowi4_tos conversion.	Guillaume Nault
	Convert the "tos" parameter of ip_route_output() to dscp_t. This way we'll have a dscp_t value directly available when .flowi4_tos will eventually be converted to dscp_t. All ip_route_output() callers but one set this "tos" parameter to 0 and therefore don't need to be adapted to the new prototype. Only br_nf_pre_routing_finish() needs conversion. It can just use ip4h_dscp() to get the DSCP field from the IPv4 header. Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/0f10d031dd44c70aae9bc6e19391cb30d5c2fe71.1730928699.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>