summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-02-10mlxsw: Enable Tx checksum offloadIdo Schimmel
The device is able to checksum plain TCP / UDP packets over IPv4 / IPv6 when the 'ipcs' bit in the send descriptor is set. Advertise support for the 'NETIF_F_IP{,6}_CSUM' features in net devices registered by the driver and VLAN uppers and set the 'ipcs' bit when the stack requests Tx checksum offload. Note that the device also calculates the IPv4 checksum, but it first zeroes the current checksum so there should not be any difference compared to the checksum calculated by the kernel. On SN5600 (Spectrum-4) there is about 10% improvement in Tx packet rate with 1400 byte packets when using pktgen. Tested on Spectrum-{1,2,3,4} with all the combinations of IPv4 / IPv6, TCP / UDP, with and without VLAN. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/8dc86c95474ce10572a0fa83b8adb0259558e982.1738950446.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10selftests: drv-net: add helper for path resolutionJakub Kicinski
Refering to C binaries from Python code is going to be a common need. Add a helper to convert from path in relation to the test. Meaning, if the test is in the same directory as the binary, the call would be simply: cfg.rpath("binary"). The helper name "rpath" is not great. I can't think of a better name that would be accurate yet concise. Reviewed-by: Petr Machata <petrm@nvidia.com> Link: https://patch.msgid.link/20250207184140.1730466-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10selftests: drv-net: factor out a DrvEnv base classJakub Kicinski
We have separate Env classes for local tests and tests with a remote endpoint. Make it easier to share the code by creating a base class. Make env loading a method of this class. Reviewed-by: Petr Machata <petrm@nvidia.com> Link: https://patch.msgid.link/20250207184140.1730466-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10selftests: drv-net: remove an unnecessary libmnl includeJakub Kicinski
ncdevmem doesn't need libmnl, remove the unnecessary include. Since YNL doesn't depend on libmnl either, any more, it's actually possible to build selftests without having libmnl installed. Reviewed-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Joe Damato <jdamato@fastly.com> Link: https://patch.msgid.link/20250207183119.1721424-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10Merge branch 'fib-rules-convert-rtm_newrule-and-rtm_delrule-to-per-netns-rtnl'Jakub Kicinski
Kuniyuki Iwashima says: ==================== fib: rules: Convert RTM_NEWRULE and RTM_DELRULE to per-netns RTNL. Patch 1 ~ 2 are small cleanup, and patch 3 ~ 8 make fib_nl_newrule() and fib_nl_delrule() hold per-netns RTNL. v1: https://lore.kernel.org/20250206084629.16602-1-kuniyu@amazon.com ==================== Link: https://patch.msgid.link/20250207072502.87775-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10net: fib_rules: Convert RTM_DELRULE to per-netns RTNL.Kuniyuki Iwashima
fib_nl_delrule() is the doit() handler for RTM_DELRULE but also called from vrf_newlink() in case something fails in vrf_add_fib_rules(). In the latter case, RTNL is already held and the 4th arg is true. Let's hold per-netns RTNL in fib_delrule() if rtnl_held is false. Now we can place ASSERT_RTNL_NET() in call_fib_rule_notifiers(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250207072502.87775-9-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10net: fib_rules: Add error_free label in fib_delrule().Kuniyuki Iwashima
We will hold RTNL just before calling fib_nl2rule_rtnl() in fib_delrule() and release it before kfree(nlrule). Let's add a new rule to make the following change cleaner. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250207072502.87775-8-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10net: fib_rules: Convert RTM_NEWRULE to per-netns RTNL.Kuniyuki Iwashima
fib_nl_newrule() is the doit() handler for RTM_NEWRULE but also called from vrf_newlink(). In the latter case, RTNL is already held and the 4th arg is true. Let's hold per-netns RTNL in fib_newrule() if rtnl_held is false. Note that we call fib_rule_get() before releasing per-netns RTNL to call notify_rule_change() without RTNL and prevent freeing the new rule. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250207072502.87775-7-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10net: fib_rules: Factorise fib_newrule() and fib_delrule().Kuniyuki Iwashima
fib_nl_newrule() / fib_nl_delrule() is the doit() handler for RTM_NEWRULE / RTM_DELRULE but also called from vrf_newlink(). Currently, we hold RTNL on both paths but will not on the former. Also, we set dev_net(dev)->rtnl to skb->sk in vrf_fib_rule() because fib_nl_newrule() / fib_nl_delrule() fetch net as sock_net(skb->sk). Let's Factorise the two functions and pass net and rtnl_held flag. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250207072502.87775-6-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10ip: fib_rules: Fetch net from fib_rule in fib[46]_rule_configure().Kuniyuki Iwashima
The following patch will not set skb->sk from VRF path. Let's fetch net from fib_rule->fr_net instead of sock_net(skb->sk) in fib[46]_rule_configure(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250207072502.87775-5-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10net: fib_rules: Split fib_nl2rule().Kuniyuki Iwashima
We will move RTNL down to fib_nl_newrule() and fib_nl_delrule(). Some operations in fib_nl2rule() require RTNL: fib_default_rule_pref() and __dev_get_by_name(). Let's split the RTNL parts as fib_nl2rule_rtnl(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250207072502.87775-4-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10net: fib_rules: Pass net to fib_nl2rule() instead of skb.Kuniyuki Iwashima
skb is not used in fib_nl2rule() other than sock_net(skb->sk), which is already available in callers, fib_nl_newrule() and fib_nl_delrule(). Let's pass net directly to fib_nl2rule(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250207072502.87775-3-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10net: fib_rules: Don't check net in rule_exists() and rule_find().Kuniyuki Iwashima
fib_nl_newrule() / fib_nl_delrule() looks up struct fib_rules_ops in sock_net(skb->sk) and calls rule_exists() / rule_find() respectively. fib_nl_newrule() creates a new rule and links it to the found ops, so struct fib_rule never belongs to a different netns's ops->rules_list. Let's remove redundant netns check in rule_exists() and rule_find(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250207072502.87775-2-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10Merge branch 'tun-unify-vnet-implementation'Jakub Kicinski
Akihiko Odaki says: ==================== tun: Unify vnet implementation When I implemented virtio's hash-related features to tun/tap [1], I found tun/tap does not fill the entire region reserved for the virtio header, leaving some uninitialized hole in the middle of the buffer after read()/recvmesg(). This series fills the uninitialized hole. More concretely, the num_buffers field will be initialized with 1, and the other fields will be inialized with 0. Setting the num_buffers field to 1 is mandated by virtio 1.0 [2]. The change to virtio header is preceded by another change that refactors tun and tap to unify their virtio-related code. [1]: https://lore.kernel.org/r/20241008-rss-v5-0-f3cf68df005d@daynix.com [2]: https://lore.kernel.org/r/20241227084256-mutt-send-email-mst@kernel.org/ ==================== Link: https://patch.msgid.link/20250207-tun-v6-0-fb49cf8b103e@daynix.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10tap: Use tun's vnet-related codeAkihiko Odaki
tun and tap implements the same vnet-related features so reuse the code. Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250207-tun-v6-7-fb49cf8b103e@daynix.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10tap: Keep hdr_len in tap_get_user()Akihiko Odaki
hdr_len is repeatedly used so keep it in a local variable. Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250207-tun-v6-6-fb49cf8b103e@daynix.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10tun: Extract the vnet handling codeAkihiko Odaki
The vnet handling code will be reused by tap. Functions are renamed to ensure that their names contain "vnet" to clarify that they are part of the decoupled vnet handling code. Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250207-tun-v6-5-fb49cf8b103e@daynix.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10tun: Decouple vnet handlingAkihiko Odaki
Decouple the vnet handling code so that we can reuse it for tap. Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250207-tun-v6-4-fb49cf8b103e@daynix.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10tun: Decouple vnet from tun_structAkihiko Odaki
Decouple vnet-related functions from tun_struct so that we can reuse them for tap in the future. Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250207-tun-v6-3-fb49cf8b103e@daynix.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10tun: Keep hdr_len in tun_get_user()Akihiko Odaki
hdr_len is repeatedly used so keep it in a local variable. Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250207-tun-v6-2-fb49cf8b103e@daynix.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10tun: Refactor CONFIG_TUN_VNET_CROSS_LEAkihiko Odaki
Check IS_ENABLED(CONFIG_TUN_VNET_CROSS_LE) to save some lines and make future changes easier. Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250207-tun-v6-1-fb49cf8b103e@daynix.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10Merge branch 'net-xilinx-axienet-enable-adaptive-irq-coalescing-with-dim'Jakub Kicinski
Sean Anderson says: ==================== net: xilinx: axienet: Enable adaptive IRQ coalescing with DIM To improve performance without sacrificing latency under low load, enable DIM. While I appreciate not having to write the library myself, I do think there are many unusual aspects to DIM, as detailed in the last patch. ==================== Link: https://patch.msgid.link/20250206201036.1516800-1-sean.anderson@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10net: xilinx: axienet: Enable adaptive IRQ coalescing with DIMSean Anderson
The default RX IRQ coalescing settings of one IRQ per packet can represent a significant CPU load. However, increasing the coalescing unilaterally can result in undesirable latency under low load. Adaptive IRQ coalescing with DIM offers a way to adjust the coalescing settings based on load. This device only supports "CQE" mode [1], where each packet resets the timer. Therefore, an interrupt is fired either when we receive coalesce_count_rx packets or when the interface is idle for coalesce_usec_rx. With this in mind, consider the following scenarios: Link saturated Here we want to set coalesce_count_rx to a large value, in order to coalesce more packets and reduce CPU load. coalesce_usec_rx should be set to at least the time for one packet. Otherwise the link will be "idle" and we will get an interrupt for each packet anyway. Bursts of packets Each burst should be coalesced into a single interrupt, although it may be prudent to reduce coalesce_count_rx for better latency. coalesce_usec_rx should be set to at least the time for one packet so bursts are coalesced. However, additional time beyond the packet time will just increase latency at the end of a burst. Sporadic packets Due to low load, we can set coalesce_count_rx to 1 in order to reduce latency to the minimum. coalesce_usec_rx does not matter in this case. Based on this analysis, I expected the CQE profiles to look something like usec = 0, pkts = 1 // Low load usec = 16, pkts = 4 usec = 16, pkts = 16 usec = 16, pkts = 64 usec = 16, pkts = 256 // High load Where usec is set to 16 to be a few us greater than the 12.3 us packet time of a 1500 MTU packet at 1 GBit/s. However, the CQE profile is instead usec = 2, pkts = 256 // Low load usec = 8, pkts = 128 usec = 16, pkts = 64 usec = 32, pkts = 64 usec = 64, pkts = 64 // High load I found this very surprising. The number of coalesced packets *decreases* as load increases. But as load increases we have more opportunities to coalesce packets without affecting latency as much. Additionally, the profile *increases* the usec as the load increases. But as load increases, the gaps between packets will tend to become smaller, making it possible to *decrease* usec for better latency at the end of a "burst". I consider the default CQE profile unsuitable for this NIC. Therefore, we use the first profile outlined in this commit instead. coalesce_usec_rx is set to 16 by default, but the user can customize it. This may be necessary if they are using jumbo frames. I think adjusting the profile times based on the link speed/mtu would be good improvement for generic DIM. In addition to the above profile problems, I noticed the following additional issues with DIM while testing: - DIM tends to "wander" when at low load, since the performance gradient is pretty flat. If you only have 10p/ms anyway then adjusting the coalescing settings will not affect throughput very much. - DIM takes a long time to adjust back to low indices when load is decreased following a period of high load. This is because it only re-evaluates its settings once every 64 interrupts. However, at low load 64 interrupts can be several seconds. Finally: performance. This patch increases receive throughput with iperf3 from 840 Mbits/sec to 938 Mbits/sec, decreases interrupts from 69920/sec to 316/sec, and decreases CPU utilization (4x Cortex-A53) from 43% to 9%. [1] Who names this stuff? Signed-off-by: Sean Anderson <sean.anderson@linux.dev> Reviewed by: Shannon Nelson <shannon.nelson@amd.com> Link: https://patch.msgid.link/20250206201036.1516800-5-sean.anderson@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10net: xilinx: axienet: Get coalesce parameters from driver stateSean Anderson
The cr variables now contain the same values as the control registers themselves. Extract/calculate the values from the variables instead of saving the user-specified values. This allows us to remove some bookeeping, and also lets the user know what the actual coalesce settings are. Signed-off-by: Sean Anderson <sean.anderson@linux.dev> Reviewed by: Shannon Nelson <shannon.nelson@amd.com> Link: https://patch.msgid.link/20250206201036.1516800-4-sean.anderson@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10net: xilinx: axienet: Support adjusting coalesce settings while runningSean Anderson
In preparation for adaptive IRQ coalescing, we first need to support adjusting the settings at runtime. The existing code doesn't require any locking because - dma_start is the only function that modifies rx/tx_dma_cr. It is always called with IRQs and NAPI disabled, so nothing else is touching the hardware. - The IRQs don't race with poll, since the latter is a softirq. - The IRQs don't race with dma_stop since they both just clear the control registers. - dma_stop doesn't race with poll since the former is called with NAPI disabled. However, once we introduce another function that modifies rx/tx_dma_cr, we need to have some locking to prevent races. Introduce two locks to protect these variables and their registers. The control register values are now generated where the coalescing settings are set. Converting coalescing settings to control register values may require sleeping because of clk_get_rate. However, the read/modify/write of the control registers themselves can't sleep because it needs to happen in IRQ context. By pre-calculating the control register values, we avoid introducing an additional mutex. Since axienet_dma_start writes the control settings when it runs, we don't bother updating the CR registers when rx/tx_dma_started is false. This prevents any issues from writing to the control registers in the middle of a reset sequence. Signed-off-by: Sean Anderson <sean.anderson@linux.dev> Reviewed-by: Shannon Nelson <shannon.nelson@amd.com> Link: https://patch.msgid.link/20250206201036.1516800-3-sean.anderson@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10net: xilinx: axienet: Combine CR calculationSean Anderson
Combine the common parts of the CR calculations for better code reuse. While we're at it, simplify the code a bit. Signed-off-by: Sean Anderson <sean.anderson@linux.dev> Reviewed-by: Shannon Nelson <shannon.nelson@amd.com> Link: https://patch.msgid.link/20250206201036.1516800-2-sean.anderson@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10Merge tag 'wireless-2025-02-07' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless Kalle Valo says: ==================== wireless fixes for v6.14-rc3 We have only one fix for ath12k and one fix for brcmfmac. Also this will be my last pull request as I'm stepping down as wireless driver maintainer. * tag 'wireless-2025-02-07' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless: MAINTAINERS: wifi: remove Kalle MAINTAINERS: wifi: ath: remove Kalle wifi: brcmfmac: use random seed flag for BCM4355 and BCM4364 firmware wifi: ath12k: fix handling of 6 GHz rules ==================== Link: https://patch.msgid.link/20250207182957.23315C4CED1@smtp.kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10Merge branch 'net-second-round-to-use-dev_net_rcu'Jakub Kicinski
Eric Dumazet says: ==================== net: second round to use dev_net_rcu() dev_net(dev) should either be protected by RTNL or RCU. There is no LOCKDEP support yet for this helper. Adding it would trigger too many splats. This second series fixes some of them. ==================== Link: https://patch.msgid.link/20250207135841.1948589-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10ipv6: mcast: extend RCU protection in igmp6_send()Eric Dumazet
igmp6_send() can be called without RTNL or RCU being held. Extend RCU protection so that we can safely fetch the net pointer and avoid a potential UAF. Note that we no longer can use sock_alloc_send_skb() because ipv6.igmp_sk uses GFP_KERNEL allocations which can sleep. Instead use alloc_skb() and charge the net->ipv6.igmp_sk socket under RCU protection. Fixes: b8ad0cbc58f7 ("[NETNS][IPV6] mcast - handle several network namespace") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250207135841.1948589-9-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10ndisc: extend RCU protection in ndisc_send_skb()Eric Dumazet
ndisc_send_skb() can be called without RTNL or RCU held. Acquire rcu_read_lock() earlier, so that we can use dev_net_rcu() and avoid a potential UAF. Fixes: 1762f7e88eb3 ("[NETNS][IPV6] ndisc - make socket control per namespace") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250207135841.1948589-8-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10vrf: use RCU protection in l3mdev_l3_out()Eric Dumazet
l3mdev_l3_out() can be called without RCU being held: raw_sendmsg() ip_push_pending_frames() ip_send_skb() ip_local_out() __ip_local_out() l3mdev_ip_out() Add rcu_read_lock() / rcu_read_unlock() pair to avoid a potential UAF. Fixes: a8e3e1a9f020 ("net: l3mdev: Add hook to output path") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250207135841.1948589-7-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10openvswitch: use RCU protection in ovs_vport_cmd_fill_info()Eric Dumazet
ovs_vport_cmd_fill_info() can be called without RTNL or RCU. Use RCU protection and dev_net_rcu() to avoid potential UAF. Fixes: 9354d4520342 ("openvswitch: reliable interface indentification in port dumps") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250207135841.1948589-6-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10arp: use RCU protection in arp_xmit()Eric Dumazet
arp_xmit() can be called without RTNL or RCU protection. Use RCU protection to avoid potential UAF. Fixes: 29a26a568038 ("netfilter: Pass struct net into the netfilter hooks") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250207135841.1948589-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10neighbour: use RCU protection in __neigh_notify()Eric Dumazet
__neigh_notify() can be called without RTNL or RCU protection. Use RCU protection to avoid potential UAF. Fixes: 426b5303eb43 ("[NETNS]: Modify the neighbour table code so it handles multiple network namespaces") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250207135841.1948589-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10ndisc: use RCU protection in ndisc_alloc_skb()Eric Dumazet
ndisc_alloc_skb() can be called without RTNL or RCU being held. Add RCU protection to avoid possible UAF. Fixes: de09334b9326 ("ndisc: Introduce ndisc_alloc_skb() helper.") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250207135841.1948589-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10ndisc: ndisc_send_redirect() must use dev_get_by_index_rcu()Eric Dumazet
ndisc_send_redirect() is called under RCU protection, not RTNL. It must use dev_get_by_index_rcu() instead of __dev_get_by_index() Fixes: 2f17becfbea5 ("vrf: check the original netdevice for generating redirect") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Stephen Suryaputra <ssuryaextr@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250207135841.1948589-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10net: stmmac: Apply new page pool parameters when SPH is enabledFurong Xu
Commit df542f669307 ("net: stmmac: Switch to zero-copy in non-XDP RX path") makes DMA write received frame into buffer at offset of NET_SKB_PAD and sets page pool parameters to sync from offset of NET_SKB_PAD. But when Header Payload Split is enabled, the header is written at offset of NET_SKB_PAD, while the payload is written at offset of zero. Uncorrect offset parameter for the payload breaks dma coherence [1] since both CPU and DMA touch the page buffer from offset of zero which is not handled by the page pool sync parameter. And in case the DMA cannot split the received frame, for example, a large L2 frame, pp_params.max_len should grow to match the tail of entire frame. [1] https://lore.kernel.org/netdev/d465f277-bac7-439f-be1d-9a47dfe2d951@nvidia.com/ Reported-by: Jon Hunter <jonathanh@nvidia.com> Reported-by: Brad Griffis <bgriffis@nvidia.com> Suggested-by: Ido Schimmel <idosch@idosch.org> Fixes: df542f669307 ("net: stmmac: Switch to zero-copy in non-XDP RX path") Signed-off-by: Furong Xu <0x1207@gmail.com> Tested-by: Jon Hunter <jonathanh@nvidia.com> Tested-by: Thierry Reding <treding@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250207085639.13580-1-0x1207@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10r8152: add vendor/device ID pair for Dell Alienware AW1022zAleksander Jan Bajkowski
The Dell AW1022z is an RTL8156B based 2.5G Ethernet controller. Add the vendor and product ID values to the driver. This makes Ethernet work with the adapter. Signed-off-by: Aleksander Jan Bajkowski <olek2@wp.pl> Link: https://patch.msgid.link/20250206224033.980115-1-olek2@wp.pl Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10Merge branch 'xsk-the-lost-bits-from-chapter-iii'Jakub Kicinski
Alexander Lobakin says: ==================== xsk: the lost bits from Chapter III Before introducing libeth_xdp, we need to add a couple more generic helpers. Notably: * 01: add generic loop unrolling hint helpers; * 04: add helper to get both xdp_desc's DMA address and metadata pointer in one go, saving several cycles and hotpath object code size in drivers (especially when unrolling). Bonus: * 02, 03: convert two drivers which were using custom macros to generic unrolled_count() (trivial, no object code changes). ==================== Link: https://patch.msgid.link/20250206182630.3914318-1-aleksander.lobakin@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10xsk: add helper to get &xdp_desc's DMA and meta pointer in one goAlexander Lobakin
Currently, when your driver supports XSk Tx metadata and you want to send an XSk frame, you need to do the following: * call external xsk_buff_raw_get_dma(); * call inline xsk_buff_get_metadata(), which calls external xsk_buff_raw_get_data() and then do some inline checks. This effectively means that the following piece: addr = pool->unaligned ? xp_unaligned_add_offset_to_addr(addr) : addr; is done twice per frame, plus you have 2 external calls per frame, plus this: meta = pool->addrs + addr - pool->tx_metadata_len; if (unlikely(!xsk_buff_valid_tx_metadata(meta))) is always inlined, even if there's no meta or it's invalid. Add xsk_buff_raw_get_ctx() (xp_raw_get_ctx() to be precise) to do that in one go. It returns a small structure with 2 fields: DMA address, filled unconditionally, and metadata pointer, non-NULL only if it's present and valid. The address correction is performed only once and you also have only 1 external call per XSk frame, which does all the calculations and checks outside of your hotpath. You only need to check `if (ctx.meta)` for the metadata presence. To not copy any existing code, derive address correction and getting virtual and DMA address into small helpers. bloat-o-meter reports no object code changes for the existing functionality. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://patch.msgid.link/20250206182630.3914318-5-aleksander.lobakin@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10ice: use generic unrolled_count() macroAlexander Lobakin
ice, same as i40e, has a custom loop unrolling macros for unrolling Tx descriptors filling on XSk xmit. Replace ice defs with generic unrolled_count(), which is also more convenient as it allows passing defines as its argument, not hardcoded values, while the loop declaration will still be usual for-loop. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://patch.msgid.link/20250206182630.3914318-4-aleksander.lobakin@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10i40e: use generic unrolled_count() macroAlexander Lobakin
i40e, as well as ice, has a custom loop unrolling macro for unrolling Tx descriptors filling on XSk xmit. Replace i40e defs with generic unrolled_count(), which is also more convenient as it allows passing defines as its argument, not hardcoded values, while the loop declaration will still be a usual for-loop. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://patch.msgid.link/20250206182630.3914318-3-aleksander.lobakin@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10unroll: add generic loop unroll helpersAlexander Lobakin
There are cases when we need to explicitly unroll loops. For example, cache operations, filling DMA descriptors on very high speeds etc. Add compiler-specific attribute macros to give the compiler a hint that we'd like to unroll a loop. Example usage: #define UNROLL_BATCH 8 unrolled_count(UNROLL_BATCH) for (u32 i = 0; i < UNROLL_BATCH; i++) op(priv, i); Note that sometimes the compilers won't unroll loops if they think this would have worse optimization and perf than without unrolling, and that unroll attributes are available only starting GCC 8. For older compiler versions, no hints/attributes will be applied. For better unrolling/parallelization, don't have any variables that interfere between iterations except for the iterator itself. Co-developed-by: Jose E. Marchesi <jose.marchesi@oracle.com> # pragmas Signed-off-by: Jose E. Marchesi <jose.marchesi@oracle.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://patch.msgid.link/20250206182630.3914318-2-aleksander.lobakin@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10net: phy: dp83td510: introduce LED framework supportOleksij Rempel
Add LED brightness, mode, HW control and polarity functions to enable external LED control in the TI DP83TD510 PHY. Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Link: https://patch.msgid.link/20250205103846.2273833-1-o.rempel@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10Merge tag 'nfsd-6.14-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fixes from Chuck Lever: "Fixes for new bugs: - A fix for CB_GETATTR reply decoding was not quite correct - Fix the NFSD connection limiting logic - Fix a bug in the new session table resizing logic Bugs that pre-date v6.14: - Support for courteous clients (5.19) introduced a shutdown hang - Fix a crash in the filecache laundrette (6.9) - Fix a zero-day crash in NFSD's NFSv3 ACL implementation" * tag 'nfsd-6.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: NFSD: Fix CB_GETATTR status fix NFSD: fix hang in nfsd4_shutdown_callback nfsd: fix __fh_verify for localio nfsd: fix uninitialised slot info when a request is retried nfsd: validate the nfsd_serv pointer before calling svc_wake_up nfsd: clear acl_access/acl_default after releasing them
2025-02-10e1000e: Fix real-time violations on link upGerhard Engleder
Link down and up triggers update of MTA table. This update executes many PCIe writes and a final flush. Thus, PCIe will be blocked until all writes are flushed. As a result, DMA transfers of other targets suffer from delay in the range of 50us. This results in timing violations on real-time systems during link down and up of e1000e in combination with an Intel i3-2310E Sandy Bridge CPU. The i3-2310E is quite old. Launched 2011 by Intel but still in use as robot controller. The exact root cause of the problem is unclear and this situation won't change as Intel support for this CPU has ended years ago. Our experience is that the number of posted PCIe writes needs to be limited at least for real-time systems. With posted PCIe writes a much higher throughput can be generated than with PCIe reads which cannot be posted. Thus, the load on the interconnect is much higher. Additionally, a PCIe read waits until all posted PCIe writes are done. Therefore, the PCIe read can block the CPU for much more than 10us if a lot of PCIe writes were posted before. Both issues are the reason why we are limiting the number of posted PCIe writes in row in general for our real-time systems, not only for this driver. A flush after a low enough number of posted PCIe writes eliminates the delay but also increases the time needed for MTA table update. The following measurements were done on i3-2310E with e1000e for 128 MTA table entries: Single flush after all writes: 106us Flush after every write: 429us Flush after every 2nd write: 266us Flush after every 4th write: 180us Flush after every 8th write: 141us Flush after every 16th write: 121us A flush after every 8th write delays the link up by 35us and the negative impact to DMA transfers of other targets is still tolerable. Execute a flush after every 8th write. This prevents overloading the interconnect with posted writes. Signed-off-by: Gerhard Engleder <eg@keba.com> Link: https://lore.kernel.org/netdev/f8fe665a-5e6c-4f95-b47a-2f3281aa0e6c@lunn.ch/T/ CC: Vitaly Lifshits <vitaly.lifshits@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Tested-by: Avigail Dahan <avigailx.dahan@intel.com> Reviewed-by: Vitaly Lifshits <vitaly.lifshits@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-10igc: Avoid unnecessary link down event in XDP_SETUP_PROG processSong Yoong Siang
The igc_close()/igc_open() functions are too drastic for installing a new XDP prog because they cause undesirable link down event and device reset. To avoid delays in Ethernet traffic, improve the XDP_SETUP_PROG process by using the same sequence as igc_xdp_setup_pool(), which performs only the necessary steps, as follows: 1. stop the traffic and clean buffer 2. stop NAPI 3. install the XDP program 4. resume NAPI 5. allocate buffer and resume the traffic This patch has been tested using the 'ip link set xdpdrv' command to attach a simple XDP prog that always returns XDP_PASS. Before this patch, attaching xdp program will cause ptp4l to lose sync for few seconds, as shown in ptp4l log below: ptp4l[198.082]: rms 4 max 8 freq +906 +/- 2 delay 12 +/- 0 ptp4l[199.082]: rms 3 max 4 freq +906 +/- 3 delay 12 +/- 0 ptp4l[199.536]: port 1 (enp2s0): link down ptp4l[199.536]: port 1 (enp2s0): SLAVE to FAULTY on FAULT_DETECTED (FT_UNSPECIFIED) ptp4l[199.600]: selected local clock 22abbc.fffe.bb1234 as best master ptp4l[199.600]: port 1 (enp2s0): assuming the grand master role ptp4l[199.600]: port 1 (enp2s0): master state recommended in slave only mode ptp4l[199.600]: port 1 (enp2s0): defaultDS.priority1 probably misconfigured ptp4l[202.266]: port 1 (enp2s0): link up ptp4l[202.300]: port 1 (enp2s0): FAULTY to LISTENING on INIT_COMPLETE ptp4l[205.558]: port 1 (enp2s0): new foreign master 44abbc.fffe.bb2144-1 ptp4l[207.558]: selected best master clock 44abbc.fffe.bb2144 ptp4l[207.559]: port 1 (enp2s0): LISTENING to UNCALIBRATED on RS_SLAVE ptp4l[208.308]: port 1 (enp2s0): UNCALIBRATED to SLAVE on MASTER_CLOCK_SELECTED ptp4l[208.933]: rms 742 max 1303 freq -195 +/- 682 delay 12 +/- 0 ptp4l[209.933]: rms 178 max 274 freq +387 +/- 243 delay 12 +/- 0 After this patch, attaching xdp program no longer cause ptp4l to lose sync, as shown in ptp4l log below: ptp4l[201.183]: rms 1 max 3 freq +959 +/- 1 delay 8 +/- 0 ptp4l[202.183]: rms 1 max 3 freq +961 +/- 2 delay 8 +/- 0 ptp4l[203.183]: rms 2 max 3 freq +958 +/- 2 delay 8 +/- 0 ptp4l[204.183]: rms 3 max 5 freq +961 +/- 3 delay 8 +/- 0 ptp4l[205.183]: rms 2 max 4 freq +964 +/- 3 delay 8 +/- 0 Besides, before this patch, attaching xdp program will causes flood ping to lose 10 packets, as shown in ping statistics below: --- 169.254.1.2 ping statistics --- 100000 packets transmitted, 99990 received, +6 errors, 0.01% packet loss, time 34001ms rtt min/avg/max/mdev = 0.028/0.301/3104.360/13.838 ms, pipe 10, ipg/ewma 0.340/0.243 ms After this patch, attaching xdp program no longer cause flood ping to loss any packets, as shown in ping statistics below: --- 169.254.1.2 ping statistics --- 100000 packets transmitted, 100000 received, 0% packet loss, time 32326ms rtt min/avg/max/mdev = 0.027/0.231/19.589/0.155 ms, pipe 2, ipg/ewma 0.323/0.322 ms On the other hand, this patch has been tested with tools/testing/selftests/ bpf/xdp_hw_metadata app to make sure AF_XDP zero-copy is working fine with XDP Tx and Rx metadata. Below is the result of last packet after received 10000 UDP packets with interval 1 ms: poll: 1 (0) skip=0 fail=0 redir=10000 xsk_ring_cons__peek: 1 0x55881c7ef7a8: rx_desc[9999]->addr=8f110 addr=8f110 comp_addr=8f110 EoP rx_hash: 0xFB9BB6A3 with RSS type:0x1 HW RX-time: 1733923136269470866 (sec:1733923136.2695) delta to User RX-time sec:0.0000 (43.280 usec) XDP RX-time: 1733923136269482482 (sec:1733923136.2695) delta to User RX-time sec:0.0000 (31.664 usec) No rx_vlan_tci or rx_vlan_proto, err=-95 0x55881c7ef7a8: ping-pong with csum=ab19 (want 315b) csum_start=34 csum_offset=6 0x55881c7ef7a8: complete tx idx=9999 addr=f010 HW TX-complete-time: 1733923136269591637 (sec:1733923136.2696) delta to User TX-complete-time sec:0.0001 (108.571 usec) XDP RX-time: 1733923136269482482 (sec:1733923136.2695) delta to User TX-complete-time sec:0.0002 (217.726 usec) HW RX-time: 1733923136269470866 (sec:1733923136.2695) delta to HW TX-complete-time sec:0.0001 (120.771 usec) 0x55881c7ef7a8: complete rx idx=10127 addr=8f110 Signed-off-by: Song Yoong Siang <yoong.siang.song@intel.com> Tested-by: Avigail Dahan <avigailx.dahan@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-10ice: refactor ice_fdir_create_dflt_rules() functionMateusz Polchlopek
The Flow Director function ice_fdir_create_dflt_rules() calls few times function ice_create_init_fdir_rule() each time with different enum ice_fltr_ptype parameter. Next step is to return error code if error occurred. Change the code to store all necessary default rules in constant array and call ice_create_init_fdir_rule() in the loop. It makes it easy to extend the list of default rules in the future, without the need of duplicate code more and more. Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-10ice: Implement PTP support for E830 devicesMichal Michalik
Add specific functions and definitions for E830 devices to enable PTP support. E830 devices support direct write to GLTSYN_ registers without shadow registers and 64 bit read of PHC time. Enable PTM for E830 device, which is required for cross timestamp and and dependency on PCIE_PTM for ICE_HWTS. Check X86_FEATURE_ART for E830 as it may not be present in the CPU. Cc: Anna-Maria Behnsen <anna-maria@linutronix.de> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Co-developed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Co-developed-by: Milena Olech <milena.olech@intel.com> Signed-off-by: Milena Olech <milena.olech@intel.com> Co-developed-by: Paul Greenwalt <paul.greenwalt@intel.com> Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com> Signed-off-by: Michal Michalik <michal.michalik@intel.com> Co-developed-by: Karol Kolacinski <karol.kolacinski@intel.com> Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-10ice: Refactor ice_ptp_init_tx_*Karol Kolacinski
Unify ice_ptp_init_tx_* functions for most of the MAC types except E82X. This simplifies the code for the future use with new MAC types. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>