summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-10-15Merge tag 'for-netdev' of ↵Paolo Abeni
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2024-10-14 The following pull-request contains BPF updates for your *net-next* tree. We've added 21 non-merge commits during the last 18 day(s) which contain a total of 21 files changed, 1185 insertions(+), 127 deletions(-). The main changes are: 1) Put xsk sockets on a struct diet and add various cleanups. Overall, this helps to bump performance by 12% for some workloads, from Maciej Fijalkowski. 2) Extend BPF selftests to increase coverage of XDP features in combination with BPF cpumap, from Alexis Lothoré (eBPF Foundation). 3) Extend netkit with an option to delegate skb->{mark,priority} scrubbing to its BPF program, from Daniel Borkmann. 4) Make the bpf_get_netns_cookie() helper available also to tc(x) BPF programs, from Mahe Tardy. 5) Extend BPF selftests covering a BPF program setting socket options per MPTCP subflow, from Geliang Tang and Nicolas Rybowski. bpf-next-for-netdev * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (21 commits) xsk: Use xsk_buff_pool directly for cq functions xsk: Wrap duplicated code to function xsk: Carry a copy of xdp_zc_max_segs within xsk_buff_pool xsk: Get rid of xdp_buff_xsk::orig_addr xsk: s/free_list_node/list_node/ xsk: Get rid of xdp_buff_xsk::xskb_list_node selftests/bpf: check program redirect in xdp_cpumap_attach selftests/bpf: make xdp_cpumap_attach keep redirect prog attached selftests/bpf: fix bpf_map_redirect call for cpu map test selftests/bpf: add tcx netns cookie tests bpf: add get_netns_cookie helper to tc programs selftests/bpf: add missing header include for htons selftests/bpf: Extend netkit tests to validate skb meta data tools: Sync if_link.h uapi tooling header netkit: Add add netkit scrub support to rt_link.yaml netkit: Simplify netkit mode over to use NLA_POLICY_MAX netkit: Add option for scrubbing skb meta data bpf: Remove unused macro selftests/bpf: Add mptcp subflow subtest selftests/bpf: Add getsockopt to inspect mptcp subflow ... ==================== Link: https://patch.msgid.link/20241014211110.16562-1-daniel@iogearbox.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-15net: gianfar: Use __be64 * to store pointers to big endian valuesSimon Horman
Timestamp values are read using pointers to 64-bit big endian values. But the type of these pointers is u64 *, host byte order. Use __be64 * instead. Flagged by Sparse: .../gianfar.c:2212:60: warning: cast to restricted __be64 .../gianfar.c:2475:53: warning: cast to restricted __be64 Introduced by commit cc772ab7cdca ("gianfar: Add hardware RX timestamping support"). Compile tested only. No functional change intended. Signed-off-by: Simon Horman <horms@kernel.org> Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com> Link: https://patch.msgid.link/20241011-gianfar-be64-v1-1-a77ebe972176@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-15rtnl_net_debug: Remove rtnl_net_debug_exit().Kuniyuki Iwashima
kernel test robot reported section mismatch in rtnl_net_debug_exit(). WARNING: modpost: vmlinux: section mismatch in reference: rtnl_net_debug_exit+0x20 (section: .exit.text) -> rtnl_net_debug_net_ops (section: .init.data) rtnl_net_debug_exit() uses rtnl_net_debug_net_ops() that is annotated as __net_initdata, but this file is always built-in. Let's remove rtnl_net_debug_exit(). Fixes: 03fa53485659 ("rtnetlink: Add ASSERT_RTNL_NET() placeholder for netdev notifier.") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202410101854.i0vQCaDz-lkp@intel.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20241010172433.67694-1-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-15octeontx2-af: Fix potential integer overflows on integer shiftsColin Ian King
The left shift int 32 bit integer constants 1 is evaluated using 32 bit arithmetic and then assigned to a 64 bit unsigned integer. In the case where the shift is 32 or more this can lead to an overflow. Avoid this by shifting using the BIT_ULL macro instead. Fixes: 019aba04f08c ("octeontx2-af: Modify SMQ flush sequence to drop packets") Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://patch.msgid.link/20241010154519.768785-1-colin.i.king@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-15tools: ynl-gen: use names of constants in generated limitsJakub Kicinski
YNL specs can use string expressions for limits, like s32-min or u16-max. We convert all of those into their numeric values when generating the code, which isn't always helpful. Try to retain the string representations in the output. Any sort of calculations still need the integers. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Joe Damato <jdamato@fastly.com> Link: https://patch.msgid.link/20241010151248.2049755-1-kuba@kernel.org [pabeni@redhat.com: regenerated netdev-genl-gen.c] Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-15net: ethernet: ti: am65-cpsw: Enable USXGMII mode for J7200 CPSW5GSiddharth Vadapalli
TI's J7200 SoC supports USXGMII mode. Add USXGMII mode to the extra_modes member of the J7200 SoC data. Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com> Reviewed-by: Roger Quadros <rogerq@kernel.org> Link: https://patch.msgid.link/20241010150543.2620448-1-s-vadapalli@ti.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-15net: stmmac: dwmac-tegra: Fix link bring-up sequenceParitosh Dixit
The Tegra MGBE driver sometimes fails to initialize, reporting the following error, and as a result, it is unable to acquire an IP address with DHCP: tegra-mgbe 6800000.ethernet: timeout waiting for link to become ready As per the recommendation from the Tegra hardware design team, fix this issue by: - clearing the PHY_RDY bit before setting the CDR_RESET bit and then setting PHY_RDY bit before clearing CDR_RESET bit. This ensures valid data is present at UPHY RX inputs before starting the CDR lock. - adding the required delays when bringing up the UPHY lane. Note we need to use delays here because there is no alternative, such as polling, for these cases. Using the usleep_range() instead of ndelay() as sleeping is preferred over busy wait loop. Without this change we would see link failures on boot sometimes as often as 1 in 5 boots. With this fix we have not observed any failures in over 1000 boots. Fixes: d8ca113724e7 ("net: stmmac: tegra: Add MGBE support") Signed-off-by: Paritosh Dixit <paritoshd@nvidia.com> Link: https://patch.msgid.link/20241010142908.602712-1-paritoshd@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-15net: usb: usbnet: fix race in probe failureOliver Neukum
The same bug as in the disconnect code path also exists in the case of a failure late during the probe process. The flag must also be set. Signed-off-by: Oliver Neukum <oneukum@suse.com> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Link: https://patch.msgid.link/20241010131934.1499695-1-oneukum@suse.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-15net: phy: intel-xway: add support for PHY LEDsDaniel Golle
The intel-xway PHY driver predates the PHY LED framework and currently initializes all LED pins to equal default values. Add PHY LED functions to the drivers and don't set default values if LEDs are defined in device tree. According the datasheets 3 LEDs are supported on all Intel XWAY PHYs. Signed-off-by: Daniel Golle <daniel@makrotopia.org> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/81f4717ab9acf38f3239727a4540ae96fd01109b.1728558223.git.daniel@makrotopia.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-15net: phy: mxl-gpy: correctly describe LED polarityDaniel Golle
According the datasheet covering the LED (0x1b) register: 0B Active High LEDx pin driven high when activated 1B Active Low LEDx pin driven low when activated Make use of the now available 'active-high' property and correctly reflect the polarity setting which was previously inverted. Signed-off-by: Daniel Golle <daniel@makrotopia.org> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/180ccafa837f09908b852a8a874a3808c5ecd2d0.1728558223.git.daniel@makrotopia.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-15net: phy: aquantia: correctly describe LED polarity overrideDaniel Golle
Use newly defined 'active-high' property to set the VEND1_GLOBAL_LED_DRIVE_VDD bit and let 'active-low' clear that bit. This reflects the technical reality which was inverted in the previous description in which the 'active-low' property was used to actually set the VEND1_GLOBAL_LED_DRIVE_VDD bit, which means that VDD (ie. supply voltage) of the LED is driven rather than GND. Signed-off-by: Daniel Golle <daniel@makrotopia.org> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/86a413b4387c42dcb54f587cc2433a06f16aae83.1728558223.git.daniel@makrotopia.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-15net: phy: support 'active-high' property for PHY LEDsDaniel Golle
In addition to 'active-low' and 'inactive-high-impedance' also support 'active-high' property for PHY LED pin configuration. As only either 'active-high' or 'active-low' can be set at the same time, WARN and return an error in case both are set. Signed-off-by: Daniel Golle <daniel@makrotopia.org> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/91598487773d768f254d5faf06cf65b13e972f0e.1728558223.git.daniel@makrotopia.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-15Merge branch 'make-phy-output-rmii-reference-clock'Paolo Abeni
Wei Fang says: ==================== make PHY output RMII reference clock The TJA11xx PHYs have the capability to provide 50MHz reference clock in RMII mode and output on REF_CLK pin. Therefore, add the new property "nxp,rmii-refclk-output" to support this feature. This property is only available for PHYs which use nxp-c45-tja11xx driver, such as TJA1103, TJA1104, TJA1120 and TJA1121. ==================== Link: https://patch.msgid.link/20241010061944.266966-1-wei.fang@nxp.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-15net: phy: c45-tja11xx: add support for outputting RMII reference clockWei Fang
For TJA11xx PHYs, they have the capability to output 50MHz reference clock on REF_CLK pin in RMII mode, which is called "revRMII" mode in the PHY data sheet. Signed-off-by: Wei Fang <wei.fang@nxp.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-15dt-bindings: net: tja11xx: add "nxp,rmii-refclk-out" propertyWei Fang
Per the RMII specification, the REF_CLK is sourced from MAC to PHY or from an external source. But for TJA11xx PHYs, they support to output a 50MHz RMII reference clock on REF_CLK pin. Previously the "nxp,rmii-refclk-in" was added to indicate that in RMII mode, if this property present, REF_CLK is input to the PHY, otherwise it is output. This seems inappropriate now. Because according to the RMII specification, the REF_CLK is originally input, so there is no need to add an additional "nxp,rmii-refclk-in" property to declare that REF_CLK is input. Unfortunately, because the "nxp,rmii-refclk-in" property has been added for a while, and we cannot confirm which DTS use the TJA1100 and TJA1101 PHYs, changing it to switch polarity will cause an ABI break. But fortunately, this property is only valid for TJA1100 and TJA1101. For TJA1103/TJA1104/TJA1120/TJA1121 PHYs, this property is invalid because they use the nxp-c45-tja11xx driver, which is a different driver from TJA1100/TJA1101. Therefore, for PHYs using nxp-c45-tja11xx driver, add "nxp,rmii-refclk-out" property to support outputting RMII reference clock on REF_CLK pin. Signed-off-by: Wei Fang <wei.fang@nxp.com> Reviewed-by: Rob Herring (Arm) <robh@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-14selftests: net: move EXTRA_CLEAN of libynl.a into ynl.mkJakub Kicinski
Commit 1fd9e4f25782 ("selftests: make kselftest-clean remove libynl outputs") added EXTRA_CLEAN of YNL generated files to ynl.mk. We already had a EXTRA_CLEAN in the file including the snippet. Consolidate them. Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20241011230311.2529760-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-14selftests: net: rebuild YNL if dependencies changedJakub Kicinski
Try to rebuild YNL if either user added a new family or the specs of the families have changed. Stanislav's ncdevmem cause a false positive build failure in NIPA because libynl.a isn't rebuilt after ethtool is added to YNL_GENS. Note that sha1sum is already used in other parts of the build system. Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20241011230311.2529760-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-14net: mtk_eth_soc: use ethtool_putsRosen Penev
Allows simplifying get_strings and avoids manual pointer manipulation. Tested on Belkin RT1800. Signed-off-by: Rosen Penev <rosenp@gmail.com> Reviewed-by: Gerhard Engleder <gerhard@engleder-embedded.com> Link: https://patch.msgid.link/20241011200225.7403-1-rosenp@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-14net: mvneta: use ethtool_putsRosen Penev
Allows simplifying get_strings and avoids manual pointer manipulation. Tested on Turris Omnia. Signed-off-by: Rosen Penev <rosenp@gmail.com> Reviewed-by: Gerhard Engleder <gerhard@engleder-embedded.com> Link: https://patch.msgid.link/20241011195955.7065-1-rosenp@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-14Merge branch 'add-support-for-per-napi-config-via-netlink'Jakub Kicinski
Joe Damato says: ==================== Add support for per-NAPI config via netlink Greetings: Welcome to v6. Minor changes from v5 [1], please see changelog below. There were no explicit comments from reviewers on the call outs in my v5, so I'm retaining them from my previous cover letter just in case :) A few important call outs for reviewers: 1. This revision seems to work (see below for a full walk through). I think this is the behavior we talked about, but please let me know if a use case is missing. 2. Re a previous point made by Stanislav regarding "taking over a NAPI ID" when the channel count changes: mlx5 seems to call napi_disable followed by netif_napi_del for the old queues and then calls napi_enable for the new ones. In this RFC, the NAPI ID generation is deferred to napi_enable. This means we won't end up with two of the same NAPI IDs added to the hash at the same time. Can we assume all drivers will napi_disable the old queues before napi_enable the new ones? - If yes: we might not need to worry about a NAPI ID takeover function. - If no: I'll need to make a change so that the NAPI ID generation is deferred only for drivers which have opted into the config space via calls to netif_napi_add_config 3. I made the decision to remove the WARN_ON_ONCE that (I think?) Jakub previously suggested in alloc_netdev_mqs (WARN_ON_ONCE(txqs != rxqs);) because this was triggering on every kernel boot with my mlx5 NIC. 4. I left the "maxqs = max(txqs, rxqs);" in alloc_netdev_mqs despite thinking this is a bit strange. I think it's strange that we might be short some number of NAPI configs, but it seems like most people are in favor of this approach, so I've left it. I'd appreciate thoughts from reviewers on the above items, if at all possible. Now, on to the implementation. Firstly, this implementation moves certain settings to napi_struct so that they are "per-NAPI", while taking care to respect existing sysfs parameters which are interface wide and affect all NAPIs: - NAPI ID - gro_flush_timeout - defer_hard_irqs Furthermore: - NAPI ID generation and addition to the hash is now deferred to napi_enable, instead of during netif_napi_add - NAPIs are removed from the hash during napi_disable, instead of netif_napi_del. - An array of "struct napi_config" is allocated in net_device. IMPORTANT: The above changes affect all network drivers. Optionally, drivers may opt-in to using their config space by calling netif_napi_add_config instead of netif_napi_add. If a driver does this, the NAPI being added is linked with an allocated "struct napi_config" and the per-NAPI settings (including NAPI ID) are persisted even as hardware queues are destroyed and recreated. To help illustrate how this would end up working, I've added patches for 3 drivers, of which I have access to only 1: - mlx5 which is the basis of the examples below - mlx4 which has TX only NAPIs, just to highlight that case. I have only compile tested this patch; I don't have this hardware. - bnxt which I have only compiled tested. I don't have this hardware. NOTE: I only tested this on mlx5; I have no access to the other hardware for which I provided patches. Hopefully other folks can help test :) Here's how it works when I test it on my mlx5 system: $ ethtool -l eth4 | grep Combined | tail -1 Combined: 2 First, output the current NAPI settings: $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \ --dump napi-get --json='{"ifindex": 7}' [{'defer-hard-irqs': 0, 'gro-flush-timeout': 0, 'id': 345, 'ifindex': 7, 'irq': 527}, {'defer-hard-irqs': 0, 'gro-flush-timeout': 0, 'id': 344, 'ifindex': 7, 'irq': 327}] Now, set the global sysfs parameters: $ sudo bash -c 'echo 20000 >/sys/class/net/eth4/gro_flush_timeout' $ sudo bash -c 'echo 100 >/sys/class/net/eth4/napi_defer_hard_irqs' Output current NAPI settings again: $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \ --dump napi-get --json='{"ifindex": 7}' [{'defer-hard-irqs': 100, 'gro-flush-timeout': 20000, 'id': 345, 'ifindex': 7, 'irq': 527}, {'defer-hard-irqs': 100, 'gro-flush-timeout': 20000, 'id': 344, 'ifindex': 7, 'irq': 327}] Now set NAPI ID 345, via its NAPI ID to specific values: $ sudo ./tools/net/ynl/cli.py \ --spec Documentation/netlink/specs/netdev.yaml \ --do napi-set \ --json='{"id": 345, "defer-hard-irqs": 111, "gro-flush-timeout": 11111}' None Now output current NAPI settings again to ensure only NAPI ID 345 changed: $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \ --dump napi-get --json='{"ifindex": 7}' [{'defer-hard-irqs': 111, 'gro-flush-timeout': 11111, 'id': 345, 'ifindex': 7, 'irq': 527}, {'defer-hard-irqs': 100, 'gro-flush-timeout': 20000, 'id': 344, 'ifindex': 7, 'irq': 327}] Now, increase gro-flush-timeout only: $ sudo ./tools/net/ynl/cli.py \ --spec Documentation/netlink/specs/netdev.yaml \ --do napi-set --json='{"id": 345, "gro-flush-timeout": 44444}' None Now output the current NAPI settings once more: $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \ --dump napi-get --json='{"ifindex": 7}' [{'defer-hard-irqs': 111, 'gro-flush-timeout': 44444, 'id': 345, 'ifindex': 7, 'irq': 527}, {'defer-hard-irqs': 100, 'gro-flush-timeout': 20000, 'id': 344, 'ifindex': 7, 'irq': 327}] Now set NAPI ID 345 to have gro_flush_timeout of 0: $ sudo ./tools/net/ynl/cli.py \ --spec Documentation/netlink/specs/netdev.yaml \ --do napi-set --json='{"id": 345, "gro-flush-timeout": 0}' None Check that NAPI ID 345 has a value of 0: $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \ --dump napi-get --json='{"ifindex": 7}' [{'defer-hard-irqs': 111, 'gro-flush-timeout': 0, 'id': 345, 'ifindex': 7, 'irq': 527}, {'defer-hard-irqs': 100, 'gro-flush-timeout': 20000, 'id': 344, 'ifindex': 7, 'irq': 327}] Change the queue count, ensuring that NAPI ID 345 retains its settings: $ sudo ethtool -L eth4 combined 4 Check that the new queues have the system wide settings but that NAPI ID 345 remains unchanged: $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \ --dump napi-get --json='{"ifindex": 7}' [{'defer-hard-irqs': 100, 'gro-flush-timeout': 20000, 'id': 347, 'ifindex': 7, 'irq': 529}, {'defer-hard-irqs': 100, 'gro-flush-timeout': 20000, 'id': 346, 'ifindex': 7, 'irq': 528}, {'defer-hard-irqs': 111, 'gro-flush-timeout': 0, 'id': 345, 'ifindex': 7, 'irq': 527}, {'defer-hard-irqs': 100, 'gro-flush-timeout': 20000, 'id': 344, 'ifindex': 7, 'irq': 327}] Now reduce the queue count below where NAPI ID 345 is indexed: $ sudo ethtool -L eth4 combined 1 Check the output: $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \ --dump napi-get --json='{"ifindex": 7}' [{'defer-hard-irqs': 100, 'gro-flush-timeout': 20000, 'id': 344, 'ifindex': 7, 'irq': 327}] Re-increase the queue count to ensure NAPI ID 345 is re-assigned the same values: $ sudo ethtool -L eth4 combined 2 $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \ --dump napi-get --json='{"ifindex": 7}' [{'defer-hard-irqs': 111, 'gro-flush-timeout': 0, 'id': 345, 'ifindex': 7, 'irq': 527}, {'defer-hard-irqs': 100, 'gro-flush-timeout': 20000, 'id': 344, 'ifindex': 7, 'irq': 327}] Create new queues to ensure the sysfs globals are used for the new NAPIs but that NAPI ID 345 is unchanged: $ sudo ethtool -L eth4 combined 8 $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \ --dump napi-get --json='{"ifindex": 7}' [...] {'defer-hard-irqs': 100, 'gro-flush-timeout': 20000, 'id': 346, 'ifindex': 7, 'irq': 528}, {'defer-hard-irqs': 111, 'gro-flush-timeout': 0, 'id': 345, 'ifindex': 7, 'irq': 527}, {'defer-hard-irqs': 100, 'gro-flush-timeout': 20000, 'id': 344, 'ifindex': 7, 'irq': 327}] Last, but not least, let's try writing the sysfs parameters to ensure all NAPIs are rewritten: $ sudo bash -c 'echo 33333 >/sys/class/net/eth4/gro_flush_timeout' $ sudo bash -c 'echo 222 >/sys/class/net/eth4/napi_defer_hard_irqs' Check that worked: $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \ --dump napi-get --json='{"ifindex": 7}' [...] {'defer-hard-irqs': 222, 'gro-flush-timeout': 33333, 'id': 346, 'ifindex': 7, 'irq': 528}, {'defer-hard-irqs': 222, 'gro-flush-timeout': 33333, 'id': 345, 'ifindex': 7, 'irq': 527}, {'defer-hard-irqs': 222, 'gro-flush-timeout': 33333, 'id': 344, 'ifindex': 7, 'irq': 327}] [1]: https://lore.kernel.org/20241009005525.13651-1-jdamato@fastly.com v5: https://lore.kernel.org/20241009005525.13651-1-jdamato@fastly.com rfcv4: https://lore.kernel.org/lkml/20241001235302.57609-1-jdamato@fastly.com rfcv3: https://lore.kernel.org/20240912100738.16567-8-jdamato@fastly.com rfcv2: https://lore.kernel.org/20240908160702.56618-1-jdamato@fastly.com ==================== Link: https://patch.msgid.link/20241011184527.16393-1-jdamato@fastly.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-14mlx4: Add support for persistent NAPI config to RX CQsJoe Damato
Use netif_napi_add_config to assign persistent per-NAPI config when initializing RX CQ NAPIs. Presently, struct napi_config only has support for two fields used for RX, so there is no need to support them with TX CQs, yet. Signed-off-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20241011184527.16393-10-jdamato@fastly.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-14mlx5: Add support for persistent NAPI configJoe Damato
Use netif_napi_add_config to assign persistent per-NAPI config when initializing NAPIs. Signed-off-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20241011184527.16393-9-jdamato@fastly.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-14bnxt: Add support for persistent NAPI configJoe Damato
Use netif_napi_add_config to assign persistent per-NAPI config when initializing NAPIs. Signed-off-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20241011184527.16393-8-jdamato@fastly.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-14netdev-genl: Support setting per-NAPI config valuesJoe Damato
Add support to set per-NAPI defer_hard_irqs and gro_flush_timeout. Signed-off-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20241011184527.16393-7-jdamato@fastly.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-14net: napi: Add napi_configJoe Damato
Add a persistent NAPI config area for NAPI configuration to the core. Drivers opt-in to setting the persistent config for a NAPI by passing an index when calling netif_napi_add_config. napi_config is allocated in alloc_netdev_mqs, freed in free_netdev (after the NAPIs are deleted). Drivers which call netif_napi_add_config will have persistent per-NAPI settings: NAPI IDs, gro_flush_timeout, and defer_hard_irq settings. Per-NAPI settings are saved in napi_disable and restored in napi_enable. Co-developed-by: Martin Karsten <mkarsten@uwaterloo.ca> Signed-off-by: Martin Karsten <mkarsten@uwaterloo.ca> Signed-off-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20241011184527.16393-6-jdamato@fastly.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-14netdev-genl: Dump gro_flush_timeoutJoe Damato
Support dumping gro_flush_timeout for a NAPI ID. Signed-off-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20241011184527.16393-5-jdamato@fastly.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-14net: napi: Make gro_flush_timeout per-NAPIJoe Damato
Allow per-NAPI gro_flush_timeout setting. The existing sysfs parameter is respected; writes to sysfs will write to all NAPI structs for the device and the net_device gro_flush_timeout field. Reads from sysfs will read from the net_device field. The ability to set gro_flush_timeout on specific NAPI instances will be added in a later commit, via netdev-genl. Signed-off-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20241011184527.16393-4-jdamato@fastly.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-14netdev-genl: Dump napi_defer_hard_irqsJoe Damato
Support dumping defer_hard_irqs for a NAPI ID. Signed-off-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20241011184527.16393-3-jdamato@fastly.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-14net: napi: Make napi_defer_hard_irqs per-NAPIJoe Damato
Add defer_hard_irqs to napi_struct in preparation for per-NAPI settings. The existing sysfs parameter is respected; writes to sysfs will write to all NAPI structs for the device and the net_device defer_hard_irq field. Reads from sysfs show the net_device field. The ability to set defer_hard_irqs on specific NAPI instances will be added in a later commit, via netdev-genl. Signed-off-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20241011184527.16393-2-jdamato@fastly.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-14net: phylink: allow half-duplex modes with RATE_MATCH_PAUSEDaniel Golle
PHYs performing rate-matching using MAC-side flow-control always perform duplex-matching as well in case they are supporting half-duplex modes at all. No longer remove half-duplex modes from their capabilities. Suggested-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Daniel Golle <daniel@makrotopia.org> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/b157c0c289cfba024039a96e635d037f9d946745.1728617993.git.daniel@makrotopia.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-14Merge branch 'tcp-add-skb-sk-to-more-control-packets'Jakub Kicinski
Eric Dumazet says: ==================== tcp: add skb->sk to more control packets Currently, TCP can set skb->sk for a variety of transmit packets. However, packets sent on behalf of a TIME_WAIT sockets do not have an attached socket. Same issue for RST packets. We want to change this, in order to increase eBPF program capabilities. This is slightly risky, because various layers could be confused by TIME_WAIT sockets showing up in skb->sk. v2: audited all sk_to_full_sk() users and addressed Martin feedback. ==================== Link: https://patch.msgid.link/20241010174817.1543642-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-14ipv4: tcp: give socket pointer to control skbsEric Dumazet
ip_send_unicast_reply() send orphaned 'control packets'. These are RST packets and also ACK packets sent from TIME_WAIT. Some eBPF programs would prefer to have a meaningful skb->sk pointer as much as possible. This means that TCP can now attach TIME_WAIT sockets to outgoing skbs. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Brian Vazquez <brianvv@google.com> Link: https://patch.msgid.link/20241010174817.1543642-6-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-14ipv6: tcp: give socket pointer to control skbsEric Dumazet
tcp_v6_send_response() send orphaned 'control packets'. These are RST packets and also ACK packets sent from TIME_WAIT. Some eBPF programs would prefer to have a meaningful skb->sk pointer as much as possible. This means that TCP can now attach TIME_WAIT sockets to outgoing skbs. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Brian Vazquez <brianvv@google.com> Link: https://patch.msgid.link/20241010174817.1543642-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-14net: add skb_set_owner_edemux() helperEric Dumazet
This can be used to attach a socket to an skb, taking a reference on sk->sk_refcnt. This helper might be a NOP if sk->sk_refcnt is zero. Use it from tcp_make_synack(). Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Brian Vazquez <brianvv@google.com> Link: https://patch.msgid.link/20241010174817.1543642-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-14net_sched: sch_fq: prepare for TIME_WAIT socketsEric Dumazet
TCP stack is not attaching skb to TIME_WAIT sockets yet, but we would like to allow this in the future. Add sk_listener_or_tw() helper to detect the three states that FQ needs to take care. Like NEW_SYN_RECV, TIME_WAIT are not full sockets and do not contain sk->sk_pacing_status, sk->sk_pacing_rate. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Brian Vazquez <brianvv@google.com> Link: https://patch.msgid.link/20241010174817.1543642-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-14net: add TIME_WAIT logic to sk_to_full_sk()Eric Dumazet
TCP will soon attach TIME_WAIT sockets to some ACK and RST. Make sure sk_to_full_sk() detects this and does not return a non full socket. v3: also changed sk_const_to_full_sk() Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Brian Vazquez <brianvv@google.com> Link: https://patch.msgid.link/20241010174817.1543642-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-14net/smc: Fix memory leak when using percpu refsKai Shen
This patch adds missing percpu_ref_exit when releasing percpu refs. When releasing percpu refs, percpu_ref_exit should be called. Otherwise, memory leak happens. Fixes: 79a22238b4f2 ("net/smc: Use percpu ref for wr tx reference") Signed-off-by: Kai Shen <KaiShen@linux.alibaba.com> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com> Link: https://patch.msgid.link/20241010115624.7769-1-KaiShen@linux.alibaba.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-14tg3: Address byte-order miss-matchesSimon Horman
Address byte-order miss-matches flagged by Sparse. In tg3_load_firmware_cpu() and tg3_get_device_address() this is done using appropriate types to store big endian values. In the cases of tg3_test_nvram(), where buf is an array which contains values of several different types, cast to __le32 before converting values to host byte order. Reported by Sparse as: .../tg3.c:3745:34: warning: cast to restricted __be32 .../tg3.c:13096:21: warning: cast to restricted __le32 .../tg3.c:13096:21: warning: cast from restricted __be32 .../tg3.c:13101:21: warning: cast to restricted __le32 .../tg3.c:13101:21: warning: cast from restricted __be32 .../tg3.c:17070:63: warning: incorrect type in argument 3 (different base types) .../tg3.c:17070:63: expected restricted __be32 [usertype] *val .../tg3.c:17070:63: got unsigned int * dr.../tg3.c:17071:63: warning: incorrect type in argument 3 (different base types) .../tg3.c:17071:63: expected restricted __be32 [usertype] *val .../tg3.c:17071:63: got unsigned int * Also, address white-space issues on lines modified for the above. And, for consistency, lines adjacent to them. Compile tested only. No functional change intended. Signed-off-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20241009-tg3-sparse-v1-1-6af38a7bf4ff@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-14Merge branch 'posix-clock-fix-missing-timespec64-check-for-ptp-clock'Jakub Kicinski
Jinjie Ruan says: ==================== posix-clock: Fix missing timespec64 check for PTP clock Check timespec64 in pc_clock_settime() for PTP clock as the man manual of clock_settime() said. ==================== Link: https://patch.msgid.link/20241009072302.1754567-1-ruanjinjie@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-14net: lan743x: Remove duplicate checkJinjie Ruan
Since timespec64_valid() has been checked in higher layer pc_clock_settime(), the duplicate check in lan743x_ptpci_settime64() can be removed. Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Link: https://patch.msgid.link/20241009072302.1754567-3-ruanjinjie@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-14posix-clock: Fix missing timespec64 check in pc_clock_settime()Jinjie Ruan
As Andrew pointed out, it will make sense that the PTP core checked timespec64 struct's tv_sec and tv_nsec range before calling ptp->info->settime64(). As the man manual of clock_settime() said, if tp.tv_sec is negative or tp.tv_nsec is outside the range [0..999,999,999], it should return EINVAL, which include dynamic clocks which handles PTP clock, and the condition is consistent with timespec64_valid(). As Thomas suggested, timespec64_valid() only check the timespec is valid, but not ensure that the time is in a valid range, so check it ahead using timespec64_valid_strict() in pc_clock_settime() and return -EINVAL if not valid. There are some drivers that use tp->tv_sec and tp->tv_nsec directly to write registers without validity checks and assume that the higher layer has checked it, which is dangerous and will benefit from this, such as hclge_ptp_settime(), igb_ptp_settime_i210(), _rcar_gen4_ptp_settime(), and some drivers can remove the checks of itself. Cc: stable@vger.kernel.org Fixes: 0606f422b453 ("posix clocks: Introduce dynamic clocks") Acked-by: Richard Cochran <richardcochran@gmail.com> Suggested-by: Andrew Lunn <andrew@lunn.ch> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Link: https://patch.msgid.link/20241009072302.1754567-2-ruanjinjie@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-14sched_ext: Remove unnecessary cpu_relax()David Vernet
As described in commit b07996c7abac ("sched_ext: Don't hold scx_tasks_lock for too long"), we're doing a cond_resched() every 32 calls to scx_task_iter_next() to avoid RCU and other stalls. That commit also added a cpu_relax() to the codepath where we drop and reacquire the lock, but as Waiman described in [0], cpu_relax() should only be necessary in busy loops to avoid pounding on a cacheline (or to allow a hypertwin to more fully utilize a core). Let's remove the unnecessary cpu_relax(). [0]: https://lore.kernel.org/all/35b3889b-904a-4d26-981f-c8aa1557a7c7@redhat.com/ Cc: Waiman Long <llong@redhat.com> Signed-off-by: David Vernet <void@manifault.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-10-14ring-buffer: Fix refcount setting of boot mapped buffersSteven Rostedt
A ring buffer which has its buffered mapped at boot up to fixed memory should not be freed. Other buffers can be. The ref counting setup was wrong for both. It made the not mapped buffers ref count have zero, and the boot mapped buffer a ref count of 1. But an normally allocated buffer should be 1, where it can be removed. Keep the ref count of a normal boot buffer with its setup ref count (do not decrement it), and increment the fixed memory boot mapped buffer's ref count. Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20241011165224.33dd2624@gandalf.local.home Fixes: e645535a954ad ("tracing: Add option to use memmapped memory for trace boot instance") Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-10-14Merge tag 'f2fs-6.12-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs fix from Jaegeuk Kim: "An urgent fix to resolve DIO read performance regression caused by 'f2fs: fix to avoid racing in between read and OPU dio write'" * tag 'f2fs-6.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: f2fs: allow parallel DIO reads
2024-10-14Merge tag 'erofs-for-6.12-rc4-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs fixes from Gao Xiang: "The main one fixes a syzbot issue due to the invalid inode type out of file-backed mounts. The others are minor cleanups without actual logic changes. Summary: - Make sure only regular inodes can be used for file-backed mounts - Two minor codebase cleanups" * tag 'erofs-for-6.12-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: erofs: get rid of kaddr in `struct z_erofs_maprecorder` erofs: get rid of z_erofs_try_to_claim_pcluster() erofs: ensure regular inodes for file-backed mounts
2024-10-14xsk: Use xsk_buff_pool directly for cq functionsMaciej Fijalkowski
Currently xsk_cq_{reserve_addr,submit,cancel}_locked() take xdp_sock as an input argument but it is only used for pulling out xsk_buff_pool pointer from it. Change mentioned functions to take pool pointer as an input argument to avoid unnecessary dereferences. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20241007122458.282590-7-maciej.fijalkowski@intel.com
2024-10-14xsk: Wrap duplicated code to functionMaciej Fijalkowski
Both allocation paths have exactly the same code responsible for getting and initializing xskb. Pull it out to common function. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20241007122458.282590-6-maciej.fijalkowski@intel.com
2024-10-14xsk: Carry a copy of xdp_zc_max_segs within xsk_buff_poolMaciej Fijalkowski
This so we avoid dereferencing struct net_device within hot path. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20241007122458.282590-5-maciej.fijalkowski@intel.com
2024-10-14xsk: Get rid of xdp_buff_xsk::orig_addrMaciej Fijalkowski
Continue the process of dieting xdp_buff_xsk by removing orig_addr member. It can be calculated from xdp->data_hard_start where it was previously used, so it is not anything that has to be carried around in struct used widely in hot path. This has been used for initializing xdp_buff_xsk::frame_dma during pool setup and as a shortcut in xp_get_handle() to retrieve address provided to xsk Rx queue. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20241007122458.282590-4-maciej.fijalkowski@intel.com
2024-10-14xsk: s/free_list_node/list_node/Maciej Fijalkowski
Now that free_list_node's purpose is two-folded, make it just a 'list_node'. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20241007122458.282590-3-maciej.fijalkowski@intel.com