summaryrefslogtreecommitdiff
path: root/drivers/net/ethernet/intel
AgeCommit message (Collapse)Author
2025-02-21xfrm: provide common xdo_dev_offload_ok callback implementationLeon Romanovsky
Almost all drivers except bond and nsim had same check if device can perform XFRM offload on that specific packet. The check was that packet doesn't have IPv4 options and IPv6 extensions. In NIC drivers, the IPv4 HELEN comparison was slightly different, but the intent was to check for the same conditions. So let's chose more strict variant as a common base. Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-02-20igc: Add launch time support to XDP ZCSong Yoong Siang
Enable Launch Time Control (LTC) support for XDP zero copy via XDP Tx metadata framework. This patch has been tested with tools/testing/selftests/bpf/xdp_hw_metadata on Intel I225-LM Ethernet controller. Below are the test steps and result. Test 1: Send a single packet with the launch time set to 1 s in the future. Test steps: 1. On the DUT, start the xdp_hw_metadata selftest application: $ sudo ./xdp_hw_metadata enp2s0 -l 1000000000 -L 1 2. On the Link Partner, send a UDP packet with VLAN priority 1 to port 9091 of the DUT. Result: When the launch time is set to 1 s in the future, the delta between the launch time and the transmit hardware timestamp is 0.016 us, as shown in printout of the xdp_hw_metadata application below. 0x562ff5dc8880: rx_desc[4]->addr=84110 addr=84110 comp_addr=84110 EoP rx_hash: 0xE343384 with RSS type:0x1 HW RX-time: 1734578015467548904 (sec:1734578015.4675) delta to User RX-time sec:0.0002 (183.103 usec) XDP RX-time: 1734578015467651698 (sec:1734578015.4677) delta to User RX-time sec:0.0001 (80.309 usec) No rx_vlan_tci or rx_vlan_proto, err=-95 0x562ff5dc8880: ping-pong with csum=561c (want c7dd) csum_start=34 csum_offset=6 HW RX-time: 1734578015467548904 (sec:1734578015.4675) delta to HW Launch-time sec:1.0000 (1000000.000 usec) 0x562ff5dc8880: complete tx idx=4 addr=4018 HW Launch-time: 1734578016467548904 (sec:1734578016.4675) delta to HW TX-complete-time sec:0.0000 (0.016 usec) HW TX-complete-time: 1734578016467548920 (sec:1734578016.4675) delta to User TX-complete-time sec:0.0000 (32.546 usec) XDP RX-time: 1734578015467651698 (sec:1734578015.4677) delta to User TX-complete-time sec:0.9999 (999929.768 usec) HW RX-time: 1734578015467548904 (sec:1734578015.4675) delta to HW TX-complete-time sec:1.0000 (1000000.016 usec) 0x562ff5dc8880: complete rx idx=132 addr=84110 Test 2: Send 1000 packets with a 10 ms interval and the launch time set to 500 us in the future. Test steps: 1. On the DUT, start the xdp_hw_metadata selftest application: $ sudo chrt -f 99 ./xdp_hw_metadata enp2s0 -l 500000 -L 1 > \ /dev/shm/result.log 2. On the Link Partner, send 1000 UDP packets with a 10 ms interval and VLAN priority 1 to port 9091 of the DUT. Result: When the launch time is set to 500 us in the future, the average delta between the launch time and the transmit hardware timestamp is 0.016 us, as shown in the analysis of /dev/shm/result.log below. The XDP launch time works correctly in sending 1000 packets continuously. Min delta: 0.005 us Avr delta: 0.016 us Max delta: 0.031 us Total packets forwarded: 1000 Signed-off-by: Song Yoong Siang <yoong.siang.song@intel.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Faizal Rahim <faizal.abdul.rahim@linux.intel.com> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://patch.msgid.link/20250216093430.957880-6-yoong.siang.song@intel.com
2025-02-20igc: Refactor empty frame insertion for launch time supportSong Yoong Siang
Refactor the code for inserting an empty frame into a new function igc_insert_empty_frame(). This change extracts the logic for inserting an empty packet from igc_xmit_frame_ring() into a separate function, allowing it to be reused in future implementations, such as the XDP zero copy transmit function. Remove the igc_desc_unused() checking in igc_init_tx_empty_descriptor() because the number of descriptors needed is guaranteed. Ensure that skb allocation and DMA mapping work for the empty frame, before proceeding to fill in igc_tx_buffer info, context descriptor, and data descriptor. Rate limit the error messages for skb allocation and DMA mapping failures. Update the comment to indicate that the 2 descriptors needed by the empty frame are already taken into consideration in igc_xmit_frame_ring(). Handle the case where the insertion of an empty frame fails and explain the reason behind this handling. Signed-off-by: Song Yoong Siang <yoong.siang.song@intel.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Faizal Rahim <faizal.abdul.rahim@linux.intel.com> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://patch.msgid.link/20250216093430.957880-5-yoong.siang.song@intel.com
2025-02-18igc: Switch to use hrtimer_setup()Nam Cao
hrtimer_setup() takes the callback function pointer as argument and initializes the timer completely. Replace hrtimer_init() and the open coded initialization of hrtimer::function with the new setup mechanism. Patch was created by using Coccinelle. Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/baeaabb84aef1803dfae385f6e9614d6fc547667.1738746872.git.namcao@linutronix.de
2025-02-17Merge branch '100GbE' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue Tony Nguyen says: ==================== ice, iavf: Add support for Rx timestamping Mateusz Polchlopek says: Initially, during VF creation it registers the PTP clock in the system and negotiates with PF it's capabilities. In the meantime the PF enables the Flexible Descriptor for VF. Only this type of descriptor allows to receive Rx timestamps. Enabling virtual clock would be possible, though it would probably perform poorly due to the lack of direct time access. Enable timestamping should be done using userspace tools, e.g. hwstamp_ctl -i $VF -r 14 In order to report the timestamps to userspace, the VF extends timestamp to 40b. To support this feature the flexible descriptors and PTP part in iavf driver have been introduced. * '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue: iavf: add support for Rx timestamps to hotpath iavf: handle set and get timestamps ops iavf: Implement checking DD desc field iavf: refactor iavf_clean_rx_irq to support legacy and flex descriptors iavf: define Rx descriptors as qwords libeth: move idpf_rx_csum_decoded and idpf_rx_extracted iavf: periodically cache PHC time iavf: add support for indirect access to PHC time iavf: add initial framework for registering PTP clock iavf: negotiate PTP capabilities iavf: add support for negotiating flexible RXDID format virtchnl: add enumeration for the rxdid format ice: support Rx timestamp on flex descriptor virtchnl: add support for enabling PTP on iAVF ==================== Link: https://patch.msgid.link/20250214192739.1175740-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-14ice: Fix signedness bug in ice_init_interrupt_scheme()Dan Carpenter
If pci_alloc_irq_vectors() can't allocate the minimum number of vectors then it returns -ENOSPC so there is no need to check for that in the caller. In fact, because pf->msix.min is an unsigned int, it means that any negative error codes are type promoted to high positive values and treated as success. So here, the "return -ENOMEM;" is unreachable code. Check for negatives instead. Now that we're only dealing with error codes, it's easier to propagate the error code from pci_alloc_irq_vectors() instead of hardcoding -ENOMEM. Fixes: 79d97b8cf9a8 ("ice: remove splitting MSI-X between features") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Link: https://patch.msgid.link/b16e4f01-4c85-46e2-b602-fce529293559@stanley.mountain Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-14iavf: add support for Rx timestamps to hotpathJacob Keller
Add support for receive timestamps to the Rx hotpath. This support only works when using the flexible descriptor format, so make sure that we request this format by default if we have receive timestamp support available in the PTP capabilities. In order to report the timestamps to userspace, we need to perform timestamp extension. The Rx descriptor does actually contain the "40 bit" timestamp. However, upper 32 bits which contain nanoseconds are conveniently stored separately in the descriptor. We could extract the 32bits and lower 8 bits, then perform a bitwise OR to calculate the 40bit value. This makes no sense, because the timestamp extension algorithm would simply discard the lower 8 bits anyways. Thus, implement timestamp extension as iavf_ptp_extend_32b_timestamp(), and extract and forward only the 32bits of nominal nanoseconds. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Reviewed-by: Sunil Goutham <sgoutham@marvell.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-14iavf: handle set and get timestamps opsJacob Keller
Add handlers for the .ndo_hwtstamp_get and .ndo_hwtstamp_set ops which allow userspace to request timestamp enablement for the device. This support allows standard Linux applications to request the timestamping desired. As with other devices that support timestamping all packets, the driver will upgrade any request for timestamping of a specific type of packet to HWTSTAMP_FILTER_ALL. The current configuration is stored, so that it can be retrieved by calling .ndo_hwtstamp_get The Tx timestamps are not implemented yet so calling set ops for Tx path will end with EOPNOTSUPP error code. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Co-developed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-14iavf: Implement checking DD desc fieldMateusz Polchlopek
Rx timestamping introduced in PF driver caused the need of refactoring the VF driver mechanism to check packet fields. The function to check errors in descriptor has been removed and from now only previously set struct fields are being checked. The field DD (descriptor done) needs to be checked at the very beginning, before extracting other fields. Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-14iavf: refactor iavf_clean_rx_irq to support legacy and flex descriptorsJacob Keller
Using VIRTCHNL_VF_OFFLOAD_FLEX_DESC, the iAVF driver is capable of negotiating to enable the advanced flexible descriptor layout. Add the flexible NIC layout (RXDID=2) as a member of the Rx descriptor union. Also add bit position definitions for the status and error indications that are needed. The iavf_clean_rx_irq function needs to extract a few fields from the Rx descriptor, including the size, rx_ptype, and vlan_tag. Move the extraction to a separate function that decodes the fields into a structure. This will reduce the burden for handling multiple descriptor types by keeping the relevant extraction logic in one place. To support handling an additional descriptor format with minimal code duplication, refactor Rx checksum handling so that the general logic is separated from the bit calculations. Introduce an iavf_rx_desc_decoded structure which holds the relevant bits decoded from the Rx descriptor. This will enable implementing flexible descriptor handling without duplicating the general logic twice. Introduce an iavf_extract_flex_rx_fields, iavf_flex_rx_hash, and iavf_flex_rx_csum functions which operate on the flexible NIC descriptor format instead of the legacy 32 byte format. Based on the negotiated RXDID, select the correct function for processing the Rx descriptors. With this change, the Rx hot path should be functional when using either the default legacy 32byte format or when we switch to the flexible NIC layout. Modify the Rx hot path to add support for the flexible descriptor format and add request enabling Rx timestamps for all queues. As in ice, make sure we bump the checksum level if the hardware detected a packet type which could have an outer checksum. This is important because hardware only verifies the inner checksum. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Co-developed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-14iavf: define Rx descriptors as qwordsMateusz Polchlopek
The union iavf_32byte_rx_desc consists of two unnamed structs defined inside. One of them represents legacy 32 byte descriptor and second the 16 byte descriptor (extended to 32 byte). Each of them consists of bunch of unions, structs and __le fields that represent specific fields in descriptor. This commit changes the representation of iavf_32byte_rx_desc union to store four __le64 fields (qw0, qw1, qw2, qw3) that represent quad-words. Those quad-words will be then accessed by calling leXY_get_bits macros in upcoming commits. Suggested-by: Alexander Lobakin <aleksander.lobakin@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-14libeth: move idpf_rx_csum_decoded and idpf_rx_extractedMateusz Polchlopek
Structs idpf_rx_csum_decoded and idpf_rx_extracted are used both in idpf and iavf Intel drivers. Change the prefix from idpf_* to libeth_* and move mentioned structs to libeth's rx.h header file. Adjust usage in idpf driver. Suggested-by: Alexander Lobakin <aleksander.lobakin@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-14iavf: periodically cache PHC timeJacob Keller
The Rx timestamps reported by hardware may only have 32 bits of storage for nanosecond time. These timestamps cannot be directly reported to the Linux stack, as it expects 64bits of time. To handle this, the timestamps must be extended using an algorithm that calculates the corrected 64bit timestamp by comparison between the PHC time and the timestamp. This algorithm requires the PHC time to be captured within ~2 seconds of when the timestamp was captured. Instead of trying to read the PHC time in the Rx hotpath, the algorithm relies on a cached value that is periodically updated. Keep this cached time up to date by using the PTP .do_aux_work kthread function. The iavf_ptp_do_aux_work will reschedule itself about twice a second, and will check whether or not the cached PTP time needs to be updated. If so, it issues a VIRTCHNL_OP_1588_PTP_GET_TIME to request the time from the PF. The jitter and latency involved with this command aren't important, because the cached time just needs to be kept up to date within about ~2 seconds. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Co-developed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-14iavf: add support for indirect access to PHC timeJacob Keller
Implement support for reading the PHC time indirectly via the VIRTCHNL_OP_1588_PTP_GET_TIME operation. Based on some simple tests with ftrace, the latency of the indirect clock access appears to be about ~110 microseconds. This is due to the cost of preparing a message to send over the virtchnl queue. This is expected, due to the increased jitter caused by sending messages over virtchnl. It is not easy to control the precise time that the message is sent by the VF, or the time that the message is responded to by the PF, or the time that the message sent from the PF is received by the VF. For sending the request, note that many PTP related operations will require sending of VIRTCHNL messages. Instead of adding a separate AQ flag and storage for each operation, setup a simple queue mechanism for queuing up virtchnl messages. Each message will be converted to a iavf_ptp_aq_cmd structure which ends with a flexible array member. A single AQ flag is added for processing messages from this queue. In principle this could be extended to handle arbitrary virtchnl messages. For now it is kept to PTP-specific as the need is primarily for handling PTP-related commands. Use this to implement .gettimex64 using the indirect method via the virtchnl command. The response from the PF is processed and stored into the cached_phc_time. A wait queue is used to allow the PTP clock gettime request to sleep until the message is sent from the PF. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-14iavf: add initial framework for registering PTP clockJacob Keller
Add the iavf_ptp.c file and fill it in with a skeleton framework to allow registering the PTP clock device. Add implementation of helper functions to check if a PTP capability is supported and handle change in PTP capabilities. Enabling virtual clock would be possible, though it would probably perform poorly due to the lack of direct time access. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Sai Krishna <saikrishnag@marvell.com> Reviewed-by: Simon Horman <horms@kernel.org> Co-developed-by: Ahmed Zaki <ahmed.zaki@intel.com> Signed-off-by: Ahmed Zaki <ahmed.zaki@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Co-developed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-14iavf: negotiate PTP capabilitiesJacob Keller
Add a new extended capabilities negotiation to exchange information from the PF about what PTP capabilities are supported by this VF. This requires sending a VIRTCHNL_OP_1588_PTP_GET_CAPS message, and waiting for the response from the PF. Handle this early on during the VF initialization. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Co-developed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-14iavf: add support for negotiating flexible RXDID formatJacob Keller
Enable support for VIRTCHNL_VF_OFFLOAD_RX_FLEX_DESC, to enable the VF driver the ability to determine what Rx descriptor formats are available. This requires sending an additional message during initialization and reset, the VIRTCHNL_OP_GET_SUPPORTED_RXDIDS. This operation requests the supported Rx descriptor IDs available from the PF. This is treated the same way that VLAN V2 capabilities are handled. Add a new set of extended capability flags, used to process send and receipt of the VIRTCHNL_OP_GET_SUPPORTED_RXDIDS message. This ensures we finish negotiating for the supported descriptor formats prior to beginning configuration of receive queues. This change stores the supported format bitmap into the iavf_adapter structure. Additionally, if VIRTCHNL_VF_OFFLOAD_RX_FLEX_DESC is enabled by the PF, we need to make sure that the Rx queue configuration specifies the format. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Co-developed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-14ice: support Rx timestamp on flex descriptorSimei Su
To support Rx timestamp offload, VIRTCHNL_OP_1588_PTP_CAPS is sent by the VF to request PTP capability and responded by the PF what capability is enabled for that VF. Hardware captures timestamps which contain only 32 bits of nominal nanoseconds, as opposed to the 64bit timestamps that the stack expects. To convert 32b to 64b, we need a current PHC time. VIRTCHNL_OP_1588_PTP_GET_TIME is sent by the VF and responded by the PF with the current PHC time. Signed-off-by: Simei Su <simei.su@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Co-developed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.14-rc3). No conflicts or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-12Merge branch '200GbE' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue Tony Nguyen says: ==================== Intel Wired LAN Driver Updates 2025-02-11 (idpf, ixgbe, igc) For idpf: Sridhar fixes a couple issues in handling of RSC packets. Josh adds a call to set_real_num_queues() to keep queue count in sync. For ixgbe: Piotr removes missed IS_ERR() removal when ERR_PTR usage was removed. For igc: Zdenek Bouska fixes reporting of Rx timestamp with AF_XDP. Siang sets buffer type on empty frame to ensure proper handling. * '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue: igc: Set buffer type for empty frames in igc_init_empty_frame igc: Fix HW RX timestamp when passed by ZC XDP ixgbe: Fix possible skb NULL pointer dereference idpf: call set_real_num_queues in idpf_open idpf: record rx queue in skb for RSC packets idpf: fix handling rsc packet with a single segment ==================== Link: https://patch.msgid.link/20250211214343.4092496-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-11Merge branch '100GbE' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue Tony Nguyen says: ==================== Intel Wired LAN Driver Updates 2025-02-10 (ice, igc, e1000e) For ice: Karol, Jake, and Michal add PTP support for E830 devices. Karol refactors and cleans up PTP code. Jake allows for a common cross-timestamp implementation to be shared for all devices and Michal adds E830 support. Mateusz cleans up initial Flow Director rule creation to loop rather than duplicate repeated similar calls. For igc: Siang adjust calls to remove need for close and open calls on loading XDP program. For e1000e: Gerhard Engleder batches register writes for writing multicast table on real-time kernels. * '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue: e1000e: Fix real-time violations on link up igc: Avoid unnecessary link down event in XDP_SETUP_PROG process ice: refactor ice_fdir_create_dflt_rules() function ice: Implement PTP support for E830 devices ice: Refactor ice_ptp_init_tx_* ice: Add unified ice_capture_crosststamp ice: Process TSYN IRQ in a separate function ice: Use FIELD_PREP for timestamp values ice: Remove unnecessary ice_is_e8xx() functions ice: Don't check device type when checking GNSS presence ==================== Link: https://patch.msgid.link/20250210192352.3799673-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-11iavf: Fix a locking bug in an error pathBart Van Assche
If the netdev lock has been obtained, unlock it before returning. This bug has been detected by the Clang thread-safety analyzer. Fixes: afc664987ab3 ("eth: iavf: extend the netdev_lock usage") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://patch.msgid.link/20250206175114.1974171-28-bvanassche@acm.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-11igc: Set buffer type for empty frames in igc_init_empty_frameSong Yoong Siang
Set the buffer type to IGC_TX_BUFFER_TYPE_SKB for empty frame in the igc_init_empty_frame function. This ensures that the buffer type is correctly identified and handled during Tx ring cleanup. Fixes: db0b124f02ba ("igc: Enhance Qbv scheduling by using first flag bit") Cc: stable@vger.kernel.org # 6.2+ Signed-off-by: Song Yoong Siang <yoong.siang.song@intel.com> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Mor Bar-Gabay <morx.bar.gabay@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-11igc: Fix HW RX timestamp when passed by ZC XDPZdenek Bouska
Fixes HW RX timestamp in the following scenario: - AF_PACKET socket with enabled HW RX timestamps is created - AF_XDP socket with enabled zero copy is created - frame is forwarded to the BPF program, where the timestamp should still be readable (extracted by igc_xdp_rx_timestamp(), kfunc behind bpf_xdp_metadata_rx_timestamp()) - the frame got XDP_PASS from BPF program, redirecting to the stack - AF_PACKET socket receives the frame with HW RX timestamp Moves the skb timestamp setting from igc_dispatch_skb_zc() to igc_construct_skb_zc() so that igc_construct_skb_zc() is similar to igc_construct_skb(). This issue can also be reproduced by running: # tools/testing/selftests/bpf/xdp_hw_metadata enp1s0 When a frame with the wrong port 9092 (instead of 9091) is used: # echo -n xdp | nc -u -q1 192.168.10.9 9092 then the RX timestamp is missing and xdp_hw_metadata prints: skb hwtstamp is not found! With this fix or when copy mode is used: # tools/testing/selftests/bpf/xdp_hw_metadata -c enp1s0 then RX timestamp is found and xdp_hw_metadata prints: found skb hwtstamp = 1736509937.852786132 Fixes: 069b142f5819 ("igc: Add support for PTP .getcyclesx64()") Signed-off-by: Zdenek Bouska <zdenek.bouska@siemens.com> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Florian Bezdeka <florian.bezdeka@siemens.com> Reviewed-by: Song Yoong Siang <yoong.siang.song@intel.com> Tested-by: Mor Bar-Gabay <morx.bar.gabay@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-11ixgbe: Fix possible skb NULL pointer dereferencePiotr Kwapulinski
The commit c824125cbb18 ("ixgbe: Fix passing 0 to ERR_PTR in ixgbe_run_xdp()") stopped utilizing the ERR-like macros for xdp status encoding. Propagate this logic to the ixgbe_put_rx_buffer(). The commit also relaxed the skb NULL pointer check - caught by Smatch. Restore this check. Fixes: c824125cbb18 ("ixgbe: Fix passing 0 to ERR_PTR in ixgbe_run_xdp()") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/intel-wired-lan/2c7d6c31-192a-4047-bd90-9566d0e14cc0@stanley.mountain/ Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Piotr Kwapulinski <piotr.kwapulinski@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Saritha Sanigani <sarithax.sanigani@intel.com> (A Contingent Worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-11idpf: call set_real_num_queues in idpf_openJoshua Hay
On initial driver load, alloc_etherdev_mqs is called with whatever max queue values are provided by the control plane. However, if the driver is loaded on a system where num_online_cpus() returns less than the max queues, the netdev will think there are more queues than are actually available. Only num_online_cpus() will be allocated, but skb_get_queue_mapping(skb) could possibly return an index beyond the range of allocated queues. Consequently, the packet is silently dropped and it appears as if TX is broken. Set the real number of queues during open so the netdev knows how many queues will be allocated. Fixes: 1c325aac10a8 ("idpf: configure resources for TX queues") Signed-off-by: Joshua Hay <joshua.a.hay@intel.com> Reviewed-by: Madhu Chittim <madhu.chittim@intel.com> Tested-by: Samuel Salin <Samuel.salin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-11idpf: record rx queue in skb for RSC packetsSridhar Samudrala
Move the call to skb_record_rx_queue in idpf_rx_process_skb_fields() so that RX queue is recorded for RSC packets too. Fixes: 90912f9f4f2d ("idpf: convert header split mode to libeth + napi_build_skb()") Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Reviewed-by: Madhu Chittim <madhu.chittim@intel.com> Tested-by: Samuel Salin <Samuel.salin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-11idpf: fix handling rsc packet with a single segmentSridhar Samudrala
Handle rsc packet with a single segment same as a multi segment rsc packet so that CHECKSUM_PARTIAL is set in the skb->ip_summed field. The current code is passing CHECKSUM_NONE resulting in TCP GRO layer doing checksum in SW and hiding the issue. This will fail when using dmabufs as payload buffers as skb frag would be unreadable. Fixes: 3a8845af66ed ("idpf: add RX splitq napi poll support") Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Tested-by: Samuel Salin <Samuel.salin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-10ice: use generic unrolled_count() macroAlexander Lobakin
ice, same as i40e, has a custom loop unrolling macros for unrolling Tx descriptors filling on XSk xmit. Replace ice defs with generic unrolled_count(), which is also more convenient as it allows passing defines as its argument, not hardcoded values, while the loop declaration will still be usual for-loop. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://patch.msgid.link/20250206182630.3914318-4-aleksander.lobakin@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10i40e: use generic unrolled_count() macroAlexander Lobakin
i40e, as well as ice, has a custom loop unrolling macro for unrolling Tx descriptors filling on XSk xmit. Replace i40e defs with generic unrolled_count(), which is also more convenient as it allows passing defines as its argument, not hardcoded values, while the loop declaration will still be a usual for-loop. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://patch.msgid.link/20250206182630.3914318-3-aleksander.lobakin@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10e1000e: Fix real-time violations on link upGerhard Engleder
Link down and up triggers update of MTA table. This update executes many PCIe writes and a final flush. Thus, PCIe will be blocked until all writes are flushed. As a result, DMA transfers of other targets suffer from delay in the range of 50us. This results in timing violations on real-time systems during link down and up of e1000e in combination with an Intel i3-2310E Sandy Bridge CPU. The i3-2310E is quite old. Launched 2011 by Intel but still in use as robot controller. The exact root cause of the problem is unclear and this situation won't change as Intel support for this CPU has ended years ago. Our experience is that the number of posted PCIe writes needs to be limited at least for real-time systems. With posted PCIe writes a much higher throughput can be generated than with PCIe reads which cannot be posted. Thus, the load on the interconnect is much higher. Additionally, a PCIe read waits until all posted PCIe writes are done. Therefore, the PCIe read can block the CPU for much more than 10us if a lot of PCIe writes were posted before. Both issues are the reason why we are limiting the number of posted PCIe writes in row in general for our real-time systems, not only for this driver. A flush after a low enough number of posted PCIe writes eliminates the delay but also increases the time needed for MTA table update. The following measurements were done on i3-2310E with e1000e for 128 MTA table entries: Single flush after all writes: 106us Flush after every write: 429us Flush after every 2nd write: 266us Flush after every 4th write: 180us Flush after every 8th write: 141us Flush after every 16th write: 121us A flush after every 8th write delays the link up by 35us and the negative impact to DMA transfers of other targets is still tolerable. Execute a flush after every 8th write. This prevents overloading the interconnect with posted writes. Signed-off-by: Gerhard Engleder <eg@keba.com> Link: https://lore.kernel.org/netdev/f8fe665a-5e6c-4f95-b47a-2f3281aa0e6c@lunn.ch/T/ CC: Vitaly Lifshits <vitaly.lifshits@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Tested-by: Avigail Dahan <avigailx.dahan@intel.com> Reviewed-by: Vitaly Lifshits <vitaly.lifshits@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-10igc: Avoid unnecessary link down event in XDP_SETUP_PROG processSong Yoong Siang
The igc_close()/igc_open() functions are too drastic for installing a new XDP prog because they cause undesirable link down event and device reset. To avoid delays in Ethernet traffic, improve the XDP_SETUP_PROG process by using the same sequence as igc_xdp_setup_pool(), which performs only the necessary steps, as follows: 1. stop the traffic and clean buffer 2. stop NAPI 3. install the XDP program 4. resume NAPI 5. allocate buffer and resume the traffic This patch has been tested using the 'ip link set xdpdrv' command to attach a simple XDP prog that always returns XDP_PASS. Before this patch, attaching xdp program will cause ptp4l to lose sync for few seconds, as shown in ptp4l log below: ptp4l[198.082]: rms 4 max 8 freq +906 +/- 2 delay 12 +/- 0 ptp4l[199.082]: rms 3 max 4 freq +906 +/- 3 delay 12 +/- 0 ptp4l[199.536]: port 1 (enp2s0): link down ptp4l[199.536]: port 1 (enp2s0): SLAVE to FAULTY on FAULT_DETECTED (FT_UNSPECIFIED) ptp4l[199.600]: selected local clock 22abbc.fffe.bb1234 as best master ptp4l[199.600]: port 1 (enp2s0): assuming the grand master role ptp4l[199.600]: port 1 (enp2s0): master state recommended in slave only mode ptp4l[199.600]: port 1 (enp2s0): defaultDS.priority1 probably misconfigured ptp4l[202.266]: port 1 (enp2s0): link up ptp4l[202.300]: port 1 (enp2s0): FAULTY to LISTENING on INIT_COMPLETE ptp4l[205.558]: port 1 (enp2s0): new foreign master 44abbc.fffe.bb2144-1 ptp4l[207.558]: selected best master clock 44abbc.fffe.bb2144 ptp4l[207.559]: port 1 (enp2s0): LISTENING to UNCALIBRATED on RS_SLAVE ptp4l[208.308]: port 1 (enp2s0): UNCALIBRATED to SLAVE on MASTER_CLOCK_SELECTED ptp4l[208.933]: rms 742 max 1303 freq -195 +/- 682 delay 12 +/- 0 ptp4l[209.933]: rms 178 max 274 freq +387 +/- 243 delay 12 +/- 0 After this patch, attaching xdp program no longer cause ptp4l to lose sync, as shown in ptp4l log below: ptp4l[201.183]: rms 1 max 3 freq +959 +/- 1 delay 8 +/- 0 ptp4l[202.183]: rms 1 max 3 freq +961 +/- 2 delay 8 +/- 0 ptp4l[203.183]: rms 2 max 3 freq +958 +/- 2 delay 8 +/- 0 ptp4l[204.183]: rms 3 max 5 freq +961 +/- 3 delay 8 +/- 0 ptp4l[205.183]: rms 2 max 4 freq +964 +/- 3 delay 8 +/- 0 Besides, before this patch, attaching xdp program will causes flood ping to lose 10 packets, as shown in ping statistics below: --- 169.254.1.2 ping statistics --- 100000 packets transmitted, 99990 received, +6 errors, 0.01% packet loss, time 34001ms rtt min/avg/max/mdev = 0.028/0.301/3104.360/13.838 ms, pipe 10, ipg/ewma 0.340/0.243 ms After this patch, attaching xdp program no longer cause flood ping to loss any packets, as shown in ping statistics below: --- 169.254.1.2 ping statistics --- 100000 packets transmitted, 100000 received, 0% packet loss, time 32326ms rtt min/avg/max/mdev = 0.027/0.231/19.589/0.155 ms, pipe 2, ipg/ewma 0.323/0.322 ms On the other hand, this patch has been tested with tools/testing/selftests/ bpf/xdp_hw_metadata app to make sure AF_XDP zero-copy is working fine with XDP Tx and Rx metadata. Below is the result of last packet after received 10000 UDP packets with interval 1 ms: poll: 1 (0) skip=0 fail=0 redir=10000 xsk_ring_cons__peek: 1 0x55881c7ef7a8: rx_desc[9999]->addr=8f110 addr=8f110 comp_addr=8f110 EoP rx_hash: 0xFB9BB6A3 with RSS type:0x1 HW RX-time: 1733923136269470866 (sec:1733923136.2695) delta to User RX-time sec:0.0000 (43.280 usec) XDP RX-time: 1733923136269482482 (sec:1733923136.2695) delta to User RX-time sec:0.0000 (31.664 usec) No rx_vlan_tci or rx_vlan_proto, err=-95 0x55881c7ef7a8: ping-pong with csum=ab19 (want 315b) csum_start=34 csum_offset=6 0x55881c7ef7a8: complete tx idx=9999 addr=f010 HW TX-complete-time: 1733923136269591637 (sec:1733923136.2696) delta to User TX-complete-time sec:0.0001 (108.571 usec) XDP RX-time: 1733923136269482482 (sec:1733923136.2695) delta to User TX-complete-time sec:0.0002 (217.726 usec) HW RX-time: 1733923136269470866 (sec:1733923136.2695) delta to HW TX-complete-time sec:0.0001 (120.771 usec) 0x55881c7ef7a8: complete rx idx=10127 addr=8f110 Signed-off-by: Song Yoong Siang <yoong.siang.song@intel.com> Tested-by: Avigail Dahan <avigailx.dahan@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-10ice: refactor ice_fdir_create_dflt_rules() functionMateusz Polchlopek
The Flow Director function ice_fdir_create_dflt_rules() calls few times function ice_create_init_fdir_rule() each time with different enum ice_fltr_ptype parameter. Next step is to return error code if error occurred. Change the code to store all necessary default rules in constant array and call ice_create_init_fdir_rule() in the loop. It makes it easy to extend the list of default rules in the future, without the need of duplicate code more and more. Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-10ice: Implement PTP support for E830 devicesMichal Michalik
Add specific functions and definitions for E830 devices to enable PTP support. E830 devices support direct write to GLTSYN_ registers without shadow registers and 64 bit read of PHC time. Enable PTM for E830 device, which is required for cross timestamp and and dependency on PCIE_PTM for ICE_HWTS. Check X86_FEATURE_ART for E830 as it may not be present in the CPU. Cc: Anna-Maria Behnsen <anna-maria@linutronix.de> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Co-developed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Co-developed-by: Milena Olech <milena.olech@intel.com> Signed-off-by: Milena Olech <milena.olech@intel.com> Co-developed-by: Paul Greenwalt <paul.greenwalt@intel.com> Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com> Signed-off-by: Michal Michalik <michal.michalik@intel.com> Co-developed-by: Karol Kolacinski <karol.kolacinski@intel.com> Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-10ice: Refactor ice_ptp_init_tx_*Karol Kolacinski
Unify ice_ptp_init_tx_* functions for most of the MAC types except E82X. This simplifies the code for the future use with new MAC types. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-10ice: Add unified ice_capture_crosststampJacob Keller
Devices supported by ice driver use essentially the same logic for performing a crosstimestamp. The only difference is that E830 hardware has different offsets. Instead of having multiple implementations, combine them into a single ice_capture_crosststamp() function. To support both hardware types, the ice_capture_crosststamp function must be able to determine the appropriate registers to access. To handle this, pass a custom context structure instead of the PF pointer. This structure, ice_crosststamp_ctx, contains a pointer to the PF, and a pointer to the device configuration structure. This new structure also will make it easier to implement historic snapshot support in a future commit. The device configuration structure is a static const data which defines the offsets and flags for the various registers. This includes the lock register, the cross timestamp control register, the upper and lower ART system time capture registers, and the upper and lower device time capture registers for each timer index. Use the configuration structure to access all of the registers in ice_capture_crosststamp(). Ensure that we don't over-run the device time array by checking that the timer index is 0 or 1. Previously this was simply assumed, and it would cause the device to read an incorrect and likely garbage register. It does feel like there should be a kernel interface for managing register offsets like this, but the closest thing I saw was <linux/regmap.h> which is interesting but not quite what we're looking for... Use rd32_poll_timeout() to read lock_reg and ctl_reg. Add snapshot system time for historic interpolation. Remove X86_FEATURE_ART and X86_FEATURE_TSC_KNOWN_FREQ from all E82X devices because those are SoCs, which will always have those features. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-10ice: Process TSYN IRQ in a separate functionKarol Kolacinski
Simplify TSYN IRQ processing by moving it to a separate function and having appropriate behavior per PHY model, instead of multiple conditions not related to HW, but to specific timestamping modes. When PTP is not enabled in the kernel, don't process timestamps and return IRQ_HANDLED. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-10ice: Use FIELD_PREP for timestamp valuesKarol Kolacinski
Instead of using shifts and casts, use FIELD_PREP after reading 40b timestamp values. Rename a couple defines for better clarity and consistency. Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-10ice: Remove unnecessary ice_is_e8xx() functionsKarol Kolacinski
Remove unnecessary ice_is_e8xx() functions and PHY model. Instead, use MAC type where applicable. Don't check device type in ice_ptp_maybe_trigger_tx_interrupt(), because in reality it depends on the ready bitmap, which only E810 does not have. Call ice_ptp_cfg_phy_interrupt() unconditionally, because all further function calls check the MAC type anyway and this allows simpler code in the future with addition of the new MAC types. Reorder ICE_MAC_* cases in switches in ice_ptp* as in enum ice_mac_type. Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-10ice: Don't check device type when checking GNSS presenceKarol Kolacinski
Don't check if the device type is E810T as non-E810T devices can support GNSS too and PCA9575 check is enough to determine if GNSS is present or not. Rename ice_gnss_is_gps_present() to ice_gnss_is_module_present() because GNSS module supports multiple GNSS providers, not only GPS. Move functions related to PCA9575 from ice_ptp_hw.c to ice_common.c to be able to access them when PTP is disabled in the kernel, but GNSS is enabled. Remove logical AND with ICE_AQC_LINK_TOPO_NODE_TYPE_M in ice_get_pca9575_handle(), which has no effect, and reorder device type checks to check the device_id first, then set other variables. Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-06Merge branch 'for-next' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue Tony Nguyen says: ==================== ice: managing MSI-X in driver Michal Swiatkowski says: It is another try to allow user to manage amount of MSI-X used for each feature in ice. First was via devlink resources API, it wasn't accepted in upstream. Also static MSI-X allocation using devlink resources isn't really user friendly. This try is using more dynamic way. "Dynamic" across whole kernel when platform supports it and "dynamic" across the driver when not. To achieve that reuse global devlink parameter pf_msix_max and pf_msix_min. It fits how ice hardware counts MSI-X. In case of ice amount of MSI-X reported on PCI is a whole MSI-X for the card (with MSI-X for VFs also). Having pf_msix_max allow user to statically set how many MSI-X he wants on PF and how many should be reserved for VFs. pf_msix_min is used to set minimum number of MSI-X with which ice driver should probe correctly. Meaning of this field in case of dynamic vs static allocation: - on system with dynamic MSI-X allocation support * alloc pf_msix_min as static, rest will be allocated dynamically - on system without dynamic MSI-X allocation support * try alloc pf_msix_max as static, minimum acceptable result is pf_msix_min As Jesse and Piotr suggested pf_msix_max and pf_msix_min can (an probably should) be stored in NVM. This patchset isn't implementing that. Dynamic (kernel or driver) way means that splitting MSI-X across the RDMA and eth in case there is a MSI-X shortage isn't correct. Can work when dynamic is only on driver site, but can't when dynamic is on kernel site. Let's remove this code and move to MSI-X allocation feature by feature. If there is no more MSI-X for a feature, a feature is working with less MSI-X or it is turned off. There is a regression here. With MSI-X splitting user can run RDMA and eth even on system with not enough MSI-X. Now only eth will work. RDMA can be turned on by changing number of PF queues (lowering) and reprobe RDMA driver. Example: 72 CPU number, eth, RDMA and flow director (1 MSI-X), 1 MSI-X for OICR on PF, and 1 more for RDMA. Card is using 1 + 72 + 1 + 72 + 1 = 147. We set pf_msix_min = 2, pf_msix_max = 128 OICR: 1 eth: 72 flow director: 1 RDMA: 128 - 74 = 54 We can change number of queues on pf to 36 and do devlink reinit OICR: 1 eth: 36 RDMA: 73 flow director: 1 We can also (implemented in "ice: enable_rdma devlink param") turned RDMA off. OICR: 1 eth: 72 RDMA: 0 (turned off) flow director: 1 After this changes we have a static base vector for SRIOV (SIOV probably in the feature). Last patch from this series is simplifying managing VF MSI-X code based on static vector. Now changing queues using ethtool is also changing MSI-X. If there is enough MSI-X it is always one to one. When there is not enough there will be more queues than MSI-X. * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue: ice: init flow director before RDMA ice: simplify VF MSI-X managing ice: enable_rdma devlink param ice: treat dyn_allowed only as suggestion ice, irdma: move interrupts code to irdma ice: get rid of num_lan_msix field ice: remove splitting MSI-X between features ice: devlink PF MSI-X max and min parameter ice: count combined queues using Rx/Tx count ==================== Link: https://patch.msgid.link/20250205185512.895887-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-05ice: init flow director before RDMAMichal Swiatkowski
Flow director needs only one MSI-X. Load it before RDMA to save MSI-X for it. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-05ice: simplify VF MSI-X managingMichal Swiatkowski
After implementing pf->msix.max field, base vector for other use cases (like VFs) can be fixed. This simplify code when changing MSI-X amount on particular VF, because there is no need to move a base vector. A fixed base vector allows to reserve vectors from the beginning instead of from the end, which is also simpler in code. Store total and rest value in the same struct as max and min for PF. Move tracking vectors from ice_sriov.c to ice_irq.c as it can be also use for other none PF use cases (SIOV). Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-05ice: enable_rdma devlink paramMichal Swiatkowski
Implement enable_rdma devlink parameter to allow user to turn RDMA feature on and off. It is useful when there is no enough interrupts and user doesn't need RDMA feature. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Jan Sokolowski <jan.sokolowski@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-05ice: treat dyn_allowed only as suggestionMichal Swiatkowski
It can be needed to have some MSI-X allocated as static and rest as dynamic. For example on PF VSI. We want to always have minimum one MSI-X on it, because of that it is allocated as a static one, rest can be dynamic if it is supported. Change the ice_get_irq_res() to allow using static entries if they are free even if caller wants dynamic one. Adjust limit values to the new approach. Min and max in limit means the values that are valid, so decrease max and num_static by one. Set vsi::irq_dyn_alloc if dynamic allocation is supported. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-05ice, irdma: move interrupts code to irdmaMichal Swiatkowski
Move responsibility of MSI-X requesting for RDMA feature from ice driver to irdma driver. It is done to allow simple fallback when there is not enough MSI-X available. Change amount of MSI-X used for control from 4 to 1, as it isn't needed to have more than one MSI-X for this purpose. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-05ice: get rid of num_lan_msix fieldMichal Swiatkowski
Remove the field to allow having more queues than MSI-X on VSI. As default the number will be the same, but if there won't be more MSI-X available VSI can run with at least one MSI-X. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-05ice: remove splitting MSI-X between featuresMichal Swiatkowski
With dynamic approach to alloc MSI-X there is no sense to statically split MSI-X between PF features. Splitting was also calculating needed MSI-X. Move this part to separate function and use as max value. Remove ICE_ESWITCH_MSIX, as there is no need for additional MSI-X for switchdev. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-05ice: devlink PF MSI-X max and min parameterMichal Swiatkowski
Use generic devlink PF MSI-X parameter to allow user to change MSI-X range. Add notes about this parameters into ice devlink documentation. Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-04ice: count combined queues using Rx/Tx countMichal Swiatkowski
Previous implementation assumes that there is 1:1 matching between vectors and queues. It isn't always true. Get minimum value from Rx/Tx queues to determine combined queues number. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>