summaryrefslogtreecommitdiff
path: root/drivers/net/ethernet/intel/iavf
AgeCommit message (Collapse)Author
8 daysiavf: use libie_aq_strMichal Swiatkowski
There is no need to store the err string in hw->err_str. Simplify it and use common helper. hw->err_str is still used for other purpouse. It should be marked that previously for unknown error the numeric value was passed as a string. Now the "LIBIE_AQ_RC_UNKNOWN" is used for such cases. Add libie_aminq module in iavf Kconfig. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com> Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
8 daysiavf: use libie adminq descriptorsMichal Swiatkowski
Use libie_aq_desc instead of iavf_aq_desc. Do needed changes to allow clean build Use libie_aq_raw() wherever it can be used. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
9 daysiavf: access ->pp through netmem_desc instead of pageByungchul Park
To eliminate the use of struct page in page pool, the page pool users should use netmem descriptor and APIs instead. Make iavf access ->pp through netmem_desc instead of page. Signed-off-by: Byungchul Park <byungchul@sk.com> Link: https://patch.msgid.link/20250721021835.63939-9-byungchul@sk.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17Merge branch '200GbE' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue Tony Nguyen says: ==================== libeth: add libeth_xdp helper lib Alexander Lobakin says: Time to add XDP helpers infra to libeth to greatly simplify adding XDP to idpf and iavf, as well as improve and extend XDP in ice and i40e. Any vendor is free to reuse helpers. If this happens, I'm fine with moving the folder of out intel/. The helpers greatly simplify building xdp_buff, running a prog, handling the verdict, implement XDP_TX, .ndo_xdp_xmit, XDP buffer completion. Same applies to XSk (with XSk xmit instead of .ndo_xdp_xmit, plus stuff like XSk wakeup). They are entirely generic with no HW definitions or assumptions. HW-specific stuff like parsing Rx desc / filling Tx desc is passed from the driver as inline callbacks. For now, key assumptions that optimize performance / avoid code bloat, but might not fit every driver in driver/net/: * netmem holding the buffers are always order-0; * driver has separate XDP Tx queues, doesn't use stack queues for that. For best efficiency, you may want to have nr_cpu_ids XDP queues, but less (queue sharing) is also supported; * XDP Tx queues are interrupt-less and use "lazy" cleaning only when there are less than 1/4 free Tx descriptors of the queue size; * main target platforms are 64-bit, although 32-bit is also fully supported, but the code might be not as optimized for them. Library code already supports multi-buffer for all kinds of Tx and both header split and no split for Rx and Tx. Frags can come from devmem/io_uring etc., direct `struct page *` is used only for header buffers for which it's always true. Drivers are free to pass their own Rx hints and XSK xmit hints ops. XDP_TX and ndo_xdp_xmit use onstack bulk for the frames to be sent and send them by batches of 16 buffers. This eats ~280 bytes on the stack, but gives good boosts and allow to greatly optimize the main sending function leaving it without any error/exception paths. XSk xmit fills Tx descriptors in the loop unrolled by 8. This was proven to improve perf on ice and i40e. XDP_TX and ndo_xdp_xmit doesn't use unrolling as I wasn't able to get any improvements in those scenenarios from this, while +1 Kb for their sending functions for nothing doesn't sound reasonable. XSk wakeup, instead of traditionally used "SW interrupts" provided by NICs, uses IPI to schedule NAPI on the CPU corresponding to the given queue pair. It gives better control over CPU distribution and in general performs way better than "SW interrupts", plus allows us to not pass any HW-specific callbacks there. The code is built the way that all callbacks passed from drivers get inlined; in general, most of hotpath gets inlined. Everything slow/exception lands to .c files in the libeth folder, doesn't create copies in the drivers themselves and doesn't overloat hotpath. Sure, inlining means that hotpath will be compiled into every driver that uses the lib, but the core code is written in one place, so no copying of bugs happens. Fixed once -- works everywhere. The last commit might look like sorta hack, but it gives really good boosts and decreases object code size, plus there are checks that all those wider accesses are fully safe, so I don't feel anything bad about it. An example of using libeth_xdp can be found either on my GitHub or on the mailing lists here ("XDP for idpf"). Macros for building driver XDP functions lead to that some implementations (XDP_TX, ndo_xdp_xmit etc.) consist of really only a few lines. * '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue: libeth: xdp, xsk: access adjacent u32s as u64 where applicable libeth: xsk: add XSkFQ refill and XSk wakeup helpers libeth: xsk: add XSk Rx processing support libeth: xsk: add XSk xmit functions libeth: xsk: add XSk XDP_TX sending helpers libeth: xdp: add RSS hash hint and XDP features setup helpers libeth: xdp: add templates for building driver-side callbacks libeth: xdp: add XDP prog run and verdict result handling libeth: xdp: add helpers for preparing/processing &libeth_xdp_buff libeth: xdp: add XDPSQ cleanup timers libeth: xdp: add XDPSQ locking helpers libeth: xdp: add XDPSQE completion helpers libeth: xdp: add .ndo_xdp_xmit() helpers libeth: xdp: add XDP_TX buffers sending libeth: support native XDP and register memory model libeth: convert to netmem libeth, libie: clean symbol exports up a little ==================== Link: https://patch.msgid.link/20250616201639.710420-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-16eth: iavf: migrate to new RXFH callbacksJakub Kicinski
Migrate to new callbacks added by commit 9bb00786fc61 ("net: ethtool: add dedicated callbacks for getting and setting rxfh fields"). I'm deleting all the boilerplate kdoc from the affected functions. It is somewhere between pointless and incorrect, just a burden for people refactoring the code. Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Joe Damato <joe@dama.to> Reviewed-by: Tony Nguyen <anthony.l.nguyen@intel.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20250614180907.4167714-8-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-16libeth: convert to netmemAlexander Lobakin
Back when the libeth Rx core was initially written, devmem was a draft and netmem_ref didn't exist in the mainline. Now that it's here, make libeth MP-agnostic before introducing any new code or any new library users. When it's known that the created PP/FQ is for header buffers, use faster "unsafe" underscored netmem <--> virt accessors as netmem_is_net_iov() is always false in that case, but consumes some cycles (bit test + true branch). Reviewed-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-06-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.16-rc2). No conflicts or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-10iavf: fix reset_task for early reset eventAhmed Zaki
If a reset event is received from the PF early in the init cycle, the state machine hangs for about 25 seconds. Reproducer: echo 1 > /sys/class/net/$PF0/device/sriov_numvfs ip link set dev $PF0 vf 0 mac $NEW_MAC The log shows: [792.620416] ice 0000:5e:00.0: Enabling 1 VFs [792.738812] iavf 0000:5e:01.0: enabling device (0000 -> 0002) [792.744182] ice 0000:5e:00.0: Enabling 1 VFs with 17 vectors and 16 queues per VF [792.839964] ice 0000:5e:00.0: Setting MAC 52:54:00:00:00:11 on VF 0. VF driver will be reinitialized [813.389684] iavf 0000:5e:01.0: Failed to communicate with PF; waiting before retry [818.635918] iavf 0000:5e:01.0: Hardware came out of reset. Attempting reinit. [818.766273] iavf 0000:5e:01.0: Multiqueue Enabled: Queue pair count = 16 Fix it by scheduling the reset task and making the reset task capable of resetting early in the init cycle. Fixes: ef8693eb90ae3 ("i40evf: refactor reset handling") Signed-off-by: Ahmed Zaki <ahmed.zaki@intel.com> Tested-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Marcin Szycik <marcin.szycik@linux.intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-06-09iavf: convert to NAPI IRQ affinity APIAhmed Zaki
Commit bd7c00605ee0 ("net: move aRFS rmap management and CPU affinity to core") allows the drivers to delegate the IRQ affinity to the NAPI instance. However, the driver needs to use a persistent NAPI config and explicitly set/unset the NAPI<->IRQ association. Convert to the new IRQ affinity API. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Ahmed Zaki <ahmed.zaki@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-06-09net: intel: move RSS packet classifier types to libieJacob Keller
The Intel i40e, iavf, and ice drivers all include a definition of the packet classifier filter types used to program RSS hash enable bits. For i40e, these bits are used for both the PF and VF to configure the PFQF_HENA and VFQF_HENA registers. For ice and iAVF, these bits are used to communicate the desired hash enable filter over virtchnl via its struct virtchnl_rss_hashena. The virtchnl.h header makes no mention of where the bit definitions reside. Maintaining a separate copy of these bits across three drivers is cumbersome. Move the definition to libie as a new pctype.h header file. Each driver can include this, and drop its own definition. The ice implementation also defined a ICE_AVF_FLOW_FIELD_INVALID, intending to use this to indicate when there were no hash enable bits set. This is confusing, since the enumeration is using bit positions. A value of 0 *should* indicate the first bit. Instead, rewrite the code that uses ICE_AVF_FLOW_FIELD_INVALID to just check if the avf_hash is zero. From context this should be clear that we're checking if none of the bits are set. The values are kept as bit positions instead of encoding the BIT_ULL directly into their value. While most users will simply use BIT_ULL immediately, i40e uses the macros both with BIT_ULL and test_bit/set_bit calls. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-06-09net: intel: rename 'hena' to 'hashcfg' for clarityJacob Keller
i40e, ice, and iAVF all use 'hena' as a shorthand for the "hash enable" configuration. This comes originally from the X710 datasheet 'xxQF_HENA' registers. In the context of the registers the meaning is fairly clear. However, on its own, hena is a weird name that can be more difficult to understand. This is especially true in ice. The E810 hardware doesn't even have registers with HENA in the name. Replace the shorthand 'hena' with 'hashcfg'. This makes it clear the variables deal with the Hash configuration, not just a single boolean on/off for all hashing. Do not update the register names. These come directly from the datasheet for X710 and X722, and it is more important that the names can be searched. Suggested-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-06-03iavf: get rid of the crit lockPrzemek Kitszel
Get rid of the crit lock. That frees us from the error prone logic of try_locks. Thanks to netdev_lock() by Jakub it is now easy, and in most cases we were protected by it already - replace crit lock by netdev lock when it was not the case. Lockdep reports that we should cancel the work under crit_lock [splat1], and that was the scheme we have mostly followed since [1] by Slawomir. But when that is done we still got into deadlocks [splat2]. So instead we should look at the bigger problem, namely "weird locking/scheduling" of the iavf. The first step to fix that is to remove the crit lock. I will followup with a -next series that simplifies scheduling/tasks. Cancel the work without netdev lock (weird unlock+lock scheme), to fix the [splat2] (which would be totally ugly if we would kept the crit lock). Extend protected part of iavf_watchdog_task() to include scheduling more work. Note that the removed comment in iavf_reset_task() was misplaced, it belonged to inside of the removed if condition, so it's gone now. [splat1] - w/o this patch - The deadlock during VF removal: WARNING: possible circular locking dependency detected sh/3825 is trying to acquire lock: ((work_completion)(&(&adapter->watchdog_task)->work)){+.+.}-{0:0}, at: start_flush_work+0x1a1/0x470 but task is already holding lock: (&adapter->crit_lock){+.+.}-{4:4}, at: iavf_remove+0xd1/0x690 [iavf] which lock already depends on the new lock. [splat2] - when cancelling work under crit lock, w/o this series, see [2] for the band aid attempt WARNING: possible circular locking dependency detected sh/3550 is trying to acquire lock: ((wq_completion)iavf){+.+.}-{0:0}, at: touch_wq_lockdep_map+0x26/0x90 but task is already holding lock: (&dev->lock){+.+.}-{4:4}, at: iavf_remove+0xa6/0x6e0 [iavf] which lock already depends on the new lock. [1] fc2e6b3b132a ("iavf: Rework mutexes for better synchronisation") [2] https://github.com/pkitszel/linux/commit/52dddbfc2bb60294083f5711a158a Fixes: d1639a17319b ("iavf: fix a deadlock caused by rtnl and driver's lock circular dependencies") Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-06-03iavf: sprinkle netdev_assert_locked() annotationsPrzemek Kitszel
Lockdep annotations help in general, but here it is extra good, as next commit will remove crit lock. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-06-03iavf: extract iavf_watchdog_step() out of iavf_watchdog_task()Przemek Kitszel
Finish up easy refactor of watchdog_task, total for this + prev two commits is: 1 file changed, 47 insertions(+), 82 deletions(-) Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-06-03iavf: simplify watchdog_task in terms of adminq task schedulingPrzemek Kitszel
Simplify the decision whether to schedule adminq task. The condition is the same, but it is executed in more scenarios. Note that movement of watchdog_done label makes this commit a bit surprising. (Hence not squashing it to anything bigger). Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-06-03iavf: centralize watchdog requeueing itselfPrzemek Kitszel
Centralize the unlock(critlock); unlock(netdev); queue_delayed_work(watchog_task); pattern to one place. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-06-03iavf: iavf_suspend(): take RTNL before netdev_lock()Przemek Kitszel
Fix an obvious violation of lock ordering. Jakub's [1] added netdev_lock() call that is wrong ordered wrt RTNL, but the Fixes tag points to crit_lock being wrongly placed (by lockdep standards). Actual reason we got it wrong is dated back to critical section managed by pure flag checks, which is with us since the very beginning. [1] afc664987ab3 ("eth: iavf: extend the netdev_lock usage") Fixes: 5ac49f3c2702 ("iavf: use mutexes for locking of critical sections") Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-03-08net: move misc netdev_lock flavors to a separate headerJakub Kicinski
Move the more esoteric helpers for netdev instance lock to a dedicated header. This avoids growing netdevice.h to infinity and makes rebuilding the kernel much faster (after touching the header with the helpers). The main netdev_lock() / netdev_unlock() functions are used in static inlines in netdevice.h and will probably be used most commonly, so keep them in netdevice.h. Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250307183006.2312761-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-06net: hold netdev instance lock during nft ndo_setup_tcStanislav Fomichev
Introduce new dev_setup_tc for nft ndo_setup_tc paths. Reviewed-by: Eric Dumazet <edumazet@google.com> Cc: Saeed Mahameed <saeed@kernel.org> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250305163732.2766420-3-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-06net: hold netdev instance lock during ndo_open/ndo_stopStanislav Fomichev
For the drivers that use shaper API, switch to the mode where core stack holds the netdev lock. This affects two drivers: * iavf - already grabs netdev lock in ndo_open/ndo_stop, so mostly remove these * netdevsim - switch to _locked APIs to avoid deadlock iavf_close diff is a bit confusing, the existing call looks like this: iavf_close() { netdev_lock() .. netdev_unlock() wait_event_timeout(down_waitqueue) } I change it to the following: netdev_lock() iavf_close() { .. netdev_unlock() wait_event_timeout(down_waitqueue) netdev_lock() // reusing this lock call } netdev_unlock() Since I'm reusing existing netdev_lock call, so it looks like I only add netdev_unlock. Cc: Saeed Mahameed <saeed@kernel.org> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250305163732.2766420-2-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.14-rc5). Conflicts: drivers/net/ethernet/cadence/macb_main.c fa52f15c745c ("net: cadence: macb: Synchronize stats calculations") 75696dd0fd72 ("net: cadence: macb: Convert to get_stats64") https://lore.kernel.org/20250224125848.68ee63e5@canb.auug.org.au Adjacent changes: drivers/net/ethernet/intel/ice/ice_sriov.c 79990cf5e7ad ("ice: Fix deinitializing VF in error path") a203163274a4 ("ice: simplify VF MSI-X managing") net/ipv4/tcp.c 18912c520674 ("tcp: devmem: don't write truncated dmabuf CMSGs to userspace") 297d389e9e5b ("net: prefix devmem specific helpers") net/mptcp/subflow.c 8668860b0ad3 ("mptcp: reset when MPTCP opts are dropped after join") c3349a22c200 ("mptcp: consolidate subflow cleanup") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-25iavf: fix circular lock dependency with netdev_lockJacob Keller
We have recently seen reports of lockdep circular lock dependency warnings when loading the iAVF driver: [ 1504.790308] ====================================================== [ 1504.790309] WARNING: possible circular locking dependency detected [ 1504.790310] 6.13.0 #net_next_rt.c2933b2befe2.el9 Not tainted [ 1504.790311] ------------------------------------------------------ [ 1504.790312] kworker/u128:0/13566 is trying to acquire lock: [ 1504.790313] ffff97d0e4738f18 (&dev->lock){+.+.}-{4:4}, at: register_netdevice+0x52c/0x710 [ 1504.790320] [ 1504.790320] but task is already holding lock: [ 1504.790321] ffff97d0e47392e8 (&adapter->crit_lock){+.+.}-{4:4}, at: iavf_finish_config+0x37/0x240 [iavf] [ 1504.790330] [ 1504.790330] which lock already depends on the new lock. [ 1504.790330] [ 1504.790330] [ 1504.790330] the existing dependency chain (in reverse order) is: [ 1504.790331] [ 1504.790331] -> #1 (&adapter->crit_lock){+.+.}-{4:4}: [ 1504.790333] __lock_acquire+0x52d/0xbb0 [ 1504.790337] lock_acquire+0xd9/0x330 [ 1504.790338] mutex_lock_nested+0x4b/0xb0 [ 1504.790341] iavf_finish_config+0x37/0x240 [iavf] [ 1504.790347] process_one_work+0x248/0x6d0 [ 1504.790350] worker_thread+0x18d/0x330 [ 1504.790352] kthread+0x10e/0x250 [ 1504.790354] ret_from_fork+0x30/0x50 [ 1504.790357] ret_from_fork_asm+0x1a/0x30 [ 1504.790361] [ 1504.790361] -> #0 (&dev->lock){+.+.}-{4:4}: [ 1504.790364] check_prev_add+0xf1/0xce0 [ 1504.790366] validate_chain+0x46a/0x570 [ 1504.790368] __lock_acquire+0x52d/0xbb0 [ 1504.790370] lock_acquire+0xd9/0x330 [ 1504.790371] mutex_lock_nested+0x4b/0xb0 [ 1504.790372] register_netdevice+0x52c/0x710 [ 1504.790374] iavf_finish_config+0xfa/0x240 [iavf] [ 1504.790379] process_one_work+0x248/0x6d0 [ 1504.790381] worker_thread+0x18d/0x330 [ 1504.790383] kthread+0x10e/0x250 [ 1504.790385] ret_from_fork+0x30/0x50 [ 1504.790387] ret_from_fork_asm+0x1a/0x30 [ 1504.790389] [ 1504.790389] other info that might help us debug this: [ 1504.790389] [ 1504.790389] Possible unsafe locking scenario: [ 1504.790389] [ 1504.790390] CPU0 CPU1 [ 1504.790391] ---- ---- [ 1504.790391] lock(&adapter->crit_lock); [ 1504.790393] lock(&dev->lock); [ 1504.790394] lock(&adapter->crit_lock); [ 1504.790395] lock(&dev->lock); [ 1504.790397] [ 1504.790397] *** DEADLOCK *** This appears to be caused by the change in commit 5fda3f35349b ("net: make netdev_lock() protect netdev->reg_state"), which added a netdev_lock() in register_netdevice. The iAVF driver calls register_netdevice() from iavf_finish_config(), as a final stage of its state machine post-probe. It currently takes the RTNL lock, then the netdev lock, and then the device critical lock. This pattern is used throughout the driver. Thus there is a strong dependency that the crit_lock should not be acquired before the net device lock. The change to register_netdevice creates an ABBA lock order violation because the iAVF driver is holding the crit_lock while calling register_netdevice, which then takes the netdev_lock. It seems likely that future refactors could result in netdev APIs which hold the netdev_lock while calling into the driver. This means that we should not re-order the locks so that netdev_lock is acquired after the device private crit_lock. Instead, notice that we already release the netdev_lock prior to calling the register_netdevice. This flow only happens during the early driver initialization as we transition through the __IAVF_STARTUP, __IAVF_INIT_VERSION_CHECK, __IAVF_INIT_GET_RESOURCES, etc. Analyzing the places where we take crit_lock in the driver there are two sources: a) several of the work queue tasks including adminq_task, watchdog_task, reset_task, and the finish_config task. b) various callbacks which ultimately stem back to .ndo operations or ethtool operations. The latter cannot be triggered until after the netdevice registration is completed successfully. The iAVF driver uses alloc_ordered_workqueue, which is an unbound workqueue that has a max limit of 1, and thus guarantees that only a single work item on the queue is executing at any given time, so none of the other work threads could be executing due to the ordered workqueue guarantees. The iavf_finish_config() function also does not do anything else after register_netdevice, unless it fails. It seems unlikely that the driver private crit_lock is protecting anything that register_netdevice() itself touches. Thus, to fix this ABBA lock violation, lets simply release the adapter->crit_lock as well as netdev_lock prior to calling register_netdevice(). We do still keep holding the RTNL lock as required by the function. If we do fail to register the netdevice, then we re-acquire the adapter critical lock to finish the transition back to __IAVF_INIT_CONFIG_ADAPTER. This ensures every call where both netdev_lock and the adapter->crit_lock are acquired under the same ordering. Fixes: afc664987ab3 ("eth: iavf: extend the netdev_lock usage") Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://patch.msgid.link/20250224190647.3601930-5-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-25ethtool: Symmetric OR-XOR RSS hashGal Pressman
Add an additional type of symmetric RSS hash type: OR-XOR. The "Symmetric-OR-XOR" algorithm transforms the input as follows: (SRC_IP | DST_IP, SRC_IP ^ DST_IP, SRC_PORT | DST_PORT, SRC_PORT ^ DST_PORT) Change 'cap_rss_sym_xor_supported' to 'supported_input_xfrm', a bitmap of supported RXH_XFRM_* types. Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Edward Cree <ecree.xilinx@gmail.com> Link: https://patch.msgid.link/20250224174416.499070-2-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-14iavf: add support for Rx timestamps to hotpathJacob Keller
Add support for receive timestamps to the Rx hotpath. This support only works when using the flexible descriptor format, so make sure that we request this format by default if we have receive timestamp support available in the PTP capabilities. In order to report the timestamps to userspace, we need to perform timestamp extension. The Rx descriptor does actually contain the "40 bit" timestamp. However, upper 32 bits which contain nanoseconds are conveniently stored separately in the descriptor. We could extract the 32bits and lower 8 bits, then perform a bitwise OR to calculate the 40bit value. This makes no sense, because the timestamp extension algorithm would simply discard the lower 8 bits anyways. Thus, implement timestamp extension as iavf_ptp_extend_32b_timestamp(), and extract and forward only the 32bits of nominal nanoseconds. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Reviewed-by: Sunil Goutham <sgoutham@marvell.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-14iavf: handle set and get timestamps opsJacob Keller
Add handlers for the .ndo_hwtstamp_get and .ndo_hwtstamp_set ops which allow userspace to request timestamp enablement for the device. This support allows standard Linux applications to request the timestamping desired. As with other devices that support timestamping all packets, the driver will upgrade any request for timestamping of a specific type of packet to HWTSTAMP_FILTER_ALL. The current configuration is stored, so that it can be retrieved by calling .ndo_hwtstamp_get The Tx timestamps are not implemented yet so calling set ops for Tx path will end with EOPNOTSUPP error code. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Co-developed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-14iavf: Implement checking DD desc fieldMateusz Polchlopek
Rx timestamping introduced in PF driver caused the need of refactoring the VF driver mechanism to check packet fields. The function to check errors in descriptor has been removed and from now only previously set struct fields are being checked. The field DD (descriptor done) needs to be checked at the very beginning, before extracting other fields. Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-14iavf: refactor iavf_clean_rx_irq to support legacy and flex descriptorsJacob Keller
Using VIRTCHNL_VF_OFFLOAD_FLEX_DESC, the iAVF driver is capable of negotiating to enable the advanced flexible descriptor layout. Add the flexible NIC layout (RXDID=2) as a member of the Rx descriptor union. Also add bit position definitions for the status and error indications that are needed. The iavf_clean_rx_irq function needs to extract a few fields from the Rx descriptor, including the size, rx_ptype, and vlan_tag. Move the extraction to a separate function that decodes the fields into a structure. This will reduce the burden for handling multiple descriptor types by keeping the relevant extraction logic in one place. To support handling an additional descriptor format with minimal code duplication, refactor Rx checksum handling so that the general logic is separated from the bit calculations. Introduce an iavf_rx_desc_decoded structure which holds the relevant bits decoded from the Rx descriptor. This will enable implementing flexible descriptor handling without duplicating the general logic twice. Introduce an iavf_extract_flex_rx_fields, iavf_flex_rx_hash, and iavf_flex_rx_csum functions which operate on the flexible NIC descriptor format instead of the legacy 32 byte format. Based on the negotiated RXDID, select the correct function for processing the Rx descriptors. With this change, the Rx hot path should be functional when using either the default legacy 32byte format or when we switch to the flexible NIC layout. Modify the Rx hot path to add support for the flexible descriptor format and add request enabling Rx timestamps for all queues. As in ice, make sure we bump the checksum level if the hardware detected a packet type which could have an outer checksum. This is important because hardware only verifies the inner checksum. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Co-developed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-14iavf: define Rx descriptors as qwordsMateusz Polchlopek
The union iavf_32byte_rx_desc consists of two unnamed structs defined inside. One of them represents legacy 32 byte descriptor and second the 16 byte descriptor (extended to 32 byte). Each of them consists of bunch of unions, structs and __le fields that represent specific fields in descriptor. This commit changes the representation of iavf_32byte_rx_desc union to store four __le64 fields (qw0, qw1, qw2, qw3) that represent quad-words. Those quad-words will be then accessed by calling leXY_get_bits macros in upcoming commits. Suggested-by: Alexander Lobakin <aleksander.lobakin@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-14iavf: periodically cache PHC timeJacob Keller
The Rx timestamps reported by hardware may only have 32 bits of storage for nanosecond time. These timestamps cannot be directly reported to the Linux stack, as it expects 64bits of time. To handle this, the timestamps must be extended using an algorithm that calculates the corrected 64bit timestamp by comparison between the PHC time and the timestamp. This algorithm requires the PHC time to be captured within ~2 seconds of when the timestamp was captured. Instead of trying to read the PHC time in the Rx hotpath, the algorithm relies on a cached value that is periodically updated. Keep this cached time up to date by using the PTP .do_aux_work kthread function. The iavf_ptp_do_aux_work will reschedule itself about twice a second, and will check whether or not the cached PTP time needs to be updated. If so, it issues a VIRTCHNL_OP_1588_PTP_GET_TIME to request the time from the PF. The jitter and latency involved with this command aren't important, because the cached time just needs to be kept up to date within about ~2 seconds. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Co-developed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-14iavf: add support for indirect access to PHC timeJacob Keller
Implement support for reading the PHC time indirectly via the VIRTCHNL_OP_1588_PTP_GET_TIME operation. Based on some simple tests with ftrace, the latency of the indirect clock access appears to be about ~110 microseconds. This is due to the cost of preparing a message to send over the virtchnl queue. This is expected, due to the increased jitter caused by sending messages over virtchnl. It is not easy to control the precise time that the message is sent by the VF, or the time that the message is responded to by the PF, or the time that the message sent from the PF is received by the VF. For sending the request, note that many PTP related operations will require sending of VIRTCHNL messages. Instead of adding a separate AQ flag and storage for each operation, setup a simple queue mechanism for queuing up virtchnl messages. Each message will be converted to a iavf_ptp_aq_cmd structure which ends with a flexible array member. A single AQ flag is added for processing messages from this queue. In principle this could be extended to handle arbitrary virtchnl messages. For now it is kept to PTP-specific as the need is primarily for handling PTP-related commands. Use this to implement .gettimex64 using the indirect method via the virtchnl command. The response from the PF is processed and stored into the cached_phc_time. A wait queue is used to allow the PTP clock gettime request to sleep until the message is sent from the PF. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-14iavf: add initial framework for registering PTP clockJacob Keller
Add the iavf_ptp.c file and fill it in with a skeleton framework to allow registering the PTP clock device. Add implementation of helper functions to check if a PTP capability is supported and handle change in PTP capabilities. Enabling virtual clock would be possible, though it would probably perform poorly due to the lack of direct time access. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Sai Krishna <saikrishnag@marvell.com> Reviewed-by: Simon Horman <horms@kernel.org> Co-developed-by: Ahmed Zaki <ahmed.zaki@intel.com> Signed-off-by: Ahmed Zaki <ahmed.zaki@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Co-developed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-14iavf: negotiate PTP capabilitiesJacob Keller
Add a new extended capabilities negotiation to exchange information from the PF about what PTP capabilities are supported by this VF. This requires sending a VIRTCHNL_OP_1588_PTP_GET_CAPS message, and waiting for the response from the PF. Handle this early on during the VF initialization. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Co-developed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-14iavf: add support for negotiating flexible RXDID formatJacob Keller
Enable support for VIRTCHNL_VF_OFFLOAD_RX_FLEX_DESC, to enable the VF driver the ability to determine what Rx descriptor formats are available. This requires sending an additional message during initialization and reset, the VIRTCHNL_OP_GET_SUPPORTED_RXDIDS. This operation requests the supported Rx descriptor IDs available from the PF. This is treated the same way that VLAN V2 capabilities are handled. Add a new set of extended capability flags, used to process send and receipt of the VIRTCHNL_OP_GET_SUPPORTED_RXDIDS message. This ensures we finish negotiating for the supported descriptor formats prior to beginning configuration of receive queues. This change stores the supported format bitmap into the iavf_adapter structure. Additionally, if VIRTCHNL_VF_OFFLOAD_RX_FLEX_DESC is enabled by the PF, we need to make sure that the Rx queue configuration specifies the format. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Co-developed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-11iavf: Fix a locking bug in an error pathBart Van Assche
If the netdev lock has been obtained, unlock it before returning. This bug has been detected by the Clang thread-safety analyzer. Fixes: afc664987ab3 ("eth: iavf: extend the netdev_lock usage") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://patch.msgid.link/20250206175114.1974171-28-bvanassche@acm.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-24iavf: allow changing VLAN state without calling PFMichal Swiatkowski
First case: > ip l a l $VF name vlanx type vlan id 100 > ip l d vlanx > ip l a l $VF name vlanx type vlan id 100 As workqueue can be execute after sometime, there is a window to have call trace like that: - iavf_del_vlan - iavf_add_vlan - iavf_del_vlans (wq) It means that our VLAN 100 will change the state from IAVF_VLAN_ACTIVE to IAVF_VLAN_REMOVE (iavf_del_vlan). After that in iavf_add_vlan state won't be changed because VLAN 100 is on the filter list. The final result is that the VLAN 100 filter isn't added in hardware (no iavf_add_vlans call). To fix that change the state if the filter wasn't removed yet directly to active. It is save as IAVF_VLAN_REMOVE means that virtchnl message wasn't sent yet. Second case: > ip l a l $VF name vlanx type vlan id 100 Any type of VF reset ex. change trust > ip l s $PF vf $VF_NUM trust on > ip l d vlanx > ip l a l $VF name vlanx type vlan id 100 In case of reset iavf driver is responsible for readding all filters that are being used. To do that all VLAN filters state are changed to IAVF_VLAN_ADD. Here is even longer window for changing VLAN state from kernel side, as workqueue isn't called immediately. We can have call trace like that: - changing to IAVF_VLAN_ADD (after reset) - iavf_del_vlan (called from kernel ops) - iavf_del_vlans (wq) Not exsisitng VLAN filters will be removed from hardware. It isn't a bug, ice driver will handle it fine. However, we can have call trace like that: - changing to IAVF_VLAN_ADD (after reset) - iavf_del_vlan (called from kernel ops) - iavf_add_vlan (called from kernel ops) - iavf_del_vlans (wq) With fix for previous case we end up with no VLAN filters in hardware. We have to remove VLAN filters if the state is IAVF_VLAN_ADD and delete VLAN was called. It is save as IAVF_VLAN_ADD means that virtchnl message wasn't sent yet. Fixes: 0c0da0e95105 ("iavf: refactor VLAN filter states") Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-15net: protect NAPI enablement with netdev_lock()Jakub Kicinski
Wrap napi_enable() / napi_disable() with netdev_lock(). Provide the "already locked" flavor of the API. iavf needs the usual adjustment. A number of drivers call napi_enable() under a spin lock, so they have to be modified to take netdev_lock() first, then spin lock then call napi_enable_locked(). Protecting napi_enable() implies that napi->napi_id is protected by netdev_lock(). Acked-by: Francois Romieu <romieu@fr.zoreil.com> # via-velocity Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250115035319.559603-7-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15net: protect netdev->napi_list with netdev_lock()Jakub Kicinski
Hold netdev->lock when NAPIs are getting added or removed. This will allow safe access to NAPI instances of a net_device without rtnl_lock. Create a family of helpers which assume the lock is already taken. Switch iavf to them, as it makes extensive use of netdev->lock, already. Reviewed-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250115035319.559603-6-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15net: add netdev_lock() / netdev_unlock() helpersJakub Kicinski
Add helpers for locking the netdev instance, use it in drivers and the shaper code. This will make grepping for the lock usage much easier, as we extend the lock to cover more fields. Reviewed-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Link: https://patch.msgid.link/20250115035319.559603-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-13eth: iavf: extend the netdev_lock usageJakub Kicinski
iavf uses the netdev->lock already to protect shapers. In an upcoming series we'll try to protect NAPI instances with netdev->lock. We need to modify the protection a bit. All NAPI related calls in the driver need to be consistently under the lock. This will allow us to easily switch to a "we already hold the lock" NAPI API later. register_netdevice(), OTOH, must not be called under the netdev_lock() as we do not intend to have an "already locked" version of this call. Link: https://patch.msgid.link/20250111071339.3709071-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-02module: Convert symbol namespace to string literalPeter Zijlstra
Clean up the existing export namespace code along the same lines of commit 33def8498fdd ("treewide: Convert macro and uses of __section(foo) to __section("foo")") and for the same reason, it is not desired for the namespace argument to be a macro expansion itself. Scripted using git grep -l -e MODULE_IMPORT_NS -e EXPORT_SYMBOL_NS | while read file; do awk -i inplace ' /^#define EXPORT_SYMBOL_NS/ { gsub(/__stringify\(ns\)/, "ns"); print; next; } /^#define MODULE_IMPORT_NS/ { gsub(/__stringify\(ns\)/, "ns"); print; next; } /MODULE_IMPORT_NS/ { $0 = gensub(/MODULE_IMPORT_NS\(([^)]*)\)/, "MODULE_IMPORT_NS(\"\\1\")", "g"); } /EXPORT_SYMBOL_NS/ { if ($0 ~ /(EXPORT_SYMBOL_NS[^(]*)\(([^,]+),/) { if ($0 !~ /(EXPORT_SYMBOL_NS[^(]*)\(([^,]+), ([^)]+)\)/ && $0 !~ /(EXPORT_SYMBOL_NS[^(]*)\(\)/ && $0 !~ /^my/) { getline line; gsub(/[[:space:]]*\\$/, ""); gsub(/[[:space:]]/, "", line); $0 = $0 " " line; } $0 = gensub(/(EXPORT_SYMBOL_NS[^(]*)\(([^,]+), ([^)]+)\)/, "\\1(\\2, \"\\3\")", "g"); } } { print }' $file; done Requested-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://mail.google.com/mail/u/2/#inbox/FMfcgzQXKWgMmjdFwwdsfgxzKpVHWPlc Acked-by: Greg KH <gregkh@linuxfoundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-10-10Merge branch 'net-introduce-tx-h-w-shaping-api'Jakub Kicinski
Paolo Abeni says: ==================== net: introduce TX H/W shaping API We have a plurality of shaping-related drivers API, but none flexible enough to meet existing demand from vendors[1]. This series introduces new device APIs to configure in a flexible way TX H/W shaping. The new functionalities are exposed via a newly defined generic netlink interface and include introspection capabilities. Some self-tests are included, on top of a dummy netdevsim implementation. Finally a basic implementation for the iavf driver is provided. Some usage examples: * Configure shaping on a given queue: ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/shaper.yaml \ --do set --json '{"ifindex": '$IFINDEX', "shaper": {"handle": {"scope": "queue", "id":'$QUEUEID'}, "bw-max": 2000000}}' * Container B/W sharing The orchestration infrastructure wants to group the container-related queues under a RR scheduling and limit the aggregate bandwidth: ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/shaper.yaml \ --do group --json '{"ifindex": '$IFINDEX', "leaves": [ {"handle": {"scope": "queue", "id":'$QID1'}, "weight": '$W1'}, {"handle": {"scope": "queue", "id":'$QID2'}, "weight": '$W2'}], {"handle": {"scope": "queue", "id":'$QID3'}, "weight": '$W3'}], "handle": {"scope":"node"}, "bw-max": 10000000}' {'ifindex': $IFINDEX, 'handle': {'scope': 'node', 'id': 0}} Q1 \ \ Q2 -- node 0 ------- netdev / (bw-max: 10M) Q3 / * Delegation A containers wants to limit the aggregate B/W bandwidth of 2 of the 3 queues it owns - the starting configuration is the one from the previous point: SPEC=Documentation/netlink/specs/net_shaper.yaml ./tools/net/ynl/cli.py --spec $SPEC \ --do group --json '{"ifindex": '$IFINDEX', "leaves": [ {"handle": {"scope": "queue", "id":'$QID1'}, "weight": '$W1'}, {"handle": {"scope": "queue", "id":'$QID2'}, "weight": '$W2'}], "handle": {"scope": "node"}, "bw-max": 5000000 }' {'ifindex': $IFINDEX, 'handle': {'scope': 'node', 'id': 1}} Q1 -- node 1 --------\ / (bw-max: 5M) \ Q2 / node 0 ------- netdev /(bw-max: 10M) Q3 ------------------/ In a group operation, when prior to the op itself, the leaves have different parents, the user must specify the parent handle for the group. I.e., starting from the previous config: ./tools/net/ynl/cli.py --spec $SPEC \ --do group --json '{"ifindex": '$IFINDEX', "leaves": [ {"handle": {"scope": "queue", "id":'$QID1'}, "weight": '$W1'}, {"handle": {"scope": "queue", "id":'$QID3'}, "weight": '$W3'}], "handle": {"scope": "node"}, "bw-max": 3000000 }' Netlink error: Invalid argument nl_len = 96 (80) nl_flags = 0x300 nl_type = 2 error: -22 extack: {'msg': 'All the leaves shapers must have the same old parent'} ./tools/net/ynl/cli.py --spec $SPEC \ --do group --json '{"ifindex": '$IFINDEX', "leaves": [ {"handle": {"scope": "queue", "id":'$QID1'}, "weight": '$W1'}, {"handle": {"scope": "queue", "id":'$QID3'}, "weight": '$W3'}], "handle": {"scope": "node"}, "parent": {"scope": "node", "id": 1}, "bw-max": 3000000 } {'ifindex': $IFINDEX, 'handle': {'scope': 'node', 'id': 2}} Q1 -- node 2 --- /(bw-max:3M)\ Q3 / \ ---- node 1 \ / (bw-max: 5M)\ Q2 node 0 ------- netdev (bw-max: 10M) * Cleanup: Still starting from config 1To delete a single queue shaper ./tools/net/ynl/cli.py --spec $SPEC --do delete --json \ '{"ifindex": '$IFINDEX', "handle": {"scope": "queue", "id":'$QID3'}}' Q1 -- node 2 --- (bw-max:3M)\ \ ---- node 1 \ / (bw-max: 5M)\ Q2 node 0 ------- netdev (bw-max: 10M) Deleting a node shaper relinks all its leaves to the node's parent: ./tools/net/ynl/cli.py --spec $SPEC --do delete --json \ '{"ifindex": '$IFINDEX', "handle": {"scope": "node", "id":2}}' Q1 ---\ \ node 1----- \ / (bw-max: 5M)\ Q2----/ node 0 ------- netdev (bw-max: 10M) Deleting the last shaper under a node shaper deletes the node, too: ./tools/net/ynl/cli.py --spec $SPEC --do delete --json \ '{"ifindex": '$IFINDEX', "handle": {"scope": "queue", "id":'$QID1'}}' ./tools/net/ynl/cli.py --spec $SPEC --do delete --json \ '{"ifindex": '$IFINDEX', "handle": {"scope": "queue", "id":'$QID2'}}' ./tools/net/ynl/cli.py --spec $SPEC --do get --json \ '{"ifindex": '$IFINDEX', "handle": {"scope": "node", "id": 1}}' Netlink error: No such file or directory nl_len = 44 (28) nl_flags = 0x300 nl_type = 2 error: -2 extack: {'bad-attr': '.handle'} Such delete recurses on parents that are left over with no leaves: ./tools/net/ynl/cli.py --spec $SPEC --do get --json \ '{"ifindex": '$IFINDEX', "handle": {"scope": "node", "id": 0}}' Netlink error: No such file or directory nl_len = 44 (28) nl_flags = 0x300 nl_type = 2 error: -2 extack: {'bad-attr': '.handle'} v8: https://lore.kernel.org/cover.1727704215.git.pabeni@redhat.com v7: https://lore.kernel.org/cover.1725919039.git.pabeni@redhat.com v6: https://lore.kernel.org/cover.1725457317.git.pabeni@redhat.com v5: https://lore.kernel.org/cover.1724944116.git.pabeni@redhat.com v4: https://lore.kernel.org/cover.1724165948.git.pabeni@redhat.com v3: https://lore.kernel.org/cover.1722357745.git.pabeni@redhat.com RFC v2: https://lore.kernel.org/cover.1721851988.git.pabeni@redhat.com RFC v1: https://lore.kernel.org/cover.1719518113.git.pabeni@redhat.com ==================== Link: https://patch.msgid.link/cover.1728460186.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-10iavf: add support to exchange qos capabilitiesSudheer Mogilappagari
During driver initialization VF determines QOS capability is allowed by PF and receives QOS parameters. After which quanta size for queues is configured which is not configurable and is set to 1KB currently. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Sudheer Mogilappagari <sudheer.mogilappagari@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Link: https://patch.msgid.link/72cbeb9c88d40e557053c57d7531c96bed490576.1728460186.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-10iavf: Add net_shaper_ops supportSudheer Mogilappagari
Implement net_shaper_ops support for IAVF. This enables configuration of rate limiting on per queue basis. Customer intends to enforce bandwidth limit on Tx traffic steered to the queue by configuring rate limits on the queue. To set rate limiting for a queue, update shaper object of given queues in driver and send VIRTCHNL_OP_CONFIG_QUEUE_BW to PF to update HW configuration. Deleting shaper configured for queue is nothing but configuring shaper with bw_max 0. The PF restores the default rate limiting config when bw_max is zero. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Sudheer Mogilappagari <sudheer.mogilappagari@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Link: https://patch.msgid.link/5a882cb51998c4c2c3d21fed521498eba1c8f079.1728460186.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-08iavf: Remove unused declarationsYue Haibing
There is no caller and implementation in tree. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-08-13iavf: add support for offloading tc U32 cls filtersAhmed Zaki
Add support for offloading cls U32 filters. Only "skbedit queue_mapping" and "drop" actions are supported. Also, only "ip" and "802_3" tc protocols are allowed. The PF must advertise the VIRTCHNL_VF_OFFLOAD_TC_U32 capability flag. Since the filters will be enabled via the FD stage at the PF, a new type of FDIR filters is added and the existing list and state machine are used. The new filters can be used to configure flow directors based on raw (binary) pattern in the rx packet. Examples: 0. # tc qdisc add dev enp175s0v0 ingress 1. Redirect UDP from src IP 192.168.2.1 to queue 12: # tc filter add dev <dev> protocol ip ingress u32 \ match u32 0x45000000 0xff000000 at 0 \ match u32 0x00110000 0x00ff0000 at 8 \ match u32 0xC0A80201 0xffffffff at 12 \ match u32 0x00000000 0x00000000 at 24 \ action skbedit queue_mapping 12 skip_sw 2. Drop all ICMP: # tc filter add dev <dev> protocol ip ingress u32 \ match u32 0x45000000 0xff000000 at 0 \ match u32 0x00010000 0x00ff0000 at 8 \ match u32 0x00000000 0x00000000 at 24 \ action drop skip_sw 3. Redirect ICMP traffic from MAC 3c:fd:fe:a5:47:e0 to queue 7 (note proto: 802_3): # tc filter add dev <dev> protocol 802_3 ingress u32 \ match u32 0x00003CFD 0x0000ffff at 4 \ match u32 0xFEA547E0 0xffffffff at 8 \ match u32 0x08004500 0xffffff00 at 12 \ match u32 0x00000001 0x000000ff at 20 \ match u32 0x0000 0x0000 at 40 \ action skbedit queue_mapping 7 skip_sw Notes on matches: 1 - All intermediate fields that are needed to parse the correct PTYPE must be provided (in e.g. 3: Ethernet Type 0x0800 in MAC, IP version and IP length: 0x45 and protocol: 0x01 (ICMP)). 2 - The last match must provide an offset that guarantees all required headers are accounted for, even if the last header is not matched. For example, in #2, the last match is 4 bytes at offset 24 starting from IP header, so the total is 14 (MAC) + 24 + 4 = 42, which is the sum of MAC+IP+ICMP headers. Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Reviewed-by: Marcin Szycik <marcin.szycik@linux.intel.com> Signed-off-by: Ahmed Zaki <ahmed.zaki@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-08-13iavf: refactor add/del FDIR filtersAhmed Zaki
In preparation for a second type of FDIR filters that can be added by tc-u32, move the add/del of the FDIR logic to be entirely contained in iavf_fdir.c. The iavf_find_fdir_fltr_by_loc() is renamed to iavf_find_fdir_fltr() to be more agnostic to the filter ID parameter (for now @loc, which is relevant only to current FDIR filters added via ethtool). The FDIR filter deletion is moved from iavf_del_fdir_ethtool() in ethtool.c to iavf_fdir_del_fltr(). While at it, fix a minor bug where the "fltr" is accessed out of the fdir_fltr_lock spinlock protection. Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Reviewed-by: Marcin Szycik <marcin.szycik@linux.intel.com> Signed-off-by: Ahmed Zaki <ahmed.zaki@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-07-11net: intel: Remove MODULE_AUTHORsTony Nguyen
We are moving away from the Sourceforge email address. Rather than removing or updating the email for the affected entries, remove the MODULE_AUTHOR altogether as its usage is incorrect [1]. Link: https://lore.kernel.org/netdev/20200626115236.7f36d379@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com/ [1] Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Acked-by: Alexander Lobakin <aleksander.lobakin@intel.com> # libeth, libie Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-06-10net: intel: Use *-y instead of *-objs in MakefileAndy Shevchenko
*-objs suffix is reserved rather for (user-space) host programs while usually *-y suffix is used for kernel drivers (although *-objs works for that purpose for now). Let's correct the old usages of *-objs in Makefiles. Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20240607-next-2024-06-03-intel-next-batch-v3-1-d1470cee3347@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-22tracing/treewide: Remove second parameter of __assign_str()Steven Rostedt (Google)
With the rework of how the __string() handles dynamic strings where it saves off the source string in field in the helper structure[1], the assignment of that value to the trace event field is stored in the helper value and does not need to be passed in again. This means that with: __string(field, mystring) Which use to be assigned with __assign_str(field, mystring), no longer needs the second parameter and it is unused. With this, __assign_str() will now only get a single parameter. There's over 700 users of __assign_str() and because coccinelle does not handle the TRACE_EVENT() macro I ended up using the following sed script: git grep -l __assign_str | while read a ; do sed -e 's/\(__assign_str([^,]*[^ ,]\) *,[^;]*/\1)/' $a > /tmp/test-file; mv /tmp/test-file $a; done I then searched for __assign_str() that did not end with ';' as those were multi line assignments that the sed script above would fail to catch. Note, the same updates will need to be done for: __assign_str_len() __assign_rel_str() __assign_rel_str_len() I tested this with both an allmodconfig and an allyesconfig (build only for both). [1] https://lore.kernel.org/linux-trace-kernel/20240222211442.634192653@goodmis.org/ Link: https://lore.kernel.org/linux-trace-kernel/20240516133454.681ba6a0@rorschach.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Julia Lawall <Julia.Lawall@inria.fr> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Acked-by: Jani Nikula <jani.nikula@intel.com> Acked-by: Christian König <christian.koenig@amd.com> for the amdgpu parts. Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> #for Acked-by: Rafael J. Wysocki <rafael@kernel.org> # for thermal Acked-by: Takashi Iwai <tiwai@suse.de> Acked-by: Darrick J. Wong <djwong@kernel.org> # xfs Tested-by: Guenter Roeck <linux@roeck-us.net>
2024-05-08iavf: flower: validate control flagsAsbjørn Sloth Tønnesen
This driver currently doesn't support any control flags. Use flow_rule_has_control_flags() to check for control flags, such as can be set through `tc flower ... ip_flags frag`. In case any control flags are masked, flow_rule_has_control_flags() sets a NL extended error message, and we return -EOPNOTSUPP. Only compile-tested. Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>