Age | Commit message (Collapse) | Author |
|
Tariq Toukan says:
====================
net/mlx5: HWS, Refactor action STE handling
This patch series by Vlad refactors how action STEs are handled for
hardware steering.
Definitions
----------
* STE (Steering Table Entry): a building block for steering rules.
Simple rules consist of a single STE that specifies both the match
value and what actions to do. For more complex rules we have one or
more match STEs that point to one or more action STEs. It is these
action STEs which this patch series is primarily concerned with.
* RTC (Rule Table Context): a table that contains STEs. A matcher
currently consists of a match RTC and, if necessary, an action RTC.
This patch series decouples action RTCs from matchers and moves action
RTCs to a central pool.
* Matcher: a logical container for steering rules. While the items above
describe hardware concepts, a matcher is purely a software construct.
Current situation
-----------------
As mentioned above, a matcher currently consists of a match RTC (or
more, in case of complex matchers) and zero or one action STCs. An
action STC is only allocated if the matcher contains sufficiently
complicated action templates, or many actions.
When adding a rule, we decide based on its action template whether it
requires action STEs. If yes, we allocate the required number of action
STEs from the matcher's action STE.
When updating a rule, we need to prevent the rule ever being in an
invalid state. So we need to allocate and write new action STEs first,
then update the match STE to point to them, and finally release the old
action STEs. So there is a state when a rule needs double the action
STEs it normally uses.
Thus, for a given matcher of log_sz=N, log_action_ste_sz=A, the action
STC log_size is (N + A + 1). We need enough space to hold all the rules'
action STEs, and effectively double that space to account for the not
very common case of rules being updated. We could manage with much fewer
extra action STEs, but RTCs are allocated in powers of two. This results
in effective utilization of action RTCs of 50%, outside rule update
cases.
This is further complicated when resizing matchers. To avoid updating
all the rules to point to new match STEs, we keep existing action RTCs
around as resize_data, and only free them when the matcher is freed.
Action STE pool
---------------
This patch series decouples action RTCs from matchers by creating a
per-queue pool. When a rule needs to allocate action STEs it does so
from the pool, creating a new RTC if needed. During update two sets of
action STEs are in use, possibly from different RTCs.
The pool is sharded per-queue to avoid lock contention. Each per-queue
pool consists of 3 elements, corresponding to rx-only, tx-only and
rx-and-tx use cases. The series takes this approach because rules that
are bidirectional require that their action STEs have the same index in
the rx- and tx-RTCs, and using a single RTC would result in
unidirectional rules wasting the STEs for the unused direction.
Pool elements, in turn, consist of a list of RTCs. The driver
progressively allocates larger RTCs as they are needed to amortize the
cost of allocation.
Allocation of elements (STEs) inside RTCs is modelled by an existing
mechanism, somewhat confusingly also known as a pool. The first few
patches in the series refactor this abstraction to simplify it and adapt
it to the new schema.
Finally, this series implements periodic cleanup of unused action RTCs
as a new feature. Previously, once a matcher allocated an action RTC, it
would only be freed when the matcher is freed. This resulted in a lot of
wasted memory for matchers that had previously grown, but were now
mostly unused.
Conversely, action STE pools have a timestamp of when they were last
used. A cleanup routine periodically checks all pools. If a pool's last
usage was too far in the past, it is destroyed.
Benchmarks
----------
The test module creates a batch of (1 << 18) rules per queue and then
deletes them, in a loop. The rules are complex enough to require two
action STEs per rule.
Each queue is manipulated from a separate kernel workqueue, so there is
a 1:1 correspondence between threads and queues.
There are sleep statements between insert and delete batches so that
memory usage can be evaluated using `free -m`. The numbers below are the
diff between base memory usage (without the mlx5 module inserted) and
peak usage while running a test. The values are rounded to the nearest
hundred megabytes. The `queues` column lists how many queues the test
used.
queues mem_before mem_after
1 1300M 800M
4 4000M 2300M
8 7300M 3300M
Across all of the tests, insertion and deletion rates are the same
before and after these patches.
Summary of the patches
----------------------
* Patch 1: Fix matcher action template attach to avoid overrunning the
buffer and correctly report errors.
* Patches 2-7: Cleanup the existing pool abstraction. Clarify semantics,
and use cases, simplify API and callers.
* Patch 8: Implement the new action STE pool structure.
* Patch 9: Use the action STE pool when manipulating rules.
* Patch 10: Remove action RTC from matcher.
* Patch 11: Add logic to periodically check and free unused action RTCs.
* Patch 12: Export action STE tables in debugfs for our dump tool.
====================
Link: https://patch.msgid.link/1744312662-356571-1-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Introduce a new type of dump object and dump all action STE tables,
along with information on their RTCs and STEs.
Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com>
Reviewed-by: Hamdan Agbariya <hamdani@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Link: https://patch.msgid.link/1744312662-356571-13-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Periodically check for unused action STE tables and free their
associated resources. In order to do this safely, add a per-queue lock
to synchronize the garbage collect work with regular operations on
steering rules.
Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Link: https://patch.msgid.link/1744312662-356571-12-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Remove the matcher action STE implementation now that the code uses
per-queue action STE pools. This also allows simplifying matcher code
because it is now only handling a single type of RTC/STE.
The matcher resize data is also going away. Matchers were saving old
action STE data because the rules still used it, but now that data lives
in the action STE pool and is no longer coupled to a matcher.
Furthermore, matchers no longer need to rehash a due to action template
addition. If a new action template needs more action STEs, we simply
update the matcher's num_of_action_stes and future rules will allocate
the correct number. Existing rules are unaffected by such an operation
and can continue to use their existing action STEs.
The range action was using the matcher action STE implementation, but
there was no reason to do this other than the container fitting the
purpose. Extract that information to a separate structure.
Finally, stop dumping per-matcher information about action RTCs,
because they no longer exist. A later patch in this series will add
support for dumping action STE pools.
Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Link: https://patch.msgid.link/1744312662-356571-11-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Use the central action STE pool when creating / updating rules.
Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Link: https://patch.msgid.link/1744312662-356571-10-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Implement a per-queue pool of action STEs that match STEs can link to,
regardless of matcher.
The code relies on hints to optimize whether a given rule is added to
rx-only, tx-only or both. Correspondingly, action STEs need to be added
to different RTC for ingress or egress paths. For rx-and-tx rules, the
current rule implementation dictates that the offsets for a given rule
must be the same in both RTCs.
To avoid wasting STEs, each action STE pool element holds 3 pools:
rx-only, tx-only, and rx-and-tx, corresponding to the possible values of
the pool optimization enum. The implementation then chooses at rule
creation / update which of these elements to allocate from.
Each element holds multiple action STE tables, which wrap an RTC, an STE
range, the logic to buddy-allocate offsets from the range, and an STC
that allows match STEs to point to this table. When allocating offsets
from an element, we iterate through available action STE tables and, if
needed, create a new table.
Similar to the previous implementation, this iteration does not free any
resources. This is implemented in a subsequent patch.
Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Link: https://patch.msgid.link/1744312662-356571-9-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The optimization to create a size-one STE range for the unused direction
was broken. The hardware prevents us from creating RTCs over unallocated
STE space, so the only reason this has worked so far is because the
optimization was never used.
Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Link: https://patch.msgid.link/1744312662-356571-8-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Future users will need to query whether a pool is empty.
Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Link: https://patch.msgid.link/1744312662-356571-7-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Remove members which are now no longer used. In fact, many of the
`struct mlx5hws_pool_chunk` were not even written to beyond being
initialized, but they were used in various internals.
Also cleanup some local variables which made more sense when the API was
thicker.
Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Link: https://patch.msgid.link/1744312662-356571-6-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Refactor the pool implementation to remove unused flags and clarify its
usage. A pool represents a single range of STEs or STCs which are
allocated at pool creation time.
Pools are used under three patterns:
1. STCs are allocated one at a time from a global pool using a bitmap
based implementation.
2. Action STEs are allocated in power-of-two blocks using a buddy
algorithm.
3. Match STEs do not use allocation, since insertion into these tables
is based on hashes or direct addressing. In such cases we use a pool
only to create the STE range.
Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Link: https://patch.msgid.link/1744312662-356571-5-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The pool implementation claimed to support multiple resources, but this
does not really make sense in context. Callers always allocate a single
STC or STE chunk of exactly the size provided.
The code that handled multiple resources was unused (and likely buggy)
due to the combination of flags passed by callers.
Simplify the pool by having it handle a single resource. As a result of
this simplification, chunks no longer contain a resource offset (there
is now only one resource per pool), and the get_base_id functions no
longer take a chunk parameter.
Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Link: https://patch.msgid.link/1744312662-356571-4-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Remove the array of elements wrapped in a struct because in reality only
the first element was ever used.
Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Link: https://patch.msgid.link/1744312662-356571-3-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The procedure of attaching an action template to an existing matcher had
a few issues:
1. Attaching accidentally overran the `at` array in bwc_matcher, which
would result in memory corruption. This bug wasn't triggered, but it
is possible to trigger it by attaching action templates beyond the
initial buffer size of 8. Fix this by converting to a dynamically
sized buffer and reallocating if needed.
2. Similarly, the `at` array inside the native matcher was never
reallocated. Fix this the same as above.
3. The bwc layer treated any error in action template attach as a signal
that the matcher should be rehashed to account for a larger number of
action STEs. In reality, there are other unrelated errors that can
arise and they should be propagated upstack. Fix this by adding a
`need_rehash` output parameter that's orthogonal to error codes.
Fixes: 2111bb970c78 ("net/mlx5: HWS, added backward-compatible API handling")
Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Link: https://patch.msgid.link/1744312662-356571-2-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Implement Enhanced Transmission Selection scheduler (ETS) support for
KSZ88x3 devices, which support two fixed egress scheduling modes:
Strict Priority and Weighted Fair Queuing (WFQ).
Since the switch does not allow remapping priorities to queues or
adjusting weights, this implementation only supports enabling
strict priority mode. If strict mode is not explicitly requested,
the switch falls back to its default WFQ mode.
This patch introduces KSZ88x3-specific handlers for ETS add and
delete operations and uses TXQ Split Control registers to toggle
the WFQ enable bit per queue. Corresponding macros are also added
for register access.
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://patch.msgid.link/20250410124249.2728568-1-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Russell King says:
====================
net: stmmac: remove unnecessary initialisation of 1µs TIC counter
In commit 8efbdbfa9938 ("net: stmmac: Initialize MAC_ONEUS_TIC_COUNTER
register"), code to initialise the LPI 1us counter in dwmac4's
initialisation was added, making the initialisation in glue drivers
unnecessary. This series cleans up the now redundant initialisation.
====================
Link: https://patch.msgid.link/Z_oe0U5E0i3uZbop@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
GMAC_1US_TIC_COUNTER is now no longer used, so remove the definition.
This was duplicated by GMAC4_MAC_ONEUS_TIC_COUNTER further down in the
same file.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1u3Vv0-000E87-DQ@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
plat_dat->eee_users_rate is now unused, so remove this member.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1u3Vuv-000E7y-9k@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Remove the write to GMAC_1US_TIC_COUNTER for two reasons:
1. during initialisation or reinitialisation of the DWMAC core, the
core is reset, which sets this register back to its default value.
Writing it prior to stmmac_dvr_probe() has no effect.
2. Since commit 8efbdbfa9938 ("net: stmmac: Initialize
MAC_ONEUS_TIC_COUNTER register"), GMAC4/5 core code will set
this register based on the rate of plat->stmmac_clk. This clock
is fetched by devm_stmmac_probe_config_dt(), and plat->clk_ptp_rate
will be set to its rate profided a "ptp_ref" clock is not provided.
In any case, Marek's commit will set the effectual value of this
register.
Therefore, dwmac-intel-plat.c writing GMAC_1US_TIC_COUNTER serves no
useful purpose and can be removed.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1u3Vuq-000E7s-5Y@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Remove the write to GMAC_1US_TIC_COUNTER for two reasons:
1. during initialisation or reinitialisation of the DWMAC core, the
core is reset, which sets this register back to its default value.
Writing it prior to stmmac_dvr_probe() has no effect.
2. Since commit 8efbdbfa9938 ("net: stmmac: Initialize
MAC_ONEUS_TIC_COUNTER register"), GMAC4/5 core code will set
this register based on the rate of plat->stmmac_clk. This clock
is created by the same code which initialises plat->eee_usecs_rate,
which is also created to run at this same rate. Since Marek's
commit, this will set this register appropriately using the
rate of this clock.
Therefore, dwmac-intel.c writing GMAC_1US_TIC_COUNTER serves no
useful purpose and can be removed.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1u3Vul-000E7m-1j@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
tegra_eqos_init() initialises the 1US TIC counter for the EEE timers.
However, the DWGMAC core is reset after this write, which clears
this register to its default.
However, dwmac4_core_init() configures this register using the same
clock, which happens after reset - thus this is the write which
ensures that the register is correctly configured.
Therefore, tegra_eqos_init() is not required and is removed. This also
means eqos->clk_slave can also be removed.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1u3Vuf-000E7g-U4@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Kuniyuki Iwashima says:
====================
net: Convert ->exit_batch_rtnl() to ->exit_rtnl().
While converting nexthop to per-netns RTNL, there are two blockers
to using rtnl_net_dereference(), flush_all_nexthops() and
__unregister_nexthop_notifier(), both of which are called from
->exit_batch_rtnl().
Instead of spreading __rtnl_net_lock() over each ->exit_batch_rtnl(),
we should convert all ->exit_batch_rtnl() to per-net ->exit_rtnl() and
run it under __rtnl_net_lock() because all ->exit_batch_rtnl() functions
do not have anything to factor out for batching.
Patch 1 & 2 factorise the undo mechanism against ->init() into a single
function, and Patch 3 adds ->exit_batch_rtnl().
Patch 4 ~ 13 convert all ->exit_batch_rtnl() users.
Patch 14 removes ->exit_batch_rtnl().
Later, we can convert pfcp and ppp to use ->exit_rtnl().
v1: https://lore.kernel.org/all/20250410022004.8668-1-kuniyu@amazon.com/
====================
Link: https://patch.msgid.link/20250411205258.63164-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
There are no ->exit_batch_rtnl() users remaining.
Let's remove the hook.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250411205258.63164-15-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
geneve_exit_batch_rtnl() iterates the dying netns list and
performs the same operation for each.
Let's use ->exit_rtnl().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250411205258.63164-14-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
bareudp_exit_batch_rtnl() iterates the dying netns list and performs the
same operation for each.
Let's use ->exit_rtnl().
While at it, we replace unregister_netdevice_queue() with
bareudp_dellink() for better cleanup. It unlinks the device
from net_generic(net, bareudp_net_id)->bareudp_list, but there
is no real issue as both the dev and the list are freed later.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250411205258.63164-13-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
gtp_net_exit_batch_rtnl() iterates the dying netns list
and performs the same operations for each.
Let's use ->exit_rtnl().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250411205258.63164-12-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
bond_net_exit_batch_rtnl() iterates the dying netns list and
performs the same operation for each.
Let's use ->exit_rtnl().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250411205258.63164-11-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
br_net_exit_batch_rtnl() iterates the dying netns list and
performs the same operation for each.
Let's use ->exit_rtnl().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250411205258.63164-10-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
xfrmi_exit_batch_rtnl() iterates the dying netns list and
performs the same operations for each.
Let's use ->exit_rtnl().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250411205258.63164-9-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The following functions iterates the dying netns list and performs
the same operations for each netns.
* ip6gre_exit_batch_rtnl()
* ip6_tnl_exit_batch_rtnl()
* vti6_exit_batch_rtnl()
* sit_exit_batch_rtnl()
Let's use ->exit_rtnl().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250411205258.63164-8-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
ip_tunnel_delete_nets() iterates the dying netns list and performs the
same operations for each.
Let's export ip_tunnel_destroy() as ip_tunnel_delete_net() and call it
from ->exit_rtnl().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250411205258.63164-7-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
vxlan_exit_batch_rtnl() iterates the dying netns list and
performs the same operations for each.
Let's use ->exit_rtnl().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250411205258.63164-6-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
nexthop_net_exit_batch_rtnl() iterates the dying netns list and
performs the same operation for each.
Let's use ->exit_rtnl().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250411205258.63164-5-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
struct pernet_operations provides two batching hooks; ->exit_batch()
and ->exit_batch_rtnl().
The batching variant is beneficial if ->exit() meets any of the
following conditions:
1) ->exit() repeatedly acquires a global lock for each netns
2) ->exit() has a time-consuming operation that can be factored
out (e.g. synchronize_rcu(), smp_mb(), etc)
3) ->exit() does not need to repeat the same iterations for each
netns (e.g. inet_twsk_purge())
Currently, none of the ->exit_batch_rtnl() functions satisfy any of
the above conditions because RTNL is factored out and held by the
caller and all of these functions iterate over the dying netns list.
Also, we want to hold per-netns RTNL there but avoid spreading
__rtnl_net_lock() across multiple locations.
Let's add ->exit_rtnl() hook and run it under __rtnl_net_lock().
The following patches will convert all ->exit_batch_rtnl() users
to ->exit_rtnl().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250411205258.63164-4-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
If ops_init() fails while loading a module or we unload the
module, free_exit_list() rolls back the changes.
The rollback sequence is the same as ops_undo_list().
The ops is already removed from pernet_list before calling
free_exit_list(). If we link the ops to a temporary list,
we can reuse ops_undo_list().
Let's add a wrapper of ops_undo_list() and use it instead
of free_exit_list().
Now, we have the central place to roll back ops_init().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250411205258.63164-3-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When we roll back the changes made by struct pernet_operations.init(),
we execute mostly identical sequences in three places.
* setup_net()
* cleanup_net()
* free_exit_list()
The only difference between the first two is which list and RCU helpers
to use.
In setup_net(), an ops could fail on the way, so we need to perform a
reverse walk from its previous ops in pernet_list. OTOH, in cleanup_net(),
we iterate the full list from tail to head.
The former passes the failed ops to list_for_each_entry_continue_reverse().
It's tricky, but we can reuse it for the latter if we pass list_entry() of
the head node.
Also, synchronize_rcu() and synchronize_rcu_expedited() can be easily
switched by an argument.
Let's factorise the rollback part in setup_net() and cleanup_net().
In the next patch, ops_undo_list() will be reused for free_exit_list(),
and then two arguments (ops_list and hold_rtnl) will differ.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250411205258.63164-2-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Commit 0e9c127729be ("ethtool: add interface to read Tx hardware
timestamping statistics") added documentation for timestamping
statistics, but added the detailed explanation for this method to
the get_ts_info() rather than get_ts_stats(). Move it to the correct
entry.
Cc: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1u3MTz-000Crx-IW@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Toke Høiland-Jørgensen says:
====================
Fix late DMA unmap crash for page pool
This series fixes the late dma_unmap crash for page pool first reported
by Yonglong Liu in [0]. It is an alternative approach to the one
submitted by Yunsheng Lin, most recently in [1]. The first commit just
wraps some tests in a helper function, in preparation of the main change
in patch 2. See the commit message of patch 2 for the details.
[0] https://lore.kernel.org/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org
[1] https://lore.kernel.org/20250307092356.638242-1-linyunsheng@huawei.com
v8: https://lore.kernel.org/20250407-page-pool-track-dma-v8-0-da9500d4ba21@redhat.com
v7: https://lore.kernel.org/20250404-page-pool-track-dma-v7-0-ad34f069bc18@redhat.com
v6: https://lore.kernel.org/20250401-page-pool-track-dma-v6-0-8b83474870d4@redhat.com
v5: https://lore.kernel.org/20250328-page-pool-track-dma-v5-0-55002af683ad@redhat.com
v4: https://lore.kernel.org/20250327-page-pool-track-dma-v4-0-b380dc6706d0@redhat.com
v3: https://lore.kernel.org/20250326-page-pool-track-dma-v3-0-8e464016e0ac@redhat.com
v2: https://lore.kernel.org/20250325-page-pool-track-dma-v2-0-113ebc1946f3@redhat.com
v1: https://lore.kernel.org/20250314-page-pool-track-dma-v1-0-c212e57a74c2@redhat.com
====================
Link: https://patch.msgid.link/20250409-page-pool-track-dma-v9-0-6a9ef2e0cba8@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When enabling DMA mapping in page_pool, pages are kept DMA mapped until
they are released from the pool, to avoid the overhead of re-mapping the
pages every time they are used. This causes resource leaks and/or
crashes when there are pages still outstanding while the device is torn
down, because page_pool will attempt an unmap through a non-existent DMA
device on the subsequent page return.
To fix this, implement a simple tracking of outstanding DMA-mapped pages
in page pool using an xarray. This was first suggested by Mina[0], and
turns out to be fairly straight forward: We simply store pointers to
pages directly in the xarray with xa_alloc() when they are first DMA
mapped, and remove them from the array on unmap. Then, when a page pool
is torn down, it can simply walk the xarray and unmap all pages still
present there before returning, which also allows us to get rid of the
get/put_device() calls in page_pool. Using xa_cmpxchg(), no additional
synchronisation is needed, as a page will only ever be unmapped once.
To avoid having to walk the entire xarray on unmap to find the page
reference, we stash the ID assigned by xa_alloc() into the page
structure itself, using the upper bits of the pp_magic field. This
requires a couple of defines to avoid conflicting with the
POINTER_POISON_DELTA define, but this is all evaluated at compile-time,
so does not affect run-time performance. The bitmap calculations in this
patch gives the following number of bits for different architectures:
- 23 bits on 32-bit architectures
- 21 bits on PPC64 (because of the definition of ILLEGAL_POINTER_VALUE)
- 32 bits on other 64-bit architectures
Stashing a value into the unused bits of pp_magic does have the effect
that it can make the value stored there lie outside the unmappable
range (as governed by the mmap_min_addr sysctl), for architectures that
don't define ILLEGAL_POINTER_VALUE. This means that if one of the
pointers that is aliased to the pp_magic field (such as page->lru.next)
is dereferenced while the page is owned by page_pool, that could lead to
a dereference into userspace, which is a security concern. The risk of
this is mitigated by the fact that (a) we always clear pp_magic before
releasing a page from page_pool, and (b) this would need a
use-after-free bug for struct page, which can have many other risks
since page->lru.next is used as a generic list pointer in multiple
places in the kernel. As such, with this patch we take the position that
this risk is negligible in practice. For more discussion, see[1].
Since all the tracking added in this patch is performed on DMA
map/unmap, no additional code is needed in the fast path, meaning the
performance overhead of this tracking is negligible there. A
micro-benchmark shows that the total overhead of the tracking itself is
about 400 ns (39 cycles(tsc) 395.218 ns; sum for both map and unmap[2]).
Since this cost is only paid on DMA map and unmap, it seems like an
acceptable cost to fix the late unmap issue. Further optimisation can
narrow the cases where this cost is paid (for instance by eliding the
tracking when DMA map/unmap is a no-op).
The extra memory needed to track the pages is neatly encapsulated inside
xarray, which uses the 'struct xa_node' structure to track items. This
structure is 576 bytes long, with slots for 64 items, meaning that a
full node occurs only 9 bytes of overhead per slot it tracks (in
practice, it probably won't be this efficient, but in any case it should
be an acceptable overhead).
[0] https://lore.kernel.org/all/CAHS8izPg7B5DwKfSuzz-iOop_YRbk3Sd6Y4rX7KBG9DcVJcyWg@mail.gmail.com/
[1] https://lore.kernel.org/r/20250320023202.GA25514@openwall.com
[2] https://lore.kernel.org/r/ae07144c-9295-4c9d-a400-153bb689fe9e@huawei.com
Reported-by: Yonglong Liu <liuyonglong@huawei.com>
Closes: https://lore.kernel.org/r/8743264a-9700-4227-a556-5f931c720211@huawei.com
Fixes: ff7d6b27f894 ("page_pool: refurbish version of page_pool code")
Suggested-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org>
Tested-by: Jesper Dangaard Brouer <hawk@kernel.org>
Tested-by: Qiuling Ren <qren@redhat.com>
Tested-by: Yuying Ma <yuma@redhat.com>
Tested-by: Yonglong Liu <liuyonglong@huawei.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20250409-page-pool-track-dma-v9-2-6a9ef2e0cba8@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Since we are about to stash some more information into the pp_magic
field, let's move the magic signature checks into a pair of helper
functions so it can be changed in one place.
Reviewed-by: Mina Almasry <almasrymina@google.com>
Tested-by: Yonglong Liu <liuyonglong@huawei.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20250409-page-pool-track-dma-v9-1-6a9ef2e0cba8@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
It's not safe to access nla_len(ovs_key) if the data is smaller than
the netlink header. Check that the attribute is OK first.
Fixes: ccb1352e76cf ("net: Add Open vSwitch kernel components.")
Reported-by: syzbot+b07a9da40df1576b8048@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=b07a9da40df1576b8048
Tested-by: syzbot+b07a9da40df1576b8048@syzkaller.appspotmail.com
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Reviewed-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Link: https://patch.msgid.link/20250412104052.2073688-1-i.maximets@ovn.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2025-04-11 (ice, i40e, ixgbe, igc, e1000e)
For ice:
Mateusz and Larysa add support for LLDP packets to be received on a VF
and transmitted by a VF in switchdev mode. Additional information:
https://lore.kernel.org/intel-wired-lan/20250214085215.2846063-1-larysa.zaremba@intel.com/
Karol adds timesync support for E825C devices using 2xNAC (Network
Acceleration Complex) configuration. 2xNAC mode is the mode in which
IO die is housing two complexes and each of them has its own PHY
connected to it.
Martyna adds messaging to clarify filter errors when recipe space is
exhausted.
Colin Ian King adds static modifier to a const array to avoid stack
usage.
For i40e:
Kyungwook Boo changes variable declaration types to prevent possible
underflow.
For ixgbe:
Rand Deeb adjusts retry values so that retries are attempted.
For igc:
Rui Salvaterra sets VLAN offloads to be enabled as default.
For e1000e:
Piotr Wejman converts driver to use newer hardware timestamping API.
* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
net: e1000e: convert to ndo_hwtstamp_get() and ndo_hwtstamp_set()
igc: enable HW vlan tag insertion/stripping by default
ixgbe: Fix unreachable retry logic in combined and byte I2C write functions
i40e: fix MMIO write access to an invalid page in i40e_clear_hw
ice: make const read-only array dflt_rules static
ice: improve error message for insufficient filter space
ice: enable timesync operation on 2xNAC E825 devices
ice: refactor ice_sbq_msg_dev enum
ice: remove SW side band access workaround for E825
ice: enable LLDP TX for VFs through tc
ice: support egress drop rules on PF
ice: remove headers argument from ice_tc_count_lkups
ice: receive LLDP on trusted VFs
ice: do not add LLDP-specific filter if not necessary
ice: fix check for existing switch rule
====================
Link: https://patch.msgid.link/20250411204401.3271306-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue
Tony Nguyen says:
====================
igc: Fix PTM timeout
Christopher S M Hall says:
There have been sporadic reports of PTM timeouts using i225/i226 devices
These timeouts have been root caused to:
1) Manipulating the PTM status register while PTM is enabled
and triggered
2) The hardware retrying too quickly when an inappropriate response
is received from the upstream device
The issue can be reproduced with the following:
$ sudo phc2sys -R 1000 -O 0 -i tsn0 -m
Note: 1000 Hz (-R 1000) is unrealistically large, but provides a way to
quickly reproduce the issue.
PHC2SYS exits with:
"ioctl PTP_OFFSET_PRECISE: Connection timed out" when the PTM transaction
fails
The first patch in this series also resolves an issue reported by Corinna
Vinschen relating to kdump:
This patch also fixes a hang in igc_probe() when loading the igc
driver in the kdump kernel on systems supporting PTM.
The igc driver running in the base kernel enables PTM trigger in
igc_probe(). Therefore the driver is always in PTM trigger mode,
except in brief periods when manually triggering a PTM cycle.
When a crash occurs, the NIC is reset while PTM trigger is enabled.
Due to a hardware problem, the NIC is subsequently in a bad busmaster
state and doesn't handle register reads/writes. When running
igc_probe() in the kdump kernel, the first register access to a NIC
register hangs driver probing and ultimately breaks kdump.
With this patch, igc has PTM trigger disabled most of the time,
and the trigger is only enabled for very brief (10 - 100 us) periods
when manually triggering a PTM cycle. Chances that a crash occurs
during a PTM trigger are not zero, but extremly reduced.
* '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
igc: add lock preventing multiple simultaneous PTM transactions
igc: cleanup PTP module if probe fails
igc: handle the IGC_PTP_ENABLED flag correctly
igc: move ktime snapshot into PTM retry loop
igc: increase wait time before retrying PTM
igc: fix PTM cycle trigger logic
====================
Link: https://patch.msgid.link/20250411162857.2754883-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Siddharth Vadapalli says:
====================
CPSW Bindings for 5000M Fixed-Link
This series adds 5000M as a valid speed for fixed-link mode of operation
and also updates the CPSW bindings to evaluate fixed-link property. This
series is in the context of the following device-tree overlay which
enables USXGMII 5000M Fixed-link mode of operation with CPSW on TI's
J784S4 SoC:
https://github.com/torvalds/linux/blob/v6.15-rc1/arch/arm64/boot/dts/ti/k3-j784s4-evm-usxgmii-exp1-exp2.dtso
====================
Link: https://patch.msgid.link/20250411060917.633769-1-s-vadapalli@ti.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Since the fixed-link (phyless) mode of operation is supported by the
CPSW MAC, include "fixed-link" in the set of properties to be evaluated.
Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com>
Link: https://patch.msgid.link/20250411060917.633769-3-s-vadapalli@ti.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
A link speed of 5000 Mbps is a valid speed for a fixed-link mode of
operation. Hence, update the bindings to include the same.
Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250411060917.633769-2-s-vadapalli@ti.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Joseph Huang says:
====================
Add support for mdb offload failure notification
Currently the bridge does not provide real-time feedback to user space
on whether or not an attempt to offload an mdb entry was successful.
This patch set adds support to notify user space about failed offload
attempts, and is controlled by a new knob mdb_offload_fail_notification.
A break-down of the patches in the series:
Patch 1 adds offload failed flag to indicate that the offload attempt
has failed. The flag is reflected in netlink mdb entry flags.
Patch 2 adds the new bridge bool option mdb_offload_fail_notification.
Patch 3 notifies user space when the result is known, controlled by
mdb_offload_fail_notification setting.
====================
Link: https://patch.msgid.link/20250411150323.1117797-1-Joseph.Huang@garmin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Notify user space on mdb offload failure if
mdb_offload_fail_notification is enabled.
Signed-off-by: Joseph Huang <Joseph.Huang@garmin.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20250411150323.1117797-4-Joseph.Huang@garmin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add BR_BOOLOPT_MDB_OFFLOAD_FAIL_NOTIFICATION bool option.
Signed-off-by: Joseph Huang <Joseph.Huang@garmin.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20250411150323.1117797-3-Joseph.Huang@garmin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add MDB_FLAGS_OFFLOAD_FAILED and MDB_PG_FLAGS_OFFLOAD_FAILED to indicate
that an attempt to offload the MDB entry to switchdev has failed.
Signed-off-by: Joseph Huang <Joseph.Huang@garmin.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20250411150323.1117797-2-Joseph.Huang@garmin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Commit 8fa7292fee5c ("treewide: Switch/rename to timer_delete[_sync]()")
switched del_timer_sync to timer_delete_sync, but did not modify the
comment for bnad_dim_timeout(). Now fix it.
Signed-off-by: WangYuli <wangyuli@uniontech.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/61DDCE7AB5B6CE82+20250411101736.160981-1-wangyuli@uniontech.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|