Age | Commit message (Collapse) | Author |
|
fib_nl_newrule() / fib_nl_delrule() looks up struct fib_rules_ops
in sock_net(skb->sk) and calls rule_exists() / rule_find() respectively.
fib_nl_newrule() creates a new rule and links it to the found ops, so
struct fib_rule never belongs to a different netns's ops->rules_list.
Let's remove redundant netns check in rule_exists() and rule_find().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250207072502.87775-2-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Akihiko Odaki says:
====================
tun: Unify vnet implementation
When I implemented virtio's hash-related features to tun/tap [1],
I found tun/tap does not fill the entire region reserved for the virtio
header, leaving some uninitialized hole in the middle of the buffer
after read()/recvmesg().
This series fills the uninitialized hole. More concretely, the
num_buffers field will be initialized with 1, and the other fields will
be inialized with 0. Setting the num_buffers field to 1 is mandated by
virtio 1.0 [2].
The change to virtio header is preceded by another change that refactors
tun and tap to unify their virtio-related code.
[1]: https://lore.kernel.org/r/20241008-rss-v5-0-f3cf68df005d@daynix.com
[2]: https://lore.kernel.org/r/20241227084256-mutt-send-email-mst@kernel.org/
====================
Link: https://patch.msgid.link/20250207-tun-v6-0-fb49cf8b103e@daynix.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
tun and tap implements the same vnet-related features so reuse the code.
Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250207-tun-v6-7-fb49cf8b103e@daynix.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
hdr_len is repeatedly used so keep it in a local variable.
Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250207-tun-v6-6-fb49cf8b103e@daynix.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The vnet handling code will be reused by tap.
Functions are renamed to ensure that their names contain "vnet" to
clarify that they are part of the decoupled vnet handling code.
Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250207-tun-v6-5-fb49cf8b103e@daynix.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Decouple the vnet handling code so that we can reuse it for tap.
Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250207-tun-v6-4-fb49cf8b103e@daynix.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Decouple vnet-related functions from tun_struct so that we can reuse
them for tap in the future.
Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250207-tun-v6-3-fb49cf8b103e@daynix.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
hdr_len is repeatedly used so keep it in a local variable.
Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250207-tun-v6-2-fb49cf8b103e@daynix.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Check IS_ENABLED(CONFIG_TUN_VNET_CROSS_LE) to save some lines and make
future changes easier.
Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250207-tun-v6-1-fb49cf8b103e@daynix.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Sean Anderson says:
====================
net: xilinx: axienet: Enable adaptive IRQ coalescing with DIM
To improve performance without sacrificing latency under low load,
enable DIM. While I appreciate not having to write the library myself, I
do think there are many unusual aspects to DIM, as detailed in the last
patch.
====================
Link: https://patch.msgid.link/20250206201036.1516800-1-sean.anderson@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The default RX IRQ coalescing settings of one IRQ per packet can represent
a significant CPU load. However, increasing the coalescing unilaterally
can result in undesirable latency under low load. Adaptive IRQ
coalescing with DIM offers a way to adjust the coalescing settings based
on load.
This device only supports "CQE" mode [1], where each packet resets the
timer. Therefore, an interrupt is fired either when we receive
coalesce_count_rx packets or when the interface is idle for
coalesce_usec_rx. With this in mind, consider the following scenarios:
Link saturated
Here we want to set coalesce_count_rx to a large value, in order to
coalesce more packets and reduce CPU load. coalesce_usec_rx should
be set to at least the time for one packet. Otherwise the link will
be "idle" and we will get an interrupt for each packet anyway.
Bursts of packets
Each burst should be coalesced into a single interrupt, although it
may be prudent to reduce coalesce_count_rx for better latency.
coalesce_usec_rx should be set to at least the time for one packet
so bursts are coalesced. However, additional time beyond the packet
time will just increase latency at the end of a burst.
Sporadic packets
Due to low load, we can set coalesce_count_rx to 1 in order to
reduce latency to the minimum. coalesce_usec_rx does not matter in
this case.
Based on this analysis, I expected the CQE profiles to look something
like
usec = 0, pkts = 1 // Low load
usec = 16, pkts = 4
usec = 16, pkts = 16
usec = 16, pkts = 64
usec = 16, pkts = 256 // High load
Where usec is set to 16 to be a few us greater than the 12.3 us packet
time of a 1500 MTU packet at 1 GBit/s. However, the CQE profile is
instead
usec = 2, pkts = 256 // Low load
usec = 8, pkts = 128
usec = 16, pkts = 64
usec = 32, pkts = 64
usec = 64, pkts = 64 // High load
I found this very surprising. The number of coalesced packets
*decreases* as load increases. But as load increases we have more
opportunities to coalesce packets without affecting latency as much.
Additionally, the profile *increases* the usec as the load increases.
But as load increases, the gaps between packets will tend to become
smaller, making it possible to *decrease* usec for better latency at the
end of a "burst".
I consider the default CQE profile unsuitable for this NIC. Therefore,
we use the first profile outlined in this commit instead.
coalesce_usec_rx is set to 16 by default, but the user can customize it.
This may be necessary if they are using jumbo frames. I think adjusting
the profile times based on the link speed/mtu would be good improvement
for generic DIM.
In addition to the above profile problems, I noticed the following
additional issues with DIM while testing:
- DIM tends to "wander" when at low load, since the performance gradient
is pretty flat. If you only have 10p/ms anyway then adjusting the
coalescing settings will not affect throughput very much.
- DIM takes a long time to adjust back to low indices when load is
decreased following a period of high load. This is because it only
re-evaluates its settings once every 64 interrupts. However, at low
load 64 interrupts can be several seconds.
Finally: performance. This patch increases receive throughput with
iperf3 from 840 Mbits/sec to 938 Mbits/sec, decreases interrupts from
69920/sec to 316/sec, and decreases CPU utilization (4x Cortex-A53) from
43% to 9%.
[1] Who names this stuff?
Signed-off-by: Sean Anderson <sean.anderson@linux.dev>
Reviewed by: Shannon Nelson <shannon.nelson@amd.com>
Link: https://patch.msgid.link/20250206201036.1516800-5-sean.anderson@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The cr variables now contain the same values as the control registers
themselves. Extract/calculate the values from the variables instead of
saving the user-specified values. This allows us to remove some
bookeeping, and also lets the user know what the actual coalesce
settings are.
Signed-off-by: Sean Anderson <sean.anderson@linux.dev>
Reviewed by: Shannon Nelson <shannon.nelson@amd.com>
Link: https://patch.msgid.link/20250206201036.1516800-4-sean.anderson@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
In preparation for adaptive IRQ coalescing, we first need to support
adjusting the settings at runtime. The existing code doesn't require any
locking because
- dma_start is the only function that modifies rx/tx_dma_cr. It is
always called with IRQs and NAPI disabled, so nothing else is touching
the hardware.
- The IRQs don't race with poll, since the latter is a softirq.
- The IRQs don't race with dma_stop since they both just clear the
control registers.
- dma_stop doesn't race with poll since the former is called with NAPI
disabled.
However, once we introduce another function that modifies rx/tx_dma_cr,
we need to have some locking to prevent races. Introduce two locks to
protect these variables and their registers.
The control register values are now generated where the coalescing
settings are set. Converting coalescing settings to control register
values may require sleeping because of clk_get_rate. However, the
read/modify/write of the control registers themselves can't sleep
because it needs to happen in IRQ context. By pre-calculating the
control register values, we avoid introducing an additional mutex.
Since axienet_dma_start writes the control settings when it runs, we
don't bother updating the CR registers when rx/tx_dma_started is false.
This prevents any issues from writing to the control registers in the
middle of a reset sequence.
Signed-off-by: Sean Anderson <sean.anderson@linux.dev>
Reviewed-by: Shannon Nelson <shannon.nelson@amd.com>
Link: https://patch.msgid.link/20250206201036.1516800-3-sean.anderson@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Combine the common parts of the CR calculations for better code reuse.
While we're at it, simplify the code a bit.
Signed-off-by: Sean Anderson <sean.anderson@linux.dev>
Reviewed-by: Shannon Nelson <shannon.nelson@amd.com>
Link: https://patch.msgid.link/20250206201036.1516800-2-sean.anderson@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The Dell AW1022z is an RTL8156B based 2.5G Ethernet controller.
Add the vendor and product ID values to the driver. This makes Ethernet
work with the adapter.
Signed-off-by: Aleksander Jan Bajkowski <olek2@wp.pl>
Link: https://patch.msgid.link/20250206224033.980115-1-olek2@wp.pl
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Alexander Lobakin says:
====================
xsk: the lost bits from Chapter III
Before introducing libeth_xdp, we need to add a couple more generic
helpers. Notably:
* 01: add generic loop unrolling hint helpers;
* 04: add helper to get both xdp_desc's DMA address and metadata
pointer in one go, saving several cycles and hotpath object
code size in drivers (especially when unrolling).
Bonus:
* 02, 03: convert two drivers which were using custom macros to
generic unrolled_count() (trivial, no object code changes).
====================
Link: https://patch.msgid.link/20250206182630.3914318-1-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Currently, when your driver supports XSk Tx metadata and you want to
send an XSk frame, you need to do the following:
* call external xsk_buff_raw_get_dma();
* call inline xsk_buff_get_metadata(), which calls external
xsk_buff_raw_get_data() and then do some inline checks.
This effectively means that the following piece:
addr = pool->unaligned ? xp_unaligned_add_offset_to_addr(addr) : addr;
is done twice per frame, plus you have 2 external calls per frame, plus
this:
meta = pool->addrs + addr - pool->tx_metadata_len;
if (unlikely(!xsk_buff_valid_tx_metadata(meta)))
is always inlined, even if there's no meta or it's invalid.
Add xsk_buff_raw_get_ctx() (xp_raw_get_ctx() to be precise) to do that
in one go. It returns a small structure with 2 fields: DMA address,
filled unconditionally, and metadata pointer, non-NULL only if it's
present and valid. The address correction is performed only once and
you also have only 1 external call per XSk frame, which does all the
calculations and checks outside of your hotpath. You only need to
check `if (ctx.meta)` for the metadata presence.
To not copy any existing code, derive address correction and getting
virtual and DMA address into small helpers. bloat-o-meter reports no
object code changes for the existing functionality.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20250206182630.3914318-5-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
ice, same as i40e, has a custom loop unrolling macros for unrolling
Tx descriptors filling on XSk xmit.
Replace ice defs with generic unrolled_count(), which is also more
convenient as it allows passing defines as its argument, not hardcoded
values, while the loop declaration will still be usual for-loop.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://patch.msgid.link/20250206182630.3914318-4-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
i40e, as well as ice, has a custom loop unrolling macro for unrolling
Tx descriptors filling on XSk xmit.
Replace i40e defs with generic unrolled_count(), which is also more
convenient as it allows passing defines as its argument, not hardcoded
values, while the loop declaration will still be a usual for-loop.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://patch.msgid.link/20250206182630.3914318-3-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
There are cases when we need to explicitly unroll loops. For example,
cache operations, filling DMA descriptors on very high speeds etc.
Add compiler-specific attribute macros to give the compiler a hint
that we'd like to unroll a loop.
Example usage:
#define UNROLL_BATCH 8
unrolled_count(UNROLL_BATCH)
for (u32 i = 0; i < UNROLL_BATCH; i++)
op(priv, i);
Note that sometimes the compilers won't unroll loops if they think this
would have worse optimization and perf than without unrolling, and that
unroll attributes are available only starting GCC 8. For older compiler
versions, no hints/attributes will be applied.
For better unrolling/parallelization, don't have any variables that
interfere between iterations except for the iterator itself.
Co-developed-by: Jose E. Marchesi <jose.marchesi@oracle.com> # pragmas
Signed-off-by: Jose E. Marchesi <jose.marchesi@oracle.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20250206182630.3914318-2-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add LED brightness, mode, HW control and polarity functions to enable
external LED control in the TI DP83TD510 PHY.
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://patch.msgid.link/20250205103846.2273833-1-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Link down and up triggers update of MTA table. This update executes many
PCIe writes and a final flush. Thus, PCIe will be blocked until all
writes are flushed. As a result, DMA transfers of other targets suffer
from delay in the range of 50us. This results in timing violations on
real-time systems during link down and up of e1000e in combination with
an Intel i3-2310E Sandy Bridge CPU.
The i3-2310E is quite old. Launched 2011 by Intel but still in use as
robot controller. The exact root cause of the problem is unclear and
this situation won't change as Intel support for this CPU has ended
years ago. Our experience is that the number of posted PCIe writes needs
to be limited at least for real-time systems. With posted PCIe writes a
much higher throughput can be generated than with PCIe reads which
cannot be posted. Thus, the load on the interconnect is much higher.
Additionally, a PCIe read waits until all posted PCIe writes are done.
Therefore, the PCIe read can block the CPU for much more than 10us if a
lot of PCIe writes were posted before. Both issues are the reason why we
are limiting the number of posted PCIe writes in row in general for our
real-time systems, not only for this driver.
A flush after a low enough number of posted PCIe writes eliminates the
delay but also increases the time needed for MTA table update. The
following measurements were done on i3-2310E with e1000e for 128 MTA
table entries:
Single flush after all writes: 106us
Flush after every write: 429us
Flush after every 2nd write: 266us
Flush after every 4th write: 180us
Flush after every 8th write: 141us
Flush after every 16th write: 121us
A flush after every 8th write delays the link up by 35us and the
negative impact to DMA transfers of other targets is still tolerable.
Execute a flush after every 8th write. This prevents overloading the
interconnect with posted writes.
Signed-off-by: Gerhard Engleder <eg@keba.com>
Link: https://lore.kernel.org/netdev/f8fe665a-5e6c-4f95-b47a-2f3281aa0e6c@lunn.ch/T/
CC: Vitaly Lifshits <vitaly.lifshits@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Tested-by: Avigail Dahan <avigailx.dahan@intel.com>
Reviewed-by: Vitaly Lifshits <vitaly.lifshits@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
The igc_close()/igc_open() functions are too drastic for installing a new
XDP prog because they cause undesirable link down event and device reset.
To avoid delays in Ethernet traffic, improve the XDP_SETUP_PROG process by
using the same sequence as igc_xdp_setup_pool(), which performs only the
necessary steps, as follows:
1. stop the traffic and clean buffer
2. stop NAPI
3. install the XDP program
4. resume NAPI
5. allocate buffer and resume the traffic
This patch has been tested using the 'ip link set xdpdrv' command to attach
a simple XDP prog that always returns XDP_PASS.
Before this patch, attaching xdp program will cause ptp4l to lose sync for
few seconds, as shown in ptp4l log below:
ptp4l[198.082]: rms 4 max 8 freq +906 +/- 2 delay 12 +/- 0
ptp4l[199.082]: rms 3 max 4 freq +906 +/- 3 delay 12 +/- 0
ptp4l[199.536]: port 1 (enp2s0): link down
ptp4l[199.536]: port 1 (enp2s0): SLAVE to FAULTY on FAULT_DETECTED (FT_UNSPECIFIED)
ptp4l[199.600]: selected local clock 22abbc.fffe.bb1234 as best master
ptp4l[199.600]: port 1 (enp2s0): assuming the grand master role
ptp4l[199.600]: port 1 (enp2s0): master state recommended in slave only mode
ptp4l[199.600]: port 1 (enp2s0): defaultDS.priority1 probably misconfigured
ptp4l[202.266]: port 1 (enp2s0): link up
ptp4l[202.300]: port 1 (enp2s0): FAULTY to LISTENING on INIT_COMPLETE
ptp4l[205.558]: port 1 (enp2s0): new foreign master 44abbc.fffe.bb2144-1
ptp4l[207.558]: selected best master clock 44abbc.fffe.bb2144
ptp4l[207.559]: port 1 (enp2s0): LISTENING to UNCALIBRATED on RS_SLAVE
ptp4l[208.308]: port 1 (enp2s0): UNCALIBRATED to SLAVE on MASTER_CLOCK_SELECTED
ptp4l[208.933]: rms 742 max 1303 freq -195 +/- 682 delay 12 +/- 0
ptp4l[209.933]: rms 178 max 274 freq +387 +/- 243 delay 12 +/- 0
After this patch, attaching xdp program no longer cause ptp4l to lose sync,
as shown in ptp4l log below:
ptp4l[201.183]: rms 1 max 3 freq +959 +/- 1 delay 8 +/- 0
ptp4l[202.183]: rms 1 max 3 freq +961 +/- 2 delay 8 +/- 0
ptp4l[203.183]: rms 2 max 3 freq +958 +/- 2 delay 8 +/- 0
ptp4l[204.183]: rms 3 max 5 freq +961 +/- 3 delay 8 +/- 0
ptp4l[205.183]: rms 2 max 4 freq +964 +/- 3 delay 8 +/- 0
Besides, before this patch, attaching xdp program will causes flood ping to
lose 10 packets, as shown in ping statistics below:
--- 169.254.1.2 ping statistics ---
100000 packets transmitted, 99990 received, +6 errors, 0.01% packet loss, time 34001ms
rtt min/avg/max/mdev = 0.028/0.301/3104.360/13.838 ms, pipe 10, ipg/ewma 0.340/0.243 ms
After this patch, attaching xdp program no longer cause flood ping to loss
any packets, as shown in ping statistics below:
--- 169.254.1.2 ping statistics ---
100000 packets transmitted, 100000 received, 0% packet loss, time 32326ms
rtt min/avg/max/mdev = 0.027/0.231/19.589/0.155 ms, pipe 2, ipg/ewma 0.323/0.322 ms
On the other hand, this patch has been tested with tools/testing/selftests/
bpf/xdp_hw_metadata app to make sure AF_XDP zero-copy is working fine with
XDP Tx and Rx metadata. Below is the result of last packet after received
10000 UDP packets with interval 1 ms:
poll: 1 (0) skip=0 fail=0 redir=10000
xsk_ring_cons__peek: 1
0x55881c7ef7a8: rx_desc[9999]->addr=8f110 addr=8f110 comp_addr=8f110 EoP
rx_hash: 0xFB9BB6A3 with RSS type:0x1
HW RX-time: 1733923136269470866 (sec:1733923136.2695) delta to User RX-time sec:0.0000 (43.280 usec)
XDP RX-time: 1733923136269482482 (sec:1733923136.2695) delta to User RX-time sec:0.0000 (31.664 usec)
No rx_vlan_tci or rx_vlan_proto, err=-95
0x55881c7ef7a8: ping-pong with csum=ab19 (want 315b) csum_start=34 csum_offset=6
0x55881c7ef7a8: complete tx idx=9999 addr=f010
HW TX-complete-time: 1733923136269591637 (sec:1733923136.2696) delta to User TX-complete-time sec:0.0001 (108.571 usec)
XDP RX-time: 1733923136269482482 (sec:1733923136.2695) delta to User TX-complete-time sec:0.0002 (217.726 usec)
HW RX-time: 1733923136269470866 (sec:1733923136.2695) delta to HW TX-complete-time sec:0.0001 (120.771 usec)
0x55881c7ef7a8: complete rx idx=10127 addr=8f110
Signed-off-by: Song Yoong Siang <yoong.siang.song@intel.com>
Tested-by: Avigail Dahan <avigailx.dahan@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
The Flow Director function ice_fdir_create_dflt_rules() calls few
times function ice_create_init_fdir_rule() each time with different
enum ice_fltr_ptype parameter. Next step is to return error code if
error occurred.
Change the code to store all necessary default rules in constant array
and call ice_create_init_fdir_rule() in the loop. It makes it easy to
extend the list of default rules in the future, without the need of
duplicate code more and more.
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
Add specific functions and definitions for E830 devices to enable
PTP support.
E830 devices support direct write to GLTSYN_ registers without shadow
registers and 64 bit read of PHC time.
Enable PTM for E830 device, which is required for cross timestamp and
and dependency on PCIE_PTM for ICE_HWTS.
Check X86_FEATURE_ART for E830 as it may not be present in the CPU.
Cc: Anna-Maria Behnsen <anna-maria@linutronix.de>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Co-developed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Co-developed-by: Milena Olech <milena.olech@intel.com>
Signed-off-by: Milena Olech <milena.olech@intel.com>
Co-developed-by: Paul Greenwalt <paul.greenwalt@intel.com>
Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com>
Signed-off-by: Michal Michalik <michal.michalik@intel.com>
Co-developed-by: Karol Kolacinski <karol.kolacinski@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
Unify ice_ptp_init_tx_* functions for most of the MAC types except E82X.
This simplifies the code for the future use with new MAC types.
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
Devices supported by ice driver use essentially the same logic for
performing a crosstimestamp. The only difference is that E830 hardware
has different offsets. Instead of having multiple implementations,
combine them into a single ice_capture_crosststamp() function.
To support both hardware types, the ice_capture_crosststamp function
must be able to determine the appropriate registers to access. To handle
this, pass a custom context structure instead of the PF pointer. This
structure, ice_crosststamp_ctx, contains a pointer to the PF, and
a pointer to the device configuration structure. This new structure also
will make it easier to implement historic snapshot support in a future
commit.
The device configuration structure is a static const data which defines
the offsets and flags for the various registers. This includes the lock
register, the cross timestamp control register, the upper and lower ART
system time capture registers, and the upper and lower device time
capture registers for each timer index.
Use the configuration structure to access all of the registers in
ice_capture_crosststamp(). Ensure that we don't over-run the device time
array by checking that the timer index is 0 or 1. Previously this was
simply assumed, and it would cause the device to read an incorrect and
likely garbage register.
It does feel like there should be a kernel interface for managing
register offsets like this, but the closest thing I saw was
<linux/regmap.h> which is interesting but not quite what we're looking
for...
Use rd32_poll_timeout() to read lock_reg and ctl_reg.
Add snapshot system time for historic interpolation.
Remove X86_FEATURE_ART and X86_FEATURE_TSC_KNOWN_FREQ from all E82X
devices because those are SoCs, which will always have those features.
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
Simplify TSYN IRQ processing by moving it to a separate function and
having appropriate behavior per PHY model, instead of multiple
conditions not related to HW, but to specific timestamping modes.
When PTP is not enabled in the kernel, don't process timestamps and
return IRQ_HANDLED.
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
Instead of using shifts and casts, use FIELD_PREP after reading 40b
timestamp values.
Rename a couple defines for better clarity and consistency.
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
Remove unnecessary ice_is_e8xx() functions and PHY model. Instead, use
MAC type where applicable.
Don't check device type in ice_ptp_maybe_trigger_tx_interrupt(), because
in reality it depends on the ready bitmap, which only E810 does not
have.
Call ice_ptp_cfg_phy_interrupt() unconditionally, because all further
function calls check the MAC type anyway and this allows simpler code
in the future with addition of the new MAC types.
Reorder ICE_MAC_* cases in switches in ice_ptp* as in enum ice_mac_type.
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
Don't check if the device type is E810T as non-E810T devices can support
GNSS too and PCA9575 check is enough to determine if GNSS is present or
not.
Rename ice_gnss_is_gps_present() to ice_gnss_is_module_present()
because GNSS module supports multiple GNSS providers, not only GPS.
Move functions related to PCA9575 from ice_ptp_hw.c to ice_common.c
to be able to access them when PTP is disabled in the kernel, but GNSS
is enabled.
Remove logical AND with ICE_AQC_LINK_TOPO_NODE_TYPE_M in
ice_get_pca9575_handle(), which has no effect, and reorder device type
checks to check the device_id first, then set other variables.
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
Jakub Kicinski says:
====================
eth: fbnic: support RSS contexts and ntuple filters
Add support for RSS contexts and ntuple filters in fbnic.
The device has only one context, intended for use by TCP zero-copy Rx.
First two patches add a check we seem to be missing in the core,
to avoid having to copy it to all drivers.
$ ./drivers/net/hw/rss_ctx.py
KTAP version 1
1..16
ok 1 rss_ctx.test_rss_key_indir
ok 2 rss_ctx.test_rss_queue_reconfigure
ok 3 rss_ctx.test_rss_resize
ok 4 rss_ctx.test_hitless_key_update
ok 5 rss_ctx.test_rss_context
# Failed to create context 2, trying to test what we got
ok 6 rss_ctx.test_rss_context4 # SKIP Tested only 1 contexts, wanted 4
# Increasing queue count 44 -> 66
# Failed to create context 2, trying to test what we got
ok 7 rss_ctx.test_rss_context32 # SKIP Tested only 1 contexts, wanted 32
# Added only 1 out of 3 contexts
ok 8 rss_ctx.test_rss_context_dump
# Driver does not support rss + queue offset
ok 9 rss_ctx.test_rss_context_queue_reconfigure
ok 10 rss_ctx.test_rss_context_overlap
ok 11 rss_ctx.test_rss_context_overlap2 # SKIP Test requires at least 2 contexts, but device only has 1
ok 12 rss_ctx.test_rss_context_out_of_order # SKIP Test requires at least 4 contexts, but device only has 1
# Failed to create context 2, trying to test what we got
ok 13 rss_ctx.test_rss_context4_create_with_cfg # SKIP Tested only 1 contexts, wanted 4
ok 14 rss_ctx.test_flow_add_context_missing
ok 15 rss_ctx.test_delete_rss_context_busy
ok 16 rss_ctx.test_rss_ntuple_addition # SKIP Ntuple filter with RSS and nonzero action not supported
# Totals: pass:10 fail:0 xfail:0 xpass:0 skip:6 error:0
====================
Link: https://patch.msgid.link/20250206235334.1425329-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The device has a handful of relatively small TCAM tables,
support dumping the driver state via debugfs.
# ethtool -N eth0 flow-type tcp6 \
dst-ip 1111::2222 dst-port $((0x1122)) \
src-ip 3333::4444 src-port $((0x3344)) \
action 2
Added rule with ID 47
# cd $dbgfs
# cat ip_src
Idx S TCAM Bitmap V Addr/Mask
------------------------------------
00 1 00020000,00000000 6 33330000000000000000000000004444
00000000000000000000000000000000
...
# cat ip_dst
Idx S TCAM Bitmap V Addr/Mask
------------------------------------
00 1 00020000,00000000 6 11110000000000000000000000002222
00000000000000000000000000000000
...
# cat act_tcam
Idx S Value/Mask RSS Dest
------------------------------------------------------------------------
...
49 1 0000 0000 0000 0000 0000 0000 1122 3344 0000 9c00 0088 000f 00000212
ffff ffff ffff ffff ffff ffff 0000 0000 ffff 23ff ff00
...
The ipo_* tables are for outer IP addresses.
The tce_* table is for directing/stealing traffic to NC-SI.
Signed-off-by: Alexander Duyck <alexanderduyck@meta.com>
Link: https://patch.msgid.link/20250206235334.1425329-8-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
There's no good API to check how many contexts device supports.
But initial tests sense the context count already, so just store
that number and skip tests which we know need more.
Link: https://patch.msgid.link/20250206235334.1425329-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add ethtool -n / -N support. Support only "un-ordered" rule sets
(RX_CLS_LOC_ANY), just for simplicity of the code. It's unclear
anyone actually cares about the rule ordering.
Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Link: https://patch.msgid.link/20250206235334.1425329-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
IPv6 addresses are huge so the device has 4 TCAMs used for narrowing
them down to a smaller key before the main match / action engine.
Add the tables in which we'll keep the IP addresses used by
ethtool n-tuple rules. Add the code for programming them
into the device, and code for allocating and freeing entries.
A bit of copy / paste here as we need to support IPv4 and
IPv6 in the same tables, and there is four of them.
But it makes the code easier to match up with the device.
Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Link: https://patch.msgid.link/20250206235334.1425329-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add support for an extra RSS context. The device has a primary
and a secondary context.
Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250206235334.1425329-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Check that adding Rx flow steering rules pointing to an RSS
context which does not exist is prevented.
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250206235334.1425329-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Since commit 42dc431f5d0e ("ethtool: rss: prevent rss ctx deletion
when in use") we prevent removal of RSS contexts pointed to by
existing flow rules. Core should also prevent creation of rules
which point to RSS context which don't exist in the first place.
Reviewed-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250206235334.1425329-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Breno Leitao says:
====================
netconsole: Add support for CPU population
The current implementation of netconsole sends all log messages in
parallel, which can lead to an intermixed and interleaved output on the
receiving side. This makes it challenging to demultiplex the messages
and attribute them to their originating CPUs.
As a result, users and developers often struggle to effectively analyze
and debug the parallel log output received through netconsole.
Example of a message got from produciton hosts:
------------[ cut here ]------------
------------[ cut here ]------------
refcount_t: saturated; leaking memory.
WARNING: CPU: 2 PID: 1613668 at lib/refcount.c:22 refcount_warn_saturate+0x5e/0xe0
refcount_t: addition on 0; use-after-free.
WARNING: CPU: 26 PID: 4139916 at lib/refcount.c:25 refcount_warn_saturate+0x7d/0xe0
Modules linked in: bpf_preload(E) vhost_net(E) tun(E) vhost(E)
This series of patches introduces a new feature to the netconsole
subsystem that allows the automatic population of the CPU number in the
userdata field for each log message. This enhancement provides several
benefits:
* Improved demultiplexing of parallel log output: When multiple CPUs are
sending messages concurrently, the added CPU number in the userdata
makes it easier to differentiate and attribute the messages to their
originating CPUs.
* Better visibility into message sources: The CPU number information
gives users and developers more insight into which specific CPU a
particular log message came from, which can be valuable for debugging
and analysis.
The changes in this series are as follows Patches::
Patch "consolidate send buffers into netconsole_target struct"
=================================================
Move the static buffers to netconsole target, from static declaration
in send_msg_no_fragmentation() and send_msg_fragmented().
Patch "netconsole: Rename userdata to extradata"
=================================================
Create the a concept of extradata, which encompasses the concept of
userdata and the upcoming sysdatao
Sysdata is a new concept being added, which is basically fields that are
populated by the kernel. At this time only the CPU#, but, there is a
desire to add current task name, kernel release version, etc.
Patch "netconsole: Helper to count number of used entries"
===========================================================
Create a simple helper to count number of entries in extradata. I am
separating this in a function since it will need to count userdata and
sysdata. For instance, when the user adds an extra userdata, we need to
check if there is space, counting the previous data entries (from
userdata and cpu data)
Patch "Introduce configfs helpers for sysdata features"
======================================================
Create the concept of sysdata feature in the netconsole target, and
create the configfs helpers to enable the bit in nt->sysdata
Patch "Include sysdata in extradata entry count"
================================================
Add the concept of sysdata when counting for available space in the
buffer. This will protect users from creating new userdata/sysdata if
there is no more space
Patch "netconsole: add support for sysdata and CPU population"
===============================================================
This is the core patch. Basically add a new option to enable automatic
CPU number population in the netconsole userdata Provides a new "cpu_nr"
sysfs attribute to control this feature
Patch "netconsole: selftest: test CPU number auto-population"
=============================================================
Expands the existing netconsole selftest to verify the CPU number
auto-population functionality Ensures the received netconsole messages
contain the expected "cpu=<CPU>" entry in the message. Test different
permutation with userdata
Patch "netconsole: docs: Add documentation for CPU number auto-population"
=============================================================================
Updates the netconsole documentation to explain the new CPU number
auto-population feature Provides instructions on how to enable and use
the feature
I believe these changes will be a valuable addition to the netconsole
subsystem, enhancing its usefulness for kernel developers and users.
PS: This patchset is on top of the patch that created
netcons_fragmented_msg selftest:
https://lore.kernel.org/all/20250203-netcons_frag_msgs-v1-1-5bc6bedf2ac0@debian.org/
---
Changes in v5:
- Fixed a kernel doc syntax syntax (Simon)
- Link to v4: https://lore.kernel.org/r/20250204-netcon_cpu-v4-0-9480266ef556@debian.org
Changes in v4:
- Fixed Kernel doc for netconsole_target (Simon)
- Fixed a typo in disable_sysdata_feature (Simon)
- Improved sysdata_cpu_nr_show() to return !! in a bit-wise operation
- Link to v3: https://lore.kernel.org/r/20250124-netcon_cpu-v3-0-12a0d286ba1d@debian.org
Changes in v3:
- Moved the buffer into netconsole_target, avoiding static functions in
the send path (Jakub).
- Fix a documentation error (Randy Dunlap)
- Created a function that handle all the extradata, consolidating it in
a single place (Jakub)
- Split the patch even more, trying to simplify the review.
- Link to v2: https://lore.kernel.org/r/20250115-netcon_cpu-v2-0-95971b44dc56@debian.org
Changes in v2:
- Create the concept of extradata and sysdata. This will make the design
easier to understand, and the code easier to read.
* Basically extradata encompasses userdata and the new sysdata.
Userdata originates from user, and sysdata originates in kernel.
- Improved the test to send from a very specific CPU, which can be
checked to be correct on the other side, as suggested by Jakub.
- Fixed a bug where CPU # was populated at the wrong place
- Link to v1: https://lore.kernel.org/r/20241113-netcon_cpu-v1-0-d187bf7c0321@debian.org
====================
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Update the netconsole documentation to explain the new feature that
allows automatic population of the CPU number.
The key changes include introducing a new section titled "CPU number
auto population in userdata", explaining how to enable the CPU number
auto-population feature by writing to the "populate_cpu_nr" file in the
netconsole configfs hierarchy.
This documentation update ensures users are aware of the new CPU number
auto-population functionality and how to leverage it for better
demultiplexing and visibility of parallel netconsole output.
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add a new selftest to verify that the netconsole module correctly
handles CPU runtime data in sysdata. The test validates three scenarios:
1. Basic CPU sysdata functionality - verifies that cpu=X is appended to
messages
2. CPU sysdata with userdata - ensures CPU data works alongside userdata
3. Disabled CPU sysdata - confirms no CPU data is included when disabled
The test uses taskset to control which CPU sends messages and verifies
the reported CPU matches the one used. This helps ensure that netconsole
accurately tracks and reports the originating CPU of messages.
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add infrastructure to automatically append kernel-generated data (sysdata)
to netconsole messages. As the first use case, implement CPU number
population, which adds the CPU that sent the message.
This change introduces three distinct data types:
- extradata: The complete set of appended data (sysdata + userdata)
- userdata: User-provided key-value pairs from userspace
- sysdata: Kernel-populated data (e.g. cpu=XX)
The implementation adds a new configfs attribute 'cpu_nr' to control CPU
number population per target. When enabled, each message is tagged with
its originating CPU. The sysdata is dynamically updated at message time
and appended after any existing userdata.
The CPU number is formatted as "cpu=XX" and is added to the extradata
buffer, respecting the existing size limits.
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Modify count_extradata_entries() to include sysdata fields when
calculating the total number of extradata entries. This change ensures
that the sysdata feature, specifically the CPU number field, is
correctly counted against the MAX_EXTRADATA_ITEMS limit.
The modification adds a simple check for the CPU_NR flag in the
sysdata_fields, incrementing the entry count accordingly.
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch introduces a bitfield to store sysdata features in the
netconsole_target struct. It also adds configfs helpers to enable
or disable the CPU_NR feature, which populates the CPU number in
sysdata.
The patch provides the necessary infrastructure to set or unset the
CPU_NR feature, but does not modify the message itself.
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add a helper function nr_extradata_entries() to count the number of used
extradata entries in a netconsole target. This refactors the duplicate
code for counting entries into a single function, which will be reused
by upcoming CPU sysdata changes.
The helper uses list_count_nodes() to count the number of children in
the userdata group configfs hierarchy.
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Rename "userdata" to "extradata" since this structure will hold both
user and system data in future patches. Keep "userdata" term only for
data that comes from userspace (configfs), while "extradata" encompasses
both userdata and future kerneldata.
These are the rules of the design
1. extradata_complete will hold userdata and sysdata (coming)
2. sysdata will come after userdata_length
3. extradata_complete[userdata_length] string will be replaced at every
message
5. userdata is replaced when configfs changes (update_userdata())
6. sysdata is replaced at every message
Example:
extradata_complete = "userkey=uservalue cpu=42"
userdata_length = 17
sysdata_length = 7 (space (" ") is part of sysdata)
Since sysdata is still not available, you will see the following in the
send functions:
extradata_len = nt->userdata_length;
The upcoming patches will, which will add support for sysdata, will
change it to:
extradata_len = nt->userdata_length + sysdata_len;
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Move the static buffers from send_msg_no_fragmentation() and
send_msg_fragmented() into the netconsole_target structure. This
simplifies the code by:
- Eliminating redundant static buffers
- Centralizing buffer management in the target structure
- Reducing memory usage by 1KB (one buffer instead of two)
The buffer in netconsole_target is protected by target_list_lock,
maintaining the same synchronization semantics as the original code.
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Jakub Kicinski says:
====================
net: improve core queue API handling while device is down
The core netdev_rx_queue_restart() doesn't currently take into account
that the device may be down. The current and proposed queue API
implementations deal with this by rejecting queue API calls while
the device is down. We can do better, in theory we can still allow
devmem binding when the device is down - we shouldn't stop and start
the queues just try to allocate the memory. The reason we allocate
the memory is that memory provider binding checks if any compatible
page pool has been created (page_pool_check_memory_provider()).
Alternatively we could reject installing MP while the device is down
but the MP assignment survives ifdown (so presumably MP doesn't cease
to exist while down), and in general we allow configuration while down.
Previously I thought we need this as a fix, but gve rejects page pool
calls while down, and so did Saeed in the patches he posted. So this
series just makes the core act more sensibly but practically should
be a noop for now.
v1: https://lore.kernel.org/20250205190131.564456-1-kuba@kernel.org
====================
Link: https://patch.msgid.link/20250206225638.1387810-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Resetting queues while the device is down should be legal.
Allow it, test it. Ideally we'd test this with a real device
supporting devmem but I don't have access to such devices.
Reviewed-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20250206225638.1387810-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|