Age | Commit message (Collapse) | Author |
|
Setup the TXBD to enable TX timestamp if requested. At TX packet DMA
completion, if we requested TX timestamp on that packet, we defer to
.do_aux_work() to obtain the TX timestamp from the firmware before we
free the TX SKB.
v2: Use .do_aux_work() to get the TX timestamp from firmware.
Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
If the RX packet is timestamped by the hardware, the RX completion
record will contain the lower 32-bit of the timestamp. This needs
to be combined with the upper 16-bit of the periodic timestamp that
we get from the timer. The previous snapshot in ptp->old_timer is
used to make sure that the snapshot is not ahead of the RX timestamp
and we adjust for wrap-around if needed.
v2: Make ptp->old_time read access safe on 32-bit CPUs.
Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
From the bnxt_timer(), read the 48-bit hardware running clock
periodically and store it in ptp->current_time. The previous snapshot
of the clock will be stored in ptp->old_time. The old_time snapshot
will be used in the next patches to compute the RX packet timestamps.
v2: Use .do_aux_work() to read the timer periodically.
Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add the clock APIs to set/get/adjust the hw clock, and the related
ioctls and ethtool methods.
v2: Propagate error code from ptp_clock_register().
Add spinlock to serialize access to the timecounter. The
timecounter is accessed in process context and the RX datapath.
Read the PHC using direct registers.
Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Store PTP hardware info in a structure if hardware and firmware support PTP.
Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Adding the PTP related firmware interface is the main change.
There is also a name change for admin_mtu, requiring code fixup.
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch adds support of dumping MAC umv counter in debugfs,
which will be helpful for debugging.
The display style is below:
$ cat umv_info
num_alloc_vport : 2
max_umv_size : 256
wanted_umv_size : 256
priv_umv_size : 85
share_umv_size : 86
vport(0) used_umv_num : 1
vport(1) used_umv_num : 1
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Previously, the flow director counter is not enabled. To improve the
maintainability for chechking whether flow director hit or not, enable
flow director counter for each function, and add debugfs query inerface
to query the counters for each function.
The debugfs command is below:
cat fd_counter
func_id hit_times
pf 0
vf0 0
vf1 0
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler udpates from Ingo Molnar:
- Changes to core scheduling facilities:
- Add "Core Scheduling" via CONFIG_SCHED_CORE=y, which enables
coordinated scheduling across SMT siblings. This is a much
requested feature for cloud computing platforms, to allow the
flexible utilization of SMT siblings, without exposing untrusted
domains to information leaks & side channels, plus to ensure more
deterministic computing performance on SMT systems used by
heterogenous workloads.
There are new prctls to set core scheduling groups, which allows
more flexible management of workloads that can share siblings.
- Fix task->state access anti-patterns that may result in missed
wakeups and rename it to ->__state in the process to catch new
abuses.
- Load-balancing changes:
- Tweak newidle_balance for fair-sched, to improve 'memcache'-like
workloads.
- "Age" (decay) average idle time, to better track & improve
workloads such as 'tbench'.
- Fix & improve energy-aware (EAS) balancing logic & metrics.
- Fix & improve the uclamp metrics.
- Fix task migration (taskset) corner case on !CONFIG_CPUSET.
- Fix RT and deadline utilization tracking across policy changes
- Introduce a "burstable" CFS controller via cgroups, which allows
bursty CPU-bound workloads to borrow a bit against their future
quota to improve overall latencies & batching. Can be tweaked via
/sys/fs/cgroup/cpu/<X>/cpu.cfs_burst_us.
- Rework assymetric topology/capacity detection & handling.
- Scheduler statistics & tooling:
- Disable delayacct by default, but add a sysctl to enable it at
runtime if tooling needs it. Use static keys and other
optimizations to make it more palatable.
- Use sched_clock() in delayacct, instead of ktime_get_ns().
- Misc cleanups and fixes.
* tag 'sched-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
sched/doc: Update the CPU capacity asymmetry bits
sched/topology: Rework CPU capacity asymmetry detection
sched/core: Introduce SD_ASYM_CPUCAPACITY_FULL sched_domain flag
psi: Fix race between psi_trigger_create/destroy
sched/fair: Introduce the burstable CFS controller
sched/uclamp: Fix uclamp_tg_restrict()
sched/rt: Fix Deadline utilization tracking during policy change
sched/rt: Fix RT utilization tracking during policy change
sched: Change task_struct::state
sched,arch: Remove unused TASK_STATE offsets
sched,timer: Use __set_current_state()
sched: Add get_current_state()
sched,perf,kvm: Fix preemption condition
sched: Introduce task_is_running()
sched: Unbreak wakeups
sched/fair: Age the average idle time
sched/cpufreq: Consider reduced CPU capacity in energy calculation
sched/fair: Take thermal pressure into account while estimating energy
thermal/cpufreq_cooling: Update offline CPUs per-cpu thermal_pressure
sched/fair: Return early from update_tg_cfs_load() if delta == 0
...
|
|
Add the xfrm xdo and ipsec_init/cleanup to uplink representor to
support IPsec in SRIOV switchdev mode.
Signed-off-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Huy Nguyen <huyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Expose ethtool SW counter for the number of kTLS device-offloaded
TX connections that are finished and deleted.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Avoid second traversal on the SF table by recording the first free entry
and using it in case the looked up entry was not found in the table.
Signed-off-by: Eli Cohen <elic@nvidia.com>
Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
The max packet size a hairpin queue is able to handle
is determined by the total hairpin buffer size divided
by 4.
Currently the buffer size is set to 32KB which makes
the max packet size to be 8KB and doesn't support
jumbo frames of size 9KB.
This change increases the buffer size to 64KB to increase
the max frame size and support 9KB frames.
Signed-off-by: Ariel Levkovich <lariel@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Add SW steering support for sFlow / flow sampler action.
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
When comparing sampler flow destinations,
in fs_core, consider sampler ID as well.
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
Tony Nguyen says:
====================
100GbE Intel Wired LAN Driver Updates 2021-06-25
This series contains updates to ice driver only.
Jesse adds support for tracepoints to aide in debugging.
Maciej adds support for PTP auxiliary pin support.
Victor removes the VSI info from the old aggregator when moving the VSI
to another aggregator.
Tony removes an unnecessary VSI assignment.
Christophe Jaillet fixes a memory leak for failed allocation in
ice_pf_dcb_cfg().
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The newly implemented fwnode_mdbiobus_register turned out to be
problematic - in case the fwnode_/of_/acpi_mdio are built as
modules, a dependency cycle can be observed during the depmod phase of
modules_install, eg.:
depmod: ERROR: Cycle detected: fwnode_mdio -> of_mdio -> fwnode_mdio
depmod: ERROR: Found 2 modules in dependency cycles!
OR:
depmod: ERROR: Cycle detected: acpi_mdio -> fwnode_mdio -> acpi_mdio
depmod: ERROR: Found 2 modules in dependency cycles!
A possible solution could be to rework fwnode_mdiobus_register,
so that to merge the contents of acpi_mdiobus_register and
of_mdiobus_register. However feasible, such change would
be very intrusive and affect huge amount of the of_mdiobus_register
users.
Since there are currently 2 users of ACPI and MDIO
(xgmac_mdio and mvmdio), withdraw the fwnode_mdbiobus_register
and roll back to a simple 'if' condition in affected drivers.
Fixes: 62a6ef6a996f ("net: mdiobus: Introduce fwnode_mdbiobus_register()")
Signed-off-by: Marcin Wojtas <mw@semihalf.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Patch was based on wrong presumption that be_poll can be called only
from bh context. It reintroducing old regression (also reverted) and
causing deadlock when we use netconsole with benet in bonding.
Old revert: commit 072a9c486004 ("netpoll: revert 6bdb7fe3104 and fix
be_poll() instead")
[ 331.269715] bond0: (slave enp0s7f0): Releasing backup interface
[ 331.270121] CPU: 4 PID: 1479 Comm: ifenslave Not tainted 5.13.0-rc7+ #2
[ 331.270122] Call Trace:
[ 331.270122] [c00000001789f200] [c0000000008c505c] dump_stack+0x100/0x174 (unreliable)
[ 331.270124] [c00000001789f240] [c008000001238b9c] be_poll+0x64/0xe90 [be2net]
[ 331.270125] [c00000001789f330] [c000000000d1e6e4] netpoll_poll_dev+0x174/0x3d0
[ 331.270127] [c00000001789f400] [c008000001bc167c] bond_poll_controller+0xb4/0x130 [bonding]
[ 331.270128] [c00000001789f450] [c000000000d1e624] netpoll_poll_dev+0xb4/0x3d0
[ 331.270129] [c00000001789f520] [c000000000d1ed88] netpoll_send_skb+0x448/0x470
[ 331.270130] [c00000001789f5d0] [c0080000011f14f8] write_msg+0x180/0x1b0 [netconsole]
[ 331.270131] [c00000001789f640] [c000000000230c0c] console_unlock+0x54c/0x790
[ 331.270132] [c00000001789f7b0] [c000000000233098] vprintk_emit+0x2d8/0x450
[ 331.270133] [c00000001789f810] [c000000000234758] vprintk+0xc8/0x270
[ 331.270134] [c00000001789f850] [c000000000233c28] printk+0x40/0x54
[ 331.270135] [c00000001789f870] [c000000000ccf908] __netdev_printk+0x150/0x198
[ 331.270136] [c00000001789f910] [c000000000ccfdb4] netdev_info+0x68/0x94
[ 331.270137] [c00000001789f950] [c008000001bcbd70] __bond_release_one+0x188/0x6b0 [bonding]
[ 331.270138] [c00000001789faa0] [c008000001bcc6f4] bond_do_ioctl+0x42c/0x490 [bonding]
[ 331.270139] [c00000001789fb60] [c000000000d0d17c] dev_ifsioc+0x17c/0x400
[ 331.270140] [c00000001789fbc0] [c000000000d0db70] dev_ioctl+0x390/0x890
[ 331.270141] [c00000001789fc10] [c000000000c7c76c] sock_do_ioctl+0xac/0x1b0
[ 331.270142] [c00000001789fc90] [c000000000c7ffac] sock_ioctl+0x31c/0x6e0
[ 331.270143] [c00000001789fd60] [c0000000005b9728] sys_ioctl+0xf8/0x150
[ 331.270145] [c00000001789fdb0] [c0000000000336c0] system_call_exception+0x160/0x2f0
[ 331.270146] [c00000001789fe10] [c00000000000d35c] system_call_common+0xec/0x278
[ 331.270147] --- interrupt: c00 at 0x7fffa6c6ec00
[ 331.270147] NIP: 00007fffa6c6ec00 LR: 0000000105c4185c CTR: 0000000000000000
[ 331.270148] REGS: c00000001789fe80 TRAP: 0c00 Not tainted (5.13.0-rc7+)
[ 331.270148] MSR: 800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE> CR: 28000428 XER: 00000000
[ 331.270155] IRQMASK: 0
[ 331.270156] GPR00: 0000000000000036 00007fffd494d5b0 00007fffa6d57100 0000000000000003
[ 331.270158] GPR04: 0000000000008991 00007fffd494d6d0 0000000000000008 00007fffd494f28c
[ 331.270161] GPR08: 0000000000000003 0000000000000000 0000000000000000 0000000000000000
[ 331.270164] GPR12: 0000000000000000 00007fffa6dfa220 0000000000000000 0000000000000000
[ 331.270167] GPR16: 0000000105c44880 0000000000000000 0000000105c60088 0000000105c60318
[ 331.270170] GPR20: 0000000105c602c0 0000000105c44560 0000000000000000 0000000000000000
[ 331.270172] GPR24: 00007fffd494dc50 00007fffd494d6a8 0000000105c60008 00007fffd494d6d0
[ 331.270175] GPR28: 00007fffd494f27e 0000000105c6026c 00007fffd494f284 0000000000000000
[ 331.270178] NIP [00007fffa6c6ec00] 0x7fffa6c6ec00
[ 331.270178] LR [0000000105c4185c] 0x105c4185c
[ 331.270179] --- interrupt: c00
This reverts commit d0d006a43e9a7a796f6f178839c92fcc222c564d.
Fixes: d0d006a43e9a7a ("be2net: disable bh with spin_lock in be_process_mcc")
Signed-off-by: Petr Oros <poros@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
If this 'kzalloc()' fails we must free some resources as in all the other
error handling paths of this function.
Fixes: 348048e724a0 ("ice: Implement iidc operations")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
ice_get_vf_vsi() is being called twice for the same VSI. Remove the
unnecessary call/assignment.
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Tested-by: Tony Brelinski <tonyx.brelinski@intel.com>
|
|
Remove the VSI info from previous aggregator after moving the VSI to a
new aggregator.
Signed-off-by: Victor Raj <victor.raj@intel.com>
Tested-by: Tony Brelinski <tonyx.brelinski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
The E810 device supports programmable pins for enabling both input and
output events related to the PTP hardware clock. This includes both
output signals with programmable period, as well as timestamping of
events on input pins.
Add support for enabling these using the CONFIG_PTP_1588_CLOCK
interface.
This allows programming the software defined pins to take advantage of
the hardware clock features.
Signed-off-by: Maciej Machnikowski <maciej.machnikowski@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
Fixes: 893ce44df565 ("gve: Add basic driver framework for Compute Engine Virtual NIC")
Signed-off-by: Bailey Forrest <bcf@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch adds build and driver logic for the "mlxbf_gige"
Ethernet driver from Mellanox Technologies. The second
generation BlueField SoC from Mellanox supports an
out-of-band GigaBit Ethernet management port to the Arm
subsystem. This driver supports TCP/IP network connectivity
for that port, and provides back-end routines to handle
basic ethtool requests.
The driver interfaces to the Gigabit Ethernet block of
BlueField SoC via MMIO accesses to registers, which contain
control information or pointers describing transmit and
receive resources. There is a single transmit queue, and
the port supports transmit ring sizes of 4 to 256 entries.
There is a single receive queue, and the port supports
receive ring sizes of 32 to 32K entries. The transmit and
receive rings are allocated from DMA coherent memory. There
is a 16-bit producer and consumer index per ring to denote
software ownership and hardware ownership, respectively.
The main driver logic such as probe(), remove(), and netdev
ops are in "mlxbf_gige_main.c". Logic in "mlxbf_gige_rx.c"
and "mlxbf_gige_tx.c" handles the packet processing for
receive and transmit respectively.
The logic in "mlxbf_gige_ethtool.c" supports the handling
of some basic ethtool requests: get driver info, get ring
parameters, get registers, and get statistics.
The logic in "mlxbf_gige_mdio.c" is the driver controlling
the Mellanox BlueField hardware that interacts with a PHY
device via MDIO/MDC pins. This driver does the following:
- At driver probe time, it configures several BlueField MDIO
parameters such as sample rate, full drive, voltage and MDC
- It defines functions to read and write MDIO registers and
registers the MDIO bus.
- It defines the phy interrupt handler reporting a
link up/down status change
- This driver's probe is invoked from the main driver logic
while the phy interrupt handler is registered in ndo_open.
Driver limitations
- Only supports 1Gbps speed
- Only supports GMII protocol
- Supports maximum packet size of 2KB
- Does not support scatter-gather buffering
Testing
- Successful build of kernel for ARM64, ARM32, X86_64
- Tested ARM64 build on FastModels & Palladium
- Tested ARM64 build on several Mellanox boards that are built with
the BlueField-2 SoC. The testing includes coverage in the areas
of networking (e.g. ping, iperf, ifconfig, route), file transfers
(e.g. SCP), and various ethtool options relevant to this driver.
Signed-off-by: David Thompson <davthompson@nvidia.com>
Signed-off-by: Asmaa Mnebhi <asmaa@nvidia.com>
Reviewed-by: Liming Sun <limings@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch is modeled after one by Scott Peterson for i40e.
Add tracepoints to the driver, via a new file ice_trace.h and some new
trace calls added in interesting places in the driver. Add some tracing
for DIMLIB to help debug interrupt moderation problems.
Performance should not be affected, and this can be very useful
for debugging and adding new trace events to paths in the future.
Note eBPF programs can attach to these events, as well as perf
can count them since we're attaching to the events subsystem
in the kernel.
Co-developed-by: Ben Shelton <benjamin.h.shelton@intel.com>
Signed-off-by: Ben Shelton <benjamin.h.shelton@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Tony Brelinski <tonyx.brelinski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
The Broadcom UniMAC MDIO bus from mdio-bcm-unimac module comes too late.
So, GENET cannot find the ethernet PHY on UniMAC MDIO bus. This leads
GENET fail to attach the PHY as following log:
bcmgenet fd580000.ethernet: GENET 5.0 EPHY: 0x0000
...
could not attach to PHY
bcmgenet fd580000.ethernet eth0: failed to connect to PHY
uart-pl011 fe201000.serial: no DMA platform data
libphy: bcmgenet MII bus: probed
...
unimac-mdio unimac-mdio.-19: Broadcom UniMAC MDIO bus
It is not just coming too late, there is also no way for the module
loader to figure out the dependency between GENET and its MDIO bus
driver unless we provide this MODULE_SOFTDEP hint.
This patch adds the soft dependency to load mdio-bcm-unimac module
before genet module to fix this issue.
Buglink: https://bugzilla.kernel.org/show_bug.cgi?id=213485
Fixes: 9a4e79697009 ("net: bcmgenet: utilize generic Broadcom UniMAC MDIO controller driver")
Signed-off-by: Jian-Hong Pan <jhp@endlessos.org>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Simply get a pointer to the data in the register payload instead of
copying it to a temporary buffer.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
https://patchwork.kernel.org/project/netdevbpf/list/?series=506637&state=*
- Remove unused variable
- Use correct integer type for string formatting.
- Remove `inline` in C files
Fixes: 9c1a59a2f4bc ("gve: DQO: Add ring allocation and initialization")
Fixes: a57e5de476be ("gve: DQO: Add TX path")
Signed-off-by: Bailey Forrest <bcf@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Complete to commit def4ec6dce393e ("e1000e: PCIm function state support")
Check the PCIm state only on CSME systems. There is no point to do this
check on non CSME systems.
This patch fixes a generation a false-positive warning:
"Error in exiting dmoff"
Fixes: def4ec6dce39 ("e1000e: PCIm function state support")
Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Dvora Fuxbrumer <dvorax.fuxbrumer@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2021-06-24
This series contains updates to i40e driver only.
Dinghao Liu corrects error handling for failed i40e_vsi_request_irq()
call.
Mateusz allows for disabling of autonegotiation for all BaseT media.
Jesse corrects the multiplier being used on 5Gb speeds for PTP.
Jan adds locking in paths calling i40e_setup_pf_switch() that were
missing it.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The RX queue has an array of `gve_rx_buf_state_dqo` objects. All
allocated pages have an associated buf_state object. When a buffer is
posted on the RX buffer queue, the buffer ID will be the buf_state's
index into the RX queue's array.
On packet reception, the RX queue will have one descriptor for each
buffer associated with a received packet. Each RX descriptor will have
a buffer_id that was posted on the buffer queue.
Notable mentions:
- We use a default buffer size of 2048 bytes. Based on page size, we
may post separate sections of a single page as separate buffers.
- The driver holds an extra reference on pages passed up the receive
path with an skb and keeps these pages on a list. When posting new
buffers to the NIC, we check if any of these pages has only our
reference, or another buffer sized segment of the page has no
references. If so, it is free to reuse. This page recycling approach
is a common netdev optimization that reduces page alloc/free calls.
- Pages in the free list have a page_count bias in order to avoid an
atomic increment of pagecount every time we attempt to reuse a page.
# references = page_count() - bias
- In order to track when a page is safe to reuse, we keep track of the
last offset which had a single SKB reference. When this occurs, it
implies that every single other offset is reusable. Otherwise, we
don't know if offsets can be safely reused.
- We maintain two free lists of pages. List #1 (recycled_buf_states)
contains pages we know can be reused right away. List #2
(used_buf_states) contains pages which cannot be used right away. We
only attempt to get pages from list #2 when list #1 is empty. We only
attempt to use a small fixed number pages from list #2 before giving
up and allocating a new page. Both lists are FIFOs in hope that by the
time we attempt to reuse a page, the references were dropped.
Signed-off-by: Bailey Forrest <bcf@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
TX SKBs will have their buffers DMA mapped with the device. Each buffer
will have at least one TX descriptor associated. Each SKB will also have
a metadata descriptor.
Each TX queue maintains an array of `gve_tx_pending_packet_dqo` objects.
Every TX SKB will have an associated pending_packet object. A TX SKB's
descriptors will use its pending_packet's index as the completion tag,
which will be returned on the TX completion queue.
The device implements a "flow-miss model". Most packets will simply
receive a packet completion. The flow-miss system may choose to process
a packet based on its contents. A TX packet which experiences a flow
miss would receive a miss completion followed by a later reinjection
completion. The miss-completion is received when the packet starts to be
processed by the flow-miss system and the reinjection completion is
received when the flow-miss system completes processing the packet and
sends it on the wire.
Notable mentions:
- Buffers may be freed after receiving the miss-completion, but in order
to avoid packet reordering, we do not complete the SKB until receiving
the reinjection completion.
- The driver must robustly handle the unlikely scenario where a miss
completion does not have an associated reinjection completion. This is
accomplished by maintaining a list of packets which have a pending
reinjection completion. After a short timeout (5 seconds), the
SKB and buffers are released and the pending_packet is moved to a
second list which has a longer timeout (60 seconds), where the
pending_packet will not be reused. When the longer timeout elapses,
the driver may assume the reinjection completion would never be
received and the pending_packet may be reused.
- Completion handling is triggered by an interrupt and is done in the
NAPI poll function. Because the TX path and completion exist in
different threading contexts they maintain their own lists for free
pending_packet objects. The TX path uses a lock-free approach to steal
the list from the completion path.
- Both the TSO context and general context descriptors have metadata
bytes. The device requires that if multiple descriptors contain the
same field, each descriptor must have the same value set for that
field.
Signed-off-by: Bailey Forrest <bcf@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When interrupts are first enabled, we also set the ratelimits, which
will be static for the entire usage of the device.
Signed-off-by: Bailey Forrest <bcf@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Allocate the buffer and completion ring structures. Do not populate the
rings yet. That will happen in the respective rx and tx datapath
follow-on patches
Signed-off-by: Bailey Forrest <bcf@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add napi netdev device registration, interrupt handling and initial tx
and rx polling stubs. The stubs will be filled in follow-on patches.
Also:
- LRO feature advertisement and handling
- Also update ethtool logic
Signed-off-by: Bailey Forrest <bcf@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
DQO queue creation requires additional parameters:
- TX completion/RX buffer queue size
- TX completion/RX buffer queue address
- TX/RX queue size
- RX buffer size
Signed-off-by: Bailey Forrest <bcf@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
- Add new DQO datapath structures:
- `gve_rx_buf_queue_dqo`
- `gve_rx_compl_queue_dqo`
- `gve_rx_buf_state_dqo`
- `gve_tx_desc_dqo`
- `gve_tx_pending_packet_dqo`
- Incorporate these into the existing ring data structures:
- `gve_rx_ring`
- `gve_tx_ring`
Noteworthy mentions:
- `gve_rx_buf_state` represents an RX buffer which was posted to HW.
Each RX queue has an array of these objects and the index into the
array is used as the buffer_id when posted to HW.
- `gve_tx_pending_packet_dqo` is treated similarly for TX queues. The
completion_tag is the index into the array.
- These two structures have links for linked lists which are represented
by 16b indexes into a contiguous array of these structures.
This reduces memory footprint compared to 64b pointers.
- We use unions for the writeable datapath structures to reduce cache
footprint. GQI specific members will renamed like DQO members in a
future patch.
Signed-off-by: Bailey Forrest <bcf@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
General description of rings and descriptors:
TX ring is used for sending TX packet buffers to the NIC. It has the
following descriptors:
- `gve_tx_pkt_desc_dqo` - Data buffer descriptor
- `gve_tx_tso_context_desc_dqo` - TSO context descriptor
- `gve_tx_general_context_desc_dqo` - Generic metadata descriptor
Metadata is a collection of 12 bytes. We define `gve_tx_metadata_dqo`
which represents the logical interpetation of the metadata bytes. It's
helpful to define this structure because the metadata bytes exist in
multiple descriptor types (including `gve_tx_tso_context_desc_dqo`),
and the device requires same field has the same value in all
descriptors.
The TX completion ring is used to receive completions from the NIC.
Having a separate ring allows for completions to be out of order. The
completion descriptor `gve_tx_compl_desc` has several different types,
most important are packet and descriptor completions. Descriptor
completions are used to notify the driver when descriptors sent on the
TX ring are done being consumed. The descriptor completion is only used
to signal that space is cleared in the TX ring. A packet completion will
be received when a packet transmitted on the TX queue is done being
transmitted.
In addition there are "miss" and "reinjection" completions. The device
implements a "flow-miss model". Most packets will simply receive a
packet completion. The flow-miss system may choose to process a packet
based on its contents. A TX packet which experiences a flow miss would
receive a miss completion followed by a later reinjection completion.
The miss-completion is received when the packet starts to be processed
by the flow-miss system and the reinjection completion is received when
the flow-miss system completes processing the packet and sends it on the
wire.
The RX buffer ring is used to send buffers to HW via the
`gve_rx_desc_dqo` descriptor.
Received packets are put into the RX queue by the device, which
populates the `gve_rx_compl_desc_dqo` descriptor. The RX descriptors
refer to buffers posted by the buffer queue. Received buffers may be
returned out of order, such as when HW LRO is enabled.
Important concepts:
- "TX" and "RX buffer" queues, which send descriptors to the device, use
MMIO doorbells to notify the device of new descriptors.
- "RX" and "TX completion" queues, which receive descriptors from the
device, use a "generation bit" to know when a descriptor was populated
by the device. The driver initializes all bits with the "current
generation". The device will populate received descriptors with the
"next generation" which is inverted from the current generation. When
the ring wraps, the current/next generation are swapped.
- It's the driver's responsibility to ensure that the RX and TX
completion queues are not overrun. This can be accomplished by
limiting the number of descriptors posted to HW.
- TX packets have a 16 bit completion_tag and RX buffers have a 16 bit
buffer_id. These will be returned on the TX completion and RX queues
respectively to let the driver know which packet/buffer was completed.
Bitfields are used to describe descriptor fields. This notation is more
concise and readable than shift-and-mask. It is possible because the
driver is restricted to little endian platforms.
Signed-off-by: Bailey Forrest <bcf@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Unlike GQI, DQO RX descriptors do not contain the L3 and L4 type of the
packet. L3 and L4 types are necessary in order to set the hash and csum
on RX SKBs correctly.
DQO RX descriptors instead contain a 10 bit PTYPE index. The PTYPE map
enables the device to tell the driver how to map from PTYPE index to
L3/L4 type.
The device doesn't provide any guarantees about the range of possible
PTYPEs, so we just use a 1024 entry array to implement a fast mapping
structure.
Signed-off-by: Bailey Forrest <bcf@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
- In addition to TX and RX queues, DQO has TX completion and RX buffer
queues.
- TX completions are received when the device has completed sending a
packet on the wire.
- RX buffers are posted on a separate queue form the RX completions.
- DQO descriptor rings are allowed to be smaller than PAGE_SIZE.
Signed-off-by: Bailey Forrest <bcf@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The currently supported queue formats are:
- GQI_RDA - GQI with raw DMA addressing
- GQI_QPL - GQI with queue page list
- DQO_RDA - DQO with raw DMA addressing
The old `gve_priv.raw_addressing` value is only used for GQI_RDA, so we
remove it in favor of just checking against GQI_RDA
Signed-off-by: Bailey Forrest <bcf@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The current model uses an integer ID and a fixed size struct for the
parameters of each device option.
The new model allows the device option structs to grow in size over
time. A driver may assume that changes to device option structs will
always be appended.
New device options will also generally have a
`supported_features_mask` so that the driver knows which fields within a
particular device option are enabled.
`gve_device_option.feat_mask` is changed to `required_features_mask`,
and it is a bitmask which must match the value expected by the driver.
This gives the device the ability to break backwards compatibility with
old drivers for certain features by blocking the old drivers from trying
to use the feature.
We maintain ABI compatibility with the old model for
GVE_DEV_OPT_ID_RAW_ADDRESSING in case a driver is using a device which
does not support the new model.
This patch introduces some new terminology:
RDA - Raw DMA Addressing - Buffers associated with SKBs are directly DMA
mapped and read/updated by the device.
QPL - Queue Page Lists - Driver uses bounce buffers which are DMA mapped
with the device for read/write and data is copied from/to SKBs.
Signed-off-by: Bailey Forrest <bcf@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Using `page_offset` like a boolean means a page may only be split into
two sections. With page sizes larger than 4k, this can be very wasteful.
Future commits in this patchset use `struct gve_rx_slot_page_info` in a
way which supports a fixed buffer size and a variable page size.
Signed-off-by: Bailey Forrest <bcf@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Future use cases will have a different padding value.
Signed-off-by: Bailey Forrest <bcf@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
These functions will be shared by the GQI and DQO variants of the GVNIC
driver as of follow-up patches in this series.
Signed-off-by: Bailey Forrest <bcf@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The key length used to store the macsec key was set to MACSEC_KEYID_LEN
(16), which is an issue as:
- This was never meant to be the key length.
- The key length can be > 16.
Fix this by using MACSEC_MAX_KEY_LEN instead (the max length accepted in
uAPI).
Fixes: 27736563ce32 ("net: atlantic: MACSec egress offload implementation")
Fixes: 9ff40a751a6f ("net: atlantic: MACSec ingress offload implementation")
Reported-by: Lior Nahmanson <liorna@nvidia.com>
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This adds statistic counters for the network interfaces provided
by the driver. It also adds CPU port counters (which are not
exposed by ethtool).
This also adds support for configuring the network interface
parameters via ethtool: speed, duplex, aneg etc.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Bjarni Jonasson <bjarni.jonasson@microchip.com>
Signed-off-by: Lars Povlsen <lars.povlsen@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This configures the Sparx5 calendars according to the bandwidth
requested in the Device Tree nodes.
It also checks if the total requested bandwidth is within the
specs of the detected Sparx5 models limits.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Bjarni Jonasson <bjarni.jonasson@microchip.com>
Signed-off-by: Lars Povlsen <lars.povlsen@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This adds SwitchDev support by hardware offloading the
software bridge.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Bjarni Jonasson <bjarni.jonasson@microchip.com>
Signed-off-by: Lars Povlsen <lars.povlsen@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This adds Sparx5 VLAN support.
Sparx5 has more VLAN features than provided here, but these will be added
in later series. For now we only add the basic L2 features.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Bjarni Jonasson <bjarni.jonasson@microchip.com>
Signed-off-by: Lars Povlsen <lars.povlsen@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|