Age | Commit message (Collapse) | Author |
|
Add test cases to check that unicast ARP/NS packets are replied once, even
if ARP/ND suppression is enabled.
Without the previous patch:
$ ./test_bridge_neigh_suppress.sh
...
Unicast ARP, per-port ARP suppression - VLAN 10
-----------------------------------------------
TEST: "neigh_suppress" is on [ OK ]
TEST: Unicast ARP, suppression on, h1 filter [FAIL]
TEST: Unicast ARP, suppression on, h2 filter [ OK ]
Unicast ARP, per-port ARP suppression - VLAN 20
-----------------------------------------------
TEST: "neigh_suppress" is on [ OK ]
TEST: Unicast ARP, suppression on, h1 filter [FAIL]
TEST: Unicast ARP, suppression on, h2 filter [ OK ]
...
Unicast NS, per-port NS suppression - VLAN 10
---------------------------------------------
TEST: "neigh_suppress" is on [ OK ]
TEST: Unicast NS, suppression on, h1 filter [FAIL]
TEST: Unicast NS, suppression on, h2 filter [ OK ]
Unicast NS, per-port NS suppression - VLAN 20
---------------------------------------------
TEST: "neigh_suppress" is on [ OK ]
TEST: Unicast NS, suppression on, h1 filter [FAIL]
TEST: Unicast NS, suppression on, h2 filter [ OK ]
...
Tests passed: 156
Tests failed: 4
With the previous patch:
$ ./test_bridge_neigh_suppress.sh
...
Unicast ARP, per-port ARP suppression - VLAN 10
-----------------------------------------------
TEST: "neigh_suppress" is on [ OK ]
TEST: Unicast ARP, suppression on, h1 filter [ OK ]
TEST: Unicast ARP, suppression on, h2 filter [ OK ]
Unicast ARP, per-port ARP suppression - VLAN 20
-----------------------------------------------
TEST: "neigh_suppress" is on [ OK ]
TEST: Unicast ARP, suppression on, h1 filter [ OK ]
TEST: Unicast ARP, suppression on, h2 filter [ OK ]
...
Unicast NS, per-port NS suppression - VLAN 10
---------------------------------------------
TEST: "neigh_suppress" is on [ OK ]
TEST: Unicast NS, suppression on, h1 filter [ OK ]
TEST: Unicast NS, suppression on, h2 filter [ OK ]
Unicast NS, per-port NS suppression - VLAN 20
---------------------------------------------
TEST: "neigh_suppress" is on [ OK ]
TEST: Unicast NS, suppression on, h1 filter [ OK ]
TEST: Unicast NS, suppression on, h2 filter [ OK ]
...
Tests passed: 160
Tests failed: 0
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/dc240b9649b31278295189f412223f320432c5f2.1744123493.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When Proxy ARP or ARP/ND suppression are enabled, ARP/NS packets can be
handled by bridge in br_do_proxy_suppress_arp()/br_do_suppress_nd().
For broadcast packets, they are replied by bridge, but later they are not
flooded. Currently, unicast packets are replied by bridge when suppression
is enabled, and they are also forwarded, which results two replicas of
ARP reply/NA - one from the bridge and second from the target.
RFC 1122 describes use case for unicat ARP packets - "unicast poll" -
actively poll the remote host by periodically sending a point-to-point ARP
request to it, and delete the entry if no ARP reply is received from N
successive polls.
The purpose of ARP/ND suppression is to reduce flooding in the broadcast
domain. If a host is sending a unicast ARP/NS, then it means it already
knows the address and the switches probably know it as well and there
will not be any flooding.
In addition, the use case of unicast ARP/NS is to poll a specific host,
so it does not make sense to have the switch answer on behalf of the host.
According to RFC 9161:
"A PE SHOULD reply to broadcast/multicast address resolution messages,
i.e., ARP Requests, ARP probes, NS messages, as well as DAD NS messages.
An ARP probe is an ARP Request constructed with an all-zero sender IP
address that may be used by hosts for IPv4 Address Conflict Detection as
specified in [RFC5227]. A PE SHOULD NOT reply to unicast address resolution
requests (for instance, NUD NS messages)."
Forward such requests and prevent the bridge from replying to them.
Reported-by: Denis Yulevych <denisyu@nvidia.com>
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/6bf745a149ddfe5e6be8da684a63aa574a326f8d.1744123493.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
txq_trans_update() currently uses txq->xmit_lock_owner
to conditionally update txq->trans_start.
For regular devices, txq->xmit_lock_owner is updated
from HARD_TX_LOCK() and HARD_TX_UNLOCK(), and this apparently
causes cpu stalls.
Using dev->lltx, which sits in a read-mostly cache-line,
and already used in HARD_TX_LOCK() and HARD_TX_UNLOCK()
helps cpu prediction.
On an AMD EPYC 7B12 dual socket server, tcp_rr with 128 threads
and 30,000 flows gets a 5 % increase in throughput.
As explained in commit 95ecba62e2fd ("net: fix races in
netdev_tx_sent_queue()/dev_watchdog()") I am planning
to no longer update txq->trans_start in the fast path
in a followup patch.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250408202742.2145516-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The cn10k_free_matchall_ipolicer() calls the cn10k_map_unmap_rq_policer()
for each queue in a for loop without checking for any errors.
Check the return value of the cn10k_map_unmap_rq_policer() function during
each loop, and report a warning if the function fails.
Signed-off-by: Wentao Liang <vulab@iscas.ac.cn>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250408032602.2909-1-vulab@iscas.ac.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Recent change [0] resulted in a "BUG: using __this_cpu_read() in
preemptible" splat [1]. PREEMPT kernels have additional requirements
on what can and can not run with/without preemption enabled.
Expose those constrains in the debug kernels.
0: https://lore.kernel.org/netdev/20250314120048.12569-2-justin.iurman@uliege.be/
1: https://lore.kernel.org/netdev/20250402094458.006ba2a7@kernel.org/T/#mbf72641e9d7d274daee9003ef5edf6833201f1bc
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Simon Horman <horms@kernel.org>
Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250402172305.1775226-1-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The __get_unaligned_cpu32 function is deprecated. So, replace it with
the more generic get_unaligned and just cast the input parameter.
Signed-off-by: Julian Vetter <julian@outer-limits.org>
Link: https://patch.msgid.link/20250408091946.2266271-1-julian@outer-limits.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The __get_unaligned_cpu32 function is deprecated. So, replace it with
the more generic get_unaligned and just cast the input parameter.
Signed-off-by: Julian Vetter <julian@outer-limits.org>
Link: https://patch.msgid.link/20250408091548.2263911-1-julian@outer-limits.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Jakub Kicinski says:
====================
net: depend on instance lock for queue related netlink ops
netdev-genl used to be protected by rtnl_lock. In previous release
we already switched the queue management ops (for Rx zero-copy) to
the instance lock. This series converts other ops to depend on the
instance lock when possible.
Unfortunately queue related state is hard to lock (unlike NAPI)
as the process of switching the number of queues usually involves
a large reconfiguration of the driver. The reconfig process has
historically been under rtnl_lock, but for drivers which opt into
ops locking it is also under the instance lock. Leverage that
and conditionally take rtnl_lock or instance lock depending
on the device capabilities.
v1: https://lore.kernel.org/20250407190117.16528-1-kuba@kernel.org
====================
Link: https://patch.msgid.link/20250408195956.412733-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
We mostly needed rtnl_lock in qstat to make sure the queue count
is stable while we work. For "ops locked" drivers the instance
lock protects the queue count, so we don't have to take rtnl_lock.
For currently ops-locked drivers: netdevsim and bnxt need
the protection from netdev going down while we dump, which
instance lock provides. gve doesn't care.
Reviewed-by: Joe Damato <jdamato@fastly.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250408195956.412733-9-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Explicitly list all the ops structs and what locking they provide.
Use "ops locked" as a term for drivers which have ops called under
the instance lock.
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250408195956.412733-8-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Writes to XDP features are now protected by netdev->lock.
Other things we report are based on ops which don't change
once device has been registered. It is safe to stop taking
rtnl_lock, and depend on netdev->lock instead.
Reviewed-by: Joe Damato <jdamato@fastly.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250408195956.412733-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Protect xdp_features with netdev->lock. This way pure readers
no longer have to take rtnl_lock to access the field.
This includes calling NETDEV_XDP_FEAT_CHANGE under the lock.
Looks like that's fine for bonding, the only "real" listener,
it's the same as ethtool feature change.
In terms of normal drivers - only GVE need special consideration
(other drivers don't use instance lock or don't support XDP).
It calls xdp_set_features_flag() helper from gve_init_priv() which
in turn is called from gve_reset_recovery() (locked), or prior
to netdev registration. So switch to _locked.
Reviewed-by: Joe Damato <jdamato@fastly.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Acked-by: Harshitha Ramamurthy <hramamurthy@google.com>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20250408195956.412733-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Netdev queue dump accesses: NAPI, memory providers, XSk pointers.
All three are "ops protected" now, switch to the op compat locking.
rtnl lock does not have to be taken for "ops locked" devices.
Reviewed-by: Joe Damato <jdamato@fastly.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250408195956.412733-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add helpers to "lock a netdev in a backward-compatible way",
which for ops-locked netdevs will mean take the instance lock.
For drivers which haven't opted into the ops locking we'll take
rtnl_lock.
The scoped foreach is dropping and re-taking the lock for each
device, even if prev and next are both under rtnl_lock.
I hope that's fine since we expect that netdev nl to be mostly
supported by modern drivers, and modern drivers should also
opt into the instance locking.
Note that these helpers are mostly needed for queue related state,
because drivers modify queue config in their ops in a non-atomic
way. Or differently put, queue changes don't have a clear-cut API
like NAPI configuration. Any state that can should just use the
instance lock directly, not the "compat" hacks.
Reviewed-by: Joe Damato <jdamato@fastly.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250408195956.412733-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Read accesses go via xsk_get_pool_from_qid(), the call coming
from the core and gve look safe (other "ops locked" drivers
don't support XSK).
Write accesses go via xsk_reg_pool_at_qid() and xsk_clear_pool_at_qid().
Former is already under the ops lock, latter is not (both coming from
the workqueue via xp_clear_dev() and NETDEV_UNREGISTER via xsk_notifier()).
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250408195956.412733-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
netdev_get_by_index_lock() performs following steps:
rcu_lock();
dev = lookup(netns, ifindex);
dev_get(dev);
rcu_unlock();
[... lock & validate the dev ...]
return dev
Validation right now only checks if the device is registered but since
the lookup is netns-aware we must also protect against the device
switching netns right after we dropped the RCU lock. Otherwise
the caller in netns1 may get a pointer to a device which has just
switched to netns2.
We can't hold the lock for the entire netns change process (because of
the NETDEV_UNREGISTER notifier), and there's no existing marking to
indicate that the netns is unlisted because of netns move, so add one.
AFAIU none of the existing netdev_get_by_index_lock() callers can
suffer from this problem (NAPI code double checks the netns membership
and other callers are either under rtnl_lock or not ns-sensitive),
so this patch does not have to be treated as a fix.
Reviewed-by: Joe Damato <jdamato@fastly.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250408195956.412733-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
__skb_try_recv_from_queue() deals with a queue, @sk is not used
since commit e427cad6eee4 ("net: datagram: drop 'destructor'
argument from several helpers"). Remove sk from function parameters,
adapt callers.
No functional change intended.
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250407-cleanup-drop-param-sk-v1-1-cd076979afac@rbox.co
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Paolo Abeni says:
====================
udp_tunnel: GRO optimizations
The UDP tunnel GRO stage is source of measurable overhead for workload
based on UDP-encapsulated traffic: each incoming packets requires a full
UDP socket lookup and an indirect call.
In the most common setups a single UDP tunnel device is used. In such
case we can optimize both the lookup and the indirect call.
Patch 1 tracks per netns the active UDP tunnels and replaces the socket
lookup with a single destination port comparison when possible.
Patch 2 tracks the different types of UDP tunnels and replaces the
indirect call with a static one when there is a single UDP tunnel type
active.
I measure ~10% performance improvement in TCP over UDP tunnel stream
tests on top of this series.
v4: https://lore.kernel.org/cover.1741718157.git.pabeni@redhat.com
v3: https://lore.kernel.org/cover.1741632298.git.pabeni@redhat.com
v2: https://lore.kernel.org/cover.1741338765.git.pabeni@redhat.com
v1: https://lore.kernel.org/cover.1741275846.git.pabeni@redhat.com
====================
Link: https://patch.msgid.link/cover.1744040675.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
It's quite common to have a single UDP tunnel type active in the
whole system. In such a case we can replace the indirect call for
the UDP tunnel GRO callback with a static call.
Add the related accounting in the control path and switch to static
call when possible. To keep the code simple use a static array for
the registered tunnel types, and size such array based on the kernel
config.
Note that there are valid kernel configurations leading to
UDP_MAX_TUNNEL_TYPES == 0 even with IS_ENABLED(CONFIG_NET_UDP_TUNNEL),
Explicitly skip the accounting in such a case, to avoid compile warning
when accessing "udp_tunnel_gro_types".
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/53d156cdfddcc9678449e873cc83e68fa1582653.1744040675.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Most UDP tunnels bind a socket to a local port, with ANY address, no
peer and no interface index specified.
Additionally it's quite common to have a single tunnel device per
namespace.
Track in each namespace the UDP tunnel socket respecting the above.
When only a single one is present, store a reference in the netns.
When such reference is not NULL, UDP tunnel GRO lookup just need to
match the incoming packet destination port vs the socket local port.
The tunnel socket never sets the reuse[port] flag[s]. When bound to no
address and interface, no other socket can exist in the same netns
matching the specified local port.
Matching packets with non-local destination addresses will be
aggregated, and eventually segmented as needed - no behavior changes
intended.
Restrict the optimization to kernel sockets only: it covers all the
relevant use-cases, and user-space owned sockets could be disconnected
and rebound after setup_udp_tunnel_sock(), breaking the uniqueness
assumption
Note that the UDP tunnel socket reference is stored into struct
netns_ipv4 for both IPv4 and IPv6 tunnels. That is intentional to keep
all the fastpath-related netns fields in the same struct and allow
cacheline-based optimization. Currently both the IPv4 and IPv6 socket
pointer share the same cacheline as the `udp_table` field.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/41d16bc8d1257d567f9344c445b4ae0b4a91ede4.1744040675.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Recently we had some issues in parallel TDC where some of IFE tests are
failing due to some of IFE's submodules (like act_meta_skbtcindex and
act_meta_skbprio) taking too long to load [1]. To avoid that issue,
pre-load IFE and all its submodules before running any of the tests in
tdc.sh
[1] https://lore.kernel.org/netdev/e909b2a0-244e-4141-9fa9-1b7d96ab7d71@mojatatu.com/T/#u
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Link: https://patch.msgid.link/20250407215656.2535990-1-victor@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Let's pass the queue index to netif_napi_add_config() to preserve
per-NAPI config.
Test:
Set 100 to defer-hard-irqs (default is 0) and check the value after
link down & up.
$ cat /sys/class/net/enp39s0/napi_defer_hard_irqs
0
$ ./tools/net/ynl/pyynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
--dump napi-get --json='{"ifindex": 2}'
[{'defer-hard-irqs': 0,
'gro-flush-timeout': 0,
'id': 65,
'ifindex': 2,
'irq': 29,
'irq-suspend-timeout': 0}]
$ sudo ./tools/net/ynl/pyynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
--do napi-set --json='{"id": 65, "defer-hard-irqs": 100}'
$ sudo ip link set enp39s0 down && sudo ip link set enp39s0 up
Without patch:
$ ./tools/net/ynl/pyynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
--dump napi-get --json='{"ifindex": 2}'
[{'defer-hard-irqs': 0, <------------------- Reset to 0
'gro-flush-timeout': 0,
'id': 66, <------------------------------- New ID
'ifindex': 2,
'irq': 29,
'irq-suspend-timeout': 0}]
With patch:
$ ./tools/net/ynl/pyynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
--dump napi-get --json='{"ifindex": 2}'
[{'defer-hard-irqs': 100, <--------------+-- Preserved
'gro-flush-timeout': 0, |
'id': 65, <----------------------------'
'ifindex': 2,
'irq': 29,
'irq-suspend-timeout': 0}]
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Arthur Kiyanovski <akiyano@amazon.com>
Link: https://patch.msgid.link/20250407164802.25184-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Eric Dumazet says:
====================
rps: misc changes
Minor changes in rps:
skb_flow_limit() is probably unused these days,
and data-races are quite theoretical.
====================
Link: https://patch.msgid.link/20250407163602.170356-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add an rcu_head to sd_flow_limit and rps_sock_flow_table structs
to use the more conventional and predictable k[v]free_rcu().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250407163602.170356-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
softnet_seq_show() reads several fields that might be updated
concurrently. Add READ_ONCE() and WRITE_ONCE() annotations.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250407163602.170356-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
softnet_seq_show() can read fl->count while another cpu
updates this field from skb_flow_limit().
Make this field an 'unsigned int', as its only consumer
only deals with 32 bit.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250407163602.170356-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
As explained in commit f3483c8e1da6 ("net: rfs: hash function change"),
masking low order bits of skb_get_hash(skb) has low entropy.
A NIC with 32 RX queues uses the 5 low order bits of rss key
to select a queue. This means all packets landing to a given
queue share the same 5 low order bits.
Switch to hash_32() to reduce hash collisions.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250407163602.170356-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Use SPDX-License-Identifier accross all the files of the xgbe driver to
ensure compliance with Linux kernel standards, thus removing the
boiler-plate template license text.
Signed-off-by: Raju Rangoju <Raju.Rangoju@amd.com>
Acked-by: Shyam Sundar S K <Shyam-sundar.S-k@amd.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250407102913.3063691-1-Raju.Rangoju@amd.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Remove the double negation and simplify the if condition.
No functional changes intended.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20250407091442.743478-1-thorsten.blum@linux.dev
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
The __get_unaligned_cpu32 function is deprecated. So, replace it with
the more generic get_unaligned and just cast the input parameter.
Signed-off-by: Julian Vetter <julian@outer-limits.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250407083306.1553921-1-julian@outer-limits.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
If the destination buffer has a fixed length, strscpy_pad()
automatically determines its size using sizeof() when the argument is
omitted. This makes the explicit sizeof() calls unnecessary - remove
them.
No functional changes intended.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Link: https://patch.msgid.link/20250407082607.741919-2-thorsten.blum@linux.dev
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from netfilter.
Current release - regressions:
- four fixes for the netdev per-instance locking
Current release - new code bugs:
- consolidate more code between existing Rx zero-copy and uring so
that the latter doesn't miss / have to duplicate the safety checks
Previous releases - regressions:
- ipv6: fix omitted Netlink attributes when using SKIP_STATS
Previous releases - always broken:
- net: fix geneve_opt length integer overflow
- udp: fix multiple wrap arounds of sk->sk_rmem_alloc when it
approaches INT_MAX
- dsa: mvpp2: add a lock to avoid corruption of the shared TCAM
- dsa: airoha: fix issues with traffic QoS configuration / offload,
and flow table offload
Misc:
- touch up the Netlink YAML specs of old families to make them usable
for user space C codegen"
* tag 'net-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (56 commits)
selftests: net: amt: indicate progress in the stress test
netlink: specs: rt_route: pull the ifa- prefix out of the names
netlink: specs: rt_addr: pull the ifa- prefix out of the names
netlink: specs: rt_addr: fix get multi command name
netlink: specs: rt_addr: fix the spec format / schema failures
net: avoid false positive warnings in __net_mp_close_rxq()
net: move mp dev config validation to __net_mp_open_rxq()
net: ibmveth: make veth_pool_store stop hanging
arcnet: Add NULL check in com20020pci_probe()
ipv6: Do not consider link down nexthops in path selection
ipv6: Start path selection from the first nexthop
usbnet:fix NPE during rx_complete
net: octeontx2: Handle XDP_ABORTED and XDP invalid as XDP_DROP
net: fix geneve_opt length integer overflow
io_uring/zcrx: fix selftests w/ updated netdev Python helpers
selftests: net: use netdevsim in netns test
docs: net: document netdev notifier expectations
net: dummy: request ops lock
netdevsim: add dummy device notifiers
net: rename rtnl_net_debug to lock_debug
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi
Pull spi fixes from Mark Brown:
"A small collection of fixes that came in during the merge window,
everything is driver specific with nothing standing out particularly"
* tag 'spi-fix-v6.15-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
spi: bcm2835: Restore native CS probing when pinctrl-bcm2835 is absent
spi: bcm2835: Do not call gpiod_put() on invalid descriptor
spi: cadence-qspi: revert "Improve spi memory performance"
spi: cadence: Fix out-of-bounds array access in cdns_mrvl_xspi_setup_clock()
spi: fsl-qspi: use devm function instead of driver remove
spi: SPI_QPIC_SNAND should be tristate and depend on MTD
spi-rockchip: Fix register out of bounds access
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull more SoC driver updates from Arnd Bergmann:
"This is the promised follow-up to the soc drivers branch, adding minor
updates to omap and freescale drivers.
Most notably, Ioana Ciornei takes over maintenance of the DPAA bus
driver used in some NXP (originally Freescale) chips"
* tag 'soc-drivers-6.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
bus: fsl-mc: Remove deadcode
MAINTAINERS: add the linuppc-dev list to the fsl-mc bus entry
MAINTAINERS: fix nonexistent dtbinding file name
MAINTAINERS: add myself as maintainer for the fsl-mc bus
irqdomain: soc: Switch to irq_find_mapping()
Input: tsc2007 - accept standard properties
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86
Pull x86 platform driver fixes from Ilpo Järvinen:
- thinkpad_acpi:
- Fix NULL pointer dereferences while probing
- Disable ACPI fan access for T495* and E560
- ISST: Correct command storage data length
* tag 'platform-drivers-x86-v6.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86:
MAINTAINERS: consistently use my dedicated email address
platform/x86: ISST: Correct command storage data length
platform/x86: thinkpad_acpi: disable ACPI fan access for T495* and E560
platform/x86: thinkpad_acpi: Fix NULL pointer dereferences while probing
|
|
Our CI expects output from the test at least once every 10 minutes.
The AMT test when running on debug kernel is just on the edge
of that time for the stress test. Improve the output:
- print the name of the test first, before starting it,
- output a dot every 10% of the way.
Output after:
TEST: amt discovery [ OK ]
TEST: IPv4 amt multicast forwarding [ OK ]
TEST: IPv6 amt multicast forwarding [ OK ]
TEST: IPv4 amt traffic forwarding torture .......... [ OK ]
TEST: IPv6 amt traffic forwarding torture .......... [ OK ]
Reviewed-by: Taehee Yoo <ap420073@gmail.com>
Link: https://patch.msgid.link/20250403145636.2891166-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Jakub Kicinski says:
====================
netlink: specs: rt_addr: fix problems revealed by C codegen
I put together basic YNL C support for classic netlink. This revealed
a few problems in the rt_addr spec.
v1: https://lore.kernel.org/20250401012939.2116915-1-kuba@kernel.org
====================
Link: https://patch.msgid.link/20250403013706.2828322-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
YAML specs don't normally include the C prefix name in the name
of the YAML attr. Remove the ifa- prefix from all attributes
in route-attrs and metrics and specify name-prefix instead.
This is a bit risky, hopefully there aren't many users out there.
Fixes: 023289b4f582 ("doc/netlink: Add spec for rt route messages")
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250403013706.2828322-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
YAML specs don't normally include the C prefix name in the name
of the YAML attr. Remove the ifa- prefix from all attributes
in addr-attrs and specify name-prefix instead.
This is a bit risky, hopefully there aren't many users out there.
Fixes: dfb0f7d9d979 ("doc/netlink: Add spec for rt addr messages")
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250403013706.2828322-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Command names should match C defines, codegens may depend on it.
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Fixes: 4f280376e531 ("selftests/net: Add selftest for IPv4 RTM_GETMULTICAST support")
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250403013706.2828322-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The spec is mis-formatted, schema validation says:
Failed validating 'type' in schema['properties']['operations']['properties']['list']['items']['properties']['dump']['properties']['request']['properties']['value']:
{'minimum': 0, 'type': 'integer'}
On instance['operations']['list'][3]['dump']['request']['value']:
'58 - ifa-family'
The ifa-family clearly wants to be part of an attribute list.
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Reviewed-by: Yuyang Huang <yuyanghuang@google.com>
Fixes: 4f280376e531 ("selftests/net: Add selftest for IPv4 RTM_GETMULTICAST support")
Link: https://patch.msgid.link/20250403013706.2828322-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Jakub Kicinski says:
====================
net: make memory provider install / close paths more common
We seem to be fixing bugs in config path for devmem which also exist
in the io_uring ZC path. Let's try to make the two paths more common,
otherwise this is bound to keep happening.
Found by code inspection and compile tested only.
v1: https://lore.kernel.org/20250331194201.2026422-1-kuba@kernel.org
====================
Link: https://patch.msgid.link/20250403013405.2827250-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Commit under Fixes solved the problem of spurious warnings when we
uninstall an MP from a device while its down. The __net_mp_close_rxq()
which is used by io_uring was not fixed. Move the fix over and reuse
__net_mp_close_rxq() in the devmem path.
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Fixes: a70f891e0fa0 ("net: devmem: do not WARN conditionally after netdev_rx_queue_restart()")
Reviewed-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20250403013405.2827250-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
devmem code performs a number of safety checks to avoid having
to reimplement all of them in the drivers. Move those to
__net_mp_open_rxq() and reuse that function for binding to make
sure that io_uring ZC also benefits from them.
While at it rename the queue ID variable to rxq_idx in
__net_mp_open_rxq(), we touch most of the relevant lines.
The XArray insertion is reordered after the netdev_rx_queue_restart()
call, otherwise we'd need to duplicate the queue index check
or risk inserting an invalid pointer. The XArray allocation
failures should be extremely rare.
Reviewed-by: Mina Almasry <almasrymina@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Fixes: 6e18ed929d3b ("net: add helpers for setting a memory provider on an rx queue")
Link: https://patch.msgid.link/20250403013405.2827250-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
v2:
- Created a single error handling unlock and exit in veth_pool_store
- Greatly expanded commit message with previous explanatory-only text
Summary: Use rtnl_mutex to synchronize veth_pool_store with itself,
ibmveth_close and ibmveth_open, preventing multiple calls in a row to
napi_disable.
Background: Two (or more) threads could call veth_pool_store through
writing to /sys/devices/vio/30000002/pool*/*. You can do this easily
with a little shell script. This causes a hang.
I configured LOCKDEP, compiled ibmveth.c with DEBUG, and built a new
kernel. I ran this test again and saw:
Setting pool0/active to 0
Setting pool1/active to 1
[ 73.911067][ T4365] ibmveth 30000002 eth0: close starting
Setting pool1/active to 1
Setting pool1/active to 0
[ 73.911367][ T4366] ibmveth 30000002 eth0: close starting
[ 73.916056][ T4365] ibmveth 30000002 eth0: close complete
[ 73.916064][ T4365] ibmveth 30000002 eth0: open starting
[ 110.808564][ T712] systemd-journald[712]: Sent WATCHDOG=1 notification.
[ 230.808495][ T712] systemd-journald[712]: Sent WATCHDOG=1 notification.
[ 243.683786][ T123] INFO: task stress.sh:4365 blocked for more than 122 seconds.
[ 243.683827][ T123] Not tainted 6.14.0-01103-g2df0c02dab82-dirty #8
[ 243.683833][ T123] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 243.683838][ T123] task:stress.sh state:D stack:28096 pid:4365 tgid:4365 ppid:4364 task_flags:0x400040 flags:0x00042000
[ 243.683852][ T123] Call Trace:
[ 243.683857][ T123] [c00000000c38f690] [0000000000000001] 0x1 (unreliable)
[ 243.683868][ T123] [c00000000c38f840] [c00000000001f908] __switch_to+0x318/0x4e0
[ 243.683878][ T123] [c00000000c38f8a0] [c000000001549a70] __schedule+0x500/0x12a0
[ 243.683888][ T123] [c00000000c38f9a0] [c00000000154a878] schedule+0x68/0x210
[ 243.683896][ T123] [c00000000c38f9d0] [c00000000154ac80] schedule_preempt_disabled+0x30/0x50
[ 243.683904][ T123] [c00000000c38fa00] [c00000000154dbb0] __mutex_lock+0x730/0x10f0
[ 243.683913][ T123] [c00000000c38fb10] [c000000001154d40] napi_enable+0x30/0x60
[ 243.683921][ T123] [c00000000c38fb40] [c000000000f4ae94] ibmveth_open+0x68/0x5dc
[ 243.683928][ T123] [c00000000c38fbe0] [c000000000f4aa20] veth_pool_store+0x220/0x270
[ 243.683936][ T123] [c00000000c38fc70] [c000000000826278] sysfs_kf_write+0x68/0xb0
[ 243.683944][ T123] [c00000000c38fcb0] [c0000000008240b8] kernfs_fop_write_iter+0x198/0x2d0
[ 243.683951][ T123] [c00000000c38fd00] [c00000000071b9ac] vfs_write+0x34c/0x650
[ 243.683958][ T123] [c00000000c38fdc0] [c00000000071bea8] ksys_write+0x88/0x150
[ 243.683966][ T123] [c00000000c38fe10] [c0000000000317f4] system_call_exception+0x124/0x340
[ 243.683973][ T123] [c00000000c38fe50] [c00000000000d05c] system_call_vectored_common+0x15c/0x2ec
...
[ 243.684087][ T123] Showing all locks held in the system:
[ 243.684095][ T123] 1 lock held by khungtaskd/123:
[ 243.684099][ T123] #0: c00000000278e370 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x50/0x248
[ 243.684114][ T123] 4 locks held by stress.sh/4365:
[ 243.684119][ T123] #0: c00000003a4cd3f8 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0x88/0x150
[ 243.684132][ T123] #1: c000000041aea888 (&of->mutex#2){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x154/0x2d0
[ 243.684143][ T123] #2: c0000000366fb9a8 (kn->active#64){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x160/0x2d0
[ 243.684155][ T123] #3: c000000035ff4cb8 (&dev->lock){+.+.}-{3:3}, at: napi_enable+0x30/0x60
[ 243.684166][ T123] 5 locks held by stress.sh/4366:
[ 243.684170][ T123] #0: c00000003a4cd3f8 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0x88/0x150
[ 243.684183][ T123] #1: c00000000aee2288 (&of->mutex#2){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x154/0x2d0
[ 243.684194][ T123] #2: c0000000366f4ba8 (kn->active#64){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x160/0x2d0
[ 243.684205][ T123] #3: c000000035ff4cb8 (&dev->lock){+.+.}-{3:3}, at: napi_disable+0x30/0x60
[ 243.684216][ T123] #4: c0000003ff9bbf18 (&rq->__lock){-.-.}-{2:2}, at: __schedule+0x138/0x12a0
From the ibmveth debug, two threads are calling veth_pool_store, which
calls ibmveth_close and ibmveth_open. Here's the sequence:
T4365 T4366
----------------- ----------------- ---------
veth_pool_store veth_pool_store
ibmveth_close
ibmveth_close
napi_disable
napi_disable
ibmveth_open
napi_enable <- HANG
ibmveth_close calls napi_disable at the top and ibmveth_open calls
napi_enable at the top.
https://docs.kernel.org/networking/napi.html]] says
The control APIs are not idempotent. Control API calls are safe
against concurrent use of datapath APIs but an incorrect sequence of
control API calls may result in crashes, deadlocks, or race
conditions. For example, calling napi_disable() multiple times in a
row will deadlock.
In the normal open and close paths, rtnl_mutex is acquired to prevent
other callers. This is missing from veth_pool_store. Use rtnl_mutex in
veth_pool_store fixes these hangs.
Signed-off-by: Dave Marquardt <davemarq@linux.ibm.com>
Fixes: 860f242eb534 ("[PATCH] ibmveth change buffer pools dynamically")
Reviewed-by: Nick Child <nnac123@linux.ibm.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250402154403.386744-1-davemarq@linux.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
devm_kasprintf() returns NULL when memory allocation fails. Currently,
com20020pci_probe() does not check for this case, which results in a
NULL pointer dereference.
Add NULL check after devm_kasprintf() to prevent this issue and ensure
no resources are left allocated.
Fixes: 6b17a597fc2f ("arcnet: restoring support for multiple Sohard Arcnet cards")
Signed-off-by: Henry Martin <bsdhenrymartin@gmail.com>
Link: https://patch.msgid.link/20250402135036.44697-1-bsdhenrymartin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Ido Schimmel says:
====================
ipv6: Multipath routing fixes
This patchset contains two fixes for IPv6 multipath routing. See the
commit messages for more details.
====================
Link: https://patch.msgid.link/20250402114224.293392-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Nexthops whose link is down are not supposed to be considered during
path selection when the "ignore_routes_with_linkdown" sysctl is set.
This is done by assigning them a negative region boundary.
However, when comparing the computed hash (unsigned) with the region
boundary (signed), the negative region boundary is treated as unsigned,
resulting in incorrect nexthop selection.
Fix by treating the computed hash as signed. Note that the computed hash
is always in range of [0, 2^31 - 1].
Fixes: 3d709f69a3e7 ("ipv6: Use hash-threshold instead of modulo-N")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250402114224.293392-3-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Cited commit transitioned IPv6 path selection to use hash-threshold
instead of modulo-N. With hash-threshold, each nexthop is assigned a
region boundary in the multipath hash function's output space and a
nexthop is chosen if the calculated hash is smaller than the nexthop's
region boundary.
Hash-threshold does not work correctly if path selection does not start
with the first nexthop. For example, if fib6_select_path() is always
passed the last nexthop in the group, then it will always be chosen
because its region boundary covers the entire hash function's output
space.
Fix this by starting the selection process from the first nexthop and do
not consider nexthops for which rt6_score_route() provided a negative
score.
Fixes: 3d709f69a3e7 ("ipv6: Use hash-threshold instead of modulo-N")
Reported-by: Stanislav Fomichev <stfomichev@gmail.com>
Closes: https://lore.kernel.org/netdev/Z9RIyKZDNoka53EO@mini-arch/
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250402114224.293392-2-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Missing usbnet_going_away Check in Critical Path.
The usb_submit_urb function lacks a usbnet_going_away
validation, whereas __usbnet_queue_skb includes this check.
This inconsistency creates a race condition where:
A URB request may succeed, but the corresponding SKB data
fails to be queued.
Subsequent processes:
(e.g., rx_complete → defer_bh → __skb_unlink(skb, list))
attempt to access skb->next, triggering a NULL pointer
dereference (Kernel Panic).
Fixes: 04e906839a05 ("usbnet: fix cyclical race on disconnect with work queue")
Cc: stable@vger.kernel.org
Signed-off-by: Ying Lu <luying1@xiaomi.com>
Link: https://patch.msgid.link/4c9ef2efaa07eb7f9a5042b74348a67e5a3a7aea.1743584159.git.luying1@xiaomi.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|