summaryrefslogtreecommitdiff
path: root/net/core
AgeCommit message (Collapse)Author
2025-07-07netpoll: factor out IPv6 header setup into push_ipv6() helperBreno Leitao
Move IPv6 header construction from netpoll_send_udp() into a new static helper function, push_ipv6(). This refactoring reduces code duplication and improves readability in netpoll_send_udp(). Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250702-netpoll_untagle_ip-v2-3-13cf3db24e2b@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-07netpoll: factor out UDP checksum calculation into helperBreno Leitao
Extract UDP checksum calculation logic from netpoll_send_udp() into a new static helper function netpoll_udp_checksum(). This reduces code duplication and improves readability for both IPv4 and IPv6 cases. No functional change intended. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250702-netpoll_untagle_ip-v2-2-13cf3db24e2b@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-07netpoll: Improve code clarity with explicit struct size calculationsBreno Leitao
Replace pointer-dereference sizeof() operations with explicit struct names for improved readability and maintainability. This change: 1. Replaces `sizeof(*udph)` with `sizeof(struct udphdr)` 2. Replaces `sizeof(*ip6h)` with `sizeof(struct ipv6hdr)` 3. Replaces `sizeof(*iph)` with `sizeof(struct iphdr)` This will make it easy to move code in the upcoming patches. No functional changes are introduced by this patch. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250702-netpoll_untagle_ip-v2-1-13cf3db24e2b@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-07net: remove RTNL use for /proc/sys/net/core/rps_default_maskEric Dumazet
Use a dedicated mutex instead. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250702061558.1585870-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-07page_pool: rename __page_pool_alloc_pages_slow() to ↵Byungchul Park
__page_pool_alloc_netmems_slow() Now that __page_pool_alloc_pages_slow() is for allocating netmem, not struct page, rename it to __page_pool_alloc_netmems_slow() to reflect what it does. Signed-off-by: Byungchul Park <byungchul@sk.com> Reviewed-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://patch.msgid.link/20250702053256.4594-4-byungchul@sk.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-07page_pool: rename __page_pool_release_page_dma() to ↵Byungchul Park
__page_pool_release_netmem_dma() Now that __page_pool_release_page_dma() is for releasing netmem, not struct page, rename it to __page_pool_release_netmem_dma() to reflect what it does. Signed-off-by: Byungchul Park <byungchul@sk.com> Reviewed-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Link: https://patch.msgid.link/20250702053256.4594-3-byungchul@sk.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-07page_pool: rename page_pool_return_page() to page_pool_return_netmem()Byungchul Park
Now that page_pool_return_page() is for returning netmem, not struct page, rename it to page_pool_return_netmem() to reflect what it does. Signed-off-by: Byungchul Park <byungchul@sk.com> Reviewed-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Link: https://patch.msgid.link/20250702053256.4594-2-byungchul@sk.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-04af_unix: enable handing out pidfds for reaped tasks in SCM_PIDFDAlexander Mikhalitsyn
Now everything is ready to pass PIDFD_STALE to pidfd_prepare(). This will allow opening pidfd for reaped tasks. Cc: linux-kernel@vger.kernel.org Cc: netdev@vger.kernel.org Cc: David S. Miller <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Simon Horman <horms@kernel.org> Cc: Willem de Bruijn <willemb@google.com> Cc: Leon Romanovsky <leon@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christian Brauner <brauner@kernel.org> Cc: Kuniyuki Iwashima <kuniyu@google.com> Cc: Lennart Poettering <mzxreary@0pointer.de> Cc: Luca Boccassi <bluca@debian.org> Cc: David Rheinsberg <david@readahead.eu> Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Link: https://lore.kernel.org/20250703222314.309967-7-aleksandr.mikhalitsyn@canonical.com Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-04af_unix: stash pidfs dentry when neededAlexander Mikhalitsyn
We need to ensure that pidfs dentry is allocated when we meet any struct pid for the first time. This will allows us to open pidfd even after the task it corresponds to is reaped. Basically, we need to identify all places where we fill skb/scm_cookie with struct pid reference for the first time and call pidfs_register_pid(). Tricky thing here is that we have a few places where this happends depending on what userspace is doing: - [__scm_replace_pid()] explicitly sending an SCM_CREDENTIALS message and specified pid in a numeric format - [unix_maybe_add_creds()] enabled SO_PASSCRED/SO_PASSPIDFD but didn't send SCM_CREDENTIALS explicitly - [scm_send()] force_creds is true. Netlink case, we don't need to touch it. Cc: linux-kernel@vger.kernel.org Cc: netdev@vger.kernel.org Cc: David S. Miller <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Simon Horman <horms@kernel.org> Cc: Leon Romanovsky <leon@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christian Brauner <brauner@kernel.org> Cc: Kuniyuki Iwashima <kuniyu@google.com> Cc: Lennart Poettering <mzxreary@0pointer.de> Cc: Luca Boccassi <bluca@debian.org> Cc: David Rheinsberg <david@readahead.eu> Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Link: https://lore.kernel.org/20250703222314.309967-6-aleksandr.mikhalitsyn@canonical.com Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-04af_unix: introduce and use scm_replace_pid() helperAlexander Mikhalitsyn
Existing logic in __scm_send() related to filling an struct scm_cookie with a proper struct pid reference is already pretty tricky. Let's simplify it a bit by introducing a new helper. This helper will be extended in one of the next patches. Cc: linux-kernel@vger.kernel.org Cc: netdev@vger.kernel.org Cc: David S. Miller <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Simon Horman <horms@kernel.org> Cc: Willem de Bruijn <willemb@google.com> Cc: Leon Romanovsky <leon@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christian Brauner <brauner@kernel.org> Cc: Kuniyuki Iwashima <kuniyu@google.com> Cc: Lennart Poettering <mzxreary@0pointer.de> Cc: Luca Boccassi <bluca@debian.org> Cc: David Rheinsberg <david@readahead.eu> Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Link: https://lore.kernel.org/20250703222314.309967-4-aleksandr.mikhalitsyn@canonical.com Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-02net: preserve MSG_ZEROCOPY with forwardingWillem de Bruijn
MSG_ZEROCOPY data must be copied before data is queued to local sockets, to avoid indefinite timeout until memory release. This test is performed by skb_orphan_frags_rx, which is called when looping an egress skb to packet sockets, error queue or ingress path. To preserve zerocopy for skbs that are looped to ingress but are then forwarded to an egress device rather than delivered locally, defer this last check until an skb enters the local IP receive path. This is analogous to existing behavior of skb_clear_delivery_time. Signed-off-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250630194312.1571410-2-willemdebruijn.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02net: dst: add four helpers to annotate data-races around dst->devEric Dumazet
dst->dev is read locklessly in many contexts, and written in dst_dev_put(). Fixing all the races is going to need many changes. We probably will have to add full RCU protection. Add three helpers to ease this painful process. static inline struct net_device *dst_dev(const struct dst_entry *dst) { return READ_ONCE(dst->dev); } static inline struct net_device *skb_dst_dev(const struct sk_buff *skb) { return dst_dev(skb_dst(skb)); } static inline struct net *skb_dst_dev_net(const struct sk_buff *skb) { return dev_net(skb_dst_dev(skb)); } static inline struct net *skb_dst_dev_net_rcu(const struct sk_buff *skb) { return dev_net_rcu(skb_dst_dev(skb)); } Fixes: 4a6ce2b6f2ec ("net: introduce a new function dst_dev_put()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250630121934.3399505-7-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02net: dst: annotate data-races around dst->outputEric Dumazet
dst_dev_put() can overwrite dst->output while other cpus might read this field (for instance from dst_output()) Add READ_ONCE()/WRITE_ONCE() annotations to suppress potential issues. We will likely need RCU protection in the future. Fixes: 4a6ce2b6f2ec ("net: introduce a new function dst_dev_put()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250630121934.3399505-6-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02net: dst: annotate data-races around dst->inputEric Dumazet
dst_dev_put() can overwrite dst->input while other cpus might read this field (for instance from dst_input()) Add READ_ONCE()/WRITE_ONCE() annotations to suppress potential issues. We will likely need full RCU protection later. Fixes: 4a6ce2b6f2ec ("net: introduce a new function dst_dev_put()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250630121934.3399505-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02net: dst: annotate data-races around dst->lastuseEric Dumazet
(dst_entry)->lastuse is read and written locklessly, add corresponding annotations. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250630121934.3399505-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02net: dst: annotate data-races around dst->obsoleteEric Dumazet
(dst_entry)->obsolete is read locklessly, add corresponding annotations. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250630121934.3399505-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02tcp: move tcp_memory_allocated into net_aligned_dataEric Dumazet
____cacheline_aligned_in_smp attribute only makes sure to align a field to a cache line. It does not prevent the linker to use the remaining of the cache line for other variables, causing potential false sharing. Move tcp_memory_allocated into a dedicated cache line. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250630093540.3052835-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02net: move net_cookie into net_aligned_dataEric Dumazet
Using per-cpu data for net->net_cookie generation is overkill, because even busy hosts do not create hundreds of netns per second. Make sure to put net_cookie in a private cache line to avoid potential false sharing. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250630093540.3052835-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02net: add struct net_aligned_dataEric Dumazet
This structure will hold networking data that must consume a full cache line to avoid accidental false sharing. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250630093540.3052835-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-01net: ieee8021q: fix insufficient table-size assertionRubenKelevra
_Static_assert(ARRAY_SIZE(map) != IEEE8021Q_TT_MAX - 1) rejects only a length of 7 and allows any other mismatch. Replace it with a strict equality test via a helper macro so that every mapping table must have exactly IEEE8021Q_TT_MAX (8) entries. Signed-off-by: RubenKelevra <rubenkelevra@gmail.com> Link: https://patch.msgid.link/20250626205907.1566384-1-rubenkelevra@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-06-30net: net->nsid_lock does not need BH safetyEric Dumazet
At the time of commit bc51dddf98c9 ("netns: avoid disabling irq for netns id") peernet2id() was not yet using RCU. Commit 2dce224f469f ("netns: protect netns ID lookups with RCU") changed peernet2id() to no longer acquire net->nsid_lock (potentially from BH context). We do not need to block soft interrupts when acquiring net->nsid_lock anymore. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Guillaume Nault <gnault@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250627163242.230866-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-30neighbor: Add NTF_EXT_VALIDATED flag for externally validated entriesIdo Schimmel
tl;dr ===== Add a new neighbor flag ("extern_valid") that can be used to indicate to the kernel that a neighbor entry was learned and determined to be valid externally. The kernel will not try to remove or invalidate such an entry, leaving these decisions to the user space control plane. This is needed for EVPN multi-homing where a neighbor entry for a multi-homed host needs to be synced across all the VTEPs among which the host is multi-homed. Background ========== In a typical EVPN multi-homing setup each host is multi-homed using a set of links called ES (Ethernet Segment, i.e., LAG) to multiple leaf switches (VTEPs). VTEPs that are connected to the same ES are called ES peers. When a neighbor entry is learned on a VTEP, it is distributed to both ES peers and remote VTEPs using EVPN MAC/IP advertisement routes. ES peers use the neighbor entry when routing traffic towards the multi-homed host and remote VTEPs use it for ARP/NS suppression. Motivation ========== If the ES link between a host and the VTEP on which the neighbor entry was locally learned goes down, the EVPN MAC/IP advertisement route will be withdrawn and the neighbor entries will be removed from both ES peers and remote VTEPs. Routing towards the multi-homed host and ARP/NS suppression can fail until another ES peer locally learns the neighbor entry and distributes it via an EVPN MAC/IP advertisement route. "draft-rbickhart-evpn-ip-mac-proxy-adv-03" [1] suggests avoiding these intermittent failures by having the ES peers install the neighbor entries as before, but also injecting EVPN MAC/IP advertisement routes with a proxy indication. When the previously mentioned ES link goes down and the original EVPN MAC/IP advertisement route is withdrawn, the ES peers will not withdraw their neighbor entries, but instead start aging timers for the proxy indication. If an ES peer locally learns the neighbor entry (i.e., it becomes "reachable"), it will restart its aging timer for the entry and emit an EVPN MAC/IP advertisement route without a proxy indication. An ES peer will stop its aging timer for the proxy indication if it observes the removal of the proxy indication from at least one of the ES peers advertising the entry. In the event that the aging timer for the proxy indication expired, an ES peer will withdraw its EVPN MAC/IP advertisement route. If the timer expired on all ES peers and they all withdrew their proxy advertisements, the neighbor entry will be completely removed from the EVPN fabric. Implementation ============== In the above scheme, when the control plane (e.g., FRR) advertises a neighbor entry with a proxy indication, it expects the corresponding entry in the data plane (i.e., the kernel) to remain valid and not be removed due to garbage collection or loss of carrier. The control plane also expects the kernel to notify it if the entry was learned locally (i.e., became "reachable") so that it will remove the proxy indication from the EVPN MAC/IP advertisement route. That is why these entries cannot be programmed with dummy states such as "permanent" or "noarp". Instead, add a new neighbor flag ("extern_valid") which indicates that the entry was learned and determined to be valid externally and should not be removed or invalidated by the kernel. The kernel can probe the entry and notify user space when it becomes "reachable" (it is initially installed as "stale"). However, if the kernel does not receive a confirmation, have it return the entry to the "stale" state instead of the "failed" state. In other words, an entry marked with the "extern_valid" flag behaves like any other dynamically learned entry other than the fact that the kernel cannot remove or invalidate it. One can argue that the "extern_valid" flag should not prevent garbage collection and that instead a neighbor entry should be programmed with both the "extern_valid" and "extern_learn" flags. There are two reasons for not doing that: 1. Unclear why a control plane would like to program an entry that the kernel cannot invalidate but can completely remove. 2. The "extern_learn" flag is used by FRR for neighbor entries learned on remote VTEPs (for ARP/NS suppression) whereas here we are concerned with local entries. This distinction is currently irrelevant for the kernel, but might be relevant in the future. Given that the flag only makes sense when the neighbor has a valid state, reject attempts to add a neighbor with an invalid state and with this flag set. For example: # ip neigh add 192.0.2.1 nud none dev br0.10 extern_valid Error: Cannot create externally validated neighbor with an invalid state. # ip neigh add 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid # ip neigh replace 192.0.2.1 nud failed dev br0.10 extern_valid Error: Cannot mark neighbor as externally validated with an invalid state. The above means that a neighbor cannot be created with the "extern_valid" flag and flags such as "use" or "managed" as they result in a neighbor being created with an invalid state ("none") and immediately getting probed: # ip neigh add 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid use Error: Cannot create externally validated neighbor with an invalid state. However, these flags can be used together with "extern_valid" after the neighbor was created with a valid state: # ip neigh add 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid # ip neigh replace 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid use One consequence of preventing the kernel from invalidating a neighbor entry is that by default it will only try to determine reachability using unicast probes. This can be changed using the "mcast_resolicit" sysctl: # sysctl net.ipv4.neigh.br0/10.mcast_resolicit 0 # tcpdump -nn -e -i br0.10 -Q out arp & # ip neigh replace 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid use 62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28 62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28 62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28 # sysctl -wq net.ipv4.neigh.br0/10.mcast_resolicit=3 # ip neigh replace 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid use 62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28 62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28 62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28 62:50:1d:11:93:6f > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28 62:50:1d:11:93:6f > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28 62:50:1d:11:93:6f > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28 iproute2 patches can be found here [2]. [1] https://datatracker.ietf.org/doc/html/draft-rbickhart-evpn-ip-mac-proxy-adv-03 [2] https://github.com/idosch/iproute2/tree/submit/extern_valid_v1 Signed-off-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://patch.msgid.link/20250626073111.244534-2-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-26Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2025-06-27 We've added 6 non-merge commits during the last 8 day(s) which contain a total of 6 files changed, 120 insertions(+), 20 deletions(-). The main changes are: 1) Fix RCU usage in task_cls_state() for BPF programs using helpers like bpf_get_cgroup_classid_curr() outside of networking, from Charalampos Mitrodimas. 2) Fix a sockmap race between map_update and a pending workqueue from an earlier map_delete freeing the old psock where both pointed to the same psock->sk, from Jiayuan Chen. 3) Fix a data corruption issue when using bpf_msg_pop_data() in kTLS which failed to recalculate the ciphertext length, also from Jiayuan Chen. 4) Remove xdp_redirect_map{,_err} trace events since they are unused and also hide XDP trace events under CONFIG_BPF_SYSCALL, from Steven Rostedt. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: xdp: tracing: Hide some xdp events under CONFIG_BPF_SYSCALL xdp: Remove unused events xdp_redirect_map and xdp_redirect_map_err net, bpf: Fix RCU usage in task_cls_state() for BPF programs selftests/bpf: Add test to cover ktls with bpf_msg_pop_data bpf, ktls: Fix data corruption when using bpf_msg_pop_data() in ktls bpf, sockmap: Fix psock incorrectly pointing to sk ==================== Link: https://patch.msgid.link/20250626230111.24772-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.16-rc4). Conflicts: Documentation/netlink/specs/mptcp_pm.yaml 9e6dd4c256d0 ("netlink: specs: mptcp: replace underscores with dashes in names") ec362192aa9e ("netlink: specs: fix up indentation errors") https://lore.kernel.org/20250626122205.389c2cd4@canb.auug.org.au Adjacent changes: Documentation/netlink/specs/fou.yaml 791a9ed0a40d ("netlink: specs: fou: replace underscores with dashes in names") 880d43ca9aa4 ("netlink: specs: clean up spaces in brackets") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-26net: selftests: fix TCP packet checksumJakub Kicinski
The length in the pseudo header should be the length of the L3 payload AKA the L4 header+payload. The selftest code builds the packet from the lower layers up, so all the headers are pushed already when it constructs L4. We need to subtract the lower layer headers from skb->len. Fixes: 3e1e58d64c3d ("net: add generic selftest support") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Gerhard Engleder <gerhard@engleder-embedded.com> Reported-by: Oleksij Rempel <o.rempel@pengutronix.de> Tested-by: Oleksij Rempel <o.rempel@pengutronix.de> Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de> Link: https://patch.msgid.link/20250624183258.3377740-1-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-06-25net: Reoder rxq_idx check in __net_mp_open_rxq()Yue Haibing
array_index_nospec() clamp the rxq_idx within the range of [0, dev->real_num_rx_queues), move the check before it. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Reviewed-by: Mina Almasry <almasrymina@google.com> Link: https://patch.msgid.link/20250624140159.3929503-1-yuehaibing@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-25neighbour: Remove redundant assignment to errYue Haibing
'err' has been checked against 0 in the if statement. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250624014216.3686659-1-yuehaibing@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-23net: make sk->sk_rcvtimeo locklessEric Dumazet
Followup of commit 285975dd6742 ("net: annotate data-races around sk->sk_{rcv|snd}timeo"). Remove lock_sock()/release_sock() from ksmbd_tcp_rcv_timeout() and add READ_ONCE()/WRITE_ONCE() where it is needed. Also SO_RCVTIMEO_OLD and SO_RCVTIMEO_NEW can call sock_set_timeout() without holding the socket lock. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250620155536.335520-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-23net: make sk->sk_sndtimeo locklessEric Dumazet
Followup of commit 285975dd6742 ("net: annotate data-races around sk->sk_{rcv|snd}timeo"). Remove lock_sock()/release_sock() from sock_set_sndtimeo(), and add READ_ONCE()/WRITE_ONCE() where it is needed. Also SO_SNDTIMEO_OLD and SO_SNDTIMEO_NEW can call sock_set_timeout() without holding the socket lock. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250620155536.335520-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-23net: remove sock_i_uid()Eric Dumazet
Difference between sock_i_uid() and sk_uid() is that after sock_orphan(), sock_i_uid() returns GLOBAL_ROOT_UID while sk_uid() returns the last cached sk->sk_uid value. None of sock_i_uid() callers care about this. Use sk_uid() which is much faster and inlined. Note that diag/dump users are calling sock_i_ino() and can not see the full benefit yet. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Lorenzo Colitti <lorenzo@google.com> Reviewed-by: Maciej Żenczykowski <maze@google.com> Link: https://patch.msgid.link/20250620133001.4090592-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-23net: netpoll: Initialize UDP checksum field before checksummingBreno Leitao
commit f1fce08e63fe ("netpoll: Eliminate redundant assignment") removed the initialization of the UDP checksum, which was wrong and broke netpoll IPv6 transmission due to bad checksumming. udph->check needs to be set before calling csum_ipv6_magic(). Fixes: f1fce08e63fe ("netpoll: Eliminate redundant assignment") Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250620-netpoll_fix-v1-1-f9f0b82bc059@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-19Merge branch ↵Jakub Kicinski
'ref_tracker-add-ability-to-register-a-debugfs-file-for-a-ref_tracker_dir' Jeff Layton says: ==================== ref_tracker: add ability to register a debugfs file for a ref_tracker_dir For those just joining in, this series adds a new top-level "ref_tracker" debugfs directory, and has each ref_tracker_dir register a file in there as part of its initialization. It also adds the ability to register a symlink with a more human-usable name that points to the file, and does some general cleanup of how the ref_tracker object names are handled. v14: https://lore.kernel.org/20250610-reftrack-dbgfs-v14-0-efb532861428@kernel.org v13: https://lore.kernel.org/20250603-reftrack-dbgfs-v13-0-7b2a425019d8@kernel.org v12: https://lore.kernel.org/20250529-reftrack-dbgfs-v12-0-11b93c0c0b6e@kernel.org v11: https://lore.kernel.org/20250528-reftrack-dbgfs-v11-0-94ae0b165841@kernel.org v10: https://lore.kernel.org/20250527-reftrack-dbgfs-v10-0-dc55f7705691@kernel.org v9: https://lore.kernel.org/20250509-reftrack-dbgfs-v9-0-8ab888a4524d@kernel.org v8: https://lore.kernel.org/20250507-reftrack-dbgfs-v8-0-607717d3bb98@kernel.org v7: https://lore.kernel.org/20250505-reftrack-dbgfs-v7-0-f78c5d97bcca@kernel.org v6: https://lore.kernel.org/20250430-reftrack-dbgfs-v6-0-867c29aff03a@kernel.org v5: https://lore.kernel.org/20250428-reftrack-dbgfs-v5-0-1cbbdf2038bd@kernel.org v4: https://lore.kernel.org/20250418-reftrack-dbgfs-v4-0-5ca5c7899544@kernel.org v3: https://lore.kernel.org/20250417-reftrack-dbgfs-v3-0-c3159428c8fb@kernel.org v2: https://lore.kernel.org/20250415-reftrack-dbgfs-v2-0-b18c4abd122f@kernel.org v1: https://lore.kernel.org/20250414-reftrack-dbgfs-v1-0-f03585832203@kernel.org ==================== Link: https://patch.msgid.link/20250618-reftrack-dbgfs-v15-0-24fc37ead144@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-19ref_tracker: eliminate the ref_tracker_dir name fieldJeff Layton
Now that we have dentries and the ability to create meaningful symlinks to them, don't keep a name string in each tracker. Switch the output format to print "class@address", and drop the name field. Also, add a kerneldoc header for ref_tracker_dir_init(). Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://patch.msgid.link/20250618-reftrack-dbgfs-v15-9-24fc37ead144@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-19net: add symlinks to ref_tracker_dir for netnsJeff Layton
After assigning the inode number to the namespace, use it to create a unique name for each netns refcount tracker with the ns.inum and net_cookie values in it, and register a symlink to the debugfs file for it. init_net is registered before the ref_tracker dir is created, so add a late_initcall() to register its files and symlinks. Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://patch.msgid.link/20250618-reftrack-dbgfs-v15-8-24fc37ead144@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-19ref_tracker: add a static classname string to each ref_tracker_dirJeff Layton
A later patch in the series will be adding debugfs files for each ref_tracker that get created in ref_tracker_dir_init(). The format will be "class@%px". The current "name" string can vary between ref_tracker_dir objects of the same type, so it's not suitable for this purpose. Add a new "class" string to the ref_tracker dir that describes the the type of object (sans any individual info for that object). Also, in the i915 driver, gate the creation of debugfs files on whether the dentry pointer is still set to NULL. CI has shown that the ref_tracker_dir can be initialized more than once. Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://patch.msgid.link/20250618-reftrack-dbgfs-v15-4-24fc37ead144@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-19netpoll: Extract IPv6 address retrieval functionBreno Leitao
Extract the IPv6 address retrieval logic from netpoll_setup() into a dedicated helper function netpoll_take_ipv6() to improve code organization and readability. The function handles obtaining the local IPv6 address from the network device, including proper address type matching between local and remote addresses (link-local vs global), and includes appropriate error handling when IPv6 is not supported or no suitable address is available. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250618-netpoll_ip_ref-v1-3-c2ac00fe558f@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-19netpoll: extract IPv4 address retrieval into helper functionBreno Leitao
Move the IPv4 address retrieval logic from netpoll_setup() into a separate netpoll_take_ipv4() function to improve code organization and readability. This change consolidates the IPv4-specific logic and error handling into a dedicated function while maintaining the same functionality. No functional changes. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250618-netpoll_ip_ref-v1-2-c2ac00fe558f@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-19netpoll: Extract carrier wait functionBreno Leitao
Extract the carrier waiting logic into a dedicated helper function netpoll_wait_carrier() to improve code readability and reduce duplication in netpoll_setup(). Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250618-netpoll_ip_ref-v1-1-c2ac00fe558f@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-19neighbour: add support for NUD_PERMANENT proxy entriesNicolas Escande
As discussesd before in [0] proxy entries (which are more configuration than runtime data) should stay when the link (carrier) goes does down. This is what happens for regular neighbour entries. So lets fix this by: - storing in proxy entries the fact that it was added as NUD_PERMANENT - not removing NUD_PERMANENT proxy entries when the carrier goes down (same as how it's done in neigh_flush_dev() for regular neigh entries) [0]: https://lore.kernel.org/netdev/c584ef7e-6897-01f3-5b80-12b53f7b4bf4@kernel.org/ Signed-off-by: Nicolas Escande <nico.escande@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250617141334.3724863-1-nico.escande@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.16-rc3). No conflicts or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-18net: remove redundant ASSERT_RTNL() in queue setup functionsStanislav Fomichev
The existing netdev_ops_assert_locked() already asserts that either the RTNL lock or the per-device lock is held, making the explicit ASSERT_RTNL() redundant. Cc: Michael Chan <michael.chan@broadcom.com> Signed-off-by: Stanislav Fomichev <stfomichev@gmail.com> Link: https://patch.msgid.link/20250616162117.287806-5-stfomichev@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-18udp_tunnel: remove rtnl_lock dependencyStanislav Fomichev
Drivers that are using ops lock and don't depend on RTNL lock still need to manage it because udp_tunnel's RTNL dependency. Introduce new udp_tunnel_nic_lock and use it instead of rtnl_lock. Drop non-UDP_TUNNEL_NIC_INFO_MAY_SLEEP mode from udp_tunnel infra (udp_tunnel_nic_device_sync_work needs to grab udp_tunnel_nic_lock mutex and might sleep). Cover more places in v4: - netlink - udp_tunnel_notify_add_rx_port (ndo_open) - triggers udp_tunnel_nic_device_sync_work - udp_tunnel_notify_del_rx_port (ndo_stop) - triggers udp_tunnel_nic_device_sync_work - udp_tunnel_get_rx_info (__netdev_update_features) - triggers NETDEV_UDP_TUNNEL_PUSH_INFO - udp_tunnel_drop_rx_info (__netdev_update_features) - triggers NETDEV_UDP_TUNNEL_DROP_INFO - udp_tunnel_nic_reset_ntf (ndo_open) - notifiers - udp_tunnel_nic_netdevice_event, depending on the event: - triggers NETDEV_UDP_TUNNEL_PUSH_INFO - triggers NETDEV_UDP_TUNNEL_DROP_INFO - ethnl_tunnel_info_reply_size - udp_tunnel_nic_set_port_priv (two intel drivers) Cc: Michael Chan <michael.chan@broadcom.com> Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Stanislav Fomichev <stfomichev@gmail.com> Link: https://patch.msgid.link/20250616162117.287806-4-stfomichev@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17net: netmem: fix skb_ensure_writable with unreadable skbsMina Almasry
skb_ensure_writable should succeed when it's trying to write to the header of the unreadable skbs, so it doesn't need an unconditional skb_frags_readable check. The preceding pskb_may_pull() call will succeed if write_len is within the head and fail if we're trying to write to the unreadable payload, so we don't need an additional check. Removing this check restores DSCP functionality with unreadable skbs as it's called from dscp_tg. Cc: willemb@google.com Cc: asml.silence@gmail.com Fixes: 65249feb6b3d ("net: add support for skbs with unreadable frags") Signed-off-by: Mina Almasry <almasrymina@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250615200733.520113-1-almasrymina@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-16netpoll: move netpoll_print_options to netconsoleBreno Leitao
Move netpoll_print_options() from net/core/netpoll.c to drivers/net/netconsole.c and make it static. This function is only used by netconsole, so there's no need to export it or keep it in the public netpoll API. This reduces the netpoll API surface and improves code locality by keeping netconsole-specific functionality within the netconsole driver. Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://patch.msgid.link/20250613-rework-v3-4-0752bf2e6912@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-16netpoll: relocate netconsole-specific functions to netconsole moduleBreno Leitao
Move netpoll_parse_ip_addr() and netpoll_parse_options() from the generic netpoll module to the netconsole module where they are actually used. These functions were originally placed in netpoll but are only consumed by netconsole. This refactoring improves code organization by: - Removing unnecessary exported symbols from netpoll - Making netpoll_parse_options() static (no longer needs global visibility) - Reducing coupling between netpoll and netconsole modules The functions remain functionally identical - this is purely a code reorganization to better reflect their actual usage patterns. Here are the changes: 1) Move both functions from netpoll to netconsole 2) Add static to netpoll_parse_options() 3) Removed the EXPORT_SYMBOL() PS: This diff does not change the function format, so, it is easy to review, but, checkpatch will not be happy. A follow-up patch will address the current issues reported by checkpatch. Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://patch.msgid.link/20250613-rework-v3-3-0752bf2e6912@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-16netpoll: expose netpoll logging macros in public headerBreno Leitao
Move np_info(), np_err(), and np_notice() macros from internal implementation to the public netpoll header file to make them available for use by netpoll consumers. These logging macros provide consistent formatting for netpoll-related messages by automatically prefixing log output with the netpoll instance name. The goal is to use the exact same format that is being displayed today, instead of creating something netconsole-specific. Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://patch.msgid.link/20250613-rework-v3-2-0752bf2e6912@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-16netpoll: remove __netpoll_cleanup from exported APIBreno Leitao
Since commit 97714695ef90 ("net: netconsole: Defer netpoll cleanup to avoid lock release during list traversal"), netconsole no longer uses __netpoll_cleanup(). With no remaining users, remove this function from the exported netpoll API. The function remains available internally within netpoll for use by netpoll_cleanup(). Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://patch.msgid.link/20250613-rework-v3-1-0752bf2e6912@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-14net: sysfs: Implement is_visible for phys_(port_id, port_name, switch_id)Yajun Deng
phys_port_id_show, phys_port_name_show and phys_switch_id_show would return -EOPNOTSUPP if the netdev didn't implement the corresponding method. There is no point in creating these files if they are unsupported. Put these attributes in netdev_phys_group and implement the is_visible method. make phys_(port_id, port_name, switch_id) invisible if the netdev dosen't implement the corresponding method. Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250612142707.4644-1-yajun.deng@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.16-rc2). No conflicts or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-12Merge tag 'net-6.16-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from bluetooth and wireless. Current release - regressions: - af_unix: allow passing cred for embryo without SO_PASSCRED/SO_PASSPIDFD Current release - new code bugs: - eth: airoha: correct enable mask for RX queues 16-31 - veth: prevent NULL pointer dereference in veth_xdp_rcv when peer disappears under traffic - ipv6: move fib6_config_validate() to ip6_route_add(), prevent invalid routes Previous releases - regressions: - phy: phy_caps: don't skip better duplex match on non-exact match - dsa: b53: fix untagged traffic sent via cpu tagged with VID 0 - Revert "wifi: mwifiex: Fix HT40 bandwidth issue.", it caused transient packet loss, exact reason not fully understood, yet Previous releases - always broken: - net: clear the dst when BPF is changing skb protocol (IPv4 <> IPv6) - sched: sfq: fix a potential crash on gso_skb handling - Bluetooth: intel: improve rx buffer posting to avoid causing issues in the firmware - eth: intel: i40e: make reset handling robust against multiple requests - eth: mlx5: ensure FW pages are always allocated on the local NUMA node, even when device is configure to 'serve' another node - wifi: ath12k: fix GCC_GCC_PCIE_HOT_RST definition for WCN7850, prevent kernel crashes - wifi: ath11k: avoid burning CPU in ath11k_debugfs_fw_stats_request() for 3 sec if fw_stats_done is not set" * tag 'net-6.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (70 commits) selftests: drv-net: rss_ctx: Add test for ntuple rules targeting default RSS context net: ethtool: Don't check if RSS context exists in case of context 0 af_unix: Allow passing cred for embryo without SO_PASSCRED/SO_PASSPIDFD. ipv6: Move fib6_config_validate() to ip6_route_add(). net: drv: netdevsim: don't napi_complete() from netpoll net/mlx5: HWS, Add error checking to hws_bwc_rule_complex_hash_node_get() veth: prevent NULL pointer dereference in veth_xdp_rcv net_sched: remove qdisc_tree_flush_backlog() net_sched: ets: fix a race in ets_qdisc_change() net_sched: tbf: fix a race in tbf_change() net_sched: red: fix a race in __red_change() net_sched: prio: fix a race in prio_tune() net_sched: sch_sfq: reject invalid perturb period net: phy: phy_caps: Don't skip better duplex macth on non-exact match MAINTAINERS: Update Kuniyuki Iwashima's email address. selftests: net: add test case for NAT46 looping back dst net: clear the dst when changing skb protocol net/mlx5e: Fix number of lanes to UNKNOWN when using data_rate_oper net/mlx5e: Fix leak of Geneve TLV option object net/mlx5: HWS, make sure the uplink is the last destination ...