summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2025-07-14sunrpc: fix handling of unknown auth status codesJeff Layton
In the case of an unknown error code from svc_authenticate or pg_authenticate, return AUTH_ERROR with a status of AUTH_FAILED. Also add the other auth_stat value from RFC 5531, and document all the status codes. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-07-14sunrpc: new tracepoints around svc thread wakeupsJeff Layton
Convert the svc_wake_up tracepoint into svc_pool_thread_event class. Have it also record the pool id, and add new tracepoints for when the thread is already running and for when there are no idle threads. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-07-14sunrpc: unexport csum_partial_copy_to_xdrChristoph Hellwig
csum_partial_copy_to_xdr is only used inside the sunrpc module, so remove the export. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-07-14sunrpc: simplify xdr_partial_copy_from_skbChristoph Hellwig
csum_partial_copy_to_xdr can handle a checksumming and non-checksumming case and implements this using a callback, which leads to a lot of boilerplate code and indirect calls in the fast path. Switch to storing a need_checksum flag in struct xdr_skb_reader instead to remove the indirect call and simplify the code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-07-14sunrpc: simplify xdr_init_encode_pagesChristoph Hellwig
The rqst argument to xdr_init_encode_pages is set to NULL by all callers, and pages is always set to buf->pages. Remove the two arguments and hardcode the assignments. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-07-14lib/crypto: sha1: Rename sha1_init() to sha1_init_raw()Eric Biggers
Rename the existing sha1_init() to sha1_init_raw(), since it conflicts with the upcoming library function. This will later be removed, but this keeps the kernel building for the introduction of the library. Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20250712232329.818226-3-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2025-07-14Revert "netfilter: nf_tables: Add notifications for hook changes"Phil Sutter
This reverts commit 465b9ee0ee7bc268d7f261356afd6c4262e48d82. Such notifications fit better into core or nfnetlink_hook code, following the NFNL_MSG_HOOK_GET message format. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-07-14netfilter: nf_tables: hide clash bit from userspaceFlorian Westphal
Its a kernel implementation detail, at least at this time: We can later decide to revert this patch if there is a compelling reason, but then we should also remove the ifdef that prevents exposure of ip_conntrack_status enum IPS_NAT_CLASH value in the uapi header. Clash entries are not included in dumps (true for both old /proc and ctnetlink) either. So for now exclude the clash bit when dumping. Fixes: 7e5c6aa67e6f ("netfilter: nf_tables: add packets conntrack state to debug trace info") Link: https://lore.kernel.org/netfilter-devel/aGwf3dCggwBlRKKC@strlen.de/ Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-07-14don't bother with path_get()/path_put() in unix_open_file()Al Viro
Once unix_sock ->path is set, we are guaranteed that its ->path will remain unchanged (and pinned) until the socket is closed. OTOH, dentry_open() does not modify the path passed to it. IOW, there's no need to copy unix_sk(sk)->path in unix_open_file() - we can just pass it to dentry_open() and be done with that. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/20250712054157.GZ1880847@ZenIV Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-13rpl: Fix use-after-free in rpl_do_srh_inline().Kuniyuki Iwashima
Running lwt_dst_cache_ref_loop.sh in selftest with KASAN triggers the splat below [0]. rpl_do_srh_inline() fetches ipv6_hdr(skb) and accesses it after skb_cow_head(), which is illegal as the header could be freed then. Let's fix it by making oldhdr to a local struct instead of a pointer. [0]: [root@fedora net]# ./lwt_dst_cache_ref_loop.sh ... TEST: rpl (input) [ 57.631529] ================================================================== BUG: KASAN: slab-use-after-free in rpl_do_srh_inline.isra.0 (net/ipv6/rpl_iptunnel.c:174) Read of size 40 at addr ffff888122bf96d8 by task ping6/1543 CPU: 50 UID: 0 PID: 1543 Comm: ping6 Not tainted 6.16.0-rc5-01302-gfadd1e6231b1 #23 PREEMPT(voluntary) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 Call Trace: <IRQ> dump_stack_lvl (lib/dump_stack.c:122) print_report (mm/kasan/report.c:409 mm/kasan/report.c:521) kasan_report (mm/kasan/report.c:221 mm/kasan/report.c:636) kasan_check_range (mm/kasan/generic.c:175 (discriminator 1) mm/kasan/generic.c:189 (discriminator 1)) __asan_memmove (mm/kasan/shadow.c:94 (discriminator 2)) rpl_do_srh_inline.isra.0 (net/ipv6/rpl_iptunnel.c:174) rpl_input (net/ipv6/rpl_iptunnel.c:201 net/ipv6/rpl_iptunnel.c:282) lwtunnel_input (net/core/lwtunnel.c:459) ipv6_rcv (./include/net/dst.h:471 (discriminator 1) ./include/net/dst.h:469 (discriminator 1) net/ipv6/ip6_input.c:79 (discriminator 1) ./include/linux/netfilter.h:317 (discriminator 1) ./include/linux/netfilter.h:311 (discriminator 1) net/ipv6/ip6_input.c:311 (discriminator 1)) __netif_receive_skb_one_core (net/core/dev.c:5967) process_backlog (./include/linux/rcupdate.h:869 net/core/dev.c:6440) __napi_poll.constprop.0 (net/core/dev.c:7452) net_rx_action (net/core/dev.c:7518 net/core/dev.c:7643) handle_softirqs (kernel/softirq.c:579) do_softirq (kernel/softirq.c:480 (discriminator 20)) </IRQ> <TASK> __local_bh_enable_ip (kernel/softirq.c:407) __dev_queue_xmit (net/core/dev.c:4740) ip6_finish_output2 (./include/linux/netdevice.h:3358 ./include/net/neighbour.h:526 ./include/net/neighbour.h:540 net/ipv6/ip6_output.c:141) ip6_finish_output (net/ipv6/ip6_output.c:215 net/ipv6/ip6_output.c:226) ip6_output (./include/linux/netfilter.h:306 net/ipv6/ip6_output.c:248) ip6_send_skb (net/ipv6/ip6_output.c:1983) rawv6_sendmsg (net/ipv6/raw.c:588 net/ipv6/raw.c:918) __sys_sendto (net/socket.c:714 (discriminator 1) net/socket.c:729 (discriminator 1) net/socket.c:2228 (discriminator 1)) __x64_sys_sendto (net/socket.c:2231) do_syscall_64 (arch/x86/entry/syscall_64.c:63 (discriminator 1) arch/x86/entry/syscall_64.c:94 (discriminator 1)) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) RIP: 0033:0x7f68cffb2a06 Code: 5d e8 41 8b 93 08 03 00 00 59 5e 48 83 f8 fc 75 19 83 e2 39 83 fa 08 75 11 e8 26 ff ff ff 66 0f 1f 44 00 00 48 8b 45 10 0f 05 <48> 8b 5d f8 c9 c3 0f 1f 40 00 f3 0f 1e fa 55 48 89 e5 48 83 ec 08 RSP: 002b:00007ffefb7c53d0 EFLAGS: 00000202 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 0000564cd69f10a0 RCX: 00007f68cffb2a06 RDX: 0000000000000040 RSI: 0000564cd69f10a4 RDI: 0000000000000003 RBP: 00007ffefb7c53f0 R08: 0000564cd6a032ac R09: 000000000000001c R10: 0000000000000000 R11: 0000000000000202 R12: 0000564cd69f10a4 R13: 0000000000000040 R14: 00007ffefb7c66e0 R15: 0000564cd69f10a0 </TASK> Allocated by task 1543: kasan_save_stack (mm/kasan/common.c:48) kasan_save_track (mm/kasan/common.c:60 (discriminator 1) mm/kasan/common.c:69 (discriminator 1)) __kasan_slab_alloc (mm/kasan/common.c:319 mm/kasan/common.c:345) kmem_cache_alloc_node_noprof (./include/linux/kasan.h:250 mm/slub.c:4148 mm/slub.c:4197 mm/slub.c:4249) kmalloc_reserve (net/core/skbuff.c:581 (discriminator 88)) __alloc_skb (net/core/skbuff.c:669) __ip6_append_data (net/ipv6/ip6_output.c:1672 (discriminator 1)) ip6_append_data (net/ipv6/ip6_output.c:1859) rawv6_sendmsg (net/ipv6/raw.c:911) __sys_sendto (net/socket.c:714 (discriminator 1) net/socket.c:729 (discriminator 1) net/socket.c:2228 (discriminator 1)) __x64_sys_sendto (net/socket.c:2231) do_syscall_64 (arch/x86/entry/syscall_64.c:63 (discriminator 1) arch/x86/entry/syscall_64.c:94 (discriminator 1)) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) Freed by task 1543: kasan_save_stack (mm/kasan/common.c:48) kasan_save_track (mm/kasan/common.c:60 (discriminator 1) mm/kasan/common.c:69 (discriminator 1)) kasan_save_free_info (mm/kasan/generic.c:579 (discriminator 1)) __kasan_slab_free (mm/kasan/common.c:271) kmem_cache_free (mm/slub.c:4643 (discriminator 3) mm/slub.c:4745 (discriminator 3)) pskb_expand_head (net/core/skbuff.c:2274) rpl_do_srh_inline.isra.0 (net/ipv6/rpl_iptunnel.c:158 (discriminator 1)) rpl_input (net/ipv6/rpl_iptunnel.c:201 net/ipv6/rpl_iptunnel.c:282) lwtunnel_input (net/core/lwtunnel.c:459) ipv6_rcv (./include/net/dst.h:471 (discriminator 1) ./include/net/dst.h:469 (discriminator 1) net/ipv6/ip6_input.c:79 (discriminator 1) ./include/linux/netfilter.h:317 (discriminator 1) ./include/linux/netfilter.h:311 (discriminator 1) net/ipv6/ip6_input.c:311 (discriminator 1)) __netif_receive_skb_one_core (net/core/dev.c:5967) process_backlog (./include/linux/rcupdate.h:869 net/core/dev.c:6440) __napi_poll.constprop.0 (net/core/dev.c:7452) net_rx_action (net/core/dev.c:7518 net/core/dev.c:7643) handle_softirqs (kernel/softirq.c:579) do_softirq (kernel/softirq.c:480 (discriminator 20)) __local_bh_enable_ip (kernel/softirq.c:407) __dev_queue_xmit (net/core/dev.c:4740) ip6_finish_output2 (./include/linux/netdevice.h:3358 ./include/net/neighbour.h:526 ./include/net/neighbour.h:540 net/ipv6/ip6_output.c:141) ip6_finish_output (net/ipv6/ip6_output.c:215 net/ipv6/ip6_output.c:226) ip6_output (./include/linux/netfilter.h:306 net/ipv6/ip6_output.c:248) ip6_send_skb (net/ipv6/ip6_output.c:1983) rawv6_sendmsg (net/ipv6/raw.c:588 net/ipv6/raw.c:918) __sys_sendto (net/socket.c:714 (discriminator 1) net/socket.c:729 (discriminator 1) net/socket.c:2228 (discriminator 1)) __x64_sys_sendto (net/socket.c:2231) do_syscall_64 (arch/x86/entry/syscall_64.c:63 (discriminator 1) arch/x86/entry/syscall_64.c:94 (discriminator 1)) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) The buggy address belongs to the object at ffff888122bf96c0 which belongs to the cache skbuff_small_head of size 704 The buggy address is located 24 bytes inside of freed 704-byte region [ffff888122bf96c0, ffff888122bf9980) The buggy address belongs to the physical page: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x122bf8 head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 flags: 0x200000000000040(head|node=0|zone=2) page_type: f5(slab) raw: 0200000000000040 ffff888101fc0a00 ffffea000464dc00 0000000000000002 raw: 0000000000000000 0000000080270027 00000000f5000000 0000000000000000 head: 0200000000000040 ffff888101fc0a00 ffffea000464dc00 0000000000000002 head: 0000000000000000 0000000080270027 00000000f5000000 0000000000000000 head: 0200000000000003 ffffea00048afe01 00000000ffffffff 00000000ffffffff head: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff888122bf9580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff888122bf9600: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc >ffff888122bf9680: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb ^ ffff888122bf9700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff888122bf9780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb Fixes: a7a29f9c361f8 ("net: ipv6: add rpl sr tunnel") Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-07-13af_packet: fix soft lockup issue caused by tpacket_snd()Yun Lu
When MSG_DONTWAIT is not set, the tpacket_snd operation will wait for pending_refcnt to decrement to zero before returning. The pending_refcnt is decremented by 1 when the skb->destructor function is called, indicating that the skb has been successfully sent and needs to be destroyed. If an error occurs during this process, the tpacket_snd() function will exit and return error, but pending_refcnt may not yet have decremented to zero. Assuming the next send operation is executed immediately, but there are no available frames to be sent in tx_ring (i.e., packet_current_frame returns NULL), and skb is also NULL, the function will not execute wait_for_completion_interruptible_timeout() to yield the CPU. Instead, it will enter a do-while loop, waiting for pending_refcnt to be zero. Even if the previous skb has completed transmission, the skb->destructor function can only be invoked in the ksoftirqd thread (assuming NAPI threading is enabled). When both the ksoftirqd thread and the tpacket_snd operation happen to run on the same CPU, and the CPU trapped in the do-while loop without yielding, the ksoftirqd thread will not get scheduled to run. As a result, pending_refcnt will never be reduced to zero, and the do-while loop cannot exit, eventually leading to a CPU soft lockup issue. In fact, skb is true for all but the first iterations of that loop, and as long as pending_refcnt is not zero, even if incremented by a previous call, wait_for_completion_interruptible_timeout() should be executed to yield the CPU, allowing the ksoftirqd thread to be scheduled. Therefore, the execution condition of this function should be modified to check if pending_refcnt is not zero, instead of check skb. - if (need_wait && skb) { + if (need_wait && packet_read_pending(&po->tx_ring)) { As a result, the judgment conditions are duplicated with the end code of the while loop, and packet_read_pending() is a very expensive function. Actually, this loop can only exit when ph is NULL, so the loop condition can be changed to while (1), and in the "ph = NULL" branch, if the subsequent condition of if is not met, the loop can break directly. Now, the loop logic remains the same as origin but is clearer and more obvious. Fixes: 89ed5b519004 ("af_packet: Block execution of tasks waiting for transmit to complete in AF_PACKET") Cc: stable@kernel.org Suggested-by: LongJun Tang <tanglongjun@kylinos.cn> Signed-off-by: Yun Lu <luyun@kylinos.cn> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-07-13af_packet: fix the SO_SNDTIMEO constraint not effective on tpacked_snd()Yun Lu
Due to the changes in commit 581073f626e3 ("af_packet: do not call packet_read_pending() from tpacket_destruct_skb()"), every time tpacket_destruct_skb() is executed, the skb_completion is marked as completed. When wait_for_completion_interruptible_timeout() returns completed, the pending_refcnt has not yet been reduced to zero. Therefore, when ph is NULL, the wait function may need to be called multiple times until packet_read_pending() finally returns zero. We should call sock_sndtimeo() only once, otherwise the SO_SNDTIMEO constraint could be way off. Fixes: 581073f626e3 ("af_packet: do not call packet_read_pending() from tpacket_destruct_skb()") Cc: stable@kernel.org Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yun Lu <luyun@kylinos.cn> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-07-13net/sched: sch_qfq: Fix race condition on qfq_aggregateXiang Mei
A race condition can occur when 'agg' is modified in qfq_change_agg (called during qfq_enqueue) while other threads access it concurrently. For example, qfq_dump_class may trigger a NULL dereference, and qfq_delete_class may cause a use-after-free. This patch addresses the issue by: 1. Moved qfq_destroy_class into the critical section. 2. Added sch_tree_lock protection to qfq_dump_class and qfq_dump_class_stats. Fixes: 462dbc9101ac ("pkt_sched: QFQ Plus: fair-queueing service at DRR cost") Signed-off-by: Xiang Mei <xmei5@asu.edu> Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-07-11Merge tag 'batadv-next-pullrequest-20250710' of ↵Jakub Kicinski
git://git.open-mesh.org/linux-merge Simon Wunderlich says: ==================== This cleanup patchset includes the following patches: - bump version strings, by Simon Wunderlich - batman-adv: store hard_iface as iflink private data, by Matthias Schiffer * tag 'batadv-next-pullrequest-20250710' of git://git.open-mesh.org/linux-merge: batman-adv: store hard_iface as iflink private data batman-adv: Start new development cycle ==================== Link: https://patch.msgid.link/20250710164501.153872-1-sw@simonwunderlich.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-11net_sched: act_skbedit: use RCU in tcf_skbedit_dump()Eric Dumazet
Also storing tcf_action into struct tcf_skbedit_params makes sure there is no discrepancy in tcf_skbedit_act(). Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250709090204.797558-12-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-11net_sched: act_police: use RCU in tcf_police_dump()Eric Dumazet
Also storing tcf_action into struct tcf_police_params makes sure there is no discrepancy in tcf_police_act(). Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250709090204.797558-11-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-11net_sched: act_pedit: use RCU in tcf_pedit_dump()Eric Dumazet
Also storing tcf_action into struct tcf_pedit_params makes sure there is no discrepancy in tcf_pedit_act(). Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250709090204.797558-10-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-11net_sched: act_nat: use RCU in tcf_nat_dump()Eric Dumazet
Also storing tcf_action into struct tcf_nat_params makes sure there is no discrepancy in tcf_nat_act(). Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250709090204.797558-9-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-11net_sched: act_mpls: use RCU in tcf_mpls_dump()Eric Dumazet
Also storing tcf_action into struct tcf_mpls_params makes sure there is no discrepancy in tcf_mpls_act(). Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250709090204.797558-8-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-11net_sched: act_ctinfo: use RCU in tcf_ctinfo_dump()Eric Dumazet
Also storing tcf_action into struct tcf_ctinfo_params makes sure there is no discrepancy in tcf_ctinfo_act(). Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250709090204.797558-7-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-11net_sched: act_ctinfo: use atomic64_t for three countersEric Dumazet
Commit 21c167aa0ba9 ("net/sched: act_ctinfo: use percpu stats") missed that stats_dscp_set, stats_dscp_error and stats_cpmark_set might be written (and read) locklessly. Use atomic64_t for these three fields, I doubt act_ctinfo is used heavily on big SMP hosts anyway. Fixes: 24ec483cec98 ("net: sched: Introduce act_ctinfo action") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Pedro Tammela <pctammela@mojatatu.com> Link: https://patch.msgid.link/20250709090204.797558-6-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-11net_sched: act_ct: use RCU in tcf_ct_dump()Eric Dumazet
Also storing tcf_action into struct tcf_ct_params makes sure there is no discrepancy in tcf_ct_act(). Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250709090204.797558-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-11net_sched: act_csum: use RCU in tcf_csum_dump()Eric Dumazet
Also storing tcf_action into struct tcf_csum_params makes sure there is no discrepancy in tcf_csum_act(). Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250709090204.797558-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-11net_sched: act_connmark: use RCU in tcf_connmark_dump()Eric Dumazet
Also storing tcf_action into struct tcf_connmark_parms makes sure there is no discrepancy in tcf_connmark_act(). Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250709090204.797558-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-11net/sched: Restrict conditions for adding duplicating netems to qdisc treeWilliam Liu
netem_enqueue's duplication prevention logic breaks when a netem resides in a qdisc tree with other netems - this can lead to a soft lockup and OOM loop in netem_dequeue, as seen in [1]. Ensure that a duplicating netem cannot exist in a tree with other netems. Previous approaches suggested in discussions in chronological order: 1) Track duplication status or ttl in the sk_buff struct. Considered too specific a use case to extend such a struct, though this would be a resilient fix and address other previous and potential future DOS bugs like the one described in loopy fun [2]. 2) Restrict netem_enqueue recursion depth like in act_mirred with a per cpu variable. However, netem_dequeue can call enqueue on its child, and the depth restriction could be bypassed if the child is a netem. 3) Use the same approach as in 2, but add metadata in netem_skb_cb to handle the netem_dequeue case and track a packet's involvement in duplication. This is an overly complex approach, and Jamal notes that the skb cb can be overwritten to circumvent this safeguard. 4) Prevent the addition of a netem to a qdisc tree if its ancestral path contains a netem. However, filters and actions can cause a packet to change paths when re-enqueued to the root from netem duplication, leading us to the current solution: prevent a duplicating netem from inhabiting the same tree as other netems. [1] https://lore.kernel.org/netdev/8DuRWwfqjoRDLDmBMlIfbrsZg9Gx50DHJc1ilxsEBNe2D6NMoigR_eIRIG0LOjMc3r10nUUZtArXx4oZBIdUfZQrwjcQhdinnMis_0G7VEk=@willsroot.io/ [2] https://lwn.net/Articles/719297/ Fixes: 0afb51e72855 ("[PKT_SCHED]: netem: reinsert for duplication") Reported-by: William Liu <will@willsroot.io> Reported-by: Savino Dicanosa <savy@syst3mfailure.io> Signed-off-by: William Liu <will@willsroot.io> Signed-off-by: Savino Dicanosa <savy@syst3mfailure.io> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20250708164141.875402-1-will@willsroot.io Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.16-rc6-2). No conflicts. Adjacent changes: drivers/net/wireless/mediatek/mt76/mt7925/mcu.c c701574c5412 ("wifi: mt76: mt7925: fix invalid array index in ssid assignment during hw scan") b3a431fe2e39 ("wifi: mt76: mt7925: fix off by one in mt7925_mcu_hw_scan()") drivers/net/wireless/mediatek/mt76/mt7996/mac.c 62da647a2b20 ("wifi: mt76: mt7996: Add MLO support to mt7996_tx_check_aggr()") dc66a129adf1 ("wifi: mt76: add a wrapper for wcid access with validation") drivers/net/wireless/mediatek/mt76/mt7996/main.c 3dd6f67c669c ("wifi: mt76: Move RCU section in mt7996_mcu_add_rate_ctrl()") 8989d8e90f5f ("wifi: mt76: mt7996: Do not set wcid.sta to 1 in mt7996_mac_sta_event()") net/mac80211/cfg.c 58fcb1b4287c ("wifi: mac80211: reject VHT opmode for unsupported channel widths") 037dc18ac3fb ("wifi: mac80211: add support for storing station S1G capabilities") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-11bpf: Remove attach_type in sockmap_linkTao Chen
Use attach_type in bpf_link, and remove it in sockmap_link. Signed-off-by: Tao Chen <chen.dylane@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20250710032038.888700-4-chen.dylane@linux.dev
2025-07-11bpf: Add attach_type field to bpf_linkTao Chen
Attach_type will be set when a link is created by user. It is better to record attach_type in bpf_link generically and have it available universally for all link types. So add the attach_type field in bpf_link and move the sleepable field to avoid unnecessary gap padding. Signed-off-by: Tao Chen <chen.dylane@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20250710032038.888700-2-chen.dylane@linux.dev
2025-07-11netlink: make sure we allow at least one dump skbJakub Kicinski
Commit under Fixes tightened up the memory accounting for Netlink sockets. Looks like the accounting is too strict for some existing use cases, Marek reported issues with nl80211 / WiFi iw CLI. To reduce number of iterations Netlink dumps try to allocate messages based on the size of the buffer passed to previous recvmsg() calls. If user space uses a larger buffer in recvmsg() than sk_rcvbuf we will allocate an skb we won't be able to queue. Make sure we always allow at least one skb to be queued. Same workaround is already present in netlink_attachskb(). Alternative would be to cap the allocation size to rcvbuf - rmem_alloc but as I said, the workaround is already present in other places. Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/9794af18-4905-46c6-b12c-365ea2f05858@samsung.com Fixes: ae8f160e7eb2 ("netlink: Fix wraparounds of sk->sk_rmem_alloc.") Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250711001121.3649033-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-11netlink: Fix rmem check in netlink_broadcast_deliver().Kuniyuki Iwashima
We need to allow queuing at least one skb even when skb is larger than sk->sk_rcvbuf. The cited commit made a mistake while converting a condition in netlink_broadcast_deliver(). Let's correct the rmem check for the allow-one-skb rule. Fixes: ae8f160e7eb24 ("netlink: Fix wraparounds of sk->sk_rmem_alloc.") Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250711053208.2965945-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-10Merge tag 'nf-next-25-07-10' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next Pablo Neira Ayuso says: ==================== Netfilter updates for net-next (v2) The following series contains an initial small batch of Netfilter updates for net-next: 1) Remove DCCP conntrack support, keep DCCP matches around in order to avoid breakage when loading ruleset, add Kconfig to wrap the code so it can be disabled by distributors. 2) Remove buggy code aiming at shrinking netlink deletion event, then re-add it correctly in another patch. This is to prevent -stable to pick up on a fix that breaks old userspace. From Phil Sutter. 3) Missing WARN_ON_ONCE() to check for lockdep_commit_lock_is_held() to uncover bugs. From Fedor Pchelkin. * tag 'nf-next-25-07-10' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: nf_tables: adjust lockdep assertions handling netfilter: nf_tables: Reintroduce shortened deletion notifications netfilter: nf_tables: Drop dead code from fill_*_info routines netfilter: conntrack: remove DCCP protocol support ==================== Link: https://patch.msgid.link/20250710010706.2861281-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-10gre: Fix IPv6 multicast route creation.Guillaume Nault
Use addrconf_add_dev() instead of ipv6_find_idev() in addrconf_gre_config() so that we don't just get the inet6_dev, but also install the default ff00::/8 multicast route. Before commit 3e6a0243ff00 ("gre: Fix again IPv6 link-local address generation."), the multicast route was created at the end of the function by addrconf_add_mroute(). But this code path is now only taken in one particular case (gre devices not bound to a local IP address and in EUI64 mode). For all other cases, the function exits early and addrconf_add_mroute() is not called anymore. Using addrconf_add_dev() instead of ipv6_find_idev() in addrconf_gre_config(), fixes the problem as it will create the default multicast route for all gre devices. This also brings addrconf_gre_config() a bit closer to the normal netdevice IPv6 configuration code (addrconf_dev_config()). Cc: stable@vger.kernel.org Fixes: 3e6a0243ff00 ("gre: Fix again IPv6 link-local address generation.") Reported-by: Aiden Yang <ling@moedove.com> Closes: https://lore.kernel.org/netdev/CANR=AhRM7YHHXVxJ4DmrTNMeuEOY87K2mLmo9KMed1JMr20p6g@mail.gmail.com/ Reviewed-by: Gary Guo <gary@garyguo.net> Tested-by: Gary Guo <gary@garyguo.net> Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/027a923dcb550ad115e6d93ee8bb7d310378bd01.1752070620.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-10net: appletalk: Fix device refcount leak in atrtr_create()Kito Xu
When updating an existing route entry in atrtr_create(), the old device reference was not being released before assigning the new device, leading to a device refcount leak. Fix this by calling dev_put() to release the old device reference before holding the new one. Fixes: c7f905f0f6d4 ("[ATALK]: Add missing dev_hold() to atrtr_create().") Signed-off-by: Kito Xu <veritas501@foxmail.com> Link: https://patch.msgid.link/tencent_E1A26771CDAB389A0396D1681A90A49E5D09@qq.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-10ethtool: rss: report which fields are configured for hashingJakub Kicinski
Implement ETHTOOL_GRXFH over Netlink. The number of flow types is reasonable (around 20) so report all of them at once for simplicity. Do not maintain the flow ID mapping with ioctl at the uAPI level. This gives us a chance to clean up the confusion that come from RxNFC vs RxFH (flow direction vs hashing) in the ioctl. Try to align with the names used in ethtool CLI, they seem to have stood the test of time just fine. One annoyance is that we still call L4 ports the weird names, but I guess they also apply to IPSec (where they cover the SPI) so it is what it is. $ ynl --family ethtool --dump rss-get { "header": { "dev-index": 1, "dev-name": "enp1s0" }, "hfunc": 1, "hkey": b"...", "indir": [0, 1, ...], "flow-hash": { "ether": {"l2da"}, "ah-esp4": {"ip-src", "ip-dst"}, "ah-esp6": {"ip-src", "ip-dst"}, "ah4": {"ip-src", "ip-dst"}, "ah6": {"ip-src", "ip-dst"}, "esp4": {"ip-src", "ip-dst"}, "esp6": {"ip-src", "ip-dst"}, "ip4": {"ip-src", "ip-dst"}, "ip6": {"ip-src", "ip-dst"}, "sctp4": {"ip-src", "ip-dst"}, "sctp6": {"ip-src", "ip-dst"}, "udp4": {"ip-src", "ip-dst"}, "udp6": {"ip-src", "ip-dst"} "tcp4": {"l4-b-0-1", "l4-b-2-3", "ip-src", "ip-dst"}, "tcp6": {"l4-b-0-1", "l4-b-2-3", "ip-src", "ip-dst"}, }, } Link: https://patch.msgid.link/20250708220640.2738464-5-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-10ethtool: mark ETHER_FLOW as usable for Rx hashJakub Kicinski
Looks like some drivers (ena, enetc, fbnic.. there's probably more) consider ETHER_FLOW to be legitimate target for flow hashing. I'm not sure how intentional that is from the uAPI perspective vs just an effect of ethtool IOCTL doing minimal input validation. But Netlink will do strict validation, so we need to decide whether we allow this use case or not. I don't see a strong reason against it, and rejecting it would potentially regress a number of drivers. So update the comments and flow_type_hashable(). Link: https://patch.msgid.link/20250708220640.2738464-4-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-10ethtool: rss: make sure dump takes the rss lockJakub Kicinski
After commit 040cef30b5e6 ("net: ethtool: move get_rxfh callback under the rss_lock") we're expected to take rss_lock around get. Switch dump to using the new prep helper and move the locking into it. Link: https://patch.msgid.link/20250708220640.2738464-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-10Merge tag 'wireless-next-2025-07-10' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next Johannes Berg says: ==================== Quite a bit more work, notably: - mt76: firmware recovery improvements, MLO work - iwlwifi: use embedded PNVM in (to be released) FW images to fix compatibility issues - cfg80211/mac80211: extended regulatory info support (6 GHz) - cfg80211: use "faux device" for regulatory * tag 'wireless-next-2025-07-10' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (48 commits) wifi: mac80211: don't complete management TX on SAE commit wifi: cfg80211/mac80211: implement dot11ExtendedRegInfoSupport wifi: mac80211: send extended MLD capa/ops if AP has it wifi: mac80211: copy first_part into HW scan wifi: cfg80211: add a flag for the first part of a scan wifi: mac80211: remove DISALLOW_PUNCTURING_5GHZ code wifi: cfg80211: only verify part of Extended MLD Capabilities wifi: nl80211: make nl80211_check_scan_flags() type safe wifi: cfg80211: hide scan internals wifi: mac80211: fix deactivated link CSA wifi: mac80211: add mandatory bitrate support for 6 GHz wifi: mac80211: remove spurious blank line wifi: mac80211: verify state before connection wifi: mac80211: avoid weird state in error path wifi: iwlwifi: mvm: remove support for iwl_wowlan_info_notif_v4 wifi: iwlwifi: bump minimum API version in BZ wifi: iwlwifi: mvm: remove unneeded argument wifi: iwlwifi: mvm: remove MLO GTK rekey code wifi: iwlwifi: pcie: rename iwl_pci_gen1_2_probe() argument wifi: iwlwifi: match discrete/integrated to fix some names ... ==================== Link: https://patch.msgid.link/20250710123113.24878-3-johannes@sipsolutions.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-10Merge tag 'wireless-2025-07-10' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless Johannes Berg says: ==================== Quite a number of fixes still: - mt76 (hadn't sent any fixes so far) - RCU - scanning - decapsulation offload - interface combinations - rt2x00: build fix (bad function pointer prototype) - cfg80211: prevent A-MSDU flipping attacks in mesh - zd1211rw: prevent race ending with NULL ptr deref - cfg80211/mac80211: more S1G fixes - mwifiex: avoid WARN on certain RX frames - mac80211: - avoid stack data leak in WARN cases - fix non-transmitted BSSID search (on certain multi-BSSID APs) - always initialize key list so driver iteration won't crash - fix monitor interface in device restart - fix __free() annotation usage * tag 'wireless-2025-07-10' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless: (26 commits) wifi: mac80211: add the virtual monitor after reconfig complete wifi: mac80211: always initialize sdata::key_list wifi: mac80211: Fix uninitialized variable with __free() in ieee80211_ml_epcs() wifi: mt76: mt792x: Limit the concurrent STA and SoftAP to operate on the same channel wifi: mt76: mt7925: Fix null-ptr-deref in mt7925_thermal_init() wifi: mt76: fix queue assignment for deauth packets wifi: mt76: add a wrapper for wcid access with validation wifi: mt76: mt7921: prevent decap offload config before STA initialization wifi: mt76: mt7925: prevent NULL pointer dereference in mt7925_sta_set_decap_offload() wifi: mt76: mt7925: fix incorrect scan probe IE handling for hw_scan wifi: mt76: mt7925: fix invalid array index in ssid assignment during hw scan wifi: mt76: mt7925: fix the wrong config for tx interrupt wifi: mt76: Remove RCU section in mt7996_mac_sta_rc_work() wifi: mt76: Move RCU section in mt7996_mcu_add_rate_ctrl() wifi: mt76: Move RCU section in mt7996_mcu_add_rate_ctrl_fixed() wifi: mt76: Move RCU section in mt7996_mcu_set_fixed_field() wifi: mt76: Assume __mt76_connac_mcu_alloc_sta_req runs in atomic context wifi: prevent A-MSDU attacks in mesh networks wifi: rt2x00: fix remove callback type mismatch wifi: mac80211: reject VHT opmode for unsupported channel widths ... ==================== Link: https://patch.msgid.link/20250710122212.24272-3-johannes@sipsolutions.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-10net: replace ND_PRINTK with dynamic debugWang Liang
ND_PRINTK with val > 1 only works when the ND_DEBUG was set in compilation phase. Replace it with dynamic debug. Convert ND_PRINTK with val <= 1 to net_{err,warn}_ratelimited, and convert the rest to net_dbg_ratelimited. Suggested-by: Ido Schimmel <idosch@idosch.org> Signed-off-by: Wang Liang <wangliang74@huawei.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250708033342.1627636-1-wangliang74@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-10Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.16-rc6). No conflicts. Adjacent changes: Documentation/devicetree/bindings/net/allwinner,sun8i-a83t-emac.yaml 0a12c435a1d6 ("dt-bindings: net: sun8i-emac: Add A100 EMAC compatible") b3603c0466a8 ("dt-bindings: net: sun8i-emac: Rename A523 EMAC0 to GMAC0") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-10net: xsk: introduce XDP_MAX_TX_SKB_BUDGET setsockoptJason Xing
This patch provides a setsockopt method to let applications leverage to adjust how many descs to be handled at most in one send syscall. It mitigates the situation where the default value (32) that is too small leads to higher frequency of triggering send syscall. Considering the prosperity/complexity the applications have, there is no absolutely ideal suggestion fitting all cases. So keep 32 as its default value like before. The patch does the following things: - Add XDP_MAX_TX_SKB_BUDGET socket option. - Set max_tx_budget to 32 by default in the initialization phase as a per-socket granular control. - Set the range of max_tx_budget as [32, xs->tx->nentries]. The idea behind this comes out of real workloads in production. We use a user-level stack with xsk support to accelerate sending packets and minimize triggering syscalls. When the packets are aggregated, it's not hard to hit the upper bound (namely, 32). The moment user-space stack fetches the -EAGAIN error number passed from sendto(), it will loop to try again until all the expected descs from tx ring are sent out to the driver. Enlarging the XDP_MAX_TX_SKB_BUDGET value contributes to less frequency of sendto() and higher throughput/PPS. Here is what I did in production, along with some numbers as follows: For one application I saw lately, I suggested using 128 as max_tx_budget because I saw two limitations without changing any default configuration: 1) XDP_MAX_TX_SKB_BUDGET, 2) socket sndbuf which is 212992 decided by net.core.wmem_default. As to XDP_MAX_TX_SKB_BUDGET, the scenario behind this was I counted how many descs are transmitted to the driver at one time of sendto() based on [1] patch and then I calculated the possibility of hitting the upper bound. Finally I chose 128 as a suitable value because 1) it covers most of the cases, 2) a higher number would not bring evident results. After twisting the parameters, a stable improvement of around 4% for both PPS and throughput and less resources consumption were found to be observed by strace -c -p xxx: 1) %time was decreased by 7.8% 2) error counter was decreased from 18367 to 572 [1]: https://lore.kernel.org/all/20250619093641.70700-1-kerneljasonxing@gmail.com/ Signed-off-by: Jason Xing <kernelxing@tencent.com> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://patch.msgid.link/20250704160138.48677-1-kerneljasonxing@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-10wifi: mac80211: add the virtual monitor after reconfig completeMiri Korenblit
In reconfig we add the virtual monitor in 2 cases: 1. If we are resuming (it was deleted on suspend) 2. If it was added after an error but before the reconfig (due to the last non-monitor interface removal). In the second case, the removal of the non-monitor interface will succeed but the addition of the virtual monitor will fail, so we add it in the reconfig. The problem is that we mislead the driver to think that this is an existing interface that is getting re-added - while it is actually a completely new interface from the drivers' point of view. Some drivers act differently when a interface is re-added. For example, it might not initialize things because they were already initialized. Such drivers will - in this case - be left with a partialy initialized vif. To fix it, add the virtual monitor after reconfig_complete, so the driver will know that this is a completely new interface. Fixes: 3c3e21e7443b ("mac80211: destroy virtual monitor interface across suspend") Reviewed-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20250709233451.648d39b041e8.I2e37b68375278987e303d6c00cc5f3d8334d2f96@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-07-10wifi: mac80211: always initialize sdata::key_listMiri Korenblit
This is currently not initialized for a virtual monitor, leading to a NULL pointer dereference when - for example - iterating over all the keys of all the vifs. Reviewed-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20250709233400.8dcefe578497.I4c90a00ae3256520e063199d7f6f2580d5451acf@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-07-10net/sched: sch_qfq: Fix null-deref in agg_dequeueXiang Mei
To prevent a potential crash in agg_dequeue (net/sched/sch_qfq.c) when cl->qdisc->ops->peek(cl->qdisc) returns NULL, we check the return value before using it, similar to the existing approach in sch_hfsc.c. To avoid code duplication, the following changes are made: 1. Changed qdisc_warn_nonwc(include/net/pkt_sched.h) into a static inline function. 2. Moved qdisc_peek_len from net/sched/sch_hfsc.c to include/net/pkt_sched.h so that sch_qfq can reuse it. 3. Applied qdisc_peek_len in agg_dequeue to avoid crashing. Signed-off-by: Xiang Mei <xmei5@asu.edu> Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com> Link: https://patch.msgid.link/20250705212143.3982664-1-xmei5@asu.edu Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-09rxrpc: Fix oops due to non-existence of prealloc backlog structDavid Howells
If an AF_RXRPC service socket is opened and bound, but calls are preallocated, then rxrpc_alloc_incoming_call() will oops because the rxrpc_backlog struct doesn't get allocated until the first preallocation is made. Fix this by returning NULL from rxrpc_alloc_incoming_call() if there is no backlog struct. This will cause the incoming call to be aborted. Reported-by: Junvyyang, Tencent Zhuque Lab <zhuque@tencent.com> Suggested-by: Junvyyang, Tencent Zhuque Lab <zhuque@tencent.com> Signed-off-by: David Howells <dhowells@redhat.com> cc: LePremierHomme <kwqcheii@proton.me> cc: Marc Dionne <marc.dionne@auristor.com> cc: Willy Tarreau <w@1wt.eu> cc: Simon Horman <horms@kernel.org> cc: linux-afs@lists.infradead.org Link: https://patch.msgid.link/20250708211506.2699012-3-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09rxrpc: Fix bug due to prealloc collisionDavid Howells
When userspace is using AF_RXRPC to provide a server, it has to preallocate incoming calls and assign to them call IDs that will be used to thread related recvmsg() and sendmsg() together. The preallocated call IDs will automatically be attached to calls as they come in until the pool is empty. To the kernel, the call IDs are just arbitrary numbers, but userspace can use the call ID to hold a pointer to prepared structs. In any case, the user isn't permitted to create two calls with the same call ID (call IDs become available again when the call ends) and EBADSLT should result from sendmsg() if an attempt is made to preallocate a call with an in-use call ID. However, the cleanup in the error handling will trigger both assertions in rxrpc_cleanup_call() because the call isn't marked complete and isn't marked as having been released. Fix this by setting the call state in rxrpc_service_prealloc_one() and then marking it as being released before calling the cleanup function. Fixes: 00e907127e6f ("rxrpc: Preallocate peers, conns and calls for incoming service requests") Reported-by: Junvyyang, Tencent Zhuque Lab <zhuque@tencent.com> Signed-off-by: David Howells <dhowells@redhat.com> cc: LePremierHomme <kwqcheii@proton.me> cc: Marc Dionne <marc.dionne@auristor.com> cc: Simon Horman <horms@kernel.org> cc: linux-afs@lists.infradead.org Link: https://patch.msgid.link/20250708211506.2699012-2-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09vsock: Add support for SIOCINQ ioctlXuewei Niu
Add support for SIOCINQ ioctl, indicating the length of bytes unread in the socket. The value is obtained from `vsock_stream_has_data()`. Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Luigi Leonardi <leonardi@redhat.com> Link: https://patch.msgid.link/20250708-siocinq-v6-2-3775f9a9e359@antgroup.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09hv_sock: Return the readable bytes in hvs_stream_has_data()Dexuan Cui
When hv_sock was originally added, __vsock_stream_recvmsg() and vsock_stream_has_data() actually only needed to know whether there is any readable data or not, so hvs_stream_has_data() was written to return 1 or 0 for simplicity. However, now hvs_stream_has_data() should return the readable bytes because vsock_data_ready() -> vsock_stream_has_data() needs to know the actual bytes rather than a boolean value of 1 or 0. The SIOCINQ ioctl support also needs hvs_stream_has_data() to return the readable bytes. Let hvs_stream_has_data() return the readable bytes of the payload in the next host-to-guest VMBus hv_sock packet. Note: there may be multiple incoming hv_sock packets pending in the VMBus channel's ringbuffer, but so far there is not a VMBus API that allows us to know all the readable bytes in total without reading and caching the payload of the multiple packets, so let's just return the readable bytes of the next single packet. In the future, we'll either add a VMBus API that allows us to know the total readable bytes without touching the data in the ringbuffer, or the hv_sock driver needs to understand the VMBus packet format and parse the packets directly. Signed-off-by: Dexuan Cui <decui@microsoft.com> Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com> Acked-by: Stefano Garzarella <sgarzare@redhat.com> Acked-by: Wei Liu <wei.liu@kernel.org> Link: https://patch.msgid.link/20250708-siocinq-v6-1-3775f9a9e359@antgroup.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09skbuff: Add MSG_MORE flag to optimize tcp large packet transmissionFeng Yang
When using sockmap for forwarding, the average latency for different packet sizes after sending 10,000 packets is as follows: size old(us) new(us) 512 56 55 1472 58 58 1600 106 81 3000 145 105 5000 182 125 Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Feng Yang <yangfeng@kylinos.cn> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250708054053.39551-1-yangfeng59949@163.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09net: ipconfig: convert timeouts to secs_to_jiffies()Easwar Hariharan
Commit b35108a51cf7 ("jiffies: Define secs_to_jiffies()") introduced secs_to_jiffies(). As the value here is a multiple of 1000, use secs_to_jiffies() instead of msecs_to_jiffies to avoid the multiplication. This is converted using scripts/coccinelle/misc/secs_to_jiffies.cocci with the following Coccinelle rules: @depends on patch@ expression E; @@ -msecs_to_jiffies(E * 1000) +secs_to_jiffies(E) -msecs_to_jiffies(E * MSEC_PER_SEC) +secs_to_jiffies(E) While here, manually convert a couple timeouts denominated in seconds Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com> Link: https://patch.msgid.link/20250707-netdev-secs-to-jiffies-part-2-v2-2-b7817036342f@linux.microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>