linux/linux-stable.git - Linux kernel stable tree

Age	Commit message (Collapse)	Author
2025-05-13	net: devmem: TCP tx netlink api	Stanislav Fomichev
	Add bind-tx netlink call to attach dmabuf for TX; queue is not required, only ifindex and dmabuf fd for attachment. Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: Mina Almasry <almasrymina@google.com> Link: https://patch.msgid.link/20250508004830.4100853-4-almasrymina@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-13	net: add get_netmem/put_netmem support	Mina Almasry
	Currently net_iovs support only pp ref counts, and do not support a page ref equivalent. This is fine for the RX path as net_iovs are used exclusively with the pp and only pp refcounting is needed there. The TX path however does not use pp ref counts, thus, support for get_page/put_page equivalent is needed for netmem. Support get_netmem/put_netmem. Check the type of the netmem before passing it to page or net_iov specific code to obtain a page ref equivalent. For dmabuf net_iovs, we obtain a ref on the underlying binding. This ensures the entire binding doesn't disappear until all the net_iovs have been put_netmem'ed. We do not need to track the refcount of individual dmabuf net_iovs as we don't allocate/free them from a pool similar to what the buddy allocator does for pages. This code is written to be extensible by other net_iov implementers. get_netmem/put_netmem will check the type of the netmem and route it to the correct helper: pages -> [get\|put]_page() dmabuf net_iovs -> net_devmem_[get\|put]_net_iov() new net_iovs -> new helpers Signed-off-by: Mina Almasry <almasrymina@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250508004830.4100853-3-almasrymina@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-13	netmem: add niov->type attribute to distinguish different net_iov types	Mina Almasry
	Later patches in the series adds TX net_iovs where there is no pp associated, so we can't rely on niov->pp->mp_ops to tell what is the type of the net_iov. Add a type enum to the net_iov which tells us the net_iov type. Signed-off-by: Mina Almasry <almasrymina@google.com> Link: https://patch.msgid.link/20250508004830.4100853-2-almasrymina@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-12	net: Lock lower level devices when updating features	Cosmin Ratiu
	__netdev_update_features() expects the netdevice to be ops-locked, but it gets called recursively on the lower level netdevices to sync their features, and nothing locks those. This commit fixes that, with the assumption that it shouldn't be possible for both higher-level and lover-level netdevices to require the instance lock, because that would lead to lock dependency warnings. Without this, playing with higher level (e.g. vxlan) netdevices on top of netdevices with instance locking enabled can run into issues: WARNING: CPU: 59 PID: 206496 at ./include/net/netdev_lock.h:17 netif_napi_add_weight_locked+0x753/0xa60 [...] Call Trace: <TASK> mlx5e_open_channel+0xc09/0x3740 [mlx5_core] mlx5e_open_channels+0x1f0/0x770 [mlx5_core] mlx5e_safe_switch_params+0x1b5/0x2e0 [mlx5_core] set_feature_lro+0x1c2/0x330 [mlx5_core] mlx5e_handle_feature+0xc8/0x140 [mlx5_core] mlx5e_set_features+0x233/0x2e0 [mlx5_core] __netdev_update_features+0x5be/0x1670 __netdev_update_features+0x71f/0x1670 dev_ethtool+0x21c5/0x4aa0 dev_ioctl+0x438/0xae0 sock_ioctl+0x2ba/0x690 __x64_sys_ioctl+0xa78/0x1700 do_syscall_64+0x6d/0x140 entry_SYSCALL_64_after_hwframe+0x4b/0x53 </TASK> Fixes: 7e4d784f5810 ("net: hold netdev instance lock during rtnetlink operations") Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250509072850.2002821-1-cratiu@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-09	bpf: Allow some trace helpers for all prog types	Feng Yang
	if it works under NMI and doesn't use any context-dependent things, should be fine for any program type. The detailed discussion is in [1]. [1] https://lore.kernel.org/all/CAEf4Bza6gK3dsrTosk6k3oZgtHesNDSrDd8sdeQ-GiS6oJixQg@mail.gmail.com/ Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com> Signed-off-by: Feng Yang <yangfeng@kylinos.cn> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/bpf/20250506061434.94277-2-yangfeng59949@163.com
2025-05-08	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net	Jakub Kicinski
	Cross-merge networking fixes after downstream PR (net-6.15-rc6). No conflicts. Adjacent changes: net/core/dev.c: 08e9f2d584c4 ("net: Lock netdevices during dev_shutdown") a82dc19db136 ("net: avoid potential race between netdev_get_by_index_lock() and netns switch") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-08	net: export a helper for adding up queue stats	Jakub Kicinski
	Older drivers and drivers with lower queue counts often have a static array of queues, rather than allocating structs for each queue on demand. Add a helper for adding up qstats from a queue range. Expectation is that driver will pass a queue range [netdev->real_num_x_queues, MAX). It was tempting to always use num_x_queues as the end, but virtio seems to clamp its queue count after allocating the netdev. And this way we can trivaly reuse the helper for [0, real_..). Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20250507003221.823267-2-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-07	bpf: Scrub packet on bpf_redirect_peer	Paul Chaignon
	When bpf_redirect_peer is used to redirect packets to a device in another network namespace, the skb isn't scrubbed. That can lead skb information from one namespace to be "misused" in another namespace. As one example, this is causing Cilium to drop traffic when using bpf_redirect_peer to redirect packets that just went through IPsec decryption to a container namespace. The following pwru trace shows (1) the packet path from the host's XFRM layer to the container's XFRM layer where it's dropped and (2) the number of active skb extensions at each function. NETNS MARK IFACE TUPLE FUNC 4026533547 d00 eth0 10.244.3.124:35473->10.244.2.158:53 xfrm_rcv_cb .active_extensions = (__u8)2, 4026533547 d00 eth0 10.244.3.124:35473->10.244.2.158:53 xfrm4_rcv_cb .active_extensions = (__u8)2, 4026533547 d00 eth0 10.244.3.124:35473->10.244.2.158:53 gro_cells_receive .active_extensions = (__u8)2, [...] 4026533547 0 eth0 10.244.3.124:35473->10.244.2.158:53 skb_do_redirect .active_extensions = (__u8)2, 4026534999 0 eth0 10.244.3.124:35473->10.244.2.158:53 ip_rcv .active_extensions = (__u8)2, 4026534999 0 eth0 10.244.3.124:35473->10.244.2.158:53 ip_rcv_core .active_extensions = (__u8)2, [...] 4026534999 0 eth0 10.244.3.124:35473->10.244.2.158:53 udp_queue_rcv_one_skb .active_extensions = (__u8)2, 4026534999 0 eth0 10.244.3.124:35473->10.244.2.158:53 __xfrm_policy_check .active_extensions = (__u8)2, 4026534999 0 eth0 10.244.3.124:35473->10.244.2.158:53 __xfrm_decode_session .active_extensions = (__u8)2, 4026534999 0 eth0 10.244.3.124:35473->10.244.2.158:53 security_xfrm_decode_session .active_extensions = (__u8)2, 4026534999 0 eth0 10.244.3.124:35473->10.244.2.158:53 kfree_skb_reason(SKB_DROP_REASON_XFRM_POLICY) .active_extensions = (__u8)2, In this case, there are no XFRM policies in the container's network namespace so the drop is unexpected. When we decrypt the IPsec packet, the XFRM state used for decryption is set in the skb extensions. This information is preserved across the netns switch. When we reach the XFRM policy check in the container's netns, __xfrm_policy_check drops the packet with LINUX_MIB_XFRMINNOPOLS because a (container-side) XFRM policy can't be found that matches the (host-side) XFRM state used for decryption. This patch fixes this by scrubbing the packet when using bpf_redirect_peer, as is done on typical netns switches via veth devices except skb->mark and skb->tstamp are not zeroed. Fixes: 9aa1206e8f482 ("bpf: Add redirect_peer helper") Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/1728ead5e0fe45e7a6542c36bd4e3ca07a73b7d6.1746460653.git.paul.chaignon@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-06	net: add missing instance lock to dev_set_promiscuity	Stanislav Fomichev
	Accidentally spotted while trying to understand what else needs to be renamed to netif_ prefix. Most of the calls to dev_set_promiscuity are adjacent to dev_set_allmulti or dev_disable_lro so it should be safe to add the lock. Note that new netif_set_promiscuity is currently unused, the locked paths call __dev_set_promiscuity directly. Fixes: ad7c7b2172c3 ("net: hold netdev instance lock during sysfs operations") Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250506011919.2882313-1-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-06	net: Lock netdevices during dev_shutdown	Cosmin Ratiu
	__qdisc_destroy() calls into various qdiscs .destroy() op, which in turn can call .ndo_setup_tc(), which requires the netdev instance lock. This commit extends the critical section in unregister_netdevice_many_notify() to cover dev_shutdown() (and dev_tcx_uninstall() as a side-effect) and acquires the netdev instance lock in __dev_change_net_namespace() for the other dev_shutdown() call. This should now guarantee that for all qdisc ops, the netdev instance lock is held during .ndo_setup_tc(). Fixes: a0527ee2df3f ("net: hold netdev instance lock during qdisc ndo_setup_tc") Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250505194713.1723399-1-cratiu@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-29	xsk: convert xdp_copy_frags_from_zc() to use page_pool_dev_alloc()	Bui Quang Minh
	This commit makes xdp_copy_frags_from_zc() use page allocation API page_pool_dev_alloc() instead of page_pool_dev_alloc_netmem() to avoid possible confusion of the returned value. Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com> Link: https://patch.msgid.link/20250426081220.40689-3-minhquangbui99@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-29	xsk: respect the offsets when copying frags	Bui Quang Minh
	In commit 560d958c6c68 ("xsk: add generic XSk &xdp_buff -> skb conversion"), we introduce a helper to convert zerocopy xdp_buff to skb. However, in the frag copy, we mistakenly ignore the frag's offset. This commit adds the missing offset when copying frags in xdp_copy_frags_from_zc(). This function is not used anywhere so no backport is needed. Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com> Link: https://patch.msgid.link/20250426081220.40689-2-minhquangbui99@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-28	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after rc4	Alexei Starovoitov
	Cross-merge bpf and other fixes after downstream PRs. No conflicts. Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-04-28	net, pidfs: enable handing out pidfds for reaped sk->sk_peer_pid	Christian Brauner
	Now that all preconditions are met, allow handing out pidfs for reaped sk->sk_peer_pids. Thanks to Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> for pointing out that we need to limit this to AF_UNIX sockets for now. Link: https://lore.kernel.org/20250425-work-pidfs-net-v2-4-450a19461e75@kernel.org Reviewed-by: David Rheinsberg <david@readahead.eu> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-24	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net	Jakub Kicinski
	Cross-merge networking fixes after downstream PR (net-6.15-rc4). This pull includes wireless and a fix to vxlan which isn't in Linus's tree just yet. The latter creates with a silent conflict / build breakage, so merging it now to avoid causing problems. drivers/net/vxlan/vxlan_vnifilter.c 094adad91310 ("vxlan: Use a single lock to protect the FDB table") 087a9eb9e597 ("vxlan: vnifilter: Fix unlocked deletion of default FDB entry") https://lore.kernel.org/20250423145131.513029-1-idosch@nvidia.com No "normal" conflicts, or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-23	net: Fix wild-memory-access in __register_pernet_operations() when ↵	Kuniyuki Iwashima
	CONFIG_NET_NS=n. kernel test robot reported the splat below. [0] Before commit fed176bf3143 ("net: Add ops_undo_single for module load/unload."), if CONFIG_NET_NS=n, ops was linked to pernet_list only when init_net had not been initialised, and ops was unlinked from pernet_list only under the same condition. Let's say an ops is loaded before the init_net setup but unloaded after that. Then, the ops remains in pernet_list, which seems odd. The cited commit added ops_undo_single(), which calls list_add() for ops to link it to a temporary list, so a minor change was added to __register_pernet_operations() and __unregister_pernet_operations() under CONFIG_NET_NS=n to avoid the pernet_list corruption. However, the corruption must have been left as is. When CONFIG_NET_NS=n, pernet_list was used to keep ops registered before the init_net setup, and after that, pernet_list was not used at all. This was because some ops annotated with __net_initdata are cleared out of memory at some point during boot. Then, such ops is initialised by POISON_FREE_INITMEM (0xcc), resulting in that ops->list.{next,prev} suddenly switches from a valid pointer to a weird value, 0xcccccccccccccccc. To avoid such wild memory access, let's allow the pernet_list corruption for CONFIG_NET_NS=n. [0]: Oops: general protection fault, probably for non-canonical address 0xf999959999999999: 0000 [#1] SMP KASAN NOPTI KASAN: maybe wild-memory-access in range [0xccccccccccccccc8-0xcccccccccccccccf] CPU: 2 UID: 0 PID: 346 Comm: modprobe Not tainted 6.15.0-rc1-00294-ga4cba7e98e35 #85 PREEMPT(voluntary) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 RIP: 0010:__list_add_valid_or_report (lib/list_debug.c:32) Code: 48 c1 ea 03 80 3c 02 00 0f 85 5a 01 00 00 49 39 74 24 08 0f 85 83 00 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 1f 01 00 00 4c 39 26 0f 85 ab 00 00 00 4c 39 ee RSP: 0018:ff11000135b87830 EFLAGS: 00010a07 RAX: dffffc0000000000 RBX: ffffffffc02223c0 RCX: ffffffff8406fcc2 RDX: 1999999999999999 RSI: cccccccccccccccc RDI: ffffffffc02223c0 RBP: ffffffff86064e40 R08: 0000000000000001 R09: fffffbfff0a9f5b5 R10: ffffffff854fadaf R11: 676552203a54454e R12: ffffffff86064e40 R13: ffffffffc02223c0 R14: ffffffff86064e48 R15: 0000000000000021 FS: 00007f6fb0d9e1c0(0000) GS:ff11000858ea0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f6fb0eda580 CR3: 0000000122fec005 CR4: 0000000000771ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> register_pernet_operations (./include/linux/list.h:150 (discriminator 5) ./include/linux/list.h:183 (discriminator 5) net/core/net_namespace.c:1315 (discriminator 5) net/core/net_namespace.c:1359 (discriminator 5)) register_pernet_subsys (net/core/net_namespace.c:1401) inet6_init (net/ipv6/af_inet6.c:535) ipv6 do_one_initcall (init/main.c:1257) do_init_module (kernel/module/main.c:2942) load_module (kernel/module/main.c:3409) init_module_from_file (kernel/module/main.c:3599) idempotent_init_module (kernel/module/main.c:3611) __x64_sys_finit_module (./include/linux/file.h:62 ./include/linux/file.h:83 kernel/module/main.c:3634 kernel/module/main.c:3621 kernel/module/main.c:3621) do_syscall_64 (arch/x86/entry/syscall_64.c:63 arch/x86/entry/syscall_64.c:94) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) RIP: 0033:0x7f6fb0df7e5d Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 73 9f 1b 00 f7 d8 64 89 01 48 RSP: 002b:00007fffdc6a8968 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 RAX: ffffffffffffffda RBX: 000055b535721b70 RCX: 00007f6fb0df7e5d RDX: 0000000000000000 RSI: 000055b51e44aa2a RDI: 0000000000000004 RBP: 0000000000040000 R08: 0000000000000000 R09: 000055b535721b30 R10: 0000000000000004 R11: 0000000000000246 R12: 000055b51e44aa2a R13: 000055b535721bf0 R14: 000055b5357220b0 R15: 0000000000000000 </TASK> Modules linked in: ipv6(+) crc_ccitt Fixes: fed176bf3143 ("net: Add ops_undo_single for module load/unload.") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202504181511.1c3f23e4-lkp@intel.com Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250418215025.87871-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-22	xdp: create locked/unlocked instances of xdp redirect target setters	Joshua Washington
	Commit 03df156dd3a6 ("xdp: double protect netdev->xdp_flags with netdev->lock") introduces the netdev lock to xdp_set_features_flag(). The change includes a _locked version of the method, as it is possible for a driver to have already acquired the netdev lock before calling this helper. However, the same applies to xdp_features_(set\|clear)_redirect_flags(), which ends up calling the unlocked version of xdp_set_features_flags() leading to deadlocks in GVE, which grabs the netdev lock as part of its suspend, reset, and shutdown processes: [ 833.265543] WARNING: possible recursive locking detected [ 833.270949] 6.15.0-rc1 #6 Tainted: G E [ 833.276271] -------------------------------------------- [ 833.281681] systemd-shutdow/1 is trying to acquire lock: [ 833.287090] ffff949d2b148c68 (&dev->lock){+.+.}-{4:4}, at: xdp_set_features_flag+0x29/0x90 [ 833.295470] [ 833.295470] but task is already holding lock: [ 833.301400] ffff949d2b148c68 (&dev->lock){+.+.}-{4:4}, at: gve_shutdown+0x44/0x90 [gve] [ 833.309508] [ 833.309508] other info that might help us debug this: [ 833.316130] Possible unsafe locking scenario: [ 833.316130] [ 833.322142] CPU0 [ 833.324681] ---- [ 833.327220] lock(&dev->lock); [ 833.330455] lock(&dev->lock); [ 833.333689] [ 833.333689] * DEADLOCK * [ 833.333689] [ 833.339701] May be due to missing lock nesting notation [ 833.339701] [ 833.346582] 5 locks held by systemd-shutdow/1: [ 833.351205] #0: ffffffffa9c89130 (system_transition_mutex){+.+.}-{4:4}, at: __se_sys_reboot+0xe6/0x210 [ 833.360695] #1: ffff93b399e5c1b8 (&dev->mutex){....}-{4:4}, at: device_shutdown+0xb4/0x1f0 [ 833.369144] #2: ffff949d19a471b8 (&dev->mutex){....}-{4:4}, at: device_shutdown+0xc2/0x1f0 [ 833.377603] #3: ffffffffa9eca050 (rtnl_mutex){+.+.}-{4:4}, at: gve_shutdown+0x33/0x90 [gve] [ 833.386138] #4: ffff949d2b148c68 (&dev->lock){+.+.}-{4:4}, at: gve_shutdown+0x44/0x90 [gve] Introduce xdp_features_(set\|clear)_redirect_target_locked() versions which assume that the netdev lock has already been acquired before setting the XDP feature flag and update GVE to use the locked version. Fixes: 03df156dd3a6 ("xdp: double protect netdev->xdp_flags with netdev->lock") Tested-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com> Signed-off-by: Joshua Washington <joshwash@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20250422011643.3509287-1-joshwash@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-22	net: Drop hold_rtnl arg from ops_undo_list().	Kuniyuki Iwashima
	ops_undo_list() first iterates over ops_list for ->pre_exit(). Let's check if any of the ops has ->exit_rtnl() there and drop the hold_rtnl argument. Note that nexthop uses ->exit_rtnl() and is built-in, so hold_rtnl is always true for setup_net() and cleanup_net() for now. Suggested-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/netdev/20250414170148.21f3523c@kernel.org/ Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250418003259.48017-2-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-22	net: lwtunnel: disable BHs when required	Justin Iurman
	In lwtunnel_{output\|xmit}(), dev_xmit_recursion() may be called in preemptible scope for PREEMPT kernels. This patch disables BHs before calling dev_xmit_recursion(). BHs are re-enabled only at the end, since we must ensure the same CPU is used for both dev_xmit_recursion_inc() and dev_xmit_recursion_dec() (and any other recursion levels in some cases) in order to maintain valid per-cpu counters. Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Closes: https://lore.kernel.org/netdev/CAADnVQJFWn3dBFJtY+ci6oN1pDFL=TzCmNbRgey7MdYxt_AP2g@mail.gmail.com/ Reported-by: Eduard Zingerman <eddyz87@gmail.com> Closes: https://lore.kernel.org/netdev/m2h62qwf34.fsf@gmail.com/ Fixes: 986ffb3a57c5 ("net: lwtunnel: fix recursion loops") Signed-off-by: Justin Iurman <justin.iurman@uliege.be> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250416160716.8823-1-justin.iurman@uliege.be Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-22	net: selftests: initialize TCP header and skb payload with zero	Oleksij Rempel
	Zero-initialize TCP header via memset() to avoid garbage values that may affect checksum or behavior during test transmission. Also zero-fill allocated payload and padding regions using memset() after skb_put(), ensuring deterministic content for all outgoing test packets. Fixes: 3e1e58d64c3d ("net: add generic selftest support") Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Cc: stable@vger.kernel.org Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250416160125.2914724-1-o.rempel@pengutronix.de Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-21	net: Use nlmsg_payload in rtnetlink file	Breno Leitao
	Leverage the new nlmsg_payload() helper to avoid checking for message size and then reading the nlmsg data. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250417-nlmsg_v3-v1-2-9b09d9d7e61d@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-21	net: Use nlmsg_payload in neighbour file	Breno Leitao
	Leverage the new nlmsg_payload() helper to avoid checking for message size and then reading the nlmsg data. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250417-nlmsg_v3-v1-1-9b09d9d7e61d@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-21	net: fix the missing unlock for detached devices	Jakub Kicinski
	The combined condition was left as is when we converted from __dev_get_by_index() to netdev_get_by_index_lock(). There was no need to undo anything with the former, for the latter we need an unlock. Fixes: 1d22d3060b9b ("net: drop rtnl_lock for queue_mgmt operations") Reviewed-by: Mina Almasry <almasrymina@google.com> Link: https://patch.msgid.link/20250418015317.1954107-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-21	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after rc3	Alexei Starovoitov
	Cross-merge bpf and other fixes after downstream PRs. No conflicts. Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-04-17	netdev: fix the locking for netdev notifications	Jakub Kicinski
	Kuniyuki reports that the assert for netdev lock fires when there are netdev event listeners (otherwise we skip the netlink event generation). Correct the locking when coming from the notifier. The NETDEV_XDP_FEAT_CHANGE notifier is already fully locked, it's the documentation that's incorrect. Fixes: 99e44f39a8f7 ("netdev: depend on netdev->lock for xdp features") Reported-by: syzkaller <syzkaller@googlegroups.com> Reported-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/20250410171019.62128-1-kuniyu@amazon.com Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250416030447.1077551-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-17	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net	Jakub Kicinski
	Cross-merge networking fixes after downstream PR (net-6.15-rc3). No conflicts. Adjacent changes: tools/net/ynl/pyynl/ynl_gen_c.py 4d07bbf2d456 ("tools: ynl-gen: don't declare loop iterator in place") 7e8ba0c7de2b ("tools: ynl: don't use genlmsghdr in classic netlink") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-17	Merge tag 'net-6.15-rc3' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from Bluetooth, CAN and Netfilter. Current release - regressions: - two fixes for the netdev per-instance locking - batman-adv: fix double-hold of meshif when getting enabled Current release - new code bugs: - Bluetooth: increment TX timestamping tskey always for stream sockets - wifi: static analysis and build fixes for the new Intel sub-driver Previous releases - regressions: - net: fib_rules: fix iif / oif matching on L3 master (VRF) device - ipv6: add exception routes to GC list in rt6_insert_exception() - netfilter: conntrack: fix erroneous removal of offload bit - Bluetooth: - fix sending MGMT_EV_DEVICE_FOUND for invalid address - l2cap: process valid commands in too long frame - btnxpuart: Revert baudrate change in nxp_shutdown Previous releases - always broken: - ethtool: fix memory corruption during SFP FW flashing - eth: - hibmcge: fixes for link and MTU handling, pause frames etc - igc: fixes for PTM (PCIe timestamping) - dsa: b53: enable BPDU reception for management port Misc: - fixes for Netlink protocol schemas" * tag 'net-6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (81 commits) net: ethernet: mtk_eth_soc: revise QDMA packet scheduler settings net: ethernet: mtk_eth_soc: correct the max weight of the queue limit for 100Mbps net: ethernet: mtk_eth_soc: reapply mdc divider on reset net: ti: icss-iep: Fix possible NULL pointer dereference for perout request net: ti: icssg-prueth: Fix possible NULL pointer dereference inside emac_xmit_xdp_frame() net: ti: icssg-prueth: Fix kernel warning while bringing down network interface netfilter: conntrack: fix erronous removal of offload bit net: don't try to ops lock uninitialized devs ptp: ocp: fix start time alignment in ptp_ocp_signal_set net: dsa: avoid refcount warnings when ds->ops->tag_8021q_vlan_del() fails net: dsa: free routing table on probe failure net: dsa: clean up FDB, MDB, VLAN entries on unbind net: dsa: mv88e6xxx: fix -ENOENT when deleting VLANs and MST is unsupported net: dsa: mv88e6xxx: avoid unregistering devlink regions which were never registered net: txgbe: fix memory leak in txgbe_probe() error path net: bridge: switchdev: do not notify new brentries as changed net: b53: enable BPDU reception for management port netlink: specs: rt-neigh: prefix struct nfmsg members with ndm netlink: specs: rt-link: adjust mctp attribute naming netlink: specs: rtnetlink: attribute naming corrections ...
2025-04-17	net: pktgen: fix code style (WARNING: Prefer strscpy over strcpy)	Peter Seiderer
	Fix checkpatch code style warnings: WARNING: Prefer strscpy over strcpy - see: https://github.com/KSPP/linux/issues/88 #1423: FILE: net/core/pktgen.c:1423: + strcpy(pkt_dev->dst_min, buf); WARNING: Prefer strscpy over strcpy - see: https://github.com/KSPP/linux/issues/88 #1444: FILE: net/core/pktgen.c:1444: + strcpy(pkt_dev->dst_max, buf); WARNING: Prefer strscpy over strcpy - see: https://github.com/KSPP/linux/issues/88 #1554: FILE: net/core/pktgen.c:1554: + strcpy(pkt_dev->src_min, buf); WARNING: Prefer strscpy over strcpy - see: https://github.com/KSPP/linux/issues/88 #1575: FILE: net/core/pktgen.c:1575: + strcpy(pkt_dev->src_max, buf); WARNING: Prefer strscpy over strcpy - see: https://github.com/KSPP/linux/issues/88 #3231: FILE: net/core/pktgen.c:3231: + strcpy(pkt_dev->result, "Starting"); WARNING: Prefer strscpy over strcpy - see: https://github.com/KSPP/linux/issues/88 #3235: FILE: net/core/pktgen.c:3235: + strcpy(pkt_dev->result, "Error starting"); WARNING: Prefer strscpy over strcpy - see: https://github.com/KSPP/linux/issues/88 #3849: FILE: net/core/pktgen.c:3849: + strcpy(pkt_dev->odevname, ifname); While at it squash memset/strcpy pattern into single strscpy_pad call. Signed-off-by: Peter Seiderer <ps.report@gmx.net> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250415112916.113455-4-ps.report@gmx.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-17	net: pktgen: fix code style (WARNING: please, no space before tabs)	Peter Seiderer
	Fix checkpatch code style warnings: WARNING: please, no space before tabs #230: FILE: net/core/pktgen.c:230: +#define M_NETIF_RECEIVE ^I1^I/* Inject packets into stack */$ Signed-off-by: Peter Seiderer <ps.report@gmx.net> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250415112916.113455-3-ps.report@gmx.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-17	net: pktgen: fix code style (ERROR: else should follow close brace '}')	Peter Seiderer
	Fix checkpatch code style errors: ERROR: else should follow close brace '}' #1317: FILE: net/core/pktgen.c:1317: + } + else And checkpatch follow up code style check: CHECK: Unbalanced braces around else statement #1316: FILE: net/core/pktgen.c:1316: + } else Signed-off-by: Peter Seiderer <ps.report@gmx.net> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250415112916.113455-2-ps.report@gmx.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-17	skb: implement skb_send_sock_locked_with_flags()	Antonio Quartulli
	When sending an skb over a socket using skb_send_sock_locked(), it is currently not possible to specify any flag to be set in msghdr->msg_flags. However, we may want to pass flags the user may have specified, like MSG_NOSIGNAL. Extend __skb_send_sock() with a new argument 'flags' and add a new interface named skb_send_sock_locked_with_flags(). Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Simon Horman <horms@kernel.org> Signed-off-by: Antonio Quartulli <antonio@openvpn.net> Link: https://patch.msgid.link/20250415-b4-ovpn-v26-12-577f6097b964@openvpn.net Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-17	net, pidfd: report EINVAL for ESRCH	Christian Brauner
	dbus-broker relies -EINVAL being returned to indicate ESRCH in [1]. This causes issues for some workloads as reported in [2]. Paper over it until this is fixed in userspace. Link: https://lore.kernel.org/20250416-gegriffen-tiefbau-70cfecb80ac8@brauner Link: https://github.com/bus1/dbus-broker/blob/5d34d91b138fc802a016aa68c093eb81ea31139c/src/util/sockopt.c#L241 [1] Link: https://lore.kernel.org/20250415223454.GA1852104@ax162 [2] Tested-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-16	net: don't try to ops lock uninitialized devs	Jakub Kicinski
	We need to be careful when operating on dev while in rtnl_create_link(). Some devices (vxlan) initialize netdev_ops in ->newlink, so later on. Avoid using netdev_lock_ops(), the device isn't registered so we cannot legally call its ops or generate any notifications for it. netdev_ops_assert_locked_or_invisible() is safe to use, it checks registration status first. Reported-by: syzbot+de1c7d68a10e3f123bdd@syzkaller.appspotmail.com Fixes: 04efcee6ef8d ("net: hold instance lock during NETDEV_CHANGE") Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250415151552.768373-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-15	net: fib_rules: Fix iif / oif matching on L3 master device	Ido Schimmel
	Before commit 40867d74c374 ("net: Add l3mdev index to flow struct and avoid oif reset for port devices") it was possible to use FIB rules to match on a L3 domain. This was done by having a FIB rule match on iif / oif being a L3 master device. It worked because prior to the FIB rule lookup the iif / oif fields in the flow structure were reset to the index of the L3 master device to which the input / output device was enslaved to. The above scheme made it impossible to match on the original input / output device. Therefore, cited commit stopped overwriting the iif / oif fields in the flow structure and instead stored the index of the enslaving L3 master device in a new field ('flowi_l3mdev') in the flow structure. While the change enabled new use cases, it broke the original use case of matching on a L3 domain. Fix this by interpreting the iif / oif matching on a L3 master device as a match against the L3 domain. In other words, if the iif / oif in the FIB rule points to a L3 master device, compare the provided index against 'flowi_l3mdev' rather than 'flowi_{i,o}if'. Before cited commit, a FIB rule that matched on 'iif vrf1' would only match incoming traffic from devices enslaved to 'vrf1'. With the proposed change (i.e., comparing against 'flowi_l3mdev'), the rule would also match traffic originating from a socket bound to 'vrf1'. Avoid that by adding a new flow flag ('FLOWI_FLAG_L3MDEV_OIF') that indicates if the L3 domain was derived from the output interface or the input interface (when not set) and take this flag into account when evaluating the FIB rule against the flow structure. Avoid unnecessary checks in the data path by detecting that a rule matches on a L3 master device when the rule is installed and marking it as such. Tested using the following script [1]. Output before 40867d74c374 (v5.4.291): default dev dummy1 table 100 scope link default dev dummy1 table 200 scope link Output after 40867d74c374: default dev dummy1 table 300 scope link default dev dummy1 table 300 scope link Output with this patch: default dev dummy1 table 100 scope link default dev dummy1 table 200 scope link [1] #!/bin/bash ip link add name vrf1 up type vrf table 10 ip link add name dummy1 up master vrf1 type dummy sysctl -wq net.ipv4.conf.all.forwarding=1 sysctl -wq net.ipv4.conf.all.rp_filter=0 ip route add table 100 default dev dummy1 ip route add table 200 default dev dummy1 ip route add table 300 default dev dummy1 ip rule add prio 0 oif vrf1 table 100 ip rule add prio 1 iif vrf1 table 200 ip rule add prio 2 table 300 ip route get 192.0.2.1 oif dummy1 fibmatch ip route get 192.0.2.1 iif dummy1 from 198.51.100.1 fibmatch Fixes: 40867d74c374 ("net: Add l3mdev index to flow struct and avoid oif reset for port devices") Reported-by: hanhuihui <hanhuihui5@huawei.com> Closes: https://lore.kernel.org/netdev/ec671c4f821a4d63904d0da15d604b75@huawei.com/ Signed-off-by: Ido Schimmel <idosch@nvidia.com> Acked-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250414172022.242991-2-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-15	net: fib_rules: Use nlmsg_payload in fib_{new,del}rule()	Breno Leitao
	Leverage the new nlmsg_payload() helper to avoid checking for message size and then reading the nlmsg data. Suggested-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250414-nlmsg-v2-10-3d90cb42c6af@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-15	net: fib_rules: Use nlmsg_payload in fib_valid_dumprule_req	Breno Leitao
	Leverage the new nlmsg_payload() helper to avoid checking for message size and then reading the nlmsg data. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250414-nlmsg-v2-9-3d90cb42c6af@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-15	rtnetlink: Use nlmsg_payload in valid_fdb_dump_strict	Breno Leitao
	Leverage the new nlmsg_payload() helper to avoid checking for message size and then reading the nlmsg data. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250414-nlmsg-v2-4-3d90cb42c6af@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-15	neighbour: Use nlmsg_payload in neigh_valid_get_req	Breno Leitao
	Update neigh_valid_get_req function to utilize the new nlmsg_payload() helper function. This change improves code clarity and safety by ensuring that the Netlink message payload is properly validated before accessing its data. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250414-nlmsg-v2-3-3d90cb42c6af@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-15	neighbour: Use nlmsg_payload in neightbl_valid_dump_info	Breno Leitao
	Update neightbl_valid_dump_info function to utilize the new nlmsg_payload() helper function. This change improves code clarity and safety by ensuring that the Netlink message payload is properly validated before accessing its data. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250414-nlmsg-v2-2-3d90cb42c6af@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14	net: Remove ->exit_batch_rtnl().	Kuniyuki Iwashima
	There are no ->exit_batch_rtnl() users remaining. Let's remove the hook. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20250411205258.63164-15-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14	net: Add ->exit_rtnl() hook to struct pernet_operations.	Kuniyuki Iwashima
	struct pernet_operations provides two batching hooks; ->exit_batch() and ->exit_batch_rtnl(). The batching variant is beneficial if ->exit() meets any of the following conditions: 1) ->exit() repeatedly acquires a global lock for each netns 2) ->exit() has a time-consuming operation that can be factored out (e.g. synchronize_rcu(), smp_mb(), etc) 3) ->exit() does not need to repeat the same iterations for each netns (e.g. inet_twsk_purge()) Currently, none of the ->exit_batch_rtnl() functions satisfy any of the above conditions because RTNL is factored out and held by the caller and all of these functions iterate over the dying netns list. Also, we want to hold per-netns RTNL there but avoid spreading __rtnl_net_lock() across multiple locations. Let's add ->exit_rtnl() hook and run it under __rtnl_net_lock(). The following patches will convert all ->exit_batch_rtnl() users to ->exit_rtnl(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20250411205258.63164-4-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14	net: Add ops_undo_single for module load/unload.	Kuniyuki Iwashima
	If ops_init() fails while loading a module or we unload the module, free_exit_list() rolls back the changes. The rollback sequence is the same as ops_undo_list(). The ops is already removed from pernet_list before calling free_exit_list(). If we link the ops to a temporary list, we can reuse ops_undo_list(). Let's add a wrapper of ops_undo_list() and use it instead of free_exit_list(). Now, we have the central place to roll back ops_init(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20250411205258.63164-3-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14	net: Factorise setup_net() and cleanup_net().	Kuniyuki Iwashima
	When we roll back the changes made by struct pernet_operations.init(), we execute mostly identical sequences in three places. * setup_net() * cleanup_net() * free_exit_list() The only difference between the first two is which list and RCU helpers to use. In setup_net(), an ops could fail on the way, so we need to perform a reverse walk from its previous ops in pernet_list. OTOH, in cleanup_net(), we iterate the full list from tail to head. The former passes the failed ops to list_for_each_entry_continue_reverse(). It's tricky, but we can reuse it for the latter if we pass list_entry() of the head node. Also, synchronize_rcu() and synchronize_rcu_expedited() can be easily switched by an argument. Let's factorise the rollback part in setup_net() and cleanup_net(). In the next patch, ops_undo_list() will be reused for free_exit_list(), and then two arguments (ops_list and hold_rtnl) will differ. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20250411205258.63164-2-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14	page_pool: Track DMA-mapped pages and unmap them when destroying the pool	Toke Høiland-Jørgensen
	When enabling DMA mapping in page_pool, pages are kept DMA mapped until they are released from the pool, to avoid the overhead of re-mapping the pages every time they are used. This causes resource leaks and/or crashes when there are pages still outstanding while the device is torn down, because page_pool will attempt an unmap through a non-existent DMA device on the subsequent page return. To fix this, implement a simple tracking of outstanding DMA-mapped pages in page pool using an xarray. This was first suggested by Mina[0], and turns out to be fairly straight forward: We simply store pointers to pages directly in the xarray with xa_alloc() when they are first DMA mapped, and remove them from the array on unmap. Then, when a page pool is torn down, it can simply walk the xarray and unmap all pages still present there before returning, which also allows us to get rid of the get/put_device() calls in page_pool. Using xa_cmpxchg(), no additional synchronisation is needed, as a page will only ever be unmapped once. To avoid having to walk the entire xarray on unmap to find the page reference, we stash the ID assigned by xa_alloc() into the page structure itself, using the upper bits of the pp_magic field. This requires a couple of defines to avoid conflicting with the POINTER_POISON_DELTA define, but this is all evaluated at compile-time, so does not affect run-time performance. The bitmap calculations in this patch gives the following number of bits for different architectures: - 23 bits on 32-bit architectures - 21 bits on PPC64 (because of the definition of ILLEGAL_POINTER_VALUE) - 32 bits on other 64-bit architectures Stashing a value into the unused bits of pp_magic does have the effect that it can make the value stored there lie outside the unmappable range (as governed by the mmap_min_addr sysctl), for architectures that don't define ILLEGAL_POINTER_VALUE. This means that if one of the pointers that is aliased to the pp_magic field (such as page->lru.next) is dereferenced while the page is owned by page_pool, that could lead to a dereference into userspace, which is a security concern. The risk of this is mitigated by the fact that (a) we always clear pp_magic before releasing a page from page_pool, and (b) this would need a use-after-free bug for struct page, which can have many other risks since page->lru.next is used as a generic list pointer in multiple places in the kernel. As such, with this patch we take the position that this risk is negligible in practice. For more discussion, see[1]. Since all the tracking added in this patch is performed on DMA map/unmap, no additional code is needed in the fast path, meaning the performance overhead of this tracking is negligible there. A micro-benchmark shows that the total overhead of the tracking itself is about 400 ns (39 cycles(tsc) 395.218 ns; sum for both map and unmap[2]). Since this cost is only paid on DMA map and unmap, it seems like an acceptable cost to fix the late unmap issue. Further optimisation can narrow the cases where this cost is paid (for instance by eliding the tracking when DMA map/unmap is a no-op). The extra memory needed to track the pages is neatly encapsulated inside xarray, which uses the 'struct xa_node' structure to track items. This structure is 576 bytes long, with slots for 64 items, meaning that a full node occurs only 9 bytes of overhead per slot it tracks (in practice, it probably won't be this efficient, but in any case it should be an acceptable overhead). [0] https://lore.kernel.org/all/CAHS8izPg7B5DwKfSuzz-iOop_YRbk3Sd6Y4rX7KBG9DcVJcyWg@mail.gmail.com/ [1] https://lore.kernel.org/r/20250320023202.GA25514@openwall.com [2] https://lore.kernel.org/r/ae07144c-9295-4c9d-a400-153bb689fe9e@huawei.com Reported-by: Yonglong Liu <liuyonglong@huawei.com> Closes: https://lore.kernel.org/r/8743264a-9700-4227-a556-5f931c720211@huawei.com Fixes: ff7d6b27f894 ("page_pool: refurbish version of page_pool code") Suggested-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org> Tested-by: Jesper Dangaard Brouer <hawk@kernel.org> Tested-by: Qiuling Ren <qren@redhat.com> Tested-by: Yuying Ma <yuma@redhat.com> Tested-by: Yonglong Liu <liuyonglong@huawei.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://patch.msgid.link/20250409-page-pool-track-dma-v9-2-6a9ef2e0cba8@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14	page_pool: Move pp_magic check into helper functions	Toke Høiland-Jørgensen
	Since we are about to stash some more information into the pp_magic field, let's move the magic signature checks into a pair of helper functions so it can be changed in one place. Reviewed-by: Mina Almasry <almasrymina@google.com> Tested-by: Yonglong Liu <liuyonglong@huawei.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://patch.msgid.link/20250409-page-pool-track-dma-v9-1-6a9ef2e0cba8@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14	net: pktgen: fix code style (WARNING: quoted string split across lines)	Peter Seiderer
	Fix checkpatch code style warnings: WARNING: quoted string split across lines #480: FILE: net/core/pktgen.c:480: + "Packet Generator for packet performance testing. " + "Version: " VERSION "\n"; WARNING: quoted string split across lines #632: FILE: net/core/pktgen.c:632: + " udp_src_min: %d udp_src_max: %d" + " udp_dst_min: %d udp_dst_max: %d\n", Signed-off-by: Peter Seiderer <ps.report@gmx.net> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14	net: pktgen: fix code style (WARNING: macros should not use a trailing ↵	Peter Seiderer
	semicolon) Fix checkpatch code style warnings: WARNING: macros should not use a trailing semicolon #180: FILE: net/core/pktgen.c:180: +#define func_enter() pr_debug("entering %s\n", __func__); WARNING: macros should not use a trailing semicolon #234: FILE: net/core/pktgen.c:234: +#define if_lock(t) mutex_lock(&(t->if_lock)); CHECK: Unnecessary parentheses around t->if_lock #235: FILE: net/core/pktgen.c:235: +#define if_unlock(t) mutex_unlock(&(t->if_lock)); Signed-off-by: Peter Seiderer <ps.report@gmx.net> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14	net: pktgen: fix code style (WARNING: Missing a blank line after declarations)	Peter Seiderer
	Fix checkpatch code style warnings: WARNING: Missing a blank line after declarations #761: FILE: net/core/pktgen.c:761: + char c; + if (get_user(c, &user_buffer[i])) WARNING: Missing a blank line after declarations #780: FILE: net/core/pktgen.c:780: + char c; + if (get_user(c, &user_buffer[i])) WARNING: Missing a blank line after declarations #806: FILE: net/core/pktgen.c:806: + char c; + if (get_user(c, &user_buffer[i])) WARNING: Missing a blank line after declarations #823: FILE: net/core/pktgen.c:823: + char c; + if (get_user(c, &user_buffer[i])) WARNING: Missing a blank line after declarations #1968: FILE: net/core/pktgen.c:1968: + char f[32]; + memset(f, 0, 32); WARNING: Missing a blank line after declarations #2410: FILE: net/core/pktgen.c:2410: + struct pktgen_net pn = net_generic(dev_net(pkt_dev->odev), pg_net_id); + if (!x) { WARNING: Missing a blank line after declarations #2442: FILE: net/core/pktgen.c:2442: + __u16 t; + if (pkt_dev->flags & F_QUEUE_MAP_RND) { WARNING: Missing a blank line after declarations #2523: FILE: net/core/pktgen.c:2523: + unsigned int i; + for (i = 0; i < pkt_dev->nr_labels; i++) WARNING: Missing a blank line after declarations #2567: FILE: net/core/pktgen.c:2567: + __u32 t; + if (pkt_dev->flags & F_IPSRC_RND) WARNING: Missing a blank line after declarations #2587: FILE: net/core/pktgen.c:2587: + __be32 s; + if (pkt_dev->flags & F_IPDST_RND) { WARNING: Missing a blank line after declarations #2634: FILE: net/core/pktgen.c:2634: + __u32 t; + if (pkt_dev->flags & F_TXSIZE_RND) { WARNING: Missing a blank line after declarations #2736: FILE: net/core/pktgen.c:2736: + int i; + for (i = 0; i < pkt_dev->cflows; i++) { WARNING: Missing a blank line after declarations #2738: FILE: net/core/pktgen.c:2738: + struct xfrm_state x = pkt_dev->flows[i].x; + if (x) { WARNING: Missing a blank line after declarations #2752: FILE: net/core/pktgen.c:2752: + int nhead = 0; + if (x) { WARNING: Missing a blank line after declarations #2795: FILE: net/core/pktgen.c:2795: + unsigned int i; + for (i = 0; i < pkt_dev->nr_labels; i++) WARNING: Missing a blank line after declarations #3480: FILE: net/core/pktgen.c:3480: + ktime_t idle_start = ktime_get(); + schedule(); Signed-off-by: Peter Seiderer <ps.report@gmx.net> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14	net: pktgen: fix code style (WARNING: Block comments)	Peter Seiderer
	Fix checkpatch code style warnings: WARNING: Block comments use a trailing / on a separate line + removal by worker thread / WARNING: Block comments use on subsequent lines + __u8 tos; /* six MSB of (former) IPv4 TOS + are for dscp codepoint / WARNING: Block comments use a trailing / on a separate line + are for dscp codepoint / WARNING: Block comments use on subsequent lines + __u8 traffic_class; /* ditto for the (former) Traffic Class in IPv6 + (see RFC 3260, sec. 4) / WARNING: Block comments use a trailing / on a separate line + (see RFC 3260, sec. 4) / WARNING: Block comments use on subsequent lines + /* = { + 0x00, 0x80, 0xC8, 0x79, 0xB3, 0xCB, WARNING: Block comments use * on subsequent lines + /* Field for thread to receive "posted" events terminate, + stop ifs etc. / WARNING: Block comments use a trailing / on a separate line + stop ifs etc. / WARNING: Block comments should align the on each line + * we go look for it ... +/ WARNING: Block comments use a trailing / on a separate line + * we resolve the dst issue / WARNING: Block comments use a trailing / on a separate line + * with proc_create_data() */ Signed-off-by: Peter Seiderer <ps.report@gmx.net> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14	net: pktgen: fix code style (WARNING: suspect code indent for conditional ↵	Peter Seiderer
	statements) Fix checkpatch code style warnings: WARNING: suspect code indent for conditional statements (8, 17) #2901: FILE: net/core/pktgen.c:2901: + } else { + skb = __netdev_alloc_skb(dev, size, GFP_NOWAIT); Signed-off-by: Peter Seiderer <ps.report@gmx.net> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>