summaryrefslogtreecommitdiff
path: root/net/ipv6
AgeCommit message (Collapse)Author
2024-06-19tcp: use sk_skb_reason_drop to free rx packetsYan Zhai
Replace kfree_skb_reason with sk_skb_reason_drop and pass the receiving socket to the tracepoint. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/r/202406011539.jhwBd7DX-lkp@intel.com/ Signed-off-by: Yan Zhai <yan@cloudflare.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-19net: raw: use sk_skb_reason_drop to free rx packetsYan Zhai
Replace kfree_skb_reason with sk_skb_reason_drop and pass the receiving socket to the tracepoint. Signed-off-by: Yan Zhai <yan@cloudflare.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-17xfrm6: check ip6_dst_idev() return value in xfrm6_get_saddr()Eric Dumazet
ip6_dst_idev() can return NULL, xfrm6_get_saddr() must act accordingly. syzbot reported: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN PTI KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] CPU: 1 PID: 12 Comm: kworker/u8:1 Not tainted 6.10.0-rc2-syzkaller-00383-gb8481381d4e2 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/02/2024 Workqueue: wg-kex-wg1 wg_packet_handshake_send_worker RIP: 0010:xfrm6_get_saddr+0x93/0x130 net/ipv6/xfrm6_policy.c:64 Code: df 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 97 00 00 00 4c 8b ab d8 00 00 00 48 b8 00 00 00 00 00 fc ff df 4c 89 ea 48 c1 ea 03 <80> 3c 02 00 0f 85 86 00 00 00 4d 8b 6d 00 e8 ca 13 47 01 48 b8 00 RSP: 0018:ffffc90000117378 EFLAGS: 00010246 RAX: dffffc0000000000 RBX: ffff88807b079dc0 RCX: ffffffff89a0d6d7 RDX: 0000000000000000 RSI: ffffffff89a0d6e9 RDI: ffff88807b079e98 RBP: ffff88807ad73248 R08: 0000000000000007 R09: fffffffffffff000 R10: ffff88807b079dc0 R11: 0000000000000007 R12: ffffc90000117480 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff8880b9300000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f4586d00440 CR3: 0000000079042000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> xfrm_get_saddr net/xfrm/xfrm_policy.c:2452 [inline] xfrm_tmpl_resolve_one net/xfrm/xfrm_policy.c:2481 [inline] xfrm_tmpl_resolve+0xa26/0xf10 net/xfrm/xfrm_policy.c:2541 xfrm_resolve_and_create_bundle+0x140/0x2570 net/xfrm/xfrm_policy.c:2835 xfrm_bundle_lookup net/xfrm/xfrm_policy.c:3070 [inline] xfrm_lookup_with_ifid+0x4d1/0x1e60 net/xfrm/xfrm_policy.c:3201 xfrm_lookup net/xfrm/xfrm_policy.c:3298 [inline] xfrm_lookup_route+0x3b/0x200 net/xfrm/xfrm_policy.c:3309 ip6_dst_lookup_flow+0x15c/0x1d0 net/ipv6/ip6_output.c:1256 send6+0x611/0xd20 drivers/net/wireguard/socket.c:139 wg_socket_send_skb_to_peer+0xf9/0x220 drivers/net/wireguard/socket.c:178 wg_socket_send_buffer_to_peer+0x12b/0x190 drivers/net/wireguard/socket.c:200 wg_packet_send_handshake_initiation+0x227/0x360 drivers/net/wireguard/send.c:40 wg_packet_handshake_send_worker+0x1c/0x30 drivers/net/wireguard/send.c:51 process_one_work+0x9fb/0x1b60 kernel/workqueue.c:3231 process_scheduled_works kernel/workqueue.c:3312 [inline] worker_thread+0x6c8/0xf70 kernel/workqueue.c:3393 kthread+0x2c1/0x3a0 kernel/kthread.c:389 ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240615154231.234442-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-17ipv6: prevent possible NULL dereference in rt6_probe()Eric Dumazet
syzbot caught a NULL dereference in rt6_probe() [1] Bail out if __in6_dev_get() returns NULL. [1] Oops: general protection fault, probably for non-canonical address 0xdffffc00000000cb: 0000 [#1] PREEMPT SMP KASAN PTI KASAN: null-ptr-deref in range [0x0000000000000658-0x000000000000065f] CPU: 1 PID: 22444 Comm: syz-executor.0 Not tainted 6.10.0-rc2-syzkaller-00383-gb8481381d4e2 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/02/2024 RIP: 0010:rt6_probe net/ipv6/route.c:656 [inline] RIP: 0010:find_match+0x8c4/0xf50 net/ipv6/route.c:758 Code: 14 fd f7 48 8b 85 38 ff ff ff 48 c7 45 b0 00 00 00 00 48 8d b8 5c 06 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 19 RSP: 0018:ffffc900034af070 EFLAGS: 00010203 RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffc90004521000 RDX: 00000000000000cb RSI: ffffffff8990d0cd RDI: 000000000000065c RBP: ffffc900034af150 R08: 0000000000000005 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000002 R12: 000000000000000a R13: 1ffff92000695e18 R14: ffff8880244a1d20 R15: 0000000000000000 FS: 00007f4844a5a6c0(0000) GS:ffff8880b9300000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b31b27000 CR3: 000000002d42c000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> rt6_nh_find_match+0xfa/0x1a0 net/ipv6/route.c:784 nexthop_for_each_fib6_nh+0x26d/0x4a0 net/ipv4/nexthop.c:1496 __find_rr_leaf+0x6e7/0xe00 net/ipv6/route.c:825 find_rr_leaf net/ipv6/route.c:853 [inline] rt6_select net/ipv6/route.c:897 [inline] fib6_table_lookup+0x57e/0xa30 net/ipv6/route.c:2195 ip6_pol_route+0x1cd/0x1150 net/ipv6/route.c:2231 pol_lookup_func include/net/ip6_fib.h:616 [inline] fib6_rule_lookup+0x386/0x720 net/ipv6/fib6_rules.c:121 ip6_route_output_flags_noref net/ipv6/route.c:2639 [inline] ip6_route_output_flags+0x1d0/0x640 net/ipv6/route.c:2651 ip6_dst_lookup_tail.constprop.0+0x961/0x1760 net/ipv6/ip6_output.c:1147 ip6_dst_lookup_flow+0x99/0x1d0 net/ipv6/ip6_output.c:1250 rawv6_sendmsg+0xdab/0x4340 net/ipv6/raw.c:898 inet_sendmsg+0x119/0x140 net/ipv4/af_inet.c:853 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg net/socket.c:745 [inline] sock_write_iter+0x4b8/0x5c0 net/socket.c:1160 new_sync_write fs/read_write.c:497 [inline] vfs_write+0x6b6/0x1140 fs/read_write.c:590 ksys_write+0x1f8/0x260 fs/read_write.c:643 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcd/0x250 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f Fixes: 52e1635631b3 ("[IPV6]: ROUTE: Add router_probe_interval sysctl.") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240615151454.166404-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-17ipv6: prevent possible NULL deref in fib6_nh_init()Eric Dumazet
syzbot reminds us that in6_dev_get() can return NULL. fib6_nh_init() ip6_validate_gw( &idev ) ip6_route_check_nh( idev ) *idev = in6_dev_get(dev); // can be NULL Oops: general protection fault, probably for non-canonical address 0xdffffc00000000bc: 0000 [#1] PREEMPT SMP KASAN PTI KASAN: null-ptr-deref in range [0x00000000000005e0-0x00000000000005e7] CPU: 0 PID: 11237 Comm: syz-executor.3 Not tainted 6.10.0-rc2-syzkaller-00249-gbe27b8965297 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 06/07/2024 RIP: 0010:fib6_nh_init+0x640/0x2160 net/ipv6/route.c:3606 Code: 00 00 fc ff df 4c 8b 64 24 58 48 8b 44 24 28 4c 8b 74 24 30 48 89 c1 48 89 44 24 28 48 8d 98 e0 05 00 00 48 89 d8 48 c1 e8 03 <42> 0f b6 04 38 84 c0 0f 85 b3 17 00 00 8b 1b 31 ff 89 de e8 b8 8b RSP: 0018:ffffc900032775a0 EFLAGS: 00010202 RAX: 00000000000000bc RBX: 00000000000005e0 RCX: 0000000000000000 RDX: 0000000000000010 RSI: ffffc90003277a54 RDI: ffff88802b3a08d8 RBP: ffffc900032778b0 R08: 00000000000002fc R09: 0000000000000000 R10: 00000000000002fc R11: 0000000000000000 R12: ffff88802b3a08b8 R13: 1ffff9200064eec8 R14: ffffc90003277a00 R15: dffffc0000000000 FS: 00007f940feb06c0(0000) GS:ffff8880b9400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 00000000245e8000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ip6_route_info_create+0x99e/0x12b0 net/ipv6/route.c:3809 ip6_route_add+0x28/0x160 net/ipv6/route.c:3853 ipv6_route_ioctl+0x588/0x870 net/ipv6/route.c:4483 inet6_ioctl+0x21a/0x280 net/ipv6/af_inet6.c:579 sock_do_ioctl+0x158/0x460 net/socket.c:1222 sock_ioctl+0x629/0x8e0 net/socket.c:1341 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:907 [inline] __se_sys_ioctl+0xfc/0x170 fs/ioctl.c:893 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f940f07cea9 Fixes: 428604fb118f ("ipv6: do not set routes if disable_ipv6 has been enabled") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Lorenzo Bianconi <lorenzo@kernel.org> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240614082002.26407-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-17xfrm: Log input direction mismatch error in one placeAntony Antony
Previously, the offload data path decrypted the packet before checking the direction, leading to error logging and packet dropping. However, dropped packets wouldn't be visible in tcpdump or audit log. With this fix, the offload path, upon noticing SA direction mismatch, will pass the packet to the stack without decrypting it. The L3 layer will then log the error, audit, and drop ESP without decrypting or decapsulating it. This also ensures that the slow path records the error and audit log, making dropped packets visible in tcpdump. Fixes: 304b44f0d5a4 ("xfrm: Add dir validation to "in" data path lookup") Signed-off-by: Antony Antony <antony.antony@secunet.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2024-06-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. No conflicts, no adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-12net/ipv6: Fix the RT cache flush via sysctl using a previous delayPetr Pavlu
The net.ipv6.route.flush system parameter takes a value which specifies a delay used during the flush operation for aging exception routes. The written value is however not used in the currently requested flush and instead utilized only in the next one. A problem is that ipv6_sysctl_rtcache_flush() first reads the old value of net->ipv6.sysctl.flush_delay into a local delay variable and then calls proc_dointvec() which actually updates the sysctl based on the provided input. Fix the problem by switching the order of the two operations. Fixes: 4990509f19e8 ("[NETNS][IPV6]: Make sysctls route per namespace.") Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240607112828.30285-1-petr.pavlu@suse.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-12net: ipv4,ipv6: Pass multipath hash computation through a helperPetr Machata
The following patches will add a sysctl to control multipath hash seed. In order to centralize the hash computation, add a helper, fib_multipath_hash_from_keys(), and have all IPv4 and IPv6 route.c invocations of flow_hash_from_keys() go through this helper instead. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240607151357.421181-2-petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-11netfilter: Use flowlabel flow key when re-routing mangled packetsFlorian Westphal
'ip6 dscp set $v' in an nftables outpute route chain has no effect. While nftables does detect the dscp change and calls the reroute hook. But ip6_route_me_harder never sets the dscp/flowlabel: flowlabel/dsfield routing rules are ignored and no reroute takes place. Thanks to Yi Chen for an excellent reproducer script that I used to validate this change. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: Yi Chen <yiche@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-06-10tcp: fix race in tcp_v6_syn_recv_sock()Eric Dumazet
tcp_v6_syn_recv_sock() calls ip6_dst_store() before inet_sk(newsk)->pinet6 has been set up. This means ip6_dst_store() writes over the parent (listener) np->dst_cookie. This is racy because multiple threads could share the same parent and their final np->dst_cookie could be wrong. Move ip6_dst_store() call after inet_sk(newsk)->pinet6 has been changed and after the copy of parent ipv6_pinfo. Fixes: e994b2f0fb92 ("tcp: do not lock listener to process SYN packets") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. No conflicts. Adjacent changes: drivers/net/ethernet/pensando/ionic/ionic_txrx.c d9c04209990b ("ionic: Mark error paths in the data path as unlikely") 491aee894a08 ("ionic: fix kernel panic in XDP_TX action") net/ipv6/ip6_fib.c b4cb4a1391dc ("net: use unrcu_pointer() helper") b01e1c030770 ("ipv6: fix possible race in __fib6_drop_pcpu_from()") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-06ipv6: fix possible race in __fib6_drop_pcpu_from()Eric Dumazet
syzbot found a race in __fib6_drop_pcpu_from() [1] If compiler reads more than once (*ppcpu_rt), second read could read NULL, if another cpu clears the value in rt6_get_pcpu_route(). Add a READ_ONCE() to prevent this race. Also add rcu_read_lock()/rcu_read_unlock() because we rely on RCU protection while dereferencing pcpu_rt. [1] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000012: 0000 [#1] PREEMPT SMP KASAN PTI KASAN: null-ptr-deref in range [0x0000000000000090-0x0000000000000097] CPU: 0 PID: 7543 Comm: kworker/u8:17 Not tainted 6.10.0-rc1-syzkaller-00013-g2bfcfd584ff5 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/02/2024 Workqueue: netns cleanup_net RIP: 0010:__fib6_drop_pcpu_from.part.0+0x10a/0x370 net/ipv6/ip6_fib.c:984 Code: f8 48 c1 e8 03 80 3c 28 00 0f 85 16 02 00 00 4d 8b 3f 4d 85 ff 74 31 e8 74 a7 fa f7 49 8d bf 90 00 00 00 48 89 f8 48 c1 e8 03 <80> 3c 28 00 0f 85 1e 02 00 00 49 8b 87 90 00 00 00 48 8b 0c 24 48 RSP: 0018:ffffc900040df070 EFLAGS: 00010206 RAX: 0000000000000012 RBX: 0000000000000001 RCX: ffffffff89932e16 RDX: ffff888049dd1e00 RSI: ffffffff89932d7c RDI: 0000000000000091 RBP: dffffc0000000000 R08: 0000000000000005 R09: 0000000000000007 R10: 0000000000000001 R11: 0000000000000006 R12: ffff88807fa080b8 R13: fffffbfff1a9a07d R14: ffffed100ff41022 R15: 0000000000000001 FS: 0000000000000000(0000) GS:ffff8880b9200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b32c26000 CR3: 000000005d56e000 CR4: 00000000003526f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> __fib6_drop_pcpu_from net/ipv6/ip6_fib.c:966 [inline] fib6_drop_pcpu_from net/ipv6/ip6_fib.c:1027 [inline] fib6_purge_rt+0x7f2/0x9f0 net/ipv6/ip6_fib.c:1038 fib6_del_route net/ipv6/ip6_fib.c:1998 [inline] fib6_del+0xa70/0x17b0 net/ipv6/ip6_fib.c:2043 fib6_clean_node+0x426/0x5b0 net/ipv6/ip6_fib.c:2205 fib6_walk_continue+0x44f/0x8d0 net/ipv6/ip6_fib.c:2127 fib6_walk+0x182/0x370 net/ipv6/ip6_fib.c:2175 fib6_clean_tree+0xd7/0x120 net/ipv6/ip6_fib.c:2255 __fib6_clean_all+0x100/0x2d0 net/ipv6/ip6_fib.c:2271 rt6_sync_down_dev net/ipv6/route.c:4906 [inline] rt6_disable_ip+0x7ed/0xa00 net/ipv6/route.c:4911 addrconf_ifdown.isra.0+0x117/0x1b40 net/ipv6/addrconf.c:3855 addrconf_notify+0x223/0x19e0 net/ipv6/addrconf.c:3778 notifier_call_chain+0xb9/0x410 kernel/notifier.c:93 call_netdevice_notifiers_info+0xbe/0x140 net/core/dev.c:1992 call_netdevice_notifiers_extack net/core/dev.c:2030 [inline] call_netdevice_notifiers net/core/dev.c:2044 [inline] dev_close_many+0x333/0x6a0 net/core/dev.c:1585 unregister_netdevice_many_notify+0x46d/0x19f0 net/core/dev.c:11193 unregister_netdevice_many net/core/dev.c:11276 [inline] default_device_exit_batch+0x85b/0xae0 net/core/dev.c:11759 ops_exit_list+0x128/0x180 net/core/net_namespace.c:178 cleanup_net+0x5b7/0xbf0 net/core/net_namespace.c:640 process_one_work+0x9fb/0x1b60 kernel/workqueue.c:3231 process_scheduled_works kernel/workqueue.c:3312 [inline] worker_thread+0x6c8/0xf70 kernel/workqueue.c:3393 kthread+0x2c1/0x3a0 kernel/kthread.c:389 ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 Fixes: d52d3997f843 ("ipv6: Create percpu rt6_info") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/r/20240604193549.981839-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-06inet: remove (struct uncached_list)->quarantineEric Dumazet
This list is used to tranfert dst that are handled by rt_flush_dev() and rt6_uncached_list_flush_dev() out of the per-cpu lists. But quarantine list is not used later. If we simply use list_del_init(&rt->dst.rt_uncached), this also removes the dst from per-cpu list. This patch also makes the future calls to rt_del_uncached_list() and rt6_uncached_list_del() faster, because no spinlock acquisition is needed anymore. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240604165150.726382-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-06net: use unrcu_pointer() helperEric Dumazet
Toke mentioned unrcu_pointer() existence, allowing to remove some of the ugly casts we have when using xchg() for rcu protected pointers. Also make inet_rcv_compat const. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Toke Høiland-Jørgensen <toke@redhat.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/r/20240604111603.45871-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-05tcp: annotate data-races around tw->tw_ts_recent and tw->tw_ts_recent_stampEric Dumazet
These fields can be read and written locklessly, add annotations around these minor races. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05net: remove NULL-pointer net parameter in ip_metrics_convertJason Xing
When I was doing some experiments, I found that when using the first parameter, namely, struct net, in ip_metrics_convert() always triggers NULL pointer crash. Then I digged into this part, realizing that we can remove this one due to its uselessness. Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-03ila: block BH in ila_output()Eric Dumazet
As explained in commit 1378817486d6 ("tipc: block BH before using dst_cache"), net/core/dst_cache.c helpers need to be called with BH disabled. ila_output() is called from lwtunnel_output() possibly from process context, and under rcu_read_lock(). We might be interrupted by a softirq, re-enter ila_output() and corrupt dst_cache data structures. Fix the race by using local_bh_disable(). Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240531132636.2637995-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-03ipv6: sr: block BH in seg6_output_core() and seg6_input_core()Eric Dumazet
As explained in commit 1378817486d6 ("tipc: block BH before using dst_cache"), net/core/dst_cache.c helpers need to be called with BH disabled. Disabling preemption in seg6_output_core() is not good enough, because seg6_output_core() is called from process context, lwtunnel_output() only uses rcu_read_lock(). We might be interrupted by a softirq, re-enter seg6_output_core() and corrupt dst_cache data structures. Fix the race by using local_bh_disable() instead of preempt_disable(). Apply a similar change in seg6_input_core(). Fixes: fa79581ea66c ("ipv6: sr: fix several BUGs when preemption is enabled") Fixes: 6c8702c60b88 ("ipv6: sr: add support for SRH encapsulation and injection with lwtunnels") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: David Lebrun <dlebrun@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240531132636.2637995-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-03net: ipv6: rpl_iptunnel: block BH in rpl_output() and rpl_input()Eric Dumazet
As explained in commit 1378817486d6 ("tipc: block BH before using dst_cache"), net/core/dst_cache.c helpers need to be called with BH disabled. Disabling preemption in rpl_output() is not good enough, because rpl_output() is called from process context, lwtunnel_output() only uses rcu_read_lock(). We might be interrupted by a softirq, re-enter rpl_output() and corrupt dst_cache data structures. Fix the race by using local_bh_disable() instead of preempt_disable(). Apply a similar change in rpl_input(). Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Alexander Aring <aahringo@redhat.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240531132636.2637995-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-03ipv6: ioam: block BH from ioam6_output()Eric Dumazet
As explained in commit 1378817486d6 ("tipc: block BH before using dst_cache"), net/core/dst_cache.c helpers need to be called with BH disabled. Disabling preemption in ioam6_output() is not good enough, because ioam6_output() is called from process context, lwtunnel_output() only uses rcu_read_lock(). We might be interrupted by a softirq, re-enter ioam6_output() and corrupt dst_cache data structures. Fix the race by using local_bh_disable() instead of preempt_disable(). Fixes: 8cb3bf8bff3c ("ipv6: ioam: Add support for the ip6ip6 encapsulation") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Justin Iurman <justin.iurman@uliege.be> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240531132636.2637995-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-31Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. Conflicts: drivers/net/ethernet/ti/icssg/icssg_classifier.c abd5576b9c57 ("net: ti: icssg-prueth: Add support for ICSSG switch firmware") 56a5cf538c3f ("net: ti: icssg-prueth: Fix start counter for ft1 filter") https://lore.kernel.org/all/20240531123822.3bb7eadf@canb.auug.org.au/ No other adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-30ipv6: sr: restruct ifdefinesHangbin Liu
There are too many ifdef in IPv6 segment routing code that may cause logic problems. like commit 160e9d275218 ("ipv6: sr: fix invalid unregister error path"). To avoid this, the init functions are redefined for both cases. The code could be more clear after all fidefs are removed. Suggested-by: Simon Horman <horms@kernel.org> Suggested-by: David Ahern <dsahern@kernel.org> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240529040908.3472952-1-liuhangbin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-29net: fix __dst_negative_advice() raceEric Dumazet
__dst_negative_advice() does not enforce proper RCU rules when sk->dst_cache must be cleared, leading to possible UAF. RCU rules are that we must first clear sk->sk_dst_cache, then call dst_release(old_dst). Note that sk_dst_reset(sk) is implementing this protocol correctly, while __dst_negative_advice() uses the wrong order. Given that ip6_negative_advice() has special logic against RTF_CACHE, this means each of the three ->negative_advice() existing methods must perform the sk_dst_reset() themselves. Note the check against NULL dst is centralized in __dst_negative_advice(), there is no need to duplicate it in various callbacks. Many thanks to Clement Lecigne for tracking this issue. This old bug became visible after the blamed commit, using UDP sockets. Fixes: a87cb3e48ee8 ("net: Facility to report route quality of connected sockets") Reported-by: Clement Lecigne <clecigne@google.com> Diagnosed-by: Clement Lecigne <clecigne@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <tom@herbertland.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240528114353.1794151-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-29tcp: fix races in tcp_v[46]_err()Eric Dumazet
These functions have races when they: 1) Write sk->sk_err 2) call sk_error_report(sk) 3) call tcp_done(sk) As described in prior patches in this series: An smp_wmb() is missing. We should call tcp_done() before sk_error_report(sk) to have consistent tcp_poll() results on SMP hosts. Use tcp_done_with_error() where we centralized the correct sequence. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Link: https://lore.kernel.org/r/20240528125253.1966136-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-28net/ipv6/ndisc: constify ctl_table arguments of utility functionThomas Weißschuh
The sysctl core is preparing to only expose instances of struct ctl_table as "const". This will also affect the ctl_table argument of sysctl handlers. As the function prototype of all sysctl handlers throughout the tree needs to stay consistent that change will be done in one commit. To reduce the size of that final commit, switch utility functions which are not bound by "typedef proc_handler" to "const struct ctl_table". No functional change. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Link: https://lore.kernel.org/r/20240527-sysctl-const-handler-net-v1-4-16523767d0b2@weissschuh.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-28net/ipv6/addrconf: constify ctl_table arguments of utility functionsThomas Weißschuh
The sysctl core is preparing to only expose instances of struct ctl_table as "const". This will also affect the ctl_table argument of sysctl handlers. As the function prototype of all sysctl handlers throughout the tree needs to stay consistent that change will be done in one commit. To reduce the size of that final commit, switch utility functions which are not bound by "typedef proc_handler" to "const struct ctl_table". No functional change. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Link: https://lore.kernel.org/r/20240527-sysctl-const-handler-net-v1-3-16523767d0b2@weissschuh.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-28Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2024-05-28 We've added 23 non-merge commits during the last 11 day(s) which contain a total of 45 files changed, 696 insertions(+), 277 deletions(-). The main changes are: 1) Rename skb's mono_delivery_time to tstamp_type for extensibility and add SKB_CLOCK_TAI type support to bpf_skb_set_tstamp(), from Abhishek Chauhan. 2) Add netfilter CT zone ID and direction to bpf_ct_opts so that arbitrary CT zones can be used from XDP/tc BPF netfilter CT helper functions, from Brad Cowie. 3) Several tweaks to the instruction-set.rst IETF doc to address the Last Call review comments, from Dave Thaler. 4) Small batch of riscv64 BPF JIT optimizations in order to emit more compressed instructions to the JITed image for better icache efficiency, from Xiao Wang. 5) Sort bpftool C dump output from BTF, aiming to simplify vmlinux.h diffing and forcing more natural type definitions ordering, from Mykyta Yatsenko. 6) Use DEV_STATS_INC() macro in BPF redirect helpers to silence a syzbot/KCSAN race report for the tx_errors counter, from Jiang Yunshui. 7) Un-constify bpf_func_info in bpftool to fix compilation with LLVM 17+ which started treating const structs as constants and thus breaking full BTF program name resolution, from Ivan Babrou. 8) Fix up BPF program numbers in test_sockmap selftest in order to reduce some of the test-internal array sizes, from Geliang Tang. 9) Small cleanup in Makefile.btf script to use test-ge check for v1.25-only pahole, from Alan Maguire. 10) Fix bpftool's make dependencies for vmlinux.h in order to avoid needless rebuilds in some corner cases, from Artem Savkov. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (23 commits) bpf, net: Use DEV_STAT_INC() bpf, docs: Fix instruction.rst indentation bpf, docs: Clarify call local offset bpf, docs: Add table captions bpf, docs: clarify sign extension of 64-bit use of 32-bit imm bpf, docs: Use RFC 2119 language for ISA requirements bpf, docs: Move sentence about returning R0 to abi.rst bpf: constify member bpf_sysctl_kern:: Table riscv, bpf: Try RVC for reg move within BPF_CMPXCHG JIT riscv, bpf: Use STACK_ALIGN macro for size rounding up riscv, bpf: Optimize zextw insn with Zba extension selftests/bpf: Handle forwarding of UDP CLOCK_TAI packets net: Add additional bit to support clockid_t timestamp type net: Rename mono_delivery_time to tstamp_type for scalabilty selftests/bpf: Update tests for new ct zone opts for nf_conntrack kfuncs net: netfilter: Make ct zone opts configurable for bpf ct helpers selftests/bpf: Fix prog numbers in test_sockmap bpf: Remove unused variable "prev_state" bpftool: Un-const bpf_func_info to fix it for llvm 17 and newer bpf: Fix order of args in call to bpf_map_kvcalloc ... ==================== Link: https://lore.kernel.org/r/20240528105924.30905-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-27tcp: reduce accepted window in NEW_SYN_RECV stateEric Dumazet
Jason commit made checks against ACK sequence less strict and can be exploited by attackers to establish spoofed flows with less probes. Innocent users might use tcp_rmem[1] == 1,000,000,000, or something more reasonable. An attacker can use a regular TCP connection to learn the server initial tp->rcv_wnd, and use it to optimize the attack. If we make sure that only the announced window (smaller than 65535) is used for ACK validation, we force an attacker to use 65537 packets to complete the 3WHS (assuming server ISN is unknown) Fixes: 378979e94e95 ("tcp: remove 64 KByte limit for initial tp->rcv_wnd value") Link: https://datatracker.ietf.org/meeting/119/materials/slides-119-tcpm-ghost-acks-00 Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://lore.kernel.org/r/20240523130528.60376-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-27net: gro: initialize network_offset in network layerWillem de Bruijn
Syzkaller was able to trigger kernel BUG at net/core/gro.c:424 ! RIP: 0010:gro_pull_from_frag0 net/core/gro.c:424 [inline] RIP: 0010:gro_try_pull_from_frag0 net/core/gro.c:446 [inline] RIP: 0010:dev_gro_receive+0x242f/0x24b0 net/core/gro.c:571 Due to using an incorrect NAPI_GRO_CB(skb)->network_offset. The referenced commit sets this offset to 0 in skb_gro_reset_offset. That matches the expected case in dev_gro_receive: pp = INDIRECT_CALL_INET(ptype->callbacks.gro_receive, ipv6_gro_receive, inet_gro_receive, &gro_list->list, skb); But syzkaller injected an skb with protocol ETH_P_TEB into an ip6gre device (by writing the IP6GRE encapsulated version to a TAP device). The result was a first call to eth_gro_receive, and thus an extra ETH_HLEN in network_offset that should not be there. First issue hit is when computing offset from network header in ipv6_gro_pull_exthdrs. Initialize both offsets in the network layer gro_receive. This pairs with all reads in gro_receive, which use skb_gro_receive_network_offset(). Fixes: 186b1ea73ad8 ("net: gro: use cb instead of skb->network_header") Reported-by: syzkaller <syzkaller@googlegroups.com> Signed-off-by: Willem de Bruijn <willemb@google.com> CC: Richard Gobert <richardbgobert@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240523141434.1752483-1-willemdebruijn.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-23net: Add additional bit to support clockid_t timestamp typeAbhishek Chauhan
tstamp_type is now set based on actual clockid_t compressed into 2 bits. To make the design scalable for future needs this commit bring in the change to extend the tstamp_type:1 to tstamp_type:2 to support other clockid_t timestamp. We now support CLOCK_TAI as part of tstamp_type as part of this commit with existing support CLOCK_MONOTONIC and CLOCK_REALTIME. Signed-off-by: Abhishek Chauhan <quic_abchauha@quicinc.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20240509211834.3235191-3-quic_abchauha@quicinc.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-05-23net: Rename mono_delivery_time to tstamp_type for scalabiltyAbhishek Chauhan
mono_delivery_time was added to check if skb->tstamp has delivery time in mono clock base (i.e. EDT) otherwise skb->tstamp has timestamp in ingress and delivery_time at egress. Renaming the bitfield from mono_delivery_time to tstamp_type is for extensibilty for other timestamps such as userspace timestamp (i.e. SO_TXTIME) set via sock opts. As we are renaming the mono_delivery_time to tstamp_type, it makes sense to start assigning tstamp_type based on enum defined in this commit. Earlier we used bool arg flag to check if the tstamp is mono in function skb_set_delivery_time, Now the signature of the functions accepts tstamp_type to distinguish between mono and real time. Also skb_set_delivery_type_by_clockid is a new function which accepts clockid to determine the tstamp_type. In future tstamp_type:1 can be extended to support userspace timestamp by increasing the bitfield. Signed-off-by: Abhishek Chauhan <quic_abchauha@quicinc.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20240509211834.3235191-2-quic_abchauha@quicinc.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-05-23net: esp: cleanup esp_output_tail_tcp() in case of unsupported ESPINTCPHagar Hemdan
xmit() functions should consume skb or return error codes in error paths. When the configuration "CONFIG_INET_ESPINTCP" is not set, the implementation of the function "esp_output_tail_tcp" violates this rule. The function frees the skb and returns the error code. This change removes the kfree_skb from both functions, for both esp4 and esp6. WARN_ON is added because esp_output_tail_tcp() should never be called if CONFIG_INET_ESPINTCP is not set. This bug was discovered and resolved using Coverity Static Analysis Security Testing (SAST) by Synopsys, Inc. Fixes: e27cca96cd68 ("xfrm: add espintcp (RFC 8229)") Signed-off-by: Hagar Hemdan <hagarhem@amazon.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2024-05-21ipv6: sr: fix memleak in seg6_hmac_init_algoHangbin Liu
seg6_hmac_init_algo returns without cleaning up the previous allocations if one fails, so it's going to leak all that memory and the crypto tfms. Update seg6_hmac_exit to only free the memory when allocated, so we can reuse the code directly. Fixes: bf355b8d2c30 ("ipv6: sr: add core files for SR HMAC support") Reported-by: Sabrina Dubroca <sd@queasysnail.net> Closes: https://lore.kernel.org/netdev/Zj3bh-gE7eT6V6aH@hog/ Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://lore.kernel.org/r/20240517005435.2600277-1-liuhangbin@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-20ipv6: sr: fix missing sk_buff release in seg6_input_coreAndrea Mayer
The seg6_input() function is responsible for adding the SRH into a packet, delegating the operation to the seg6_input_core(). This function uses the skb_cow_head() to ensure that there is sufficient headroom in the sk_buff for accommodating the link-layer header. In the event that the skb_cow_header() function fails, the seg6_input_core() catches the error but it does not release the sk_buff, which will result in a memory leak. This issue was introduced in commit af3b5158b89d ("ipv6: sr: fix BUG due to headroom too small after SRH push") and persists even after commit 7a3f5b0de364 ("netfilter: add netfilter hooks to SRv6 data plane"), where the entire seg6_input() code was refactored to deal with netfilter hooks. The proposed patch addresses the identified memory leak by requiring the seg6_input_core() function to release the sk_buff in the event that skb_cow_head() fails. Fixes: af3b5158b89d ("ipv6: sr: fix BUG due to headroom too small after SRH push") Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-16net/ipv6: Fix route deleting failure when metric equals 0xu xin
Problem ========= After commit 67f695134703 ("ipv6: Move setting default metric for routes"), we noticed that the logic of assigning the default value of fc_metirc changed in the ioctl process. That is, when users use ioctl(fd, SIOCADDRT, rt) with a non-zero metric to add a route, then they may fail to delete a route with passing in a metric value of 0 to the kernel by ioctl(fd, SIOCDELRT, rt). But iproute can succeed in deleting it. As a reference, when using iproute tools by netlink to delete routes with a metric parameter equals 0, like the command as follows: ip -6 route del fe80::/64 via fe81::5054:ff:fe11:3451 dev eth0 metric 0 the user can still succeed in deleting the route entry with the smallest metric. Root Reason =========== After commit 67f695134703 ("ipv6: Move setting default metric for routes"), When ioctl() pass in SIOCDELRT with a zero metric, rtmsg_to_fib6_config() will set a defalut value (1024) to cfg->fc_metric in kernel, and in ip6_route_del() and the line 4074 at net/ipv3/route.c, it will check by if (cfg->fc_metric && cfg->fc_metric != rt->fib6_metric) continue; and the condition is true and skip the later procedure (deleting route) because cfg->fc_metric != rt->fib6_metric. But before that commit, cfg->fc_metric is still zero there, so the condition is false and it will do the following procedure (deleting). Solution ======== In order to keep a consistent behaviour across netlink() and ioctl(), we should allow to delete a route with a metric value of 0. So we only do the default setting of fc_metric in route adding. CC: stable@vger.kernel.org # 5.4+ Fixes: 67f695134703 ("ipv6: Move setting default metric for routes") Co-developed-by: Fan Yu <fan.yu9@zte.com.cn> Signed-off-by: Fan Yu <fan.yu9@zte.com.cn> Signed-off-by: xu xin <xu.xin16@zte.com.cn> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240514201102055dD2Ba45qKbLlUMxu_DTHP@zte.com.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Merge in late fixes to prepare for the 6.10 net-next PR. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-13tcp: rstreason: handle timewait cases in the receive pathJason Xing
There are two possible cases where TCP layer can send an RST. Since they happen in the same place, I think using one independent reason is enough to identify this special situation. Signed-off-by: Jason Xing <kernelxing@tencent.com> Link: https://lore.kernel.org/r/20240510122502.27850-5-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-13net: gro: move L3 flush checks to tcp_gro_receive and udp_gro_receive_segmentRichard Gobert
{inet,ipv6}_gro_receive functions perform flush checks (ttl, flags, iph->id, ...) against all packets in a loop. These flush checks are used in all merging UDP and TCP flows. These checks need to be done only once and only against the found p skb, since they only affect flush and not same_flow. This patch leverages correct network header offsets from the cb for both outer and inner network headers - allowing these checks to be done only once, in tcp_gro_receive and udp_gro_receive_segment. As a result, NAPI_GRO_CB(p)->flush is not used at all. In addition, flush_id checks are more declarative and contained in inet_gro_flush, thus removing the need for flush_id in napi_gro_cb. This results in less parsing code for non-loop flush tests for TCP and UDP flows. To make sure results are not within noise range - I've made netfilter drop all TCP packets, and measured CPU performance in GRO (in this case GRO is responsible for about 50% of the CPU utilization). perf top while replaying 64 parallel IP/TCP streams merging in GRO: (gro_receive_network_flush is compiled inline to tcp_gro_receive) net-next: 6.94% [kernel] [k] inet_gro_receive 3.02% [kernel] [k] tcp_gro_receive patch applied: 4.27% [kernel] [k] tcp_gro_receive 4.22% [kernel] [k] inet_gro_receive perf top while replaying 64 parallel IP/IP/TCP streams merging in GRO (same results for any encapsulation, in this case inet_gro_receive is top offender in net-next) net-next: 10.09% [kernel] [k] inet_gro_receive 2.08% [kernel] [k] tcp_gro_receive patch applied: 6.97% [kernel] [k] inet_gro_receive 3.68% [kernel] [k] tcp_gro_receive Signed-off-by: Richard Gobert <richardbgobert@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20240509190819.2985-3-richardbgobert@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-13net: gro: use cb instead of skb->network_headerRichard Gobert
This patch converts references of skb->network_header to napi_gro_cb's network_offset and inner_network_offset. Signed-off-by: Richard Gobert <richardbgobert@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20240509190819.2985-2-richardbgobert@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-13Merge tag 'nf-next-24-05-12' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next: Patch #1 skips transaction if object type provides no .update interface. Patch #2 skips NETDEV_CHANGENAME which is unused. Patch #3 enables conntrack to handle Multicast Router Advertisements and Multicast Router Solicitations from the Multicast Router Discovery protocol (RFC4286) as untracked opposed to invalid packets. From Linus Luessing. Patch #4 updates DCCP conntracker to mark invalid as invalid, instead of dropping them, from Jason Xing. Patch #5 uses NF_DROP instead of -NF_DROP since NF_DROP is 0, also from Jason. Patch #6 removes reference in netfilter's sysctl documentation on pickup entries which were already removed by Florian Westphal. Patch #7 removes check for IPS_OFFLOAD flag to disable early drop which allows to evict entries from the conntrack table, also from Florian. Patches #8 to #16 updates nf_tables pipapo set backend to allocate the datastructure copy on-demand from preparation phase, to better deal with OOM situations where .commit step is too late to fail. Series from Florian Westphal. Patch #17 adds a selftest with packetdrill to cover conntrack TCP state transitions, also from Florian. Patch #18 use GFP_KERNEL to clone elements from control plane to avoid quick atomic reserves exhaustion with large sets, reporter refers to million entries magnitude. * tag 'nf-next-24-05-12' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: nf_tables: allow clone callbacks to sleep selftests: netfilter: add packetdrill based conntrack tests netfilter: nft_set_pipapo: remove dirty flag netfilter: nft_set_pipapo: move cloning of match info to insert/removal path netfilter: nft_set_pipapo: prepare pipapo_get helper for on-demand clone netfilter: nft_set_pipapo: merge deactivate helper into caller netfilter: nft_set_pipapo: prepare walk function for on-demand clone netfilter: nft_set_pipapo: prepare destroy function for on-demand clone netfilter: nft_set_pipapo: make pipapo_clone helper return NULL netfilter: nft_set_pipapo: move prove_locking helper around netfilter: conntrack: remove flowtable early-drop test netfilter: conntrack: documentation: remove reference to non-existent sysctl netfilter: use NF_DROP instead of -NF_DROP netfilter: conntrack: dccp: try not to drop skb in conntrack netfilter: conntrack: fix ct-state for ICMPv6 Multicast Router Discovery netfilter: nf_tables: remove NETDEV_CHANGENAME from netdev chain event handler netfilter: nf_tables: skip transaction if update object is not implemented ==================== Link: https://lore.kernel.org/r/20240512161436.168973-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-10ipv6: sr: fix invalid unregister error pathHangbin Liu
The error path of seg6_init() is wrong in case CONFIG_IPV6_SEG6_LWTUNNEL is not defined. In that case if seg6_hmac_init() fails, the genl_unregister_family() isn't called. This issue exist since commit 46738b1317e1 ("ipv6: sr: add option to control lwtunnel support"), and commit 5559cea2d5aa ("ipv6: sr: fix possible use-after-free and null-ptr-deref") replaced unregister_pernet_subsys() with genl_unregister_family() in this error path. Fixes: 46738b1317e1 ("ipv6: sr: add option to control lwtunnel support") Reported-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240509131812.1662197-4-liuhangbin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-10ipv6: sr: fix incorrect unregister orderHangbin Liu
Commit 5559cea2d5aa ("ipv6: sr: fix possible use-after-free and null-ptr-deref") changed the register order in seg6_init(). But the unregister order in seg6_exit() is not updated. Fixes: 5559cea2d5aa ("ipv6: sr: fix possible use-after-free and null-ptr-deref") Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240509131812.1662197-3-liuhangbin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-10ipv6: sr: add missing seg6_local_exitHangbin Liu
Currently, we only call seg6_local_exit() in seg6_init() if seg6_local_init() failed. But forgot to call it in seg6_exit(). Fixes: d1df6fd8a1d2 ("ipv6: sr: define core operations for seg6local lightweight tunnel") Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240509131812.1662197-2-liuhangbin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-10net: ipv6: fix wrong start position when receive hop-by-hop fragmentgaoxingwang
In IPv6, ipv6_rcv_core will parse the hop-by-hop type extension header and increase skb->transport_header by one extension header length. But if there are more other extension headers like fragment header at this time, the skb->transport_header points to the second extension header, not the transport layer header or the first extension header. This will result in the start and nexthdrp variable not pointing to the same position in ipv6frag_thdr_trunced, and ipv6_skip_exthdr returning incorrect offset and frag_off.Sometimes,the length of the last sharded packet is smaller than the calculated incorrect offset, resulting in packet loss. We can use network header to offset and calculate the correct position to solve this problem. Fixes: 9d9e937b1c8b (ipv6/netfilter: Discard first fragment not including all headers) Signed-off-by: Gao Xingwang <gaoxingwang1@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-09tcp: get rid of twsk_unique()Eric Dumazet
DCCP is going away soon, and had no twsk_unique() method. We can directly call tcp_twsk_unique() for TCP sockets. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240507164140.940547-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. No conflicts. Adjacent changes: drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 35d92abfbad8 ("net: hns3: fix kernel crash when devlink reload during initialization") 2a1a1a7b5fd7 ("net: hns3: add command queue trace for hns3") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-08ipv6: prevent NULL dereference in ip6_output()Eric Dumazet
According to syzbot, there is a chance that ip6_dst_idev() returns NULL in ip6_output(). Most places in IPv6 stack deal with a NULL idev just fine, but not here. syzbot reported: general protection fault, probably for non-canonical address 0xdffffc00000000bc: 0000 [#1] PREEMPT SMP KASAN PTI KASAN: null-ptr-deref in range [0x00000000000005e0-0x00000000000005e7] CPU: 0 PID: 9775 Comm: syz-executor.4 Not tainted 6.9.0-rc5-syzkaller-00157-g6a30653b604a #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024 RIP: 0010:ip6_output+0x231/0x3f0 net/ipv6/ip6_output.c:237 Code: 3c 1e 00 49 89 df 74 08 4c 89 ef e8 19 58 db f7 48 8b 44 24 20 49 89 45 00 49 89 c5 48 8d 9d e0 05 00 00 48 89 d8 48 c1 e8 03 <42> 0f b6 04 38 84 c0 4c 8b 74 24 28 0f 85 61 01 00 00 8b 1b 31 ff RSP: 0018:ffffc9000927f0d8 EFLAGS: 00010202 RAX: 00000000000000bc RBX: 00000000000005e0 RCX: 0000000000040000 RDX: ffffc900131f9000 RSI: 0000000000004f47 RDI: 0000000000004f48 RBP: 0000000000000000 R08: ffffffff8a1f0b9a R09: 1ffffffff1f51fad R10: dffffc0000000000 R11: fffffbfff1f51fae R12: ffff8880293ec8c0 R13: ffff88805d7fc000 R14: 1ffff1100527d91a R15: dffffc0000000000 FS: 00007f135c6856c0(0000) GS:ffff8880b9400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020000080 CR3: 0000000064096000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> NF_HOOK include/linux/netfilter.h:314 [inline] ip6_xmit+0xefe/0x17f0 net/ipv6/ip6_output.c:358 sctp_v6_xmit+0x9f2/0x13f0 net/sctp/ipv6.c:248 sctp_packet_transmit+0x26ad/0x2ca0 net/sctp/output.c:653 sctp_packet_singleton+0x22c/0x320 net/sctp/outqueue.c:783 sctp_outq_flush_ctrl net/sctp/outqueue.c:914 [inline] sctp_outq_flush+0x6d5/0x3e20 net/sctp/outqueue.c:1212 sctp_side_effects net/sctp/sm_sideeffect.c:1198 [inline] sctp_do_sm+0x59cc/0x60c0 net/sctp/sm_sideeffect.c:1169 sctp_primitive_ASSOCIATE+0x95/0xc0 net/sctp/primitive.c:73 __sctp_connect+0x9cd/0xe30 net/sctp/socket.c:1234 sctp_connect net/sctp/socket.c:4819 [inline] sctp_inet_connect+0x149/0x1f0 net/sctp/socket.c:4834 __sys_connect_file net/socket.c:2048 [inline] __sys_connect+0x2df/0x310 net/socket.c:2065 __do_sys_connect net/socket.c:2075 [inline] __se_sys_connect net/socket.c:2072 [inline] __x64_sys_connect+0x7a/0x90 net/socket.c:2072 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f Fixes: 778d80be5269 ("ipv6: Add disable_ipv6 sysctl to disable IPv6 operaion on specific interface.") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com> Link: https://lore.kernel.org/r/20240507161842.773961-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-08ipv6: fib6_rules: avoid possible NULL dereference in fib6_rule_action()Eric Dumazet
syzbot is able to trigger the following crash [1], caused by unsafe ip6_dst_idev() use. Indeed ip6_dst_idev() can return NULL, and must always be checked. [1] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN PTI KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] CPU: 0 PID: 31648 Comm: syz-executor.0 Not tainted 6.9.0-rc4-next-20240417-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024 RIP: 0010:__fib6_rule_action net/ipv6/fib6_rules.c:237 [inline] RIP: 0010:fib6_rule_action+0x241/0x7b0 net/ipv6/fib6_rules.c:267 Code: 02 00 00 49 8d 9f d8 00 00 00 48 89 d8 48 c1 e8 03 42 80 3c 20 00 74 08 48 89 df e8 f9 32 bf f7 48 8b 1b 48 89 d8 48 c1 e8 03 <42> 80 3c 20 00 74 08 48 89 df e8 e0 32 bf f7 4c 8b 03 48 89 ef 4c RSP: 0018:ffffc9000fc1f2f0 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 1a772f98c8186700 RDX: 0000000000000003 RSI: ffffffff8bcac4e0 RDI: ffffffff8c1f9760 RBP: ffff8880673fb980 R08: ffffffff8fac15ef R09: 1ffffffff1f582bd R10: dffffc0000000000 R11: fffffbfff1f582be R12: dffffc0000000000 R13: 0000000000000080 R14: ffff888076509000 R15: ffff88807a029a00 FS: 00007f55e82ca6c0(0000) GS:ffff8880b9400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b31d23000 CR3: 0000000022b66000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> fib_rules_lookup+0x62c/0xdb0 net/core/fib_rules.c:317 fib6_rule_lookup+0x1fd/0x790 net/ipv6/fib6_rules.c:108 ip6_route_output_flags_noref net/ipv6/route.c:2637 [inline] ip6_route_output_flags+0x38e/0x610 net/ipv6/route.c:2649 ip6_route_output include/net/ip6_route.h:93 [inline] ip6_dst_lookup_tail+0x189/0x11a0 net/ipv6/ip6_output.c:1120 ip6_dst_lookup_flow+0xb9/0x180 net/ipv6/ip6_output.c:1250 sctp_v6_get_dst+0x792/0x1e20 net/sctp/ipv6.c:326 sctp_transport_route+0x12c/0x2e0 net/sctp/transport.c:455 sctp_assoc_add_peer+0x614/0x15c0 net/sctp/associola.c:662 sctp_connect_new_asoc+0x31d/0x6c0 net/sctp/socket.c:1099 __sctp_connect+0x66d/0xe30 net/sctp/socket.c:1197 sctp_connect net/sctp/socket.c:4819 [inline] sctp_inet_connect+0x149/0x1f0 net/sctp/socket.c:4834 __sys_connect_file net/socket.c:2048 [inline] __sys_connect+0x2df/0x310 net/socket.c:2065 __do_sys_connect net/socket.c:2075 [inline] __se_sys_connect net/socket.c:2072 [inline] __x64_sys_connect+0x7a/0x90 net/socket.c:2072 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f Fixes: 5e5f3f0f8013 ("[IPV6] ADDRCONF: Convert ipv6_get_saddr() to ipv6_dev_get_saddr().") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240507163145.835254-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-08ipv6: Fix potential uninit-value access in __ip6_make_skb()Shigeru Yoshida
As it was done in commit fc1092f51567 ("ipv4: Fix uninit-value access in __ip_make_skb()") for IPv4, check FLOWI_FLAG_KNOWN_NH on fl6->flowi6_flags instead of testing HDRINCL on the socket to avoid a race condition which causes uninit-value access. Fixes: ea30388baebc ("ipv6: Fix an uninit variable access bug in __ip6_make_skb()") Signed-off-by: Shigeru Yoshida <syoshida@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>