summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2023-06-05Bluetooth: fix debugfs registrationJohan Hovold
Since commit ec6cef9cd98d ("Bluetooth: Fix SMP channel registration for unconfigured controllers") the debugfs interface for unconfigured controllers will be created when the controller is configured. There is however currently nothing preventing a controller from being configured multiple time (e.g. setting the device address using btmgmt) which results in failed attempts to register the already registered debugfs entries: debugfs: File 'features' in directory 'hci0' already present! debugfs: File 'manufacturer' in directory 'hci0' already present! debugfs: File 'hci_version' in directory 'hci0' already present! ... debugfs: File 'quirk_simultaneous_discovery' in directory 'hci0' already present! Add a controller flag to avoid trying to register the debugfs interface more than once. Fixes: ec6cef9cd98d ("Bluetooth: Fix SMP channel registration for unconfigured controllers") Cc: stable@vger.kernel.org # 4.0 Signed-off-by: Johan Hovold <johan+linaro@kernel.org> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-06-05Bluetooth: hci_sync: add lock to protect HCI_UNREGISTERZhengping Jiang
When the HCI_UNREGISTER flag is set, no jobs should be scheduled. Fix potential race when HCI_UNREGISTER is set after the flag is tested in hci_cmd_sync_queue. Fixes: 0b94f2651f56 ("Bluetooth: hci_sync: Fix queuing commands when HCI_UNREGISTER is set") Signed-off-by: Zhengping Jiang <jiangzp@google.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-06-05Bluetooth: Fix use-after-free in hci_remove_ltk/hci_remove_irkLuiz Augusto von Dentz
Similar to commit 0f7d9b31ce7a ("netfilter: nf_tables: fix use-after-free in nft_set_catchall_destroy()"). We can not access k after kfree_rcu() call. Cc: stable@vger.kernel.org Signed-off-by: Min Li <lm0963hack@gmail.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-06-05Bluetooth: ISO: Fix CIG auto-allocation to select configurable CIGPauli Virtanen
Make CIG auto-allocation to select the first CIG_ID that is still configurable. Also use correct CIG_ID range (see Core v5.3 Vol 4 Part E Sec 7.8.97 p.2553). Previously, it would always select CIG_ID 0 regardless of anything, because cis_list with data.cis == 0xff (BT_ISO_QOS_CIS_UNSET) would not count any CIS. Since we are not adding CIS here, use find_cis instead. Fixes: 26afbd826ee3 ("Bluetooth: Add initial implementation of CIS connections") Signed-off-by: Pauli Virtanen <pav@iki.fi> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-06-05Bluetooth: ISO: consider right CIS when removing CIG at cleanupPauli Virtanen
When looking for CIS blocking CIG removal, consider only the CIS with the right CIG ID. Don't try to remove CIG with unset CIG ID. Fixes: 26afbd826ee3 ("Bluetooth: Add initial implementation of CIS connections") Signed-off-by: Pauli Virtanen <pav@iki.fi> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-06-05Merge branch 'drm-i915-use-ref_tracker-library-for-tracking-wakerefs'Jakub Kicinski
Andrzej Hajda says: ==================== drm/i915: use ref_tracker library for tracking wakerefs This is reviewed series of ref_tracker patches, ready to merge via network tree, rebased on net-next/main. i915 patches will be merged later via intel-gfx tree. ==================== Merge on top of an -rc tag in case it's needed in another tree. Link: https://lore.kernel.org/r/20230224-track_gt-v9-0-5b47a33f55d1@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-05lib/ref_tracker: improve printing statsAndrzej Hajda
In case the library is tracking busy subsystem, simply printing stack for every active reference will spam log with long, hard to read, redundant stack traces. To improve readabilty following changes have been made: - reports are printed per stack_handle - log is more compact, - added display name for ref_tracker_dir - it will differentiate multiple subsystems, - stack trace is printed indented, in the same printk call, - info about dropped references is printed as well. Signed-off-by: Andrzej Hajda <andrzej.hajda@intel.com> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-05bpf/xdp: optimize bpf_xdp_pointer to avoid reading sinfoJesper Dangaard Brouer
Currently we observed a significant performance degradation in samples/bpf xdp1 and xdp2, due XDP multibuffer "xdp.frags" handling, added in commit 772251742262 ("samples/bpf: fixup some tools to be able to support xdp multibuffer"). This patch reduce the overhead by avoiding to read/load shared_info (sinfo) memory area, when XDP packet don't have any frags. This improves performance because sinfo is located in another cacheline. Function bpf_xdp_pointer() is used by BPF helpers bpf_xdp_load_bytes() and bpf_xdp_store_bytes(). As a help to reviewers, xdp_get_buff_len() can potentially access sinfo, but it uses xdp_buff_has_frags() flags bit check to avoid accessing sinfo in no-frags case. The likely/unlikely instrumentation lays out asm code such that sinfo access isn't interleaved with no-frags case (checked on GCC 12.2.1-4). The generated asm code is more compact towards the no-frags case. The BPF kfunc bpf_dynptr_slice() also use bpf_xdp_pointer(). Thus, it should also take effect for that. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Lorenzo Bianconi <lorenzo@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/r/168563651438.3436004.17735707525651776648.stgit@firesoul Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-06-05mptcp: update userspace pm infosGeliang Tang
Increase pm subflows counter on both server side and client side when userspace pm creates a new subflow, and decrease the counter when it closes a subflow. Increase add_addr_signaled counter in mptcp_nl_cmd_announce() when the address is announced by userspace PM. This modification is similar to how the in-kernel PM is updating the counter: when additional subflows are created/removed. Fixes: 9ab4807c84a4 ("mptcp: netlink: Add MPTCP_PM_CMD_ANNOUNCE") Fixes: 702c2f646d42 ("mptcp: netlink: allow userspace-driven subflow establishment") Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/329 Cc: stable@vger.kernel.org Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-05mptcp: add address into userspace pm listGeliang Tang
Add the address into userspace_pm_local_addr_list when the subflow is created. Make sure it can be found in mptcp_nl_cmd_remove(). And delete it in the new helper mptcp_userspace_pm_delete_local_addr(). By doing this, the "REMOVE" command also works with subflows that have been created via the "SUB_CREATE" command instead of restricting to the addresses that have been announced via the "ANNOUNCE" command. Fixes: d9a4594edabf ("mptcp: netlink: Add MPTCP_PM_CMD_REMOVE") Link: https://github.com/multipath-tcp/mptcp_net-next/issues/379 Cc: stable@vger.kernel.org Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-05mptcp: only send RM_ADDR in nl_cmd_removeGeliang Tang
The specifications from [1] about the "REMOVE" command say: Announce that an address has been lost to the peer It was then only supposed to send a RM_ADDR and not trying to delete associated subflows. A new helper mptcp_pm_remove_addrs() is then introduced to do just that, compared to mptcp_pm_remove_addrs_and_subflows() also removing subflows. To delete a subflow, the userspace daemon can use the "SUB_DESTROY" command, see mptcp_nl_cmd_sf_destroy(). Fixes: d9a4594edabf ("mptcp: netlink: Add MPTCP_PM_CMD_REMOVE") Link: https://github.com/multipath-tcp/mptcp/blob/mptcp_v0.96/include/uapi/linux/mptcp.h [1] Cc: stable@vger.kernel.org Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-05SUNRPC: Use __alloc_bulk_pages() in svc_init_buffer()Chuck Lever
Clean up: Use the bulk page allocator when filling a server thread's buffer page array. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-05SUNRPC: Resupply rq_pages from node-local memoryChuck Lever
svc_init_buffer() is careful to allocate the initial set of server thread buffer pages from memory on the local NUMA node. svc_alloc_arg() should also be that careful. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-05SUNRPC: Trace struct svc_sock lifetime eventsChuck Lever
Capture a timestamp and pointer address during the creation and destruction of struct svc_sock to record its lifetime. This helps to diagnose transport reference counting issues. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-05SUNRPC: Improve observability in svc_tcp_accept()Chuck Lever
The -ENOMEM arm could fire repeatedly if the system runs low on memory, so remove it. Don't bother to trace -EAGAIN error events, since those fire after a listener is created (with no work done) and once again after an accept has been handled successfully (again, with no work done). Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-05SUNRPC: Remove dprintk() in svc_handle_xprt()Chuck Lever
When enabled, this dprintk() fires for every incoming RPC, which is an enormous amount of log traffic. These days, after the first few hundred log messages, the system journald is just going to mute it, along with all other NFSD debug output. Let's rely on trace points for this high-traffic information instead. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-05SUNRPC: Fix an incorrect commentChuck Lever
The correct function name is svc_tcp_listen_data_ready(). Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-05SUNRPC: Fix UAF in svc_tcp_listen_data_ready()Ding Hui
After the listener svc_sock is freed, and before invoking svc_tcp_accept() for the established child sock, there is a window that the newsock retaining a freed listener svc_sock in sk_user_data which cloning from parent. In the race window, if data is received on the newsock, we will observe use-after-free report in svc_tcp_listen_data_ready(). Reproduce by two tasks: 1. while :; do rpc.nfsd 0 ; rpc.nfsd; done 2. while :; do echo "" | ncat -4 127.0.0.1 2049 ; done KASAN report: ================================================================== BUG: KASAN: slab-use-after-free in svc_tcp_listen_data_ready+0x1cf/0x1f0 [sunrpc] Read of size 8 at addr ffff888139d96228 by task nc/102553 CPU: 7 PID: 102553 Comm: nc Not tainted 6.3.0+ #18 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020 Call Trace: <IRQ> dump_stack_lvl+0x33/0x50 print_address_description.constprop.0+0x27/0x310 print_report+0x3e/0x70 kasan_report+0xae/0xe0 svc_tcp_listen_data_ready+0x1cf/0x1f0 [sunrpc] tcp_data_queue+0x9f4/0x20e0 tcp_rcv_established+0x666/0x1f60 tcp_v4_do_rcv+0x51c/0x850 tcp_v4_rcv+0x23fc/0x2e80 ip_protocol_deliver_rcu+0x62/0x300 ip_local_deliver_finish+0x267/0x350 ip_local_deliver+0x18b/0x2d0 ip_rcv+0x2fb/0x370 __netif_receive_skb_one_core+0x166/0x1b0 process_backlog+0x24c/0x5e0 __napi_poll+0xa2/0x500 net_rx_action+0x854/0xc90 __do_softirq+0x1bb/0x5de do_softirq+0xcb/0x100 </IRQ> <TASK> ... </TASK> Allocated by task 102371: kasan_save_stack+0x1e/0x40 kasan_set_track+0x21/0x30 __kasan_kmalloc+0x7b/0x90 svc_setup_socket+0x52/0x4f0 [sunrpc] svc_addsock+0x20d/0x400 [sunrpc] __write_ports_addfd+0x209/0x390 [nfsd] write_ports+0x239/0x2c0 [nfsd] nfsctl_transaction_write+0xac/0x110 [nfsd] vfs_write+0x1c3/0xae0 ksys_write+0xed/0x1c0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc Freed by task 102551: kasan_save_stack+0x1e/0x40 kasan_set_track+0x21/0x30 kasan_save_free_info+0x2a/0x50 __kasan_slab_free+0x106/0x190 __kmem_cache_free+0x133/0x270 svc_xprt_free+0x1e2/0x350 [sunrpc] svc_xprt_destroy_all+0x25a/0x440 [sunrpc] nfsd_put+0x125/0x240 [nfsd] nfsd_svc+0x2cb/0x3c0 [nfsd] write_threads+0x1ac/0x2a0 [nfsd] nfsctl_transaction_write+0xac/0x110 [nfsd] vfs_write+0x1c3/0xae0 ksys_write+0xed/0x1c0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc Fix the UAF by simply doing nothing in svc_tcp_listen_data_ready() if state != TCP_LISTEN, that will avoid dereferencing svsk for all child socket. Link: https://lore.kernel.org/lkml/20230507091131.23540-1-dinghui@sangfor.com.cn/ Fixes: fa9251afc33c ("SUNRPC: Call the default socket callbacks instead of open coding") Signed-off-by: Ding Hui <dinghui@sangfor.com.cn> Cc: <stable@vger.kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-05Merge tag 'linux-can-fixes-for-6.4-20230605' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can Marc Kleine-Budde says: ==================== this is a pull request of 3 patches for net/master. All 3 patches target the j1939 stack. The 1st patch is by Oleksij Rempel and fixes the error queue handling for (E)TP sessions that run into timeouts. The last 2 patches are by Fedor Pchelkin and fix a potential use-after-free in j1939_netdev_start() if j1939_can_rx_register() fails. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-05net/sched: fq_pie: ensure reasonable TCA_FQ_PIE_QUANTUM valuesEric Dumazet
We got multiple syzbot reports, all duplicates of the following [1] syzbot managed to install fq_pie with a zero TCA_FQ_PIE_QUANTUM, thus triggering infinite loops. Use limits similar to sch_fq, with commits 3725a269815b ("pkt_sched: fq: avoid hang when quantum 0") and d9e15a273306 ("pkt_sched: fq: do not accept silly TCA_FQ_QUANTUM") [1] watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [swapper/0:0] Modules linked in: irq event stamp: 172817 hardirqs last enabled at (172816): [<ffff80001242fde4>] __el1_irq arch/arm64/kernel/entry-common.c:476 [inline] hardirqs last enabled at (172816): [<ffff80001242fde4>] el1_interrupt+0x58/0x68 arch/arm64/kernel/entry-common.c:486 hardirqs last disabled at (172817): [<ffff80001242fdb0>] __el1_irq arch/arm64/kernel/entry-common.c:468 [inline] hardirqs last disabled at (172817): [<ffff80001242fdb0>] el1_interrupt+0x24/0x68 arch/arm64/kernel/entry-common.c:486 softirqs last enabled at (167634): [<ffff800008020c1c>] softirq_handle_end kernel/softirq.c:414 [inline] softirqs last enabled at (167634): [<ffff800008020c1c>] __do_softirq+0xac0/0xd54 kernel/softirq.c:600 softirqs last disabled at (167701): [<ffff80000802a660>] ____do_softirq+0x14/0x20 arch/arm64/kernel/irq.c:80 CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.4.0-rc3-syzkaller-geb0f1697d729 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/28/2023 pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : fq_pie_qdisc_dequeue+0x10c/0x8ac net/sched/sch_fq_pie.c:246 lr : fq_pie_qdisc_dequeue+0xe4/0x8ac net/sched/sch_fq_pie.c:240 sp : ffff800008007210 x29: ffff800008007280 x28: ffff0000c86f7890 x27: ffff0000cb20c2e8 x26: ffff0000cb20c2f0 x25: dfff800000000000 x24: ffff0000cb20c2e0 x23: ffff0000c86f7880 x22: 0000000000000040 x21: 1fffe000190def10 x20: ffff0000cb20c2e0 x19: ffff0000cb20c2e0 x18: ffff800008006e60 x17: 0000000000000000 x16: ffff80000850af6c x15: 0000000000000302 x14: 0000000000000100 x13: 0000000000000000 x12: 0000000000000001 x11: 0000000000000302 x10: 0000000000000100 x9 : 0000000000000000 x8 : 0000000000000000 x7 : ffff80000841c468 x6 : 0000000000000000 x5 : 0000000000000001 x4 : 0000000000000001 x3 : 0000000000000000 x2 : ffff0000cb20c2e0 x1 : ffff0000cb20c2e0 x0 : 0000000000000001 Call trace: fq_pie_qdisc_dequeue+0x10c/0x8ac net/sched/sch_fq_pie.c:246 dequeue_skb net/sched/sch_generic.c:292 [inline] qdisc_restart net/sched/sch_generic.c:397 [inline] __qdisc_run+0x1fc/0x231c net/sched/sch_generic.c:415 __dev_xmit_skb net/core/dev.c:3868 [inline] __dev_queue_xmit+0xc80/0x3318 net/core/dev.c:4210 dev_queue_xmit include/linux/netdevice.h:3085 [inline] neigh_connected_output+0x2f8/0x38c net/core/neighbour.c:1581 neigh_output include/net/neighbour.h:544 [inline] ip6_finish_output2+0xd60/0x1a1c net/ipv6/ip6_output.c:134 __ip6_finish_output net/ipv6/ip6_output.c:195 [inline] ip6_finish_output+0x538/0x8c8 net/ipv6/ip6_output.c:206 NF_HOOK_COND include/linux/netfilter.h:292 [inline] ip6_output+0x270/0x594 net/ipv6/ip6_output.c:227 dst_output include/net/dst.h:458 [inline] NF_HOOK include/linux/netfilter.h:303 [inline] ndisc_send_skb+0xc30/0x1790 net/ipv6/ndisc.c:508 ndisc_send_rs+0x47c/0x5d4 net/ipv6/ndisc.c:718 addrconf_rs_timer+0x300/0x58c net/ipv6/addrconf.c:3936 call_timer_fn+0x19c/0x8cc kernel/time/timer.c:1700 expire_timers kernel/time/timer.c:1751 [inline] __run_timers+0x55c/0x734 kernel/time/timer.c:2022 run_timer_softirq+0x7c/0x114 kernel/time/timer.c:2035 __do_softirq+0x2d0/0xd54 kernel/softirq.c:571 ____do_softirq+0x14/0x20 arch/arm64/kernel/irq.c:80 call_on_irq_stack+0x24/0x4c arch/arm64/kernel/entry.S:882 do_softirq_own_stack+0x20/0x2c arch/arm64/kernel/irq.c:85 invoke_softirq kernel/softirq.c:452 [inline] __irq_exit_rcu+0x28c/0x534 kernel/softirq.c:650 irq_exit_rcu+0x14/0x84 kernel/softirq.c:662 __el1_irq arch/arm64/kernel/entry-common.c:472 [inline] el1_interrupt+0x38/0x68 arch/arm64/kernel/entry-common.c:486 el1h_64_irq_handler+0x18/0x24 arch/arm64/kernel/entry-common.c:491 el1h_64_irq+0x64/0x68 arch/arm64/kernel/entry.S:587 __daif_local_irq_enable arch/arm64/include/asm/irqflags.h:33 [inline] arch_local_irq_enable+0x8/0xc arch/arm64/include/asm/irqflags.h:55 cpuidle_idle_call kernel/sched/idle.c:170 [inline] do_idle+0x1f0/0x4e8 kernel/sched/idle.c:282 cpu_startup_entry+0x24/0x28 kernel/sched/idle.c:379 rest_init+0x2dc/0x2f4 init/main.c:735 start_kernel+0x0/0x55c init/main.c:834 start_kernel+0x3f0/0x55c init/main.c:1088 __primary_switched+0xb8/0xc0 arch/arm64/kernel/head.S:523 Fixes: ec97ecf1ebe4 ("net: sched: add Flow Queue PIE packet scheduler") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-05can: j1939: avoid possible use-after-free when j1939_can_rx_register failsFedor Pchelkin
Syzkaller reports the following failure: BUG: KASAN: use-after-free in kref_put include/linux/kref.h:64 [inline] BUG: KASAN: use-after-free in j1939_priv_put+0x25/0xa0 net/can/j1939/main.c:172 Write of size 4 at addr ffff888141c15058 by task swapper/3/0 CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.10.144-syzkaller #0 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x107/0x167 lib/dump_stack.c:118 print_address_description.constprop.0+0x1c/0x220 mm/kasan/report.c:385 __kasan_report mm/kasan/report.c:545 [inline] kasan_report.cold+0x1f/0x37 mm/kasan/report.c:562 check_memory_region_inline mm/kasan/generic.c:186 [inline] check_memory_region+0x145/0x190 mm/kasan/generic.c:192 instrument_atomic_read_write include/linux/instrumented.h:101 [inline] atomic_fetch_sub_release include/asm-generic/atomic-instrumented.h:220 [inline] __refcount_sub_and_test include/linux/refcount.h:272 [inline] __refcount_dec_and_test include/linux/refcount.h:315 [inline] refcount_dec_and_test include/linux/refcount.h:333 [inline] kref_put include/linux/kref.h:64 [inline] j1939_priv_put+0x25/0xa0 net/can/j1939/main.c:172 j1939_sk_sock_destruct+0x44/0x90 net/can/j1939/socket.c:374 __sk_destruct+0x4e/0x820 net/core/sock.c:1784 rcu_do_batch kernel/rcu/tree.c:2485 [inline] rcu_core+0xb35/0x1a30 kernel/rcu/tree.c:2726 __do_softirq+0x289/0x9a3 kernel/softirq.c:298 asm_call_irq_on_stack+0x12/0x20 </IRQ> __run_on_irqstack arch/x86/include/asm/irq_stack.h:26 [inline] run_on_irqstack_cond arch/x86/include/asm/irq_stack.h:77 [inline] do_softirq_own_stack+0xaa/0xe0 arch/x86/kernel/irq_64.c:77 invoke_softirq kernel/softirq.c:393 [inline] __irq_exit_rcu kernel/softirq.c:423 [inline] irq_exit_rcu+0x136/0x200 kernel/softirq.c:435 sysvec_apic_timer_interrupt+0x4d/0x100 arch/x86/kernel/apic/apic.c:1095 asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:635 Allocated by task 1141: kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48 kasan_set_track mm/kasan/common.c:56 [inline] __kasan_kmalloc.constprop.0+0xc9/0xd0 mm/kasan/common.c:461 kmalloc include/linux/slab.h:552 [inline] kzalloc include/linux/slab.h:664 [inline] j1939_priv_create net/can/j1939/main.c:131 [inline] j1939_netdev_start+0x111/0x860 net/can/j1939/main.c:268 j1939_sk_bind+0x8ea/0xd30 net/can/j1939/socket.c:485 __sys_bind+0x1f2/0x260 net/socket.c:1645 __do_sys_bind net/socket.c:1656 [inline] __se_sys_bind net/socket.c:1654 [inline] __x64_sys_bind+0x6f/0xb0 net/socket.c:1654 do_syscall_64+0x33/0x40 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x61/0xc6 Freed by task 1141: kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48 kasan_set_track+0x1c/0x30 mm/kasan/common.c:56 kasan_set_free_info+0x1b/0x30 mm/kasan/generic.c:355 __kasan_slab_free+0x112/0x170 mm/kasan/common.c:422 slab_free_hook mm/slub.c:1542 [inline] slab_free_freelist_hook+0xad/0x190 mm/slub.c:1576 slab_free mm/slub.c:3149 [inline] kfree+0xd9/0x3b0 mm/slub.c:4125 j1939_netdev_start+0x5ee/0x860 net/can/j1939/main.c:300 j1939_sk_bind+0x8ea/0xd30 net/can/j1939/socket.c:485 __sys_bind+0x1f2/0x260 net/socket.c:1645 __do_sys_bind net/socket.c:1656 [inline] __se_sys_bind net/socket.c:1654 [inline] __x64_sys_bind+0x6f/0xb0 net/socket.c:1654 do_syscall_64+0x33/0x40 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x61/0xc6 It can be caused by this scenario: CPU0 CPU1 j1939_sk_bind(socket0, ndev0, ...) j1939_netdev_start() j1939_sk_bind(socket1, ndev0, ...) j1939_netdev_start() mutex_lock(&j1939_netdev_lock) j1939_priv_set(ndev0, priv) mutex_unlock(&j1939_netdev_lock) if (priv_new) kref_get(&priv_new->rx_kref) return priv_new; /* inside j1939_sk_bind() */ jsk->priv = priv j1939_can_rx_register(priv) // fails j1939_priv_set(ndev, NULL) kfree(priv) j1939_sk_sock_destruct() j1939_priv_put() // <- uaf To avoid this, call j1939_can_rx_register() under j1939_netdev_lock so that a concurrent thread cannot process j1939_priv before j1939_can_rx_register() returns. Found by Linux Verification Center (linuxtesting.org) with Syzkaller. Fixes: 9d71dd0c7009 ("can: add support of SAE J1939 protocol") Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Tested-by: Oleksij Rempel <o.rempel@pengutronix.de> Acked-by: Oleksij Rempel <o.rempel@pengutronix.de> Link: https://lore.kernel.org/r/20230526171910.227615-3-pchelkin@ispras.ru Cc: stable@vger.kernel.org Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2023-06-05can: j1939: change j1939_netdev_lock type to mutexFedor Pchelkin
It turns out access to j1939_can_rx_register() needs to be serialized, otherwise j1939_priv can be corrupted when parallel threads call j1939_netdev_start() and j1939_can_rx_register() fails. This issue is thoroughly covered in other commit which serializes access to j1939_can_rx_register(). Change j1939_netdev_lock type to mutex so that we do not need to remove GFP_KERNEL from can_rx_register(). j1939_netdev_lock seems to be used in normal contexts where mutex usage is not prohibited. Found by Linux Verification Center (linuxtesting.org) with Syzkaller. Fixes: 9d71dd0c7009 ("can: add support of SAE J1939 protocol") Suggested-by: Alexey Khoroshilov <khoroshilov@ispras.ru> Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Tested-by: Oleksij Rempel <o.rempel@pengutronix.de> Acked-by: Oleksij Rempel <o.rempel@pengutronix.de> Link: https://lore.kernel.org/r/20230526171910.227615-2-pchelkin@ispras.ru Cc: stable@vger.kernel.org Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2023-06-05can: j1939: j1939_sk_send_loop_abort(): improved error queue handling in ↵Oleksij Rempel
J1939 Socket This patch addresses an issue within the j1939_sk_send_loop_abort() function in the j1939/socket.c file, specifically in the context of Transport Protocol (TP) sessions. Without this patch, when a TP session is initiated and a Clear To Send (CTS) frame is received from the remote side requesting one data packet, the kernel dispatches the first Data Transport (DT) frame and then waits for the next CTS. If the remote side doesn't respond with another CTS, the kernel aborts due to a timeout. This leads to the user-space receiving an EPOLLERR on the socket, and the socket becomes active. However, when trying to read the error queue from the socket with sock.recvmsg(, , socket.MSG_ERRQUEUE), it returns -EAGAIN, given that the socket is non-blocking. This situation results in an infinite loop: the user-space repeatedly calls epoll(), epoll() returns the socket file descriptor with EPOLLERR, but the socket then blocks on the recv() of ERRQUEUE. This patch introduces an additional check for the J1939_SOCK_ERRQUEUE flag within the j1939_sk_send_loop_abort() function. If the flag is set, it indicates that the application has subscribed to receive error queue messages. In such cases, the kernel can communicate the current transfer state via the error queue. This allows for the function to return early, preventing the unnecessary setting of the socket into an error state, and breaking the infinite loop. It is crucial to note that a socket error is only needed if the application isn't using the error queue, as, without it, the application wouldn't be aware of transfer issues. Fixes: 9d71dd0c7009 ("can: add support of SAE J1939 protocol") Reported-by: David Jander <david@protonic.nl> Tested-by: David Jander <david@protonic.nl> Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Link: https://lore.kernel.org/r/20230526081946.715190-1-o.rempel@pengutronix.de Cc: stable@vger.kernel.org Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2023-06-04net: sched: wrap tc_skip_wrapper with CONFIG_RETPOLINEMin-Hua Chen
This patch fixes the following sparse warning: net/sched/sch_api.c:2305:1: sparse: warning: symbol 'tc_skip_wrapper' was not declared. Should it be static? No functional change intended. Signed-off-by: Min-Hua Chen <minhuadotchen@gmail.com> Acked-by: Pedro Tammela <pctammela@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-04ipv6: lower "link become ready"'s level messageMatthieu Baerts
This following message is printed in the console each time a network device configured with an IPv6 addresses is ready to be used: ADDRCONF(NETDEV_CHANGE): <iface>: link becomes ready When netns are being extensively used -- e.g. by re-creating netns' with veth to discuss with each others for testing purposes like mptcp_join.sh selftest does -- it generates a lot of messages like that: more than 700 when executing mptcp_join.sh with the latest version. It looks like this message is not that helpful after all: maybe it can be used as a sign to know if there is something wrong, e.g. if a device is being regularly reconfigured by accident? But even then, there are better ways to monitor and diagnose such issues. When looking at commit 3c21edbd1137 ("[IPV6]: Defer IPv6 device initialization until the link becomes ready.") which introduces this new message, it seems it had been added to verify that the new feature was working as expected. It could have then used a lower level than "info" from the beginning but it was fine like that back then: 17 years ago. It seems then OK today to simply lower its level, similar to commit 7c62b8dd5ca8 ("net/ipv6: lower the level of "link is not ready" messages") and as suggested by Mat [1], Stephen and David [2]. Link: https://lore.kernel.org/mptcp/614e76ac-184e-c553-af72-084f792e60b0@kernel.org/T/ [1] Link: https://lore.kernel.org/netdev/68035bad-b53e-91cb-0e4a-007f27d62b05@tessares.net/T/ [2] Suggested-by: Mat Martineau <martineau@kernel.org> Suggested-by: Stephen Hemminger <stephen@networkplumber.org> Suggested-by: David Ahern <dsahern@kernel.org> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-03net/smc: Avoid to access invalid RMBs' MRs in SMCRv1 ADD LINK CONTWen Gu
SMCRv1 has a similar issue to SMCRv2 (see link below) that may access invalid MRs of RMBs when construct LLC ADD LINK CONT messages. BUG: kernel NULL pointer dereference, address: 0000000000000014 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 5 PID: 48 Comm: kworker/5:0 Kdump: loaded Tainted: G W E 6.4.0-rc3+ #49 Workqueue: events smc_llc_add_link_work [smc] RIP: 0010:smc_llc_add_link_cont+0x160/0x270 [smc] RSP: 0018:ffffa737801d3d50 EFLAGS: 00010286 RAX: ffff964f82144000 RBX: ffffa737801d3dd8 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff964f81370c30 RBP: ffffa737801d3dd4 R08: ffff964f81370000 R09: ffffa737801d3db0 R10: 0000000000000001 R11: 0000000000000060 R12: ffff964f82e70000 R13: ffff964f81370c38 R14: ffffa737801d3dd3 R15: 0000000000000001 FS: 0000000000000000(0000) GS:ffff9652bfd40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000014 CR3: 000000008fa20004 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> smc_llc_srv_rkey_exchange+0xa7/0x190 [smc] smc_llc_srv_add_link+0x3ae/0x5a0 [smc] smc_llc_add_link_work+0xb8/0x140 [smc] process_one_work+0x1e5/0x3f0 worker_thread+0x4d/0x2f0 ? __pfx_worker_thread+0x10/0x10 kthread+0xe5/0x120 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x2c/0x50 </TASK> When an alernate RNIC is available in system, SMC will try to add a new link based on the RNIC for resilience. All the RMBs in use will be mapped to the new link. Then the RMBs' MRs corresponding to the new link will be filled into LLC messages. For SMCRv1, they are ADD LINK CONT messages. However smc_llc_add_link_cont() may mistakenly access to unused RMBs which haven't been mapped to the new link and have no valid MRs, thus causing a crash. So this patch fixes it. Fixes: 87f88cda2128 ("net/smc: rkey processing for a new link as SMC client") Link: https://lore.kernel.org/r/1685101741-74826-3-git-send-email-guwen@linux.alibaba.com Signed-off-by: Wen Gu <guwen@linux.alibaba.com> Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-02net/ipv6: convert skip_notify_on_dev_down sysctl to u8Eric Dumazet
Save a bit a space, and could help future sysctls to use the same pattern. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-02Merge tag 'nfsd-6.4-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fixes from Chuck Lever: - Two minor bug fixes * tag 'nfsd-6.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: nfsd: fix double fget() bug in __write_ports_addfd() nfsd: make a copy of struct iattr before calling notify_change
2023-06-02ipv4: Drop tos parameter from flowi4_update_output()Guillaume Nault
Callers of flowi4_update_output() never try to update ->flowi4_tos: * ip_route_connect() updates ->flowi4_tos with its own current value. * ip_route_newports() has two users: tcp_v4_connect() and dccp_v4_connect. Both initialise fl4 with ip_route_connect(), which in turn sets ->flowi4_tos with RT_TOS(inet_sk(sk)->tos) and ->flowi4_scope based on SOCK_LOCALROUTE. Then ip_route_newports() updates ->flowi4_tos with RT_CONN_FLAGS(sk), which is the same as RT_TOS(inet_sk(sk)->tos), unless SOCK_LOCALROUTE is set on the socket. In that case, the lowest order bit is set to 1, to eventually inform ip_route_output_key_hash() to restrict the scope to RT_SCOPE_LINK. This is equivalent to properly setting ->flowi4_scope as ip_route_connect() did. * ip_vs_xmit.c initialises ->flowi4_tos with memset(0), then calls flowi4_update_output() with tos=0. * sctp_v4_get_dst() uses the same RT_CONN_FLAGS_TOS() when initialising ->flowi4_tos and when calling flowi4_update_output(). In the end, ->flowi4_tos never changes. So let's just drop the tos parameter. This will simplify the conversion of ->flowi4_tos from __u8 to dscp_t. Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-02ip_gre: clean up some inconsistent indentingJiapeng Chong
No functional modification involved. net/ipv4/ip_gre.c:192 ipgre_err() warn: inconsistent indenting. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=5375 Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-02net/ipv4: ping_group_range: allow GID from 2147483648 to 4294967294Akihiro Suda
With this commit, all the GIDs ("0 4294967294") can be written to the "net.ipv4.ping_group_range" sysctl. Note that 4294967295 (0xffffffff) is an invalid GID (see gid_valid() in include/linux/uidgid.h), and an attempt to register this number will cause -EINVAL. Prior to this commit, only up to GID 2147483647 could be covered. Documentation/networking/ip-sysctl.rst had "0 4294967295" as an example value, but this example was wrong and causing -EINVAL. Fixes: c319b4d76b9e ("net: ipv4: add IPPROTO_ICMP socket kind") Co-developed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-02tls: suppress wakeups unless we have a full recordJakub Kicinski
TLS does not override .poll() so TLS-enabled socket will generate an event whenever data arrives at the TCP socket. This leads to unnecessary wakeups on slow connections. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-01devlink: bring port new reply backJiri Pirko
In the offending fixes commit I mistakenly removed the reply message of the port new command. I was under impression it is a new port notification, partly due to the "notify" in the name of the helper function. Bring the code sending reply with new port message back, this time putting it directly to devlink_nl_cmd_port_new_doit() Fixes: c496daeb8630 ("devlink: remove duplicate port notification") Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/20230531142025.2605001-1-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. No conflicts. Adjacent changes: drivers/net/ethernet/sfc/tc.c 622ab656344a ("sfc: fix error unwinds in TC offload") b6583d5e9e94 ("sfc: support TC decap rules matching on enc_src_port") net/mptcp/protocol.c 5b825727d087 ("mptcp: add annotations around msk->subflow accesses") e76c8ef5cc5b ("mptcp: refactor mptcp_stream_accept()") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-01ipvs: dynamically limit the connection hash tableJulian Anastasov
As we allow the hash table to be configured to rows above 2^20, we should limit it depending on the available memory to some sane values. Switch to kvmalloc allocation to better select the needed allocation type. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-06-01ipvs: increase ip_vs_conn_tab_bits range for 64BITAbhijeet Rastogi
Current range [8, 20] is set purely due to historical reasons because at the time, ~1M (2^20) was considered sufficient. With this change, 27 is the upper limit for 64-bit, 20 otherwise. Previous change regarding this limit is here. Link: https://lore.kernel.org/all/86eabeb9dd62aebf1e2533926fdd13fed48bab1f.1631289960.git.aclaudi@redhat.com/T/#u Signed-off-by: Abhijeet Rastogi <abhijeet.1989@gmail.com> Acked-by: Julian Anastasov <ja@ssi.bg> Acked-by: Simon Horman <horms@kernel.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-06-01bpf: Add table ID to bpf_fib_lookup BPF helperLouis DeLosSantos
Add ability to specify routing table ID to the `bpf_fib_lookup` BPF helper. A new field `tbid` is added to `struct bpf_fib_lookup` used as parameters to the `bpf_fib_lookup` BPF helper. When the helper is called with the `BPF_FIB_LOOKUP_DIRECT` and `BPF_FIB_LOOKUP_TBID` flags the `tbid` field in `struct bpf_fib_lookup` will be used as the table ID for the fib lookup. If the `tbid` does not exist the fib lookup will fail with `BPF_FIB_LKUP_RET_NOT_FWDED`. The `tbid` field becomes a union over the vlan related output fields in `struct bpf_fib_lookup` and will be zeroed immediately after usage. This functionality is useful in containerized environments. For instance, if a CNI wants to dictate the next-hop for traffic leaving a container it can create a container-specific routing table and perform a fib lookup against this table in a "host-net-namespace-side" TC program. This functionality also allows `ip rule` like functionality at the TC layer, allowing an eBPF program to pick a routing table based on some aspect of the sk_buff. As a concrete use case, this feature will be used in Cilium's SRv6 L3VPN datapath. When egress traffic leaves a Pod an eBPF program attached by Cilium will determine which VRF the egress traffic should target, and then perform a FIB lookup in a specific table representing this VRF's FIB. Signed-off-by: Louis DeLosSantos <louis.delos.devel@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20230505-bpf-add-tbid-fib-lookup-v2-1-0a31c22c748c@gmail.com
2023-06-01mptcp: fix active subflow finalizationPaolo Abeni
Active subflow are inserted into the connection list at creation time. When the MPJ handshake completes successfully, a new subflow creation netlink event is generated correctly, but the current code wrongly avoid initializing a couple of subflow data. The above will cause misbehavior on a few exceptional events: unneeded mptcp-level retransmission on msk-level sequence wrap-around and infinite mapping fallback even when a MPJ socket is present. Address the issue factoring out the needed initialization in a new helper and invoking the latter from __mptcp_finish_join() time for passive subflow and from mptcp_finish_join() for active ones. Fixes: 0530020a7c8f ("mptcp: track and update contiguous data status") Cc: stable@vger.kernel.org Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-01mptcp: add annotations around sk->sk_shutdown accessesPaolo Abeni
Christoph reported the mptcp variant of a recently addressed plain TCP issue. Similar to commit e14cadfd80d7 ("tcp: add annotations around sk->sk_shutdown accesses") add READ/WRITE ONCE annotations to silence KCSAN reports around lockless sk_shutdown access. Fixes: 71ba088ce0aa ("mptcp: cleanup accept and poll") Reported-by: Christoph Paasch <cpaasch@apple.com> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/401 Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-01mptcp: fix data race around msk->first accessPaolo Abeni
The first subflow socket is accessed outside the msk socket lock by mptcp_subflow_fail(), we need to annotate each write access with WRITE_ONCE, but a few spots still lacks it. Fixes: 76a13b315709 ("mptcp: invoke MP_FAIL response when needed") Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-01mptcp: consolidate passive msk socket initializationPaolo Abeni
When the msk socket is cloned at MPC handshake time, a few fields are initialized in a racy way outside mptcp_sk_clone() and the msk socket lock. The above is due historical reasons: before commit a88d0092b24b ("mptcp: simplify subflow_syn_recv_sock()") as the first subflow socket carrying all the needed date was not available yet at msk creation time We can now refactor the code moving the missing initialization bit under the socket lock, removing the init race and avoiding some code duplication. This will also simplify the next patch, as all msk->first write access are now under the msk socket lock. Fixes: 0397c6d85f9c ("mptcp: keep unaccepted MPC subflow into join list") Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-01mptcp: add annotations around msk->subflow accessesPaolo Abeni
The MPTCP can access the first subflow socket in a few spots outside the socket lock scope. That is actually safe, as MPTCP will delete the socket itself only after the msk sock close(). Still the such accesses causes a few KCSAN splats, as reported by Christoph. Silence the harmless warning adding a few annotation around the relevant accesses. Fixes: 71ba088ce0aa ("mptcp: cleanup accept and poll") Reported-by: Christoph Paasch <cpaasch@apple.com> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/402 Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-01mptcp: fix connect timeout handlingPaolo Abeni
Ondrej reported a functional issue WRT timeout handling on connect with a nice reproducer. The problem is that the current mptcp connect waits for both the MPTCP socket level timeout, and the first subflow socket timeout. The latter is not influenced/touched by the exposed setsockopt(). Overall the above makes the SO_SNDTIMEO a no-op on connect. Since mptcp_connect is invoked via inet_stream_connect and the latter properly handle the MPTCP level timeout, we can address the issue making the nested subflow level connect always unblocking. This also allow simplifying a bit the code, dropping an ugly hack to handle the fastopen and custom proto_ops connect. The issues predates the blamed commit below, but the current resolution requires the infrastructure introduced there. Fixes: 54f1944ed6d2 ("mptcp: factor out mptcp_connect()") Reported-by: Ondrej Mosnacek <omosnace@redhat.com> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/399 Cc: stable@vger.kernel.org Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-01rtnetlink: add the missing IFLA_GRO_ tb check in validate_linkmsgXin Long
This fixes the issue that dev gro_max_size and gso_ipv4_max_size can be set to a huge value: # ip link add dummy1 type dummy # ip link set dummy1 gro_max_size 4294967295 # ip -d link show dummy1 dummy addrgenmode eui64 ... gro_max_size 4294967295 Fixes: 0fe79f28bfaf ("net: allow gro_max_size to exceed 65536") Fixes: 9eefedd58ae1 ("net: add gso_ipv4_max_size and gro_ipv4_max_size per device") Reported-by: Xiumei Mu <xmu@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-01rtnetlink: move IFLA_GSO_ tb check to validate_linkmsgXin Long
These IFLA_GSO_* tb check should also be done for the new created link, otherwise, they can be set to a huge value when creating links: # ip link add dummy1 gso_max_size 4294967295 type dummy # ip -d link show dummy1 dummy addrgenmode eui64 ... gso_max_size 4294967295 Fixes: 46e6b992c250 ("rtnetlink: allow GSO maximums to be set on device creation") Fixes: 9eefedd58ae1 ("net: add gso_ipv4_max_size and gro_ipv4_max_size per device") Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-01rtnetlink: call validate_linkmsg in rtnl_create_linkXin Long
validate_linkmsg() was introduced by commit 1840bb13c22f5b ("[RTNL]: Validate hardware and broadcast address attribute for RTM_NEWLINK") to validate tb[IFLA_ADDRESS/BROADCAST] for existing links. The same check should also be done for newly created links. This patch adds validate_linkmsg() call in rtnl_create_link(), to avoid the invalid address set when creating some devices like: # ip link add dummy0 type dummy # ip link add link dummy0 name mac0 address 01:02 type macsec Fixes: 0e06877c6fdb ("[RTNETLINK]: rtnl_link: allow specifying initial device address") Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-01bpf, sockmap: Avoid potential NULL dereference in sk_psock_verdict_data_ready()Eric Dumazet
syzbot found sk_psock(sk) could return NULL when called from sk_psock_verdict_data_ready(). Just make sure to handle this case. [1] general protection fault, probably for non-canonical address 0xdffffc000000005c: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x00000000000002e0-0x00000000000002e7] CPU: 0 PID: 15 Comm: ksoftirqd/0 Not tainted 6.4.0-rc3-syzkaller-00588-g4781e965e655 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/16/2023 RIP: 0010:sk_psock_verdict_data_ready+0x19f/0x3c0 net/core/skmsg.c:1213 Code: 4c 89 e6 e8 63 70 5e f9 4d 85 e4 75 75 e8 19 74 5e f9 48 8d bb e0 02 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 07 02 00 00 48 89 ef ff 93 e0 02 00 00 e8 29 fd RSP: 0018:ffffc90000147688 EFLAGS: 00010206 RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000100 RDX: 000000000000005c RSI: ffffffff8825ceb7 RDI: 00000000000002e0 RBP: ffff888076518c40 R08: 0000000000000007 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000008000 R15: ffff888076518c40 FS: 0000000000000000(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f901375bab0 CR3: 000000004bf26000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> tcp_data_ready+0x10a/0x520 net/ipv4/tcp_input.c:5006 tcp_data_queue+0x25d3/0x4c50 net/ipv4/tcp_input.c:5080 tcp_rcv_established+0x829/0x1f90 net/ipv4/tcp_input.c:6019 tcp_v4_do_rcv+0x65a/0x9c0 net/ipv4/tcp_ipv4.c:1726 tcp_v4_rcv+0x2cbf/0x3340 net/ipv4/tcp_ipv4.c:2148 ip_protocol_deliver_rcu+0x9f/0x480 net/ipv4/ip_input.c:205 ip_local_deliver_finish+0x2ec/0x520 net/ipv4/ip_input.c:233 NF_HOOK include/linux/netfilter.h:303 [inline] NF_HOOK include/linux/netfilter.h:297 [inline] ip_local_deliver+0x1ae/0x200 net/ipv4/ip_input.c:254 dst_input include/net/dst.h:468 [inline] ip_rcv_finish+0x1cf/0x2f0 net/ipv4/ip_input.c:449 NF_HOOK include/linux/netfilter.h:303 [inline] NF_HOOK include/linux/netfilter.h:297 [inline] ip_rcv+0xae/0xd0 net/ipv4/ip_input.c:569 __netif_receive_skb_one_core+0x114/0x180 net/core/dev.c:5491 __netif_receive_skb+0x1f/0x1c0 net/core/dev.c:5605 process_backlog+0x101/0x670 net/core/dev.c:5933 __napi_poll+0xb7/0x6f0 net/core/dev.c:6499 napi_poll net/core/dev.c:6566 [inline] net_rx_action+0x8a9/0xcb0 net/core/dev.c:6699 __do_softirq+0x1d4/0x905 kernel/softirq.c:571 run_ksoftirqd kernel/softirq.c:939 [inline] run_ksoftirqd+0x31/0x60 kernel/softirq.c:931 smpboot_thread_fn+0x659/0x9e0 kernel/smpboot.c:164 kthread+0x344/0x440 kernel/kthread.c:379 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308 </TASK> Fixes: 6df7f764cd3c ("bpf, sockmap: Wake up polling after data copy") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20230530195149.68145-1-edumazet@google.com
2023-06-01tcp: fix mishandling when the sack compression is deferred.fuyuanli
In this patch, we mainly try to handle sending a compressed ack correctly if it's deferred. Here are more details in the old logic: When sack compression is triggered in the tcp_compressed_ack_kick(), if the sock is owned by user, it will set TCP_DELACK_TIMER_DEFERRED and then defer to the release cb phrase. Later once user releases the sock, tcp_delack_timer_handler() should send a ack as expected, which, however, cannot happen due to lack of ICSK_ACK_TIMER flag. Therefore, the receiver would not sent an ack until the sender's retransmission timeout. It definitely increases unnecessary latency. Fixes: 5d9f4262b7ea ("tcp: add SACK compression") Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: fuyuanli <fuyuanli@didiglobal.com> Signed-off-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://lore.kernel.org/netdev/20230529113804.GA20300@didi-ThinkCentre-M920t-N000/ Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230531080150.GA20424@didi-ThinkCentre-M920t-N000 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-06-01net/sched: flower: fix possible OOB write in fl_set_geneve_opt()Hangyu Hua
If we send two TCA_FLOWER_KEY_ENC_OPTS_GENEVE packets and their total size is 252 bytes(key->enc_opts.len = 252) then key->enc_opts.len = opt->length = data_len / 4 = 0 when the third TCA_FLOWER_KEY_ENC_OPTS_GENEVE packet enters fl_set_geneve_opt. This bypasses the next bounds check and results in an out-of-bounds. Fixes: 0a6e77784f49 ("net/sched: allow flower to match tunnel options") Signed-off-by: Hangyu Hua <hbh25y@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Pieter Jansen van Vuuren <pieter.jansen-van-vuuren@amd.com> Link: https://lore.kernel.org/r/20230531102805.27090-1-hbh25y@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-05-31devlink: make health report on unregistered instance warn just onceJakub Kicinski
Devlink health is involved in error recovery. Machines in bad state tend to be fairly unreliable, and occasionally get stuck in error loops. Even with a reasonable grace period devlink health may get a thousand reports in an hour. In case of reporting on an unregistered devlink instance the subsequent reports don't add much value. Switch to WARN_ON_ONCE() to avoid flooding dmesg and fleet monitoring dashboards. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20230531015523.48961-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>