summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2024-10-07net: explicitly clear the sk pointer, when pf->create failsIgnat Korchagin
We have recently noticed the exact same KASAN splat as in commit 6cd4a78d962b ("net: do not leave a dangling sk pointer, when socket creation fails"). The problem is that commit did not fully address the problem, as some pf->create implementations do not use sk_common_release in their error paths. For example, we can use the same reproducer as in the above commit, but changing ping to arping. arping uses AF_PACKET socket and if packet_create fails, it will just sk_free the allocated sk object. While we could chase all the pf->create implementations and make sure they NULL the freed sk object on error from the socket, we can't guarantee future protocols will not make the same mistake. So it is easier to just explicitly NULL the sk pointer upon return from pf->create in __sock_create. We do know that pf->create always releases the allocated sk object on error, so if the pointer is not NULL, it is definitely dangling. Fixes: 6cd4a78d962b ("net: do not leave a dangling sk pointer, when socket creation fails") Signed-off-by: Ignat Korchagin <ignat@cloudflare.com> Cc: stable@vger.kernel.org Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20241003170151.69445-1-ignat@cloudflare.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-07Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds
Pull virtio fixes from Michael Tsirkin: "Several small bugfixes all over the place. Most notably, fixes the vsock allocation with GFP_KERNEL in atomic context, which has been triggering warnings for lots of testers" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: vhost/scsi: null-ptr-dereference in vhost_scsi_get_req() vsock/virtio: use GFP_ATOMIC under RCU read lock virtio_console: fix misc probe bugs virtio_ring: tag event_triggered as racy for KCSAN vdpa/octeon_ep: Fix format specifier for pointers in debug messages
2024-10-07vsock/virtio: use GFP_ATOMIC under RCU read lockMichael S. Tsirkin
virtio_transport_send_pkt in now called on transport fast path, under RCU read lock. In that case, we have a bug: virtio_add_sgs is called with GFP_KERNEL, and might sleep. Pass the gfp flags as an argument, and use GFP_ATOMIC on the fast path. Link: https://lore.kernel.org/all/hfcr2aget2zojmqpr4uhlzvnep4vgskblx5b6xf2ddosbsrke7@nt34bxgp7j2x Fixes: efcd71af38be ("vsock/virtio: avoid queuing packets when intermediate queue is empty") Reported-by: Christian Brauner <brauner@kernel.org> Cc: Stefano Garzarella <sgarzare@redhat.com> Cc: Luigi Leonardi <luigi.leonardi@outlook.com> Message-ID: <3fbfb6e871f625f89eb578c7228e127437b1975a.1727876449.git.mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Luigi Leonardi <luigi.leonardi@outlook.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
2024-10-07xfrm: validate new SA's prefixlen using SA family when sel.family is unsetSabrina Dubroca
This expands the validation introduced in commit 07bf7908950a ("xfrm: Validate address prefix lengths in the xfrm selector.") syzbot created an SA with usersa.sel.family = AF_UNSPEC usersa.sel.prefixlen_s = 128 usersa.family = AF_INET Because of the AF_UNSPEC selector, verify_newsa_info doesn't put limits on prefixlen_{s,d}. But then copy_from_user_state sets x->sel.family to usersa.family (AF_INET). Do the same conversion in verify_newsa_info before validating prefixlen_{s,d}, since that's how prefixlen is going to be used later on. Reported-by: syzbot+cc39f136925517aed571@syzkaller.appspotmail.com Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2024-10-04net: Fix an unsafe loop on the listAnastasia Kovaleva
The kernel may crash when deleting a genetlink family if there are still listeners for that family: Oops: Kernel access of bad area, sig: 11 [#1] ... NIP [c000000000c080bc] netlink_update_socket_mc+0x3c/0xc0 LR [c000000000c0f764] __netlink_clear_multicast_users+0x74/0xc0 Call Trace: __netlink_clear_multicast_users+0x74/0xc0 genl_unregister_family+0xd4/0x2d0 Change the unsafe loop on the list to a safe one, because inside the loop there is an element removal from this list. Fixes: b8273570f802 ("genetlink: fix netns vs. netlink table locking (2)") Cc: stable@vger.kernel.org Signed-off-by: Anastasia Kovaleva <a.kovaleva@yadro.com> Reviewed-by: Dmitry Bogdanov <d.bogdanov@yadro.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20241003104431.12391-1-a.kovaleva@yadro.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-04Bluetooth: hci_conn: Fix UAF in hci_enhanced_setup_syncLuiz Augusto von Dentz
This checks if the ACL connection remains valid as it could be destroyed while hci_enhanced_setup_sync is pending on cmd_sync leading to the following trace: BUG: KASAN: slab-use-after-free in hci_enhanced_setup_sync+0x91b/0xa60 Read of size 1 at addr ffff888002328ffd by task kworker/u5:2/37 CPU: 0 UID: 0 PID: 37 Comm: kworker/u5:2 Not tainted 6.11.0-rc6-01300-g810be445d8d6 #7099 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-2.fc40 04/01/2014 Workqueue: hci0 hci_cmd_sync_work Call Trace: <TASK> dump_stack_lvl+0x5d/0x80 ? hci_enhanced_setup_sync+0x91b/0xa60 print_report+0x152/0x4c0 ? hci_enhanced_setup_sync+0x91b/0xa60 ? __virt_addr_valid+0x1fa/0x420 ? hci_enhanced_setup_sync+0x91b/0xa60 kasan_report+0xda/0x1b0 ? hci_enhanced_setup_sync+0x91b/0xa60 hci_enhanced_setup_sync+0x91b/0xa60 ? __pfx_hci_enhanced_setup_sync+0x10/0x10 ? __pfx___mutex_lock+0x10/0x10 hci_cmd_sync_work+0x1c2/0x330 process_one_work+0x7d9/0x1360 ? __pfx_lock_acquire+0x10/0x10 ? __pfx_process_one_work+0x10/0x10 ? assign_work+0x167/0x240 worker_thread+0x5b7/0xf60 ? __kthread_parkme+0xac/0x1c0 ? __pfx_worker_thread+0x10/0x10 ? __pfx_worker_thread+0x10/0x10 kthread+0x293/0x360 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x2f/0x70 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> Allocated by task 34: kasan_save_stack+0x30/0x50 kasan_save_track+0x14/0x30 __kasan_kmalloc+0x8f/0xa0 __hci_conn_add+0x187/0x17d0 hci_connect_sco+0x2e1/0xb90 sco_sock_connect+0x2a2/0xb80 __sys_connect+0x227/0x2a0 __x64_sys_connect+0x6d/0xb0 do_syscall_64+0x71/0x140 entry_SYSCALL_64_after_hwframe+0x76/0x7e Freed by task 37: kasan_save_stack+0x30/0x50 kasan_save_track+0x14/0x30 kasan_save_free_info+0x3b/0x60 __kasan_slab_free+0x101/0x160 kfree+0xd0/0x250 device_release+0x9a/0x210 kobject_put+0x151/0x280 hci_conn_del+0x448/0xbf0 hci_abort_conn_sync+0x46f/0x980 hci_cmd_sync_work+0x1c2/0x330 process_one_work+0x7d9/0x1360 worker_thread+0x5b7/0xf60 kthread+0x293/0x360 ret_from_fork+0x2f/0x70 ret_from_fork_asm+0x1a/0x30 Cc: stable@vger.kernel.org Fixes: e07a06b4eb41 ("Bluetooth: Convert SCO configure_datapath to hci_sync") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-10-04Bluetooth: RFCOMM: FIX possible deadlock in rfcomm_sk_state_changeLuiz Augusto von Dentz
rfcomm_sk_state_change attempts to use sock_lock so it must never be called with it locked but rfcomm_sock_ioctl always attempt to lock it causing the following trace: ====================================================== WARNING: possible circular locking dependency detected 6.8.0-syzkaller-08951-gfe46a7dd189e #0 Not tainted ------------------------------------------------------ syz-executor386/5093 is trying to acquire lock: ffff88807c396258 (sk_lock-AF_BLUETOOTH-BTPROTO_RFCOMM){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1671 [inline] ffff88807c396258 (sk_lock-AF_BLUETOOTH-BTPROTO_RFCOMM){+.+.}-{0:0}, at: rfcomm_sk_state_change+0x5b/0x310 net/bluetooth/rfcomm/sock.c:73 but task is already holding lock: ffff88807badfd28 (&d->lock){+.+.}-{3:3}, at: __rfcomm_dlc_close+0x226/0x6a0 net/bluetooth/rfcomm/core.c:491 Reported-by: syzbot+d7ce59b06b3eb14fd218@syzkaller.appspotmail.com Tested-by: syzbot+d7ce59b06b3eb14fd218@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=d7ce59b06b3eb14fd218 Fixes: 3241ad820dbb ("[Bluetooth] Add timestamp support to L2CAP, RFCOMM and SCO") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-10-04netfilter: br_netfilter: fix panic with metadata_dst skbAndy Roulin
Fix a kernel panic in the br_netfilter module when sending untagged traffic via a VxLAN device. This happens during the check for fragmentation in br_nf_dev_queue_xmit. It is dependent on: 1) the br_netfilter module being loaded; 2) net.bridge.bridge-nf-call-iptables set to 1; 3) a bridge with a VxLAN (single-vxlan-device) netdevice as a bridge port; 4) untagged frames with size higher than the VxLAN MTU forwarded/flooded When forwarding the untagged packet to the VxLAN bridge port, before the netfilter hooks are called, br_handle_egress_vlan_tunnel is called and changes the skb_dst to the tunnel dst. The tunnel_dst is a metadata type of dst, i.e., skb_valid_dst(skb) is false, and metadata->dst.dev is NULL. Then in the br_netfilter hooks, in br_nf_dev_queue_xmit, there's a check for frames that needs to be fragmented: frames with higher MTU than the VxLAN device end up calling br_nf_ip_fragment, which in turns call ip_skb_dst_mtu. The ip_dst_mtu tries to use the skb_dst(skb) as if it was a valid dst with valid dst->dev, thus the crash. This case was never supported in the first place, so drop the packet instead. PING 10.0.0.2 (10.0.0.2) from 0.0.0.0 h1-eth0: 2000(2028) bytes of data. [ 176.291791] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000110 [ 176.292101] Mem abort info: [ 176.292184] ESR = 0x0000000096000004 [ 176.292322] EC = 0x25: DABT (current EL), IL = 32 bits [ 176.292530] SET = 0, FnV = 0 [ 176.292709] EA = 0, S1PTW = 0 [ 176.292862] FSC = 0x04: level 0 translation fault [ 176.293013] Data abort info: [ 176.293104] ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000 [ 176.293488] CM = 0, WnR = 0, TnD = 0, TagAccess = 0 [ 176.293787] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [ 176.293995] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000043ef5000 [ 176.294166] [0000000000000110] pgd=0000000000000000, p4d=0000000000000000 [ 176.294827] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP [ 176.295252] Modules linked in: vxlan ip6_udp_tunnel udp_tunnel veth br_netfilter bridge stp llc ipv6 crct10dif_ce [ 176.295923] CPU: 0 PID: 188 Comm: ping Not tainted 6.8.0-rc3-g5b3fbd61b9d1 #2 [ 176.296314] Hardware name: linux,dummy-virt (DT) [ 176.296535] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 176.296808] pc : br_nf_dev_queue_xmit+0x390/0x4ec [br_netfilter] [ 176.297382] lr : br_nf_dev_queue_xmit+0x2ac/0x4ec [br_netfilter] [ 176.297636] sp : ffff800080003630 [ 176.297743] x29: ffff800080003630 x28: 0000000000000008 x27: ffff6828c49ad9f8 [ 176.298093] x26: ffff6828c49ad000 x25: 0000000000000000 x24: 00000000000003e8 [ 176.298430] x23: 0000000000000000 x22: ffff6828c4960b40 x21: ffff6828c3b16d28 [ 176.298652] x20: ffff6828c3167048 x19: ffff6828c3b16d00 x18: 0000000000000014 [ 176.298926] x17: ffffb0476322f000 x16: ffffb7e164023730 x15: 0000000095744632 [ 176.299296] x14: ffff6828c3f1c880 x13: 0000000000000002 x12: ffffb7e137926a70 [ 176.299574] x11: 0000000000000001 x10: ffff6828c3f1c898 x9 : 0000000000000000 [ 176.300049] x8 : ffff6828c49bf070 x7 : 0008460f18d5f20e x6 : f20e0100bebafeca [ 176.300302] x5 : ffff6828c7f918fe x4 : ffff6828c49bf070 x3 : 0000000000000000 [ 176.300586] x2 : 0000000000000000 x1 : ffff6828c3c7ad00 x0 : ffff6828c7f918f0 [ 176.300889] Call trace: [ 176.301123] br_nf_dev_queue_xmit+0x390/0x4ec [br_netfilter] [ 176.301411] br_nf_post_routing+0x2a8/0x3e4 [br_netfilter] [ 176.301703] nf_hook_slow+0x48/0x124 [ 176.302060] br_forward_finish+0xc8/0xe8 [bridge] [ 176.302371] br_nf_hook_thresh+0x124/0x134 [br_netfilter] [ 176.302605] br_nf_forward_finish+0x118/0x22c [br_netfilter] [ 176.302824] br_nf_forward_ip.part.0+0x264/0x290 [br_netfilter] [ 176.303136] br_nf_forward+0x2b8/0x4e0 [br_netfilter] [ 176.303359] nf_hook_slow+0x48/0x124 [ 176.303803] __br_forward+0xc4/0x194 [bridge] [ 176.304013] br_flood+0xd4/0x168 [bridge] [ 176.304300] br_handle_frame_finish+0x1d4/0x5c4 [bridge] [ 176.304536] br_nf_hook_thresh+0x124/0x134 [br_netfilter] [ 176.304978] br_nf_pre_routing_finish+0x29c/0x494 [br_netfilter] [ 176.305188] br_nf_pre_routing+0x250/0x524 [br_netfilter] [ 176.305428] br_handle_frame+0x244/0x3cc [bridge] [ 176.305695] __netif_receive_skb_core.constprop.0+0x33c/0xecc [ 176.306080] __netif_receive_skb_one_core+0x40/0x8c [ 176.306197] __netif_receive_skb+0x18/0x64 [ 176.306369] process_backlog+0x80/0x124 [ 176.306540] __napi_poll+0x38/0x17c [ 176.306636] net_rx_action+0x124/0x26c [ 176.306758] __do_softirq+0x100/0x26c [ 176.307051] ____do_softirq+0x10/0x1c [ 176.307162] call_on_irq_stack+0x24/0x4c [ 176.307289] do_softirq_own_stack+0x1c/0x2c [ 176.307396] do_softirq+0x54/0x6c [ 176.307485] __local_bh_enable_ip+0x8c/0x98 [ 176.307637] __dev_queue_xmit+0x22c/0xd28 [ 176.307775] neigh_resolve_output+0xf4/0x1a0 [ 176.308018] ip_finish_output2+0x1c8/0x628 [ 176.308137] ip_do_fragment+0x5b4/0x658 [ 176.308279] ip_fragment.constprop.0+0x48/0xec [ 176.308420] __ip_finish_output+0xa4/0x254 [ 176.308593] ip_finish_output+0x34/0x130 [ 176.308814] ip_output+0x6c/0x108 [ 176.308929] ip_send_skb+0x50/0xf0 [ 176.309095] ip_push_pending_frames+0x30/0x54 [ 176.309254] raw_sendmsg+0x758/0xaec [ 176.309568] inet_sendmsg+0x44/0x70 [ 176.309667] __sys_sendto+0x110/0x178 [ 176.309758] __arm64_sys_sendto+0x28/0x38 [ 176.309918] invoke_syscall+0x48/0x110 [ 176.310211] el0_svc_common.constprop.0+0x40/0xe0 [ 176.310353] do_el0_svc+0x1c/0x28 [ 176.310434] el0_svc+0x34/0xb4 [ 176.310551] el0t_64_sync_handler+0x120/0x12c [ 176.310690] el0t_64_sync+0x190/0x194 [ 176.311066] Code: f9402e61 79402aa2 927ff821 f9400023 (f9408860) [ 176.315743] ---[ end trace 0000000000000000 ]--- [ 176.316060] Kernel panic - not syncing: Oops: Fatal exception in interrupt [ 176.316371] Kernel Offset: 0x37e0e3000000 from 0xffff800080000000 [ 176.316564] PHYS_OFFSET: 0xffff97d780000000 [ 176.316782] CPU features: 0x0,88000203,3c020000,0100421b [ 176.317210] Memory Limit: none [ 176.317527] ---[ end Kernel panic - not syncing: Oops: Fatal Exception in interrupt ]---\ Fixes: 11538d039ac6 ("bridge: vlan dst_metadata hooks in ingress and egress paths") Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Andy Roulin <aroulin@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20241001154400.22787-2-aroulin@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-03rxrpc: Fix uninitialised variable in rxrpc_send_data()David Howells
Fix the uninitialised txb variable in rxrpc_send_data() by moving the code that loads it above all the jumps to maybe_error, txb being stored back into call->tx_pending right before the normal return. Fixes: b0f571ecd794 ("rxrpc: Fix locking in rxrpc's sendmsg") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lists.infradead.org/pipermail/linux-afs/2024-October/008896.html Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org Link: https://patch.msgid.link/20241001132702.3122709-3-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-03rxrpc: Fix a race between socket set up and I/O thread creationDavid Howells
In rxrpc_open_socket(), it sets up the socket and then sets up the I/O thread that will handle it. This is a problem, however, as there's a gap between the two phases in which a packet may come into rxrpc_encap_rcv() from the UDP packet but we oops when trying to wake the not-yet created I/O thread. As a quick fix, just make rxrpc_encap_rcv() discard the packet if there's no I/O thread yet. A better, but more intrusive fix would perhaps be to rearrange things such that the socket creation is done by the I/O thread. Fixes: a275da62e8c1 ("rxrpc: Create a per-local endpoint receive queue and I/O thread") Signed-off-by: David Howells <dhowells@redhat.com> cc: yuxuanzhe@outlook.com cc: Marc Dionne <marc.dionne@auristor.com> cc: Simon Horman <horms@kernel.org> cc: linux-afs@lists.infradead.org Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20241001132702.3122709-2-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-03tcp: fix TFO SYN_RECV to not zero retrans_stamp with retransmits outNeal Cardwell
Fix tcp_rcv_synrecv_state_fastopen() to not zero retrans_stamp if retransmits are outstanding. tcp_fastopen_synack_timer() sets retrans_stamp, so typically we'll need to zero retrans_stamp here to prevent spurious retransmits_timed_out(). The logic to zero retrans_stamp is from this 2019 commit: commit cd736d8b67fb ("tcp: fix retrans timestamp on passive Fast Open") However, in the corner case where the ACK of our TFO SYNACK carried some SACK blocks that caused us to enter TCP_CA_Recovery then that non-zero retrans_stamp corresponds to the active fast recovery, and we need to leave retrans_stamp with its current non-zero value, for correct ETIMEDOUT and undo behavior. Fixes: cd736d8b67fb ("tcp: fix retrans timestamp on passive Fast Open") Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20241001200517.2756803-4-ncardwell.sw@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-03tcp: fix tcp_enter_recovery() to zero retrans_stamp when it's safeNeal Cardwell
Fix tcp_enter_recovery() so that if there are no retransmits out then we zero retrans_stamp when entering fast recovery. This is necessary to fix two buggy behaviors. Currently a non-zero retrans_stamp value can persist across multiple back-to-back loss recovery episodes. This is because we generally only clears retrans_stamp if we are completely done with loss recoveries, and get to tcp_try_to_open() and find !tcp_any_retrans_done(sk). This behavior causes two bugs: (1) When a loss recovery episode (CA_Loss or CA_Recovery) is followed immediately by a new CA_Recovery, the retrans_stamp value can persist and can be a time before this new CA_Recovery episode starts. That means that timestamp-based undo will be using the wrong retrans_stamp (a value that is too old) when comparing incoming TS ecr values to retrans_stamp to see if the current fast recovery episode can be undone. (2) If there is a roughly minutes-long sequence of back-to-back fast recovery episodes, one after another (e.g. in a shallow-buffered or policed bottleneck), where each fast recovery successfully makes forward progress and recovers one window of sequence space (but leaves at least one retransmit in flight at the end of the recovery), followed by several RTOs, then the ETIMEDOUT check may be using the wrong retrans_stamp (a value set at the start of the first fast recovery in the sequence). This can cause a very premature ETIMEDOUT, killing the connection prematurely. This commit changes the code to zero retrans_stamp when entering fast recovery, when this is known to be safe (no retransmits are out in the network). That ensures that when starting a fast recovery episode, and it is safe to do so, retrans_stamp is set when we send the fast retransmit packet. That addresses both bug (1) and bug (2) by ensuring that (if no retransmits are out when we start a fast recovery) we use the initial fast retransmit of this fast recovery as the time value for undo and ETIMEDOUT calculations. This makes intuitive sense, since the start of a new fast recovery episode (in a scenario where no lost packets are out in the network) means that the connection has made forward progress since the last RTO or fast recovery, and we should thus "restart the clock" used for both undo and ETIMEDOUT logic. Note that if when we start fast recovery there *are* retransmits out in the network, there can still be undesirable (1)/(2) issues. For example, after this patch we can still have the (1) and (2) problems in cases like this: + round 1: sender sends flight 1 + round 2: sender receives SACKs and enters fast recovery 1, retransmits some packets in flight 1 and then sends some new data as flight 2 + round 3: sender receives some SACKs for flight 2, notes losses, and retransmits some packets to fill the holes in flight 2 + fast recovery has some lost retransmits in flight 1 and continues for one or more rounds sending retransmits for flight 1 and flight 2 + fast recovery 1 completes when snd_una reaches high_seq at end of flight 1 + there are still holes in the SACK scoreboard in flight 2, so we enter fast recovery 2, but some retransmits in the flight 2 sequence range are still in flight (retrans_out > 0), so we can't execute the new retrans_stamp=0 added here to clear retrans_stamp It's not yet clear how to fix these remaining (1)/(2) issues in an efficient way without breaking undo behavior, given that retrans_stamp is currently used for undo and ETIMEDOUT. Perhaps the optimal (but expensive) strategy would be to set retrans_stamp to the timestamp of the earliest outstanding retransmit when entering fast recovery. But at least this commit makes things better. Note that this does not change the semantics of retrans_stamp; it simply makes retrans_stamp accurate in some cases where it was not before: (1) Some loss recovery, followed by an immediate entry into a fast recovery, where there are no retransmits out when entering the fast recovery. (2) When a TFO server has a SYNACK retransmit that sets retrans_stamp, and then the ACK that completes the 3-way handshake has SACK blocks that trigger a fast recovery. In this case when entering fast recovery we want to zero out the retrans_stamp from the TFO SYNACK retransmit, and set the retrans_stamp based on the timestamp of the fast recovery. We introduce a tcp_retrans_stamp_cleanup() helper, because this two-line sequence already appears in 3 places and is about to appear in 2 more as a result of this bug fix patch series. Once this bug fix patches series in the net branch makes it into the net-next branch we'll update the 3 other call sites to use the new helper. This is a long-standing issue. The Fixes tag below is chosen to be the oldest commit at which the patch will apply cleanly, which is from Linux v3.5 in 2012. Fixes: 1fbc340514fc ("tcp: early retransmit: tcp_enter_recovery()") Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20241001200517.2756803-3-ncardwell.sw@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-03tcp: fix to allow timestamp undo if no retransmits were sentNeal Cardwell
Fix the TCP loss recovery undo logic in tcp_packet_delayed() so that it can trigger undo even if TSQ prevents a fast recovery episode from reaching tcp_retransmit_skb(). Geumhwan Yu <geumhwan.yu@samsung.com> recently reported that after this commit from 2019: commit bc9f38c8328e ("tcp: avoid unconditional congestion window undo on SYN retransmit") ...and before this fix we could have buggy scenarios like the following: + Due to reordering, a TCP connection receives some SACKs and enters a spurious fast recovery. + TSQ prevents all invocations of tcp_retransmit_skb(), because many skbs are queued in lower layers of the sending machine's network stack; thus tp->retrans_stamp remains 0. + The connection receives a TCP timestamp ECR value echoing a timestamp before the fast recovery, indicating that the fast recovery was spurious. + The connection fails to undo the spurious fast recovery because tp->retrans_stamp is 0, and thus tcp_packet_delayed() returns false, due to the new logic in the 2019 commit: commit bc9f38c8328e ("tcp: avoid unconditional congestion window undo on SYN retransmit") This fix tweaks the logic to be more similar to the tcp_packet_delayed() logic before bc9f38c8328e, except that we take care not to be fooled by the FLAG_SYN_ACKED code path zeroing out tp->retrans_stamp (the bug noted and fixed by Yuchung in bc9f38c8328e). Note that this returns the high-level behavior of tcp_packet_delayed() to again match the comment for the function, which says: "Nothing was retransmitted or returned timestamp is less than timestamp of the first retransmission." Note that this comment is in the original 2005-04-16 Linux git commit, so this is evidently long-standing behavior. Fixes: bc9f38c8328e ("tcp: avoid unconditional congestion window undo on SYN retransmit") Reported-by: Geumhwan Yu <geumhwan.yu@samsung.com> Diagnosed-by: Geumhwan Yu <geumhwan.yu@samsung.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20241001200517.2756803-2-ncardwell.sw@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-03Merge tag 'net-6.12-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from ieee802154, bluetooth and netfilter. Current release - regressions: - eth: mlx5: fix wrong reserved field in hca_cap_2 in mlx5_ifc - eth: am65-cpsw: fix forever loop in cleanup code Current release - new code bugs: - eth: mlx5: HWS, fixed double-free in error flow of creating SQ Previous releases - regressions: - core: avoid potential underflow in qdisc_pkt_len_init() with UFO - core: test for not too small csum_start in virtio_net_hdr_to_skb() - vrf: revert "vrf: remove unnecessary RCU-bh critical section" - bluetooth: - fix uaf in l2cap_connect - fix possible crash on mgmt_index_removed - dsa: improve shutdown sequence - eth: mlx5e: SHAMPO, fix overflow of hd_per_wq - eth: ip_gre: fix drops of small packets in ipgre_xmit Previous releases - always broken: - core: fix gso_features_check to check for both dev->gso_{ipv4_,}max_size - core: fix tcp fraglist segmentation after pull from frag_list - netfilter: nf_tables: prevent nf_skb_duplicated corruption - sctp: set sk_state back to CLOSED if autobind fails in sctp_listen_start - mac802154: fix potential RCU dereference issue in mac802154_scan_worker - eth: fec: restart PPS after link state change" * tag 'net-6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (48 commits) sctp: set sk_state back to CLOSED if autobind fails in sctp_listen_start dt-bindings: net: xlnx,axi-ethernet: Add missing reg minItems doc: net: napi: Update documentation for napi_schedule_irqoff net/ncsi: Disable the ncsi work before freeing the associated structure net: phy: qt2025: Fix warning: unused import DeviceId gso: fix udp gso fraglist segmentation after pull from frag_list bridge: mcast: Fail MDB get request on empty entry vrf: revert "vrf: Remove unnecessary RCU-bh critical section" net: ethernet: ti: am65-cpsw: Fix forever loop in cleanup code net: phy: realtek: Check the index value in led_hw_control_get ppp: do not assume bh is held in ppp_channel_bridge_input() selftests: rds: move include.sh to TEST_FILES net: test for not too small csum_start in virtio_net_hdr_to_skb() net: gso: fix tcp fraglist segmentation after pull from frag_list ipv4: ip_gre: Fix drops of small packets in ipgre_xmit net: stmmac: dwmac4: extend timeout for VLAN Tag register busy bit check net: add more sanity checks to qdisc_pkt_len_init() net: avoid potential underflow in qdisc_pkt_len_init() with UFO net: ethernet: ti: cpsw_ale: Fix warning on some platforms net: microchip: Make FDMA config symbol invisible ...
2024-10-03sctp: set sk_state back to CLOSED if autobind fails in sctp_listen_startXin Long
In sctp_listen_start() invoked by sctp_inet_listen(), it should set the sk_state back to CLOSED if sctp_autobind() fails due to whatever reason. Otherwise, next time when calling sctp_inet_listen(), if sctp_sk(sk)->reuse is already set via setsockopt(SCTP_REUSE_PORT), sctp_sk(sk)->bind_hash will be dereferenced as sk_state is LISTENING, which causes a crash as bind_hash is NULL. KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] RIP: 0010:sctp_inet_listen+0x7f0/0xa20 net/sctp/socket.c:8617 Call Trace: <TASK> __sys_listen_socket net/socket.c:1883 [inline] __sys_listen+0x1b7/0x230 net/socket.c:1894 __do_sys_listen net/socket.c:1902 [inline] Fixes: 5e8f3f703ae4 ("sctp: simplify sctp listening code") Reported-by: syzbot+f4e0f821e3a3b7cee51d@syzkaller.appspotmail.com Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Link: https://patch.msgid.link/a93e655b3c153dc8945d7a812e6d8ab0d52b7aa0.1727729391.git.lucien.xin@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-03Merge tag 'nf-24-10-02' of ↵Paolo Abeni
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Fix incorrect documentation in uapi/linux/netfilter/nf_tables.h regarding flowtable hooks, from Phil Sutter. 2) Fix nft_audit.sh selftests with newer nft binaries, due to different (valid) audit output, also from Phil. 3) Disable BH when duplicating packets via nf_dup infrastructure, otherwise race on nf_skb_duplicated for locally generated traffic. From Eric. 4) Missing return in callback of selftest C program, from zhang jiao. netfilter pull request 24-10-02 * tag 'nf-24-10-02' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: selftests: netfilter: Add missing return value netfilter: nf_tables: prevent nf_skb_duplicated corruption selftests: netfilter: Fix nft_audit.sh for newer nft binaries netfilter: uapi: NFTA_FLOWTABLE_HOOK is NLA_NESTED ==================== Link: https://patch.msgid.link/20241002202421.1281311-1-pablo@netfilter.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-03net/ncsi: Disable the ncsi work before freeing the associated structureEddie James
The work function can run after the ncsi device is freed, resulting in use-after-free bugs or kernel panic. Fixes: 2d283bdd079c ("net/ncsi: Resource management") Signed-off-by: Eddie James <eajames@linux.ibm.com> Link: https://patch.msgid.link/20240925155523.1017097-1-eajames@linux.ibm.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-02gso: fix udp gso fraglist segmentation after pull from frag_listWillem de Bruijn
Detect gso fraglist skbs with corrupted geometry (see below) and pass these to skb_segment instead of skb_segment_list, as the first can segment them correctly. Valid SKB_GSO_FRAGLIST skbs - consist of two or more segments - the head_skb holds the protocol headers plus first gso_size - one or more frag_list skbs hold exactly one segment - all but the last must be gso_size Optional datapath hooks such as NAT and BPF (bpf_skb_pull_data) can modify these skbs, breaking these invariants. In extreme cases they pull all data into skb linear. For UDP, this causes a NULL ptr deref in __udpv4_gso_segment_list_csum at udp_hdr(seg->next)->dest. Detect invalid geometry due to pull, by checking head_skb size. Don't just drop, as this may blackhole a destination. Convert to be able to pass to regular skb_segment. Link: https://lore.kernel.org/netdev/20240428142913.18666-1-shiming.cheng@mediatek.com/ Fixes: 9fd1ff5d2ac7 ("udp: Support UDP fraglist GRO/GSO.") Signed-off-by: Willem de Bruijn <willemb@google.com> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20241001171752.107580-1-willemdebruijn.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-02bridge: mcast: Fail MDB get request on empty entryIdo Schimmel
When user space deletes a port from an MDB entry, the port is removed synchronously. If this was the last port in the entry and the entry is not joined by the host itself, then the entry is scheduled for deletion via a timer. The above means that it is possible for the MDB get netlink request to retrieve an empty entry which is scheduled for deletion. This is problematic as after deleting the last port in an entry, user space cannot rely on a non-zero return code from the MDB get request as an indication that the port was successfully removed. Fix by returning an error when the entry's port list is empty and the entry is not joined by the host. Fixes: 68b380a395a7 ("bridge: mcast: Add MDB get support") Reported-by: Jamie Bainbridge <jamie.bainbridge@gmail.com> Closes: https://lore.kernel.org/netdev/c92569919307749f879b9482b0f3e125b7d9d2e3.1726480066.git.jamie.bainbridge@gmail.com/ Tested-by: Jamie Bainbridge <jamie.bainbridge@gmail.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20240929123640.558525-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-02net: gso: fix tcp fraglist segmentation after pull from frag_listFelix Fietkau
Detect tcp gso fraglist skbs with corrupted geometry (see below) and pass these to skb_segment instead of skb_segment_list, as the first can segment them correctly. Valid SKB_GSO_FRAGLIST skbs - consist of two or more segments - the head_skb holds the protocol headers plus first gso_size - one or more frag_list skbs hold exactly one segment - all but the last must be gso_size Optional datapath hooks such as NAT and BPF (bpf_skb_pull_data) can modify these skbs, breaking these invariants. In extreme cases they pull all data into skb linear. For TCP, this causes a NULL ptr deref in __tcpv4_gso_segment_list_csum at tcp_hdr(seg->next). Detect invalid geometry due to pull, by checking head_skb size. Don't just drop, as this may blackhole a destination. Convert to be able to pass to regular skb_segment. Approach and description based on a patch by Willem de Bruijn. Link: https://lore.kernel.org/netdev/20240428142913.18666-1-shiming.cheng@mediatek.com/ Link: https://lore.kernel.org/netdev/20240922150450.3873767-1-willemdebruijn.kernel@gmail.com/ Fixes: bee88cd5bd83 ("net: add support for segmenting TCP fraglist GSO packets") Cc: stable@vger.kernel.org Signed-off-by: Felix Fietkau <nbd@nbd.name> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20240926085315.51524-1-nbd@nbd.name Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-02Merge tag 'for-net-2024-09-27' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth Luiz Augusto von Dentz says: ==================== bluetooth pull request for net: - btmrvl: Use IRQF_NO_AUTOEN flag in request_irq() - MGMT: Fix possible crash on mgmt_index_removed - L2CAP: Fix uaf in l2cap_connect - Bluetooth: hci_event: Align BR/EDR JUST_WORKS paring with LE * tag 'for-net-2024-09-27' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth: Bluetooth: hci_event: Align BR/EDR JUST_WORKS paring with LE Bluetooth: btmrvl: Use IRQF_NO_AUTOEN flag in request_irq() Bluetooth: L2CAP: Fix uaf in l2cap_connect Bluetooth: MGMT: Fix possible crash on mgmt_index_removed ==================== Link: https://patch.msgid.link/20240927145730.2452175-1-luiz.dentz@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-02Merge tag 'ieee802154-for-net-2024-09-27' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/wpan/wpan Stefan Schmidt says: ==================== pull-request: ieee802154 for net 2024-09-27 Jinjie Ruan added the use of IRQF_NO_AUTOEN in the mcr20a driver and fixed and addiotinal build dependency problem while doing so. Jiawei Ye, ensured a correct RCU handling in mac802154_scan_worker. * tag 'ieee802154-for-net-2024-09-27' of git://git.kernel.org/pub/scm/linux/kernel/git/wpan/wpan: net: ieee802154: mcr20a: Use IRQF_NO_AUTOEN flag in request_irq() mac802154: Fix potential RCU dereference issue in mac802154_scan_worker ieee802154: Fix build error ==================== Link: https://patch.msgid.link/20240927094351.3865511-1-stefan@datenfreihafen.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-02move asm/unaligned.h to linux/unaligned.hAl Viro
asm/unaligned.h is always an include of asm-generic/unaligned.h; might as well move that thing to linux/unaligned.h and include that - there's nothing arch-specific in that header. auto-generated by the following: for i in `git grep -l -w asm/unaligned.h`; do sed -i -e "s/asm\/unaligned.h/linux\/unaligned.h/" $i done for i in `git grep -l -w asm-generic/unaligned.h`; do sed -i -e "s/asm-generic\/unaligned.h/linux\/unaligned.h/" $i done git mv include/asm-generic/unaligned.h include/linux/unaligned.h git mv tools/include/asm-generic/unaligned.h tools/include/linux/unaligned.h sed -i -e "/unaligned.h/d" include/asm-generic/Kbuild sed -i -e "s/__ASM_GENERIC/__LINUX/" include/linux/unaligned.h tools/include/linux/unaligned.h
2024-10-01bpf: Make sure internal and UAPI bpf_redirect flags don't overlapToke Høiland-Jørgensen
The bpf_redirect_info is shared between the SKB and XDP redirect paths, and the two paths use the same numeric flag values in the ri->flags field (specifically, BPF_F_BROADCAST == BPF_F_NEXTHOP). This means that if skb bpf_redirect_neigh() is used with a non-NULL params argument and, subsequently, an XDP redirect is performed using the same bpf_redirect_info struct, the XDP path will get confused and end up crashing, which syzbot managed to trigger. With the stack-allocated bpf_redirect_info, the structure is no longer shared between the SKB and XDP paths, so the crash doesn't happen anymore. However, different code paths using identically-numbered flag values in the same struct field still seems like a bit of a mess, so this patch cleans that up by moving the flag definitions together and redefining the three flags in BPF_F_REDIRECT_INTERNAL to not overlap with the flags used for XDP. It also adds a BUILD_BUG_ON() check to make sure the overlap is not re-introduced by mistake. Fixes: e624d4ed4aa8 ("xdp: Extend xdp_redirect_map with broadcast support") Reported-by: syzbot+cca39e6e84a367a7e6f6@syzkaller.appspotmail.com Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Closes: https://syzkaller.appspot.com/bug?extid=cca39e6e84a367a7e6f6 Link: https://lore.kernel.org/bpf/20240920125625.59465-1-toke@redhat.com
2024-10-01ipv4: ip_gre: Fix drops of small packets in ipgre_xmitAnton Danilov
Regression Description: Depending on the options specified for the GRE tunnel device, small packets may be dropped. This occurs because the pskb_network_may_pull function fails due to the packet's insufficient length. For example, if only the okey option is specified for the tunnel device, original (before encapsulation) packets smaller than 28 bytes (including the IPv4 header) will be dropped. This happens because the required length is calculated relative to the network header, not the skb->head. Here is how the required length is computed and checked: * The pull_len variable is set to 28 bytes, consisting of: * IPv4 header: 20 bytes * GRE header with Key field: 8 bytes * The pskb_network_may_pull function adds the network offset, shifting the checkable space further to the beginning of the network header and extending it to the beginning of the packet. As a result, the end of the checkable space occurs beyond the actual end of the packet. Instead of ensuring that 28 bytes are present in skb->head, the function is requesting these 28 bytes starting from the network header. For small packets, this requested length exceeds the actual packet size, causing the check to fail and the packets to be dropped. This issue affects both locally originated and forwarded packets in DMVPN-like setups. How to reproduce (for local originated packets): ip link add dev gre1 type gre ikey 1.9.8.4 okey 1.9.8.4 \ local <your-ip> remote 0.0.0.0 ip link set mtu 1400 dev gre1 ip link set up dev gre1 ip address add 192.168.13.1/24 dev gre1 ip neighbor add 192.168.13.2 lladdr <remote-ip> dev gre1 ping -s 1374 -c 10 192.168.13.2 tcpdump -vni gre1 tcpdump -vni <your-ext-iface> 'ip proto 47' ip -s -s -d link show dev gre1 Solution: Use the pskb_may_pull function instead the pskb_network_may_pull. Fixes: 80d875cfc9d3 ("ipv4: ip_gre: Avoid skb_pull() failure in ipgre_xmit()") Signed-off-by: Anton Danilov <littlesmilingcloud@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20240924235158.106062-1-littlesmilingcloud@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-01net: add more sanity checks to qdisc_pkt_len_init()Eric Dumazet
One path takes care of SKB_GSO_DODGY, assuming skb->len is bigger than hdr_len. virtio_net_hdr_to_skb() does not fully dissect TCP headers, it only make sure it is at least 20 bytes. It is possible for an user to provide a malicious 'GSO' packet, total length of 80 bytes. - 20 bytes of IPv4 header - 60 bytes TCP header - a small gso_size like 8 virtio_net_hdr_to_skb() would declare this packet as a normal GSO packet, because it would see 40 bytes of payload, bigger than gso_size. We need to make detect this case to not underflow qdisc_skb_cb(skb)->pkt_len. Fixes: 1def9238d4aa ("net_sched: more precise pkt_len computation") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-01net: avoid potential underflow in qdisc_pkt_len_init() with UFOEric Dumazet
After commit 7c6d2ecbda83 ("net: be more gentle about silly gso requests coming from user") virtio_net_hdr_to_skb() had sanity check to detect malicious attempts from user space to cook a bad GSO packet. Then commit cf9acc90c80ec ("net: virtio_net_hdr_to_skb: count transport header in UFO") while fixing one issue, allowed user space to cook a GSO packet with the following characteristic : IPv4 SKB_GSO_UDP, gso_size=3, skb->len = 28. When this packet arrives in qdisc_pkt_len_init(), we end up with hdr_len = 28 (IPv4 header + UDP header), matching skb->len Then the following sets gso_segs to 0 : gso_segs = DIV_ROUND_UP(skb->len - hdr_len, shinfo->gso_size); Then later we set qdisc_skb_cb(skb)->pkt_len to back to zero :/ qdisc_skb_cb(skb)->pkt_len += (gso_segs - 1) * hdr_len; This leads to the following crash in fq_codel [1] qdisc_pkt_len_init() is best effort, we only want an estimation of the bytes sent on the wire, not crashing the kernel. This patch is fixing this particular issue, a following one adds more sanity checks for another potential bug. [1] [ 70.724101] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 70.724561] #PF: supervisor read access in kernel mode [ 70.724561] #PF: error_code(0x0000) - not-present page [ 70.724561] PGD 10ac61067 P4D 10ac61067 PUD 107ee2067 PMD 0 [ 70.724561] Oops: Oops: 0000 [#1] SMP NOPTI [ 70.724561] CPU: 11 UID: 0 PID: 2163 Comm: b358537762 Not tainted 6.11.0-virtme #991 [ 70.724561] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 [ 70.724561] RIP: 0010:fq_codel_enqueue (net/sched/sch_fq_codel.c:120 net/sched/sch_fq_codel.c:168 net/sched/sch_fq_codel.c:230) sch_fq_codel [ 70.724561] Code: 24 08 49 c1 e1 06 44 89 7c 24 18 45 31 ed 45 31 c0 31 ff 89 44 24 14 4c 03 8b 90 01 00 00 eb 04 39 ca 73 37 4d 8b 39 83 c7 01 <49> 8b 17 49 89 11 41 8b 57 28 45 8b 5f 34 49 c7 07 00 00 00 00 49 All code ======== 0: 24 08 and $0x8,%al 2: 49 c1 e1 06 shl $0x6,%r9 6: 44 89 7c 24 18 mov %r15d,0x18(%rsp) b: 45 31 ed xor %r13d,%r13d e: 45 31 c0 xor %r8d,%r8d 11: 31 ff xor %edi,%edi 13: 89 44 24 14 mov %eax,0x14(%rsp) 17: 4c 03 8b 90 01 00 00 add 0x190(%rbx),%r9 1e: eb 04 jmp 0x24 20: 39 ca cmp %ecx,%edx 22: 73 37 jae 0x5b 24: 4d 8b 39 mov (%r9),%r15 27: 83 c7 01 add $0x1,%edi 2a:* 49 8b 17 mov (%r15),%rdx <-- trapping instruction 2d: 49 89 11 mov %rdx,(%r9) 30: 41 8b 57 28 mov 0x28(%r15),%edx 34: 45 8b 5f 34 mov 0x34(%r15),%r11d 38: 49 c7 07 00 00 00 00 movq $0x0,(%r15) 3f: 49 rex.WB Code starting with the faulting instruction =========================================== 0: 49 8b 17 mov (%r15),%rdx 3: 49 89 11 mov %rdx,(%r9) 6: 41 8b 57 28 mov 0x28(%r15),%edx a: 45 8b 5f 34 mov 0x34(%r15),%r11d e: 49 c7 07 00 00 00 00 movq $0x0,(%r15) 15: 49 rex.WB [ 70.724561] RSP: 0018:ffff95ae85e6fb90 EFLAGS: 00000202 [ 70.724561] RAX: 0000000002000000 RBX: ffff95ae841de000 RCX: 0000000000000000 [ 70.724561] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000001 [ 70.724561] RBP: ffff95ae85e6fbf8 R08: 0000000000000000 R09: ffff95b710a30000 [ 70.724561] R10: 0000000000000000 R11: bdf289445ce31881 R12: ffff95ae85e6fc58 [ 70.724561] R13: 0000000000000000 R14: 0000000000000040 R15: 0000000000000000 [ 70.724561] FS: 000000002c5c1380(0000) GS:ffff95bd7fcc0000(0000) knlGS:0000000000000000 [ 70.724561] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 70.724561] CR2: 0000000000000000 CR3: 000000010c568000 CR4: 00000000000006f0 [ 70.724561] Call Trace: [ 70.724561] <TASK> [ 70.724561] ? __die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434) [ 70.724561] ? page_fault_oops (arch/x86/mm/fault.c:715) [ 70.724561] ? exc_page_fault (./arch/x86/include/asm/irqflags.h:26 ./arch/x86/include/asm/irqflags.h:87 ./arch/x86/include/asm/irqflags.h:147 arch/x86/mm/fault.c:1489 arch/x86/mm/fault.c:1539) [ 70.724561] ? asm_exc_page_fault (./arch/x86/include/asm/idtentry.h:623) [ 70.724561] ? fq_codel_enqueue (net/sched/sch_fq_codel.c:120 net/sched/sch_fq_codel.c:168 net/sched/sch_fq_codel.c:230) sch_fq_codel [ 70.724561] dev_qdisc_enqueue (net/core/dev.c:3784) [ 70.724561] __dev_queue_xmit (net/core/dev.c:3880 (discriminator 2) net/core/dev.c:4390 (discriminator 2)) [ 70.724561] ? irqentry_enter (kernel/entry/common.c:237) [ 70.724561] ? sysvec_apic_timer_interrupt (./arch/x86/include/asm/hardirq.h:74 (discriminator 2) arch/x86/kernel/apic/apic.c:1043 (discriminator 2) arch/x86/kernel/apic/apic.c:1043 (discriminator 2)) [ 70.724561] ? trace_hardirqs_on (kernel/trace/trace_preemptirq.c:58 (discriminator 4)) [ 70.724561] ? asm_sysvec_apic_timer_interrupt (./arch/x86/include/asm/idtentry.h:702) [ 70.724561] ? virtio_net_hdr_to_skb.constprop.0 (./include/linux/virtio_net.h:129 (discriminator 1)) [ 70.724561] packet_sendmsg (net/packet/af_packet.c:3145 (discriminator 1) net/packet/af_packet.c:3177 (discriminator 1)) [ 70.724561] ? _raw_spin_lock_bh (./arch/x86/include/asm/atomic.h:107 (discriminator 4) ./include/linux/atomic/atomic-arch-fallback.h:2170 (discriminator 4) ./include/linux/atomic/atomic-instrumented.h:1302 (discriminator 4) ./include/asm-generic/qspinlock.h:111 (discriminator 4) ./include/linux/spinlock.h:187 (discriminator 4) ./include/linux/spinlock_api_smp.h:127 (discriminator 4) kernel/locking/spinlock.c:178 (discriminator 4)) [ 70.724561] ? netdev_name_node_lookup_rcu (net/core/dev.c:325 (discriminator 1)) [ 70.724561] __sys_sendto (net/socket.c:730 (discriminator 1) net/socket.c:745 (discriminator 1) net/socket.c:2210 (discriminator 1)) [ 70.724561] ? __sys_setsockopt (./include/linux/file.h:34 net/socket.c:2355) [ 70.724561] __x64_sys_sendto (net/socket.c:2222 (discriminator 1) net/socket.c:2218 (discriminator 1) net/socket.c:2218 (discriminator 1)) [ 70.724561] do_syscall_64 (arch/x86/entry/common.c:52 (discriminator 1) arch/x86/entry/common.c:83 (discriminator 1)) [ 70.724561] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) [ 70.724561] RIP: 0033:0x41ae09 Fixes: cf9acc90c80ec ("net: virtio_net_hdr_to_skb: count transport header in UFO") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jonathan Davies <jonathan.davies@nutanix.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Jonathan Davies <jonathan.davies@nutanix.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-01net: Fix gso_features_check to check for both dev->gso_{ipv4_,}max_sizeDaniel Borkmann
Commit 24ab059d2ebd ("net: check dev->gso_max_size in gso_features_check()") added a dev->gso_max_size test to gso_features_check() in order to fall back to GSO when needed. This was added as it was noticed that some drivers could misbehave if TSO packets get too big. However, the check doesn't respect dev->gso_ipv4_max_size limit. For instance, a device could be configured with BIG TCP for IPv4, but not IPv6. Therefore, add a netif_get_gso_max_size() equivalent to netif_get_gro_max_size() and use the helper to respect both limits before falling back to GSO engine. Fixes: 24ab059d2ebd ("net: check dev->gso_max_size in gso_features_check()") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20240923212242.15669-2-daniel@iogearbox.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-01net: Add netif_get_gro_max_size helper for GRODaniel Borkmann
Add a small netif_get_gro_max_size() helper which returns the maximum IPv4 or IPv6 GRO size of the netdevice. We later add a netif_get_gso_max_size() equivalent as well for GSO, so that these helpers can be used consistently instead of open-coded checks. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20240923212242.15669-1-daniel@iogearbox.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-01net: dsa: improve shutdown sequenceVladimir Oltean
Alexander Sverdlin presents 2 problems during shutdown with the lan9303 driver. One is specific to lan9303 and the other just happens to reproduce there. The first problem is that lan9303 is unique among DSA drivers in that it calls dev_get_drvdata() at "arbitrary runtime" (not probe, not shutdown, not remove): phy_state_machine() -> ... -> dsa_user_phy_read() -> ds->ops->phy_read() -> lan9303_phy_read() -> chip->ops->phy_read() -> lan9303_mdio_phy_read() -> dev_get_drvdata() But we never stop the phy_state_machine(), so it may continue to run after dsa_switch_shutdown(). Our common pattern in all DSA drivers is to set drvdata to NULL to suppress the remove() method that may come afterwards. But in this case it will result in an NPD. The second problem is that the way in which we set dp->conduit->dsa_ptr = NULL; is concurrent with receive packet processing. dsa_switch_rcv() checks once whether dev->dsa_ptr is NULL, but afterwards, rather than continuing to use that non-NULL value, dev->dsa_ptr is dereferenced again and again without NULL checks: dsa_conduit_find_user() and many other places. In between dereferences, there is no locking to ensure that what was valid once continues to be valid. Both problems have the common aspect that closing the conduit interface solves them. In the first case, dev_close(conduit) triggers the NETDEV_GOING_DOWN event in dsa_user_netdevice_event() which closes user ports as well. dsa_port_disable_rt() calls phylink_stop(), which synchronously stops the phylink state machine, and ds->ops->phy_read() will thus no longer call into the driver after this point. In the second case, dev_close(conduit) should do this, as per Documentation/networking/driver.rst: | Quiescence | ---------- | | After the ndo_stop routine has been called, the hardware must | not receive or transmit any data. All in flight packets must | be aborted. If necessary, poll or wait for completion of | any reset commands. So it should be sufficient to ensure that later, when we zeroize conduit->dsa_ptr, there will be no concurrent dsa_switch_rcv() call on this conduit. The addition of the netif_device_detach() function is to ensure that ioctls, rtnetlinks and ethtool requests on the user ports no longer propagate down to the driver - we're no longer prepared to handle them. The race condition actually did not exist when commit 0650bf52b31f ("net: dsa: be compatible with masters which unregister on shutdown") first introduced dsa_switch_shutdown(). It was created later, when we stopped unregistering the user interfaces from a bad spot, and we just replaced that sequence with a racy zeroization of conduit->dsa_ptr (one which doesn't ensure that the interfaces aren't up). Reported-by: Alexander Sverdlin <alexander.sverdlin@siemens.com> Closes: https://lore.kernel.org/netdev/2d2e3bba17203c14a5ffdabc174e3b6bbb9ad438.camel@siemens.com/ Closes: https://lore.kernel.org/netdev/c1bf4de54e829111e0e4a70e7bd1cf523c9550ff.camel@siemens.com/ Fixes: ee534378f005 ("net: dsa: fix panic when DSA master device unbinds on shutdown") Reviewed-by: Alexander Sverdlin <alexander.sverdlin@siemens.com> Tested-by: Alexander Sverdlin <alexander.sverdlin@siemens.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://patch.msgid.link/20240913203549.3081071-1-vladimir.oltean@nxp.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-09-30sunrpc: fix prog selection loop in svc_process_commonNeilBrown
If the rq_prog is not in the list of programs, then we use the last program in the list and we don't get the expected rpc_prog_unavail error as the subsequent tests on 'progp' being NULL are ineffective. We should only assign progp when we find the right program, and we should initialize it to NULL Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Fixes: 86ab08beb3f0 ("SUNRPC: replace program list with program array") Signed-off-by: NeilBrown <neilb@suse.de> Acked-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2024-09-28Merge tag 'ceph-for-6.12-rc1' of https://github.com/ceph/ceph-clientLinus Torvalds
Pull ceph updates from Ilya Dryomov: "Three CephFS fixes from Xiubo and Luis and a bunch of assorted cleanups" * tag 'ceph-for-6.12-rc1' of https://github.com/ceph/ceph-client: ceph: remove the incorrect Fw reference check when dirtying pages ceph: Remove empty definition in header file ceph: Fix typo in the comment ceph: fix a memory leak on cap_auths in MDS client ceph: flush all caps releases when syncing the whole filesystem ceph: rename ceph_flush_cap_releases() to ceph_flush_session_cap_releases() libceph: use min() to simplify code in ceph_dns_resolve_name() ceph: Convert to use jiffies macro ceph: Remove unused declarations
2024-09-27[tree-wide] finally take no_llseek outAl Viro
no_llseek had been defined to NULL two years ago, in commit 868941b14441 ("fs: remove no_llseek") To quote that commit, At -rc1 we'll need do a mechanical removal of no_llseek - git grep -l -w no_llseek | grep -v porting.rst | while read i; do sed -i '/\<no_llseek\>/d' $i done would do it. Unfortunately, that hadn't been done. Linus, could you do that now, so that we could finally put that thing to rest? All instances are of the form .llseek = no_llseek, so it's obviously safe. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-09-27Bluetooth: hci_event: Align BR/EDR JUST_WORKS paring with LELuiz Augusto von Dentz
This aligned BR/EDR JUST_WORKS method with LE which since 92516cd97fd4 ("Bluetooth: Always request for user confirmation for Just Works") always request user confirmation with confirm_hint set since the likes of bluetoothd have dedicated policy around JUST_WORKS method (e.g. main.conf:JustWorksRepairing). CVE: CVE-2024-8805 Cc: stable@vger.kernel.org Fixes: ba15a58b179e ("Bluetooth: Fix SSP acceptor just-works confirmation without MITM") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Tested-by: Kiran K <kiran.k@intel.com>
2024-09-27Bluetooth: L2CAP: Fix uaf in l2cap_connectLuiz Augusto von Dentz
[Syzbot reported] BUG: KASAN: slab-use-after-free in l2cap_connect.constprop.0+0x10d8/0x1270 net/bluetooth/l2cap_core.c:3949 Read of size 8 at addr ffff8880241e9800 by task kworker/u9:0/54 CPU: 0 UID: 0 PID: 54 Comm: kworker/u9:0 Not tainted 6.11.0-rc6-syzkaller-00268-g788220eee30d #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/06/2024 Workqueue: hci2 hci_rx_work Call Trace: <TASK> __dump_stack lib/dump_stack.c:93 [inline] dump_stack_lvl+0x116/0x1f0 lib/dump_stack.c:119 print_address_description mm/kasan/report.c:377 [inline] print_report+0xc3/0x620 mm/kasan/report.c:488 kasan_report+0xd9/0x110 mm/kasan/report.c:601 l2cap_connect.constprop.0+0x10d8/0x1270 net/bluetooth/l2cap_core.c:3949 l2cap_connect_req net/bluetooth/l2cap_core.c:4080 [inline] l2cap_bredr_sig_cmd net/bluetooth/l2cap_core.c:4772 [inline] l2cap_sig_channel net/bluetooth/l2cap_core.c:5543 [inline] l2cap_recv_frame+0xf0b/0x8eb0 net/bluetooth/l2cap_core.c:6825 l2cap_recv_acldata+0x9b4/0xb70 net/bluetooth/l2cap_core.c:7514 hci_acldata_packet net/bluetooth/hci_core.c:3791 [inline] hci_rx_work+0xaab/0x1610 net/bluetooth/hci_core.c:4028 process_one_work+0x9c5/0x1b40 kernel/workqueue.c:3231 process_scheduled_works kernel/workqueue.c:3312 [inline] worker_thread+0x6c8/0xed0 kernel/workqueue.c:3389 kthread+0x2c1/0x3a0 kernel/kthread.c:389 ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 ... Freed by task 5245: kasan_save_stack+0x33/0x60 mm/kasan/common.c:47 kasan_save_track+0x14/0x30 mm/kasan/common.c:68 kasan_save_free_info+0x3b/0x60 mm/kasan/generic.c:579 poison_slab_object+0xf7/0x160 mm/kasan/common.c:240 __kasan_slab_free+0x32/0x50 mm/kasan/common.c:256 kasan_slab_free include/linux/kasan.h:184 [inline] slab_free_hook mm/slub.c:2256 [inline] slab_free mm/slub.c:4477 [inline] kfree+0x12a/0x3b0 mm/slub.c:4598 l2cap_conn_free net/bluetooth/l2cap_core.c:1810 [inline] kref_put include/linux/kref.h:65 [inline] l2cap_conn_put net/bluetooth/l2cap_core.c:1822 [inline] l2cap_conn_del+0x59d/0x730 net/bluetooth/l2cap_core.c:1802 l2cap_connect_cfm+0x9e6/0xf80 net/bluetooth/l2cap_core.c:7241 hci_connect_cfm include/net/bluetooth/hci_core.h:1960 [inline] hci_conn_failed+0x1c3/0x370 net/bluetooth/hci_conn.c:1265 hci_abort_conn_sync+0x75a/0xb50 net/bluetooth/hci_sync.c:5583 abort_conn_sync+0x197/0x360 net/bluetooth/hci_conn.c:2917 hci_cmd_sync_work+0x1a4/0x410 net/bluetooth/hci_sync.c:328 process_one_work+0x9c5/0x1b40 kernel/workqueue.c:3231 process_scheduled_works kernel/workqueue.c:3312 [inline] worker_thread+0x6c8/0xed0 kernel/workqueue.c:3389 kthread+0x2c1/0x3a0 kernel/kthread.c:389 ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 Reported-by: syzbot+c12e2f941af1feb5632c@syzkaller.appspotmail.com Tested-by: syzbot+c12e2f941af1feb5632c@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=c12e2f941af1feb5632c Fixes: 7b064edae38d ("Bluetooth: Fix authentication if acl data comes before remote feature evt") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-09-27Bluetooth: MGMT: Fix possible crash on mgmt_index_removedLuiz Augusto von Dentz
If mgmt_index_removed is called while there are commands queued on cmd_sync it could lead to crashes like the bellow trace: 0x0000053D: __list_del_entry_valid_or_report+0x98/0xdc 0x0000053D: mgmt_pending_remove+0x18/0x58 [bluetooth] 0x0000053E: mgmt_remove_adv_monitor_complete+0x80/0x108 [bluetooth] 0x0000053E: hci_cmd_sync_work+0xbc/0x164 [bluetooth] So while handling mgmt_index_removed this attempts to dequeue commands passed as user_data to cmd_sync. Fixes: 7cf5c2978f23 ("Bluetooth: hci_sync: Refactor remove Adv Monitor") Reported-by: jiaymao <quic_jiaymao@quicinc.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-09-27netfilter: nf_tables: prevent nf_skb_duplicated corruptionEric Dumazet
syzbot found that nf_dup_ipv4() or nf_dup_ipv6() could write per-cpu variable nf_skb_duplicated in an unsafe way [1]. Disabling preemption as hinted by the splat is not enough, we have to disable soft interrupts as well. [1] BUG: using __this_cpu_write() in preemptible [00000000] code: syz.4.282/6316 caller is nf_dup_ipv4+0x651/0x8f0 net/ipv4/netfilter/nf_dup_ipv4.c:87 CPU: 0 UID: 0 PID: 6316 Comm: syz.4.282 Not tainted 6.11.0-rc7-syzkaller-00104-g7052622fccb1 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/06/2024 Call Trace: <TASK> __dump_stack lib/dump_stack.c:93 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:119 check_preemption_disabled+0x10e/0x120 lib/smp_processor_id.c:49 nf_dup_ipv4+0x651/0x8f0 net/ipv4/netfilter/nf_dup_ipv4.c:87 nft_dup_ipv4_eval+0x1db/0x300 net/ipv4/netfilter/nft_dup_ipv4.c:30 expr_call_ops_eval net/netfilter/nf_tables_core.c:240 [inline] nft_do_chain+0x4ad/0x1da0 net/netfilter/nf_tables_core.c:288 nft_do_chain_ipv4+0x202/0x320 net/netfilter/nft_chain_filter.c:23 nf_hook_entry_hookfn include/linux/netfilter.h:154 [inline] nf_hook_slow+0xc3/0x220 net/netfilter/core.c:626 nf_hook+0x2c4/0x450 include/linux/netfilter.h:269 NF_HOOK_COND include/linux/netfilter.h:302 [inline] ip_output+0x185/0x230 net/ipv4/ip_output.c:433 ip_local_out net/ipv4/ip_output.c:129 [inline] ip_send_skb+0x74/0x100 net/ipv4/ip_output.c:1495 udp_send_skb+0xacf/0x1650 net/ipv4/udp.c:981 udp_sendmsg+0x1c21/0x2a60 net/ipv4/udp.c:1269 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg+0x1a6/0x270 net/socket.c:745 ____sys_sendmsg+0x525/0x7d0 net/socket.c:2597 ___sys_sendmsg net/socket.c:2651 [inline] __sys_sendmmsg+0x3b2/0x740 net/socket.c:2737 __do_sys_sendmmsg net/socket.c:2766 [inline] __se_sys_sendmmsg net/socket.c:2763 [inline] __x64_sys_sendmmsg+0xa0/0xb0 net/socket.c:2763 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f4ce4f7def9 Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f4ce5d4a038 EFLAGS: 00000246 ORIG_RAX: 0000000000000133 RAX: ffffffffffffffda RBX: 00007f4ce5135f80 RCX: 00007f4ce4f7def9 RDX: 0000000000000001 RSI: 0000000020005d40 RDI: 0000000000000006 RBP: 00007f4ce4ff0b76 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000000 R14: 00007f4ce5135f80 R15: 00007ffd4cbc6d68 </TASK> Fixes: d877f07112f1 ("netfilter: nf_tables: add nft_dup expression") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-09-26Merge tag 'net-6.12-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from netfilter. It looks like that most people are still traveling: both the ML volume and the processing capacity are low. Previous releases - regressions: - netfilter: - nf_reject_ipv6: fix nf_reject_ip6_tcphdr_put() - nf_tables: keep deleted flowtable hooks until after RCU - tcp: check skb is non-NULL in tcp_rto_delta_us() - phy: aquantia: fix -ETIMEDOUT PHY probe failure when firmware not present - eth: virtio_net: fix mismatched buf address when unmapping for small packets - eth: stmmac: fix zero-division error when disabling tc cbs - eth: bonding: fix unnecessary warnings and logs from bond_xdp_get_xmit_slave() Previous releases - always broken: - netfilter: - fix clash resolution for bidirectional flows - fix allocation with no memcg accounting - eth: r8169: add tally counter fields added with RTL8125 - eth: ravb: fix rx and tx frame size limit" * tag 'net-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (35 commits) selftests: netfilter: Avoid hanging ipvs.sh kselftest: add test for nfqueue induced conntrack race netfilter: nfnetlink_queue: remove old clash resolution logic netfilter: nf_tables: missing objects with no memcg accounting netfilter: nf_tables: use rcu chain hook list iterator from netlink dump path netfilter: ctnetlink: compile ctnetlink_label_size with CONFIG_NF_CONNTRACK_EVENTS netfilter: nf_reject: Fix build warning when CONFIG_BRIDGE_NETFILTER=n netfilter: nf_tables: Keep deleted flowtable hooks until after RCU docs: tproxy: ignore non-transparent sockets in iptables netfilter: ctnetlink: Guard possible unused functions selftests: netfilter: nft_tproxy.sh: add tcp tests selftests: netfilter: add reverse-clash resolution test case netfilter: conntrack: add clash resolution for reverse collisions netfilter: nf_nat: don't try nat source port reallocation for reverse dir clash selftests/net: packetdrill: increase timing tolerance in debug mode usbnet: fix cyclical race on disconnect with work queue net: stmmac: set PP_FLAG_DMA_SYNC_DEV only if XDP is enabled virtio_net: Fix mismatched buf address when unmapping for small packets bonding: Fix unnecessary warnings and logs from bond_xdp_get_xmit_slave() r8169: add missing MODULE_FIRMWARE entry for RTL8126A rev.b ...
2024-09-26Merge tag 'usb-6.12-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb Pull USB/Thunderbolt updates from Greg KH: "Here is the large set of USB and Thunderbolt changes for 6.12-rc1. Nothing "major" in here, except for a new 9p network gadget that has been worked on for a long time (all of the needed acks are here) Other than that, it's the usual set of: - Thunderbolt / USB4 driver updates and additions for new hardware - dwc3 driver updates and new features added - xhci driver updates - typec driver updates - USB gadget updates and api additions to make some gadgets more configurable by userspace - dwc2 driver updates - usb phy driver updates - usbip feature additions - other minor USB driver updates All of these have been in linux-next for a long time with no reported issues" * tag 'usb-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (145 commits) sub: cdns3: Use predefined PCI vendor ID constant sub: cdns2: Use predefined PCI vendor ID constant USB: misc: yurex: fix race between read and write USB: misc: cypress_cy7c63: check for short transfer USB: appledisplay: close race between probe and completion handler USB: class: CDC-ACM: fix race between get_serial and set_serial usb: r8a66597-hcd: make read-only const arrays static usb: typec: ucsi: Fix busy loop on ASUS VivoBooks usb: dwc3: rtk: Clean up error code in __get_dwc3_maximum_speed() usb: storage: ene_ub6250: Fix right shift warnings usb: roles: Improve the fix for a false positive recursive locking complaint locking/mutex: Introduce mutex_init_with_key() locking/mutex: Define mutex_init() once net/9p/usbg: fix CONFIG_USB_GADGET dependency usb: xhci: fix loss of data on Cadence xHC usb: xHCI: add XHCI_RESET_ON_RESUME quirk for Phytium xHCI host usb: dwc3: imx8mp: disable SS_CON and U3 wakeup for system sleep usb: dwc3: imx8mp: add 2 software managed quirk properties for host mode usb: host: xhci-plat: Parse xhci-missing_cas_quirk and apply quirk usb: misc: onboard_usb_dev: add Microchip usb5744 SMBus programming support ...
2024-09-26Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds
Pull virtio updates from Michael Tsirkin: "Several new features here: - virtio-balloon supports new stats - vdpa supports setting mac address - vdpa/mlx5 suspend/resume as well as MKEY ops are now faster - virtio_fs supports new sysfs entries for queue info - virtio/vsock performance has been improved And fixes, cleanups all over the place" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (34 commits) vsock/virtio: avoid queuing packets when intermediate queue is empty vsock/virtio: refactor virtio_transport_send_pkt_work fw_cfg: Constify struct kobj_type vdpa/mlx5: Postpone MR deletion vdpa/mlx5: Introduce init/destroy for MR resources vdpa/mlx5: Rename mr_mtx -> lock vdpa/mlx5: Extract mr members in own resource struct vdpa/mlx5: Rename function vdpa/mlx5: Delete direct MKEYs in parallel vdpa/mlx5: Create direct MKEYs in parallel MAINTAINERS: add virtio-vsock driver in the VIRTIO CORE section virtio_fs: add sysfs entries for queue information virtio_fs: introduce virtio_fs_put_locked helper vdpa: Remove unused declarations vdpa/mlx5: Parallelize VQ suspend/resume for CVQ MQ command vdpa/mlx5: Small improvement for change_num_qps() vdpa/mlx5: Keep notifiers during suspend but ignore vdpa/mlx5: Parallelize device resume vdpa/mlx5: Parallelize device suspend vdpa/mlx5: Use async API for vq modify commands ...
2024-09-26Merge tag 'nf-24-09-26' of ↵Paolo Abeni
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net v2: with kdoc fixes per Paolo Abeni. The following patchset contains Netfilter fixes for net: Patch #1 and #2 handle an esoteric scenario: Given two tasks sending UDP packets to one another, two packets of the same flow in each direction handled by different CPUs that result in two conntrack objects in NEW state, where reply packet loses race. Then, patch #3 adds a testcase for this scenario. Series from Florian Westphal. 1) NAT engine can falsely detect a port collision if it happens to pick up a reply packet as NEW rather than ESTABLISHED. Add extra code to detect this and suppress port reallocation in this case. 2) To complete the clash resolution in the reply direction, extend conntrack logic to detect clashing conntrack in the reply direction to existing entry. 3) Adds a test case. Then, an assorted list of fixes follow: 4) Add a selftest for tproxy, from Antonio Ojea. 5) Guard ctnetlink_*_size() functions under #if defined(CONFIG_NETFILTER_NETLINK_GLUE_CT) || defined(CONFIG_NF_CONNTRACK_EVENTS) From Andy Shevchenko. 6) Use -m socket --transparent in iptables tproxy documentation. From XIE Zhibang. 7) Call kfree_rcu() when releasing flowtable hooks to address race with netlink dump path, from Phil Sutter. 8) Fix compilation warning in nf_reject with CONFIG_BRIDGE_NETFILTER=n. From Simon Horman. 9) Guard ctnetlink_label_size() under CONFIG_NF_CONNTRACK_EVENTS which is its only user, to address a compilation warning. From Simon Horman. 10) Use rcu-protected list iteration over basechain hooks from netlink dump path. 11) Fix memcg for nf_tables, use GFP_KERNEL_ACCOUNT is not complete. 12) Remove old nfqueue conntrack clash resolution. Instead trying to use same destination address consistently which requires double DNAT, use the existing clash resolution which allows clashing packets go through with different destination. Antonio Ojea originally reported an issue from the postrouting chain, I proposed a fix: https://lore.kernel.org/netfilter-devel/ZuwSwAqKgCB2a51-@calendula/T/ which he reported it did not work for him. 13) Adds a selftest for patch 12. 14) Fixes ipvs.sh selftest. netfilter pull request 24-09-26 * tag 'nf-24-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: selftests: netfilter: Avoid hanging ipvs.sh kselftest: add test for nfqueue induced conntrack race netfilter: nfnetlink_queue: remove old clash resolution logic netfilter: nf_tables: missing objects with no memcg accounting netfilter: nf_tables: use rcu chain hook list iterator from netlink dump path netfilter: ctnetlink: compile ctnetlink_label_size with CONFIG_NF_CONNTRACK_EVENTS netfilter: nf_reject: Fix build warning when CONFIG_BRIDGE_NETFILTER=n netfilter: nf_tables: Keep deleted flowtable hooks until after RCU docs: tproxy: ignore non-transparent sockets in iptables netfilter: ctnetlink: Guard possible unused functions selftests: netfilter: nft_tproxy.sh: add tcp tests selftests: netfilter: add reverse-clash resolution test case netfilter: conntrack: add clash resolution for reverse collisions netfilter: nf_nat: don't try nat source port reallocation for reverse dir clash ==================== Link: https://patch.msgid.link/20240926110717.102194-1-pablo@netfilter.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-09-26netfilter: nfnetlink_queue: remove old clash resolution logicFlorian Westphal
For historical reasons there are two clash resolution spots in netfilter, one in nfnetlink_queue and one in conntrack core. nfnetlink_queue one was added first: If a colliding entry is found, NAT NAT transformation is reversed by calling nat engine again with altered tuple. See commit 368982cd7d1b ("netfilter: nfnetlink_queue: resolve clash for unconfirmed conntracks") for details. One problem is that nf_reroute() won't take an action if the queueing doesn't occur in the OUTPUT hook, i.e. when queueing in forward or postrouting, packet will be sent via the wrong path. Another problem is that the scenario addressed (2nd UDP packet sent with identical addresses while first packet is still being processed) can also occur without any nfqueue involvement due to threaded resolvers doing A and AAAA requests back-to-back. This lead us to add clash resolution logic to the conntrack core, see commit 6a757c07e51f ("netfilter: conntrack: allow insertion of clashing entries"). Instead of fixing the nfqueue based logic, lets remove it and let conntrack core handle this instead. Retain the ->update hook for sake of nfqueue based conntrack helpers. We could axe this hook completely but we'd have to split confirm and helper logic again, see commit ee04805ff54a ("netfilter: conntrack: make conntrack userspace helpers work again"). This SHOULD NOT be backported to kernels earlier than v5.6; they lack adequate clash resolution handling. Patch was originally written by Pablo Neira Ayuso. Reported-by: Antonio Ojea <aojea@google.com> Closes: https://bugzilla.netfilter.org/show_bug.cgi?id=1766 Signed-off-by: Florian Westphal <fw@strlen.de> Tested-by: Antonio Ojea <aojea@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-09-26netfilter: nf_tables: missing objects with no memcg accountingPablo Neira Ayuso
Several ruleset objects are still not using GFP_KERNEL_ACCOUNT for memory accounting, update them. This includes: - catchall elements - compat match large info area - log prefix - meta secctx - numgen counters - pipapo set backend datastructure - tunnel private objects Fixes: 33758c891479 ("memcg: enable accounting for nft objects") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-09-26netfilter: nf_tables: use rcu chain hook list iterator from netlink dump pathPablo Neira Ayuso
Lockless iteration over hook list is possible from netlink dump path, use rcu variant to iterate over the hook list as is done with flowtable hooks. Fixes: b9703ed44ffb ("netfilter: nf_tables: support for adding new devices to an existing netdev chain") Reported-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-09-26netfilter: ctnetlink: compile ctnetlink_label_size with ↵Simon Horman
CONFIG_NF_CONNTRACK_EVENTS Only provide ctnetlink_label_size when it is used, which is when CONFIG_NF_CONNTRACK_EVENTS is configured. Flagged by clang-18 W=1 builds as: .../nf_conntrack_netlink.c:385:19: warning: unused function 'ctnetlink_label_size' [-Wunused-function] 385 | static inline int ctnetlink_label_size(const struct nf_conn *ct) | ^~~~~~~~~~~~~~~~~~~~ The condition on CONFIG_NF_CONNTRACK_LABELS being removed by this patch guards compilation of non-trivial implementations of ctnetlink_dump_labels() and ctnetlink_label_size(). However, this is not necessary as each of these functions will always return 0 if CONFIG_NF_CONNTRACK_LABELS is not defined as each function starts with the equivalent of: struct nf_conn_labels *labels = nf_ct_labels_find(ct); if (!labels) return 0; And nf_ct_labels_find always returns NULL if CONFIG_NF_CONNTRACK_LABELS is not enabled. So I believe that the compiler optimises the code away in such cases anyway. Found by inspection. Compile tested only. Originally splitted in two patches, Pablo Neira Ayuso collapsed them and added Fixes: tag. Fixes: 0ceabd83875b ("netfilter: ctnetlink: deliver labels to userspace") Link: https://lore.kernel.org/netfilter-devel/20240909151712.GZ2097826@kernel.org/ Signed-off-by: Simon Horman <horms@kernel.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-09-26netfilter: nf_reject: Fix build warning when CONFIG_BRIDGE_NETFILTER=nSimon Horman
If CONFIG_BRIDGE_NETFILTER is not enabled, which is the case for x86_64 defconfig, then building nf_reject_ipv4.c and nf_reject_ipv6.c with W=1 using gcc-14 results in the following warnings, which are treated as errors: net/ipv4/netfilter/nf_reject_ipv4.c: In function 'nf_send_reset': net/ipv4/netfilter/nf_reject_ipv4.c:243:23: error: variable 'niph' set but not used [-Werror=unused-but-set-variable] 243 | struct iphdr *niph; | ^~~~ cc1: all warnings being treated as errors net/ipv6/netfilter/nf_reject_ipv6.c: In function 'nf_send_reset6': net/ipv6/netfilter/nf_reject_ipv6.c:286:25: error: variable 'ip6h' set but not used [-Werror=unused-but-set-variable] 286 | struct ipv6hdr *ip6h; | ^~~~ cc1: all warnings being treated as errors Address this by reducing the scope of these local variables to where they are used, which is code only compiled when CONFIG_BRIDGE_NETFILTER enabled. Compile tested and run through netfilter selftests. Reported-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Closes: https://lore.kernel.org/netfilter-devel/20240906145513.567781-1-andriy.shevchenko@linux.intel.com/ Signed-off-by: Simon Horman <horms@kernel.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-09-26netfilter: nf_tables: Keep deleted flowtable hooks until after RCUPhil Sutter
Documentation of list_del_rcu() warns callers to not immediately free the deleted list item. While it seems not necessary to use the RCU-variant of list_del() here in the first place, doing so seems to require calling kfree_rcu() on the deleted item as well. Fixes: 3f0465a9ef02 ("netfilter: nf_tables: dynamically allocate hooks per net_device in flowtables") Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-09-26netfilter: ctnetlink: Guard possible unused functionsAndy Shevchenko
Some of the functions may be unused (CONFIG_NETFILTER_NETLINK_GLUE_CT=n and CONFIG_NF_CONNTRACK_EVENTS=n), it prevents kernel builds with clang, `make W=1` and CONFIG_WERROR=y: net/netfilter/nf_conntrack_netlink.c:657:22: error: unused function 'ctnetlink_acct_size' [-Werror,-Wunused-function] 657 | static inline size_t ctnetlink_acct_size(const struct nf_conn *ct) | ^~~~~~~~~~~~~~~~~~~ net/netfilter/nf_conntrack_netlink.c:667:19: error: unused function 'ctnetlink_secctx_size' [-Werror,-Wunused-function] 667 | static inline int ctnetlink_secctx_size(const struct nf_conn *ct) | ^~~~~~~~~~~~~~~~~~~~~ net/netfilter/nf_conntrack_netlink.c:683:22: error: unused function 'ctnetlink_timestamp_size' [-Werror,-Wunused-function] 683 | static inline size_t ctnetlink_timestamp_size(const struct nf_conn *ct) | ^~~~~~~~~~~~~~~~~~~~~~~~ Fix this by guarding possible unused functions with ifdeffery. See also commit 6863f5643dd7 ("kbuild: allow Clang to find unused static inline functions for W=1 build"). Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-09-26netfilter: conntrack: add clash resolution for reverse collisionsFlorian Westphal
Given existing entry: ORIGIN: a:b -> c:d REPLY: c:d -> a:b And colliding entry: ORIGIN: c:d -> a:b REPLY: a:b -> c:d The colliding ct (and the associated skb) get dropped on insert. Permit this by checking if the colliding entry matches the reply direction. Happens when both ends send packets at same time, both requests are picked up as NEW, rather than NEW for the 'first' and 'ESTABLISHED' for the second packet. This is an esoteric condition, as ruleset must permit NEW connections in either direction and both peers must already have a bidirectional traffic flow at the time conntrack gets enabled. Allow the 'reverse' skb to pass and assign the existing (clashing) entry. While at it, also drop the extra 'dying' check, this is already tested earlier by the calling function. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-09-26netfilter: nf_nat: don't try nat source port reallocation for reverse dir clashFlorian Westphal
A conntrack entry can be inserted to the connection tracking table if there is no existing entry with an identical tuple in either direction. Example: INITIATOR -> NAT/PAT -> RESPONDER Initiator passes through NAT/PAT ("us") and SNAT is done (saddr rewrite). Then, later, NAT/PAT machine itself also wants to connect to RESPONDER. This will not work if the SNAT done earlier has same IP:PORT source pair. Conntrack table has: ORIGINAL: $IP_INITATOR:$SPORT -> $IP_RESPONDER:$DPORT REPLY: $IP_RESPONDER:$DPORT -> $IP_NAT:$SPORT and new locally originating connection wants: ORIGINAL: $IP_NAT:$SPORT -> $IP_RESPONDER:$DPORT REPLY: $IP_RESPONDER:$DPORT -> $IP_NAT:$SPORT This is handled by the NAT engine which will do a source port reallocation for the locally originating connection that is colliding with an existing tuple by attempting a source port rewrite. This is done even if this new connection attempt did not go through a masquerade/snat rule. There is a rare race condition with connection-less protocols like UDP, where we do the port reallocation even though its not needed. This happens when new packets from the same, pre-existing flow are received in both directions at the exact same time on different CPUs after the conntrack table was flushed (or conntrack becomes active for first time). With strict ordering/single cpu, the first packet creates new ct entry and second packet is resolved as established reply packet. With parallel processing, both packets are picked up as new and both get their own ct entry. In this case, the 'reply' packet (picked up as ORIGINAL) can be mangled by NAT engine because a port collision is detected. This change isn't enough to prevent a packet drop later during nf_conntrack_confirm(), the existing clash resolution strategy will not detect such reverse clash case. This is resolved by a followup patch. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>