Age | Commit message (Collapse) | Author |
|
The kTLS tx handling code is using a mix of get_page() and
page_ref_inc() APIs to increment the page reference. But on the release
path (mlx5e_ktls_tx_handle_resync_dump_comp()), only put_page() is used.
This is an issue when using pages from large folios: the get_page()
references are stored on the folio page while the page_ref_inc()
references are stored directly in the given page. On release the folio
page will be dereferenced too many times.
This was found while doing kTLS testing with sendfile() + ZC when the
served file was read from NFS on a kernel with NFS large folios support
(commit 49b29a573da8 ("nfs: add support for large folios")).
Fixes: 84d1bb2b139e ("net/mlx5e: kTLS, Limit DUMP wqe size")
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20241107183527.676877-5-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The referenced commits introduced a two-step process for deleting FTEs:
- Lock the FTE, delete it from hardware, set the hardware deletion function
to NULL and unlock the FTE.
- Lock the parent flow group, delete the software copy of the FTE, and
remove it from the xarray.
However, this approach encounters a race condition if a rule with the same
match value is added simultaneously. In this scenario, fs_core may set the
hardware deletion function to NULL prematurely, causing a panic during
subsequent rule deletions.
To prevent this, ensure the active flag of the FTE is checked under a lock,
which will prevent the fs_core layer from attaching a new steering rule to
an FTE that is in the process of deletion.
[ 438.967589] MOSHE: 2496 mlx5_del_flow_rules del_hw_func
[ 438.968205] ------------[ cut here ]------------
[ 438.968654] refcount_t: decrement hit 0; leaking memory.
[ 438.969249] WARNING: CPU: 0 PID: 8957 at lib/refcount.c:31 refcount_warn_saturate+0xfb/0x110
[ 438.970054] Modules linked in: act_mirred cls_flower act_gact sch_ingress openvswitch nsh mlx5_vdpa vringh vhost_iotlb vdpa mlx5_ib mlx5_core xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter rpcsec_gss_krb5 auth_rpcgss oid_registry overlay rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm ib_uverbs ib_core zram zsmalloc fuse [last unloaded: cls_flower]
[ 438.973288] CPU: 0 UID: 0 PID: 8957 Comm: tc Not tainted 6.12.0-rc1+ #8
[ 438.973888] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[ 438.974874] RIP: 0010:refcount_warn_saturate+0xfb/0x110
[ 438.975363] Code: 40 66 3b 82 c6 05 16 e9 4d 01 01 e8 1f 7c a0 ff 0f 0b c3 cc cc cc cc 48 c7 c7 10 66 3b 82 c6 05 fd e8 4d 01 01 e8 05 7c a0 ff <0f> 0b c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 90
[ 438.976947] RSP: 0018:ffff888124a53610 EFLAGS: 00010286
[ 438.977446] RAX: 0000000000000000 RBX: ffff888119d56de0 RCX: 0000000000000000
[ 438.978090] RDX: ffff88852c828700 RSI: ffff88852c81b3c0 RDI: ffff88852c81b3c0
[ 438.978721] RBP: ffff888120fa0e88 R08: 0000000000000000 R09: ffff888124a534b0
[ 438.979353] R10: 0000000000000001 R11: 0000000000000001 R12: ffff888119d56de0
[ 438.979979] R13: ffff888120fa0ec0 R14: ffff888120fa0ee8 R15: ffff888119d56de0
[ 438.980607] FS: 00007fe6dcc0f800(0000) GS:ffff88852c800000(0000) knlGS:0000000000000000
[ 438.983984] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 438.984544] CR2: 00000000004275e0 CR3: 0000000186982001 CR4: 0000000000372eb0
[ 438.985205] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 438.985842] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 438.986507] Call Trace:
[ 438.986799] <TASK>
[ 438.987070] ? __warn+0x7d/0x110
[ 438.987426] ? refcount_warn_saturate+0xfb/0x110
[ 438.987877] ? report_bug+0x17d/0x190
[ 438.988261] ? prb_read_valid+0x17/0x20
[ 438.988659] ? handle_bug+0x53/0x90
[ 438.989054] ? exc_invalid_op+0x14/0x70
[ 438.989458] ? asm_exc_invalid_op+0x16/0x20
[ 438.989883] ? refcount_warn_saturate+0xfb/0x110
[ 438.990348] mlx5_del_flow_rules+0x2f7/0x340 [mlx5_core]
[ 438.990932] __mlx5_eswitch_del_rule+0x49/0x170 [mlx5_core]
[ 438.991519] ? mlx5_lag_is_sriov+0x3c/0x50 [mlx5_core]
[ 438.992054] ? xas_load+0x9/0xb0
[ 438.992407] mlx5e_tc_rule_unoffload+0x45/0xe0 [mlx5_core]
[ 438.993037] mlx5e_tc_del_fdb_flow+0x2a6/0x2e0 [mlx5_core]
[ 438.993623] mlx5e_flow_put+0x29/0x60 [mlx5_core]
[ 438.994161] mlx5e_delete_flower+0x261/0x390 [mlx5_core]
[ 438.994728] tc_setup_cb_destroy+0xb9/0x190
[ 438.995150] fl_hw_destroy_filter+0x94/0xc0 [cls_flower]
[ 438.995650] fl_change+0x11a4/0x13c0 [cls_flower]
[ 438.996105] tc_new_tfilter+0x347/0xbc0
[ 438.996503] ? ___slab_alloc+0x70/0x8c0
[ 438.996929] rtnetlink_rcv_msg+0xf9/0x3e0
[ 438.997339] ? __netlink_sendskb+0x4c/0x70
[ 438.997751] ? netlink_unicast+0x286/0x2d0
[ 438.998171] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 438.998625] netlink_rcv_skb+0x54/0x100
[ 438.999020] netlink_unicast+0x203/0x2d0
[ 438.999421] netlink_sendmsg+0x1e4/0x420
[ 438.999820] __sock_sendmsg+0xa1/0xb0
[ 439.000203] ____sys_sendmsg+0x207/0x2a0
[ 439.000600] ? copy_msghdr_from_user+0x6d/0xa0
[ 439.001072] ___sys_sendmsg+0x80/0xc0
[ 439.001459] ? ___sys_recvmsg+0x8b/0xc0
[ 439.001848] ? generic_update_time+0x4d/0x60
[ 439.002282] __sys_sendmsg+0x51/0x90
[ 439.002658] do_syscall_64+0x50/0x110
[ 439.003040] entry_SYSCALL_64_after_hwframe+0x76/0x7e
Fixes: 718ce4d601db ("net/mlx5: Consolidate update FTE for all removal changes")
Fixes: cefc23554fc2 ("net/mlx5: Fix FTE cleanup")
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20241107183527.676877-4-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The number of PCI vectors allocated by the platform (which may be fewer
than requested) is currently not honored when creating the SF pool;
only the PCI MSI-X capability is considered.
As a result, when a platform allocates fewer vectors
(in non-dynamic mode) than requested, the PF and SF pools end up
with an invalid vector range.
This causes incorrect SF vector accounting, which leads to the
following call trace when an invalid IRQ vector is allocated.
This issue is resolved by ensuring that the platform's vector
limit is respected for both the SF and PF pools.
Workqueue: mlx5_vhca_event0 mlx5_sf_dev_add_active_work [mlx5_core]
RIP: 0010:pci_irq_vector+0x23/0x80
RSP: 0018:ffffabd5cebd7248 EFLAGS: 00010246
RAX: ffff980880e7f308 RBX: ffff9808932fb880 RCX: 0000000000000001
RDX: 00000000000001ff RSI: 0000000000000200 RDI: ffff980880e7f308
RBP: 0000000000000200 R08: 0000000000000010 R09: ffff97a9116f0860
R10: 0000000000000002 R11: 0000000000000228 R12: ffff980897cd0160
R13: 0000000000000000 R14: ffff97a920fec0c0 R15: ffffabd5cebd72d0
FS: 0000000000000000(0000) GS:ffff97c7ff9c0000(0000) knlGS:0000000000000000
? rescuer_thread+0x350/0x350
kthread+0x11b/0x140
? __kthread_bind_mask+0x60/0x60
ret_from_fork+0x22/0x30
mlx5_core 0000:a1:00.0: mlx5_irq_alloc:321:(pid 6781): Failed to request irq. err = -22
mlx5_core 0000:a1:00.0: mlx5_irq_alloc:321:(pid 6781): Failed to request irq. err = -22
mlx5_core.sf mlx5_core.sf.6: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0 enhanced)
mlx5_core.sf mlx5_core.sf.7: firmware version: 32.43.356
mlx5_core.sf mlx5_core.sf.6 enpa1s0f0s4: renamed from eth0
mlx5_core.sf mlx5_core.sf.7: Rate limit: 127 rates are supported, range: 0Mbps to 195312Mbps
mlx5_core 0000:a1:00.0: mlx5_irq_alloc:321:(pid 6781): Failed to request irq. err = -22
mlx5_core 0000:a1:00.0: mlx5_irq_alloc:321:(pid 6781): Failed to request irq. err = -22
mlx5_core 0000:a1:00.0: mlx5_irq_alloc:321:(pid 6781): Failed to request irq. err = -22
Fixes: 3354822cde5a ("net/mlx5: Use dynamic msix vectors allocation")
Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Amir Tzin <amirtz@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20241107183527.676877-3-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
IB representors depend on ETH representors, so the IB representors
should not exist without the ETH ones. When unloading the ETH
representors, the corresponding IB representors should be also
unloaded.
The commit 8d159eb2117b ("RDMA/mlx5: Use IB set_netdev and get_netdev functions")
introduced the use of the ib_device_set_netdev API in IB
repsresentors. ib_device_set_netdev() increments the refcount of
the representor's netdev when loading an IB representor and
decrements it when unloading.
Without the unloading of the IB representor, the refcount of the
representor's netdev remains greater than 0, preventing it from
being unregistered.
The patch uncovered an underlying bug where the eth representor is
unloaded, without unloading the IB representor.
This issue happened when using multiport E-switch and rebooting,
causing the shutdown to hang when unloading the ETH representor
because the refcount of the representor's netdevice was greater than 0.
Call trace:
unregister_netdevice: waiting for eth3 to become free. Usage count = 2
ref_tracker: eth%d@00000000661d60f7 has 1/1 users at
ib_device_set_netdev+0x160/0x2d0 [ib_core]
mlx5_ib_vport_rep_load+0x104/0x3f0 [mlx5_ib]
mlx5_eswitch_reload_ib_reps+0xfc/0x110 [mlx5_core]
mlx5_mpesw_work+0x236/0x330 [mlx5_core]
process_one_work+0x169/0x320
worker_thread+0x288/0x3a0
kthread+0xb8/0xe0
ret_from_fork+0x2d/0x50
ret_from_fork_asm+0x11/0x20
Fixes: 8d159eb2117b ("RDMA/mlx5: Use IB set_netdev and get_netdev functions")
Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20241107183527.676877-2-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Paolo Abeni says:
====================
mptcp: fix a couple of races
The first patch addresses a division by zero issue reported by Eric,
the second one solves a similar issue found by code inspection while
investigating the former.
====================
Link: https://patch.msgid.link/cover.1731060874.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Additional active subflows - i.e. created by the in kernel path
manager - are included into the subflow list before starting the
3whs.
A racing recvmsg() spooling data received on an already established
subflow would unconditionally call tcp_cleanup_rbuf() on all the
current subflows, potentially hitting a divide by zero error on
the newly created ones.
Explicitly check that the subflow is in a suitable state before
invoking tcp_cleanup_rbuf().
Fixes: c76c6956566f ("mptcp: call tcp_cleanup_rbuf on subflows")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/02374660836e1b52afc91966b7535c8c5f7bafb0.1731060874.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Eric reported a division by zero splat in the MPTCP protocol:
Oops: divide error: 0000 [#1] PREEMPT SMP KASAN PTI
CPU: 1 UID: 0 PID: 6094 Comm: syz-executor317 Not tainted
6.12.0-rc5-syzkaller-00291-g05b92660cdfe #0
Hardware name: Google Google Compute Engine/Google Compute Engine,
BIOS Google 09/13/2024
RIP: 0010:__tcp_select_window+0x5b4/0x1310 net/ipv4/tcp_output.c:3163
Code: f6 44 01 e3 89 df e8 9b 75 09 f8 44 39 f3 0f 8d 11 ff ff ff e8
0d 74 09 f8 45 89 f4 e9 04 ff ff ff e8 00 74 09 f8 44 89 f0 99 <f7> 7c
24 14 41 29 d6 45 89 f4 e9 ec fe ff ff e8 e8 73 09 f8 48 89
RSP: 0018:ffffc900041f7930 EFLAGS: 00010293
RAX: 0000000000017e67 RBX: 0000000000017e67 RCX: ffffffff8983314b
RDX: 0000000000000000 RSI: ffffffff898331b0 RDI: 0000000000000004
RBP: 00000000005d6000 R08: 0000000000000004 R09: 0000000000017e67
R10: 0000000000003e80 R11: 0000000000000000 R12: 0000000000003e80
R13: ffff888031d9b440 R14: 0000000000017e67 R15: 00000000002eb000
FS: 00007feb5d7f16c0(0000) GS:ffff8880b8700000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007feb5d8adbb8 CR3: 0000000074e4c000 CR4: 00000000003526f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
__tcp_cleanup_rbuf+0x3e7/0x4b0 net/ipv4/tcp.c:1493
mptcp_rcv_space_adjust net/mptcp/protocol.c:2085 [inline]
mptcp_recvmsg+0x2156/0x2600 net/mptcp/protocol.c:2289
inet_recvmsg+0x469/0x6a0 net/ipv4/af_inet.c:885
sock_recvmsg_nosec net/socket.c:1051 [inline]
sock_recvmsg+0x1b2/0x250 net/socket.c:1073
__sys_recvfrom+0x1a5/0x2e0 net/socket.c:2265
__do_sys_recvfrom net/socket.c:2283 [inline]
__se_sys_recvfrom net/socket.c:2279 [inline]
__x64_sys_recvfrom+0xe0/0x1c0 net/socket.c:2279
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xcd/0x250 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7feb5d857559
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 51 18 00 00 90 48 89 f8 48
89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d
01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007feb5d7f1208 EFLAGS: 00000246 ORIG_RAX: 000000000000002d
RAX: ffffffffffffffda RBX: 00007feb5d8e1318 RCX: 00007feb5d857559
RDX: 000000800000000e RSI: 0000000000000000 RDI: 0000000000000003
RBP: 00007feb5d8e1310 R08: 0000000000000000 R09: ffffffff81000000
R10: 0000000000000100 R11: 0000000000000246 R12: 00007feb5d8e131c
R13: 00007feb5d8ae074 R14: 000000800000000e R15: 00000000fffffdef
and provided a nice reproducer.
The root cause is the current bad handling of racing disconnect.
After the blamed commit below, sk_wait_data() can return (with
error) with the underlying socket disconnected and a zero rcv_mss.
Catch the error and return without performing any additional
operations on the current socket.
Reported-by: Eric Dumazet <edumazet@google.com>
Fixes: 419ce133ab92 ("tcp: allow again tcp_disconnect() when threads are waiting")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/8c82ecf71662ecbc47bf390f9905de70884c9f2d.1731060874.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Document new behavior when the number of frags passed is too big.
Signed-off-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20241107210331.3044434-2-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Exit early if we're freeing more than 1024 frags, to prevent
looping too long.
Also minor code cleanups:
- Flip checks to reduce indentation.
- Use sizeof(*tokens) everywhere for consistentcy.
Cc: Yi Lai <yi1.lai@linux.intel.com>
Signed-off-by: Mina Almasry <almasrymina@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20241107210331.3044434-1-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Syzkaller reported this warning:
------------[ cut here ]------------
WARNING: CPU: 0 PID: 16 at net/ipv4/af_inet.c:156 inet_sock_destruct+0x1c5/0x1e0
Modules linked in:
CPU: 0 UID: 0 PID: 16 Comm: ksoftirqd/0 Not tainted 6.12.0-rc5 #26
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
RIP: 0010:inet_sock_destruct+0x1c5/0x1e0
Code: 24 12 4c 89 e2 5b 48 c7 c7 98 ec bb 82 41 5c e9 d1 18 17 ff 4c 89 e6 5b 48 c7 c7 d0 ec bb 82 41 5c e9 bf 18 17 ff 0f 0b eb 83 <0f> 0b eb 97 0f 0b eb 87 0f 0b e9 68 ff ff ff 66 66 2e 0f 1f 84 00
RSP: 0018:ffffc9000008bd90 EFLAGS: 00010206
RAX: 0000000000000300 RBX: ffff88810b172a90 RCX: 0000000000000007
RDX: 0000000000000002 RSI: 0000000000000300 RDI: ffff88810b172a00
RBP: ffff88810b172a00 R08: ffff888104273c00 R09: 0000000000100007
R10: 0000000000020000 R11: 0000000000000006 R12: ffff88810b172a00
R13: 0000000000000004 R14: 0000000000000000 R15: ffff888237c31f78
FS: 0000000000000000(0000) GS:ffff888237c00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffc63fecac8 CR3: 000000000342e000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
? __warn+0x88/0x130
? inet_sock_destruct+0x1c5/0x1e0
? report_bug+0x18e/0x1a0
? handle_bug+0x53/0x90
? exc_invalid_op+0x18/0x70
? asm_exc_invalid_op+0x1a/0x20
? inet_sock_destruct+0x1c5/0x1e0
__sk_destruct+0x2a/0x200
rcu_do_batch+0x1aa/0x530
? rcu_do_batch+0x13b/0x530
rcu_core+0x159/0x2f0
handle_softirqs+0xd3/0x2b0
? __pfx_smpboot_thread_fn+0x10/0x10
run_ksoftirqd+0x25/0x30
smpboot_thread_fn+0xdd/0x1d0
kthread+0xd3/0x100
? __pfx_kthread+0x10/0x10
ret_from_fork+0x34/0x50
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
---[ end trace 0000000000000000 ]---
Its possible that two threads call tcp_v6_do_rcv()/sk_forward_alloc_add()
concurrently when sk->sk_state == TCP_LISTEN with sk->sk_lock unlocked,
which triggers a data-race around sk->sk_forward_alloc:
tcp_v6_rcv
tcp_v6_do_rcv
skb_clone_and_charge_r
sk_rmem_schedule
__sk_mem_schedule
sk_forward_alloc_add()
skb_set_owner_r
sk_mem_charge
sk_forward_alloc_add()
__kfree_skb
skb_release_all
skb_release_head_state
sock_rfree
sk_mem_uncharge
sk_forward_alloc_add()
sk_mem_reclaim
// set local var reclaimable
__sk_mem_reclaim
sk_forward_alloc_add()
In this syzkaller testcase, two threads call
tcp_v6_do_rcv() with skb->truesize=768, the sk_forward_alloc changes like
this:
(cpu 1) | (cpu 2) | sk_forward_alloc
... | ... | 0
__sk_mem_schedule() | | +4096 = 4096
| __sk_mem_schedule() | +4096 = 8192
sk_mem_charge() | | -768 = 7424
| sk_mem_charge() | -768 = 6656
... | ... |
sk_mem_uncharge() | | +768 = 7424
reclaimable=7424 | |
| sk_mem_uncharge() | +768 = 8192
| reclaimable=8192 |
__sk_mem_reclaim() | | -4096 = 4096
| __sk_mem_reclaim() | -8192 = -4096 != 0
The skb_clone_and_charge_r() should not be called in tcp_v6_do_rcv() when
sk->sk_state is TCP_LISTEN, it happens later in tcp_v6_syn_recv_sock().
Fix the same issue in dccp_v6_do_rcv().
Suggested-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Fixes: e994b2f0fb92 ("tcp: do not lock listener to process SYN packets")
Signed-off-by: Wang Liang <wangliang74@huawei.com>
Link: https://patch.msgid.link/20241107023405.889239-1-wangliang74@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Commit 55d42a0c3f9c ("selftests: net: add a test for closing
a netlink socket ith dump in progress") added a new test
but did not add it to gitignore.
Reviewed-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20241108004731.2979878-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The tx_bytes should consider the actual size of the Ethernet frames
without the SPI encapsulation. But we still need to take care of
Ethernet padding.
Fixes: 2f207cbf0dd4 ("net: vertexcom: Add MSE102x SPI support")
Signed-off-by: Stefan Wahren <wahrenst@gmx.net>
Link: https://patch.msgid.link/20241108114343.6174-3-wahrenst@gmx.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
A lockdep report [1] with CONFIG_PROVE_RCU_LIST=y hints
that sctp_v6_available() is calling dev_get_by_index_rcu()
and ipv6_chk_addr() without holding rcu.
[1]
=============================
WARNING: suspicious RCU usage
6.12.0-rc5-virtme #1216 Tainted: G W
-----------------------------
net/core/dev.c:876 RCU-list traversed in non-reader section!!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
1 lock held by sctp_hello/31495:
#0: ffff9f1ebbdb7418 (sk_lock-AF_INET6){+.+.}-{0:0}, at: sctp_bind (./arch/x86/include/asm/jump_label.h:27 net/sctp/socket.c:315) sctp
stack backtrace:
CPU: 7 UID: 0 PID: 31495 Comm: sctp_hello Tainted: G W 6.12.0-rc5-virtme #1216
Tainted: [W]=WARN
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl (lib/dump_stack.c:123)
lockdep_rcu_suspicious (kernel/locking/lockdep.c:6822)
dev_get_by_index_rcu (net/core/dev.c:876 (discriminator 7))
sctp_v6_available (net/sctp/ipv6.c:701) sctp
sctp_do_bind (net/sctp/socket.c:400 (discriminator 1)) sctp
sctp_bind (net/sctp/socket.c:320) sctp
inet6_bind_sk (net/ipv6/af_inet6.c:465)
? security_socket_bind (security/security.c:4581 (discriminator 1))
__sys_bind (net/socket.c:1848 net/socket.c:1869)
? do_user_addr_fault (./include/linux/rcupdate.h:347 ./include/linux/rcupdate.h:880 ./include/linux/mm.h:729 arch/x86/mm/fault.c:1340)
? do_user_addr_fault (./arch/x86/include/asm/preempt.h:84 (discriminator 13) ./include/linux/rcupdate.h:98 (discriminator 13) ./include/linux/rcupdate.h:882 (discriminator 13) ./include/linux/mm.h:729 (discriminator 13) arch/x86/mm/fault.c:1340 (discriminator 13))
__x64_sys_bind (net/socket.c:1877 (discriminator 1) net/socket.c:1875 (discriminator 1) net/socket.c:1875 (discriminator 1))
do_syscall_64 (arch/x86/entry/common.c:52 (discriminator 1) arch/x86/entry/common.c:83 (discriminator 1))
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
RIP: 0033:0x7f59b934a1e7
Code: 44 00 00 48 8b 15 39 8c 0c 00 f7 d8 64 89 02 b8 ff ff ff ff eb bd 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 b8 31 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 09 8c 0c 00 f7 d8 64 89 01 48
All code
========
0: 44 00 00 add %r8b,(%rax)
3: 48 8b 15 39 8c 0c 00 mov 0xc8c39(%rip),%rdx # 0xc8c43
a: f7 d8 neg %eax
c: 64 89 02 mov %eax,%fs:(%rdx)
f: b8 ff ff ff ff mov $0xffffffff,%eax
14: eb bd jmp 0xffffffffffffffd3
16: 66 2e 0f 1f 84 00 00 cs nopw 0x0(%rax,%rax,1)
1d: 00 00 00
20: 0f 1f 00 nopl (%rax)
23: b8 31 00 00 00 mov $0x31,%eax
28: 0f 05 syscall
2a:* 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax <-- trapping instruction
30: 73 01 jae 0x33
32: c3 ret
33: 48 8b 0d 09 8c 0c 00 mov 0xc8c09(%rip),%rcx # 0xc8c43
3a: f7 d8 neg %eax
3c: 64 89 01 mov %eax,%fs:(%rcx)
3f: 48 rex.W
Code starting with the faulting instruction
===========================================
0: 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax
6: 73 01 jae 0x9
8: c3 ret
9: 48 8b 0d 09 8c 0c 00 mov 0xc8c09(%rip),%rcx # 0xc8c19
10: f7 d8 neg %eax
12: 64 89 01 mov %eax,%fs:(%rcx)
15: 48 rex.W
RSP: 002b:00007ffe2d0ad398 EFLAGS: 00000202 ORIG_RAX: 0000000000000031
RAX: ffffffffffffffda RBX: 00007ffe2d0ad3d0 RCX: 00007f59b934a1e7
RDX: 000000000000001c RSI: 00007ffe2d0ad3d0 RDI: 0000000000000005
RBP: 0000000000000005 R08: 1999999999999999 R09: 0000000000000000
R10: 00007f59b9253298 R11: 0000000000000202 R12: 00007ffe2d0ada61
R13: 0000000000000000 R14: 0000562926516dd8 R15: 00007f59b9479000
</TASK>
Fixes: 6fe1e52490a9 ("sctp: check ipv6 addr with sk_bound_dev if set")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Xin Long <lucien.xin@gmail.com>
Link: https://patch.msgid.link/20241107192021.2579789-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Close a socket with dump in progress. We need a dump which generates
enough info not to fit into a single skb. Policy dump fits the bill.
Use the trick discovered by syzbot for keeping a ref on the socket
longer than just close, with mqueue.
TAP version 13
1..3
# Starting 3 tests from 1 test cases.
# RUN global.test_sanity ...
# OK global.test_sanity
ok 1 global.test_sanity
# RUN global.close_in_progress ...
# OK global.close_in_progress
ok 2 global.close_in_progress
# RUN global.close_with_ref ...
# OK global.close_with_ref
ok 3 global.close_with_ref
# PASSED: 3 / 3 tests passed.
# Totals: pass:3 fail:0 xfail:0 xpass:0 skip:0 error:0
Note that this test is not expected to fail but rather crash
the kernel if we get the cleanup wrong.
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20241106015235.2458807-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Netlink supports iterative dumping of data. It provides the families
the following ops:
- start - (optional) kicks off the dumping process
- dump - actual dump helper, keeps getting called until it returns 0
- done - (optional) pairs with .start, can be used for cleanup
The whole process is asynchronous and the repeated calls to .dump
don't actually happen in a tight loop, but rather are triggered
in response to recvmsg() on the socket.
This gives the user full control over the dump, but also means that
the user can close the socket without getting to the end of the dump.
To make sure .start is always paired with .done we check if there
is an ongoing dump before freeing the socket, and if so call .done.
The complication is that sockets can get freed from BH and .done
is allowed to sleep. So we use a workqueue to defer the call, when
needed.
Unfortunately this does not work correctly. What we defer is not
the cleanup but rather releasing a reference on the socket.
We have no guarantee that we own the last reference, if someone
else holds the socket they may release it in BH and we're back
to square one.
The whole dance, however, appears to be unnecessary. Only the user
can interact with dumps, so we can clean up when socket is closed.
And close always happens in process context. Some async code may
still access the socket after close, queue notification skbs to it etc.
but no dumps can start, end or otherwise make progress.
Delete the workqueue and flush the dump state directly from the release
handler. Note that further cleanup is possible in -next, for instance
we now always call .done before releasing the main module reference,
so dump doesn't have to take a reference of its own.
Reported-by: syzkaller <syzkaller@googlegroups.com>
Fixes: ed5d7788a934 ("netlink: Do not schedule work from sk_destruct")
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241106015235.2458807-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from can and netfilter.
Things are slowing down quite a bit, mostly driver fixes here. No
known ongoing investigations.
Current release - new code bugs:
- eth: ti: am65-cpsw:
- fix multi queue Rx on J7
- fix warning in am65_cpsw_nuss_remove_rx_chns()
Previous releases - regressions:
- mptcp: do not require admin perm to list endpoints, got missed in a
refactoring
- mptcp: use sock_kfree_s instead of kfree
Previous releases - always broken:
- sctp: properly validate chunk size in sctp_sf_ootb() fix OOB access
- virtio_net: make RSS interact properly with queue number
- can: mcp251xfd: mcp251xfd_get_tef_len(): fix length calculation
- can: mcp251xfd: mcp251xfd_ring_alloc(): fix coalescing
configuration when switching CAN modes
Misc:
- revert earlier hns3 fixes, they were ignoring IOMMU abstractions
and need to be reworked
- can: {cc770,sja1000}_isa: allow building on x86_64"
* tag 'net-6.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (42 commits)
drivers: net: ionic: add missed debugfs cleanup to ionic_probe() error path
net/smc: do not leave a dangling sk pointer in __smc_create()
rxrpc: Fix missing locking causing hanging calls
net/smc: Fix lookup of netdev by using ib_device_get_netdev()
net: arc: rockchip: fix emac mdio node support
net: arc: fix the device for dma_map_single/dma_unmap_single
virtio_net: Update rss when set queue
virtio_net: Sync rss config to device when virtnet_probe
virtio_net: Add hash_key_length check
virtio_net: Support dynamic rss indirection table size
netfilter: nf_tables: wait for rcu grace period on net_device removal
net: stmmac: Fix unbalanced IRQ wake disable warning on single irq case
net: vertexcom: mse102x: Fix possible double free of TX skb
mptcp: use sock_kfree_s instead of kfree
mptcp: no admin perm to list endpoints
net: phy: ti: add PHY_RST_AFTER_CLK_EN flag
net: ethernet: ti: am65-cpsw: fix warning in am65_cpsw_nuss_remove_rx_chns()
net: ethernet: ti: am65-cpsw: Fix multi queue Rx on J7
net: hns3: fix kernel crash when uninstalling driver
Revert "Merge branch 'there-are-some-bugfix-for-the-hns3-ethernet-driver'"
...
|
|
The ionic_setup_one() creates a debugfs entry for ionic upon
successful execution. However, the ionic_probe() does not
release the dentry before returning, resulting in a memory
leak.
To fix this bug, we add the ionic_debugfs_del_dev() to release
the resources in a timely manner before returning.
Fixes: 0de38d9f1dba ("ionic: extract common bits from ionic_probe")
Signed-off-by: Wentao Liang <Wentao_liang_g@163.com>
Acked-by: Shannon Nelson <shannon.nelson@amd.com>
Link: https://patch.msgid.link/20241107021756.1677-1-liangwentao@iscas.ac.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Thanks to commit 4bbd360a5084 ("socket: Print pf->create() when
it does not clear sock->sk on failure."), syzbot found an issue with AF_SMC:
smc_create must clear sock->sk on failure, family: 43, type: 1, protocol: 0
WARNING: CPU: 0 PID: 5827 at net/socket.c:1565 __sock_create+0x96f/0xa30 net/socket.c:1563
Modules linked in:
CPU: 0 UID: 0 PID: 5827 Comm: syz-executor259 Not tainted 6.12.0-rc6-next-20241106-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
RIP: 0010:__sock_create+0x96f/0xa30 net/socket.c:1563
Code: 03 00 74 08 4c 89 e7 e8 4f 3b 85 f8 49 8b 34 24 48 c7 c7 40 89 0c 8d 8b 54 24 04 8b 4c 24 0c 44 8b 44 24 08 e8 32 78 db f7 90 <0f> 0b 90 90 e9 d3 fd ff ff 89 e9 80 e1 07 fe c1 38 c1 0f 8c ee f7
RSP: 0018:ffffc90003e4fda0 EFLAGS: 00010246
RAX: 099c6f938c7f4700 RBX: 1ffffffff1a595fd RCX: ffff888034823c00
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 00000000ffffffe9 R08: ffffffff81567052 R09: 1ffff920007c9f50
R10: dffffc0000000000 R11: fffff520007c9f51 R12: ffffffff8d2cafe8
R13: 1ffffffff1a595fe R14: ffffffff9a789c40 R15: ffff8880764298c0
FS: 000055557b518380(0000) GS:ffff8880b8600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fa62ff43225 CR3: 0000000031628000 CR4: 00000000003526f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
sock_create net/socket.c:1616 [inline]
__sys_socket_create net/socket.c:1653 [inline]
__sys_socket+0x150/0x3c0 net/socket.c:1700
__do_sys_socket net/socket.c:1714 [inline]
__se_sys_socket net/socket.c:1712 [inline]
For reference, see commit 2d859aff775d ("Merge branch
'do-not-leave-dangling-sk-pointers-in-pf-create-functions'")
Fixes: d25a92ccae6b ("net/smc: Introduce IPPROTO_SMC")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ignat Korchagin <ignat@cloudflare.com>
Cc: D. Wythe <alibuda@linux.alibaba.com>
Cc: Dust Li <dust.li@linux.alibaba.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Link: https://patch.msgid.link/20241106221922.1544045-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
If a call gets aborted (e.g. because kafs saw a signal) between it being
queued for connection and the I/O thread picking up the call, the abort
will be prioritised over the connection and it will be removed from
local->new_client_calls by rxrpc_disconnect_client_call() without a lock
being held. This may cause other calls on the list to disappear if a race
occurs.
Fix this by taking the client_call_lock when removing a call from whatever
list its ->wait_link happens to be on.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Fixes: 9d35d880e0e4 ("rxrpc: Move client call connection to the I/O thread")
Link: https://patch.msgid.link/726660.1730898202@warthog.procyon.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The SMC-R variant of the SMC protocol used direct call to function
ib_device_ops.get_netdev() to lookup netdev. As we used mlx5 device
driver to run SMC-R, it failed to find a device, because in mlx5_ib the
internal net device management for retrieving net devices was replaced
by a common interface ib_device_get_netdev() in commit 8d159eb2117b
("RDMA/mlx5: Use IB set_netdev and get_netdev functions").
Since such direct accesses to the internal net device management is not
recommended at all, update the SMC-R code to use proper API
ib_device_get_netdev().
Fixes: 54903572c23c ("net/smc: allow pnetid-less configuration")
Reported-by: Aswin K <aswin@linux.ibm.com>
Reviewed-by: Gerd Bayer <gbayer@linux.ibm.com>
Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
Reviewed-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Reviewed-by: D. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://patch.msgid.link/20241106082612.57803-1-wenjia@linux.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/ukleinek/linux
Pull pwm fix from Uwe Kleine-König:
"Fix period setting in imx-tpm driver and a maintainer update
Erik Schumacher found and fixed a problem in the calculation of the
PWM period setting yielding too long periods. Trevor Gamblin - who
already cared about mainlining the pwm-axi-pwmgen driver - stepped
forward as an additional reviewer.
Thanks to Erik and Trevor"
* tag 'pwm/for-6.12-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/ukleinek/linux:
MAINTAINERS: add self as reviewer for AXI PWM GENERATOR
pwm: imx-tpm: Use correct MODULO value for EPWM mode
|
|
seq_printf is costy, on a system with n CPUs, reading /proc/softirqs
would yield 10*n decimal values, and the extra cost parsing format string
grows linearly with number of cpus. Replace seq_printf with
seq_put_decimal_ull_width have significant performance improvement.
On an 8CPUs system, reading /proc/softirqs show ~40% performance
gain with this patch.
Signed-off-by: David Wang <00107082@163.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter fix for net
The following series contains a Netfilter fix:
1) Wait for rcu grace period after netdevice removal is reported via event.
* tag 'nf-24-11-07' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nf_tables: wait for rcu grace period on net_device removal
====================
Link: https://patch.msgid.link/20241107113212.116634-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Andy Yan says:
====================
Fix the arc emac driver
The arc emac driver was broken for a long time,
The first broken happens when a dma releated fix introduced in Linux 5.10.
The second broken happens when a emac device tree node restyle introduced
in Linux 6.1.
These two patches are try to make the arc emac work again.
Changes in v2:
- Add cover letter.
- Add fix tag.
- Add more detail explaination.
====================
Link: https://patch.msgid.link/20241104130147.440125-1-andyshrk@163.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
The binding emac_rockchip.txt is converted to YAML.
Changed against the original binding is an added MDIO subnode.
This make the driver failed to find the PHY, and given the 'mdio
has invalid PHY address' it is probably looking in the wrong node.
Fix emac_mdio.c so that it can handle both old and new
device trees.
Fixes: 1dabb74971b3 ("ARM: dts: rockchip: restyle emac nodes")
Signed-off-by: Johan Jonker <jbx6244@gmail.com>
Tested-by: Andy Yan <andyshrk@163.com>
Link: https://lore.kernel.org/r/20220603163539.537-3-jbx6244@gmail.com
Signed-off-by: Andy Yan <andy.yan@rock-chips.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
The ndev->dev and pdev->dev aren't the same device, use ndev->dev.parent
which has dma_mask, ndev->dev.parent is just pdev->dev.
Or it would cause the following issue:
[ 39.933526] ------------[ cut here ]------------
[ 39.938414] WARNING: CPU: 1 PID: 501 at kernel/dma/mapping.c:149 dma_map_page_attrs+0x90/0x1f8
Fixes: f959dcd6ddfd ("dma-direct: Fix potential NULL pointer dereference")
Signed-off-by: David Wu <david.wu@rock-chips.com>
Signed-off-by: Johan Jonker <jbx6244@gmail.com>
Signed-off-by: Andy Yan <andy.yan@rock-chips.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Philo Lu says:
====================
virtio_net: Make RSS interact properly with queue number
With this patch set, RSS updates with queue_pairs changing:
- When virtnet_probe, init default rss and commit
- When queue_pairs changes _without_ user rss configuration, update rss
with the new queue number
- When queue_pairs changes _with_ user rss configuration, keep rss as user
configured
Patch 1 and 2 fix possible out of bound errors for indir_table and key.
Patch 3 and 4 add RSS update in probe() and set_queues().
====================
Link: https://patch.msgid.link/20241104085706.13872-1-lulie@linux.alibaba.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
RSS configuration should be updated with queue number. In particular, it
should be updated when (1) rss enabled and (2) default rss configuration
is used without user modification.
During rss command processing, device updates queue_pairs using
rss.max_tx_vq. That is, the device updates queue_pairs together with
rss, so we can skip the sperate queue_pairs update
(VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET below) and return directly.
Also remove the `vi->has_rss ?` check when setting vi->rss.max_tx_vq,
because this is not used in the other hash_report case.
Fixes: c7114b1249fa ("drivers/net/virtio_net: Added basic RSS support.")
Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
During virtnet_probe, default rss configuration is initialized, but was
not committed to the device. This patch fix this by sending rss command
after device ready in virtnet_probe. Otherwise, the actual rss
configuration used by device can be different with that read by user
from driver, which may confuse the user.
If the command committing fails, driver rss will be disabled.
Fixes: c7114b1249fa ("drivers/net/virtio_net: Added basic RSS support.")
Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Joe Damato <jdamato@fastly.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Add hash_key_length check in virtnet_probe() to avoid possible out of
bound errors when setting/reading the hash key.
Fixes: c7114b1249fa ("drivers/net/virtio_net: Added basic RSS support.")
Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Joe Damato <jdamato@fastly.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
When reading/writing virtio_net_ctrl_rss, we get the indirection table
size from vi->rss_indir_table_size, which is initialized in
virtnet_probe(). However, the actual size of indirection_table was set
as VIRTIO_NET_RSS_MAX_TABLE_LEN=128. This collision may cause issues if
the vi->rss_indir_table_size exceeds 128.
This patch instead uses dynamic indirection table, allocated with
vi->rss after vi->rss_indir_table_size initialized. And free it in
virtnet_remove().
In virtnet_commit_rss_command(), sgs for rss is initialized differently
with hash_report. So indirection_table is not used if !vi->has_rss, and
then we don't need to alloc indirection_table for hash_report only uses.
Fixes: c7114b1249fa ("drivers/net/virtio_net: Added basic RSS support.")
Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Joe Damato <jdamato@fastly.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
8c873e219970 ("netfilter: core: free hooks with call_rcu") removed
synchronize_net() call when unregistering basechain hook, however,
net_device removal event handler for the NFPROTO_NETDEV was not updated
to wait for RCU grace period.
Note that 835b803377f5 ("netfilter: nf_tables_netdev: unregister hooks
on net_device removal") does not remove basechain rules on device
removal, I was hinted to remove rules on net_device removal later, see
5ebe0b0eec9d ("netfilter: nf_tables: destroy basechain and rules on
netdevice removal").
Although NETDEV_UNREGISTER event is guaranteed to be handled after
synchronize_net() call, this path needs to wait for rcu grace period via
rcu callback to release basechain hooks if netns is alive because an
ongoing netlink dump could be in progress (sockets hold a reference on
the netns).
Note that nf_tables_pre_exit_net() unregisters and releases basechain
hooks but it is possible to see NETDEV_UNREGISTER at a later stage in
the netns exit path, eg. veth peer device in another netns:
cleanup_net()
default_device_exit_batch()
unregister_netdevice_many_notify()
notifier_call_chain()
nf_tables_netdev_event()
__nft_release_basechain()
In this particular case, same rule of thumb applies: if netns is alive,
then wait for rcu grace period because netlink dump in the other netns
could be in progress. Otherwise, if the other netns is going away then
no netlink dump can be in progress and basechain hooks can be released
inmediately.
While at it, turn WARN_ON() into WARN_ON_ONCE() for the basechain
validation, which should not ever happen.
Fixes: 835b803377f5 ("netfilter: nf_tables_netdev: unregister hooks on net_device removal")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Commit a23aa0404218 ("net: stmmac: ethtool: Fixed calltrace caused by
unbalanced disable_irq_wake calls") introduced checks to prevent
unbalanced enable and disable IRQ wake calls. However it only
initialized the auxiliary variable on one of the paths,
stmmac_request_irq_multi_msi(), missing the other,
stmmac_request_irq_single().
Add the same initialization on stmmac_request_irq_single() to prevent
"Unbalanced IRQ <x> wake disable" warnings from being printed the first
time disable_irq_wake() is called on platforms that run on that code
path.
Fixes: a23aa0404218 ("net: stmmac: ethtool: Fixed calltrace caused by unbalanced disable_irq_wake calls")
Signed-off-by: Nícolas F. R. A. Prado <nfraprado@collabora.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20241101-stmmac-unbalanced-wake-single-fix-v1-1-5952524c97f0@collabora.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
The scope of the TX skb is wider than just mse102x_tx_frame_spi(),
so in case the TX skb room needs to be expanded, we should free the
the temporary skb instead of the original skb. Otherwise the original
TX skb pointer would be freed again in mse102x_tx_work(), which leads
to crashes:
Internal error: Oops: 0000000096000004 [#2] PREEMPT SMP
CPU: 0 PID: 712 Comm: kworker/0:1 Tainted: G D 6.6.23
Hardware name: chargebyte Charge SOM DC-ONE (DT)
Workqueue: events mse102x_tx_work [mse102x]
pstate: 20400009 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : skb_release_data+0xb8/0x1d8
lr : skb_release_data+0x1ac/0x1d8
sp : ffff8000819a3cc0
x29: ffff8000819a3cc0 x28: ffff0000046daa60 x27: ffff0000057f2dc0
x26: ffff000005386c00 x25: 0000000000000002 x24: 00000000ffffffff
x23: 0000000000000000 x22: 0000000000000001 x21: ffff0000057f2e50
x20: 0000000000000006 x19: 0000000000000000 x18: ffff00003fdacfcc
x17: e69ad452d0c49def x16: 84a005feff870102 x15: 0000000000000000
x14: 000000000000024a x13: 0000000000000002 x12: 0000000000000000
x11: 0000000000000400 x10: 0000000000000930 x9 : ffff00003fd913e8
x8 : fffffc00001bc008
x7 : 0000000000000000 x6 : 0000000000000008
x5 : ffff00003fd91340 x4 : 0000000000000000 x3 : 0000000000000009
x2 : 00000000fffffffe x1 : 0000000000000000 x0 : 0000000000000000
Call trace:
skb_release_data+0xb8/0x1d8
kfree_skb_reason+0x48/0xb0
mse102x_tx_work+0x164/0x35c [mse102x]
process_one_work+0x138/0x260
worker_thread+0x32c/0x438
kthread+0x118/0x11c
ret_from_fork+0x10/0x20
Code: aa1303e0 97fffab6 72001c1f 54000141 (f9400660)
Cc: stable@vger.kernel.org
Fixes: 2f207cbf0dd4 ("net: vertexcom: Add MSE102x SPI support")
Signed-off-by: Stefan Wahren <wahrenst@gmx.net>
Link: https://patch.msgid.link/20241105163101.33216-1-wahrenst@gmx.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Pull NFS client fixes from Anna Schumaker:
"These are mostly fixes that came up during the nfs bakeathon the other
week.
Stable Fixes:
- Fix KMSAN warning in decode_getfattr_attrs()
Other Bugfixes:
- Handle -ENOTCONN in xs_tcp_setup_socked()
- NFSv3: only use NFS timeout for MOUNT when protocols are compatible
- Fix attribute delegation behavior on exclusive create and a/mtime
changes
- Fix localio to cope with racing nfs_local_probe()
- Avoid i_lock contention in fs_clear_invalid_mapping()"
* tag 'nfs-for-6.12-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
nfs: avoid i_lock contention in nfs_clear_invalid_mapping
nfs_common: fix localio to cope with racing nfs_local_probe()
NFS: Further fixes to attribute delegation a/mtime changes
NFS: Fix attribute delegation behaviour on exclusive create
nfs: Fix KMSAN warning in decode_getfattr_attrs()
NFSv3: only use NFS timeout for MOUNT when protocols are compatible
sunrpc: handle -ENOTCONN in xs_tcp_setup_socket()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd
Pull keys fixes from Jarkko Sakkinen:
"A couple of fixes for keys and trusted keys"
* tag 'keys-next-6.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd:
KEYS: trusted: dcp: fix NULL dereference in AEAD crypto operation
security/keys: fix slab-out-of-bounds in key_task_permission
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracefs fixes from Steven Rostedt:
"Fix tracefs mount options.
Commit 78ff64081949 ("vfs: Convert tracefs to use the new mount API")
broke the gid setting when set by fstab or other mount utility. It is
ignored when it is set. Fix the code so that it recognises the option
again and will honor the settings on mount at boot up.
Update the internal documentation and create a selftest to make sure
it doesn't break again in the future"
* tag 'tracefs-v6.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing/selftests: Add tracefs mount options test
tracing: Document tracefs gid mount option
tracing: Fix tracefs mount options
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86
Pull x86 platform driver fixes from Hans de Goede:
- AMD PMF: Add new hardware id
- AMD PMC: Fix crash when loaded with enable_stb=1 on devices without STB
- Dell: Add Alienware hwid for Alienware systems with Dell WMI interface
- thinkpad_acpi: Quirk to fix wrong fan speed readings on L480
- New hotkey mappings for Dell and Lenovo laptops
* tag 'platform-drivers-x86-v6.12-4' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86:
platform/x86: thinkpad_acpi: Fix for ThinkPad's with ECFW showing incorrect fan speed
platform/x86: ideapad-laptop: add missing Ideapad Pro 5 fn keys
platform/x86: dell-wmi-base: Handle META key Lock/Unlock events
platform/x86: dell-smbios-base: Extends support to Alienware products
platform/x86/amd/pmc: Detect when STB is not available
platform/x86/amd/pmf: Add SMU metrics table support for 1Ah family 60h model
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mikulas Patocka:
- fix memory safety bugs in dm-cache
- fix restart/panic logic in dm-verity
- fix 32-bit unsigned integer overflow in dm-unstriped
- fix a device mapper crash if blk_alloc_disk fails
* tag 'for-6.12/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm cache: fix potential out-of-bounds access on the first resume
dm cache: optimize dirty bit checking with find_next_bit when resizing
dm cache: fix out-of-bounds access to the dirty bitset when resizing
dm cache: fix flushing uninitialized delayed_work on cache_ctr error
dm cache: correct the number of origin blocks to match the target length
dm-verity: don't crash if panic_on_corruption is not selected
dm-unstriped: cast an operand to sector_t to prevent potential uint32_t overflow
dm: fix a crash if blk_alloc_disk fails
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid
Pull HID fix from Jiri Kosina:
- report buffer sanitization fix for HID core (Jiri Kosina)
* tag 'hid-for-linus-20241105' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid:
HID: core: zero-initialize the report buffer
|
|
fan speed
Fix for Thinkpad's with ECFW showing incorrect fan speed. Some models use
decimal instead of hexadecimal for the speed stored in the EC registers.
For example the rpm register will have 0x4200 instead of 0x1068, here
the actual RPM is "4200" in decimal.
Add a quirk to handle this.
Signed-off-by: Vishnu Sankar <vishnuocv@gmail.com>
Suggested-by: Mark Pearson <mpearson-lenovo@squebb.ca>
Link: https://lore.kernel.org/r/20241105235505.8493-1-vishnuocv@gmail.com
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2024-11-04 (ice, idpf, i40e, e1000e)
For ice:
Marcin adjusts ordering of calls in ice_eswitch_detach() to resolve a
use after free issue.
Mateusz corrects variable type for Flow Director queue to fix issues
related to drop actions.
For idpf:
Pavan resolves issues related to reset on idpf; avoiding use of freed
vport and correctly unrolling the mailbox task.
For i40e:
Aleksandr fixes a race condition involving addition and deletion of VF
MAC filters.
For e1000e:
Vitaly reverts workaround for Meteor Lake causing regressions in power
management flows.
* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
e1000e: Remove Meteor Lake SMBUS workarounds
i40e: fix race condition by adding filter's intermediate sync state
idpf: fix idpf_vc_core_init error path
idpf: avoid vport access in idpf_get_link_ksettings
ice: change q_index variable type to s16 to store -1 value
ice: Fix use after free during unload with ports in bridge
====================
Link: https://patch.msgid.link/20241104223639.2801097-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Matthieu Baerts says:
====================
mptcp: pm: fix wrong perm and sock kfree
Two small fixes related to the MPTCP path-manager:
- Patch 1: remove an accidental restriction to admin users to list MPTCP
endpoints. A regression from v6.7.
- Patch 2: correctly use sock_kfree_s() instead of kfree() in the
userspace PM. A fix for another fix introduced in v6.4 and
backportable up to v5.19.
====================
Link: https://patch.msgid.link/20241104-net-mptcp-misc-6-12-v1-0-c13f2ff1656f@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The local address entries on userspace_pm_local_addr_list are allocated
by sock_kmalloc().
It's then required to use sock_kfree_s() instead of kfree() to free
these entries in order to adjust the allocated size on the sk side.
Fixes: 24430f8bf516 ("mptcp: add address into userspace pm list")
Cc: stable@vger.kernel.org
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20241104-net-mptcp-misc-6-12-v1-2-c13f2ff1656f@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
During the switch to YNL, the command to list all endpoints has been
accidentally restricted to users with admin permissions.
It looks like there are no reasons to have this restriction which makes
it harder for a user to quickly check if the endpoint list has been
correctly populated by an automated tool. Best to go back to the
previous behaviour then.
mptcp_pm_gen.c has been modified using ynl-gen-c.py:
$ ./tools/net/ynl/ynl-gen-c.py --mode kernel \
--spec Documentation/netlink/specs/mptcp_pm.yaml --source \
-o net/mptcp/mptcp_pm_gen.c
The header file doesn't need to be regenerated.
Fixes: 1d0507f46843 ("net: mptcp: convert netlink from small_ops to ops")
Cc: stable@vger.kernel.org
Reviewed-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20241104-net-mptcp-misc-6-12-v1-1-c13f2ff1656f@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
DP83848 datasheet (section 4.7.2) indicates that the reset pin should be
toggled after the clocks are running. Add the PHY_RST_AFTER_CLK_EN to
make sure that this indication is respected.
In my experience not having this flag enabled would lead to, on some
boots, the wrong MII mode being selected if the PHY was initialized on
the bootloader and was receiving data during Linux boot.
Signed-off-by: Diogo Silva <diogompaissilva@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Fixes: 34e45ad9378c ("net: phy: dp83848: Add TI DP83848 Ethernet PHY")
Link: https://patch.msgid.link/20241102151504.811306-1-paissilva@ld-100007.ds1.internal
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Roger Quadros says:
====================
net: ethernet: ti: am65-cpsw: Fixes to multi queue RX feature
On J7 platforms, setting up multiple RX flows was failing
as the RX free descriptor ring 0 is shared among all flows
and we did not allocate enough elements in the RX free descriptor
ring 0 to accommodate for all RX flows. Patch 1 fixes this.
The second patch fixes a warning if there was any error in
am65_cpsw_nuss_init_rx_chns() and am65_cpsw_nuss_cleanup_rx_chns()
was called after that.
Signed-off-by: Roger Quadros <rogerq@kernel.org>
====================
Link: https://patch.msgid.link/20241101-am65-cpsw-multi-rx-j7-fix-v3-0-338fdd6a55da@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
flow->irq is initialized to 0 which is a valid IRQ. Set it to -EINVAL
in error path of am65_cpsw_nuss_init_rx_chns() so we do not try
to free an unallocated IRQ in am65_cpsw_nuss_remove_rx_chns().
If user tried to change number of RX queues and am65_cpsw_nuss_init_rx_chns()
failed due to any reason, the warning will happen if user tries to change
the number of RX queues after the error condition.
root@am62xx-evm:~# ethtool -L eth0 rx 3
[ 40.385293] am65-cpsw-nuss 8000000.ethernet: set new flow-id-base 19
[ 40.393211] am65-cpsw-nuss 8000000.ethernet: Failed to init rx flow2
netlink error: Invalid argument
root@am62xx-evm:~# ethtool -L eth0 rx 2
[ 82.306427] ------------[ cut here ]------------
[ 82.311075] WARNING: CPU: 0 PID: 378 at kernel/irq/devres.c:144 devm_free_irq+0x84/0x90
[ 82.469770] Call trace:
[ 82.472208] devm_free_irq+0x84/0x90
[ 82.475777] am65_cpsw_nuss_remove_rx_chns+0x6c/0xac [ti_am65_cpsw_nuss]
[ 82.482487] am65_cpsw_nuss_update_tx_rx_chns+0x2c/0x9c [ti_am65_cpsw_nuss]
[ 82.489442] am65_cpsw_set_channels+0x30/0x4c [ti_am65_cpsw_nuss]
[ 82.495531] ethnl_set_channels+0x224/0x2dc
[ 82.499713] ethnl_default_set_doit+0xb8/0x1b8
[ 82.504149] genl_family_rcv_msg_doit+0xc0/0x124
[ 82.508757] genl_rcv_msg+0x1f0/0x284
[ 82.512409] netlink_rcv_skb+0x58/0x130
[ 82.516239] genl_rcv+0x38/0x50
[ 82.519374] netlink_unicast+0x1d0/0x2b0
[ 82.523289] netlink_sendmsg+0x180/0x3c4
[ 82.527205] __sys_sendto+0xe4/0x158
[ 82.530779] __arm64_sys_sendto+0x28/0x38
[ 82.534782] invoke_syscall+0x44/0x100
[ 82.538526] el0_svc_common.constprop.0+0xc0/0xe0
[ 82.543221] do_el0_svc+0x1c/0x28
[ 82.546528] el0_svc+0x28/0x98
[ 82.549578] el0t_64_sync_handler+0xc0/0xc4
[ 82.553752] el0t_64_sync+0x190/0x194
[ 82.557407] ---[ end trace 0000000000000000 ]---
Fixes: da70d184a8c3 ("net: ethernet: ti: am65-cpsw: Introduce multi queue Rx")
Signed-off-by: Roger Quadros <rogerq@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
On J7 platforms, setting up multiple RX flows was failing
as the RX free descriptor ring 0 is shared among all flows
and we did not allocate enough elements in the RX free descriptor
ring 0 to accommodate for all RX flows.
This issue is not present on AM62 as separate pair of
rings are used for free and completion rings for each flow.
Fix this by allocating enough elements for RX free descriptor
ring 0.
However, we can no longer rely on desc_idx (descriptor based
offsets) to identify the pages in the respective flows as
free descriptor ring includes elements for all flows.
To solve this, introduce a new swdata data structure to store
flow_id and page. This can be used to identify which flow (page_pool)
and page the descriptor belonged to when popped out of the
RX rings.
Fixes: da70d184a8c3 ("net: ethernet: ti: am65-cpsw: Introduce multi queue Rx")
Signed-off-by: Roger Quadros <rogerq@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
When the driver is uninstalled and the VF is disabled concurrently, a
kernel crash occurs. The reason is that the two actions call function
pci_disable_sriov(). The num_VFs is checked to determine whether to
release the corresponding resources. During the second calling, num_VFs
is not 0 and the resource release function is called. However, the
corresponding resource has been released during the first invoking.
Therefore, the problem occurs:
[15277.839633][T50670] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020
...
[15278.131557][T50670] Call trace:
[15278.134686][T50670] klist_put+0x28/0x12c
[15278.138682][T50670] klist_del+0x14/0x20
[15278.142592][T50670] device_del+0xbc/0x3c0
[15278.146676][T50670] pci_remove_bus_device+0x84/0x120
[15278.151714][T50670] pci_stop_and_remove_bus_device+0x6c/0x80
[15278.157447][T50670] pci_iov_remove_virtfn+0xb4/0x12c
[15278.162485][T50670] sriov_disable+0x50/0x11c
[15278.166829][T50670] pci_disable_sriov+0x24/0x30
[15278.171433][T50670] hnae3_unregister_ae_algo_prepare+0x60/0x90 [hnae3]
[15278.178039][T50670] hclge_exit+0x28/0xd0 [hclge]
[15278.182730][T50670] __se_sys_delete_module.isra.0+0x164/0x230
[15278.188550][T50670] __arm64_sys_delete_module+0x1c/0x30
[15278.193848][T50670] invoke_syscall+0x50/0x11c
[15278.198278][T50670] el0_svc_common.constprop.0+0x158/0x164
[15278.203837][T50670] do_el0_svc+0x34/0xcc
[15278.207834][T50670] el0_svc+0x20/0x30
For details, see the following figure.
rmmod hclge disable VFs
----------------------------------------------------
hclge_exit() sriov_numvfs_store()
... device_lock()
pci_disable_sriov() hns3_pci_sriov_configure()
pci_disable_sriov()
sriov_disable()
sriov_disable() if !num_VFs :
if !num_VFs : return;
return; sriov_del_vfs()
sriov_del_vfs() ...
... klist_put()
klist_put() ...
... num_VFs = 0;
num_VFs = 0; device_unlock();
In this patch, when driver is removing, we get the device_lock()
to protect num_VFs, just like sriov_numvfs_store().
Fixes: 0dd8a25f355b ("net: hns3: disable sriov before unload hclge layer")
Signed-off-by: Peiyang Wang <wangpeiyang1@huawei.com>
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20241101091507.3644584-1-shaojijie@huawei.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|