summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2025-07-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netPaolo Abeni
Cross-merge networking fixes after downstream PR (net-6.16-rc5). No conflicts. No adjacent changes. Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-03Merge tag 'net-6.16-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from Bluetooth. Current release - new code bugs: - eth: - txgbe: fix the issue of TX failure - ngbe: specify IRQ vector when the number of VFs is 7 Previous releases - regressions: - sched: always pass notifications when child class becomes empty - ipv4: fix stat increase when udp early demux drops the packet - bluetooth: prevent unintended pause by checking if advertising is active - virtio: fix error reporting in virtqueue_resize - eth: - virtio-net: - ensure the received length does not exceed allocated size - fix the xsk frame's length check - lan78xx: fix WARN in __netif_napi_del_locked on disconnect Previous releases - always broken: - bluetooth: mesh: check instances prior disabling advertising - eth: - idpf: convert control queue mutex to a spinlock - dpaa2: fix xdp_rxq_info leak - amd-xgbe: align CL37 AN sequence as per databook" * tag 'net-6.16-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (38 commits) vsock/vmci: Clear the vmci transport packet properly when initializing it dt-bindings: net: sophgo,sg2044-dwmac: Drop status from the example net: ngbe: specify IRQ vector when the number of VFs is 7 net: wangxun: revert the adjustment of the IRQ vector sequence net: txgbe: request MISC IRQ in ndo_open virtio_net: Enforce minimum TX ring size for reliability virtio_net: Cleanup '2+MAX_SKB_FRAGS' virtio_ring: Fix error reporting in virtqueue_resize virtio-net: xsk: rx: fix the frame's length check virtio-net: use the check_mergeable_len helper virtio-net: remove redundant truesize check with PAGE_SIZE virtio-net: ensure the received length does not exceed allocated size net: ipv4: fix stat increase when udp early demux drops the packet net: libwx: fix the incorrect display of the queue number amd-xgbe: do not double read link status net/sched: Always pass notifications when child class becomes empty nui: Fix dma_mapping_error() check rose: fix dangling neighbour pointers in rose_rt_device_down() enic: fix incorrect MTU comparison in enic_change_mtu() amd-xgbe: align CL37 AN sequence as per databook ...
2025-07-03Bluetooth: hci_event: Fix not marking Broadcast Sink BIS as connectedLuiz Augusto von Dentz
Upon receiving HCI_EVT_LE_BIG_SYNC_ESTABLISHED with status 0x00 (success) the corresponding BIS hci_conn state shall be set to BT_CONNECTED otherwise they will be left with BT_OPEN which is invalid at that point, also create the debugfs and sysfs entries following the same logic as the likes of Broadcast Source BIS and CIS connections. Fixes: f777d8827817 ("Bluetooth: ISO: Notify user space about failed bis connections") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-07-03Bluetooth: hci_sync: Fix attempting to send HCI_Disconnect to BIS handleLuiz Augusto von Dentz
BIS/PA connections do have their own cleanup proceedure which are performed by hci_conn_cleanup/bis_cleanup. Fixes: 23205562ffc8 ("Bluetooth: separate CIS_LINK and BIS_LINK link types") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-07-03Bluetooth: hci_sync: Fix not disabling advertising instanceLuiz Augusto von Dentz
As the code comments on hci_setup_ext_adv_instance_sync suggests the advertising instance needs to be disabled in order to update its parameters, but it was wrongly checking that !adv->pending. Fixes: cba6b758711c ("Bluetooth: hci_sync: Make use of hci_cmd_sync_queue set 2") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-07-03ipv6: Cleanup fib6_drop_pcpu_from()Yue Haibing
Since commit 0e2338749192 ("ipv6: fix races in ip6_dst_destroy()"), 'table' is unused in __fib6_drop_pcpu_from(), no need pass it from fib6_drop_pcpu_from(). Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250701041235.1333687-1-yuehaibing@huawei.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-03netfilter: conntrack: remove DCCP protocol supportPablo Neira Ayuso
The DCCP socket family has now been removed from this tree, see: 8bb3212be4b4 ("Merge branch 'net-retire-dccp-socket'") Remove connection tracking and NAT support for this protocol, this should not pose a problem because no DCCP traffic is expected to be seen on the wire. As for the code for matching on dccp header for iptables and nftables, mark it as deprecated and keep it in place. Ruleset restoration is an atomic operation. Without dccp matching support, an astray match on dccp could break this operation leaving your computer with no policy in place, so let's follow a more conservative approach for matches. Add CONFIG_NFT_EXTHDR_DCCP which is set to 'n' by default to deprecate dccp extension support. Similarly, label CONFIG_NETFILTER_XT_MATCH_DCCP as deprecated too and also set it to 'n' by default. Code to match on DCCP protocol from ebtables also remains in place, this is just a few checks on IPPROTO_DCCP from _check() path which is exercised when ruleset is loaded. There is another use of IPPROTO_DCCP from the _check() path in the iptables multiport match. Another check for IPPROTO_DCCP from the packet in the reject target is also removed. So let's schedule removal of the dccp matching for a second stage, this should not interfer with the dccp retirement since this is only matching on the dccp header. Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-07-03vsock/vmci: Clear the vmci transport packet properly when initializing itHarshaVardhana S A
In vmci_transport_packet_init memset the vmci_transport_packet before populating the fields to avoid any uninitialised data being left in the structure. Cc: Bryan Tan <bryan-bt.tan@broadcom.com> Cc: Vishnu Dasa <vishnu.dasa@broadcom.com> Cc: Broadcom internal kernel review list Cc: Stefano Garzarella <sgarzare@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Simon Horman <horms@kernel.org> Cc: virtualization@lists.linux.dev Cc: netdev@vger.kernel.org Cc: stable <stable@kernel.org> Signed-off-by: HarshaVardhana S A <harshavardhana.sa@broadcom.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Fixes: d021c344051a ("VSOCK: Introduce VM Sockets") Acked-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20250701122254.2397440-1-gregkh@linuxfoundation.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-02rpc_create_client_dir(): return 0 or -E...Al Viro
Callers couldn't care less which dentry did we get - anything valid is treated as success. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-07-02rpc_create_client_dir(): don't bother with rpc_populate()Al Viro
not for a single file... Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-07-02rpc_new_dir(): the last argument is always NULLAl Viro
All callers except the one in rpc_populate() pass explicit NULL there; rpc_populate() passes its last argument ('private') instead, but in the only call of rpc_populate() that creates any subdirectories (when creating fixed subdirectories of root) private itself is NULL. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-07-02rpc_pipe: expand the calls of rpc_mkdir_populate()Al Viro
... and get rid of convoluted callbacks. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-07-02rpc_gssd_dummy_populate(): don't bother with rpc_populate()Al Viro
Just have it create gssd (in root), clntXX in gssd, then info and gssd in clntXX - all with explicit rpc_new_dir()/rpc_new_file()/rpc_mkpipe_dentry(). Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-07-02rpc_mkpipe_dentry(): switch to simple_start_creating()Al Viro
... and make sure we set the fs-private part of inode up before attaching it to dentry. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-07-02rpc_pipe: saner primitive for creating regular filesAl Viro
rpc_new_file(); similar to rpc_new_dir(), except that here we pass file_operations as well. Callers don't care about dentry, just return 0 or -E... Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-07-02rpc_pipe: saner primitive for creating subdirectoriesAl Viro
All users of __rpc_mkdir() have the same form - start_creating(), followed, in case of success, by __rpc_mkdir() and unlocking parent. Combine that into a single helper, expanding __rpc_mkdir() into it, along with the call of __rpc_create_common() in it. Don't mess with d_drop() + d_add() - just d_instantiate() and be done with that. The reason __rpc_create_common() goes for that dance is that dentry it gets might or might not be hashed; here we know it's hashed. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-07-02rpc_pipe: don't overdo directory lockingAl Viro
Don't try to hold directories locked more than VFS requires; lock just before getting a child to be made positive (using simple_start_creating()) and unlock as soon as the child is created. There's no benefit in keeping the parent locked while populating the child - it won't stop dcache lookups anyway. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-07-02rpc_mkpipe_dentry(): saner calling conventionsAl Viro
Instead of returning a dentry or ERR_PTR(-E...), return 0 and store dentry into pipe->dentry on success and return -E... on failure. Callers are happier that way... NOTE: dummy rpc_pipe is getting ->dentry set; we never access that, since we 1) never call rpc_unlink() for it (dentry is taken out by ->kill_sb()) 2) never call rpc_queue_upcall() for it (writing to that sucker fails; no downcalls are ever submitted, so no replies are going to arrive) IOW, having that ->dentry set (and left dangling) is harmless, if ugly; cleaner solution will take more massage. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-07-02rpc_unlink(): saner calling conventionsAl Viro
1) pass it pipe instead of pipe->dentry 2) zero pipe->dentry afterwards 3) it always returns 0; why bother? Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-07-02rpc_populate(): lift cleanup into callersAl Viro
rpc_populate() is called either from fill_super (where we don't need to remove any files on failure - rpc_kill_sb() will take them all out anyway) or from rpc_mkdir_populate(), where we need to remove the directory we'd been trying to populate along with whatever we'd put into it before we failed. Simpler to combine that into simple_recursive_removal() there. Note that rpc_pipe is overlocking directories quite a bit - locked parent is no obstacle to finding a child in dcache, so keeping it locked won't prevent userland observing a partially built subtree. All we need is to follow minimal VFS requirements; it's not as if clients used directory locking for exclusion - tree changes are serialized, but that's done on ->pipefs_sb_lock. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-07-02rpc_unlink(): use simple_recursive_removal()Al Viro
note that the callback of simple_recursive_removal() is called with the parent locked; the victim isn't locked by the caller. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-07-02rpc_{rmdir_,}depopulate(): use simple_recursive_removal() insteadAl Viro
no need to give an exact list of objects to be remove when it's simply every child of the victim directory. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-07-02rpc_pipe: clean failure exits in fill_superAl Viro
->kill_sb() will be called immediately after a failure return anyway, so we don't need to bother with notifier chain and rpc_gssd_dummy_depopulate(). What's more, rpc_gssd_dummy_populate() failure exits do not need to bother with __rpc_depopulate() - anything added to the tree will be taken out by ->kill_sb(). Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-07-02net: ipv6: Fix spelling mistakeChenguang Zhao
change 'Maximium' to 'Maximum' Signed-off-by: Chenguang Zhao <zhaochenguang@kylinos.cn> Reviewed-by: Simon Horman <horms@kernel.org> Acked-by: Paul Moore <paul@paul-moore.com> Link: https://patch.msgid.link/20250702055820.112190-1-zhaochenguang@kylinos.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02devlink: Extend devlink rate API with traffic classes bandwidth managementCarolina Jubran
Introduce support for specifying relative bandwidth shares between traffic classes (TC) in the devlink-rate API. This new option allows users to allocate bandwidth across multiple traffic classes in a single command. This feature provides a more granular control over traffic management, especially for scenarios requiring Enhanced Transmission Selection. Users can now define a relative bandwidth share for each traffic class. For example, assigning share values of 20 to TC0 (TCP/UDP) and 80 to TC5 (RoCE) will result in TC0 receiving 20% and TC5 receiving 80% of the total bandwidth. The actual percentage each class receives depends on the ratio of its share value to the sum of all shares. Example: DEV=pci/0000:08:00.0 $ devlink port function rate add $DEV/vfs_group tx_share 10Gbit \ tx_max 50Gbit tc-bw 0:20 1:0 2:0 3:0 4:0 5:80 6:0 7:0 $ devlink port function rate set $DEV/vfs_group \ tc-bw 0:20 1:0 2:0 3:0 4:0 5:20 6:60 7:0 Example usage with ynl: ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/devlink.yaml \ --do rate-set --json '{ "bus-name": "pci", "dev-name": "0000:08:00.0", "port-index": 1, "rate-tc-bws": [ {"rate-tc-index": 0, "rate-tc-bw": 50}, {"rate-tc-index": 1, "rate-tc-bw": 50}, {"rate-tc-index": 2, "rate-tc-bw": 0}, {"rate-tc-index": 3, "rate-tc-bw": 0}, {"rate-tc-index": 4, "rate-tc-bw": 0}, {"rate-tc-index": 5, "rate-tc-bw": 0}, {"rate-tc-index": 6, "rate-tc-bw": 0}, {"rate-tc-index": 7, "rate-tc-bw": 0} ] }' ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/devlink.yaml \ --do rate-get --json '{ "bus-name": "pci", "dev-name": "0000:08:00.0", "port-index": 1 }' output for rate-get: {'bus-name': 'pci', 'dev-name': '0000:08:00.0', 'port-index': 1, 'rate-tc-bws': [{'rate-tc-bw': 50, 'rate-tc-index': 0}, {'rate-tc-bw': 50, 'rate-tc-index': 1}, {'rate-tc-bw': 0, 'rate-tc-index': 2}, {'rate-tc-bw': 0, 'rate-tc-index': 3}, {'rate-tc-bw': 0, 'rate-tc-index': 4}, {'rate-tc-bw': 0, 'rate-tc-index': 5}, {'rate-tc-bw': 0, 'rate-tc-index': 6}, {'rate-tc-bw': 0, 'rate-tc-index': 7}], 'rate-tx-max': 0, 'rate-tx-priority': 0, 'rate-tx-share': 0, 'rate-tx-weight': 0, 'rate-type': 'leaf'} Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250629142138.361537-3-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02net: preserve MSG_ZEROCOPY with forwardingWillem de Bruijn
MSG_ZEROCOPY data must be copied before data is queued to local sockets, to avoid indefinite timeout until memory release. This test is performed by skb_orphan_frags_rx, which is called when looping an egress skb to packet sockets, error queue or ingress path. To preserve zerocopy for skbs that are looped to ingress but are then forwarded to an egress device rather than delivered locally, defer this last check until an skb enters the local IP receive path. This is analogous to existing behavior of skb_clear_delivery_time. Signed-off-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250630194312.1571410-2-willemdebruijn.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02net: ipv4: fix stat increase when udp early demux drops the packetAntoine Tenart
udp_v4_early_demux now returns drop reasons as it either returns 0 or ip_mc_validate_source, which returns itself a drop reason. However its use was not converted in ip_rcv_finish_core and the drop reason is ignored, leading to potentially skipping increasing LINUX_MIB_IPRPFILTER if the drop reason is SKB_DROP_REASON_IP_RPFILTER. This is a fix and we're not converting udp_v4_early_demux to explicitly return a drop reason to ease backports; this can be done as a follow-up. Fixes: d46f827016d8 ("net: ip: make ip_mc_validate_source() return drop reason") Cc: Menglong Dong <menglong8.dong@gmail.com> Reported-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Antoine Tenart <atenart@kernel.org> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20250701074935.144134-1-atenart@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02net/sched: Always pass notifications when child class becomes emptyLion Ackermann
Certain classful qdiscs may invoke their classes' dequeue handler on an enqueue operation. This may unexpectedly empty the child qdisc and thus make an in-flight class passive via qlen_notify(). Most qdiscs do not expect such behaviour at this point in time and may re-activate the class eventually anyways which will lead to a use-after-free. The referenced fix commit attempted to fix this behavior for the HFSC case by moving the backlog accounting around, though this turned out to be incomplete since the parent's parent may run into the issue too. The following reproducer demonstrates this use-after-free: tc qdisc add dev lo root handle 1: drr tc filter add dev lo parent 1: basic classid 1:1 tc class add dev lo parent 1: classid 1:1 drr tc qdisc add dev lo parent 1:1 handle 2: hfsc def 1 tc class add dev lo parent 2: classid 2:1 hfsc rt m1 8 d 1 m2 0 tc qdisc add dev lo parent 2:1 handle 3: netem tc qdisc add dev lo parent 3:1 handle 4: blackhole echo 1 | socat -u STDIN UDP4-DATAGRAM:127.0.0.1:8888 tc class delete dev lo classid 1:1 echo 1 | socat -u STDIN UDP4-DATAGRAM:127.0.0.1:8888 Since backlog accounting issues leading to a use-after-frees on stale class pointers is a recurring pattern at this point, this patch takes a different approach. Instead of trying to fix the accounting, the patch ensures that qdisc_tree_reduce_backlog always calls qlen_notify when the child qdisc is empty. This solves the problem because deletion of qdiscs always involves a call to qdisc_reset() and / or qdisc_purge_queue() which ultimately resets its qlen to 0 thus causing the following qdisc_tree_reduce_backlog() to report to the parent. Note that this may call qlen_notify on passive classes multiple times. This is not a problem after the recent patch series that made all the classful qdiscs qlen_notify() handlers idempotent. Fixes: 3f981138109f ("sch_hfsc: Fix qlen accounting bug when using peek in hfsc_enqueue()") Signed-off-by: Lion Ackermann <nnamrec@gmail.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/d912cbd7-193b-4269-9857-525bee8bbb6a@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02ipv6: ip6_mc_input() and ip6_mr_input() cleanupsEric Dumazet
Both functions are always called under RCU. We remove the extra rcu_read_lock()/rcu_read_unlock(). We use skb_dst_dev_net_rcu() instead of skb_dst_dev_net(). We use dev_net_rcu() instead of dev_net(). Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250630121934.3399505-11-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02ipv6: adopt skb_dst_dev() and skb_dst_dev_net[_rcu]() helpersEric Dumazet
Use the new helpers as a step to deal with potential dst->dev races. v2: fix typo in ipv6_rthdr_rcv() (kernel test robot <lkp@intel.com>) Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250630121934.3399505-10-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02ipv6: adopt dst_dev() helperEric Dumazet
Use the new helper as a step to deal with potential dst->dev races. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250630121934.3399505-9-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02ipv4: adopt dst_dev, skb_dst_dev and skb_dst_dev_net[_rcu]Eric Dumazet
Use the new helpers as a first step to deal with potential dst->dev races. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250630121934.3399505-8-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02net: dst: add four helpers to annotate data-races around dst->devEric Dumazet
dst->dev is read locklessly in many contexts, and written in dst_dev_put(). Fixing all the races is going to need many changes. We probably will have to add full RCU protection. Add three helpers to ease this painful process. static inline struct net_device *dst_dev(const struct dst_entry *dst) { return READ_ONCE(dst->dev); } static inline struct net_device *skb_dst_dev(const struct sk_buff *skb) { return dst_dev(skb_dst(skb)); } static inline struct net *skb_dst_dev_net(const struct sk_buff *skb) { return dev_net(skb_dst_dev(skb)); } static inline struct net *skb_dst_dev_net_rcu(const struct sk_buff *skb) { return dev_net_rcu(skb_dst_dev(skb)); } Fixes: 4a6ce2b6f2ec ("net: introduce a new function dst_dev_put()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250630121934.3399505-7-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02net: dst: annotate data-races around dst->outputEric Dumazet
dst_dev_put() can overwrite dst->output while other cpus might read this field (for instance from dst_output()) Add READ_ONCE()/WRITE_ONCE() annotations to suppress potential issues. We will likely need RCU protection in the future. Fixes: 4a6ce2b6f2ec ("net: introduce a new function dst_dev_put()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250630121934.3399505-6-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02net: dst: annotate data-races around dst->inputEric Dumazet
dst_dev_put() can overwrite dst->input while other cpus might read this field (for instance from dst_input()) Add READ_ONCE()/WRITE_ONCE() annotations to suppress potential issues. We will likely need full RCU protection later. Fixes: 4a6ce2b6f2ec ("net: introduce a new function dst_dev_put()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250630121934.3399505-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02net: dst: annotate data-races around dst->lastuseEric Dumazet
(dst_entry)->lastuse is read and written locklessly, add corresponding annotations. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250630121934.3399505-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02net: dst: annotate data-races around dst->expiresEric Dumazet
(dst_entry)->expires is read and written locklessly, add corresponding annotations. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250630121934.3399505-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02net: dst: annotate data-races around dst->obsoleteEric Dumazet
(dst_entry)->obsolete is read locklessly, add corresponding annotations. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250630121934.3399505-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02udp: move udp_memory_allocated into net_aligned_dataEric Dumazet
____cacheline_aligned_in_smp attribute only makes sure to align a field to a cache line. It does not prevent the linker to use the remaining of the cache line for other variables, causing potential false sharing. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250630093540.3052835-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02tcp: move tcp_memory_allocated into net_aligned_dataEric Dumazet
____cacheline_aligned_in_smp attribute only makes sure to align a field to a cache line. It does not prevent the linker to use the remaining of the cache line for other variables, causing potential false sharing. Move tcp_memory_allocated into a dedicated cache line. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250630093540.3052835-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02net: move net_cookie into net_aligned_dataEric Dumazet
Using per-cpu data for net->net_cookie generation is overkill, because even busy hosts do not create hundreds of netns per second. Make sure to put net_cookie in a private cache line to avoid potential false sharing. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250630093540.3052835-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02net: add struct net_aligned_dataEric Dumazet
This structure will hold networking data that must consume a full cache line to avoid accidental false sharing. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250630093540.3052835-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02xfrm: Duplicate SPI HandlingAakash Kumar S
The issue originates when Strongswan initiates an XFRM_MSG_ALLOCSPI Netlink message, which triggers the kernel function xfrm_alloc_spi(). This function is expected to ensure uniqueness of the Security Parameter Index (SPI) for inbound Security Associations (SAs). However, it can return success even when the requested SPI is already in use, leading to duplicate SPIs assigned to multiple inbound SAs, differentiated only by their destination addresses. This behavior causes inconsistencies during SPI lookups for inbound packets. Since the lookup may return an arbitrary SA among those with the same SPI, packet processing can fail, resulting in packet drops. According to RFC 4301 section 4.4.2 , for inbound processing a unicast SA is uniquely identified by the SPI and optionally protocol. Reproducing the Issue Reliably: To consistently reproduce the problem, restrict the available SPI range in charon.conf : spi_min = 0x10000000 spi_max = 0x10000002 This limits the system to only 2 usable SPI values. Next, create more than 2 Child SA. each using unique pair of src/dst address. As soon as the 3rd Child SA is initiated, it will be assigned a duplicate SPI, since the SPI pool is already exhausted. With a narrow SPI range, the issue is consistently reproducible. With a broader/default range, it becomes rare and unpredictable. Current implementation: xfrm_spi_hash() lookup function computes hash using daddr, proto, and family. So if two SAs have the same SPI but different destination addresses, then they will: a. Hash into different buckets b. Be stored in different linked lists (byspi + h) c. Not be seen in the same hlist_for_each_entry_rcu() iteration. As a result, the lookup will result in NULL and kernel allows that Duplicate SPI Proposed Change: xfrm_state_lookup_spi_proto() does a truly global search - across all states, regardless of hash bucket and matches SPI and proto. Signed-off-by: Aakash Kumar S <saakashkumar@marvell.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-07-02xfrm: ipcomp: adjust transport header after decompressingFernando Fernandez Mancera
The skb transport header pointer needs to be adjusted by network header pointer plus the size of the ipcomp header. This shows up when running traffic over ipcomp using transport mode. After being reinjected, packets are dropped because the header isn't adjusted properly and some checks can be triggered. E.g the skb is mistakenly considered as IP fragmented packet and later dropped. kworker/30:1-mm 443 [030] 102.055250: skb:kfree_skb:skbaddr=0xffff8f104aa3ce00 rx_sk=( ffffffff8419f1f4 sk_skb_reason_drop+0x94 ([kernel.kallsyms]) ffffffff8419f1f4 sk_skb_reason_drop+0x94 ([kernel.kallsyms]) ffffffff84281420 ip_defrag+0x4b0 ([kernel.kallsyms]) ffffffff8428006e ip_local_deliver+0x4e ([kernel.kallsyms]) ffffffff8432afb1 xfrm_trans_reinject+0xe1 ([kernel.kallsyms]) ffffffff83758230 process_one_work+0x190 ([kernel.kallsyms]) ffffffff83758f37 worker_thread+0x2d7 ([kernel.kallsyms]) ffffffff83761cc9 kthread+0xf9 ([kernel.kallsyms]) ffffffff836c3437 ret_from_fork+0x197 ([kernel.kallsyms]) ffffffff836718da ret_from_fork_asm+0x1a ([kernel.kallsyms]) Fixes: eb2953d26971 ("xfrm: ipcomp: Use crypto_acomp interface") Link: https://bugzilla.suse.com/1244532 Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-07-02xfrm: Set transport header to fix UDP GRO handlingTobias Brunner
The referenced commit replaced a call to __xfrm4|6_udp_encap_rcv() with a custom check for non-ESP markers. But what the called function also did was setting the transport header to the ESP header. The function that follows, esp4|6_gro_receive(), relies on that being set when it calls xfrm_parse_spi(). We have to set the full offset as the skb's head was not moved yet so adding just the UDP header length won't work. Fixes: e3fd05777685 ("xfrm: Fix UDP GRO handling for some corner cases") Signed-off-by: Tobias Brunner <tobias@strongswan.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-07-01seg6: fix lenghts typo in a commentAndrea Mayer
Fix a typo: lenghts -> length The typo has been identified using codespell, and the tool currently does not report any additional issues in comments. Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250629171226.4988-2-andrea.mayer@uniroma2.it Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-01rose: fix dangling neighbour pointers in rose_rt_device_down()Kohei Enju
There are two bugs in rose_rt_device_down() that can cause use-after-free: 1. The loop bound `t->count` is modified within the loop, which can cause the loop to terminate early and miss some entries. 2. When removing an entry from the neighbour array, the subsequent entries are moved up to fill the gap, but the loop index `i` is still incremented, causing the next entry to be skipped. For example, if a node has three neighbours (A, A, B) with count=3 and A is being removed, the second A is not checked. i=0: (A, A, B) -> (A, B) with count=2 ^ checked i=1: (A, B) -> (A, B) with count=2 ^ checked (B, not A!) i=2: (doesn't occur because i < count is false) This leaves the second A in the array with count=2, but the rose_neigh structure has been freed. Code that accesses these entries assumes that the first `count` entries are valid pointers, causing a use-after-free when it accesses the dangling pointer. Fix both issues by iterating over the array in reverse order with a fixed loop bound. This ensures that all entries are examined and that the removal of an entry doesn't affect subsequent iterations. Reported-by: syzbot+e04e2c007ba2c80476cb@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=e04e2c007ba2c80476cb Tested-by: syzbot+e04e2c007ba2c80476cb@syzkaller.appspotmail.com Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Kohei Enju <enjuk@amazon.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250629030833.6680-1-enjuk@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-01ip6_tunnel: enable to change proto of fb tunnelsNicolas Dichtel
This is possible via the ioctl API: > ip -6 tunnel change ip6tnl0 mode any Let's align the netlink API: > ip link set ip6tnl0 type ip6tnl mode any Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Link: https://patch.msgid.link/20250630145602.1027220-1-nicolas.dichtel@6wind.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-01net: ethtool: fix leaking netdev ref if ethnl_default_parse() failedJakub Kicinski
Ido spotted that I made a mistake in commit under Fixes, ethnl_default_parse() may acquire a dev reference even when it returns an error. This may have been driven by the code structure in dumps (which unconditionally release dev before handling errors), but it's too much of a trap. Functions should undo what they did before returning an error, rather than expecting caller to clean up. Rather than fixing ethnl_default_set_doit() directly make ethnl_default_parse() clean up errors. Reported-by: Ido Schimmel <idosch@idosch.org> Link: https://lore.kernel.org/aGEPszpq9eojNF4Y@shredder Fixes: 963781bdfe20 ("net: ethtool: call .parse_request for SET handlers") Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250630154053.1074664-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-01Merge tag 'nfs-for-6.16-2' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds
Pull NFS client fixes from Anna Schumaker: - Fix loop in GSS sequence number cache - Clean up /proc/net/rpc/nfs if nfs_fs_proc_net_init() fails - Fix a race to wake on NFS_LAYOUT_DRAIN - Fix handling of NFS level errors in I/O * tag 'nfs-for-6.16-2' of git://git.linux-nfs.org/projects/anna/linux-nfs: NFSv4/flexfiles: Fix handling of NFS level errors in I/O NFSv4/pNFS: Fix a race to wake on NFS_LAYOUT_DRAIN nfs: Clean up /proc/net/rpc/nfs when nfs_fs_proc_net_init() fails. sunrpc: fix loop in gss seqno cache