summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2024-06-19net: hsr: cosmetic: Remove extra white spaceLukasz Majewski
This change just removes extra (i.e. not needed) white space in prp_drop_frame() function. No functional changes. Signed-off-by: Lukasz Majewski <lukma@denx.de> Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://lore.kernel.org/r/20240618125817.1111070-1-lukma@denx.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-19af_packet: use sk_skb_reason_drop to free rx packetsYan Zhai
Replace kfree_skb_reason with sk_skb_reason_drop and pass the receiving socket to the tracepoint. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/r/202406011859.Aacus8GV-lkp@intel.com/ Signed-off-by: Yan Zhai <yan@cloudflare.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-19udp: use sk_skb_reason_drop to free rx packetsYan Zhai
Replace kfree_skb_reason with sk_skb_reason_drop and pass the receiving socket to the tracepoint. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/r/202406011751.NpVN0sSk-lkp@intel.com/ Signed-off-by: Yan Zhai <yan@cloudflare.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-19tcp: use sk_skb_reason_drop to free rx packetsYan Zhai
Replace kfree_skb_reason with sk_skb_reason_drop and pass the receiving socket to the tracepoint. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/r/202406011539.jhwBd7DX-lkp@intel.com/ Signed-off-by: Yan Zhai <yan@cloudflare.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-19net: raw: use sk_skb_reason_drop to free rx packetsYan Zhai
Replace kfree_skb_reason with sk_skb_reason_drop and pass the receiving socket to the tracepoint. Signed-off-by: Yan Zhai <yan@cloudflare.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-19ping: use sk_skb_reason_drop to free rx packetsYan Zhai
Replace kfree_skb_reason with sk_skb_reason_drop and pass the receiving socket to the tracepoint. Signed-off-by: Yan Zhai <yan@cloudflare.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-19net: introduce sk_skb_reason_drop functionYan Zhai
Long used destructors kfree_skb and kfree_skb_reason do not pass receiving socket to packet drop tracepoints trace_kfree_skb. This makes it hard to track packet drops of a certain netns (container) or a socket (user application). The naming of these destructors are also not consistent with most sk/skb operating functions, i.e. functions named "sk_xxx" or "skb_xxx". Introduce a new functions sk_skb_reason_drop as drop-in replacement for kfree_skb_reason on local receiving path. Callers can now pass receiving sockets to the tracepoints. kfree_skb and kfree_skb_reason are still usable but they are now just inline helpers that call sk_skb_reason_drop. Note it is not feasible to do the same to consume_skb. Packets not dropped can flow through multiple receive handlers, and have multiple receiving sockets. Leave it untouched for now. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yan Zhai <yan@cloudflare.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-19net: add rx_sk to trace_kfree_skbYan Zhai
skb does not include enough information to find out receiving sockets/services and netns/containers on packet drops. In theory skb->dev tells about netns, but it can get cleared/reused, e.g. by TCP stack for OOO packet lookup. Similarly, skb->sk often identifies a local sender, and tells nothing about a receiver. Allow passing an extra receiving socket to the tracepoint to improve the visibility on receiving drops. Signed-off-by: Yan Zhai <yan@cloudflare.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-19rds:Simplify the allocation of slab cachesHongfu Li
Use the new KMEM_CACHE() macro instead of direct kmem_cache_create to simplify the creation of SLAB caches. Signed-off-by: Hongfu Li <lihongfu@kylinos.cn> Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-17net: Move dev_set_hwtstamp_phylib to net/core/dev.hKory Maincent
This declaration was added to the header to be called from ethtool. ethtool is separated from core for code organization but it is not really a separate entity, it controls very core things. As ethtool is an internal stuff it is not wise to have it in netdevice.h. Move the declaration to net/core/dev.h instead. Remove the EXPORT_SYMBOL_GPL call as ethtool can not be built as a module. Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Kory Maincent <kory.maincent@bootlin.com> Link: https://lore.kernel.org/r/20240612-feature_ptp_netnext-v15-2-b2a086257b63@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-17fou: remove warn in gue_gro_receive on unsupported protocolWillem de Bruijn
Drop the WARN_ON_ONCE inn gue_gro_receive if the encapsulated type is not known or does not have a GRO handler. Such a packet is easily constructed. Syzbot generates them and sets off this warning. Remove the warning as it is expected and not actionable. The warning was previously reduced from WARN_ON to WARN_ON_ONCE in commit 270136613bf7 ("fou: Do WARN_ON_ONCE in gue_gro_receive for bad proto callbacks"). Signed-off-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240614122552.1649044-1-willemdebruijn.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-17net/smc: Introduce IPPROTO_SMCD. Wythe
This patch allows to create smc socket via AF_INET, similar to the following code, /* create v4 smc sock */ v4 = socket(AF_INET, SOCK_STREAM, IPPROTO_SMC); /* create v6 smc sock */ v6 = socket(AF_INET6, SOCK_STREAM, IPPROTO_SMC); There are several reasons why we believe it is appropriate here: 1. For smc sockets, it actually use IPv4 (AF-INET) or IPv6 (AF-INET6) address. There is no AF_SMC address at all. 2. Create smc socket in the AF_INET(6) path, which allows us to reuse the infrastructure of AF_INET(6) path, such as common ebpf hooks. Otherwise, smc have to implement it again in AF_SMC path. Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Tested-by: Niklas Schnelle <schnelle@linux.ibm.com> Tested-by: Wenjia Zhang <wenjia@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-17net/smc: expose smc proto operationsD. Wythe
Externalize smc proto operations (smc_xxx) to allow access from files other than af_smc.c This is in preparation for the subsequent implementation of the AF_INET version of SMC. Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Tested-by: Niklas Schnelle <schnelle@linux.ibm.com> Tested-by: Wenjia Zhang <wenjia@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-17net/smc: refactoring initialization of smc sockD. Wythe
This patch aims to isolate the shared components of SMC socket allocation by introducing smc_sk_init() for sock initialization and __smc_create_clcsk() for the initialization of clcsock. This is in preparation for the subsequent implementation of the AF_INET version of SMC. Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Tested-by: Niklas Schnelle <schnelle@linux.ibm.com> Tested-by: Wenjia Zhang <wenjia@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-14net: micro-optimize skb_datagram_iterSagi Grimberg
We only use the mapping in a single context in a short and contained scope, so kmap_local_page is sufficient and cheaper. This will also allow skb_datagram_iter to be called from softirq context. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Link: https://lore.kernel.org/r/20240613113504.1079860-1-sagi@grimberg.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-14atm: clean up a put_user() callsDan Carpenter
Unlike copy_from_user(), put_user() and get_user() return -EFAULT on error. Use the error code directly instead of setting it. Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/r/04a018e8-7433-4f67-8ddd-9357a0114f87@moroto.mountain Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-14net: qrtr: ns: Ignore ENODEV failures in nsChris Lew
Ignore the ENODEV failures returned by kernel_sendmsg(). These errors indicate that either the local port has been closed or the remote has gone down. Neither of these scenarios are fatal and will eventually be handled through packets that are later queued on the control port. Signed-off-by: Chris Lew <quic_clew@quicinc.com> Signed-off-by: Sarannya Sasikumar <quic_sarannya@quicinc.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240612063156.1377210-1-quic_sarannya@quicinc.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-14net: hsr: Send supervisory frames to HSR network with ProxyNodeTable dataLukasz Majewski
This patch provides support for sending supervision HSR frames with MAC addresses stored in ProxyNodeTable when RedBox (i.e. HSR-SAN) is enabled. Supervision frames with RedBox MAC address (appended as second TLV) are only send for ProxyNodeTable nodes. This patch series shall be tested with hsr_redbox.sh script. Signed-off-by: Lukasz Majewski <lukma@denx.de> Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. No conflicts, no adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-13Merge tag 'net-6.10-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from bluetooth and netfilter. Slim pickings this time, probably a combination of summer, DevConf.cz, and the end of first half of the year at corporations. Current release - regressions: - Revert "igc: fix a log entry using uninitialized netdev", it traded lack of netdev name in a printk() for a crash Previous releases - regressions: - Bluetooth: L2CAP: fix rejecting L2CAP_CONN_PARAM_UPDATE_REQ - geneve: fix incorrectly setting lengths of inner headers in the skb, confusing the drivers and causing mangled packets - sched: initialize noop_qdisc owner to avoid false-positive recursion detection (recursing on CPU 0), which bubbles up to user space as a sendmsg() error, while noop_qdisc should silently drop - netdevsim: fix backwards compatibility in nsim_get_iflink() Previous releases - always broken: - netfilter: ipset: fix race between namespace cleanup and gc in the list:set type" * tag 'net-6.10-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (35 commits) bnxt_en: Adjust logging of firmware messages in case of released token in __hwrm_send() af_unix: Read with MSG_PEEK loops if the first unread byte is OOB bnxt_en: Cap the size of HWRM_PORT_PHY_QCFG forwarded response gve: Clear napi->skb before dev_kfree_skb_any() ionic: fix use after netif_napi_del() Revert "igc: fix a log entry using uninitialized netdev" net: bridge: mst: fix suspicious rcu usage in br_mst_set_state net: bridge: mst: pass vlan group directly to br_mst_vlan_set_state net/ipv6: Fix the RT cache flush via sysctl using a previous delay net: stmmac: replace priv->speed with the portTransmitRate from the tc-cbs parameters gve: ignore nonrelevant GSO type bits when processing TSO headers net: pse-pd: Use EOPNOTSUPP error code instead of ENOTSUPP netfilter: Use flowlabel flow key when re-routing mangled packets netfilter: ipset: Fix race between namespace cleanup and gc in the list:set type netfilter: nft_inner: validate mandatory meta and payload tcp: use signed arithmetic in tcp_rtx_probe0_timed_out() mailmap: map Geliang's new email address mptcp: pm: update add_addr counters after connect mptcp: pm: inc RmAddr MIB counter once per RM_ADDR ID mptcp: ensure snd_una is properly initialized on connect ...
2024-06-13Merge tag 'nfs-for-6.10-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client fixes from Trond Myklebust: "Bugfixes: - NFSv4.2: Fix a memory leak in nfs4_set_security_label - NFSv2/v3: abort nfs_atomic_open_v23 if the name is too long. - NFS: Add appropriate memory barriers to the sillyrename code - Propagate readlink errors in nfs_symlink_filler - NFS: don't invalidate dentries on transient errors - NFS: fix unnecessary synchronous writes in random write workloads - NFSv4.1: enforce rootpath check when deciding whether or not to trunk Other: - Change email address for Trond Myklebust due to email server concerns" * tag 'nfs-for-6.10-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFS: add barriers when testing for NFS_FSDATA_BLOCKED SUNRPC: return proper error from gss_wrap_req_priv NFSv4.1 enforce rootpath check in fs_location query NFS: abort nfs_atomic_open_v23 if name is too long. nfs: don't invalidate dentries on transient errors nfs: Avoid flushing many pages with NFS_FILE_SYNC nfs: propagate readlink errors in nfs_symlink_filler MAINTAINERS: Change email address for Trond Myklebust NFSv4: Fix memory leak in nfs4_set_security_label
2024-06-13af_unix: Read with MSG_PEEK loops if the first unread byte is OOBRao Shoaib
Read with MSG_PEEK flag loops if the first byte to read is an OOB byte. commit 22dd70eb2c3d ("af_unix: Don't peek OOB data without MSG_OOB.") addresses the loop issue but does not address the issue that no data beyond OOB byte can be read. >>> from socket import * >>> c1, c2 = socketpair(AF_UNIX, SOCK_STREAM) >>> c1.send(b'a', MSG_OOB) 1 >>> c1.send(b'b') 1 >>> c2.recv(1, MSG_PEEK | MSG_DONTWAIT) b'b' >>> from socket import * >>> c1, c2 = socketpair(AF_UNIX, SOCK_STREAM) >>> c2.setsockopt(SOL_SOCKET, SO_OOBINLINE, 1) >>> c1.send(b'a', MSG_OOB) 1 >>> c1.send(b'b') 1 >>> c2.recv(1, MSG_PEEK | MSG_DONTWAIT) b'a' >>> c2.recv(1, MSG_PEEK | MSG_DONTWAIT) b'a' >>> c2.recv(1, MSG_DONTWAIT) b'a' >>> c2.recv(1, MSG_PEEK | MSG_DONTWAIT) b'b' >>> Fixes: 314001f0bf92 ("af_unix: Add OOB support") Signed-off-by: Rao Shoaib <Rao.Shoaib@oracle.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240611084639.2248934-1-Rao.Shoaib@oracle.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-12net: bridge: mst: fix suspicious rcu usage in br_mst_set_stateNikolay Aleksandrov
I converted br_mst_set_state to RCU to avoid a vlan use-after-free but forgot to change the vlan group dereference helper. Switch to vlan group RCU deref helper to fix the suspicious rcu usage warning. Fixes: 3a7c1661ae13 ("net: bridge: mst: fix vlan use-after-free") Reported-by: syzbot+9bbe2de1bc9d470eb5fe@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=9bbe2de1bc9d470eb5fe Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/20240609103654.914987-3-razor@blackwall.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-12net: bridge: mst: pass vlan group directly to br_mst_vlan_set_stateNikolay Aleksandrov
Pass the already obtained vlan group pointer to br_mst_vlan_set_state() instead of dereferencing it again. Each caller has already correctly dereferenced it for their context. This change is required for the following suspicious RCU dereference fix. No functional changes intended. Fixes: 3a7c1661ae13 ("net: bridge: mst: fix vlan use-after-free") Reported-by: syzbot+9bbe2de1bc9d470eb5fe@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=9bbe2de1bc9d470eb5fe Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/20240609103654.914987-2-razor@blackwall.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-12net/ipv6: Fix the RT cache flush via sysctl using a previous delayPetr Pavlu
The net.ipv6.route.flush system parameter takes a value which specifies a delay used during the flush operation for aging exception routes. The written value is however not used in the currently requested flush and instead utilized only in the next one. A problem is that ipv6_sysctl_rtcache_flush() first reads the old value of net->ipv6.sysctl.flush_delay into a local delay variable and then calls proc_dointvec() which actually updates the sysctl based on the provided input. Fix the problem by switching the order of the two operations. Fixes: 4990509f19e8 ("[NETNS][IPV6]: Make sysctls route per namespace.") Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240607112828.30285-1-petr.pavlu@suse.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-12net: ipv4: Add a sysctl to set multipath hash seedPetr Machata
When calculating hashes for the purpose of multipath forwarding, both IPv4 and IPv6 code currently fall back on flow_hash_from_keys(). That uses a randomly-generated seed. That's a fine choice by default, but unfortunately some deployments may need a tighter control over the seed used. In this patch, make the seed configurable by adding a new sysctl key, net.ipv4.fib_multipath_hash_seed to control the seed. This seed is used specifically for multipath forwarding and not for the other concerns that flow_hash_from_keys() is used for, such as queue selection. Expose the knob as sysctl because other such settings, such as headers to hash, are also handled that way. Like those, the multipath hash seed is a per-netns variable. Despite being placed in the net.ipv4 namespace, the multipath seed sysctl is used for both IPv4 and IPv6, similarly to e.g. a number of TCP variables. The seed used by flow_hash_from_keys() is a 128-bit quantity. However it seems that usually the seed is a much more modest value. 32 bits seem typical (Cisco, Cumulus), some systems go even lower. For that reason, and to decouple the user interface from implementation details, go with a 32-bit quantity, which is then quadruplicated to form the siphash key. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240607151357.421181-3-petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-12net: ipv4,ipv6: Pass multipath hash computation through a helperPetr Machata
The following patches will add a sysctl to control multipath hash seed. In order to centralize the hash computation, add a helper, fib_multipath_hash_from_keys(), and have all IPv4 and IPv6 route.c invocations of flow_hash_from_keys() go through this helper instead. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240607151357.421181-2-petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-12Merge tag 'nf-24-06-11' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: Patch #1 fixes insufficient sanitization of netlink attributes for the inner expression which can trigger nul-pointer dereference, from Davide Ornaghi. Patch #2 address a report that there is a race condition between namespace cleanup and the garbage collection of the list:set type. This patch resolves this issue with other minor issues as well, from Jozsef Kadlecsik. Patch #3 ip6_route_me_harder() ignores flowlabel/dsfield when ip dscp has been mangled, this unbreaks ip6 dscp set $v, from Florian Westphal. All of these patches address issues that are present in several releases. * tag 'nf-24-06-11' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: Use flowlabel flow key when re-routing mangled packets netfilter: ipset: Fix race between namespace cleanup and gc in the list:set type netfilter: nft_inner: validate mandatory meta and payload ==================== Link: https://lore.kernel.org/r/20240611220323.413713-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-12net: add and use __skb_get_hash_symmetric_netFlorian Westphal
Similar to previous patch: apply same logic for __skb_get_hash_symmetric and let callers pass the netns to the dissector core. Existing function is turned into a wrapper to avoid adjusting all callers, nft_hash.c uses new function. Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240608221057.16070-3-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-12net: add and use skb_get_hash_netFlorian Westphal
Years ago flow dissector gained ability to delegate flow dissection to a bpf program, scoped per netns. Unfortunately, skb_get_hash() only gets an sk_buff argument instead of both net+skb. This means the flow dissector needs to obtain the netns pointer from somewhere else. The netns is derived from skb->dev, and if that is not available, from skb->sk. If neither is set, we hit a (benign) WARN_ON_ONCE(). Trying both dev and sk covers most cases, but not all, as recently reported by Christoph Paasch. In case of nf-generated tcp reset, both sk and dev are NULL: WARNING: .. net/core/flow_dissector.c:1104 skb_flow_dissect_flow_keys include/linux/skbuff.h:1536 [inline] skb_get_hash include/linux/skbuff.h:1578 [inline] nft_trace_init+0x7d/0x120 net/netfilter/nf_tables_trace.c:320 nft_do_chain+0xb26/0xb90 net/netfilter/nf_tables_core.c:268 nft_do_chain_ipv4+0x7a/0xa0 net/netfilter/nft_chain_filter.c:23 nf_hook_slow+0x57/0x160 net/netfilter/core.c:626 __ip_local_out+0x21d/0x260 net/ipv4/ip_output.c:118 ip_local_out+0x26/0x1e0 net/ipv4/ip_output.c:127 nf_send_reset+0x58c/0x700 net/ipv4/netfilter/nf_reject_ipv4.c:308 nft_reject_ipv4_eval+0x53/0x90 net/ipv4/netfilter/nft_reject_ipv4.c:30 [..] syzkaller did something like this: table inet filter { chain input { type filter hook input priority filter; policy accept; meta nftrace set 1 tcp dport 42 reject with tcp reset } chain output { type filter hook output priority filter; policy accept; # empty chain is enough } } ... then sends a tcp packet to port 42. Initial attempt to simply set skb->dev from nf_reject_ipv4 doesn't cover all cases: skbs generated via ipv4 igmp_send_report trigger similar splat. Moreover, Pablo Neira found that nft_hash.c uses __skb_get_hash_symmetric() which would trigger same warn splat for such skbs. Lets allow callers to pass the current netns explicitly. The nf_trace infrastructure is adjusted to use the new helper. __skb_get_hash_symmetric is handled in the next patch. Reported-by: Christoph Paasch <cpaasch@apple.com> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/494 Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240608221057.16070-2-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-12net/tcp: Remove tcp_hash_fail()Dmitry Safonov
Now there are tracepoints, that cover all functionality of tcp_hash_fail(), but also wire up missing places They are also faster, can be disabled and provide filtering. This potentially may create a regression if a userspace depends on dmesg logs. Fingers crossed, let's see if anyone complains in reality. Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-12net/tcp: Add tcp-md5 and tcp-ao tracepointsDmitry Safonov
Instead of forcing userspace to parse dmesg (that's what currently is happening, at least in codebase of my current company), provide a better way, that can be enabled/disabled in runtime. Currently, there are already tcp events, add hashing related ones there, too. Rasdaemon currently exercises net_dev_xmit_timeout, devlink_health_report, but it'll be trivial to teach it to deal with failed hashes. Otherwise, BGP may trace/log them itself. Especially exciting for possible investigations is key rotation (RNext_key requests). Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-12net/tcp: Move tcp_inbound_hash() from headersDmitry Safonov
Two reasons: 1. It's grown up enough 2. In order to not do header spaghetti by including <trace/events/tcp.h>, which is necessary for TCP tracepoints. While at it, unexport and make static tcp_inbound_ao_hash(). Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-12net/tcp: Add a helper tcp_ao_hdr_maclen()Dmitry Safonov
It's going to be used more in TCP-AO tracepoints. Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-12net/tcp: Use static_branch_tcp_{md5,ao} to drop ifdefsDmitry Safonov
It's possible to clean-up some ifdefs by hiding that tcp_{md5,ao}_needed static branch is defined and compiled only under related configs, since commit 4c8530dc7d7d ("net/tcp: Only produce AO/MD5 logs if there are any keys"). Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-11Merge tag 'for-net-2024-06-10' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth Luiz Augusto von Dentz says: ==================== bluetooth pull request for net: - hci_sync: fix not using correct handle - L2CAP: fix rejecting L2CAP_CONN_PARAM_UPDATE_REQ - L2CAP: fix connection setup in l2cap_connect * tag 'for-net-2024-06-10' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth: Bluetooth: fix connection setup in l2cap_connect Bluetooth: L2CAP: Fix rejecting L2CAP_CONN_PARAM_UPDATE_REQ Bluetooth: hci_sync: Fix not using correct handle ==================== Link: https://lore.kernel.org/r/20240610135803.920662-1-luiz.dentz@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-11net: core: Implement dstats-type stats collectionsJeremy Kerr
We currently have dev_get_tstats64() for collecting per-cpu stats of type pcpu_sw_netstats ("tstats"). However, tstats doesn't allow for accounting tx/rx drops. We do have a stats variant that does have stats for dropped packets: struct pcpu_dstats, but there are no core helpers for using those stats. The VRF driver uses dstats, by providing its own collation/fetch functions to do so. This change adds a common implementation for dstats-type collection, used when pcpu_stat_type == NETDEV_PCPU_STAT_DSTAT. This is based on the VRF driver's existing stats collator (plus the unused tx_drops stat from there). We will switch the VRF driver to use this in the next change. Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240607-dstats-v3-2-cc781fe116f7@codeconstruct.com.au Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-11ip_tunnel: Move stats allocation to coreBreno Leitao
With commit 34d21de99cea9 ("net: Move {l,t,d}stats allocation to core and convert veth & vrf"), stats allocation could be done on net core instead of this driver. With this new approach, the driver doesn't have to bother with error handling (allocation failure checking, making sure free happens in the right spot, etc). This is core responsibility now. Move ip_tunnel driver to leverage the core allocation. All the ip_tunnel_init() users call ip_tunnel_init() as part of their .ndo_init callback. The .ndo_init callback is called before the stats allocation in netdev_register(), thus, the allocation will happen before the netdev is visible. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240607084420.3932875-1-leitao@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-11net: dsa: Fix typo in NET_DSA_TAG_RTL4_A KconfigChris Packham
Fix a minor typo in the help text for the NET_DSA_TAG_RTL4_A config option. Signed-off-by: Chris Packham <chris.packham@alliedtelesis.co.nz> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Luiz Angelo Daros de Luca <luizluca@gmail.com> Link: https://lore.kernel.org/r/20240607020843.1380735-1-chris.packham@alliedtelesis.co.nz Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-11netfilter: Use flowlabel flow key when re-routing mangled packetsFlorian Westphal
'ip6 dscp set $v' in an nftables outpute route chain has no effect. While nftables does detect the dscp change and calls the reroute hook. But ip6_route_me_harder never sets the dscp/flowlabel: flowlabel/dsfield routing rules are ignored and no reroute takes place. Thanks to Yi Chen for an excellent reproducer script that I used to validate this change. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: Yi Chen <yiche@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-06-11netfilter: ipset: Fix race between namespace cleanup and gc in the list:set typeJozsef Kadlecsik
Lion Ackermann reported that there is a race condition between namespace cleanup in ipset and the garbage collection of the list:set type. The namespace cleanup can destroy the list:set type of sets while the gc of the set type is waiting to run in rcu cleanup. The latter uses data from the destroyed set which thus leads use after free. The patch contains the following parts: - When destroying all sets, first remove the garbage collectors, then wait if needed and then destroy the sets. - Fix the badly ordered "wait then remove gc" for the destroy a single set case. - Fix the missing rcu locking in the list:set type in the userspace test case. - Use proper RCU list handlings in the list:set type. The patch depends on c1193d9bbbd3 (netfilter: ipset: Add list flush to cancel_gc). Fixes: 97f7cf1cd80e (netfilter: ipset: fix performance regression in swap operation) Reported-by: Lion Ackermann <nnamrec@gmail.com> Tested-by: Lion Ackermann <nnamrec@gmail.com> Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-06-11netfilter: nft_inner: validate mandatory meta and payloadDavide Ornaghi
Check for mandatory netlink attributes in payload and meta expression when used embedded from the inner expression, otherwise NULL pointer dereference is possible from userspace. Fixes: a150d122b6bd ("netfilter: nft_meta: add inner match support") Fixes: 3a07327d10a0 ("netfilter: nft_inner: support for inner tunnel header matching") Signed-off-by: Davide Ornaghi <d.ornaghi97@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-06-10tcp: use signed arithmetic in tcp_rtx_probe0_timed_out()Eric Dumazet
Due to timer wheel implementation, a timer will usually fire after its schedule. For instance, for HZ=1000, a timeout between 512ms and 4s has a granularity of 64ms. For this range of values, the extra delay could be up to 63ms. For TCP, this means that tp->rcv_tstamp may be after inet_csk(sk)->icsk_timeout whenever the timer interrupt finally triggers, if one packet came during the extra delay. We need to make sure tcp_rtx_probe0_timed_out() handles this case. Fixes: e89688e3e978 ("net: tcp: fix unexcepted socket die when snd_wnd is 0") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Menglong Dong <imagedong@tencent.com> Acked-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://lore.kernel.org/r/20240607125652.1472540-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-10mptcp: pm: update add_addr counters after connectYonglongLi
The creation of new subflows can fail for different reasons. If no subflow have been created using the received ADD_ADDR, the related counters should not be updated, otherwise they will never be decremented for events related to this ID later on. For the moment, the number of accepted ADD_ADDR is only decremented upon the reception of a related RM_ADDR, and only if the remote address ID is currently being used by at least one subflow. In other words, if no subflow can be created with the received address, the counter will not be decremented. In this case, it is then important not to increment pm.add_addr_accepted counter, and not to modify pm.accept_addr bit. Note that this patch does not modify the behaviour in case of failures later on, e.g. if the MP Join is dropped or rejected. The "remove invalid addresses" MP Join subtest has been modified to validate this case. The broadcast IP address is added before the "valid" address that will be used to successfully create a subflow, and the limit is decreased by one: without this patch, it was not possible to create the last subflow, because: - the broadcast address would have been accepted even if it was not usable: the creation of a subflow to this address results in an error, - the limit of 2 accepted ADD_ADDR would have then been reached. Fixes: 01cacb00b35c ("mptcp: add netlink-based PM") Cc: stable@vger.kernel.org Co-developed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: YonglongLi <liyonglong@chinatelecom.cn> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://lore.kernel.org/r/20240607-upstream-net-20240607-misc-fixes-v1-3-1ab9ddfa3d00@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-10mptcp: pm: inc RmAddr MIB counter once per RM_ADDR IDYonglongLi
The RmAddr MIB counter is supposed to be incremented once when a valid RM_ADDR has been received. Before this patch, it could have been incremented as many times as the number of subflows connected to the linked address ID, so it could have been 0, 1 or more than 1. The "RmSubflow" is incremented after a local operation. In this case, it is normal to tied it with the number of subflows that have been actually removed. The "remove invalid addresses" MP Join subtest has been modified to validate this case. A broadcast IP address is now used instead: the client will not be able to create a subflow to this address. The consequence is that when receiving the RM_ADDR with the ID attached to this broadcast IP address, no subflow linked to this ID will be found. Fixes: 7a7e52e38a40 ("mptcp: add RM_ADDR related mibs") Cc: stable@vger.kernel.org Co-developed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: YonglongLi <liyonglong@chinatelecom.cn> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://lore.kernel.org/r/20240607-upstream-net-20240607-misc-fixes-v1-2-1ab9ddfa3d00@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-10mptcp: ensure snd_una is properly initialized on connectPaolo Abeni
This is strictly related to commit fb7a0d334894 ("mptcp: ensure snd_nxt is properly initialized on connect"). It turns out that syzkaller can trigger the retransmit after fallback and before processing any other incoming packet - so that snd_una is still left uninitialized. Address the issue explicitly initializing snd_una together with snd_nxt and write_seq. Suggested-by: Mat Martineau <martineau@kernel.org> Fixes: 8fd738049ac3 ("mptcp: fallback in case of simultaneous connect") Cc: stable@vger.kernel.org Reported-by: Christoph Paasch <cpaasch@apple.com> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/485 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://lore.kernel.org/r/20240607-upstream-net-20240607-misc-fixes-v1-1-1ab9ddfa3d00@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-10net/sched: initialize noop_qdisc ownerJohannes Berg
When the noop_qdisc owner isn't initialized, then it will be 0, so packets will erroneously be regarded as having been subject to recursion as long as only CPU 0 queues them. For non-SMP, that's all packets, of course. This causes a change in what's reported to userspace, normally noop_qdisc would drop packets silently, but with this change the syscall returns -ENOBUFS if RECVERR is also set on the socket. Fix this by initializing the owner field to -1, just like it would be for dynamically allocated qdiscs by qdisc_alloc(). Fixes: 0f022d32c3ec ("net/sched: Fix mirred deadlock on device recursion") Signed-off-by: Johannes Berg <johannes.berg@intel.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240607175340.786bfb938803.I493bf8422e36be4454c08880a8d3703cea8e421a@changeid Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-10Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2024-06-06 We've added 54 non-merge commits during the last 10 day(s) which contain a total of 50 files changed, 1887 insertions(+), 527 deletions(-). The main changes are: 1) Add a user space notification mechanism via epoll when a struct_ops object is getting detached/unregistered, from Kui-Feng Lee. 2) Big batch of BPF selftest refactoring for sockmap and BPF congctl tests, from Geliang Tang. 3) Add BTF field (type and string fields, right now) iterator support to libbpf instead of using existing callback-based approaches, from Andrii Nakryiko. 4) Extend BPF selftests for the latter with a new btf_field_iter selftest, from Alan Maguire. 5) Add new kfuncs for a generic, open-coded bits iterator, from Yafang Shao. 6) Fix BPF selftests' kallsyms_find() helper under kernels configured with CONFIG_LTO_CLANG_THIN, from Yonghong Song. 7) Remove a bunch of unused structs in BPF selftests, from David Alan Gilbert. 8) Convert test_sockmap section names into names understood by libbpf so it can deduce program type and attach type, from Jakub Sitnicki. 9) Extend libbpf with the ability to configure log verbosity via LIBBPF_LOG_LEVEL environment variable, from Mykyta Yatsenko. 10) Fix BPF selftests with regards to bpf_cookie and find_vma flakiness in nested VMs, from Song Liu. 11) Extend riscv32/64 JITs to introduce shift/add helpers to generate Zba optimization, from Xiao Wang. 12) Enable BPF programs to declare arrays and struct fields with kptr, bpf_rb_root, and bpf_list_head, from Kui-Feng Lee. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (54 commits) selftests/bpf: Drop useless arguments of do_test in bpf_tcp_ca selftests/bpf: Use start_test in test_dctcp in bpf_tcp_ca selftests/bpf: Use start_test in test_dctcp_fallback in bpf_tcp_ca selftests/bpf: Add start_test helper in bpf_tcp_ca selftests/bpf: Use connect_to_fd_opts in do_test in bpf_tcp_ca libbpf: Auto-attach struct_ops BPF maps in BPF skeleton selftests/bpf: Add btf_field_iter selftests selftests/bpf: Fix send_signal test with nested CONFIG_PARAVIRT libbpf: Remove callback-based type/string BTF field visitor helpers bpftool: Use BTF field iterator in btfgen libbpf: Make use of BTF field iterator in BTF handling code libbpf: Make use of BTF field iterator in BPF linker code libbpf: Add BTF field iterator selftests/bpf: Ignore .llvm.<hash> suffix in kallsyms_find() selftests/bpf: Fix bpf_cookie and find_vma in nested VM selftests/bpf: Test global bpf_list_head arrays. selftests/bpf: Test global bpf_rb_root arrays and fields in nested struct types. selftests/bpf: Test kptr arrays and kptrs in nested struct fields. bpf: limit the number of levels of a nested struct type. bpf: look into the types of the fields of a struct type recursively. ... ==================== Link: https://lore.kernel.org/r/20240606223146.23020-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-10Merge tag 'wireless-next-2024-06-07' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next Kalle Valo says: ==================== wireless-next patches for v6.11 The first "new features" pull request for v6.11 with changes both in stack and in drivers. Nothing out of ordinary, except that we have two conflicts this time: net/mac80211/cfg.c https://lore.kernel.org/all/20240531124415.05b25e7a@canb.auug.org.au drivers/net/wireless/microchip/wilc1000/netdev.c https://lore.kernel.org/all/20240603110023.23572803@canb.auug.org.au Major changes: cfg80211/mac80211 * parse Transmit Power Envelope (TPE) data in mac80211 instead of in drivers wilc1000 * read MAC address during probe to make it visible to user space iwlwifi * bump FW API to 91 for BZ/SC devices * report 64-bit radiotap timestamp * enable P2P low latency by default * handle Transmit Power Envelope (TPE) advertised by AP * start using guard() rtlwifi * RTL8192DU support ath12k * remove unsupported tx monitor handling * channel 2 in 6 GHz band support * Spatial Multiplexing Power Save (SMPS) in 6 GHz band support * multiple BSSID (MBSSID) and Enhanced Multi-BSSID Advertisements (EMA) support * dynamic VLAN support * add panic handler for resetting the firmware state ath10k * add qcom,no-msa-ready-indicator Device Tree property * LED support for various chipsets * tag 'wireless-next-2024-06-07' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (194 commits) wifi: ath12k: add hw_link_id in ath12k_pdev wifi: ath12k: add panic handler wifi: rtw89: chan: Use swap() in rtw89_swap_sub_entity() wifi: brcm80211: remove unused structs wifi: brcm80211: use sizeof(*pointer) instead of sizeof(type) wifi: ath12k: do not process consecutive RDDM event dt-bindings: net: wireless: ath11k: Drop "qcom,ipq8074-wcss-pil" from example wifi: ath12k: fix memory leak in ath12k_dp_rx_peer_frag_setup() wifi: rtlwifi: handle return value of usb init TX/RX wifi: rtlwifi: Enable the new rtl8192du driver wifi: rtlwifi: Add rtl8192du/sw.c wifi: rtlwifi: Constify rtl_hal_cfg.{ops,usb_interface_cfg} and rtl_priv.cfg wifi: rtlwifi: Add rtl8192du/dm.{c,h} wifi: rtlwifi: Add rtl8192du/fw.{c,h} and rtl8192du/led.{c,h} wifi: rtlwifi: Add rtl8192du/rf.{c,h} wifi: rtlwifi: Add rtl8192du/trx.{c,h} wifi: rtlwifi: Add rtl8192du/phy.{c,h} wifi: rtlwifi: Add rtl8192du/hw.{c,h} wifi: rtlwifi: Add new members to struct rtl_priv for RTL8192DU wifi: rtlwifi: Add rtl8192du/table.{c,h} ... Signed-off-by: Jakub Kicinski <kuba@kernel.org> ==================== Link: https://lore.kernel.org/r/20240607093517.41394C2BBFC@smtp.kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-10Bluetooth: fix connection setup in l2cap_connectPauli Virtanen
The amp_id argument of l2cap_connect() was removed in commit 84a4bb6548a2 ("Bluetooth: HCI: Remove HCI_AMP support") It was always called with amp_id == 0, i.e. AMP_ID_BREDR == 0x00 (ie. non-AMP controller). In the above commit, the code path for amp_id != 0 was preserved, although it should have used the amp_id == 0 one. Restore the previous behavior of the non-AMP code path, to fix problems with L2CAP connections. Fixes: 84a4bb6548a2 ("Bluetooth: HCI: Remove HCI_AMP support") Signed-off-by: Pauli Virtanen <pav@iki.fi> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>