summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2021-05-28mptcp: do not reset MP_CAPABLE subflow on mapping errorsPaolo Abeni
When some mapping related errors occurs we close the main MPC subflow with a RST. We should instead fallback gracefully to TCP, and do the reset only for MPJ subflows. Fixes: d22f4988ffec ("mptcp: process MP_CAPABLE data option") Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/192 Reported-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-05-28mptcp: always parse mptcp options for MPC reqskPaolo Abeni
In subflow_syn_recv_sock() we currently skip options parsing for OoO packet, given that such packets may not carry the relevant MPC option. If the peer generates an MPC+data TSO packet and some of the early segments are lost or get reorder, we server will ignore the peer key, causing transient, unexpected fallback to TCP. The solution is always parsing the incoming MPTCP options, and do the fallback only for in-order packets. This actually cleans the existing code a bit. Fixes: d22f4988ffec ("mptcp: process MP_CAPABLE data option") Reported-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-05-28mptcp: fix sk_forward_memory corruption on retransmissionPaolo Abeni
MPTCP sk_forward_memory handling is a bit special, as such field is protected by the msk socket spin_lock, instead of the plain socket lock. Currently we have a code path updating such field without handling the relevant lock: __mptcp_retrans() -> __mptcp_clean_una_wakeup() Several helpers in __mptcp_clean_una_wakeup() will update sk_forward_alloc, possibly causing such field corruption, as reported by Matthieu. Address the issue providing and using a new variant of blamed function which explicitly acquires the msk spin lock. Fixes: 64b9cea7a0af ("mptcp: fix spurious retransmissions") Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/172 Reported-by: Matthieu Baerts <matthieu.baerts@tessares.net> Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-05-28Merge tag 'nfs-for-5.13-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client bugfixes from Trond Myklebust: "Stable fixes: - Fix v4.0/v4.1 SEEK_DATA return -ENOTSUPP when set NFS_V4_2 config - Fix Oops in xs_tcp_send_request() when transport is disconnected - Fix a NULL pointer dereference in pnfs_mark_matching_lsegs_return() Bugfixes: - Fix instances where signal_pending() should be fatal_signal_pending() - fix an incorrect limit in filelayout_decode_layout() - Fixes for the SUNRPC backlogged RPC queue - Don't corrupt the value of pg_bytes_written in nfs_do_recoalesce() - Revert commit 586a0787ce35 ("Clean up rpcrdma_prepare_readch()")" * tag 'nfs-for-5.13-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: nfs: Remove trailing semicolon in macros xprtrdma: Revert 586a0787ce35 NFSv4: Fix v4.0/v4.1 SEEK_DATA return -ENOTSUPP when set NFS_V4_2 config NFS: Clean up reset of the mirror accounting variables NFS: Don't corrupt the value of pg_bytes_written in nfs_do_recoalesce() NFS: Fix an Oopsable condition in __nfs_pageio_add_request() SUNRPC: More fixes for backlog congestion SUNRPC: Fix Oops in xs_tcp_send_request() when transport is disconnected NFSv4: Fix a NULL pointer dereference in pnfs_mark_matching_lsegs_return() SUNRPC in case of backlog, hand free slots directly to waiting task pNFS/NFSv4: Remove redundant initialization of 'rd_size' NFS: fix an incorrect limit in filelayout_decode_layout() fs/nfs: Use fatal_signal_pending instead of signal_pending
2021-05-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfJakub Kicinski
Pablo Neira Ayuso says: ==================== Netfilter/IPVS fixes for net The following patchset contains Netfilter/IPVS fixes for net: 1) Fix incorrect sockopts unregistration from error path, from Florian Westphal. 2) A few patches to provide better error reporting when missing kernel netfilter options are missing in .config. 3) Fix dormant table flag updates. 4) Memleak in IPVS when adding service with IP_VS_SVC_F_HASHED flag. * git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf: ipvs: ignore IP_VS_SVC_F_HASHED flag when adding service netfilter: nf_tables: fix table flag updates netfilter: nf_tables: extended netlink error reporting for chain type netfilter: nf_tables: missing error reporting for not selected expressions netfilter: conntrack: unregister ipv4 sockopts on error unwind ==================== Link: https://lore.kernel.org/r/20210527190115.98503-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-05-27net/sched: act_ct: Fix ct template allocation for zone 0Ariel Levkovich
Fix current behavior of skipping template allocation in case the ct action is in zone 0. Skipping the allocation may cause the datapath ct code to ignore the entire ct action with all its attributes (commit, nat) in case the ct action in zone 0 was preceded by a ct clear action. The ct clear action sets the ct_state to untracked and resets the skb->_nfct pointer. Under these conditions and without an allocated ct template, the skb->_nfct pointer will remain NULL which will cause the tc ct action handler to exit without handling commit and nat actions, if such exist. For example, the following rule in OVS dp: recirc_id(0x2),ct_state(+new-est-rel-rpl+trk),ct_label(0/0x1), \ in_port(eth0),actions:ct_clear,ct(commit,nat(src=10.11.0.12)), \ recirc(0x37a) Will result in act_ct skipping the commit and nat actions in zone 0. The change removes the skipping of template allocation for zone 0 and treats it the same as any other zone. Fixes: b57dc7c13ea9 ("net/sched: Introduce action ct") Signed-off-by: Ariel Levkovich <lariel@nvidia.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Link: https://lore.kernel.org/r/20210526170110.54864-1-lariel@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-05-27net/sched: act_ct: Offload connections with commit actionPaul Blakey
Currently established connections are not offloaded if the filter has a "ct commit" action. This behavior will not offload connections of the following scenario: $ tc_filter add dev $DEV ingress protocol ip prio 1 flower \ ct_state -trk \ action ct commit action goto chain 1 $ tc_filter add dev $DEV ingress protocol ip chain 1 prio 1 flower \ action mirred egress redirect dev $DEV2 $ tc_filter add dev $DEV2 ingress protocol ip prio 1 flower \ action ct commit action goto chain 1 $ tc_filter add dev $DEV2 ingress protocol ip prio 1 chain 1 flower \ ct_state +trk+est \ action mirred egress redirect dev $DEV Offload established connections, regardless of the commit flag. Fixes: 46475bb20f4b ("net/sched: act_ct: Software offload of established flows") Reviewed-by: Oz Shlomo <ozsh@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Paul Blakey <paulb@nvidia.com> Link: https://lore.kernel.org/r/1622029449-27060-1-git-send-email-paulb@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-05-27devlink: Correct VIRTUAL port to not have phys_port attributesParav Pandit
Physical port name, port number attributes do not belong to virtual port flavour. When VF or SF virtual ports are registered they incorrectly append "np0" string in the netdevice name of the VF/SF. Before this fix, VF netdevice name were ens2f0np0v0, ens2f0np0v1 for VF 0 and 1 respectively. After the fix, they are ens2f0v0, ens2f0v1. With this fix, reading /sys/class/net/ens2f0v0/phys_port_name returns -EOPNOTSUPP. Also devlink port show example for 2 VFs on one PF to ensure that any physical port attributes are not exposed. $ devlink port show pci/0000:06:00.0/65535: type eth netdev ens2f0np0 flavour physical port 0 splittable false pci/0000:06:00.3/196608: type eth netdev ens2f0v0 flavour virtual splittable false pci/0000:06:00.4/262144: type eth netdev ens2f0v1 flavour virtual splittable false This change introduces a netdevice name change on systemd/udev version 245 and higher which honors phys_port_name sysfs file for generation of netdevice name. This also aligns to phys_port_name usage which is limited to switchdev ports as described in [1]. [1] https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/tree/Documentation/networking/switchdev.rst Fixes: acf1ee44ca5d ("devlink: Introduce devlink port flavour virtual") Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20210526200027.14008-1-parav@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-05-27Bluetooth: fix the erroneous flush_work() orderLin Ma
In the cleanup routine for failed initialization of HCI device, the flush_work(&hdev->rx_work) need to be finished before the flush_work(&hdev->cmd_work). Otherwise, the hci_rx_work() can possibly invoke new cmd_work and cause a bug, like double free, in late processings. This was assigned CVE-2021-3564. This patch reorder the flush_work() to fix this bug. Cc: Marcel Holtmann <marcel@holtmann.org> Cc: Johan Hedberg <johan.hedberg@gmail.com> Cc: Luiz Augusto von Dentz <luiz.dentz@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Cc: linux-bluetooth@vger.kernel.org Cc: netdev@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Lin Ma <linma@zju.edu.cn> Signed-off-by: Hao Xiong <mart1n@zju.edu.cn> Cc: stable <stable@vger.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2021-05-27xprtrdma: Revert 586a0787ce35Chuck Lever
Commit 9ed5af268e88 ("SUNRPC: Clean up the handling of page padding in rpc_prepare_reply_pages()") [Dec 2020] affects RPC Replies that have a data payload (i.e., Write chunks). rpcrdma_prepare_readch(), as its name suggests, sets up Read chunks which are data payloads within RPC Calls. Those payloads are constructed by xdr_write_pages(), which continues to stuff the call buffer's tail kvec with the payload's XDR roundup. Thus removing the tail buffer logic in rpcrdma_prepare_readch() was the wrong thing to do. Fixes: 586a0787ce35 ("xprtrdma: Clean up rpcrdma_prepare_readch()") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-05-27ipvs: ignore IP_VS_SVC_F_HASHED flag when adding serviceJulian Anastasov
syzbot reported memory leak [1] when adding service with HASHED flag. We should ignore this flag both from sockopt and netlink provided data, otherwise the service is not hashed and not visible while releasing resources. [1] BUG: memory leak unreferenced object 0xffff888115227800 (size 512): comm "syz-executor263", pid 8658, jiffies 4294951882 (age 12.560s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff83977188>] kmalloc include/linux/slab.h:556 [inline] [<ffffffff83977188>] kzalloc include/linux/slab.h:686 [inline] [<ffffffff83977188>] ip_vs_add_service+0x598/0x7c0 net/netfilter/ipvs/ip_vs_ctl.c:1343 [<ffffffff8397d770>] do_ip_vs_set_ctl+0x810/0xa40 net/netfilter/ipvs/ip_vs_ctl.c:2570 [<ffffffff838449a8>] nf_setsockopt+0x68/0xa0 net/netfilter/nf_sockopt.c:101 [<ffffffff839ae4e9>] ip_setsockopt+0x259/0x1ff0 net/ipv4/ip_sockglue.c:1435 [<ffffffff839fa03c>] raw_setsockopt+0x18c/0x1b0 net/ipv4/raw.c:857 [<ffffffff83691f20>] __sys_setsockopt+0x1b0/0x360 net/socket.c:2117 [<ffffffff836920f2>] __do_sys_setsockopt net/socket.c:2128 [inline] [<ffffffff836920f2>] __se_sys_setsockopt net/socket.c:2125 [inline] [<ffffffff836920f2>] __x64_sys_setsockopt+0x22/0x30 net/socket.c:2125 [<ffffffff84350efa>] do_syscall_64+0x3a/0xb0 arch/x86/entry/common.c:47 [<ffffffff84400068>] entry_SYSCALL_64_after_hwframe+0x44/0xae Reported-and-tested-by: syzbot+e562383183e4b1766930@syzkaller.appspotmail.com Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Julian Anastasov <ja@ssi.bg> Reviewed-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-05-26Merge tag 'net-5.13-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Networking fixes for 5.13-rc4, including fixes from bpf, netfilter, can and wireless trees. Notably including fixes for the recently announced "FragAttacks" WiFi vulnerabilities. Rather large batch, touching some core parts of the stack, too, but nothing hair-raising. Current release - regressions: - tipc: make node link identity publish thread safe - dsa: felix: re-enable TAS guard band mode - stmmac: correct clocks enabled in stmmac_vlan_rx_kill_vid() - stmmac: fix system hang if change mac address after interface ifdown Current release - new code bugs: - mptcp: avoid OOB access in setsockopt() - bpf: Fix nested bpf_bprintf_prepare with more per-cpu buffers - ethtool: stats: fix a copy-paste error - init correct array size Previous releases - regressions: - sched: fix packet stuck problem for lockless qdisc - net: really orphan skbs tied to closing sk - mlx4: fix EEPROM dump support - bpf: fix alu32 const subreg bound tracking on bitwise operations - bpf: fix mask direction swap upon off reg sign change - bpf, offload: reorder offload callback 'prepare' in verifier - stmmac: Fix MAC WoL not working if PHY does not support WoL - packetmmap: fix only tx timestamp on request - tipc: skb_linearize the head skb when reassembling msgs Previous releases - always broken: - mac80211: address recent "FragAttacks" vulnerabilities - mac80211: do not accept/forward invalid EAPOL frames - mptcp: avoid potential error message floods - bpf, ringbuf: deny reserve of buffers larger than ringbuf to prevent out of buffer writes - bpf: forbid trampoline attach for functions with variable arguments - bpf: add deny list of functions to prevent inf recursion of tracing programs - tls splice: check SPLICE_F_NONBLOCK instead of MSG_DONTWAIT - can: isotp: prevent race between isotp_bind() and isotp_setsockopt() - netfilter: nft_set_pipapo_avx2: Add irq_fpu_usable() check, fallback to non-AVX2 version Misc: - bpf: add kconfig knob for disabling unpriv bpf by default" * tag 'net-5.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (172 commits) net: phy: Document phydev::dev_flags bits allocation mptcp: validate 'id' when stopping the ADD_ADDR retransmit timer mptcp: avoid error message on infinite mapping mptcp: drop unconditional pr_warn on bad opt mptcp: avoid OOB access in setsockopt() nfp: update maintainer and mailing list addresses net: mvpp2: add buffer header handling in RX bnx2x: Fix missing error code in bnx2x_iov_init_one() net: zero-initialize tc skb extension on allocation net: hns: Fix kernel-doc sctp: fix the proc_handler for sysctl encap_port sctp: add the missing setting for asoc encap_port bpf, selftests: Adjust few selftest result_unpriv outcomes bpf: No need to simulate speculative domain for immediates bpf: Fix mask direction swap upon off reg sign change bpf: Wrap aux data inside bpf_sanitize_info container bpf: Fix BPF_LSM kconfig symbol dependency selftests/bpf: Add test for l3 use of bpf_redirect_peer bpftool: Add sock_release help info for cgroup attach/prog load command net: dsa: microchip: enable phy errata workaround on 9567 ...
2021-05-26locking/atomic: net: use linux/atomic.h for xchg & cmpxchgMark Rutland
As xchg*() and cmpxchg*() may be instrumented by atomic-instrumented.h, it's necessary to include <linux/atomic.h> to use these, rather than <asm/cmpxchg.h>, which is effectively an arch-internal header. In a couple of places we include <asm/cmpxchg.h>, but get away with this as <linux/atomic.h> gets pulled in inidrectly by another include. Before we convert more architectures to use atomic-instrumented.h, let's fix these up to use <linux/atomic.h> so that we don't make things more fragile. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210525140232.53872-3-mark.rutland@arm.com
2021-05-26SUNRPC: More fixes for backlog congestionTrond Myklebust
Ensure that we fix the XPRT_CONGESTED starvation issue for RDMA as well as socket based transports. Ensure we always initialise the request after waking up from the backlog list. Fixes: e877a88d1f06 ("SUNRPC in case of backlog, hand free slots directly to waiting task") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-05-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf 2021-05-26 The following pull-request contains BPF updates for your *net* tree. We've added 14 non-merge commits during the last 14 day(s) which contain a total of 17 files changed, 513 insertions(+), 231 deletions(-). The main changes are: 1) Fix bpf_skb_change_head() helper to reset mac_len, from Jussi Maki. 2) Fix masking direction swap upon off-reg sign change, from Daniel Borkmann. 3) Fix BPF offloads in verifier by reordering driver callback, from Yinjun Zhang. 4) BPF selftest for ringbuf mmap ro/rw restrictions, from Andrii Nakryiko. 5) Follow-up fixes to nested bprintf per-cpu buffers, from Florent Revest. 6) Fix bpftool sock_release attach point help info, from Liu Jian. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-25mptcp: validate 'id' when stopping the ADD_ADDR retransmit timerDavide Caratti
when Linux receives an echo-ed ADD_ADDR, it checks the IP address against the list of "announced" addresses. In case of a positive match, the timer that handles retransmissions is stopped regardless of the 'Address Id' in the received packet: this behaviour does not comply with RFC8684 3.4.1. Fix it by validating the 'Address Id' in received echo-ed ADD_ADDRs. Tested using packetdrill, with the following captured output: unpatched kernel: Out <...> Flags [.], ack 1, win 256, options [mptcp add-addr v1 id 1 198.51.100.2 hmac 0xfd2e62517888fe29,mptcp dss ack 3007449509], length 0 In <...> Flags [.], ack 1, win 257, options [mptcp add-addr v1-echo id 1 1.2.3.4,mptcp dss ack 3013740213], length 0 Out <...> Flags [.], ack 1, win 256, options [mptcp add-addr v1 id 1 198.51.100.2 hmac 0xfd2e62517888fe29,mptcp dss ack 3007449509], length 0 In <...> Flags [.], ack 1, win 257, options [mptcp add-addr v1-echo id 90 198.51.100.2,mptcp dss ack 3013740213], length 0 ^^^ retransmission is stopped here, but 'Address Id' is 90 patched kernel: Out <...> Flags [.], ack 1, win 256, options [mptcp add-addr v1 id 1 198.51.100.2 hmac 0x1cf372d59e05f4b8,mptcp dss ack 3007449509], length 0 In <...> Flags [.], ack 1, win 257, options [mptcp add-addr v1-echo id 1 1.2.3.4,mptcp dss ack 1672384568], length 0 Out <...> Flags [.], ack 1, win 256, options [mptcp add-addr v1 id 1 198.51.100.2 hmac 0x1cf372d59e05f4b8,mptcp dss ack 3007449509], length 0 In <...> Flags [.], ack 1, win 257, options [mptcp add-addr v1-echo id 90 198.51.100.2,mptcp dss ack 1672384568], length 0 Out <...> Flags [.], ack 1, win 256, options [mptcp add-addr v1 id 1 198.51.100.2 hmac 0x1cf372d59e05f4b8,mptcp dss ack 3007449509], length 0 In <...> Flags [.], ack 1, win 257, options [mptcp add-addr v1-echo id 1 198.51.100.2,mptcp dss ack 1672384568], length 0 ^^^ retransmission is stopped here, only when both 'Address Id' and 'IP Address' match Fixes: 00cfd77b9063 ("mptcp: retransmit ADD_ADDR when timeout") Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-25mptcp: avoid error message on infinite mappingPaolo Abeni
Another left-over. Avoid flooding dmesg with useless text, we already have a MIB for that event. Fixes: 648ef4b88673 ("mptcp: Implement MPTCP receive path") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-25mptcp: drop unconditional pr_warn on bad optPaolo Abeni
This is a left-over of early day. A malicious peer can flood the kernel logs with useless messages, just drop it. Fixes: f296234c98a8 ("mptcp: Add handling of incoming MP_JOIN requests") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-25mptcp: avoid OOB access in setsockopt()Paolo Abeni
We can't use tcp_set_congestion_control() on an mptcp socket, as such function can end-up accessing a tcp-specific field - prior_ssthresh - causing an OOB access. To allow propagating the correct ca algo on subflow, cache the ca name at initialization time. Additionally avoid overriding the user-selected CA (if any) at clone time. Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/182 Fixes: aa1fbd94e5c7 ("mptcp: sockopt: add TCP_CONGESTION and TCP_INFO") Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-25net: zero-initialize tc skb extension on allocationVlad Buslov
Function skb_ext_add() doesn't initialize created skb extension with any value and leaves it up to the user. However, since extension of type TC_SKB_EXT originally contained only single value tc_skb_ext->chain its users used to just assign the chain value without setting whole extension memory to zero first. This assumption changed when TC_SKB_EXT extension was extended with additional fields but not all users were updated to initialize the new fields which leads to use of uninitialized memory afterwards. UBSAN log: [ 778.299821] UBSAN: invalid-load in net/openvswitch/flow.c:899:28 [ 778.301495] load of value 107 is not a valid value for type '_Bool' [ 778.303215] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.12.0-rc7+ #2 [ 778.304933] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 [ 778.307901] Call Trace: [ 778.308680] <IRQ> [ 778.309358] dump_stack+0xbb/0x107 [ 778.310307] ubsan_epilogue+0x5/0x40 [ 778.311167] __ubsan_handle_load_invalid_value.cold+0x43/0x48 [ 778.312454] ? memset+0x20/0x40 [ 778.313230] ovs_flow_key_extract.cold+0xf/0x14 [openvswitch] [ 778.314532] ovs_vport_receive+0x19e/0x2e0 [openvswitch] [ 778.315749] ? ovs_vport_find_upcall_portid+0x330/0x330 [openvswitch] [ 778.317188] ? create_prof_cpu_mask+0x20/0x20 [ 778.318220] ? arch_stack_walk+0x82/0xf0 [ 778.319153] ? secondary_startup_64_no_verify+0xb0/0xbb [ 778.320399] ? stack_trace_save+0x91/0xc0 [ 778.321362] ? stack_trace_consume_entry+0x160/0x160 [ 778.322517] ? lock_release+0x52e/0x760 [ 778.323444] netdev_frame_hook+0x323/0x610 [openvswitch] [ 778.324668] ? ovs_netdev_get_vport+0xe0/0xe0 [openvswitch] [ 778.325950] __netif_receive_skb_core+0x771/0x2db0 [ 778.327067] ? lock_downgrade+0x6e0/0x6f0 [ 778.328021] ? lock_acquire+0x565/0x720 [ 778.328940] ? generic_xdp_tx+0x4f0/0x4f0 [ 778.329902] ? inet_gro_receive+0x2a7/0x10a0 [ 778.330914] ? lock_downgrade+0x6f0/0x6f0 [ 778.331867] ? udp4_gro_receive+0x4c4/0x13e0 [ 778.332876] ? lock_release+0x52e/0x760 [ 778.333808] ? dev_gro_receive+0xcc8/0x2380 [ 778.334810] ? lock_downgrade+0x6f0/0x6f0 [ 778.335769] __netif_receive_skb_list_core+0x295/0x820 [ 778.336955] ? process_backlog+0x780/0x780 [ 778.337941] ? mlx5e_rep_tc_netdevice_event_unregister+0x20/0x20 [mlx5_core] [ 778.339613] ? seqcount_lockdep_reader_access.constprop.0+0xa7/0xc0 [ 778.341033] ? kvm_clock_get_cycles+0x14/0x20 [ 778.342072] netif_receive_skb_list_internal+0x5f5/0xcb0 [ 778.343288] ? __kasan_kmalloc+0x7a/0x90 [ 778.344234] ? mlx5e_handle_rx_cqe_mpwrq+0x9e0/0x9e0 [mlx5_core] [ 778.345676] ? mlx5e_xmit_xdp_frame_mpwqe+0x14d0/0x14d0 [mlx5_core] [ 778.347140] ? __netif_receive_skb_list_core+0x820/0x820 [ 778.348351] ? mlx5e_post_rx_mpwqes+0xa6/0x25d0 [mlx5_core] [ 778.349688] ? napi_gro_flush+0x26c/0x3c0 [ 778.350641] napi_complete_done+0x188/0x6b0 [ 778.351627] mlx5e_napi_poll+0x373/0x1b80 [mlx5_core] [ 778.352853] __napi_poll+0x9f/0x510 [ 778.353704] ? mlx5_flow_namespace_set_mode+0x260/0x260 [mlx5_core] [ 778.355158] net_rx_action+0x34c/0xa40 [ 778.356060] ? napi_threaded_poll+0x3d0/0x3d0 [ 778.357083] ? sched_clock_cpu+0x18/0x190 [ 778.358041] ? __common_interrupt+0x8e/0x1a0 [ 778.359045] __do_softirq+0x1ce/0x984 [ 778.359938] __irq_exit_rcu+0x137/0x1d0 [ 778.360865] irq_exit_rcu+0xa/0x20 [ 778.361708] common_interrupt+0x80/0xa0 [ 778.362640] </IRQ> [ 778.363212] asm_common_interrupt+0x1e/0x40 [ 778.364204] RIP: 0010:native_safe_halt+0xe/0x10 [ 778.365273] Code: 4f ff ff ff 4c 89 e7 e8 50 3f 40 fe e9 dc fe ff ff 48 89 df e8 43 3f 40 fe eb 90 cc e9 07 00 00 00 0f 00 2d 74 05 62 00 fb f4 <c3> 90 e9 07 00 00 00 0f 00 2d 64 05 62 00 f4 c3 cc cc 0f 1f 44 00 [ 778.369355] RSP: 0018:ffffffff84407e48 EFLAGS: 00000246 [ 778.370570] RAX: ffff88842de46a80 RBX: ffffffff84425840 RCX: ffffffff83418468 [ 778.372143] RDX: 000000000026f1da RSI: 0000000000000004 RDI: ffffffff8343af5e [ 778.373722] RBP: fffffbfff0884b08 R08: 0000000000000000 R09: ffff88842de46bcb [ 778.375292] R10: ffffed1085bc8d79 R11: 0000000000000001 R12: 0000000000000000 [ 778.376860] R13: ffffffff851124a0 R14: 0000000000000000 R15: dffffc0000000000 [ 778.378491] ? rcu_eqs_enter.constprop.0+0xb8/0xe0 [ 778.379606] ? default_idle_call+0x5e/0xe0 [ 778.380578] default_idle+0xa/0x10 [ 778.381406] default_idle_call+0x96/0xe0 [ 778.382350] do_idle+0x3d4/0x550 [ 778.383153] ? arch_cpu_idle_exit+0x40/0x40 [ 778.384143] cpu_startup_entry+0x19/0x20 [ 778.385078] start_kernel+0x3c7/0x3e5 [ 778.385978] secondary_startup_64_no_verify+0xb0/0xbb Fix the issue by providing new function tc_skb_ext_alloc() that allocates tc skb extension and initializes its memory to 0 before returning it to the caller. Change all existing users to use new API instead of calling skb_ext_add() directly. Fixes: 038ebb1a713d ("net/sched: act_ct: fix miss set mru for ovs after defrag in act_ct") Fixes: d29334c15d33 ("net/sched: act_api: fix miss set post_ct for ovs after do conntrack in act_ct") Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Acked-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-25sctp: fix the proc_handler for sysctl encap_portXin Long
proc_dointvec() cannot do min and max check for setting a value when extra1/extra2 is set, so change it to proc_dointvec_minmax() for sysctl encap_port. Fixes: e8a3001c2120 ("sctp: add encap_port for netns sock asoc and transport") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-25sctp: add the missing setting for asoc encap_portXin Long
This patch is to add the missing setting back for asoc encap_port. Fixes: 8dba29603b5c ("sctp: add SCTP_REMOTE_UDP_ENCAPS_PORT sockopt") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-24net: hsr: fix mac_len checksGeorge McCollister
Commit 2e9f60932a2c ("net: hsr: check skb can contain struct hsr_ethhdr in fill_frame_info") added the following which resulted in -EINVAL always being returned: if (skb->mac_len < sizeof(struct hsr_ethhdr)) return -EINVAL; mac_len was not being set correctly so this check completely broke HSR/PRP since it was always 14, not 20. Set mac_len correctly and modify the mac_len checks to test in the correct places since sometimes it is legitimately 14. Fixes: 2e9f60932a2c ("net: hsr: check skb can contain struct hsr_ethhdr in fill_frame_info") Signed-off-by: George McCollister <george.mccollister@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-24sch_dsmark: fix a NULL deref in qdisc_reset()Taehee Yoo
If Qdisc_ops->init() is failed, Qdisc_ops->reset() would be called. When dsmark_init(Qdisc_ops->init()) is failed, it possibly doesn't initialize dsmark_qdisc_data->q. But dsmark_reset(Qdisc_ops->reset()) uses dsmark_qdisc_data->q pointer wihtout any null checking. So, panic would occur. Test commands: sysctl net.core.default_qdisc=dsmark -w ip link add dummy0 type dummy ip link add vw0 link dummy0 type virt_wifi ip link set vw0 up Splat looks like: KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f] CPU: 3 PID: 684 Comm: ip Not tainted 5.12.0+ #910 RIP: 0010:qdisc_reset+0x2b/0x680 Code: 1f 44 00 00 48 b8 00 00 00 00 00 fc ff df 41 57 41 56 41 55 41 54 55 48 89 fd 48 83 c7 18 53 48 89 fa 48 c1 ea 03 48 83 ec 20 <80> 3c 02 00 0f 85 09 06 00 00 4c 8b 65 18 0f 1f 44 00 00 65 8b 1d RSP: 0018:ffff88800fda6bf8 EFLAGS: 00010282 RAX: dffffc0000000000 RBX: ffff8880050ed800 RCX: 0000000000000000 RDX: 0000000000000003 RSI: ffffffff99e34100 RDI: 0000000000000018 RBP: 0000000000000000 R08: fffffbfff346b553 R09: fffffbfff346b553 R10: 0000000000000001 R11: fffffbfff346b552 R12: ffffffffc0824940 R13: ffff888109e83800 R14: 00000000ffffffff R15: ffffffffc08249e0 FS: 00007f5042287680(0000) GS:ffff888119800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055ae1f4dbd90 CR3: 0000000006760002 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: ? rcu_read_lock_bh_held+0xa0/0xa0 dsmark_reset+0x3d/0xf0 [sch_dsmark] qdisc_reset+0xa9/0x680 qdisc_destroy+0x84/0x370 qdisc_create_dflt+0x1fe/0x380 attach_one_default_qdisc.constprop.41+0xa4/0x180 dev_activate+0x4d5/0x8c0 ? __dev_open+0x268/0x390 __dev_open+0x270/0x390 Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-24netfilter: nf_tables: fix table flag updatesPablo Neira Ayuso
The dormant flag need to be updated from the preparation phase, otherwise, two consecutive requests to dorm a table in the same batch might try to remove the same hooks twice, resulting in the following warning: hook not found, pf 3 num 0 WARNING: CPU: 0 PID: 334 at net/netfilter/core.c:480 __nf_unregister_net_hook+0x1eb/0x610 net/netfilter/core.c:480 Modules linked in: CPU: 0 PID: 334 Comm: kworker/u4:5 Not tainted 5.12.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: netns cleanup_net RIP: 0010:__nf_unregister_net_hook+0x1eb/0x610 net/netfilter/core.c:480 This patch is a partial revert of 0ce7cf4127f1 ("netfilter: nftables: update table flags from the commit phase") to restore the previous behaviour. However, there is still another problem: A batch containing a series of dorm-wakeup-dorm table and vice-versa also trigger the warning above since hook unregistration happens from the preparation phase, while hook registration occurs from the commit phase. To fix this problem, this patch adds two internal flags to annotate the original dormant flag status which are __NFT_TABLE_F_WAS_DORMANT and __NFT_TABLE_F_WAS_AWAKEN, to restore it from the abort path. The __NFT_TABLE_F_UPDATE bitmask allows to handle the dormant flag update with one single transaction. Reported-by: syzbot+7ad5cd1615f2d89c6e7e@syzkaller.appspotmail.com Fixes: 0ce7cf4127f1 ("netfilter: nftables: update table flags from the commit phase") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-05-23net/sched: fq_pie: fix OOB access in the traffic pathDavide Caratti
the following script: # tc qdisc add dev eth0 handle 0x1 root fq_pie flows 2 # tc qdisc add dev eth0 clsact # tc filter add dev eth0 egress matchall action skbedit priority 0x10002 # ping 192.0.2.2 -I eth0 -c2 -w1 -q produces the following splat: BUG: KASAN: slab-out-of-bounds in fq_pie_qdisc_enqueue+0x1314/0x19d0 [sch_fq_pie] Read of size 4 at addr ffff888171306924 by task ping/942 CPU: 3 PID: 942 Comm: ping Not tainted 5.12.0+ #441 Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014 Call Trace: dump_stack+0x92/0xc1 print_address_description.constprop.7+0x1a/0x150 kasan_report.cold.13+0x7f/0x111 fq_pie_qdisc_enqueue+0x1314/0x19d0 [sch_fq_pie] __dev_queue_xmit+0x1034/0x2b10 ip_finish_output2+0xc62/0x2120 __ip_finish_output+0x553/0xea0 ip_output+0x1ca/0x4d0 ip_send_skb+0x37/0xa0 raw_sendmsg+0x1c4b/0x2d00 sock_sendmsg+0xdb/0x110 __sys_sendto+0x1d7/0x2b0 __x64_sys_sendto+0xdd/0x1b0 do_syscall_64+0x3c/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7fe69735c3eb Code: 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 f3 0f 1e fa 48 8d 05 75 42 2c 00 41 89 ca 8b 00 85 c0 75 14 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 75 c3 0f 1f 40 00 41 57 4d 89 c7 41 56 41 89 RSP: 002b:00007fff06d7fb38 EFLAGS: 00000246 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 000055e961413700 RCX: 00007fe69735c3eb RDX: 0000000000000040 RSI: 000055e961413700 RDI: 0000000000000003 RBP: 0000000000000040 R08: 000055e961410500 R09: 0000000000000010 R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff06d81260 R13: 00007fff06d7fb40 R14: 00007fff06d7fc30 R15: 000055e96140f0a0 Allocated by task 917: kasan_save_stack+0x19/0x40 __kasan_kmalloc+0x7f/0xa0 __kmalloc_node+0x139/0x280 fq_pie_init+0x555/0x8e8 [sch_fq_pie] qdisc_create+0x407/0x11b0 tc_modify_qdisc+0x3c2/0x17e0 rtnetlink_rcv_msg+0x346/0x8e0 netlink_rcv_skb+0x120/0x380 netlink_unicast+0x439/0x630 netlink_sendmsg+0x719/0xbf0 sock_sendmsg+0xe2/0x110 ____sys_sendmsg+0x5ba/0x890 ___sys_sendmsg+0xe9/0x160 __sys_sendmsg+0xd3/0x170 do_syscall_64+0x3c/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae The buggy address belongs to the object at ffff888171306800 which belongs to the cache kmalloc-256 of size 256 The buggy address is located 36 bytes to the right of 256-byte region [ffff888171306800, ffff888171306900) The buggy address belongs to the page: page:00000000bcfb624e refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x171306 head:00000000bcfb624e order:1 compound_mapcount:0 flags: 0x17ffffc0010200(slab|head|node=0|zone=2|lastcpupid=0x1fffff) raw: 0017ffffc0010200 dead000000000100 dead000000000122 ffff888100042b40 raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff888171306800: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff888171306880: 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc >ffff888171306900: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ^ ffff888171306980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff888171306a00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fix fq_pie traffic path to avoid selecting 'q->flows + q->flows_cnt' as a valid flow: it's an address beyond the allocated memory. Fixes: ec97ecf1ebe4 ("net: sched: add Flow Queue PIE packet scheduler") CC: stable@vger.kernel.org Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-23net/sched: fq_pie: re-factor fix for fq_pie endless loopDavide Caratti
the patch that fixed an endless loop in_fq_pie_init() was not considering that 65535 is a valid class id. The correct bugfix for this infinite loop is to change 'idx' to become an u32, like Colin proposed in the past [1]. Fix this as follows: - restore 65536 as maximum possible values of 'flows_cnt' - use u32 'idx' when iterating on 'q->flows' - fix the TDC selftest This reverts commit bb2f930d6dd708469a587dc9ed1efe1ef969c0bf. [1] https://lore.kernel.org/netdev/20210407163808.499027-1-colin.king@canonical.com/ CC: Colin Ian King <colin.king@canonical.com> CC: stable@vger.kernel.org Fixes: bb2f930d6dd7 ("net/sched: fix infinite loop in sch_fq_pie") Fixes: ec97ecf1ebe4 ("net: sched: add Flow Queue PIE packet scheduler") Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-21ipv6: record frag_max_size in atomic fragments in input pathFrancesco Ruggeri
Commit dbd1759e6a9c ("ipv6: on reassembly, record frag_max_size") filled the frag_max_size field in IP6CB in the input path. The field should also be filled in case of atomic fragments. Fixes: dbd1759e6a9c ('ipv6: on reassembly, record frag_max_size') Signed-off-by: Francesco Ruggeri <fruggeri@arista.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-21RDS tcp loopback connection can hangRao Shoaib
When TCP is used as transport and a program on the system connects to RDS port 16385, connection is accepted but denied per the rules of RDS. However, RDS connections object is left in the list. Next loopback connection will select that connection object as it is at the head of list. The connection attempt will hang as the connection object is set to connect over TCP which is not allowed The issue can be reproduced easily, use rds-ping to ping a local IP address. After that use any program like ncat to connect to the same IP address and port 16385. This will hang so ctrl-c out. Now try rds-ping, it will hang. To fix the issue this patch adds checks to disallow the connection object creation and destroys the connection object. Signed-off-by: Rao Shoaib <rao.shoaib@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-21netfilter: nf_tables: extended netlink error reporting for chain typePablo Neira Ayuso
Users that forget to select the NAT chain type in netfilter's Kconfig hit ENOENT when adding the basechain. This report is however sparse since it might be the table, the chain or the kernel module that is missing/does not exist. This patch provides extended netlink error reporting for the NFTA_CHAIN_TYPE netlink attribute, which conveys the basechain type. If the user selects a basechain that his custom kernel does not support, the netlink extended error provides a more accurate hint on the described issue. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-05-21netfilter: nf_tables: missing error reporting for not selected expressionsPablo Neira Ayuso
Sometimes users forget to turn on nftables extensions from Kconfig that they need. In such case, the error reporting from userspace is misleading: $ sudo nft add rule x y counter Error: Could not process rule: No such file or directory add rule x y counter ^^^^^^^^^^^^^^^^^^^^ Add missing NL_SET_BAD_ATTR() to provide a hint: $ nft add rule x y counter Error: Could not process rule: No such file or directory add rule x y counter ^^^^^^^ Fixes: 83d9dcba06c5 ("netfilter: nf_tables: extended netlink error reporting for expressions") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-05-21bpf: Set mac_len in bpf_skb_change_headJussi Maki
The skb_change_head() helper did not set "skb->mac_len", which is problematic when it's used in combination with skb_redirect_peer(). Without it, redirecting a packet from a L3 device such as wireguard to the veth peer device will cause skb->data to point to the middle of the IP header on entry to tcp_v4_rcv() since the L2 header is not pulled correctly due to mac_len=0. Fixes: 3a0af8fd61f9 ("bpf: BPF for lightweight tunnel infrastructure") Signed-off-by: Jussi Maki <joamaki@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20210519154743.2554771-2-joamaki@gmail.com
2021-05-20Merge tag 'char-misc-5.13-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char/misc driver fixes from Greg KH: "Here is a big set of char/misc/other driver fixes for 5.13-rc3. The majority here is the fallout of the umn.edu re-review of all prior submissions. That resulted in a bunch of reverts along with the "correct" changes made, such that there is no regression of any of the potential fixes that were made by those individuals. I would like to thank the over 80 different developers who helped with the review and fixes for this mess. Other than that, there's a few habanna driver fixes for reported issues, and some dyndbg fixes for reported problems. All of these have been in linux-next for a while with no reported problems" * tag 'char-misc-5.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (82 commits) misc: eeprom: at24: check suspend status before disable regulator uio_hv_generic: Fix another memory leak in error handling paths uio_hv_generic: Fix a memory leak in error handling paths uio/uio_pci_generic: fix return value changed in refactoring Revert "Revert "ALSA: usx2y: Fix potential NULL pointer dereference"" dyndbg: drop uninformative vpr_info dyndbg: avoid calling dyndbg_emit_prefix when it has no work binder: Return EFAULT if we fail BINDER_ENABLE_ONEWAY_SPAM_DETECTION cdrom: gdrom: initialize global variable at init time brcmfmac: properly check for bus register errors Revert "brcmfmac: add a check for the status of usb_register" video: imsttfb: check for ioremap() failures Revert "video: imsttfb: fix potential NULL pointer dereferences" net: liquidio: Add missing null pointer checks Revert "net: liquidio: fix a NULL pointer dereference" media: gspca: properly check for errors in po1030_probe() Revert "media: gspca: Check the return value of write_bridge for timeout" media: gspca: mt9m111: Check write_bridge for timeout Revert "media: gspca: mt9m111: Check write_bridge for timeout" media: dvb: Add check on sp8870_readreg return ...
2021-05-20SUNRPC: Fix Oops in xs_tcp_send_request() when transport is disconnectedTrond Myklebust
If a disconnection occurs while we're trying to reply to a server callback, then we may end up calling xs_tcp_send_request() with a NULL value for transport->inet, which trips up the call to tcp_sock_set_cork(). Fixes: d737e5d41870 ("SUNRPC: Set TCP_CORK until the transmit queue is empty") Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-05-20SUNRPC in case of backlog, hand free slots directly to waiting taskNeilBrown
If sunrpc.tcp_max_slot_table_entries is small and there are tasks on the backlog queue, then when a request completes it is freed and the first task on the queue is woken. The expectation is that it will wake and claim that request. However if it was a sync task and the waiting process was killed at just that moment, it will wake and NOT claim the request. As long as TASK_CONGESTED remains set, requests can only be claimed by tasks woken from the backlog, and they are woken only as requests are freed, so when a task doesn't claim a request, no other task can ever get that request until TASK_CONGESTED is cleared. Each time this happens the number of available requests is decreased by one. With a sufficiently high workload and sufficiently low setting of max_slot (16 in the case where this was seen), TASK_CONGESTED can remain set for an extended period, and the above scenario (of a process being killed just as its task was woken) can repeat until no requests can be allocated. Then traffic stops. This patch addresses the problem by introducing a positive handover of a request from a completing task to a backlog task - the request is never freed when there is a backlog. When a task is woken it might not already have a request attached in which case it is *not* freed (as with current code) but is initialised (if needed) and used. If it isn't used it will eventually be freed by rpc_exit_task(). xprt_release() is enhanced to be able to correctly release an uninitialised request. Fixes: ba60eb25ff6b ("SUNRPC: Fix a livelock problem in the xprt->backlog queue") Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-05-20netfilter: conntrack: unregister ipv4 sockopts on error unwindFlorian Westphal
When ipv6 sockopt register fails, the ipv4 one needs to be removed. Fixes: a0ae2562c6c ("netfilter: conntrack: remove l3proto abstraction") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-05-19ethtool: stats: Fix a copy-paste errorYueHaibing
data->ctrl_stats should be memset with correct size. Fixes: bfad2b979ddc ("ethtool: add interface to read standard MAC Ctrl stats") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-18tipc: simplify the finalize work queueXin Long
This patch is to use "struct work_struct" for the finalize work queue instead of "struct tipc_net_work", as it can get the "net" and "addr" from tipc_net's other members and there is no need to add extra net and addr in tipc_net by defining "struct tipc_net_work". Note that it's safe to get net from tn->bcl as bcl is always released after the finalize work queue is done. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-18batman-adv: Avoid WARN_ON timing related checksSven Eckelmann
The soft/batadv interface for a queued OGM can be changed during the time the OGM was queued for transmission and when the OGM is actually transmitted by the worker. But WARN_ON must be used to denote kernel bugs and not to print simple warnings. A warning can simply be printed using pr_warn. Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Reported-by: syzbot+c0b807de416427ff3dd1@syzkaller.appspotmail.com Fixes: ef0a937f7a14 ("batman-adv: consider outgoing interface in OGM sending") Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2021-05-17xfrm: Fix fall-through warnings for ClangGustavo A. R. Silva
In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning by explicitly adding a break statement instead of letting the code fall through to the next case. Link: https://github.com/KSPP/linux/issues/115 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2021-05-17tipc: Fix fall-through warnings for ClangGustavo A. R. Silva
In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning by explicitly adding a break statement instead of letting the code fall through to the next case. Link: https://github.com/KSPP/linux/issues/115 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2021-05-17sctp: Fix fall-through warnings for ClangGustavo A. R. Silva
In preparation to enable -Wimplicit-fallthrough for Clang, fix a couple of warnings by explicitly adding a break statement and replacing a comment with a goto statement instead of letting the code fall through to the next case. Link: https://github.com/KSPP/linux/issues/115 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2021-05-17rds: Fix fall-through warnings for ClangGustavo A. R. Silva
In preparation to enable -Wimplicit-fallthrough for Clang, fix multiple warnings by explicitly adding multiple break statements instead of letting the code fall through to the next case. Link: https://github.com/KSPP/linux/issues/115 Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com> Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2021-05-17net/packet: Fix fall-through warnings for ClangGustavo A. R. Silva
In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning by explicitly adding a break statement instead of letting the code fall through to the next case. Link: https://github.com/KSPP/linux/issues/115 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2021-05-17net: netrom: Fix fall-through warnings for ClangGustavo A. R. Silva
In preparation to enable -Wimplicit-fallthrough for Clang, fix multiple warnings by explicitly adding multiple break statements instead of letting the code fall through to the next case. Link: https://github.com/KSPP/linux/issues/115 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2021-05-17ipv4: Fix fall-through warnings for ClangGustavo A. R. Silva
In preparation to enable -Wimplicit-fallthrough for Clang, fix multiple warnings by explicitly adding multiple break statements instead of just letting the code fall through to the next case. Link: https://github.com/KSPP/linux/issues/115 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2021-05-17netlink: disable IRQs for netlink_lock_table()Johannes Berg
Syzbot reports that in mac80211 we have a potential deadlock between our "local->stop_queue_reasons_lock" (spinlock) and netlink's nl_table_lock (rwlock). This is because there's at least one situation in which we might try to send a netlink message with this spinlock held while it is also possible to take the spinlock from a hardirq context, resulting in the following deadlock scenario reported by lockdep: CPU0 CPU1 ---- ---- lock(nl_table_lock); local_irq_disable(); lock(&local->queue_stop_reason_lock); lock(nl_table_lock); <Interrupt> lock(&local->queue_stop_reason_lock); This seems valid, we can take the queue_stop_reason_lock in any kind of context ("CPU0"), and call ieee80211_report_ack_skb() with the spinlock held and IRQs disabled ("CPU1") in some code path (ieee80211_do_stop() via ieee80211_free_txskb()). Short of disallowing netlink use in scenarios like these (which would be rather complex in mac80211's case due to the deep callchain), it seems the only fix for this is to disable IRQs while nl_table_lock is held to avoid hitting this scenario, this disallows the "CPU0" portion of the reported deadlock. Note that the writer side (netlink_table_grab()) already disables IRQs for this lock. Unfortunately though, this seems like a huge hammer, and maybe the whole netlink table locking should be reworked. Reported-by: syzbot+69ff9dff50dcfe14ddd4@syzkaller.appspotmail.com Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-17net/smc: remove device from smcd_dev_list after failed device_add()Julian Wiedmann
If the device_add() for a smcd_dev fails, there's no cleanup step that rolls back the earlier list_add(). The device subsequently gets freed, and we end up with a corrupted list. Add some error handling that removes the device from the list. Fixes: c6ba7c9ba43d ("net/smc: add base infrastructure for SMC-D and ISM") Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-17tipc: wait and exit until all work queues are doneXin Long
On some host, a crash could be triggered simply by repeating these commands several times: # modprobe tipc # tipc bearer enable media udp name UDP1 localip 127.0.0.1 # rmmod tipc [] BUG: unable to handle kernel paging request at ffffffffc096bb00 [] Workqueue: events 0xffffffffc096bb00 [] Call Trace: [] ? process_one_work+0x1a7/0x360 [] ? worker_thread+0x30/0x390 [] ? create_worker+0x1a0/0x1a0 [] ? kthread+0x116/0x130 [] ? kthread_flush_work_fn+0x10/0x10 [] ? ret_from_fork+0x35/0x40 When removing the TIPC module, the UDP tunnel sock will be delayed to release in a work queue as sock_release() can't be done in rtnl_lock(). If the work queue is schedule to run after the TIPC module is removed, kernel will crash as the work queue function cleanup_beareri() code no longer exists when trying to invoke it. To fix it, this patch introduce a member wq_count in tipc_net to track the numbers of work queues in schedule, and wait and exit until all work queues are done in tipc_exit_net(). Fixes: d0f91938bede ("tipc: add ip/udp media type") Reported-by: Shuang Li <shuali@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-17mld: fix panic in mld_newpack()Taehee Yoo
mld_newpack() doesn't allow to allocate high order page, only order-0 allocation is allowed. If headroom size is too large, a kernel panic could occur in skb_put(). Test commands: ip netns del A ip netns del B ip netns add A ip netns add B ip link add veth0 type veth peer name veth1 ip link set veth0 netns A ip link set veth1 netns B ip netns exec A ip link set lo up ip netns exec A ip link set veth0 up ip netns exec A ip -6 a a 2001:db8:0::1/64 dev veth0 ip netns exec B ip link set lo up ip netns exec B ip link set veth1 up ip netns exec B ip -6 a a 2001:db8:0::2/64 dev veth1 for i in {1..99} do let A=$i-1 ip netns exec A ip link add ip6gre$i type ip6gre \ local 2001:db8:$A::1 remote 2001:db8:$A::2 encaplimit 100 ip netns exec A ip -6 a a 2001:db8:$i::1/64 dev ip6gre$i ip netns exec A ip link set ip6gre$i up ip netns exec B ip link add ip6gre$i type ip6gre \ local 2001:db8:$A::2 remote 2001:db8:$A::1 encaplimit 100 ip netns exec B ip -6 a a 2001:db8:$i::2/64 dev ip6gre$i ip netns exec B ip link set ip6gre$i up done Splat looks like: kernel BUG at net/core/skbuff.c:110! invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI CPU: 0 PID: 7 Comm: kworker/0:1 Not tainted 5.12.0+ #891 Workqueue: ipv6_addrconf addrconf_dad_work RIP: 0010:skb_panic+0x15d/0x15f Code: 92 fe 4c 8b 4c 24 10 53 8b 4d 70 45 89 e0 48 c7 c7 00 ae 79 83 41 57 41 56 41 55 48 8b 54 24 a6 26 f9 ff <0f> 0b 48 8b 6c 24 20 89 34 24 e8 4a 4e 92 fe 8b 34 24 48 c7 c1 20 RSP: 0018:ffff88810091f820 EFLAGS: 00010282 RAX: 0000000000000089 RBX: ffff8881086e9000 RCX: 0000000000000000 RDX: 0000000000000089 RSI: 0000000000000008 RDI: ffffed1020123efb RBP: ffff888005f6eac0 R08: ffffed1022fc0031 R09: ffffed1022fc0031 R10: ffff888117e00187 R11: ffffed1022fc0030 R12: 0000000000000028 R13: ffff888008284eb0 R14: 0000000000000ed8 R15: 0000000000000ec0 FS: 0000000000000000(0000) GS:ffff888117c00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f8b801c5640 CR3: 0000000033c2c006 CR4: 00000000003706f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: ? ip6_mc_hdr.isra.26.constprop.46+0x12a/0x600 ? ip6_mc_hdr.isra.26.constprop.46+0x12a/0x600 skb_put.cold.104+0x22/0x22 ip6_mc_hdr.isra.26.constprop.46+0x12a/0x600 ? rcu_read_lock_sched_held+0x91/0xc0 mld_newpack+0x398/0x8f0 ? ip6_mc_hdr.isra.26.constprop.46+0x600/0x600 ? lock_contended+0xc40/0xc40 add_grhead.isra.33+0x280/0x380 add_grec+0x5ca/0xff0 ? mld_sendpack+0xf40/0xf40 ? lock_downgrade+0x690/0x690 mld_send_initial_cr.part.34+0xb9/0x180 ipv6_mc_dad_complete+0x15d/0x1b0 addrconf_dad_completed+0x8d2/0xbb0 ? lock_downgrade+0x690/0x690 ? addrconf_rs_timer+0x660/0x660 ? addrconf_dad_work+0x73c/0x10e0 addrconf_dad_work+0x73c/0x10e0 Allowing high order page allocation could fix this problem. Fixes: 72e09ad107e7 ("ipv6: avoid high order allocations") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>