summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2022-05-18random32: use real rng for non-deterministic randomnessJason A. Donenfeld
random32.c has two random number generators in it: one that is meant to be used deterministically, with some predefined seed, and one that does the same exact thing as random.c, except does it poorly. The first one has some use cases. The second one no longer does and can be replaced with calls to random.c's proper random number generator. The relatively recent siphash-based bad random32.c code was added in response to concerns that the prior random32.c was too deterministic. Out of fears that random.c was (at the time) too slow, this code was anonymously contributed. Then out of that emerged a kind of shadow entropy gathering system, with its own tentacles throughout various net code, added willy nilly. Stop👏making👏bespoke👏random👏number👏generators👏. Fortunately, recent advances in random.c mean that we can stop playing with this sketchiness, and just use get_random_u32(), which is now fast enough. In micro benchmarks using RDPMC, I'm seeing the same median cycle count between the two functions, with the mean being _slightly_ higher due to batches refilling (which we can optimize further need be). However, when doing *real* benchmarks of the net functions that actually use these random numbers, the mean cycles actually *decreased* slightly (with the median still staying the same), likely because the additional prandom code means icache misses and complexity, whereas random.c is generally already being used by something else nearby. The biggest benefit of this is that there are many users of prandom who probably should be using cryptographically secure random numbers. This makes all of those accidental cases become secure by just flipping a switch. Later on, we can do a tree-wide cleanup to remove the static inline wrapper functions that this commit adds. There are also some low-ish hanging fruits for making this even faster in the future: a get_random_u16() function for use in the networking stack will give a 2x performance boost there, using SIMD for ChaCha20 will let us compute 4 or 8 or 16 blocks of output in parallel, instead of just one, giving us large buffers for cheap, and introducing a get_random_*_bh() function that assumes irqs are already disabled will shave off a few cycles for ordinary calls. These are things we can chip away at down the road. Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-05-18mptcp: Do TCP fallback on early DSS checksum failureMat Martineau
RFC 8684 section 3.7 describes several opportunities for a MPTCP connection to "fall back" to regular TCP early in the connection process, before it has been confirmed that MPTCP options can be successfully propagated on all SYN, SYN/ACK, and data packets. If a peer acknowledges the first received data packet with a regular TCP header (no MPTCP options), fallback is allowed. If the recipient of that first data packet finds a MPTCP DSS checksum error, this provides an opportunity to fail gracefully with a TCP fallback rather than resetting the connection (as might happen if a checksum failure were detected later). This commit modifies the checksum failure code to attempt fallback on the initial subflow of a MPTCP connection, only if it's a failure in the first data mapping. In cases where the peer initiates the connection, requests checksums, is the first to send data, and the peer is sending incorrect checksums (see https://github.com/multipath-tcp/mptcp_net-next/issues/275), this allows the connection to proceed as TCP rather than reset. Fixes: dd8bcd1768ff ("mptcp: validate the data checksum") Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-18mptcp: fix checksum byte orderPaolo Abeni
The MPTCP code typecasts the checksum value to u16 and then converts it to big endian while storing the value into the MPTCP option. As a result, the wire encoding for little endian host is wrong, and that causes interoperabilty interoperability issues with other implementation or host with different endianness. Address the issue writing in the packet the unmodified __sum16 value. MPTCP checksum is disabled by default, interoperating with systems with bad mptcp-level csum encoding should cause fallback to TCP. Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/275 Fixes: c5b39e26d003 ("mptcp: send out checksum for DSS") Fixes: 390b95a5fb84 ("mptcp: receive checksum for DSS") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-18Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2022-05-18 1) Fix "disable_policy" flag use when arriving from different devices. From Eyal Birger. 2) Fix error handling of pfkey_broadcast in function pfkey_process. From Jiasheng Jiang. 3) Check the encryption module availability consistency in pfkey. From Thomas Bartschies. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-18net: af_key: check encryption module availability consistencyThomas Bartschies
Since the recent introduction supporting the SM3 and SM4 hash algos for IPsec, the kernel produces invalid pfkey acquire messages, when these encryption modules are disabled. This happens because the availability of the algos wasn't checked in all necessary functions. This patch adds these checks. Signed-off-by: Thomas Bartschies <thomas.bartschies@cvk.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2022-05-18net: af_key: add check for pfkey_broadcast in function pfkey_processJiasheng Jiang
If skb_clone() returns null pointer, pfkey_broadcast() will return error. Therefore, it should be better to check the return value of pfkey_broadcast() and return error if fails. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Jiasheng Jiang <jiasheng@iscas.ac.cn> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2022-05-18netfilter: ctnetlink: fix up for "netfilter: conntrack: remove unconfirmed list"Stephen Rothwell
After merging the net-next tree, today's linux-next build (powerpc ppc64_defconfig) produced this warning: nf_conntrack_netlink.c:1717 warning: 'ctnetlink_dump_one_entry' defined but not used Fixes: 8a75a2c17410 ("netfilter: conntrack: remove unconfirmed list") Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Florian Westphal <fw@strlen.de>
2022-05-17dn_route: set rt neigh to blackhole_netdev instead of loopback_dev in ifdownXin Long
Like other places in ipv4/6 dst ifdown, change to use blackhole_netdev instead of pernet loopback_dev in dn dst ifdown. Signed-off-by: Xin Long <lucien.xin@gmail.com> Link: https://lore.kernel.org/r/0cdf10e5a4af509024f08644919121fb71645bc2.1652751029.git.lucien.xin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-17NFC: nci: fix sleep in atomic context bugs caused by nci_skb_allocDuoming Zhou
There are sleep in atomic context bugs when the request to secure element of st-nci is timeout. The root cause is that nci_skb_alloc with GFP_KERNEL parameter is called in st_nci_se_wt_timeout which is a timer handler. The call paths that could trigger bugs are shown below: (interrupt context 1) st_nci_se_wt_timeout nci_hci_send_event nci_hci_send_data nci_skb_alloc(..., GFP_KERNEL) //may sleep (interrupt context 2) st_nci_se_wt_timeout nci_hci_send_event nci_hci_send_data nci_send_data nci_queue_tx_data_frags nci_skb_alloc(..., GFP_KERNEL) //may sleep This patch changes allocation mode of nci_skb_alloc from GFP_KERNEL to GFP_ATOMIC in order to prevent atomic context sleeping. The GFP_ATOMIC flag makes memory allocation operation could be used in atomic context. Fixes: ed06aeefdac3 ("nfc: st-nci: Rename st21nfcb to st-nci") Signed-off-by: Duoming Zhou <duoming@zju.edu.cn> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Link: https://lore.kernel.org/r/20220517012530.75714-1-duoming@zju.edu.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-17net/smc: rdma write inline if qp has sufficient inline spaceGuangguan Wang
Rdma write with inline flag when sending small packages, whose length is shorter than the qp's max_inline_data, can help reducing latency. In my test environment, which are 2 VMs running on the same physical host and whose NICs(ConnectX-4Lx) are working on SR-IOV mode, qperf shows 0.5us-0.7us improvement in latency. Test command: server: smc_run taskset -c 1 qperf client: smc_run taskset -c 1 qperf <server ip> -oo \ msg_size:1:2K:*2 -t 30 -vu tcp_lat The results shown below: msgsize before after 1B 11.2 us 10.6 us (-0.6 us) 2B 11.2 us 10.7 us (-0.5 us) 4B 11.3 us 10.7 us (-0.6 us) 8B 11.2 us 10.6 us (-0.6 us) 16B 11.3 us 10.7 us (-0.6 us) 32B 11.3 us 10.6 us (-0.7 us) 64B 11.2 us 11.2 us (0 us) 128B 11.2 us 11.2 us (0 us) 256B 11.2 us 11.2 us (0 us) 512B 11.4 us 11.3 us (-0.1 us) 1KB 11.4 us 11.5 us (0.1 us) 2KB 11.5 us 11.5 us (0 us) Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Tested-by: kernel test robot <lkp@intel.com> Acked-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-17net/smc: send cdc msg inline if qp has sufficient inline spaceGuangguan Wang
As cdc msg's length is 44B, cdc msgs can be sent inline in most rdma devices, which can help reducing sending latency. In my test environment, which are 2 VMs running on the same physical host and whose NICs(ConnectX-4Lx) are working on SR-IOV mode, qperf shows 0.4us-0.7us improvement in latency. Test command: server: smc_run taskset -c 1 qperf client: smc_run taskset -c 1 qperf <server ip> -oo \ msg_size:1:2K:*2 -t 30 -vu tcp_lat The results shown below: msgsize before after 1B 11.9 us 11.2 us (-0.7 us) 2B 11.7 us 11.2 us (-0.5 us) 4B 11.7 us 11.3 us (-0.4 us) 8B 11.6 us 11.2 us (-0.4 us) 16B 11.7 us 11.3 us (-0.4 us) 32B 11.7 us 11.3 us (-0.4 us) 64B 11.7 us 11.2 us (-0.5 us) 128B 11.6 us 11.2 us (-0.4 us) 256B 11.8 us 11.2 us (-0.6 us) 512B 11.8 us 11.4 us (-0.4 us) 1KB 11.9 us 11.4 us (-0.5 us) 2KB 12.1 us 11.5 us (-0.6 us) Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Tested-by: kernel test robot <lkp@intel.com> Acked-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-17mac80211: refactor freeing the next_beaconJohannes Berg
We have this code seven times, refactor it into a separate function. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2022-05-17ax25: merge repeat codes in ax25_dev_device_down()Lu Wei
Merge repeat codes to reduce the duplication. Signed-off-by: Lu Wei <luwei32@huawei.com> Link: https://lore.kernel.org/r/20220516062804.254742-1-luwei32@huawei.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-05-17xfrm: set dst dev to blackhole_netdev instead of loopback_dev in ifdownXin Long
The global blackhole_netdev has replaced pernet loopback_dev to become the one given to the object that holds an netdev when ifdown in many places of ipv4 and ipv6 since commit 8d7017fd621d ("blackhole_netdev: use blackhole_netdev to invalidate dst entries"). Especially after commit faab39f63c1f ("net: allow out-of-order netdev unregistration"), it's no longer safe to use loopback_dev that may be freed before other netdev. This patch is to set dst dev to blackhole_netdev instead of loopback_dev in ifdown. v1->v2: - add Fixes tag as Eric suggested. Fixes: faab39f63c1f ("net: allow out-of-order netdev unregistration") Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/e8c87482998ca6fcdab214f5a9d582899ec0c648.1652665047.git.lucien.xin@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-05-16Merge tag 'linux-can-next-for-5.19-20220516' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next Marc Kleine-Budde says: ==================== pull-request: can-next 2022-05-16 the first 2 patches are by me and target the CAN raw protocol. The 1st removes an unneeded assignment, the other one adds support for SO_TXTIME/SCM_TXTIME. Oliver Hartkopp contributes 2 patches for the ISOTP protocol. The 1st adds support for transmission without flow control, the other let's bind() return an error on incorrect CAN ID formatting. Geert Uytterhoeven contributes a patch to clean up ctucanfd's Kconfig file. Vincent Mailhol's patch for the slcan driver uses the proper function to check for invalid CAN frames in the xmit callback. The next patch is by Geert Uytterhoeven and makes the interrupt-names of the renesas,rcar-canfd dt bindings mandatory. A patch by my update the ctucanfd dt bindings to include the common CAN controller bindings. The last patch is by Akira Yokosawa and fixes a breakage the ctucanfd's documentation. * tag 'linux-can-next-for-5.19-20220516' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next: docs: ctucanfd: Use 'kernel-figure' directive instead of 'figure' dt-bindings: can: ctucanfd: include common CAN controller bindings dt-bindings: can: renesas,rcar-canfd: Make interrupt-names required can: slcan: slc_xmit(): use can_dropped_invalid_skb() instead of manual check can: ctucanfd: Let users select instead of depend on CAN_CTUCANFD can: isotp: isotp_bind(): return -EINVAL on incorrect CAN ID formatting can: isotp: add support for transmission without flow control can: raw: add support for SO_TXTIME/SCM_TXTIME can: raw: raw_sendmsg(): remove not needed setting of skb->sk ==================== Link: https://lore.kernel.org/r/20220516202625.1129281-1-mkl@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-16af_unix: Silence randstruct GCC plugin warningKees Cook
While preparing for Clang randstruct support (which duplicated many of the warnings the randstruct GCC plugin warned about), one strange one remained only for the randstruct GCC plugin. Eliminating this rids the plugin of the last exception. It seems the plugin is happy to dereference individual members of a cross-struct cast, but it is upset about casting to a whole object pointer. This only manifests in one place in the kernel, so just replace the variable with individual member accesses. There is no change in executable instruction output. Drop the last exception from the randstruct GCC plugin. Cc: "David S. Miller" <davem@davemloft.net> Cc: Christoph Hellwig <hch@infradead.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Cong Wang <cong.wang@bytedance.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: netdev@vger.kernel.org Cc: linux-hardening@vger.kernel.org Acked-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Link: https://lore.kernel.org/lkml/20220511022217.58586-1-kuniyu@amazon.co.jp Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/lkml/20220511151542.4cb3ff17@kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>
2022-05-16mptcp: sockopt: add TCP_DEFER_ACCEPT supportFlorian Westphal
Support this via passthrough to the underlying tcp listener socket. Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/271 Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-16Revert "mptcp: add data lock for sk timers"Paolo Abeni
This reverts commit 4293248c6704b854bf816aa1967e433402bee11c. Additional locks are not needed, all the touched sections are already under mptcp socket lock protection. Fixes: 4293248c6704 ("mptcp: add data lock for sk timers") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-16can: isotp: isotp_bind(): return -EINVAL on incorrect CAN ID formattingOliver Hartkopp
Commit 3ea566422cbd ("can: isotp: sanitize CAN ID checks in isotp_bind()") checks the given CAN ID address information by sanitizing the input values. This check (silently) removes obsolete bits by masking the given CAN IDs. Derek Will suggested to give a feedback to the application programmer when the 'sanitizing' was actually needed which means the programmer provided CAN ID content in a wrong format (e.g. SFF CAN IDs with a CAN ID > 0x7FF). Link: https://lore.kernel.org/all/20220515181633.76671-1-socketcan@hartkopp.net Suggested-by: Derek Will <derekrobertwill@gmail.com> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2022-05-16can: isotp: add support for transmission without flow controlOliver Hartkopp
Usually the ISO 15765-2 protocol is a point-to-point protocol to transfer segmented PDUs to a dedicated receiver. This receiver sends a flow control message to specify protocol options and timings (e.g. block size / STmin). The so called functional addressing communication allows a 1:N communication but is limited to a single frame length. This new CAN_ISOTP_CF_BROADCAST allows an unconfirmed 1:N communication with PDU length that would not fit into a single frame. This feature is not covered by the ISO 15765-2 standard. Link: https://lore.kernel.org/all/20220507115558.19065-1-socketcan@hartkopp.net Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2022-05-16can: raw: add support for SO_TXTIME/SCM_TXTIMEMarc Kleine-Budde
This patch calls into sock_cmsg_send() to parse the user supplied control information into a struct sockcm_cookie. Then assign the requested transmit time to the skb. This makes it possible to use the Earliest TXTIME First (ETF) packet scheduler with the CAN_RAW protocol. The user can send a CAN_RAW frame with a TXTIME and the kernel (with the ETF scheduler) will take care of sending it to the network interface. Link: https://lore.kernel.org/all/20220502091946.1916211-3-mkl@pengutronix.de Acked-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2022-05-16can: raw: raw_sendmsg(): remove not needed setting of skb->skMarc Kleine-Budde
The skb in raw_sendmsg() is allocated with sock_alloc_send_skb(), which subsequently calls sock_alloc_send_pskb() -> skb_set_owner_w(), which assigns "skb->sk = sk". This patch removes the not needed setting of skb->sk. Link: https://lore.kernel.org/all/20220502091946.1916211-2-mkl@pengutronix.de Acked-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2022-05-16netfilter: conntrack: remove pr_debug callsites from tcp trackerFlorian Westphal
They are either obsolete or useless. Those in the normal processing path cannot be enabled on a production system; they generate too much noise. One pr_debug call resides in an error path and does provide useful info, merge it with the existing nf_log_invalid(). Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-16netfilter: nf_conncount: reduce unnecessary GCWilliam Tu
Currently nf_conncount can trigger garbage collection (GC) at multiple places. Each GC process takes a spin_lock_bh to traverse the nf_conncount_list. We found that when testing port scanning use two parallel nmap, because the number of connection increase fast, the nf_conncount_count and its subsequent call to __nf_conncount_add take too much time, causing several CPU lockup. This happens when user set the conntrack limit to +20,000, because the larger the limit, the longer the list that GC has to traverse. The patch mitigate the performance issue by avoiding unnecessary GC with a timestamp. Whenever nf_conncount has done a GC, a timestamp is updated, and beforce the next time GC is triggered, we make sure it's more than a jiffies. By doin this we can greatly reduce the CPU cycles and avoid the softirq lockup. To reproduce it in OVS, $ ovs-appctl dpctl/ct-set-limits zone=1,limit=20000 $ ovs-appctl dpctl/ct-get-limits At another machine, runs two nmap $ nmap -p1- <IP> $ nmap -p1- <IP> Signed-off-by: William Tu <u9012063@gmail.com> Co-authored-by: Yifeng Sun <pkusunyifeng@gmail.com> Reported-by: Greg Rose <gvrose8192@gmail.com> Suggested-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-16netfilter: Use l3mdev flow key when re-routing mangled packetsMartin Willi
Commit 40867d74c374 ("net: Add l3mdev index to flow struct and avoid oif reset for port devices") introduces a flow key specific for layer 3 domains, such as a VRF master device. This allows for explicit VRF domain selection instead of abusing the oif flow key. Update ip[6]_route_me_harder() to make use of that new key when re-routing mangled packets within VRFs instead of setting the flow oif, making it consistent with other users. Signed-off-by: Martin Willi <martin@strongswan.org> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-16netfilter: nft_flow_offload: fix offload with pppoe + vlanFelix Fietkau
When running a combination of PPPoE on top of a VLAN, we need to set info->outdev to the PPPoE device, otherwise PPPoE encap is skipped during software offload. Fixes: 72efd585f714 ("netfilter: flowtable: add pppoe support") Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-16net: fix dev_fill_forward_path with pppoe + bridgeFelix Fietkau
When calling dev_fill_forward_path on a pppoe device, the provided destination address is invalid. In order for the bridge fdb lookup to succeed, the pppoe code needs to update ctx->daddr to the correct value. Fix this by storing the address inside struct net_device_path_ctx Fixes: f6efc675c9dd ("net: ppp: resolve forwarding path for bridge pppoe devices") Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-16netfilter: nft_flow_offload: skip dst neigh lookup for ppp devicesFelix Fietkau
The dst entry does not contain a valid hardware address, so skip the lookup in order to avoid running into errors here. The proper hardware address is filled in from nft_dev_path_info Fixes: 72efd585f714 ("netfilter: flowtable: add pppoe support") Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-16netfilter: flowtable: fix excessive hw offload attempts after failureFelix Fietkau
If a flow cannot be offloaded, the code currently repeatedly tries again as quickly as possible, which can significantly increase system load. Fix this by limiting flow timeout update and hardware offload retry to once per second. Fixes: c07531c01d82 ("netfilter: flowtable: Remove redundant hw refresh bit") Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-16net/sched: act_pedit: sanitize shift argument before usagePaolo Abeni
syzbot was able to trigger an Out-of-Bound on the pedit action: UBSAN: shift-out-of-bounds in net/sched/act_pedit.c:238:43 shift exponent 1400735974 is too large for 32-bit type 'unsigned int' CPU: 0 PID: 3606 Comm: syz-executor151 Not tainted 5.18.0-rc5-syzkaller-00165-g810c2f0a3f86 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106 ubsan_epilogue+0xb/0x50 lib/ubsan.c:151 __ubsan_handle_shift_out_of_bounds.cold+0xb1/0x187 lib/ubsan.c:322 tcf_pedit_init.cold+0x1a/0x1f net/sched/act_pedit.c:238 tcf_action_init_1+0x414/0x690 net/sched/act_api.c:1367 tcf_action_init+0x530/0x8d0 net/sched/act_api.c:1432 tcf_action_add+0xf9/0x480 net/sched/act_api.c:1956 tc_ctl_action+0x346/0x470 net/sched/act_api.c:2015 rtnetlink_rcv_msg+0x413/0xb80 net/core/rtnetlink.c:5993 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2502 netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline] netlink_unicast+0x543/0x7f0 net/netlink/af_netlink.c:1345 netlink_sendmsg+0x904/0xe00 net/netlink/af_netlink.c:1921 sock_sendmsg_nosec net/socket.c:705 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:725 ____sys_sendmsg+0x6e2/0x800 net/socket.c:2413 ___sys_sendmsg+0xf3/0x170 net/socket.c:2467 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2496 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7fe36e9e1b59 Code: 28 c3 e8 2a 14 00 00 66 2e 0f 1f 84 00 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffef796fe88 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fe36e9e1b59 RDX: 0000000000000000 RSI: 0000000020000300 RDI: 0000000000000003 RBP: 00007fe36e9a5d00 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007fe36e9a5d90 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 </TASK> The 'shift' field is not validated, and any value above 31 will trigger out-of-bounds. The issue predates the git history, but syzbot was able to trigger it only after the commit mentioned in the fixes tag, and this change only applies on top of such commit. Address the issue bounding the 'shift' value to the maximum allowed by the relevant operator. Reported-and-tested-by: syzbot+8ed8fc4c57e9dcf23ca6@syzkaller.appspotmail.com Fixes: 8b796475fd78 ("net/sched: act_pedit: really ensure the skb is writable") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16net: call skb_defer_free_flush() before each napi_poll()Eric Dumazet
skb_defer_free_flush() can consume cpu cycles, it seems better to call it in the inner loop: - Potentially frees page/skb that will be reallocated while hot. - Account for the cpu cycles in the @time_limit determination. - Keep softnet_data.defer_count small to reduce chances for skb_attempt_defer_free() to send an IPI. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16net: add skb_defer_max sysctlEric Dumazet
commit 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists") added another per-cpu cache of skbs. It was expected to be small, and an IPI was forced whenever the list reached 128 skbs. We might need to be able to control more precisely queue capacity and added latency. An IPI is generated whenever queue reaches half capacity. Default value of the new limit is 64. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16net: use napi_consume_skb() in skb_defer_free_flush()Eric Dumazet
skb_defer_free_flush() runs from softirq context, we have the opportunity to refill the napi_alloc_cache, and/or use kmem_cache_free_bulk() when this cache is full. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16net: fix possible race in skb_attempt_defer_free()Eric Dumazet
A cpu can observe sd->defer_count reaching 128, and call smp_call_function_single_async() Problem is that the remote CPU can clear sd->defer_count before the IPI is run/acknowledged. Other cpus can queue more packets and also decide to call smp_call_function_single_async() while the pending IPI was not yet delivered. This is a common issue with smp_call_function_single_async(). Callers must ensure correct synchronization and serialization. I triggered this issue while experimenting smaller threshold. Performing the call to smp_call_function_single_async() under sd->defer_lock protection did not solve the problem. Commit 5a18ceca6350 ("smp: Allow smp_call_function_single_async() to insert locked csd") replaced an informative WARN_ON_ONCE() with a return of -EBUSY, which is often ignored. Test of CSD_FLAG_LOCK presence is racy anyway. Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16net: tcp: reset 'drop_reason' to NOT_SPCIFIED in tcp_v{4,6}_rcv()Menglong Dong
The 'drop_reason' that passed to kfree_skb_reason() in tcp_v4_rcv() and tcp_v6_rcv() can be SKB_NOT_DROPPED_YET(0), as it is used as the return value of tcp_inbound_md5_hash(). And it can panic the kernel with NULL pointer in net_dm_packet_report_size() if the reason is 0, as drop_reasons[0] is NULL. Fixes: 1330b6ef3313 ("skb: make drop reason booleanable") Reviewed-by: Jiang Biao <benbjiang@tencent.com> Reviewed-by: Hao Peng <flyingpeng@tencent.com> Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16net: skb: check the boundrary of drop reason in kfree_skb_reason()Menglong Dong
Sometimes, we may forget to reset skb drop reason to NOT_SPECIFIED after we make it the return value of the functions with return type of enum skb_drop_reason, such as tcp_inbound_md5_hash. Therefore, its value can be SKB_NOT_DROPPED_YET(0), which is invalid for kfree_skb_reason(). So we check the range of drop reason in kfree_skb_reason() with DEBUG_NET_WARN_ON_ONCE(). Reviewed-by: Jiang Biao <benbjiang@tencent.com> Reviewed-by: Hao Peng <flyingpeng@tencent.com> Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16net: dm: check the boundary of skb drop reasonsMenglong Dong
The 'reason' will be set to 'SKB_DROP_REASON_NOT_SPECIFIED' if it not small that SKB_DROP_REASON_MAX in net_dm_packet_trace_kfree_skb_hit(), but it can't avoid it to be 0, which is invalid and can cause NULL pointer in drop_reasons. Therefore, reset it to SKB_DROP_REASON_NOT_SPECIFIED when 'reason <= 0'. Reviewed-by: Jiang Biao <benbjiang@tencent.com> Reviewed-by: Hao Peng <flyingpeng@tencent.com> Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16net/smc: align the connect behaviour with TCPGuangguan Wang
Connect with O_NONBLOCK will not be completed immediately and returns -EINPROGRESS. It is possible to use selector/poll for completion by selecting the socket for writing. After select indicates writability, a second connect function call will return 0 to indicate connected successfully as TCP does, but smc returns -EISCONN. Use socket state for smc to indicate connect state, which can help smc aligning the connect behaviour with TCP. Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com> Acked-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16inet: rename INET_MATCH()Eric Dumazet
This is no longer a macro, but an inlined function. INET_MATCH() -> inet_match() Signed-off-by: Eric Dumazet <edumazet@google.com> Suggested-by: Olivier Hartkopp <socketcan@hartkopp.net> Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16ipv6: add READ_ONCE(sk->sk_bound_dev_if) in INET6_MATCH()Eric Dumazet
INET6_MATCH() runs without holding a lock on the socket. We probably need to annotate most reads. This patch makes INET6_MATCH() an inline function to ease our changes. v2: inline function only defined if IS_ENABLED(CONFIG_IPV6) Change the name to inet6_match(), this is no longer a macro. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16l2tp: use add READ_ONCE() to fetch sk->sk_bound_dev_ifEric Dumazet
Use READ_ONCE() in paths not holding the socket lock. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16net_sched: em_meta: add READ_ONCE() in var_sk_bound_if()Eric Dumazet
sk->sk_bound_dev_if can change under us, use READ_ONCE() annotation. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16inet: add READ_ONCE(sk->sk_bound_dev_if) in inet_csk_bind_conflict()Eric Dumazet
inet_csk_bind_conflict() can access sk->sk_bound_dev_if for unlocked sockets. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16dccp: use READ_ONCE() to read sk->sk_bound_dev_ifEric Dumazet
When reading listener sk->sk_bound_dev_if locklessly, we must use READ_ONCE(). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16net: core: add READ_ONCE/WRITE_ONCE annotations for sk->sk_bound_dev_ifEric Dumazet
sock_bindtoindex_locked() needs to use WRITE_ONCE(sk->sk_bound_dev_if, val), because other cpus/threads might locklessly read this field. sock_getbindtodevice(), sock_getsockopt() need READ_ONCE() because they run without socket lock held. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16sctp: read sk->sk_bound_dev_if once in sctp_rcv()Eric Dumazet
sctp_rcv() reads sk->sk_bound_dev_if twice while the socket is not locked. Another cpu could change this field under us. Fixes: 0fd9a65a76e8 ("[SCTP] Support SO_BINDTODEVICE socket option on incoming packets.") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: Vlad Yasevich <vyasevich@gmail.com> Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16net: annotate races around sk->sk_bound_dev_ifEric Dumazet
UDP sendmsg() is lockless, and reads sk->sk_bound_dev_if while this field can be changed by another thread. Adds minimal annotations to avoid KCSAN splats for UDP. Following patches will add more annotations to potential lockless readers. BUG: KCSAN: data-race in __ip6_datagram_connect / udpv6_sendmsg write to 0xffff888136d47a94 of 4 bytes by task 7681 on cpu 0: __ip6_datagram_connect+0x6e2/0x930 net/ipv6/datagram.c:221 ip6_datagram_connect+0x2a/0x40 net/ipv6/datagram.c:272 inet_dgram_connect+0x107/0x190 net/ipv4/af_inet.c:576 __sys_connect_file net/socket.c:1900 [inline] __sys_connect+0x197/0x1b0 net/socket.c:1917 __do_sys_connect net/socket.c:1927 [inline] __se_sys_connect net/socket.c:1924 [inline] __x64_sys_connect+0x3d/0x50 net/socket.c:1924 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x2b/0x50 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae read to 0xffff888136d47a94 of 4 bytes by task 7670 on cpu 1: udpv6_sendmsg+0xc60/0x16e0 net/ipv6/udp.c:1436 inet6_sendmsg+0x5f/0x80 net/ipv6/af_inet6.c:652 sock_sendmsg_nosec net/socket.c:705 [inline] sock_sendmsg net/socket.c:725 [inline] ____sys_sendmsg+0x39a/0x510 net/socket.c:2413 ___sys_sendmsg net/socket.c:2467 [inline] __sys_sendmmsg+0x267/0x4c0 net/socket.c:2553 __do_sys_sendmmsg net/socket.c:2582 [inline] __se_sys_sendmmsg net/socket.c:2579 [inline] __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2579 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x2b/0x50 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae value changed: 0x00000000 -> 0xffffff9b Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 7670 Comm: syz-executor.3 Tainted: G W 5.18.0-rc1-syzkaller-dirty #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 I chose to not add Fixes: tag because race has minor consequences and stable teams busy enough. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16ipv6: Add hop-by-hop header to jumbograms in ip6_outputCoco Li
Instead of simply forcing a 0 payload_len in IPv6 header, implement RFC 2675 and insert a custom extension header. Note that only TCP stack is currently potentially generating jumbograms, and that this extension header is purely local, it wont be sent on a physical link. This is needed so that packet capture (tcpdump and friends) can properly dissect these large packets. Signed-off-by: Coco Li <lixiaoyan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16net: allow gro_max_size to exceed 65536Alexander Duyck
Allow the gro_max_size to exceed a value larger than 65536. There weren't really any external limitations that prevented this other than the fact that IPv4 only supports a 16 bit length field. Since we have the option of adding a hop-by-hop header for IPv6 we can allow IPv6 to exceed this value and for IPv4 and non-TCP flows we can cap things at 65536 via a constant rather than relying on gro_max_size. [edumazet] limit GRO_MAX_SIZE to (8 * 65535) to avoid overflows. Signed-off-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16ipv6/gro: insert temporary HBH/jumbo headerEric Dumazet
Following patch will add GRO_IPV6_MAX_SIZE, allowing gro to build BIG TCP ipv6 packets (bigger than 64K). This patch changes ipv6_gro_complete() to insert a HBH/jumbo header so that resulting packet can go through IPv6/TCP stacks. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>