summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2021-07-19sctp: add size validation when walking chunksMarcelo Ricardo Leitner
[ Upstream commit 50619dbf8db77e98d821d615af4f634d08e22698 ] The first chunk in a packet is ensured to be present at the beginning of sctp_rcv(), as a packet needs to have at least 1 chunk. But the second one, may not be completely available and ch->length can be over uninitialized memory. Fix here is by only trying to walk on the next chunk if there is enough to hold at least the header, and then proceed with the ch->length validation that is already there. Reported-by: Ilja Van Sprundel <ivansprundel@ioactive.com> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-19sctp: validate from_addr_param returnMarcelo Ricardo Leitner
[ Upstream commit 0c5dc070ff3d6246d22ddd931f23a6266249e3db ] Ilja reported that, simply putting it, nothing was validating that from_addr_param functions were operating on initialized memory. That is, the parameter itself was being validated by sctp_walk_params, but it doesn't check for types and their specific sizes and it could be a 0-length one, causing from_addr_param to potentially work over the next parameter or even uninitialized memory. The fix here is to, in all calls to from_addr_param, check if enough space is there for the wanted IP address type. Reported-by: Ilja Van Sprundel <ivansprundel@ioactive.com> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-19Bluetooth: mgmt: Fix the command returns garbage parameter valueTedd Ho-Jeong An
[ Upstream commit 02ce2c2c24024aade65a8d91d6a596651eaf2d0a ] When the Get Device Flags command fails, it returns the error status with the parameters filled with the garbage values. Although the parameters are not used, it is better to fill with zero than the random values. Signed-off-by: Tedd Ho-Jeong An <tedd.an@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-19Bluetooth: L2CAP: Fix invalid access on ECRED Connection responseLuiz Augusto von Dentz
[ Upstream commit de895b43932cb47e69480540be7eca289af24f23 ] The use of l2cap_chan_del is not safe under a loop using list_for_each_entry. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-19Bluetooth: L2CAP: Fix invalid access if ECRED Reconfigure failsLuiz Augusto von Dentz
[ Upstream commit 1fa20d7d4aad02206e84b74915819fbe9f81dab3 ] The use of l2cap_chan_del is not safe under a loop using list_for_each_entry. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-19Bluetooth: Shutdown controller after workqueues are flushed or cancelledKai-Heng Feng
[ Upstream commit 0ea9fd001a14ebc294f112b0361a4e601551d508 ] Rfkill block and unblock Intel USB Bluetooth [8087:0026] may make it stops working: [ 509.691509] Bluetooth: hci0: HCI reset during shutdown failed [ 514.897584] Bluetooth: hci0: MSFT filter_enable is already on [ 530.044751] usb 3-10: reset full-speed USB device number 5 using xhci_hcd [ 545.660350] usb 3-10: device descriptor read/64, error -110 [ 561.283530] usb 3-10: device descriptor read/64, error -110 [ 561.519682] usb 3-10: reset full-speed USB device number 5 using xhci_hcd [ 566.686650] Bluetooth: hci0: unexpected event for opcode 0x0500 [ 568.752452] Bluetooth: hci0: urb 0000000096cd309b failed to resubmit (113) [ 578.797955] Bluetooth: hci0: Failed to read MSFT supported features (-110) [ 586.286565] Bluetooth: hci0: urb 00000000c522f633 failed to resubmit (113) [ 596.215302] Bluetooth: hci0: Failed to read MSFT supported features (-110) Or kernel panics because other workqueues already freed skb: [ 2048.663763] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 2048.663775] #PF: supervisor read access in kernel mode [ 2048.663779] #PF: error_code(0x0000) - not-present page [ 2048.663782] PGD 0 P4D 0 [ 2048.663787] Oops: 0000 [#1] SMP NOPTI [ 2048.663793] CPU: 3 PID: 4491 Comm: rfkill Tainted: G W 5.13.0-rc1-next-20210510+ #20 [ 2048.663799] Hardware name: HP HP EliteBook 850 G8 Notebook PC/8846, BIOS T76 Ver. 01.01.04 12/02/2020 [ 2048.663801] RIP: 0010:__skb_ext_put+0x6/0x50 [ 2048.663814] Code: 8b 1b 48 85 db 75 db 5b 41 5c 5d c3 be 01 00 00 00 e8 de 13 c0 ff eb e7 be 02 00 00 00 e8 d2 13 c0 ff eb db 0f 1f 44 00 00 55 <8b> 07 48 89 e5 83 f8 01 74 14 b8 ff ff ff ff f0 0f c1 07 83 f8 01 [ 2048.663819] RSP: 0018:ffffc1d105b6fd80 EFLAGS: 00010286 [ 2048.663824] RAX: 0000000000000000 RBX: ffff9d9ac5649000 RCX: 0000000000000000 [ 2048.663827] RDX: ffffffffc0d1daf6 RSI: 0000000000000206 RDI: 0000000000000000 [ 2048.663830] RBP: ffffc1d105b6fd98 R08: 0000000000000001 R09: ffff9d9ace8ceac0 [ 2048.663834] R10: ffff9d9ace8ceac0 R11: 0000000000000001 R12: ffff9d9ac5649000 [ 2048.663838] R13: 0000000000000000 R14: 00007ffe0354d650 R15: 0000000000000000 [ 2048.663843] FS: 00007fe02ab19740(0000) GS:ffff9d9e5f8c0000(0000) knlGS:0000000000000000 [ 2048.663849] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2048.663853] CR2: 0000000000000000 CR3: 0000000111a52004 CR4: 0000000000770ee0 [ 2048.663856] PKRU: 55555554 [ 2048.663859] Call Trace: [ 2048.663865] ? skb_release_head_state+0x5e/0x80 [ 2048.663873] kfree_skb+0x2f/0xb0 [ 2048.663881] btusb_shutdown_intel_new+0x36/0x60 [btusb] [ 2048.663905] hci_dev_do_close+0x48c/0x5e0 [bluetooth] [ 2048.663954] ? __cond_resched+0x1a/0x50 [ 2048.663962] hci_rfkill_set_block+0x56/0xa0 [bluetooth] [ 2048.664007] rfkill_set_block+0x98/0x170 [ 2048.664016] rfkill_fop_write+0x136/0x1e0 [ 2048.664022] vfs_write+0xc7/0x260 [ 2048.664030] ksys_write+0xb1/0xe0 [ 2048.664035] ? exit_to_user_mode_prepare+0x37/0x1c0 [ 2048.664042] __x64_sys_write+0x1a/0x20 [ 2048.664048] do_syscall_64+0x40/0xb0 [ 2048.664055] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 2048.664060] RIP: 0033:0x7fe02ac23c27 [ 2048.664066] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 [ 2048.664070] RSP: 002b:00007ffe0354d638 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 2048.664075] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fe02ac23c27 [ 2048.664078] RDX: 0000000000000008 RSI: 00007ffe0354d650 RDI: 0000000000000003 [ 2048.664081] RBP: 0000000000000000 R08: 0000559b05998440 R09: 0000559b05998440 [ 2048.664084] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003 [ 2048.664086] R13: 0000000000000000 R14: ffffffff00000000 R15: 00000000ffffffff So move the shutdown callback to a place where workqueues are either flushed or cancelled to resolve the issue. Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-19Bluetooth: Fix alt settings for incoming SCO with transparent coding formatKiran K
[ Upstream commit 06d213d8a89a6f55b708422c3dda2b22add10748 ] For incoming SCO connection with transparent coding format, alt setting of CVSD is getting applied instead of Transparent. Before fix: < HCI Command: Accept Synchron.. (0x01|0x0029) plen 21 #2196 [hci0] 321.342548 Address: 1C:CC:D6:E2:EA:80 (Xiaomi Communications Co Ltd) Transmit bandwidth: 8000 Receive bandwidth: 8000 Max latency: 13 Setting: 0x0003 Input Coding: Linear Input Data Format: 1's complement Input Sample Size: 8-bit # of bits padding at MSB: 0 Air Coding Format: Transparent Data Retransmission effort: Optimize for link quality (0x02) Packet type: 0x003f HV1 may be used HV2 may be used HV3 may be used EV3 may be used EV4 may be used EV5 may be used > HCI Event: Command Status (0x0f) plen 4 #2197 [hci0] 321.343585 Accept Synchronous Connection Request (0x01|0x0029) ncmd 1 Status: Success (0x00) > HCI Event: Synchronous Connect Comp.. (0x2c) plen 17 #2198 [hci0] 321.351666 Status: Success (0x00) Handle: 257 Address: 1C:CC:D6:E2:EA:80 (Xiaomi Communications Co Ltd) Link type: eSCO (0x02) Transmission interval: 0x0c Retransmission window: 0x04 RX packet length: 60 TX packet length: 60 Air mode: Transparent (0x03) ........ > SCO Data RX: Handle 257 flags 0x00 dlen 48 #2336 [hci0] 321.383655 < SCO Data TX: Handle 257 flags 0x00 dlen 60 #2337 [hci0] 321.389558 > SCO Data RX: Handle 257 flags 0x00 dlen 48 #2338 [hci0] 321.393615 > SCO Data RX: Handle 257 flags 0x00 dlen 48 #2339 [hci0] 321.393618 > SCO Data RX: Handle 257 flags 0x00 dlen 48 #2340 [hci0] 321.393618 < SCO Data TX: Handle 257 flags 0x00 dlen 60 #2341 [hci0] 321.397070 > SCO Data RX: Handle 257 flags 0x00 dlen 48 #2342 [hci0] 321.403622 > SCO Data RX: Handle 257 flags 0x00 dlen 48 #2343 [hci0] 321.403625 > SCO Data RX: Handle 257 flags 0x00 dlen 48 #2344 [hci0] 321.403625 > SCO Data RX: Handle 257 flags 0x00 dlen 48 #2345 [hci0] 321.403625 < SCO Data TX: Handle 257 flags 0x00 dlen 60 #2346 [hci0] 321.404569 < SCO Data TX: Handle 257 flags 0x00 dlen 60 #2347 [hci0] 321.412091 > SCO Data RX: Handle 257 flags 0x00 dlen 48 #2348 [hci0] 321.413626 > SCO Data RX: Handle 257 flags 0x00 dlen 48 #2349 [hci0] 321.413630 > SCO Data RX: Handle 257 flags 0x00 dlen 48 #2350 [hci0] 321.413630 < SCO Data TX: Handle 257 flags 0x00 dlen 60 #2351 [hci0] 321.419674 After fix: < HCI Command: Accept Synchronou.. (0x01|0x0029) plen 21 #309 [hci0] 49.439693 Address: 1C:CC:D6:E2:EA:80 (Xiaomi Communications Co Ltd) Transmit bandwidth: 8000 Receive bandwidth: 8000 Max latency: 13 Setting: 0x0003 Input Coding: Linear Input Data Format: 1's complement Input Sample Size: 8-bit # of bits padding at MSB: 0 Air Coding Format: Transparent Data Retransmission effort: Optimize for link quality (0x02) Packet type: 0x003f HV1 may be used HV2 may be used HV3 may be used EV3 may be used EV4 may be used EV5 may be used > HCI Event: Command Status (0x0f) plen 4 #310 [hci0] 49.440308 Accept Synchronous Connection Request (0x01|0x0029) ncmd 1 Status: Success (0x00) > HCI Event: Synchronous Connect Complete (0x2c) plen 17 #311 [hci0] 49.449308 Status: Success (0x00) Handle: 257 Address: 1C:CC:D6:E2:EA:80 (Xiaomi Communications Co Ltd) Link type: eSCO (0x02) Transmission interval: 0x0c Retransmission window: 0x04 RX packet length: 60 TX packet length: 60 Air mode: Transparent (0x03) < SCO Data TX: Handle 257 flags 0x00 dlen 60 #312 [hci0] 49.450421 < SCO Data TX: Handle 257 flags 0x00 dlen 60 #313 [hci0] 49.457927 > HCI Event: Max Slots Change (0x1b) plen 3 #314 [hci0] 49.460345 Handle: 256 Max slots: 5 < SCO Data TX: Handle 257 flags 0x00 dlen 60 #315 [hci0] 49.465453 > SCO Data RX: Handle 257 flags 0x00 dlen 60 #316 [hci0] 49.470502 > SCO Data RX: Handle 257 flags 0x00 dlen 60 #317 [hci0] 49.470519 < SCO Data TX: Handle 257 flags 0x00 dlen 60 #318 [hci0] 49.472996 > SCO Data RX: Handle 257 flags 0x00 dlen 60 #319 [hci0] 49.480412 < SCO Data TX: Handle 257 flags 0x00 dlen 60 #320 [hci0] 49.480492 < SCO Data TX: Handle 257 flags 0x00 dlen 60 #321 [hci0] 49.487989 > SCO Data RX: Handle 257 flags 0x00 dlen 60 #322 [hci0] 49.490303 < SCO Data TX: Handle 257 flags 0x00 dlen 60 #323 [hci0] 49.495496 > SCO Data RX: Handle 257 flags 0x00 dlen 60 #324 [hci0] 49.500304 > SCO Data RX: Handle 257 flags 0x00 dlen 60 #325 [hci0] 49.500311 Signed-off-by: Kiran K <kiran.k@intel.com> Signed-off-by: Lokendra Singh <lokendra.singh@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-19Bluetooth: Fix the HCI to MGMT status conversion tableYu Liu
[ Upstream commit 4ef36a52b0e47c80bbfd69c0cce61c7ae9f541ed ] 0x2B, 0x31 and 0x33 are reserved for future use but were not present in the HCI to MGMT conversion table, this caused the conversion to be incorrect for the HCI status code greater than 0x2A. Reviewed-by: Miao-chen Chou <mcchou@chromium.org> Signed-off-by: Yu Liu <yudiliu@google.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-19net: ip: avoid OOM kills with large UDP sends over loopbackJakub Kicinski
[ Upstream commit 6d123b81ac615072a8525c13c6c41b695270a15d ] Dave observed number of machines hitting OOM on the UDP send path. The workload seems to be sending large UDP packets over loopback. Since loopback has MTU of 64k kernel will try to allocate an skb with up to 64k of head space. This has a good chance of failing under memory pressure. What's worse if the message length is <32k the allocation may trigger an OOM killer. This is entirely avoidable, we can use an skb with page frags. af_unix solves a similar problem by limiting the head length to SKB_MAX_ALLOC. This seems like a good and simple approach. It means that UDP messages > 16kB will now use fragments if underlying device supports SG, if extra allocator pressure causes regressions in real workloads we can switch to trying the large allocation first and falling back. v4: pre-calculate all the additions to alloclen so we can be sure it won't go over order-2 Reported-by: Dave Jones <dsj@fb.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-19mac80211: consider per-CPU statistics if presentJohannes Berg
[ Upstream commit d656a4c6ead6c3f252b2f2532bc9735598f7e317 ] If we have been keeping per-CPU statistics, consider them regardless of USES_RSS, because we may not actually fill those, for example in non-fast-RX cases when the connection is not compatible with fast-RX. If we didn't fill them, the additional data will be zero and not affect anything, and if we did fill them then it's more correct to consider them. This fixes an issue in mesh mode where some statistics are not updated due to USES_RSS being set, but fast-RX isn't used. Reported-by: Thiraviyam Mariyappan <tmariyap@codeaurora.org> Link: https://lore.kernel.org/r/20210610220814.13b35f5797c5.I511e9b33c5694e0d6cef4b6ae755c873d7c22124@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-19cfg80211: fix default HE tx bitrate mask in 2G bandPing-Ke Shih
[ Upstream commit 9df66d5b9f45c39b3925d16e8947cc10009b186d ] In 2G band, a HE sta can only supports HT and HE, but not supports VHT. In this case, default HE tx bitrate mask isn't filled, when we use iw to set bitrates without any parameter. Signed-off-by: Ping-Ke Shih <pkshih@realtek.com> Link: https://lore.kernel.org/r/20210609075944.51130-1-pkshih@realtek.com Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-19wireless: wext-spy: Fix out-of-bounds warningGustavo A. R. Silva
[ Upstream commit e93bdd78406da9ed01554c51e38b2a02c8ef8025 ] Fix the following out-of-bounds warning: net/wireless/wext-spy.c:178:2: warning: 'memcpy' offset [25, 28] from the object at 'threshold' is out of the bounds of referenced subobject 'low' with type 'struct iw_quality' at offset 20 [-Warray-bounds] The problem is that the original code is trying to copy data into a couple of struct members adjacent to each other in a single call to memcpy(). This causes a legitimate compiler warning because memcpy() overruns the length of &threshold.low and &spydata->spy_thr_low. As these are just a couple of struct members, fix this by using direct assignments, instead of memcpy(). This helps with the ongoing efforts to globally enable -Warray-bounds and get us closer to being able to tighten the FORTIFY_SOURCE routines on memcpy(). Link: https://github.com/KSPP/linux/issues/109 Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20210422200032.GA168995@embeddedor Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-19vsock: notify server to shutdown when client has pending signalLongpeng(Mike)
[ Upstream commit c7ff9cff70601ea19245d997bb977344663434c7 ] The client's sk_state will be set to TCP_ESTABLISHED if the server replay the client's connect request. However, if the client has pending signal, its sk_state will be set to TCP_CLOSE without notify the server, so the server will hold the corrupt connection. client server 1. sk_state=TCP_SYN_SENT | 2. call ->connect() | 3. wait reply | | 4. sk_state=TCP_ESTABLISHED | 5. insert to connected list | 6. reply to the client 7. sk_state=TCP_ESTABLISHED | 8. insert to connected list | 9. *signal pending* <--------------------- the user kill client 10. sk_state=TCP_CLOSE | client is exiting... | 11. call ->release() | virtio_transport_close if (!(sk->sk_state == TCP_ESTABLISHED || sk->sk_state == TCP_CLOSING)) return true; *return at here, the server cannot notice the connection is corrupt* So the client should notify the peer in this case. Cc: David S. Miller <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Jorgen Hansen <jhansen@vmware.com> Cc: Norbert Slusarek <nslusarek@gmx.net> Cc: Andra Paraschiv <andraprs@amazon.com> Cc: Colin Ian King <colin.king@canonical.com> Cc: David Brazdil <dbrazdil@google.com> Cc: Alexander Popov <alex.popov@linux.com> Suggested-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://lkml.org/lkml/2021/5/17/418 Signed-off-by: lixianming <lixianming5@huawei.com> Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-19net: sched: fix error return code in tcf_del_walker()Yang Yingliang
[ Upstream commit 55d96f72e8ddc0a294e0b9c94016edbb699537e1 ] When nla_put_u32() fails, 'ret' could be 0, it should return error code in tcf_del_walker(). Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-19xfrm: Fix error reporting in xfrm_state_construct.Steffen Klassert
[ Upstream commit 6fd06963fa74197103cdbb4b494763127b3f2f34 ] When memory allocation for XFRMA_ENCAP or XFRMA_COADDR fails, the error will not be reported because the -ENOMEM assignment to the err variable is overwritten before. Fix this by moving these two in front of the function so that memory allocation failures will be reported. Reported-by: Tobias Brunner <tobias@strongswan.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-19net: bridge: mrp: Update ring transitions.Horatiu Vultur
[ Upstream commit fcb34635854a5a5814227628867ea914a9805384 ] According to the standard IEC 62439-2, the number of transitions needs to be counted for each transition 'between' ring state open and ring state closed and not from open state to closed state. Therefore fix this for both ring and interconnect ring. Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-19net: tcp better handling of reordering then loss casesYuchung Cheng
[ Upstream commit a29cb6914681a55667436a9eb7a42e28da8cf387 ] This patch aims to improve the situation when reordering and loss are ocurring in the same flight of packets. Previously the reordering would first induce a spurious recovery, then the subsequent ACK may undo the cwnd (based on the timestamps e.g.). However the current loss recovery does not proceed to invoke RACK to install a reordering timer. If some packets are also lost, this may lead to a long RTO-based recovery. An example is https://groups.google.com/g/bbr-dev/c/OFHADvJbTEI The solution is to after reverting the recovery, always invoke RACK to either mount the RACK timer to fast retransmit after the reordering window, or restarts the recovery if new loss is identified. Hence it is possible the sender may go from Recovery to Disorder/Open to Recovery again in one ACK. Reported-by: mingkun bian <bianmingkun@gmail.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-19ipv6: use prandom_u32() for ID generationWilly Tarreau
[ Upstream commit 62f20e068ccc50d6ab66fdb72ba90da2b9418c99 ] This is a complement to commit aa6dd211e4b1 ("inet: use bigger hash table for IP ID generation"), but focusing on some specific aspects of IPv6. Contary to IPv4, IPv6 only uses packet IDs with fragments, and with a minimum MTU of 1280, it's much less easy to force a remote peer to produce many fragments to explore its ID sequence. In addition packet IDs are 32-bit in IPv6, which further complicates their analysis. On the other hand, it is often easier to choose among plenty of possible source addresses and partially work around the bigger hash table the commit above permits, which leaves IPv6 partially exposed to some possibilities of remote analysis at the risk of weakening some protocols like DNS if some IDs can be predicted with a good enough probability. Given the wide range of permitted IDs, the risk of collision is extremely low so there's no need to rely on the positive increment algorithm that is shared with the IPv4 code via ip_idents_reserve(). We have a fast PRNG, so let's simply call prandom_u32() and be done with it. Performance measurements at 10 Gbps couldn't show any difference with the previous code, even when using a single core, because due to the large fragments, we're limited to only ~930 kpps at 10 Gbps and the cost of the random generation is completely offset by other operations and by the network transfer time. In addition, this change removes the need to update a shared entry in the idents table so it may even end up being slightly faster on large scale systems where this matters. The risk of at least one collision here is about 1/80 million among 10 IDs, 1/850k among 100 IDs, and still only 1/8.5k among 1000 IDs, which remains very low compared to IPv4 where all IDs are reused every 4 to 80ms on a 10 Gbps flow depending on packet sizes. Reported-by: Amit Klein <aksecurity@gmail.com> Signed-off-by: Willy Tarreau <w@1wt.eu> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20210529110746.6796-1-w@1wt.eu Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-19net/sched: cls_api: increase max_reclassify_loopDavide Caratti
[ Upstream commit 05ff8435e50569a0a6b95e5ceaea43696e8827ab ] modern userspace applications, like OVN, can configure the TC datapath to "recirculate" packets several times. If more than 4 "recirculation" rules are configured, packets can be dropped by __tcf_classify(). Changing the maximum number of reclassifications (from 4 to 16) should be sufficient to prevent drops in most use cases, and guard against loops at the same time. Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-19net: Treat __napi_schedule_irqoff() as __napi_schedule() on PREEMPT_RTSebastian Andrzej Siewior
[ Upstream commit 8380c81d5c4fced6f4397795a5ae65758272bbfd ] __napi_schedule_irqoff() is an optimized version of __napi_schedule() which can be used where it is known that interrupts are disabled, e.g. in interrupt-handlers, spin_lock_irq() sections or hrtimer callbacks. On PREEMPT_RT enabled kernels this assumptions is not true. Force- threaded interrupt handlers and spinlocks are not disabling interrupts and the NAPI hrtimer callback is forced into softirq context which runs with interrupts enabled as well. Chasing all usage sites of __napi_schedule_irqoff() is a whack-a-mole game so make __napi_schedule_irqoff() invoke __napi_schedule() for PREEMPT_RT kernels. The callers of ____napi_schedule() in the networking core have been audited and are correct on PREEMPT_RT kernels as well. Reported-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14net: tipc: fix FB_MTU eat two pagesMenglong Dong
[ Upstream commit 0c6de0c943dbb42831bf7502eb5c007f71e752d2 ] FB_MTU is used in 'tipc_msg_build()' to alloc smaller skb when memory allocation fails, which can avoid unnecessary sending failures. The value of FB_MTU now is 3744, and the data size will be: (3744 + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) + \ SKB_DATA_ALIGN(BUF_HEADROOM + BUF_TAILROOM + 3)) which is larger than one page(4096), and two pages will be allocated. To avoid it, replace '3744' with a calculation: (PAGE_SIZE - SKB_DATA_ALIGN(BUF_OVERHEAD) - \ SKB_DATA_ALIGN(sizeof(struct skb_shared_info))) What's more, alloc_skb_fclone() will call SKB_DATA_ALIGN for data size, and it's not necessary to make alignment for buf_size in tipc_buf_acquire(). So, just remove it. Fixes: 4c94cc2d3d57 ("tipc: fall back to smaller MTU if allocation of local send skb fails") Signed-off-by: Menglong Dong <dong.menglong@zte.com.cn> Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14net: sched: fix warning in tcindex_alloc_perfect_hashPavel Skripkin
[ Upstream commit 3f2db250099f46988088800052cdf2332c7aba61 ] Syzbot reported warning in tcindex_alloc_perfect_hash. The problem was in too big cp->hash, which triggers warning in kmalloc. Since cp->hash comes from userspace, there is no need to warn if value is not correct Fixes: b9a24bb76bf6 ("net_sched: properly handle failure case of tcf_exts_init()") Reported-and-tested-by: syzbot+1071ad60cd7df39fdadb@syzkaller.appspotmail.com Signed-off-by: Pavel Skripkin <paskripkin@gmail.com> Acked-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14net: lwtunnel: handle MTU calculation in forwadingVadim Fedorenko
[ Upstream commit fade56410c22cacafb1be9f911a0afd3701d8366 ] Commit 14972cbd34ff ("net: lwtunnel: Handle fragmentation") moved fragmentation logic away from lwtunnel by carry encap headroom and use it in output MTU calculation. But the forwarding part was not covered and created difference in MTU for output and forwarding and further to silent drops on ipv4 forwarding path. Fix it by taking into account lwtunnel encap headroom. The same commit also introduced difference in how to treat RTAX_MTU in IPv4 and IPv6 where latter explicitly removes lwtunnel encap headroom from route MTU. Make IPv4 version do the same. Fixes: 14972cbd34ff ("net: lwtunnel: Handle fragmentation") Suggested-by: David Ahern <dsahern@gmail.com> Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14Bluetooth: Fix handling of HCI_LE_Advertising_Set_Terminated eventLuiz Augusto von Dentz
[ Upstream commit 23837a6d7a1a61818ed94a6b8af552d6cf7d32d5 ] Error status of this event means that it has ended due reasons other than a connection: 'If advertising has terminated as a result of the advertising duration elapsing, the Status parameter shall be set to the error code Advertising Timeout (0x3C).' 'If advertising has terminated because the Max_Extended_Advertising_Events was reached, the Status parameter shall be set to the error code Limit Reached (0x43).' Fixes: acf0aeae431a0 ("Bluetooth: Handle ADv set terminated event") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14Bluetooth: Fix Set Extended (Scan Response) DataLuiz Augusto von Dentz
[ Upstream commit c9ed0a7077306f9d41d74fb006ab5dbada8349c5 ] These command do have variable length and the length can go up to 251, so this changes the struct to not use a fixed size and then when creating the PDU only the actual length of the data send to the controller. Fixes: a0fb3726ba551 ("Bluetooth: Use Set ext adv/scan rsp data if controller supports") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14Bluetooth: Fix not sending Set Extended Scan ResponseLuiz Augusto von Dentz
[ Upstream commit a76a0d365077711594ce200a9553ed6d1ff40276 ] Current code is actually failing on the following tests of mgmt-tester because get_adv_instance_scan_rsp_len did not account for flags that cause scan response data to be included resulting in non-scannable instance when in fact it should be scannable. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14Bluetooth: mgmt: Fix slab-out-of-bounds in tlv_data_is_validLuiz Augusto von Dentz
[ Upstream commit 799acb9347915bfe4eac0ff2345b468f0a1ca207 ] This fixes parsing of LTV entries when the length is 0. Found with: tools/mgmt-tester -s "Add Advertising - Success (ScRsp only)" Add Advertising - Success (ScRsp only) - run Sending Add Advertising (0x003e) Test condition added, total 1 [ 11.004577] ================================================================== [ 11.005292] BUG: KASAN: slab-out-of-bounds in tlv_data_is_valid+0x87/0xe0 [ 11.005984] Read of size 1 at addr ffff888002c695b0 by task mgmt-tester/87 [ 11.006711] [ 11.007176] [ 11.007429] Allocated by task 87: [ 11.008151] [ 11.008438] The buggy address belongs to the object at ffff888002c69580 [ 11.008438] which belongs to the cache kmalloc-64 of size 64 [ 11.010526] The buggy address is located 48 bytes inside of [ 11.010526] 64-byte region [ffff888002c69580, ffff888002c695c0) [ 11.012423] The buggy address belongs to the page: [ 11.013291] [ 11.013544] Memory state around the buggy address: [ 11.014359] ffff888002c69480: fa fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc [ 11.015453] ffff888002c69500: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc [ 11.016232] >ffff888002c69580: 00 00 00 00 00 00 fc fc fc fc fc fc fc fc fc fc [ 11.017010] ^ [ 11.017547] ffff888002c69600: 00 00 00 00 00 00 fc fc fc fc fc fc fc fc fc fc [ 11.018296] ffff888002c69680: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc [ 11.019116] ================================================================== Fixes: 2bb36870e8cb2 ("Bluetooth: Unify advertising instance flags check") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14bpfilter: Specify the log level for the kmsg messageGary Lin
[ Upstream commit a196fa78a26571359740f701cf30d774eb8a72cb ] Per the kmsg document [0], if we don't specify the log level with a prefix "<N>" in the message string, the default log level will be applied to the message. Since the default level could be warning(4), this would make the log utility such as journalctl treat the message, "Started bpfilter", as a warning. To avoid confusion, this commit adds the prefix "<5>" to make the message always a notice. [0] https://www.kernel.org/doc/Documentation/ABI/testing/dev-kmsg Fixes: 36c4357c63f3 ("net: bpfilter: print umh messages to /dev/kmsg") Reported-by: Martin Loviska <mloviska@suse.com> Signed-off-by: Gary Lin <glin@suse.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Dmitrii Banshchikov <me@ubique.spb.ru> Link: https://lore.kernel.org/bpf/20210623040918.8683-1-glin@suse.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14ipv6: fix out-of-bound access in ip6_parse_tlv()Eric Dumazet
[ Upstream commit 624085a31c1ad6a80b1e53f686bf6ee92abbf6e8 ] First problem is that optlen is fetched without checking there is more than one byte to parse. Fix this by taking care of IPV6_TLV_PAD1 before fetching optlen (under appropriate sanity checks against len) Second problem is that IPV6_TLV_PADN checks of zero padding are performed before the check of remaining length. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Fixes: c1412fce7ecc ("net/ipv6/exthdrs.c: Strict PadN option checking") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14bpf: Do not change gso_size during bpf_skb_change_proto()Maciej Żenczykowski
[ Upstream commit 364745fbe981a4370f50274475da4675661104df ] This is technically a backwards incompatible change in behaviour, but I'm going to argue that it is very unlikely to break things, and likely to fix *far* more then it breaks. In no particular order, various reasons follow: (a) I've long had a bug assigned to myself to debug a super rare kernel crash on Android Pixel phones which can (per stacktrace) be traced back to BPF clat IPv6 to IPv4 protocol conversion causing some sort of ugly failure much later on during transmit deep in the GSO engine, AFAICT precisely because of this change to gso_size, though I've never been able to manually reproduce it. I believe it may be related to the particular network offload support of attached USB ethernet dongle being used for tethering off of an IPv6-only cellular connection. The reason might be we end up with more segments than max permitted, or with a GSO packet with only one segment... (either way we break some assumption and hit a BUG_ON) (b) There is no check that the gso_size is > 20 when reducing it by 20, so we might end up with a negative (or underflowing) gso_size or a gso_size of 0. This can't possibly be good. Indeed this is probably somehow exploitable (or at least can result in a kernel crash) by delivering crafted packets and perhaps triggering an infinite loop or a divide by zero... As a reminder: gso_size (MSS) is related to MTU, but not directly derived from it: gso_size/MSS may be significantly smaller then one would get by deriving from local MTU. And on some NICs (which do loose MTU checking on receive, it may even potentially be larger, for example my work pc with 1500 MTU can receive 1520 byte frames [and sometimes does due to bugs in a vendor plat46 implementation]). Indeed even just going from 21 to 1 is potentially problematic because it increases the number of segments by a factor of 21 (think DoS, or some other crash due to too many segments). (c) It's always safe to not increase the gso_size, because it doesn't result in the max packet size increasing. So the skb_increase_gso_size() call was always unnecessary for correctness (and outright undesirable, see later). As such the only part which is potentially dangerous (ie. could cause backwards compatibility issues) is the removal of the skb_decrease_gso_size() call. (d) If the packets are ultimately destined to the local device, then there is absolutely no benefit to playing around with gso_size. It only matters if the packets will egress the device. ie. we're either forwarding, or transmitting from the device. (e) This logic only triggers for packets which are GSO. It does not trigger for skbs which are not GSO. It will not convert a non-GSO MTU sized packet into a GSO packet (and you don't even know what the MTU is, so you can't even fix it). As such your transmit path must *already* be able to handle an MTU 20 bytes larger then your receive path (for IPv4 to IPv6 translation) - and indeed 28 bytes larger due to IPv4 fragments. Thus removing the skb_decrease_gso_size() call doesn't actually increase the size of the packets your transmit side must be able to handle. ie. to handle non-GSO max-MTU packets, the IPv4/IPv6 device/ route MTUs must already be set correctly. Since for example with an IPv4 egress MTU of 1500, IPv4 to IPv6 translation will already build 1520 byte IPv6 frames, so you need a 1520 byte device MTU. This means if your IPv6 device's egress MTU is 1280, your IPv4 route must be 1260 (and actually 1252, because of the need to handle fragments). This is to handle normal non-GSO packets. Thus the reduction is simply not needed for GSO packets, because when they're correctly built, they will already be the right size. (f) TSO/GSO should be able to exactly undo GRO: the number of packets (TCP segments) should not be modified, so that TCP's MSS counting works correctly (this matters for congestion control). If protocol conversion changes the gso_size, then the number of TCP segments may increase or decrease. Packet loss after protocol conversion can result in partial loss of MSS segments that the sender sent. How's the sending TCP stack going to react to receiving ACKs/SACKs in the middle of the segments it sent? (g) skb_{decrease,increase}_gso_size() are already no-ops for GSO_BY_FRAGS case (besides triggering WARN_ON_ONCE). This means you already cannot guarantee that gso_size (and thus resulting packet MTU) is changed. ie. you must assume it won't be changed. (h) changing gso_size is outright buggy for UDP GSO packets, where framing matters (I believe that's also the case for SCTP, but it's already excluded by [g]). So the only remaining case is TCP, which also doesn't want it (see [f]). (i) see also the reasoning on the previous attempt at fixing this (commit fa7b83bf3b156c767f3e4a25bbf3817b08f3ff8e) which shows that the current behaviour causes TCP packet loss: In the forwarding path GRO -> BPF 6 to 4 -> GSO for TCP traffic, the coalesced packet payload can be > MSS, but < MSS + 20. bpf_skb_proto_6_to_4() will upgrade the MSS and it can be > the payload length. After then tcp_gso_segment checks for the payload length if it is <= MSS. The condition is causing the packet to be dropped. tcp_gso_segment(): [...] mss = skb_shinfo(skb)->gso_size; if (unlikely(skb->len <= mss)) goto out; [...] Thus changing the gso_size is simply a very bad idea. Increasing is unnecessary and buggy, and decreasing can go negative. Fixes: 6578171a7ff0 ("bpf: add bpf_skb_change_proto helper") Signed-off-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Dongseok Yi <dseok.yi@samsung.com> Cc: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/bpf/CANP3RGfjLikQ6dg=YpBU0OeHvyv7JOki7CyOUS9modaXAi-9vQ@mail.gmail.com Link: https://lore.kernel.org/bpf/20210617000953.2787453-2-zenczykowski@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14can: j1939: j1939_sk_setsockopt(): prevent allocation of j1939 filter for ↵Norbert Slusarek
optlen == 0 [ Upstream commit aaf473d0100f64abc88560e2bea905805bcf2a8e ] If optval != NULL and optlen == 0 are specified for SO_J1939_FILTER in j1939_sk_setsockopt(), memdup_sockptr() will return ZERO_PTR for 0 size allocation. The new filter will be mistakenly assigned ZERO_PTR. This patch checks for optlen != 0 and filter will be assigned NULL in case of optlen == 0. Fixes: 9d71dd0c7009 ("can: add support of SAE J1939 protocol") Link: https://lore.kernel.org/r/20210620123842.117975-1-nslusarek@gmx.net Signed-off-by: Norbert Slusarek <nslusarek@gmx.net> Acked-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14ipv6: exthdrs: do not blindly use init_netEric Dumazet
[ Upstream commit bcc3f2a829b9edbe3da5fb117ee5a63686d31834 ] I see no reason why max_dst_opts_cnt and max_hbh_opts_cnt are fetched from the initial net namespace. The other sysctls (max_dst_opts_len & max_hbh_opts_len) are in fact already using the current ns. Note: it is not clear why ipv6_destopt_rcv() use two ways to get to the netns : 1) dev_net(dst->dev) Originally used to increment IPSTATS_MIB_INHDRERRORS 2) dev_net(skb->dev) Tom used this variant in his patch. Maybe this calls to use ipv6_skb_net() instead ? Fixes: 47d3d7ac656a ("ipv6: Implement limits on Hop-by-Hop and Destination options") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <tom@quantonium.net> Cc: Coco Li <lixiaoyan@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14mac80211: remove iwlwifi specific workaround NDPs of null_responsePing-Ke Shih
[ Upstream commit 744757e46bf13ec3a7b3507d17ab3faab9516d43 ] Remove the remaining workaround that is not removed by the commit e41eb3e408de ("mac80211: remove iwlwifi specific workaround that broke sta NDP tx") Fixes: 41cbb0f5a295 ("mac80211: add support for HE") Signed-off-by: Ping-Ke Shih <pkshih@realtek.com> Link: https://lore.kernel.org/r/20210623134826.10318-1-pkshih@realtek.com Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14net/ipv4: swap flow ports when validating sourceMiao Wang
[ Upstream commit c69f114d09891adfa3e301a35d9e872b8b7b5a50 ] When doing source address validation, the flowi4 struct used for fib_lookup should be in the reverse direction to the given skb. fl4_dport and fl4_sport returned by fib4_rules_early_flow_dissect should thus be swapped. Fixes: 5a847a6e1477 ("net/ipv4: Initialize proto and ports in flow struct") Signed-off-by: Miao Wang <shankerwangmiao@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14ip6_tunnel: fix GRE6 segmentationJakub Kicinski
[ Upstream commit a6e3f2985a80ef6a45a17d2d9d9151f17ea3ce07 ] Commit 6c11fbf97e69 ("ip6_tunnel: add MPLS transmit support") moved assiging inner_ipproto down from ipxip6_tnl_xmit() to its callee ip6_tnl_xmit(). The latter is also used by GRE. Since commit 38720352412a ("gre: Use inner_proto to obtain inner header protocol") GRE had been depending on skb->inner_protocol during segmentation. It sets it in gre_build_header() and reads it in gre_gso_segment(). Changes to ip6_tnl_xmit() overwrite the protocol, resulting in GSO skbs getting dropped. Note that inner_protocol is a union with inner_ipproto, GRE uses the former while the change switched it to the latter (always setting it to just IPPROTO_GRE). Restore the original location of skb_set_inner_ipproto(), it is unclear why it was moved in the first place. Fixes: 6c11fbf97e69 ("ip6_tunnel: add MPLS transmit support") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Tested-by: Vadim Fedorenko <vfedorenko@novek.ru> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14xfrm: Fix xfrm offload fallback fail caseAyush Sawal
[ Upstream commit dd72fadf2186fc8a6018f97fe72f4d5ca05df440 ] In case of xfrm offload, if xdo_dev_state_add() of driver returns -EOPNOTSUPP, xfrm offload fallback is failed. In xfrm state_add() both xso->dev and xso->real_dev are initialized to dev and when err(-EOPNOTSUPP) is returned only xso->dev is set to null. So in this scenario the condition in func validate_xmit_xfrm(), if ((x->xso.dev != dev) && (x->xso.real_dev == dev)) return skb; returns true, due to which skb is returned without calling esp_xmit() below which has fallback code. Hence the CRYPTO_FALLBACK is failing. So fixing this with by keeping x->xso.real_dev as NULL when err is returned in func xfrm_dev_state_add(). Fixes: bdfd2d1fa79a ("bonding/xfrm: use real_dev instead of slave_dev") Signed-off-by: Ayush Sawal <ayush.sawal@chelsio.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14pkt_sched: sch_qfq: fix qfq_change_class() error pathEric Dumazet
[ Upstream commit 0cd58e5c53babb9237b741dbef711f0a9eb6d3fd ] If qfq_change_class() is unable to allocate memory for qfq_aggregate, it frees the class that has been inserted in the class hash table, but does not unhash it. Defer the insertion after the problematic allocation. BUG: KASAN: use-after-free in hlist_add_head include/linux/list.h:884 [inline] BUG: KASAN: use-after-free in qdisc_class_hash_insert+0x200/0x210 net/sched/sch_api.c:731 Write of size 8 at addr ffff88814a534f10 by task syz-executor.4/31478 CPU: 0 PID: 31478 Comm: syz-executor.4 Not tainted 5.13.0-rc6-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x141/0x1d7 lib/dump_stack.c:120 print_address_description.constprop.0.cold+0x5b/0x2f8 mm/kasan/report.c:233 __kasan_report mm/kasan/report.c:419 [inline] kasan_report.cold+0x7c/0xd8 mm/kasan/report.c:436 hlist_add_head include/linux/list.h:884 [inline] qdisc_class_hash_insert+0x200/0x210 net/sched/sch_api.c:731 qfq_change_class+0x96c/0x1990 net/sched/sch_qfq.c:489 tc_ctl_tclass+0x514/0xe50 net/sched/sch_api.c:2113 rtnetlink_rcv_msg+0x44e/0xad0 net/core/rtnetlink.c:5564 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2504 netlink_unicast_kernel net/netlink/af_netlink.c:1314 [inline] netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1340 netlink_sendmsg+0x856/0xd90 net/netlink/af_netlink.c:1929 sock_sendmsg_nosec net/socket.c:654 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:674 ____sys_sendmsg+0x6e8/0x810 net/socket.c:2350 ___sys_sendmsg+0xf3/0x170 net/socket.c:2404 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2433 do_syscall_64+0x3a/0xb0 arch/x86/entry/common.c:47 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x4665d9 Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007fdc7b5f0188 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 000000000056bf80 RCX: 00000000004665d9 RDX: 0000000000000000 RSI: 00000000200001c0 RDI: 0000000000000003 RBP: 00007fdc7b5f01d0 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002 R13: 00007ffcf7310b3f R14: 00007fdc7b5f0300 R15: 0000000000022000 Allocated by task 31445: kasan_save_stack+0x1b/0x40 mm/kasan/common.c:38 kasan_set_track mm/kasan/common.c:46 [inline] set_alloc_info mm/kasan/common.c:428 [inline] ____kasan_kmalloc mm/kasan/common.c:507 [inline] ____kasan_kmalloc mm/kasan/common.c:466 [inline] __kasan_kmalloc+0x9b/0xd0 mm/kasan/common.c:516 kmalloc include/linux/slab.h:556 [inline] kzalloc include/linux/slab.h:686 [inline] qfq_change_class+0x705/0x1990 net/sched/sch_qfq.c:464 tc_ctl_tclass+0x514/0xe50 net/sched/sch_api.c:2113 rtnetlink_rcv_msg+0x44e/0xad0 net/core/rtnetlink.c:5564 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2504 netlink_unicast_kernel net/netlink/af_netlink.c:1314 [inline] netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1340 netlink_sendmsg+0x856/0xd90 net/netlink/af_netlink.c:1929 sock_sendmsg_nosec net/socket.c:654 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:674 ____sys_sendmsg+0x6e8/0x810 net/socket.c:2350 ___sys_sendmsg+0xf3/0x170 net/socket.c:2404 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2433 do_syscall_64+0x3a/0xb0 arch/x86/entry/common.c:47 entry_SYSCALL_64_after_hwframe+0x44/0xae Freed by task 31445: kasan_save_stack+0x1b/0x40 mm/kasan/common.c:38 kasan_set_track+0x1c/0x30 mm/kasan/common.c:46 kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:357 ____kasan_slab_free mm/kasan/common.c:360 [inline] ____kasan_slab_free mm/kasan/common.c:325 [inline] __kasan_slab_free+0xfb/0x130 mm/kasan/common.c:368 kasan_slab_free include/linux/kasan.h:212 [inline] slab_free_hook mm/slub.c:1583 [inline] slab_free_freelist_hook+0xdf/0x240 mm/slub.c:1608 slab_free mm/slub.c:3168 [inline] kfree+0xe5/0x7f0 mm/slub.c:4212 qfq_change_class+0x10fb/0x1990 net/sched/sch_qfq.c:518 tc_ctl_tclass+0x514/0xe50 net/sched/sch_api.c:2113 rtnetlink_rcv_msg+0x44e/0xad0 net/core/rtnetlink.c:5564 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2504 netlink_unicast_kernel net/netlink/af_netlink.c:1314 [inline] netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1340 netlink_sendmsg+0x856/0xd90 net/netlink/af_netlink.c:1929 sock_sendmsg_nosec net/socket.c:654 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:674 ____sys_sendmsg+0x6e8/0x810 net/socket.c:2350 ___sys_sendmsg+0xf3/0x170 net/socket.c:2404 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2433 do_syscall_64+0x3a/0xb0 arch/x86/entry/common.c:47 entry_SYSCALL_64_after_hwframe+0x44/0xae The buggy address belongs to the object at ffff88814a534f00 which belongs to the cache kmalloc-128 of size 128 The buggy address is located 16 bytes inside of 128-byte region [ffff88814a534f00, ffff88814a534f80) The buggy address belongs to the page: page:ffffea0005294d00 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x14a534 flags: 0x57ff00000000200(slab|node=1|zone=2|lastcpupid=0x7ff) raw: 057ff00000000200 ffffea00004fee00 0000000600000006 ffff8880110418c0 raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected page_owner tracks the page as allocated page last allocated via order 0, migratetype Unmovable, gfp_mask 0x12cc0(GFP_KERNEL|__GFP_NOWARN|__GFP_NORETRY), pid 29797, ts 604817765317, free_ts 604810151744 prep_new_page mm/page_alloc.c:2358 [inline] get_page_from_freelist+0x1033/0x2b60 mm/page_alloc.c:3994 __alloc_pages+0x1b2/0x500 mm/page_alloc.c:5200 alloc_pages+0x18c/0x2a0 mm/mempolicy.c:2272 alloc_slab_page mm/slub.c:1646 [inline] allocate_slab+0x2c5/0x4c0 mm/slub.c:1786 new_slab mm/slub.c:1849 [inline] new_slab_objects mm/slub.c:2595 [inline] ___slab_alloc+0x4a1/0x810 mm/slub.c:2758 __slab_alloc.constprop.0+0xa7/0xf0 mm/slub.c:2798 slab_alloc_node mm/slub.c:2880 [inline] slab_alloc mm/slub.c:2922 [inline] __kmalloc+0x315/0x330 mm/slub.c:4050 kmalloc include/linux/slab.h:561 [inline] kzalloc include/linux/slab.h:686 [inline] __register_sysctl_table+0x112/0x1090 fs/proc/proc_sysctl.c:1318 mpls_dev_sysctl_register+0x1b7/0x2d0 net/mpls/af_mpls.c:1421 mpls_add_dev net/mpls/af_mpls.c:1472 [inline] mpls_dev_notify+0x214/0x8b0 net/mpls/af_mpls.c:1588 notifier_call_chain+0xb5/0x200 kernel/notifier.c:83 call_netdevice_notifiers_info+0xb5/0x130 net/core/dev.c:2121 call_netdevice_notifiers_extack net/core/dev.c:2133 [inline] call_netdevice_notifiers net/core/dev.c:2147 [inline] register_netdevice+0x106b/0x1500 net/core/dev.c:10312 veth_newlink+0x585/0xac0 drivers/net/veth.c:1547 __rtnl_newlink+0x1062/0x1710 net/core/rtnetlink.c:3452 rtnl_newlink+0x64/0xa0 net/core/rtnetlink.c:3500 page last free stack trace: reset_page_owner include/linux/page_owner.h:24 [inline] free_pages_prepare mm/page_alloc.c:1298 [inline] free_pcp_prepare+0x223/0x300 mm/page_alloc.c:1342 free_unref_page_prepare mm/page_alloc.c:3250 [inline] free_unref_page+0x12/0x1d0 mm/page_alloc.c:3298 __vunmap+0x783/0xb60 mm/vmalloc.c:2566 free_work+0x58/0x70 mm/vmalloc.c:80 process_one_work+0x98d/0x1600 kernel/workqueue.c:2276 worker_thread+0x64c/0x1120 kernel/workqueue.c:2422 kthread+0x3b1/0x4a0 kernel/kthread.c:313 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294 Memory state around the buggy address: ffff88814a534e00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff88814a534e80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc >ffff88814a534f00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff88814a534f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff88814a535000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Fixes: 462dbc9101acd ("pkt_sched: QFQ Plus: fair-queueing service at DRR cost") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14netfilter: nf_tables_offload: check FLOW_DISSECTOR_KEY_BASIC in VLAN ↵Pablo Neira Ayuso
transfer logic [ Upstream commit ea45fdf82cc90430bb7c280e5e53821e833782c5 ] The VLAN transfer logic should actually check for FLOW_DISSECTOR_KEY_BASIC, not FLOW_DISSECTOR_KEY_CONTROL. Moreover, do not fallback to case 2) .n_proto is set to 802.1q or 802.1ad, if FLOW_DISSECTOR_KEY_BASIC is unset. Fixes: 783003f3bb8a ("netfilter: nftables_offload: special ethertype handling for VLAN") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14tls: prevent oversized sendfile() hangs by ignoring MSG_MOREJakub Kicinski
[ Upstream commit d452d48b9f8b1a7f8152d33ef52cfd7fe1735b0a ] We got multiple reports that multi_chunk_sendfile test case from tls selftest fails. This was sort of expected, as the original fix was never applied (see it in the first Link:). The test in question uses sendfile() with count larger than the size of the underlying file. This will make splice set MSG_MORE on all sendpage calls, meaning TLS will never close and flush the last partial record. Eric seem to have addressed a similar problem in commit 35f9c09fe9c7 ("tcp: tcp_sendpages() should call tcp_push() once") by introducing MSG_SENDPAGE_NOTLAST. Unlike MSG_MORE MSG_SENDPAGE_NOTLAST is not set on the last call of a "pipefull" of data (PIPE_DEF_BUFFERS == 16, so every 16 pages or whenever we run out of data). Having a break every 16 pages should be fine, TLS can pack exactly 4 pages into a record, so for aligned reads there should be no difference, unaligned may see one extra record per sendpage(). Sticking to TCP semantics seems preferable to modifying splice, but we can revisit it if real life scenarios show a regression. Reported-by: Vadim Fedorenko <vfedorenko@novek.ru> Reported-by: Seth Forshee <seth.forshee@canonical.com> Link: https://lore.kernel.org/netdev/1591392508-14592-1-git-send-email-pooja.trivedi@stackpath.com/ Fixes: 3c4d7559159b ("tls: kernel TLS support") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Tested-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14xsk: Fix broken Tx ring validationMagnus Karlsson
[ Upstream commit f654fae47e83e56b454fbbfd0af0a4f232e356d6 ] Fix broken Tx ring validation for AF_XDP. The commit under the Fixes tag, fixed an off-by-one error in the validation but introduced another error. Descriptors are now let through even if they straddle a chunk boundary which they are not allowed to do in aligned mode. Worse is that they are let through even if they straddle the end of the umem itself, tricking the kernel to read data outside the allowed umem region which might or might not be mapped at all. Fix this by reintroducing the old code, but subtract the length by one to fix the off-by-one error that the original patch was addressing. The test chunk != chunk_end makes sure packets do not straddle chunk boundraries. Note that packets of zero length are allowed in the interface, therefore the test if the length is non-zero. Fixes: ac31565c2193 ("xsk: Fix for xp_aligned_validate_desc() when len == chunk_size") Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Acked-by: Björn Töpel <bjorn@kernel.org> Link: https://lore.kernel.org/bpf/20210618075805.14412-1-magnus.karlsson@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14netfilter: nft_tproxy: restrict support to TCP and UDP transport protocolsPablo Neira Ayuso
[ Upstream commit 52f0f4e178c757b3d356087376aad8bd77271828 ] Add unfront check for TCP and UDP packets before performing further processing. Fixes: 4ed8eb6570a4 ("netfilter: nf_tables: Add native tproxy support") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14netfilter: nft_osf: check for TCP packet before further processingPablo Neira Ayuso
[ Upstream commit 8f518d43f89ae00b9cf5460e10b91694944ca1a8 ] The osf expression only supports for TCP packets, add a upfront sanity check to skip packet parsing if this is not a TCP packet. Fixes: b96af92d6eaf ("netfilter: nf_tables: implement Passive OS fingerprint module in nft_osf") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14netfilter: nft_exthdr: check for IPv6 packet before further processingPablo Neira Ayuso
[ Upstream commit cdd73cc545c0fb9b1a1f7b209f4f536e7990cff4 ] ipv6_find_hdr() does not validate that this is an IPv6 packet. Add a sanity check for calling ipv6_find_hdr() to make sure an IPv6 packet is passed for parsing. Fixes: 96518518cc41 ("netfilter: add nftables") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14netlabel: Fix memory leak in netlbl_mgmt_add_commonLiu Shixin
[ Upstream commit b8f6b0522c298ae9267bd6584e19b942a0636910 ] Hulk Robot reported memory leak in netlbl_mgmt_add_common. The problem is non-freed map in case of netlbl_domhsh_add() failed. BUG: memory leak unreferenced object 0xffff888100ab7080 (size 96): comm "syz-executor537", pid 360, jiffies 4294862456 (age 22.678s) hex dump (first 32 bytes): 05 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ fe 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 ................ backtrace: [<0000000008b40026>] netlbl_mgmt_add_common.isra.0+0xb2a/0x1b40 [<000000003be10950>] netlbl_mgmt_add+0x271/0x3c0 [<00000000c70487ed>] genl_family_rcv_msg_doit.isra.0+0x20e/0x320 [<000000001f2ff614>] genl_rcv_msg+0x2bf/0x4f0 [<0000000089045792>] netlink_rcv_skb+0x134/0x3d0 [<0000000020e96fdd>] genl_rcv+0x24/0x40 [<0000000042810c66>] netlink_unicast+0x4a0/0x6a0 [<000000002e1659f0>] netlink_sendmsg+0x789/0xc70 [<000000006e43415f>] sock_sendmsg+0x139/0x170 [<00000000680a73d7>] ____sys_sendmsg+0x658/0x7d0 [<0000000065cbb8af>] ___sys_sendmsg+0xf8/0x170 [<0000000019932b6c>] __sys_sendmsg+0xd3/0x190 [<00000000643ac172>] do_syscall_64+0x37/0x90 [<000000009b79d6dc>] entry_SYSCALL_64_after_hwframe+0x44/0xae Fixes: 63c416887437 ("netlabel: Add network address selectors to the NetLabel/LSM domain mapping") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Liu Shixin <liushixin2@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14net/sched: act_vlan: Fix modify to allow 0Boris Sukholitko
[ Upstream commit 9c5eee0afca09cbde6bd00f77876754aaa552970 ] Currently vlan modification action checks existence of vlan priority by comparing it to 0. Therefore it is impossible to modify existing vlan tag to have priority 0. For example, the following tc command will change the vlan id but will not affect vlan priority: tc filter add dev eth1 ingress matchall action vlan modify id 300 \ priority 0 pipe mirred egress redirect dev eth2 The incoming packet on eth1: ethertype 802.1Q (0x8100), vlan 200, p 4, ethertype IPv4 will be changed to: ethertype 802.1Q (0x8100), vlan 300, p 4, ethertype IPv4 although the user has intended to have p == 0. The fix is to add tcfv_push_prio_exists flag to struct tcf_vlan_params and rely on it when deciding to set the priority. Fixes: 45a497f2d149a4a8061c (net/sched: act_vlan: Introduce TCA_VLAN_ACT_MODIFY vlan action) Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14xfrm: remove the fragment check for ipv6 beet modeXin Long
[ Upstream commit eebd49a4ffb420a991c606e54aa3c9f02857a334 ] In commit 68dc022d04eb ("xfrm: BEET mode doesn't support fragments for inner packets"), it tried to fix the issue that in TX side the packet is fragmented before the ESP encapping while in the RX side the fragments always get reassembled before decapping with ESP. This is not true for IPv6. IPv6 is different, and it's using exthdr to save fragment info, as well as the ESP info. Exthdrs are added in TX and processed in RX both in order. So in the above case, the ESP decapping will be done earlier than the fragment reassembling in TX side. Here just remove the fragment check for the IPv6 inner packets to recover the fragments support for BEET mode. Fixes: 68dc022d04eb ("xfrm: BEET mode doesn't support fragments for inner packets") Reported-by: Xiumei Mu <xmu@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14mptcp: generate subflow hmac after mptcp_finish_join()Jianguo Wu
[ Upstream commit 0a4d8e96e4fd687af92b961d5cdcea0fdbde05fe ] For outgoing subflow join, when recv SYNACK, in subflow_finish_connect(), the mptcp_finish_join() may return false in some cases, and send a RESET to remote, and no local hmac is required. So generate subflow hmac after mptcp_finish_join(). Fixes: ec3edaa7ca6c ("mptcp: Add handling of outgoing MP_JOIN requests") Signed-off-by: Jianguo Wu <wujianguo@chinatelecom.cn> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14mptcp: fix pr_debug in mptcp_token_new_connectJianguo Wu
[ Upstream commit 2f1af441fd5dd5caf0807bb19ce9bbf9325ce534 ] After commit 2c5ebd001d4f ("mptcp: refactor token container"), pr_debug() is called before mptcp_crypto_key_gen_sha() in mptcp_token_new_connect(), so the output local_key, token and idsn are 0, like: MPTCP: ssk=00000000f6b3c4a2, local_key=0, token=0, idsn=0 Move pr_debug() after mptcp_crypto_key_gen_sha(). Fixes: 2c5ebd001d4f ("mptcp: refactor token container") Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Jianguo Wu <wujianguo@chinatelecom.cn> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14net: qrtr: ns: Fix error return code in qrtr_ns_init()Wei Yongjun
[ Upstream commit a49e72b3bda73d36664a084e47da9727a31b8095 ] Fix to return a negative error code -ENOMEM from the error handling case instead of 0, as done elsewhere in this function. Fixes: c6e08d6251f3 ("net: qrtr: Allocate workqueue before kernel_bind") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14xfrm: xfrm_state_mtu should return at least 1280 for ipv6Sabrina Dubroca
[ Upstream commit b515d2637276a3810d6595e10ab02c13bfd0b63a ] Jianwen reported that IPv6 Interoperability tests are failing in an IPsec case where one of the links between the IPsec peers has an MTU of 1280. The peer generates a packet larger than this MTU, the router replies with a "Packet too big" message indicating an MTU of 1280. When the peer tries to send another large packet, xfrm_state_mtu returns 1280 - ipsec_overhead, which causes ip6_setup_cork to fail with EINVAL. We can fix this by forcing xfrm_state_mtu to return IPV6_MIN_MTU when IPv6 is used. After going through IPsec, the packet will then be fragmented to obey the actual network's PMTU, just before leaving the host. Currently, TFC padding is capped to PMTU - overhead to avoid fragementation: after padding and encapsulation, we still fit within the PMTU. That behavior is preserved in this patch. Fixes: 91657eafb64b ("xfrm: take net hdr len into account for esp payload size calculation") Reported-by: Jianwen Ji <jiji@redhat.com> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Sasha Levin <sashal@kernel.org>