summaryrefslogtreecommitdiff
path: root/net/vmw_vsock/virtio_transport_common.c
AgeCommit message (Collapse)Author
2023-05-23bpf, sockmap: Pass skb ownership through read_skbJohn Fastabend
The read_skb hook calls consume_skb() now, but this means that if the recv_actor program wants to use the skb it needs to inc the ref cnt so that the consume_skb() doesn't kfree the sk_buff. This is problematic because in some error cases under memory pressure we may need to linearize the sk_buff from sk_psock_skb_ingress_enqueue(). Then we get this, skb_linearize() __pskb_pull_tail() pskb_expand_head() BUG_ON(skb_shared(skb)) Because we incremented users refcnt from sk_psock_verdict_recv() we hit the bug on with refcnt > 1 and trip it. To fix lets simply pass ownership of the sk_buff through the skb_read call. Then we can drop the consume from read_skb handlers and assume the verdict recv does any required kfree. Bug found while testing in our CI which runs in VMs that hit memory constraints rather regularly. William tested TCP read_skb handlers. [ 106.536188] ------------[ cut here ]------------ [ 106.536197] kernel BUG at net/core/skbuff.c:1693! [ 106.536479] invalid opcode: 0000 [#1] PREEMPT SMP PTI [ 106.536726] CPU: 3 PID: 1495 Comm: curl Not tainted 5.19.0-rc5 #1 [ 106.537023] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.16.0-1 04/01/2014 [ 106.537467] RIP: 0010:pskb_expand_head+0x269/0x330 [ 106.538585] RSP: 0018:ffffc90000138b68 EFLAGS: 00010202 [ 106.538839] RAX: 000000000000003f RBX: ffff8881048940e8 RCX: 0000000000000a20 [ 106.539186] RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff8881048940e8 [ 106.539529] RBP: ffffc90000138be8 R08: 00000000e161fd1a R09: 0000000000000000 [ 106.539877] R10: 0000000000000018 R11: 0000000000000000 R12: ffff8881048940e8 [ 106.540222] R13: 0000000000000003 R14: 0000000000000000 R15: ffff8881048940e8 [ 106.540568] FS: 00007f277dde9f00(0000) GS:ffff88813bd80000(0000) knlGS:0000000000000000 [ 106.540954] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 106.541227] CR2: 00007f277eeede64 CR3: 000000000ad3e000 CR4: 00000000000006e0 [ 106.541569] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 106.541915] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 106.542255] Call Trace: [ 106.542383] <IRQ> [ 106.542487] __pskb_pull_tail+0x4b/0x3e0 [ 106.542681] skb_ensure_writable+0x85/0xa0 [ 106.542882] sk_skb_pull_data+0x18/0x20 [ 106.543084] bpf_prog_b517a65a242018b0_bpf_skskb_http_verdict+0x3a9/0x4aa9 [ 106.543536] ? migrate_disable+0x66/0x80 [ 106.543871] sk_psock_verdict_recv+0xe2/0x310 [ 106.544258] ? sk_psock_write_space+0x1f0/0x1f0 [ 106.544561] tcp_read_skb+0x7b/0x120 [ 106.544740] tcp_data_queue+0x904/0xee0 [ 106.544931] tcp_rcv_established+0x212/0x7c0 [ 106.545142] tcp_v4_do_rcv+0x174/0x2a0 [ 106.545326] tcp_v4_rcv+0xe70/0xf60 [ 106.545500] ip_protocol_deliver_rcu+0x48/0x290 [ 106.545744] ip_local_deliver_finish+0xa7/0x150 Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()") Reported-by: William Findlay <will@isovalent.com> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: William Findlay <will@isovalent.com> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20230523025618.113937-2-john.fastabend@gmail.com
2023-04-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Conflicts: drivers/net/ethernet/google/gve/gve.h 3ce934558097 ("gve: Secure enough bytes in the first TX desc for all TCP pkts") 75eaae158b1b ("gve: Add XDP DROP and TX support for GQI-QPL format") https://lore.kernel.org/all/20230406104927.45d176f5@canb.auug.org.au/ https://lore.kernel.org/all/c5872985-1a95-0bc8-9dcc-b6f23b439e9d@tessares.net/ Adjacent changes: net/can/isotp.c 051737439eae ("can: isotp: fix race between isotp_sendsmg() and isotp_release()") 96d1c81e6a04 ("can: isotp: add module parameter for maximum pdu size") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-31virtio/vsock: fix leaks due to missing skb ownerBobby Eshleman
This patch sets the skb owner in the recv and send path for virtio. For the send path, this solves the leak caused when virtio_transport_purge_skbs() finds skb->sk is always NULL and therefore never matches it with the current socket. Setting the owner upon allocation fixes this. For the recv path, this ensures correctness of accounting and also correct transfer of ownership in vsock_loopback (when skbs are sent from one socket and received by another). Fixes: 71dc9ec9ac7d ("virtio/vsock: replace virtio_vsock_pkt with sk_buff") Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com> Reported-by: Cong Wang <xiyou.wangcong@gmail.com> Link: https://lore.kernel.org/all/ZCCbATwov4U+GBUv@pop-os.localdomain/ Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Conflicts: drivers/net/ethernet/mediatek/mtk_ppe.c 3fbe4d8c0e53 ("net: ethernet: mtk_eth_soc: ppe: add support for flow accounting") 924531326e2d ("net: ethernet: mtk_eth_soc: add missing ppe cache flush when deleting a flow") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-30virtio/vsock: WARN_ONCE() for invalid state of socketArseniy Krasnov
This adds WARN_ONCE() and return from stream dequeue callback when socket's queue is empty, but 'rx_bytes' still non-zero. This allows the detection of potential bugs due to packet merging (see previous patch). Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-03-30virtio/vsock: fix header length on skb mergingArseniy Krasnov
This fixes appending newly arrived skbuff to the last skbuff of the socket's queue. Problem fires when we are trying to append data to skbuff which was already processed in dequeue callback at least once. Dequeue callback calls function 'skb_pull()' which changes 'skb->len'. In current implementation 'skb->len' is used to update length in header of the last skbuff after new data was copied to it. This is bug, because value in header is used to calculate 'rx_bytes'/'fwd_cnt' and thus must be not be changed during skbuff's lifetime. Bug starts to fire since: commit 077706165717 ("virtio/vsock: don't use skbuff state to account credit") It presents before, but didn't triggered due to a little bit buggy implementation of credit calculation logic. So use Fixes tag for it. Fixes: 077706165717 ("virtio/vsock: don't use skbuff state to account credit") Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-03-29vsock: support sockmapBobby Eshleman
This patch adds sockmap support for vsock sockets. It is intended to be usable by all transports, but only the virtio and loopback transports are implemented. SOCK_STREAM, SOCK_DGRAM, and SOCK_SEQPACKET are all supported. Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-28virtio/vsock: check argument to avoid no effect callArseniy Krasnov
Both of these functions have no effect when input argument is 0, so to avoid useless spinlock access, check argument before it. Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-03-28virtio/vsock: allocate multiple skbuffs on txArseniy Krasnov
This adds small optimization for tx path: instead of allocating single skbuff on every call to transport, allocate multiple skbuff's until credit space allows, thus trying to send as much as possible data without return to af_vsock.c. Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-03-22virtio/vsock: check transport before skb allocationArseniy Krasnov
Pointer to transport could be checked before allocation of skbuff, thus there is no need to free skbuff when this pointer is NULL. Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru> Reviewed-by: Bobby Eshleman <bobby.eshleman@bytedance.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Link: https://lore.kernel.org/r/08d61bef-0c11-c7f9-9266-cb2109070314@sberdevices.ru Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-03-16virtio/vsock: don't drop skbuff on copy failureArseniy Krasnov
This returns behaviour of SOCK_STREAM read as before skbuff usage. When copying to user fails current skbuff won't be dropped, but returned to sockets's queue. Technically instead of 'skb_dequeue()', 'skb_peek()' is called and when skbuff becomes empty, it is removed from queue by '__skb_unlink()'. Fixes: 71dc9ec9ac7d ("virtio/vsock: replace virtio_vsock_pkt with sk_buff") Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Acked-by: Bobby Eshleman <bobby.eshleman@bytedance.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-16virtio/vsock: remove redundant 'skb_pull()' callArseniy Krasnov
Since we now no longer use 'skb->len' to update credit, there is no sense to update skbuff state, because it is used only once after dequeue to copy data and then will be released. Fixes: 71dc9ec9ac7d ("virtio/vsock: replace virtio_vsock_pkt with sk_buff") Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Acked-by: Bobby Eshleman <bobby.eshleman@bytedance.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-16virtio/vsock: don't use skbuff state to account creditArseniy Krasnov
'skb->len' can vary when we partially read the data, this complicates the calculation of credit to be updated in 'virtio_transport_inc_rx_pkt()/ virtio_transport_dec_rx_pkt()'. Also in 'virtio_transport_dec_rx_pkt()' we were miscalculating the credit since 'skb->len' was redundant. For these reasons, let's replace the use of skbuff state to calculate new 'rx_bytes'/'fwd_cnt' values with explicit value as input argument. This makes code more simple, because it is not needed to change skbuff state before each call to update 'rx_bytes'/'fwd_cnt'. Fixes: 71dc9ec9ac7d ("virtio/vsock: replace virtio_vsock_pkt with sk_buff") Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Acked-by: Bobby Eshleman <bobby.eshleman@bytedance.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-16virtio/vsock: replace virtio_vsock_pkt with sk_buffBobby Eshleman
This commit changes virtio/vsock to use sk_buff instead of virtio_vsock_pkt. Beyond better conforming to other net code, using sk_buff allows vsock to use sk_buff-dependent features in the future (such as sockmap) and improves throughput. This patch introduces the following performance changes: Tool: Uperf Env: Phys Host + L1 Guest Payload: 64k Threads: 16 Test Runs: 10 Type: SOCK_STREAM Before: commit b7bfaa761d760 ("Linux 6.2-rc3") Before ------ g2h: 16.77Gb/s h2g: 10.56Gb/s After ----- g2h: 21.04Gb/s h2g: 10.76Gb/s Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Merge in the left-over fixes before the net-next pull-request. Conflicts: drivers/net/ethernet/mediatek/mtk_ppe.c ae3ed15da588 ("net: ethernet: mtk_eth_soc: fix state in __mtk_foe_entry_clear") 9d8cb4c096ab ("net: ethernet: mtk_eth_soc: add foe_entry_size to mtk_eth_soc") https://lore.kernel.org/all/6cb6893b-4921-a068-4c30-1109795110bb@tessares.net/ kernel/bpf/helpers.c 8addbfc7b308 ("bpf: Gate dynptr API behind CAP_BPF") 5679ff2f138f ("bpf: Move bpf_loop and bpf_for_each_map_elem under CAP_BPF") 8a67f2de9b1d ("bpf: expose bpf_strtol and bpf_strtoul to all program types") https://lore.kernel.org/all/20221003201957.13149-1-daniel@iogearbox.net/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-29vhost/vsock: Use kvmalloc/kvfree for larger packets.Junichi Uekawa
When copying a large file over sftp over vsock, data size is usually 32kB, and kmalloc seems to fail to try to allocate 32 32kB regions. vhost-5837: page allocation failure: order:4, mode:0x24040c0 Call Trace: [<ffffffffb6a0df64>] dump_stack+0x97/0xdb [<ffffffffb68d6aed>] warn_alloc_failed+0x10f/0x138 [<ffffffffb68d868a>] ? __alloc_pages_direct_compact+0x38/0xc8 [<ffffffffb664619f>] __alloc_pages_nodemask+0x84c/0x90d [<ffffffffb6646e56>] alloc_kmem_pages+0x17/0x19 [<ffffffffb6653a26>] kmalloc_order_trace+0x2b/0xdb [<ffffffffb66682f3>] __kmalloc+0x177/0x1f7 [<ffffffffb66e0d94>] ? copy_from_iter+0x8d/0x31d [<ffffffffc0689ab7>] vhost_vsock_handle_tx_kick+0x1fa/0x301 [vhost_vsock] [<ffffffffc06828d9>] vhost_worker+0xf7/0x157 [vhost] [<ffffffffb683ddce>] kthread+0xfd/0x105 [<ffffffffc06827e2>] ? vhost_dev_set_owner+0x22e/0x22e [vhost] [<ffffffffb683dcd1>] ? flush_kthread_worker+0xf3/0xf3 [<ffffffffb6eb332e>] ret_from_fork+0x4e/0x80 [<ffffffffb683dcd1>] ? flush_kthread_worker+0xf3/0xf3 Work around by doing kvmalloc instead. Fixes: 433fc58e6bf2 ("VSOCK: Introduce vhost_vsock.ko") Signed-off-by: Junichi Uekawa <uekawa@chromium.org> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Link: https://lore.kernel.org/r/20220928064538.667678-1-uekawa@chromium.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-23virtio/vsock: check SO_RCVLOWAT before wake up readerArseniy Krasnov
This adds extra condition to wake up data reader: do it only when number of readable bytes >= SO_RCVLOWAT. Otherwise, there is no sense to kick user,because it will wait until SO_RCVLOWAT bytes will be dequeued. This check is performed in vsock_data_ready(). Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-08-23virtio/vsock: use 'target' in notify_poll_in callbackArseniy Krasnov
This callback controls setting of POLLIN, POLLRDNORM output bits of poll() syscall, but in some cases, it is incorrectly to set it, when socket has at least 1 bytes of available data. Use 'target' which is already exists. Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2021-12-08virtio/vsock: fix the transport to work with VMADDR_CID_ANYWei Wang
The VMADDR_CID_ANY flag used by a socket means that the socket isn't bound to any specific CID. For example, a host vsock server may want to be bound with VMADDR_CID_ANY, so that a guest vsock client can connect to the host server with CID=VMADDR_CID_HOST (i.e. 2), and meanwhile, a host vsock client can connect to the same local server with CID=VMADDR_CID_LOCAL (i.e. 1). The current implementation sets the destination socket's svm_cid to a fixed CID value after the first client's connection, which isn't an expected operation. For example, if the guest client first connects to the host server, the server's svm_cid gets set to VMADDR_CID_HOST, then other host clients won't be able to connect to the server anymore. Reproduce steps: 1. Run the host server: socat VSOCK-LISTEN:1234,fork - 2. Run a guest client to connect to the host server: socat - VSOCK-CONNECT:2:1234 3. Run a host client to connect to the host server: socat - VSOCK-CONNECT:1:1234 Without this patch, step 3. above fails to connect, and socat complains "socat[1720] E connect(5, AF=40 cid:1 port:1234, 16): Connection reset by peer". With this patch, the above works well. Fixes: c0cfa2d8a788 ("vsock: add multi-transports support") Signed-off-by: Wei Wang <wei.w.wang@intel.com> Link: https://lore.kernel.org/r/20211126011823.1760-1-wei.w.wang@intel.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
2021-09-05virtio/vsock: support MSG_EOR bit processingArseny Krasnov
If packet has 'EOR' bit - set MSG_EOR in 'recvmsg()' flags. Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://lore.kernel.org/r/20210903123251.3273639-1-arseny.krasnov@kaspersky.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2021-09-05virtio/vsock: rename 'EOR' to 'EOM' bit.Arseny Krasnov
This current implemented bit is used to mark end of messages ('EOM' - end of message), not records('EOR' - end of record). Also rename 'record' to 'message' in implementation as it is different things. Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://lore.kernel.org/r/20210903123109.3273053-1-arseny.krasnov@kaspersky.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2021-08-03VSOCK: handle VIRTIO_VSOCK_OP_CREDIT_REQUESTHarshavardhan Unnibhavi
The original implementation of the virtio-vsock driver does not handle a VIRTIO_VSOCK_OP_CREDIT_REQUEST as required by the virtio-vsock specification. The vsock device emulated by vhost-vsock and the virtio-vsock driver never uses this request, which was probably why nobody noticed it. However, another implementation of the device may use this request type. Hence, this commit introduces a way to handle an explicit credit request by responding with a corresponding credit update as required by the virtio-vsock specification. Fixes: 06a8fc78367d ("VSOCK: Introduce virtio_vsock_common.ko") Signed-off-by: Harshavardhan Unnibhavi <harshanavkis@gmail.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Link: https://lore.kernel.org/r/20210802173506.2383-1-harshanavkis@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-06-29net: sock: introduce sk_error_reportAlexander Aring
This patch introduces a function wrapper to call the sk_error_report callback. That will prepare to add additional handling whenever sk_error_report is called, for example to trace socket errors. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-18vsock/virtio: remove redundant `copy_failed` variableStefano Garzarella
When memcpy_to_msg() fails in virtio_transport_seqpacket_do_dequeue(), we already set `dequeued_len` with the negative error value returned by memcpy_to_msg(). So we can directly check `dequeued_len` value instead of using a dedicated flag variable to skip the copy path for the rest of fragments. Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-11virtio/vsock: rest of SOCK_SEQPACKET supportArseny Krasnov
Small updates to make SOCK_SEQPACKET work: 1) Send SHUTDOWN on socket close for SEQPACKET type. 2) Set SEQPACKET packet type during send. 3) Set 'VIRTIO_VSOCK_SEQ_EOR' bit in flags for last packet of message. 4) Implement data check function for SEQPACKET. 5) Check for max datagram size. Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-11virtio/vsock: add SEQPACKET receive logicArseny Krasnov
Update current receive logic for SEQPACKET support: performs check for packet and socket types on receive(if mismatch, then reset connection). Increment EOR counter on receive. Also if buffer of new packet was appended to buffer of last packet in rx queue, update flags of last packet with flags of new packet. Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-11virtio/vsock: dequeue callback for SOCK_SEQPACKETArseny Krasnov
Callback fetches RW packets from rx queue of socket until whole record is copied(if user's buffer is full, user is not woken up). This is done to not stall sender, because if we wake up user and it leaves syscall, nobody will send credit update for rest of record, and sender will wait for next enter of read syscall at receiver's side. So if user buffer is full, we just send credit update and drop data. Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-11virtio/vsock: simplify credit update function APIArseny Krasnov
This function is static and 'hdr' arg was always NULL. Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-11virtio/vsock: set packet's type in virtio_transport_send_pkt_info()Arseny Krasnov
There is no need to set type of packet which differs from type of socket, so move passing type of packet from 'info' structure to 'virtio_transport_send_pkt_info()' function. Since at current time only stream type is supported, set it directly in 'virtio_ transport_send_pkt_info()', so callers don't need to set it. Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-20vsock/virtio: free queued packets when closing socketStefano Garzarella
As reported by syzbot [1], there is a memory leak while closing the socket. We partially solved this issue with commit ac03046ece2b ("vsock/virtio: free packets during the socket release"), but we forgot to drain the RX queue when the socket is definitely closed by the scheduled work. To avoid future issues, let's use the new virtio_transport_remove_sock() to drain the RX queue before removing the socket from the af_vsock lists calling vsock_remove_sock(). [1] https://syzkaller.appspot.com/bug?extid=24452624fc4c571eedd9 Fixes: ac03046ece2b ("vsock/virtio: free packets during the socket release") Reported-and-tested-by: syzbot+24452624fc4c571eedd9@syzkaller.appspotmail.com Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-08vsock/virtio: update credit only if socket is not closedStefano Garzarella
If the socket is closed or is being released, some resources used by virtio_transport_space_update() such as 'vsk->trans' may be released. To avoid a use after free bug we should only update the available credit when we are sure the socket is still open and we have the lock held. Fixes: 06a8fc78367d ("VSOCK: Introduce virtio_vsock_common.ko") Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Link: https://lore.kernel.org/r/20210208144454.84438-1-sgarzare@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-23vsock/virtio: discard packets only when socket is really closedStefano Garzarella
Starting from commit 8692cefc433f ("virtio_vsock: Fix race condition in virtio_transport_recv_pkt"), we discard packets in virtio_transport_recv_pkt() if the socket has been released. When the socket is connected, we schedule a delayed work to wait the RST packet from the other peer, also if SHUTDOWN_MASK is set in sk->sk_shutdown. This is done to complete the virtio-vsock shutdown algorithm, releasing the port assigned to the socket definitively only when the other peer has consumed all the packets. If we discard the RST packet received, the socket will be closed only when the VSOCK_CLOSE_TIMEOUT is reached. Sergio discovered the issue while running ab(1) HTTP benchmark using libkrun [1] and observing a latency increase with that commit. To avoid this issue, we discard packet only if the socket is really closed (SOCK_DONE flag is set). We also set SOCK_DONE in virtio_transport_release() when we don't need to wait any packets from the other peer (we didn't schedule the delayed work). In this case we remove the socket from the vsock lists, releasing the port assigned. [1] https://github.com/containers/libkrun Fixes: 8692cefc433f ("virtio_vsock: Fix race condition in virtio_transport_recv_pkt") Cc: justin.he@arm.com Reported-by: Sergio Lopez <slp@redhat.com> Tested-by: Sergio Lopez <slp@redhat.com> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Acked-by: Jia He <justin.he@arm.com> Link: https://lore.kernel.org/r/20201120104736.73749-1-sgarzare@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-05-30virtio_vsock: Fix race condition in virtio_transport_recv_pktJia He
When client on the host tries to connect(SOCK_STREAM, O_NONBLOCK) to the server on the guest, there will be a panic on a ThunderX2 (armv8a server): [ 463.718844] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000 [ 463.718848] Mem abort info: [ 463.718849] ESR = 0x96000044 [ 463.718852] EC = 0x25: DABT (current EL), IL = 32 bits [ 463.718853] SET = 0, FnV = 0 [ 463.718854] EA = 0, S1PTW = 0 [ 463.718855] Data abort info: [ 463.718856] ISV = 0, ISS = 0x00000044 [ 463.718857] CM = 0, WnR = 1 [ 463.718859] user pgtable: 4k pages, 48-bit VAs, pgdp=0000008f6f6e9000 [ 463.718861] [0000000000000000] pgd=0000000000000000 [ 463.718866] Internal error: Oops: 96000044 [#1] SMP [...] [ 463.718977] CPU: 213 PID: 5040 Comm: vhost-5032 Tainted: G O 5.7.0-rc7+ #139 [ 463.718980] Hardware name: GIGABYTE R281-T91-00/MT91-FS1-00, BIOS F06 09/25/2018 [ 463.718982] pstate: 60400009 (nZCv daif +PAN -UAO) [ 463.718995] pc : virtio_transport_recv_pkt+0x4c8/0xd40 [vmw_vsock_virtio_transport_common] [ 463.718999] lr : virtio_transport_recv_pkt+0x1fc/0xd40 [vmw_vsock_virtio_transport_common] [ 463.719000] sp : ffff80002dbe3c40 [...] [ 463.719025] Call trace: [ 463.719030] virtio_transport_recv_pkt+0x4c8/0xd40 [vmw_vsock_virtio_transport_common] [ 463.719034] vhost_vsock_handle_tx_kick+0x360/0x408 [vhost_vsock] [ 463.719041] vhost_worker+0x100/0x1a0 [vhost] [ 463.719048] kthread+0x128/0x130 [ 463.719052] ret_from_fork+0x10/0x18 The race condition is as follows: Task1 Task2 ===== ===== __sock_release virtio_transport_recv_pkt __vsock_release vsock_find_bound_socket (found sk) lock_sock_nested vsock_remove_sock sock_orphan sk_set_socket(sk, NULL) sk->sk_shutdown = SHUTDOWN_MASK ... release_sock lock_sock virtio_transport_recv_connecting sk->sk_socket->state (panic!) The root cause is that vsock_find_bound_socket can't hold the lock_sock, so there is a small race window between vsock_find_bound_socket() and lock_sock(). If __vsock_release() is running in another task, sk->sk_socket will be set to NULL inadvertently. This fixes it by checking sk->sk_shutdown(suggested by Stefano) after lock_sock since sk->sk_shutdown is set to SHUTDOWN_MASK under the protection of lock_sock_nested. Signed-off-by: Jia He <justin.he@arm.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-27vsock/virtio: fix multiple packet delivery to monitoring devicesStefano Garzarella
In virtio_transport.c, if the virtqueue is full, the transmitting packet is queued up and it will be sent in the next iteration. This causes the same packet to be delivered multiple times to monitoring devices. We want to continue to deliver packets to monitoring devices before it is put in the virtqueue, to avoid that replies can appear in the packet capture before the transmitted packet. This patch fixes the issue, adding a new flag (tap_delivered) in struct virtio_vsock_pkt, to check if the packet is already delivered to monitoring devices. In vhost/vsock.c, we are splitting packets, so we must set 'tap_delivered' to false when we queue up the same virtio_vsock_pkt to handle the remaining bytes. Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
The mptcp conflict was overlapping additions. The SMC conflict was an additional and removal happening at the same time. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-27vsock: fix potential deadlock in transport->release()Stefano Garzarella
Some transports (hyperv, virtio) acquire the sock lock during the .release() callback. In the vsock_stream_connect() we call vsock_assign_transport(); if the socket was previously assigned to another transport, the vsk->transport->release() is called, but the sock lock is already held in the vsock_stream_connect(), causing a deadlock reported by syzbot: INFO: task syz-executor280:9768 blocked for more than 143 seconds. Not tainted 5.6.0-rc1-syzkaller #0 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. syz-executor280 D27912 9768 9766 0x00000000 Call Trace: context_switch kernel/sched/core.c:3386 [inline] __schedule+0x934/0x1f90 kernel/sched/core.c:4082 schedule+0xdc/0x2b0 kernel/sched/core.c:4156 __lock_sock+0x165/0x290 net/core/sock.c:2413 lock_sock_nested+0xfe/0x120 net/core/sock.c:2938 virtio_transport_release+0xc4/0xd60 net/vmw_vsock/virtio_transport_common.c:832 vsock_assign_transport+0xf3/0x3b0 net/vmw_vsock/af_vsock.c:454 vsock_stream_connect+0x2b3/0xc70 net/vmw_vsock/af_vsock.c:1288 __sys_connect_file+0x161/0x1c0 net/socket.c:1857 __sys_connect+0x174/0x1b0 net/socket.c:1874 __do_sys_connect net/socket.c:1885 [inline] __se_sys_connect net/socket.c:1882 [inline] __x64_sys_connect+0x73/0xb0 net/socket.c:1882 do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294 entry_SYSCALL_64_after_hwframe+0x49/0xbe To avoid this issue, this patch remove the lock acquiring in the .release() callback of hyperv and virtio transports, and it holds the lock when we call vsk->transport->release() in the vsock core. Reported-by: syzbot+731710996d79d0d58fbc@syzkaller.appspotmail.com Fixes: 408624af4c89 ("vsock: use local transport when it is loaded") Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-16net: virtio_vsock: Enhance connection semanticsSebastien Boeuf
Whenever the vsock backend on the host sends a packet through the RX queue, it expects an answer on the TX queue. Unfortunately, there is one case where the host side will hang waiting for the answer and might effectively never recover if no timeout mechanism was implemented. This issue happens when the guest side starts binding to the socket, which insert a new bound socket into the list of already bound sockets. At this time, we expect the guest to also start listening, which will trigger the sk_state to move from TCP_CLOSE to TCP_LISTEN. The problem occurs if the host side queued a RX packet and triggered an interrupt right between the end of the binding process and the beginning of the listening process. In this specific case, the function processing the packet virtio_transport_recv_pkt() will find a bound socket, which means it will hit the switch statement checking for the sk_state, but the state won't be changed into TCP_LISTEN yet, which leads the code to pick the default statement. This default statement will only free the buffer, while it should also respond to the host side, by sending a packet on its TX queue. In order to simply fix this unfortunate chain of events, it is important that in case the default statement is entered, and because at this stage we know the host side is waiting for an answer, we must send back a packet containing the operation VIRTIO_VSOCK_OP_RST. One could say that a proper timeout mechanism on the host side will be enough to avoid the backend to hang. But the point of this patch is to ensure the normal use case will be provided with proper responsiveness when it comes to establishing the connection. Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
Mere overlapping changes in the conflicts here. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-16vsock/virtio: add WARN_ON check on virtio_transport_get_ops()Stefano Garzarella
virtio_transport_get_ops() and virtio_transport_send_pkt_info() can only be used on connecting/connected sockets, since a socket assigned to a transport is required. This patch adds a WARN_ON() on virtio_transport_get_ops() to check this requirement, a comment and a returned error on virtio_transport_send_pkt_info(), Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-16vsock/virtio: fix null-pointer dereference in virtio_transport_recv_listen()Stefano Garzarella
With multi-transport support, listener sockets are not bound to any transport. So, calling virtio_transport_reset(), when an error occurs, on a listener socket produces the following null-pointer dereference: BUG: kernel NULL pointer dereference, address: 00000000000000e8 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI CPU: 0 PID: 20 Comm: kworker/0:1 Not tainted 5.5.0-rc1-ste-00003-gb4be21f316ac-dirty #56 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014 Workqueue: virtio_vsock virtio_transport_rx_work [vmw_vsock_virtio_transport] RIP: 0010:virtio_transport_send_pkt_info+0x20/0x130 [vmw_vsock_virtio_transport_common] Code: 1f 84 00 00 00 00 00 0f 1f 00 55 48 89 e5 41 57 41 56 41 55 49 89 f5 41 54 49 89 fc 53 48 83 ec 10 44 8b 76 20 e8 c0 ba fe ff <48> 8b 80 e8 00 00 00 e8 64 e3 7d c1 45 8b 45 00 41 8b 8c 24 d4 02 RSP: 0018:ffffc900000b7d08 EFLAGS: 00010282 RAX: 0000000000000000 RBX: ffff88807bf12728 RCX: 0000000000000000 RDX: ffff88807bf12700 RSI: ffffc900000b7d50 RDI: ffff888035c84000 RBP: ffffc900000b7d40 R08: ffff888035c84000 R09: ffffc900000b7d08 R10: ffff8880781de800 R11: 0000000000000018 R12: ffff888035c84000 R13: ffffc900000b7d50 R14: 0000000000000000 R15: ffff88807bf12724 FS: 0000000000000000(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000000e8 CR3: 00000000790f4004 CR4: 0000000000160ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: virtio_transport_reset+0x59/0x70 [vmw_vsock_virtio_transport_common] virtio_transport_recv_pkt+0x5bb/0xe50 [vmw_vsock_virtio_transport_common] ? detach_buf_split+0xf1/0x130 virtio_transport_rx_work+0xba/0x130 [vmw_vsock_virtio_transport] process_one_work+0x1c0/0x300 worker_thread+0x45/0x3c0 kthread+0xfc/0x130 ? current_work+0x40/0x40 ? kthread_park+0x90/0x90 ret_from_fork+0x35/0x40 Modules linked in: sunrpc kvm_intel kvm vmw_vsock_virtio_transport vmw_vsock_virtio_transport_common irqbypass vsock virtio_rng rng_core CR2: 00000000000000e8 ---[ end trace e75400e2ea2fa824 ]--- This happens because virtio_transport_reset() calls virtio_transport_send_pkt_info() that can be used only on connecting/connected sockets. This patch fixes the issue, using virtio_transport_reset_no_sock() instead of virtio_transport_reset() when we are handling a listener socket. Fixes: c0cfa2d8a788 ("vsock: add multi-transports support") Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-11vsock/virtio_transport_common: remove unused virtio header includesStefano Garzarella
We can remove virtio header includes, because virtio_transport_common doesn't use virtio API, but provides common functions to interface virtio/vhost transports with the af_vsock core, and to handle the protocol. Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-14vsock: add multi-transports supportStefano Garzarella
This patch adds the support of multiple transports in the VSOCK core. With the multi-transports support, we can use vsock with nested VMs (using also different hypervisors) loading both guest->host and host->guest transports at the same time. Major changes: - vsock core module can be loaded regardless of the transports - vsock_core_init() and vsock_core_exit() are renamed to vsock_core_register() and vsock_core_unregister() - vsock_core_register() has a feature parameter (H2G, G2H, DGRAM) to identify which directions the transport can handle and if it's support DGRAM (only vmci) - each stream socket is assigned to a transport when the remote CID is set (during the connect() or when we receive a connection request on a listener socket). The remote CID is used to decide which transport to use: - remote CID <= VMADDR_CID_HOST will use guest->host transport; - remote CID == local_cid (guest->host transport) will use guest->host transport for loopback (host->guest transports don't support loopback); - remote CID > VMADDR_CID_HOST will use host->guest transport; - listener sockets are not bound to any transports since no transport operations are done on it. In this way we can create a listener socket, also if the transports are not loaded or with VMADDR_CID_ANY to listen on all transports. - DGRAM sockets are handled as before, since only the vmci_transport provides this feature. Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-14vsock: add vsock_create_connected() called by transportsStefano Garzarella
All transports call __vsock_create() with the same parameters, most of them depending on the parent socket. In order to simplify the VSOCK core APIs exposed to the transports, this patch adds the vsock_create_connected() callable from transports to create a new socket when a connection request is received. We also unexported the __vsock_create(). Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Jorgen Hansen <jhansen@vmware.com> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-14vsock: handle buffer_size sockopts in the coreStefano Garzarella
virtio_transport and vmci_transport handle the buffer_size sockopts in a very similar way. In order to support multiple transports, this patch moves this handling in the core to allow the user to change the options also if the socket is not yet assigned to any transport. This patch also adds the '.notify_buffer_size' callback in the 'struct virtio_transport' in order to inform the transport, when the buffer_size is changed by the user. It is also useful to limit the 'buffer_size' requested (e.g. virtio transports). Acked-by: Dexuan Cui <decui@microsoft.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Jorgen Hansen <jhansen@vmware.com> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-14vsock: add 'struct vsock_sock *' param to vsock_core_get_transport()Stefano Garzarella
Since now the 'struct vsock_sock' object contains a pointer to the transport, this patch adds a parameter to the vsock_core_get_transport() to return the right transport assigned to the socket. This patch modifies also the virtio_transport_get_ops(), that uses the vsock_core_get_transport(), adding the 'struct vsock_sock *' parameter. Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Jorgen Hansen <jhansen@vmware.com> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-14vsock/virtio: add transport parameter to the virtio_transport_reset_no_sock()Stefano Garzarella
We are going to add 'struct vsock_sock *' parameter to virtio_transport_get_ops(). In some cases, like in the virtio_transport_reset_no_sock(), we don't have any socket assigned to the packet received, so we can't use the virtio_transport_get_ops(). In order to allow virtio_transport_reset_no_sock() to use the '.send_pkt' callback from the 'vhost_transport' or 'virtio_transport', we add the 'struct virtio_transport *' to it and to its caller: virtio_transport_recv_pkt(). We moved the 'vhost_transport' and 'virtio_transport' definition, to pass their address to the virtio_transport_recv_pkt(). Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-14vsock: remove vm_sockets_get_local_cid()Stefano Garzarella
vm_sockets_get_local_cid() is only used in virtio_transport_common.c. We can replace it calling the virtio_transport_get_ops() and using the get_local_cid() callback registered by the transport. Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Jorgen Hansen <jhansen@vmware.com> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
One conflict in the BPF samples Makefile, some fixes in 'net' whilst we were converting over to Makefile.target rules in 'net-next'. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-08vsock/virtio: fix sock refcnt holding during the shutdownStefano Garzarella
The "42f5cda5eaf4" commit rightly set SOCK_DONE on peer shutdown, but there is an issue if we receive the SHUTDOWN(RDWR) while the virtio_transport_close_timeout() is scheduled. In this case, when the timeout fires, the SOCK_DONE is already set and the virtio_transport_close_timeout() will not call virtio_transport_reset() and virtio_transport_do_close(). This causes that both sockets remain open and will never be released, preventing the unloading of [virtio|vhost]_transport modules. This patch fixes this issue, calling virtio_transport_reset() and virtio_transport_do_close() when we receive the SHUTDOWN(RDWR) and there is nothing left to read. Fixes: 42f5cda5eaf4 ("vsock/virtio: set SOCK_DONE on peer shutdown") Cc: Stephen Barber <smbarber@chromium.org> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-06net: use helpers to change sk_ack_backlogEric Dumazet
Writers are holding a lock, but many readers do not. Following patch will add appropriate barriers in sk_acceptq_removed() and sk_acceptq_added(). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>