Age | Commit message (Collapse) | Author |
|
commit c7cafd5b81cc07fb402e3068d134c21e60ea688c upstream.
8c8492ca64e7 ("io_uring/net: don't retry connect operation on EPOLLERR")
is a little dirty hack that
1) wrongfully assumes that POLLERR equals to a failed request, which
breaks all POLLERR users, e.g. all error queue recv interfaces.
2) deviates the connection request behaviour from connect(2), and
3) racy and solved at a wrong level.
Nothing can be done with 2) now, and 3) is beyond the scope of the
patch. At least solve 1) by moving the hack out of generic poll handling
into io_connect().
Cc: stable@vger.kernel.org
Fixes: 8c8492ca64e79 ("io_uring/net: don't retry connect operation on EPOLLERR")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3dc89036388d602ebd84c28e5042e457bdfc952b.1752682444.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
A previous commit aborted mapping more for a non-incremental ring for
bundle peeking, but depending on where in the process this peeking
happened, it would not necessarily prevent a retry by the user. That can
create gaps in the received/read data.
Add struct buf_sel_arg->partial_map, which can pass this information
back. The networking side can then map that to internal state and use it
to gate retry as well.
Since this necessitates a new flag, change io_sr_msg->retry to a
retry_flags member, and store both the retry and partial map condition
in there.
Cc: stable@vger.kernel.org
Fixes: 26ec15e4b0c1 ("io_uring/kbuf: don't truncate end buffer for multiple buffer peeks")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit 178b8ff66ff827c41b4fa105e9aabb99a0b5c537)
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Commit 9a709b7e98e6fa51600b5f2d24c5068efa6d39de upstream.
A bigger array of vecs could've been allocated, but
io_ring_buffers_peek() still decided to cap the mapped range depending
on how much data was available. Hence don't rely on the segment count
to know if the request should be marked as needing cleanup, always
check upfront if the iov array is different than the fast_iov array.
Fixes: 26ec15e4b0c1 ("io_uring/kbuf: don't truncate end buffer for multiple buffer peeks")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
A previous fix corrected the retry condition for when to continue a
current bundle, but it missed that the current (not the total) transfer
count also applies to the buffer put. If not, then for incrementally
consumed buffer rings repeated completions on the same request may end
up over consuming.
Reported-by: Roy Tang (ErgoniaTrading) <royonia@ergonia.io>
Cc: stable@vger.kernel.org
Fixes: 3a08988123c8 ("io_uring/net: only retry recv bundle for a full transfer")
Link: https://github.com/axboe/liburing/issues/1423
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit 51a4598ad5d9eb6be4ec9ba65bbfdf0ac302eb2e)
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Commit 2c7f023219966777be0687e15b57689894304cd3 upstream.
Currently retry and general validity of msg_inq is gated on it being
larger than zero, but it's entirely possible for this to be slightly
inaccurate. In particular, if FIN is received, it'll return 1.
Just use larger than 1 as the check. This covers both the FIN case, and
at the same time, it doesn't make much sense to retry a recv immediately
if there's even just a single 1 byte of valid data in the socket.
Leave the SOCK_NONEMPTY flagging when larger than 0 still, as an app may
use that for the final receive.
Cc: stable@vger.kernel.org
Reported-by: Christian Mazakas <christian.mazakas@gmail.com>
Fixes: 7c71a0af81ba ("io_uring/net: improve recv bundles")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Commit 3a08988123c868dbfdd054541b1090fb891fa49e upstream.
If a shorter than assumed transfer was seen, a partial buffer will have
been filled. For that case it isn't sane to attempt to fill more into
the bundle before posting a completion, as that will cause a gap in
the received data.
Check if the iterator has hit zero and only allow to continue a bundle
operation if that is the case.
Also ensure that for putting finished buffers, only the current transfer
is accounted. Otherwise too many buffers may be put for a short transfer.
Link: https://github.com/axboe/liburing/issues/1409
Cc: stable@vger.kernel.org
Fixes: 7c71a0af81ba ("io_uring/net: improve recv bundles")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Commit 7c71a0af81ba72de9b2c501065e4e718aba9a271 upstream.
Current recv bundles are only supported for multishot receives, and
additionally they also always post at least 2 CQEs if more data is
available than what a buffer will hold. This happens because the initial
bundle recv will do a single buffer, and then do the rest of what is in
the socket as a followup receive. As shown in a test program, if 1k
buffers are available and 32k is available to receive in the socket,
you'd get the following completions:
bundle=1, mshot=0
cqe res 1024
cqe res 1024
[...]
cqe res 1024
bundle=1, mshot=1
cqe res 1024
cqe res 31744
where bundle=1 && mshot=0 will post 32 1k completions, and bundle=1 &&
mshot=1 will post a 1k completion and then a 31k completion.
To support bundle recv without multishot, it's possible to simply retry
the recv immediately and post a single completion, rather than split it
into two completions. With the below patch, the same test looks as
follows:
bundle=1, mshot=0
cqe res 32768
bundle=1, mshot=1
cqe res 32768
where mshot=0 works fine for bundles, and both of them post just a
single 32k completion rather than split it into separate completions.
Posting fewer completions is always a nice win, and not needing
multishot for proper bundle efficiency is nice for cases that can't
necessarily use multishot.
Reported-by: Norman Maurer <norman_maurer@apple.com>
Link: https://lore.kernel.org/r/184f9f92-a682-4205-a15d-89e18f664502@kernel.dk
Fixes: 2f9c9515bdfd ("io_uring/net: support bundles for recv")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 6889ae1b4df1579bcdffef023e2ea9a982565dff upstream.
[ 114.987980][ T5313] WARNING: CPU: 6 PID: 5313 at io_uring/io_uring.c:872 io_req_post_cqe+0x12e/0x4f0
[ 114.991597][ T5313] RIP: 0010:io_req_post_cqe+0x12e/0x4f0
[ 115.001880][ T5313] Call Trace:
[ 115.002222][ T5313] <TASK>
[ 115.007813][ T5313] io_send+0x4fe/0x10f0
[ 115.009317][ T5313] io_issue_sqe+0x1a6/0x1740
[ 115.012094][ T5313] io_wq_submit_work+0x38b/0xed0
[ 115.013223][ T5313] io_worker_handle_work+0x62a/0x1600
[ 115.013876][ T5313] io_wq_worker+0x34f/0xdf0
As the comment states, io_req_post_cqe() should only be used by
multishot requests, i.e. REQ_F_APOLL_MULTISHOT, which bundled sends are
not. Add a flag signifying whether a request wants to post multiple
CQEs. Eventually REQ_F_APOLL_MULTISHOT should imply the new flag, but
that's left out for simplicity.
Cc: stable@vger.kernel.org
Fixes: a05d1f625c7aa ("io_uring/net: support bundles for send")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8b611dbb54d1cd47a88681f5d38c84d0c02bc563.1743067183.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit f6a89bf5278d6e15016a736db67043560d1b50d5 upstream.
REQ_F_APOLL_MULTISHOT doesn't guarantee it's executed from the multishot
context, so a multishot accept may get executed inline, fail
io_req_post_cqe(), and ask the core code to kill the request with
-ECANCELED by returning IOU_STOP_MULTISHOT even when a socket has been
accepted and installed.
Cc: stable@vger.kernel.org
Fixes: 390ed29b5e425 ("io_uring: add IORING_ACCEPT_MULTISHOT for accept")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/51c6deb01feaa78b08565ca8f24843c017f5bc80.1740331076.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 67c007d6c12da3e456c005083696c20d4498ae72 upstream.
refcount_t: underflow; use-after-free.
WARNING: CPU: 0 PID: 5823 at lib/refcount.c:28 refcount_warn_saturate+0x15a/0x1d0 lib/refcount.c:28
RIP: 0010:refcount_warn_saturate+0x15a/0x1d0 lib/refcount.c:28
Call Trace:
<TASK>
io_notif_flush io_uring/notif.h:40 [inline]
io_send_zc_cleanup+0x121/0x170 io_uring/net.c:1222
io_clean_op+0x58c/0x9a0 io_uring/io_uring.c:406
io_free_batch_list io_uring/io_uring.c:1429 [inline]
__io_submit_flush_completions+0xc16/0xd20 io_uring/io_uring.c:1470
io_submit_flush_completions io_uring/io_uring.h:159 [inline]
Before the blamed commit, sendzc relied on io_req_msg_cleanup() to clear
REQ_F_NEED_CLEANUP, so after the following snippet the request will
never hit the core io_uring cleanup path.
io_notif_flush();
io_req_msg_cleanup();
The easiest fix is to null the notification. io_send_zc_cleanup() can
still be called after, but it's tolerated.
Reported-by: syzbot+cf285a028ffba71b2ef5@syzkaller.appspotmail.com
Tested-by: syzbot+cf285a028ffba71b2ef5@syzkaller.appspotmail.com
Fixes: cc34d8330e036 ("io_uring/net: don't clear REQ_F_NEED_CLEANUP unconditionally")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e1306007458b8891c88c4f20c966a17595f766b0.1742643795.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit cc34d8330e036b6bffa88db9ea537bae6b03948f upstream.
io_req_msg_cleanup() relies on the fact that io_netmsg_recycle() will
always fully recycle, but that may not be the case if the msg cache
was already full. To ensure that normal cleanup always gets run,
let io_netmsg_recycle() deal with clearing the relevant cleanup flags,
as it knows exactly when that should be done.
Cc: stable@vger.kernel.org
Reported-by: David Wei <dw@davidwei.uk>
Fixes: 75191341785e ("io_uring/net: add iovec recycling")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 6ebf05189dfc6d0d597c99a6448a4d1064439a18 ]
Match the compat part of io_sendmsg_copy_hdr() with its counterpart and
save msg_control.
Fixes: c55978024d123 ("io_uring/net: move receive multishot out of the generic msghdr path")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2a8418821fe83d3b64350ad2b3c0303e9b732bbd.1740498502.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 8c8492ca64e79c6e0f433e8c9d2bcbd039ef83d0 upstream.
If a socket is shutdown before the connection completes, POLLERR is set
in the poll mask. However, connect ignores this as it doesn't know, and
attempts the connection again. This may lead to a bogus -ETIMEDOUT
result, where it should have noticed the POLLERR and just returned
-ECONNRESET instead.
Have the poll logic check for whether or not POLLERR is set in the mask,
and if so, mark the request as failed. Then connect can appropriately
fail the request rather than retry it.
Reported-by: Sergey Galas <ssgalas@cloud.ru>
Cc: stable@vger.kernel.org
Link: https://github.com/axboe/liburing/discussions/1335
Fixes: 3fb1bd688172 ("io_uring/net: handle -EINPROGRESS correct for IORING_OP_CONNECT")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit c6e60a0a68b7e6b3c7e33863a16e8e88ba9eee6f ]
syzbot reports that ->msg_inq may get used uinitialized from the
following path:
BUG: KMSAN: uninit-value in io_recv_buf_select io_uring/net.c:1094 [inline]
BUG: KMSAN: uninit-value in io_recv+0x930/0x1f90 io_uring/net.c:1158
io_recv_buf_select io_uring/net.c:1094 [inline]
io_recv+0x930/0x1f90 io_uring/net.c:1158
io_issue_sqe+0x420/0x2130 io_uring/io_uring.c:1740
io_queue_sqe io_uring/io_uring.c:1950 [inline]
io_req_task_submit+0xfa/0x1d0 io_uring/io_uring.c:1374
io_handle_tw_list+0x55f/0x5c0 io_uring/io_uring.c:1057
tctx_task_work_run+0x109/0x3e0 io_uring/io_uring.c:1121
tctx_task_work+0x6d/0xc0 io_uring/io_uring.c:1139
task_work_run+0x268/0x310 kernel/task_work.c:239
io_run_task_work+0x43a/0x4a0 io_uring/io_uring.h:343
io_cqring_wait io_uring/io_uring.c:2527 [inline]
__do_sys_io_uring_enter io_uring/io_uring.c:3439 [inline]
__se_sys_io_uring_enter+0x204f/0x4ce0 io_uring/io_uring.c:3330
__x64_sys_io_uring_enter+0x11f/0x1a0 io_uring/io_uring.c:3330
x64_sys_call+0xce5/0x3c30 arch/x86/include/generated/asm/syscalls_64.h:427
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xcd/0x1e0 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
and it is correct, as it's never initialized upfront. Hence the first
submission can end up using it uninitialized, if the recv wasn't
successful and the networking stack didn't honor ->msg_get_inq being set
and filling in the output value of ->msg_inq as requested.
Set it to 0 upfront when it's allocated, just to silence this KMSAN
warning. There's no side effect of using it uninitialized, it'll just
potentially cause the next receive to use a recv value hint that's not
accurate.
Fixes: c6f32c7d9e09 ("io_uring/net: get rid of ->prep_async() for receive side")
Reported-by: syzbot+068ff190354d2f74892f@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
If the recv returns zero, or an error, then it doesn't matter if more
data has already been received for this buffer. A condition like that
should terminate the multishot receive. Rather than pass in the
collected return value, pass in whether to terminate or keep the recv
going separately.
Note that this isn't a bug right now, as the only way to get there is
via setting MSG_WAITALL with multishot receive. And if an application
does that, then -EINVAL is returned anyway. But it seems like an easy
bug to introduce, so let's make it a bit more explicit.
Link: https://github.com/axboe/liburing/issues/1246
Cc: stable@vger.kernel.org
Fixes: b3fdea6ecb55 ("io_uring: multishot recv")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In preparation for needing the consumed length, pass in the length being
completed. Unused right now, but will be used when it is possible to
partially consume a buffer.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This reverts commit 79996b45f7b28c0e3e08a95bab80119e95317e28.
Revert the change that restricts a send provided buffer to be zero, so
it will always consume the whole buffer. This is strictly needed for
partial consumption, as the send may very well be a subset of the
current buffer. In fact, that's the intended use case.
For non-incremental provided buffer rings, an application should set
sqe->len carefully to avoid the potential issue described in the
reverted commit. It is recommended that '0' still be set for len for
that case, if the application is set on maintaining more than 1 send
inflight for the same socket. This is somewhat of a nonsensical thing
to do.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Just like what is being done on the recv side, if we only map a single
segment, then use ITER_UBUF for mapping it. That's more efficient than
using an ITER_IOVEC.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If a send is issued marked with IOSQE_BUFFER_SELECT for selecting a
buffer, unless it's a bundle, it should not select multiple buffers.
Cc: stable@vger.kernel.org
Fixes: a05d1f625c7a ("io_uring/net: support bundles for send")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If the iovec inside the kmsg isn't already allocated AND one gets
expanded beyond the fixed size, then the request may not already have
been marked for cleanup. Ensure that it is.
Cc: stable@vger.kernel.org
Fixes: a05d1f625c7a ("io_uring/net: support bundles for send")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If the iovec inside the kmsg isn't already allocated AND one gets
expanded beyond the fixed size, then the request may not already have
been marked for cleanup. Ensure that it is.
Cc: stable@vger.kernel.org
Fixes: 2f9c9515bdfd ("io_uring/net: support bundles for recv")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Not much excitement - a handful of large patchsets (devmem among them)
did not make it in time.
Core & protocols:
- Use local_lock in addition to local_bh_disable() to protect per-CPU
resources in networking, a step closer for local_bh_disable() not
to act as a big lock on PREEMPT_RT
- Use flex array for netdevice priv area, ensure its cache alignment
- Add a sysctl knob to allow user to specify a default rto_min at
socket init time. Bit of a big hammer but multiple companies were
independently carrying such patch downstream so clearly it's useful
- Support scheduling transmission of packets based on CLOCK_TAI
- Un-pin TCP TIMEWAIT timer to avoid it firing on CPUs later cordoned
off using cpusets
- Support multiple L2TPv3 UDP tunnels using the same 5-tuple address
- Allow configuration of multipath hash seed, to both allow
synchronizing hashing of two routers, and preventing partial
accidental sync
- Improve TCP compliance with RFC 9293 for simultaneous connect()
- Support sending NAT keepalives in IPsec ESP in UDP states.
Userspace IKE daemon had to do this before, but the kernel can
better keep track of it
- Support sending supervision HSR frames with MAC addresses stored in
ProxyNodeTable when RedBox (i.e. HSR-SAN) is enabled
- Introduce IPPROTO_SMC for selecting SMC when socket is created
- Allow UDP GSO transmit from devices with no checksum offload
- openvswitch: add packet sampling via psample, separating the
sampled traffic from "upcall" packets sent to user space for
forwarding
- nf_tables: shrink memory consumption for transaction objects
Things we sprinkled into general kernel code:
- Power Sequencing subsystem (used by Qualcomm Bluetooth driver for
QCA6390) [ Already merged separately - Linus ]
- Add IRQ information in sysfs for auxiliary bus
- Introduce guard definition for local_lock
- Add aligned flavor of __cacheline_group_{begin, end}() markings for
grouping fields in structures
BPF:
- Notify user space (via epoll) when a struct_ops object is getting
detached/unregistered
- Add new kfuncs for a generic, open-coded bits iterator
- Enable BPF programs to declare arrays of kptr, bpf_rb_root, and
bpf_list_head
- Support resilient split BTF which cuts down on duplication and
makes BTF as compact as possible WRT BTF from modules
- Add support for dumping kfunc prototypes from BTF which enables
both detecting as well as dumping compilable prototypes for kfuncs
- riscv64 BPF JIT improvements in particular to add 12-argument
support for BPF trampolines and to utilize bpf_prog_pack for the
latter
- Add the capability to offload the netfilter flowtable in XDP layer
through kfuncs
Driver API:
- Allow users to configure IRQ tresholds between which automatic IRQ
moderation can choose
- Expand Power Sourcing (PoE) status with power, class and failure
reason. Support setting power limits
- Track additional RSS contexts in the core, make sure configuration
changes don't break them
- Support IPsec crypto offload for IPv6 ESP and IPv4 UDP-encapsulated
ESP data paths
- Support updating firmware on SFP modules
Tests and tooling:
- mptcp: use net/lib.sh to manage netns
- TCP-AO and TCP-MD5: replace debug prints used by tests with
tracepoints
- openvswitch: make test self-contained (don't depend on OvS CLI
tools)
Drivers:
- Ethernet high-speed NICs:
- Broadcom (bnxt):
- increase the max total outstanding PTP TX packets to 4
- add timestamping statistics support
- implement netdev_queue_mgmt_ops
- support new RSS context API
- Intel (100G, ice, idpf):
- implement FEC statistics and dumping signal quality indicators
- support E825C products (with 56Gbps PHYs)
- nVidia/Mellanox:
- support HW-GRO
- mlx4/mlx5: support per-queue statistics via netlink
- obey the max number of EQs setting in sub-functions
- AMD/Solarflare:
- support new RSS context API
- AMD/Pensando:
- ionic: rework fix for doorbell miss to lower overhead and
skip it on new HW
- Wangxun:
- txgbe: support Flow Director perfect filters
- Ethernet NICs consumer, embedded and virtual:
- Add driver for Tehuti Networks TN40xx chips
- Add driver for Meta's internal NIC chips
- Add driver for Ethernet MAC on Airoha EN7581 SoCs
- Add driver for Renesas Ethernet-TSN devices
- Google cloud vNIC:
- flow steering support
- Microsoft vNIC:
- support page sizes other than 4KB on ARM64
- vmware vNIC:
- support latency measurement (update to version 9)
- VirtIO net:
- support for Byte Queue Limits
- support configuring thresholds for automatic IRQ moderation
- support for AF_XDP Rx zero-copy
- Synopsys (stmmac):
- support for STM32MP13 SoC
- let platforms select the right PCS implementation
- TI:
- icssg-prueth: add multicast filtering support
- icssg-prueth: enable PTP timestamping and PPS
- Renesas:
- ravb: improve Rx performance 30-400% by using page pool,
theaded NAPI and timer-based IRQ coalescing
- ravb: add MII support for R-Car V4M
- Cadence (macb):
- macb: add ARP support to Wake-On-LAN
- Cortina:
- use phylib for RX and TX pause configuration
- Ethernet switches:
- nVidia/Mellanox:
- support configuration of multipath hash seed
- report more accurate max MTU
- use page_pool to improve Rx performance
- MediaTek:
- mt7530: add support for bridge port isolation
- Qualcomm:
- qca8k: add support for bridge port isolation
- Microchip:
- lan9371/2: add 100BaseTX PHY support
- NXP:
- vsc73xx: implement VLAN operations
- Ethernet PHYs:
- aquantia: enable support for aqr115c
- aquantia: add support for PHY LEDs
- realtek: add support for rtl8224 2.5Gbps PHY
- xpcs: add memory-mapped device support
- add BroadR-Reach link mode and support in Broadcom's PHY driver
- CAN:
- add document for ISO 15765-2 protocol support
- mcp251xfd: workaround for erratum DS80000789E, use timestamps to
catch when device returns incorrect FIFO status
- WiFi:
- mac80211/cfg80211:
- parse Transmit Power Envelope (TPE) data in mac80211 instead
of in drivers
- improvements for 6 GHz regulatory flexibility
- multi-link improvements
- support multiple radios per wiphy
- remove DEAUTH_NEED_MGD_TX_PREP flag
- Intel (iwlwifi):
- bump FW API to 91 for BZ/SC devices
- report 64-bit radiotap timestamp
- enable P2P low latency by default
- handle Transmit Power Envelope (TPE) advertised by AP
- remove support for older FW for new devices
- fast resume (keeping the device configured)
- mvm: re-enable Multi-Link Operation (MLO)
- aggregation (A-MSDU) optimizations
- MediaTek (mt76):
- mt7925 Multi-Link Operation (MLO) support
- Qualcomm (ath10k):
- LED support for various chipsets
- Qualcomm (ath12k):
- remove unsupported Tx monitor handling
- support channel 2 in 6 GHz band
- support Spatial Multiplexing Power Save (SMPS) in 6 GHz band
- supprt multiple BSSID (MBSSID) and Enhanced Multi-BSSID
Advertisements (EMA)
- support dynamic VLAN
- add panic handler for resetting the firmware state
- DebugFS support for datapath statistics
- WCN7850: support for Wake on WLAN
- Microchip (wilc1000):
- read MAC address during probe to make it visible to user space
- suspend/resume improvements
- TI (wl18xx):
- support newer firmware versions
- RealTek (rtw89):
- preparation for RTL8852BE-VT support
- Wake on WLAN support for WiFi 6 chips
- 36-bit PCI DMA support
- RealTek (rtlwifi):
- RTL8192DU support
- Broadcom (brcmfmac):
- Management Frame Protection support (to enable WPA3)
- Bluetooth:
- qualcomm: use the power sequencer for QCA6390
- btusb: mediatek: add ISO data transmission functions
- hci_bcm4377: add BCM4388 support
- btintel: add support for BlazarU core
- btintel: add support for Whale Peak2
- btnxpuart: add support for AW693 A1 chipset
- btnxpuart: add support for IW615 chipset
- btusb: add Realtek RTL8852BE support ID 0x13d3:0x3591"
* tag 'net-next-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1589 commits)
eth: fbnic: Fix spelling mistake "tiggerring" -> "triggering"
tcp: Replace strncpy() with strscpy()
wifi: ath12k: fix build vs old compiler
tcp: Don't access uninit tcp_rsk(req)->ao_keyid in tcp_create_openreq_child().
eth: fbnic: Write the TCAM tables used for RSS control and Rx to host
eth: fbnic: Add L2 address programming
eth: fbnic: Add basic Rx handling
eth: fbnic: Add basic Tx handling
eth: fbnic: Add link detection
eth: fbnic: Add initial messaging to notify FW of our presence
eth: fbnic: Implement Rx queue alloc/start/stop/free
eth: fbnic: Implement Tx queue alloc/start/stop/free
eth: fbnic: Allocate a netdevice and napi vectors with queues
eth: fbnic: Add FW communication mechanism
eth: fbnic: Add message parsing for FW messages
eth: fbnic: Add register init to set PCIe/Ethernet device config
eth: fbnic: Allocate core device specific structures and devlink interface
eth: fbnic: Add scaffolding for Meta's NIC driver
PCI: Add Meta Platforms vendor ID
net/sched: cls_flower: propagate tca[TCA_OPTIONS] to NL_REQ_ATTR_CHECK
...
|
|
Pull io_uring updates from Jens Axboe:
"Here are the io_uring updates queued up for 6.11.
Nothing major this time around, various minor improvements and
cleanups/fixes. This contains:
- Add bind/listen opcodes. Main motivation is to support direct
descriptors, to avoid needing a regular fd just for doing these two
operations (Gabriel)
- Probe fixes (Gabriel)
- Treat io-wq work flags as atomics. Not fixing a real issue, but may
as well and it silences a KCSAN warning (me)
- Cleanup of rsrc __set_current_state() usage (me)
- Add 64-bit for {m,f}advise operations (me)
- Improve performance of data ring messages (me)
- Fix for ring message overflow posting (Pavel)
- Fix for freezer interaction with TWA_NOTIFY_SIGNAL. Not strictly an
io_uring thing, but since TWA_NOTIFY_SIGNAL was originally added
for faster task_work signaling for io_uring, bundling it with this
pull (Pavel)
- Add Pavel as a co-maintainer
- Various cleanups (me, Thorsten)"
* tag 'for-6.11/io_uring-20240714' of git://git.kernel.dk/linux: (28 commits)
io_uring/net: check socket is valid in io_bind()/io_listen()
kernel: rerun task_work while freezing in get_signal()
io_uring/io-wq: limit retrying worker initialisation
io_uring/napi: Remove unnecessary s64 cast
io_uring/net: cleanup io_recv_finish() bundle handling
io_uring/msg_ring: fix overflow posting
MAINTAINERS: change Pavel Begunkov from io_uring reviewer to maintainer
io_uring/msg_ring: use kmem_cache_free() to free request
io_uring/msg_ring: check for dead submitter task
io_uring/msg_ring: add an alloc cache for io_kiocb entries
io_uring/msg_ring: improve handling of target CQE posting
io_uring: add io_add_aux_cqe() helper
io_uring: add remote task_work execution helper
io_uring/msg_ring: tighten requirement for remote posting
io_uring: Allocate only necessary memory in io_probe
io_uring: Fix probe of disabled operations
io_uring: Introduce IORING_OP_LISTEN
io_uring: Introduce IORING_OP_BIND
net: Split a __sys_listen helper for io_uring
net: Split a __sys_bind helper for io_uring
...
|
|
We need to check that sock_from_file(req->file) != NULL.
Reported-by: syzbot <syzbot+1e811482aa2c70afa9a0@syzkaller.appspotmail.com>
Closes: https://syzkaller.appspot.com/bug?extid=1e811482aa2c70afa9a0
Fixes: 7481fd93fa0a ("io_uring: Introduce IORING_OP_BIND")
Fixes: ff140cc8628a ("io_uring: Introduce IORING_OP_LISTEN")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Link: https://lore.kernel.org/r/903da529-eaa3-43ef-ae41-d30f376c60cc@I-love.SAKURA.ne.jp
[axboe: move assignment of sock to where the NULL check is]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Cross-merge networking fixes after downstream PR.
Conflicts:
drivers/net/phy/aquantia/aquantia.h
219343755eae ("net: phy: aquantia: add missing include guards")
61578f679378 ("net: phy: aquantia: add support for PHY LEDs")
drivers/net/ethernet/wangxun/libwx/wx_hw.c
bd07a9817846 ("net: txgbe: remove separate irq request for MSI and INTx")
b501d261a5b3 ("net: txgbe: add FDIR ATR support")
https://lore.kernel.org/all/20240703112936.483c1975@canb.auug.org.au/
include/linux/mlx5/mlx5_ifc.h
048a403648fc ("net/mlx5: IFC updates for changing max EQs")
99be56171fa9 ("net/mlx5e: SHAMPO, Re-enable HW-GRO")
https://lore.kernel.org/all/20240701133951.6926b2e3@canb.auug.org.au/
Adjacent changes:
drivers/net/wireless/intel/iwlwifi/mvm/mac80211.c
4130c67cd123 ("wifi: iwlwifi: mvm: check vif for NULL/ERR_PTR before dereference")
3f3126515fbe ("wifi: iwlwifi: mvm: add mvm-specific guard")
include/net/mac80211.h
816c6bec09ed ("wifi: mac80211: fix BSS_CHANGED_UNSOL_BCAST_PROBE_RESP")
5a009b42e041 ("wifi: mac80211: track changes in AP's TPE")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Combine the two cases that check for whether or not this is a bundle,
rather than having them as separate checks. This is easier to reduce,
and it reduces the text associated with it as well.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
For bundle receives to function properly, the previous iteration msg_inq
value is needed to make a judgement call on how much data there is to
receive. A previous fix ended up clearing it earlier as an error case
would potentially errantly set IORING_CQE_F_SOCK_NONEMPTY if the request
got failed.
Move the assignment to post assigning buffers for the receive, but
ensure that it's cleared for the buffer selection error case. With that,
buffer selection has the right msg_inq value and can correctly bundle
receives as designed.
Noticed while testing where it was apparent than more than 1 buffer was
never received. After fix was in place, multiple buffers are correctly
picked for receive. This provides a 10x speedup for the test case, as
the buffer size used was 64b.
Fixes: 18414a4a2eab ("io_uring/net: assign kmsg inq/flags before buffer selection")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Currently, io_uring's io_sg_from_iter() duplicates the part of
__zerocopy_sg_from_iter() charging pages to the socket. It'd be too easy
to miss while changing it in net/, the chunk is not the most
straightforward for outside users and full of internal implementation
details. io_uring is not a good place to keep it, deduplicate it by
moving out of the callback into __zerocopy_sg_from_iter().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
IORING_OP_LISTEN provides the semantic of listen(2) via io_uring. While
this is an essentially synchronous system call, the main point is to
enable a network path to execute fully with io_uring registered and
descriptorless files.
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/20240614163047.31581-4-krisman@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
IORING_OP_BIND provides the semantic of bind(2) via io_uring. While
this is an essentially synchronous system call, the main point is to
enable a network path to execute fully with io_uring registered and
descriptorless files.
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/20240614163047.31581-3-krisman@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
syzbot reports that recv is using an uninitialized value:
=====================================================
BUG: KMSAN: uninit-value in io_req_cqe_overflow io_uring/io_uring.c:810 [inline]
BUG: KMSAN: uninit-value in io_req_complete_post io_uring/io_uring.c:937 [inline]
BUG: KMSAN: uninit-value in io_issue_sqe+0x1f1b/0x22c0 io_uring/io_uring.c:1763
io_req_cqe_overflow io_uring/io_uring.c:810 [inline]
io_req_complete_post io_uring/io_uring.c:937 [inline]
io_issue_sqe+0x1f1b/0x22c0 io_uring/io_uring.c:1763
io_wq_submit_work+0xa17/0xeb0 io_uring/io_uring.c:1860
io_worker_handle_work+0xc04/0x2000 io_uring/io-wq.c:597
io_wq_worker+0x447/0x1410 io_uring/io-wq.c:651
ret_from_fork+0x6d/0x90 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
Uninit was stored to memory at:
io_req_set_res io_uring/io_uring.h:215 [inline]
io_recv_finish+0xf10/0x1560 io_uring/net.c:861
io_recv+0x12ec/0x1ea0 io_uring/net.c:1175
io_issue_sqe+0x429/0x22c0 io_uring/io_uring.c:1751
io_wq_submit_work+0xa17/0xeb0 io_uring/io_uring.c:1860
io_worker_handle_work+0xc04/0x2000 io_uring/io-wq.c:597
io_wq_worker+0x447/0x1410 io_uring/io-wq.c:651
ret_from_fork+0x6d/0x90 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
Uninit was created at:
slab_post_alloc_hook mm/slub.c:3877 [inline]
slab_alloc_node mm/slub.c:3918 [inline]
__do_kmalloc_node mm/slub.c:4038 [inline]
__kmalloc+0x6e4/0x1060 mm/slub.c:4052
kmalloc include/linux/slab.h:632 [inline]
io_alloc_async_data+0xc0/0x220 io_uring/io_uring.c:1662
io_msg_alloc_async io_uring/net.c:166 [inline]
io_recvmsg_prep_setup io_uring/net.c:725 [inline]
io_recvmsg_prep+0xbe8/0x1a20 io_uring/net.c:806
io_init_req io_uring/io_uring.c:2135 [inline]
io_submit_sqe io_uring/io_uring.c:2182 [inline]
io_submit_sqes+0x1135/0x2f10 io_uring/io_uring.c:2335
__do_sys_io_uring_enter io_uring/io_uring.c:3246 [inline]
__se_sys_io_uring_enter+0x40f/0x3c80 io_uring/io_uring.c:3183
__x64_sys_io_uring_enter+0x11f/0x1a0 io_uring/io_uring.c:3183
x64_sys_call+0x2c0/0x3b50 arch/x86/include/generated/asm/syscalls_64.h:427
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xcf/0x1e0 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
which appears to be io_recv_finish() reading kmsg->msg.msg_inq to decide
if it needs to set IORING_CQE_F_SOCK_NONEMPTY or not. If the recv is
entered with buffer selection, but no buffer is available, then we jump
error path which calls io_recv_finish() without having assigned
kmsg->msg_inq. This might cause an errant setting of the NONEMPTY flag
for a request get gets errored with -ENOBUFS.
Reported-by: syzbot+b1647099e82b3b349fbf@syzkaller.appspotmail.com
Fixes: 4a3223f7bfda ("io_uring/net: switch io_recv() to using io_async_msghdr")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If the given protocol supports passing back whether or not we had more
pending accept post this one, pass back this information to userspace.
This is done by setting IORING_CQE_F_SOCK_NONEMPTY in the CQE flags,
just like we do for recv/recvmsg if there's more data available post
a receive operation.
We can also use this information to be smarter about multishot retry,
as we don't need to do a pointless retry if we know for a fact that
there aren't any more connections to accept.
Suggested-by: Norman Maurer <norman_maurer@apple.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In preparation for passing in more information via this API, change
do_accept() to take a proto_accept_arg struct pointer rather than just
the file flags separately.
No functional changes in this patch.
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Similarly to how polling first is supported for receive, it makes sense
to provide the same for accept. An accept operation does a lot of
expensive setup, like allocating an fd, a socket/inode, etc. If no
connection request is already pending, this is wasted and will just be
cleaned up and freed, only to retry via the usual poll trigger.
Add IORING_ACCEPT_POLL_FIRST, which tells accept to only initiate the
accept request if poll says we have something to accept.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This allows the caller to perform a non-blocking attempt, similarly to
how recvmsg has MSG_DONTWAIT. If set, and we get -EAGAIN on a connection
attempt, propagate the result to userspace rather than arm poll and
wait for a retry.
Suggested-by: Norman Maurer <norman_maurer@apple.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When sending from a provided buffer, we set sr->len to be the smallest
between the actual buffer size and sqe->len. But, now that we
disconnect the buffer from the submission request, we can get in a
situation where the buffers and requests mismatch, and only part of a
buffer gets sent. Assume:
* buf[1]->len = 128; buf[2]->len = 256
* sqe[1]->len = 128; sqe[2]->len = 256
If sqe1 runs first, it picks buff[1] and it's all good. But, if sqe[2]
runs first, sqe[1] picks buff[2], and the last half of buff[2] is
never sent.
While arguably the use-case of different-length sends is questionable,
it has already raised confusion with potential users of this
feature. Let's make the interface less tricky by forcing the length to
only come from the buffer ring entry itself.
Fixes: ac5f71a3d9d7 ("io_uring/net: add provided buffer support for IORING_OP_SEND")
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
SEND[MSG]_ZC produces multiple CQEs via notifications, LAZY_WAKE doesn't
handle it and so disable LAZY_WAKE for sendzc polling. It should be
fine, sends are not likely to be polled in the first place.
Fixes: 6ce4a93dbb5b ("io_uring/poll: use IOU_F_TWQ_LAZY_WAKE for wakeups")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/5b360fb352d91e3aec751d75c87dfb4753a084ee.1714488419.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If IORING_OP_RECV is used with provided buffers, the caller may also set
IORING_RECVSEND_BUNDLE to turn it into a multi-buffer recv. This grabs
buffers available and receives into them, posting a single completion for
all of it.
This can be used with multishot receive as well, or without it.
Now that both send and receive support bundles, add a feature flag for
it as well. If IORING_FEAT_RECVSEND_BUNDLE is set after registering the
ring, then the kernel supports bundles for recv and send.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If IORING_OP_SEND is used with provided buffers, the caller may also
set IORING_RECVSEND_BUNDLE to turn it into a multi-buffer send. The idea
is that an application can fill outgoing buffers in a provided buffer
group, and then arm a single send that will service them all. Once
there are no more buffers to send, or if the requested length has
been sent, the request posts a single completion for all the buffers.
This only enables it for IORING_OP_SEND, IORING_OP_SENDMSG is coming
in a separate patch. However, this patch does do a lot of the prep
work that makes wiring up the sendmsg variant pretty trivial. They
share the prep side.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
It's pretty trivial to wire up provided buffer support for the send
side, just like how it's done the receive side. This enables setting up
a buffer ring that an application can use to push pending sends to,
and then have a send pick a buffer from that ring.
One of the challenges with async IO and networking sends is that you
can get into reordering conditions if you have more than one inflight
at the same time. Consider the following scenario where everything is
fine:
1) App queues sendA for socket1
2) App queues sendB for socket1
3) App does io_uring_submit()
4) sendA is issued, completes successfully, posts CQE
5) sendB is issued, completes successfully, posts CQE
All is fine. Requests are always issued in-order, and both complete
inline as most sends do.
However, if we're flooding socket1 with sends, the following could
also result from the same sequence:
1) App queues sendA for socket1
2) App queues sendB for socket1
3) App does io_uring_submit()
4) sendA is issued, socket1 is full, poll is armed for retry
5) Space frees up in socket1, this triggers sendA retry via task_work
6) sendB is issued, completes successfully, posts CQE
7) sendA is retried, completes successfully, posts CQE
Now we've sent sendB before sendA, which can make things unhappy. If
both sendA and sendB had been using provided buffers, then it would look
as follows instead:
1) App queues dataA for sendA, queues sendA for socket1
2) App queues dataB for sendB queues sendB for socket1
3) App does io_uring_submit()
4) sendA is issued, socket1 is full, poll is armed for retry
5) Space frees up in socket1, this triggers sendA retry via task_work
6) sendB is issued, picks first buffer (dataA), completes successfully,
posts CQE (which says "I sent dataA")
7) sendA is retried, picks first buffer (dataB), completes successfully,
posts CQE (which says "I sent dataB")
Now we've sent the data in order, and everybody is happy.
It's worth noting that this also opens the door for supporting multishot
sends, as provided buffers would be a prerequisite for that. Those can
trigger either when new buffers are added to the outgoing ring, or (if
stalled due to lack of space) when space frees up in the socket.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This is just moving io_recv_prep_retry() higher up so it can get used
for sends as well, and rename it to be generically useful for both
sends and receives.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We can set MSG_ZEROCOPY at the preparation step, do it so we don't have
to care about it later in the issue callback.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c2c22aaa577624977f045979a6db2b9fb2e5648c.1712534031.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
io_notif_complete_tw_ext() can be removed and combined with
io_notif_complete_tw to make it simpler without sacrificing
anything.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/025a124a5e20e2474a57e2f04f16c422eb83063c.1712534031.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Currently lists are being used to manage this, but best practice is
usually to have these in an array instead as that it cheaper to manage.
Outside of that detail, games are also played with KASAN as the list
is inside the cached entry itself.
Finally, all users of this need a struct io_cache_entry embedded in
their struct, which is union'ized with something else in there that
isn't used across the free -> realloc cycle.
Get rid of all of that, and simply have it be an array. This will not
change the memory used, as we're just trading an 8-byte member entry
for the per-elem array size.
This reduces the overhead of the recycled allocations, and it reduces
the amount of code code needed to support recycling to about half of
what it currently is.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
While doing that, get rid of io_async_connect and just use the generic
io_async_msghdr. Both of them have a struct sockaddr_storage in there,
and while io_async_msghdr is bigger, if the same type can be used then
the netmsg_cache can get reused for connect as well.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Now that iovec recycling is being done, the iovec is no longer being
freed in there. Hence the kmsg parameter is now useless.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Right now the io_async_msghdr is recycled to avoid the overhead of
allocating+freeing it for every request. But the iovec is not included,
hence that will be allocated and freed for each transfer regardless.
This commit enables recyling of the iovec between io_async_msghdr
recycles. This avoids alloc+free for each one if an iovec is used, and
on top of that, it extends the cache hot nature of msg to the iovec as
well.
Also enables KASAN for the iovec entries, so that reuse can be detected
even while they are in the cache.
The io_async_msghdr also shrinks from 376 -> 288 bytes, an 88 byte
saving (or ~23% smaller), as the fast_iovec entry is dropped from 8
entries to a single entry. There's no point keeping a big fast iovec
entry, if iovecs aren't being allocated and freed continually.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
All net commands have async data at this point, there's no reason to
check if this is the case or not.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We now ONLY call io_msg_alloc_async() from inside prep handling, which
is always locked. No need for this helper anymore, or the check in
io_msg_alloc_async() on whether the ring is locked or not.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Move the io_async_msghdr out of the issue path and into prep handling,
e it's now done unconditionally and hence does not need to be part
of the issue path. This means any usage of io_sendrecv_prep_async() and
io_sendmsg_prep_async(), and hence the forced async setup path is now
unified with the normal prep setup.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|