summaryrefslogtreecommitdiff
path: root/io_uring
AgeCommit message (Collapse)Author
28 hoursio_uring: fix incorrect io_kiocb reference in io_link_skbYang Xiuwei
[ Upstream commit 2c139a47eff8de24e3350dadb4c9d5e3426db826 ] In io_link_skb function, there is a bug where prev_notif is incorrectly assigned using 'nd' instead of 'prev_nd'. This causes the context validation check to compare the current notification with itself instead of comparing it with the previous notification. Fix by using the correct prev_nd parameter when obtaining prev_notif. Signed-off-by: Yang Xiuwei <yangxiuwei@kylinos.cn> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Fixes: 6fe4220912d19 ("io_uring/notif: implement notification stacking") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
28 hoursio_uring/msg_ring: kill alloc_cache for io_kiocb allocationsJens Axboe
[ Upstream commit df8922afc37aa2111ca79a216653a629146763ad ] A recent commit: fc582cd26e88 ("io_uring/msg_ring: ensure io_kiocb freeing is deferred for RCU") fixed an issue with not deferring freeing of io_kiocb structs that msg_ring allocates to after the current RCU grace period. But this only covers requests that don't end up in the allocation cache. If a request goes into the alloc cache, it can get reused before it is sane to do so. A recent syzbot report would seem to indicate that there's something there, however it may very well just be because of the KASAN poisoning that the alloc_cache handles manually. Rather than attempt to make the alloc_cache sane for that use case, just drop the usage of the alloc_cache for msg_ring request payload data. Fixes: 50cf5f3842af ("io_uring/msg_ring: add an alloc cache for io_kiocb entries") Link: https://lore.kernel.org/io-uring/68cc2687.050a0220.139b6.0005.GAE@google.com/ Reported-by: syzbot+baa2e0f4e02df602583e@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
28 hoursio_uring: include dying ring in task_work "should cancel" stateJens Axboe
commit 3539b1467e94336d5854ebf976d9627bfb65d6c3 upstream. When running task_work for an exiting task, rather than perform the issue retry attempt, the task_work is canceled. However, this isn't done for a ring that has been closed. This can lead to requests being successfully completed post the ring being closed, which is somewhat confusing and surprising to an application. Rather than just check the task exit state, also include the ring ref state in deciding whether or not to terminate a given request when run from task_work. Cc: stable@vger.kernel.org # 6.1+ Link: https://github.com/axboe/liburing/discussions/1459 Reported-by: Benedek Thaler <thaler@thaler.hu> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
28 hoursio_uring/io-wq: fix `max_workers` breakage and `nr_workers` underflowMax Kellermann
commit cd4ea81be3eb94047ad023c631afd9bd6c295400 upstream. Commit 88e6c42e40de ("io_uring/io-wq: add check free worker before create new worker") reused the variable `do_create` for something else, abusing it for the free worker check. This caused the value to effectively always be `true` at the time `nr_workers < max_workers` was checked, but it should really be `false`. This means the `max_workers` setting was ignored, and worse: if the limit had already been reached, incrementing `nr_workers` was skipped even though another worker would be created. When later lots of workers exit, the `nr_workers` field could easily underflow, making the problem worse because more and more workers would be created without incrementing `nr_workers`. The simple solution is to use a different variable for the free worker check instead of using one variable for two different things. Cc: stable@vger.kernel.org Fixes: 88e6c42e40de ("io_uring/io-wq: add check free worker before create new worker") Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Reviewed-by: Fengnan Chang <changfengnan@bytedance.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
7 daysfs: add a FMODE_ flag to indicate IOCB_HAS_METADATA availabilityChristoph Hellwig
[ Upstream commit d072148a8631f102de60ed5a3a827e85d09d24f0 ] Currently the kernel will happily route io_uring requests with metadata to file operations that don't support it. Add a FMODE_ flag to guard that. Fixes: 4de2ce04c862 ("fs: introduce IOCB_HAS_METADATA for metadata") Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/20250819082517.2038819-2-hch@lst.de Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-09-04io_uring/kbuf: always use READ_ONCE() to read ring provided buffer lengthsJens Axboe
[ Upstream commit 98b6fa62c84f2e129161e976a5b9b3cb4ccd117b ] Since the buffers are mapped from userspace, it is prudent to use READ_ONCE() to read the value into a local variable, and use that for any other actions taken. Having a stable read of the buffer length avoids worrying about it changing after checking, or being read multiple times. Similarly, the buffer may well change in between it being picked and being committed. Ensure the looping for incremental ring buffer commit stops if it hits a zero sized buffer, as no further progress can be made at that point. Fixes: ae98dbf43d75 ("io_uring/kbuf: add support for incremental buffer consumption") Link: https://lore.kernel.org/io-uring/tencent_000C02641F6250C856D0C26228DE29A3D30A@qq.com/ Reported-by: Qingyue Zhang <chunzhennn@qq.com> Reported-by: Suoxing Zhang <aftern00n@qq.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-09-04io_uring/kbuf: fix signedness in this_len calculationQingyue Zhang
[ Upstream commit c64eff368ac676e8540344d27a3de47e0ad90d21 ] When importing and using buffers, buf->len is considered unsigned. However, buf->len is converted to signed int when committing. This can lead to unexpected behavior if the buffer is large enough to be interpreted as a negative value. Make min_t calculation unsigned. Fixes: ae98dbf43d75 ("io_uring/kbuf: add support for incremental buffer consumption") Co-developed-by: Suoxing Zhang <aftern00n@qq.com> Signed-off-by: Suoxing Zhang <aftern00n@qq.com> Signed-off-by: Qingyue Zhang <chunzhennn@qq.com> Link: https://lore.kernel.org/r/tencent_4DBB3674C0419BEC2C0C525949DA410CA307@qq.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-09-04io_uring/io-wq: add check free worker before create new workerFengnan Chang
[ Upstream commit 9d83e1f05c98bab5de350bef89177e2be8b34db0 ] After commit 0b2b066f8a85 ("io_uring/io-wq: only create a new worker if it can make progress"), in our produce environment, we still observe that part of io_worker threads keeps creating and destroying. After analysis, it was confirmed that this was due to a more complex scenario involving a large number of fsync operations, which can be abstracted as frequent write + fsync operations on multiple files in a single uring instance. Since write is a hash operation while fsync is not, and fsync is likely to be suspended during execution, the action of checking the hash value in io_wqe_dec_running cannot handle such scenarios. Similarly, if hash-based work and non-hash-based work are sent at the same time, similar issues are likely to occur. Returning to the starting point of the issue, when a new work arrives, io_wq_enqueue may wake up free worker A, while io_wq_dec_running may create worker B. Ultimately, only one of A and B can obtain and process the task, leaving the other in an idle state. In the end, the issue is caused by inconsistent logic in the checks performed by io_wq_enqueue and io_wq_dec_running. Therefore, the problem can be resolved by checking for available workers in io_wq_dec_running. Signed-off-by: Fengnan Chang <changfengnan@bytedance.com> Reviewed-by: Diangang Li <lidiangang@bytedance.com> Link: https://lore.kernel.org/r/20250813120214.18729-1-changfengnan@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-28io_uring/futex: ensure io_futex_wait() cleans up properly on failureJens Axboe
commit 508c1314b342b78591f51c4b5dadee31a88335df upstream. The io_futex_data is allocated upfront and assigned to the io_kiocb async_data field, but the request isn't marked with REQ_F_ASYNC_DATA at that point. Those two should always go together, as the flag tells io_uring whether the field is valid or not. Additionally, on failure cleanup, the futex handler frees the data but does not clear ->async_data. Clear the data and the flag in the error path as well. Thanks to Trend Micro Zero Day Initiative and particularly ReDress for reporting this. Cc: stable@vger.kernel.org Fixes: 194bb58c6090 ("io_uring: add support for futex wake and wait") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-20io_uring/zcrx: don't leak pages on account failurePavel Begunkov
commit 6bbd3411ff87df1ca38ff32d36eb5dc673ca8021 upstream. Someone needs to release pinned pages in io_import_umem() if accounting fails. Assign them to the area but return an error, the following io_zcrx_free_area() will clean them up. Fixes: 262ab205180d2 ("io_uring/zcrx: account area memory") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e19f283a912f200c0d427e376cb789fc3f3d69bc.1753091564.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-20io_uring/zcrx: fix null ifq on area destructionPavel Begunkov
commit 720df2310b89cf76c1dc1a05902536282506f8bf upstream. Dan reports that ifq can be null when infering arguments for io_unaccount_mem() from io_zcrx_free_area(). Fix it by always setting a correct ifq. Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/202507180628.gBxrOgqr-lkp@intel.com/ Fixes: 262ab205180d2 ("io_uring/zcrx: account area memory") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20670d163bb90dba2a81a4150f1125603cefb101.1753091564.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-20io_uring/rw: cast rw->flags assignment to rwf_tJens Axboe
commit 825aea662b492571877b32aeeae13689fd9fbee4 upstream. kernel test robot reports that a recent change of the sqe->rw_flags field throws a sparse warning on 32-bit archs: >> io_uring/rw.c:291:19: sparse: sparse: incorrect type in assignment (different base types) @@ expected restricted __kernel_rwf_t [usertype] flags @@ got unsigned int @@ io_uring/rw.c:291:19: sparse: expected restricted __kernel_rwf_t [usertype] flags io_uring/rw.c:291:19: sparse: got unsigned int Force cast it to rwf_t to silence that new sparse warning. Fixes: cf73d9970ea4 ("io_uring: don't use int for ABI") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202507032211.PwSNPNSP-lkp@intel.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-20io_uring/net: commit partial buffers on retryJens Axboe
commit 41b70df5b38bc80967d2e0ed55cc3c3896bba781 upstream. Ring provided buffers are potentially only valid within the single execution context in which they were acquired. io_uring deals with this and invalidates them on retry. But on the networking side, if MSG_WAITALL is set, or if the socket is of the streaming type and too little was processed, then it will hang on to the buffer rather than recycle or commit it. This is problematic for two reasons: 1) If someone unregisters the provided buffer ring before a later retry, then the req->buf_list will no longer be valid. 2) If multiple sockers are using the same buffer group, then multiple receives can consume the same memory. This can cause data corruption in the application, as either receive could land in the same userspace buffer. Fix this by disallowing partial retries from pinning a provided buffer across multiple executions, if ring provided buffers are used. Cc: stable@vger.kernel.org Reported-by: pt x <superman.xpt@gmail.com> Fixes: c56e022c0a27 ("io_uring: add support for user mapped provided buffer ring") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-20io_uring/memmap: cast nr_pages to size_t before shiftingJens Axboe
commit 33503c083fda048c77903460ac0429e1e2c0e341 upstream. If the allocated size exceeds UINT_MAX, then it's necessary to cast the mr->nr_pages value to size_t to prevent it from overflowing. In practice this isn't much of a concern as the required memory size will have been validated upfront, and accounted to the user. And > 4GB sizes will be necessary to make the lack of a cast a problem, which greatly exceeds normal user locked_vm settings that are generally in the kb to mb range. However, if root is used, then accounting isn't done, and then it's possible to hit this issue. Link: https://lore.kernel.org/all/6895b298.050a0220.7f033.0059.GAE@google.com/ Cc: stable@vger.kernel.org Reported-by: syzbot+23727438116feb13df15@syzkaller.appspotmail.com Fixes: 087f997870a9 ("io_uring/memmap: implement mmap for regions") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-20io_uring/zcrx: account area memoryPavel Begunkov
commit 262ab205180d2ba3ab6110899a4dbe439c51dfaa upstream. zcrx areas can be quite large and need to be accounted and checked against RLIMIT_MEMLOCK. In practise it shouldn't be a big issue as the inteface already requires cap_net_admin. Cc: stable@vger.kernel.org Fixes: cf96310c5f9a0 ("io_uring/zcrx: add io_zcrx_area") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4b53f0c575bd062f63d12bec6cac98037fc66aeb.1752699568.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-20io_uring: export io_[un]account_memPavel Begunkov
commit 11fbada7184f9e19bcdfa2f6b15828a78b8897a6 upstream. Export pinned memory accounting helpers, they'll be used by zcrx shortly. Cc: stable@vger.kernel.org Fixes: cf96310c5f9a0 ("io_uring/zcrx: add io_zcrx_area") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9a61e54bd89289b39570ae02fe620e12487439e4.1752699568.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-07-18Merge tag 'io_uring-6.16-20250718' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring fixes from Jens Axboe: - dmabug offset fix for zcrx - Fix for the POLLERR connect work around handling * tag 'io_uring-6.16-20250718' of git://git.kernel.dk/linux: io_uring/poll: fix POLLERR handling io_uring/zcrx: disallow user selected dmabuf offset and size
2025-07-16io_uring/poll: fix POLLERR handlingPavel Begunkov
8c8492ca64e7 ("io_uring/net: don't retry connect operation on EPOLLERR") is a little dirty hack that 1) wrongfully assumes that POLLERR equals to a failed request, which breaks all POLLERR users, e.g. all error queue recv interfaces. 2) deviates the connection request behaviour from connect(2), and 3) racy and solved at a wrong level. Nothing can be done with 2) now, and 3) is beyond the scope of the patch. At least solve 1) by moving the hack out of generic poll handling into io_connect(). Cc: stable@vger.kernel.org Fixes: 8c8492ca64e79 ("io_uring/net: don't retry connect operation on EPOLLERR") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3dc89036388d602ebd84c28e5042e457bdfc952b.1752682444.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-14io_uring/zcrx: disallow user selected dmabuf offset and sizePavel Begunkov
zcrx shouldn't be so frivolous about cutting a dmabuf sgtable and taking a subrange into it, the dmabuf layer might be not expecting that. It shouldn't be a problem for now, but since the zcrx dmabuf support is new and there shouldn't be any real users, let's play safe and reject user provided ranges into dmabufs. Also, it shouldn't be needed as userspace should size them appropriately. Fixes: a5c98e9424573 ("io_uring/zcrx: dmabuf backed zerocopy receive") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/be899f1afed32053eb2e2079d0da241514674aca.1752443579.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-11Merge tag 'io_uring-6.16-20250710' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring fixes from Jens Axboe: - Remove a pointless warning in the zcrx code - Fix for MSG_RING commands, where the allocated io_kiocb needs to be freed under RCU as well - Revert the work-around we had in place for the anon inodes pretending to be regular files. Since that got reworked upstream, the work-around is no longer needed * tag 'io_uring-6.16-20250710' of git://git.kernel.dk/linux: Revert "io_uring: gate REQ_F_ISREG on !S_ANON_INODE as well" io_uring/msg_ring: ensure io_kiocb freeing is deferred for RCU io_uring/zcrx: fix pp destruction warnings
2025-07-08Revert "io_uring: gate REQ_F_ISREG on !S_ANON_INODE as well"Jens Axboe
This reverts commit 6f11adcc6f36ffd8f33dbdf5f5ce073368975bc3. The problematic commit was fixed in mainline, so the work-around in io_uring can be removed at this point. Anonymous inodes no longer pretend to be regular files after: 1e7ab6f67824 ("anon_inode: rework assertions") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-08io_uring/msg_ring: ensure io_kiocb freeing is deferred for RCUJens Axboe
syzbot reports that defer/local task_work adding via msg_ring can hit a request that has been freed: CPU: 1 UID: 0 PID: 19356 Comm: iou-wrk-19354 Not tainted 6.16.0-rc4-syzkaller-00108-g17bbde2e1716 #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025 Call Trace: <TASK> dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:408 [inline] print_report+0xd2/0x2b0 mm/kasan/report.c:521 kasan_report+0x118/0x150 mm/kasan/report.c:634 io_req_local_work_add io_uring/io_uring.c:1184 [inline] __io_req_task_work_add+0x589/0x950 io_uring/io_uring.c:1252 io_msg_remote_post io_uring/msg_ring.c:103 [inline] io_msg_data_remote io_uring/msg_ring.c:133 [inline] __io_msg_ring_data+0x820/0xaa0 io_uring/msg_ring.c:151 io_msg_ring_data io_uring/msg_ring.c:173 [inline] io_msg_ring+0x134/0xa00 io_uring/msg_ring.c:314 __io_issue_sqe+0x17e/0x4b0 io_uring/io_uring.c:1739 io_issue_sqe+0x165/0xfd0 io_uring/io_uring.c:1762 io_wq_submit_work+0x6e9/0xb90 io_uring/io_uring.c:1874 io_worker_handle_work+0x7cd/0x1180 io_uring/io-wq.c:642 io_wq_worker+0x42f/0xeb0 io_uring/io-wq.c:696 ret_from_fork+0x3fc/0x770 arch/x86/kernel/process.c:148 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 </TASK> which is supposed to be safe with how requests are allocated. But msg ring requests alloc and free on their own, and hence must defer freeing to a sane time. Add an rcu_head and use kfree_rcu() in both spots where requests are freed. Only the one in io_msg_tw_complete() is strictly required as it has been visible on the other ring, but use it consistently in the other spot as well. This should not cause any other issues outside of KASAN rightfully complaining about it. Link: https://lore.kernel.org/io-uring/686cd2ea.a00a0220.338033.0007.GAE@google.com/ Reported-by: syzbot+54cbbfb4db9145d26fc2@syzkaller.appspotmail.com Cc: stable@vger.kernel.org Fixes: 0617bb500bfa ("io_uring/msg_ring: improve handling of target CQE posting") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-07io_uring/zcrx: fix pp destruction warningsPavel Begunkov
With multiple page pools and in some other cases we can have allocated niovs on page pool destruction. Remove a misplaced warning checking that all niovs are returned to zcrx on io_pp_zc_destroy(). It was reported before but apparently got lost. Reported-by: Pedro Tammela <pctammela@mojatatu.com> Fixes: 34a3e60821ab9 ("io_uring/zcrx: implement zerocopy receive pp memory provider") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b9e6d919d2964bc48ddbf8eb52fc9f5d118e9bc1.1751878185.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30Merge tag 'io_uring-6.16-20250630' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring fix from Jens Axboe: "Now that anonymous inodes set S_IFREG, this breaks the io_uring read/write retries for short reads/writes. As things like timerfd and eventfd are anon inodes, applications that previously did: unsigned long event_data[2]; io_uring_prep_read(sqe, evfd, event_data, sizeof(event_data), 0); and just got a short read when 1 event was posted, will now wait for the full amount before posting a completion. This caused issues for the ghostty application, making it basically unusable due to excessive buffering" * tag 'io_uring-6.16-20250630' of git://git.kernel.dk/linux: io_uring: gate REQ_F_ISREG on !S_ANON_INODE as well
2025-06-29io_uring: gate REQ_F_ISREG on !S_ANON_INODE as wellJens Axboe
io_uring marks a request as dealing with a regular file on S_ISREG. This drives things like retries on short reads or writes, which is generally not expected on a regular file (or bdev). Applications tend to not expect that, so io_uring tries hard to ensure it doesn't deliver short IO on regular files. However, a recent commit added S_IFREG to anonymous inodes. When io_uring is used to read from various things that are backed by anon inodes, like eventfd, timerfd, etc, then it'll now all of a sudden wait for more data when rather than deliver what was read or written in a single operation. This breaks applications that issue reads on anon inodes, if they ask for more data than a single read delivers. Add a check for !S_ANON_INODE as well before setting REQ_F_ISREG to prevent that. Cc: Christian Brauner <brauner@kernel.org> Cc: stable@vger.kernel.org Link: https://github.com/ghostty-org/ghostty/discussions/7720 Fixes: cfd86ef7e8e7 ("anon_inode: use a proper mode internally") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-27Merge tag 'io_uring-6.16-20250626' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring fixes from Jens Axboe: - Two tweaks for a recent fix: fixing a memory leak if multiple iovecs were initially mapped but only the first was used and hence turned into a UBUF rathan than an IOVEC iterator, and catching a case where a retry would be done even if the previous segment wasn't full - Small series fixing an issue making the vm unhappy if debugging is turned on, hitting a VM_BUG_ON_PAGE() - Fix a resource leak in io_import_dmabuf() in the error handling case, which is a regression in this merge window - Mark fallocate as needing to be write serialized, as is already done for truncate and buffered writes * tag 'io_uring-6.16-20250626' of git://git.kernel.dk/linux: io_uring/kbuf: flag partial buffer mappings io_uring/net: mark iov as dynamically allocated even for single segments io_uring: fix resource leak in io_import_dmabuf() io_uring: don't assume uaddr alignment in io_vec_fill_bvec io_uring/rsrc: don't rely on user vaddr alignment io_uring/rsrc: fix folio unpinning io_uring: make fallocate be hashed work
2025-06-26io_uring/kbuf: flag partial buffer mappingsJens Axboe
A previous commit aborted mapping more for a non-incremental ring for bundle peeking, but depending on where in the process this peeking happened, it would not necessarily prevent a retry by the user. That can create gaps in the received/read data. Add struct buf_sel_arg->partial_map, which can pass this information back. The networking side can then map that to internal state and use it to gate retry as well. Since this necessitates a new flag, change io_sr_msg->retry to a retry_flags member, and store both the retry and partial map condition in there. Cc: stable@vger.kernel.org Fixes: 26ec15e4b0c1 ("io_uring/kbuf: don't truncate end buffer for multiple buffer peeks") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-25io_uring/net: mark iov as dynamically allocated even for single segmentsJens Axboe
A bigger array of vecs could've been allocated, but io_ring_buffers_peek() still decided to cap the mapped range depending on how much data was available. Hence don't rely on the segment count to know if the request should be marked as needing cleanup, always check upfront if the iov array is different than the fast_iov array. Fixes: 26ec15e4b0c1 ("io_uring/kbuf: don't truncate end buffer for multiple buffer peeks") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-25io_uring: fix resource leak in io_import_dmabuf()Penglei Jiang
Replace the return statement with setting ret = -EINVAL and jumping to the err label to ensure resources are released via io_release_dmabuf. Fixes: a5c98e942457 ("io_uring/zcrx: dmabuf backed zerocopy receive") Signed-off-by: Penglei Jiang <superman.xpt@gmail.com> Link: https://lore.kernel.org/r/20250625102703.68336-1-superman.xpt@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-24io_uring: don't assume uaddr alignment in io_vec_fill_bvecPavel Begunkov
There is no guaranteed alignment for user pointers. Don't use mask trickery and adjust the offset by bv_offset. Cc: stable@vger.kernel.org Reported-by: David Hildenbrand <david@redhat.com> Fixes: 9ef4cbbcb4ac3 ("io_uring: add infra for importing vectored reg buffers") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/io-uring/19530391f5c361a026ac9b401ff8e123bde55d98.1750771718.git.asml.silence@gmail.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-24io_uring/rsrc: don't rely on user vaddr alignmentPavel Begunkov
There is no guaranteed alignment for user pointers, however the calculation of an offset of the first page into a folio after coalescing uses some weird bit mask logic, get rid of it. Cc: stable@vger.kernel.org Reported-by: David Hildenbrand <david@redhat.com> Fixes: a8edbb424b139 ("io_uring/rsrc: enable multi-hugepage buffer coalescing") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/io-uring/e387b4c78b33f231105a601d84eefd8301f57954.1750771718.git.asml.silence@gmail.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-24io_uring/rsrc: fix folio unpinningPavel Begunkov
syzbot complains about an unmapping failure: [ 108.070381][ T14] kernel BUG at mm/gup.c:71! [ 108.070502][ T14] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP [ 108.123672][ T14] Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20250221-8.fc42 02/21/2025 [ 108.127458][ T14] Workqueue: iou_exit io_ring_exit_work [ 108.174205][ T14] Call trace: [ 108.175649][ T14] sanity_check_pinned_pages+0x7cc/0x7d0 (P) [ 108.178138][ T14] unpin_user_page+0x80/0x10c [ 108.180189][ T14] io_release_ubuf+0x84/0xf8 [ 108.182196][ T14] io_free_rsrc_node+0x250/0x57c [ 108.184345][ T14] io_rsrc_data_free+0x148/0x298 [ 108.186493][ T14] io_sqe_buffers_unregister+0x84/0xa0 [ 108.188991][ T14] io_ring_ctx_free+0x48/0x480 [ 108.191057][ T14] io_ring_exit_work+0x764/0x7d8 [ 108.193207][ T14] process_one_work+0x7e8/0x155c [ 108.195431][ T14] worker_thread+0x958/0xed8 [ 108.197561][ T14] kthread+0x5fc/0x75c [ 108.199362][ T14] ret_from_fork+0x10/0x20 We can pin a tail page of a folio, but then io_uring will try to unpin the head page of the folio. While it should be fine in terms of keeping the page actually alive, mm folks say it's wrong and triggers a debug warning. Use unpin_user_folio() instead of unpin_user_page*. Cc: stable@vger.kernel.org Debugged-by: David Hildenbrand <david@redhat.com> Reported-by: syzbot+1d335893772467199ab6@syzkaller.appspotmail.com Closes: https://lkml.kernel.org/r/683f1551.050a0220.55ceb.0017.GAE@google.com Fixes: a8edbb424b139 ("io_uring/rsrc: enable multi-hugepage buffer coalescing") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/io-uring/a28b0f87339ac2acf14a645dad1e95bbcbf18acd.1750771718.git.asml.silence@gmail.com/ [axboe: adapt to current tree, massage commit message] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-23io_uring: make fallocate be hashed workFengnan Chang
Like ftruncate and write, fallocate operations on the same file cannot be executed in parallel, so it is better to make fallocate be hashed work. Signed-off-by: Fengnan Chang <changfengnan@bytedance.com> Link: https://lore.kernel.org/r/20250623110218.61490-1-changfengnan@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-21Merge tag 'io_uring-6.16-20250621' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring fix from Jens Axboe: "A single fix to hopefully wrap up the saga of receive bundles" * tag 'io_uring-6.16-20250621' of git://git.kernel.dk/linux: io_uring/net: always use current transfer count for buffer put
2025-06-20io_uring/net: always use current transfer count for buffer putJens Axboe
A previous fix corrected the retry condition for when to continue a current bundle, but it missed that the current (not the total) transfer count also applies to the buffer put. If not, then for incrementally consumed buffer rings repeated completions on the same request may end up over consuming. Reported-by: Roy Tang (ErgoniaTrading) <royonia@ergonia.io> Cc: stable@vger.kernel.org Fixes: 3a08988123c8 ("io_uring/net: only retry recv bundle for a full transfer") Link: https://github.com/axboe/liburing/issues/1423 Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-19Merge tag 'io_uring-6.16-20250619' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring fixes from Jens Axboe: - Two fixes for error injection failures. One fixes a task leak issue introduced in this merge window, the other an older issue with handling allocation of a mapped buffer. - Fix for a syzbot issue that triggers a kmalloc warning on attempting an allocation that's too large - Fix for an error injection failure causing a double put of a task, introduced in this merge window * tag 'io_uring-6.16-20250619' of git://git.kernel.dk/linux: io_uring: fix potential page leak in io_sqe_buffer_register() io_uring/sqpoll: don't put task_struct on tctx setup failure io_uring: remove duplicate io_uring_alloc_task_context() definition io_uring: fix task leak issue in io_wq_create() io_uring/rsrc: validate buffer count with offset for cloning
2025-06-18io_uring: fix potential page leak in io_sqe_buffer_register()Penglei Jiang
If allocation of the 'imu' fails, then the existing pages aren't unpinned in the error path. This is mostly a theoretical issue, requiring fault injection to hit. Move unpin_user_pages() to unified error handling to fix the page leak issue. Fixes: d8c2237d0aa9 ("io_uring: add io_pin_pages() helper") Signed-off-by: Penglei Jiang <superman.xpt@gmail.com> Link: https://lore.kernel.org/r/20250617165644.79165-1-superman.xpt@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-17io_uring/sqpoll: don't put task_struct on tctx setup failureJens Axboe
A recent commit moved the error handling of sqpoll thread and tctx failures into the thread itself, as part of fixing an issue. However, it missed that tctx allocation may also fail, and that io_sq_offload_create() does its own error handling for the task_struct in that case. Remove the manual task putting in io_sq_offload_create(), as io_sq_thread() will notice that the tctx did not get setup and hence it should put itself and exit. Reported-by: syzbot+763e12bbf004fb1062e4@syzkaller.appspotmail.com Fixes: ac0b8b327a56 ("io_uring: fix use-after-free of sq->thread in __io_uring_show_fdinfo()") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-17io_uring: remove duplicate io_uring_alloc_task_context() definitionJens Axboe
This function exists in both tctx.h (where it belongs) and in io_uring.h as a remnant of before the tctx handling code got split out. Remove the io_uring.h definition and ensure that sqpoll.c includes the tctx.h header to get the definition. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-15io_uring: fix task leak issue in io_wq_create()Penglei Jiang
Add missing put_task_struct() in the error path Cc: stable@vger.kernel.org Fixes: 0f8baa3c9802 ("io-wq: fully initialize wqe before calling cpuhp_state_add_instance_nocalls()") Signed-off-by: Penglei Jiang <superman.xpt@gmail.com> Link: https://lore.kernel.org/r/20250615163906.2367-1-superman.xpt@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-15io_uring/rsrc: validate buffer count with offset for cloningJens Axboe
syzbot reports that it can trigger a WARN_ON() for kmalloc() attempt that's too big: WARNING: CPU: 0 PID: 6488 at mm/slub.c:5024 __kvmalloc_node_noprof+0x520/0x640 mm/slub.c:5024 Modules linked in: CPU: 0 UID: 0 PID: 6488 Comm: syz-executor312 Not tainted 6.15.0-rc7-syzkaller-gd7fa1af5b33e #0 PREEMPT Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025 pstate: 20400005 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : __kvmalloc_node_noprof+0x520/0x640 mm/slub.c:5024 lr : __do_kmalloc_node mm/slub.c:-1 [inline] lr : __kvmalloc_node_noprof+0x3b4/0x640 mm/slub.c:5012 sp : ffff80009cfd7a90 x29: ffff80009cfd7ac0 x28: ffff0000dd52a120 x27: 0000000000412dc0 x26: 0000000000000178 x25: ffff7000139faf70 x24: 0000000000000000 x23: ffff800082f4cea8 x22: 00000000ffffffff x21: 000000010cd004a8 x20: ffff0000d75816c0 x19: ffff0000dd52a000 x18: 00000000ffffffff x17: ffff800092f39000 x16: ffff80008adbe9e4 x15: 0000000000000005 x14: 1ffff000139faf1c x13: 0000000000000000 x12: 0000000000000000 x11: ffff7000139faf21 x10: 0000000000000003 x9 : ffff80008f27b938 x8 : 0000000000000002 x7 : 0000000000000000 x6 : 0000000000000000 x5 : 00000000ffffffff x4 : 0000000000400dc0 x3 : 0000000200000000 x2 : 000000010cd004a8 x1 : ffff80008b3ebc40 x0 : 0000000000000001 Call trace: __kvmalloc_node_noprof+0x520/0x640 mm/slub.c:5024 (P) kvmalloc_array_node_noprof include/linux/slab.h:1065 [inline] io_rsrc_data_alloc io_uring/rsrc.c:206 [inline] io_clone_buffers io_uring/rsrc.c:1178 [inline] io_register_clone_buffers+0x484/0xa14 io_uring/rsrc.c:1287 __io_uring_register io_uring/register.c:815 [inline] __do_sys_io_uring_register io_uring/register.c:926 [inline] __se_sys_io_uring_register io_uring/register.c:903 [inline] __arm64_sys_io_uring_register+0x42c/0xea8 io_uring/register.c:903 __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline] invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:49 el0_svc_common+0x130/0x23c arch/arm64/kernel/syscall.c:132 do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:151 el0_svc+0x58/0x17c arch/arm64/kernel/entry-common.c:767 el0t_64_sync_handler+0x78/0x108 arch/arm64/kernel/entry-common.c:786 el0t_64_sync+0x198/0x19c arch/arm64/kernel/entry.S:600 which is due to offset + buffer_count being too large. The registration code checks only the total count of buffers, but given that the indexing is an array, it should also check offset + count. That can't exceed IORING_MAX_REG_BUFFERS either, as there's no way to reach buffers beyond that limit. There's no issue with registrering a table this large, outside of the fact that it's pointless to register buffers that cannot be reached, and that it can trigger this kmalloc() warning for attempting an allocation that is too large. Cc: stable@vger.kernel.org Fixes: b16e920a1909 ("io_uring/rsrc: allow cloning at an offset") Reported-by: syzbot+cb4bf3cb653be0d25de8@syzkaller.appspotmail.com Link: https://lore.kernel.org/io-uring/684e77bd.a00a0220.279073.0029.GAE@google.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-14Merge tag 'io_uring-6.16-20250614' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring fixes from Jens Axboe: - Fix for a race between SQPOLL exit and fdinfo reading. It's slim and I was only able to reproduce this with an artificial delay in the kernel. Followup sparse fix as well to unify the access to ->thread. - Fix for multiple buffer peeking, avoiding truncation if possible. - Run local task_work for IOPOLL reaping when the ring is exiting. This currently isn't done due to an assumption that polled IO will never need task_work, but a fix on the block side is going to change that. * tag 'io_uring-6.16-20250614' of git://git.kernel.dk/linux: io_uring: run local task_work from ring exit IOPOLL reaping io_uring/kbuf: don't truncate end buffer for multiple buffer peeks io_uring: consistently use rcu semantics with sqpoll thread io_uring: fix use-after-free of sq->thread in __io_uring_show_fdinfo()
2025-06-13io_uring: run local task_work from ring exit IOPOLL reapingJens Axboe
In preparation for needing to shift NVMe passthrough to always use task_work for polled IO completions, ensure that those are suitably run at exit time. See commit: 9ce6c9875f3e ("nvme: always punt polled uring_cmd end_io work to task_work") for details on why that is necessary. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-13io_uring/kbuf: don't truncate end buffer for multiple buffer peeksJens Axboe
If peeking a bunch of buffers, normally io_ring_buffers_peek() will truncate the end buffer. This isn't optimal as presumably more data will be arriving later, and hence it's better to stop with the last full buffer rather than truncate the end buffer. Cc: stable@vger.kernel.org Fixes: 35c8711c8fc4 ("io_uring/kbuf: add helpers for getting/peeking multiple buffers") Reported-by: Christian Mazakas <christian.mazakas@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-12io_uring: consistently use rcu semantics with sqpoll threadKeith Busch
The sqpoll thread is dereferenced with rcu read protection in one place, so it needs to be annotated as an __rcu type, and should consistently use rcu helpers for access and assignment to make sparse happy. Since most of the accesses occur under the sqd->lock, we can use rcu_dereference_protected() without declaring an rcu read section. Provide a simple helper to get the thread from a locked context. Fixes: ac0b8b327a5677d ("io_uring: fix use-after-free of sq->thread in __io_uring_show_fdinfo()") Signed-off-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20250611205343.1821117-1-kbusch@meta.com [axboe: fold in fix for register.c] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-10io_uring: fix use-after-free of sq->thread in __io_uring_show_fdinfo()Penglei Jiang
syzbot reports: BUG: KASAN: slab-use-after-free in getrusage+0x1109/0x1a60 Read of size 8 at addr ffff88810de2d2c8 by task a.out/304 CPU: 0 UID: 0 PID: 304 Comm: a.out Not tainted 6.16.0-rc1 #1 PREEMPT(voluntary) Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x53/0x70 print_report+0xd0/0x670 ? __pfx__raw_spin_lock_irqsave+0x10/0x10 ? getrusage+0x1109/0x1a60 kasan_report+0xce/0x100 ? getrusage+0x1109/0x1a60 getrusage+0x1109/0x1a60 ? __pfx_getrusage+0x10/0x10 __io_uring_show_fdinfo+0x9fe/0x1790 ? ksys_read+0xf7/0x1c0 ? do_syscall_64+0xa4/0x260 ? vsnprintf+0x591/0x1100 ? __pfx___io_uring_show_fdinfo+0x10/0x10 ? __pfx_vsnprintf+0x10/0x10 ? mutex_trylock+0xcf/0x130 ? __pfx_mutex_trylock+0x10/0x10 ? __pfx_show_fd_locks+0x10/0x10 ? io_uring_show_fdinfo+0x57/0x80 io_uring_show_fdinfo+0x57/0x80 seq_show+0x38c/0x690 seq_read_iter+0x3f7/0x1180 ? inode_set_ctime_current+0x160/0x4b0 seq_read+0x271/0x3e0 ? __pfx_seq_read+0x10/0x10 ? __pfx__raw_spin_lock+0x10/0x10 ? __mark_inode_dirty+0x402/0x810 ? selinux_file_permission+0x368/0x500 ? file_update_time+0x10f/0x160 vfs_read+0x177/0xa40 ? __pfx___handle_mm_fault+0x10/0x10 ? __pfx_vfs_read+0x10/0x10 ? mutex_lock+0x81/0xe0 ? __pfx_mutex_lock+0x10/0x10 ? fdget_pos+0x24d/0x4b0 ksys_read+0xf7/0x1c0 ? __pfx_ksys_read+0x10/0x10 ? do_user_addr_fault+0x43b/0x9c0 do_syscall_64+0xa4/0x260 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f0f74170fc9 Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 8 RSP: 002b:00007fffece049e8 EFLAGS: 00000206 ORIG_RAX: 0000000000000000 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f0f74170fc9 RDX: 0000000000001000 RSI: 00007fffece049f0 RDI: 0000000000000004 RBP: 00007fffece05ad0 R08: 0000000000000000 R09: 00007fffece04d90 R10: 0000000000000000 R11: 0000000000000206 R12: 00005651720a1100 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 </TASK> Allocated by task 298: kasan_save_stack+0x33/0x60 kasan_save_track+0x14/0x30 __kasan_slab_alloc+0x6e/0x70 kmem_cache_alloc_node_noprof+0xe8/0x330 copy_process+0x376/0x5e00 create_io_thread+0xab/0xf0 io_sq_offload_create+0x9ed/0xf20 io_uring_setup+0x12b0/0x1cc0 do_syscall_64+0xa4/0x260 entry_SYSCALL_64_after_hwframe+0x77/0x7f Freed by task 22: kasan_save_stack+0x33/0x60 kasan_save_track+0x14/0x30 kasan_save_free_info+0x3b/0x60 __kasan_slab_free+0x37/0x50 kmem_cache_free+0xc4/0x360 rcu_core+0x5ff/0x19f0 handle_softirqs+0x18c/0x530 run_ksoftirqd+0x20/0x30 smpboot_thread_fn+0x287/0x6c0 kthread+0x30d/0x630 ret_from_fork+0xef/0x1a0 ret_from_fork_asm+0x1a/0x30 Last potentially related work creation: kasan_save_stack+0x33/0x60 kasan_record_aux_stack+0x8c/0xa0 __call_rcu_common.constprop.0+0x68/0x940 __schedule+0xff2/0x2930 __cond_resched+0x4c/0x80 mutex_lock+0x5c/0xe0 io_uring_del_tctx_node+0xe1/0x2b0 io_uring_clean_tctx+0xb7/0x160 io_uring_cancel_generic+0x34e/0x760 do_exit+0x240/0x2350 do_group_exit+0xab/0x220 __x64_sys_exit_group+0x39/0x40 x64_sys_call+0x1243/0x1840 do_syscall_64+0xa4/0x260 entry_SYSCALL_64_after_hwframe+0x77/0x7f The buggy address belongs to the object at ffff88810de2cb00 which belongs to the cache task_struct of size 3712 The buggy address is located 1992 bytes inside of freed 3712-byte region [ffff88810de2cb00, ffff88810de2d980) which is caused by the task_struct pointed to by sq->thread being released while it is being used in the function __io_uring_show_fdinfo(). Holding ctx->uring_lock does not prevent ehre relase or exit of sq->thread. Fix this by assigning and looking up ->thread under RCU, and grabbing a reference to the task_struct. This ensures that it cannot get released while fdinfo is using it. Reported-by: syzbot+531502bbbe51d2f769f4@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/682b06a5.a70a0220.3849cf.00b3.GAE@google.com Fixes: 3fcb9d17206e ("io_uring/sqpoll: statistics of the true utilization of sq threads") Signed-off-by: Penglei Jiang <superman.xpt@gmail.com> Link: https://lore.kernel.org/r/20250610171801.70960-1-superman.xpt@gmail.com [axboe: massage commit message] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-06Merge tag 'io_uring-6.16-20250606' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring fixes from Jens Axboe: - Fix for a regression introduced in this merge window, where the 'id' passed to xa_find() for ifq lookup is uninitialized - Fix for zcrx release on registration failure. From 6.15, going to stable - Tweak for recv bundles, where msg_inq should be > 1 before being used to gate a retry event - Pavel doesnt want to be a maintainer anymore, remove him from the MAINTAINERS entry - Limit legacy kbuf registrations to 64k, which is the size of the buffer ID field anyway. Hence it's nonsensical to support more than that, and the only purpose that serves is to have syzbot trigger long exit delays for heavily configured debug kernels - Fix for the io_uring futex handling, which got broken for FUTEX2_PRIVATE by a generic futex commit adding private hashes * tag 'io_uring-6.16-20250606' of git://git.kernel.dk/linux: io_uring/futex: mark wait requests as inflight io_uring/futex: get rid of struct io_futex addr union io_uring/kbuf: limit legacy provided buffer lists to USHRT_MAX MAINTAINERS: remove myself from io_uring io_uring/net: only consider msg_inq if larger than 1 io_uring/zcrx: fix area release on registration failure io_uring/zcrx: init id for xa_find
2025-06-04io_uring/futex: mark wait requests as inflightJens Axboe
Inflight marking is used so that do_exit() -> io_uring_files_cancel() will find requests with files that reference an io_uring instance, so they can get appropriately canceled before the files go away. However, it's also called before the mm goes away. Mark futex/futexv wait requests as being inflight, so that io_uring_files_cancel() will prune them. This ensures that the mm stays alive, which is important as an exiting mm will also free the futex private hash buckets. An io_uring futex request with FUTEX2_PRIVATE set relies on those being alive until the request has completed. A recent commit added these futex private hashes, which get killed when the mm goes away. Fixes: 80367ad01d93 ("futex: Add basic infrastructure for local task local hash") Link: https://lore.kernel.org/io-uring/38053.1749045482@localhost/ Reported-by: Robert Morris <rtm@csail.mit.edu> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-04io_uring/futex: get rid of struct io_futex addr unionJens Axboe
Rather than use a union of a u32 and struct futex_waitv user address, consolidate it into a single void __user pointer instead. This also makes prep easier to use as the store happens to the member that will be used. No functional changes in this patch. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-03io_uring/kbuf: limit legacy provided buffer lists to USHRT_MAXJens Axboe
The buffer ID for a provided buffer is an unsigned short, and hence there can only be 64k added to any given buffer list before having duplicate BIDs. Cap the legacy provided buffers at 64k in the list. This is mostly to prevent silly stall reports from syzbot, which likes to dump tons of buffers into a list and then have kernels with lockdep and kasan churning through them and hitting long wait times for buffer pruning at ring exit time. Signed-off-by: Jens Axboe <axboe@kernel.dk>