linux/linux-stable.git - Linux kernel stable tree

Age	Commit message (Collapse)	Author
2021-06-14	io_uring: cache task struct refs	Pavel Begunkov
	tctx in submission part is always synchronised because is executed from the task's context, so we can batch allocate tctx/task references and store them across syscall boundaries. It avoids enough of operations, including an atomic for getting task ref and a percpu_counter_add() function call, which still fallback to spinlock for large batching cases (around >=32). Should be good for SQPOLL submitting in small portions and coming at some moment bpf submissions. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/14b327b973410a3eec1f702ecf650e100513aca9.1623634181.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-14	io_uring: don't vmalloc rsrc tags	Pavel Begunkov
	We don't really need vmalloc for keeping tags, it's not a hot path and is there out of convenience, so replace it with two level tables to not litter kernel virtual memory mappings. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/241a3422747113a8909e7e1030eb585d4a349e0d.1623634181.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-14	io_uring: add helpers for 2 level table alloc	Pavel Begunkov
	Some parts like fixed file table use 2 level tables, factor out helpers for allocating/deallocating them as more users are to come. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1709212359cd82eb416d395f86fc78431ccfc0aa.1623634181.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-14	io_uring: remove rsrc put work irq save/restore	Pavel Begunkov
	io_rsrc_put_work() is executed by workqueue in non-irq context, so no need for irqsave/restore variants of spinlocking. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2a7f77220735f4ad404ac885b4d73bdf42d2f836.1623634181.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-14	io_uring: hide rsrc tag copy into generic helpers	Pavel Begunkov
	Make io_rsrc_data_alloc() taking care of rsrc tags loading on registration, so we don't need to repeat it for each new rsrc type. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/5609680697bd09735de10561b75edb95283459da.1623634181.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-14	io_uring: rename function *task_file	Pavel Begunkov
	What at some moment was references to struct file used to control lifetimes of task/ctx is now just internal tctx structures/nodes, so rename outdated *task_file() routines into something more sensible. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e2fbce42932154c2631ce58ffbffaa232afe18d5.1623634181.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-14	io_uring: refactor io_iopoll_req_issued	Pavel Begunkov
	A simple refactoring of io_iopoll_req_issued(), move in_async inside so we don't pass it around and save on double checking it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1513bfde4f0c835be25ac69a82737ab0668d7665.1623634181.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-14	io_uring: fix blocking inline submission	Pavel Begunkov
	There is a complaint against sys_io_uring_enter() blocking if it submits stdin reads. The problem is in __io_file_supports_async(), which sees that it's a cdev and allows it to be processed inline. Punt char devices using generic rules of io_file_supports_async(), including checking for presence of *_iter() versions of rw callbacks. Apparently, it will affect most of cdevs with some exceptions like null and zero devices. Cc: stable@vger.kernel.org Reported-by: Birk Hirdman <lonjil@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d60270856b8a4560a639ef5f76e55eb563633599.1623236455.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-14	io_uring: enable shmem/memfd memory registration	Pavel Begunkov
	Relax buffer registration restictions, which filters out file backed memory, and allow shmem/memfd as they have normal anonymous pages underneath. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-14	io_uring: don't bounce submit_state cachelines	Pavel Begunkov
	struct io_submit_state contains struct io_comp_state and so locked_free_, that renders cachelines around ->locked_free being invalidated on most non-inline completions, that may terrorise caches if submissions and completions are done by different tasks. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/290cb5412b76892e8631978ee8ab9db0c6290dd5.1621201931.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-14	io_uring: rename io_get_cqring	Pavel Begunkov
	Rename io_get_cqring() into io_get_cqe() for consistency with SQ, and just because the old name is not as clear. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a46a53e3f781de372f5632c184e61546b86515ce.1621201931.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-14	io_uring: kill cached_cq_overflow	Pavel Begunkov
	There are two copies of cq_overflow, shared with userspace and internal cached one. It was needed for DRAIN accounting, but now we have yet another knob to tune the accounting, i.e. cq_extra, and we can throw away the internal counter and just increment the one in the shared ring. If user modifies it as so never gets the right overflow value ever again, it's its problem, even though before we would have restored it back by next overflow. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8427965f5175dd051febc63804909861109ce859.1621201931.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-14	io_uring: deduce cq_mask from cq_entries	Pavel Begunkov
	No need to cache cq_mask, it's exactly cq_entries - 1, so just deduce it to not carry it around. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d439efad0503c8398451dae075e68a04362fbc8d.1621201931.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-14	io_uring: remove dependency on ring->sq/cq_entries	Pavel Begunkov
	We have numbers of {sq,cq} entries cached in ctx, don't look up them in user-shared rings as 1) it may fetch additional cacheline 2) user may change it and so it's always error prone. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/745d31bc2da41283ddd0489ef784af5c8d6310e9.1621201931.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-14	io_uring: better locality for rsrc fields	Pavel Begunkov
	ring has two types of resource-related fields: used for request submission, and field needed for update/registration. Reshuffle them into these two groups for better locality and readability. The second group is not in the hot path, so it's natural to place them somewhere in the end. Also update an outdated comment. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/05b34795bb4440f4ec4510f08abd5a31830f8ca0.1621201931.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-14	io_uring: shuffle rarely used ctx fields	Pavel Begunkov
	There is a bunch of scattered around ctx fields that are almost never used, e.g. only on ring exit, plunge them to the end, better locality, better aesthetically. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/782ff94b00355923eae757d58b1a47821b5b46d4.1621201931.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-14	io_uring: make fail flag not link specific	Pavel Begunkov
	The main difference is in req_set_fail_links() renamed into req_set_fail(), which now sets REQ_F_FAIL_LINK/REQ_F_FAIL flag unconditional on whether it has been a link or not. It only matters in io_disarm_next(), which already handles it well, and all calls to it have a fast path checking REQ_F_LINK/HARDLINK. It looks cleaner, and sheds binary size text data bss dec hex filename 84235 12390 8 96633 17979 ./fs/io_uring.o 84151 12414 8 96573 1793d ./fs/io_uring.o Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e2224154dd6e53b665ac835d29436b177872fa10.1621201931.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-14	io_uring: get rid of files in exit cancel	Pavel Begunkov
	We don't match against files on cancellation anymore, so no need to drag around files_struct anymore, just pass a flag telling whether only inflight or all requests should be killed. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7bfc5409a78f8e2d6b27dec3293ec2d248677348.1621201931.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-14	io_uring: simplify waking sqo_sq_wait	Pavel Begunkov
	Going through submission in __io_sq_thread() and still having a full SQ is rather unexpected, so remove a check for SQ fullness and just wake up whoever wait on sqo_sq_wait. Also skip if it doesn't do submission in the first place, likely may to happen for SQPOLL sharing and/or IOPOLL. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e2e91751e87b1a39f8d63ef884aaff578123f61e.1621201931.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-14	io_uring: remove unused park_task_work	Pavel Begunkov
	As sqpoll cancel via task_work is killed, remove everything related to park_task_work as it's not used anymore. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/310d8b76a2fbbf3e139373500e04ad9af7ee3dbb.1621201931.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-14	io_uring: improve sq_thread waiting check	Pavel Begunkov
	If SQPOLL task finds a ring requesting it to continue running, no need to set wake flag to rest of the rings as it will be cleared in a moment anyway, so hide it in a single sqd->ctx_list loop. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1ee5a696d9fd08645994c58ee147d149a8957d94.1621201931.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-14	io_uring: improve sqpoll event/state handling	Pavel Begunkov
	As sqd->state changes rarely, don't check every event one by one but look them all at once. Add a helper function. Also don't go into event waiting sleeping with STOP flag set. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/645025f95c7eeec97f88ff497785f4f1d6f3966f.1621201931.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-10	io_uring: add feature flag for rsrc tags	Pavel Begunkov
	Add IORING_FEAT_RSRC_TAGS indicating that io_uring supports a bunch of new IORING_REGISTER operations, in particular IORING_REGISTER_[FILES[,UPDATE]2,BUFFERS[2,UPDATE]] that support rsrc tagging, and also indicating implemented dynamic fixed buffer updates. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9b995d4045b6c6b4ab7510ca124fd25ac2203af7.1623339162.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-10	io_uring: change registration/upd/rsrc tagging ABI	Pavel Begunkov
	There are ABI moments about recently added rsrc registration/update and tagging that might become a nuisance in the future. First, IORING_REGISTER_RSRC[_UPD] hide different types of resources under it, so breaks fine control over them by restrictions. It works for now, but once those are wanted under restrictions it would require a rework. It was also inconvenient trying to fit a new resource not supporting all the features (e.g. dynamic update) into the interface, so better to return to IORING_REGISTER_* top level dispatching. Second, register/update were considered to accept a type of resource, however that's not a good idea because there might be several ways of registration of a single resource type, e.g. we may want to add non-contig buffers or anything more exquisite as dma mapped memory. So, remove IORING_RSRC_[FILE,BUFFER] out of the ABI, and place them internally for now to limit changes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9b554897a7c17ad6e3becc48dfed2f7af9f423d5.1623339162.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-29	io_uring: fix misaccounting fix buf pinned pages	Pavel Begunkov
	As Andres reports "... io_sqe_buffer_register() doesn't initialize imu. io_buffer_account_pin() does imu->acct_pages++, before calling io_account_mem(ctx, imu->acct_pages).", leading to evevntual -ENOMEM. Initialise the field. Reported-by: Andres Freund <andres@anarazel.de> Fixes: 41edf1a5ec967 ("io_uring: keep table of pointers to ubufs") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/438a6f46739ae5e05d9c75a0c8fa235320ff367c.1622285901.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-27	io_uring: fix data race to avoid potential NULL-deref	Marco Elver
	Commit ba5ef6dc8a82 ("io_uring: fortify tctx/io_wq cleanup") introduced setting tctx->io_wq to NULL a bit earlier. This has caused KCSAN to detect a data race between accesses to tctx->io_wq: write to 0xffff88811d8df330 of 8 bytes by task 3709 on cpu 1: io_uring_clean_tctx fs/io_uring.c:9042 [inline] __io_uring_cancel fs/io_uring.c:9136 io_uring_files_cancel include/linux/io_uring.h:16 [inline] do_exit kernel/exit.c:781 do_group_exit kernel/exit.c:923 get_signal kernel/signal.c:2835 arch_do_signal_or_restart arch/x86/kernel/signal.c:789 handle_signal_work kernel/entry/common.c:147 [inline] exit_to_user_mode_loop kernel/entry/common.c:171 [inline] ... read to 0xffff88811d8df330 of 8 bytes by task 6412 on cpu 0: io_uring_try_cancel_iowq fs/io_uring.c:8911 [inline] io_uring_try_cancel_requests fs/io_uring.c:8933 io_ring_exit_work fs/io_uring.c:8736 process_one_work kernel/workqueue.c:2276 ... With the config used, KCSAN only reports data races with value changes: this implies that in the case here we also know that tctx->io_wq was non-NULL. Therefore, depending on interleaving, we may end up with: [CPU 0] \| [CPU 1] io_uring_try_cancel_iowq() \| io_uring_clean_tctx() if (!tctx->io_wq) // false \| ... ... \| tctx->io_wq = NULL io_wq_cancel_cb(tctx->io_wq, ...) \| ... -> NULL-deref \| Note: It is likely that thus far we've gotten lucky and the compiler optimizes the double-read into a single read into a register -- but this is never guaranteed, and can easily change with a different config! Fix the data race by restoring the previous behaviour, where both setting io_wq to NULL and put of the wq are _serialized_ after concurrent io_uring_try_cancel_iowq() via acquisition of the uring_lock and removal of the node in io_uring_del_task_file(). Fixes: ba5ef6dc8a82 ("io_uring: fortify tctx/io_wq cleanup") Suggested-by: Pavel Begunkov <asml.silence@gmail.com> Reported-by: syzbot+bf2b3d0435b9b728946c@syzkaller.appspotmail.com Signed-off-by: Marco Elver <elver@google.com> Cc: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20210527092547.2656514-1-elver@google.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-25	io_uring/io-wq: close io-wq full-stop gap	Pavel Begunkov
	There is an old problem with io-wq cancellation where requests should be killed and are in io-wq but are not discoverable, e.g. in @next_hashed or @linked vars of io_worker_handle_work(). It adds some unreliability to individual request canellation, but also may potentially get __io_uring_cancel() stuck. For instance: 1) An __io_uring_cancel()'s cancellation round have not found any request but there are some as desribed. 2) __io_uring_cancel() goes to sleep 3) Then workers wake up and try to execute those hidden requests that happen to be unbound. As we already cancel all requests of io-wq there, set IO_WQ_BIT_EXIT in advance, so preventing 3) from executing unbound requests. The workers will initially break looping because of getting a signal as they are threads of the dying/exec()'ing user task. Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/abfcf8c54cb9e8f7bfbad7e9a0cc5433cc70bdc2.1621781238.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-20	io_uring: fortify tctx/io_wq cleanup	Pavel Begunkov
	We don't want anyone poking into tctx->io_wq awhile it's being destroyed by io_wq_put_and_exit(), and even though it shouldn't even happen, if buggy would be preferable to get a NULL-deref instead of subtle delayed failure or UAF. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/827b021de17926fd807610b3e53a5a5fa8530856.1621513214.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-17	io_uring: don't modify req->poll for rw	Pavel Begunkov
	__io_queue_proc() is used by both poll and apoll, so we should not access req->poll directly but selecting right struct io_poll_iocb depending on use case. Reported-and-tested-by: syzbot+a84b8783366ecb1c65d0@syzkaller.appspotmail.com Fixes: ea6a693d862d ("io_uring: disable multishot poll for double poll add cases") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4a6a1de31142d8e0250fe2dfd4c8923d82a5bbfc.1621251795.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-14	io_uring: increase max number of reg buffers	Pavel Begunkov
	Since recent changes instead of storing a large array of struct io_mapped_ubuf, we store pointers to them, that is 4 times slimmer and we should not to so worry about restricting max number of registererd buffer slots, increase the limit 4 times. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d3dee1da37f46da416aa96a16bf9e5094e10584d.1620990371.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-14	io_uring: further remove sqpoll limits on opcodes	Pavel Begunkov
	There are three types of requests that left disabled for sqpoll, namely epoll ctx, statx, and resources update. Since SQPOLL task is now closely mimics a userspace thread, remove the restrictions. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/909b52d70c45636d8d7897582474ea5aab5eed34.1620990306.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-14	io_uring: fix ltout double free on completion race	Pavel Begunkov
	Always remove linked timeout on io_link_timeout_fn() from the master request link list, otherwise we may get use-after-free when first io_link_timeout_fn() puts linked timeout in the fail path, and then will be found and put on master's free. Cc: stable@vger.kernel.org # 5.10+ Fixes: 90cd7e424969d ("io_uring: track link timeout's master explicitly") Reported-and-tested-by: syzbot+5a864149dd970b546223@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/69c46bf6ce37fec4fdcd98f0882e18eb07ce693a.1620990121.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-08	io_uring: fix link timeout refs	Pavel Begunkov
	WARNING: CPU: 0 PID: 10242 at lib/refcount.c:28 refcount_warn_saturate+0x15b/0x1a0 lib/refcount.c:28 RIP: 0010:refcount_warn_saturate+0x15b/0x1a0 lib/refcount.c:28 Call Trace: __refcount_sub_and_test include/linux/refcount.h:283 [inline] __refcount_dec_and_test include/linux/refcount.h:315 [inline] refcount_dec_and_test include/linux/refcount.h:333 [inline] io_put_req fs/io_uring.c:2140 [inline] io_queue_linked_timeout fs/io_uring.c:6300 [inline] __io_queue_sqe+0xbef/0xec0 fs/io_uring.c:6354 io_submit_sqe fs/io_uring.c:6534 [inline] io_submit_sqes+0x2bbd/0x7c50 fs/io_uring.c:6660 __do_sys_io_uring_enter fs/io_uring.c:9240 [inline] __se_sys_io_uring_enter+0x256/0x1d60 fs/io_uring.c:9182 io_link_timeout_fn() should put only one reference of the linked timeout request, however in case of racing with the master request's completion first io_req_complete() puts one and then io_put_req_deferred() is called. Cc: stable@vger.kernel.org # 5.12+ Fixes: 9ae1f8dd372e0 ("io_uring: fix inconsistent lock state") Reported-by: syzbot+a2910119328ce8e7996f@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ff51018ff29de5ffa76f09273ef48cb24c720368.1620417627.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-05	io_uring: truncate lengths larger than MAX_RW_COUNT on provide buffers	Thadeu Lima de Souza Cascardo
	Read and write operations are capped to MAX_RW_COUNT. Some read ops rely on that limit, and that is not guaranteed by the IORING_OP_PROVIDE_BUFFERS. Truncate those lengths when doing io_add_buffers, so buffer addresses still use the uncapped length. Also, take the chance and change struct io_buffer len member to __u32, so it matches struct io_provide_buffer len member. This fixes CVE-2021-3491, also reported as ZDI-CAN-13546. Fixes: ddf0322db79c ("io_uring: add IORING_OP_PROVIDE_BUFFERS") Reported-by: Billy Jheng Bing-Jhong (@st424204) Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-30	io_uring: Fix memory leak in io_sqe_buffers_register()	Zqiang
	unreferenced object 0xffff8881123bf0a0 (size 32): comm "syz-executor557", pid 8384, jiffies 4294946143 (age 12.360s) backtrace: [<ffffffff81469b71>] kmalloc_node include/linux/slab.h:579 [inline] [<ffffffff81469b71>] kvmalloc_node+0x61/0xf0 mm/util.c:587 [<ffffffff815f0b3f>] kvmalloc include/linux/mm.h:795 [inline] [<ffffffff815f0b3f>] kvmalloc_array include/linux/mm.h:813 [inline] [<ffffffff815f0b3f>] kvcalloc include/linux/mm.h:818 [inline] [<ffffffff815f0b3f>] io_rsrc_data_alloc+0x4f/0xc0 fs/io_uring.c:7164 [<ffffffff815f26d8>] io_sqe_buffers_register+0x98/0x3d0 fs/io_uring.c:8383 [<ffffffff815f84a7>] __io_uring_register+0xf67/0x18c0 fs/io_uring.c:9986 [<ffffffff81609222>] __do_sys_io_uring_register fs/io_uring.c:10091 [inline] [<ffffffff81609222>] __se_sys_io_uring_register fs/io_uring.c:10071 [inline] [<ffffffff81609222>] __x64_sys_io_uring_register+0x112/0x230 fs/io_uring.c:10071 [<ffffffff842f616a>] do_syscall_64+0x3a/0xb0 arch/x86/entry/common.c:47 [<ffffffff84400068>] entry_SYSCALL_64_after_hwframe+0x44/0xae Fix data->tags memory leak, through io_rsrc_data_free() to release data memory space. Reported-by: syzbot+0f32d05d8b6cd8d7ea3e@syzkaller.appspotmail.com Signed-off-by: Zqiang <qiang.zhang@windriver.com> Link: https://lore.kernel.org/r/20210430082515.13886-1-qiang.zhang@windriver.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-29	io_uring: Fix premature return from loop and memory leak	Colin Ian King
	Currently the -EINVAL error return path is leaking memory allocated to data. Fix this by not returning immediately but instead setting the error return variable to -EINVAL and breaking out of the loop. Kudos to Pavel Begunkov for suggesting a correct fix. Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20210429104602.62676-1-colin.king@canonical.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-29	io_uring: fix unchecked error in switch_start()	Pavel Begunkov
	io_rsrc_node_switch_start() can fail, don't forget to check returned error code. Reported-by: syzbot+a4715dd4b7c866136f79@syzkaller.appspotmail.com Fixes: eae071c9b4cef ("io_uring: prepare fixed rw for dynanic buffers") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c4c06e2f3f0c8e43bd8d0a266c79055bcc6b6e60.1619693112.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-29	io_uring: allow empty slots for reg buffers	Pavel Begunkov
	Allow empty reg buffer slots any request using which should fail. This allows users to not register all buffers in advance, but do it lazily and/or on demand via updates. That is achieved by setting iov_base and iov_len to zero for registration and/or buffer updates. Empty buffer can't have a non-zero tag. Implementation details: to not add extra overhead to io_import_fixed(), create a dummy buffer crafted to fail any request using it, and set it to all empty buffer slots. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7e95e4d700082baaf010c648c72ac764c9cc8826.1619611868.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-29	io_uring: add more build check for uapi	Pavel Begunkov
	Add a couple of BUILD_BUG_ON() checking some rsrc uapi structs and SQE flags. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ff960df4d5026b9fb5bfd80994b9d3667d3926da.1619536280.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-29	io_uring: dont overlap internal and user req flags	Pavel Begunkov
	CQE flags take one byte that we store in req->flags together with other REQ_F_* internal flags. CQE flags are copied directly into req and then verified that requires some handling on failures, e.g. to make sure that that copy doesn't set some of the internal flags. Move all internal flags to take bits after the first byte, so we don't need extra handling and make it safer overall. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b8b5b02d1ab9d786fcc7db4a3fe86db6b70b8987.1619536280.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-29	io_uring: fix drain with rsrc CQEs	Pavel Begunkov
	Resource emitted CQEs are not bound to requests, so fix up counters used for DRAIN/defer logic. Fixes: b60c8dce33895 ("io_uring: preparation for rsrc tagging") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2b32f5f0a40d5928c3466d028f936e167f0654be.1619536280.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-28	Merge tag 'for-5.13/io_uring-2021-04-27' of git://git.kernel.dk/linux-block	Linus Torvalds
	Pull io_uring updates from Jens Axboe: - Support for multi-shot mode for POLL requests - More efficient reference counting. This is shamelessly stolen from the mm side. Even though referencing is mostly single/dual user, the 128 count was retained to keep the code the same. Maybe this should/could be made generic at some point. - Removal of the need to have a manager thread for each ring. The manager threads only job was checking and creating new io-threads as needed, instead we handle this from the queue path. - Allow SQPOLL without CAP_SYS_ADMIN or CAP_SYS_NICE. Since 5.12, this thread is "just" a regular application thread, so no need to restrict use of it anymore. - Cleanup of how internal async poll data lifetime is managed. - Fix for syzbot reported crash on SQPOLL cancelation. - Make buffer registration more like file registrations, which includes flexibility in avoiding full set unregistration and re-registration. - Fix for io-wq affinity setting. - Be a bit more defensive in task->pf_io_worker setup. - Various SQPOLL fixes. - Cleanup of SQPOLL creds handling. - Improvements to in-flight request tracking. - File registration cleanups. - Tons of cleanups and little fixes * tag 'for-5.13/io_uring-2021-04-27' of git://git.kernel.dk/linux-block: (156 commits) io_uring: maintain drain logic for multishot poll requests io_uring: Check current->io_uring in io_uring_cancel_sqpoll io_uring: fix NULL reg-buffer io_uring: simplify SQPOLL cancellations io_uring: fix work_exit sqpoll cancellations io_uring: Fix uninitialized variable up.resv io_uring: fix invalid error check after malloc io_uring: io_sq_thread() no longer needs to reset current->pf_io_worker kernel: always initialize task->pf_io_worker to NULL io_uring: update sq_thread_idle after ctx deleted io_uring: add full-fledged dynamic buffers support io_uring: implement fixed buffers registration similar to fixed files io_uring: prepare fixed rw for dynanic buffers io_uring: keep table of pointers to ubufs io_uring: add generic rsrc update with tags io_uring: add IORING_REGISTER_RSRC io_uring: enumerate dynamic resources io_uring: add generic path for rsrc update io_uring: preparation for rsrc tagging io_uring: decouple CQE filling from requests ...
2021-04-27	io_uring: maintain drain logic for multishot poll requests	Hao Xu
	Now that we have multishot poll requests, one SQE can emit multiple CQEs. given below example: sqe0(multishot poll)-->sqe1-->sqe2(drain req) sqe2 is designed to issue after sqe0 and sqe1 completed, but since sqe0 is a multishot poll request, sqe2 may be issued after sqe0's event triggered twice before sqe1 completed. This isn't what users leverage drain requests for. Here the solution is to wait for multishot poll requests fully completed. To achieve this, we should reconsider the req_need_defer equation, the original one is: all_sqes(excluding dropped ones) == all_cqes(including dropped ones) This means we issue a drain request when all the previous submitted SQEs have generated their CQEs. Now we should consider multishot requests, we deduct all the multishot CQEs except the cancellation one, In this way a multishot poll request behave like a normal request, so: all_sqes == all_cqes - multishot_cqes(except cancellations) Here we introduce cq_extra for it. Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/1618298439-136286-1-git-send-email-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-27	io_uring: Check current->io_uring in io_uring_cancel_sqpoll	Palash Oswal
	syzkaller identified KASAN: null-ptr-deref Write in io_uring_cancel_sqpoll. io_uring_cancel_sqpoll is called by io_sq_thread before calling io_uring_alloc_task_context. This leads to current->io_uring being NULL. io_uring_cancel_sqpoll should not have to deal with threads where current->io_uring is NULL. In order to cast a wider safety net, perform input sanitisation directly in io_uring_cancel_sqpoll and return for NULL value of current->io_uring. This is safe since if current->io_uring isn't set, then there's no way for the task to have submitted any requests. Reported-by: syzbot+be51ca5a4d97f017cd50@syzkaller.appspotmail.com Cc: stable@vger.kernel.org Signed-off-by: Palash Oswal <hello@oswalpalash.com> Link: https://lore.kernel.org/r/20210427125148.21816-1-hello@oswalpalash.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-26	io_uring: fix NULL reg-buffer	Pavel Begunkov
	io_import_fixed() doesn't expect a registered buffer slot to be NULL and would fail stumbling on it. We don't allow it, but if during __io_sqe_buffers_update() rsrc removal succeeds but following register fails, we'll get such a situation. Do it atomically and don't remove buffers until we sure that a new one can be set. Fixes: 634d00df5e1cf ("io_uring: add full-fledged dynamic buffers support") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/830020f9c387acddd51962a3123b5566571b8c6d.1619446608.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-26	io_uring: simplify SQPOLL cancellations	Pavel Begunkov
	All sqpoll rings (even sharing sqpoll task) are currently dead bound to the task that created them, iow when owner task dies it kills all its SQPOLL rings and their inflight requests via task_work infra. It's neither the nicist way nor the most convenient as adds extra locking/waiting and dependencies. Leave it alone and rely on SIGKILL being delivered on its thread group exit, so there are only two cases left: 1) thread group is dying, so sqpoll task gets a signal and exit itself cancelling all requests. 2) an sqpoll ring is dying. Because refs_kill() is called the sqpoll not going to submit any new request, and that's what we need. And io_ring_exit_work() will do all the cancellation itself before actually killing ctx, so sqpoll doesn't need to worry about it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3cd7f166b9c326a2c932b70e71a655b03257b366.1619389911.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-26	io_uring: fix work_exit sqpoll cancellations	Pavel Begunkov
	After closing an SQPOLL ring, io_ring_exit_work() kicks in and starts doing cancellations via io_uring_try_cancel_requests(). It will go through io_uring_try_cancel_iowq(), which uses ctx->tctx_list, but as SQPOLL task don't have a ctx note, its io-wq won't be reachable and so is left not cancelled. It will eventually cancelled when one of the tasks dies, but if a thread group survives for long and changes rings, it will spawn lots of unreclaimed resources and live locked works. Cancel SQPOLL task's io-wq separately in io_ring_exit_work(). Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a71a7fe345135d684025bb529d5cb1d8d6b46e10.1619389911.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-26	io_uring: Fix uninitialized variable up.resv	Colin Ian King
	The variable up.resv is not initialized and is being checking for a non-zero value in the call to _io_register_rsrc_update. Fix this by explicitly setting the variable to 0. Addresses-Coverity: ("Uninitialized scalar variable)" Fixes: c3bdad027183 ("io_uring: add generic rsrc update with tags") Signed-off-by: Colin Ian King <colin.king@canonical.com> Link: https://lore.kernel.org/r/20210426094735.8320-1-colin.king@canonical.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-26	io_uring: fix invalid error check after malloc	Pavel Begunkov
	Now we allocate io_mapped_ubuf instead of bvec, so we clearly have to check its address after allocation. Fixes: 41edf1a5ec967 ("io_uring: keep table of pointers to ubufs") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d28eb1bc4384284f69dbce35b9f70c115ff6176f.1619392565.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-25	io_uring: io_sq_thread() no longer needs to reset current->pf_io_worker	Stefan Metzmacher
	This is done by create_io_thread() now. Signed-off-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>