summaryrefslogtreecommitdiff
path: root/drivers/block
AgeCommit message (Collapse)Author
12 daysloop: use kiocb helpers to fix lockdep warningMing Lei
[ Upstream commit c4706c5058a7bd7d7c20f3b24a8f523ecad44e83 ] The lockdep tool can report a circular lock dependency warning in the loop driver's AIO read/write path: ``` [ 6540.587728] kworker/u96:5/72779 is trying to acquire lock: [ 6540.593856] ff110001b5968440 (sb_writers#9){.+.+}-{0:0}, at: loop_process_work+0x11a/0xf70 [loop] [ 6540.603786] [ 6540.603786] but task is already holding lock: [ 6540.610291] ff110001b5968440 (sb_writers#9){.+.+}-{0:0}, at: loop_process_work+0x11a/0xf70 [loop] [ 6540.620210] [ 6540.620210] other info that might help us debug this: [ 6540.627499] Possible unsafe locking scenario: [ 6540.627499] [ 6540.634110] CPU0 [ 6540.636841] ---- [ 6540.639574] lock(sb_writers#9); [ 6540.643281] lock(sb_writers#9); [ 6540.646988] [ 6540.646988] *** DEADLOCK *** ``` This patch fixes the issue by using the AIO-specific helpers `kiocb_start_write()` and `kiocb_end_write()`. These functions are designed to be used with a `kiocb` and manage write sequencing correctly for asynchronous I/O without introducing the problematic lock dependency. The `kiocb` is already part of the `loop_cmd` struct, so this change also simplifies the completion function `lo_rw_aio_do_completion()` by using the `iocb` from the `cmd` struct directly, instead of retrieving the loop device from the request queue. Fixes: 39d86db34e41 ("loop: add file_start_write() and file_end_write()") Cc: Changhui Zhong <czhong@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250716114808.3159657-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-07-17ublk: sanity check add_dev input for underflowRonnie Sahlberg
[ Upstream commit 969127bf0783a4ac0c8a27e633a9e8ea1738583f ] Add additional checks that queue depth and number of queues are non-zero. Signed-off-by: Ronnie Sahlberg <rsahlberg@whamcloud.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250626022046.235018-1-ronniesahlberg@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-07-17nbd: fix uaf in nbd_genl_connect() error pathZheng Qixing
[ Upstream commit aa9552438ebf015fc5f9f890dbfe39f0c53cf37e ] There is a use-after-free issue in nbd: block nbd6: Receive control failed (result -104) block nbd6: shutting down sockets ================================================================== BUG: KASAN: slab-use-after-free in recv_work+0x694/0xa80 drivers/block/nbd.c:1022 Write of size 4 at addr ffff8880295de478 by task kworker/u33:0/67 CPU: 2 UID: 0 PID: 67 Comm: kworker/u33:0 Not tainted 6.15.0-rc5-syzkaller-00123-g2c89c1b655c0 #0 PREEMPT(full) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 Workqueue: nbd6-recv recv_work Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x116/0x1f0 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:408 [inline] print_report+0xc3/0x670 mm/kasan/report.c:521 kasan_report+0xe0/0x110 mm/kasan/report.c:634 check_region_inline mm/kasan/generic.c:183 [inline] kasan_check_range+0xef/0x1a0 mm/kasan/generic.c:189 instrument_atomic_read_write include/linux/instrumented.h:96 [inline] atomic_dec include/linux/atomic/atomic-instrumented.h:592 [inline] recv_work+0x694/0xa80 drivers/block/nbd.c:1022 process_one_work+0x9cc/0x1b70 kernel/workqueue.c:3238 process_scheduled_works kernel/workqueue.c:3319 [inline] worker_thread+0x6c8/0xf10 kernel/workqueue.c:3400 kthread+0x3c2/0x780 kernel/kthread.c:464 ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:153 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 </TASK> nbd_genl_connect() does not properly stop the device on certain error paths after nbd_start_device() has been called. This causes the error path to put nbd->config while recv_work continue to use the config after putting it, leading to use-after-free in recv_work. This patch moves nbd_start_device() after the backend file creation. Reported-by: syzbot+48240bab47e705c53126@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/68227a04.050a0220.f2294.00b5.GAE@google.com/T/ Fixes: 6497ef8df568 ("nbd: provide a way for userspace processes to identify device backends") Signed-off-by: Zheng Qixing <zhengqixing@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250612132405.364904-1-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-07-10aoe: defer rexmit timer downdev work to workqueueJustin Sanders
[ Upstream commit cffc873d68ab09a0432b8212008c5613f8a70a2c ] When aoe's rexmit_timer() notices that an aoe target fails to respond to commands for more than aoe_deadsecs, it calls aoedev_downdev() which cleans the outstanding aoe and block queues. This can involve sleeping, such as in blk_mq_freeze_queue(), which should not occur in irq context. This patch defers that aoedev_downdev() call to the aoe device's workqueue. Link: https://bugzilla.kernel.org/show_bug.cgi?id=212665 Signed-off-by: Justin Sanders <jsanders.devel@gmail.com> Link: https://lore.kernel.org/r/20250610170600.869-2-jsanders.devel@gmail.com Tested-By: Valentin Kleibel <valentin@vrvis.at> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-27ublk: santizize the arguments from userspace when adding a deviceRonnie Sahlberg
[ Upstream commit 8c8472855884355caf3d8e0c50adf825f83454b2 ] Sanity check the values for queue depth and number of queues we get from userspace when adding a device. Signed-off-by: Ronnie Sahlberg <rsahlberg@whamcloud.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Fixes: 71f28f3136af ("ublk_drv: add io_uring based userspace block driver") Fixes: 62fe99cef94a ("ublk: add read()/write() support for ublk char device") Link: https://lore.kernel.org/r/20250619021031.181340-1-ronniesahlberg@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-27aoe: clean device rq_list in aoedev_downdev()Justin Sanders
[ Upstream commit 7f90d45e57cb2ef1f0adcaf925ddffdfc5e680ca ] An aoe device's rq_list contains accepted block requests that are waiting to be transmitted to the aoe target. This queue was added as part of the conversion to blk_mq. However, the queue was not cleaned out when an aoe device is downed which caused blk_mq_freeze_queue() to sleep indefinitely waiting for those requests to complete, causing a hang. This fix cleans out the queue before calling blk_mq_freeze_queue(). Link: https://bugzilla.kernel.org/show_bug.cgi?id=212665 Fixes: 3582dd291788 ("aoe: convert aoeblk to blk-mq") Signed-off-by: Justin Sanders <jsanders.devel@gmail.com> Link: https://lore.kernel.org/r/20250610170600.869-1-jsanders.devel@gmail.com Tested-By: Valentin Kleibel <valentin@vrvis.at> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-19loop: add file_start_write() and file_end_write()Ming Lei
[ Upstream commit 39d86db34e41b96bd86f1955cd0ce6cd9c5fca4c ] file_start_write() and file_end_write() should be added around ->write_iter(). Recently we switch to ->write_iter() from vfs_iter_write(), and the implied file_start_write() and file_end_write() are lost. Also we never add them for dio code path, so add them back for covering both. Cc: Jeff Moyer <jmoyer@redhat.com> Fixes: f2fed441c69b ("loop: stop using vfs_iter_{read,write} for buffered I/O") Fixes: bc07c10a3603 ("block: loop: support DIO & AIO") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250527153405.837216-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-19brd: fix discard end sectorYu Kuai
[ Upstream commit a26a339a654b9403f0ee1004f1db4c2b2a355460 ] brd_do_discard() just aligned start sector to page, this can only work if the discard size if at least one page. For example: blkdiscard /dev/ram0 -o 5120 -l 1024 In this case, size = (1024 - (8192 - 5120)), which is a huge value. Fix the problem by round_down() the end sector. Fixes: 9ead7efc6f3f ("brd: implement discard support") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250506061756.2970934-4-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-19brd: fix aligned_sector from brd_do_discard()Yu Kuai
[ Upstream commit d4099f8893b057ad7e8d61df76bdeaf807ebd679 ] The calculation is just wrong, fix it by round_up(). Fixes: 9ead7efc6f3f ("brd: implement discard support") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250506061756.2970934-3-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-29loop: don't require ->write_iter for writable files in loop_configureChristoph Hellwig
[ Upstream commit 355341e4359b2d5edf0ed5e117f7e9e7a0a5dac0 ] Block devices can be opened read-write even if they can't be written to for historic reasons. Remove the check requiring file->f_op->write_iter when the block devices was opened in loop_configure. The call to loop_check_backing_file just below ensures the ->write_iter is present for backing files opened for writing, which is the only check that is actually needed. Fixes: f5c84eff634b ("loop: Add sanity check for read/write_iter") Reported-by: Christian Hesse <mail@eworm.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250520135420.1177312-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-29loop: check in LO_FLAGS_DIRECT_IO in loop_default_blocksizeChristoph Hellwig
[ Upstream commit f6f9e32fe1e454ae8ac0190b2c2bd6074914beec ] We can't go below the minimum direct I/O size no matter if direct I/O is enabled by passing in an O_DIRECT file descriptor or due to the explicit flag. Now that LO_FLAGS_DIRECT_IO is set earlier after assigning a backing file, loop_default_blocksize can check it instead of the O_DIRECT flag to handle both conditions. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20250131120120.1315125-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-29ublk: complete command synchronously on errorCaleb Sander Mateos
[ Upstream commit 603f9be21c1894e462416e3324962d6c9c2b95f8 ] In case of an error, ublk's ->uring_cmd() functions currently return -EIOCBQUEUED and immediately call io_uring_cmd_done(). -EIOCBQUEUED and io_uring_cmd_done() are intended for asynchronous completions. For synchronous completions, the ->uring_cmd() function can just return the negative return code directly. This skips io_uring_cmd_del_cancelable(), and deferring the completion to task work. So return the error code directly from __ublk_ch_uring_cmd() and ublk_ctrl_uring_cmd(). Update ublk_ch_uring_cmd_cb(), which currently ignores the return value from __ublk_ch_uring_cmd(), to call io_uring_cmd_done() for synchronous completions. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20250225212456.2902549-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-29ublk: enforce ublks_max only for unprivileged devicesUday Shankar
[ Upstream commit 80bdfbb3545b6f16680a72c825063d08a6b44c7a ] Commit 403ebc877832 ("ublk_drv: add module parameter of ublks_max for limiting max allowed ublk dev"), claimed ublks_max was added to prevent a DoS situation with an untrusted user creating too many ublk devices. If that's the case, ublks_max should only restrict the number of unprivileged ublk devices in the system. Enforce the limit only for unprivileged ublk devices, and rename variables accordingly. Leave the external-facing parameter name unchanged, since changing it may break systems which use it (but still update its documentation to reflect its new meaning). As a result of this change, in a system where there are only normal (non-unprivileged) devices, the maximum number of such devices is increased to 1 << MINORBITS, or 1048576. That ought to be enough for anyone, right? Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250228-ublks_max-v1-1-04b7379190c0@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-18loop: Add sanity check for read/write_iterLizhi Xu
[ Upstream commit f5c84eff634ba003326aa034c414e2a9dcb7c6a7 ] Some file systems do not support read_iter/write_iter, such as selinuxfs in this issue. So before calling them, first confirm that the interface is supported and then call it. It is releavant in that vfs_iter_read/write have the check, and removal of their used caused szybot to be able to hit this issue. Fixes: f2fed441c69b ("loop: stop using vfs_iter__{read,write} for buffered I/O") Reported-by: syzbot+6af973a3b8dfd2faefdc@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=6af973a3b8dfd2faefdc Signed-off-by: Lizhi Xu <lizhi.xu@windriver.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250428143626.3318717-1-lizhi.xu@windriver.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-18loop: factor out a loop_assign_backing_file helperChristoph Hellwig
[ Upstream commit d278164832618bf2775c6a89e6434e2633de1eed ] Split the code for setting up a backing file into a helper in preparation of adding more code to this path. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20250131120120.1315125-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: f5c84eff634b ("loop: Add sanity check for read/write_iter") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-18loop: refactor queue limits updatesChristoph Hellwig
[ Upstream commit b38c8be255e89ffcdeb817407222d2de0b573a41 ] Replace loop_reconfigure_limits with a slightly less encompassing loop_update_limits that expects the caller to acquire and commit the queue limits to prepare for sorting out the freeze vs limits lock ordering. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250110054726.1499538-11-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: f5c84eff634b ("loop: Add sanity check for read/write_iter") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-18loop: Fix ABBA locking raceOGAWA Hirofumi
[ Upstream commit b49125574cae26458d4aa02ce8f4523ba9a2a328 ] Current loop calls vfs_statfs() while holding the q->limits_lock. If FS takes some locking in vfs_statfs callback, this may lead to ABBA locking bug (at least, FAT fs has this issue actually). So this patch calls vfs_statfs() outside q->limits_locks instead, because looks like no reason to hold q->limits_locks while getting discord configs. Chain exists of: &sbi->fat_lock --> &q->q_usage_counter(io)#17 --> &q->limits_lock Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&q->limits_lock); lock(&q->q_usage_counter(io)#17); lock(&q->limits_lock); lock(&sbi->fat_lock); *** DEADLOCK *** Reported-by: syzbot+a5d8c609c02f508672cc@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=a5d8c609c02f508672cc Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: f5c84eff634b ("loop: Add sanity check for read/write_iter") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-18loop: Simplify discard granularity calcJohn Garry
[ Upstream commit d47de6ac8842327ae1c782670283450159c55d5b ] A bdev discard granularity is always at least SECTOR_SIZE, so don't check for a zero value. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20241101092215.422428-1-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: f5c84eff634b ("loop: Add sanity check for read/write_iter") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-18loop: Use bdev limit helpers for configuring discardJohn Garry
[ Upstream commit 8d3fd059dd289e6c322e5741ad56794bcce699a2 ] Instead of directly looking at the request_queue limits, use the bdev limits helpers, which is preferable. Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20241030111900.3981223-1-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: f5c84eff634b ("loop: Add sanity check for read/write_iter") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-25block: don't reorder requests in blk_add_rq_to_plugChristoph Hellwig
commit e70c301faece15b618e54b613b1fd6ece3dd05b4 upstream. Add requests to the tail of the list instead of the front so that they are queued up in submission order. Remove the re-reordering in blk_mq_dispatch_plug_list, virtio_queue_rqs and nvme_queue_rqs now that the list is ordered as expected. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241113152050.157179-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-25block: add a rq_list typeChristoph Hellwig
commit a3396b99990d8b4e5797e7b16fdeb64c15ae97bb upstream. Replace the semi-open coded request list helpers with a proper rq_list type that mirrors the bio_list and has head and tail pointers. Besides better type safety this actually allows to insert at the tail of the list, which will be useful soon. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241113152050.157179-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-25loop: LOOP_SET_FD: send uevents for partitionsThomas Weißschuh
commit 0dba7a05b9e47d8b546399117b0ddf2426dc6042 upstream. Remove the suppression of the uevents before scanning for partitions. The partitions inherit their suppression settings from their parent device, which lead to the uevents being dropped. This is similar to the same changes for LOOP_CONFIGURE done in commit bb430b694226 ("loop: LOOP_CONFIGURE: send uevents for partitions"). Fixes: 498ef5c777d9 ("loop: suppress uevents while reconfiguring the device") Cc: stable@vger.kernel.org Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20250415-loop-uevent-changed-v3-1-60ff69ac6088@linutronix.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-25loop: properly send KOBJ_CHANGED uevent for disk deviceThomas Weißschuh
commit e7bc0010ceb403d025100698586c8e760921d471 upstream. The original commit message and the wording "uncork" in the code comment indicate that it is expected that the suppressed event instances are automatically sent after unsuppressing. This is not the case, instead they are discarded. In effect this means that no "changed" events are emitted on the device itself by default. While each discovered partition does trigger a changed event on the device, devices without partitions don't have any event emitted. This makes udev miss the device creation and prompted workarounds in userspace. See the linked util-linux/losetup bug. Explicitly emit the events and drop the confusingly worded comments. Link: https://github.com/util-linux/util-linux/issues/2434 Fixes: 498ef5c777d9 ("loop: suppress uevents while reconfiguring the device") Cc: stable@vger.kernel.org Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Link: https://lore.kernel.org/r/20250415-loop-uevent-changed-v2-1-0c4e6a923b2a@linutronix.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-25loop: stop using vfs_iter_{read,write} for buffered I/OChristoph Hellwig
[ Upstream commit f2fed441c69b9237760840a45a004730ff324faf ] vfs_iter_{read,write} always perform direct I/O when the file has the O_DIRECT flag set, which breaks disabling direct I/O using the LOOP_SET_STATUS / LOOP_SET_STATUS64 ioctls. This was recenly reported as a regression, but as far as I can tell was only uncovered by better checking for block sizes and has been around since the direct I/O support was added. Fix this by using the existing aio code that calls the raw read/write iter methods instead. Note that despite the comments there is no need for block drivers to ever call flush_dcache_page themselves, and the call is a left-over from prehistoric times. Fixes: ab1cb278bc70 ("block: loop: introduce ioctl command of LOOP_SET_DIRECT_IO") Reported-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Tested-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20250409130940.3685677-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-25loop: aio inherit the ioprio of original requestYunlong Xing
[ Upstream commit 1fdb8188c3d505452b40cdb365b1bb32be533a8e ] Set cmd->iocb.ki_ioprio to the ioprio of loop device's request. The purpose is to inherit the original request ioprio in the aio flow. Signed-off-by: Yunlong Xing <yunlong.xing@unisoc.com> Signed-off-by: Zhiguo Niu <zhiguo.niu@unisoc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250414030159.501180-1-yunlong.xing@unisoc.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: f2fed441c69b ("loop: stop using vfs_iter_{read,write} for buffered I/O") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-20ublk: fix handling recovery & reissue in ublk_abort_queue()Ming Lei
[ Upstream commit 6ee6bd5d4fce502a5b5a2ea805e9ff16e6aa890f ] Commit 8284066946e6 ("ublk: grab request reference when the request is handled by userspace") doesn't grab request reference in case of recovery reissue. Then the request can be requeued & re-dispatch & failed when canceling uring command. If it is one zc request, the request can be freed before io_uring returns the zc buffer back, then cause kernel panic: [ 126.773061] BUG: kernel NULL pointer dereference, address: 00000000000000c8 [ 126.773657] #PF: supervisor read access in kernel mode [ 126.774052] #PF: error_code(0x0000) - not-present page [ 126.774455] PGD 0 P4D 0 [ 126.774698] Oops: Oops: 0000 [#1] SMP NOPTI [ 126.775034] CPU: 13 UID: 0 PID: 1612 Comm: kworker/u64:55 Not tainted 6.14.0_blk+ #182 PREEMPT(full) [ 126.775676] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-1.fc39 04/01/2014 [ 126.776275] Workqueue: iou_exit io_ring_exit_work [ 126.776651] RIP: 0010:ublk_io_release+0x14/0x130 [ublk_drv] Fixes it by always grabbing request reference for aborting the request. Reported-by: Caleb Sander Mateos <csander@purestorage.com> Closes: https://lore.kernel.org/linux-block/CADUfDZodKfOGUeWrnAxcZiLT+puaZX8jDHoj_sfHZCOZwhzz6A@mail.gmail.com/ Fixes: 8284066946e6 ("ublk: grab request reference when the request is handled by userspace") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250409011444.2142010-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-20ublk: refactor recovery configuration flag helpersUday Shankar
[ Upstream commit 3b939b8f715e014adcc48f7827fe9417252f0833 ] ublk currently supports the following behaviors on ublk server exit: A: outstanding I/Os get errors, subsequently issued I/Os get errors B: outstanding I/Os get errors, subsequently issued I/Os queue C: outstanding I/Os get reissued, subsequently issued I/Os queue and the following behaviors for recovery of preexisting block devices by a future incarnation of the ublk server: 1: ublk devices stopped on ublk server exit (no recovery possible) 2: ublk devices are recoverable using start/end_recovery commands The userspace interface allows selection of combinations of these behaviors using flags specified at device creation time, namely: default behavior: A + 1 UBLK_F_USER_RECOVERY: B + 2 UBLK_F_USER_RECOVERY|UBLK_F_USER_RECOVERY_REISSUE: C + 2 We can't easily change the userspace interface to allow independent selection of one of {A, B, C} and one of {1, 2}, but we can refactor the internal helpers which test for the flags. Replace the existing helpers with the following set: ublk_nosrv_should_reissue_outstanding: tests for behavior C ublk_nosrv_[dev_]should_queue_io: tests for behavior B ublk_nosrv_should_stop_dev: tests for behavior 1 Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241007182419.3263186-3-ushankar@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: 6ee6bd5d4fce ("ublk: fix handling recovery & reissue in ublk_abort_queue()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10ublk: make sure ubq->canceling is set when queue is frozenMing Lei
[ Upstream commit 8741d0737921ec1c03cf59aebf4d01400c2b461a ] Now ublk driver depends on `ubq->canceling` for deciding if the request can be dispatched via uring_cmd & io_uring_cmd_complete_in_task(). Once ubq->canceling is set, the uring_cmd can be done via ublk_cancel_cmd() and io_uring_cmd_done(). So set ubq->canceling when queue is frozen, this way makes sure that the flag can be observed from ublk_queue_rq() reliably, and avoids use-after-free on uring_cmd. Fixes: 216c8f5ef0f2 ("ublk: replace monitor with cancelable uring_cmd") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250327095123.179113-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-03-22block: change blk_mq_add_to_batch() third argument type to boolShin'ichiro Kawasaki
[ Upstream commit 9bce6b5f8987678b9c6c1fe433af6b5fe41feadc ] Commit 1f47ed294a2b ("block: cleanup and fix batch completion adding conditions") modified the evaluation criteria for the third argument, 'ioerror', in the blk_mq_add_to_batch() function. Initially, the function had checked if 'ioerror' equals zero. Following the commit, it started checking for negative error values, with the presumption that such values, for instance -EIO, would be passed in. However, blk_mq_add_to_batch() callers do not pass negative error values. Instead, they pass status codes defined in various ways: - NVMe PCI and Apple drivers pass NVMe status code - virtio_blk driver passes the virtblk request header status byte - null_blk driver passes blk_status_t These codes are either zero or positive, therefore the revised check fails to function as intended. Specifically, with the NVMe PCI driver, this modification led to the failure of the blktests test case nvme/039. In this test scenario, errors are artificially injected to the NVMe driver, resulting in positive NVMe status codes passed to blk_mq_add_to_batch(), which unexpectedly processes the failed I/O in a batch. Hence the failure. To correct the ioerror check within blk_mq_add_to_batch(), make all callers to uniformly pass the argument as boolean. Modify the callers to check their specific status codes and pass the boolean value 'is_error'. Also describe the arguments of blK_mq_add_to_batch as kerneldoc. Fixes: 1f47ed294a2b ("block: cleanup and fix batch completion adding conditions") Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Link: https://lore.kernel.org/r/20250311104359.1767728-3-shinichiro.kawasaki@wdc.com [axboe: fold in documentation update] Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-03-13ublk: set_params: properly check if parameters can be appliedUday Shankar
[ Upstream commit 5ac60242b0173be83709603ebaf27a473f16c4e4 ] The parameters set by the set_params call are only applied to the block device in the start_dev call. So if a device has already been started, a subsequently issued set_params on that device will not have the desired effect, and should return an error. There is an existing check for this - set_params fails on devices in the LIVE state. But this check is not sufficient to cover the recovery case. In this case, the device will be in the QUIESCED or FAIL_IO states, so set_params will succeed. But this success is misleading, because the parameters will not be applied, since the device has already been started (by a previous ublk server). The bit UB_STATE_USED is set on completion of the start_dev; use it to detect and fail set_params commands which arrive too late to be applied (after start_dev). Signed-off-by: Uday Shankar <ushankar@purestorage.com> Fixes: 0aa73170eba5 ("ublk_drv: add SET_PARAMS/GET_PARAMS control command") Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250304-set_params-v1-1-17b5e0887606@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-03-13rust: treewide: switch to our kernel `Box` typeDanilo Krummrich
commit 8373147ce4961665c5700016b1c76299e962d077 upstream. Now that we got the kernel `Box` type in place, convert all existing `Box` users to make use of it. Reviewed-by: Alice Ryhl <aliceryhl@google.com> Reviewed-by: Benno Lossin <benno.lossin@proton.me> Reviewed-by: Gary Guo <gary@garyguo.net> Signed-off-by: Danilo Krummrich <dakr@kernel.org> Link: https://lore.kernel.org/r/20241004154149.93856-13-dakr@kernel.org Signed-off-by: Miguel Ojeda <ojeda@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-08nbd: don't allow reconnect after disconnectYu Kuai
[ Upstream commit 844b8cdc681612ff24df62cdefddeab5772fadf1 ] Following process can cause nbd_config UAF: 1) grab nbd_config temporarily; 2) nbd_genl_disconnect() flush all recv_work() and release the initial reference: nbd_genl_disconnect nbd_disconnect_and_put nbd_disconnect flush_workqueue(nbd->recv_workq) if (test_and_clear_bit(NBD_RT_HAS_CONFIG_REF, ...)) nbd_config_put -> due to step 1), reference is still not zero 3) nbd_genl_reconfigure() queue recv_work() again; nbd_genl_reconfigure config = nbd_get_config_unlocked(nbd) if (!config) -> succeed if (!test_bit(NBD_RT_BOUND, ...)) -> succeed nbd_reconnect_socket queue_work(nbd->recv_workq, &args->work) 4) step 1) release the reference; 5) Finially, recv_work() will trigger UAF: recv_work nbd_config_put(nbd) -> nbd_config is freed atomic_dec(&config->recv_threads) -> UAF Fix the problem by clearing NBD_RT_BOUND in nbd_genl_disconnect(), so that nbd_genl_reconfigure() will fail. Fixes: b7aa3d39385d ("nbd: add a reconfigure netlink command") Reported-by: syzbot+6b0df248918b92c33e6a@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/675bfb65.050a0220.1a2d0d.0006.GAE@google.com/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250103092859.3574648-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-02-08ps3disk: Do not use dev->bounce_size before it is setGeert Uytterhoeven
[ Upstream commit c2398e6d5f16e15598d3a37e17107fea477e3f91 ] dev->bounce_size is only initialized after it is used to set the queue limits. Fix this by using BOUNCE_SIZE instead. Fixes: a7f18b74dbe17162 ("ps3disk: pass queue_limits to blk_mq_alloc_disk") Reported-by: Philipp Hortmann <philipp.g.hortmann@gmail.com> Closes: https://lore.kernel.org/39256db9-3d73-4e86-a49b-300dfd670212@gmail.com Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/06988f959ea6885b8bd7fb3b9059dd54bc6bbad7.1735894216.git.geert+renesas@glider.be Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-01-23zram: fix potential UAF of zram tableKairui Song
commit 212fe1c0df4a150fb6298db2cfff267ceaba5402 upstream. If zram_meta_alloc failed early, it frees allocated zram->table without setting it NULL. Which will potentially cause zram_meta_free to access the table if user reset an failed and uninitialized device. Link: https://lkml.kernel.org/r/20250107065446.86928-1-ryncsn@gmail.com Fixes: 74363ec674cb ("zram: fix uninitialized ZRAM not releasing backing device") Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-01-02ublk: detach gendisk from ublk device if add_disk() failsMing Lei
[ Upstream commit 75cd4005da5492129917a4a4ee45e81660556104 ] Inside ublk_abort_requests(), gendisk is grabbed for aborting all inflight requests. And ublk_abort_requests() is called when exiting the uring context or handling timeout. If add_disk() fails, the gendisk may have been freed when calling ublk_abort_requests(), so use-after-free can be caused when getting disk's reference in ublk_abort_requests(). Fixes the bug by detaching gendisk from ublk device if add_disk() fails. Fixes: bd23f6c2c2d0 ("ublk: quiesce request queue when aborting queue") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241225110640.351531-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-01-02virtio-blk: don't keep queue frozen during system suspendMing Lei
[ Upstream commit 7678abee0867e6b7fb89aa40f6e9f575f755fb37 ] Commit 4ce6e2db00de ("virtio-blk: Ensure no requests in virtqueues before deleting vqs.") replaces queue quiesce with queue freeze in virtio-blk's PM callbacks. And the motivation is to drain inflight IOs before suspending. block layer's queue freeze looks very handy, but it is also easy to cause deadlock, such as, any attempt to call into bio_queue_enter() may run into deadlock if the queue is frozen in current context. There are all kinds of ->suspend() called in suspend context, so keeping queue frozen in the whole suspend context isn't one good idea. And Marek reported lockdep warning[1] caused by virtio-blk's freeze queue in virtblk_freeze(). [1] https://lore.kernel.org/linux-block/ca16370e-d646-4eee-b9cc-87277c89c43c@samsung.com/ Given the motivation is to drain in-flight IOs, it can be done by calling freeze & unfreeze, meantime restore to previous behavior by keeping queue quiesced during suspend. Cc: Yi Sun <yi.sun@unisoc.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Stefan Hajnoczi <stefanha@redhat.com> Cc: virtualization@lists.linux.dev Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Acked-by: Stefan Hajnoczi <stefanha@redhat.com> Link: https://lore.kernel.org/r/20241112125821.1475793-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-12-27zram: fix uninitialized ZRAM not releasing backing deviceKairui Song
commit 74363ec674cb172d8856de25776c8f3103f05e2f upstream. Setting backing device is done before ZRAM initialization. If we set the backing device, then remove the ZRAM module without initializing the device, the backing device reference will be leaked and the device will be hold forever. Fix this by always reset the ZRAM fully on rmmod or reset store. Link: https://lkml.kernel.org/r/20241209165717.94215-3-ryncsn@gmail.com Fixes: 013bf95a83ec ("zram: add interface to specif backing device") Signed-off-by: Kairui Song <kasong@tencent.com> Reported-by: Desheng Wu <deshengwu@tencent.com> Suggested-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-12-27zram: refuse to use zero sized block device as backing deviceKairui Song
commit be48c412f6ebf38849213c19547bc6d5b692b5e5 upstream. Patch series "zram: fix backing device setup issue", v2. This series fixes two bugs of backing device setting: - ZRAM should reject using a zero sized (or the uninitialized ZRAM device itself) as the backing device. - Fix backing device leaking when removing a uninitialized ZRAM device. This patch (of 2): Setting a zero sized block device as backing device is pointless, and one can easily create a recursive loop by setting the uninitialized ZRAM device itself as its own backing device by (zram0 is uninitialized): echo /dev/zram0 > /sys/block/zram0/backing_dev It's definitely a wrong config, and the module will pin itself, kernel should refuse doing so in the first place. By refusing to use zero sized device we avoided misuse cases including this one above. Link: https://lkml.kernel.org/r/20241209165717.94215-1-ryncsn@gmail.com Link: https://lkml.kernel.org/r/20241209165717.94215-2-ryncsn@gmail.com Fixes: 013bf95a83ec ("zram: add interface to specif backing device") Signed-off-by: Kairui Song <kasong@tencent.com> Reported-by: Desheng Wu <deshengwu@tencent.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-12-14zram: clear IDLE flag in mark_idle()Sergey Senozhatsky
[ Upstream commit d37da422edb0664a2037e6d7d42fe6d339aae78a ] If entry does not fulfill current mark_idle() parameters, e.g. cutoff time, then we should clear its ZRAM_IDLE from previous mark_idle() invocations. Consider the following case: - mark_idle() cutoff time 8h - mark_idle() cutoff time 4h - writeback() idle - will writeback entries with cutoff time 8h, while it should only pick entries with cutoff time 4h The bug was reported by Shin Kawamura. Link: https://lkml.kernel.org/r/20241028153629.1479791-3-senozhatsky@chromium.org Fixes: 755804d16965 ("zram: introduce an aged idle interface") Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reported-by: Shin Kawamura <kawasin@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: <stable@vger.kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-12-14zram: do not mark idle slots that cannot be idleSergey Senozhatsky
[ Upstream commit b967fa1ba72b5da2b6d9bf95f0b13420a59e0701 ] ZRAM_SAME slots cannot be post-processed (writeback or recompress) so do not mark them ZRAM_IDLE. Same with ZRAM_WB slots, they cannot be ZRAM_IDLE because they are not in zsmalloc pool anymore. Link: https://lkml.kernel.org/r/20240917021020.883356-6-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Stable-dep-of: d37da422edb0 ("zram: clear IDLE flag in mark_idle()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-12-09zram: clear IDLE flag after recompressionSergey Senozhatsky
commit f85219096648b251a81e9fe24a1974590cfc417d upstream. Patch series "zram: IDLE flag handling fixes", v2. zram can wrongly preserve ZRAM_IDLE flag on its entries which can result in premature post-processing (writeback and recompression) of such entries. This patch (of 2) Recompression should clear ZRAM_IDLE flag on the entries it has accessed, because otherwise some entries, specifically those for which recompression has failed, become immediate candidate entries for another post-processing (e.g. writeback). Consider the following case: - recompression marks entries IDLE every 4 hours and attempts to recompress them - some entries are incompressible, so we keep them intact and hence preserve IDLE flag - writeback marks entries IDLE every 8 hours and writebacks IDLE entries, however we have IDLE entries left from recompression, so writeback prematurely writebacks those entries. The bug was reported by Shin Kawamura. Link: https://lkml.kernel.org/r/20241028153629.1479791-1-senozhatsky@chromium.org Link: https://lkml.kernel.org/r/20241028153629.1479791-2-senozhatsky@chromium.org Fixes: 84b33bf78889 ("zram: introduce recompress sysfs knob") Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reported-by: Shin Kawamura <kawasin@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: <stable@vger.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-12-05brd: decrease the number of allocated pages which discardedZhang Xianwei
[ Upstream commit 82734209bedd65a8b508844bab652b464379bfdd ] The number of allocated pages which discarded will not decrease. Fix it. Fixes: 9ead7efc6f3f ("brd: implement discard support") Signed-off-by: Zhang Xianwei <zhang.xianwei8@zte.com.cn> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241128170056565nPKSz2vsP8K8X2uk2iaDG@zte.com.cn Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-12-05ublk: fix error code for unsupported commandMing Lei
commit 34c1227035b3ab930a1ae6ab6f22fec1af8ab09e upstream. ENOTSUPP is for kernel use only, and shouldn't be sent to userspace. Fix it by replacing it with EOPNOTSUPP. Cc: stable@vger.kernel.org Fixes: bfbcef036396 ("ublk_drv: move ublk_get_device_from_id into ublk_ctrl_uring_cmd") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241119030646.2319030-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-12-05ublk: fix ublk_ch_mmap() for 64K page sizeMing Lei
commit d369735e02ef122d19d4c3d093028da0eb400636 upstream. In ublk_ch_mmap(), queue id is calculated in the following way: (vma->vm_pgoff << PAGE_SHIFT) / `max_cmd_buf_size` 'max_cmd_buf_size' is equal to `UBLK_MAX_QUEUE_DEPTH * sizeof(struct ublksrv_io_desc)` and UBLK_MAX_QUEUE_DEPTH is 4096 and part of UAPI, so 'max_cmd_buf_size' is always page aligned in 4K page size kernel. However, it isn't true in 64K page size kernel. Fixes the issue by always rounding up 'max_cmd_buf_size' with PAGE_SIZE. Cc: stable@vger.kernel.org Fixes: 71f28f3136af ("ublk_drv: add io_uring based userspace block driver") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241111110718.1394001-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-12-05zram: fix NULL pointer in comp_algorithm_show()Liu Shixin
[ Upstream commit f364cdeb38938f9d03061682b8ff3779dd1730e5 ] LTP reported a NULL pointer dereference as followed: CPU: 7 UID: 0 PID: 5995 Comm: cat Kdump: loaded Not tainted 6.12.0-rc6+ #3 Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 pstate: 40400005 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : __pi_strcmp+0x24/0x140 lr : zcomp_available_show+0x60/0x100 [zram] sp : ffff800088b93b90 x29: ffff800088b93b90 x28: 0000000000000001 x27: 0000000000400cc0 x26: 0000000000000ffe x25: ffff80007b3e2388 x24: 0000000000000000 x23: ffff80007b3e2390 x22: ffff0004041a9000 x21: ffff80007b3e2900 x20: 0000000000000000 x19: 0000000000000000 x18: 0000000000000000 x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 x11: 0000000000000000 x10: ffff80007b3e2900 x9 : ffff80007b3cb280 x8 : 0101010101010101 x7 : 0000000000000000 x6 : 0000000000000000 x5 : 0000000000000040 x4 : 0000000000000000 x3 : 00656c722d6f7a6c x2 : 0000000000000000 x1 : ffff80007b3e2900 x0 : 0000000000000000 Call trace: __pi_strcmp+0x24/0x140 comp_algorithm_show+0x40/0x70 [zram] dev_attr_show+0x28/0x80 sysfs_kf_seq_show+0x90/0x140 kernfs_seq_show+0x34/0x48 seq_read_iter+0x1d4/0x4e8 kernfs_fop_read_iter+0x40/0x58 new_sync_read+0x9c/0x168 vfs_read+0x1a8/0x1f8 ksys_read+0x74/0x108 __arm64_sys_read+0x24/0x38 invoke_syscall+0x50/0x120 el0_svc_common.constprop.0+0xc8/0xf0 do_el0_svc+0x24/0x38 el0_svc+0x38/0x138 el0t_64_sync_handler+0xc0/0xc8 el0t_64_sync+0x188/0x190 The zram->comp_algs[ZRAM_PRIMARY_COMP] can be NULL in zram_add() if comp_algorithm_set() has not been called. User can access the zram device by sysfs after device_add_disk(), so there is a time window to trigger the NULL pointer dereference. Move it ahead device_add_disk() to make sure when user can access the zram device, it is ready. comp_algorithm_set() is protected by zram->init_lock in other places and no such problem. Link: https://lkml.kernel.org/r/20241108100147.3776123-1-liushixin2@huawei.com Fixes: 7ac07a26dea7 ("zram: preparation for multi-zcomp support") Signed-off-by: Liu Shixin <liushixin2@huawei.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-12-05zram: permit only one post-processing operation at a timeSergey Senozhatsky
[ Upstream commit 58652f2b6d21f2874c9f060165ec7e03e8b1fc71 ] Both recompress and writeback soon will unlock slots during processing, which makes things too complex wrt possible race-conditions. We still want to clear PP_SLOT in slot_free, because this is how we figure out that slot that was selected for post-processing has been released under us and when we start post-processing we check if slot still has PP_SLOT set. At the same time, theoretically, we can have something like this: CPU0 CPU1 recompress scan slots set PP_SLOT unlock slot slot_free clear PP_SLOT allocate PP_SLOT writeback scan slots set PP_SLOT unlock slot select PP-slot test PP_SLOT So recompress will not detect that slot has been re-used and re-selected for concurrent writeback post-processing. Make sure that we only permit on post-processing operation at a time. So now recompress and writeback post-processing don't race against each other, we only need to handle slot re-use (slot_free and write), which is handled individually by each pp operation. Having recompress and writeback competing for the same slots is not exactly good anyway (can't imagine anyone doing that). Link: https://lkml.kernel.org/r/20240917021020.883356-3-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Stable-dep-of: f364cdeb3893 ("zram: fix NULL pointer in comp_algorithm_show()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-12-05zram: ZRAM_DEF_COMP should depend on ZRAMGeert Uytterhoeven
[ Upstream commit 9f3310ccc71efff041fed3f8be5ad19b0feab30b ] When Compressed RAM block device support is disabled, the CONFIG_ZRAM_DEF_COMP symbol still ends up in the generated config file: CONFIG_ZRAM_DEF_COMP="unset-value" While this causes no real harm, avoid polluting the config file by adding a dependency on ZRAM. Link: https://lkml.kernel.org/r/64e05bad68a9bd5cc322efd114a04d25de525940.1730807319.git.geert@linux-m68k.org Fixes: 917a59e81c34 ("zram: introduce custom comp backends API") Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-12-05virtio_blk: reverse request order in virtio_queue_rqsChristoph Hellwig
[ Upstream commit 7f212e997edbb7a2cb85cef2ac14265dfaf88717 ] blk_mq_flush_plug_list submits requests in the reverse order that they were submitted, which leads to a rather suboptimal I/O pattern especially in rotational devices. Fix this by rewriting virtio_queue_rqs so that it always pops the requests from the passed in request list, and then adds them to the head of a local submit list. This actually simplifies the code a bit as it removes the complicated list splicing, at the cost of extra updates of the rq_next pointer. As that should be cache hot anyway it should be an easy price to pay. Fixes: 0e9911fa768f ("virtio-blk: support mq_ops->queue_rqs()") Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241113152050.157179-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-12-05loop: fix type of block sizeLi Wang
[ Upstream commit 8e604cac499248c75ad3a26695a743ff879ded5c ] PAGE_SIZE may be 64K, and the max block size can be PAGE_SIZE, so any variable for holding block size can't be defined as 'unsigned short'. Unfortunately commit 473516b36193 ("loop: use the atomic queue limits update API") passes 'bsize' with type of 'unsigned short' to loop_reconfigure_limits(), and causes LTP/ioctl_loop06 test failure: 12 ioctl_loop06.c:76: TINFO: Using LOOP_SET_BLOCK_SIZE with arg > PAGE_SIZE 13 ioctl_loop06.c:59: TFAIL: Set block size succeed unexpectedly ... 18 ioctl_loop06.c:76: TINFO: Using LOOP_CONFIGURE with block_size > PAGE_SIZE 19 ioctl_loop06.c:59: TFAIL: Set block size succeed unexpectedly Fixes the issue by defining 'block size' variable with 'unsigned int', which is aligned with block layer's definition. (improve commit log & add fixes tag) Fixes: 473516b36193 ("loop: use the atomic queue limits update API") Cc: John Garry <john.g.garry@oracle.com> Cc: Stefan Hajnoczi <stefanha@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: Li Wang <liwang@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20241109022744.1126003-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-12-05brd: defer automatic disk creation until module initialization succeedsYang Erkun
[ Upstream commit 826cc42adf44930a633d11a5993676d85ddb0842 ] My colleague Wupeng found the following problems during fault injection: BUG: unable to handle page fault for address: fffffbfff809d073 PGD 6e648067 P4D 123ec8067 PUD 123ec4067 PMD 100e38067 PTE 0 Oops: Oops: 0000 [#1] PREEMPT SMP KASAN NOPTI CPU: 5 UID: 0 PID: 755 Comm: modprobe Not tainted 6.12.0-rc3+ #17 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.1-2.fc37 04/01/2014 RIP: 0010:__asan_load8+0x4c/0xa0 ... Call Trace: <TASK> blkdev_put_whole+0x41/0x70 bdev_release+0x1a3/0x250 blkdev_release+0x11/0x20 __fput+0x1d7/0x4a0 task_work_run+0xfc/0x180 syscall_exit_to_user_mode+0x1de/0x1f0 do_syscall_64+0x6b/0x170 entry_SYSCALL_64_after_hwframe+0x76/0x7e loop_init() is calling loop_add() after __register_blkdev() succeeds and is ignoring disk_add() failure from loop_add(), for loop_add() failure is not fatal and successfully created disks are already visible to bdev_open(). brd_init() is currently calling brd_alloc() before __register_blkdev() succeeds and is releasing successfully created disks when brd_init() returns an error. This can cause UAF for the latter two case: case 1: T1: modprobe brd brd_init brd_alloc(0) // success add_disk disk_scan_partitions bdev_file_open_by_dev // alloc file fput // won't free until back to userspace brd_alloc(1) // failed since mem alloc error inject // error path for modprobe will release code segment // back to userspace __fput blkdev_release bdev_release blkdev_put_whole bdev->bd_disk->fops->release // fops is freed now, UAF! case 2: T1: T2: modprobe brd brd_init brd_alloc(0) // success open(/dev/ram0) brd_alloc(1) // fail // error path for modprobe close(/dev/ram0) ... /* UAF! */ bdev->bd_disk->fops->release Fix this problem by following what loop_init() does. Besides, reintroduce brd_devices_mutex to help serialize modifications to brd_list. Fixes: 7f9b348cb5e9 ("brd: convert to blk_alloc_disk/blk_cleanup_disk") Reported-by: Wupeng Ma <mawupeng1@huawei.com> Signed-off-by: Yang Erkun <yangerkun@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241030034914.907829-1-yangerkun@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>