summaryrefslogtreecommitdiff
path: root/drivers/block/loop.c
AgeCommit message (Collapse)Author
2018-04-20block/loop: fix deadlock after loop_set_statusTetsuo Handa
commit 1e047eaab3bb5564f25b41e9cd3a053009f4e789 upstream. syzbot is reporting deadlocks at __blkdev_get() [1]. ---------------------------------------- [ 92.493919] systemd-udevd D12696 525 1 0x00000000 [ 92.495891] Call Trace: [ 92.501560] schedule+0x23/0x80 [ 92.502923] schedule_preempt_disabled+0x5/0x10 [ 92.504645] __mutex_lock+0x416/0x9e0 [ 92.510760] __blkdev_get+0x73/0x4f0 [ 92.512220] blkdev_get+0x12e/0x390 [ 92.518151] do_dentry_open+0x1c3/0x2f0 [ 92.519815] path_openat+0x5d9/0xdc0 [ 92.521437] do_filp_open+0x7d/0xf0 [ 92.527365] do_sys_open+0x1b8/0x250 [ 92.528831] do_syscall_64+0x6e/0x270 [ 92.530341] entry_SYSCALL_64_after_hwframe+0x42/0xb7 [ 92.931922] 1 lock held by systemd-udevd/525: [ 92.933642] #0: 00000000a2849e25 (&bdev->bd_mutex){+.+.}, at: __blkdev_get+0x73/0x4f0 ---------------------------------------- The reason of deadlock turned out that wait_event_interruptible() in blk_queue_enter() got stuck with bdev->bd_mutex held at __blkdev_put() due to q->mq_freeze_depth == 1. ---------------------------------------- [ 92.787172] a.out S12584 634 633 0x80000002 [ 92.789120] Call Trace: [ 92.796693] schedule+0x23/0x80 [ 92.797994] blk_queue_enter+0x3cb/0x540 [ 92.803272] generic_make_request+0xf0/0x3d0 [ 92.807970] submit_bio+0x67/0x130 [ 92.810928] submit_bh_wbc+0x15e/0x190 [ 92.812461] __block_write_full_page+0x218/0x460 [ 92.815792] __writepage+0x11/0x50 [ 92.817209] write_cache_pages+0x1ae/0x3d0 [ 92.825585] generic_writepages+0x5a/0x90 [ 92.831865] do_writepages+0x43/0xd0 [ 92.836972] __filemap_fdatawrite_range+0xc1/0x100 [ 92.838788] filemap_write_and_wait+0x24/0x70 [ 92.840491] __blkdev_put+0x69/0x1e0 [ 92.841949] blkdev_close+0x16/0x20 [ 92.843418] __fput+0xda/0x1f0 [ 92.844740] task_work_run+0x87/0xb0 [ 92.846215] do_exit+0x2f5/0xba0 [ 92.850528] do_group_exit+0x34/0xb0 [ 92.852018] SyS_exit_group+0xb/0x10 [ 92.853449] do_syscall_64+0x6e/0x270 [ 92.854944] entry_SYSCALL_64_after_hwframe+0x42/0xb7 [ 92.943530] 1 lock held by a.out/634: [ 92.945105] #0: 00000000a2849e25 (&bdev->bd_mutex){+.+.}, at: __blkdev_put+0x3c/0x1e0 ---------------------------------------- The reason of q->mq_freeze_depth == 1 turned out that loop_set_status() forgot to call blk_mq_unfreeze_queue() at error paths for info->lo_encrypt_type != NULL case. ---------------------------------------- [ 37.509497] CPU: 2 PID: 634 Comm: a.out Tainted: G W 4.16.0+ #457 [ 37.513608] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017 [ 37.518832] RIP: 0010:blk_freeze_queue_start+0x17/0x40 [ 37.521778] RSP: 0018:ffffb0c2013e7c60 EFLAGS: 00010246 [ 37.524078] RAX: 0000000000000000 RBX: ffff8b07b1519798 RCX: 0000000000000000 [ 37.527015] RDX: 0000000000000002 RSI: ffffb0c2013e7cc0 RDI: ffff8b07b1519798 [ 37.529934] RBP: ffffb0c2013e7cc0 R08: 0000000000000008 R09: 47a189966239b898 [ 37.532684] R10: dad78b99b278552f R11: 9332dca72259d5ef R12: ffff8b07acd73678 [ 37.535452] R13: 0000000000004c04 R14: 0000000000000000 R15: ffff8b07b841e940 [ 37.538186] FS: 00007fede33b9740(0000) GS:ffff8b07b8e80000(0000) knlGS:0000000000000000 [ 37.541168] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 37.543590] CR2: 00000000206fdf18 CR3: 0000000130b30006 CR4: 00000000000606e0 [ 37.546410] Call Trace: [ 37.547902] blk_freeze_queue+0x9/0x30 [ 37.549968] loop_set_status+0x67/0x3c0 [loop] [ 37.549975] loop_set_status64+0x3b/0x70 [loop] [ 37.549986] lo_ioctl+0x223/0x810 [loop] [ 37.549995] blkdev_ioctl+0x572/0x980 [ 37.550003] block_ioctl+0x34/0x40 [ 37.550006] do_vfs_ioctl+0xa7/0x6d0 [ 37.550017] ksys_ioctl+0x6b/0x80 [ 37.573076] SyS_ioctl+0x5/0x10 [ 37.574831] do_syscall_64+0x6e/0x270 [ 37.576769] entry_SYSCALL_64_after_hwframe+0x42/0xb7 ---------------------------------------- [1] https://syzkaller.appspot.com/bug?id=cd662bc3f6022c0979d01a262c318fab2ee9b56f Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: syzbot <bot+48594378e9851eab70bcd6f99327c7db58c5a28a@syzkaller.appspotmail.com> Fixes: ecdd09597a572513 ("block/loop: fix race between I/O and set_status") Cc: Ming Lei <tom.leiming@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: stable <stable@vger.kernel.org> Cc: Jens Axboe <axboe@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-13Fix loop device flush before configure v3James Wang
[ Upstream commit 6460495709aeb651896bc8e5c134b2e4ca7d34a8 ] While installing SLES-12 (based on v4.4), I found that the installer will stall for 60+ seconds during LVM disk scan. The root cause was determined to be the removal of a bound device check in loop_flush() by commit b5dd2f6047ca ("block: loop: improve performance via blk-mq"). Restoring this check, examining ->lo_state as set by loop_set_fd() eliminates the bad behavior. Test method: modprobe loop max_loop=64 dd if=/dev/zero of=disk bs=512 count=200K for((i=0;i<4;i++))do losetup -f disk; done mkfs.ext4 -F /dev/loop0 for((i=0;i<4;i++))do mkdir t$i; mount /dev/loop$i t$i;done for f in `ls /dev/loop[0-9]*|sort`; do \ echo $f; dd if=$f of=/dev/null bs=512 count=1; \ done Test output: stock patched /dev/loop0 18.1217e-05 8.3842e-05 /dev/loop1 6.1114e-05 0.000147979 /dev/loop10 0.414701 0.000116564 /dev/loop11 0.7474 6.7942e-05 /dev/loop12 0.747986 8.9082e-05 /dev/loop13 0.746532 7.4799e-05 /dev/loop14 0.480041 9.3926e-05 /dev/loop15 1.26453 7.2522e-05 Note that from loop10 onward, the device is not mounted, yet the stock kernel consumes several orders of magnitude more wall time than it does for a mounted device. (Thanks for Mike Galbraith <efault@gmx.de>, give a changelog review.) Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: James Wang <jnwang@suse.com> Fixes: b5dd2f6047ca ("block: loop: improve performance via blk-mq") Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-18loop: Fix lost writes caused by missing flagRoss Zwisler
commit 1d037577c323e5090ce281e96bc313ab2eee5be2 upstream. The following commit: commit aa4d86163e4e ("block: loop: switch to VFS ITER_BVEC") replaced __do_lo_send_write(), which used ITER_KVEC iterators, with lo_write_bvec() which uses ITER_BVEC iterators. In this change, though, the WRITE flag was lost: - iov_iter_kvec(&from, ITER_KVEC | WRITE, &kvec, 1, len); + iov_iter_bvec(&i, ITER_BVEC, bvec, 1, bvec->bv_len); This flag is necessary for the DAX case because we make decisions based on whether or not the iterator is a READ or a WRITE in dax_iomap_actor() and in dax_iomap_rw(). We end up going through this path in configurations where we combine a PMEM device with 4k sectors, a loopback device and DAX. The consequence of this missed flag is that what we intend as a write actually turns into a read in the DAX code, so no data is ever written. The very simplest test case is to create a loopback device and try and write a small string to it, then hexdump a few bytes of the device to see if the write took. Without this patch you read back all zeros, with this you read back the string you wrote. For XFS this causes us to fail or panic during the following xfstests: xfs/074 xfs/078 xfs/216 xfs/217 xfs/250 For ext4 we have a similar issue where writes never happen, but we don't currently have any xfstests that use loopback and show this issue. Fix this by restoring the WRITE flag argument to iov_iter_bvec(). This causes the xfstests to all pass. Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: stable@vger.kernel.org Fixes: commit aa4d86163e4e ("block: loop: switch to VFS ITER_BVEC") Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-03loop: fix concurrent lo_open/lo_releaseLinus Torvalds
commit ae6650163c66a7eff1acd6eb8b0f752dcfa8eba5 upstream. 范龙飞 reports that KASAN can report a use-after-free in __lock_acquire. The reason is due to insufficient serialization in lo_release(), which will continue to use the loop device even after it has decremented the lo_refcnt to zero. In the meantime, another process can come in, open the loop device again as it is being shut down. Confusion ensues. Reported-by: 范龙飞 <long7573@126.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Cc: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-03-12loop: fix LO_FLAGS_PARTSCAN hangOmar Sandoval
commit e02898b423802b1f3a3aaa7f16e896da069ba8f7 upstream. loop_reread_partitions() needs to do I/O, but we just froze the queue, so we end up waiting forever. This can easily be reproduced with losetup -P. Fix it by moving the reread to after we unfreeze the queue. Fixes: ecdd09597a57 ("block/loop: fix race between I/O and set_status") Reported-by: Tejun Heo <tj@kernel.org> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-03-12block/loop: fix race between I/O and set_statusMing Lei
commit ecdd09597a57251323b0de50e3d45e69298c4a83 upstream. Inside set_status, transfer need to setup again, so we have to drain IO before the transition, otherwise oops may be triggered like the following: divide error: 0000 [#1] SMP KASAN CPU: 0 PID: 2935 Comm: loop7 Not tainted 4.10.0-rc7+ #213 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 task: ffff88006ba1e840 task.stack: ffff880067338000 RIP: 0010:transfer_xor+0x1d1/0x440 drivers/block/loop.c:110 RSP: 0018:ffff88006733f108 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff8800688d7000 RCX: 0000000000000059 RDX: 0000000000000000 RSI: 1ffff1000d743f43 RDI: ffff880068891c08 RBP: ffff88006733f160 R08: ffff8800688d7001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800688d7000 R13: ffff880067b7d000 R14: dffffc0000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88006d000000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000006c17e0 CR3: 0000000066e3b000 CR4: 00000000001406f0 Call Trace: lo_do_transfer drivers/block/loop.c:251 [inline] lo_read_transfer drivers/block/loop.c:392 [inline] do_req_filebacked drivers/block/loop.c:541 [inline] loop_handle_cmd drivers/block/loop.c:1677 [inline] loop_queue_work+0xda0/0x49b0 drivers/block/loop.c:1689 kthread_worker_fn+0x4c3/0xa30 kernel/kthread.c:630 kthread+0x326/0x3f0 kernel/kthread.c:227 ret_from_fork+0x31/0x40 arch/x86/entry/entry_64.S:430 Code: 03 83 e2 07 41 29 df 42 0f b6 04 30 4d 8d 44 24 01 38 d0 7f 08 84 c0 0f 85 62 02 00 00 44 89 f8 41 0f b6 48 ff 25 ff 01 00 00 99 <f7> 7d c8 48 63 d2 48 03 55 d0 48 89 d0 48 89 d7 48 c1 e8 03 83 RIP: transfer_xor+0x1d1/0x440 drivers/block/loop.c:110 RSP: ffff88006733f108 ---[ end trace 0166f7bd3b0c0933 ]--- Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Tested-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-06loop: return proper error from loop_queue_rq()Omar Sandoval
commit b4a567e8114327518c09f5632339a5954ab975a3 upstream. ->queue_rq() should return one of the BLK_MQ_RQ_QUEUE_* constants, not an errno. Fixes: f4aa4c7bbac6 ("block: loop: convert to per-device workqueue") Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-10-11kthread: kthread worker API cleanupPetr Mladek
A good practice is to prefix the names of functions by the name of the subsystem. The kthread worker API is a mix of classic kthreads and workqueues. Each worker has a dedicated kthread. It runs a generic function that process queued works. It is implemented as part of the kthread subsystem. This patch renames the existing kthread worker API to use the corresponding name from the workqueues API prefixed by kthread_: __init_kthread_worker() -> __kthread_init_worker() init_kthread_worker() -> kthread_init_worker() init_kthread_work() -> kthread_init_work() insert_kthread_work() -> kthread_insert_work() queue_kthread_work() -> kthread_queue_work() flush_kthread_work() -> kthread_flush_work() flush_kthread_worker() -> kthread_flush_worker() Note that the names of DEFINE_KTHREAD_WORK*() macros stay as they are. It is common that the "DEFINE_" prefix has precedence over the subsystem names. Note that INIT() macros and init() functions use different naming scheme. There is no good solution. There are several reasons for this solution: + "init" in the function names stands for the verb "initialize" aka "initialize worker". While "INIT" in the macro names stands for the noun "INITIALIZER" aka "worker initializer". + INIT() macros are used only in DEFINE() macros + init() functions are used close to the other kthread() functions. It looks much better if all the functions use the same scheme. + There will be also kthread_destroy_worker() that will be used close to kthread_cancel_work(). It is related to the init() function. Again it looks better if all functions use the same naming scheme. + there are several precedents for such init() function names, e.g. amd_iommu_init_device(), free_area_init_node(), jump_label_init_type(), regmap_init_mmio_clk(), + It is not an argument but it was inconsistent even before. [arnd@arndb.de: fix linux-next merge conflict] Link: http://lkml.kernel.org/r/20160908135724.1311726-1-arnd@arndb.de Link: http://lkml.kernel.org/r/1470754545-17632-3-git-send-email-pmladek@suse.com Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Petr Mladek <pmladek@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Borislav Petkov <bp@suse.de> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-15blk-mq: remove ->map_queueChristoph Hellwig
All drivers use the default, so provide an inline version of it. If we ever need other queue mapping we can add an optional method back, although supporting will also require major changes to the queue setup code. This provides better code generation, and better debugability as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-04loop: make do_req_filebacked more robustChristoph Hellwig
Use a switch statement to iterate over the possible operations and error out if it's an incorrect one. Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-04loop: don't try to use AIO for discardsChristoph Hellwig
Fix a fat-fingered conversion to the req_op accessors, and also use a switch statement to make it more obvious what is being checked. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Dave Chinner <david@fromorbit.com> Fixes: c2df40 ("drivers: use req op accessor"); Reviewed-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-12loop: Make user notify for adding loop device failedMinfei Huang
There is no error number returned if loop driver fails in function alloc_disk to add new loop device. Add a correct error number to make user notify in this case. Signed-off-by: Minfei Huang <mnghuan@gmail.com> Reviewed-by: Ming Lei <tom.leiming@gmail.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07block, drivers: add REQ_OP_FLUSH operationMike Christie
This adds a REQ_OP_FLUSH operation that is sent to request_fn based drivers by the block layer's flush code, instead of sending requests with the request->cmd_flags REQ_FLUSH bit set. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07drivers: use req op accessorMike Christie
The req operation REQ_OP is separated from the rq_flag_bits definition. This converts the block layer drivers to use req_op to get the op from the request struct. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07block, drivers, cgroup: use op_is_write helper instead of checking for REQ_WRITEMike Christie
We currently set REQ_WRITE/WRITE for all non READ IOs like discard, flush, writesame, etc. In the next patches where we no longer set up the op as a bitmap, we will not be able to detect a operation direction like writesame by testing if REQ_WRITE is set. This patch converts the drivers and cgroup to use the op_is_write helper. This should just cover the simple cases. I did dm, md and bcache in their own patches because they were more involved. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-05-17Merge branch 'for-4.7/drivers' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block driver updates from Jens Axboe: "On top of the core pull request, this is the drivers pull request for this merge window. This contains: - Switch drivers to the new write back cache API, and kill off the flush flags. From me. - Kill the discard support for the STEC pci-e flash driver. It's trivially broken, and apparently unmaintained, so it's safer to just remove it. From Jeff Moyer. - A set of lightnvm updates from the usual suspects (Matias/Javier, and Simon), and fixes from Arnd, Jeff Mahoney, Sagi, and Wenwei Tao. - A set of updates for NVMe: - Turn the controller state management into a proper state machine. From Christoph. - Shuffling of code in preparation for NVMe-over-fabrics, also from Christoph. - Cleanup of the command prep part from Ming Lin. - Rewrite of the discard support from Ming Lin. - Deadlock fix for namespace removal from Ming Lin. - Use the now exported blk-mq tag helper for IO termination. From Sagi. - Various little fixes from Christoph, Guilherme, Keith, Ming Lin, Wang Sheng-Hui. - Convert mtip32xx to use the now exported blk-mq tag iter function, from Keith" * 'for-4.7/drivers' of git://git.kernel.dk/linux-block: (74 commits) lightnvm: reserved space calculation incorrect lightnvm: rename nr_pages to nr_ppas on nvm_rq lightnvm: add is_cached entry to struct ppa_addr lightnvm: expose gennvm_mark_blk to targets lightnvm: remove mgt targets on mgt removal lightnvm: pass dma address to hardware rather than pointer lightnvm: do not assume sequential lun alloc. nvme/lightnvm: Log using the ctrl named device lightnvm: rename dma helper functions lightnvm: enable metadata to be sent to device lightnvm: do not free unused metadata on rrpc lightnvm: fix out of bound ppa lun id on bb tbl lightnvm: refactor set_bb_tbl for accepting ppa list lightnvm: move responsibility for bad blk mgmt to target lightnvm: make nvm_set_rqd_ppalist() aware of vblks lightnvm: remove struct factory_blks lightnvm: refactor device ops->get_bb_tbl() lightnvm: introduce nvm_for_each_lun_ppa() macro lightnvm: refactor dev->online_target to global nvm_targets lightnvm: rename nvm_targets to nvm_tgt_type ...
2016-04-15block: loop: fix filesystem corruption in case of aio/dioMing Lei
Starting from commit e36f620428(block: split bios to max possible length), block core starts to split bio in the middle of bvec. Unfortunately loop dio/aio doesn't consider this situation, and always treat 'iter.iov_offset' as zero. Then filesystem corruption is observed. This patch figures out the offset of the base bvevc via 'bio->bi_iter.bi_bvec_done' and fixes the issue by passing the offset to iov iterator. Fixes: e36f6204288088f (block: split bios to max possible length) Cc: Keith Busch <keith.busch@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: stable@vger.kernel.org (4.5) Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-04-12loop: switch to using blk_queue_write_cache()Jens Axboe
Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2015-10-09Merge branch 'for-4.4/core' into for-4.4/driversJens Axboe
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-10-01blk-mq: fix racy updates of rq->errorsChristoph Hellwig
blk_mq_complete_request may be a no-op if the request has already been completed by others means (e.g. a timeout or cancellation), but currently drivers have to set rq->errors before calling blk_mq_complete_request, which might leave us with the wrong error value. Add an error parameter to blk_mq_complete_request so that we can defer setting rq->errors until we known we won the race to complete the request. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-09-23block: loop: support DIO & AIOMing Lei
There are at least 3 advantages to use direct I/O and AIO on read/write loop's backing file: 1) double cache can be avoided, then memory usage gets decreased a lot 2) not like user space direct I/O, there isn't cost of pinning pages 3) avoid context switch for obtaining good throughput - in buffered file read, random I/O top throughput is often obtained only if they are submitted concurrently from lots of tasks; but for sequential I/O, most of times they can be hit from page cache, so concurrent submissions often introduce unnecessary context switch and can't improve throughput much. There was such discussion[1] to use non-blocking I/O to improve the problem for application. - with direct I/O and AIO, concurrent submissions can be avoided and random read throughput can't be affected meantime xfstests(-g auto, ext4) is basically passed when running with direct I/O(aio), one exception is generic/232, but it failed in loop buffered I/O(4.2-rc6-next-20150814) too. Follows the fio test result for performance purpose: 4 jobs fio test inside ext4 file system over loop block 1) How to run - KVM: 4 VCPUs, 2G RAM - linux kernel: 4.2-rc6-next-20150814(base) with the patchset - the loop block is over one image on SSD. - linux psync, 4 jobs, size 1500M, ext4 over loop block - test result: IOPS from fio output 2) Throughput(IOPS) becomes a bit better with direct I/O(aio) ------------------------------------------------------------- test cases |randread |read |randwrite |write | ------------------------------------------------------------- base |8015 |113811 |67442 |106978 ------------------------------------------------------------- base+loop aio |8136 |125040 |67811 |111376 ------------------------------------------------------------- - somehow, it should be caused by more page cache avaiable for application or one extra page copy is avoided in case of direct I/O 3) context switch - context switch decreased by ~50% with loop direct I/O(aio) compared with loop buffered I/O(4.2-rc6-next-20150814) 4) memory usage from /proc/meminfo ------------------------------------------------------------- | Buffers | Cached ------------------------------------------------------------- base | > 760MB | ~950MB ------------------------------------------------------------- base+loop direct I/O(aio) | < 5MB | ~1.6GB ------------------------------------------------------------- - so there are much more page caches available for application with direct I/O [1] https://lwn.net/Articles/612483/ Signed-off-by: Ming Lei <ming.lei@canonical.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-09-23block: loop: introduce ioctl command of LOOP_SET_DIRECT_IOMing Lei
If loop block is mounted via 'mount -o loop', it isn't easy to pass file descriptor opened as O_DIRECT, so this patch introduces a new command to support direct IO for this case. Cc: linux-api@vger.kernel.org Signed-off-by: Ming Lei <ming.lei@canonical.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-09-23block: loop: prepare for supporing direct IOMing Lei
This patches provides one interface for enabling direct IO from user space: - userspace(such as losetup) can pass 'file' which is opened/fcntl as O_DIRECT Also __loop_update_dio() is introduced to check if direct I/O can be used on current loop setting. The last big change is to introduce LO_FLAGS_DIRECT_IO flag for userspace to know if direct IO is used to access backing file. Cc: linux-api@vger.kernel.org Signed-off-by: Ming Lei <ming.lei@canonical.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-09-23block: loop: use kthread_workMing Lei
The following patch will use dio/aio to submit IO to backing file, then it needn't to schedule IO concurrently from work, so use kthread_work for decreasing context switch cost a lot. For non-AIO case, single thread has been used for long long time, and it was just converted to work in v4.0, which has caused performance regression for fedora live booting already. In discussion[1], even though submitting I/O via work concurrently can improve random read IO throughput, meantime it might hurt sequential read IO performance, so better to restore to single thread behaviour. For the following AIO support, it is better to use multi hw-queue with per-hwq kthread than current work approach suppose there is so high performance requirement for loop. [1] http://marc.info/?t=143082678400002&r=1&w=2 Signed-off-by: Ming Lei <ming.lei@canonical.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-09-23block: loop: set QUEUE_FLAG_NOMERGES for request queue of loopMing Lei
It doesn't make sense to enable merge because the I/O submitted to backing file is handled page by page. Signed-off-by: Ming Lei <ming.lei@canonical.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-17block: have drivers use blk_queue_max_discard_sectors()Jens Axboe
Some drivers use it now, others just set the limits field manually. But in preparation for splitting this into a hard and soft limit, ensure that they all call the proper function for setting the hw limit for discards. Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-04Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull more vfs updates from Al Viro: "Assorted VFS fixes and related cleanups (IMO the most interesting in that part are f_path-related things and Eric's descriptor-related stuff). UFS regression fixes (it got broken last cycle). 9P fixes. fs-cache series, DAX patches, Jan's file_remove_suid() work" [ I'd say this is much more than "fixes and related cleanups". The file_table locking rule change by Eric Dumazet is a rather big and fundamental update even if the patch isn't huge. - Linus ] * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (49 commits) 9p: cope with bogus responses from server in p9_client_{read,write} p9_client_write(): avoid double p9_free_req() 9p: forgetting to cancel request on interrupted zero-copy RPC dax: bdev_direct_access() may sleep block: Add support for DAX reads/writes to block devices dax: Use copy_from_iter_nocache dax: Add block size note to documentation fs/file.c: __fget() and dup2() atomicity rules fs/file.c: don't acquire files->file_lock in fd_install() fs:super:get_anon_bdev: fix race condition could cause dev exceed its upper limitation vfs: avoid creation of inode number 0 in get_next_ino namei: make set_root_rcu() return void make simple_positive() public ufs: use dir_pages instead of ufs_dir_pages() pagemap.h: move dir_pages() over there remove the pointless include of lglock.h fs: cleanup slight list_entry abuse xfs: Correctly lock inode when removing suid and file capabilities fs: Call security_ops->inode_killpriv on truncate fs: Provide function telling whether file_remove_privs() will do anything ...
2015-06-25Merge branch 'for-4.2/drivers' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block driver updates from Jens Axboe: "This contains: - a few race fixes for null_blk, from Akinobu Mita. - a series of fixes for mtip32xx, from Asai Thambi and Selvan Mani at Micron. - NVMe: * Fix for missing error return on allocation failure, from Axel Lin. * Code consolidation and cleanups from Christoph. * Memory barrier addition, syncing queue count and queue pointers. From Jon Derrick. * Various fixes from Keith, an addition to support user issue reset from sysfs or ioctl, and automatic namespace rescan. * Fix from Matias, avoiding losing some request flags when marking the request failfast. - small cleanups and sparse fixups for ps3vram. From Geert Uytterhoeven and Geoff Lavand. - s390/dasd dead code removal, from Jarod Wilson. - a set of fixes and optimizations for loop, from Ming Lei. - conversion to blkdev_reread_part() of loop, dasd, ndb. From Ming Lei. - updates to cciss. From Tomas Henzl" * 'for-4.2/drivers' of git://git.kernel.dk/linux-block: (44 commits) mtip32xx: Fix accessing freed memory block: nvme-scsi: Catch kcalloc failure NVMe: Fix IO for extended metadata formats nvme: don't overwrite req->cmd_flags on sync cmd mtip32xx: increase wait time for hba reset mtip32xx: fix minor number mtip32xx: remove unnecessary sleep in mtip_ftl_rebuild_poll() mtip32xx: fix crash on surprise removal of the drive mtip32xx: Abort I/O during secure erase operation mtip32xx: fix incorrectly setting MTIP_DDF_SEC_LOCK_BIT mtip32xx: remove unused variable 'port->allocated' mtip32xx: fix rmmod issue MAINTAINERS: Update ps3vram block driver block/ps3vram: Remove obsolete reference to MTD block/ps3vram: Fix sparse warnings NVMe: Automatic namespace rescan NVMe: Memory barrier before queue_count is incremented NVMe: add sysfs and ioctl controller reset null_blk: restart request processing on completion handler null_blk: prevent timer handler running on a different CPU where started ...
2015-06-23vfs: add file_path() helperMiklos Szeredi
Turn d_path(&file->f_path, ...); into file_path(file, ...); Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-05-20loop: remove (now) unused 'out' labelJens Axboe
gcc, righfully, complains: drivers/block/loop.c:1369:1: warning: label 'out' defined but not used [-Wunused-label] Kill it. Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-20block: loop: fix another reread part failureMing Lei
loop_clr_fd() can be run piggyback with lo_release(), and under this situation, reread partition may always fail because bd_mutex has been held already. This patch detects the situation by the reference count, and call __blkdev_reread_part() to avoid acquiring the lock again. In the meantime, this patch switches to new kernel APIs of blkdev_reread_part() and __blkdev_reread_part(). Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Jarod Wilson <jarod@redhat.com> Acked-by: Jarod Wilson <jarod@redhat.com> Signed-off-by: Jarod Wilson <jarod@redhat.com> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-20block: loop: don't hold lo_ctl_mutex in lo_openMing Lei
The lo_ctl_mutex is held for running all ioctl handlers, and in some ioctl handlers, ioctl_by_bdev(BLKRRPART) is called for rereading partitions, which requires bd_mutex. So it is easy to cause failure because trylock(bd_mutex) may fail inside blkdev_reread_part(), and follows the lock context: blkid or other application: ->open() ->mutex_lock(bd_mutex) ->lo_open() ->mutex_lock(lo_ctl_mutex) losetup(set fd ioctl): ->mutex_lock(lo_ctl_mutex) ->ioctl_by_bdev(BLKRRPART) ->trylock(bd_mutex) This patch trys to eliminate the ABBA lock dependency by removing lo_ctl_mutext in lo_open() with the following approach: 1) make lo_refcnt as atomic_t and avoid acquiring lo_ctl_mutex in lo_open(): - for open vs. add/del loop, no any problem because of loop_index_mutex - freeze request queue during clr_fd, so I/O can't come until clearing fd is completed, like the effect of holding lo_ctl_mutex in lo_open - both open() and release() have been serialized by bd_mutex already 2) don't hold lo_ctl_mutex for decreasing/checking lo_refcnt in lo_release(), then lo_ctl_mutex is only required for the last release. Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Jarod Wilson <jarod@redhat.com> Acked-by: Jarod Wilson <jarod@redhat.com> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-05block: loop: avoiding too many pending per work I/OMing Lei
If there are too many pending per work I/O, too many high priority work thread can be generated so that system performance can be effected. This patch limits the max_active parameter of workqueue as 16. This patch fixes Fedora 22 live booting performance regression when it is booted from squashfs over dm based on loop, and looks the following reasons are related with the problem: - not like other filesyststems(such as ext4), squashfs is a bit special, and I observed that increasing I/O jobs to access file in squashfs only improve I/O performance a little, but it can make big difference for ext4 - nested loop: both squashfs.img and ext3fs.img are mounted as loop block, and ext3fs.img is inside the squashfs - during booting, lots of tasks may run concurrently Fixes: b5dd2f6047ca108001328aac0e8588edd15f1778 Cc: stable@vger.kernel.org (v4.0) Cc: Justin M. Forbes <jforbes@fedoraproject.org> Signed-off-by: Ming Lei <ming.lei@canonical.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-05block: loop: convert to per-device workqueueMing Lei
Documentation/workqueue.txt: If there is dependency among multiple work items used during memory reclaim, they should be queued to separate wq each with WQ_MEM_RECLAIM. Loop devices can be stacked, so we have to convert to per-device workqueue. One example is Fedora live CD. Fixes: b5dd2f6047ca108001328aac0e8588edd15f1778 Cc: stable@vger.kernel.org (v4.0) Cc: Justin M. Forbes <jforbes@fedoraproject.org> Signed-off-by: Ming Lei <ming.lei@canonical.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-04-27block: destroy bdi before blockdev is unregistered.NeilBrown
Because of the peculiar way that md devices are created (automatically when the device node is opened), a new device can be created and registered immediately after the blk_unregister_region(disk_devt(disk), disk->minors); call in del_gendisk(). Therefore it is important that all visible artifacts of the previous device are removed before this call. In particular, the 'bdi'. Since: commit c4db59d31e39ea067c32163ac961e9c80198fd37 Author: Christoph Hellwig <hch@lst.de> fs: don't reassign dirty inodes to default_backing_dev_info moved the device_unregister(bdi->dev); call from bdi_unregister() to bdi_destroy() it has been quite easy to lose a race and have a new (e.g.) "md127" be created after the blk_unregister_region() call and before bdi_destroy() is ultimately called by the final 'put_disk', which must come after del_gendisk(). The new device finds that the bdi name is already registered in sysfs and complains > [ 9627.630029] WARNING: CPU: 18 PID: 3330 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x5a/0x70() > [ 9627.630032] sysfs: cannot create duplicate filename '/devices/virtual/bdi/9:127' We can fix this by moving the bdi_destroy() call out of blk_release_queue() (which can happen very late when a refcount reaches zero) and into blk_cleanup_queue() - which happens exactly when the md device driver calls it. Then it is only necessary for md to call blk_cleanup_queue() before del_gendisk(). As loop.c devices are also created on demand by opening the device node, we make the same change there. Fixes: c4db59d31e39ea067c32163ac961e9c80198fd37 Reported-by: Azat Khuzhin <a3at.mail@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Cc: stable@vger.kernel.org (v4.0) Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-04-15block: loop: switch to VFS ITER_BVECChristoph Hellwig
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11switch /dev/loop to vfs_iter_write()Al Viro
all writable files that might be used as backing store for /dev/loop already support ->write_iter() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-01-02loop: add blk-mq.h includeJens Axboe
Looks like we pull it in through other ways on x86, but we fail on sparc: In file included from drivers/block/cryptoloop.c:30:0: drivers/block/loop.h:63:24: error: field 'tag_set' has incomplete type struct blk_mq_tag_set tag_set; Add the include to loop.h, kill it from loop.c. Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-02block: loop: don't handle REQ_FUA explicitlyMing Lei
block core handles REQ_FUA by its flush state machine, so won't do it in loop explicitly. Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-02block: loop: introduce lo_discard() and lo_req_flush()Ming Lei
No behaviour change, just move the handling for REQ_DISCARD and REQ_FLUSH in these two functions. Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-02block: loop: say goodby to bioMing Lei
Switch to block request completely. Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-02block: loop: improve performance via blk-mqMing Lei
The conversion is a bit straightforward, and use work queue to dispatch requests of loop block, and one big change is that requests is submitted to backend file/device concurrently with work queue, so throughput may get improved much. Given write requests over same file are often run exclusively, so don't handle them concurrently for avoiding extra context switch cost, possible lock contention and work schedule cost. Also with blk-mq, there is opportunity to get loop I/O merged before submitting to backend file/device. In the following test: - base: v3.19-rc2-2041231 - loop over file in ext4 file system on SSD disk - bs: 4k, libaio, io depth: 64, O_DIRECT, num of jobs: 1 - throughput: IOPS ------------------------------------------------------ | | base | base with loop-mq | delta | ------------------------------------------------------ | randread | 1740 | 25318 | +1355%| ------------------------------------------------------ | read | 42196 | 51771 | +22.6%| ----------------------------------------------------- | randwrite | 35709 | 34624 | -3% | ----------------------------------------------------- | write | 39137 | 40326 | +3% | ----------------------------------------------------- So loop-mq can improve throughput for both read and randread, meantime, performance of write and randwrite isn't hurted basically. Another benefit is that loop driver code gets simplified much after blk-mq conversion, and the patch can be thought as cleanup too. Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-05-07Merge branch 'sched/urgent' into sched/core, to avoid conflictsIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-04-18sched, treewide: Replace hardcoded nice values with MIN_NICE/MAX_NICEDongsheng Yang
Replace various -20/+19 hardcoded nice values with MIN_NICE/MAX_NICE. Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/ff13819fd09b7a5dba5ab5ae797f2e7019bdfa17.1394532288.git.yangds.fnst@cn.fujitsu.com Cc: devel@driverdev.osuosl.org Cc: devicetree@vger.kernel.org Cc: fcoe-devel@open-fcoe.org Cc: linux390@de.ibm.com Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org Cc: linux-s390@vger.kernel.org Cc: linux-scsi@vger.kernel.org Cc: nbd-general@lists.sourceforge.net Cc: ocfs2-devel@oss.oracle.com Cc: openipmi-developer@lists.sourceforge.net Cc: qla2xxx-upstream@qlogic.com Cc: linux-arch@vger.kernel.org [ Consolidated the patches, twiddled the changelog. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-04-08drivers/block/loop.c: ratelimit error messagesMike Galbraith
Metric tons of high speed spew is not helpful when things go pear shaped. systemd lost its mind, forgot how to stop services it insists on being sole manager of, massive printk() flood ensued, box eventually died. [16206.684000] loop: Write error at byte offset 11412291584, length 4096. [16206.684000] systemd-journald[1758]: /dev/kmsg buffer overrun, some messages lost. [16206.684000] loop: Write error at byte offset 13155434496, length 4096. [16206.684000] loop: Write error at byte offset 13155438592, length 4096. [16206.684000] loop: Write error at byte offset 13155442688, length 4096. [16206.684000] loop: Write error at byte offset 13960736768, length 4096. [16206.684000] loop: Write error at byte offset 14229172224, length 4096. [16206.684000] systemd-journald[1758]: /dev/kmsg buffer overrun, some messages lost. [16206.684000] loop: Write error at byte offset 14766043136, length 4096. [16206.684000] loop: Write error at byte offset 15034478592, length 4096. [16206.684000] systemd-journald[1758]: /dev/kmsg buffer overrun, some messages lost. Signed-off-by: Mike Galbraith <bitbucket@online.de> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-01-21drivers/block/loop.c: fix comment typo in loop_config_discardOlaf Hering
Discard requests are ignored if the encryption is enabled for the given loop device. Update comment to match the code, and similar comments elsewhere in the file. Signed-off-by: Olaf Hering <olaf@aepfle.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-23block: Convert bio_for_each_segment() to bvec_iterKent Overstreet
More prep work for immutable biovecs - with immutable bvecs drivers won't be able to use the biovec directly, they'll need to use helpers that take into account bio->bi_iter.bi_bvec_done. This updates callers for the new usage without changing the implementation yet. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "Ed L. Cashin" <ecashin@coraid.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Lars Ellenberg <drbd-dev@lists.linbit.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Paul Clements <Paul.Clements@steeleye.com> Cc: Jim Paris <jim@jtan.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Yehuda Sadeh <yehuda@inktank.com> Cc: Sage Weil <sage@inktank.com> Cc: Alex Elder <elder@inktank.com> Cc: ceph-devel@vger.kernel.org Cc: Joshua Morris <josh.h.morris@us.ibm.com> Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Neil Brown <neilb@suse.de> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: linux390@de.ibm.com Cc: Nagalakshmi Nandigama <Nagalakshmi.Nandigama@lsi.com> Cc: Sreekanth Reddy <Sreekanth.Reddy@lsi.com> Cc: support@lsi.com Cc: "James E.J. Bottomley" <JBottomley@parallels.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com> Cc: Tejun Heo <tj@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Guo Chao <yan@linux.vnet.ibm.com> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Keith Busch <keith.busch@intel.com> Cc: Stephen Hemminger <shemminger@vyatta.com> Cc: Quoc-Son Anh <quoc-sonx.anh@intel.com> Cc: Sebastian Ott <sebott@linux.vnet.ibm.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Seth Jennings <sjenning@linux.vnet.ibm.com> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: "Darrick J. Wong" <darrick.wong@oracle.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Jan Kara <jack@suse.cz> Cc: linux-m68k@lists.linux-m68k.org Cc: linuxppc-dev@lists.ozlabs.org Cc: drbd-user@lists.linbit.com Cc: nbd-general@lists.sourceforge.net Cc: cbe-oss-dev@lists.ozlabs.org Cc: xen-devel@lists.xensource.com Cc: virtualization@lists.linux-foundation.org Cc: linux-raid@vger.kernel.org Cc: linux-s390@vger.kernel.org Cc: DL-MPTFusionLinux@lsi.com Cc: linux-scsi@vger.kernel.org Cc: devel@driverdev.osuosl.org Cc: linux-fsdevel@vger.kernel.org Cc: cluster-devel@redhat.com Cc: linux-mm@kvack.org Acked-by: Geoff Levand <geoff@infradead.org>
2013-11-23block: Abstract out bvec iteratorKent Overstreet
Immutable biovecs are going to require an explicit iterator. To implement immutable bvecs, a later patch is going to add a bi_bvec_done member to this struct; for now, this patch effectively just renames things. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "Ed L. Cashin" <ecashin@coraid.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Lars Ellenberg <drbd-dev@lists.linbit.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Yehuda Sadeh <yehuda@inktank.com> Cc: Sage Weil <sage@inktank.com> Cc: Alex Elder <elder@inktank.com> Cc: ceph-devel@vger.kernel.org Cc: Joshua Morris <josh.h.morris@us.ibm.com> Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Neil Brown <neilb@suse.de> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: dm-devel@redhat.com Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: linux390@de.ibm.com Cc: Boaz Harrosh <bharrosh@panasas.com> Cc: Benny Halevy <bhalevy@tonian.com> Cc: "James E.J. Bottomley" <JBottomley@parallels.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Chris Mason <chris.mason@fusionio.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Dave Kleikamp <shaggy@kernel.org> Cc: Joern Engel <joern@logfs.org> Cc: Prasad Joshi <prasadjoshi.linux@gmail.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Ben Myers <bpm@sgi.com> Cc: xfs@oss.sgi.com Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Len Brown <len.brown@intel.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Guo Chao <yan@linux.vnet.ibm.com> Cc: Tejun Heo <tj@kernel.org> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Cc: "Roger Pau Monné" <roger.pau@citrix.com> Cc: Jan Beulich <jbeulich@suse.com> Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Cc: Ian Campbell <Ian.Campbell@citrix.com> Cc: Sebastian Ott <sebott@linux.vnet.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Jerome Marchand <jmarchand@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: Peng Tao <tao.peng@emc.com> Cc: Andy Adamson <andros@netapp.com> Cc: fanchaoting <fanchaoting@cn.fujitsu.com> Cc: Jie Liu <jeff.liu@oracle.com> Cc: Sunil Mushran <sunil.mushran@gmail.com> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Namjae Jeon <namjae.jeon@samsung.com> Cc: Pankaj Kumar <pankaj.km@samsung.com> Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Cc: Mel Gorman <mgorman@suse.de>6
2013-11-08loop: fix crash when using unassigned loop deviceMikulas Patocka
When the loop module is loaded, it creates 8 loop devices /dev/loop[0-7]. The devices have no request routine and thus, when they are used without being assigned, a crash happens. For example, these commands cause crash (assuming there are no used loop devices): Kernel Fault: Code=26 regs=000000007f420980 (Addr=0000000000000010) CPU: 1 PID: 50 Comm: kworker/1:1 Not tainted 3.11.0 #1 Workqueue: ksnaphd do_metadata [dm_snapshot] task: 000000007fcf4078 ti: 000000007f420000 task.ti: 000000007f420000 [ 116.319988] YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI PSW: 00001000000001001111111100001111 Not tainted r00-03 000000ff0804ff0f 00000000408bf5d0 00000000402d8204 000000007b7ff6c0 r04-07 00000000408a95d0 000000007f420950 000000007b7ff6c0 000000007d06c930 r08-11 000000007f4205c0 0000000000000001 000000007f4205c0 000000007f4204b8 r12-15 0000000000000010 0000000000000000 0000000000000000 0000000000000000 r16-19 000000001108dd48 000000004061cd7c 000000007d859800 000000000800000f r20-23 0000000000000000 0000000000000008 0000000000000000 0000000000000000 r24-27 00000000ffffffff 000000007b7ff6c0 000000007d859800 00000000408a95d0 r28-31 0000000000000000 000000007f420950 000000007f420980 000000007f4208e8 sr00-03 0000000000000000 0000000000000000 0000000000000000 0000000000303000 sr04-07 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 117.549988] IASQ: 0000000000000000 0000000000000000 IAOQ: 00000000402d82fc 00000000402d8300 IIR: 53820020 ISR: 0000000000000000 IOR: 0000000000000010 CPU: 1 CR30: 000000007f420000 CR31: ffffffffffffffff ORIG_R28: 0000000000000001 IAOQ[0]: generic_make_request+0x11c/0x1a0 IAOQ[1]: generic_make_request+0x120/0x1a0 RP(r2): generic_make_request+0x24/0x1a0 Backtrace: [<00000000402d83f0>] submit_bio+0x70/0x140 [<0000000011087c4c>] dispatch_io+0x234/0x478 [dm_mod] [<0000000011087f44>] sync_io+0xb4/0x190 [dm_mod] [<00000000110883bc>] dm_io+0x2c4/0x310 [dm_mod] [<00000000110bfcd0>] do_metadata+0x28/0xb0 [dm_snapshot] [<00000000401591d8>] process_one_work+0x160/0x460 [<0000000040159bc0>] worker_thread+0x300/0x478 [<0000000040161a70>] kthread+0x118/0x128 [<0000000040104020>] end_fault_vector+0x20/0x28 [<0000000040177220>] task_tick_fair+0x420/0x4d0 [<00000000401aa048>] invoke_rcu_core+0x50/0x60 [<00000000401ad5b8>] rcu_check_callbacks+0x210/0x8d8 [<000000004014aaa0>] update_process_times+0xa8/0xc0 [<00000000401ab86c>] rcu_process_callbacks+0x4b4/0x598 [<0000000040142408>] __do_softirq+0x250/0x2c0 [<00000000401789d0>] find_busiest_group+0x3c0/0xc70 [ 119.379988] Kernel panic - not syncing: Kernel Fault Rebooting in 1 seconds.. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08block: fix a probe argument to blk_register_regionMikulas Patocka
The probe function is supposed to return NULL on failure (as we can see in kobj_lookup: kobj = probe(dev, index, data); ... if (kobj) return kobj; However, in loop and brd, it returns negative error from ERR_PTR. This causes a crash if we simulate disk allocation failure and run less -f /dev/loop0 because the negative number is interpreted as a pointer: BUG: unable to handle kernel NULL pointer dereference at 00000000000002b4 IP: [<ffffffff8118b188>] __blkdev_get+0x28/0x450 PGD 23c677067 PUD 23d6d1067 PMD 0 Oops: 0000 [#1] PREEMPT SMP Modules linked in: loop hpfs nvidia(PO) ip6table_filter ip6_tables uvesafb cfbcopyarea cfbimgblt cfbfillrect fbcon font bitblit fbcon_rotate fbcon_cw fbcon_ud fbcon_ccw softcursor fb fbdev msr ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_state ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc tun ipv6 cpufreq_stats cpufreq_ondemand cpufreq_userspace cpufreq_powersave cpufreq_conservative hid_generic spadfs usbhid hid fuse raid0 snd_usb_audio snd_pcm_oss snd_mixer_oss md_mod snd_pcm snd_timer snd_page_alloc snd_hwdep snd_usbmidi_lib dmi_sysfs snd_rawmidi nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack snd soundcore lm85 hwmon_vid ohci_hcd ehci_pci ehci_hcd serverworks sata_svw libata acpi_cpufreq freq_table mperf ide_core usbcore kvm_amd kvm tg3 i2c_piix4 libphy microcode e100 usb_common ptp skge i2c_core pcspkr k10temp evdev floppy hwmon pps_core mii rtc_cmos button processor unix [last unloaded: nvidia] CPU: 1 PID: 6831 Comm: less Tainted: P W O 3.10.15-devel #18 Hardware name: empty empty/S3992-E, BIOS 'V1.06 ' 06/09/2009 task: ffff880203cc6bc0 ti: ffff88023e47c000 task.ti: ffff88023e47c000 RIP: 0010:[<ffffffff8118b188>] [<ffffffff8118b188>] __blkdev_get+0x28/0x450 RSP: 0018:ffff88023e47dbd8 EFLAGS: 00010286 RAX: ffffffffffffff74 RBX: ffffffffffffff74 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000001 RBP: ffff88023e47dc18 R08: 0000000000000002 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff88023f519658 R13: ffffffff8118c300 R14: 0000000000000000 R15: ffff88023f519640 FS: 00007f2070bf7700(0000) GS:ffff880247400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000002b4 CR3: 000000023da1d000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Stack: 0000000000000002 0000001d00000000 000000003e47dc50 ffff88023f519640 ffff88043d5bb668 ffffffff8118c300 ffff88023d683550 ffff88023e47de60 ffff88023e47dc98 ffffffff8118c10d 0000001d81605698 0000000000000292 Call Trace: [<ffffffff8118c300>] ? blkdev_get_by_dev+0x60/0x60 [<ffffffff8118c10d>] blkdev_get+0x1dd/0x370 [<ffffffff8118c300>] ? blkdev_get_by_dev+0x60/0x60 [<ffffffff813cea6c>] ? _raw_spin_unlock+0x2c/0x50 [<ffffffff8118c300>] ? blkdev_get_by_dev+0x60/0x60 [<ffffffff8118c365>] blkdev_open+0x65/0x80 [<ffffffff8114d12e>] do_dentry_open.isra.18+0x23e/0x2f0 [<ffffffff8114d214>] finish_open+0x34/0x50 [<ffffffff8115e122>] do_last.isra.62+0x2d2/0xc50 [<ffffffff8115eb58>] path_openat.isra.63+0xb8/0x4d0 [<ffffffff81115a8e>] ? might_fault+0x4e/0xa0 [<ffffffff8115f4f0>] do_filp_open+0x40/0x90 [<ffffffff813cea6c>] ? _raw_spin_unlock+0x2c/0x50 [<ffffffff8116db85>] ? __alloc_fd+0xa5/0x1f0 [<ffffffff8114e45f>] do_sys_open+0xef/0x1d0 [<ffffffff8114e559>] SyS_open+0x19/0x20 [<ffffffff813cff16>] system_call_fastpath+0x1a/0x1f Code: 44 00 00 55 48 89 e5 41 57 49 89 ff 41 56 41 89 d6 41 55 41 54 4c 8d 67 18 53 48 83 ec 18 89 75 cc e9 f2 00 00 00 0f 1f 44 00 00 <48> 8b 80 40 03 00 00 48 89 df 4c 8b 68 58 e8 d5 a4 07 00 44 89 RIP [<ffffffff8118b188>] __blkdev_get+0x28/0x450 RSP <ffff88023e47dbd8> CR2: 00000000000002b4 ---[ end trace bb7f32dbf02398dc ]--- The brd change should be backported to stable kernels starting with 2.6.25. The loop change should be backported to stable kernels starting with 2.6.22. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: stable@kernel.org # 2.6.22+ Signed-off-by: Jens Axboe <axboe@kernel.dk>