summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)Author
2016-03-10dm thin metadata: don't issue prefetches if a transaction abort has failedJoe Thornber
If a transaction abort has failed then we can no longer use the metadata device. Typically this happens if the superblock is unreadable. This fix addresses a crash seen during metadata device failure testing. Fixes: 8a01a6af75 ("dm thin: prefetch missing metadata pages") Cc: stable@vger.kernel.org # 3.19+ Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-10dm snapshot: disallow the COW and origin devices from being identicalDingXiang
Otherwise loading a "snapshot" table using the same device for the origin and COW devices, e.g.: echo "0 20971520 snapshot 253:3 253:3 P 8" | dmsetup create snap will trigger: BUG: unable to handle kernel NULL pointer dereference at 0000000000000098 [ 1958.979934] IP: [<ffffffffa040efba>] dm_exception_store_set_chunk_size+0x7a/0x110 [dm_snapshot] [ 1958.989655] PGD 0 [ 1958.991903] Oops: 0000 [#1] SMP ... [ 1959.059647] CPU: 9 PID: 3556 Comm: dmsetup Tainted: G IO 4.5.0-rc5.snitm+ #150 ... [ 1959.083517] task: ffff8800b9660c80 ti: ffff88032a954000 task.ti: ffff88032a954000 [ 1959.091865] RIP: 0010:[<ffffffffa040efba>] [<ffffffffa040efba>] dm_exception_store_set_chunk_size+0x7a/0x110 [dm_snapshot] [ 1959.104295] RSP: 0018:ffff88032a957b30 EFLAGS: 00010246 [ 1959.110219] RAX: 0000000000000000 RBX: 0000000000000008 RCX: 0000000000000001 [ 1959.118180] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff880329334a00 [ 1959.126141] RBP: ffff88032a957b50 R08: 0000000000000000 R09: 0000000000000001 [ 1959.134102] R10: 000000000000000a R11: f000000000000000 R12: ffff880330884d80 [ 1959.142061] R13: 0000000000000008 R14: ffffc90001c13088 R15: ffff880330884d80 [ 1959.150021] FS: 00007f8926ba3840(0000) GS:ffff880333440000(0000) knlGS:0000000000000000 [ 1959.159047] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1959.165456] CR2: 0000000000000098 CR3: 000000032f48b000 CR4: 00000000000006e0 [ 1959.173415] Stack: [ 1959.175656] ffffc90001c13040 ffff880329334a00 ffff880330884ed0 ffff88032a957bdc [ 1959.183946] ffff88032a957bb8 ffffffffa040f225 ffff880329334a30 ffff880300000000 [ 1959.192233] ffffffffa04133e0 ffff880329334b30 0000000830884d58 00000000569c58cf [ 1959.200521] Call Trace: [ 1959.203248] [<ffffffffa040f225>] dm_exception_store_create+0x1d5/0x240 [dm_snapshot] [ 1959.211986] [<ffffffffa040d310>] snapshot_ctr+0x140/0x630 [dm_snapshot] [ 1959.219469] [<ffffffffa0005c44>] ? dm_split_args+0x64/0x150 [dm_mod] [ 1959.226656] [<ffffffffa0005ea7>] dm_table_add_target+0x177/0x440 [dm_mod] [ 1959.234328] [<ffffffffa0009203>] table_load+0x143/0x370 [dm_mod] [ 1959.241129] [<ffffffffa00090c0>] ? retrieve_status+0x1b0/0x1b0 [dm_mod] [ 1959.248607] [<ffffffffa0009e35>] ctl_ioctl+0x255/0x4d0 [dm_mod] [ 1959.255307] [<ffffffff813304e2>] ? memzero_explicit+0x12/0x20 [ 1959.261816] [<ffffffffa000a0c3>] dm_ctl_ioctl+0x13/0x20 [dm_mod] [ 1959.268615] [<ffffffff81215eb6>] do_vfs_ioctl+0xa6/0x5c0 [ 1959.274637] [<ffffffff81120d2f>] ? __audit_syscall_entry+0xaf/0x100 [ 1959.281726] [<ffffffff81003176>] ? do_audit_syscall_entry+0x66/0x70 [ 1959.288814] [<ffffffff81216449>] SyS_ioctl+0x79/0x90 [ 1959.294450] [<ffffffff8167e4ae>] entry_SYSCALL_64_fastpath+0x12/0x71 ... [ 1959.323277] RIP [<ffffffffa040efba>] dm_exception_store_set_chunk_size+0x7a/0x110 [dm_snapshot] [ 1959.333090] RSP <ffff88032a957b30> [ 1959.336978] CR2: 0000000000000098 [ 1959.344121] ---[ end trace b049991ccad1169e ]--- Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1195899 Cc: stable@vger.kernel.org Signed-off-by: Ding Xiang <dingxiang@huawei.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-10dm cache: make the 'mq' policy an alias for 'smq'Joe Thornber
smq seems to be performing better than the old mq policy in all situations, as well as using a quarter of the memory. Make 'mq' an alias for 'smq' when choosing a cache policy. The tunables that were present for the old mq are faked, and have no effect. mq should be considered deprecated now. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-10dm: drop unnecessary assignment of md->queueBob Liu
md->queue and q are the same thing in dm_old_init_request_queue() and dm_mq_init_request_queue(). Also drop the temporary 'struct request_queue *q' in dm_old_init_request_queue(). Signed-off-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-10dm: reorder 'struct mapped_device' members to fix alignment and holesMike Snitzer
Saves 16 bytes by eliminating 4 4byte holes but more importantly: numerous members that crossed cachelines were fixed. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-10dm: remove dummy definition of 'struct dm_table'Mike Snitzer
Change the map pointer in 'struct mapped_device' from 'struct dm_table __rcu *' to 'void __rcu *' to avoid the need for the dummy definition. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-10dm: add 'dm_numa_node' module parameterMike Snitzer
Allows user to control which NUMA node the memory for DM device structures (e.g. mapped_device, request_queue, gendisk, blk_mq_tag_set) is allocated from. Defaults to NUMA_NO_NODE (-1). Allowable range is from -1 until the last online NUMA node id. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-10dm thin metadata: remove needless newline from subtree_dec() DMERR messageMike Snitzer
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-10dm mpath: cleanup reinstate_path() et al based on code reviewMike Snitzer
fail_path() will print a "Failing path ..." message but reinstate_path() doesn't print a "Reinstating path ...". Add that message to reinstate_path() to add symmetry and aid system debugging. Remove reinstate_path()'s check for the path_selector providing .reinstate_path hook. All path selectors provide this and any future ones must too. activate_path() calls pg_init_done() with SCSI_DH_DEV_OFFLINED but pg_init_done() doesn't expicitly handle it in its swicth statement. Add SCSI_DH_DEV_OFFLINED to the default case. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-09md/raid5: output stripe state for debugShaohua Li
Neil recently fixed an obscure race in break_stripe_batch_list. Debug would be quite convenient if we know the stripe state. This is what this patch does. Signed-off-by: Shaohua Li <shli@fb.com>
2016-03-09md/raid5: preserve STRIPE_PREREAD_ACTIVE in break_stripe_batch_listNeilBrown
break_stripe_batch_list breaks up a batch and copies some flags from the batch head to the members, preserving others. It doesn't preserve or copy STRIPE_PREREAD_ACTIVE. This is not normally a problem as STRIPE_PREREAD_ACTIVE is cleared when a stripe_head is added to a batch, and is not set on stripe_heads already in a batch. However there is no locking to ensure one thread doesn't set the flag after it has just been cleared in another. This does occasionally happen. md/raid5 maintains a count of the number of stripe_heads with STRIPE_PREREAD_ACTIVE set: conf->preread_active_stripes. When break_stripe_batch_list clears STRIPE_PREREAD_ACTIVE inadvertently this could becomes incorrect and will never again return to zero. md/raid5 delays the handling of some stripe_heads until preread_active_stripes becomes zero. So when the above mention race happens, those stripe_heads become blocked and never progress, resulting is write to the array handing. So: change break_stripe_batch_list to preserve STRIPE_PREREAD_ACTIVE in the members of a batch. URL: https://bugzilla.kernel.org/show_bug.cgi?id=108741 URL: https://bugzilla.redhat.com/show_bug.cgi?id=1258153 URL: http://thread.gmane.org/5649C0E9.2030204@zoner.cz Reported-by: Martin Svec <martin.svec@zoner.cz> (and others) Tested-by: Tom Weber <linux@junkyard.4t2.com> Fixes: 1b956f7a8f9a ("md/raid5: be more selective about distributing flags across batch.") Cc: stable@vger.kernel.org (v4.1 and later) Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-03-08bcache: fix cache_set_flush() NULL pointer dereference on OOMEric Wheeler
When bch_cache_set_alloc() fails to kzalloc the cache_set, the asyncronous closure handling tries to dereference a cache_set that hadn't yet been allocated inside of cache_set_flush() which is called by __cache_set_unregister() during cleanup. This appears to happen only during an OOM condition on bcache_register. Signed-off-by: Eric Wheeler <bcache@linux.ewheeler.net> Cc: stable@vger.kernel.org
2016-03-08bcache: cleaned up error handling around register_cache()Eric Wheeler
Fix null pointer dereference by changing register_cache() to return an int instead of being void. This allows it to return -ENOMEM or -ENODEV and enables upper layers to handle the OOM case without NULL pointer issues. See this thread: http://thread.gmane.org/gmane.linux.kernel.bcache.devel/3521 Fixes this error: gargamel:/sys/block/md5/bcache# echo /dev/sdh2 > /sys/fs/bcache/register bcache: register_cache() error opening sdh2: cannot allocate memory BUG: unable to handle kernel NULL pointer dereference at 00000000000009b8 IP: [<ffffffffc05a7e8d>] cache_set_flush+0x102/0x15c [bcache] PGD 120dff067 PUD 1119a3067 PMD 0 Oops: 0000 [#1] SMP Modules linked in: veth ip6table_filter ip6_tables (...) CPU: 4 PID: 3371 Comm: kworker/4:3 Not tainted 4.4.2-amd64-i915-volpreempt-20160213bc1 #3 Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013 Workqueue: events cache_set_flush [bcache] task: ffff88020d5dc280 ti: ffff88020b6f8000 task.ti: ffff88020b6f8000 RIP: 0010:[<ffffffffc05a7e8d>] [<ffffffffc05a7e8d>] cache_set_flush+0x102/0x15c [bcache] Signed-off-by: Eric Wheeler <bcache@linux.ewheeler.net> Tested-by: Marc MERLIN <marc@merlins.org> Cc: <stable@vger.kernel.org>
2016-03-08bcache: fix race of writeback thread starting before complete initializationEric Wheeler
The bch_writeback_thread might BUG_ON in read_dirty() if dc->sb==BDEV_STATE_DIRTY and bch_sectors_dirty_init has not yet completed its related initialization. This patch downs the dc->writeback_lock until after initialization is complete, thus preventing bch_writeback_thread from proceeding prematurely. See this thread: http://thread.gmane.org/gmane.linux.kernel.bcache.devel/3453 Signed-off-by: Eric Wheeler <bcache@linux.ewheeler.net> Tested-by: Marc MERLIN <marc@merlins.org> Cc: <stable@vger.kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-03-07md/bitmap: remove redundant checkEric Engestrom
daemon_sleep is an unsigned, so testing if it's 0 or less than 1 does the same thing. Signed-off-by: Eric Engestrom <eric.engestrom@imgtec.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-02-26MD: warn for potential deadlockShaohua Li
The personality thread shouldn't call mddev_suspend(). Because mddev_suspend() will for all IO finish, but IO is handled in personality thread, so this could cause deadlock. To trigger this early, add a warning if mddev_suspend() is called from personality thread. Suggested-by: NeilBrown <neilb@suse.com> Cc: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-02-26md: Drop sending a change uevent when stoppingSebastian Parschauer
When stopping an MD device, then its device node /dev/mdX may still exist afterwards or it is recreated by udev. The next open() call can lead to creation of an inoperable MD device. The reason for this is that a change event (KOBJ_CHANGE) is sent to udev which races against the remove event (KOBJ_REMOVE) from md_free(). So drop sending the change event. A change is likely also required in mdadm as many versions send the change event to udev as well. Neil mentioned the change event is a workaround for old kernel Commit: 934d9c23b4c7 ("md: destroy partitions and notify udev when md array is stopped.") new mdadm can handle device remove now, so this isn't required any more. Cc: NeilBrown <neilb@suse.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Jes Sorensen <Jes.Sorensen@redhat.com> Signed-off-by: Sebastian Parschauer <sebastian.riemer@profitbricks.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-02-26RAID5: revert e9e4c377e2f563 to fix a livelockShaohua Li
Revert commit e9e4c377e2f563(md/raid5: per hash value and exclusive wait_for_stripe) The problem is raid5_get_active_stripe waits on conf->wait_for_stripe[hash]. Assume hash is 0. My test release stripes in this order: - release all stripes with hash 0 - raid5_get_active_stripe still sleeps since active_stripes > max_nr_stripes * 3 / 4 - release all stripes with hash other than 0. active_stripes becomes 0 - raid5_get_active_stripe still sleeps, since nobody wakes up wait_for_stripe[0] The system live locks. The problem is active_stripes isn't a per-hash count. Revert the patch makes the live lock go away. Cc: stable@vger.kernel.org (v4.2+) Cc: Yuanhan Liu <yuanhan.liu@linux.intel.com> Cc: NeilBrown <neilb@suse.de> Signed-off-by: Shaohua Li <shli@fb.com>
2016-02-26RAID5: check_reshape() shouldn't call mddev_suspendShaohua Li
check_reshape() is called from raid5d thread. raid5d thread shouldn't call mddev_suspend(), because mddev_suspend() waits for all IO finish but IO is handled in raid5d thread, we could easily deadlock here. This issue is introduced by 738a273 ("md/raid5: fix allocation of 'scribble' array.") Cc: stable@vger.kernel.org (v4.1+) Reported-and-tested-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-02-25md/raid5: Compare apples to apples (or sectors to sectors)Jes Sorensen
'max_discard_sectors' is in sectors, while 'stripe' is in bytes. This fixes the problem where DISCARD would get disabled on some larger RAID5 configurations (6 or more drives in my testing), while it worked as expected with smaller configurations. Fixes: 620125f2bf8 ("MD: raid5 trim support") Cc: stable@vger.kernel.org v3.7+ Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-02-22dm mpath: remove __pgpath_busy forward declaration, rename to pgpath_busyMike Snitzer
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22dm mpath: switch from 'unsigned' to 'bool' for flags where appropriateMike Snitzer
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22dm round robin: use percpu 'repeat_count' and 'current_path'Mike Snitzer
Now that dm-mpath core is lockless in the per-IO fast path it is critical, for performance, to have the .select_path hook (rr_select_path) also be as lockless as possible. The new percpu members of 'struct selector' allow for lockless support of 'repeat_count' governed repeat use of a previously selected path. If a path fails while it is 'current_path' the worst case is concurrent IO might be mapped to the failed path until the .fail_path hook (rr_fail_path) is called. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22dm path selector: remove 'repeat_count' return from .select_path hookMike Snitzer
If a path selector has any use for a repeat_count it should be handled locally and not depend on the dm-mpath core to be concerned with it. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22dm mpath: push path selector locking down to path selectorsMike Snitzer
Proper locking of the lists used by the path selectors should be handled within the selectors (relying on dm-mpath.c code's use of the m->lock spinlock was reckless). Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22dm mpath: remove repeat_count support from multipath coreMike Snitzer
Preparation for making __multipath_map() avoid taking the m->lock spinlock -- in favor of using RCU locking. repeat_count was primarily for bio-based DM multipath's benefit. There is really no need for it anymore now that DM multipath is request-based. As such, repeat_count > 1 is no longer honored and a warning is displayed if the user attempts to use a value > 1. This is a temporary change for the round-robin path-selector (as a later commit will restore its support for repeat_count > 1). Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22dm mpath: remove unnecessary casts in front of ti->privateMike Snitzer
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22dm mpath: use blk_mq_alloc_request() and blk_mq_free_request() directlyMike Snitzer
There isn't any need to support both old .request_fn and blk-mq paths in the blk-mq specific portion of __multipath_map(). Call blk_mq_alloc_request() directly rather than use blk_get_request(). Similarly, call blk_mq_free_request(), rather than blk_put_request(), in multipath_release_clone(). Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22dm mpath: cleanup 'struct dm_mpath_io' management codeMike Snitzer
Refactor and rename existing interfaces to be more specific and self-documenting. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22dm mpath: use blk-mq pdu for per-request 'struct dm_mpath_io'Mike Snitzer
Allow the multipath target to avoid making small allocations for each 'struct dm_mpath_io' that is needed for each request. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22dm: allow immutable request-based targets to use blk-mq pduMike Snitzer
This will allow DM multipath to use a portion of the blk-mq pdu space for target data (e.g. struct dm_mpath_io). Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22dm: rename target's per_bio_data_size to per_io_data_sizeMike Snitzer
Request-based DM will also make use of per_bio_data_size. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22dm: distinquish old .request_fn (dm-old) vs dm-mq request-based DMMike Snitzer
Rename various methods to have either a "dm_old" or "dm_mq" prefix. Improve code comments to assist with understanding the duality of code that handles both "dm_old" and "dm_mq" cases. It is no much easier to quickly look at the code and _know_ that a given method is either 1) "dm_old" only 2) "dm_mq" only 3) common to both. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22dm: remove support for stacking dm-mq on .request_fn device(s)Mike Snitzer
Remove all fiddley code that propped up this support for a blk-mq request-queue ontop of all .request_fn devices. Testing has proven this niche request-based dm-mq mode to be buggy, when testing fault tolerance with DM multipath, and there is no point trying to preserve it. Should help improve efficiency of pure dm-mq code and make code maintenance less delicate. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22dm: fix a couple locking issues with use of block interfacesMike Snitzer
old_stop_queue() was checking blk_queue_stopped() without holding the q->queue_lock. dm_requeue_original_request() needed to check blk_queue_stopped(), with q->queue_lock held, before calling blk_mq_kick_requeue_list(). And a side-effect of that change is start_queue() must also call blk_mq_kick_requeue_list(). Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22dm: allocate blk_mq_tag_set rather than embed in mapped_deviceMike Snitzer
The blk_mq_tag_set is only needed for dm-mq support. There is point wasting space in 'struct mapped_device' for non-dm-mq devices. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> # check kzalloc return
2016-02-22dm: add 'dm_mq_nr_hw_queues' and 'dm_mq_queue_depth' module paramsMike Snitzer
Allow user to change these values via module params or sysfs. 'dm_mq_nr_hw_queues' defaults to 1 (max 32). 'dm_mq_queue_depth' defaults to 2048 (up from 64, which proved far too small under moderate sized workloads -- the dm-multipath device would continuously block waiting for tags (requests) to become available). The maximum is BLK_MQ_MAX_DEPTH (currently 10240). Keep in mind the total number of pre-allocated requests per request-based dm-mq device is 'dm_mq_nr_hw_queues' * 'dm_mq_queue_depth' (currently 2048). Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22dm: optimize dm_request_fn()Mike Snitzer
DM multipath is the only request-based DM target -- which only supports tables with a single target that is immutable. Leverage this fact in dm_request_fn(). Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22dm: optimize dm_mq_queue_rq()Mike Snitzer
DM multipath is the only dm-mq target. But that aside, request-based DM only supports tables with a single target that is immutable. Leverage this fact in dm_mq_queue_rq() by using the 'immutable_target' stored in the mapped_device when the table was made active. This saves the need to even take the read-side of the SRCU via dm_{get,put}_live_table. If the active DM table does not have an immutable target (e.g. "error" target was swapped in) then fallback to the slow-path where the target is looked up from the live table. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22dm: set DM_TARGET_WILDCARD feature on "error" targetMike Snitzer
The DM_TARGET_WILDCARD feature indicates that the "error" target may replace any target; even immutable targets. This feature will be useful to preserve the ability to replace the "multipath" target even once it is formally converted over to having the DM_TARGET_IMMUTABLE feature. Also, implicit in the DM_TARGET_WILDCARD feature flag being set is that .map, .map_rq, .clone_and_map_rq and .release_clone_rq are all defined in the target_type. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22dm: cleanup dm_any_congested()Mike Snitzer
The request-based DM support for checking queue congestion doesn't require access to the live DM table. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22dm: remove unused dm_get_rq_mapinfo()Mike Snitzer
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22dm: fix excessive dm-mq context switchingMike Snitzer
Request-based DM's blk-mq support (dm-mq) was reported to be 50% slower than if an underlying null_blk device were used directly. One of the reasons for this drop in performance is that blk_insert_clone_request() was calling blk_mq_insert_request() with @async=true. This forced the use of kblockd_schedule_delayed_work_on() to run the blk-mq hw queues which ushered in ping-ponging between process context (fio in this case) and kblockd's kworker to submit the cloned request. The ftrace function_graph tracer showed: kworker-2013 => fio-12190 fio-12190 => kworker-2013 ... kworker-2013 => fio-12190 fio-12190 => kworker-2013 ... Fixing blk_insert_clone_request()'s blk_mq_insert_request() call to _not_ use kblockd to submit the cloned requests isn't enough to eliminate the observed context switches. In addition to this dm-mq specific blk-core fix, there are 2 DM core fixes to dm-mq that (when paired with the blk-core fix) completely eliminate the observed context switching: 1) don't blk_mq_run_hw_queues in blk-mq request completion Motivated by desire to reduce overhead of dm-mq, punting to kblockd just increases context switches. In my testing against a really fast null_blk device there was no benefit to running blk_mq_run_hw_queues() on completion (and no other blk-mq driver does this). So hopefully this change doesn't induce the need for yet another revert like commit 621739b00e16ca2d ! 2) use blk_mq_complete_request() in dm_complete_request() blk_complete_request() doesn't offer the traditional q->mq_ops vs .request_fn branching pattern that other historic block interfaces do (e.g. blk_get_request). Using blk_mq_complete_request() for blk-mq requests is important for performance. It should be noted that, like blk_complete_request(), blk_mq_complete_request() doesn't natively handle partial completions -- but the request-based DM-multipath target does provide the required partial completion support by dm.c:end_clone_bio() triggering requeueing of the request via dm-mpath.c:multipath_end_io()'s return of DM_ENDIO_REQUEUE. dm-mq fix #2 is _much_ more important than #1 for eliminating the context switches. Before: cpu : usr=15.10%, sys=59.39%, ctx=7905181, majf=0, minf=475 After: cpu : usr=20.60%, sys=79.35%, ctx=2008, majf=0, minf=472 With these changes multithreaded async read IOPs improved from ~950K to ~1350K for this dm-mq stacked on null_blk test-case. The raw read IOPs of the underlying null_blk device for the same workload is ~1950K. Fixes: 7fb4898e0 ("block: add blk-mq support to blk_insert_cloned_request()") Fixes: bfebd1cdb ("dm: add full blk-mq support to request-based DM") Cc: stable@vger.kernel.org # 4.1+ Reported-by: Sagi Grimberg <sagig@dev.mellanox.co.il> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Jens Axboe <axboe@kernel.dk>
2016-02-21dm: fix sparse "unexpected unlock" warnings in ioctl codeMike Snitzer
Rename dm_get_live_table_for_ioctl to dm_grab_bdev_for_ioctl and have it do the dm_{get,put}_live_table() rather than split those operations. The dm_grab_bdev_for_ioctl() callers only care about the block_device associated with a singleton DM device so there isn't any need to retain a reference to the live DM table. It is sufficient to: 1) dm_get_live_table() 2) bdgrab() the bdev associated with the singleton table's target 3) dm_put_live_table() 4) bdput() the bdev Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-21dm: do not return target from dm_get_live_table_for_ioctl()Mike Snitzer
None of the callers actually used the returned target. Also, just reuse bdev pointer passed to dm_blk_ioctl(). Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-21dm: fix dm_rq_target_io leak on faults with .request_fn DM w/ blk-mq pathsMike Snitzer
Using request-based DM mpath configured with the following stacking (.request_fn DM mpath ontop of scsi-mq paths): echo Y > /sys/module/scsi_mod/parameters/use_blk_mq echo N > /sys/module/dm_mod/parameters/use_blk_mq 'struct dm_rq_target_io' would leak if a request is requeued before a blk-mq clone is allocated (or fails to allocate). free_rq_tio() wasn't being called. kmemleak reported: unreferenced object 0xffff8800b90b98c0 (size 112): comm "kworker/7:1H", pid 5692, jiffies 4295056109 (age 78.589s) hex dump (first 32 bytes): 00 d0 5c 2c 03 88 ff ff 40 00 bf 01 00 c9 ff ff ..\,....@....... e0 d9 b1 34 00 88 ff ff 00 00 00 00 00 00 00 00 ...4............ backtrace: [<ffffffff81672b6e>] kmemleak_alloc+0x4e/0xb0 [<ffffffff811dbb63>] kmem_cache_alloc+0xc3/0x1e0 [<ffffffff8117eae5>] mempool_alloc_slab+0x15/0x20 [<ffffffff8117ec1e>] mempool_alloc+0x6e/0x170 [<ffffffffa00029ac>] dm_old_prep_fn+0x3c/0x180 [dm_mod] [<ffffffff812fbd78>] blk_peek_request+0x168/0x290 [<ffffffffa0003e62>] dm_request_fn+0xb2/0x1b0 [dm_mod] [<ffffffff812f66e3>] __blk_run_queue+0x33/0x40 [<ffffffff812f9585>] blk_delay_work+0x25/0x40 [<ffffffff81096fff>] process_one_work+0x14f/0x3d0 [<ffffffff81097715>] worker_thread+0x125/0x4b0 [<ffffffff8109ce88>] kthread+0xd8/0xf0 [<ffffffff8167cb8f>] ret_from_fork+0x3f/0x70 [<ffffffffffffffff>] 0xffffffffffffffff crash> struct -o dm_rq_target_io struct dm_rq_target_io { ... } SIZE: 112 Fixes: e5863d9ad7 ("dm: allocate requests in target when stacking on blk-mq devices") Cc: stable@vger.kernel.org # 4.0+ Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-03Merge branch 'mymd/for-next' into mymd/for-linusShaohua Li
2016-01-27dm crypt: Use skcipher and ahashHerbert Xu
This patch replaces uses of ablkcipher with skcipher, and the long obsolete hash interface with ahash. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2016-01-24md-cluster: delete useless codeShaohua Li
page->index already considers node offset. The node_offset calculation in write_sb_page is useless and confusion. Cc: Goldwyn Rodrigues <rgoldwyn@suse.com> Cc: NeilBrown <neilb@suse.com> Acked-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-01-24md-cluster: fix missing memory freeShaohua Li
There are several places we allocate dlm_lock_resource, but not free it. leave() need free a lock resource too (from Guoqing) Cc: Goldwyn Rodrigues <rgoldwyn@suse.com> Cc: Guoqing Jiang <gqjiang@suse.com> Cc: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>