summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)Author
2017-08-28dm mpath: do not lock up a CPU with requeuing activityBart Van Assche
When using the block layer in single queue mode, get_request() returns ERR_PTR(-EAGAIN) if the queue is dying and the REQ_NOWAIT flag has been passed to get_request(). Avoid that the kernel reports soft lockup complaints in this case due to continuous requeuing activity. Fixes: 7083abbbf ("dm mpath: avoid that path removal can trigger an infinite loop") Cc: stable@vger.kernel.org Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Tested-by: Laurence Oberman <loberman@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-08-28dm: fix printk() rate limiting codeBart Van Assche
Using the same rate limiting state for different kinds of messages is wrong because this can cause a high frequency message to suppress a report of a low frequency message. Hence use a unique rate limiting state per message type. Fixes: 71a16736a15e ("dm: use local printk ratelimit") Cc: stable@vger.kernel.org Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-08-28dm mpath: retry BLK_STS_RESOURCE errorsBart Van Assche
Retry requests instead of failing them if an out-of-memory error occurs or the block driver below dm-mpath is busy. This restores the v4.12 behavior of noretry_error(), namely that -ENOMEM results in a retry. Fixes: 2a842acab109 ("block: introduce new block status code type") Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-08-28dm: fix the second dec_pending() argument in __split_and_process_bio()Bart Van Assche
Detected by sparse. Fixes: 4e4cbee93d56 ("block: switch bios to blk_status_t") Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Laurence Oberman <loberman@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-08-25md/raid0: attach correct cgroup info in bioShaohua Li
The discard bio doesn't attach the original bio cgroup info. Normal bio is cloned, so is fine. Signed-off-by: Shaohua Li <shli@fb.com>
2017-08-25raid5: remove raid5_build_blockGuoqing Jiang
Now raid5_build_block is just called to set the sector of r5dev, raid5_compute_blocknr can be used directly for the purpose. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-08-25md/r5cache: call mddev_lock/unlock() in r5c_journal_mode_showSong Liu
In r5c_journal_mode_show(), it is necessary to call mddev_lock() before accessing conf and conf->log. Otherwise, the conf->log may change (and become NULL). Signed-off-by: Song Liu <songliubraving@fb.com> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-08-25md: replace seq_release_private with seq_releaseCihangir Akturk
Since commit f15146380d28 ("fs: seq_file - add event counter to simplify poll() support"), md.c code has been no longer used the private field of the struct seq_file, but seq_release_private() has been continued to be used to release the allocated seq_file. This seems to have been forgotten. So convert it to use seq_release() instead of seq_release_private(). Signed-off-by: Cihangir Akturk <cakturk@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-08-25md: notify about new spare disk in the containerAlexey Obitotskiy
In case of external metadata arrays spare disks are added to containers first. mdadm keeps monitoring /proc/mdstat output and when spare disk is available, it moves it from the container to the array. The problem is there is no notification of new spare disk in the container and mdadm waits a long time (until timeout) before it takes the action. Signed-off-by: Alexey Obitotskiy <aleksey.obitotskiy@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-08-25md/raid1/10: reset bio allocated from mempoolShaohua Li
Data allocated from mempool doesn't always get initialized, this happens when the data is reused instead of fresh allocation. In the raid1/10 case, we must reinitialize the bios. Reported-by: Jonathan G. Underwood <jonathan.underwood@gmail.com> Fixes: f0250618361d(md: raid10: don't use bio's vec table to manage resync pages) Fixes: 98d30c5812c3(md: raid1: don't use bio's vec table to manage resync pages) Cc: stable@vger.kernel.org (4.12+) Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-08-24md/raid5: release/flush io in raid5_do_work()Song Liu
In raid5, there are scenarios where some ios are deferred to a later time, and some IO need a flush to complete. To make sure we make progress with these IOs, we need to call the following functions: flush_deferred_bios(conf); r5l_flush_stripe_to_raid(conf->log); Both of these functions are called in raid5d(), but missing in raid5_do_work(). As a result, these functions are not called when multi-threading (group_thread_cnt > 0) is enabled. This patch adds calls to these function to raid5_do_work(). Note for stable branches: r5l_flush_stripe_to_raid(conf->log) is need for 4.4+ flush_deferred_bios(conf) is only needed for 4.11+ Cc: stable@vger.kernel.org (4.4+) Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-08-24md/bitmap: copy correct data for bitmap superShaohua Li
raid5 cache could write bitmap superblock before bitmap superblock is initialized. The bitmap superblock is less than 512B. The current code will only copy the superblock to a new page and write the whole 512B, which will zero the the data after the superblock. Unfortunately the data could include bitmap, which we should preserve. The patch will make superblock read do 4k chunk and we always copy the 4k data to new page, so the superblock write will old data to disk and we don't change the bitmap. Reported-by: Song Liu <songliubraving@fb.com> Reviewed-by: Song Liu <songliubraving@fb.com> Cc: stable@vger.kernel.org (4.10+) Signed-off-by: Shaohua Li <shli@fb.com>
2017-08-23block: replace bi_bdev with a gendisk pointer and partitions indexChristoph Hellwig
This way we don't need a block_device structure to submit I/O. The block_device has different life time rules from the gendisk and request_queue and is usually only available when the block device node is open. Other callers need to explicitly create one (e.g. the lightnvm passthrough code, or the new nvme multipathing code). For the actual I/O path all that we need is the gendisk, which exists once per block device. But given that the block layer also does partition remapping we additionally need a partition index, which is used for said remapping in generic_make_request. Note that all the block drivers generally want request_queue or sometimes the gendisk, so this removes a layer of indirection all over the stack. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-23raid5: remove a call to get_start_sectChristoph Hellwig
The block layer always remaps partitions before calling into the ->make_request methods of drivers. Thus the call to get_start_sect in in_chunk_boundary will always return 0 and can be removed. Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-11MD: not clear ->safemode for external metadata arrayShaohua Li
->safemode should be triggered by mdadm for external metadaa array, otherwise array's state confuses mdadm. Fixes: 33182d15c6bf(md: always clear ->safemode when md_check_recovery gets the mddev lock.) Cc: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-08-09block: pass in queue to inflight accountingJens Axboe
No functional change in this patch, just in preparation for basing the inflight mechanism on the queue in question. Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-09dm-crypt: don't mess with BIP_BLOCK_INTEGRITYChristoph Hellwig
This flag is never set right after calling bio_integrity_alloc, so don't clear it and confuse the reader. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-08md/r5cache: fix io_unit handling in r5l_log_endio()Song Liu
In r5l_log_endio(), once log->io_list_lock is released, the io unit may be accessed (or even freed) by other threads. Current code doesn't handle the io_unit properly, which leads to potential race conditions. This patch solves this race condition by: 1. Add a pending_stripe count flush_payload. Multiple flush_payloads are counted as only one pending_stripe. Flag has_flush_payload is added to show whether the io unit has flush_payload; 2. In r5l_log_endio(), check flags has_null_flush and has_flush_payload with log->io_list_lock held. After the lock is released, this IO unit is only accessed when we know the pending_stripe counter cannot be zeroed by other threads. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-08-08md/r5cache: call mddev_lock/unlock() in r5c_journal_mode_setSong Liu
In r5c_journal_mode_set(), it is necessary to call mddev_lock() before accessing conf and conf->log. Otherwise, the conf->log may change (and become NULL). Shaohua: fix unlock in failure cases Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-08-08md: fix test in md_write_start()NeilBrown
md_write_start() needs to clear the in_sync flag is it is set, or if there might be a race with set_in_sync() such that the later will set it very soon. In the later case it is sufficient to take the spinlock to synchronize with set_in_sync(), and then set the flag if needed. The current test is incorrect. It should be: if "flag is set" or "race is possible" "flag is set" is trivially "mddev->in_sync". "race is possible" should be tested by "mddev->sync_checkers". If sync_checkers is 0, then there can be no race. set_in_sync() will wait in percpu_ref_switch_to_atomic_sync() for an RCU grace period, and as md_write_start() holds the rcu_read_lock(), set_in_sync() will be sure ot see the update to writes_pending. If sync_checkers is > 0, there could be race. If md_write_start() happened entirely between if (!mddev->in_sync && percpu_ref_is_zero(&mddev->writes_pending)) { and mddev->in_sync = 1; in set_in_sync(), then it would not see that is_sync had been set, and set_in_sync() would not see that writes_pending had been incremented. This bug means that in_sync is sometimes not set when it should be. Consequently there is a small chance that the array will be marked as "clean" when in fact it is inconsistent. Fixes: 4ad23a976413 ("MD: use per-cpu counter for writes_pending") cc: stable@vger.kernel.org (v4.12+) Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-08-08md: always clear ->safemode when md_check_recovery gets the mddev lock.NeilBrown
If ->safemode == 1, md_check_recovery() will try to get the mddev lock and perform various other checks. If mddev->in_sync is zero, it will call set_in_sync, and clear ->safemode. However if mddev->in_sync is not zero, ->safemode will not be cleared. When md_check_recovery() drops the mddev lock, the thread is woken up again. Normally it would just check if there was anything else to do, find nothing, and go to sleep. However as ->safemode was not cleared, it will take the mddev lock again, then wake itself up when unlocking. This results in an infinite loop, repeatedly calling md_check_recovery(), which RCU or the soft-lockup detector will eventually complain about. Prior to commit 4ad23a976413 ("MD: use per-cpu counter for writes_pending"), safemode would only be set to one when the writes_pending counter reached zero, and would be cleared again when writes_pending is incremented. Since that patch, safemode is set more freely, but is not reliably cleared. So in md_check_recovery() clear ->safemode before checking ->in_sync. Fixes: 4ad23a976413 ("MD: use per-cpu counter for writes_pending") Cc: stable@vger.kernel.org (4.12+) Reported-by: Dominik Brodowski <linux@dominikbrodowski.net> Reported-by: David R <david@unsolicited.net> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-08-04crypto: algapi - make crypto_xor() take separate dst and src argumentsArd Biesheuvel
There are quite a number of occurrences in the kernel of the pattern if (dst != src) memcpy(dst, src, walk.total % AES_BLOCK_SIZE); crypto_xor(dst, final, walk.total % AES_BLOCK_SIZE); or crypto_xor(keystream, src, nbytes); memcpy(dst, keystream, nbytes); where crypto_xor() is preceded or followed by a memcpy() invocation that is only there because crypto_xor() uses its output parameter as one of the inputs. To avoid having to add new instances of this pattern in the arm64 code, which will be refactored to implement non-SIMD fallbacks, add an alternative implementation called crypto_xor_cpy(), taking separate input and output arguments. This removes the need for the separate memcpy(). Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2017-07-28Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/mdLinus Torvalds
Pull MD fixes from Shaohua Li: "This fixes several bugs, three of them are marked for stable: - an initialization issue fixed by Ming - a bio clone race issue fixed by me - an async tx flush issue fixed by Ofer - other cleanups" * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: MD: fix warnning for UP case md/raid5: add thread_group worker async_tx_issue_pending_all md: simplify code with bio_io_error md/raid1: fix writebehind bio clone md: raid1-10: move raid1/raid10 common code into raid1-10.c md: raid1/raid10: initialize bvec table via bio_add_page() md: remove 'idx' from 'struct resync_pages'
2017-07-28Merge tag 'for-4.13/dm-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: - a few DM integrity fixes that improve performance. One that address inefficiencies in the on-disk journal device layout. Another that makes use of the block layer's on-stack plugging when writing the journal. - a dm-bufio fix for the blk_status_t conversion that went in during the merge window. - a few DM raid fixes that address correctness when suspending the device and a validation fix for validation that occurs during device activation. - a couple DM zoned target fixes. Important one being the fix to not use GFP_KERNEL in the IO path due to concerns about deadlock in low-memory conditions (e.g. swap over a DM zoned device, etc). - a DM DAX device fix to make sure dm_dax_flush() is called if the underlying DAX device is operating as a write cache. * tag 'for-4.13/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm, dax: Make sure dm_dax_flush() is called if device supports it dm verity fec: fix GFP flags used with mempool_alloc() dm zoned: use GFP_NOIO in I/O path dm zoned: remove test for impossible REQ_OP_FLUSH conditions dm raid: bump target version dm raid: avoid mddev->suspended access dm raid: fix activation check in validate_raid_redundancy() dm raid: remove WARN_ON() in raid10_md_layout_to_format() dm bufio: fix error code in dm_bufio_write_dirty_buffers() dm integrity: test for corrupted disk format during table load dm integrity: WARN_ON if variables representing journal usage get out of sync dm integrity: use plugging when writing the journal dm integrity: fix inefficient allocation of journal space
2017-07-26dm, dax: Make sure dm_dax_flush() is called if device supports itVivek Goyal
Currently dm_dax_flush() is not being called, even if underlying dax device supports write cache, because DAXDEV_WRITE_CACHE is not being propagated up to the DM dax device. If the underlying dax device supports write cache, set DAXDEV_WRITE_CACHE on the DM dax device. This will cause dm_dax_flush() to be called. Fixes: abebfbe2f7 ("dm: add ->flush() dax operation support") Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-07-26dm verity fec: fix GFP flags used with mempool_alloc()NeilBrown
mempool_alloc() cannot fail for GFP_NOIO allocation, so there is no point testing for failure. One place the code tested for failure was passing "0" as the GFP flags. This is most unusual and is probably meant to be GFP_NOIO, so that is changed. Also, allocation from ->extra_pool and ->prealloc_pool are repeated before releasing the previous allocation. This can deadlock if the code is servicing a write under high memory pressure. To avoid deadlocks, change these to use GFP_NOWAIT and leave the error handling in place. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-07-26dm zoned: use GFP_NOIO in I/O pathDamien Le Moal
Use GFP_NOIO for memory allocations in the I/O path. Other memory allocations in the initialization path can use GFP_KERNEL. Reported-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-07-25MD: fix warnning for UP caseShaohua Li
spin_is_locked always returns 0 for UP case, so ignores it Reported-by: Joshua Kinard <kumba@gentoo.org> Signed-off-by: Shaohua Li <shli@fb.com>
2017-07-25dm zoned: remove test for impossible REQ_OP_FLUSH conditionsMikulas Patocka
The value REQ_OP_FLUSH is only used by the block code for request-based devices. Remove the tests for REQ_OP_FLUSH from the bio-based dm-zoned-target. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-07-25dm raid: bump target versionHeinz Mauelshagen
Bumo dm-raid target version to 1.12.1 to reflect that commit cc27b0c78c ("md: fix deadlock between mddev_suspend() and md_write_start()") is available. This version change allows userspace to detect that MD fix is available. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-07-25dm raid: avoid mddev->suspended accessHeinz Mauelshagen
Use runtime flag to ensure that an mddev gets suspended/resumed just once. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-07-25dm raid: fix activation check in validate_raid_redundancy()Heinz Mauelshagen
During growing reshapes (i.e. stripes being added to a raid set), the new stripe images are not in-sync and not part of the raid set until the reshape is started. LVM2 has to request multiple table reloads involving superblock updates in order to reflect proper size of SubLVs in the cluster. Before a stripe adding reshape starts, validate_raid_redundancy() fails as a result of that because it checks the total number of devices against the number of rebuild ones rather than the actual ones in the raid set (as retrieved from the superblock) thus resulting in failed raid4/5/6/10 redundancy checks. E.g. convert 3 stripes -> 7 stripes raid5 (which only allows for maximum 1 device to fail) requesting +4 delta disks causing 4 devices to rebuild during reshaping thus failing activation. To fix this, move validate_raid_redundancy() to get access to the current raid_set members. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-07-25dm raid: remove WARN_ON() in raid10_md_layout_to_format()Heinz Mauelshagen
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-07-25dm bufio: fix error code in dm_bufio_write_dirty_buffers()Dan Carpenter
We should be returning normal negative error codes here. The "a" variables comes from &c->async_write_error which is a blk_status_t converted to a regular error code. In the current code, the blk_status_t gets propogated back to pool_create() and eventually results in an Oops. Fixes: 4e4cbee93d56 ("block: switch bios to blk_status_t") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-07-25dm integrity: test for corrupted disk format during table loadMikulas Patocka
If the dm-integrity superblock was corrupted in such a way that the journal_sections field was zero, the integrity target would deadlock because it would wait forever for free space in the journal. Detect this situation and refuse to activate the device. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Fixes: 7eada909bfd7 ("dm: add integrity target") Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-07-25dm integrity: WARN_ON if variables representing journal usage get out of syncMikulas Patocka
If this WARN_ON triggers it speaks to programmer error, and likely implies corruption, but no released kernel should trigger it. This WARN_ON serves to assist DM integrity developers as changes are made/tested in the future. BUG_ON is excessive for catching programmer error, if a user or developer would like warnings to trigger a panic, they can enable that via /proc/sys/kernel/panic_on_warn Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-07-24md/raid5: add thread_group worker async_tx_issue_pending_allOfer Heifetz
Since thread_group worker and raid5d kthread are not in sync, if worker writes stripe before raid5d then requests will be waiting for issue_pendig. Issue observed when building raid5 with ext4, in some build runs jbd2 would get hung and requests were waiting in the HW engine waiting to be issued. Fix this by adding a call to async_tx_issue_pending_all in the raid5_do_work. Signed-off-by: Ofer Heifetz <oferh@marvell.com> Cc: stable@vger.kernel.org Signed-off-by: Shaohua Li <shli@fb.com>
2017-07-21md: simplify code with bio_io_errorGuoqing Jiang
Since bio_io_error sets bi_status to BLK_STS_IOERR, and calls bio_endio, so we can use it directly. And as mentioned by Shaohua, there are also two places in raid5.c can use bio_io_error either. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-07-21md/raid1: fix writebehind bio cloneShaohua Li
After bio is submitted, we should not clone it as its bi_iter might be invalid by driver. This is the case of behind_master_bio. In certain situration, we could dispatch behind_master_bio immediately for the first disk and then clone it for other disks. https://bugzilla.kernel.org/show_bug.cgi?id=196383 Reported-and-tested-by: Markus <m4rkusxxl@web.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Fix: 841c1316c7da(md: raid1: improve write behind) Cc: stable@vger.kernel.org (4.12+) Signed-off-by: Shaohua Li <shli@fb.com>
2017-07-21md: raid1-10: move raid1/raid10 common code into raid1-10.cMing Lei
No function change, just move 'struct resync_pages' and related helpers into raid1-10.c Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-07-21md: raid1/raid10: initialize bvec table via bio_add_page()Ming Lei
We will support multipage bvec soon, so initialize bvec table using the standardy way instead of writing the talbe directly. Otherwise it won't work any more once multipage bvec is enabled. Acked-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-07-21md: remove 'idx' from 'struct resync_pages'Ming Lei
bio_add_page() won't fail for resync bio, and the page index for each bio is same, so remove it. More importantly the 'idx' of 'struct resync_pages' is initialized in mempool allocator function, the current way is wrong since mempool is only responsible for allocation, we can't use that for initialization. Suggested-by: NeilBrown <neilb@suse.com> Reported-by: NeilBrown <neilb@suse.com> Reported-and-tested-by: Patrick <dto@gmx.net> Fixes: f0250618361d(md: raid10: don't use bio's vec table to manage resync pages) Fixes: 98d30c5812c3(md: raid1: don't use bio's vec table to manage resync pages) Cc: stable@vger.kernel.org (4.12+) Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-07-19dm integrity: use plugging when writing the journalMikulas Patocka
When copying data from the journal to the appropriate place, we submit many IOs. Some of these IOs could go to adjacent areas. Use on-stack plugging so that adjacent IOs get merged during submission. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-07-19dm integrity: fix inefficient allocation of journal spaceMikulas Patocka
When using a block size greater than 512 bytes, the dm-integrity target allocates journal space inefficiently. It allocates one journal entry for each 512-byte chunk of data, fills an entry for each block of data and leaves the remaining entries unused. This issue doesn't cause data corruption, but all the unused journal entries degrade performance severely. For example, with 4k blocks and an 8k bio, it would allocate 16 journal entries but only use 2 entries. The remaining 14 entries were left unused. Fix this by adding the missing 'log2_sectors_per_block' shifts that are required to have each journal entry map to a full block. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Fixes: 7eada909bfd7 ("dm: add integrity target") Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-07-18Merge tag 'md/4.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/mdLinus Torvalds
Pull MD fixes from Shaohua Li: - raid5-ppl fix by Artur. This one is introduced in this release cycle. - raid5 reshape fix by Xiao. This is an old bug and will be added to stable. - bitmap fix by Guoqing. * tag 'md/4.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: raid5-ppl: use BIOSET_NEED_BVECS when creating bioset Raid5 should update rdev->sectors after reshape md/bitmap: don't read page from device with Bitmap_sync
2017-07-12raid5-ppl: use BIOSET_NEED_BVECS when creating biosetArtur Paszkiewicz
This bioset is used for allocating bios with nr_iovecs > 0 so this flag must be set. Fixes: 011067b05668 ("blk: replace bioset_create_nobvec() with a flags arg to bioset_create()") Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-07-11Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull more block updates from Jens Axboe: "This is a followup for block changes, that didn't make the initial pull request. It's a bit of a mixed bag, this contains: - A followup pull request from Sagi for NVMe. Outside of fixups for NVMe, it also includes a series for ensuring that we properly quiesce hardware queues when browsing live tags. - Set of integrity fixes from Dmitry (mostly), fixing various issues for folks using DIF/DIX. - Fix for a bug introduced in cciss, with the req init changes. From Christoph. - Fix for a bug in BFQ, from Paolo. - Two followup fixes for lightnvm/pblk from Javier. - Depth fix from Ming for blk-mq-sched. - Also from Ming, performance fix for mtip32xx that was introduced with the dynamic initialization of commands" * 'for-linus' of git://git.kernel.dk/linux-block: (44 commits) block: call bio_uninit in bio_endio nvmet: avoid unneeded assignment of submit_bio return value nvme-pci: add module parameter for io queue depth nvme-pci: compile warnings in nvme_alloc_host_mem() nvmet_fc: Accept variable pad lengths on Create Association LS nvme_fc/nvmet_fc: revise Create Association descriptor length lightnvm: pblk: remove unnecessary checks lightnvm: pblk: control I/O flow also on tear down cciss: initialize struct scsi_req null_blk: fix error flow for shared tags during module_init block: Fix __blkdev_issue_zeroout loop nvme-rdma: unconditionally recycle the request mr nvme: split nvme_uninit_ctrl into stop and uninit virtio_blk: quiesce/unquiesce live IO when entering PM states mtip32xx: quiesce request queues to make sure no submissions are inflight nbd: quiesce request queues to make sure no submissions are inflight nvme: kick requeue list when requeueing a request instead of when starting the queues nvme-pci: quiesce/unquiesce admin_q instead of start/stop its hw queues nvme-loop: quiesce/unquiesce admin_q instead of start/stop its hw queues nvme-fc: quiesce/unquiesce admin_q instead of start/stop its hw queues ...
2017-07-10Raid5 should update rdev->sectors after reshapeXiao Ni
The raid5 md device is created by the disks which we don't use the total size. For example, the size of the device is 5G and it just uses 3G of the devices to create one raid5 device. Then change the chunksize and wait reshape to finish. After reshape finishing stop the raid and assemble it again. It fails. mdadm -CR /dev/md0 -l5 -n3 /dev/loop[0-2] --size=3G --chunk=32 --assume-clean mdadm /dev/md0 --grow --chunk=64 wait reshape to finish mdadm -S /dev/md0 mdadm -As The error messages: [197519.814302] md: loop1 does not have a valid v1.2 superblock, not importing! [197519.821686] md: md_import_device returned -22 After reshape the data offset is changed. It selects backwards direction in this condition. In function super_1_load it compares the available space of the underlying device with sb->data_size. The new data offset gets bigger after reshape. So super_1_load returns -EINVAL. rdev->sectors is updated in md_finish_reshape. Then sb->data_size is set in super_1_sync based on rdev->sectors. So add md_finish_reshape in end_reshape. Signed-off-by: Xiao Ni <xni@redhat.com> Acked-by: Guoqing Jiang <gqjiang@suse.com> Cc: stable@vger.kernel.org Signed-off-by: Shaohua Li <shli@fb.com>
2017-07-10md/bitmap: don't read page from device with Bitmap_syncGuoqing Jiang
The device owns Bitmap_sync flag needs recovery to become in sync, and read page from this type device could get stale status. Also add comments for Bitmap_sync bit per the suggestion from Shaohua and Neil. Previous disscussion can be found here: https://marc.info/?t=149760428900004&r=1&w=2 Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-07-08Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/mdLinus Torvalds
Pull MD update from Shaohua Li: - fixed deadlock in MD suspend and a potential bug in bio allocation (Neil Brown) - fixed signal issue (Mikulas Patocka) - fixed typo in FailFast test (Guoqing Jiang) - other trival fixes * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: MD: fix sleep in atomic MD: fix a null dereference md: use a separate bio_set for synchronous IO. md: change the initialization value for a spare device spot to MD_DISK_ROLE_SPARE md/raid1: remove unused bio in sync_request_write md/raid10: fix FailFast test for wrong device md: don't use flush_signals in userspace processes md: fix deadlock between mddev_suspend() and md_write_start()