summaryrefslogtreecommitdiff
path: root/drivers/md/md.c
AgeCommit message (Collapse)Author
2018-11-13MD: fix invalid stored role for a disk - try2Shaohua Li
commit 9e753ba9b9b405e3902d9f08aec5f2ea58a0c317 upstream. Commit d595567dc4f0 (MD: fix invalid stored role for a disk) broke linear hotadd. Let's only fix the role for disks in raid1/10. Based on Guoqing's original patch. Reported-by: kernel test robot <rong.a.chen@intel.com> Cc: Gioh Kim <gi-oh.kim@profitbricks.com> Cc: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-13MD: fix invalid stored role for a diskShaohua Li
[ Upstream commit d595567dc4f0c1d90685ec1e2e296e2cad2643ac ] If we change the number of array's device after device is removed from array, then add the device back to array, we can see that device is added as active role instead of spare which we expected. Please see the below link for details: https://marc.info/?l=linux-raid&m=153736982015076&w=2 This is caused by that we prefer to use device's previous role which is recorded by saved_raid_disk, but we should respect the new number of conf->raid_disks since it could be changed after device is removed. Reported-by: Gioh Kim <gi-oh.kim@profitbricks.com> Tested-by: Gioh Kim <gi-oh.kim@profitbricks.com> Acked-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-03md: fix NULL dereference of mddev->pers in remove_and_add_spares()Yufen Yu
[ Upstream commit c42a0e2675721e1444f56e6132a07b7b1ec169ac ] We met NULL pointer BUG as follow: [ 151.760358] BUG: unable to handle kernel NULL pointer dereference at 0000000000000060 [ 151.761340] PGD 80000001011eb067 P4D 80000001011eb067 PUD 1011ea067 PMD 0 [ 151.762039] Oops: 0000 [#1] SMP PTI [ 151.762406] Modules linked in: [ 151.762723] CPU: 2 PID: 3561 Comm: mdadm-test Kdump: loaded Not tainted 4.17.0-rc1+ #238 [ 151.763542] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014 [ 151.764432] RIP: 0010:remove_and_add_spares.part.56+0x13c/0x3a0 [ 151.765061] RSP: 0018:ffffc90001d7fcd8 EFLAGS: 00010246 [ 151.765590] RAX: 0000000000000000 RBX: ffff88013601d600 RCX: 0000000000000000 [ 151.766306] RDX: 0000000000000000 RSI: ffff88013601d600 RDI: ffff880136187000 [ 151.767014] RBP: ffff880136187018 R08: 0000000000000003 R09: 0000000000000051 [ 151.767728] R10: ffffc90001d7fed8 R11: 0000000000000000 R12: ffff88013601d600 [ 151.768447] R13: ffff8801298b1300 R14: ffff880136187000 R15: 0000000000000000 [ 151.769160] FS: 00007f2624276700(0000) GS:ffff88013ae80000(0000) knlGS:0000000000000000 [ 151.769971] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 151.770554] CR2: 0000000000000060 CR3: 0000000111aac000 CR4: 00000000000006e0 [ 151.771272] Call Trace: [ 151.771542] md_ioctl+0x1df2/0x1e10 [ 151.771906] ? __switch_to+0x129/0x440 [ 151.772295] ? __schedule+0x244/0x850 [ 151.772672] blkdev_ioctl+0x4bd/0x970 [ 151.773048] block_ioctl+0x39/0x40 [ 151.773402] do_vfs_ioctl+0xa4/0x610 [ 151.773770] ? dput.part.23+0x87/0x100 [ 151.774151] ksys_ioctl+0x70/0x80 [ 151.774493] __x64_sys_ioctl+0x16/0x20 [ 151.774877] do_syscall_64+0x5b/0x180 [ 151.775258] entry_SYSCALL_64_after_hwframe+0x44/0xa9 For raid6, when two disk of the array are offline, two spare disks can be added into the array. Before spare disks recovery completing, system reboot and mdadm thinks it is ok to restart the degraded array by md_ioctl(). Since disks in raid6 is not only_parity(), raid5_run() will abort, when there is no PPL feature or not setting 'start_dirty_degraded' parameter. Therefore, mddev->pers is NULL. But, mddev->raid_disks has been set and it will not be cleared when raid5_run abort. md_ioctl() can execute cmd 'HOT_REMOVE_DISK' to remove a disk by mdadm, which will cause NULL pointer dereference in remove_and_add_spares() finally. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-03md: fix two problems with setting the "re-add" device state.NeilBrown
commit 011abdc9df559ec75779bb7c53a744c69b2a94c6 upstream. If "re-add" is written to the "state" file for a device which is faulty, this has an effect similar to removing and re-adding the device. It should take up the same slot in the array that it previously had, and an accelerated (e.g. bitmap-based) rebuild should happen. The slot that "it previously had" is determined by rdev->saved_raid_disk. However this is not set when a device fails (only when a device is added), and it is cleared when resync completes. This means that "re-add" will normally work once, but may not work a second time. This patch includes two fixes. 1/ when a device fails, record the ->raid_disk value in ->saved_raid_disk before clearing ->raid_disk 2/ when "re-add" is written to a device for which ->saved_raid_disk is not set, fail. I think this is suitable for stable as it can cause re-adding a device to be forced to do a full resync which takes a lot longer and so puts data at more risk. Cc: <stable@vger.kernel.org> (v4.1) Fixes: 97f6cd39da22 ("md-cluster: re-add capabilities") Signed-off-by: NeilBrown <neilb@suse.com> Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30md: fix a potential deadlock of raid5/raid10 reshapeBingJing Chang
[ Upstream commit 8876391e440ba615b10eef729576e111f0315f87 ] There is a potential deadlock if mount/umount happens when raid5_finish_reshape() tries to grow the size of emulated disk. How the deadlock happens? 1) The raid5 resync thread finished reshape (expanding array). 2) The mount or umount thread holds VFS sb->s_umount lock and tries to write through critical data into raid5 emulated block device. So it waits for raid5 kernel thread handling stripes in order to finish it I/Os. 3) In the routine of raid5 kernel thread, md_check_recovery() will be called first in order to reap the raid5 resync thread. That is, raid5_finish_reshape() will be called. In this function, it will try to update conf and call VFS revalidate_disk() to grow the raid5 emulated block device. It will try to acquire VFS sb->s_umount lock. The raid5 kernel thread cannot continue, so no one can handle mount/ umount I/Os (stripes). Once the write-through I/Os cannot be finished, mount/umount will not release sb->s_umount lock. The deadlock happens. The raid5 kernel thread is an emulated block device. It is responible to handle I/Os (stripes) from upper layers. The emulated block device should not request any I/Os on itself. That is, it should not call VFS layer functions. (If it did, it will try to acquire VFS locks to guarantee the I/Os sequence.) So we have the resync thread to send resync I/O requests and to wait for the results. For solving this potential deadlock, we can put the size growth of the emulated block device as the final step of reshape thread. 2017/12/29: Thanks to Guoqing Jiang <gqjiang@suse.com>, we confirmed that there is the same deadlock issue in raid10. It's reproducible and can be fixed by this patch. For raid10.c, we can remove the similar code to prevent deadlock as well since they has been called before. Reported-by: Alex Wu <alexwu@synology.com> Reviewed-by: Alex Wu <alexwu@synology.com> Reviewed-by: Chung-Chiang Cheng <cccheng@synology.com> Signed-off-by: BingJing Chang <bingjingc@synology.com> Signed-off-by: Shaohua Li <sh.li@alibaba-inc.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-22md.c:didn't unlock the mddev before return EINVAL in array_size_storeZhilong Liu
[ Upstream commit b670883bb9e55ba63a278d83e034faefc01ce2cf ] md.c: it needs to release the mddev lock before the array_size_store() returns. Fixes: ab5a98b132fd ("md-cluster: change array_sectors and update size are not supported") Signed-off-by: Zhilong Liu <zlliu@suse.com> Reviewed-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-11md: only allow remove_and_add_spares when no sync_thread running.NeilBrown
commit 39772f0a7be3b3dc26c74ea13fe7847fd1522c8b upstream. The locking protocols in md assume that a device will never be removed from an array during resync/recovery/reshape. When that isn't happening, rcu or reconfig_mutex is needed to protect an rdev pointer while taking a refcount. When it is happening, that protection isn't needed. Unfortunately there are cases were remove_and_add_spares() is called when recovery might be happening: is state_store(), slot_store() and hot_remove_disk(). In each case, this is just an optimization, to try to expedite removal from the personality so the device can be removed from the array. If resync etc is happening, we just have to wait for md_check_recover to find a suitable time to call remove_and_add_spares(). This optimization and not essential so it doesn't matter if it fails. So change remove_and_add_spares() to abort early if resync/recovery/reshape is happening, unless it is called from md_check_recovery() as part of a newly started recovery. The parameter "this" is only NULL when called from md_check_recovery() so when it is NULL, there is no need to abort. As this can result in a NULL dereference, the fix is suitable for -stable. cc: yuyufen <yuyufen@huawei.com> Cc: Tomasz Majchrzak <tomasz.majchrzak@intel.com> Fixes: 8430e7e0af9a ("md: disconnect device from personality before trying to remove it.") Cc: stable@ver.kernel.org (v4.8+) Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <sh.li@alibaba-inc.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-15md: fix super_offset endianness in super_1_rdev_size_changeJason Yan
commit 3fb632e40d7667d8bedfabc28850ac06d5493f54 upstream. The sb->super_offset should be big-endian, but the rdev->sb_start is in host byte order, so fix this by adding cpu_to_le64. Signed-off-by: Jason Yan <yanaijie@huawei.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-15md: fix incorrect use of lexx_to_cpu in does_sb_need_changingJason Yan
commit 1345921393ba23b60d3fcf15933e699232ad25ae upstream. The sb->layout is of type __le32, so we shoud use le32_to_cpu. Signed-off-by: Jason Yan <yanaijie@huawei.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25md: MD_CLOSING needs to be cleared after called md_set_readonly or do_md_stopNeilBrown
commit 065e519e71b2c1f41936cce75b46b5ab34adb588 upstream. if called md_set_readonly and set MD_CLOSING bit, the mddev cannot be opened any more due to the MD_CLOING bit wasn't cleared. Thus it needs to be cleared in md_ioctl after any call to md_set_readonly() or do_md_stop(). Signed-off-by: NeilBrown <neilb@suse.com> Fixes: af8d8e6f0315 ("md: changes for MD_STILL_CLOSED flag") Signed-off-by: Zhilong Liu <zlliu@suse.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-12md: fix refcount problem on mddev when stopping array.NeilBrown
commit e2342ca832726a840ca6bd196dd2cc073815b08a upstream. md_open() gets a counted reference on an mddev using mddev_find(). If it ends up returning an error, it must drop this reference. There are two error paths where the reference is not dropped. One only happens if the process is signalled and an awkward time, which is quite unlikely. The other was introduced recently in commit af8d8e6f0. Change the code to ensure the drop the reference when returning an error, and make it harded to re-introduce this sort of bug in the future. Reported-by: Marc Smith <marc.smith@mcc.edu> Fixes: af8d8e6f0315 ("md: changes for MD_STILL_CLOSED flag") Signed-off-by: NeilBrown <neilb@suse.com> Acked-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-12md: MD_RECOVERY_NEEDED is set for mddev->recoveryShaohua Li
commit 82a301cb0ea2df8a5c88213094a01660067c7fb4 upstream. Fixes: 90f5f7ad4f38("md: Wait for md_check_recovery before attempting device removal.") Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-10-28md: be careful not lot leak internal curr_resync value into metadata. -- (all)NeilBrown
mddev->curr_resync usually records where the current resync is up to, but during the starting phase it has some "magic" values. 1 - means that the array is trying to start a resync, but has yielded to another array which shares physical devices, and also needs to start a resync 2 - means the array is trying to start resync, but has found another array which shares physical devices and has already started resync. 3 - means that resync has commensed, but it is possible that nothing has actually been resynced yet. It is important that this value not be visible to user-space and particularly that it doesn't get written to the metadata, as the resync or recovery checkpoint. In part, this is because it may be slightly higher than the correct value, though this is very rare. In part, because it is not a multiple of 4K, and some devices only support 4K aligned accesses. There are two places where this value is propagates into either ->curr_resync_completed or ->recovery_cp or ->recovery_offset. These currently avoid the propagation of values 1 and 3, but will allow 3 to leak through. Change them to only propagate the value if it is > 3. As this can cause an array to fail, the patch is suitable for -stable. Cc: stable@vger.kernel.org (v3.7+) Reported-by: Viswesh <viswesh.vichu@gmail.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-10-24md: report 'write_pending' state when array in syncTomasz Majchrzak
If there is a bad block on a disk and there is a recovery performed from this disk, the same bad block is reported for a new disk. It involves setting MD_CHANGE_PENDING flag in rdev_set_badblocks. For external metadata this flag is not being cleared as array state is reported as 'clean'. The read request to bad block in RAID5 array gets stuck as it is waiting for a flag to be cleared - as per commit c3cce6cda162 ("md/raid5: ensure device failure recorded before write request returns."). The meaning of MD_CHANGE_PENDING and MD_CHANGE_CLEAN flags has been clarified in commit 070dc6dd7103 ("md: resolve confusion of MD_CHANGE_CLEAN"), however MD_CHANGE_PENDING flag has been used in personality error handlers since and it doesn't fully comply with initial purpose. It was supposed to notify that write request is about to start, however now it is also used to request metadata update. Initially (in md_allow_write, md_write_start) MD_CHANGE_PENDING flag has been set and in_sync has been set to 0 at the same time. Error handlers just set the flag without modifying in_sync value. Sysfs array state is a single value so now it reports 'clean' when MD_CHANGE_PENDING flag is set and in_sync is set to 1. Userspace has no idea it is expected to take some action. Swap the order that array state is checked so 'write_pending' is reported ahead of 'clean' ('write_pending' is a misleading name but it is too late to rename it now). Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-10-03md: set rotational bitShaohua Li
if all disks in an array are non-rotational, set the array non-rotational. This only works for array with all disks populated at startup. Support for disk hotadd/hotremove could be added later if necessary. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Shaohua Li <shli@fb.com>
2016-09-21md: fix a potential deadlockShaohua Li
lockdep reports a potential deadlock. Fix this by droping the mutex before md_import_device [ 1137.126601] ====================================================== [ 1137.127013] [ INFO: possible circular locking dependency detected ] [ 1137.127013] 4.8.0-rc4+ #538 Not tainted [ 1137.127013] ------------------------------------------------------- [ 1137.127013] mdadm/16675 is trying to acquire lock: [ 1137.127013] (&bdev->bd_mutex){+.+.+.}, at: [<ffffffff81243cf3>] __blkdev_get+0x63/0x450 [ 1137.127013] but task is already holding lock: [ 1137.127013] (detected_devices_mutex){+.+.+.}, at: [<ffffffff81a5138c>] md_ioctl+0x2ac/0x1f50 [ 1137.127013] which lock already depends on the new lock. [ 1137.127013] the existing dependency chain (in reverse order) is: [ 1137.127013] -> #1 (detected_devices_mutex){+.+.+.}: [ 1137.127013] [<ffffffff810b6f19>] lock_acquire+0xb9/0x220 [ 1137.127013] [<ffffffff81c51647>] mutex_lock_nested+0x67/0x3d0 [ 1137.127013] [<ffffffff81a4eeaf>] md_autodetect_dev+0x3f/0x90 [ 1137.127013] [<ffffffff81595be8>] rescan_partitions+0x1a8/0x2c0 [ 1137.127013] [<ffffffff81590081>] __blkdev_reread_part+0x71/0xb0 [ 1137.127013] [<ffffffff815900e5>] blkdev_reread_part+0x25/0x40 [ 1137.127013] [<ffffffff81590c4b>] blkdev_ioctl+0x51b/0xa30 [ 1137.127013] [<ffffffff81242bf1>] block_ioctl+0x41/0x50 [ 1137.127013] [<ffffffff81214c96>] do_vfs_ioctl+0x96/0x6e0 [ 1137.127013] [<ffffffff81215321>] SyS_ioctl+0x41/0x70 [ 1137.127013] [<ffffffff81c56825>] entry_SYSCALL_64_fastpath+0x18/0xa8 [ 1137.127013] -> #0 (&bdev->bd_mutex){+.+.+.}: [ 1137.127013] [<ffffffff810b6af2>] __lock_acquire+0x1662/0x1690 [ 1137.127013] [<ffffffff810b6f19>] lock_acquire+0xb9/0x220 [ 1137.127013] [<ffffffff81c51647>] mutex_lock_nested+0x67/0x3d0 [ 1137.127013] [<ffffffff81243cf3>] __blkdev_get+0x63/0x450 [ 1137.127013] [<ffffffff81244307>] blkdev_get+0x227/0x350 [ 1137.127013] [<ffffffff812444f6>] blkdev_get_by_dev+0x36/0x50 [ 1137.127013] [<ffffffff81a46d65>] lock_rdev+0x35/0x80 [ 1137.127013] [<ffffffff81a49bb4>] md_import_device+0xb4/0x1b0 [ 1137.127013] [<ffffffff81a513d6>] md_ioctl+0x2f6/0x1f50 [ 1137.127013] [<ffffffff815909b3>] blkdev_ioctl+0x283/0xa30 [ 1137.127013] [<ffffffff81242bf1>] block_ioctl+0x41/0x50 [ 1137.127013] [<ffffffff81214c96>] do_vfs_ioctl+0x96/0x6e0 [ 1137.127013] [<ffffffff81215321>] SyS_ioctl+0x41/0x70 [ 1137.127013] [<ffffffff81c56825>] entry_SYSCALL_64_fastpath+0x18/0xa8 [ 1137.127013] other info that might help us debug this: [ 1137.127013] Possible unsafe locking scenario: [ 1137.127013] CPU0 CPU1 [ 1137.127013] ---- ---- [ 1137.127013] lock(detected_devices_mutex); [ 1137.127013] lock(&bdev->bd_mutex); [ 1137.127013] lock(detected_devices_mutex); [ 1137.127013] lock(&bdev->bd_mutex); [ 1137.127013] *** DEADLOCK *** Cc: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-09-21md-cluster: clean related infos of clusterGuoqing Jiang
cluster_info and bitmap_info.nodes also need to be cleared when array is stopped. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-09-21md: changes for MD_STILL_CLOSED flagGuoqing Jiang
When stop clustered raid while it is pending on resync, MD_STILL_CLOSED flag could be cleared since udev rule is triggered to open the mddev. So obviously array can't be stopped soon and returns EBUSY. mdadm -Ss md-raid-arrays.rules set MD_STILL_CLOSED md_open() ... ... ... clear MD_STILL_CLOSED do_md_stop We make below changes to resolve this issue: 1. rename MD_STILL_CLOSED to MD_CLOSING since it is set when stop array and it means we are stopping array. 2. let md_open returns early if CLOSING is set, so no other threads will open array if one thread is trying to close it. 3. no need to clear CLOSING bit in md_open because 1 has ensure the bit is cleared, then we also don't need to test CLOSING bit in do_md_stop. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-09-21md-cluster: call md_kick_rdev_from_array once ack failedGuoqing Jiang
The new_disk_ack could return failure if WAITING_FOR_NEWDISK is not set, so we need to kick the dev from array in case failure happened. And we missed to check err before call new_disk_ack othwise we could kick a rdev which isn't in array, thanks for the reminder from Shaohua. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-09-08md-cluster: make md-cluster also can work when compiled into kernelGuoqing Jiang
The md-cluster is compiled as module by default, if it is compiled by built-in way, then we can't make md-cluster works. [64782.630008] md/raid1:md127: active with 2 out of 2 mirrors [64782.630528] md-cluster module not found. [64782.630530] md127: Could not setup cluster service (-2) Fixes: edb39c9 ("Introduce md_cluster_operations to handle cluster functions") Cc: stable@vger.kernel.org (v4.1+) Reported-by: Marc Smith <marc.smith@mcc.edu> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-08-30Merge tag 'md/4.8-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/mdLinus Torvalds
Pull MD fixes from Shaohua Li: "This includes several bug fixes: - Alexey Obitotskiy fixed a hang for faulty raid5 array with external management - Song Liu fixed two raid5 journal related bugs - Tomasz Majchrzak fixed a bad block recording issue and an accounting issue for raid10 - ZhengYuan Liu fixed an accounting issue for raid5 - I fixed a potential race condition and memory leak with DIF/DIX enabled - other trival fixes" * tag 'md/4.8-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: raid5: avoid unnecessary bio data set raid5: fix memory leak of bio integrity data raid10: record correct address of bad block md-cluster: fix error return code in join() r5cache: set MD_JOURNAL_CLEAN correctly md: don't print the same repeated messages about delayed sync operation md: remove obsolete ret in md_start_sync md: do not count journal as spare in GET_ARRAY_INFO md: Prevent IO hold during accessing to faulty raid5 array MD: hold mddev lock to change bitmap location raid5: fix incorrectly counter of conf->empty_inactive_list_nr raid10: increment write counter after bio is split
2016-08-24r5cache: set MD_JOURNAL_CLEAN correctlySong Liu
Currently, the code sets MD_JOURNAL_CLEAN when the array has MD_FEATURE_JOURNAL and the recovery_cp is MaxSector. The array will be MD_JOURNAL_CLEAN even if the journal device is missing. With this patch, the MD_JOURNAL_CLEAN is only set when the journal device presents. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-08-17md: don't print the same repeated messages about delayed sync operationArtur Paszkiewicz
This fixes a long-standing bug that caused a flood of messages like: "md: delaying data-check of md1 until md2 has finished (they share one or more physical units)" It can be reproduced like this: 1. Create at least 3 raid1 arrays on a pair of disks, each on different partitions. 2. Request a sync operation like 'check' or 'repair' on 2 arrays by writing to their md/sync_action attribute files. One operation should start and one should be delayed and a message like the above will be printed. 3. Issue a write to the third array. Each write will cause 2 copies of the message to be printed. This happens when wake_up(&resync_wait) is called, usually by md_check_recovery(). Then the delayed sync thread again prints the message and is put to sleep. This patch adds a check in md_do_sync() to prevent printing this message more than once for the same pair of devices. Reported-by: Sven Koehler <sven.koehler@gmail.com> Link: https://bugzilla.kernel.org/show_bug.cgi?id=151801 Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-08-17md: remove obsolete ret in md_start_syncGuoqing Jiang
The ret is not needed anymore since we have already move resync_start into md_do_sync in commit 41a9a0d. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-08-16md: do not count journal as spare in GET_ARRAY_INFOSong Liu
GET_ARRAY_INFO counts journal as spare (spare_disks), which is not accurate. This patch fixes this. Reported-by: Yi Zhang <yizhan@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-08-07block: rename bio bi_rw to bi_opfJens Axboe
Since commit 63a4cc24867d, bio->bi_rw contains flags in the lower portion and the op code in the higher portions. This means that old code that relies on manually setting bi_rw is most likely going to be broken. Instead of letting that brokeness linger, rename the member, to force old and out-of-tree code to break at compile time instead of at runtime. No intended functional changes in this commit. Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-28Merge branch 'mymd/for-next' into mymd/for-linusShaohua Li
2016-07-28MD: fix null pointer deferenceShaohua Li
The md device might not have personality (for example, ddf raid array). The issue is introduced by 8430e7e0af9a15(md: disconnect device from personality before trying to remove it) Reported-by: kernel test robot <xiaolong.ye@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-07-19md: add missing sysfs_notify on array_state updateTomasz Majchrzak
Changeset 6791875e2e53 has added early return from a function so there is no sysfs notification for 'active' and 'clean' state change. Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-07-19Fix kernel module refcount handlingAlexey Obitotskiy
md loads raidX modules and increments module refcount each time level has changed but does not decrement it. You are unable to unload raid0 module after reshape because raid0 reshape changes level to raid4 and back to raid0. Signed-off-by: Aleksey Obitotskiy <aleksey.obitotskiy@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-07-19md: use seconds granularity for error loggingArnd Bergmann
The md code stores the exact time of the last error in the last_read_error variable using a timespec structure. It only ever uses the seconds portion of that though, so we can use a scalar for it. There won't be an overflow in 2038 here, because it already used monotonic time and 32-bit is enough for that, but I've decided to use time64_t for consistency in the conversion. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Shaohua Li <shli@fb.com>
2016-06-13md: reduce the number of synchronize_rcu() calls when multiple devices fail.NeilBrown
Every time a device is removed with ->hot_remove_disk() a synchronize_rcu() call is made which can delay several milliseconds in some case. If lots of devices fail at once - as could happen with a large RAID10 where one set of devices are removed all at once - these delays can add up to be very inconcenient. As failure is not reversible we can check for that first, setting a separate flag if it is found, and then all synchronize_rcu() once for all the flagged devices. Then ->hot_remove_disk() function can skip the synchronize_rcu() step if the flag is set. fix build error(Shaohua) Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-06-13md: disconnect device from personality before trying to remove it.NeilBrown
When the HOT_REMOVE_DISK ioctl is used to remove a device, we call remove_and_add_spares() which will remove it from the personality if possible. This improves the chances that the removal will succeed. When writing "remove" to dev-XX/state, we don't. So that can fail more easily. So add the remove_and_add_spares() into "remove" handling. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-06-13MD:Update superblock when err == 0 in size_storeXiao Ni
This is a simple check before updating the superblock. It should update the superblock when update_size return 0. Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-06-09md: use a mutex to protect a global listCong Wang
We saw a list corruption in the list all_detected_devices: WARNING: CPU: 16 PID: 226 at lib/list_debug.c:29 __list_add+0x3c/0xa9() list_add corruption. next->prev should be prev (ffff880859d58320), but was ffff880859ce74c0. (next=ffffffff81abfdb0). Modules linked in: ahci libahci libata sd_mod scsi_mod CPU: 16 PID: 226 Comm: kworker/u241:4 Not tainted 4.1.20 #1 Hardware name: Dell Inc. PowerEdge C6220/04GD66, BIOS 2.2.3 11/07/2013 Workqueue: events_unbound async_run_entry_fn 0000000000000000 ffff880859a5baf8 ffffffff81502872 ffff880859a5bb48 0000000000000009 ffff880859a5bb38 ffffffff810692a5 ffff880859ee8828 ffffffff812ad02c ffff880859d58320 ffffffff81abfdb0 ffff880859eb90c0 Call Trace: [<ffffffff81502872>] dump_stack+0x4d/0x63 [<ffffffff810692a5>] warn_slowpath_common+0xa1/0xbb [<ffffffff812ad02c>] ? __list_add+0x3c/0xa9 [<ffffffff81069305>] warn_slowpath_fmt+0x46/0x48 [<ffffffff812ad02c>] __list_add+0x3c/0xa9 [<ffffffff81406f28>] md_autodetect_dev+0x41/0x62 [<ffffffff81285862>] rescan_partitions+0x25f/0x29d [<ffffffff81506372>] ? mutex_lock+0x13/0x31 [<ffffffff811a090f>] __blkdev_get+0x1aa/0x3cd [<ffffffff811a0b91>] blkdev_get+0x5f/0x294 [<ffffffff81377ceb>] ? put_device+0x17/0x19 [<ffffffff8128227c>] ? disk_put_part+0x12/0x14 [<ffffffff812836f3>] add_disk+0x29d/0x407 [<ffffffff81384345>] ? __pm_runtime_use_autosuspend+0x5c/0x64 [<ffffffffa004a724>] sd_probe_async+0x115/0x1af [sd_mod] [<ffffffff81083177>] async_run_entry_fn+0x72/0x12c [<ffffffff8107c44c>] process_one_work+0x198/0x2ce [<ffffffff8107cac7>] worker_thread+0x1dd/0x2bb [<ffffffff8107c8ea>] ? cancel_delayed_work_sync+0x15/0x15 [<ffffffff8107c8ea>] ? cancel_delayed_work_sync+0x15/0x15 [<ffffffff81080d9c>] kthread+0xae/0xb6 [<ffffffff81080000>] ? param_array_set+0x40/0xfa [<ffffffff81080cee>] ? __kthread_parkme+0x61/0x61 [<ffffffff81508152>] ret_from_fork+0x42/0x70 [<ffffffff81080cee>] ? __kthread_parkme+0x61/0x61 I suspect it is because there is no lock protecting this global list, autostart_arrays() is called in ioctl() path where there is no lock. Cc: Shaohua Li <shli@kernel.org> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-06-07block, drivers, fs: rename REQ_FLUSH to REQ_PREFLUSHMike Christie
To avoid confusion between REQ_OP_FLUSH, which is handled by request_fn drivers, and upper layers requesting the block layer perform a flush sequence along with possibly a WRITE, this patch renames REQ_FLUSH to REQ_PREFLUSH. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07md: use bio op accessorsMike Christie
Separate the op from the rq_flag_bits and have md set/get the bio using bio_set_op_attrs/bio_op. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07block/fs/drivers: remove rw argument from submit_bioMike Christie
This has callers of submit_bio/submit_bio_wait set the bio->bi_rw instead of passing it in. This makes that use the same as generic_make_request and how we set the other bio fields. Signed-off-by: Mike Christie <mchristi@redhat.com> Fixed up fs/ext4/crypto.c Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-03md: simplify the code with md_kick_rdev_from_arrayGuoqing Jiang
Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-06-03md-cluster: fix deadlock issue when add disk to an recoverying arrayGuoqing Jiang
Add a disk to an array which is performing recovery is a little complicated, we need to do both reap the sync thread and perform add disk for the case, then it caused deadlock as follows. linux44:~ # ps aux|grep md|grep D root 1822 0.0 0.0 0 0 ? D 16:50 0:00 [md127_resync] root 1848 0.0 0.0 19860 952 pts/0 D+ 16:50 0:00 mdadm --manage /dev/md127 --re-add /dev/vdb linux44:~ # cat /proc/1848/stack [<ffffffff8107afde>] kthread_stop+0x6e/0x120 [<ffffffffa051ddb0>] md_unregister_thread+0x40/0x80 [md_mod] [<ffffffffa0526e45>] md_reap_sync_thread+0x15/0x150 [md_mod] [<ffffffffa05271e0>] action_store+0x260/0x270 [md_mod] [<ffffffffa05206b4>] md_attr_store+0xb4/0x100 [md_mod] [<ffffffff81214a7e>] sysfs_write_file+0xbe/0x140 [<ffffffff811a6b98>] vfs_write+0xb8/0x1e0 [<ffffffff811a75b8>] SyS_write+0x48/0xa0 [<ffffffff8152a5c9>] system_call_fastpath+0x16/0x1b [<00007f068ea1ed30>] 0x7f068ea1ed30 linux44:~ # cat /proc/1822/stack [<ffffffffa05251a6>] md_do_sync+0x846/0xf40 [md_mod] [<ffffffffa052402d>] md_thread+0x16d/0x180 [md_mod] [<ffffffff8107ad94>] kthread+0xb4/0xc0 [<ffffffff8152a518>] ret_from_fork+0x58/0x90 Task1848 Task1822 md_attr_store (held reconfig_mutex by call mddev_lock()) action_store md_reap_sync_thread md_unregister_thread kthread_stop md_wakeup_thread(mddev->thread); wait_event(mddev->sb_wait, !test_bit(MD_CHANGE_PENDING)) md_check_recovery is triggered by wakeup mddev->thread, but it can't clear MD_CHANGE_PENDING flag since it can't get lock which was held by md_attr_store already. To solve the deadlock problem, we move "->resync_finish()" from md_do_sync to md_reap_sync_thread (after md_update_sb), also MD_HELD_RESYNC_LOCK is introduced since it is possible that node can't get resync lock in md_do_sync. Then we do not need to wait for MD_CHANGE_PENDING is cleared or not since metadata should be updated after md_update_sb, so just call resync_finish if MD_HELD_RESYNC_LOCK is set. We also unified the code after skip label, since set PENDING for non-clustered case should be harmless. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-05-19Merge tag 'md/4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/mdLinus Torvalds
Pull MD updates from Shaohua Li: "Several patches from Guoqing fixing md-cluster bugs and several patches from Heinz fixing dm-raid bugs" * tag 'md/4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: md-cluster: check the return value of process_recvd_msg md-cluster: gather resync infos and enable recv_thread after bitmap is ready md: set MD_CHANGE_PENDING in a atomic region md: raid5: add prerequisite to run underneath dm-raid md: raid10: add prerequisite to run underneath dm-raid md: md.c: fix oops in mddev_suspend for raid0 md-cluster: fix ifnullfree.cocci warnings md-cluster/bitmap: unplug bitmap to sync dirty pages to disk md-cluster/bitmap: fix wrong page num in bitmap_file_clear_bit and bitmap_file_set_bit md-cluster/bitmap: fix wrong calcuation of offset md-cluster: sync bitmap when node received RESYNCING msg md-cluster: always setup in-memory bitmap md-cluster: wakeup thread if activated a spare disk md-cluster: change array_sectors and update size are not supported md-cluster: fix locking when node joins cluster during message broadcast md-cluster: unregister thread if err happened md-cluster: wake up thread to continue recovery md-cluser: make resync_finish only called after pers->sync_request md-cluster: change resync lock from asynchronous to synchronous
2016-05-17Merge branch 'for-4.7/drivers' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block driver updates from Jens Axboe: "On top of the core pull request, this is the drivers pull request for this merge window. This contains: - Switch drivers to the new write back cache API, and kill off the flush flags. From me. - Kill the discard support for the STEC pci-e flash driver. It's trivially broken, and apparently unmaintained, so it's safer to just remove it. From Jeff Moyer. - A set of lightnvm updates from the usual suspects (Matias/Javier, and Simon), and fixes from Arnd, Jeff Mahoney, Sagi, and Wenwei Tao. - A set of updates for NVMe: - Turn the controller state management into a proper state machine. From Christoph. - Shuffling of code in preparation for NVMe-over-fabrics, also from Christoph. - Cleanup of the command prep part from Ming Lin. - Rewrite of the discard support from Ming Lin. - Deadlock fix for namespace removal from Ming Lin. - Use the now exported blk-mq tag helper for IO termination. From Sagi. - Various little fixes from Christoph, Guilherme, Keith, Ming Lin, Wang Sheng-Hui. - Convert mtip32xx to use the now exported blk-mq tag iter function, from Keith" * 'for-4.7/drivers' of git://git.kernel.dk/linux-block: (74 commits) lightnvm: reserved space calculation incorrect lightnvm: rename nr_pages to nr_ppas on nvm_rq lightnvm: add is_cached entry to struct ppa_addr lightnvm: expose gennvm_mark_blk to targets lightnvm: remove mgt targets on mgt removal lightnvm: pass dma address to hardware rather than pointer lightnvm: do not assume sequential lun alloc. nvme/lightnvm: Log using the ctrl named device lightnvm: rename dma helper functions lightnvm: enable metadata to be sent to device lightnvm: do not free unused metadata on rrpc lightnvm: fix out of bound ppa lun id on bb tbl lightnvm: refactor set_bb_tbl for accepting ppa list lightnvm: move responsibility for bad blk mgmt to target lightnvm: make nvm_set_rqd_ppalist() aware of vblks lightnvm: remove struct factory_blks lightnvm: refactor device ops->get_bb_tbl() lightnvm: introduce nvm_for_each_lun_ppa() macro lightnvm: refactor dev->online_target to global nvm_targets lightnvm: rename nvm_targets to nvm_tgt_type ...
2016-05-09md: set MD_CHANGE_PENDING in a atomic regionGuoqing Jiang
Some code waits for a metadata update by: 1. flagging that it is needed (MD_CHANGE_DEVS or MD_CHANGE_CLEAN) 2. setting MD_CHANGE_PENDING and waking the management thread 3. waiting for MD_CHANGE_PENDING to be cleared If the first two are done without locking, the code in md_update_sb() which checks if it needs to repeat might test if an update is needed before step 1, then clear MD_CHANGE_PENDING after step 2, resulting in the wait returning early. So make sure all places that set MD_CHANGE_PENDING are atomicial, and bit_clear_unless (suggested by Neil) is introduced for the purpose. Cc: Martin Kepplinger <martink@posteo.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: <linux-kernel@vger.kernel.org> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-05-09md: md.c: fix oops in mddev_suspend for raid0Heinz Mauelshagen
Introduced by upstream commit 70d9798b95562abac005d4ba71d28820f9a201eb The raid0 personality does not create mddev->thread as oposed to other personalities leading to its unconditional access in mddev_suspend() causing an oops. Patch checks for mddev->thread in order to keep the intention of aforementioned commit. Fixes: 70d9798b9556 ("MD: warn for potential deadlock") Cc: stable@vger.kernel.org (4.5+) Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-05-04md-cluster: wakeup thread if activated a spare diskGuoqing Jiang
When a device is re-added, it will ultimately need to be activated and that happens in md_check_recovery, so we need to set MD_RECOVERY_NEEDED right after remove_and_add_spares. A specifical issue without the change is that when one node perform fail/remove/readd on a disk, but slave nodes could not add the disk back to array as expected (added as missed instead of in sync). So give slave nodes a chance to do resync. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-05-04md-cluster: change array_sectors and update size are not supportedGuoqing Jiang
Currently, some features are not supported yet, such as change array_sectors and update size, so return EINVAL for them and listed it in document. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-05-04md-cluser: make resync_finish only called after pers->sync_requestGuoqing Jiang
It is not reasonable that cluster raid to release resync lock before the last pers->sync_request has finished. As the metadata will be changed when node performs resync, we need to inform other nodes to update metadata, so the MD_CHANGE_PENDING flag is set before finish resync. Then metadata_update_finish is move ahead to ensure that METADATA_UPDATED msg is sent before finish resync, and metadata_update_start need to be run after "repeat:" label accordingly. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-05-04md-cluster: change resync lock from asynchronous to synchronousGuoqing Jiang
If multiple nodes choose to attempt do resync at the same time they need to be serialized so they don't duplicate effort. This serialization is done by locking the 'resync' DLM lock. Currently if a node cannot get the lock immediately it doesn't request notification when the lock becomes available (i.e. DLM_LKF_NOQUEUE is set), so it may not reliably find out when it is safe to try again. Rather than trying to arrange an async wake-up when the lock becomes available, switch to using synchronous locking - this is a lot easier to think about. As it is not permitted to block in the 'raid1d' thread, move the locking to the resync thread. So the rsync thread is forked immediately, but it blocks until the resync lock is available. Once the lock is locked it checks again if any resync action is needed. A particular symptom of the current problem is that a node can get stuck with "resync=pending" indefinitely. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-04-25MD: make bio mergeableShaohua Li
blk_queue_split marks bio unmergeable, which makes sense for normal bio. But if dispatching the bio to underlayer disk, the blk_queue_split checks are invalid, hence it's possible the bio becomes mergeable. In the reported bug, this bug causes trim against raid0 performance slash https://bugzilla.kernel.org/show_bug.cgi?id=117051 Reported-and-tested-by: Park Ju Hyung <qkrwngud825@gmail.com> Fixes: 6ac45aeb6bca(block: avoid to merge splitted bio) Cc: stable@vger.kernel.org (v4.3+) Cc: Ming Lei <ming.lei@canonical.com> Cc: Neil Brown <neilb@suse.de> Reviewed-by: Jens Axboe <axboe@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-04-12md: update to using blk_queue_write_cache()Jens Axboe
Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de>