summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)Author
2020-11-30md/cluster: block reshape with remote resync jobZhao Heming
Reshape request should be blocked with ongoing resync job. In cluster env, a node can start resync job even if the resync cmd isn't executed on it, e.g., user executes "mdadm --grow" on node A, sometimes node B will start resync job. However, current update_raid_disks() only check local recovery status, which is incomplete. As a result, we see user will execute "mdadm --grow" successfully on local, while the remote node deny to do reshape job when it doing resync job. The inconsistent handling cause array enter unexpected status. If user doesn't observe this issue and continue executing mdadm cmd, the array doesn't work at last. Fix this issue by blocking reshape request. When node executes "--grow" and detects ongoing resync, it should stop and report error to user. The following script reproduces the issue with ~100% probability. (two nodes share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB) ``` # on node1, node2 is the remote node. ssh root@node2 "mdadm -S --scan" mdadm -S --scan for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \ count=20; done mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh" sleep 5 mdadm --manage --add /dev/md0 /dev/sdi mdadm --wait /dev/md0 mdadm --grow --raid-devices=3 /dev/md0 mdadm /dev/md0 --fail /dev/sdg mdadm /dev/md0 --remove /dev/sdg mdadm --grow --raid-devices=2 /dev/md0 ``` Cc: stable@vger.kernel.org Signed-off-by: Zhao Heming <heming.zhao@suse.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-11-30md: use current request time as base for ktime comparisonsPankaj Gupta
Request coalescing logic uses 'prev_flush_start' as base to compare the current request start time. 'prev_flush_start' is updated in other context. This patch changes this by using ktime comparison base to 'req_start' for better readability of code. Signed-off-by: Pankaj Gupta <pankaj.gupta@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-11-30md: add comments in md_flush_request()Pankaj Gupta
Request coalescing logic is dependent on flush time update in other context. This patch adds comments to understand the code flow better. Signed-off-by: Pankaj Gupta <pankaj.gupta@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-11-30md: improve variable names in md_flush_request()Pankaj Gupta
This patch improves readability by using better variable names in flush request coalescing logic. Signed-off-by: Pankaj Gupta <pankaj.gupta@cloud.ionos.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-11-30md/raid10: initialize r10_bio->read_slot before use.Kevin Vigor
In __make_request() a new r10bio is allocated and passed to raid10_read_request(). The read_slot member of the bio is not initialized, and the raid10_read_request() uses it to index an array. This leads to occasional panics. Fix by initializing the field to invalid value and checking for valid value in raid10_read_request(). Cc: stable@vger.kernel.org Signed-off-by: Kevin Vigor <kvigor@gmail.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-11-30md: fix a warning caused by a race between concurrent md_ioctl()sDae R. Jeong
Syzkaller reports a warning as belows. WARNING: CPU: 0 PID: 9647 at drivers/md/md.c:7169 ... Call Trace: ... RIP: 0010:md_ioctl+0x4017/0x5980 drivers/md/md.c:7169 RSP: 0018:ffff888096027950 EFLAGS: 00010293 RAX: ffff88809322c380 RBX: 0000000000000932 RCX: ffffffff84e266f2 RDX: 0000000000000000 RSI: ffffffff84e299f7 RDI: 0000000000000007 RBP: ffff888096027bc0 R08: ffff88809322c380 R09: ffffed101341a482 R10: ffff888096027940 R11: ffff88809a0d240f R12: 0000000000000932 R13: ffff8880a2c14100 R14: ffff88809a0d2268 R15: ffff88809a0d2408 __blkdev_driver_ioctl block/ioctl.c:304 [inline] blkdev_ioctl+0xece/0x1c10 block/ioctl.c:606 block_ioctl+0xee/0x130 fs/block_dev.c:1930 vfs_ioctl fs/ioctl.c:46 [inline] file_ioctl fs/ioctl.c:509 [inline] do_vfs_ioctl+0xd5f/0x1380 fs/ioctl.c:696 ksys_ioctl+0xab/0xd0 fs/ioctl.c:713 __do_sys_ioctl fs/ioctl.c:720 [inline] __se_sys_ioctl fs/ioctl.c:718 [inline] __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:718 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301 entry_SYSCALL_64_after_hwframe+0x49/0xbe This is caused by a race between two concurrenct md_ioctl()s closing the array. CPU1 (md_ioctl()) CPU2 (md_ioctl()) ------ ------ set_bit(MD_CLOSING, &mddev->flags); did_set_md_closing = true; WARN_ON_ONCE(test_bit(MD_CLOSING, &mddev->flags)); if(did_set_md_closing) clear_bit(MD_CLOSING, &mddev->flags); Fix the warning by returning immediately if the MD_CLOSING bit is set in &mddev->flags which indicates that the array is being closed. Fixes: 065e519e71b2 ("md: MD_CLOSING needs to be cleared after called md_set_readonly or do_md_stop") Reported-by: syzbot+1e46a0864c1a6e9bd3d8@syzkaller.appspotmail.com Cc: stable@vger.kernel.org Signed-off-by: Dae R. Jeong <dae.r.jeong@kaist.ac.kr> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-11-17dm writecache: fix the maximum number of argumentsMikulas Patocka
Advance the maximum number of arguments to 16. This fixes issue where certain operations, combined with table configured args, exceed 10 arguments. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: 48debafe4f2f ("dm: add writecache target") Cc: stable@vger.kernel.org # v4.18+ Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-11-17dm writecache: advance the number of arguments when reporting max_ageMikulas Patocka
When reporting the "max_age" value the number of arguments must advance by two. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: 3923d4854e18 ("dm writecache: implement gradual cleanup") Cc: stable@vger.kernel.org # v5.7+ Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-11-17dm integrity: don't use drivers that have CRYPTO_ALG_ALLOCATES_MEMORYMikulas Patocka
Don't use crypto drivers that have the flag CRYPTO_ALG_ALLOCATES_MEMORY set. These drivers allocate memory and thus they are not suitable for block I/O processing. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-11-16md: remove a spurious call to revalidate_disk_size in update_sizeChristoph Hellwig
None of the ->resize methods updates the disk size, so calling revalidate_disk_size here won't do anything. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-16md: use set_capacity_and_notifyChristoph Hellwig
Use set_capacity_and_notify to set the size of both the disk and block device. This also gets the uevent notifications for the resize for free. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-16dm-raid: use set_capacity_and_notifyChristoph Hellwig
Use set_capacity_and_notify to set the size of both the disk and block device. This also gets the uevent notifications for the resize for free. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-16dm: use set_capacity_and_notifyChristoph Hellwig
Use set_capacity_and_notify to set the size of both the disk and block device. This also gets the uevent notifications for the resize for free. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-16md: use __register_blkdev to allocate devices on demandChristoph Hellwig
Use the simpler mechanism attached to major_name to allocate a md device when a currently unregistered minor is accessed. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-16block: remove __blkdev_driver_ioctlChristoph Hellwig
Just open code it in the few callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-16md: implement ->set_read_only to hook into BLKROSET processingChristoph Hellwig
Implement the ->set_read_only method instead of parsing the actual ioctl command. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-14Merge tag 'for-5.10/dm-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - Improve DM core's bio splitting to use blk_max_size_offset(). Also fix bio splitting for bios that were deferred to the worker thread due to a DM device being suspended. - Remove DM core's special handling of NVMe devices now that block core has internalized efficiencies drivers previously needed to be concerned about (via now removed direct_make_request). - Fix request-based DM to not bounce through indirect dm_submit_bio; instead have block core make direct call to blk_mq_submit_bio(). - Various DM core cleanups to simplify and improve code. - Update DM cryot to not use drivers that set CRYPTO_ALG_ALLOCATES_MEMORY. - Fix DM raid's raid1 and raid10 discard limits for the purposes of linux-stable. But then remove DM raid's discard limits settings now that MD raid can efficiently handle large discards. - A couple small cleanups across various targets. * tag 'for-5.10/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm: fix request-based DM to not bounce through indirect dm_submit_bio dm: remove special-casing of bio-based immutable singleton target on NVMe dm: export dm_copy_name_and_uuid dm: fix comment in __dm_suspend() dm: fold dm_process_bio() into dm_submit_bio() dm: fix missing imposition of queue_limits from dm_wq_work() thread dm snap persistent: simplify area_io() dm thin metadata: Remove unused local variable when create thin and snap dm raid: remove unnecessary discard limits for raid10 dm raid: fix discard limits for raid1 and raid10 dm crypt: don't use drivers that have CRYPTO_ALG_ALLOCATES_MEMORY dm: use dm_table_get_device_name() where appropriate in targets dm table: make 'struct dm_table' definition accessible to all of DM core dm: eliminate need for start_io_acct() forward declaration dm: simplify __process_abnormal_io() dm: push use of on-stack flush_bio down to __send_empty_flush() dm: optimize max_io_len() by inlining max_io_len_target_boundary() dm: push md->immutable_target optimization down to __process_bio() dm: change max_io_len() to use blk_max_size_offset() dm table: stack 'chunk_sectors' limit to account for target-specific splitting
2020-10-13Merge tag 'drivers-5.10-2020-10-12' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block driver updates from Jens Axboe: "Here are the driver updates for 5.10. A few SCSI updates in here too, in coordination with Martin as they depend on core block changes for the shared tag bitmap. This contains: - NVMe pull requests via Christoph: - fix keep alive timer modification (Amit Engel) - order the PCI ID list more sensibly (Andy Shevchenko) - cleanup the open by controller helper (Chaitanya Kulkarni) - use an xarray for the CSE log lookup (Chaitanya Kulkarni) - support ZNS in nvmet passthrough mode (Chaitanya Kulkarni) - fix nvme_ns_report_zones (Christoph Hellwig) - add a sanity check to nvmet-fc (James Smart) - fix interrupt allocation when too many polled queues are specified (Jeffle Xu) - small nvmet-tcp optimization (Mark Wunderlich) - fix a controller refcount leak on init failure (Chaitanya Kulkarni) - misc cleanups (Chaitanya Kulkarni) - major refactoring of the scanning code (Christoph Hellwig) - MD updates via Song: - Bug fixes in bitmap code, from Zhao Heming - Fix a work queue check, from Guoqing Jiang - Fix raid5 oops with reshape, from Song Liu - Clean up unused code, from Jason Yan - Discard improvements, from Xiao Ni - raid5/6 page offset support, from Yufen Yu - Shared tag bitmap for SCSI/hisi_sas/null_blk (John, Kashyap, Hannes) - null_blk open/active zone limit support (Niklas) - Set of bcache updates (Coly, Dongsheng, Qinglang)" * tag 'drivers-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (78 commits) md/raid5: fix oops during stripe resizing md/bitmap: fix memory leak of temporary bitmap md: fix the checking of wrong work queue md/bitmap: md_bitmap_get_counter returns wrong blocks md/bitmap: md_bitmap_read_sb uses wrong bitmap blocks md/raid0: remove unused function is_io_in_chunk_boundary() nvme-core: remove extra condition for vwc nvme-core: remove extra variable nvme: remove nvme_identify_ns_list nvme: refactor nvme_validate_ns nvme: move nvme_validate_ns nvme: query namespace identifiers before adding the namespace nvme: revalidate zone bitmaps in nvme_update_ns_info nvme: remove nvme_update_formats nvme: update the known admin effects nvme: set the queue limits in nvme_update_ns_info nvme: remove the 0 lba_shift check in nvme_update_ns_info nvme: clean up the check for too large logic block sizes nvme: freeze the queue over ->lba_shift updates nvme: factor out a nvme_configure_metadata helper ...
2020-10-13Merge tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block updates from Jens Axboe: - Series of merge handling cleanups (Baolin, Christoph) - Series of blk-throttle fixes and cleanups (Baolin) - Series cleaning up BDI, seperating the block device from the backing_dev_info (Christoph) - Removal of bdget() as a generic API (Christoph) - Removal of blkdev_get() as a generic API (Christoph) - Cleanup of is-partition checks (Christoph) - Series reworking disk revalidation (Christoph) - Series cleaning up bio flags (Christoph) - bio crypt fixes (Eric) - IO stats inflight tweak (Gabriel) - blk-mq tags fixes (Hannes) - Buffer invalidation fixes (Jan) - Allow soft limits for zone append (Johannes) - Shared tag set improvements (John, Kashyap) - Allow IOPRIO_CLASS_RT for CAP_SYS_NICE (Khazhismel) - DM no-wait support (Mike, Konstantin) - Request allocation improvements (Ming) - Allow md/dm/bcache to use IO stat helpers (Song) - Series improving blk-iocost (Tejun) - Various cleanups (Geert, Damien, Danny, Julia, Tetsuo, Tian, Wang, Xianting, Yang, Yufen, yangerkun) * tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (191 commits) block: fix uapi blkzoned.h comments blk-mq: move cancel of hctx->run_work to the front of blk_exit_queue blk-mq: get rid of the dead flush handle code path block: get rid of unnecessary local variable block: fix comment and add lockdep assert blk-mq: use helper function to test hw stopped block: use helper function to test queue register block: remove redundant mq check block: invoke blk_mq_exit_sched no matter whether have .exit_sched percpu_ref: don't refer to ref->data if it isn't allocated block: ratelimit handle_bad_sector() message blk-throttle: Re-use the throtl_set_slice_end() blk-throttle: Open code __throtl_de/enqueue_tg() blk-throttle: Move service tree validation out of the throtl_rb_first() blk-throttle: Move the list operation after list validation blk-throttle: Fix IO hang for a corner case blk-throttle: Avoid tracking latency if low limit is invalid blk-throttle: Avoid getting the current time if tg->last_finish_time is 0 blk-throttle: Remove a meaningless parameter for throtl_downgrade_state() block: Remove redundant 'return' statement ...
2020-10-12Merge tag 'ras_updates_for_v5.10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RAS updates from Borislav Petkov: - Extend the recovery from MCE in kernel space also to processes which encounter an MCE in kernel space but while copying from user memory by sending them a SIGBUS on return to user space and umapping the faulty memory, by Tony Luck and Youquan Song. - memcpy_mcsafe() rework by splitting the functionality into copy_mc_to_user() and copy_mc_to_kernel(). This, as a result, enables support for new hardware which can recover from a machine check encountered during a fast string copy and makes that the default and lets the older hardware which does not support that advance recovery, opt in to use the old, fragile, slow variant, by Dan Williams. - New AMD hw enablement, by Yazen Ghannam and Akshay Gupta. - Do not use MSR-tracing accessors in #MC context and flag any fault while accessing MCA architectural MSRs as an architectural violation with the hope that such hw/fw misdesigns are caught early during the hw eval phase and they don't make it into production. - Misc fixes, improvements and cleanups, as always. * tag 'ras_updates_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mce: Allow for copy_mc_fragile symbol checksum to be generated x86/mce: Decode a kernel instruction to determine if it is copying from user x86/mce: Recover from poison found while copying from user space x86/mce: Avoid tail copy when machine check terminated a copy from user x86/mce: Add _ASM_EXTABLE_CPY for copy user access x86/mce: Provide method to find out the type of an exception handler x86/mce: Pass pointer to saved pt_regs to severity calculation routines x86/copy_mc: Introduce copy_mc_enhanced_fast_string() x86, powerpc: Rename memcpy_mcsafe() to copy_mc_to_{user, kernel}() x86/mce: Drop AMD-specific "DEFERRED" case from Intel severity rule list x86/mce: Add Skylake quirk for patrol scrub reported errors RAS/CEC: Convert to DEFINE_SHOW_ATTRIBUTE() x86/mce: Annotate mce_rd/wrmsrl() with noinstr x86/mce/dev-mcelog: Do not update kflags on AMD systems x86/mce: Stop mce_reign() from re-computing severity for every CPU x86/mce: Make mce_rdmsrl() panic on an inaccessible MSR x86/mce: Increase maximum number of banks to 64 x86/mce: Delay clearing IA32_MCG_STATUS to the end of do_machine_check() x86/MCE/AMD, EDAC/mce_amd: Remove struct smca_hwid.xec_bitmap RAS/CEC: Fix cec_init() prototype
2020-10-08md/raid5: fix oops during stripe resizingSong Liu
KoWei reported crash during raid5 reshape: [ 1032.252932] Oops: 0002 [#1] SMP PTI [...] [ 1032.252943] RIP: 0010:memcpy_erms+0x6/0x10 [...] [ 1032.252947] RSP: 0018:ffffba1ac0c03b78 EFLAGS: 00010286 [ 1032.252949] RAX: 0000784ac0000000 RBX: ffff91bec3d09740 RCX: 0000000000001000 [ 1032.252951] RDX: 0000000000001000 RSI: ffff91be6781c000 RDI: 0000784ac0000000 [ 1032.252953] RBP: ffffba1ac0c03bd8 R08: 0000000000001000 R09: ffffba1ac0c03bf8 [ 1032.252954] R10: 0000000000000000 R11: 0000000000000000 R12: ffffba1ac0c03bf8 [ 1032.252955] R13: 0000000000001000 R14: 0000000000000000 R15: 0000000000000000 [ 1032.252958] FS: 0000000000000000(0000) GS:ffff91becf500000(0000) knlGS:0000000000000000 [ 1032.252959] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1032.252961] CR2: 0000784ac0000000 CR3: 000000031780a002 CR4: 00000000001606e0 [ 1032.252962] Call Trace: [ 1032.252969] ? async_memcpy+0x179/0x1000 [async_memcpy] [ 1032.252977] ? raid5_release_stripe+0x8e/0x110 [raid456] [ 1032.252982] handle_stripe_expansion+0x15a/0x1f0 [raid456] [ 1032.252988] handle_stripe+0x592/0x1270 [raid456] [ 1032.252993] handle_active_stripes.isra.0+0x3cb/0x5a0 [raid456] [ 1032.252999] raid5d+0x35c/0x550 [raid456] [ 1032.253002] ? schedule+0x42/0xb0 [ 1032.253006] ? schedule_timeout+0x10e/0x160 [ 1032.253011] md_thread+0x97/0x160 [ 1032.253015] ? wait_woken+0x80/0x80 [ 1032.253019] kthread+0x104/0x140 [ 1032.253022] ? md_start_sync+0x60/0x60 [ 1032.253024] ? kthread_park+0x90/0x90 [ 1032.253027] ret_from_fork+0x35/0x40 This is because cache_size_mutex was unlocked too early in resize_stripes, which races with grow_one_stripe() that grow_one_stripe() allocates a stripe with wrong pool_size. Fix this issue by unlocking cache_size_mutex after updating pool_size. Cc: <stable@vger.kernel.org> # v4.4+ Reported-by: KoWei Sung <winders@amazon.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-10-08md/bitmap: fix memory leak of temporary bitmapZhao Heming
Callers of get_bitmap_from_slot() are responsible to free the bitmap. Suggested-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Zhao Heming <heming.zhao@suse.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-10-08md: fix the checking of wrong work queueGuoqing Jiang
It should check md_rdev_misc_wq instead of md_misc_wq. Fixes: cc1ffe61c026 ("md: add new workqueue for delete rdev") Cc: <stable@vger.kernel.org> # v5.8+ Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-10-08md/bitmap: md_bitmap_get_counter returns wrong blocksZhao Heming
md_bitmap_get_counter() has code: ``` if (bitmap->bp[page].hijacked || bitmap->bp[page].map == NULL) csize = ((sector_t)1) << (bitmap->chunkshift + PAGE_COUNTER_SHIFT - 1); ``` The minus 1 is wrong, this branch should report 2048 bits of space. With "-1" action, this only report 1024 bit of space. This bug code returns wrong blocks, but it doesn't inflence bitmap logic: 1. Most callers focus this function return value (the counter of offset), not the parameter blocks. 2. The bug is only triggered when hijacked is true or map is NULL. the hijacked true condition is very rare. the "map == null" only true when array is creating or resizing. 3. Even the caller gets wrong blocks, current code makes caller just to call md_bitmap_get_counter() one more time. Signed-off-by: Zhao Heming <heming.zhao@suse.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-10-08md/bitmap: md_bitmap_read_sb uses wrong bitmap blocksZhao Heming
The patched code is used to get chunks number, should use round-up div to replace current sector_div. The same code is in md_bitmap_resize(): ``` chunks = DIV_ROUND_UP_SECTOR_T(blocks, 1 << chunkshift); ``` Signed-off-by: Zhao Heming <heming.zhao@suse.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-10-08md/raid0: remove unused function is_io_in_chunk_boundary()Jason Yan
This function is no longger needed after commit 20d0189b1012 ("block: Introduce new bio_split()"). Signed-off-by: Jason Yan <yanaijie@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-10-07dm: fix request-based DM to not bounce through indirect dm_submit_bioMike Snitzer
It is unnecessary to force request-based DM to call into bio-based dm_submit_bio (via indirect disk->fops->submit_bio) only to have it then call blk_mq_submit_bio(). Fix this by establishing a request-based DM block_device_operations (dm_rq_blk_dops, which doesn't have .submit_bio) and update dm_setup_md_queue() to set md->disk->fops to it for DM_TYPE_REQUEST_BASED. Remove DM_TYPE_REQUEST_BASED conditional in dm_submit_bio and unexport blk_mq_submit_bio. Fixes: c62b37d96b6eb ("block: move ->make_request_fn to struct block_device_operations") Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-10-07dm: remove special-casing of bio-based immutable singleton target on NVMeMike Snitzer
Since commit 5a6c35f9af416 ("block: remove direct_make_request") there is no benefit to DM special-casing NVMe. Remove all code used to establish DM_TYPE_NVME_BIO_BASED. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-10-06x86, powerpc: Rename memcpy_mcsafe() to copy_mc_to_{user, kernel}()Dan Williams
In reaction to a proposal to introduce a memcpy_mcsafe_fast() implementation Linus points out that memcpy_mcsafe() is poorly named relative to communicating the scope of the interface. Specifically what addresses are valid to pass as source, destination, and what faults / exceptions are handled. Of particular concern is that even though x86 might be able to handle the semantics of copy_mc_to_user() with its common copy_user_generic() implementation other archs likely need / want an explicit path for this case: On Fri, May 1, 2020 at 11:28 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Thu, Apr 30, 2020 at 6:21 PM Dan Williams <dan.j.williams@intel.com> wrote: > > > > However now I see that copy_user_generic() works for the wrong reason. > > It works because the exception on the source address due to poison > > looks no different than a write fault on the user address to the > > caller, it's still just a short copy. So it makes copy_to_user() work > > for the wrong reason relative to the name. > > Right. > > And it won't work that way on other architectures. On x86, we have a > generic function that can take faults on either side, and we use it > for both cases (and for the "in_user" case too), but that's an > artifact of the architecture oddity. > > In fact, it's probably wrong even on x86 - because it can hide bugs - > but writing those things is painful enough that everybody prefers > having just one function. Replace a single top-level memcpy_mcsafe() with either copy_mc_to_user(), or copy_mc_to_kernel(). Introduce an x86 copy_mc_fragile() name as the rename for the low-level x86 implementation formerly named memcpy_mcsafe(). It is used as the slow / careful backend that is supplanted by a fast copy_mc_generic() in a follow-on patch. One side-effect of this reorganization is that separating copy_mc_64.S to its own file means that perf no longer needs to track dependencies for its memcpy_64.S benchmarks. [ bp: Massage a bit. ] Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Cc: <stable@vger.kernel.org> Link: http://lore.kernel.org/r/CAHk-=wjSqtXAqfUJxFtWNwmguFASTgB0dz1dT3V-78Quiezqbg@mail.gmail.com Link: https://lkml.kernel.org/r/160195561680.2163339.11574962055305783722.stgit@dwillia2-desk3.amr.corp.intel.com
2020-10-05block: make bio_crypt_clone() able to failEric Biggers
bio_crypt_clone() assumes its gfp_mask argument always includes __GFP_DIRECT_RECLAIM, so that the mempool_alloc() will always succeed. However, bio_crypt_clone() might be called with GFP_ATOMIC via setup_clone() in drivers/md/dm-rq.c, or with GFP_NOWAIT via kcryptd_io_read() in drivers/md/dm-crypt.c. Neither case is currently reachable with a bio that actually has an encryption context. However, it's fragile to rely on this. Just make bio_crypt_clone() able to fail, analogous to bio_integrity_clone(). Reported-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Satya Tangirala <satyat@google.com> Cc: Satya Tangirala <satyat@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-02bcache: remove embedded struct cache_sb from struct cache_setColy Li
Since bcache code was merged into mainline kerrnel, each cache set only as one single cache in it. The multiple caches framework is here but the code is far from completed. Considering the multiple copies of cached data can also be stored on e.g. md raid1 devices, it is unnecessary to support multiple caches in one cache set indeed. The previous preparation patches fix the dependencies of explicitly making a cache set only have single cache. Now we don't have to maintain an embedded partial super block in struct cache_set, the in-memory super block can be directly referenced from struct cache. This patch removes the embedded struct cache_sb from struct cache_set, and fixes all locations where the superb lock was referenced from this removed super block by referencing the in-memory super block of struct cache. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-02bcache: check and set sync status on cache's in-memory super blockColy Li
Currently the cache's sync status is checked and set on cache set's in- memory partial super block. After removing the embedded struct cache_sb from cache set and reference cache's in-memory super block from struct cache_set, the sync status can set and check directly on cache's super block. This patch checks and sets the cache sync status directly on cache's in-memory super block. This is a preparation for later removing embedded struct cache_sb from struct cache_set. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-02bcache: remove can_attach_cache()Coly Li
After removing the embedded struct cache_sb from struct cache_set, cache set will directly reference the in-memory super block of struct cache. It is unnecessary to compare block_size, bucket_size and nr_in_set from the identical in-memory super block in can_attach_cache(). This is a preparation patch for latter removing cache_set->sb from struct cache_set. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-02bcache: don't check seq numbers in register_cache_set()Coly Li
In order to update the partial super block of cache set, the seq numbers of cache and cache set are checked in register_cache_set(). If cache's seq number is larger than cache set's seq number, cache set must update its partial super block from cache's super block. It is unncessary when the embedded struct cache_sb is removed from struct cache set. This patch removed the seq numbers checking from register_cache_set(), because later there will be no such partial super block in struct cache set, the cache set will directly reference in-memory super block from struct cache. This is a preparation patch for removing embedded struct cache_sb from struct cache_set. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-02bcache: only use bucket_bytes() on struct cacheColy Li
Because struct cache_set and struct cache both have struct cache_sb, macro bucket_bytes() currently are used on both of them. When removing the embedded struct cache_sb from struct cache_set, this macro won't be used on struct cache_set anymore. This patch unifies all bucket_bytes() usage only on struct cache, this is one of the preparation to remove the embedded struct cache_sb from struct cache_set. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-02bcache: remove useless bucket_pages()Coly Li
It seems alloc_bucket_pages() is the only user of bucket_pages(). Considering alloc_bucket_pages() is removed from bcache code, it is safe to remove the useless macro bucket_pages() now. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-02bcache: remove useless alloc_bucket_pages()Coly Li
Now no one uses alloc_bucket_pages() anymore, remove it from bcache.h. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-02bcache: only use block_bytes() on struct cacheColy Li
Because struct cache_set and struct cache both have struct cache_sb, therefore macro block_bytes() can be used on both of them. When removing the embedded struct cache_sb from struct cache_set, this macro won't be used on struct cache_set anymore. This patch unifies all block_bytes() usage only on struct cache, this is one of the preparation to remove the embedded struct cache_sb from struct cache_set. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-02bcache: add set_uuid in struct cache_setColy Li
This patch adds a separated set_uuid[16] in struct cache_set, to store the uuid of the cache set. This is the preparation to remove the embedded struct cache_sb from struct cache_set. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-02bcache: remove for_each_cache()Coly Li
Since now each cache_set explicitly has single cache, for_each_cache() is unnecessary. This patch removes this macro, and update all locations where it is used, and makes sure all code logic still being consistent. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-02bcache: explicitly make cache_set only have single cacheColy Li
Currently although the bcache code has a framework for multiple caches in a cache set, but indeed the multiple caches never completed and users use md raid1 for multiple copies of the cached data. This patch does the following change in struct cache_set, to explicitly make a cache_set only have single cache, - Change pointer array "*cache[MAX_CACHES_PER_SET]" to a single pointer "*cache". - Remove pointer array "*cache_by_alloc[MAX_CACHES_PER_SET]". - Remove "caches_loaded". Now the code looks as exactly what it does in practic: only one cache is used in the cache set. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-02bcache: remove 'int n' from parameter list of bch_bucket_alloc_set()Coly Li
The parameter 'int n' from bch_bucket_alloc_set() is not cleared defined. From the code comments n is the number of buckets to alloc, but from the code itself 'n' is the maximum cache to iterate. Indeed all the locations where bch_bucket_alloc_set() is called, 'n' is alwasy 1. This patch removes the confused and unnecessary 'int n' from parameter list of bch_bucket_alloc_set(), and explicitly allocates only 1 bucket for its caller. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-02bcache: Convert to DEFINE_SHOW_ATTRIBUTEQinglang Miao
Use DEFINE_SHOW_ATTRIBUTE macro to simplify the code. As inode->iprivate equals to third parameter of debugfs_create_file() which is NULL. So it's equivalent to original code logic. Signed-off-by: Qinglang Miao <miaoqinglang@huawei.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-02bcache: check c->root with IS_ERR_OR_NULL() in mca_reserve()Dongsheng Yang
In mca_reserve(c) macro, we are checking root whether is NULL or not. But that's not enough, when we read the root node in run_cache_set(), if we got an error in bch_btree_node_read_done(), we will return ERR_PTR(-EIO) to c->root. And then we will go continue to unregister, but before calling unregister_shrinker(&c->shrink), there is a possibility to call bch_mca_count(), and we would get a crash with call trace like that: [ 2149.876008] Unable to handle kernel NULL pointer dereference at virtual address 00000000000000b5 ... ... [ 2150.598931] Call trace: [ 2150.606439] bch_mca_count+0x58/0x98 [escache] [ 2150.615866] do_shrink_slab+0x54/0x310 [ 2150.624429] shrink_slab+0x248/0x2d0 [ 2150.632633] drop_slab_node+0x54/0x88 [ 2150.640746] drop_slab+0x50/0x88 [ 2150.648228] drop_caches_sysctl_handler+0xf0/0x118 [ 2150.657219] proc_sys_call_handler.isra.18+0xb8/0x110 [ 2150.666342] proc_sys_write+0x40/0x50 [ 2150.673889] __vfs_write+0x48/0x90 [ 2150.681095] vfs_write+0xac/0x1b8 [ 2150.688145] ksys_write+0x6c/0xd0 [ 2150.695127] __arm64_sys_write+0x24/0x30 [ 2150.702749] el0_svc_handler+0xa0/0x128 [ 2150.710296] el0_svc+0x8/0xc Signed-off-by: Dongsheng Yang <dongsheng.yang@easystack.cn> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-02bcache: share register sysfs with async registerColy Li
Previously the experimental async registration uses a separate sysfs file register_async. Now the async registration code seems working well for a while, we can do furtuher testing with it now. This patch changes the async bcache registration shares the same sysfs file /sys/fs/bcache/register (and register_quiet). Async registration will be default behavior if BCACHE_ASYNC_REGISTRATION is set in kernel configure. By default, BCACHE_ASYNC_REGISTRATION is not configured yet. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-01dm: export dm_copy_name_and_uuidMike Snitzer
Allow DM targets to access the configured name and uuid. Also, bump DM ioctl version. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-10-01dm: fix comment in __dm_suspend()Mike Snitzer
Fix stale references to functions that have been renamed and fix typo. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-10-01dm: fold dm_process_bio() into dm_submit_bio()Mike Snitzer
dm_process_bio() is only called by dm_submit_bio(), there is no benefit to keeping dm_process_bio() factored out, so fold it. While at it, cleanup dm_submit_bio()'s DMF_BLOCK_IO_FOR_SUSPEND related branching and expand scope of dm_get_live_table() rcu reference on map via common 'out' label to dm_put_live_table(). Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-09-30dm: fix missing imposition of queue_limits from dm_wq_work() threadMike Snitzer
If a DM device was suspended when bios were issued to it, those bios would be deferred using queue_io(). Once the DM device was resumed dm_process_bio() could be called by dm_wq_work() for original bio that still needs splitting. dm_process_bio()'s check for current->bio_list (meaning call chain is within ->submit_bio) as a prerequisite for calling blk_queue_split() for "abnormal IO" would result in dm_process_bio() never imposing corresponding queue_limits (e.g. discard_granularity, discard_max_bytes, etc). Fix this by always having dm_wq_work() resubmit deferred bios using submit_bio_noacct(). Side-effect is blk_queue_split() is always called for "abnormal IO" from ->submit_bio, be it from application thread or dm_wq_work() workqueue, so proper bio splitting and depth-first bio submission is performed. For sake of clarity, remove current->bio_list check before call to blk_queue_split(). Also, remove dm_wq_work()'s use of dm_{get,put}_live_table() -- no longer needed since IO will be reissued in terms of ->submit_bio. And rename bio variable from 'c' to 'bio'. Fixes: cf9c37865557 ("dm: fix comment in dm_process_bio()") Reported-by: Jeffle Xu <jefflexu@linux.alibaba.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-09-29dm snap persistent: simplify area_io()Qinglang Miao
Signed-off-by: Qinglang Miao <miaoqinglang@huawei.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>