summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-03-06nvme: host: fix double-free of struct nvme_id_ns in ns_update_nuse()Shin'ichiro Kawasaki
When nvme_identify_ns() fails, it frees the pointer to the struct nvme_id_ns before it returns. However, ns_update_nuse() calls kfree() for the pointer even when nvme_identify_ns() fails. This results in KASAN double-free, which was observed with blktests nvme/045 with proposed patches [1] on the kernel v6.8-rc7. Fix the double-free by skipping kfree() when nvme_identify_ns() fails. Link: https://lore.kernel.org/linux-block/20240304161303.19681-1-dwagner@suse.de/ [1] Fixes: a1a825ab6a60 ("nvme: add csi, ms and nuse to sysfs") Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Daniel Wagner <dwagner@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-05Merge branch 'dmraid-fix-6.9' into md-6.9Song Liu
This is the second half of fixes for dmraid. The first half is available at [1]. This set contains fixes: - reshape can start unexpected, cause data corruption, patch 1,5,6; - deadlocks that reshape concurrent with IO, patch 8; - a lockdep warning, patch 9; For all the dmraid related tests in lvm2 suite, there is no new regressions compared against 6.6 kernels (which is good baseline before recent regressions). [1] https://lore.kernel.org/all/CAPhsuW7u1UKHCDOBDhD7DzOVtkGemDz_QnJ4DUq_kSN-Q3G66Q@mail.gmail.com/ * dmraid-fix-6.9: dm-raid: fix lockdep waring in "pers->hot_add_disk" dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape dm-raid: add a new helper prepare_suspend() in md_personality md/dm-raid: don't call md_reap_sync_thread() directly dm-raid: really frozen sync_thread during suspend md: add a new helper reshape_interrupted() md: export helper md_is_rdwr() md: export helpers to stop sync_thread md: don't clear MD_RECOVERY_FROZEN for new dm-raid until resume
2024-03-05dm-raid: fix lockdep waring in "pers->hot_add_disk"Yu Kuai
The lockdep assert is added by commit a448af25becf ("md/raid10: remove rcu protection to access rdev from conf") in print_conf(). And I didn't notice that dm-raid is calling "pers->hot_add_disk" without holding 'reconfig_mutex'. "pers->hot_add_disk" read and write many fields that is protected by 'reconfig_mutex', and raid_resume() already grab the lock in other contex. Hence fix this problem by protecting "pers->host_add_disk" with the lock. Fixes: 9092c02d9435 ("DM RAID: Add ability to restore transiently failed devices on resume") Fixes: a448af25becf ("md/raid10: remove rcu protection to access rdev from conf") Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Xiao Ni <xni@redhat.com> Acked-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240305072306.2562024-10-yukuai1@huaweicloud.com
2024-03-05dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent ↵Yu Kuai
with reshape For raid456, if reshape is still in progress, then IO across reshape position will wait for reshape to make progress. However, for dm-raid, in following cases reshape will never make progress hence IO will hang: 1) the array is read-only; 2) MD_RECOVERY_WAIT is set; 3) MD_RECOVERY_FROZEN is set; After commit c467e97f079f ("md/raid6: use valid sector values to determine if an I/O should wait on the reshape") fix the problem that IO across reshape position doesn't wait for reshape, the dm-raid test shell/lvconvert-raid-reshape.sh start to hang: [root@fedora ~]# cat /proc/979/stack [<0>] wait_woken+0x7d/0x90 [<0>] raid5_make_request+0x929/0x1d70 [raid456] [<0>] md_handle_request+0xc2/0x3b0 [md_mod] [<0>] raid_map+0x2c/0x50 [dm_raid] [<0>] __map_bio+0x251/0x380 [dm_mod] [<0>] dm_submit_bio+0x1f0/0x760 [dm_mod] [<0>] __submit_bio+0xc2/0x1c0 [<0>] submit_bio_noacct_nocheck+0x17f/0x450 [<0>] submit_bio_noacct+0x2bc/0x780 [<0>] submit_bio+0x70/0xc0 [<0>] mpage_readahead+0x169/0x1f0 [<0>] blkdev_readahead+0x18/0x30 [<0>] read_pages+0x7c/0x3b0 [<0>] page_cache_ra_unbounded+0x1ab/0x280 [<0>] force_page_cache_ra+0x9e/0x130 [<0>] page_cache_sync_ra+0x3b/0x110 [<0>] filemap_get_pages+0x143/0xa30 [<0>] filemap_read+0xdc/0x4b0 [<0>] blkdev_read_iter+0x75/0x200 [<0>] vfs_read+0x272/0x460 [<0>] ksys_read+0x7a/0x170 [<0>] __x64_sys_read+0x1c/0x30 [<0>] do_syscall_64+0xc6/0x230 [<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74 This is because reshape can't make progress. For md/raid, the problem doesn't exist because register new sync_thread doesn't rely on the IO to be done any more: 1) If array is read-only, it can switch to read-write by ioctl/sysfs; 2) md/raid never set MD_RECOVERY_WAIT; 3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold 'reconfig_mutex', hence it can be cleared and reshape can continue by sysfs api 'sync_action'. However, I'm not sure yet how to avoid the problem in dm-raid yet. This patch on the one hand make sure raid_message() can't change sync_thread() through raid_message() after presuspend(), on the other hand detect the above 3 cases before wait for IO do be done in dm_suspend(), and let dm-raid requeue those IO. Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Xiao Ni <xni@redhat.com> Acked-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com
2024-03-05dm-raid: add a new helper prepare_suspend() in md_personalityYu Kuai
There are no functional changes for now, prepare to fix a deadlock for dm-raid456. Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Xiao Ni <xni@redhat.com> Acked-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240305072306.2562024-8-yukuai1@huaweicloud.com
2024-03-05md/dm-raid: don't call md_reap_sync_thread() directlyYu Kuai
Currently md_reap_sync_thread() is called from raid_message() directly without holding 'reconfig_mutex', this is definitely unsafe because md_reap_sync_thread() can change many fields that is protected by 'reconfig_mutex'. However, hold 'reconfig_mutex' here is still problematic because this will cause deadlock, for example, commit 130443d60b1b ("md: refactor idle/frozen_sync_thread() to fix deadlock"). Fix this problem by using stop_sync_thread() to unregister sync_thread, like md/raid did. Fixes: be83651f0050 ("DM RAID: Add message/status support for changing sync action") Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Xiao Ni <xni@redhat.com> Acked-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240305072306.2562024-7-yukuai1@huaweicloud.com
2024-03-05dm-raid: really frozen sync_thread during suspendYu Kuai
1) commit f52f5c71f3d4 ("md: fix stopping sync thread") remove MD_RECOVERY_FROZEN from __md_stop_writes() and doesn't realize that dm-raid relies on __md_stop_writes() to frozen sync_thread indirectly. Fix this problem by adding MD_RECOVERY_FROZEN in md_stop_writes(), and since stop_sync_thread() is only used for dm-raid in this case, also move stop_sync_thread() to md_stop_writes(). 2) The flag MD_RECOVERY_FROZEN doesn't mean that sync thread is frozen, it only prevent new sync_thread to start, and it can't stop the running sync thread; In order to frozen sync_thread, after seting the flag, stop_sync_thread() should be used. 3) The flag MD_RECOVERY_FROZEN doesn't mean that writes are stopped, use it as condition for md_stop_writes() in raid_postsuspend() doesn't look correct. Consider that reentrant stop_sync_thread() do nothing, always call md_stop_writes() in raid_postsuspend(). 4) raid_message can set/clear the flag MD_RECOVERY_FROZEN at anytime, and if MD_RECOVERY_FROZEN is cleared while the array is suspended, new sync_thread can start unexpected. Fix this by disallow raid_message() to change sync_thread status during suspend. Note that after commit f52f5c71f3d4 ("md: fix stopping sync thread"), the test shell/lvconvert-raid-reshape.sh start to hang in stop_sync_thread(), and with previous fixes, the test won't hang there anymore, however, the test will still fail and complain that ext4 is corrupted. And with this patch, the test won't hang due to stop_sync_thread() or fail due to ext4 is corrupted anymore. However, there is still a deadlock related to dm-raid456 that will be fixed in following patches. Reported-by: Mikulas Patocka <mpatocka@redhat.com> Closes: https://lore.kernel.org/all/e5e8afe2-e9a8-49a2-5ab0-958d4065c55e@redhat.com/ Fixes: 1af2048a3e87 ("dm raid: fix deadlock caused by premature md_stop_writes()") Fixes: 9dbd1aa3a81c ("dm raid: add reshaping support to the target") Fixes: f52f5c71f3d4 ("md: fix stopping sync thread") Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Xiao Ni <xni@redhat.com> Acked-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240305072306.2562024-6-yukuai1@huaweicloud.com
2024-03-05md: add a new helper reshape_interrupted()Yu Kuai
The helper will be used for dm-raid456 later to detect the case that reshape can't make progress. Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Xiao Ni <xni@redhat.com> Acked-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240305072306.2562024-5-yukuai1@huaweicloud.com
2024-03-05md: export helper md_is_rdwr()Yu Kuai
There are no functional changes for now, prepare to fix a deadlock for dm-raid456. Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Xiao Ni <xni@redhat.com> Acked-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240305072306.2562024-4-yukuai1@huaweicloud.com
2024-03-05md: export helpers to stop sync_threadYu Kuai
Add new helpers: void md_idle_sync_thread(struct mddev *mddev); void md_frozen_sync_thread(struct mddev *mddev); void md_unfrozen_sync_thread(struct mddev *mddev); The helpers will be used in dm-raid in later patches to fix regressions and prevent calling md_reap_sync_thread() directly. Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Xiao Ni <xni@redhat.com> Acked-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240305072306.2562024-3-yukuai1@huaweicloud.com
2024-03-05md: don't clear MD_RECOVERY_FROZEN for new dm-raid until resumeYu Kuai
After commit 9dbd1aa3a81c ("dm raid: add reshaping support to the target") raid_ctr() will set MD_RECOVERY_FROZEN before md_run() and expect to keep array frozen until resume. However, md_run() will clear the flag by setting mddev->recovery to 0. Before commit 1baae052cccd ("md: Don't ignore suspended array in md_check_recovery()"), dm-raid actually relied on suspending to prevent starting new sync_thread. Fix this problem by keeping 'MD_RECOVERY_FROZEN' for dm-raid in md_run(). Fixes: 1baae052cccd ("md: Don't ignore suspended array in md_check_recovery()") Fixes: 9dbd1aa3a81c ("dm raid: add reshaping support to the target") Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Xiao Ni <xni@redhat.com> Acked-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240305072306.2562024-2-yukuai1@huaweicloud.com
2024-03-05Revert "Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d""Song Liu
This reverts commit bed9e27baf52a09b7ba2a3714f1e24e17ced386d. The original set [1][2] was expected to undo a suboptimal fix in [2], and replace it with a better fix [1]. However, as reported by Dan Moulding [2] causes an issue with raid5 with journal device. Revert [2] for now to close the issue. We will follow up on another issue reported by Juxiao Bi, as [2] is expected to fix it. We believe this is a good trade-off, because the latter issue happens less freqently. In the meanwhile, we will NOT revert [1], as it contains the right logic. [1] commit d6e035aad6c0 ("md: bypass block throttle for superblock update") [2] commit bed9e27baf52 ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d"") Reported-by: Dan Moulding <dan@danm.net> Closes: https://lore.kernel.org/linux-raid/20240123005700.9302-1-dan@danm.net/ Fixes: bed9e27baf52 ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d"") Cc: stable@vger.kernel.org # v5.19+ Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240125082131.788600-1-song@kernel.org
2024-03-05nvme: fcloop: make fcloop_class constantRicardo B. Marliere
Since commit 43a7206b0963 ("driver core: class: make class_register() take a const *"), the driver core allows for struct class to be in read-only memory, so move the fcloop_class structure to be declared at build time placing it into read-only memory, instead of having to be dynamically allocated at boot time. Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-05nvme: fabrics: make nvmf_class constantRicardo B. Marliere
Since commit 43a7206b0963 ("driver core: class: make class_register() take a const *"), the driver core allows for struct class to be in read-only memory, so move the nvmf_class structure to be declared at build time placing it into read-only memory, instead of having to be dynamically allocated at boot time. Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-05nvme: core: constify struct class usageRicardo B. Marliere
Since commit 43a7206b0963 ("driver core: class: make class_register() take a const *"), the driver core allows for struct class to be in read-only memory, so move the structures nvme_class, nvme_subsys_class and nvme_ns_chr_class to be declared at build time placing them into read-only memory, instead of having to be dynamically allocated at boot time. Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-04nvme-fabrics: typo in nvmf_parse_key()Hannes Reinecke
Of course we should use the key if there is no error ... Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-04nvme-multipath: use atomic queue limits API for stacking limitsChristoph Hellwig
Switch to the queue_limits_* helpers to stack the bdev limits, which also includes updating the readahead settings. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-04nvme-multipath: pass queue_limits to blk_alloc_diskChristoph Hellwig
The multipath disk starts out with the stacking default limits. The one interesting part here is that blk_set_stacking_limits sets the max_zone_append_sectorts to UINT_MAX, which fails the validation for non-zoned devices. With the old one call per limit scheme this was fine because no one verified this weird mismatch and it was fixed by blk_stack_limits a little later before I/O could be issued. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-04nvme: use the atomic queue limits update APIChristoph Hellwig
Changes the callchains that update queue_limits to build an on-stack queue_limits and update it atomically. Note that for now only the admin queue actually passes it to the queue allocation function. Doing the same for the gendisks used for the namespaces will require a little more work. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-04nvme: cleanup nvme_configure_metadataChristoph Hellwig
Fold nvme_init_ms into nvme_configure_metadata after splitting up a little helper to deal with the extended LBA formats. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-04nvme: don't query identify data in configure_metadataChristoph Hellwig
Move reading the Identify Namespace Data Structure, NVM Command Set out of configure_metadata into the caller. This allows doing the identify call outside the frozen I/O queues, and prepares for using data from the Identify data structure for other purposes. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-04nvme: split out a nvme_identify_ns_nvm helperChristoph Hellwig
Split the logic to query the Identify Namespace Data Structure, NVM Command Set into a separate helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-04nvme: move common logic into nvme_update_ns_infoChristoph Hellwig
nvme_update_ns_info_generic and nvme_update_ns_info_block share a fair amount of logic related to not fully supported namespace formats and updating the multipath information. Move this logic into the common caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-04nvme: move setting the write cache flags out of nvme_set_queue_limitsChristoph Hellwig
nvme_set_queue_limits is used on the admin queue and all gendisks including hidden ones that don't support block I/O. The write cache setting on the other hand only makes sense for block I/O. Move the blk_queue_write_cache call to nvme_update_ns_info_block instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-04nvme: move a few things out of nvme_update_disk_infoChristoph Hellwig
Move setting up the integrity profile and setting the disk capacity out of nvme_update_disk_info to get nvme_update_disk_info into a shape where it just sets queue_limits eventually. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-04nvme: don't use nvme_update_disk_info for the multipath diskChristoph Hellwig
Currently nvme_update_ns_info_block calls nvme_update_disk_info both for the namespace attached disk, and the multipath one (if it exists). This is very different from how other stacking drivers work, and leads to a lot of complexity. Switch to setting the disk capacity and initializing the integrity profile, and let blk_stack_limits which already is called just below deal with updating the other limits. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-04nvme: move blk_integrity_unregister into nvme_init_integrityChristoph Hellwig
Move uneregistering the existing integrity profile into the helper dealing with all the other integrity / metadata setup. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-04nvme: cleanup the nvme_init_integrity calling conventionsChristoph Hellwig
Handle the no metadata support case in nvme_init_integrity as well to simplify the calling convention and prepare for future changes in the area. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-04nvme: move max_integrity_segments handling out of nvme_init_integrityChristoph Hellwig
max_integrity_segments is just a hardware limit and doesn't need to be in nvme_init_integrity with the PI setup. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-04nvme: remove nvme_revalidate_zonesChristoph Hellwig
Handle setting the zone size / chunk_sectors and max_append_sectors limits together with the other ZNS limits, and just open code the call to blk_revalidate_zones in the current place. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-04nvme: move NVME_QUIRK_DEALLOCATE_ZEROES out of nvme_config_discardChristoph Hellwig
Move the handling of the NVME_QUIRK_DEALLOCATE_ZEROES quirk out of nvme_config_discard so that it is combined with the normal write_zeroes limit handling. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-04nvme: set max_hw_sectors unconditionallyChristoph Hellwig
All transports set a max_hw_sectors value in the nvme_ctrl, so make the code using it unconditional and clean it up using a little helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Reviewed-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-02nvme-fabrics: check max outstanding commandsGuixin Liu
Maxcmd is mandatory for fabrics, check it early to identify the root cause instead of waiting for it to propagate to "sqsize" and "allocing queue". By the way, change nvme_check_ctrl_fabric_info() to nvmf_validate_identify_ctrl(). Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-02nvmet-rdma: set max_queue_size for RDMA transportMax Gurtovoy
A new port configuration was added to set max_queue_size. Clamp user configuration to RDMA transport limits. Increase the maximal queue size of RDMA controllers from 128 to 256 (the default size stays 128 same as before). Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Israel Rukshin <israelr@nvidia.com> Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-02nvmet: introduce new max queue size configuration entryMax Gurtovoy
Using this port configuration, one will be able to set the maximal queue size to be used for any controller that will be associated to the configured port. The default value stayed 1024 but each transport will be able to set the its own values before enabling the port. Introduce lower limit of 16 for minimal queue depth (same as we use in the host fabrics drivers). Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Israel Rukshin <israelr@nvidia.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Guixin Liu <kanie@linux.alibaba.com> Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-02nvme-rdma: clamp queue size according to ctrl capMax Gurtovoy
If a controller is configured with metadata support, clamp the maximal queue size to be 128 since there are more resources that are needed for metadata operations. Otherwise, clamp it to 256. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Israel Rukshin <israelr@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-02nvme-rdma: introduce NVME_RDMA_MAX_METADATA_QUEUE_SIZE definitionMax Gurtovoy
This definition will be used by controllers that are configured with metadata support. For now, both regular and metadata controllers have the same maximal queue size but later commit will increase the maximal queue size for regular RDMA controllers to 256. We'll keep the maximal queue size for metadata controllers to be 128 since there are more resources that are needed for metadata operations and 128 is the optimal size found for metadata controllers base on testing. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Israel Rukshin <israelr@nvidia.com> Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-02nvmet: set ctrl pi_support cap before initializing cap regMax Gurtovoy
This is a preparation for setting the maximal queue size of a controller that supports PI. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Israel Rukshin <israelr@nvidia.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-02nvmet: set maxcmd to be per controllerMax Gurtovoy
This is a preparation for having a dynamic configuration of max queue size for a controller. Make sure that the maxcmd field stays the same as the MQES (+1) value as we do today. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Israel Rukshin <israelr@nvidia.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-02nvmet: compare mqes and sqsize only for IO SQMax Gurtovoy
According to the NVMe Spec: " MQES: This field indicates the maximum individual queue size that the controller supports. For NVMe over PCIe implementations, this value applies to the I/O Submission Queues and I/O Completion Queues that the host creates. For NVMe over Fabrics implementations, this value applies to only the I/O Submission Queues that the host creates. " Align the target code to compare mqes and sqsize as mentioned in the NVMe Spec. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-02nvme-rdma: move NVME_RDMA_IP_PORT from common fileMax Gurtovoy
The correct place for this definition is the nvme rdma header file and not the common nvme header file. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Israel Rukshin <israelr@nvidia.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-01nbd: use the atomic queue limits API in nbd_set_sizeChristoph Hellwig
Use queue_limits_start_update / queue_limits_commit_update to update all the limits in one go and with proper sanity checking. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240229143846.1047223-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-01nbd: freeze the queue for queue limits updatesChristoph Hellwig
nbd currently updates the logical and physical block sizes as well as the discard_sectors on a live queue. Freeze the queue first to make sure there are not commands in flight that can see torn or inconsistent limits. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240229143846.1047223-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-01nbd: don't clear discard_sectors in nbd_config_putChristoph Hellwig
nbd_config_put currently clears discard_sectors when unusing a device. This is pretty odd behavior and different from the sector size configuration which is simply left in places and then reconfigured when nbd_set_size is as part of configuring the device. Change nbd_set_size to clear discard_sectors if discard is not supported so that all the queue limits changes are handled in one place. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240229143846.1047223-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-01pktcdvd: don't set max_hw_sectors on the underlying deviceChristoph Hellwig
pktcdvd sets max_hw_sectors on the queue of the underlying device that it doesn't own (and doesn't reset it ever) since the driver was merged. This can create all kinds of problems as the underlying driver doesn't even know about it changing the limit. As the state purpose is to not create I/Os larger than a single frame, and pktcdvd never builds bios larger than that, just set REQ_NOMERGE on the bios it submits so that largers I/Os never get built. Note: I don't have packet writing hardware, so this is compile tested only. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240229144408.1047967-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-01dm: use queue_limits_setChristoph Hellwig
Use queue_limits_set which validates the limits and takes care of updating the readahead settings instead of directly assigning them to the queue. For that make sure all limits are actually updated before the assignment. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Link: https://lore.kernel.org/r/20240228225653.947152-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-01block: add a queue_limits_stack_bdev helperChristoph Hellwig
Add a small wrapper around blk_stack_limits that allows passing a bdev for the bottom device and prints an error in case of misaligned device. The name fits into the new queue limits API and the intent is to eventually replace disk_stack_limits. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240228225653.947152-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-01block: add a queue_limits_set helperChristoph Hellwig
Add a small wrapper around queue_limits_commit_update for stacking drivers that don't want to update existing limits, but set an entirely new set. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240228225653.947152-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-01Merge tag 'md-6.9-20240301' of ↵Jens Axboe
https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.9/block Pull MD updates from Song: "The major changes are: 1. Refactor raid1 read_balance, by Yu Kuai and Paul Luse. 2. Clean up and fix for md_ioctl, by Li Nan. 3. Other small fixes, by Gui-Dong Han and Heming Zhao." * tag 'md-6.9-20240301' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: (22 commits) md/raid1: factor out helpers to choose the best rdev from read_balance() md/raid1: factor out the code to manage sequential IO md/raid1: factor out choose_bb_rdev() from read_balance() md/raid1: factor out choose_slow_rdev() from read_balance() md/raid1: factor out read_first_rdev() from read_balance() md/raid1-10: factor out a new helper raid1_should_read_first() md/raid1-10: add a helper raid1_check_read_range() md/raid1: fix choose next idle in read_balance() md/raid1: record nonrot rdevs while adding/removing rdevs to conf md/raid1: factor out helpers to add rdev to conf md: add a new helper rdev_has_badblock() md/raid5: fix atomicity violation in raid5_cache_count md/md-bitmap: fix incorrect usage for sb_index md: check mddev->pers before calling md_set_readonly() md: clean up openers check in do_md_stop() and md_set_readonly() md: sync blockdev before stopping raid or setting readonly md: factor out a helper to sync mddev md: Don't clear MD_CLOSING when the raid is about to stop md: return directly before setting did_set_md_closing md: clean up invalid BUG_ON in md_ioctl ...
2024-02-29Merge branch 'raid1-read_balance' into md-6.9Song Liu
From: Yu Kuai <yukuai3@huawei.com> Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com> The original idea is that Paul want to optimize raid1 read performance([1]), however, we think that the original code for read_balance() is quite complex, and we don't want to add more complexity. Hence we decide to refactor read_balance() first, to make code cleaner and easier for follow up. Before this patchset, read_balance() has many local variables and many branches, it want to consider all the scenarios in one iteration. The idea of this patch is to divide them into 4 different steps: 1) If resync is in progress, find the first usable disk, patch 5; Otherwise: 2) Loop through all disks and skipping slow disks and disks with bad blocks, choose the best disk, patch 10. If no disk is found: 3) Look for disks with bad blocks and choose the one with most number of sectors, patch 8. If no disk is found: 4) Choose first found slow disk with no bad blocks, or slow disk with most number of sectors, patch 7. Note that step 3) and step 4) are super code path, and performance should not be considered. And after this patchset, we'll continue to optimize read_balance for step 2), specifically how to choose the best rdev to read. [1] https://lore.kernel.org/all/20240102125115.129261-1-paul.e.luse@linux.intel.com/ Yu Kuai (11): md: add a new helper rdev_has_badblock() md/raid1: factor out helpers to add rdev to conf md/raid1: record nonrot rdevs while adding/removing rdevs to conf md/raid1: fix choose next idle in read_balance() md/raid1-10: add a helper raid1_check_read_range() md/raid1-10: factor out a new helper raid1_should_read_first() md/raid1: factor out read_first_rdev() from read_balance() md/raid1: factor out choose_slow_rdev() from read_balance() md/raid1: factor out choose_bb_rdev() from read_balance() md/raid1: factor out the code to manage sequential IO md/raid1: factor out helpers to choose the best rdev from read_balance()