summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-03-13block: remove unused parameter 'q' parameter in __blk_rq_map_sg()Anuj Gupta
request_queue param is no longer used by blk_rq_map_sg and __blk_rq_map_sg. Remove it. Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250313035322.243239-1-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-13Merge tag 'md-6.15-20250312' of ↵Jens Axboe
https://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux into for-6.15/block Merge MD changes from Yu: "- fix recovery can preempt resync (Li Nan) - fix md-bitmap IO limit (Su Yue) - fix raid10 discard with REQ_NOWAIT (Xiao Ni) - fix raid1 memory leak (Zheng Qixing) - fix mddev uaf (Yu Kuai) - fix raid1,raid10 IO flags (Yu Kuai) - some refactor and cleanup (Yu Kuai)" * tag 'md-6.15-20250312' of https://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux: md/raid10: wait barrier before returning discard request with REQ_NOWAIT md/md-bitmap: fix wrong bitmap_limit for clustermd when write sb md/raid1,raid10: don't ignore IO flags md/raid5: merge reshape_progress checking inside get_reshape_loc() md: fix mddev uaf while iterating all_mddevs list md: switch md-cluster to use md_submodle_head md: don't export md_cluster_ops md/md-cluster: cleanup md_cluster_ops reference md: switch personalities to use md_submodule_head md: introduce struct md_submodule_head and APIs md: only include md-cluster.h if necessary md: merge common code into find_pers() md/raid1: fix memory leak in raid1_run() if no active rdev md: ensure resync is prioritized over recovery
2025-03-12block: fix adding folio to bioMing Lei
>4GB folio is possible on some ARCHs, such as aarch64, 16GB hugepage is supported, then 'offset' of folio can't be held in 'unsigned int', cause warning in bio_add_folio_nofail() and IO failure. Fix it by adjusting 'page' & trimming 'offset' so that `->bi_offset` won't be overflow, and folio can be added to bio successfully. Fixes: ed9832bc08db ("block: introduce folio awareness and add a bigger size from folio") Cc: Kundan Kumar <kundan.kumar@samsung.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Gavin Shan <gshan@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20250312145136.2891229-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-12block: remove unused parameterGuixin Liu
The blk_mq_map_queue()'s request_queue param is not used anymore, remove it, same with blk_get_flush_queue(). Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Link: https://lore.kernel.org/r/20250312084722.129680-1-kanie@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-10badblocks: Fix a nonsense WARN_ON() which checks whether a u64 variable < 0Coly Li
In _badblocks_check(), there are lines of code like this, 1246 sectors -= len; [snipped] 1251 WARN_ON(sectors < 0); The WARN_ON() at line 1257 doesn't make sense because sectors is unsigned long long type and never to be <0. Fix it by checking directly checking whether sectors is less than len. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Coly Li <colyli@kernel.org> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250309160556.42854-1-colyli@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-10block: make sure ->nr_integrity_segments is cloned in blk_rq_prep_cloneMing Lei
Make sure ->nr_integrity_segments is cloned in blk_rq_prep_clone(), otherwise requests cloned by device-mapper multipath will not have the proper nr_integrity_segments values set, then BUG() is hit from sg_alloc_table_chained(). Fixes: b0fd271d5fba ("block: add request clone interface (v2)") Cc: stable@vger.kernel.org Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250310115453.2271109-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-10block: protect hctx attributes/params using q->elevator_lockNilay Shroff
Currently, hctx attributes (nr_tags, nr_reserved_tags, and cpu_list) are protected using `q->sysfs_lock`. However, these attributes can be updated in multiple scenarios: - During the driver's probe method. - When updating nr_hw_queues. - When writing to the sysfs attribute nr_requests, which can modify nr_tags. The nr_requests attribute is already protected using q->elevator_lock, but none of the update paths actually use q->sysfs_lock to protect hctx attributes. So to ensure proper synchronization, replace q->sysfs_lock with q->elevator_lock when reading hctx attributes through sysfs. Additionally, blk_mq_update_nr_hw_queues allocates and updates hctx. The allocation of hctx is protected using q->elevator_lock, however, updating hctx params happens without any protection, so safeguard hctx param update path by also using q->elevator_lock. Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250306093956.2818808-1-nilay@linux.ibm.com [axboe: wrap comment at 80 chars] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-10block: protect read_ahead_kb using q->limits_lockNilay Shroff
The bdi->ra_pages could be updated under q->limits_lock because it's usually calculated from the queue limits by queue_limits_commit_update. So protect reading/writing the sysfs attribute read_ahead_kb using q->limits_lock instead of q->sysfs_lock. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250304102551.2533767-8-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-10block: protect wbt_lat_usec using q->elevator_lockNilay Shroff
The wbt latency and state could be updated while initializing the elevator or exiting the elevator. It could be also updated while configuring IO latency QoS parameters using cgroup. The elevator code path is now protected with q->elevator_lock. So we should protect the access to sysfs attribute wbt_lat_usec using q->elevator _lock instead of q->sysfs_lock. White we're at it, also protect ioc_qos_write(), which configures wbt parameters via cgroup, using q->elevator_lock. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250304102551.2533767-7-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-10block: protect nr_requests update using q->elevator_lockNilay Shroff
The sysfs attribute nr_requests could be simultaneously updated from elevator switch/update or nr_hw_queue update code path. The update to nr_requests for each of those code paths runs holding q->elevator_lock. So we should protect access to sysfs attribute nr_requests using q-> elevator_lock instead of q->sysfs_lock. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250304102551.2533767-6-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-10block: introduce a dedicated lock for protecting queue elevator updatesNilay Shroff
A queue's elevator can be updated either when modifying nr_hw_queues or through the sysfs scheduler attribute. Currently, elevator switching/ updating is protected using q->sysfs_lock, but this has led to lockdep splats[1] due to inconsistent lock ordering between q->sysfs_lock and the freeze-lock in multiple block layer call sites. As the scope of q->sysfs_lock is not well-defined, its (mis)use has resulted in numerous lockdep warnings. To address this, introduce a new q->elevator_lock, dedicated specifically for protecting elevator switches/updates. And we'd now use this new q->elevator_lock instead of q->sysfs_lock for protecting elevator switches/updates. While at it, make elv_iosched_load_module() a static function, as it is only called from elv_iosched_store(). Also, remove redundant parameters from elv_iosched_load_module() function signature. [1] https://lore.kernel.org/all/67637e70.050a0220.3157ee.000c.GAE@google.com/ Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250304102551.2533767-5-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-10block: remove q->sysfs_lock for attributes which don't need itNilay Shroff
There're few sysfs attributes in block layer which don't really need acquiring q->sysfs_lock while accessing it. The reason being, reading/ writing a value from/to such attributes are either atomic or could be easily protected using READ_ONCE()/WRITE_ONCE(). Moreover, sysfs attributes are inherently protected with sysfs/kernfs internal locking. So this change help segregate all existing sysfs attributes for which we could avoid acquiring q->sysfs_lock. For all read-only attributes we removed the q->sysfs_lock from show method of such attributes. In case attribute is read/write then we removed the q->sysfs_lock from both show and store methods of these attributes. We audited all block sysfs attributes and found following list of attributes which shouldn't require q->sysfs_lock protection: 1. io_poll: Write to this attribute is ignored. So, we don't need q->sysfs_lock. 2. io_poll_delay: Write to this attribute is NOP, so we don't need q->sysfs_lock. 3. io_timeout: Write to this attribute updates q->rq_timeout and read of this attribute returns the value stored in q->rq_timeout Moreover, the q->rq_timeout is set only once when we init the queue (under blk_mq_ init_allocated_queue()) even before disk is added. So that means that we don't need to protect it with q->sysfs_lock. As this attribute is not directly correlated with anything else simply using READ_ONCE/WRITE_ONCE should be enough. 4. nomerges: Write to this attribute file updates two q->flags : QUEUE_FLAG_ NOMERGES and QUEUE_FLAG_NOXMERGES. These flags are accessed during bio-merge which anyways doesn't run with q->sysfs_lock held. Moreover, the q->flags are updated/accessed with bitops which are atomic. So, protecting it with q->sysfs_lock is not necessary. 5. rq_affinity: Write to this attribute file makes atomic updates to q->flags: QUEUE_FLAG_SAME_COMP and QUEUE_FLAG_SAME_FORCE. These flags are also accessed from blk_mq_complete_need_ipi() using test_bit macro. As read/write to q->flags uses bitops which are atomic, protecting it with q->stsys_lock is not necessary. 6. nr_zones: Write to this attribute happens in the driver probe method (except nvme) before disk is added and outside of q->sysfs_lock or any other lock. Moreover nr_zones is defined as "unsigned int" and so reading this attribute, even when it's simultaneously being updated on other cpu, should not return torn value on any architecture supported by linux. So we can avoid using q->sysfs_lock or any other lock/ protection while reading this attribute. 7. discard_zeroes_data: Reading of this attribute always returns 0, so we don't require holding q->sysfs_lock. 8. write_same_max_bytes Reading of this attribute always returns 0, so we don't require holding q->sysfs_lock. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250304102551.2533767-4-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-10block: move q->sysfs_lock and queue-freeze under show/store methodNilay Shroff
In preparation to further simplify and group sysfs attributes which don't require locking or require some form of locking other than q-> limits_lock, move acquire/release of q->sysfs_lock and queue freeze/ unfreeze under each attributes' respective show/store method. While we are at it, also remove ->load_module() as it's used to load the module before queue is freezed. Now as we moved queue-freeze under ->store(), we could load module directly from the attributes' store method before we actually start freezing the queue. Currently, the ->load_module() is only used by "scheduler" attribute, so we now load the relevant elevator module before we start freezing the queue in elv_iosched_store(). Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250304102551.2533767-3-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-10block: acquire q->limits_lock while reading sysfs attributesNilay Shroff
There're few sysfs attributes(RW) whose store method is protected with q->limits_lock, however the corresponding show method of these attributes run holding q->sysfs_lock and that doesn't make sense as ideally the show method of these attributes should also run holding q->limits_lock instead of q->sysfs_lock. Hence update the show method of these sysfs attributes so that reading of these attributes acquire q->limits_lock instead of q->sysfs_lock. Similarly, there're few sysfs attributes(RO) whose show method is currently protected with q->sysfs_lock however updates to these attributes could occur using atomic limit update APIs such as queue_ limits_start_update() and queue_limits_commit_update() which run holding q->limits_lock. So that means that reading these attributes holding q->sysfs_lock doesn't make sense. Hence update the show method of these sysfs attributes(RO) such that they run with holding q-> limits_lock instead of q->sysfs_lock. We have defined a new macro QUEUE_LIM_RO_ENTRY() which uses new ->show_ limit() method and it runs holding q->limits_lock. All existing sysfs attributes(RO) which needs protection using q->limits_lock while reading have been now updated to use this new macro for initialization. Also, the existing QUEUE_LIM_RW_ENTRY() is updated to use new ->show_ limit() method for reading attributes instead of existing ->show() method. As ->show_limit() runs holding q->limits_lock, the existing sysfs attributes(RW) requiring protection are now inherently protected using q->limits_lock instead of q->sysfs_lock. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250304102551.2533767-2-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-06badblocks: use sector_t instead of int to avoid truncation of badblocks lengthZheng Qixing
There is a truncation of badblocks length issue when set badblocks as follow: echo "2055 4294967299" > bad_blocks cat bad_blocks 2055 3 Change 'sectors' argument type from 'int' to 'sector_t'. This change avoids truncation of badblocks length for large sectors by replacing 'int' with 'sector_t' (u64), enabling proper handling of larger disk sizes and ensuring compatibility with 64-bit sector addressing. Fixes: 9e0e252a048b ("badblocks: Add core badblock management code") Signed-off-by: Zheng Qixing <zhengqixing@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Coly Li <colyli@kernel.org> Link: https://lore.kernel.org/r/20250227075507.151331-13-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-06md: improve return types of badblocks handling functionsZheng Qixing
rdev_set_badblocks() only indicates success/failure, so convert its return type from int to boolean for better semantic clarity. rdev_clear_badblocks() return value is never used by any caller, convert it to void. This removes unnecessary value returns. Also update narrow_write_error() in both raid1 and raid10 to use boolean return type to match rdev_set_badblocks(). Signed-off-by: Zheng Qixing <zhengqixing@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250227075507.151331-12-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-06badblocks: return boolean from badblocks_set() and badblocks_clear()Zheng Qixing
Change the return type of badblocks_set() and badblocks_clear() from int to bool, indicating success or failure. Specifically: - _badblocks_set() and _badblocks_clear() functions now return true for success and false for failure. - All calls to these functions are updated to handle the new boolean return type. - This change improves code clarity and ensures a more consistent handling of success and failure states. Signed-off-by: Zheng Qixing <zhengqixing@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Coly Li <colyli@kernel.org> Acked-by: Ira Weiny <ira.weiny@intel.com> Link: https://lore.kernel.org/r/20250227075507.151331-11-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-06badblocks: fix missing bad blocks on retry in _badblocks_check()Zheng Qixing
The bad blocks check would miss bad blocks when retrying under contention, as checking parameters are not reset. These stale values from the previous attempt could lead to incorrect scanning in the subsequent retry. Move seqlock to outer function and reinitialize checking state for each retry. This ensures a clean state for each check attempt, preventing any missed bad blocks. Fixes: 3ea3354cb9f0 ("badblocks: improve badblocks_check() for multiple ranges handling") Signed-off-by: Zheng Qixing <zhengqixing@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Coly Li <colyli@kernel.org> Link: https://lore.kernel.org/r/20250227075507.151331-10-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-06badblocks: fix merge issue when new badblocks align with pre+1Li Nan
There is a merge issue when adding badblocks as follow: echo 0 10 > bad_blocks echo 30 10 > bad_blocks echo 20 10 > bad_blocks cat bad_blocks 0 10 20 10 //should be merged with (30 10) 30 10 In this case, if new badblocks does not intersect with prev, it is added by insert_at(). If there is an intersection with prev+1, the merge will be processed in the next re_insert loop. However, when the end of the new badblocks is exactly equal to the offset of prev+1, no further re_insert loop occurs, and the two badblocks are not merge. Fix it by inc prev, badblocks can be merged during the subsequent code. Fixes: aa511ff8218b ("badblocks: switch to the improved badblock handling code") Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250227075507.151331-9-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-06badblocks: try can_merge_front before overlap_frontLi Nan
Regardless of whether overlap_front() returns true or false, can_merge_front() will be executed first. Therefore, move can_merge_front() in front of can_merge_front() to simplify code. Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250227075507.151331-8-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-06badblocks: fix the using of MAX_BADBLOCKSLi Nan
The number of badblocks cannot exceed MAX_BADBLOCKS, but it should be allowed to equal MAX_BADBLOCKS. Fixes: aa511ff8218b ("badblocks: switch to the improved badblock handling code") Fixes: c3c6a86e9efc ("badblocks: add helper routines for badblock ranges handling") Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Coly Li <colyli@kernel.org> Link: https://lore.kernel.org/r/20250227075507.151331-7-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-06badblocks: return error if any badblock set failsLi Nan
_badblocks_set() returns success if at least one badblock is set successfully, even if others fail. This can lead to data inconsistencies in raid, where a failed badblock set should trigger the disk to be kicked out to prevent future reads from failed write areas. _badblocks_set() should return error if any badblock set fails. Instead of relying on 'rv', directly returning 'sectors' for clearer logic. If all badblocks are successfully set, 'sectors' will be 0, otherwise it indicates the number of badblocks that have not been set yet, thus signaling failure. By the way, it can also fix an issue: when a newly set unack badblock is included in an existing ack badblock, the setting will return an error. ··· echo "0 100" /sys/block/md0/md/dev-loop1/bad_blocks echo "0 100" /sys/block/md0/md/dev-loop1/unacknowledged_bad_blocks -bash: echo: write error: No space left on device ``` After fix, it will return success. Fixes: aa511ff8218b ("badblocks: switch to the improved badblock handling code") Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Coly Li <colyli@kernel.org> Link: https://lore.kernel.org/r/20250227075507.151331-6-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-06badblocks: return error directly when setting badblocks exceeds 512Li Nan
In the current handling of badblocks settings, a lot of processing has been done for scenarios where the number of badblocks exceeds 512. This makes the code look quite complex and also introduces some issues, For example, if there is 512 badblocks already: for((i=0; i<510; i++)); do ((sector=i*2)); echo "$sector 1" > bad_blocks; done echo 2100 10 > bad_blocks echo 2200 10 > bad_blocks Set new one, exceed 512: echo 2000 500 > bad_blocks Expected: 2000 500 Actual: 2100 400 In fact, a disk shouldn't have too many badblocks, and for disks with 512 badblocks, attempting to set more bad blocks doesn't make much sense. At that point, the more appropriate action would be to replace the disk. Therefore, to resolve these issues and simplify the code somewhat, return error directly when setting badblocks exceeds 512. Fixes: aa511ff8218b ("badblocks: switch to the improved badblock handling code") Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250227075507.151331-5-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-06badblocks: attempt to merge adjacent badblocks during ack_all_badblocksLi Nan
If ack and unack badblocks are adjacent, they will not be merged and will remain as two separate badblocks. Even after the bad blocks are written to disk and both become ack, they will still remain as two independent bad blocks. This is not ideal as it wastes the limited space for badblocks. Therefore, during ack_all_badblocks(), attempt to merge badblocks if they are adjacent. Fixes: aa511ff8218b ("badblocks: switch to the improved badblock handling code") Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Coly Li <colyli@kernel.org> Link: https://lore.kernel.org/r/20250227075507.151331-4-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-06badblocks: factor out a helper try_adjacent_combineLi Nan
Factor out try_adjacent_combine(), and it will be used in the later patch. Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250227075507.151331-3-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-06badblocks: Fix error shitf opsLi Nan
'bb->shift' is used directly in badblocks. It is wrong, fix it. Fixes: 3ea3354cb9f0 ("badblocks: improve badblocks_check() for multiple ranges handling") Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Coly Li <colyli@kernel.org> Link: https://lore.kernel.org/r/20250227075507.151331-2-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-06block: Correctly initialize BLK_INTEGRITY_NOGENERATE and BLK_INTEGRITY_NOVERIFYAnuj Gupta
Currently, BLK_INTEGRITY_NOGENERATE and BLK_INTEGRITY_NOVERIFY are not explicitly set during integrity initialization. This can lead to incorrect reporting of read_verify and write_generate sysfs values, particularly when a device does not support integrity. Ensure that these flags are correctly initialized by default. Reported-by: M Nikhil <nikh1092@linux.ibm.com> Link: https://lore.kernel.org/linux-block/f6130475-3ccd-45d2-abde-3ccceada0f0a@linux.ibm.com/ Fixes: 9f4aa46f2a74 ("block: invert the BLK_INTEGRITY_{GENERATE,VERIFY} flags") Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250305063033.1813-3-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-06block: ensure correct integrity capability propagation in stacked devicesAnuj Gupta
queue_limits_stack_integrity() incorrectly sets BLK_INTEGRITY_DEVICE_CAPABLE for a DM device even when none of its underlying devices support integrity. This happens because the flag is inherited unconditionally. Ensure that integrity capabilities are correctly propagated only when the underlying devices actually support integrity. Reported-by: M Nikhil <nikh1092@linux.ibm.com> Link: https://lore.kernel.org/linux-block/f6130475-3ccd-45d2-abde-3ccceada0f0a@linux.ibm.com/ Fixes: c6e56cf6b2e7 ("block: move integrity information into queue_limits") Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250305063033.1813-2-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-06md/raid10: wait barrier before returning discard request with REQ_NOWAITXiao Ni
raid10_handle_discard should wait barrier before returning a discard bio which has REQ_NOWAIT. And there is no need to print warning calltrace if a discard bio has REQ_NOWAIT flag. Quality engineer usually checks dmesg and reports error if dmesg has warning/error calltrace. Fixes: c9aa889b035f ("md: raid10 add nowait support") Signed-off-by: Xiao Ni <xni@redhat.com> Acked-by: Coly Li <colyli@kernel.org> Link: https://lore.kernel.org/linux-raid/20250306094938.48952-1-xni@redhat.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-03-05blk-throttle: carry over directlyMing Lei
Now ->carryover_bytes[] and ->carryover_ios[] only covers limit/config update. Actually the carryover bytes/ios can be carried to ->bytes_disp[] and ->io_disp[] directly, since the carryover is one-shot thing and only valid in current slice. Then we can remove the two fields and simplify code much. Type of ->bytes_disp[] and ->io_disp[] has to change as signed because the two fields may become negative when updating limits or config, but both are big enough for holding bytes/ios dispatched in single slice Cc: Tejun Heo <tj@kernel.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20250305043123.3938491-4-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-05blk-throttle: don't take carryover for prioritized processing of metadataMing Lei
Commit 29390bb5661d ("blk-throttle: support prioritized processing of metadata") takes bytes/ios carryover for prioritized processing of metadata. Turns out we can support it by charging it directly without trimming slice, and the result is same with carryover. Cc: Tejun Heo <tj@kernel.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20250305043123.3938491-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-05blk-throttle: remove last_bytes_disp and last_ios_dispMing Lei
The two fields are not used any more, so remove them. Cc: Tejun Heo <tj@kernel.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20250305043123.3938491-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-05blk-throttle: fix lower bps rate by throtl_trim_slice()Yu Kuai
The bio submission time may be a few jiffies more than the expected waiting time, due to 'extra_bytes' can't be divided in tg_within_bps_limit(), and also due to timer wakeup delay. In this case, adjust slice_start to jiffies will discard the extra wait time, causing lower rate than expected. Current in-tree code already covers deviation by rounddown(), but turns out it is not enough, because jiffies - slice_start can be a multiple of throtl_slice. For example, assume bps_limit is 1000bytes, 1 jiffes is 10ms, and slice is 20ms(2 jiffies), expected rate is 1000 / 1000 * 20 = 20 bytes per slice. If user issues two 21 bytes IO, then wait time will be 30ms for the first IO: bytes_allowed = 20, extra_bytes = 1; jiffy_wait = 1 + 2 = 3 jiffies and consider extra 1 jiffies by timer, throtl_trim_slice() will be called at: jiffies = 40ms slice_start = 0ms, slice_end= 40ms bytes_disp = 21 In this case, before the patch, real rate in the first two slices is 10.5 bytes per slice, and slice will be updated to: jiffies = 40ms slice_start = 40ms, slice_end = 60ms, bytes_disp = 0; Hence the second IO will have to wait another 30ms; With the patch, the real rate in the first slice is 20 bytes per slice, which is the same as expected, and slice will be updated: jiffies=40ms, slice_start = 20ms, slice_end = 60ms, bytes_disp = 1; And now, there is still 19 bytes allowed in the second slice, and the second IO will only have to wait 10ms; This problem will cause blktests throtl/001 failure in case of CONFIG_HZ_100=y, fix it by preserving one extra finished slice in throtl_trim_slice(). Fixes: e43473b7f223 ("blkio: Core implementation of throttle policy") Reported-by: Ming Lei <ming.lei@redhat.com> Closes: https://lore.kernel.org/linux-block/20250222092823.210318-3-yukuai1@huaweicloud.com/ Reviewed-by: Ming Lei <ming.lei@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250227120645.812815-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-05md/md-bitmap: fix wrong bitmap_limit for clustermd when write sbSu Yue
In clustermd, separate write-intent-bitmaps are used for each cluster node: 0 4k 8k 12k ------------------------------------------------------------------- | idle | md super | bm super [0] + bits | | bm bits[0, contd] | bm super[1] + bits | bm bits[1, contd] | | bm super[2] + bits | bm bits [2, contd] | bm super[3] + bits | | bm bits [3, contd] | | | So in node 1, pg_index in __write_sb_page() could equal to bitmap->storage.file_pages. Then bitmap_limit will be calculated to 0. md_super_write() will be called with 0 size. That means the first 4k sb area of node 1 will never be updated through filemap_write_page(). This bug causes hang of mdadm/clustermd_tests/01r1_Grow_resize. Here use (pg_index % bitmap->storage.file_pages) to make calculation of bitmap_limit correct. Fixes: ab99a87542f1 ("md/md-bitmap: fix writing non bitmap pages") Signed-off-by: Su Yue <glass.su@suse.com> Reviewed-by: Heming Zhao <heming.zhao@suse.com> Link: https://lore.kernel.org/linux-raid/20250303033918.32136-1-glass.su@suse.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-03-05md/raid1,raid10: don't ignore IO flagsYu Kuai
If blk-wbt is enabled by default, it's found that raid write performance is quite bad because all IO are throttled by wbt of underlying disks, due to flag REQ_IDLE is ignored. And turns out this behaviour exist since blk-wbt is introduced. Other than REQ_IDLE, other flags should not be ignored as well, for example REQ_META can be set for filesystems, clearing it can cause priority reverse problems; And REQ_NOWAIT should not be cleared as well, because io will wait instead of failing directly in underlying disks. Fix those problems by keep IO flags from master bio. Fises: f51d46d0e7cb ("md: add support for REQ_NOWAIT") Fixes: e34cbd307477 ("blk-wbt: add general throttling mechanism") Fixes: 5404bc7a87b9 ("[PATCH] Allow file systems to differentiate between data and meta reads") Link: https://lore.kernel.org/linux-raid/20250227121657.832356-1-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-03-05md/raid5: merge reshape_progress checking inside get_reshape_loc()Yu Kuai
During code review, it's found that other than raid5_bitmap_sector(), reshape_progress is always checked before get_reshape_loc(), while raid5_bitmap_sector() should check as well to prevent holding the lock 'conf->device_lock'. Hence merge that checking inside get_reshape_loc(). Link: https://lore.kernel.org/linux-raid/20250227120452.808503-1-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-03-05md: fix mddev uaf while iterating all_mddevs listYu Kuai
While iterating all_mddevs list from md_notify_reboot() and md_exit(), list_for_each_entry_safe is used, and this can race with deletint the next mddev, causing UAF: t1: spin_lock //list_for_each_entry_safe(mddev, n, ...) mddev_get(mddev1) // assume mddev2 is the next entry spin_unlock t2: //remove mddev2 ... mddev_free spin_lock list_del spin_unlock kfree(mddev2) mddev_put(mddev1) spin_lock //continue dereference mddev2->all_mddevs The old helper for_each_mddev() actually grab the reference of mddev2 while holding the lock, to prevent from being freed. This problem can be fixed the same way, however, the code will be complex. Hence switch to use list_for_each_entry, in this case mddev_put() can free the mddev1 and it's not safe as well. Refer to md_seq_show(), also factor out a helper mddev_put_locked() to fix this problem. Cc: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/linux-raid/20250220124348.845222-1-yukuai1@huaweicloud.com Fixes: f26514342255 ("md: stop using for_each_mddev in md_notify_reboot") Fixes: 16648bac862f ("md: stop using for_each_mddev in md_exit") Reported-and-tested-by: Guillaume Morin <guillaume@morinfr.org> Closes: https://lore.kernel.org/all/Z7Y0SURoA8xwg7vn@bender.morinfr.org/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2025-03-05md: switch md-cluster to use md_submodle_headYu Kuai
To make code cleaner, and prepare to add kconfig for bitmap. Also remove the unsed global variables pers_lock, md_cluster_ops and md_cluster_mod, and exported symbols register_md_cluster_operations(), unregister_md_cluster_operations() and md_cluster_ops. Link: https://lore.kernel.org/linux-raid/20250215092225.2427977-8-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Su Yue <glass.su@suse.com>
2025-03-05md: don't export md_cluster_opsYu Kuai
Add a new field 'cluster_ops' and initialize it md_setup_cluster(), so that the gloable variable 'md_cluter_ops' doesn't need to be exported. Also prepare to switch md-cluster to use md_submod_head. Link: https://lore.kernel.org/linux-raid/20250215092225.2427977-7-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Su Yue <glass.su@suse.com>
2025-03-05md/md-cluster: cleanup md_cluster_ops referenceYu Kuai
md_cluster_ops->slot_number() is implemented inside md-cluster.c, just call it directly. Link: https://lore.kernel.org/linux-raid/20250215092225.2427977-6-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Su Yue <glass.su@suse.com>
2025-03-05md: switch personalities to use md_submodule_headYu Kuai
Remove the global list 'pers_list', and switch to use md_submodule_head, which is managed by xarry. Prepare to unify registration and unregistration for all sub modules. Link: https://lore.kernel.org/linux-raid/20250215092225.2427977-5-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-03-05md: introduce struct md_submodule_head and APIsYu Kuai
Prepare to unify registration and unregistration of md personalities and md-cluster, also prepare for add kconfig for md-bitmap. Link: https://lore.kernel.org/linux-raid/20250215092225.2427977-4-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-03-05md: only include md-cluster.h if necessaryYu Kuai
md-cluster is only supportted by raid1 and raid10, there is no need to include md-cluster.h for other personalities. Also move APIs that is only used in md-cluster.c from md.h to md-cluster.h. Link: https://lore.kernel.org/linux-raid/20250215092225.2427977-3-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Su Yue <glass.su@suse.com>
2025-03-05md: merge common code into find_pers()Yu Kuai
- pers_lock() are held and released from caller - try_module_get() is called from caller - error message from caller Merge above code into find_pers(), and rename it to get_pers(), also add a wrapper to module_put() as put_pers(). Link: https://lore.kernel.org/linux-raid/20250215092225.2427977-2-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Su Yue <glass.su@suse.com>
2025-03-04ublk: enforce ublks_max only for unprivileged devicesUday Shankar
Commit 403ebc877832 ("ublk_drv: add module parameter of ublks_max for limiting max allowed ublk dev"), claimed ublks_max was added to prevent a DoS situation with an untrusted user creating too many ublk devices. If that's the case, ublks_max should only restrict the number of unprivileged ublk devices in the system. Enforce the limit only for unprivileged ublk devices, and rename variables accordingly. Leave the external-facing parameter name unchanged, since changing it may break systems which use it (but still update its documentation to reflect its new meaning). As a result of this change, in a system where there are only normal (non-unprivileged) devices, the maximum number of such devices is increased to 1 << MINORBITS, or 1048576. That ought to be enough for anyone, right? Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250228-ublks_max-v1-1-04b7379190c0@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-04loop: Remove struct loop_func_tableZhu Yanjun
The struct is introduced in the commit 754d96798fab ("loop: remove loop.h"), but it is not used now. So remove it. No functional changes. Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250227163343.55952-1-yanjun.zhu@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-03ublk: add DMA alignment limitMing Lei
The in-tree ublk driver doesn't need DMA alignment limit because there is one data copy between request pages and the userspace buffer. However, ublk is going to support zero copy, then DMA alignment limit is required, because same IO buffer is forwarded to backend which may have specific buffer DMA alignment limit, so the limit has to be exposed from the frontend driver to client application. Cc: Keith Busch <kbusch@kernel.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250227103707.2640014-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-03block: split struct bio_integrity_payloadChristoph Hellwig
Many of the fields in struct bio_integrity_payload are only needed for the default integrity buffer in the block layer, and the variable sized array at the end of the structure makes it very hard to embed into caller allocated structures. Reduce struct bio_integrity_payload to the minimal structure needed in common code and create two separate containing structures for the automatically generated payload and the caller allocated payload. The latter is a simple wrapper for struct bio_integrity_payload and the bvecs, while the former contains the additional fields moved out of struct bio_integrity_payload. Always use a dedicated mempool for automatic integrity metadata instead of depending on bio_set that is submitter controlled and thus often doesn't have the mempool initialized and stop using mempools for the submitter buffers as they aren't in the NOIO I/O submission path where we need to guarantee forward progress. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Tested-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20250225154449.422989-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-03block: move the block layer auto-integrity code into a new fileChristoph Hellwig
The code that automatically creates a integrity payload and generates and verifies the checksums for bios that don't have submitter-provided integrity payload currently sits right in the middle of the block integrity metadata infrastructure. Split it into a separate file to make the different layers clear. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250225154449.422989-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-03block: mark bounce buffering as incompatible with integrityChristoph Hellwig
None of the few drivers still using the legacy block layer bounce buffering support integrity metadata. Explicitly mark the features as incompatible and stop creating the slab and mempool for integrity buffers for the bounce bio_set. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250225154449.422989-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>