summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-06-14dm-integrity: use the nop integrity profileChristoph Hellwig
Use the block layer built-in nop profile instead of duplicating it. Tested by: $ dd if=/dev/urandom of=key.bin bs=512 count=1 $ cryptsetup luksFormat -q --type luks2 --integrity hmac-sha256 \ --integrity-no-wipe /dev/nvme0n1 key.bin $ cryptsetup luksOpen /dev/nvme0n1 luks-integrity --key-file key.bin and then doing mkfs.xfs and simple I/O on the mount file system. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Milan Broz <gmazyland@gmail.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240613084839.1044015-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14md/raid1: don't free conf on raid0_run failureChristoph Hellwig
The core md code calls the ->free method which already frees conf. Fixes: 07f1a6850c5d ("md/raid1: fail run raid1 array when active disk less than one") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240613084839.1044015-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14md/raid0: don't free conf on raid0_run failureChristoph Hellwig
The core md code calls the ->free method which already frees conf. Fixes: 0c031fd37f69 ("md: Move alloc/free acct bioset in to personality") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240613084839.1044015-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14block: initialize integrity buffer to zero before writing it to mediaChristoph Hellwig
Metadata added by bio_integrity_prep is using plain kmalloc, which leads to random kernel memory being written media. For PI metadata this is limited to the app tag that isn't used by kernel generated metadata, but for non-PI metadata the entire buffer leaks kernel memory. Fix this by adding the __GFP_ZERO flag to allocations for writes. Fixes: 7ba1ba12eeef ("block: Block layer data integrity support") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20240613084839.1044015-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14block: add special APIs for run-time disabling of discard and friendsChristoph Hellwig
A few drivers optimistically try to support discard, write zeroes and secure erase and disable the features from the I/O completion handler if the hardware can't support them. This disable can't be done using the atomic queue limits API because the I/O completion handlers can't take sleeping locks or freeze the queue. Keep the existing clearing of the relevant field to zero, but replace the old blk_queue_max_* APIs with new disable APIs that force the value to 0. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240531074837.1648501-15-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14block: remove unused queue limits APIChristoph Hellwig
Remove all APIs that are unused now that sd and sr have been converted to the atomic queue limits API. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240531074837.1648501-14-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14sr: convert to the atomic queue limits APIChristoph Hellwig
Assign all queue limits through a local queue_limits variable and queue_limits_commit_update so that we can't race updating them from multiple places, and free the queue when updating them so that in-progress I/O submissions don't see half-updated limits. Also use the chance to clean up variable names to standard ones. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240531074837.1648501-13-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14sd: convert to the atomic queue limits APIChristoph Hellwig
Assign all queue limits through a local queue_limits variable and queue_limits_commit_update so that we can't race updating them from multiple places, and freeze the queue when updating them so that in-progress I/O submissions don't see half-updated limits. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240531074837.1648501-12-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14sd: cleanup zoned queue limits initializationChristoph Hellwig
Consolidate setting zone-related queue limits in sd_zbc_read_zones instead of splitting them between sd_zbc_revalidate_zones and sd_zbc_read_zones, and move the early_zone_information initialization in sd_zbc_read_zones above setting up the queue limits. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240531074837.1648501-11-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14sd: factor out a sd_discard_mode helperChristoph Hellwig
Split the logic to pick the right discard mode into a little helper to prepare for further changes. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240531074837.1648501-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14sd: simplify the disable case in sd_config_discardChristoph Hellwig
Fall through to the main call to blk_queue_max_discard_sectors given that max_blocks has been initialized to zero above instead of duplicating the call. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240531074837.1648501-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14sd: add a sd_disable_write_same helperChristoph Hellwig
Add helper to disable WRITE SAME when it is not supported and use it instead of sd_config_write_same in the I/O completion handler. This avoids touching more fields than required in the I/O completion handler and prepares for converting sd to use the atomic queue limits API. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240531074837.1648501-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14sd: add a sd_disable_discard helperChristoph Hellwig
Add helper to disable discard when it is not supported and use it instead of sd_config_discard in the I/O completion handler. This avoids touching more fields than required in the I/O completion handler and prepares for converting sd to use the atomic queue limits API. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240531074837.1648501-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14sd: simplify the ZBC case in provisioning_mode_storeChristoph Hellwig
Don't reset the discard settings to no-op over and over when a user writes to the provisioning attribute as that is already the default mode for ZBC devices. In hindsight we should have made writing to the attribute fail for ZBC devices, but the code has probably been around for far too long to change this now. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240531074837.1648501-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14block: take io_opt and io_min into account for max_sectorsChristoph Hellwig
The soft max_sectors limit is normally capped by the hardware limits and an arbitrary upper limit enforced by the kernel, but can be modified by the user. A few drivers want to increase this limit (nbd, rbd) or adjust it up or down based on hardware capabilities (sd). Change blk_validate_limits to default max_sectors to the optimal I/O size, or upgrade it to the preferred minimal I/O size if that is larger than the kernel default if no optimal I/O size is provided based on the logic in the SD driver. This keeps the existing kernel default for drivers that do not provide an io_opt or very big io_min value, but picks a much more useful default for those who provide these hints, and allows to remove the hacks to set the user max_sectors limit in nbd, rbd and sd. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Acked-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240531074837.1648501-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14rbd: increase io_opt againChristoph Hellwig
Commit 16d80c54ad42 ("rbd: set io_min, io_opt and discard_granularity to alloc_size") lowered the io_opt size for rbd from objset_bytes which is 4MB for typical setup to alloc_size which is typically 64KB. The commit mostly talks about discard behavior and does mention io_min in passing. Reducing io_opt means reducing the readahead size, which seems counter-intuitive given that rbd currently abuses the user max_sectors setting to actually increase the I/O size. Switch back to the old setting to allow larger reads (the readahead size despite it's name actually limits the size of any buffered read) and to prepare for using io_opt in the max_sectors calculation and getting drivers out of the business of overriding the max_user_sectors value. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240531074837.1648501-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14ubd: untagle discard vs write zeroes not support handlingChristoph Hellwig
Discard and Write Zeroes are different operation and implemented by different fallocate opcodes for ubd. If one fails the other one can work and vice versa. Split the code to disable the operations in ubd_handler to only disable the operation that actually failed. Fixes: 50109b5a03b4 ("um: Add support for DISCARD in the UBD Driver") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-By: Anton Ivanov <anton.ivanov@cambridgegreys.com> Link: https://lore.kernel.org/r/20240531074837.1648501-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14ubd: refactor the interrupt handlerChristoph Hellwig
Instead of a separate handler function that leaves no work in the interrupt hanler itself, split out a per-request end I/O helper and clean up the coding style and variable naming while we're at it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-By: Anton Ivanov <anton.ivanov@cambridgegreys.com> Link: https://lore.kernel.org/r/20240531074837.1648501-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14MAINTAINERS: add entry for Rust block device driver APIAndreas Hindborg
Add an entry for the Rust block device driver abstractions. Signed-off-by: Andreas Hindborg <a.hindborg@samsung.com> Link: https://lore.kernel.org/r/20240611114551.228679-4-nmi@metaspace.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14rust: block: add rnull, Rust null_blk implementationAndreas Hindborg
This patch adds an initial version of the Rust null block driver. Signed-off-by: Andreas Hindborg <a.hindborg@samsung.com> Reviewed-by: Benno Lossin <benno.lossin@proton.me> Link: https://lore.kernel.org/r/20240611114551.228679-3-nmi@metaspace.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14rust: block: introduce `kernel::block::mq` moduleAndreas Hindborg
Add initial abstractions for working with blk-mq. This patch is a maintained, refactored subset of code originally published by Wedson Almeida Filho <wedsonaf@gmail.com> [1]. [1] https://github.com/wedsonaf/linux/tree/f2cfd2fe0e2ca4e90994f96afe268bbd4382a891/rust/kernel/blk/mq.rs Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andreas Hindborg <a.hindborg@samsung.com> Reviewed-by: Benno Lossin <benno.lossin@proton.me> Link: https://lore.kernel.org/r/20240611114551.228679-2-nmi@metaspace.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-12Merge tag 'md-6.11-20240612' of ↵Jens Axboe
git://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.11/block Pull MD updates from Song: "The major changes in this PR are: - sync_action fix and refactoring, by Yu Kuai; - Various small fixes by Christoph Hellwig, Li Nan, and Ofir Gal." * tag 'md-6.11-20240612' of git://git.kernel.org/pub/scm/linux/kernel/git/song/md: md/raid5: avoid BUG_ON() while continue reshape after reassembling md: pass in max_sectors for pers->sync_request() md: factor out helpers for different sync_action in md_do_sync() md: replace last_sync_action with new enum type md: use new helpers in md_do_sync() md: don't fail action_store() if sync_thread is not registered md: remove parameter check_seq for stop_sync_thread() md: replace sysfs api sync_action with new helpers md: factor out helper to start reshape from action_store() md: add new helpers for sync_action md: add a new enum type sync_action md: rearrange recovery_flags md/md-bitmap: fix writing non bitmap pages md/raid1: don't free conf on raid0_run failure md/raid0: don't free conf on raid0_run failure md: make md_flush_request() more readable md: fix deadlock between mddev_suspend and flush bio md: change the return value type of md_write_start to void md: do not delete safemode_timer in mddev_suspend
2024-06-12md/raid5: avoid BUG_ON() while continue reshape after reassemblingYu Kuai
Currently, mdadm support --revert-reshape to abort the reshape while reassembling, as the test 07revert-grow. However, following BUG_ON() can be triggerred by the test: kernel BUG at drivers/md/raid5.c:6278! invalid opcode: 0000 [#1] PREEMPT SMP PTI irq event stamp: 158985 CPU: 6 PID: 891 Comm: md0_reshape Not tainted 6.9.0-03335-g7592a0b0049a #94 RIP: 0010:reshape_request+0x3f1/0xe60 Call Trace: <TASK> raid5_sync_request+0x43d/0x550 md_do_sync+0xb7a/0x2110 md_thread+0x294/0x2b0 kthread+0x147/0x1c0 ret_from_fork+0x59/0x70 ret_from_fork_asm+0x1a/0x30 </TASK> Root cause is that --revert-reshape update the raid_disks from 5 to 4, while reshape position is still set, and after reassembling the array, reshape position will be read from super block, then during reshape the checking of 'writepos' that is caculated by old reshape position will fail. Fix this panic the easy way first, by converting the BUG_ON() to WARN_ON(), and stop the reshape if checkings fail. Noted that mdadm must fix --revert-shape as well, and probably md/raid should enhance metadata validation as well, however this means reassemble will fail and there must be user tools to fix the wrong metadata. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-13-yukuai1@huaweicloud.com
2024-06-12md: pass in max_sectors for pers->sync_request()Yu Kuai
For different sync_action, sync_thread will use different max_sectors, see details in md_sync_max_sectors(), currently both md_do_sync() and pers->sync_request() in eatch iteration have to get the same max_sectors. Hence pass in max_sectors for pers->sync_request() to prevent redundant code. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-12-yukuai1@huaweicloud.com
2024-06-12md: factor out helpers for different sync_action in md_do_sync()Yu Kuai
Make code cleaner by replacing if else if with switch, and it's more obvious now what is doing for each sync_action. There are no functional changes. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-11-yukuai1@huaweicloud.com
2024-06-12md: replace last_sync_action with new enum typeYu Kuai
The only difference is that "none" is removed and initial last_sync_action will be idle. On the one hand, this value is introduced by commit c4a395514516 ("MD: Remember the last sync operation that was performed"), and the usage described in commit message is not affected. On the other hand, last_sync_action is not used in mdadm or mdmon, and none of the tests that I can find. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-10-yukuai1@huaweicloud.com
2024-06-12md: use new helpers in md_do_sync()Yu Kuai
Make code cleaner. and also use the action_name directly in kernel log: - "check" instead of "data-check" - "repair" instead of "requested-resync" Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-9-yukuai1@huaweicloud.com
2024-06-12md: don't fail action_store() if sync_thread is not registeredYu Kuai
MD_RECOVERY_RUNNING will always be set when trying to register a new sync_thread, however, if md_start_sync() turns out to do nothing, MD_RECOVERY_RUNNING will be cleared in this case. And during the race window, action_store() will return -EBUSY, which will cause some mdadm tests to fail. For example: The test 07reshape5intr will add a new disk to array, then start reshape: mdadm /dev/md0 --add /dev/xxx mdadm --grow /dev/md0 -n 3 And add_bound_rdev() from mdadm --add will set MD_RECOVERY_NEEDED, then during the race windown, mdadm --grow will fail. Fix the problem by waiting in action_store() during the race window, fail only if sync_thread is registered. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-8-yukuai1@huaweicloud.com
2024-06-12md: remove parameter check_seq for stop_sync_thread()Yu Kuai
Caller will always set MD_RECOVERY_FROZEN if check_seq is true, and always clear MD_RECOVERY_FROZEN if check_seq is false, hence replace the parameter with test_bit() to make code cleaner. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-7-yukuai1@huaweicloud.com
2024-06-12md: replace sysfs api sync_action with new helpersYu Kuai
To get rid of extrem long if else if usage, and make code cleaner. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-6-yukuai1@huaweicloud.com
2024-06-12md: factor out helper to start reshape from action_store()Yu Kuai
There are no functional changes, just to make code cleaner and prepare for following refactor. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-5-yukuai1@huaweicloud.com
2024-06-12md: add new helpers for sync_actionYu Kuai
The new helpers will get current sync_action of the array, will be used in later patches to make code cleaner. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-4-yukuai1@huaweicloud.com
2024-06-12md: add a new enum type sync_actionYu Kuai
In order to make code related to sync_thread cleaner in following patches, also add detail comment about each sync action. And also prepare to remove the related recovery_flags in the fulture. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-3-yukuai1@huaweicloud.com
2024-06-12md: rearrange recovery_flagsYu Kuai
Currently there are lots of flags with the same confusing prefix "MD_REOCVERY_", and there are two main types of flags, sync thread runnng status, I prefer prefix "SYNC_THREAD_", and sync thread action, I perfer prefix "SYNC_ACTION_". For now, rearrange and update comment to improve code readability, there are no functional changes. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-2-yukuai1@huaweicloud.com
2024-06-11md/md-bitmap: fix writing non bitmap pagesOfir Gal
__write_sb_page() rounds up the io size to the optimal io size if it doesn't exceed the data offset, but it doesn't check the final size exceeds the bitmap length. For example: page count - 1 page size - 4K data offset - 1M optimal io size - 256K The final io size would be 256K (64 pages) but md_bitmap_storage_alloc() allocated 1 page, the IO would write 1 valid page and 63 pages that happens to be allocated afterwards. This leaks memory to the raid device superblock. This issue caused a data transfer failure in nvme-tcp. The network drivers checks the first page of an IO with sendpage_ok(), it returns true if the page isn't a slabpage and refcount >= 1. If the page !sendpage_ok() the network driver disables MSG_SPLICE_PAGES. As of now the network layer assumes all the pages of the IO are sendpage_ok() when MSG_SPLICE_PAGES is on. The bitmap pages aren't slab pages, the first page of the IO is sendpage_ok(), but the additional pages that happens to be allocated after the bitmap pages might be !sendpage_ok(). That cause skb_splice_from_iter() to stop the data transfer, in the case below it hangs 'mdadm --create'. The bug is reproducible, in order to reproduce we need nvme-over-tcp controllers with optimal IO size bigger than PAGE_SIZE. Creating a raid with bitmap over those devices reproduces the bug. In order to simulate large optimal IO size you can use dm-stripe with a single device. Script to reproduce the issue on top of brd devices using dm-stripe is attached below (will be added to blktest). I have added some logs to test the theory: ... md: created bitmap (1 pages) for device md127 __write_sb_page before md_super_write offset: 16, size: 262144. pfn: 0x53ee === __write_sb_page before md_super_write. logging pages === pfn: 0x53ee, slab: 0 <-- the only page that allocated for the bitmap pfn: 0x53ef, slab: 1 pfn: 0x53f0, slab: 0 pfn: 0x53f1, slab: 0 pfn: 0x53f2, slab: 0 pfn: 0x53f3, slab: 1 ... nvme_tcp: sendpage_ok - pfn: 0x53ee, len: 262144, offset: 0 skbuff: before sendpage_ok() - pfn: 0x53ee skbuff: before sendpage_ok() - pfn: 0x53ef WARNING at net/core/skbuff.c:6848 skb_splice_from_iter+0x142/0x450 skbuff: !sendpage_ok - pfn: 0x53ef. is_slab: 1, page_count: 1 ... Cc: stable@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ofir Gal <ofir.gal@volumez.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240607072748.3182199-1-ofir.gal@volumez.com
2024-06-10md/raid1: don't free conf on raid0_run failureChristoph Hellwig
The core md code calls the ->free method which already frees conf. Fixes: 07f1a6850c5d ("md/raid1: fail run raid1 array when active disk less than one") Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240604172607.3185916-3-hch@lst.de
2024-06-10md/raid0: don't free conf on raid0_run failureChristoph Hellwig
The core md code calls the ->free method which already frees conf. Fixes: 0c031fd37f69 ("md: Move alloc/free acct bioset in to personality") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240604172607.3185916-2-hch@lst.de
2024-06-10md: make md_flush_request() more readableLi Nan
Setting bio to NULL and checking 'if(!bio)' is redundant and looks strange, just consolidate them into one condition. There are no functional changes. Suggested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240528203149.2383260-1-linan666@huaweicloud.com
2024-06-10md: fix deadlock between mddev_suspend and flush bioLi Nan
Deadlock occurs when mddev is being suspended while some flush bio is in progress. It is a complex issue. T1. the first flush is at the ending stage, it clears 'mddev->flush_bio' and tries to submit data, but is blocked because mddev is suspended by T4. T2. the second flush sets 'mddev->flush_bio', and attempts to queue md_submit_flush_data(), which is already running (T1) and won't execute again if on the same CPU as T1. T3. the third flush inc active_io and tries to flush, but is blocked because 'mddev->flush_bio' is not NULL (set by T2). T4. mddev_suspend() is called and waits for active_io dec to 0 which is inc by T3. T1 T2 T3 T4 (flush 1) (flush 2) (third 3) (suspend) md_submit_flush_data mddev->flush_bio = NULL; . . md_flush_request . mddev->flush_bio = bio . queue submit_flushes . . . . md_handle_request . . active_io + 1 . . md_flush_request . . wait !mddev->flush_bio . . . . mddev_suspend . . wait !active_io . . . submit_flushes . queue_work md_submit_flush_data . //md_submit_flush_data is already running (T1) . md_handle_request wait resume The root issue is non-atomic inc/dec of active_io during flush process. active_io is dec before md_submit_flush_data is queued, and inc soon after md_submit_flush_data() run. md_flush_request active_io + 1 submit_flushes active_io - 1 md_submit_flush_data md_handle_request active_io + 1 make_request active_io - 1 If active_io is dec after md_handle_request() instead of within submit_flushes(), make_request() can be called directly intead of md_handle_request() in md_submit_flush_data(), and active_io will only inc and dec once in the whole flush process. Deadlock will be fixed. Additionally, the only difference between fixing the issue and before is that there is no return error handling of make_request(). But after previous patch cleaned md_write_start(), make_requst() only return error in raid5_make_request() by dm-raid, see commit 41425f96d7aa ("dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape)". Since dm always splits data and flush operation into two separate io, io size of flush submitted by dm always is 0, make_request() will not be called in md_submit_flush_data(). To prevent future modifications from introducing issues, add WARN_ON to ensure make_request() no error is returned in this context. Fixes: fa2bbff7b0b4 ("md: synchronize flush io with array reconfiguration") Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240525185257.3896201-3-linan666@huaweicloud.com
2024-06-10md: change the return value type of md_write_start to voidLi Nan
Commit cc27b0c78c79 ("md: fix deadlock between mddev_suspend() and md_write_start()") aborted md_write_start() with false when mddev is suspended, which fixed a deadlock if calling mddev_suspend() with holding reconfig_mutex(). Since mddev_suspend() now includes lockdep_assert_not_held(), it no longer holds the reconfig_mutex. This makes previous abort unnecessary. Now, remove unnecessary abort and change function return value to void. Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240525185257.3896201-2-linan666@huaweicloud.com
2024-06-10md: do not delete safemode_timer in mddev_suspendLi Nan
The deletion of safemode_timer in mddev_suspend() is redundant and potentially harmful now. If timer is about to be woken up but gets deleted, 'in_sync' will remain 0 until the next write, causing array to stay in the 'active' state instead of transitioning to 'clean'. Commit 0d9f4f135eb6 ("MD: Add del_timer_sync to mddev_suspend (fix nasty panic))" introduced this deletion for dm, because if timer fired after dm is destroyed, the resource which the timer depends on might have been freed. However, commit 0dd84b319352 ("md: call __md_stop_writes in md_stop") added __md_stop_writes() to md_stop(), which is called before freeing resource. Timer is deleted in __md_stop_writes(), and the origin issue is resolved. Therefore, delete safemode_timer can be removed safely now. Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240508092053.1447930-1-linan666@huaweicloud.com
2024-06-09Linux 6.10-rc3v6.10-rc3Linus Torvalds
2024-06-09Merge tag 'perf-tools-fixes-for-v6.10-2-2024-06-09' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools Pull perf tools fixes from Arnaldo Carvalho de Melo: - Update copies of kernel headers, which resulted in support for the new 'mseal' syscall, SUBVOL statx return mask bit, RISC-V and PPC prctls, fcntl's DUPFD_QUERY, POSTED_MSI_NOTIFICATION IRQ vector, 'map_shadow_stack' syscall for x86-32. - Revert perf.data record memory allocation optimization that ended up causing a regression, work is being done to re-introduce it in the next merge window. - Fix handling of minimal vmlinux.h file used with BPF's CO-RE when interrupting the build. * tag 'perf-tools-fixes-for-v6.10-2-2024-06-09' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools: perf bpf: Fix handling of minimal vmlinux.h file when interrupting the build Revert "perf record: Reduce memory for recording PERF_RECORD_LOST_SAMPLES event" tools headers arm64: Sync arm64's cputype.h with the kernel sources tools headers uapi: Sync linux/stat.h with the kernel sources to pick STATX_SUBVOL tools headers UAPI: Update i915_drm.h with the kernel sources tools headers UAPI: Sync kvm headers with the kernel sources tools arch x86: Sync the msr-index.h copy with the kernel sources tools headers: Update the syscall tables and unistd.h, mostly to support the new 'mseal' syscall perf trace beauty: Update the arch/x86/include/asm/irq_vectors.h copy with the kernel sources to pick POSTED_MSI_NOTIFICATION perf beauty: Update copy of linux/socket.h with the kernel sources tools headers UAPI: Sync fcntl.h with the kernel sources to pick F_DUPFD_QUERY tools headers UAPI: Sync linux/prctl.h with the kernel sources tools include UAPI: Sync linux/stat.h with the kernel sources
2024-06-09Merge tag 'edac_urgent_for_v6.10_rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras Pull EDAC fixes from Borislav Petkov: - Convert PCI core error codes to proper error numbers since latter get propagated all the way up to the module loading functions * tag 'edac_urgent_for_v6.10_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras: EDAC/igen6: Convert PCIBIOS_* return codes to errnos EDAC/amd64: Convert PCIBIOS_* return codes to errnos
2024-06-08Merge tag 'clk-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux Pull clk fix from Stephen Boyd: "One fix for the SiFive PRCI clocks so that the device boots again. This driver was registering clkdev lookups that were always going to be useless. This wasn't a problem until clkdev started returning an error in these cases, causing this driver to fail probe, and thus boot to fail because clks are essential for most drivers. The fix is simple, don't use clkdev because this is a DT based system where clkdev isn't used" * tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux: clk: sifive: Do not register clkdevs for PRCI clocks
2024-06-08Merge tag '6.10-rc2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull smb client fixes from Steve French: "Two small smb3 client fixes: - fix deadlock in umount - minor cleanup due to netfs change" * tag '6.10-rc2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: Don't advance the I/O iterator before terminating subrequest smb: client: fix deadlock in smb2_find_smb_tcon()
2024-06-08Merge tag 'for-linus-2024060801' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid Pull HID fixes from Benjamin Tissoires: - fix potential read out of bounds in hid-asus (Andrew Ballance) - fix endian-conversion on little endian systems in intel-ish-hid (Arnd Bergmann) - A couple of new input event codes (Aseda Aboagye) - errors handling fixes in hid-nvidia-shield (Chen Ni), hid-nintendo (Christophe JAILLET), hid-logitech-dj (José Expósito) - current leakage fix while the device is in suspend on a i2c-hid laptop (Johan Hovold) - other assorted smaller fixes and device ID / quirk entry additions * tag 'for-linus-2024060801' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid: HID: Ignore battery for ELAN touchscreens 2F2C and 4116 HID: i2c-hid: elan: fix reset suspend current leakage dt-bindings: HID: i2c-hid: elan: add 'no-reset-on-power-off' property dt-bindings: HID: i2c-hid: elan: add Elan eKTH5015M dt-bindings: HID: i2c-hid: add dedicated Ilitek ILI2901 schema input: Add support for "Do Not Disturb" input: Add event code for accessibility key hid: asus: asus_report_fixup: fix potential read out of bounds HID: logitech-hidpp: add missing MODULE_DESCRIPTION() macro HID: intel-ish-hid: fix endian-conversion HID: nintendo: Fix an error handling path in nintendo_hid_probe() HID: logitech-dj: Fix memory leak in logi_dj_recv_switch_to_dj_mode() HID: core: remove unnecessary WARN_ON() in implement() HID: nvidia-shield: Add missing check for input_ff_create_memless HID: intel-ish-hid: Fix build error for COMPILE_TEST
2024-06-08Merge tag 'kbuild-fixes-v6.10-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild fixes from Masahiro Yamada: - Fix the initial state of the save button in 'make gconfig' - Improve the Kconfig documentation - Fix a Kconfig bug regarding property visibility - Fix build breakage for systems where 'sed' is not installed in /bin - Fix a false warning about missing MODULE_DESCRIPTION() * tag 'kbuild-fixes-v6.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: modpost: do not warn about missing MODULE_DESCRIPTION() for vmlinux.o kbuild: explicitly run mksysmap as sed script from link-vmlinux.sh kconfig: remove wrong expr_trans_bool() kconfig: doc: document behavior of 'select' and 'imply' followed by 'if' kconfig: doc: fix a typo in the note about 'imply' kconfig: gconf: give a proper initial state to the Save button kconfig: remove unneeded code for user-supplied values being out of range
2024-06-08Merge tag 'media/v6.10-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media Pull media fixes from Mauro Carvalho Chehab: - fixes for the new ipu6 driver (and related fixes to mei csi driver) - fix a double debugfs remove logic at mgb4 driver - a documentation fix * tag 'media/v6.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: media: intel/ipu6: add csi2 port sanity check in notifier bound media: intel/ipu6: update the maximum supported csi2 port number to 6 media: mei: csi: Warn less verbosely of a missing device fwnode media: mei: csi: Put the IPU device reference media: intel/ipu6: fix the buffer flags caused by wrong parentheses media: intel/ipu6: Fix an error handling path in isys_probe() media: intel/ipu6: Move isys_remove() close to isys_probe() media: intel/ipu6: Fix some redundant resources freeing in ipu6_pci_remove() media: Documentation: v4l: Fix ACTIVE route flag media: mgb4: Fix double debugfs remove
2024-06-08Merge tag 'irq-urgent-2024-06-08' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Ingo Molnar: - Fix possible memory leak the riscv-intc irqchip driver load failures - Fix boot crash in the sifive-plic irqchip driver caused by recently changed boot initialization order - Fix race condition in the gic-v3-its irqchip driver * tag 'irq-urgent-2024-06-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip/gic-v3-its: Fix potential race condition in its_vlpi_prop_update() irqchip/sifive-plic: Chain to parent IRQ after handlers are ready irqchip/riscv-intc: Prevent memory leak when riscv_intc_init_common() fails