summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)Author
2024-08-27md/raid1: use md_bitmap_wait_behind_writes() in raid1_read_request()Yu Kuai
Use the existed helper instead of open coding it to make the code cleaner. There are no functional changes, and also avoid dereferencing bitmap directly to prepare inventing a new bitmap. Noted that this patch also export md_bitmap_wait_behind_writes(), which is necessary for now, and the exported api will be removed in following patches to convert bitmap apis into ops. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-2-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2024-08-27md/raid1: Clean up local variable 'b' from raid1_read_request()Yu Kuai
The local variable will only be used onced, in the error path that read_balance() failed to find a valid rdev to read. Since now the rdev is ensured can't be removed from conf while IO is still pending, remove the local variable and dereference rdev directly. Since we're here, also remove an extra empty line, and unnecessary type conversion from sector_t(u64) to unsigned long long. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240801133008.459998-1-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2024-08-27md: Don't flush sync_work in md_write_start()Yu Kuai
Because flush sync_work may trigger mddev_suspend() if there are spares, and this should never be done in IO path because mddev_suspend() is used to wait for IO. This problem is found by code review. Fixes: bc08041b32ab ("md: suspend array in md_start_sync() if array need reconfiguration") Cc: stable@vger.kernel.org Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240801124746.242558-1-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2024-08-22dm bufio: Remove NULL check of list_entry()Yuesong Li
list_entry() will never return a NULL pointer, thus remove the check. Signed-off-by: Yuesong Li <liyuesong@vivo.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-08-21dm-crypt: Allow to specify the integrity key size as optionIngo Franzki
For the MAC based integrity operation, the integrity key size (i.e. key_mac_size) is currently set to the digest size of the used digest. For wrapped key HMAC algorithms, the key size is independent of the cryptographic key size. So there is no known size of the mac key in such cases. The desired key size can optionally be specified as argument when the dm-crypt device is configured via 'integrity_key_size:%u'. If no integrity_key_size argument is specified, the mac key size is still set to the digest size, as before. Increase version number to 1.28.0 so that support for the new argument can be detected by user space (i.e. cryptsetup). Signed-off-by: Ingo Franzki <ifranzki@linux.ibm.com> Reviewed-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-08-21dm: Remove unused declaration and empty definition "dm_zone_map_bio"Zhang Zekun
dm_zone_map_bio() has beed removed since commit f211268ed1f9 ("dm: Use the block layer zone append emulation"), remain the declaration unused in header files. So, let's remove this unused declaration and empty definition. Signed-off-by: Zhang Zekun <zhangzekun11@huawei.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-08-21dm vdo: force read-only mode for a corrupt recovery journalSusan LeGendre-McGhee
Ensure the recovery journal does not attempt recovery when blocks with mismatched metadata versions are detected. This check is performed after determining that the blocks are otherwise valid so that it does not interfere with normal recovery. Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-08-21dm vdo: abort loading dirty VDO with the old recovery journal formatSusan LeGendre-McGhee
Abort the load process with status code VDO_UNSUPPORTED_VERSION without forcing read-only mode when a journal block with the old format version is detected. Forcing the VDO volume into read-only mode and thus requiring a read-only rebuild should only be done when absolutely necessary. Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-08-21dm vdo: add dmsetup message for returning configuration infoBruce Johnston
Add a new dmsetup message called config, which will return useful configuration information for the vdo volume and the uds index associated with it. The output is a YAML string, and contains a version number to allow future additions to the content. Signed-off-by: Bruce Johnston <bjohnsto@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-08-21dm vdo: remove bad check of bi_next fieldKen Raeburn
Remove this check to prevent spurious warning messages due to the behavior of other storage layers. This check was intended to make sure dm-vdo does not chain metadata bios together. However, vdo has no control over underlying storage layers, so the assertion is not always true. Signed-off-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-08-21dm vdo: don't refer to dedupe_context after releasing itKen Raeburn
Clear the dedupe_context pointer in a data_vio whenever ownership of the context is lost, so that vdo can't examine it accidentally. Signed-off-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-08-20dm-verity: expose root hash digest and signature data to LSMsDeven Bowers
dm-verity provides a strong guarantee of a block device's integrity. As a generic way to check the integrity of a block device, it provides those integrity guarantees to its higher layers, including the filesystem level. However, critical security metadata like the dm-verity roothash and its signing information are not easily accessible to the LSMs. To address this limitation, this patch introduces a mechanism to store and manage these essential security details within a newly added LSM blob in the block_device structure. This addition allows LSMs to make access control decisions on the integrity data stored within the block_device, enabling more flexible security policies. For instance, LSMs can now revoke access to dm-verity devices based on their roothashes, ensuring that only authorized and verified content is accessible. Additionally, LSMs can enforce policies to only allow files from dm-verity devices that have a valid digital signature to execute, effectively blocking any unsigned files from execution, thus enhancing security against unauthorized modifications. The patch includes new hook calls, `security_bdev_setintegrity()`, in dm-verity to expose the dm-verity roothash and the roothash signature to LSMs via preresume() callback. By using the preresume() callback, it ensures that the security metadata is consistently in sync with the metadata of the dm-verity target in the current active mapping table. The hook calls are depended on CONFIG_SECURITY. Signed-off-by: Deven Bowers <deven.desai@linux.microsoft.com> Signed-off-by: Fan Wu <wufan@linux.microsoft.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> [PM: moved sig_size field as discussed] Signed-off-by: Paul Moore <paul@paul-moore.com>
2024-08-16Merge tag 'block-6.11-20240824' of git://git.kernel.dk/linuxLinus Torvalds
Pull block fixes from Jens Axboe: - Fix corruption issues with s390/dasd (Eric, Stefan) - Fix a misuse of non irq locking grab of a lock (Li) - MD pull request with a single data corruption fix for raid1 (Yu) * tag 'block-6.11-20240824' of git://git.kernel.dk/linux: block: Fix lockdep warning in blk_mq_mark_tag_wait md/raid1: Fix data corruption for degraded array with slow disk s390/dasd: fix error recovery leading to data corruption on ESE devices s390/dasd: Remove DMA alignment
2024-08-15md/raid1: Fix data corruption for degraded array with slow diskYu Kuai
read_balance() will avoid reading from slow disks as much as possible, however, if valid data only lands in slow disks, and a new normal disk is still in recovery, unrecovered data can be read: raid1_read_request read_balance raid1_should_read_first -> return false choose_best_rdev -> normal disk is not recovered, return -1 choose_bb_rdev -> missing the checking of recovery, return the normal disk -> read unrecovered data Root cause is that the checking of recovery is missing in choose_bb_rdev(). Hence add such checking to fix the problem. Also fix similar problem in choose_slow_rdev(). Cc: stable@vger.kernel.org Fixes: 9f3ced792203 ("md/raid1: factor out choose_bb_rdev() from read_balance()") Fixes: dfa8ecd167c1 ("md/raid1: factor out choose_slow_rdev() from read_balance()") Reported-and-tested-by: Mateusz Jończyk <mat.jonczyk@o2.pl> Closes: https://lore.kernel.org/all/9952f532-2554-44bf-b906-4880b2e88e3a@o2.pl/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240803091137.3197008-1-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2024-08-15md: convert comma to semicolonChen Ni
Replace a comma between expression statements by a semicolon. Fixes: 5e5702898e93 ("md/raid10: Handle read errors during recovery better.") Signed-off-by: Chen Ni <nichen@iscas.ac.cn> Link: https://lore.kernel.org/r/20240716025852.400259-1-nichen@iscas.ac.cn Signed-off-by: Song Liu <song@kernel.org>
2024-08-13dm persistent data: fix memory allocation failureMikulas Patocka
kmalloc is unreliable when allocating more than 8 pages of memory. It may fail when there is plenty of free memory but the memory is fragmented. Zdenek Kabelac observed such failure in his tests. This commit changes kmalloc to kvmalloc - kvmalloc will fall back to vmalloc if the large allocation fails. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reported-by: Zdenek Kabelac <zkabelac@redhat.com> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Cc: stable@vger.kernel.org
2024-08-13dm resume: don't return EINVAL when signalledKhazhismel Kumykov
If the dm_resume method is called on a device that is not suspended, the method will suspend the device briefly, before resuming it (so that the table will be swapped). However, there was a bug that the return value of dm_suspended_md was not checked. dm_suspended_md may return an error when it is interrupted by a signal. In this case, do_resume would call dm_swap_table, which would return -EINVAL. This commit fixes the logic, so that error returned by dm_suspend is checked and the resume operation is undone. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Khazhismel Kumykov <khazhy@google.com> Cc: stable@vger.kernel.org
2024-08-13dm suspend: return -ERESTARTSYS instead of -EINTRMikulas Patocka
This commit changes device mapper, so that it returns -ERESTARTSYS instead of -EINTR when it is interrupted by a signal (so that the ioctl can be restarted). The manpage signal(7) says that the ioctl function should be restarted if the signal was handled with SA_RESTART. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org
2024-07-28minmax: add a few more MIN_T/MAX_T usersLinus Torvalds
Commit 3a7e02c040b1 ("minmax: avoid overly complicated constant expressions in VM code") added the simpler MIN_T/MAX_T macros in order to avoid some excessive expansion from the rather complicated regular min/max macros. The complexity of those macros stems from two issues: (a) trying to use them in situations that require a C constant expression (in static initializers and for array sizes) (b) the type sanity checking and MIN_T/MAX_T avoids both of these issues. Now, in the whole (long) discussion about all this, it was pointed out that the whole type sanity checking is entirely unnecessary for min_t/max_t which get a fixed type that the comparison is done in. But that still leaves min_t/max_t unnecessarily complicated due to worries about the C constant expression case. However, it turns out that there really aren't very many cases that use min_t/max_t for this, and we can just force-convert those. This does exactly that. Which in turn will then allow for much simpler implementations of min_t()/max_t(). All the usual "macros in all upper case will evaluate the arguments multiple times" rules apply. We should do all the same things for the regular min/max() vs MIN/MAX() cases, but that has the added complexity of various drivers defining their own local versions of MIN/MAX, so that needs another level of fixes first. Link: https://lore.kernel.org/all/b47fad1d0cf8449886ad148f8c013dae@AcuMS.aculab.com/ Cc: David Laight <David.Laight@aculab.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-07-22Merge tag 'for-6.11/block-20240722' of git://git.kernel.dk/linuxLinus Torvalds
Pull more block updates from Jens Axboe: - MD fixes via Song: - md-cluster fixes (Heming Zhao) - raid1 fix (Mateusz Jończyk) - s390/dasd module description (Jeff) - Series cleaning up and hardening the blk-mq debugfs flag handling (John, Christoph) - blk-cgroup cleanup (Xiu) - Error polled IO attempts if backend doesn't support it (hexue) - Fix for an sbitmap hang (Yang) * tag 'for-6.11/block-20240722' of git://git.kernel.dk/linux: (23 commits) blk-cgroup: move congestion_count to struct blkcg sbitmap: fix io hung due to race on sbitmap_word::cleared block: avoid polling configuration errors block: Catch possible entries missing from rqf_name[] block: Simplify definition of RQF_NAME() block: Use enum to define RQF_x bit indexes block: Catch possible entries missing from cmd_flag_name[] block: Catch possible entries missing from alloc_policy_name[] block: Catch possible entries missing from hctx_flag_name[] block: Catch possible entries missing from hctx_state_name[] block: Catch possible entries missing from blk_queue_flag_name[] block: Make QUEUE_FLAG_x as an enum block: Relocate BLK_MQ_MAX_DEPTH block: Relocate BLK_MQ_CPU_WORK_BATCH block: remove QUEUE_FLAG_STOPPED block: Add missing entry to hctx_flag_name[] block: Add zone write plugging entry to rqf_name[] block: Add missing entries from cmd_flag_name[] s390/dasd: fix error checks in dasd_copy_pair_store() s390/dasd: add missing MODULE_DESCRIPTION() macros ...
2024-07-22Merge tag 'for-6.11/block-post-20240722' of git://git.kernel.dk/linuxLinus Torvalds
Pull block integrity mapping updates from Jens Axboe: "A set of cleanups and fixes for the block integrity support. Sent separately from the main block changes from last week, as they depended on later fixes in the 6.10-rc cycle" * tag 'for-6.11/block-post-20240722' of git://git.kernel.dk/linux: block: don't free the integrity payload in bio_integrity_unmap_free_user block: don't free submitter owned integrity payload on I/O completion block: call bio_integrity_unmap_free_user from blk_rq_unmap_user block: don't call bio_uninit from bio_endio block: also return bio_integrity_payload * from stubs block: split integrity support out of bio.h
2024-07-21Merge tag 'mm-nonmm-stable-2024-07-21-15-07' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - In the series "treewide: Refactor heap related implementation", Kuan-Wei Chiu has significantly reworked the min_heap library code and has taught bcachefs to use the new more generic implementation. - Yury Norov's series "Cleanup cpumask.h inclusion in core headers" reworks the cpumask and nodemask headers to make things generally more rational. - Kuan-Wei Chiu has sent along some maintenance work against our sorting library code in the series "lib/sort: Optimizations and cleanups". - More library maintainance work from Christophe Jaillet in the series "Remove usage of the deprecated ida_simple_xx() API". - Ryusuke Konishi continues with the nilfs2 fixes and clanups in the series "nilfs2: eliminate the call to inode_attach_wb()". - Kuan-Ying Lee has some fixes to the gdb scripts in the series "Fix GDB command error". - Plus the usual shower of singleton patches all over the place. Please see the relevant changelogs for details. * tag 'mm-nonmm-stable-2024-07-21-15-07' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (98 commits) ia64: scrub ia64 from poison.h watchdog/perf: properly initialize the turbo mode timestamp and rearm counter tsacct: replace strncpy() with strscpy() lib/bch.c: use swap() to improve code test_bpf: convert comma to semicolon init/modpost: conditionally check section mismatch to __meminit* init: remove unused __MEMINIT* macros nilfs2: Constify struct kobj_type nilfs2: avoid undefined behavior in nilfs_cnt32_ge macro math: rational: add missing MODULE_DESCRIPTION() macro lib/zlib: add missing MODULE_DESCRIPTION() macro fs: ufs: add MODULE_DESCRIPTION() lib/rbtree.c: fix the example typo ocfs2: add bounds checking to ocfs2_check_dir_entry() fs: add kernel-doc comments to ocfs2_prepare_orphan_dir() coredump: simplify zap_process() selftests/fpu: add missing MODULE_DESCRIPTION() macro compiler.h: simplify data_race() macro build-id: require program headers to be right after ELF header resource: add missing MODULE_DESCRIPTION() ...
2024-07-19Merge tag 'for-6.11/dm-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mikulas Patocka: - Optimize processing of flush bios in the dm-linear and dm-stripe targets - Dm-io cleansups and refactoring - Remove unused 'struct thunk' in dm-cache - Handle minor device numbers > 255 in dm-init - Dm-verity refactoring & enabling platform keyring - Fix warning in dm-raid - Improve dm-crypt performance - split bios to smaller pieces, so that They could be processed concurrently - Stop using blk_limits_io_{min,opt} - Dm-vdo cleanup and refactoring - Remove max_write_zeroes_granularity and max_secure_erase_granularity - Dm-multipath cleanup & refactoring - Add dm-crypt and dm-integrity support for non-power-of-2 sector size - Fix reshape in dm-raid - Make dm_block_validator const * tag 'for-6.11/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (33 commits) dm vdo: fix a minor formatting issue in vdo.rst dm vdo int-map: fix kerneldoc formatting dm vdo repair: add missing kerneldoc fields dm: Constify struct dm_block_validator dm-integrity: introduce the Inline mode dm: introduce the target flag mempool_needs_integrity dm raid: fix stripes adding reshape size issues dm raid: move _get_reshape_sectors() as prerequisite to fixing reshape size issues dm-crypt: support for per-sector NVMe metadata dm mpath: don't call dm_get_device in multipath_message dm: factor out helper function from dm_get_device dm-verity: fix dm_is_verity_target() when dm-verity is builtin dm: Remove max_secure_erase_granularity dm: Remove max_write_zeroes_granularity dm vdo indexer: use swap() instead of open coding it dm vdo: remove unused struct 'uds_attribute' dm: stop using blk_limits_io_{min,opt} dm-crypt: limit the size of encryption requests dm verity: add support for signature verification with platform keyring dm-raid: Fix WARN_ON_ONCE check for sync_thread in raid_resume ...
2024-07-19dm vdo int-map: fix kerneldoc formattingMatthew Sakai
Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202407141607.M3E2XQ0Z-lkp@intel.com/ Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-07-19dm vdo repair: add missing kerneldoc fieldsMatthew Sakai
Also remove trivial comment for increment_recovery_point. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=9518 Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-07-19dm: Constify struct dm_block_validatorChristophe JAILLET
'struct dm_block_validator' are not modified in these drivers. Constifying this structure moves some data to a read-only section, so increase overall security. On a x86_64, with allmodconfig, as an example: Before: ====== text data bss dec hex filename 32047 920 16 32983 80d7 drivers/md/dm-cache-metadata.o After: ===== text data bss dec hex filename 32075 896 16 32987 80db drivers/md/dm-cache-metadata.o Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-07-19dm-integrity: introduce the Inline modeMikulas Patocka
This commit introduces a new 'I' mode for dm-integrity. The 'I' mode may be selected if the underlying device has non-power-of-2 sector size. In this mode, dm-integrity will store integrity data directly in device's sectors and it will not use journal. This mode improves performance and reduces flash wear because there would be no journal writes. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-07-17Merge tag 'dlm-6.11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm Pull dlm updates from David Teigland: - New flag DLM_LSFL_SOFTIRQ_SAFE can be set by code using dlm to indicate callbacks can be run from softirq - Change md-cluster to set DLM_LSFL_SOFTIRQ_SAFE - Clean up for previous changes, e.g. unused code and parameters - Remove custom pre-allocation of rsb structs which is unnecessary with kmem caches - Change idr to xarray for lkb structs in use - Change idr to xarray for rsb structs being recovered - Change outdated naming related to internal rsb states - Fix some incorrect add/remove of rsb on scan list - Use rcu to free rsb structs * tag 'dlm-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm: dlm: add rcu_barrier before destroy kmem cache dlm: remove DLM_LSFL_SOFTIRQ from exflags fs: dlm: remove unused struct 'dlm_processed_nodes' md-cluster: use DLM_LSFL_SOFTIRQ for dlm_new_lockspace() dlm: implement LSFL_SOFTIRQ_SAFE dlm: introduce DLM_LSFL_SOFTIRQ_SAFE dlm: use LSFL_FS to check for kernel lockspace dlm: use rcu to avoid an extra rsb struct lookup dlm: fix add_scan and del_scan usage dlm: change list and timer names dlm: move recover idr to xarray datastructure dlm: move lkb idr to xarray datastructure dlm: drop own rsb pre allocation mechanism dlm: remove ls_local_handle from struct dlm_ls dlm: remove unused parameter in dlm_midcomms_addr dlm: don't kref_init rsbs created for toss list dlm: remove scand leftovers
2024-07-15Merge tag 'for-6.11/block-20240710' of git://git.kernel.dk/linuxLinus Torvalds
Pull block updates from Jens Axboe: - NVMe updates via Keith: - Device initialization memory leak fixes (Keith) - More constants defined (Weiwen) - Target debugfs support (Hannes) - PCIe subsystem reset enhancements (Keith) - Queue-depth multipath policy (Redhat and PureStorage) - Implement get_unique_id (Christoph) - Authentication error fixes (Gaosheng) - MD updates via Song - sync_action fix and refactoring (Yu Kuai) - Various small fixes (Christoph Hellwig, Li Nan, and Ofir Gal, Yu Kuai, Benjamin Marzinski, Christophe JAILLET, Yang Li) - Fix loop detach/open race (Gulam) - Fix lower control limit for blk-throttle (Yu) - Add module descriptions to various drivers (Jeff) - Add support for atomic writes for block devices, and statx reporting for same. Includes SCSI and NVMe (John, Prasad, Alan) - Add IO priority information to block trace points (Dongliang) - Various zone improvements and tweaks (Damien) - mq-deadline tag reservation improvements (Bart) - Ignore direct reclaim swap writes in writeback throttling (Baokun) - Block integrity improvements and fixes (Anuj) - Add basic support for rust based block drivers. Has a dummy null_blk variant for now (Andreas) - Series converting driver settings to queue limits, and cleanups and fixes related to that (Christoph) - Cleanup for poking too deeply into the bvec internals, in preparation for DMA mapping API changes (Christoph) - Various minor tweaks and fixes (Jiapeng, John, Kanchan, Mikulas, Ming, Zhu, Damien, Christophe, Chaitanya) * tag 'for-6.11/block-20240710' of git://git.kernel.dk/linux: (206 commits) floppy: add missing MODULE_DESCRIPTION() macro loop: add missing MODULE_DESCRIPTION() macro ublk_drv: add missing MODULE_DESCRIPTION() macro xen/blkback: add missing MODULE_DESCRIPTION() macro block/rnbd: Constify struct kobj_type block: take offset into account in blk_bvec_map_sg again block: fix get_max_segment_size() warning loop: Don't bother validating blocksize virtio_blk: Don't bother validating blocksize null_blk: Don't bother validating blocksize block: Validate logical block size in blk_validate_limits() virtio_blk: Fix default logical block size fallback nvmet-auth: fix nvmet_auth hash error handling nvme: implement ->get_unique_id block: pass a phys_addr_t to get_max_segment_size block: add a bvec_phys helper blk-lib: check for kill signal in ioctl BLKZEROOUT block: limit the Write Zeroes to manually writing zeroes fallback block: refacto blkdev_issue_zeroout block: move read-only and supported checks into (__)blkdev_issue_zeroout ...
2024-07-12dm: introduce the target flag mempool_needs_integrityMikulas Patocka
This commit introduces the dm target flag mempool_needs_integrity. When the flag is set, device mapper will call bioset_integrity_create on it's bio sets. The target can then call bio_integrity_alloc on the bios allocated from the table's mempool. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-07-12md/raid1: set max_sectors during early return from choose_slow_rdev()Mateusz Jończyk
Linux 6.9+ is unable to start a degraded RAID1 array with one drive, when that drive has a write-mostly flag set. During such an attempt, the following assertion in bio_split() is hit: BUG_ON(sectors <= 0); Call Trace: ? bio_split+0x96/0xb0 ? exc_invalid_op+0x53/0x70 ? bio_split+0x96/0xb0 ? asm_exc_invalid_op+0x1b/0x20 ? bio_split+0x96/0xb0 ? raid1_read_request+0x890/0xd20 ? __call_rcu_common.constprop.0+0x97/0x260 raid1_make_request+0x81/0xce0 ? __get_random_u32_below+0x17/0x70 ? new_slab+0x2b3/0x580 md_handle_request+0x77/0x210 md_submit_bio+0x62/0xa0 __submit_bio+0x17b/0x230 submit_bio_noacct_nocheck+0x18e/0x3c0 submit_bio_noacct+0x244/0x670 After investigation, it turned out that choose_slow_rdev() does not set the value of max_sectors in some cases and because of it, raid1_read_request calls bio_split with sectors == 0. Fix it by filling in this variable. This bug was introduced in commit dfa8ecd167c1 ("md/raid1: factor out choose_slow_rdev() from read_balance()") but apparently hidden until commit 0091c5a269ec ("md/raid1: factor out helpers to choose the best rdev from read_balance()") shortly thereafter. Cc: stable@vger.kernel.org # 6.9.x+ Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl> Fixes: dfa8ecd167c1 ("md/raid1: factor out choose_slow_rdev() from read_balance()") Cc: Song Liu <song@kernel.org> Cc: Yu Kuai <yukuai3@huawei.com> Cc: Paul Luse <paul.e.luse@linux.intel.com> Cc: Xiao Ni <xni@redhat.com> Cc: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> Link: https://lore.kernel.org/linux-raid/20240706143038.7253-1-mat.jonczyk@o2.pl/ -- Tested on both Linux 6.10 and 6.9.8. Inside a VM, mdadm testsuite for RAID1 on 6.10 did not find any problems: ./test --dev=loop --no-error --raidtype=raid1 (on 6.9.8 there was one failure, caused by external bitmap support not compiled in). Notes: - I was reliably getting deadlocks when adding / removing devices on such an array - while the array was loaded with fsstress with 20 concurrent processes. When the array was idle or loaded with fsstress with 8 processes, no such deadlocks happened in my tests. This occurred also on unpatched Linux 6.8.0 though, but not on 6.1.97-rc1, so this is likely an independent regression (to be investigated). - I was also getting deadlocks when adding / removing the bitmap on the array in similar conditions - this happened on Linux 6.1.97-rc1 also though. fsstress with 8 concurrent processes did cause it only once during many tests. - in my testing, there was once a problem with hot adding an internal bitmap to the array: mdadm: Cannot add bitmap while array is resyncing or reshaping etc. mdadm: failed to set internal bitmap. even though no such reshaping was happening according to /proc/mdstat. This seems unrelated, though. Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240711202316.10775-1-mat.jonczyk@o2.pl
2024-07-12md-cluster: fix no recovery job when adding/re-adding a diskHeming Zhao
The commit db5e653d7c9f ("md: delay choosing sync action to md_start_sync()") delays the start of the sync action. In a clustered environment, this will cause another node to first activate the spare disk and skip recovery. As a result, no nodes will perform recovery when a disk is added or re-added. Before db5e653d7c9f: ``` node1 node2 ---------------------------------------------------------------- md_check_recovery + md_update_sb | sendmsg: METADATA_UPDATED + md_choose_sync_action process_metadata_update | remove_and_add_spares //node1 has not finished adding + call mddev->sync_work //the spare disk:do nothing md_start_sync starts md_do_sync md_do_sync + grabbed resync_lockres:DLM_LOCK_EX + do syncing job md_check_recovery sendmsg: METADATA_UPDATED process_metadata_update //activate spare disk ... ... md_do_sync waiting to grab resync_lockres:EX ``` After db5e653d7c9f: (note: if 'cmd:idle' sets MD_RECOVERY_INTR after md_check_recovery starts md_start_sync, setting the INTR action will exacerbate the delay in node1 calling the md_do_sync function.) ``` node1 node2 ---------------------------------------------------------------- md_check_recovery + md_update_sb | sendmsg: METADATA_UPDATED + calls mddev->sync_work process_metadata_update //node1 has not finished adding //the spare disk:do nothing md_start_sync + md_choose_sync_action | remove_and_add_spares + calls md_do_sync md_check_recovery md_update_sb sendmsg: METADATA_UPDATED process_metadata_update //activate spare disk ... ... ... ... md_do_sync + grabbed resync_lockres:EX + raid1_sync_request skip sync under conf->fullsync:0 md_do_sync 1. waiting to grab resync_lockres:EX 2. when node1 could grab EX lock, node1 will skip resync under recovery_offset:MaxSector ``` How to trigger: ```(commands @node1) # to easily watch the recovery status echo 2000 > /proc/sys/dev/raid/speed_limit_max ssh root@node2 "echo 2000 > /proc/sys/dev/raid/speed_limit_max" mdadm -CR /dev/md0 -l1 -b clustered -n 2 /dev/sda /dev/sdb --assume-clean ssh root@node2 mdadm -A /dev/md0 /dev/sda /dev/sdb mdadm --manage /dev/md0 --fail /dev/sda --remove /dev/sda mdadm --manage /dev/md0 --add /dev/sdc === "cat /proc/mdstat" on both node, there are no recovery action. === ``` How to fix: because md layer code logic is hard to restore for speeding up sync job on local node, we add new cluster msg to pending the another node to active disk. Signed-off-by: Heming Zhao <heming.zhao@suse.com> Reviewed-by: Su Yue <glass.su@suse.com> Acked-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240709104120.22243-2-heming.zhao@suse.com
2024-07-12md-cluster: fix hanging issue while a new disk addingHeming Zhao
The commit 1bbe254e4336 ("md-cluster: check for timeout while a new disk adding") is correct in terms of code syntax but not suite real clustered code logic. When a timeout occurs while adding a new disk, if recv_daemon() bypasses the unlock for ack_lockres:CR, another node will be waiting to grab EX lock. This will cause the cluster to hang indefinitely. How to fix: 1. In dlm_lock_sync(), change the wait behaviour from forever to a timeout, This could avoid the hanging issue when another node fails to handle cluster msg. Another result of this change is that if another node receives an unknown msg (e.g. a new msg_type), the old code will hang, whereas the new code will timeout and fail. This could help cluster_md handle new msg_type from different nodes with different kernel/module versions (e.g. The user only updates one leg's kernel and monitors the stability of the new kernel). 2. The old code for __sendmsg() always returns 0 (success) under the design (must successfully unlock ->message_lockres). This commit makes this function return an error number when an error occurs. Fixes: 1bbe254e4336 ("md-cluster: check for timeout while a new disk adding") Signed-off-by: Heming Zhao <heming.zhao@suse.com> Reviewed-by: Su Yue <glass.su@suse.com> Acked-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240709104120.22243-1-heming.zhao@suse.com
2024-07-11Merge tag 'for-6.10/dm-fixes-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fix from Mikulas Patocka: - Fix broken discard for device mapper VDO target * tag 'for-6.10/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm vdo: replace max_discard_sectors with max_hw_discard_sectors
2024-07-11dm vdo: replace max_discard_sectors with max_hw_discard_sectorsBruce Johnston
Commit 4f563a64732d ("block: add a max_user_discard_sectors queue limit") changed block core to set max_discard_sectors to: min(lim->max_hw_discard_sectors, lim->max_user_discard_sectors) Commit 825d8bbd2f32 ("dm: always manage discard support in terms of max_hw_discard_sectors") fixed most dm targetss to deal with this, by replacing max_discard_sectors with max_hw_discard_sectors. Unfortunately, dm-vdo did not get fixed at that time. Fixes: 825d8bbd2f32 ("dm: always manage discard support in terms of max_hw_discard_sectors") Signed-off-by: Bruce Johnston <bjohnsto@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-07-10dm raid: fix stripes adding reshape size issuesHeinz Mauelshagen
Adding stripes to an existing raid4/5/6/10 mapped device grows its capacity though it'll be only made available _after_ the respective reshape finished as of MD kernel reshape semantics. Such reshaping involves moving a window forward starting at BOD reading content from previous lesser stripes and writing them back in the new layout with more stripes. Once that process finishes at end of previous data, the grown size may be announced and used. In order to avoid writing over any existing data in place, out-of-place space is added to the beginning of each data device by lvm2 before starting the reshape process. That reshape space wasn't taken into acount for data device size calculation. Fixes resulting from above: - correct event handling conditions in do_table_event() to set the device's capacity after the stripe adding reshape ended - subtract mentioned out-of-place space doing data device and array size calculations - conditionally set capacity as of superblock in preresume Testing: - passes all LVM2 RAID tests including new lvconvert-raid-reshape-size.sh one Tested-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-07-10dm raid: move _get_reshape_sectors() as prerequisite to fixing reshape size ↵Heinz Mauelshagen
issues rs_set_dev_and_array_sectors() needs this function to calculate device and array size properly in case leg data devices have out-of-place reshape space allocated. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-07-10dm-crypt: support for per-sector NVMe metadataMikulas Patocka
Support per-sector NVMe metadata in dm-crypt. This commit changes dm-crypt, so that it can use NVMe metadata to store authentication information. We can put dm-crypt directly on the top of NVMe device, without using dm-integrity. This commit improves write throughput twice, becase the will be no writes to the dm-integrity journal. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-07-10dm mpath: don't call dm_get_device in multipath_messageBenjamin Marzinski
When mutipath_message is called with an action and a device, it needs to find the pgpath that matches that device. dm_get_device() is not the right function for this. dm_get_device() will look for a table_device matching the requested path in use by either the live or inactive table. If it doesn't find the device, dm_get_device() will open it and add it to the table. Means that multipath_message will accept any block device, add it to the table if not present, and then look through the pgpaths to see if it finds a match. Afterwards it will remove the device if it was not previously in the table devices list. This is the only function that can modify the device list of a table besides the constructors and destructors, and it can only do this when it was passed an invalid message. Instead, multipath_message() should call dm_devt_from_path() to get the device dev_t, and match that against its pgpaths. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-07-10dm: factor out helper function from dm_get_deviceBenjamin Marzinski
Factor out a helper function, dm_devt_from_path(), from dm_get_device() for use in dm targets. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-07-10dm-verity: fix dm_is_verity_target() when dm-verity is builtinEric Biggers
When CONFIG_DM_VERITY=y, dm_is_verity_target() returned true for any builtin dm target, not just dm-verity. Fix this by checking for verity_target instead of THIS_MODULE (which is NULL for builtin code). Fixes: b6c1c5745ccc ("dm: Add verity helpers for LoadPin") Cc: stable@vger.kernel.org Cc: Matthias Kaehlcke <mka@chromium.org> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-07-10dm: Remove max_secure_erase_granularityDamien Le Moal
The max_secure_erase_granularity boolean of struct dm_target is used in __process_abnormal_io() but never set by any target. Remove this field and the dead code using it. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-07-10dm: Remove max_write_zeroes_granularityDamien Le Moal
The max_write_zeroes_granularity boolean of struct dm_target is used in __process_abnormal_io() but never set by any target. Remove this field and the dead code using it. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-07-10dm vdo indexer: use swap() instead of open coding itJiapeng Chong
Use existing swap() macro rather than duplicating its implementation. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=9173 Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-07-10dm vdo: remove unused struct 'uds_attribute'Dr. David Alan Gilbert
'uds_attribute' is unused since commit a9da0fb6d8c6 ("dm vdo: remove all sysfs interfaces"). Remove it. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-07-10dm: stop using blk_limits_io_{min,opt}Christoph Hellwig
Remove use of the blk_limits_io_{min,opt} and assign the values directly to the queue_limits structure. For the io_opt this is a completely mechanical change, for io_min it removes flooring the limit to the physical and logical block size in the particular caller. But as blk_validate_limits will do the same later when actually applying the limits, there still is no change in overall behavior. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-07-10dm-crypt: limit the size of encryption requestsMikulas Patocka
There was a performance regression reported where dm-crypt would perform worse on new kernels than on old kernels. The reason is that the old kernels split the bios to NVMe request size (that is usually 65536 or 131072 bytes) and the new kernels pass the big bios through dm-crypt and split them underneath. If a big 1MiB bio is passed to dm-crypt, dm-crypt processes it on a single core without parallelization and this is what causes the performance degradation. This commit introduces new tunable variables /sys/module/dm_crypt/parameters/max_read_size and /sys/module/dm_crypt/parameters/max_write_size that specify the maximum bio size for dm-crypt. Bios larger than this value are split, so that they can be encrypted in parallel by multiple cores. If these variables are '0', a default 131072 is used. Splitting bios may cause performance regressions in other workloads - if this happens, the user should increase the value in max_read_size and max_write_size variables. max_read_size: 128k 2399MiB/s 256k 2368MiB/s 512k 1986MiB/s 1024 1790MiB/s max_write_size: 128k 1712MiB/s 256k 1651MiB/s 512k 1537MiB/s 1024k 1332MiB/s Note that if you run dm-crypt inside a virtual machine, you may need to do "echo numa >/sys/module/workqueue/parameters/default_affinity_scope" to improve performance. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Tested-by: Laurence Oberman <loberman@redhat.com>
2024-07-05dm: handle REQ_OP_ZONE_RESET_ALLDamien Le Moal
This commit implements processing of the REQ_OP_ZONE_RESET_ALL operation for zoned mapped devices. Given that this operation always has a BIO sector of 0 and a 0 size, processing through the regular BIO __split_and_process_bio() function does not work because this function would always select the first target. Instead, handling of this operation is implemented using the function __send_zone_reset_all(). Similarly to the __send_empty_flush() function, the new __send_zone_reset_all() function manually goes through all targets of a mapped device table doing the following: 1) If the target can natively support REQ_OP_ZONE_RESET_ALL, __send_duplicate_bios() is used to forward the reset all operation to the target. This case is handled with the __send_zone_reset_all_native() function. 2) For other targets, the function __send_zone_reset_all_emulated() is executed to emulate the execution of REQ_OP_ZONE_RESET_ALL using regular REQ_OP_ZONE_RESET operations. Targets that can natively support REQ_OP_ZONE_RESET_ALL are identified using the new target field zone_reset_all_supported. This boolean is set to true in for targets that have reliable zone limits, that is, targets that map all sequential write required zones of their zoned device(s). Setting this field is handled in dm_set_zones_restrictions() and device_get_zone_resource_limits(). For targets with unreliable zone limits, REQ_OP_ZONE_RESET_ALL must be emulated (case 2 above). This is implemented with __send_zone_reset_all_emulated() and is similar to the block layer function blkdev_zone_reset_all_emulated(): first a report zones is done for the zones of the target to identify zones that need reset, that is, any sequential write required zone that is not already empty. This is done using a bitmap and the function dm_zone_get_reset_bitmap() which sets to 1 the bit corresponding to a zone that needs reset. Next, this zone bitmap is inspected and a clone BIO modified to use the REQ_OP_ZONE_RESET operation issued for any zone with its bit set in the zone bitmap. This implementation is more efficient than what the block layer does with blkdev_zone_reset_all_emulated(), which is always used for DM zoned devices currently: as we can natively use REQ_OP_ZONE_RESET_ALL on targets mapping all sequential write required zones, resetting all zones of a zoned mapped device can be much faster compared to always emulating this operation using regular per-zone reset. In the worst case, this implementation is as-efficient as the block layer emulation. This reduction in the time it takes to reset all zones of a zoned mapped device depends directly on the mapped device targets mapping (reliable zone limits or not). Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240704052816.623865-4-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-05dm: Refactor is_abnormal_io()Damien Le Moal
Use a single switch-case to simplify is_abnormal_io() and make this function more readable and easier to modify. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240704052816.623865-3-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-04md/raid5: recheck if reshape has finished with device_lock heldBenjamin Marzinski
When handling an IO request, MD checks if a reshape is currently happening, and if so, where the IO sector is in relation to the reshape progress. MD uses conf->reshape_progress for both of these tasks. When the reshape finishes, conf->reshape_progress is set to MaxSector. If this occurs after MD checks if the reshape is currently happening but before it calls ahead_of_reshape(), then ahead_of_reshape() will end up comparing the IO sector against MaxSector. During a backwards reshape, this will make MD think the IO sector is in the area not yet reshaped, causing it to use the previous configuration, and map the IO to the sector where that data was before the reshape. This bug can be triggered by running the lvm2 lvconvert-raid-reshape-linear_to_raid6-single-type.sh test in a loop, although it's very hard to reproduce. Fix this by factoring the code that checks where the IO sector is in relation to the reshape out to a helper called get_reshape_loc(), which reads reshape_progress and reshape_safe while holding the device_lock, and then rechecks if the reshape has finished before calling ahead_of_reshape with the saved values. Also use the helper during the REQ_NOWAIT check to see if the location is inside of the reshape region. Fixes: fef9c61fdfabf ("md/raid5: change reshape-progress measurement to cope with reshaping backwards.") Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240702151802.1632010-1-bmarzins@redhat.com