summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)Author
2024-07-04md: Don't wait for MD_RECOVERY_NEEDED for HOT_REMOVE_DISK ioctlYu Kuai
Commit 90f5f7ad4f38 ("md: Wait for md_check_recovery before attempting device removal.") explained in the commit message that failed device must be reomoved from the personality first by md_check_recovery(), before it can be removed from the array. That's the reason the commit add the code to wait for MD_RECOVERY_NEEDED. However, this is not the case now, because remove_and_add_spares() is called directly from hot_remove_disk() from ioctl path, hence failed device(marked faulty) can be removed from the personality by ioctl. On the other hand, the commit introduced a performance problem that if MD_RECOVERY_NEEDED is set and the array is not running, ioctl will wait for 5s before it can return failure to user. Since the waiting is not needed now, fix the problem by removing the waiting. Fixes: 90f5f7ad4f38 ("md: Wait for md_check_recovery before attempting device removal.") Reported-by: Mateusz Kusiak <mateusz.kusiak@linux.intel.com> Closes: https://lore.kernel.org/all/814ff6ee-47a2-4ba0-963e-cf256ee4ecfa@linux.intel.com/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240627112321.3044744-1-yukuai1@huaweicloud.com
2024-07-04md-cluster: Constify struct md_cluster_operationsChristophe JAILLET
'struct md_cluster_operations' is not modified in this driver. Constifying this structure moves some data to a read-only section, so increase overall security. On a x86_64, with allmodconfig, as an example: Before: ====== text data bss dec hex filename 51941 1442 80 53463 d0d7 drivers/md/md-cluster.o After: ===== text data bss dec hex filename 52133 1246 80 53459 d0d3 drivers/md/md-cluster.o Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/3727f3ce9693cae4e62ae6778ea13971df805479.1719173852.git.christophe.jaillet@wanadoo.fr
2024-07-04md: Remove unneeded semicolonYang Li
./drivers/md/md.c:630:21-22: Unneeded semicolon Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=9344 Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240618010759.85416-1-yang.lee@linux.alibaba.com
2024-07-04md/raid5: fix spares errors about rcu usageYu Kuai
As commit ad8606702f26 ("md/raid5: remove rcu protection to access rdev from conf") explains, rcu protection can be removed, however, there are three places left, there won't be any real problems. drivers/md/raid5.c:8071:24: error: incompatible types in comparison expression (different address spaces): drivers/md/raid5.c:8071:24: struct md_rdev [noderef] __rcu * drivers/md/raid5.c:8071:24: struct md_rdev * drivers/md/raid5.c:7569:25: error: incompatible types in comparison expression (different address spaces): drivers/md/raid5.c:7569:25: struct md_rdev [noderef] __rcu * drivers/md/raid5.c:7569:25: struct md_rdev * drivers/md/raid5.c:7573:25: error: incompatible types in comparison expression (different address spaces): drivers/md/raid5.c:7573:25: struct md_rdev [noderef] __rcu * drivers/md/raid5.c:7573:25: struct md_rdev * Fixes: ad8606702f26 ("md/raid5: remove rcu protection to access rdev from conf") Cc: stable@vger.kernel.org Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240615085143.1648223-1-yukuai1@huaweicloud.com
2024-07-03dm verity: add support for signature verification with platform keyringLuca Boccassi
Add a new configuration CONFIG_DM_VERITY_VERIFY_ROOTHASH_SIG_PLATFORM_KEYRING that enables verifying dm-verity signatures using the platform keyring, which is populated using the UEFI DB certificates. This is useful for self-enrolled systems that do not use MOK, as the secondary keyring which is already used for verification, if the relevant kconfig is enabled, is linked to the machine keyring, which gets its certificates loaded from MOK. On datacenter/virtual/cloud deployments it is more common to deploy one's own certificate chain directly in DB on first boot in unattended mode, rather than relying on MOK, as the latter typically requires interactive authentication to enroll, and is more suited for personal machines. Default to the same value as DM_VERITY_VERIFY_ROOTHASH_SIG_SECONDARY_KEYRING if not otherwise specified, as it is likely that if one wants to use MOK certificates to verify dm-verity volumes, DB certificates are going to be used too. Keys in DB are allowed to load a full kernel already anyway, so they are already highly privileged. Signed-off-by: Luca Boccassi <bluca@debian.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-07-03dm-raid: Fix WARN_ON_ONCE check for sync_thread in raid_resumeBenjamin Marzinski
rm-raid devices will occasionally trigger the following warning when being resumed after a table load because DM_RECOVERY_RUNNING is set: WARNING: CPU: 7 PID: 5660 at drivers/md/dm-raid.c:4105 raid_resume+0xee/0x100 [dm_raid] The failing check is: WARN_ON_ONCE(test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)); This check is designed to make sure that the sync thread isn't registered, but md_check_recovery can set MD_RECOVERY_RUNNING without the sync_thread ever getting registered. Instead of checking if MD_RECOVERY_RUNNING is set, check if sync_thread is non-NULL. Fixes: 16c4770c75b1 ("dm-raid: really frozen sync_thread during suspend") Suggested-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-07-03dm-verity: hash blocks with shash import+finup when possibleEric Biggers
Currently dm-verity computes the hash of each block by using multiple calls to the "ahash" crypto API. While the exact sequence depends on the chosen dm-verity settings, in the vast majority of cases it is: 1. crypto_ahash_init() 2. crypto_ahash_update() [salt] 3. crypto_ahash_update() [data] 4. crypto_ahash_final() This is inefficient for two main reasons: - It makes multiple indirect calls, which is expensive on modern CPUs especially when mitigations for CPU vulnerabilities are enabled. Since the salt is the same across all blocks on a given dm-verity device, a much more efficient sequence would be to do an import of the pre-salted state, then a finup. - It uses the ahash (asynchronous hash) API, despite the fact that CPU-based hashing is almost always used in practice, and therefore it experiences the overhead of the ahash-based wrapper for shash. Because dm-verity was intentionally converted to ahash to support off-CPU crypto accelerators, a full reversion to shash might not be acceptable. Yet, we should still provide a fast path for shash with the most common dm-verity settings. Another reason for shash over ahash is that the upcoming multibuffer hashing support, which is specific to CPU-based hashing, is much better suited for shash than for ahash. Supporting it via ahash would add significant complexity and overhead. And it's not possible for the "same" code to properly support both multibuffer hashing and HW accelerators at the same time anyway, given the different computation models. Unfortunately there will always be code specific to each model needed (for users who want to support both). Therefore, this patch adds a new shash import+finup based fast path to dm-verity. It is used automatically when appropriate. This makes dm-verity optimized for what the vast majority of users want: CPU-based hashing with the most common settings, while still retaining support for rarer settings and off-CPU crypto accelerators. In benchmarks with veritysetup's default parameters (SHA-256, 4K data and hash block sizes, 32-byte salt), which also match the parameters that Android currently uses, this patch improves block hashing performance by about 15% on x86_64 using the SHA-NI instructions, or by about 5% on arm64 using the ARMv8 SHA2 instructions. On x86_64 roughly two-thirds of the improvement comes from the use of import and finup, while the remaining third comes from the switch from ahash to shash. Note that another benefit of using "import" to handle the salt is that if the salt size is equal to the input size of the hash algorithm's compression function, e.g. 64 bytes for SHA-256, then the performance is exactly the same as no salt. This doesn't seem to be much better than veritysetup's current default of 32-byte salts, due to the way SHA-256's finalization padding works, but it should be marginally better. Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-07-03dm-verity: make verity_hash() take dm_verity_io instead of ahash_requestEric Biggers
In preparation for adding shash support to dm-verity, change verity_hash() to take a pointer to a struct dm_verity_io instead of a pointer to the ahash_request embedded inside it. Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-07-03dm-verity: always "map" the data blocksEric Biggers
dm-verity needs to access data blocks by virtual address in three different cases (zeroization, recheck, and forward error correction), and one more case (shash support) is coming. Since it's guaranteed that dm-verity data blocks never cross pages, and kmap_local_page and kunmap_local are no-ops on modern platforms anyway, just unconditionally "map" every data block's page and work with the virtual buffer directly. This simplifies the code and eliminates unnecessary overhead. Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-07-03dm-verity: provide dma_alignment limit in io_hintsEric Biggers
Since Linux v6.1, some filesystems support submitting direct I/O that is aligned to only dma_alignment instead of the logical_block_size alignment that was required before. I/O that is not aligned to the logical_block_size is difficult to handle in device-mapper targets that do cryptographic processing of data, as it makes the units of data that are hashed or encrypted possibly be split across pages, creating rarely used and rarely tested edge cases. As such, dm-crypt and dm-integrity have already opted out of this by setting dma_alignment to 'logical_block_size - 1'. Although dm-verity does have code that handles these cases (or at least is intended to do so), supporting direct I/O with such a low amount of alignment is not really useful on dm-verity devices. So, opt dm-verity out of it too so that it's not necessary to handle these edge cases. Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-07-03dm-verity: make real_digest and want_digest fixed-lengthEric Biggers
Change the digest fields in struct dm_verity_io from variable-length to fixed-length, since their maximum length is fixed at HASH_MAX_DIGESTSIZE, i.e. 64 bytes, which is not too big. This is simpler and makes the fields a bit faster to access. (HASH_MAX_DIGESTSIZE did not exist when this code was written, which may explain why it wasn't used.) This makes the verity_io_real_digest() and verity_io_want_digest() functions trivial, but this patch leaves them in place temporarily since most of their callers will go away in a later patch anyway. Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-07-03dm-verity: move data hash mismatch handling into its own functionEric Biggers
Move the code that handles mismatches of data block hashes into its own function so that it doesn't clutter up verity_verify_io(). Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-07-03block: split integrity support out of bio.hChristoph Hellwig
Split struct bio_integrity_payload and the related prototypes out of bio.h into a separate bio-integrity.h header so that it is only pulled in by the few places that need it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240702151047.1746127-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-02dm-verity: move hash algorithm setup into its own functionEric Biggers
Move the code that sets up the hash transformation into its own function. No change in behavior. Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-07-02dm init: Handle minors larger than 255Benjamin Marzinski
dm_parse_device_entry() simply copies the minor number into dmi.dev, but the dev_t format splits the minor number between the lowest 8 bytes and highest 12 bytes. If the minor number is larger than 255, part of it will end up getting treated as the major number Fix this by checking that the minor number is valid and then encoding it as a dev_t. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-07-02dm cache metadata: remove unused struct 'thunk'Dr. David Alan Gilbert
'thunk' has been unused since commit f177940a8091 ("dm cache metadata: switch to using the new cursor api for loading metadata"). Remove it. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-07-02dm io: remove code duplication between sync_io and aysnc_ioBenjamin Marzinski
The only difference between the code to setup and dispatch the io in sync_io() and async_io() is the sync argument to dispatch_io(), which is used to update the opf argument. Update the opf argument direcly in sync_io(), and remove the sync argument from dispatch_io(). Then, make sync_io() call async_io() instead of duplicting all of its code. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-07-02dm io: don't call the async_io notify.fn on invalid num_regionsBenjamin Marzinski
If dm_io() returned an error, callers that set a notify.fn and wanted it called on an error need to check the return value and call notify.fn themselves if it was -EINVAL but not if it was -EIO. None of them do this (granted, all the existing async_io users of dm_io call it in a way that is guaranteed to not return an error). Simplify the interface by never calling the notify.fn if dm_io returns an error. This works with the existing dm_io callers which check for an error and handle it using the same methods as the notify.fn. This also allows us to move the now equivalent num_regions checks out of sync_io() and async_io() and into dm_io() itself. Additionally, change async_io() into a void function, since it can no longer fail. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-07-02dm io: bump num_bvecs to handle offset memoryBenjamin Marzinski
If dp->get_page() returns a non-zero offset, the bio might need an additional bvec to deal with the offset. For example, if remaining is exactly one page size, but there is an offset, the memory will span two pages. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-06-28bcache: work around a __bitwise to bool conversion sparse warningChristoph Hellwig
Sparse is a bit dumb about bitwise operation on __bitwise types used in boolean contexts. Add a !! to explicitly propagate to boolean without a warning. Fixes: fcf865e357f8 ("block: convert features and flags to __bitwise types") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Kent Overstreet <kent.overstreet@linux.dev> Link: https://lore.kernel.org/r/20240628131657.667797-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-26md: set md-specific flags for all queue limitsChristoph Hellwig
The md driver wants to enforce a number of flags for all devices, even when not inheriting them from the underlying devices. To make sure these flags survive the queue_limits_set calls that md uses to update the queue limits without deriving them form the previous limits add a new md_init_stacking_limits helper that calls blk_set_stacking_limits and sets these flags. Fixes: 1122c0c1cc71 ("block: move cache control settings out of queue->flags") Reported-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20240626142637.300624-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-26dm: optimize flushesMikulas Patocka
Device mapper sends flush bios to all the targets and the targets send it to the underlying device. That may be inefficient, for example if a table contains 10 linear targets pointing to the same physical device, then device mapper would send 10 flush bios to that device - despite the fact that only one bio would be sufficient. This commit optimizes the flush behavior. It introduces a per-target variable flush_bypasses_map - it is set when the target supports flush optimization - currently, the dm-linear and dm-stripe targets support it. When all the targets in a table have flush_bypasses_map, flush_bypasses_map on the table is set. __send_empty_flush tests if the table has flush_bypasses_map - and if it has, no flush bios are sent to the targets via the "map" method and the list dm_table->devices is iterated and the flush bios are sent to each member of the list. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Suggested-by: Yang Yang <yang.yang@vivo.com>
2024-06-24bcache: remove heap-related macros and switch to generic min_heapKuan-Wei Chiu
Drop the heap-related macros from bcache and replacing them with the generic min_heap implementation from include/linux. By doing so, code readability is improved by using functions instead of macros. Moreover, the min_heap implementation in include/linux adopts a bottom-up variation compared to the textbook version currently used in bcache. This bottom-up variation allows for approximately 50% reduction in the number of comparison operations during heap siftdown, without changing the number of swaps, thus making it more efficient. Link: https://lkml.kernel.org/ioyfizrzq7w7mjrqcadtzsfgpuntowtjdw5pgn4qhvsdp4mqqg@nrlek5vmisbu Link: https://lkml.kernel.org/r/20240524152958.919343-16-visitorckw@gmail.com Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> Reviewed-by: Ian Rogers <irogers@google.com> Acked-by: Coly Li <colyli@suse.de> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Brian Foster <bfoster@redhat.com> Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Sakai <msakai@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-06-24lib min_heap: rename min_heapify() to min_heap_sift_down()Kuan-Wei Chiu
After adding min_heap_sift_up(), the naming convention has been adjusted to maintain consistency with the min_heap_sift_up(). Consequently, min_heapify() has been renamed to min_heap_sift_down(). Link: https://lkml.kernel.org/CAP-5=fVcBAxt8Mw72=NCJPRJfjDaJcqk4rjbadgouAEAHz_q1A@mail.gmail.com Link: https://lkml.kernel.org/r/20240524152958.919343-13-visitorckw@gmail.com Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> Reviewed-by: Ian Rogers <irogers@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Brian Foster <bfoster@redhat.com> Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw> Cc: Coly Li <colyli@suse.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Sakai <msakai@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-06-24lib min_heap: add args for min_heap_callbacksKuan-Wei Chiu
Add a third parameter 'args' for the 'less' and 'swp' functions in the 'struct min_heap_callbacks'. This additional parameter allows these comparison and swap functions to handle extra arguments when necessary. Link: https://lkml.kernel.org/r/20240524152958.919343-9-visitorckw@gmail.com Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> Reviewed-by: Ian Rogers <irogers@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Brian Foster <bfoster@redhat.com> Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw> Cc: Coly Li <colyli@suse.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Sakai <msakai@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-06-24lib min_heap: add type safe interfaceKuan-Wei Chiu
Implement a type-safe interface for min_heap using strong type pointers instead of void * in the data field. This change includes adding small macro wrappers around functions, enabling the use of __minheap_cast and __minheap_obj_size macros for type casting and obtaining element size. This implementation removes the necessity of passing element size in min_heap_callbacks. Additionally, introduce the MIN_HEAP_PREALLOCATED macro for preallocating some elements. Link: https://lkml.kernel.org/ioyfizrzq7w7mjrqcadtzsfgpuntowtjdw5pgn4qhvsdp4mqqg@nrlek5vmisbu Link: https://lkml.kernel.org/r/20240524152958.919343-5-visitorckw@gmail.com Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> Reviewed-by: Ian Rogers <irogers@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Brian Foster <bfoster@redhat.com> Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw> Cc: Coly Li <colyli@suse.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Sakai <msakai@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-06-24bcache: fix typoKuan-Wei Chiu
Replace 'utiility' with 'utility'. Link: https://lkml.kernel.org/r/20240524152958.919343-3-visitorckw@gmail.com Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> Reviewed-by: Ian Rogers <irogers@google.com> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Brian Foster <bfoster@redhat.com> Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw> Cc: Coly Li <colyli@suse.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Sakai <msakai@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-06-20block: Generalize chunk_sectors support as boundary supportJohn Garry
The purpose of the chunk_sectors limit is to ensure that a mergeble request fits within the boundary of the chunck_sector value. Such a feature will be useful for other request_queue boundary limits, so generalize the chunk_sectors merge code. This idea was proposed by Hannes Reinecke. Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20240620125359.2684798-3-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-20Merge branch 'for-6.11/block-limits' into for-6.11/blockJens Axboe
Merge in queue limits cleanups. * for-6.11/block-limits: block: move the raid_partial_stripes_expensive flag into the features field block: remove the discard_alignment flag block: move the misaligned flag into the features field block: renumber and rename the cache disabled flag block: fix spelling and grammar for in writeback_cache_control.rst block: remove the unused blk_bounce enum
2024-06-20block: move the raid_partial_stripes_expensive flag into the features fieldChristoph Hellwig
Move the raid_partial_stripes_expensive flags into the features field to reclaim a little bit of space. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240619154623.450048-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-20block: remove the discard_alignment flagChristoph Hellwig
queue_limits.discard_alignment is never read except in the places where it is stacked into another limit. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240619154623.450048-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19Merge branch 'for-6.11/block-limits' into for-6.11/blockJens Axboe
Merge in last round of queue limits changes from Christoph. * for-6.11/block-limits: (26 commits) block: move the bounce flag into the features field block: move the skip_tagset_quiesce flag to queue_limits block: move the pci_p2pdma flag to queue_limits block: move the zone_resetall flag to queue_limits block: move the zoned flag into the features field block: move the poll flag to queue_limits block: move the dax flag to queue_limits block: move the nowait flag to queue_limits block: move the synchronous flag to queue_limits block: move the stable_writes flag to queue_limits block: move the io_stat flag setting to queue_limits block: move the add_random flag to queue_limits block: move the nonrot flag to queue_limits block: move cache control settings out of queue->flags block: remove blk_flush_policy block: freeze the queue in queue_attr_store nbd: move setting the cache control flags to __nbd_set_size virtio_blk: remove virtblk_update_cache_mode loop: fold loop_update_rotational into loop_reconfigure_limits loop: also use the default block size from an underlying block device ... Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19block: move the zoned flag into the features fieldChristoph Hellwig
Move the zoned flags into the features field to reclaim a little bit of space. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240617060532.127975-23-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19block: move the poll flag to queue_limitsChristoph Hellwig
Move the poll flag into the queue_limits feature field so that it can be set atomically with the queue frozen. Stacking drivers are simplified in that they now can simply set the flag, and blk_stack_limits will clear it when the features is not supported by any of the underlying devices. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240617060532.127975-22-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19block: move the dax flag to queue_limitsChristoph Hellwig
Move the dax flag into the queue_limits feature field so that it can be set atomically with the queue frozen. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240617060532.127975-21-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19block: move the nowait flag to queue_limitsChristoph Hellwig
Move the nowait flag into the queue_limits feature field so that it can be set atomically with the queue frozen. Stacking drivers are simplified in that they now can simply set the flag, and blk_stack_limits will clear it when the features is not supported by any of the underlying devices. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240617060532.127975-20-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19block: move the stable_writes flag to queue_limitsChristoph Hellwig
Move the stable_writes flag into the queue_limits feature field so that it can be set atomically with the queue frozen. The flag is now inherited by blk_stack_limits, which greatly simplifies the code in dm, and fixed md which previously did not pass on the flag set on lower devices. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240617060532.127975-18-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19block: move the io_stat flag setting to queue_limitsChristoph Hellwig
Move the io_stat flag into the queue_limits feature field so that it can be set atomically with the queue frozen. Simplify md and dm to set the flag unconditionally instead of avoiding setting a simple flag for cases where it already is set by other means, which is a bit pointless. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240617060532.127975-17-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19block: move the add_random flag to queue_limitsChristoph Hellwig
Move the add_random flag into the queue_limits feature field so that it can be set atomically with the queue frozen. Note that this also removes code from dm to clear the flag based on the underlying devices, which can't be reached as dm devices will always start out without the flag set. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240617060532.127975-16-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19block: move the nonrot flag to queue_limitsChristoph Hellwig
Move the nonrot flag into the queue_limits feature field so that it can be set atomically with the queue frozen. Use the chance to switch to defaulting to non-rotational and require the driver to opt into rotational, which matches the polarity of the sysfs interface. For the z2ram, ps3vram, 2x memstick, ubiblock and dcssblk the new rotational flag is not set as they clearly are not rotational despite this being a behavior change. There are some other drivers that unconditionally set the rotational flag to keep the existing behavior as they arguably can be used on rotational devices even if that is probably not their main use today (e.g. virtio_blk and drbd). The flag is automatically inherited in blk_stack_limits matching the existing behavior in dm and md. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240617060532.127975-15-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19block: move cache control settings out of queue->flagsChristoph Hellwig
Move the cache control settings into the queue_limits so that the flags can be set atomically with the device queue frozen. Add new features and flags field for the driver set flags, and internal (usually sysfs-controlled) flags in the block layer. Note that we'll eventually remove enough field from queue_limits to bring it back to the previous size. The disable flag is inverted compared to the previous meaning, which means it now survives a rescan, similar to the max_sectors and max_discard_sectors user limits. The FLUSH and FUA flags are now inherited by blk_stack_limits, which simplified the code in dm a lot, but also causes a slight behavior change in that dm-switch and dm-unstripe now advertise a write cache despite setting num_flush_bios to 0. The I/O path will handle this gracefully, but as far as I can tell the lack of num_flush_bios and thus flush support is a pre-existing data integrity bug in those targets that really needs fixing, after which a non-zero num_flush_bios should be required in dm for targets that map to underlying devices. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Ulf Hansson <ulf.hansson@linaro.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240617060532.127975-14-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-15dm: Remove unused macro DM_ZONE_INVALID_WP_OFSTDamien Le Moal
With the switch to using the zone append emulation of the block layer zone write plugging, the macro DM_ZONE_INVALID_WP_OFST is no longer used in dm-zone.c. Remove its definition. Fixes: f211268ed1f9 ("dm: Use the block layer zone append emulation") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Benjamin Marzinski <bmarzins@redhat.com> Reviewed-by: Niklas Cassel <cassel@kernel.org> Link: https://lore.kernel.org/r/20240611023639.89277-5-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-15dm: Improve zone resource limits handlingDamien Le Moal
The generic stacking of limits implemented in the block layer cannot correctly handle stacking of zone resource limits (max open zones and max active zones) because these limits are for an entire device but the stacking may be for a portion of that device (e.g. a dm-linear target that does not cover an entire block device). As a result, when DM devices are created on top of zoned block devices, the DM device never has any zone resource limits advertized, which is only correct if all underlying target devices also have no zone resource limits. If at least one target device has resource limits, the user may see either performance issues (if the max open zone limit of the device is exceeded) or write I/O errors if the max active zone limit of one of the underlying target devices is exceeded. While it is very difficult to correctly and reliably stack zone resource limits in general, cases where targets are not sharing zone resources of the same device can be dealt with relatively easily. Such situation happens when a target maps all sequential zones of a zoned block device: for such mapping, other targets mapping other parts of the same zoned block device can only contain conventional zones and thus will not require any zone resource to correctly handle write operations. For a mapped device constructed with such targets, which includes mapped devices constructed with targets mapping entire zoned block devices, the zone resource limits can be reliably determined using the non-zero minimum of the zone resource limits of all targets. For mapped devices that include targets partially mapping the set of sequential write required zones of zoned block devices, instead of advertizing no zone resource limits, it is also better to set the mapped device limits to the non-zero minimum of the limits of all targets. In this case the limits for a target depend on the number of sequential zones being mapped: if this number of zone is larger than the limits, then the limits of the device apply and can be used. If on the other hand the target maps a number of zones smaller than the limits, then no limits is needed and we can assume that the target has no limits (limits set to 0). This commit improves zone resource limits handling as described above by modifying dm_set_zones_restrictions() to iterate the targets of a mapped device to evaluate the max open and max active zone limits. This relies on an internal "stacking" of the limits of the target devices combined with a direct counting of the number of sequential zones mapped by the targets. 1) For a target mapping an entire zoned block device, the limits for the target are set to the limits of the device. 2) For a target partially mapping a zoned block device, the number of mapped sequential zones is used to determine the limits: if the target maps more sequential write required zones than the device limits, then the limits of the device are used as-is. If the number of mapped sequential zones is lower than the limits, then we assume that the target has no limits (limits set to 0). As this evaluation is done for each target, the zone resource limits for the mapped device are evaluated as the non-zero minimum of the limits of all the targets. For configurations resulting in unreliable limits, i.e. a table containing a target partially mapping a zoned device, a warning message is issued. The counting of mapped sequential zones for the target is done using the new function dm_device_count_zones() which performs a report zones on the entire block device with the callback dm_device_count_zones_cb(). This count of mapped sequential zones is also used to determine if the mapped device contains only conventional zones. This allows simplifying dm_set_zones_restrictions() to not do a report zones just for this. For mapped devices mapping only conventional zones, as before, the mapped device is changed to a regular device by setting its zoned limit to false and clearing all its zone related limits. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Benjamin Marzinski <bmarzins@redhat.com> Reviewed-by: Niklas Cassel <cassel@kernel.org> Link: https://lore.kernel.org/r/20240611023639.89277-4-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-15dm: Call dm_revalidate_zones() after setting the queue limitsDamien Le Moal
dm_revalidate_zones() is called from dm_set_zone_restrictions() when the mapped device queue limits are not yet set. However, dm_revalidate_zones() calls blk_revalidate_disk_zones() and this function consults and modifies the mapped device queue limits. Thus, currently, blk_revalidate_disk_zones() operates on limits that are not yet initialized. Fix this by moving the call to dm_revalidate_zones() out of dm_set_zone_restrictions() and into dm_table_set_restrictions() after executing queue_limits_set(). To further cleanup dm_set_zones_restrictions(), the message about the type of zone append (native or emulated) is also moved inside dm_revalidate_zones(). Fixes: 1c0e720228ad ("dm: use queue_limits_set") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Benjamin Marzinski <bmarzins@redhat.com> Reviewed-by: Niklas Cassel <cassel@kernel.org> Link: https://lore.kernel.org/r/20240611023639.89277-3-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14Merge branch 'for-6.11/block-limits' into for-6.11/blockJens Axboe
Pull in block limits branch, which exists as a shared branch for both the block and SCSI tree. * for-6.11/block-limits: (26 commits) block: move integrity information into queue_limits block: invert the BLK_INTEGRITY_{GENERATE,VERIFY} flags block: bypass the STABLE_WRITES flag for protection information block: don't require stable pages for non-PI metadata block: use kstrtoul in flag_store block: factor out flag_{store,show} helper for integrity block: remove the blk_flush_integrity call in blk_integrity_unregister block: remove the blk_integrity_profile structure dm-integrity: use the nop integrity profile md/raid1: don't free conf on raid0_run failure md/raid0: don't free conf on raid0_run failure block: initialize integrity buffer to zero before writing it to media block: add special APIs for run-time disabling of discard and friends block: remove unused queue limits API sr: convert to the atomic queue limits API sd: convert to the atomic queue limits API sd: cleanup zoned queue limits initialization sd: factor out a sd_discard_mode helper sd: simplify the disable case in sd_config_discard sd: add a sd_disable_write_same helper ...
2024-06-14block: move integrity information into queue_limitsChristoph Hellwig
Move the integrity information into the queue limits so that it can be set atomically with other queue limits, and that the sysfs changes to the read_verify and write_generate flags are properly synchronized. This also allows to provide a more useful helper to stack the integrity fields, although it still is separate from the main stacking function as not all stackable devices want to inherit the integrity settings. Even with that it greatly simplifies the code in md and dm. Note that the integrity field is moved as-is into the queue limits. While there are good arguments for removing the separate blk_integrity structure, this would cause a lot of churn and might better be done at a later time if desired. However the integrity field in the queue_limits structure is now unconditional so that various ifdefs can be avoided or replaced with IS_ENABLED(). Given that tiny size of it that seems like a worthwhile trade off. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240613084839.1044015-13-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14block: remove the blk_integrity_profile structureChristoph Hellwig
Block layer integrity configuration is a bit complex right now, as it indirects through operation vectors for a simple two-dimensional configuration: a) the checksum type of none, ip checksum, crc, crc64 b) the presence or absence of a reference tag Remove the integrity profile, and instead add a separate csum_type flag which replaces the existing ip-checksum field and a new flag that indicates the presence of the reference tag. This removes up to two layers of indirect calls, remove the need to offload the no-op verification of non-PI metadata to a workqueue and generally simplifies the code. The downside is that block/t10-pi.c now has to be built into the kernel when CONFIG_BLK_DEV_INTEGRITY is supported. Given that both nvme and SCSI require t10-pi.ko, it is loaded for all usual configurations that enabled CONFIG_BLK_DEV_INTEGRITY already, though. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240613084839.1044015-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14dm-integrity: use the nop integrity profileChristoph Hellwig
Use the block layer built-in nop profile instead of duplicating it. Tested by: $ dd if=/dev/urandom of=key.bin bs=512 count=1 $ cryptsetup luksFormat -q --type luks2 --integrity hmac-sha256 \ --integrity-no-wipe /dev/nvme0n1 key.bin $ cryptsetup luksOpen /dev/nvme0n1 luks-integrity --key-file key.bin and then doing mkfs.xfs and simple I/O on the mount file system. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Milan Broz <gmazyland@gmail.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240613084839.1044015-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14md/raid1: don't free conf on raid0_run failureChristoph Hellwig
The core md code calls the ->free method which already frees conf. Fixes: 07f1a6850c5d ("md/raid1: fail run raid1 array when active disk less than one") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240613084839.1044015-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14md/raid0: don't free conf on raid0_run failureChristoph Hellwig
The core md code calls the ->free method which already frees conf. Fixes: 0c031fd37f69 ("md: Move alloc/free acct bioset in to personality") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240613084839.1044015-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>