summaryrefslogtreecommitdiff
path: root/drivers/md/dm.c
AgeCommit message (Collapse)Author
21 hoursMerge tag 'mm-stable-2025-07-30-15-25' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "As usual, many cleanups. The below blurbiage describes 42 patchsets. 21 of those are partially or fully cleanup work. "cleans up", "cleanup", "maintainability", "rationalizes", etc. I never knew the MM code was so dirty. "mm: ksm: prevent KSM from breaking merging of new VMAs" (Lorenzo Stoakes) addresses an issue with KSM's PR_SET_MEMORY_MERGE mode: newly mapped VMAs were not eligible for merging with existing adjacent VMAs. "mm/damon: introduce DAMON_STAT for simple and practical access monitoring" (SeongJae Park) adds a new kernel module which simplifies the setup and usage of DAMON in production environments. "stop passing a writeback_control to swap/shmem writeout" (Christoph Hellwig) is a cleanup to the writeback code which removes a couple of pointers from struct writeback_control. "drivers/base/node.c: optimization and cleanups" (Donet Tom) contains largely uncorrelated cleanups to the NUMA node setup and management code. "mm: userfaultfd: assorted fixes and cleanups" (Tal Zussman) does some maintenance work on the userfaultfd code. "Readahead tweaks for larger folios" (Ryan Roberts) implements some tuneups for pagecache readahead when it is reading into order>0 folios. "selftests/mm: Tweaks to the cow test" (Mark Brown) provides some cleanups and consistency improvements to the selftests code. "Optimize mremap() for large folios" (Dev Jain) does that. A 37% reduction in execution time was measured in a memset+mremap+munmap microbenchmark. "Remove zero_user()" (Matthew Wilcox) expunges zero_user() in favor of the more modern memzero_page(). "mm/huge_memory: vmf_insert_folio_*() and vmf_insert_pfn_pud() fixes" (David Hildenbrand) addresses some warts which David noticed in the huge page code. These were not known to be causing any issues at this time. "mm/damon: use alloc_migrate_target() for DAMOS_MIGRATE_{HOT,COLD" (SeongJae Park) provides some cleanup and consolidation work in DAMON. "use vm_flags_t consistently" (Lorenzo Stoakes) uses vm_flags_t in places where we were inappropriately using other types. "mm/memfd: Reserve hugetlb folios before allocation" (Vivek Kasireddy) increases the reliability of large page allocation in the memfd code. "mm: Remove pXX_devmap page table bit and pfn_t type" (Alistair Popple) removes several now-unneeded PFN_* flags. "mm/damon: decouple sysfs from core" (SeongJae Park) implememnts some cleanup and maintainability work in the DAMON sysfs layer. "madvise cleanup" (Lorenzo Stoakes) does quite a lot of cleanup/maintenance work in the madvise() code. "madvise anon_name cleanups" (Vlastimil Babka) provides additional cleanups on top or Lorenzo's effort. "Implement numa node notifier" (Oscar Salvador) creates a standalone notifier for NUMA node memory state changes. Previously these were lumped under the more general memory on/offline notifier. "Make MIGRATE_ISOLATE a standalone bit" (Zi Yan) cleans up the pageblock isolation code and fixes a potential issue which doesn't seem to cause any problems in practice. "selftests/damon: add python and drgn based DAMON sysfs functionality tests" (SeongJae Park) adds additional drgn- and python-based DAMON selftests which are more comprehensive than the existing selftest suite. "Misc rework on hugetlb faulting path" (Oscar Salvador) fixes a rather obscure deadlock in the hugetlb fault code and follows that fix with a series of cleanups. "cma: factor out allocation logic from __cma_declare_contiguous_nid" (Mike Rapoport) rationalizes and cleans up the highmem-specific code in the CMA allocator. "mm/migration: rework movable_ops page migration (part 1)" (David Hildenbrand) provides cleanups and future-preparedness to the migration code. "mm/damon: add trace events for auto-tuned monitoring intervals and DAMOS quota" (SeongJae Park) adds some tracepoints to some DAMON auto-tuning code. "mm/damon: fix misc bugs in DAMON modules" (SeongJae Park) does that. "mm/damon: misc cleanups" (SeongJae Park) also does what it claims. "mm: folio_pte_batch() improvements" (David Hildenbrand) cleans up the large folio PTE batching code. "mm/damon/vaddr: Allow interleaving in migrate_{hot,cold} actions" (SeongJae Park) facilitates dynamic alteration of DAMON's inter-node allocation policy. "Remove unmap_and_put_page()" (Vishal Moola) provides a couple of page->folio conversions. "mm: per-node proactive reclaim" (Davidlohr Bueso) implements a per-node control of proactive reclaim - beyond the current memcg-based implementation. "mm/damon: remove damon_callback" (SeongJae Park) replaces the damon_callback interface with a more general and powerful damon_call()+damos_walk() interface. "mm/mremap: permit mremap() move of multiple VMAs" (Lorenzo Stoakes) implements a number of mremap cleanups (of course) in preparation for adding new mremap() functionality: newly permit the remapping of multiple VMAs when the user is specifying MREMAP_FIXED. It still excludes some specialized situations where this cannot be performed reliably. "drop hugetlb_free_pgd_range()" (Anthony Yznaga) switches some sparc hugetlb code over to the generic version and removes the thus-unneeded hugetlb_free_pgd_range(). "mm/damon/sysfs: support periodic and automated stats update" (SeongJae Park) augments the present userspace-requested update of DAMON sysfs monitoring files. Automatic update is now provided, along with a tunable to control the update interval. "Some randome fixes and cleanups to swapfile" (Kemeng Shi) does what is claims. "mm: introduce snapshot_page" (Luiz Capitulino and David Hildenbrand) provides (and uses) a means by which debug-style functions can grab a copy of a pageframe and inspect it locklessly without tripping over the races inherent in operating on the live pageframe directly. "use per-vma locks for /proc/pid/maps reads" (Suren Baghdasaryan) addresses the large contention issues which can be triggered by reads from that procfs file. Latencies are reduced by more than half in some situations. The series also introduces several new selftests for the /proc/pid/maps interface. "__folio_split() clean up" (Zi Yan) cleans up __folio_split()! "Optimize mprotect() for large folios" (Dev Jain) provides some quite large (>3x) speedups to mprotect() when dealing with large folios. "selftests/mm: reuse FORCE_READ to replace "asm volatile("" : "+r" (XXX));" and some cleanup" (wang lian) does some cleanup work in the selftests code. "tools/testing: expand mremap testing" (Lorenzo Stoakes) extends the mremap() selftest in several ways, including adding more checking of Lorenzo's recently added "permit mremap() move of multiple VMAs" feature. "selftests/damon/sysfs.py: test all parameters" (SeongJae Park) extends the DAMON sysfs interface selftest so that it tests all possible user-requested parameters. Rather than the present minimal subset" * tag 'mm-stable-2025-07-30-15-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (370 commits) MAINTAINERS: add missing headers to mempory policy & migration section MAINTAINERS: add missing file to cgroup section MAINTAINERS: add MM MISC section, add missing files to MISC and CORE MAINTAINERS: add missing zsmalloc file MAINTAINERS: add missing files to page alloc section MAINTAINERS: add missing shrinker files MAINTAINERS: move memremap.[ch] to hotplug section MAINTAINERS: add missing mm_slot.h file THP section MAINTAINERS: add missing interval_tree.c to memory mapping section MAINTAINERS: add missing percpu-internal.h file to per-cpu section mm/page_alloc: remove trace_mm_alloc_contig_migrate_range_info() selftests/damon: introduce _common.sh to host shared function selftests/damon/sysfs.py: test runtime reduction of DAMON parameters selftests/damon/sysfs.py: test non-default parameters runtime commit selftests/damon/sysfs.py: generalize DAMON context commit assertion selftests/damon/sysfs.py: generalize monitoring attributes commit assertion selftests/damon/sysfs.py: generalize DAMOS schemes commit assertion selftests/damon/sysfs.py: test DAMOS filters commitment selftests/damon/sysfs.py: generalize DAMOS scheme commit assertion selftests/damon/sysfs.py: test DAMOS destinations commitment ...
2025-07-17dm: split write BIOs on zone boundaries when zone append is not emulatedShin'ichiro Kawasaki
Commit 2df7168717b7 ("dm: Always split write BIOs to zoned device limits") updates the device-mapper driver to perform splits for the write BIOs. However, it did not address the cases where DM targets do not emulate zone append, such as in the cases of dm-linear or dm-flakey. For these targets, when the write BIOs span across zone boundaries, they trigger WARN_ON_ONCE(bio_straddles_zones(bio)) in blk_zone_wplug_handle_write(). This results in I/O errors. The errors are reproduced by running blktests test case zbd/004 using zoned dm-linear or dm-flakey devices. To avoid the I/O errors, handle the write BIOs regardless whether DM targets emulate zone append or not, so that all write BIOs are split at zone boundaries. For that purpose, drop the check for zone append emulation in dm_zone_bio_needs_split(). Its argument 'md' is no longer used then drop it also. Fixes: 2df7168717b7 ("dm: Always split write BIOs to zoned device limits") Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Link: https://lore.kernel.org/r/20250717103539.37279-1-shinichiro.kawasaki@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-09mm: remove callers of pfn_t functionalityAlistair Popple
All PFN_* pfn_t flags have been removed. Therefore there is no longer a need for the pfn_t type and all uses can be replaced with normal pfns. Link: https://lkml.kernel.org/r/bbedfa576c9822f8032494efbe43544628698b1f.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Björn Töpel <bjorn@kernel.org> Cc: Björn Töpel <bjorn@rivosinc.com> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Inki Dae <m.szyprowski@samsung.com> Cc: John Groves <john@groves.net> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-30dm: Check for forbidden splitting of zone write operationsDamien Le Moal
DM targets must not split zone append and write operations using dm_accept_partial_bio() as doing so is forbidden for zone append BIOs, breaks zone append emulation using regular write BIOs and potentially creates deadlock situations with queue freeze operations. Modify dm_accept_partial_bio() to add missing BUG_ON() checks for all these cases, that is, check that the BIO is a write or write zeroes operation. This change packs all the zone related checks together under a static_branch_unlikely(&zoned_enabled) and done only if the target is a zoned device. Fixes: f211268ed1f9 ("dm: Use the block layer zone append emulation") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Link: https://lore.kernel.org/r/20250625093327.548866-6-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30dm: Always split write BIOs to zoned device limitsDamien Le Moal
Any zoned DM target that requires zone append emulation will use the block layer zone write plugging. In such case, DM target drivers must not split BIOs using dm_accept_partial_bio() as doing so can potentially lead to deadlocks with queue freeze operations. Regular write operations used to emulate zone append operations also cannot be split by the target driver as that would result in an invalid writen sector value return using the BIO sector. In order for zoned DM target drivers to avoid such incorrect BIO splitting, we must ensure that large BIOs are split before being passed to the map() function of the target, thus guaranteeing that the limits for the mapped device are not exceeded. dm-crypt and dm-flakey are the only target drivers supporting zoned devices and using dm_accept_partial_bio(). In the case of dm-crypt, this function is used to split BIOs to the internal max_write_size limit (which will be suppressed in a different patch). However, since crypt_alloc_buffer() uses a bioset allowing only up to BIO_MAX_VECS (256) vectors in a BIO. The dm-crypt device max_segments limit, which is not set and so default to BLK_MAX_SEGMENTS (128), must thus be respected and write BIOs split accordingly. In the case of dm-flakey, since zone append emulation is not required, the block layer zone write plugging is not used and no splitting of BIOs required. Modify the function dm_zone_bio_needs_split() to use the block layer helper function bio_needs_zone_write_plugging() to force a call to bio_split_to_limits() in dm_split_and_process_bio(). This allows DM target drivers to avoid using dm_accept_partial_bio() for write operations on zoned DM devices. Fixes: f211268ed1f9 ("dm: Use the block layer zone append emulation") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250625093327.548866-4-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30block: Introduce bio_needs_zone_write_plugging()Damien Le Moal
In preparation for fixing device mapper zone write handling, introduce the inline helper function bio_needs_zone_write_plugging() to test if a BIO requires handling through zone write plugging using the function blk_zone_plug_bio(). This function returns true for any write (op_is_write(bio) == true) operation directed at a zoned block device using zone write plugging, that is, a block device with a disk that has a zone write plug hash table. This helper allows simplifying the check on entry to blk_zone_plug_bio() and used in to protect calls to it for blk-mq devices and DM devices. Fixes: f211268ed1f9 ("dm: Use the block layer zone append emulation") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250625093327.548866-3-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-04dm: Allow .prepare_ioctl to handle ioctls directlyKevin Wolf
This adds a 'bool *forward' parameter to .prepare_ioctl, which allows device mapper targets to accept ioctls to themselves instead of the underlying device. If the target already fully handled the ioctl, it sets *forward to false and device mapper won't forward it to the underlying device any more. In order for targets to actually know what the ioctl is about and how to handle it, pass also cmd and arg. As long as targets restrict themselves to interpreting ioctls of type DM_IOCTL, this is a backwards compatible change because previously, any such ioctl would have been passed down through all device mapper layers until it reached a device that can't understand the ioctl and would return an error. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-05-04dm: use generic functions instead of disable_discard and disable_write_zeroesMikulas Patocka
A small code cleanup: use blk_queue_disable_discard and blk_queue_disable_write_zeroes instead of disable_discard and disable_write_zeroes. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-05-04dm: limit swapping tables for devices with zone write plugsBenjamin Marzinski
dm_revalidate_zones() only allowed new or previously unzoned devices to call blk_revalidate_disk_zones(). If the device was already zoned, disk->nr_zones would always equal md->nr_zones, so dm_revalidate_zones() returned without doing any work. This would make the zoned settings for the device not match the new table. If the device had zone write plug resources, it could run into errors like bdev_zone_is_seq() reading invalid memory because disk->conv_zones_bitmap was the wrong size. If the device doesn't have any zone write plug resources, calling blk_revalidate_disk_zones() will always correctly update device. If blk_revalidate_disk_zones() fails, it can still overwrite or clear the current disk->nr_zones value. In this case, DM must restore the previous value of disk->nr_zones, so that the zoned settings will continue to match the previous value that it fell back to. If the device already has zone write plug resources, blk_revalidate_disk_zones() will not correctly update them, if it is called for arbitrary zoned device changes. Since there is not much need for this ability, the easiest solution is to disallow any table reloads that change the zoned settings, for devices that already have zone plug resources. Specifically, if a device already has zone plug resources allocated, it can only switch to another zoned table that also emulates zone append. Also, it cannot change the device size or the zone size. A device can switch to an error target. Fixes: bb37d77239af2 ("dm: introduce zone append emulation") Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Tested-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-04-11dm: free table mempools if not used in __bindBenjamin Marzinski
With request-based dm, the mempools don't need reloading when switching tables, but the unused table mempools are not freed until the active table is finally freed. Free them immediately if they are not needed. Fixes: 29dec90a0f1d9 ("dm: fix bio_set allocation") Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Tested-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-04-11dm: don't change md if dm_table_set_restrictions() failsBenjamin Marzinski
__bind was changing the disk capacity, geometry and mempools of the mapped device before calling dm_table_set_restrictions() which could fail, forcing dm to drop the new table. Failing here would leave the device using the old table but with the wrong capacity and mempools. Move dm_table_set_restrictions() earlier in __bind(). Since it needs the capacity to be set, save the old version and restore it on failure. Fixes: bb37d77239af2 ("dm: introduce zone append emulation") Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Tested-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-02-24dm: fix unconditional IO throttle caused by REQ_PREFLUSHJinliang Zheng
When a bio with REQ_PREFLUSH is submitted to dm, __send_empty_flush() generates a flush_bio with REQ_OP_WRITE | REQ_PREFLUSH | REQ_SYNC, which causes the flush_bio to be throttled by wbt_wait(). An example from v5.4, similar problem also exists in upstream: crash> bt 2091206 PID: 2091206 TASK: ffff2050df92a300 CPU: 109 COMMAND: "kworker/u260:0" #0 [ffff800084a2f7f0] __switch_to at ffff80004008aeb8 #1 [ffff800084a2f820] __schedule at ffff800040bfa0c4 #2 [ffff800084a2f880] schedule at ffff800040bfa4b4 #3 [ffff800084a2f8a0] io_schedule at ffff800040bfa9c4 #4 [ffff800084a2f8c0] rq_qos_wait at ffff8000405925bc #5 [ffff800084a2f940] wbt_wait at ffff8000405bb3a0 #6 [ffff800084a2f9a0] __rq_qos_throttle at ffff800040592254 #7 [ffff800084a2f9c0] blk_mq_make_request at ffff80004057cf38 #8 [ffff800084a2fa60] generic_make_request at ffff800040570138 #9 [ffff800084a2fae0] submit_bio at ffff8000405703b4 #10 [ffff800084a2fb50] xlog_write_iclog at ffff800001280834 [xfs] #11 [ffff800084a2fbb0] xlog_sync at ffff800001280c3c [xfs] #12 [ffff800084a2fbf0] xlog_state_release_iclog at ffff800001280df4 [xfs] #13 [ffff800084a2fc10] xlog_write at ffff80000128203c [xfs] #14 [ffff800084a2fcd0] xlog_cil_push at ffff8000012846dc [xfs] #15 [ffff800084a2fda0] xlog_cil_push_work at ffff800001284a2c [xfs] #16 [ffff800084a2fdb0] process_one_work at ffff800040111d08 #17 [ffff800084a2fe00] worker_thread at ffff8000401121cc #18 [ffff800084a2fe70] kthread at ffff800040118de4 After commit 2def2845cc33 ("xfs: don't allow log IO to be throttled"), the metadata submitted by xlog_write_iclog() should not be throttled. But due to the existence of the dm layer, throttling flush_bio indirectly causes the metadata bio to be throttled. Fix this by conditionally adding REQ_IDLE to flush_bio.bi_opf, which makes wbt_should_throttle() return false to avoid wbt_wait(). Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com> Reviewed-by: Tianxiang Peng <txpeng@tencent.com> Reviewed-by: Hao Peng <flyingpeng@tencent.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-01-17dm: Ensure cloned bio is same length for atomic writeJohn Garry
For an atomic write, a cloned bio must be same length as the original bio, i.e. no splitting. Error in case it is not. Per-dm device queue limits should be setup to ensure that this does not happen, but error this case as an insurance policy. Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-01-17dm: disable REQ_NOWAIT for flushesMikulas Patocka
REQ_NOWAIT for flushes cannot be easily supported by device mapper because it may allocate multiple bios and its impossible to undo if one of those allocations wants to wait. So, this patch disables REQ_NOWAIT flushes in device mapper and we always return EAGAIN. Previously, the code accepted REQ_NOWAIT flushes, but the non-blocking execution was not guaranteed. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-01-17dm: remove useless test in alloc_multiple_biosMikulas Patocka
The test gfp_flag & GFP_NOWAIT was always true, because both GFP_NOIO and GFP_NOWAIT include the flag __GFP_KSWAPD_RECLAIM. Luckily, this oversight didn't result in any harm; the loop always started with "try = 0". This patch removes the faulty test and explicitly starts the loop with "try = 0" (so that we don't hold the mutex in the first iteration). Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-11-20dm: add support for get_unique_idBenjamin Coddington
This adds support to obtain a device's unique id through dm, similar to the existing ioctl and persistent resevation handling. We limit this to single-target devices. This enables knfsd to export pNFS SCSI luns that have been exported from multipath devices. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-20dm: Remove unused dm_set_md_typeDr. David Alan Gilbert
dm_set_md_type() has been unused since commit ba30585936b0 ("dm: move setting md->type into dm_setup_md_queue") Remove it. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-10-15dm: fix a crash if blk_alloc_disk failsMikulas Patocka
If blk_alloc_disk fails, the variable md->disk is set to an error value. cleanup_mapped_device will see that md->disk is non-NULL and it will attempt to access it, causing a crash on this statement "md->disk->private_data = NULL;". Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reported-by: Chenyuan Yang <chenyuan0y@gmail.com> Closes: https://marc.info/?l=dm-devel&m=172824125004329&w=2 Cc: stable@vger.kernel.org Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
2024-09-15Revert "dm: requeue IO if mapping table not yet available"Mikulas Patocka
This reverts commit fa247089de9936a46e290d4724cb5f0b845600f5. The following sequence of commands causes a livelock - there will be workqueue process looping and consuming 100% CPU: dmsetup create --notable test truncate -s 1MiB testdata losetup /dev/loop0 testdata dmsetup load test --table '0 2048 linear /dev/loop0 0' dd if=/dev/zero of=/dev/dm-0 bs=16k count=1 conv=fdatasync The livelock is caused by the commit fa247089de99. The commit claims that it fixes a race condition, however, it is unknown what the actual race condition is and what program is involved in the race condition. When the inactive table is loaded, the nodes /dev/dm-0 and /sys/block/dm-0 are created. /dev/dm-0 has zero size at this point. When the device is suspended and resumed, the nodes /dev/mapper/test and /dev/disk/* are created. If some program opens a block device before it is created by dmsetup or lvm, the program is buggy, so dm could just report an error as it used to do before. Reported-by: Zdenek Kabelac <zkabelac@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: fa247089de99 ("dm: requeue IO if mapping table not yet available")
2024-08-13dm suspend: return -ERESTARTSYS instead of -EINTRMikulas Patocka
This commit changes device mapper, so that it returns -ERESTARTSYS instead of -EINTR when it is interrupted by a signal (so that the ioctl can be restarted). The manpage signal(7) says that the ioctl function should be restarted if the signal was handled with SA_RESTART. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org
2024-07-22Merge tag 'for-6.11/block-post-20240722' of git://git.kernel.dk/linuxLinus Torvalds
Pull block integrity mapping updates from Jens Axboe: "A set of cleanups and fixes for the block integrity support. Sent separately from the main block changes from last week, as they depended on later fixes in the 6.10-rc cycle" * tag 'for-6.11/block-post-20240722' of git://git.kernel.dk/linux: block: don't free the integrity payload in bio_integrity_unmap_free_user block: don't free submitter owned integrity payload on I/O completion block: call bio_integrity_unmap_free_user from blk_rq_unmap_user block: don't call bio_uninit from bio_endio block: also return bio_integrity_payload * from stubs block: split integrity support out of bio.h
2024-07-19Merge tag 'for-6.11/dm-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mikulas Patocka: - Optimize processing of flush bios in the dm-linear and dm-stripe targets - Dm-io cleansups and refactoring - Remove unused 'struct thunk' in dm-cache - Handle minor device numbers > 255 in dm-init - Dm-verity refactoring & enabling platform keyring - Fix warning in dm-raid - Improve dm-crypt performance - split bios to smaller pieces, so that They could be processed concurrently - Stop using blk_limits_io_{min,opt} - Dm-vdo cleanup and refactoring - Remove max_write_zeroes_granularity and max_secure_erase_granularity - Dm-multipath cleanup & refactoring - Add dm-crypt and dm-integrity support for non-power-of-2 sector size - Fix reshape in dm-raid - Make dm_block_validator const * tag 'for-6.11/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (33 commits) dm vdo: fix a minor formatting issue in vdo.rst dm vdo int-map: fix kerneldoc formatting dm vdo repair: add missing kerneldoc fields dm: Constify struct dm_block_validator dm-integrity: introduce the Inline mode dm: introduce the target flag mempool_needs_integrity dm raid: fix stripes adding reshape size issues dm raid: move _get_reshape_sectors() as prerequisite to fixing reshape size issues dm-crypt: support for per-sector NVMe metadata dm mpath: don't call dm_get_device in multipath_message dm: factor out helper function from dm_get_device dm-verity: fix dm_is_verity_target() when dm-verity is builtin dm: Remove max_secure_erase_granularity dm: Remove max_write_zeroes_granularity dm vdo indexer: use swap() instead of open coding it dm vdo: remove unused struct 'uds_attribute' dm: stop using blk_limits_io_{min,opt} dm-crypt: limit the size of encryption requests dm verity: add support for signature verification with platform keyring dm-raid: Fix WARN_ON_ONCE check for sync_thread in raid_resume ...
2024-07-10dm: Remove max_secure_erase_granularityDamien Le Moal
The max_secure_erase_granularity boolean of struct dm_target is used in __process_abnormal_io() but never set by any target. Remove this field and the dead code using it. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-07-10dm: Remove max_write_zeroes_granularityDamien Le Moal
The max_write_zeroes_granularity boolean of struct dm_target is used in __process_abnormal_io() but never set by any target. Remove this field and the dead code using it. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-07-05dm: handle REQ_OP_ZONE_RESET_ALLDamien Le Moal
This commit implements processing of the REQ_OP_ZONE_RESET_ALL operation for zoned mapped devices. Given that this operation always has a BIO sector of 0 and a 0 size, processing through the regular BIO __split_and_process_bio() function does not work because this function would always select the first target. Instead, handling of this operation is implemented using the function __send_zone_reset_all(). Similarly to the __send_empty_flush() function, the new __send_zone_reset_all() function manually goes through all targets of a mapped device table doing the following: 1) If the target can natively support REQ_OP_ZONE_RESET_ALL, __send_duplicate_bios() is used to forward the reset all operation to the target. This case is handled with the __send_zone_reset_all_native() function. 2) For other targets, the function __send_zone_reset_all_emulated() is executed to emulate the execution of REQ_OP_ZONE_RESET_ALL using regular REQ_OP_ZONE_RESET operations. Targets that can natively support REQ_OP_ZONE_RESET_ALL are identified using the new target field zone_reset_all_supported. This boolean is set to true in for targets that have reliable zone limits, that is, targets that map all sequential write required zones of their zoned device(s). Setting this field is handled in dm_set_zones_restrictions() and device_get_zone_resource_limits(). For targets with unreliable zone limits, REQ_OP_ZONE_RESET_ALL must be emulated (case 2 above). This is implemented with __send_zone_reset_all_emulated() and is similar to the block layer function blkdev_zone_reset_all_emulated(): first a report zones is done for the zones of the target to identify zones that need reset, that is, any sequential write required zone that is not already empty. This is done using a bitmap and the function dm_zone_get_reset_bitmap() which sets to 1 the bit corresponding to a zone that needs reset. Next, this zone bitmap is inspected and a clone BIO modified to use the REQ_OP_ZONE_RESET operation issued for any zone with its bit set in the zone bitmap. This implementation is more efficient than what the block layer does with blkdev_zone_reset_all_emulated(), which is always used for DM zoned devices currently: as we can natively use REQ_OP_ZONE_RESET_ALL on targets mapping all sequential write required zones, resetting all zones of a zoned mapped device can be much faster compared to always emulating this operation using regular per-zone reset. In the worst case, this implementation is as-efficient as the block layer emulation. This reduction in the time it takes to reset all zones of a zoned mapped device depends directly on the mapped device targets mapping (reliable zone limits or not). Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240704052816.623865-4-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-05dm: Refactor is_abnormal_io()Damien Le Moal
Use a single switch-case to simplify is_abnormal_io() and make this function more readable and easier to modify. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240704052816.623865-3-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-03block: split integrity support out of bio.hChristoph Hellwig
Split struct bio_integrity_payload and the related prototypes out of bio.h into a separate bio-integrity.h header so that it is only pulled in by the few places that need it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240702151047.1746127-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-26dm: optimize flushesMikulas Patocka
Device mapper sends flush bios to all the targets and the targets send it to the underlying device. That may be inefficient, for example if a table contains 10 linear targets pointing to the same physical device, then device mapper would send 10 flush bios to that device - despite the fact that only one bio would be sufficient. This commit optimizes the flush behavior. It introduces a per-target variable flush_bypasses_map - it is set when the target supports flush optimization - currently, the dm-linear and dm-stripe targets support it. When all the targets in a table have flush_bypasses_map, flush_bypasses_map on the table is set. __send_empty_flush tests if the table has flush_bypasses_map - and if it has, no flush bios are sent to the targets via the "map" method and the list dm_table->devices is iterated and the flush bios are sent to each member of the list. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Suggested-by: Yang Yang <yang.yang@vivo.com>
2024-06-20block: Generalize chunk_sectors support as boundary supportJohn Garry
The purpose of the chunk_sectors limit is to ensure that a mergeble request fits within the boundary of the chunck_sector value. Such a feature will be useful for other request_queue boundary limits, so generalize the chunk_sectors merge code. This idea was proposed by Hannes Reinecke. Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20240620125359.2684798-3-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19block: move the io_stat flag setting to queue_limitsChristoph Hellwig
Move the io_stat flag into the queue_limits feature field so that it can be set atomically with the queue frozen. Simplify md and dm to set the flag unconditionally instead of avoiding setting a simple flag for cases where it already is set by other means, which is a bit pointless. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240617060532.127975-17-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-21Merge tag 'for-6.10/dm-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: - Fix DM discard regressions due to DM core switching over to using queue_limits_set() without DM core and targets first being updated to set (and stack) discard limits in terms of max_hw_discard_sectors and not max_discard_sectors - Fix stable@ DM integrity discard support to set device's discard_granularity limit to the device's logical block size * tag 'for-6.10/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm: always manage discard support in terms of max_hw_discard_sectors dm-integrity: set discard_granularity to logical block size
2024-05-20dm: always manage discard support in terms of max_hw_discard_sectorsMike Snitzer
Commit 4f563a64732d ("block: add a max_user_discard_sectors queue limit") changed block core to set max_discard_sectors to: min(lim->max_hw_discard_sectors, lim->max_user_discard_sectors) Since commit 1c0e720228ad ("dm: use queue_limits_set") it was reported dm-thinp was failing in a few fstests (generic/347 and generic/405) with the first WARN_ON_ONCE in dm_cell_key_has_valid_range() being reported, e.g.: WARNING: CPU: 1 PID: 30 at drivers/md/dm-bio-prison-v1.c:128 dm_cell_key_has_valid_range+0x3d/0x50 blk_set_stacking_limits() sets max_user_discard_sectors to UINT_MAX, so given how block core now sets max_discard_sectors (detailed above) it follows that blk_stack_limits() stacks up the underlying device's max_hw_discard_sectors and max_discard_sectors is set to match it. If max_hw_discard_sectors exceeds dm's BIO_PRISON_MAX_RANGE, then dm_cell_key_has_valid_range() will trigger the warning with: WARN_ON_ONCE(key->block_end - key->block_begin > BIO_PRISON_MAX_RANGE) Aside from this warning, the discard will fail. Fix this and other DM issues by governing discard support in terms of max_hw_discard_sectors instead of max_discard_sectors. Reported-by: Theodore Ts'o <tytso@mit.edu> Fixes: 1c0e720228ad ("dm: use queue_limits_set") Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-05-13Merge tag 'for-6.10/block-20240511' of git://git.kernel.dk/linuxLinus Torvalds
Pull block updates from Jens Axboe: - Add a partscan attribute in sysfs, fixing an issue with systemd relying on an internal interface that went away. - Attempt #2 at making long running discards interruptible. The previous attempt went into 6.9, but we ended up mostly reverting it as it had issues. - Remove old ida_simple API in bcache - Support for zoned write plugging, greatly improving the performance on zoned devices. - Remove the old throttle low interface, which has been experimental since 2017 and never made it beyond that and isn't being used. - Remove page->index debugging checks in brd, as it hasn't caught anything and prepares us for removing in struct page. - MD pull request from Song - Don't schedule block workers on isolated CPUs * tag 'for-6.10/block-20240511' of git://git.kernel.dk/linux: (84 commits) blk-throttle: delay initialization until configuration blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW block: fix that util can be greater than 100% block: support to account io_ticks precisely block: add plug while submitting IO bcache: fix variable length array abuse in btree_iter bcache: Remove usage of the deprecated ida_simple_xx() API md: Revert "md: Fix overflow in is_mddev_idle" blk-lib: check for kill signal in ioctl BLKDISCARD block: add a bio_await_chain helper block: add a blk_alloc_discard_bio helper block: add a bio_chain_and_submit helper block: move discard checks into the ioctl handler block: remove the discard_granularity check in __blkdev_issue_discard block/ioctl: prefer different overflow check null_blk: Fix the WARNING: modpost: missing MODULE_DESCRIPTION() block: fix and simplify blkdevparts= cmdline parsing block: refine the EOF check in blkdev_iomap_begin block: add a partscan sysfs attribute for disks block: add a disk_has_partscan helper ...
2024-04-17dm: Use the block layer zone append emulationDamien Le Moal
For targets requiring zone append operation emulation with regular writes (e.g. dm-crypt), we can use the block layer emulation provided by zone write plugging. Remove DM implemented zone append emulation and enable the block layer one. This is done by setting the max_zone_append_sectors limit of the mapped device queue to 0 for mapped devices that have a target table that cannot support native zone append operations (e.g. dm-crypt). Such mapped devices are flagged with the DMF_EMULATE_ZONE_APPEND flag. dm_split_and_process_bio() is modified to execute blk_zone_write_plug_bio() for such device to let the block layer transform zone append operations into regular writes. This is done after ensuring that the submitted BIO is split if it straddles zone boundaries. Both changes are implemented unsing the inline helpers dm_zone_write_plug_bio() and dm_zone_bio_needs_split() respectively. dm_revalidate_zones() is also modified to use the block layer provided function blk_revalidate_disk_zones() so that all zone resources needed for zone append emulation are initialized by the block layer without DM core needing to do anything. Since the device table is not yet live when dm_revalidate_zones() is executed, enabling the use of blk_revalidate_disk_zones() requires adding a pointer to the device table in struct mapped_device. This avoids errors in dm_blk_report_zones() trying to get the table with dm_get_live_table(). The mapped device table pointer is set to the table passed as argument to dm_revalidate_zones() before calling blk_revalidate_disk_zones() and reset to NULL after this function returns to restore the live table handling for user call of report zones. All the code related to zone append emulation is removed from dm-zone.c. This leads to simplifications of the functions __map_bio() and dm_zone_endio(). This later function now only needs to deal with completions of real zone append operations for targets that support it. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-13-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-16dm: restore synchronous close of device mapper block deviceMing Lei
'dmsetup remove' and 'dmsetup remove_all' require synchronous bdev release. Otherwise dm_lock_for_deletion() may return -EBUSY if the open count is > 0, because the open count is dropped in dm_blk_close() which occurs after fput() completes. So if dm_blk_close() is delayed because of asynchronous fput(), this device mapper device is skipped during remove, which is a regression. Fix the issue by using __fput_sync(). Also, DM device removal has long supported being made asynchronous by setting the DMF_DEFERRED_REMOVE flag on the DM device. So leverage using async fput() in close_table_device() if DMF_DEFERRED_REMOVE flag is set. Reported-by: Zhong Changhui <czhong@redhat.com> Fixes: a28d893eb327 ("md: port block device access to file") Suggested-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> [snitzer: editted commit header, use fput() if DMF_DEFERRED_REMOVE set] Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-03-14Merge tag 'mm-stable-2024-03-13-20-04' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Sumanth Korikkar has taught s390 to allocate hotplug-time page frames from hotplugged memory rather than only from main memory. Series "implement "memmap on memory" feature on s390". - More folio conversions from Matthew Wilcox in the series "Convert memcontrol charge moving to use folios" "mm: convert mm counter to take a folio" - Chengming Zhou has optimized zswap's rbtree locking, providing significant reductions in system time and modest but measurable reductions in overall runtimes. The series is "mm/zswap: optimize the scalability of zswap rb-tree". - Chengming Zhou has also provided the series "mm/zswap: optimize zswap lru list" which provides measurable runtime benefits in some swap-intensive situations. - And Chengming Zhou further optimizes zswap in the series "mm/zswap: optimize for dynamic zswap_pools". Measured improvements are modest. - zswap cleanups and simplifications from Yosry Ahmed in the series "mm: zswap: simplify zswap_swapoff()". - In the series "Add DAX ABI for memmap_on_memory", Vishal Verma has contributed several DAX cleanups as well as adding a sysfs tunable to control the memmap_on_memory setting when the dax device is hotplugged as system memory. - Johannes Weiner has added the large series "mm: zswap: cleanups", which does that. - More DAMON work from SeongJae Park in the series "mm/damon: make DAMON debugfs interface deprecation unignorable" "selftests/damon: add more tests for core functionalities and corner cases" "Docs/mm/damon: misc readability improvements" "mm/damon: let DAMOS feeds and tame/auto-tune itself" - In the series "mm/mempolicy: weighted interleave mempolicy and sysfs extension" Rakie Kim has developed a new mempolicy interleaving policy wherein we allocate memory across nodes in a weighted fashion rather than uniformly. This is beneficial in heterogeneous memory environments appearing with CXL. - Christophe Leroy has contributed some cleanup and consolidation work against the ARM pagetable dumping code in the series "mm: ptdump: Refactor CONFIG_DEBUG_WX and check_wx_pages debugfs attribute". - Luis Chamberlain has added some additional xarray selftesting in the series "test_xarray: advanced API multi-index tests". - Muhammad Usama Anjum has reworked the selftest code to make its human-readable output conform to the TAP ("Test Anything Protocol") format. Amongst other things, this opens up the use of third-party tools to parse and process out selftesting results. - Ryan Roberts has added fork()-time PTE batching of THP ptes in the series "mm/memory: optimize fork() with PTE-mapped THP". Mainly targeted at arm64, this significantly speeds up fork() when the process has a large number of pte-mapped folios. - David Hildenbrand also gets in on the THP pte batching game in his series "mm/memory: optimize unmap/zap with PTE-mapped THP". It implements batching during munmap() and other pte teardown situations. The microbenchmark improvements are nice. - And in the series "Transparent Contiguous PTEs for User Mappings" Ryan Roberts further utilizes arm's pte's contiguous bit ("contpte mappings"). Kernel build times on arm64 improved nicely. Ryan's series "Address some contpte nits" provides some followup work. - In the series "mm/hugetlb: Restore the reservation" Breno Leitao has fixed an obscure hugetlb race which was causing unnecessary page faults. He has also added a reproducer under the selftest code. - In the series "selftests/mm: Output cleanups for the compaction test", Mark Brown did what the title claims. - Kinsey Ho has added the series "mm/mglru: code cleanup and refactoring". - Even more zswap material from Nhat Pham. The series "fix and extend zswap kselftests" does as claimed. - In the series "Introduce cpu_dcache_is_aliasing() to fix DAX regression" Mathieu Desnoyers has cleaned up and fixed rather a mess in our handling of DAX on archiecctures which have virtually aliasing data caches. The arm architecture is the main beneficiary. - Lokesh Gidra's series "per-vma locks in userfaultfd" provides dramatic improvements in worst-case mmap_lock hold times during certain userfaultfd operations. - Some page_owner enhancements and maintenance work from Oscar Salvador in his series "page_owner: print stacks and their outstanding allocations" "page_owner: Fixup and cleanup" - Uladzislau Rezki has contributed some vmalloc scalability improvements in his series "Mitigate a vmap lock contention". It realizes a 12x improvement for a certain microbenchmark. - Some kexec/crash cleanup work from Baoquan He in the series "Split crash out from kexec and clean up related config items". - Some zsmalloc maintenance work from Chengming Zhou in the series "mm/zsmalloc: fix and optimize objects/page migration" "mm/zsmalloc: some cleanup for get/set_zspage_mapping()" - Zi Yan has taught the MM to perform compaction on folios larger than order=0. This a step along the path to implementaton of the merging of large anonymous folios. The series is named "Enable >0 order folio memory compaction". - Christoph Hellwig has done quite a lot of cleanup work in the pagecache writeback code in his series "convert write_cache_pages() to an iterator". - Some modest hugetlb cleanups and speedups in Vishal Moola's series "Handle hugetlb faults under the VMA lock". - Zi Yan has changed the page splitting code so we can split huge pages into sizes other than order-0 to better utilize large folios. The series is named "Split a folio to any lower order folios". - David Hildenbrand has contributed the series "mm: remove total_mapcount()", a cleanup. - Matthew Wilcox has sought to improve the performance of bulk memory freeing in his series "Rearrange batched folio freeing". - Gang Li's series "hugetlb: parallelize hugetlb page init on boot" provides large improvements in bootup times on large machines which are configured to use large numbers of hugetlb pages. - Matthew Wilcox's series "PageFlags cleanups" does that. - Qi Zheng's series "minor fixes and supplement for ptdesc" does that also. S390 is affected. - Cleanups to our pagemap utility functions from Peter Xu in his series "mm/treewide: Replace pXd_large() with pXd_leaf()". - Nico Pache has fixed a few things with our hugepage selftests in his series "selftests/mm: Improve Hugepage Test Handling in MM Selftests". - Also, of course, many singleton patches to many things. Please see the individual changelogs for details. * tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (435 commits) mm/zswap: remove the memcpy if acomp is not sleepable crypto: introduce: acomp_is_async to expose if comp drivers might sleep memtest: use {READ,WRITE}_ONCE in memory scanning mm: prohibit the last subpage from reusing the entire large folio mm: recover pud_leaf() definitions in nopmd case selftests/mm: skip the hugetlb-madvise tests on unmet hugepage requirements selftests/mm: skip uffd hugetlb tests with insufficient hugepages selftests/mm: dont fail testsuite due to a lack of hugepages mm/huge_memory: skip invalid debugfs new_order input for folio split mm/huge_memory: check new folio order when split a folio mm, vmscan: retry kswapd's priority loop with cache_trim_mode off on failure mm: add an explicit smp_wmb() to UFFDIO_CONTINUE mm: fix list corruption in put_pages_list mm: remove folio from deferred split list before uncharging it filemap: avoid unnecessary major faults in filemap_fault() mm,page_owner: drop unnecessary check mm,page_owner: check for null stack_record before bumping its refcount mm: swap: fix race between free_swap_and_cache() and swapoff() mm/treewide: align up pXd_leaf() retval across archs mm/treewide: drop pXd_large() ...
2024-03-13Merge tag 'for-6.9/dm-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - Fix DM core's IO submission (which include dm-io and dm-bufio) such that a bio's IO priority is propagated. Work focused on enabling both DM crypt and verity targets to retain the appropriate IO priority - Fix DM raid reshape logic to not allow an empty flush bio to be requeued due to false concern about the bio, which doesn't have a data payload, accessing beyond the end of the device - Fix DM core's internal resume so that it properly calls both presume and resume methods, which fixes the potential for a postsuspend and resume imbalance - Update DM verity target to set DM_TARGET_SINGLETON flag because it doesn't make sense to have a DM table with a mix of targets that include dm-verity - Small cleanups in DM crypt, thin, and integrity targets - Fix references to dm-devel mailing list to use latest list address * tag 'for-6.9/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm: call the resume method on internal suspend dm raid: fix false positive for requeue needed during reshape dm-integrity: set max_integrity_segments in dm_integrity_io_hints dm: update relevant MODULE_AUTHOR entries to latest dm-devel mailing list dm ioctl: update DM_DRIVER_EMAIL to new dm-devel mailing list dm verity: set DM_TARGET_SINGLETON feature flag dm crypt: Fix IO priority lost when queuing write bios dm verity: Fix IO priority lost when reading FEC and hash dm bufio: Support IO priority dm io: Support IO priority dm crypt: remove redundant state settings after waking up dm thin: add braces around conditional code that spans lines
2024-03-12dm: call the resume method on internal suspendMikulas Patocka
There is this reported crash when experimenting with the lvm2 testsuite. The list corruption is caused by the fact that the postsuspend and resume methods were not paired correctly; there were two consecutive calls to the origin_postsuspend function. The second call attempts to remove the "hash_list" entry from a list, while it was already removed by the first call. Fix __dm_internal_resume so that it calls the preresume and resume methods of the table's targets. If a preresume method of some target fails, we are in a tricky situation. We can't return an error because dm_internal_resume isn't supposed to return errors. We can't return success, because then the "resume" and "postsuspend" methods would not be paired correctly. So, we set the DMF_SUSPENDED flag and we fake normal suspend - it may confuse userspace tools, but it won't cause a kernel crash. ------------[ cut here ]------------ kernel BUG at lib/list_debug.c:56! invalid opcode: 0000 [#1] PREEMPT SMP CPU: 1 PID: 8343 Comm: dmsetup Not tainted 6.8.0-rc6 #4 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 RIP: 0010:__list_del_entry_valid_or_report+0x77/0xc0 <snip> RSP: 0018:ffff8881b831bcc0 EFLAGS: 00010282 RAX: 000000000000004e RBX: ffff888143b6eb80 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffffffff819053d0 RDI: 00000000ffffffff RBP: ffff8881b83a3400 R08: 00000000fffeffff R09: 0000000000000058 R10: 0000000000000000 R11: ffffffff81a24080 R12: 0000000000000001 R13: ffff88814538e000 R14: ffff888143bc6dc0 R15: ffffffffa02e4bb0 FS: 00000000f7c0f780(0000) GS:ffff8893f0a40000(0000) knlGS:0000000000000000 CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033 CR2: 0000000057fb5000 CR3: 0000000143474000 CR4: 00000000000006b0 Call Trace: <TASK> ? die+0x2d/0x80 ? do_trap+0xeb/0xf0 ? __list_del_entry_valid_or_report+0x77/0xc0 ? do_error_trap+0x60/0x80 ? __list_del_entry_valid_or_report+0x77/0xc0 ? exc_invalid_op+0x49/0x60 ? __list_del_entry_valid_or_report+0x77/0xc0 ? asm_exc_invalid_op+0x16/0x20 ? table_deps+0x1b0/0x1b0 [dm_mod] ? __list_del_entry_valid_or_report+0x77/0xc0 origin_postsuspend+0x1a/0x50 [dm_snapshot] dm_table_postsuspend_targets+0x34/0x50 [dm_mod] dm_suspend+0xd8/0xf0 [dm_mod] dev_suspend+0x1f2/0x2f0 [dm_mod] ? table_deps+0x1b0/0x1b0 [dm_mod] ctl_ioctl+0x300/0x5f0 [dm_mod] dm_compat_ctl_ioctl+0x7/0x10 [dm_mod] __x64_compat_sys_ioctl+0x104/0x170 do_syscall_64+0x184/0x1b0 entry_SYSCALL_64_after_hwframe+0x46/0x4e RIP: 0033:0xf7e6aead <snip> ---[ end trace 0000000000000000 ]--- Fixes: ffcc39364160 ("dm: enhance internal suspend and resume interface") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-03-11Merge tag 'for-6.9/block-20240310' of git://git.kernel.dk/linuxLinus Torvalds
Pull block updates from Jens Axboe: - MD pull requests via Song: - Cleanup redundant checks (Yu Kuai) - Remove deprecated headers (Marc Zyngier, Song Liu) - Concurrency fixes (Li Lingfeng) - Memory leak fix (Li Nan) - Refactor raid1 read_balance (Yu Kuai, Paul Luse) - Clean up and fix for md_ioctl (Li Nan) - Other small fixes (Gui-Dong Han, Heming Zhao) - MD atomic limits (Christoph) - NVMe pull request via Keith: - RDMA target enhancements (Max) - Fabrics fixes (Max, Guixin, Hannes) - Atomic queue_limits usage (Christoph) - Const use for class_register (Ricardo) - Identification error handling fixes (Shin'ichiro, Keith) - Improvement and cleanup for cached request handling (Christoph) - Moving towards atomic queue limits. Core changes and driver bits so far (Christoph) - Fix UAF issues in aoeblk (Chun-Yi) - Zoned fix and cleanups (Damien) - s390 dasd cleanups and fixes (Jan, Miroslav) - Block issue timestamp caching (me) - noio scope guarding for zoned IO (Johannes) - block/nvme PI improvements (Kanchan) - Ability to terminate long running discard loop (Keith) - bdev revalidation fix (Li) - Get rid of old nr_queues hack for kdump kernels (Ming) - Support for async deletion of ublk (Ming) - Improve IRQ bio recycling (Pavel) - Factor in CPU capacity for remote vs local completion (Qais) - Add shared_tags configfs entry for null_blk (Shin'ichiro - Fix for a regression in page refcounts introduced by the folio unification (Tony) - Misc fixes and cleanups (Arnd, Colin, John, Kunwu, Li, Navid, Ricardo, Roman, Tang, Uwe) * tag 'for-6.9/block-20240310' of git://git.kernel.dk/linux: (221 commits) block: partitions: only define function mac_fix_string for CONFIG_PPC_PMAC block/swim: Convert to platform remove callback returning void cdrom: gdrom: Convert to platform remove callback returning void block: remove disk_stack_limits md: remove mddev->queue md: don't initialize queue limits md/raid10: use the atomic queue limit update APIs md/raid5: use the atomic queue limit update APIs md/raid1: use the atomic queue limit update APIs md/raid0: use the atomic queue limit update APIs md: add queue limit helpers md: add a mddev_is_dm helper md: add a mddev_add_trace_msg helper md: add a mddev_trace_remap helper bcache: move calculation of stripe_size and io_opt into bcache_device_init virtio_blk: Do not use disk_set_max_open/active_zones() aoe: fix the potential use-after-free problem in aoecmd_cfg_pkts block: move capacity validation to blkpg_do_ioctl() block: prevent division by zero in blk_rq_stat_sum() drbd: atomically update queue limits in drbd_reconsider_queue_parameters ...
2024-02-25md: port block device access to fileChristian Brauner
Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-4-adbd023e19cc@kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-22dm: treat alloc_dax() -EOPNOTSUPP failure as non-fatalMathieu Desnoyers
In preparation for checking whether the architecture has data cache aliasing within alloc_dax(), modify the error handling of dm alloc_dev() to treat alloc_dax() -EOPNOTSUPP failure as non-fatal. Link: https://lkml.kernel.org/r/20240215144633.96437-5-mathieu.desnoyers@efficios.com Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Suggested-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Cc: Mikulas Patocka <mpatocka@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Russell King <linux@armlinux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: kernel test robot <lkp@intel.com> Cc: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-20dm: update relevant MODULE_AUTHOR entries to latest dm-devel mailing listMike Snitzer
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-02-19block: pass a queue_limits argument to blk_alloc_diskChristoph Hellwig
Pass a queue_limits to blk_alloc_disk and apply it if non-NULL. This will allow allocating queues with valid queue limits instead of setting the values one at a time later. Also change blk_alloc_disk to return an ERR_PTR instead of just NULL which can't distinguish errors. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Link: https://lore.kernel.org/r/20240215071055.2201424-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-18bdev: rename freeze and thaw helpersChristian Brauner
We have bdev_mark_dead() etc and we're going to move block device freezing to holder ops in the next patch. Make the naming consistent: * freeze_bdev() -> bdev_freeze() * thaw_bdev() -> bdev_thaw() Also document the return code. Link: https://lore.kernel.org/r/20231024-vfs-super-freeze-v2-2-599c19f4faac@kernel.org Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-01Merge tag 'for-6.7/dm-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - Update DM core to directly call the map function for both the linear and stripe targets; which are provided by DM core - Various updates to use new safer string functions - Update DM core to respect REQ_NOWAIT flag in normal bios so that memory allocations are always attempted with GFP_NOWAIT - Add Mikulas Patocka to MAINTAINERS as a DM maintainer! - Improve DM delay target's handling of short delays (< 50ms) by using a kthread to check expiration of IOs rather than timers and a wq - Update the DM error target so that it works with zoned storage. This helps xfstests to provide proper IO error handling coverage when testing a filesystem with native zoned storage support - Update both DM crypt and integrity targets to improve performance by using crypto_shash_digest() rather than init+update+final sequence - Fix DM crypt target by backfilling missing memory allocation accounting for compound pages * tag 'for-6.7/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm crypt: account large pages in cc->n_allocated_pages dm integrity: use crypto_shash_digest() in sb_mac() dm crypt: use crypto_shash_digest() in crypt_iv_tcw_whitening() dm error: Add support for zoned block devices dm delay: for short delays, use kthread instead of timers and wq MAINTAINERS: add Mikulas Patocka as a DM maintainer dm: respect REQ_NOWAIT flag in normal bios issued to DM dm: enhance alloc_multiple_bios() to be more versatile dm: make __send_duplicate_bios return unsigned int dm log userspace: replace deprecated strncpy with strscpy dm ioctl: replace deprecated strncpy with strscpy_pad dm crypt: replace open-coded kmemdup_nul dm cache metadata: replace deprecated strncpy with strscpy dm: shortcut the calls to linear_map and stripe_map
2023-10-28dm: Convert to bdev_open_by_dev()Jan Kara
Convert device mapper to use bdev_open_by_dev() and pass the handle around. CC: Alasdair Kergon <agk@redhat.com> CC: Mike Snitzer <snitzer@kernel.org> CC: dm-devel@redhat.com Acked-by: Christoph Hellwig <hch@lst.de> Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230927093442.25915-10-jack@suse.cz Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-27dm: respect REQ_NOWAIT flag in normal bios issued to DMMike Snitzer
Update DM core's normal IO submission to allocate required memory using GFP_NOWAIT if REQ_NOWAIT is set. Tested with simple test provided in commit a9ce385344f916 ("dm: don't attempt to queue IO under RCU protection") that was enhanced to check error codes. Also tested using fio's pvsync2 with nowait=1. But testing with induced GFP_NOWAIT allocation failures wasn't performed (yet). Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-10-27dm: enhance alloc_multiple_bios() to be more versatileMike Snitzer
alloc_multiple_bios() has the useful ability to try allocating bios with GFP_NOWAIT but will fallback to using GFP_NOIO. The callers service both empty flush bios and abnormal bios (e.g. discard). alloc_multiple_bios() enhancements offered in this commit: - don't require table_devices_lock if num_bios = 1 - allow caller to pass GFP_NOWAIT to do usual GFP_NOWAIT with GFP_NOIO fallback - allow caller to pass GFP_NOIO to _only_ allocate using GFP_NOIO Flush bios with data may be issued to DM with REQ_NOWAIT, as such it makes sense to attempt servicing them with GFP_NOWAIT allocations. But abnormal IO should never be issued using REQ_NOWAIT (if that changes in the future that's fine, but no sense supporting it now). While at it, rename __send_changing_extent_only() to __send_abnormal_io(). [Thanks to both Ming and Mikulas for help with translating known possible IO scenarios to requirements.] Suggested-by: Ming Lei <ming.lei@redhat.com> Suggested-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-10-23dm: make __send_duplicate_bios return unsigned intMikulas Patocka
All the callers cast the value returned by __send_duplicate_bios to unsigned int type, so we can return unsigned int as well. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-10-06dm: shortcut the calls to linear_map and stripe_mapMikulas Patocka
Shortcut the calls to linear_map and stripe_map, so that they don't suffer the overhead of retpolines used for indirect calls. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>