Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"As usual, many cleanups. The below blurbiage describes 42 patchsets.
21 of those are partially or fully cleanup work. "cleans up",
"cleanup", "maintainability", "rationalizes", etc.
I never knew the MM code was so dirty.
"mm: ksm: prevent KSM from breaking merging of new VMAs" (Lorenzo Stoakes)
addresses an issue with KSM's PR_SET_MEMORY_MERGE mode: newly
mapped VMAs were not eligible for merging with existing adjacent
VMAs.
"mm/damon: introduce DAMON_STAT for simple and practical access monitoring" (SeongJae Park)
adds a new kernel module which simplifies the setup and usage of
DAMON in production environments.
"stop passing a writeback_control to swap/shmem writeout" (Christoph Hellwig)
is a cleanup to the writeback code which removes a couple of
pointers from struct writeback_control.
"drivers/base/node.c: optimization and cleanups" (Donet Tom)
contains largely uncorrelated cleanups to the NUMA node setup and
management code.
"mm: userfaultfd: assorted fixes and cleanups" (Tal Zussman)
does some maintenance work on the userfaultfd code.
"Readahead tweaks for larger folios" (Ryan Roberts)
implements some tuneups for pagecache readahead when it is reading
into order>0 folios.
"selftests/mm: Tweaks to the cow test" (Mark Brown)
provides some cleanups and consistency improvements to the
selftests code.
"Optimize mremap() for large folios" (Dev Jain)
does that. A 37% reduction in execution time was measured in a
memset+mremap+munmap microbenchmark.
"Remove zero_user()" (Matthew Wilcox)
expunges zero_user() in favor of the more modern memzero_page().
"mm/huge_memory: vmf_insert_folio_*() and vmf_insert_pfn_pud() fixes" (David Hildenbrand)
addresses some warts which David noticed in the huge page code.
These were not known to be causing any issues at this time.
"mm/damon: use alloc_migrate_target() for DAMOS_MIGRATE_{HOT,COLD" (SeongJae Park)
provides some cleanup and consolidation work in DAMON.
"use vm_flags_t consistently" (Lorenzo Stoakes)
uses vm_flags_t in places where we were inappropriately using other
types.
"mm/memfd: Reserve hugetlb folios before allocation" (Vivek Kasireddy)
increases the reliability of large page allocation in the memfd
code.
"mm: Remove pXX_devmap page table bit and pfn_t type" (Alistair Popple)
removes several now-unneeded PFN_* flags.
"mm/damon: decouple sysfs from core" (SeongJae Park)
implememnts some cleanup and maintainability work in the DAMON
sysfs layer.
"madvise cleanup" (Lorenzo Stoakes)
does quite a lot of cleanup/maintenance work in the madvise() code.
"madvise anon_name cleanups" (Vlastimil Babka)
provides additional cleanups on top or Lorenzo's effort.
"Implement numa node notifier" (Oscar Salvador)
creates a standalone notifier for NUMA node memory state changes.
Previously these were lumped under the more general memory
on/offline notifier.
"Make MIGRATE_ISOLATE a standalone bit" (Zi Yan)
cleans up the pageblock isolation code and fixes a potential issue
which doesn't seem to cause any problems in practice.
"selftests/damon: add python and drgn based DAMON sysfs functionality tests" (SeongJae Park)
adds additional drgn- and python-based DAMON selftests which are
more comprehensive than the existing selftest suite.
"Misc rework on hugetlb faulting path" (Oscar Salvador)
fixes a rather obscure deadlock in the hugetlb fault code and
follows that fix with a series of cleanups.
"cma: factor out allocation logic from __cma_declare_contiguous_nid" (Mike Rapoport)
rationalizes and cleans up the highmem-specific code in the CMA
allocator.
"mm/migration: rework movable_ops page migration (part 1)" (David Hildenbrand)
provides cleanups and future-preparedness to the migration code.
"mm/damon: add trace events for auto-tuned monitoring intervals and DAMOS quota" (SeongJae Park)
adds some tracepoints to some DAMON auto-tuning code.
"mm/damon: fix misc bugs in DAMON modules" (SeongJae Park)
does that.
"mm/damon: misc cleanups" (SeongJae Park)
also does what it claims.
"mm: folio_pte_batch() improvements" (David Hildenbrand)
cleans up the large folio PTE batching code.
"mm/damon/vaddr: Allow interleaving in migrate_{hot,cold} actions" (SeongJae Park)
facilitates dynamic alteration of DAMON's inter-node allocation
policy.
"Remove unmap_and_put_page()" (Vishal Moola)
provides a couple of page->folio conversions.
"mm: per-node proactive reclaim" (Davidlohr Bueso)
implements a per-node control of proactive reclaim - beyond the
current memcg-based implementation.
"mm/damon: remove damon_callback" (SeongJae Park)
replaces the damon_callback interface with a more general and
powerful damon_call()+damos_walk() interface.
"mm/mremap: permit mremap() move of multiple VMAs" (Lorenzo Stoakes)
implements a number of mremap cleanups (of course) in preparation
for adding new mremap() functionality: newly permit the remapping
of multiple VMAs when the user is specifying MREMAP_FIXED. It still
excludes some specialized situations where this cannot be performed
reliably.
"drop hugetlb_free_pgd_range()" (Anthony Yznaga)
switches some sparc hugetlb code over to the generic version and
removes the thus-unneeded hugetlb_free_pgd_range().
"mm/damon/sysfs: support periodic and automated stats update" (SeongJae Park)
augments the present userspace-requested update of DAMON sysfs
monitoring files. Automatic update is now provided, along with a
tunable to control the update interval.
"Some randome fixes and cleanups to swapfile" (Kemeng Shi)
does what is claims.
"mm: introduce snapshot_page" (Luiz Capitulino and David Hildenbrand)
provides (and uses) a means by which debug-style functions can grab
a copy of a pageframe and inspect it locklessly without tripping
over the races inherent in operating on the live pageframe
directly.
"use per-vma locks for /proc/pid/maps reads" (Suren Baghdasaryan)
addresses the large contention issues which can be triggered by
reads from that procfs file. Latencies are reduced by more than
half in some situations. The series also introduces several new
selftests for the /proc/pid/maps interface.
"__folio_split() clean up" (Zi Yan)
cleans up __folio_split()!
"Optimize mprotect() for large folios" (Dev Jain)
provides some quite large (>3x) speedups to mprotect() when dealing
with large folios.
"selftests/mm: reuse FORCE_READ to replace "asm volatile("" : "+r" (XXX));" and some cleanup" (wang lian)
does some cleanup work in the selftests code.
"tools/testing: expand mremap testing" (Lorenzo Stoakes)
extends the mremap() selftest in several ways, including adding
more checking of Lorenzo's recently added "permit mremap() move of
multiple VMAs" feature.
"selftests/damon/sysfs.py: test all parameters" (SeongJae Park)
extends the DAMON sysfs interface selftest so that it tests all
possible user-requested parameters. Rather than the present minimal
subset"
* tag 'mm-stable-2025-07-30-15-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (370 commits)
MAINTAINERS: add missing headers to mempory policy & migration section
MAINTAINERS: add missing file to cgroup section
MAINTAINERS: add MM MISC section, add missing files to MISC and CORE
MAINTAINERS: add missing zsmalloc file
MAINTAINERS: add missing files to page alloc section
MAINTAINERS: add missing shrinker files
MAINTAINERS: move memremap.[ch] to hotplug section
MAINTAINERS: add missing mm_slot.h file THP section
MAINTAINERS: add missing interval_tree.c to memory mapping section
MAINTAINERS: add missing percpu-internal.h file to per-cpu section
mm/page_alloc: remove trace_mm_alloc_contig_migrate_range_info()
selftests/damon: introduce _common.sh to host shared function
selftests/damon/sysfs.py: test runtime reduction of DAMON parameters
selftests/damon/sysfs.py: test non-default parameters runtime commit
selftests/damon/sysfs.py: generalize DAMON context commit assertion
selftests/damon/sysfs.py: generalize monitoring attributes commit assertion
selftests/damon/sysfs.py: generalize DAMOS schemes commit assertion
selftests/damon/sysfs.py: test DAMOS filters commitment
selftests/damon/sysfs.py: generalize DAMOS scheme commit assertion
selftests/damon/sysfs.py: test DAMOS destinations commitment
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core
Pull driver core updates from Danilo Krummrich:
"debugfs:
- Remove unneeded debugfs_file_{get,put}() instances
- Remove last remnants of debugfs_real_fops()
- Allow storing non-const void * in struct debugfs_inode_info::aux
sysfs:
- Switch back to attribute_group::bin_attrs (treewide)
- Switch back to bin_attribute::read()/write() (treewide)
- Constify internal references to 'struct bin_attribute'
Support cache-ids for device-tree systems:
- Add arch hook arch_compact_of_hwid()
- Use arch_compact_of_hwid() to compact MPIDR values on arm64
Rust:
- Device:
- Introduce CoreInternal device context (for bus internal methods)
- Provide generic drvdata accessors for bus devices
- Provide Driver::unbind() callbacks
- Use the infrastructure above for auxiliary, PCI and platform
- Implement Device::as_bound()
- Rename Device::as_ref() to Device::from_raw() (treewide)
- Implement fwnode and device property abstractions
- Implement example usage in the Rust platform sample driver
- Devres:
- Remove the inner reference count (Arc) and use pin-init instead
- Replace Devres::new_foreign_owned() with devres::register()
- Require T to be Send in Devres<T>
- Initialize the data kept inside a Devres last
- Provide an accessor for the Devres associated Device
- Device ID:
- Add support for ACPI device IDs and driver match tables
- Split up generic device ID infrastructure
- Use generic device ID infrastructure in net::phy
- DMA:
- Implement the dma::Device trait
- Add DMA mask accessors to dma::Device
- Implement dma::Device for PCI and platform devices
- Use DMA masks from the DMA sample module
- I/O:
- Implement abstraction for resource regions (struct resource)
- Implement resource-based ioremap() abstractions
- Provide platform device accessors for I/O (remap) requests
- Misc:
- Support fallible PinInit types in Revocable
- Implement Wrapper<T> for Opaque<T>
- Merge pin-init blanket dependencies (for Devres)
Misc:
- Fix OF node leak in auxiliary_device_create()
- Use util macros in device property iterators
- Improve kobject sample code
- Add device_link_test() for testing device link flags
- Fix typo in Documentation/ABI/testing/sysfs-kernel-address_bits
- Hint to prefer container_of_const() over container_of()"
* tag 'driver-core-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core: (84 commits)
rust: io: fix broken intra-doc links to `platform::Device`
rust: io: fix broken intra-doc link to missing `flags` module
rust: io: mem: enable IoRequest doc-tests
rust: platform: add resource accessors
rust: io: mem: add a generic iomem abstraction
rust: io: add resource abstraction
rust: samples: dma: set DMA mask
rust: platform: implement the `dma::Device` trait
rust: pci: implement the `dma::Device` trait
rust: dma: add DMA addressing capabilities
rust: dma: implement `dma::Device` trait
rust: net::phy Change module_phy_driver macro to use module_device_table macro
rust: net::phy represent DeviceId as transparent wrapper over mdio_device_id
rust: device_id: split out index support into a separate trait
device: rust: rename Device::as_ref() to Device::from_raw()
arm64: cacheinfo: Provide helper to compress MPIDR value into u32
cacheinfo: Add arch hook to compress CPU h/w id into 32 bits for cache-id
cacheinfo: Set cache 'id' based on DT data
container_of: Document container_of() is not to be used in new code
driver core: auxiliary bus: fix OF node leak
...
|
|
Pull block updates from Jens Axboe:
- MD pull request via Yu:
- call del_gendisk synchronously (Xiao)
- cleanup unused variable (John)
- cleanup workqueue flags (Ryo)
- fix faulty rdev can't be removed during resync (Qixing)
- NVMe pull request via Christoph:
- try PCIe function level reset on init failure (Keith Busch)
- log TLS handshake failures at error level (Maurizio Lombardi)
- pci-epf: do not complete commands twice if nvmet_req_init()
fails (Rick Wertenbroek)
- misc cleanups (Alok Tiwari)
- Removal of the pktcdvd driver
This has been more than a decade coming at this point, and some
recently revealed breakages that had it causing issues even for cases
where it isn't required made me re-pull the trigger on this one. It's
known broken and nobody has stepped up to maintain the code
- Series for ublk supporting batch commands, enabling the use of
multishot where appropriate
- Speed up ublk exit handling
- Fix for the two-stage elevator fixing which could leak data
- Convert NVMe to use the new IOVA based API
- Increase default max transfer size to something more reasonable
- Series fixing write operations on zoned DM devices
- Add tracepoints for zoned block device operations
- Prep series working towards improving blk-mq queue management in the
presence of isolated CPUs
- Don't allow updating of the block size of a loop device that is
currently under exclusively ownership/open
- Set chunk sectors from stacked device stripe size and use it for the
atomic write size limit
- Switch to folios in bcache read_super()
- Fix for CD-ROM MRW exit flush handling
- Various tweaks, fixes, and cleanups
* tag 'for-6.17/block-20250728' of git://git.kernel.dk/linux: (94 commits)
block: restore two stage elevator switch while running nr_hw_queue update
cdrom: Call cdrom_mrw_exit from cdrom_release function
sunvdc: Balance device refcount in vdc_port_mpgroup_check
nvme-pci: try function level reset on init failure
dm: split write BIOs on zone boundaries when zone append is not emulated
block: use chunk_sectors when evaluating stacked atomic write limits
dm-stripe: limit chunk_sectors to the stripe size
md/raid10: set chunk_sectors limit
md/raid0: set chunk_sectors limit
block: sanitize chunk_sectors for atomic write limits
ilog2: add max_pow_of_two_factor()
nvmet: pci-epf: Do not complete commands twice if nvmet_req_init() fails
nvme-tcp: log TLS handshake failures at error level
docs: nvme: fix grammar in nvme-pci-endpoint-target.rst
nvme: fix typo in status code constant for self-test in progress
nvmet: remove redundant assignment of error code in nvmet_ns_enable()
nvme: fix incorrect variable in io cqes error message
nvme: fix multiple spelling and grammar issues in host drivers
block: fix blk_zone_append_update_request_bio() kernel-doc
md/raid10: fix set but not used variable in sync_request_write()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs iomap updates from Christian Brauner:
- Refactor the iomap writeback code and split the generic and ioend/bio
based writeback code.
There are two methods that define the split between the generic
writeback code, and the implemementation of it, and all knowledge of
ioends and bios now sits below that layer.
- Add fuse iomap support for buffered writes and dirty folio writeback.
This is needed so that granular uptodate and dirty tracking can be
used in fuse when large folios are enabled. This has two big
advantages. For writes, instead of the entire folio needing to be
read into the page cache, only the relevant portions need to be. For
writeback, only the dirty portions need to be written back instead of
the entire folio.
* tag 'vfs-6.17-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fuse: refactor writeback to use iomap_writepage_ctx inode
fuse: hook into iomap for invalidating and checking partial uptodateness
fuse: use iomap for folio laundering
fuse: use iomap for writeback
fuse: use iomap for buffered writes
iomap: build the writeback code without CONFIG_BLOCK
iomap: add read_folio_range() handler for buffered writes
iomap: improve argument passing to iomap_read_folio_sync
iomap: replace iomap_folio_ops with iomap_write_ops
iomap: export iomap_writeback_folio
iomap: move folio_unlock out of iomap_writeback_folio
iomap: rename iomap_writepage_map to iomap_writeback_folio
iomap: move all ioend handling to ioend.c
iomap: add public helpers for uptodate state manipulation
iomap: hide ioends from the generic writeback code
iomap: refactor the writeback interface
iomap: cleanup the pending writeback tracking in iomap_writepage_map_blocks
iomap: pass more arguments using the iomap writeback context
iomap: header diet
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs 'protection info' updates from Christian Brauner:
"This adds the new FS_IOC_GETLBMD_CAP ioctl() to query metadata and
protection info (PI) capabilities. This ioctl returns information
about the files integrity profile. This is useful for userspace
applications to understand a files end-to-end data protection support
and configure the I/O accordingly.
For now this interface is only supported by block devices. However the
design and placement of this ioctl in generic FS ioctl space allows us
to extend it to work over files as well. This maybe useful when
filesystems start supporting PI-aware layouts.
A new structure struct logical_block_metadata_cap is introduced, which
contains the following fields:
- lbmd_flags:
bitmask of logical block metadata capability flags
- lbmd_interval:
the amount of data described by each unit of logical block metadata
- lbmd_size:
size in bytes of the logical block metadata associated with each
interval
- lbmd_opaque_size:
size in bytes of the opaque block tag associated with each interval
- lbmd_opaque_offset:
offset in bytes of the opaque block tag within the logical block
metadata
- lbmd_pi_size:
size in bytes of the T10 PI tuple associated with each interval
- lbmd_pi_offset:
offset in bytes of T10 PI tuple within the logical block metadata
- lbmd_pi_guard_tag_type:
T10 PI guard tag type
- lbmd_pi_app_tag_size:
size in bytes of the T10 PI application tag
- lbmd_pi_ref_tag_size:
size in bytes of the T10 PI reference tag
- lbmd_pi_storage_tag_size:
size in bytes of the T10 PI storage tag
The internal logic to fetch the capability is encapsulated in a helper
function blk_get_meta_cap(), which uses the blk_integrity profile
associated with the device. The ioctl returns -EOPNOTSUPP, if
CONFIG_BLK_DEV_INTEGRITY is not enabled"
* tag 'vfs-6.17-rc1.integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
block: fix lbmd_guard_tag_type assignment in FS_IOC_GETLBMD_CAP
block: fix FS_IOC_GETLBMD_CAP parsing in blkdev_common_ioctl()
fs: add ioctl to query metadata and protection info capabilities
nvme: set pi_offset only when checksum type is not BLK_INTEGRITY_CSUM_NONE
block: introduce pi_tuple_size field in blk_integrity
block: rename tuple_size field in blk_integrity to metadata_size
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull mmap_prepare updates from Christian Brauner:
"Last cycle we introduce f_op->mmap_prepare() in c84bf6dd2b83 ("mm:
introduce new .mmap_prepare() file callback").
This is preferred to the existing f_op->mmap() hook as it does require
a VMA to be established yet, thus allowing the mmap logic to invoke
this hook far, far earlier, prior to inserting a VMA into the virtual
address space, or performing any other heavy handed operations.
This allows for much simpler unwinding on error, and for there to be a
single attempt at merging a VMA rather than having to possibly
reattempt a merge based on potentially altered VMA state.
Far more importantly, it prevents inappropriate manipulation of
incompletely initialised VMA state, which is something that has been
the cause of bugs and complexity in the past.
The intent is to gradually deprecate f_op->mmap, and in that vein this
series coverts the majority of file systems to using f_op->mmap_prepare.
Prerequisite steps are taken - firstly ensuring all checks for mmap
capabilities use the file_has_valid_mmap_hooks() helper rather than
directly checking for f_op->mmap (which is now not a valid check) and
secondly updating daxdev_mapping_supported() to not require a VMA
parameter to allow ext4 and xfs to be converted.
Commit bb666b7c2707 ("mm: add mmap_prepare() compatibility layer for
nested file systems") handles the nasty edge-case of nested file
systems like overlayfs, which introduces a compatibility shim to allow
f_op->mmap_prepare() to be invoked from an f_op->mmap() callback.
This allows for nested filesystems to continue to function correctly
with all file systems regardless of which callback is used. Once we
finally convert all file systems, this shim can be removed.
As a result, ecryptfs, fuse, and overlayfs remain unaltered so they
can nest all other file systems.
We additionally do not update resctl - as this requires an update to
remap_pfn_range() (or an alternative to it) which we defer to a later
series, equally we do not update cramfs which needs a mixed mapping
insertion with the same issue, nor do we update procfs, hugetlbfs,
syfs or kernfs all of which require VMAs for internal state and hooks.
We shall return to all of these later"
* tag 'vfs-6.17-rc1.mmap_prepare' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
doc: update porting, vfs documentation to describe mmap_prepare()
fs: replace mmap hook with .mmap_prepare for simple mappings
fs: convert most other generic_file_*mmap() users to .mmap_prepare()
fs: convert simple use of generic_file_*_mmap() to .mmap_prepare()
mm/filemap: introduce generic_file_*_mmap_prepare() helpers
fs/xfs: transition from deprecated .mmap hook to .mmap_prepare
fs/ext4: transition from deprecated .mmap hook to .mmap_prepare
fs/dax: make it possible to check dev dax support without a VMA
fs: consistently use can_mmap_file() helper
mm/nommu: use file_has_valid_mmap_hooks() helper
mm: rename call_mmap/mmap_prepare to vfs_mmap/mmap_prepare
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull fallocate updates from Christian Brauner:
"fallocate() currently supports creating preallocated files
efficiently. However, on most filesystems fallocate() will preallocate
blocks in an unwriten state even if FALLOC_FL_ZERO_RANGE is specified.
The extent state must later be converted to a written state when the
user writes data into this range, which can trigger numerous metadata
changes and journal I/O. This may leads to significant write
amplification and performance degradation in synchronous write mode.
At the moment, the only method to avoid this is to create an empty
file and write zero data into it (for example, using 'dd' with a large
block size). However, this method is slow and consumes a considerable
amount of disk bandwidth.
Now that more and more flash-based storage devices are available it is
possible to efficiently write zeros to SSDs using the unmap write
zeroes command if the devices do not write physical zeroes to the
media.
For example, if SCSI SSDs support the UMMAP bit or NVMe SSDs support
the DEAC bit[1], the write zeroes command does not write actual data
to the device, instead, NVMe converts the zeroed range to a
deallocated state, which works fast and consumes almost no disk write
bandwidth.
This series implements the BLK_FEAT_WRITE_ZEROES_UNMAP feature and
BLK_FLAG_WRITE_ZEROES_UNMAP_DISABLED flag for SCSI, NVMe and
device-mapper drivers, and add the FALLOC_FL_WRITE_ZEROES and
STATX_ATTR_WRITE_ZEROES_UNMAP support for ext4 and raw bdev devices.
fallocate() is subsequently extended with the FALLOC_FL_WRITE_ZEROES
flag. FALLOC_FL_WRITE_ZEROES zeroes a specified file range in such a
way that subsequent writes to that range do not require further
changes to the file mapping metadata. This flag is beneficial for
subsequent pure overwriting within this range, as it can save on block
allocation and, consequently, significant metadata changes"
* tag 'vfs-6.17-rc1.fallocate' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
ext4: add FALLOC_FL_WRITE_ZEROES support
block: add FALLOC_FL_WRITE_ZEROES support
block: factor out common part in blkdev_fallocate()
fs: introduce FALLOC_FL_WRITE_ZEROES to fallocate
dm: clear unmap write zeroes limits when disabling write zeroes
scsi: sd: set max_hw_wzeroes_unmap_sectors if device supports SD_ZERO_*_UNMAP
nvmet: set WZDS and DRB if device enables unmap write zeroes operation
nvme: set max_hw_wzeroes_unmap_sectors if device supports DEAC bit
block: introduce max_{hw|user}_wzeroes_unmap_sectors to queue limits
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc VFS updates from Christian Brauner:
"This contains the usual selections of misc updates for this cycle.
Features:
- Add ext4 IOCB_DONTCACHE support
This refactors the address_space_operations write_begin() and
write_end() callbacks to take const struct kiocb * as their first
argument, allowing IOCB flags such as IOCB_DONTCACHE to propagate
to the filesystem's buffered I/O path.
Ext4 is updated to implement handling of the IOCB_DONTCACHE flag
and advertises support via the FOP_DONTCACHE file operation flag.
Additionally, the i915 driver's shmem write paths are updated to
bypass the legacy write_begin/write_end interface in favor of
directly calling write_iter() with a constructed synchronous kiocb.
Another i915 change replaces a manual write loop with
kernel_write() during GEM shmem object creation.
Cleanups:
- don't duplicate vfs_open() in kernel_file_open()
- proc_fd_getattr(): don't bother with S_ISDIR() check
- fs/ecryptfs: replace snprintf with sysfs_emit in show function
- vfs: Remove unnecessary list_for_each_entry_safe() from
evict_inodes()
- filelock: add new locks_wake_up_waiter() helper
- fs: Remove three arguments from block_write_end()
- VFS: change old_dir and new_dir in struct renamedata to dentrys
- netfs: Remove unused declaration netfs_queue_write_request()
Fixes:
- eventpoll: Fix semi-unbounded recursion
- eventpoll: fix sphinx documentation build warning
- fs/read_write: Fix spelling typo
- fs: annotate data race between poll_schedule_timeout() and
pollwake()
- fs/pipe: set FMODE_NOWAIT in create_pipe_files()
- docs/vfs: update references to i_mutex to i_rwsem
- fs/buffer: remove comment about hard sectorsize
- fs/buffer: remove the min and max limit checks in __getblk_slow()
- fs/libfs: don't assume blocksize <= PAGE_SIZE in
generic_check_addressable
- fs_context: fix parameter name in infofc() macro
- fs: Prevent file descriptor table allocations exceeding INT_MAX"
* tag 'vfs-6.17-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (24 commits)
netfs: Remove unused declaration netfs_queue_write_request()
eventpoll: fix sphinx documentation build warning
ext4: support uncached buffered I/O
mm/pagemap: add write_begin_get_folio() helper function
fs: change write_begin/write_end interface to take struct kiocb *
drm/i915: Refactor shmem_pwrite() to use kiocb and write_iter
drm/i915: Use kernel_write() in shmem object create
eventpoll: Fix semi-unbounded recursion
vfs: Remove unnecessary list_for_each_entry_safe() from evict_inodes()
fs/libfs: don't assume blocksize <= PAGE_SIZE in generic_check_addressable
fs/buffer: remove the min and max limit checks in __getblk_slow()
fs: Prevent file descriptor table allocations exceeding INT_MAX
fs: Remove three arguments from block_write_end()
fs/ecryptfs: replace snprintf with sysfs_emit in show function
fs: annotate suspected data race between poll_schedule_timeout() and pollwake()
docs/vfs: update references to i_mutex to i_rwsem
fs/buffer: remove comment about hard sectorsize
fs_context: fix parameter name in infofc() macro
VFS: change old_dir and new_dir in struct renamedata to dentrys
proc_fd_getattr(): don't bother with S_ISDIR() check
...
|
|
Pull block fix from Jens Axboe:
"Just a single fix for regression in this release, where a module
reference could be leaked"
* tag 'block-6.16-20250725' of git://git.kernel.dk/linux:
block: fix module reference leak in mq-deadline I/O scheduler
|
|
The kmemleak reports memory leaks related to elevator resources that
were originally allocated in the ->init_hctx() method. The following
leak traces are observed after running blktests block/040:
unreferenced object 0xffff8881b82f7400 (size 512):
comm "check", pid 68454, jiffies 4310588881
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace (crc 5bac8b34):
__kvmalloc_node_noprof+0x55d/0x7a0
sbitmap_init_node+0x15a/0x6a0
kyber_init_hctx+0x316/0xb90
blk_mq_init_sched+0x419/0x580
elevator_switch+0x18b/0x630
elv_update_nr_hw_queues+0x219/0x2c0
__blk_mq_update_nr_hw_queues+0x36a/0x6f0
blk_mq_update_nr_hw_queues+0x3a/0x60
0xffffffffc09ceb80
0xffffffffc09d7e0b
configfs_write_iter+0x2b1/0x470
vfs_write+0x527/0xe70
ksys_write+0xff/0x200
do_syscall_64+0x98/0x3c0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
unreferenced object 0xffff8881b82f6000 (size 512):
comm "check", pid 68454, jiffies 4310588881
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace (crc 5bac8b34):
__kvmalloc_node_noprof+0x55d/0x7a0
sbitmap_init_node+0x15a/0x6a0
kyber_init_hctx+0x316/0xb90
blk_mq_init_sched+0x419/0x580
elevator_switch+0x18b/0x630
elv_update_nr_hw_queues+0x219/0x2c0
__blk_mq_update_nr_hw_queues+0x36a/0x6f0
blk_mq_update_nr_hw_queues+0x3a/0x60
0xffffffffc09ceb80
0xffffffffc09d7e0b
configfs_write_iter+0x2b1/0x470
vfs_write+0x527/0xe70
ksys_write+0xff/0x200
do_syscall_64+0x98/0x3c0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
unreferenced object 0xffff8881b82f5800 (size 512):
comm "check", pid 68454, jiffies 4310588881
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace (crc 5bac8b34):
__kvmalloc_node_noprof+0x55d/0x7a0
sbitmap_init_node+0x15a/0x6a0
kyber_init_hctx+0x316/0xb90
blk_mq_init_sched+0x419/0x580
elevator_switch+0x18b/0x630
elv_update_nr_hw_queues+0x219/0x2c0
__blk_mq_update_nr_hw_queues+0x36a/0x6f0
blk_mq_update_nr_hw_queues+0x3a/0x60
0xffffffffc09ceb80
0xffffffffc09d7e0b
configfs_write_iter+0x2b1/0x470
vfs_write+0x527/0xe70
ksys_write+0xff/0x200
do_syscall_64+0x98/0x3c0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
The issue arises while we run nr_hw_queue update, Specifically, we first
reallocate hardware contexts (hctx) via __blk_mq_realloc_hw_ctxs(), and
then later invoke elevator_switch() (assuming q->elevator is not NULL).
The elevator switch code would first exit old elevator (elevator_exit)
and then switches to the new elevator. The elevator_exit loops through
each hctx and invokes the elevator’s per-hctx exit method ->exit_hctx(),
which releases resources allocated during ->init_hctx().
This memleak manifests when we reduce the num of h/w queues - for example,
when the initial update sets the number of queues to X, and a later update
reduces it to Y, where Y < X. In this case, we'd loose the access to old
hctxs while we get to elevator exit code because __blk_mq_realloc_hw_ctxs
would have already released the old hctxs. As we don't now have any
reference left to the old hctxs, we don't have any way to free the
scheduler resources (which are allocate in ->init_hctx()) and kmemleak
complains about it.
This issue was caused due to the commit 596dce110b7d ("block: simplify
elevator reattachment for updating nr_hw_queues"). That change unified
the two-stage elevator teardown and reattachment into a single call that
occurs after __blk_mq_realloc_hw_ctxs() has already freed the hctxs.
This patch restores the previous two-stage elevator switch logic during
nr_hw_queues updates. First, the elevator is switched to 'none', which
ensures all scheduler resources are properly freed. Then, the hardware
contexts (hctxs) are reallocated, and the software-to-hardware queue
mappings are updated. Finally, the original elevator is reattached. This
sequence prevents loss of references to old hctxs and avoids the scheduler
resource leaks reported by kmemleak.
Reported-by : Yi Zhang <yi.zhang@redhat.com>
Fixes: 596dce110b7d ("block: simplify elevator reattachment for updating nr_hw_queues")
Closes: https://lore.kernel.org/all/CAHj4cs8oJFvz=daCvjHM5dYCNQH4UXwSySPPU4v-WHce_kZXZA@mail.gmail.com/
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20250724102540.1366308-1-nilay@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The blk_get_meta_cap() implementation directly assigns bi->csum_type to
the UAPI field lbmd_guard_tag_type. This is not right as the kernel enum
blk_integrity_checksum values are not guaranteed to match the UAPI
defined values.
Fix this by explicitly mapping internal checksum types to UAPI-defined
constants to ensure compatibility and correctness, especially for the
devices using CRC64 PI.
Fixes: 9eb22f7fedfc ("fs: add ioctl to query metadata and protection info capabilities")
Reported-by: Vincent Fu <vincent.fu@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Link: https://lore.kernel.org/20250722120755.87501-1-anuj20.g@samsung.com
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
During probe, when the block layer registers a request queue, it
defaults to the mq-deadline I/O scheduler if the device is single-queue
and the mq-deadline module is available. To determine availability, the
elevator_set_default() invokes elevator_find_get(), which increments the
module's reference count. However, this reference is never released,
resulting in a module reference leak that prevents the mq-deadline module
from being unloaded.
This patch fixes the issue by ensuring the acquired module reference is
properly released.
Fixes: 1e44bedbc921 ("block: unifying elevator change")
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250719132722.769536-1-nilay@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Pull block fixes from Jens Axboe:
- NVMe changes via Christoph:
- revert the cross-controller atomic write size validation
that caused regressions (Christoph Hellwig)
- fix endianness of command word printout in
nvme_log_err_passthru() (John Garry)
- fix callback lock for TLS handshake (Maurizio Lombardi)
- fix misaccounting of nvme-mpath inflight I/O (Yu Kuai)
- fix inconsistent RCU list manipulation in
nvme_ns_add_to_ctrl_list() (Zheng Qixing)
- Fix for a kobject leak in queue unregistration
- Fix for loop async file write start/end handling
* tag 'block-6.16-20250718' of git://git.kernel.dk/linux:
loop: use kiocb helpers to fix lockdep warning
nvmet-tcp: fix callback lock for TLS handshake
nvme: fix misaccounting of nvme-mpath inflight I/O
nvme: revert the cross-controller atomic write size validation
nvme: fix endianness of command word prints in nvme_log_err_passthru()
nvme: fix inconsistent RCU list manipulation in nvme_ns_add_to_ctrl_list()
block: fix kobject leak in blk_unregister_queue
|
|
The atomic write unit max value is limited by any stacked device stripe
size.
It is required that the atomic write unit is a power-of-2 factor of the
stripe size.
Currently we use io_min limit to hold the stripe size, and check for a
io_min <= SECTOR_SIZE when deciding if we have a striped stacked device.
Nilay reports that this causes a problem when the physical block size is
greater than SECTOR_SIZE [0].
Furthermore, io_min may be mutated when stacking devices, and this makes
it a poor candidate to hold the stripe size. Such an example (of when
io_min may change) would be when the io_min is less than the physical
block size.
Use chunk_sectors to hold the stripe size, which is more appropriate.
[0] https://lore.kernel.org/linux-block/888f3b1d-7817-4007-b3b3-1a2ea04df771@linux.ibm.com/T/#mecca17129f72811137d3c2f1e477634e77f06781
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Tested-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20250711105258.3135198-7-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Currently we just ensure that a non-zero value in chunk_sectors aligns
with any atomic write boundary, as the blk boundary functionality uses
both these values.
However it is also improper to have atomic write unit max > chunk_sectors
(for non-zero chunk_sectors), as this would lead to splitting of atomic
write bios (which is disallowed).
Sanitize atomic write unit max against chunk_sectors to avoid any
potential problems.
Fixes: d00eea91deaf3 ("block: Add extra checks in blk_validate_atomic_write_limits()")
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20250711105258.3135198-3-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Change the address_space_operations callbacks write_begin() and
write_end() to take struct kiocb * as the first argument instead of
struct file *.
Update all affected function prototypes, implementations, call sites,
and related documentation across VFS, filesystems, and block layer.
Part of a series refactoring address_space_operations write_begin and
write_end callbacks to use struct kiocb for passing write context and
flags.
Signed-off-by: Taotao Chen <chentaotao@didiglobal.com>
Link: https://lore.kernel.org/20250716093559.217344-4-chentaotao@didiglobal.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Add tracepoints to zone write plugging plug and unplug events.
Examples for these events are:
kworker/u10:4-393 [001] d..1. 282.991660: disk_zone_wplug_add_bio: 8,0 zone 16, BIO 8388608 + 128
kworker/0:1H-58 [ [000] d..1. 283.083294: blk_zone_wplug_bio: 8,0 zone 15, BIO 7864320 + 128
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250715115324.53308-6-johannes.thumshirn@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Add a tracepoint for blkdev_zone_mgmt to trace zone management commands
submitted by higher layers like file systems or user space.
An example output for this tracepoint is as follows:
mkfs.btrfs-203 [001] ..... 42.877493: blkdev_zone_mgmt: 8,0 ZRS 5242880 + 0
This example output shows a REQ_OP_ZONE_RESET operation submitted by
mkfs.btrfs.
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250715115324.53308-5-johannes.thumshirn@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Add a tracepoint in blk_zone_update_request_bio() to trace the bio sector
update on ZONE APPEND completions.
An example for this tracepoint is as follows:
<idle>-0 [001] d.h1. 381.746444: blk_zone_update_request_bio: 259,5 ZAS 131072 () 1048832 + 256 none,0,0 [swapper/1]
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250715115324.53308-4-johannes.thumshirn@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
blk_zone_update_request_bio() does two things. First it checks if the
request to be completed was written via ZONE APPEND and if yes it then
updates the sector to the one that the data was written to.
This is small enough to be an inline function. But upcoming changes adding
a tracepoint don't work if the function is inlined.
Split the function into two, the first is blk_req_bio_is_zone_append()
checking if the sector needs to be updated. This can still be an inline
function. The second is blk_zone_append_update_request_bio() doing the
sector update.
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250715115324.53308-3-johannes.thumshirn@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The iomap_folio_ops are only used for buffered writes, including the zero
and unshare variants. Rename them to iomap_write_ops to better describe
the usage, and pass them through the call chain like the other operation
specific methods instead of through the iomap.
xfs_iomap_valid grows a IOMAP_HOLE check to keep the existing behavior
that never attached the folio_ops to a iomap representing a hole.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/20250710133343.399917-12-hch@lst.de
Acked-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Replace the ioend pointer in iomap_writeback_ctx with a void *wb_ctx
one to facilitate non-block, non-ioend writeback for use. Rename
the submit_ioend method to writeback_submit and make it mandatory so
that the generic writeback code stops seeing ioends and bios.
Co-developed-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/20250710133343.399917-6-hch@lst.de
Acked-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Replace ->map_blocks with a new ->writeback_range, which differs in the
following ways:
- it must also queue up the I/O for writeback, that is called into the
slightly refactored and extended in scope iomap_add_to_ioend for
each region
- can handle only a part of the requested region, that is the retry
loop for partial mappings moves to the caller
- handles cleanup on failures as well, and thus also replaces the
discard_folio method only implemented by XFS.
This will allow to use the iomap writeback code also for file systems
that are not block based like fuse.
Co-developed-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/20250710133343.399917-5-hch@lst.de
Acked-by: Damien Le Moal <dlemoal@kernel.org> # zonefs
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Add inode and wpc fields to pass the inode and writeback context that
are needed in the entire writeback call chain, and let the callers
initialize all fields in the writeback context before calling
iomap_writepages to simplify the argument passing.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/20250710133343.399917-3-hch@lst.de
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The kobject for the queue, `disk->queue_kobj`, is initialized with a
reference count of 1 via `kobject_init()` in `blk_register_queue()`.
While `kobject_del()` is called during the unregister path to remove
the kobject from sysfs, the initial reference is never released.
Add a call to `kobject_put()` in `blk_unregister_queue()` to properly
decrement the reference count and fix the leak.
Fixes: 2bd85221a625 ("block: untangle request_queue refcounting from sysfs")
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250711083009.2574432-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Anders and Naresh found that the addition of the FS_IOC_GETLBMD_CAP
handling in the blockdev ioctl handler breaks all ioctls with
_IOC_NR==2, as the new command is not added to the switch but only
a few of the command bits are check.
Move the check into the blk_get_meta_cap() function itself and make
it return -ENOIOCTLCMD for any unsupported command code, including
those with a smaller size that previously returned -EINVAL.
For consistency this also drops the check for NULL 'arg' that
is really useless, as any invalid pointer should return -EFAULT.
Fixes: 9eb22f7fedfc ("fs: add ioctl to query metadata and protection info capabilities")
Link: https://lore.kernel.org/all/CA+G9fYvk9HHE5UJ7cdJHTcY6P5JKnp+_e+sdC5U-ZQFTP9_hqQ@mail.gmail.com/
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Cc: Anders Roxell <anders.roxell@linaro.org>
Cc: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/20250711084708.2714436-1-arnd@kernel.org
Tested-by: Anders Roxell <anders.roxell@linaro.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Patch series "Remove zero_user()".
The zero_user() API is almost unused these days. Finish the job of
removing it.
This patch (of 5):
memzero_page() is the new name for zero_user().
Link: https://lkml.kernel.org/r/20250612143443.2848197-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20250612143443.2848197-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Alex Markuze <amarkuze@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Xiubo Li <xiubli@redhat.com>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Cc: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
instead of manually stashing the data pointer into parent directory inode's
->i_private, just pass it to debugfs_create_file_aux() so that it can
be extracted without that insane chasing through ->d_parent.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/r/20250702212818.GJ3406663@ZenIV
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Add two variants of helper functions that calculate the correct number
of queues to use. Two variants are needed because some drivers base
their maximum number of queues on the possible CPU mask, while others
use the online CPU mask.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20250617-isolcpus-queue-counters-v1-2-13923686b54b@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
group_cpu_evenly() might have allocated less groups then requested:
group_cpu_evenly()
__group_cpus_evenly()
alloc_nodes_groups()
# allocated total groups may be less than numgrps when
# active total CPU number is less then numgrps
In this case, the caller will do an out of bound access because the
caller assumes the masks returned has numgrps.
Return the number of groups created so the caller can limit the access
range accordingly.
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250617-isolcpus-queue-counters-v1-1-13923686b54b@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Add a new ioctl, FS_IOC_GETLBMD_CAP, to query metadata and protection
info (PI) capabilities. This ioctl returns information about the files
integrity profile. This is useful for userspace applications to
understand a files end-to-end data protection support and configure the
I/O accordingly.
For now this interface is only supported by block devices. However the
design and placement of this ioctl in generic FS ioctl space allows us
to extend it to work over files as well. This maybe useful when
filesystems start supporting PI-aware layouts.
A new structure struct logical_block_metadata_cap is introduced, which
contains the following fields:
1. lbmd_flags: bitmask of logical block metadata capability flags
2. lbmd_interval: the amount of data described by each unit of logical
block metadata
3. lbmd_size: size in bytes of the logical block metadata associated
with each interval
4. lbmd_opaque_size: size in bytes of the opaque block tag associated
with each interval
5. lbmd_opaque_offset: offset in bytes of the opaque block tag within
the logical block metadata
6. lbmd_pi_size: size in bytes of the T10 PI tuple associated with each
interval
7. lbmd_pi_offset: offset in bytes of T10 PI tuple within the logical
block metadata
8. lbmd_pi_guard_tag_type: T10 PI guard tag type
9. lbmd_pi_app_tag_size: size in bytes of the T10 PI application tag
10. lbmd_pi_ref_tag_size: size in bytes of the T10 PI reference tag
11. lbmd_pi_storage_tag_size: size in bytes of the T10 PI storage tag
The internal logic to fetch the capability is encapsulated in a helper
function blk_get_meta_cap(), which uses the blk_integrity profile
associated with the device. The ioctl returns -EOPNOTSUPP, if
CONFIG_BLK_DEV_INTEGRITY is not enabled.
Suggested-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/20250630090548.3317-5-anuj20.g@samsung.com
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Introduce a new pi_tuple_size field in struct blk_integrity to
explicitly represent the size (in bytes) of the protection information
(PI) tuple. This is a prep patch.
Add validation in blk_validate_integrity_limits() to ensure that
pi size matches the expected size for known checksum types and never
exceeds the pi_tuple_size.
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Link: https://lore.kernel.org/20250630090548.3317-3-anuj20.g@samsung.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The tuple_size field in blk_integrity currently represents the total
size of metadata associated with each data interval. To make the meaning
more explicit, rename tuple_size to metadata_size. This is a purely
mechanical rename with no functional changes.
Suggested-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Link: https://lore.kernel.org/20250630090548.3317-2-anuj20.g@samsung.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Add a new blk_rq_dma_map / blk_rq_dma_unmap pair that does away with
the wasteful scatterlist structure. Instead it uses the mapping iterator
to either add segments to the IOVA for IOMMU operations, or just maps
them one by one for the direct mapping. For the IOMMU case instead of
a scatterlist with an entry for each segment, only a single [dma_addr,len]
pair needs to be stored for processing a request, and for the direct
mapping the per-segment allocation shrinks from
[page,offset,len,dma_addr,dma_len] to just [dma_addr,len].
One big difference to the scatterlist API, which could be considered
downside, is that the IOVA collapsing only works when the driver sets
a virt_boundary that matches the IOMMU granule. For NVMe this is done
already so it works perfectly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/20250625113531.522027-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
To get out of the DMA mapping helpers having to check every segment for
it's P2P status, ensure that bios either contain P2P transfers or non-P2P
transfers, and that a P2P bio only contains ranges from a single device.
This means we do the page zone access in the bio add path where it should
be still page hot, and will only have do the fairly expensive P2P topology
lookup once per bio down in the DMA mapping path, and only for already
marked bios.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/20250625113531.522027-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In preparation for fixing device mapper zone write handling, introduce
the inline helper function bio_needs_zone_write_plugging() to test if a
BIO requires handling through zone write plugging using the function
blk_zone_plug_bio(). This function returns true for any write
(op_is_write(bio) == true) operation directed at a zoned block device
using zone write plugging, that is, a block device with a disk that has
a zone write plug hash table.
This helper allows simplifying the check on entry to blk_zone_plug_bio()
and used in to protect calls to it for blk-mq devices and DM devices.
Fixes: f211268ed1f9 ("dm: Use the block layer zone append emulation")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250625093327.548866-3-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Back in 2015, commit d2be537c3ba3 ("block: bump BLK_DEF_MAX_SECTORS to
2560") increased the default maximum size of a block device I/O to 2560
sectors (1280 KiB) to "accommodate a 10-data-disk stripe write with
chunk size 128k". This choice is rather arbitrary and since then,
improvements to the block layer have software RAID drivers correctly
advertize their stripe width through chunk_sectors and abuses of
BLK_DEF_MAX_SECTORS_CAP by drivers (to set the HW limit rather than the
default user controlled maximum I/O size) have been fixed.
Since many block devices can benefit from a larger value of
BLK_DEF_MAX_SECTORS_CAP, and in particular HDDs, increase this value to
be 4MiB, or 8192 sectors.
And given that BLK_DEF_MAX_SECTORS_CAP is only used in the block layer
and should not be used by drivers directly, move this macro definition
to the block layer internal header file block/blk.h.
Suggested-by: Martin K . Petersen <martin.petersen@oracle.com>
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20250618060045.37593-1-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Pull block fixes from Jens Axboe:
- Fixes for ublk:
- fix C++ narrowing warnings in the uapi header
- update/improve UBLK_F_SUPPORT_ZERO_COPY comment in uapi header
- fix for the ublk ->queue_rqs() implementation, limiting a batch
to just the specific task AND ring
- ublk_get_data() error handling fix
- sanity check more arguments in ublk_ctrl_add_dev()
- selftest addition
- NVMe pull request via Christoph:
- reset delayed remove_work after reconnect
- fix atomic write size validation
- Fix for a warning introduced in bdev_count_inflight_rw() in this
merge window
* tag 'block-6.16-20250626' of git://git.kernel.dk/linux:
block: fix false warning in bdev_count_inflight_rw()
ublk: sanity check add_dev input for underflow
nvme: fix atomic write size validation
nvme: refactor the atomic write unit detection
nvme: reset delayed remove_work after reconnect
ublk: setup ublk_io correctly in case of ublk_get_data() failure
ublk: update UBLK_F_SUPPORT_ZERO_COPY comment in UAPI header
ublk: fix narrowing warnings in UAPI header
selftests: ublk: don't take same backing file for more than one ublk devices
ublk: build batch from IOs in same io_ring_ctx and io task
|
|
While bdev_count_inflight is interating all cpus, if some IOs are issued
from traversed cpu and then completed from the cpu that is not traversed
yet:
cpu0
cpu1
bdev_count_inflight
//for_each_possible_cpu
// cpu0 is 0
infliht += 0
// issue a io
blk_account_io_start
// cpu0 inflight ++
cpu2
// the io is done
blk_account_io_done
// cpu2 inflight --
// cpu 1 is 0
inflight += 0
// cpu2 is -1
inflight += -1
...
In this case, the total inflight will be -1, causing lots of false
warning. Fix the problem by removing the warning.
Noted there is still a valid warning for nvme-mpath(From Yi) that is not
fixed yet.
Fixes: f5482ee5edb9 ("block: WARN if bdev inflight counter is negative")
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Closes: https://lore.kernel.org/linux-block/aFtUXy-lct0WxY2w@mozart.vkv.me/T/#mae89155a5006463d0a21a4a2c35ae0034b26a339
Reported-and-tested-by: Calvin Owens <calvin@wbinvd.org>
Closes: https://lore.kernel.org/linux-block/aFtUXy-lct0WxY2w@mozart.vkv.me/T/#m1d935a00070bf95055d0ac84e6075158b08acaef
Reported-by: Dave Chinner <david@fromorbit.com>
Closes: https://lore.kernel.org/linux-block/aFuypjqCXo9-5_En@dread.disaster.area/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20250626115743.1641443-1-yukuai3@huawei.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
block_write_end() looks like it can be used as a ->write_end()
implementation. However, it can't as it does not unlock nor put
the folio. Since it does not use the 'file', 'mapping' nor 'fsdata'
arguments, remove them.
Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Link: https://lore.kernel.org/20250624132130.1590285-1-willy@infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Add support for FALLOC_FL_WRITE_ZEROES, if the block device enables the
unmap write zeroes operation, it will issue a write zeroes command.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://lore.kernel.org/20250619111806.3546162-9-yi.zhang@huaweicloud.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Only the flags passed to blkdev_issue_zeroout() differ among the two
zeroing branches in blkdev_fallocate(). Therefore, do cleanup by
factoring them out.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://lore.kernel.org/20250619111806.3546162-8-yi.zhang@huaweicloud.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Currently, disks primarily implement the write zeroes command (aka
REQ_OP_WRITE_ZEROES) through two mechanisms: the first involves
physically writing zeros to the disk media (e.g., HDDs), while the
second performs an unmap operation on the logical blocks, effectively
putting them into a deallocated state (e.g., SSDs). The first method is
generally slow, while the second method is typically very fast.
For example, on certain NVMe SSDs that support NVME_NS_DEAC, submitting
REQ_OP_WRITE_ZEROES requests with the NVME_WZ_DEAC bit can accelerate
the write zeros operation by placing disk blocks into a deallocated
state, which opportunistically avoids writing zeroes to media while
still guaranteeing that subsequent reads from the specified block range
will return zeroed data. This is a best-effort optimization, not a
mandatory requirement, some devices may partially fall back to writing
physical zeroes due to factors such as misalignment or being asked to
clear a block range smaller than the device's internal allocation unit.
Therefore, the speed of this operation is not guaranteed.
It is difficult to determine whether the storage device supports unmap
write zeroes operation. We cannot determine this by only querying
bdev_limits(bdev)->max_write_zeroes_sectors. Therefore, first, add a new
hardware queue limit parameters, max_hw_wzeroes_unmap_sectors, to
indicate whether a device supports this unmap write zeroes operation.
Then, add two new counterpart software queue limits,
max_wzeroes_unmap_sectors and max_user_wzeroes_unmap_sectors, which
allow users to disable this operation if the speed is very slow on some
sepcial devices.
Finally, for the stacked devices cases, initialize these two parameters
to UINT_MAX. This operation should be enabled by both the stacking
driver and all underlying devices.
Thanks to Martin K. Petersen for optimizing the documentation of the
write_zeroes_unmap sysfs interface.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://lore.kernel.org/20250619111806.3546162-2-yi.zhang@huaweicloud.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Update nearly all generic_file_mmap() and generic_file_readonly_mmap()
callers to use generic_file_mmap_prepare() and
generic_file_readonly_mmap_prepare() respectively.
We update blkdev, 9p, afs, erofs, ext2, nfs, ntfs3, smb, ubifs and vboxsf
file systems this way.
Remaining users we cannot yet update are ecryptfs, fuse and cramfs. The
former two are nested file systems that must support any underlying file
ssytem, and cramfs inserts a mixed mapping which currently requires a VMA.
Once all file systems have been converted to mmap_prepare(), we can then
update nested file systems.
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/08db85970d89b17a995d2cffae96fb4cc462377f.1750099179.git.lorenzo.stoakes@oracle.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Pull block fixes from Jens Axboe:
- Fix for a deadlock on queue freeze with zoned writes
- Fix for zoned append emulation
- Two bio folio fixes, for sparsemem and for very large folios
- Fix for a performance regression introduced in 6.13 when plug
insertion was changed
- Fix for NVMe passthrough handling for polled IO
- Document the ublk auto registration feature
- loop lockdep warning fix
* tag 'block-6.16-20250614' of git://git.kernel.dk/linux:
nvme: always punt polled uring_cmd end_io work to task_work
Documentation: ublk: Separate UBLK_F_AUTO_BUF_REG fallback behavior sublists
block: Fix bvec_set_folio() for very large folios
bio: Fix bio_first_folio() for SPARSEMEM without VMEMMAP
block: use plug request list tail for one-shot backmerge attempt
block: don't use submit_bio_noacct_nocheck in blk_zone_wplug_bio_work
block: Clear BIO_EMULATES_ZONE_APPEND flag on BIO completion
ublk: document auto buffer registration(UBLK_F_AUTO_BUF_REG)
loop: move lo_set_size() out of queue freeze
|
|
Previously, the block layer stored the requests in the plug list in
LIFO order. For this reason, blk_attempt_plug_merge() would check
just the head entry for a back merge attempt, and abort after that
unless requests for multiple queues existed in the plug list. If more
than one request is present in the plug list, this makes the one-shot
back merging less useful than before, as it'll always fail to find a
quick merge candidate.
Use the tail entry for the one-shot merge attempt, which is the last
added request in the list. If that fails, abort immediately unless
there are multiple queues available. If multiple queues are available,
then scan the list. Ideally the latter scan would be a backwards scan
of the list, but as it currently stands, the plug list is singly linked
and hence this isn't easily feasible.
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/linux-block/20250611121626.7252-1-abuehaze@amazon.com/
Reported-by: Hazem Mohamed Abuelfotoh <abuehaze@amazon.com>
Fixes: e70c301faece ("block: don't reorder requests in blk_add_rq_to_plug")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Bios queued up in the zone write plug have already gone through all all
preparation in the submit_bio path, including the freeze protection.
Submitting them through submit_bio_noacct_nocheck duplicates the work
and can can cause deadlocks when freezing a queue with pending bio
write plugs.
Go straight to ->submit_bio or blk_mq_submit_bio to bypass the
superfluous extra freeze protection and checks.
Fixes: 9b1ce7f0c6f8 ("block: Implement zone append emulation")
Reported-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20250611044416.2351850-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When blk_zone_write_plug_bio_endio() is called for a regular write BIO
used to emulate a zone append operation, that is, a BIO flagged with
BIO_EMULATES_ZONE_APPEND, the BIO operation code is restored to the
original REQ_OP_ZONE_APPEND but the BIO_EMULATES_ZONE_APPEND flag is not
cleared. Clear it to fully return the BIO to its orginal definition.
Fixes: 9b1ce7f0c6f8 ("block: Implement zone append emulation")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250611005915.89843-1-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Move this API to the canonical timer_*() namespace.
[ tglx: Redone against pre rc1 ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/aB2X0jCKQO56WdMt@gmail.com
|
|
Pull more block updates from Jens Axboe:
- NVMe pull request via Christoph:
- TCP error handling fix (Shin'ichiro Kawasaki)
- TCP I/O stall handling fixes (Hannes Reinecke)
- fix command limits status code (Keith Busch)
- support vectored buffers also for passthrough (Pavel Begunkov)
- spelling fixes (Yi Zhang)
- MD pull request via Yu:
- fix REQ_RAHEAD and REQ_NOWAIT IO err handling for raid1/10
- fix max_write_behind setting for dm-raid
- some minor cleanups
- Integrity data direction fix and cleanup
- bcache NULL pointer fix
- Fix for loop missing write start/end handling
- Decouple hardware queues and IO threads in ublk
- Slew of ublk selftests additions and updates
* tag 'block-6.16-20250606' of git://git.kernel.dk/linux: (29 commits)
nvme: spelling fixes
nvme-tcp: fix I/O stalls on congested sockets
nvme-tcp: sanitize request list handling
nvme-tcp: remove tag set when second admin queue config fails
nvme: enable vectored registered bufs for passthrough cmds
nvme: fix implicit bool to flags conversion
nvme: fix command limits status code
selftests: ublk: kublk: improve behavior on init failure
block: flip iter directions in blk_rq_integrity_map_user()
block: drop direction param from bio_integrity_copy_user()
selftests: ublk: cover PER_IO_DAEMON in more stress tests
Documentation: ublk: document UBLK_F_PER_IO_DAEMON
selftests: ublk: add stress test for per io daemons
selftests: ublk: add functional test for per io daemons
selftests: ublk: kublk: decouple ublk_queues from ublk server threads
selftests: ublk: kublk: move per-thread data out of ublk_queue
selftests: ublk: kublk: lift queue initialization out of thread
selftests: ublk: kublk: tie sqe allocation to io instead of queue
selftests: ublk: kublk: plumb q_id in io_uring user_data
ublk: have a per-io daemon instead of a per-queue daemon
...
|