summaryrefslogtreecommitdiff
path: root/drivers/nvme
AgeCommit message (Collapse)Author
48 hoursMerge tag 'for-6.17/block-20250728' of git://git.kernel.dk/linuxLinus Torvalds
Pull block updates from Jens Axboe: - MD pull request via Yu: - call del_gendisk synchronously (Xiao) - cleanup unused variable (John) - cleanup workqueue flags (Ryo) - fix faulty rdev can't be removed during resync (Qixing) - NVMe pull request via Christoph: - try PCIe function level reset on init failure (Keith Busch) - log TLS handshake failures at error level (Maurizio Lombardi) - pci-epf: do not complete commands twice if nvmet_req_init() fails (Rick Wertenbroek) - misc cleanups (Alok Tiwari) - Removal of the pktcdvd driver This has been more than a decade coming at this point, and some recently revealed breakages that had it causing issues even for cases where it isn't required made me re-pull the trigger on this one. It's known broken and nobody has stepped up to maintain the code - Series for ublk supporting batch commands, enabling the use of multishot where appropriate - Speed up ublk exit handling - Fix for the two-stage elevator fixing which could leak data - Convert NVMe to use the new IOVA based API - Increase default max transfer size to something more reasonable - Series fixing write operations on zoned DM devices - Add tracepoints for zoned block device operations - Prep series working towards improving blk-mq queue management in the presence of isolated CPUs - Don't allow updating of the block size of a loop device that is currently under exclusively ownership/open - Set chunk sectors from stacked device stripe size and use it for the atomic write size limit - Switch to folios in bcache read_super() - Fix for CD-ROM MRW exit flush handling - Various tweaks, fixes, and cleanups * tag 'for-6.17/block-20250728' of git://git.kernel.dk/linux: (94 commits) block: restore two stage elevator switch while running nr_hw_queue update cdrom: Call cdrom_mrw_exit from cdrom_release function sunvdc: Balance device refcount in vdc_port_mpgroup_check nvme-pci: try function level reset on init failure dm: split write BIOs on zone boundaries when zone append is not emulated block: use chunk_sectors when evaluating stacked atomic write limits dm-stripe: limit chunk_sectors to the stripe size md/raid10: set chunk_sectors limit md/raid0: set chunk_sectors limit block: sanitize chunk_sectors for atomic write limits ilog2: add max_pow_of_two_factor() nvmet: pci-epf: Do not complete commands twice if nvmet_req_init() fails nvme-tcp: log TLS handshake failures at error level docs: nvme: fix grammar in nvme-pci-endpoint-target.rst nvme: fix typo in status code constant for self-test in progress nvmet: remove redundant assignment of error code in nvmet_ns_enable() nvme: fix incorrect variable in io cqes error message nvme: fix multiple spelling and grammar issues in host drivers block: fix blk_zone_append_update_request_bio() kernel-doc md/raid10: fix set but not used variable in sync_request_write() ...
2 daysMerge tag 'vfs-6.17-rc1.integrity' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs 'protection info' updates from Christian Brauner: "This adds the new FS_IOC_GETLBMD_CAP ioctl() to query metadata and protection info (PI) capabilities. This ioctl returns information about the files integrity profile. This is useful for userspace applications to understand a files end-to-end data protection support and configure the I/O accordingly. For now this interface is only supported by block devices. However the design and placement of this ioctl in generic FS ioctl space allows us to extend it to work over files as well. This maybe useful when filesystems start supporting PI-aware layouts. A new structure struct logical_block_metadata_cap is introduced, which contains the following fields: - lbmd_flags: bitmask of logical block metadata capability flags - lbmd_interval: the amount of data described by each unit of logical block metadata - lbmd_size: size in bytes of the logical block metadata associated with each interval - lbmd_opaque_size: size in bytes of the opaque block tag associated with each interval - lbmd_opaque_offset: offset in bytes of the opaque block tag within the logical block metadata - lbmd_pi_size: size in bytes of the T10 PI tuple associated with each interval - lbmd_pi_offset: offset in bytes of T10 PI tuple within the logical block metadata - lbmd_pi_guard_tag_type: T10 PI guard tag type - lbmd_pi_app_tag_size: size in bytes of the T10 PI application tag - lbmd_pi_ref_tag_size: size in bytes of the T10 PI reference tag - lbmd_pi_storage_tag_size: size in bytes of the T10 PI storage tag The internal logic to fetch the capability is encapsulated in a helper function blk_get_meta_cap(), which uses the blk_integrity profile associated with the device. The ioctl returns -EOPNOTSUPP, if CONFIG_BLK_DEV_INTEGRITY is not enabled" * tag 'vfs-6.17-rc1.integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: block: fix lbmd_guard_tag_type assignment in FS_IOC_GETLBMD_CAP block: fix FS_IOC_GETLBMD_CAP parsing in blkdev_common_ioctl() fs: add ioctl to query metadata and protection info capabilities nvme: set pi_offset only when checksum type is not BLK_INTEGRITY_CSUM_NONE block: introduce pi_tuple_size field in blk_integrity block: rename tuple_size field in blk_integrity to metadata_size
2 daysMerge tag 'vfs-6.17-rc1.fallocate' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull fallocate updates from Christian Brauner: "fallocate() currently supports creating preallocated files efficiently. However, on most filesystems fallocate() will preallocate blocks in an unwriten state even if FALLOC_FL_ZERO_RANGE is specified. The extent state must later be converted to a written state when the user writes data into this range, which can trigger numerous metadata changes and journal I/O. This may leads to significant write amplification and performance degradation in synchronous write mode. At the moment, the only method to avoid this is to create an empty file and write zero data into it (for example, using 'dd' with a large block size). However, this method is slow and consumes a considerable amount of disk bandwidth. Now that more and more flash-based storage devices are available it is possible to efficiently write zeros to SSDs using the unmap write zeroes command if the devices do not write physical zeroes to the media. For example, if SCSI SSDs support the UMMAP bit or NVMe SSDs support the DEAC bit[1], the write zeroes command does not write actual data to the device, instead, NVMe converts the zeroed range to a deallocated state, which works fast and consumes almost no disk write bandwidth. This series implements the BLK_FEAT_WRITE_ZEROES_UNMAP feature and BLK_FLAG_WRITE_ZEROES_UNMAP_DISABLED flag for SCSI, NVMe and device-mapper drivers, and add the FALLOC_FL_WRITE_ZEROES and STATX_ATTR_WRITE_ZEROES_UNMAP support for ext4 and raw bdev devices. fallocate() is subsequently extended with the FALLOC_FL_WRITE_ZEROES flag. FALLOC_FL_WRITE_ZEROES zeroes a specified file range in such a way that subsequent writes to that range do not require further changes to the file mapping metadata. This flag is beneficial for subsequent pure overwriting within this range, as it can save on block allocation and, consequently, significant metadata changes" * tag 'vfs-6.17-rc1.fallocate' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: ext4: add FALLOC_FL_WRITE_ZEROES support block: add FALLOC_FL_WRITE_ZEROES support block: factor out common part in blkdev_fallocate() fs: introduce FALLOC_FL_WRITE_ZEROES to fallocate dm: clear unmap write zeroes limits when disabling write zeroes scsi: sd: set max_hw_wzeroes_unmap_sectors if device supports SD_ZERO_*_UNMAP nvmet: set WZDS and DRB if device enables unmap write zeroes operation nvme: set max_hw_wzeroes_unmap_sectors if device supports DEAC bit block: introduce max_{hw|user}_wzeroes_unmap_sectors to queue limits
12 daysMerge tag 'block-6.16-20250718' of git://git.kernel.dk/linuxLinus Torvalds
Pull block fixes from Jens Axboe: - NVMe changes via Christoph: - revert the cross-controller atomic write size validation that caused regressions (Christoph Hellwig) - fix endianness of command word printout in nvme_log_err_passthru() (John Garry) - fix callback lock for TLS handshake (Maurizio Lombardi) - fix misaccounting of nvme-mpath inflight I/O (Yu Kuai) - fix inconsistent RCU list manipulation in nvme_ns_add_to_ctrl_list() (Zheng Qixing) - Fix for a kobject leak in queue unregistration - Fix for loop async file write start/end handling * tag 'block-6.16-20250718' of git://git.kernel.dk/linux: loop: use kiocb helpers to fix lockdep warning nvmet-tcp: fix callback lock for TLS handshake nvme: fix misaccounting of nvme-mpath inflight I/O nvme: revert the cross-controller atomic write size validation nvme: fix endianness of command word prints in nvme_log_err_passthru() nvme: fix inconsistent RCU list manipulation in nvme_ns_add_to_ctrl_list() block: fix kobject leak in blk_unregister_queue
13 daysnvme-pci: try function level reset on init failureKeith Busch
NVMe devices from multiple vendors appear to get stuck in a reset state that we can't get out of with an NVMe level Controller Reset. The kernel would report these with messages that look like: Device not ready; aborting reset, CSTS=0x1 These have historically required a power cycle to make them usable again, but in many cases, a PCIe FLR is sufficient to restart operation without a power cycle. Try it if the initial controller reset fails during any nvme reset attempt. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
13 daysnvmet: pci-epf: Do not complete commands twice if nvmet_req_init() failsRick Wertenbroek
Have nvmet_req_init() and req->execute() complete failed commands. Description of the problem: nvmet_req_init() calls __nvmet_req_complete() internally upon failure, e.g., unsupported opcode, which calls the "queue_response" callback, this results in nvmet_pci_epf_queue_response() being called, which will call nvmet_pci_epf_complete_iod() if data_len is 0 or if dma_dir is different from DMA_TO_DEVICE. This results in a double completion as nvmet_pci_epf_exec_iod_work() also calls nvmet_pci_epf_complete_iod() when nvmet_req_init() fails. Steps to reproduce: On the host send a command with an unsupported opcode with nvme-cli, For example the admin command "security receive" $ sudo nvme security-recv /dev/nvme0n1 -n1 -x4096 This triggers a double completion as nvmet_req_init() fails and nvmet_pci_epf_queue_response() is called, here iod->dma_dir is still in the default state of "DMA_NONE" as set by default in nvmet_pci_epf_alloc_iod(), so nvmet_pci_epf_complete_iod() is called. Because nvmet_req_init() failed nvmet_pci_epf_complete_iod() is also called in nvmet_pci_epf_exec_iod_work() leading to a double completion. This not only sends two completions to the host but also corrupts the state of the PCI NVMe target leading to kernel oops. This patch lets nvmet_req_init() and req->execute() complete all failed commands, and removes the double completion case in nvmet_pci_epf_exec_iod_work() therefore fixing the edge cases where double completions occurred. Fixes: 0faa0fe6f90e ("nvmet: New NVMe PCI endpoint function target driver") Signed-off-by: Rick Wertenbroek <rick.wertenbroek@gmail.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
13 daysnvme-tcp: log TLS handshake failures at error levelMaurizio Lombardi
Update the nvme_tcp_start_tls() function to use dev_err() instead of dev_dbg() when a TLS error is detected. This ensures that handshake failures are visible by default, aiding in debugging. Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Reviewed-by: Laurence Oberman <loberman@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
13 daysnvme: fix typo in status code constant for self-test in progressAlok Tiwari
Correct a typo error in the NVMe status code constant from NVME_SC_SELT_TEST_IN_PROGRESS to NVME_SC_SELF_TEST_IN_PROGRESS to accurately reflect its meaning. Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
13 daysnvmet: remove redundant assignment of error code in nvmet_ns_enable()Alok Tiwari
Remove the unnecessary ret = -EMFILE; assignment since it is immediately overwritten by the result of nvmet_bdev_ns_enable() The initial value (-EMFILE) is redundant because it has no effect on the code logic or outcome. Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
13 daysnvme: fix incorrect variable in io cqes error messageAlok Tiwari
Correct the error log to print ctrl->io_cqes instead of incorrectly using ctrl->io_sqes for the io cqes size check. Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
13 daysnvme: fix multiple spelling and grammar issues in host driversAlok Tiwari
This commit fixes several typos and grammatical issues across various nvme host driver files: - correct "glace" to "glance" in a comment in apple.c - fix "Idependent" to "Independent" in core.c - change "unsucceesful" to "unsuccessful", "they blk-mq" to "the blk-mq", - fix "terminaed" to "terminated" and other grammar in fc.c - update "O's" to "0's" to clarify meaning in nvme.h - fix a function name reference in a comment in zns.c: *_transter_len() -> *_transfer_len(). - fix sysfs_emit() output format in pci.c (replace x%08x with 0x%08x) These changes improve the code readability and documentation consistency across the NVMe driver. Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-07-15nvmet-tcp: fix callback lock for TLS handshakeMaurizio Lombardi
When restoring the default socket callbacks during a TLS handshake, we need to acquire a write lock on sk_callback_lock. Previously, a read lock was used, which is insufficient for modifying sk_user_data and sk_data_ready. Fixes: 675b453e0241 ("nvmet-tcp: enable TLS handshake upcall") Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-07-15nvme: fix misaccounting of nvme-mpath inflight I/OYu Kuai
Procedures for nvme-mpath IO accounting: 1) initialize nvme_request and clear flags; 2) set NVME_MPATH_IO_STATS and increase inflight counter when IO started; 3) check NVME_MPATH_IO_STATS and decrease inflight counter when IO is done; However, for the case nvme_fail_nonready_command(), both step 1) and 2) are skipped, and if old nvme_request set NVME_MPATH_IO_STATS and then request is reused, step 3) will still be executed, causing inflight I/O counter to be negative. Fix the problem by clearing nvme_request in nvme_fail_nonready_command(). Fixes: ea5e5f42cd2c ("nvme-fabrics: avoid double completions in nvmf_fail_nonready_command") Reported-by: Yi Zhang <yi.zhang@redhat.com> Closes: https://lore.kernel.org/all/CAHj4cs_+dauobyYyP805t33WMJVzOWj=7+51p4_j9rA63D9sog@mail.gmail.com/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-07-15nvme: revert the cross-controller atomic write size validationChristoph Hellwig
This was originally added by commit 8695f060a029 ("nvme: all namespaces in a subsystem must adhere to a common atomic write size") to check the all controllers in a subsystem report the same atomic write size, but the check wasn't quite correct and caused problems for devices with multiple namespaces that report different LBA sizes. Commit f46d273449ba ("nvme: fix atomic write size validation") tried to fix this, but then caused problems for namespace rediscovery after a format with an LBA size change that changes the AWUPF value. This drops the validation and essentially reverts those two commits while keeping the cleanup that went in between the two. We'll need to figure out how to properly check for the mouse trap that nvme left us, but for now revert the check to keep devices working for users who couldn't care less about the atomic write feature. Fixes: 8695f060a029 ("nvme: all namespaces in a subsystem must adhere to a common atomic write size") Fixes: f46d273449ba ("nvme: fix atomic write size validation") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alan Adamson <alan.adamson@oracle.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Tested-by: Alan Adamson <alan.adamson@oracle.com>
2025-07-14nvme: fix endianness of command word prints in nvme_log_err_passthru()John Garry
The command word members of struct nvme_common_command are __le32 type, so use helper le32_to_cpu() to read them properly. Fixes: 9f079dda1433 ("nvme: allow passthru cmd error logging") Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Alan Adamson <alan.adamson@oracle.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-07-14nvme: fix inconsistent RCU list manipulation in nvme_ns_add_to_ctrl_list()Zheng Qixing
When inserting a namespace into the controller's namespace list, the function uses list_add_rcu() when the namespace is inserted in the middle of the list, but falls back to a regular list_add() when adding at the head of the list. This inconsistency could lead to race conditions during concurrent access, as users might observe a partially updated list. Fix this by consistently using list_add_rcu() in both code paths to ensure proper RCU protection throughout the entire function. Fixes: be647e2c76b2 ("nvme: use srcu for iterating namespace list") Signed-off-by: Zheng Qixing <zhengqixing@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-07-11nvme-pci: don't allocate dma_vec for IOVA mappingsChristoph Hellwig
Not only do IOVA mappings no need the separate dma_vec tracking, it also won't free it and thus leak the allocations. Fixes: b8b7570a7ec8 ("nvme-pci: fix dma unmapping when using PRPs and not using the IOVA mapping") Reported-by: Klara Modin <klarasmodin@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Klara Modin <klarasmodin@gmail.com> Link: https://lore.kernel.org/r/20250711112250.633269-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-08nvme-pci: fix dma unmapping when using PRPs and not using the IOVA mappingChristoph Hellwig
The current version of the blk_rq_dma_map support in nvme-pci tries to reconstruct the DMA mappings from the on the wire descriptors if they are needed for unmapping. While this is not the case for the direct mapping fast path and the IOVA path, it is needed for the non-IOVA slow path, e.g. when using the interconnect is not dma coherent, when using swiotlb bounce buffering, or a IOMMU mapping that can't coalesce. While the reconstruction is easy and works fine for the SGL path, where the on the wire representation maps 1:1 to DMA mappings, the code to reconstruct the DMA mapping ranges from PRPs can't always work, as a given PRP layout can come from different DMA mappings, and the current code doesn't even always get that right. Give up on this approach and track the actual DMA mapping when actually needed again. Fixes: 7ce3c1dd78fc ("nvme-pci: convert the data mapping to blk_rq_dma_map") Reported-by: Ben Copeland <ben.copeland@linaro.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Tested-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20250707125223.3022531-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-04Merge tag 'block-6.16-20250704' of git://git.kernel.dk/linuxLinus Torvalds
Pull block fixes from Jens Axboe: - NVMe fixes via Christoph: - fix incorrect cdw15 value in passthru error logging (Alok Tiwari) - fix memory leak of bio integrity in nvmet (Dmitry Bogdanov) - refresh visible attrs after being checked (Eugen Hristev) - fix suspicious RCU usage warning in the multipath code (Geliang Tang) - correctly account for namespace head reference counter (Nilay Shroff) - Fix for a regression introduced in ublk in this cycle, where it would attempt to queue a canceled request. - brd RCU sleeping fix, also introduced in this cycle. Bare bones fix, should be improved upon for the next release. * tag 'block-6.16-20250704' of git://git.kernel.dk/linux: brd: fix sleeping function called from invalid context in brd_insert_page() ublk: don't queue request if the associated uring_cmd is canceled nvme-multipath: fix suspicious RCU usage warning nvme-pci: refresh visible attrs after being checked nvmet: fix memory leak of bio integrity nvme: correctly account for namespace head reference counter nvme: Fix incorrect cdw15 value in passthru error logging
2025-07-01nvme-pci: use block layer helpers to calculate num of queuesDaniel Wagner
The calculation of the upper limit for queues does not depend solely on the number of possible CPUs; for example, the isolcpus kernel command-line option must also be considered. To account for this, the block layer provides a helper function to retrieve the maximum number of queues. Use it to set an appropriate upper queue number limit. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Daniel Wagner <wagi@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20250617-isolcpus-queue-counters-v1-3-13923686b54b@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-01nvme: set pi_offset only when checksum type is not BLK_INTEGRITY_CSUM_NONEAnuj Gupta
protection information is treated as opaque when checksum type is BLK_INTEGRITY_CSUM_NONE. In order to maintain the right metadata semantics, set pi_offset only in cases where checksum type is not BLK_INTEGRITY_CSUM_NONE. Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Link: https://lore.kernel.org/20250630090548.3317-4-anuj20.g@samsung.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-01block: introduce pi_tuple_size field in blk_integrityAnuj Gupta
Introduce a new pi_tuple_size field in struct blk_integrity to explicitly represent the size (in bytes) of the protection information (PI) tuple. This is a prep patch. Add validation in blk_validate_integrity_limits() to ensure that pi size matches the expected size for known checksum types and never exceeds the pi_tuple_size. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Link: https://lore.kernel.org/20250630090548.3317-3-anuj20.g@samsung.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-01block: rename tuple_size field in blk_integrity to metadata_sizeAnuj Gupta
The tuple_size field in blk_integrity currently represents the total size of metadata associated with each data interval. To make the meaning more explicit, rename tuple_size to metadata_size. This is a purely mechanical rename with no functional changes. Suggested-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Link: https://lore.kernel.org/20250630090548.3317-2-anuj20.g@samsung.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-01nvme-multipath: fix suspicious RCU usage warningGeliang Tang
When I run the NVME over TCP test in virtme-ng, I get the following "suspicious RCU usage" warning in nvme_mpath_add_sysfs_link(): ''' [ 5.024557][ T44] nvmet: Created nvm controller 1 for subsystem nqn.2025-06.org.nvmexpress.mptcp for NQN nqn.2014-08.org.nvmexpress:uuid:f7f6b5e0-ff97-4894-98ac-c85309e0bc77. [ 5.027401][ T183] nvme nvme0: creating 2 I/O queues. [ 5.029017][ T183] nvme nvme0: mapped 2/0/0 default/read/poll queues. [ 5.032587][ T183] nvme nvme0: new ctrl: NQN "nqn.2025-06.org.nvmexpress.mptcp", addr 127.0.0.1:4420, hostnqn: nqn.2014-08.org.nvmexpress:uuid:f7f6b5e0-ff97-4894-98ac-c85309e0bc77 [ 5.042214][ T25] [ 5.042440][ T25] ============================= [ 5.042579][ T25] WARNING: suspicious RCU usage [ 5.042705][ T25] 6.16.0-rc3+ #23 Not tainted [ 5.042812][ T25] ----------------------------- [ 5.042934][ T25] drivers/nvme/host/multipath.c:1203 RCU-list traversed in non-reader section!! [ 5.043111][ T25] [ 5.043111][ T25] other info that might help us debug this: [ 5.043111][ T25] [ 5.043341][ T25] [ 5.043341][ T25] rcu_scheduler_active = 2, debug_locks = 1 [ 5.043502][ T25] 3 locks held by kworker/u9:0/25: [ 5.043615][ T25] #0: ffff888008730948 ((wq_completion)async){+.+.}-{0:0}, at: process_one_work+0x7ed/0x1350 [ 5.043830][ T25] #1: ffffc900001afd40 ((work_completion)(&entry->work)){+.+.}-{0:0}, at: process_one_work+0xcf3/0x1350 [ 5.044084][ T25] #2: ffff888013ee0020 (&head->srcu){.+.+}-{0:0}, at: nvme_mpath_add_sysfs_link.part.0+0xb4/0x3a0 [ 5.044300][ T25] [ 5.044300][ T25] stack backtrace: [ 5.044439][ T25] CPU: 0 UID: 0 PID: 25 Comm: kworker/u9:0 Not tainted 6.16.0-rc3+ #23 PREEMPT(full) [ 5.044441][ T25] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 5.044442][ T25] Workqueue: async async_run_entry_fn [ 5.044445][ T25] Call Trace: [ 5.044446][ T25] <TASK> [ 5.044449][ T25] dump_stack_lvl+0x6f/0xb0 [ 5.044453][ T25] lockdep_rcu_suspicious.cold+0x4f/0xb1 [ 5.044457][ T25] nvme_mpath_add_sysfs_link.part.0+0x2fb/0x3a0 [ 5.044459][ T25] ? queue_work_on+0x90/0xf0 [ 5.044461][ T25] ? lockdep_hardirqs_on+0x78/0x110 [ 5.044466][ T25] nvme_mpath_set_live+0x1e9/0x4f0 [ 5.044470][ T25] nvme_mpath_add_disk+0x240/0x2f0 [ 5.044472][ T25] ? __pfx_nvme_mpath_add_disk+0x10/0x10 [ 5.044475][ T25] ? add_disk_fwnode+0x361/0x580 [ 5.044480][ T25] nvme_alloc_ns+0x81c/0x17c0 [ 5.044483][ T25] ? kasan_quarantine_put+0x104/0x240 [ 5.044487][ T25] ? __pfx_nvme_alloc_ns+0x10/0x10 [ 5.044495][ T25] ? __pfx_nvme_find_get_ns+0x10/0x10 [ 5.044496][ T25] ? rcu_read_lock_any_held+0x45/0xa0 [ 5.044498][ T25] ? validate_chain+0x232/0x4f0 [ 5.044503][ T25] nvme_scan_ns+0x4c8/0x810 [ 5.044506][ T25] ? __pfx_nvme_scan_ns+0x10/0x10 [ 5.044508][ T25] ? find_held_lock+0x2b/0x80 [ 5.044512][ T25] ? ktime_get+0x16d/0x220 [ 5.044517][ T25] ? kvm_clock_get_cycles+0x18/0x30 [ 5.044520][ T25] ? __pfx_nvme_scan_ns_async+0x10/0x10 [ 5.044522][ T25] async_run_entry_fn+0x97/0x560 [ 5.044523][ T25] ? rcu_is_watching+0x12/0xc0 [ 5.044526][ T25] process_one_work+0xd3c/0x1350 [ 5.044532][ T25] ? __pfx_process_one_work+0x10/0x10 [ 5.044536][ T25] ? assign_work+0x16c/0x240 [ 5.044539][ T25] worker_thread+0x4da/0xd50 [ 5.044545][ T25] ? __pfx_worker_thread+0x10/0x10 [ 5.044546][ T25] kthread+0x356/0x5c0 [ 5.044548][ T25] ? __pfx_kthread+0x10/0x10 [ 5.044549][ T25] ? ret_from_fork+0x1b/0x2e0 [ 5.044552][ T25] ? __lock_release.isra.0+0x5d/0x180 [ 5.044553][ T25] ? ret_from_fork+0x1b/0x2e0 [ 5.044555][ T25] ? rcu_is_watching+0x12/0xc0 [ 5.044557][ T25] ? __pfx_kthread+0x10/0x10 [ 5.044559][ T25] ret_from_fork+0x218/0x2e0 [ 5.044561][ T25] ? __pfx_kthread+0x10/0x10 [ 5.044562][ T25] ret_from_fork_asm+0x1a/0x30 [ 5.044570][ T25] </TASK> ''' This patch uses sleepable RCU version of helper list_for_each_entry_srcu() instead of list_for_each_entry_rcu() to fix it. Fixes: 4dbd2b2ebe4c ("nvme-multipath: Add visibility for round-robin io-policy") Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-06-30nvme-pci: rework the build time assert for NVME_MAX_NR_DESCRIPTORSChristoph Hellwig
The current use of an always_inline helper is a bit convoluted. Instead use macros that represent the arithmetics used for building up the PRP chain. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Daniel Gomez <da.gomez@samsung.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/20250625113531.522027-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30nvme-pci: replace NVME_MAX_KB_SZ with NVME_MAX_BYTEChristoph Hellwig
Having a define in kiB units is a bit weird. Also update the comment now that there is not scatterlist limit. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Daniel Gomez <da.gomez@samsung.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/20250625113531.522027-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30nvme-pci: convert the data mapping to blk_rq_dma_mapChristoph Hellwig
Use the blk_rq_dma_map API to DMA map requests instead of scatterlists. This removes the need to allocate a scatterlist covering every segment, and thus the overall transfer length limit based on the scatterlist allocation. Instead the DMA mapping is done by iterating the bio_vec chain in the request directly. The unmap is handled differently depending on how we mapped: - when using an IOMMU only a single IOVA is used, and it is stored in iova_state - for direct mappings that don't use swiotlb and are cache coherent, unmap is not needed at all - for direct mappings that are not cache coherent or use swiotlb, the physical addresses are rebuild from the PRPs or SGL segments The latter unfortunately adds a fair amount of code to the driver, but it is code not used in the fast path. The conversion only covers the data mapping path, and still uses a scatterlist for the multi-segment metadata case. I plan to convert that as soon as we have good test coverage for the multi-segment metadata path. Thanks to Chaitanya Kulkarni for an initial attempt at a new DMA API conversion for nvme-pci, Kanchan Joshi for bringing back the single segment optimization, Leon Romanovsky for shepherding this through a gazillion rebases and Nitesh Shetty for various improvements. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/20250625113531.522027-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30nvme-pci: remove superfluous argumentsChristoph Hellwig
The call chain in the prep_rq and completion paths passes around a lot of nvme_dev, nvme_queue and nvme_command arguments that can be trivially derived from the passed in struct request. Remove them. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/20250625113531.522027-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30nvme-pci: merge the simple PRP and SGL setup into a common helperChristoph Hellwig
nvme_setup_prp_simple and nvme_setup_sgl_simple share a lot of logic. Merge them into a single helper that makes use of the previously added use_sgl tristate. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20250625113531.522027-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30nvme-pci: refactor nvme_pci_use_sglsChristoph Hellwig
Move the average segment size into a separate helper, and return a tristate to distinguish the case where can use SGL vs where we have to use SGLs. This will allow the simplify the code and make more efficient decisions in follow on changes. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20250625113531.522027-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30nvme-pci: refresh visible attrs after being checkedEugen Hristev
The sysfs attributes are registered early, but the driver does not know whether they are needed or not at that moment. For the CMB attributes, commit e917a849c3fc ("nvme-pci: refresh visible attrs for cmb attributes") solved this problem by calling nvme_update_attrs after mapping the CMB. However the issue persists for the HMB attributes. To solve the problem, moved the call to nvme_update_attrs after nvme_setup_host_mem, which sets up the HMB. Fixes: e917a849c3fc ("nvme-pci: refresh visible attrs for cmb attributes") Fixes: 86adbf0cdb9e ("nvme: simplify transport specific device attribute handling") Signed-off-by: Eugen Hristev <eugen.hristev@collabora.com> Signed-off-by: André Almeida <andrealmeid@igalia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-06-30nvmet: fix memory leak of bio integrityDmitry Bogdanov
If nvmet receives commands with metadata there is a continuous memory leak of kmalloc-128 slab or more precisely bio->bi_integrity. Since commit bf4c89fc8797 ("block: don't call bio_uninit from bio_endio") each user of bio_init has to use bio_uninit as well. Otherwise the bio integrity is not getting free. Nvmet uses bio_init for inline bios. Uninit the inline bio to complete deallocation of integrity in bio. Fixes: bf4c89fc8797 ("block: don't call bio_uninit from bio_endio") Signed-off-by: Dmitry Bogdanov <d.bogdanov@yadro.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-06-30nvme: correctly account for namespace head reference counterNilay Shroff
The blktests nvme/058 manifests an issue where the NVMe subsystem kobject entry remains stale in sysfs, causing a failure during subsequent NVMe module reloads[1]. Specifically, when attempting to register a new NVMe subsystem, the driver encounters a kobejct name collision because a stale kobject still exists. Though, please note that nvme/058 doesn't report any failure and test case passes and it's only during subsequent NVMe module reloads, the stale nvme sub- system kobject entry in sysfs causes the observed symptom[1]. This issue stems from an imbalance in the get/put usage of the namespace head (nshead) reference counter. The nshead holds a reference to the associated NVMe subsystem. If the nshead reference is not properly released, it prevents the cleanup of the subsystem's kobject, leaving nvme subsystem stale entry behind in sysfs. During the failure case, the last namespace path referencing a nshead is removed, but the nshead reference was not released. This occurs because the release logic currently only puts the nshead reference when its state is LIVE. However, in configurations where ANA (Asymmetric Namespace Access) is enabled, a namespace may be associated with an ANA state that is neither optimized nor non-optimized. In this case, the nshead may never transition to LIVE, and the corresponding nshead reference is then never dropped. In fact nvme/058 associates some of nvme namespaces to an inaccessible ANA state and with that nshead is created but it's state is not transitioned to LIVE. So the current logic would then causes nshead reference to be leaked for non-LIVE states. Another scenario, during namespace allocation, the driver first allocates a nshead and then issues an Identify Namespace command. If this command fails — which can happen in tests like nvme/058 that rapidly enables and disables namespaces — we must release the reference to the newly allocated nshead. However this reference release is currently missing in the failure, causing a nshead reference leak. To fix this, we now unconditionally release the nshead reference when the last nvme path referencing to the nshead is removed, regardless of the head’s state. Also during identify namespace failure case we now properly release the nshead reference. So this ensures proper cleanup of the nshead, and consequently, the NVMe subsystem and its associated kobject. This change prevents stale kobject entries from lingering in sysfs and eliminates the module reload failures observed just after running nvme/058. [1] https://lore.kernel.org/all/CAHj4cs8fOBS-eSjsd5LUBzy7faKXJtgLkCN+mDy_-ezCLLLq+Q@mail.gmail.com/ Reported-by: yi.zhang@redhat.com Closes: https://lore.kernel.org/all/CAHj4cs8fOBS-eSjsd5LUBzy7faKXJtgLkCN+mDy_-ezCLLLq+Q@mail.gmail.com/ Fixes: 62188639ec16 ("nvme-multipath: introduce delayed removal of the multipath head node") Tested-by: yi.zhang@redhat.com Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-06-30nvme: Fix incorrect cdw15 value in passthru error loggingAlok Tiwari
Fix an error in nvme_log_err_passthru() where cdw14 was incorrectly printed twice instead of cdw15. This fix ensures accurate logging of the full passthrough command payload. Fixes: 9f079dda1433 ("nvme: allow passthru cmd error logging") Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-06-27Merge tag 'block-6.16-20250626' of git://git.kernel.dk/linuxLinus Torvalds
Pull block fixes from Jens Axboe: - Fixes for ublk: - fix C++ narrowing warnings in the uapi header - update/improve UBLK_F_SUPPORT_ZERO_COPY comment in uapi header - fix for the ublk ->queue_rqs() implementation, limiting a batch to just the specific task AND ring - ublk_get_data() error handling fix - sanity check more arguments in ublk_ctrl_add_dev() - selftest addition - NVMe pull request via Christoph: - reset delayed remove_work after reconnect - fix atomic write size validation - Fix for a warning introduced in bdev_count_inflight_rw() in this merge window * tag 'block-6.16-20250626' of git://git.kernel.dk/linux: block: fix false warning in bdev_count_inflight_rw() ublk: sanity check add_dev input for underflow nvme: fix atomic write size validation nvme: refactor the atomic write unit detection nvme: reset delayed remove_work after reconnect ublk: setup ublk_io correctly in case of ublk_get_data() failure ublk: update UBLK_F_SUPPORT_ZERO_COPY comment in UAPI header ublk: fix narrowing warnings in UAPI header selftests: ublk: don't take same backing file for more than one ublk devices ublk: build batch from IOs in same io_ring_ctx and io task
2025-06-26nvme: fix atomic write size validationChristoph Hellwig
Don't mix the namespace and controller values, and validate the per-controller limit when probing the controller. This avoid spurious failures for controllers with namespaces that have different namespaces with different logical block sizes, or report the per-namespace values only for some namespaces. It also fixes a missing queue_limits_cancel_update in an error path by removing that error path. Fixes: 8695f060a029 ("nvme: all namespaces in a subsystem must adhere to a common atomic write size") Reported-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com> Tested-by: Yi Zhang <yi.zhang@redhat.com>
2025-06-26nvme: refactor the atomic write unit detectionChristoph Hellwig
Move all the code out of nvme_update_disk_info into the helper, and rename the helper to have a somewhat less clumsy name. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com>
2025-06-26nvme: reset delayed remove_work after reconnectKeith Busch
The remove_work will proceed with permanently disconnecting on the initial final path failure if the head shows no paths after the delay. If a new path connects while the remove_work is pending, and if that new path happens to disconnect before that remove_work executes, the delayed removal should reset based on the most recent path disconnect time, but queue_delayed_work() won't do anything if the work is already pending. Attempt to cancel the delayed work when a new path connects, and use mod_delayed_work() in case the remove_work remains pending anyway. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-06-23nvmet: set WZDS and DRB if device enables unmap write zeroes operationZhang Yi
Set the WZDS and DRB bits to the namespace dlfeat if the underlying block device enables the unmap write zeroes operation, make the nvme target device supports the unmap write zeroes command. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://lore.kernel.org/20250619111806.3546162-4-yi.zhang@huaweicloud.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-23nvme: set max_hw_wzeroes_unmap_sectors if device supports DEAC bitZhang Yi
When the device supports the Write Zeroes command and the DEAC bit, it indicates that the deallocate bit in the Write Zeroes command is supported, and the bytes read from a deallocated logical block are zeroes. This means the device supports unmap Write Zeroes operation, so set the max_hw_wzeroes_unmap_sectors to max_write_zeroes_sectors on the device's queue limit. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://lore.kernel.org/20250619111806.3546162-3-yi.zhang@huaweicloud.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-14Merge tag 'block-6.16-20250614' of git://git.kernel.dk/linuxLinus Torvalds
Pull block fixes from Jens Axboe: - Fix for a deadlock on queue freeze with zoned writes - Fix for zoned append emulation - Two bio folio fixes, for sparsemem and for very large folios - Fix for a performance regression introduced in 6.13 when plug insertion was changed - Fix for NVMe passthrough handling for polled IO - Document the ublk auto registration feature - loop lockdep warning fix * tag 'block-6.16-20250614' of git://git.kernel.dk/linux: nvme: always punt polled uring_cmd end_io work to task_work Documentation: ublk: Separate UBLK_F_AUTO_BUF_REG fallback behavior sublists block: Fix bvec_set_folio() for very large folios bio: Fix bio_first_folio() for SPARSEMEM without VMEMMAP block: use plug request list tail for one-shot backmerge attempt block: don't use submit_bio_noacct_nocheck in blk_zone_wplug_bio_work block: Clear BIO_EMULATES_ZONE_APPEND flag on BIO completion ublk: document auto buffer registration(UBLK_F_AUTO_BUF_REG) loop: move lo_set_size() out of queue freeze
2025-06-13nvme: always punt polled uring_cmd end_io work to task_workJens Axboe
Currently NVMe uring_cmd completions will complete locally, if they are polled. This is done because those completions are always invoked from task context. And while that is true, there's no guarantee that it's invoked under the right ring context, or even task. If someone does NVMe passthrough via multiple threads and with a limited number of poll queues, then ringA may find completions from ringB. For that case, completing the request may not be sound. Always just punt the passthrough completions via task_work, which will redirect the completion, if needed. Cc: stable@vger.kernel.org Fixes: 585079b6e425 ("nvme: wire up async polling for io passthrough commands") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-08treewide, timers: Rename from_timer() to timer_container_of()Ingo Molnar
Move this API to the canonical timer_*() namespace. [ tglx: Redone against pre rc1 ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/aB2X0jCKQO56WdMt@gmail.com
2025-06-06Merge tag 'block-6.16-20250606' of git://git.kernel.dk/linuxLinus Torvalds
Pull more block updates from Jens Axboe: - NVMe pull request via Christoph: - TCP error handling fix (Shin'ichiro Kawasaki) - TCP I/O stall handling fixes (Hannes Reinecke) - fix command limits status code (Keith Busch) - support vectored buffers also for passthrough (Pavel Begunkov) - spelling fixes (Yi Zhang) - MD pull request via Yu: - fix REQ_RAHEAD and REQ_NOWAIT IO err handling for raid1/10 - fix max_write_behind setting for dm-raid - some minor cleanups - Integrity data direction fix and cleanup - bcache NULL pointer fix - Fix for loop missing write start/end handling - Decouple hardware queues and IO threads in ublk - Slew of ublk selftests additions and updates * tag 'block-6.16-20250606' of git://git.kernel.dk/linux: (29 commits) nvme: spelling fixes nvme-tcp: fix I/O stalls on congested sockets nvme-tcp: sanitize request list handling nvme-tcp: remove tag set when second admin queue config fails nvme: enable vectored registered bufs for passthrough cmds nvme: fix implicit bool to flags conversion nvme: fix command limits status code selftests: ublk: kublk: improve behavior on init failure block: flip iter directions in blk_rq_integrity_map_user() block: drop direction param from bio_integrity_copy_user() selftests: ublk: cover PER_IO_DAEMON in more stress tests Documentation: ublk: document UBLK_F_PER_IO_DAEMON selftests: ublk: add stress test for per io daemons selftests: ublk: add functional test for per io daemons selftests: ublk: kublk: decouple ublk_queues from ublk server threads selftests: ublk: kublk: move per-thread data out of ublk_queue selftests: ublk: kublk: lift queue initialization out of thread selftests: ublk: kublk: tie sqe allocation to io instead of queue selftests: ublk: kublk: plumb q_id in io_uring user_data ublk: have a per-io daemon instead of a per-queue daemon ...
2025-06-04nvme: spelling fixesYi Zhang
Fix various spelling errors in comments. Signed-off-by: Yi Zhang <yi.zhang@redhat.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-06-04nvme-tcp: fix I/O stalls on congested socketsHannes Reinecke
When the socket is busy processing nvme_tcp_try_recv() might return -EAGAIN, but this doesn't automatically imply that the sending side is blocked, too. So check if there are pending requests once nvme_tcp_try_recv() returns -EAGAIN and continue with the sending loop to avoid I/O stalls. Signed-off-by: Hannes Reinecke <hare@kernel.org> Acked-by: Chris Leech <cleech@redhat.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-06-04nvme-tcp: sanitize request list handlingHannes Reinecke
Validate the request in nvme_tcp_handle_r2t() to ensure it's not part of any list, otherwise a malicious R2T PDU might inject a loop in request list processing. Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-06-04nvme-tcp: remove tag set when second admin queue config failsShin'ichiro Kawasaki
Commit 104d0e2f6222 ("nvme-fabrics: reset admin connection for secure concatenation") modified nvme_tcp_setup_ctrl() to call nvme_tcp_configure_admin_queue() twice. The first call prepares for DH-CHAP negotitation, and the second call is required for secure concatenation. However, this change triggered BUG KASAN slab-use-after- free in blk_mq_queue_tag_busy_iter(). This BUG can be recreated by repeating the blktests test case nvme/063 a few times [1]. When the BUG happens, nvme_tcp_create_ctrl() fails in the call chain below: nvme_tcp_create_ctrl() nvme_tcp_alloc_ctrl() new=true ... Alloc nvme_tcp_ctrl and admin_tag_set nvme_tcp_setup_ctrl() new=true nvme_tcp_configure_admin_queue() new=true ... Succeed nvme_alloc_admin_tag_set() ... Alloc the tag set for admin_tag_set nvme_stop_keep_alive() nvme_tcp_teardown_admin_queue() remove=false nvme_tcp_configure_admin_queue() new=false nvme_tcp_alloc_admin_queue() ... Fail, but do not call nvme_remove_admin_tag_set() nvme_uninit_ctrl() nvme_put_ctrl() ... Free up the nvme_tcp_ctrl and admin_tag_set The first call of nvme_tcp_configure_admin_queue() succeeds with new=true argument. The second call fails with new=false argument. This second call does not call nvme_remove_admin_tag_set() on failure, due to the new=false argument. Then the admin tag set is not removed. However, nvme_tcp_create_ctrl() assumes that nvme_tcp_setup_ctrl() would call nvme_remove_admin_tag_set(). Then it frees up struct nvme_tcp_ctrl which has admin_tag_set field. Later on, the timeout handler accesses the admin_tag_set field and causes the BUG KASAN slab-use-after-free. To not leave the admin tag set, call nvme_remove_admin_tag_set() when the second nvme_tcp_configure_admin_queue() call fails. Do not return from nvme_tcp_setup_ctrl() on failure. Instead, jump to "destroy_admin" go-to label to call nvme_tcp_teardown_admin_queue() which calls nvme_remove_admin_tag_set(). Fixes: 104d0e2f6222 ("nvme-fabrics: reset admin connection for secure concatenation") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/linux-nvme/6mhxskdlbo6fk6hotsffvwriauurqky33dfb3s44mqtr5dsxmf@gywwmnyh3twm/ [1] Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-06-04nvme: enable vectored registered bufs for passthrough cmdsPavel Begunkov
nvme already supports registered buffers for non-vectored io_uring passthrough commands, enable it for the vectored mode as well. It takes an iovec, each entry of which should contain a range within the same registered buffer specificied in sqe->buf_index. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-06-04nvme: fix implicit bool to flags conversionPavel Begunkov
nvme_map_user_request() takes flags as the last argument, but nvme_uring_cmd_io() shoves a bool "vec" into it. It behaves as expected because bool is converted to 0/1 and NVME_IOCTL_VEC is defined as 1, but it's better to pass flags explicitly. Fixes: 7b7fdb8e2dbc1 ("nvme: replace the "bool vec" arguments with flags in the ioctl path") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Christoph Hellwig <hch@lst.de>