summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-01-13md: add a new callback pers->bitmap_sector()Yu Kuai
This callback will be used in raid5 to convert io ranges from array to bitmap. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Link: https://lore.kernel.org/r/20250109015145.158868-4-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2025-01-13md/md-bitmap: remove the last parameter for bimtap_ops->endwrite()Yu Kuai
For the case that IO failed for one rdev, the bit will be mark as NEEDED in following cases: 1) If badblocks is set and rdev is not faulty; 2) If rdev is faulty; Case 1) is useless because synchronize data to badblocks make no sense. Case 2) can be replaced with mddev->degraded. Also remove R1BIO_Degraded, R10BIO_Degraded and STRIPE_DEGRADED since case 2) no longer use them. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250109015145.158868-3-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2025-01-13md/md-bitmap: factor behind write counters out from bitmap_{start/end}write()Yu Kuai
behind_write is only used in raid1, prepare to refactor bitmap_{start/end}write(), there are no functional changes. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Link: https://lore.kernel.org/r/20250109015145.158868-2-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2025-01-13md: Replace deprecated kmap_atomic() with kmap_local_page()David Reaver
kmap_atomic() is deprecated and should be replaced with kmap_local_page() [1][2]. kmap_local_page() is faster in kernels with HIGHMEM enabled, can take page faults, and allows preemption. According to [2], this is safe as long as the code between kmap_atomic() and kunmap_atomic() does not implicitly depend on disabling page faults or preemption. It appears to me that none of the call sites in this patch depend on disabling page faults or preemption; they are all mapping a page to simply extract some information from it or print some debug info. [1] https://lwn.net/Articles/836144/ [2] https://docs.kernel.org/mm/highmem.html#temporary-virtual-mappings Signed-off-by: David Reaver <me@davidreaver.com> Link: https://lore.kernel.org/r/20250108192131.46843-1-me@davidreaver.com Signed-off-by: Song Liu <song@kernel.org>
2025-01-13md: reintroduce md-linearYu Kuai
THe md-linear is removed by commit 849d18e27be9 ("md: Remove deprecated CONFIG_MD_LINEAR") because it has been marked as deprecated for a long time. However, md-linear is used widely for underlying disks with different size, sadly we didn't know this until now, and it's true useful to create partitions and assemble multiple raid and then append one to the other. People have to use dm-linear in this case now, however, they will prefer to minimize the number of involved modules. Fixes: 849d18e27be9 ("md: Remove deprecated CONFIG_MD_LINEAR") Cc: stable@vger.kernel.org Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Coly Li <colyli@kernel.org> Acked-by: Mike Snitzer <snitzer@kernel.org> Link: https://lore.kernel.org/r/20250102112841.1227111-1-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2025-01-13partitions: ldm: remove the initial kernel-doc notationRandy Dunlap
Remove the file's first comment describing what the file is. This comment is not in kernel-doc format so it causes a kernel-doc warning. ldm.h:13: warning: expecting prototype for ldm(). Prototype was for _FS_PT_LDM_H_() instead Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Richard Russon (FlatCap) <ldm@flatcap.org> Cc: linux-ntfs-dev@lists.sourceforge.net Cc: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20250111062758.910458-1-rdunlap@infradead.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-13blk-cgroup: rwstat: fix kernel-doc warnings in header fileRandy Dunlap
Correct the function parameters to eliminate kernel-doc warnings: blk-cgroup-rwstat.h:63: warning: Function parameter or struct member 'opf' not described in 'blkg_rwstat_add' blk-cgroup-rwstat.h:63: warning: Excess function parameter 'op' description in 'blkg_rwstat_add' blk-cgroup-rwstat.h:91: warning: Function parameter or struct member 'result' not described in 'blkg_rwstat_read' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: cgroups@vger.kernel.org Link: https://lore.kernel.org/r/20250111062748.910442-1-rdunlap@infradead.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-13blk-cgroup: fix kernel-doc warnings in header fileRandy Dunlap
Correct the function parameters and function names to eliminate kernel-doc warnings: blk-cgroup.h:238: warning: Function parameter or struct member 'bio' not described in 'bio_issue_as_root_blkg' blk-cgroup.h:248: warning: bad line: blk-cgroup.h:279: warning: expecting prototype for blkg_to_pdata(). Prototype was for blkg_to_pd() instead blk-cgroup.h:296: warning: expecting prototype for pdata_to_blkg(). Prototype was for pd_to_blkg() instead Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: cgroups@vger.kernel.org Link: https://lore.kernel.org/r/20250111062736.910383-1-rdunlap@infradead.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-13nbd: fix partial sendingMing Lei
nbd driver sends request header and payload with multiple call of sock_sendmsg, and partial sending can't be avoided. However, nbd driver returns BLK_STS_RESOURCE to block core in this situation. This way causes one issue: request->tag may change in the next run of nbd_queue_rq(), but the original old tag has been sent as part of header cookie, this way confuses nbd driver reply handling, since the real request can't be retrieved any more with the obsolete old tag. Fix it by retrying sending directly in per-socket work function, meantime return BLK_STS_OK to block layer core. Cc: vincent.chen@sifive.com Cc: Leon Schuermann <leon@is.currently.online> Cc: Bart Van Assche <bvanassche@acm.org> Reported-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Link: https://lore.kernel.org/r/20241029011941.153037-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-13block: mark GFP_NOIO around sysfs ->store()Ming Lei
sysfs ->store is called with queue freezed, meantime we have several ->store() callbacks(update_nr_requests, wbt, scheduler) to allocate memory with GFP_KERNEL which may run into direct reclaim code path, then potential deadlock can be caused. Fix the issue by marking NOIO around sysfs ->store() Reported-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: stable@vger.kernel.org Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20250113015833.698458-1-ming.lei@redhat.com Link: https://lore.kernel.org/linux-block/Z4RkemI9f6N5zoEF@fedora/T/#mc774c65eeca5c024d29695f9ac6152b87763f305 Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-13Merge tag 'nvme-6.14-2025-01-12' of git://git.infradead.org/nvme into ↵Jens Axboe
for-6.14/block Pull NVMe updates from Keith: "nvme updates for Linux 6.14 - Target support for PCI-Endpoint transport (Damien) - TCP IO queue spreading fixes (Sagi, Chaitanya) - Target handling for "limited retry" flags (Guixen) - Poll type fix (Yongsoo) - Xarray storage error handling (Keisuke) - Host memory buffer free size fix on error (Francis)" * tag 'nvme-6.14-2025-01-12' of git://git.infradead.org/nvme: (25 commits) nvme-pci: use correct size to free the hmb buffer nvme: Add error path for xa_store in nvme_init_effects nvme-pci: fix comment typo Documentation: Document the NVMe PCI endpoint target driver nvmet: New NVMe PCI endpoint function target driver nvmet: Implement arbitration feature support nvmet: Implement interrupt config feature support nvmet: Implement interrupt coalescing feature support nvmet: Implement host identifier set feature support nvmet: Introduce get/set_feature controller operations nvmet: Do not require SGL for PCI target controller commands nvmet: Add support for I/O queue management admin commands nvmet: Introduce nvmet_sq_create() and nvmet_cq_create() nvmet: Introduce nvmet_req_transfer_len() nvmet: Improve nvmet_alloc_ctrl() interface and implementation nvme: Add PCI transport type nvmet: Add drvdata field to struct nvmet_ctrl nvmet: Introduce nvmet_get_cmd_effects_admin() nvmet: Export nvmet_update_cc() and nvmet_cc_xxx() helpers nvmet: Add vendor_id and subsys_vendor_id subsystem attributes ...
2025-01-12nvme-pci: use correct size to free the hmb bufferFrancis Pravin
dev->host_mem_size value is updated only after the successful buffer allocation of hmb descriptor. Otherwise, it may have some undefined value. So, use the correct size to free the hmb buffer when the hmb descriptor buffer allocation failed. Signed-off-by: Francis Pravin <francis.p@samsung.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-12nvme: Add error path for xa_store in nvme_init_effectsKeisuke Nishimura
The xa_store() may fail due to memory allocation failure because there is no guarantee that the index NVME_CSI_NVM is already used. This fix introduces a new function to handle the error path. Fixes: cc115cbe12d9 ("nvme: always initialize known command effects") Signed-off-by: Keisuke Nishimura <keisuke.nishimura@inria.fr> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-12nvme-pci: fix comment typoBaruch Siach
envent -> event. Signed-off-by: Baruch Siach <baruch@tkos.co.il> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10Documentation: Document the NVMe PCI endpoint target driverDamien Le Moal
Add a documentation file (Documentation/nvme/nvme-pci-endpoint-target.rst) for the new NVMe PCI endpoint target driver. This provides an overview of the driver requirements, capabilities and limitations. A user guide describing how to setup a NVMe PCI endpoint device using this driver is also provided. This document is made accessible also from the PCI endpoint documentation using a link. Furthermore, since the existing nvme documentation was not accessible from the top documentation index, an index file is added to Documentation/nvme and this index listed as "NVMe Subsystem" in the "Storage interfaces" section of the subsystem API index. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10nvmet: New NVMe PCI endpoint function target driverDamien Le Moal
Implement a PCI target driver using the PCI endpoint framework. This requires hardware with a PCI controller capable of executing in endpoint mode. The PCI endpoint framework is used to set up a PCI endpoint function and its BAR compatible with a NVMe PCI controller. The framework is also used to map local memory to the PCI address space to execute MMIO accesses for retrieving NVMe commands from submission queues and posting completion entries to completion queues. If supported, DMA is used for command retreival and command data transfers, based on the PCI address segments indicated by the command using either PRPs or SGLs. The NVMe target driver relies on the NVMe target core code to execute all commands isssued by the host. The PCI target driver is mainly responsible for the following: - Initialization and teardown of the endpoint device and its backend PCI target controller. The PCI target controller is created using a subsystem and a port defined through configfs. The port used must be initialized with the "pci" transport type. The target controller is allocated and initialized when the PCI endpoint is started by binding it to the endpoint PCI device (nvmet_pci_epf_epc_init() function). - Manage the endpoint controller state according to the PCI link state and the actions of the host (e.g. checking the CC.EN register) and propagate these actions to the PCI target controller. Polling of the controller enable/disable is done using a delayed work scheduled every 5ms (nvmet_pci_epf_poll_cc() function). This work is started whenever the PCI link comes up (nvmet_pci_epf_link_up() notifier function) and stopped when the PCI link comes down (nvmet_pci_epf_link_down() notifier function). nvmet_pci_epf_poll_cc() enables and disables the PCI controller using the functions nvmet_pci_epf_enable_ctrl() and nvmet_pci_epf_disable_ctrl(). The controller admin queue is created using nvmet_pci_epf_create_cq(), which calls nvmet_cq_create(), and nvmet_pci_epf_create_sq() which uses nvmet_sq_create(). nvmet_pci_epf_disable_ctrl() always resets the PCI controller to its initial state so that nvmet_pci_epf_enable_ctrl() can be called again. This ensures correct operation if, for instance, the host reboots causing the PCI link to be temporarily down. - Manage the controller admin and I/O submission queues using local memory. Commands are obtained from submission queues using a work item that constantly polls the doorbells of all submissions queues (nvmet_pci_epf_poll_sqs() function). This work is started whenever the controller is enabled (nvmet_pci_epf_enable_ctrl() function) and stopped when the controller is disabled (nvmet_pci_epf_disable_ctrl() function). When new commands are submitted by the host, DMA transfers are used to retrieve the commands. - Initiate the execution of all admin and I/O commands using the target core code, by calling a requests execute() function. All commands are individually handled using a per-command work item (nvmet_pci_epf_iod_work() function). A command overall execution includes: initializing a struct nvmet_req request for the command, using nvmet_req_transfer_len() to get a command data transfer length, parse the command PRPs or SGLs to get the PCI address segments of the command data buffer, retrieve data from the host (if the command is a write command), call req->execute() to execute the command and transfer data to the host (for read commands). - Handle the completions of commands as notified by the ->queue_response() operation of the PCI target controller (nvmet_pci_epf_queue_response() function). Completed commands are added to a list of completed command for their CQ. Each CQ list of completed command is processed using a work item (nvmet_pci_epf_cq_work() function) which posts entries for the completed commands in the CQ memory and raise an IRQ to the host to signal the completion. IRQ coalescing is supported as mandated by the NVMe base specification for PCI controllers. Of note is that completion entries are transmitted to the host using MMIO, after mapping the completion queue memory to the host PCI address space. Unlike for retrieving commands from SQs, DMA is not used as it degrades performance due to the transfer serialization needed (which delays completion entries transmission). The configuration of a NVMe PCI endpoint controller is done using configfs. First the NVMe PCI target controller configuration must be done to set up a subsystem and a port with the "pci" addr_trtype attribute. The subsystem can be setup using a file or block device backed namespace or using a passthrough NVMe device. After this, the PCI endpoint can be configured and bound to the PCI endpoint controller to start the NVMe endpoint controller. In order to not overcomplicate this initial implementation of an endpoint PCI target controller driver, protection information is not for now supported. If the PCI controller port and namespace are configured with protection information support, an error will be returned when the controller is created and initialized when the endpoint function is started. Protection information support will be added in a follow-up patch series. Using a Rock5B board (Rockchip RK3588 SoC, PCI Gen3x4 endpoint controller) with a target PCI controller setup with 4 I/O queues and a null_blk block device as a namespace, the maximum performance using fio was measured at 131 KIOPS for random 4K reads and up to 2.8 GB/S throughput. Some data points are: Rnd read, 4KB, QD=1, 1 job : IOPS=16.9k, BW=66.2MiB/s (69.4MB/s) Rnd read, 4KB, QD=32, 1 job : IOPS=78.5k, BW=307MiB/s (322MB/s) Rnd read, 4KB, QD=32, 4 jobs: IOPS=131k, BW=511MiB/s (536MB/s) Seq read, 512KB, QD=32, 1 job : IOPS=5381, BW=2691MiB/s (2821MB/s) The NVMe PCI endpoint target driver is not intended for production use. It is a tool for learning NVMe, exploring existing features and testing implementations of new NVMe features. Co-developed-by: Rick Wertenbroek <rick.wertenbroek@gmail.com> Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Reviewed-by: Krzysztof Wilczyński <kwilczynski@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10nvmet: Implement arbitration feature supportDamien Le Moal
NVMe base specification v2.1 mandates support for the arbitration feature (NVME_FEAT_ARBITRATION). Introduce the data structure struct nvmet_feat_arbitration to define the high, medium and low priority weight fields and the arbitration burst field of this feature and implement the functions nvmet_get_feat_arbitration() and nvmet_set_feat_arbitration() functions to get and set these fields. Since there is no generic way to implement support for the arbitration feature, these functions respectively use the controller get_feature() and set_feature() operations to process the feature with the help of the controller driver. If the controller driver does not implement these operations and a get feature command or a set feature command for this feature is received, the command is failed with an invalid field error. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Rick Wertenbroek <rick.wertenbroek@gmail.com> Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10nvmet: Implement interrupt config feature supportDamien Le Moal
The NVMe base specifications v2.1 mandate supporting the interrupt config feature (NVME_FEAT_IRQ_CONFIG) for PCI controllers. Introduce the data structure struct nvmet_feat_irq_config to define the coalescing disabled (cd) and interrupt vector (iv) fields of this feature and implement the functions nvmet_get_feat_irq_config() and nvmet_set_feat_irq_config() functions to get and set these fields. These functions respectively use the controller get_feature() and set_feature() operations to fill and handle the fields of struct nvmet_feat_irq_config. Support for this feature is prohibited for fabrics controllers. If a get feature command or a set feature command for this feature is received for a fabrics controller, the command is failed with an invalid field error. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Rick Wertenbroek <rick.wertenbroek@gmail.com> Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10nvmet: Implement interrupt coalescing feature supportDamien Le Moal
The NVMe base specifications v2.1 mandate Supporting the interrupt coalescing feature (NVME_FEAT_IRQ_COALESCE) for PCI controllers. Introduce the data structure struct nvmet_feat_irq_coalesce to define the time and threshold (thr) fields of this feature and implement the functions nvmet_get_feat_irq_coalesce() and nvmet_set_feat_irq_coalesce() to get and set this feature. These functions respectively use the controller get_feature() and set_feature() operations to fill and handle the fields of struct nvmet_feat_irq_coalesce. While the Linux kernel nvme driver does not use this feature and thus will not complain if it is not implemented, other major OSes fail initializing the NVMe device if this feature support is missing. Support for this feature is prohibited for fabrics controllers. If a get feature or set feature command for this feature is received for a fabrics controller, the command is failed with an invalid field error. Suggested-by: Rick Wertenbroek <rick.wertenbroek@gmail.com> Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Rick Wertenbroek <rick.wertenbroek@gmail.com> Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10nvmet: Implement host identifier set feature supportDamien Le Moal
The NVMe specifications mandate support for the host identifier set_features for controllers that also supports reservations. Satisfy this requirement by implementing handling of the NVME_FEAT_HOST_ID feature for the nvme_set_features command. This implementation is for now effective only for PCI target controllers. For other controller types, the set features command is failed with a NVME_SC_CMD_SEQ_ERROR status as before. As noted in the code, 128 bits host identifiers are supported since the NVMe base specifications version 2.1 indicate in section 5.1.25.1.28.1 that "The controller may support a 64-bit Host Identifier...". The RHII (Reservations and Host Identifier Interaction) bit of the controller attribute (ctratt) field of the identify controller data is also set to indicate that a host ID of "0" is supported but that the host ID must be a non-zero value to use reservations. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Rick Wertenbroek <rick.wertenbroek@gmail.com> Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10nvmet: Introduce get/set_feature controller operationsDamien Le Moal
The implementation of some features cannot always be done generically by the target core code. Arbitraion and IRQ coalescing features are examples of such features: their implementation must be provided (at least partially) by the target controller driver. Introduce the set_feature() and get_feature() controller fabrics operations (in struct nvmet_fabrics_ops) to allow supporting such features. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Rick Wertenbroek <rick.wertenbroek@gmail.com> Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10nvmet: Do not require SGL for PCI target controller commandsDamien Le Moal
Support for SGL is optional for the PCI transport. Modify nvmet_req_init() to not require the NVME_CMD_SGL_METABUF command flag to be set if the target controller transport type is NVMF_TRTYPE_PCI. In addition to this, the NVMe base specification v2.1 mandate that all admin commands use PRP, that is, have CDW0.PSDT cleared to 0. Modify nvmet_parse_admin_cmd() to check this. Finally, modify nvmet_check_transfer_len() and nvmet_check_data_len_lte() to return the appropriate error status depending on the command using SGL or PRP. Since for fabrics nvmet_req_init() checks that a command uses SGL, always, this change affects only PCI target controllers. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Rick Wertenbroek <rick.wertenbroek@gmail.com> Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10nvmet: Add support for I/O queue management admin commandsDamien Le Moal
The I/O submission queue management admin commands (nvme_admin_delete_sq, nvme_admin_create_sq, nvme_admin_delete_cq, and nvme_admin_create_cq) are mandatory admin commands for I/O controllers using the PCI transport, that is, support for these commands is mandatory for a a PCI target I/O controller. Implement support for these commands by adding the functions nvmet_execute_delete_sq(), nvmet_execute_create_sq(), nvmet_execute_delete_cq() and nvmet_execute_create_cq() to set as the execute method of requests for these commands. These functions will return an invalid opcode error for any controller that is not a PCI target controller. Support for the I/O queue management commands is also reported in the command effect log of PCI target controllers (using nvmet_get_cmd_effects_admin()). Each management command is backed by a controller fabric operation that can be defined by a PCI target controller driver to setup I/O queues using nvmet_sq_create() and nvmet_cq_create() or delete I/O queues using nvmet_sq_destroy(). As noted in a comment in nvmet_execute_create_sq(), we do not yet support sharing a single CQ between multiple SQs. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Rick Wertenbroek <rick.wertenbroek@gmail.com> Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10nvmet: Introduce nvmet_sq_create() and nvmet_cq_create()Damien Le Moal
Introduce the new functions nvmet_sq_create() and nvmet_cq_create() to allow a target driver to initialize and setup admin and IO queues directly, without needing to execute connect fabrics commands. The helper functions nvmet_check_cqid() and nvmet_check_sqid() are implemented to check the correctness of SQ and CQ IDs when nvmet_sq_create() and nvmet_cq_create() are called. nvmet_sq_create() and nvmet_cq_create() are primarily intended for use with PCI target controller drivers and thus are not well integrated with the current queue creation of fabrics controllers using the connect command. These fabrices drivers are not modified to use these functions. This simple implementation of SQ and CQ management for PCI target controller drivers does not allow multiple SQs to share the same CQ, similarly to other fabrics transports. This is a specification violation. A more involved set of changes will follow to add support for this required completion queue sharing feature. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Rick Wertenbroek <rick.wertenbroek@gmail.com> Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10nvmet: Introduce nvmet_req_transfer_len()Damien Le Moal
Add the new function nvmet_req_transfer_len() to parse a request command to extract the transfer length of the command. This function implementation relies on multiple helper functions for parsing I/O commands (nvmet_io_cmd_transfer_len()), admin commands (nvmet_admin_cmd_data_len()) and fabrics connect commands (nvmet_connect_cmd_data_len). Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Rick Wertenbroek <rick.wertenbroek@gmail.com> Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10nvmet: Improve nvmet_alloc_ctrl() interface and implementationDamien Le Moal
Introduce struct nvmet_alloc_ctrl_args to define the arguments for the function nvmet_alloc_ctrl() to avoid the need for passing a pointer to a struct nvmet_req as an argument. This new data structure aggregates together the arguments that were passed to nvmet_alloc_ctrl() (subsysnqn, hostnqn and kato), together with the struct nvmet_req fields used by nvmet_alloc_ctrl(), that is, the fields port, p2p_client, and ops as input and the result and error_loc fields as output, as well as a status field. nvmet_alloc_ctrl() is also changed to return a pointer to the allocated and initialized controller structure instead of a status code, as the status is now returned through the status field of struct nvmet_alloc_ctrl_args. The function nvmet_setup_p2p_ns_map() is changed to not take a pointer to a struct nvmet_req as argument, instead, directly specify the p2p_client device pointer needed as argument. The code in nvmet_execute_admin_connect() that initializes a new target controller after allocating it is moved into nvmet_alloc_ctrl(). The code that sets up an admin queue for the controller (and the call to nvmet_install_queue()) remains in nvmet_execute_admin_connect(). Finally, nvmet_alloc_ctrl() is also exported to allow target drivers to use this function directly to allocate and initialize a new controller structure without the need to rely on a fabrics connect command request. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Rick Wertenbroek <rick.wertenbroek@gmail.com> Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10nvme: Add PCI transport typeDamien Le Moal
Define the transport type NVMF_TRTYPE_PCI for PCI endpoint targets. This transport type is defined using the value 0 which is reserved in the NVMe base specifications v2.1 (Figure 294). Given that struct nvmet_port are zeroed out on creation, to avoid having this transsport type becoming the new default, nvmet_referral_make() and nvmet_ports_make() are modified to initialize a port discovery address transport type field (disc_addr.trtype) to NVMF_TRTYPE_MAX. Any port using this transport type is also skipped and not reported in the discovery log page (nvmet_execute_disc_get_log_page()). The helper function nvmet_is_pci_ctrl() is also introduced to check if a target controller uses the PCI transport. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Rick Wertenbroek <rick.wertenbroek@gmail.com> Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10nvmet: Add drvdata field to struct nvmet_ctrlDamien Le Moal
Allow a target driver to attach private data to a target controller by adding the new field drvdata to struct nvmet_ctrl. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Rick Wertenbroek <rick.wertenbroek@gmail.com> Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10nvmet: Introduce nvmet_get_cmd_effects_admin()Damien Le Moal
In order to have a logically better organized implementation of the effects log page, split out reporting the supported admin commands from nvmet_get_cmd_effects_nvm() into the new function nvmet_get_cmd_effects_admin(). Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Rick Wertenbroek <rick.wertenbroek@gmail.com> Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10nvmet: Export nvmet_update_cc() and nvmet_cc_xxx() helpersDamien Le Moal
Make the function nvmet_update_cc() available to target drivers by exporting it. To also facilitate the manipulation of the cc register bits, move the inline helper functions nvmet_cc_en(), nvmet_cc_css(), nvmet_cc_mps(), nvmet_cc_ams(), nvmet_cc_shn(), nvmet_cc_iosqes(), and nvmet_cc_iocqes() from core.c to nvmet.h so that these functions can be reused in target controller drivers. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Rick Wertenbroek <rick.wertenbroek@gmail.com> Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10nvmet: Add vendor_id and subsys_vendor_id subsystem attributesDamien Le Moal
Define the new vendor_id and subsys_vendor_id configfs attribute for target subsystems. These attributes are respectively reported as the vid field and as the ssvid field of the identify controller data of a target controllers using the subsystem for which these attributes are set. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Rick Wertenbroek <rick.wertenbroek@gmail.com> Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10nvme: Move opcode string helper functions declarationsDamien Le Moal
Move the declaration of all helper functions converting NVMe command opcodes and status codes into strings from drivers/nvme/host/nvme.h into include/linux/nvme.h, together with the commands definitions. This allows NVMe target drivers to call these functions without having to include a host header file. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Rick Wertenbroek <rick.wertenbroek@gmail.com> Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10nvme: change return type of nvme_poll_cq() to boolYongsoo Joo
The nvme_poll_cq() function currently returns the number of CQEs found, However, only one caller, nvme_poll(), requires a boolean value to check whether any CQE was completed. The other callers do not use the return value at all. To better reflect its usage, update the return type of nvme_poll_cq() from int to bool. Signed-off-by: Yongsoo Joo <ysjoo@kookmin.ac.kr> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10nvme: Add error check for xa_store in nvme_get_effects_logKeisuke Nishimura
The xa_store() may fail due to memory allocation failure because there is no guarantee that the index csi is already used. This fix adds an error check of the return value of xa_store() in nvme_get_effects_log(). Fixes: 1cf7a12e09aa ("nvme: use an xarray to lookup the Commands Supported and Effects log") Signed-off-by: Keisuke Nishimura <keisuke.nishimura@inria.fr> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10nvme-tcp: Fix I/O queue cpu spreading for multiple controllersSagi Grimberg
Since day-1 we are assigning the queue io_cpu very naively. We always base the queue id (controller scope) and assign it its matching cpu from the online mask. This works fine when the number of queues match the number of cpu cores. The problem starts when we have less queues than cpu cores. First, we should take into account the mq_map and select a cpu within the cpus that are assigned to this queue by the mq_map in order to minimize cross numa cpu bouncing. Second, even worse is that we don't take into account multiple controllers may have assigned queues to a given cpu. As a result we may simply compund more and more queues on the same set of cpus, which is suboptimal. We fix this by introducing global per-cpu counters that tracks the number of queues assigned to each cpu, and we select the least used cpu based on the mq_map and the per-cpu counters, and assign it as the queue io_cpu. The behavior for a single controller is slightly optimized by selecting better cpu candidates by consulting with the mq_map, and multiple controllers are spreading queues among cpu cores much better, resulting in lower average cpu load, and less likelihood to hit hotspots. Note that the accounting is not 100% perfect, but we don't need to be, we're simply putting our best effort to select the best candidate cpu core that we find at any given point. Another byproduct is that every controller reset/reconnect may change the queues io_cpu mapping, based on the current LRU accounting scheme. Here is the baseline queue io_cpu assignment for 4 controllers, 2 queues per controller, and 4 cpus on the host: nvme1: queue 0: using cpu 0 nvme1: queue 1: using cpu 1 nvme2: queue 0: using cpu 0 nvme2: queue 1: using cpu 1 nvme3: queue 0: using cpu 0 nvme3: queue 1: using cpu 1 nvme4: queue 0: using cpu 0 nvme4: queue 1: using cpu 1 And this is the fixed io_cpu assignment: nvme1: queue 0: using cpu 0 nvme1: queue 1: using cpu 2 nvme2: queue 0: using cpu 1 nvme2: queue 1: using cpu 3 nvme3: queue 0: using cpu 0 nvme3: queue 1: using cpu 2 nvme4: queue 0: using cpu 1 nvme4: queue 1: using cpu 3 Fixes: 3f2304f8c6d6 ("nvme-tcp: add NVMe over TCP host driver") Suggested-by: Hannes Reinecke <hare@kernel.org> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> [fixed kbuild reported errors] Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10loop: remove the use_dio field in struct loop_deviceChristoph Hellwig
This field duplicate the LO_FLAGS_DIRECT_IO flag in lo_flags. Remove it to have a single source of truth about using direct I/O. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250110073750.1582447-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-10loop: don't freeze the queue in loop_update_dioChristoph Hellwig
All callers of loop_update_dio except for loop_configure already have the queue frozen, and loop_configure works on an unbound device. Remove the superfluous recursive freezing in loop_update_dio and add asserts for the locking and freezing state instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250110073750.1582447-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-10loop: allow loop_set_status to re-enable direct I/OChristoph Hellwig
Unlike all other calls of (__)loop_update_dio, loop_set_status never looks at the O_DIRECT flag of the backing file, and thus doesn't re-enable direct I/O on an O_DIRECT backing file if e.g. the new block size would allow it. Fix that and remove the need for the separate __loop_update_dio flag. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250110073750.1582447-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-10loop: open code the direct I/O flag update in loop_set_dioChristoph Hellwig
loop_set_dio is different from the other (__)loop_update_dio callers in that it doesn't take any implicit conditions into account and wants to update the direct I/O flag to the user passed in value and fail if that can't be done. Open code the logic here to prepare for simplifying the other direct I/O flag updates and to make the error handling less convoluted. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250110073750.1582447-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-10loop: only write back pagecache when starting to to use direct I/OChristoph Hellwig
There is no point in doing an fdatasync to write out pages when switching away from direct I/O, as there won't be any. The writeback is only needed when switching to direct I/O, which would have to invalidate the pagecache less efficiently from the I/O path. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250110073750.1582447-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-10loop: create a lo_can_use_dio helperChristoph Hellwig
Factor out a part of __loop_update_dio in preparation for further refactoring. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250110073750.1582447-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-10loop: update commands in loop_set_status still referring to transfersChristoph Hellwig
The concept of transfers is gone since commit 47e9624616c8 ("block: remove support for cryptoloop and the xor transfer"). Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250110073750.1582447-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-10loop: move updating lo_flags out of loop_set_status_from_infoChristoph Hellwig
While loop_configure simplify assigns the flags passed in by userspace, loop_set_status only looks at the two changeable flags, and currently has to do a complicate dance to implement that. Move assign lo->lo_flags out of loop_set_status_from_info into the callers and thus drastically simplify the lo_flags handling in loop_set_status. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250110073750.1582447-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-10loop: fix queue freeze vs limits lock orderChristoph Hellwig
Match the locking order used by the core block code by only freezing the queue after taking the limits lock using the queue_limits_commit_update_frozen helper and document the callers that do not freeze the queue at all. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250110054726.1499538-12-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-10loop: refactor queue limits updatesChristoph Hellwig
Replace loop_reconfigure_limits with a slightly less encompassing loop_update_limits that expects the caller to acquire and commit the queue limits to prepare for sorting out the freeze vs limits lock ordering. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250110054726.1499538-11-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-10usb-storage: fix queue freeze vs limits lock orderChristoph Hellwig
Match the locking order used by the core block code by only freezing the queue after taking the limits lock using the queue_limits_commit_update_frozen helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250110054726.1499538-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-10nbd: fix queue freeze vs limits lock orderChristoph Hellwig
Match the locking order used by the core block code by only freezing the queue after taking the limits lock using the queue_limits_commit_update_frozen helper. This also allows removes the need for the separate __nbd_set_size helper, so remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250110054726.1499538-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-10nvme: fix queue freeze vs limits lock orderChristoph Hellwig
Match the locking order used by the core block code by only freezing the queue after taking the limits lock. Unlike most queue updates this does not use the queue_limits_commit_update_frozen helper as the nvme driver want the queue frozen for more than just the limits update. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250110054726.1499538-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-10block: fix queue freeze vs limits lock order in sysfs store methodsChristoph Hellwig
queue_attr_store() always freezes a device queue before calling the attribute store operation. For attributes that control queue limits, the store operation will also lock the queue limits with a call to queue_limits_start_update(). However, some drivers (e.g. SCSI sd) may need to issue commands to a device to obtain limit values from the hardware with the queue limits locked. This creates a potential ABBA deadlock situation if a user attempts to modify a limit (thus freezing the device queue) while the device driver starts a revalidation of the device queue limits. Avoid such deadlock by not freezing the queue before calling the ->store_limit() method in struct queue_sysfs_entry and instead use the queue_limits_commit_update_frozen helper to freeze the queue after taking the limits lock. This also removes taking the sysfs lock for the store_limit method as it doesn't protect anything here, but creates even more nesting. Hopefully it will go away from the actual sysfs methods entirely soon. (commit log adapted from a similar patch from Damien Le Moal) Fixes: ff956a3be95b ("block: use queue_limits_commit_update in queue_discard_max_store") Fixes: 0327ca9d53bf ("block: use queue_limits_commit_update in queue_max_sectors_store") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250110054726.1499538-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-10block: add a store_limit operations for sysfs entriesChristoph Hellwig
De-duplicate the code for updating queue limits by adding a store_limit method that allows having common code handle the actual queue limits update. Note that this is a pure refactoring patch and does not address the existing freeze vs limits lock order problem in the refactored code, which will be addressed next. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20250110054726.1499538-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>