summaryrefslogtreecommitdiff
path: root/drivers/nvme/host/core.c
AgeCommit message (Collapse)Author
2024-02-13block: pass a queue_limits argument to blk_mq_init_queueChristoph Hellwig
Pass a queue_limits to blk_mq_init_queue and apply it if non-NULL. This will allow allocating queues with valid queue limits instead of setting the values one at a time later. Also rename the function to blk_mq_alloc_queue as that is a much better name for a function that allocates a queue and always pass the queuedata argument instead of having a separate version for the extra argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240213073425.1621680-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-12nvme: allow integrity when PI is not in first bytesKanchan Joshi
NVM command set 1.0 (or later) mandates PI to be in the last bytes of metadata. But this was not supported in the block-layer, and driver registered a nop profile. Since block-integrity can now handle flexible PI offset, change the driver to support this configuration. Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240201130126.211402-4-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-07nvme-core: fix comment to reflect right functionsChaitanya Kulkarni
The functions and the attribute listed in the comment doesn't exists in the code, (ns->logging_enabled, nvme_passthru_err_log_enabled_store() and nvme_passthru_err_log_enabled_show()) Update the comment with right function names and a comment ns->head->passthru_err_log_enabled, nvme_io_passthru_err_log_enabled_store() and nvme_io_passthru_err_log_enabled_show(). Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Alan Adamson <alan.adamson@oracle.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-02-07nvme: move passthrough logging attribute to headKeith Busch
The namespace does not have attributes, but the head does. Move the new logging attribute to that structure instead of dereferencing the wrong type. And while we're here, fix the reverse-tree coding style. Fixes: 9f079dda14339e ("nvme: allow passthru cmd error logging") Reported-by: Tasmiya Nalatwad <tasmiya@linux.vnet.ibm.com> Tested-by: Tasmiya Nalatwad <tasmiya@linux.vnet.ibm.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Alan Adamson <alan.adamson@oracle.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-02-01nvme-host: fix the updating of the firmware versionMaurizio Lombardi
The original code didn't update the firmware version if the "next slot" of the AFI register isn't zero or if the "current slot" field is zero; in those cases it assumed that a reset was needed. However, the NVMe specification doesn't exclude the possibility that the "next slot" value is equal to the "current slot" value, meaning that the same firmware slot will be activated after performing a controller level reset; in this case a reset is clearly not necessary and we can safely update the firmware version. Modify the code so the kernel will report that a Controller Level Reset is needed only in the following cases: 1) If the "current slot" field is zero. This is invalid and means that something is wrong, a reset is needed. or 2) if the "next slot" field isn't zero AND it's not equal to the "current slot" value. This means that at the next reset a different firmware slot will be activated. Fixes: 983a338b96c8 ("nvme: update firmware version after commit") Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-02-01nvme: allow passthru cmd error loggingAlan Adamson
Commit d7ac8dca938c ("nvme: quiet user passthrough command errors") disabled error logging for user passthrough commands. This commit adds the ability to opt-in to passthrough admin error logging. IO commands initiated as passthrough will always be logged. The logging output for passthrough commands (Admin and IO) has been changed to include CDWXX fields. nvme0n1: Read(0x2), LBA Out of Range (sct 0x0 / sc 0x80) DNR cdw10=0x0 cdw11=0x1 cdw12=0x70000 cdw13=0x0 cdw14=0x0 cdw15=0x0 Add a helper function nvme_log_err_passthru() which allows us to log error for passthru commands by decoding cdw10-cdw15 values of nvme command. Add a new sysfs attr passthru_err_log_enabled that allows user to conditionally enable passthrough command logging for either passthrough Admin commands sent to the controller or passthrough IO commands sent to a namespace. By default, passthrough error logging is disabled. To enable passthrough admin error logging: echo 1 > /sys/class/nvme/nvme0/passthru_err_log_enabled To disable passthrough admin error logging: echo 0 > /sys/class/nvme/nvme0/passthru_err_log_enabled To enable passthrough io error logging: echo 1 > /sys/class/nvme/nvme0/nvme0n1/passthru_err_log_enabled To disable passthrough io error logging: echo 0 > /sys/class/nvme/nvme0/nvme0n1/passthru_err_log_enabled Signed-off-by: Alan Adamson <alan.adamson@oracle.com> Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-31nvme: enable retries for authentication commandsHannes Reinecke
Authentication commands might trigger a lengthy computation on the controller or even a callout to an external entity. In these cases the controller might return a status without the DNR bit set, indicating that the command should be retried. This patch enables retries for authentication commands by setting NVME_SUBMIT_RETRY for __nvme_submit_sync_cmd(). Reported-by: Martin George <marting@netapp.com> Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-31nvme: change __nvme_submit_sync_cmd() calling conventionsHannes Reinecke
Combine the two arguments 'flags' and 'at_head' from __nvme_submit_sync_cmd() into a single 'flags' argument and use function-specific values to indicate what should be set within the function. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-29nvme: use ctrl state accessorKeith Busch
The ctrl->state value is updated in another thread using WRITE_ONCE, so ensure all the readers use the appropriate accessor. Reviewed-by: Sagi Grimberg <sagi@grmberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-24nvme: add module description to stop warningsChaitanya Kulkarni
Add MODULE_DESCRIPTION() in order to remove warnings & get clean build:- WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/host/nvme-core.o WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/host/nvme.o WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/host/nvme-fabrics.o WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/host/nvme-rdma.o WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/host/nvme-fc.o WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/host/nvme-tcp.o Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-18Merge tag 'for-6.8/block-2024-01-18' of git://git.kernel.dk/linuxLinus Torvalds
Pull block fixes from Jens Axboe: - NVMe pull request via Keith: - tcp, fc, and rdma target fixes (Maurizio, Daniel, Hannes, Christoph) - discard fixes and improvements (Christoph) - timeout debug improvements (Keith, Max) - various cleanups (Daniel, Max, Giuxen) - trace event string fixes (Arnd) - shadow doorbell setup on reset fix (William) - a write zeroes quirk for SK Hynix (Jim) - MD pull request via Song: - Sparse warning since v6.0 (Bart) - /proc/mdstat regression since v6.7 (Yu Kuai) - Use symbolic error value (Christian) - IO Priority documentation update (Christian) - Fix for accessing queue limits without having entered the queue (Christoph, me) - Fix for loop dio support (Christoph) - Move null_blk off deprecated ida interface (Christophe) - Ensure nbd initializes full msghdr (Eric) - Fix for a regression with the folio conversion, which is now easier to hit because of an unrelated change (Matthew) - Remove redundant check in virtio-blk (Li) - Fix for a potential hang in sbitmap (Ming) - Fix for partial zone appending (Damien) - Misc changes and fixes (Bart, me, Kemeng, Dmitry) * tag 'for-6.8/block-2024-01-18' of git://git.kernel.dk/linux: (45 commits) Documentation: block: ioprio: Update schedulers loop: fix the the direct I/O support check when used on top of block devices blk-mq: Remove the hctx 'run' debugfs attribute nbd: always initialize struct msghdr completely block: Fix iterating over an empty bio with bio_for_each_folio_all block: bio-integrity: fix kcalloc() arguments order virtio_blk: remove duplicate check if queue is broken in virtblk_done sbitmap: remove stale comment in sbq_calc_wake_batch block: Correct a documentation comment in blk-cgroup.c null_blk: Remove usage of the deprecated ida_simple_xx() API block: ensure we hold a queue reference when using queue limits blk-mq: rename blk_mq_can_use_cached_rq block: print symbolic error name instead of error code blk-mq: fix IO hang from sbitmap wakeup race nvmet-rdma: avoid circular locking dependency on install_queue() nvmet-tcp: avoid circular locking dependency on install_queue() nvme-pci: set doorbell config before unquiescing block: fix partial zone append completion handling in req_bio_endio() block/iocost: silence warning on 'last_period' potentially being unused md/raid1: Use blk_opf_t for read and write operations ...
2024-01-11Merge tag 'for-6.8/block-2024-01-08' of git://git.kernel.dk/linuxLinus Torvalds
Pull block updates from Jens Axboe: "Pretty quiet round this time around. This contains: - NVMe updates via Keith: - nvme fabrics spec updates (Guixin, Max) - nvme target udpates (Guixin, Evan) - nvme attribute refactoring (Daniel) - nvme-fc numa fix (Keith) - MD updates via Song: - Fix/Cleanup RCU usage from conf->disks[i].rdev (Yu Kuai) - Fix raid5 hang issue (Junxiao Bi) - Add Yu Kuai as Reviewer of the md subsystem - Remove deprecated flavors (Song Liu) - raid1 read error check support (Li Nan) - Better handle events off-by-1 case (Alex Lyakas) - Efficiency improvements for passthrough (Kundan) - Support for mapping integrity data directly (Keith) - Zoned write fix (Damien) - rnbd fixes (Kees, Santosh, Supriti) - Default to a sane discard size granularity (Christoph) - Make the default max transfer size naming less confusing (Christoph) - Remove support for deprecated host aware zoned model (Christoph) - Misc fixes (me, Li, Matthew, Min, Ming, Randy, liyouhong, Daniel, Bart, Christoph)" * tag 'for-6.8/block-2024-01-08' of git://git.kernel.dk/linux: (78 commits) block: Treat sequential write preferred zone type as invalid block: remove disk_clear_zoned sd: remove the !ZBC && blk_queue_is_zoned case in sd_read_block_characteristics drivers/block/xen-blkback/common.h: Fix spelling typo in comment blk-cgroup: fix rcu lockdep warning in blkg_lookup() blk-cgroup: don't use removal safe list iterators block: floor the discard granularity to the physical block size mtd_blkdevs: use the default discard granularity bcache: use the default discard granularity zram: use the default discard granularity null_blk: use the default discard granularity nbd: use the default discard granularity ubd: use the default discard granularity block: default the discard granularity to sector size bcache: discard_granularity should not be smaller than a sector block: remove two comments in bio_split_discard block: rename and document BLK_DEF_MAX_SECTORS loop: don't abuse BLK_DEF_MAX_SECTORS aoe: don't abuse BLK_DEF_MAX_SECTORS null_blk: don't cap max_hw_sectors to BLK_DEF_MAX_SECTORS ...
2024-01-03nvme: simplify the max_discard_segments calculationChristoph Hellwig
Just stash away the DMRL value in the nvme_ctrl struture, and leave all interpretation to nvme_config_discard, where we know DSM is supported by the time we're configuring the number of segments. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-03nvme: fix max_discard_sectors calculationChristoph Hellwig
ctrl->max_discard_sectors stores a value that is potentially based of the DMRSL field in Identify Controller, which is in units of LBAs and thus dependent on the Format of a namespace. Fix this by moving the calculation of max_discard_sectors entirely into nvme_config_discard and replacing the ctrl->max_discard_sectors value with a local variable so that the calculation is always namespace-specific. Fixes: 1a86924e4f46 ("nvme: fix interpretation of DMRSL") Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-03nvme: also skip discard granularity updates in nvme_config_discardChristoph Hellwig
Don't just skip the discard sectors and segments but also the granularity if a value was already set before. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-03nvme: update the explanation for not updating the limits in nvme_config_discardChristoph Hellwig
Expeand the comment a bit to explain what is going on. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-21nvme-fabrics: don't check discovery ioccsz/iorcszMax Gurtovoy
IOCCSZ and IORCSZ are reserved for discovery controllers. Avoid checking their values during identify controller phase. Fixes: 2fcd3ab39826 ("nvme-fabrics: check ioccsz and iorcsz") Reported-by: Daniel Wagner <dwagner@suse.de> Tested-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-19nvme-pci: fix sleeping function called from interrupt contextMaurizio Lombardi
the nvme_handle_cqe() interrupt handler calls nvme_complete_async_event() but the latter may call nvme_auth_stop() which is a blocking function. Sleeping functions can't be called in interrupt context BUG: sleeping function called from invalid context in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/15 Call Trace: <IRQ> __cancel_work_timer+0x31e/0x460 ? nvme_change_ctrl_state+0xcf/0x3c0 [nvme_core] ? nvme_change_ctrl_state+0xcf/0x3c0 [nvme_core] nvme_complete_async_event+0x365/0x480 [nvme_core] nvme_poll_cq+0x262/0xe50 [nvme] Fix the bug by moving nvme_auth_stop() to fw_act_work (executed by the nvme_wq workqueue) Fixes: f50fff73d620 ("nvme: implement In-Band authentication") Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-19nvme: add csi, ms and nuse to sysfsDaniel Wagner
libnvme is using the sysfs for enumarating the nvme resources. Though there are few missing attritbutes in the sysfs. For these libnvme issues commands during discovering. As the kernel already knows all these attributes and we would like to avoid libnvme to issue commands all the time, expose these missing attributes. The nuse value is updated on request because the nuse is a volatile value. Since any user can read the sysfs attribute, a very simple rate limit is added (update once every 5 seconds). A more sophisticated update strategy can be added later if there is actually a need for it. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-19nvme: rename ns attribute groupDaniel Wagner
Drop the 'id' part of the attribute group name because we want to expose non 'id' related attributes via the ns attribute group. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-19nvme: refactor ns info setup functionDaniel Wagner
Use nvme_ns_head instead of nvme_ns where possible. This reduces the coupling between the different data structures. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-19nvme: refactor ns info helpersDaniel Wagner
Pass in the nvme_ns_head pointer directly. This reduces the necessity on the caller side have the nvme_ns data structure present. Thus we can refactor the caller side in the next step as well. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-19nvme: move ns id info to struct nvme_ns_headDaniel Wagner
Move the namesapce info to struct nvme_ns_head, because it's the same for all associated namespaces. Note: with multipathing enabled the PI information is shared between all paths. If a path is using a different PI configuration it will overwrite the previous settings. This is obviously not correct and such configuration will be rejected in future. For the time being we expect a correctly configured storage. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-06nvme-fabrics: check ioccsz and iorcszGuixin Liu
Make sure that ioccsz and iorcsz returned by target are correct before use it. Per 2.0a base NVMe spec: I/O Queue Command Capsule Supported Size (IOCCSZ): This field defines the maximum I/O command capsule size in 16 byte units. The minimum value that shall be indicated is 4 corresponding to 64 bytes. Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-06nvme: introduce nvme_check_ctrl_fabric_info helperGuixin Liu
Inroduce nvme_check_ctrl_fabric_info helper to check fabric controller info returned by target. Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-04nvme: fix deadlock between reset and scanBitao Hu
If controller reset occurs when allocating namespace, both nvme_reset_work and nvme_scan_work will hang, as shown below. Test Scripts: for ((t=1;t<=128;t++)) do nsid=`nvme create-ns /dev/nvme1 -s 14537724 -c 14537724 -f 0 -m 0 \ -d 0 | awk -F: '{print($NF);}'` nvme attach-ns /dev/nvme1 -n $nsid -c 0 done nvme reset /dev/nvme1 We will find that both nvme_reset_work and nvme_scan_work hung: INFO: task kworker/u249:4:17848 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u249:4 state:D stack: 0 pid:17848 ppid: 2 flags:0x00000028 Workqueue: nvme-reset-wq nvme_reset_work [nvme] Call trace: __switch_to+0xb4/0xfc __schedule+0x22c/0x670 schedule+0x4c/0xd0 blk_mq_freeze_queue_wait+0x84/0xc0 nvme_wait_freeze+0x40/0x64 [nvme_core] nvme_reset_work+0x1c0/0x5cc [nvme] process_one_work+0x1d8/0x4b0 worker_thread+0x230/0x440 kthread+0x114/0x120 INFO: task kworker/u249:3:22404 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u249:3 state:D stack: 0 pid:22404 ppid: 2 flags:0x00000028 Workqueue: nvme-wq nvme_scan_work [nvme_core] Call trace: __switch_to+0xb4/0xfc __schedule+0x22c/0x670 schedule+0x4c/0xd0 rwsem_down_write_slowpath+0x32c/0x98c down_write+0x70/0x80 nvme_alloc_ns+0x1ac/0x38c [nvme_core] nvme_validate_or_alloc_ns+0xbc/0x150 [nvme_core] nvme_scan_ns_list+0xe8/0x2e4 [nvme_core] nvme_scan_work+0x60/0x500 [nvme_core] process_one_work+0x1d8/0x4b0 worker_thread+0x260/0x440 kthread+0x114/0x120 INFO: task nvme:28428 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:nvme state:D stack: 0 pid:28428 ppid: 27119 flags:0x00000000 Call trace: __switch_to+0xb4/0xfc __schedule+0x22c/0x670 schedule+0x4c/0xd0 schedule_timeout+0x160/0x194 do_wait_for_common+0xac/0x1d0 __wait_for_common+0x78/0x100 wait_for_completion+0x24/0x30 __flush_work.isra.0+0x74/0x90 flush_work+0x14/0x20 nvme_reset_ctrl_sync+0x50/0x74 [nvme_core] nvme_dev_ioctl+0x1b0/0x250 [nvme_core] __arm64_sys_ioctl+0xa8/0xf0 el0_svc_common+0x88/0x234 do_el0_svc+0x7c/0x90 el0_svc+0x1c/0x30 el0_sync_handler+0xa8/0xb0 el0_sync+0x148/0x180 The reason for the hang is that nvme_reset_work occurs while nvme_scan_work is still running. nvme_scan_work may add new ns into ctrl->namespaces list after nvme_reset_work frozen all ns->q in ctrl->namespaces list. The newly added ns is not frozen, so nvme_wait_freeze will wait forever. Unfortunately, ctrl->namespaces_rwsem is held by nvme_reset_work, so nvme_scan_work will also wait forever. Now we are deadlocked! PROCESS1 PROCESS2 ============== ============== nvme_scan_work ... nvme_reset_work nvme_validate_or_alloc_ns nvme_dev_disable nvme_alloc_ns nvme_start_freeze down_write ... nvme_ns_add_to_ctrl_list ... up_write nvme_wait_freeze ... down_read nvme_alloc_ns blk_mq_freeze_queue_wait down_write Fix by marking the ctrl with say NVME_CTRL_FROZEN flag set in nvme_start_freeze and cleared in nvme_unfreeze. Then the scan can check it before adding the new namespace (under the namespaces_rwsem). Signed-off-by: Bitao Hu <yaoma@linux.alibaba.com> Reviewed-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-04nvme: ensure reset state check orderingKeith Busch
A different CPU may be setting the ctrl->state value, so ensure proper barriers to prevent optimizing to a stale state. Normally it isn't a problem to observe the wrong state as it is merely advisory to take a quicker path during initialization and error recovery, but seeing an old state can report unexpected ENETRESET errors when a reset request was in fact successful. Reported-by: Minh Hoang <mh2022@meta.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Hannes Reinecke <hare@suse.de>
2023-12-01nvme-core: check for too small lba shiftKeith Busch
The block layer doesn't support logical block sizes smaller than 512 bytes. The nvme spec doesn't support that small either, but the driver isn't checking to make sure the device responded with usable data. Failing to catch this will result in a kernel bug, either from a division by zero when stacking, or a zero length bio. Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-27nvme: check for valid nvme_identify_ns() before using itEwan D. Milne
When scanning namespaces, it is possible to get valid data from the first call to nvme_identify_ns() in nvme_alloc_ns(), but not from the second call in nvme_update_ns_info_block(). In particular, if the NSID becomes inactive between the two commands, a storage device may return a buffer filled with zero as per 4.1.5.1. In this case, we can get a kernel crash due to a divide-by-zero in blk_stack_limits() because ns->lba_shift will be set to zero. PID: 326 TASK: ffff95fec3cd8000 CPU: 29 COMMAND: "kworker/u98:10" #0 [ffffad8f8702f9e0] machine_kexec at ffffffff91c76ec7 #1 [ffffad8f8702fa38] __crash_kexec at ffffffff91dea4fa #2 [ffffad8f8702faf8] crash_kexec at ffffffff91deb788 #3 [ffffad8f8702fb00] oops_end at ffffffff91c2e4bb #4 [ffffad8f8702fb20] do_trap at ffffffff91c2a4ce #5 [ffffad8f8702fb70] do_error_trap at ffffffff91c2a595 #6 [ffffad8f8702fbb0] exc_divide_error at ffffffff928506e6 #7 [ffffad8f8702fbd0] asm_exc_divide_error at ffffffff92a00926 [exception RIP: blk_stack_limits+434] RIP: ffffffff92191872 RSP: ffffad8f8702fc80 RFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff95efa0c91800 RCX: 0000000000000001 RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000001 RBP: 00000000ffffffff R8: ffff95fec7df35a8 R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: ffff95fed33c09a8 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #8 [ffffad8f8702fce0] nvme_update_ns_info_block at ffffffffc06d3533 [nvme_core] #9 [ffffad8f8702fd18] nvme_scan_ns at ffffffffc06d6fa7 [nvme_core] This happened when the check for valid data was moved out of nvme_identify_ns() into one of the callers. Fix this by checking in both callers. Link: https://bugzilla.kernel.org/show_bug.cgi?id=218186 Fixes: 0dd6fff2aad4 ("nvme: bring back auto-removal of deleted namespaces during sequential scan") Cc: stable@vger.kernel.org Signed-off-by: Ewan D. Milne <emilne@redhat.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-27nvme-core: fix a memory leak in nvme_ns_info_from_identify()Maurizio Lombardi
In case of error, free the nvme_id_ns structure that was allocated by nvme_identify_ns(). Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-27nvme: fine-tune sending of first keep-aliveMark O'Donovan
Keep-alive commands are sent half-way through the kato period. This normally works well but fails when the keep-alive system is started when we are more than half way through the kato. This can happen on larger setups or due to host delays. With this change we now time the initial keep-alive command from the controller initialisation time, rather than the keep-alive mechanism activation time. Signed-off-by: Mark O'Donovan <shiftee@posteo.net> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-22nvme: move nvme_stop_keep_alive() back to original positionHannes Reinecke
Stopping keep-alive not only stops the keep-alive workqueue, but also needs to be synchronized with I/O termination as we must not send a keep-alive command after all I/O had been terminated. So to avoid any regressions move the call to stop_keep_alive() back to its original position and ensure that keep-alive is correctly stopped failing to setup the admin queue. Fixes: 4733b65d82bd ("nvme: start keep-alive after admin queue setup") Suggested-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-20nvme: catch errors from nvme_configure_metadata()Hannes Reinecke
nvme_configure_metadata() is issuing I/O, so we might incur an I/O error which will cause the connection to be reset. But in that case any further probing will race with reset and cause UAF errors. So return a status from nvme_configure_metadata() and abort probing if there was an I/O error. Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-08nvme: keyring: fix conditional compilationHannes Reinecke
The keyring and auth functions can be called from both the host and the target side and are controlled by Kconfig options for each of the combinations, but the declarations are controlled by #ifdef checks on the shared Kconfig symbols. This leads to link failures in combinations where one of the frontends is built-in and the other one is a module, and the keyring code ends up in a module that is not reachable from the builtin code: ld: drivers/nvme/host/core.o: in function `nvme_core_exit': core.c:(.exit.text+0x4): undefined reference to `nvme_keyring_exit' ld: drivers/nvme/host/core.o: in function `nvme_core_init': core.c:(.init.text+0x94): undefined reference to `nvme_keyring_init ld: drivers/nvme/host/tcp.o: in function `nvme_tcp_setup_ctrl': tcp.c:(.text+0x4c18): undefined reference to `nvme_tls_psk_default' Address this by moving nvme_keyring_init()/nvme_keyring_exit() into module init/exit functions for the keyring module. Fixes: be8e82caa6859 ("nvme-tcp: enable TLS handshake upcall") Signed-off-by: Hannes Reinecke <hare@suse.de> Cc: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-06nvme: start keep-alive after admin queue setupHannes Reinecke
Setting up I/O queues might take quite some time on larger and/or busy setups, so KATO might expire before all I/O queues could be set up. Fix this by start keep alive from the ->init_ctrl_finish() callback, and stopping it when calling nvme_cancel_admin_tagset(). Signed-off-by: Hannes Reinecke <hare@suse.de> Tested-by: Mark O'Donovan <shiftee@posteo.net> [fixed nvme-fc compile error] Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-06nvme: update firmware version after commitDaniel Wagner
The firmware version sysfs entry needs to be updated after a successfully firmware activation. nvme-cli stopped issuing an Identify Controller command to list the current firmware information and relies on sysfs showing the current firmware version. Reported-by: Kenji Tomonaga <tkenbo@gmail.com> Signed-off-by: Daniel Wagner <dwagner@suse.de> Tested-by: Kenji Tomonaga <tkenbo@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Niklas Cassel <niklas.cassel@wdc.com> [fixed off-by one afi index] Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-10-12nvme: rework NVME_AUTH Kconfig selectionHannes Reinecke
Having a single Kconfig symbol NVME_AUTH conflates the selection of the authentication functions from nvme/common and nvme/host, causing kbuild robot to complain when building the nvme target only. So introduce a Kconfig symbol NVME_HOST_AUTH for the nvme host bits and use NVME_AUTH for the common functions only. And move the CRYPTO selection into nvme/common to make it easier to read. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202310120733.TlPOVeJm-lkp@intel.com/ Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-10-11nvme-tcp: enable TLS handshake upcallHannes Reinecke
Add a fabrics option 'tls' and start the TLS handshake upcall with the default PSK. When TLS is started the PSK key serial number is displayed in the sysfs attribute 'tls_key' Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-10-11nvme-keyring: register '.nvme' keyringHannes Reinecke
Register a '.nvme' keyring to hold keys for TLS and DH-HMAC-CHAP and add a new config option NVME_KEYRING. We need a separate keyring for NVMe as the configuration is done via individual commands (eg for configfs), and the usual per-session or per-process keyrings can't be used. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-09-14Merge tag 'nvme-6.6-2023-09-14' of git://git.infradead.org/nvme into block-6.6Jens Axboe
Pull NVMe fixes from Keith: "nvme fixes for Linux 6.6 - nvme-tcp iov len fix (Varun) - nvme-hwmon const qualifier for safety (Krzysztof) - nvme-fc null pointer checks (Nigel) - nvme-pci no numa node fix (Pratyush) - nvme timeout fix for non-compliant controllers (Keith)" * tag 'nvme-6.6-2023-09-14' of git://git.infradead.org/nvme: nvme: avoid bogus CRTO values nvme-pci: do not set the NUMA node of device if it has none nvme-fc: Prevent null pointer dereference in nvme_fc_io_getuuid() nvme: host: hwmon: constify pointers to hwmon_channel_info nvmet-tcp: pass iov_len instead of sg->length to bvec_set_page()
2023-09-14nvme: avoid bogus CRTO valuesKeith Busch
Some devices are reporting controller ready mode support, but return 0 for CRTO. These devices require a much higher time to ready than that, so they are failing to initialize after the driver starter preferring that value over CAP.TO. The spec requires that CAP.TO match the appropritate CRTO value, or be set to 0xff if CRTO is larger than that. This means that CAP.TO can be used to validate if CRTO is reliable, and provides an appropriate fallback for setting the timeout value if not. Use whichever is larger. Link: https://bugzilla.kernel.org/show_bug.cgi?id=217863 Reported-by: Cláudio Sampaio <patola@gmail.com> Reported-by: Felix Yan <felixonmars@archlinux.org> Tested-by: Felix Yan <felixonmars@archlinux.org> Based-on-a-patch-by: Felix Yan <felixonmars@archlinux.org> Cc: stable@vger.kernel.org Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-07-21nvme: fix possible hang when removing a controller during error recoveryMing Lei
Error recovery can be interrupted by controller removal, then the controller is left as quiesced, and IO hang can be caused. Fix the issue by unquiescing controller unconditionally when removing namespaces. This way is reasonable and safe given forward progress can be made when removing namespaces. Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reported-by: Chunguang Xu <brookxu.cn@gmail.com> Closes: https://lore.kernel.org/linux-nvme/cover.1685350577.git.chunguang.xu@shopee.com/ Cc: stable@vger.kernel.org Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-07-13nvme: don't reject probe due to duplicate IDs for single-ported PCIe devicesChristoph Hellwig
While duplicate IDs are still very harmful, including the potential to easily see changing devices in /dev/disk/by-id, it turn out they are extremely common for cheap end user NVMe devices. Relax our check for them for so that it doesn't reject the probe on single-ported PCIe devices, but prints a big warning instead. In doubt we'd still like to see quirk entries to disable the potential for changing supposed stable device identifier links, but this will at least allow users how have two (or more) of these devices to use them without having to manually add a new PCI ID entry with the quirk through sysfs or by patching the kernel. Fixes: 2079f41ec6ff ("nvme: check that EUI/GUID/UUID are globally unique") Cc: stable@vger.kernel.org # 6.0+ Co-developed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-07-03Merge tag 'block-6.5-2023-07-03' of git://git.kernel.dk/linuxLinus Torvalds
Pull more block updates from Jens Axboe: "Mostly items that came in a bit late for the initial pull request, wanted to make sure they had the appropriate amount of linux-next soak before going upstream. Outside of stragglers, just generic fixes for either merge window items, or longer standing bugs" * tag 'block-6.5-2023-07-03' of git://git.kernel.dk/linux: (25 commits) md/raid0: add discard support for the 'original' layout nvme: disable controller on reset state failure nvme: sync timeout work on failed reset nvme: ensure unquiesce on teardown cdrom/gdrom: Fix build error nvme: improved uring polling block: add request polling helper nvme-mpath: fix I/O failure with EAGAIN when failing over I/O nvme: host: fix command name spelling blk-sysfs: add a new attr_group for blk_mq blk-iocost: move wbt_enable/disable_default() out of spinlock blk-wbt: cleanup rwb_enabled() and wbt_disabled() blk-wbt: remove dead code to handle wbt enable/disable with io inflight blk-wbt: don't create wbt sysfs entry if CONFIG_BLK_WBT is disabled blk-mq: fix two misuses on RQF_USE_SCHED blk-throttle: Fix io statistics for cgroup v1 bcache: Fix bcache device claiming bcache: Alloc holder object before async registration raid10: avoid spin_lock from fastpath from raid10_unplug() md: fix 'delete_mutex' deadlock ...
2023-06-30Merge tag 'nvme-6.5-2023-06-30' of git://git.infradead.org/nvme into block-6.5Jens Axboe
Pull NVMe fixes from Keith: "nvme fixes for Linux 6.5 - Reduce spamming kernel logs on repeated controller updates (Breno) - Improved struct packing (Christophe JAILLET) - Misspelled command name in error logging (Damien) - Failover fix for temporary frozen queue (Sagi) - Reset error handling fixes (Keith)" * tag 'nvme-6.5-2023-06-30' of git://git.infradead.org/nvme: nvme: disable controller on reset state failure nvme: sync timeout work on failed reset nvme: ensure unquiesce on teardown nvme-mpath: fix I/O failure with EAGAIN when failing over I/O nvme: host: fix command name spelling nvmet: Reorder fields in 'struct nvmet_ns' nvme: Print capabilities changes just once
2023-06-30Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsiLinus Torvalds
Pull SCSI updates from James Bottomley: "Updates to the usual drivers (ufs, pm80xx, libata-scsi, smartpqi, lpfc, qla2xxx). We have a couple of major core changes impacting other systems: - Command Duration Limits, which spills into block and ATA - block level Persistent Reservation Operations, which touches block, nvme, target and dm Both of these are added with merge commits containing a cover letter explaining what's going on" * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (187 commits) scsi: core: Improve warning message in scsi_device_block() scsi: core: Replace scsi_target_block() with scsi_block_targets() scsi: core: Don't wait for quiesce in scsi_device_block() scsi: core: Don't wait for quiesce in scsi_stop_queue() scsi: core: Merge scsi_internal_device_block() and device_block() scsi: sg: Increase number of devices scsi: bsg: Increase number of devices scsi: qla2xxx: Remove unused nvme_ls_waitq wait queue scsi: ufs: ufs-pci: Add support for Intel Arrow Lake scsi: sd: sd_zbc: Use PAGE_SECTORS_SHIFT scsi: ufs: wb: Add explicit flush_threshold sysfs attribute scsi: ufs: ufs-qcom: Switch to the new ICE API scsi: ufs: dt-bindings: qcom: Add ICE phandle scsi: ufs: ufs-mediatek: Set UFSHCD_QUIRK_MCQ_BROKEN_RTC quirk scsi: ufs: ufs-mediatek: Set UFSHCD_QUIRK_MCQ_BROKEN_INTR quirk scsi: ufs: core: Add host quirk UFSHCD_QUIRK_MCQ_BROKEN_RTC scsi: ufs: core: Add host quirk UFSHCD_QUIRK_MCQ_BROKEN_INTR scsi: ufs: core: Remove dedicated hwq for dev command scsi: ufs: core: mcq: Fix the incorrect OCS value for the device command scsi: ufs: dt-bindings: samsung,exynos: Drop unneeded quotes ...
2023-06-26Merge tag 'for-6.5/block-2023-06-23' of git://git.kernel.dk/linuxLinus Torvalds
Pull block updates from Jens Axboe: - NVMe pull request via Keith: - Various cleanups all around (Irvin, Chaitanya, Christophe) - Better struct packing (Christophe JAILLET) - Reduce controller error logs for optional commands (Keith) - Support for >=64KiB block sizes (Daniel Gomez) - Fabrics fixes and code organization (Max, Chaitanya, Daniel Wagner) - bcache updates via Coly: - Fix a race at init time (Mingzhe Zou) - Misc fixes and cleanups (Andrea, Thomas, Zheng, Ye) - use page pinning in the block layer for dio (David) - convert old block dio code to page pinning (David, Christoph) - cleanups for pktcdvd (Andy) - cleanups for rnbd (Guoqing) - use the unchecked __bio_add_page() for the initial single page additions (Johannes) - fix overflows in the Amiga partition handling code (Michael) - improve mq-deadline zoned device support (Bart) - keep passthrough requests out of the IO schedulers (Christoph, Ming) - improve support for flush requests, making them less special to deal with (Christoph) - add bdev holder ops and shutdown methods (Christoph) - fix the name_to_dev_t() situation and use cases (Christoph) - decouple the block open flags from fmode_t (Christoph) - ublk updates and cleanups, including adding user copy support (Ming) - BFQ sanity checking (Bart) - convert brd from radix to xarray (Pankaj) - constify various structures (Thomas, Ivan) - more fine grained persistent reservation ioctl capability checks (Jingbo) - misc fixes and cleanups (Arnd, Azeem, Demi, Ed, Hengqi, Hou, Jan, Jordy, Li, Min, Yu, Zhong, Waiman) * tag 'for-6.5/block-2023-06-23' of git://git.kernel.dk/linux: (266 commits) scsi/sg: don't grab scsi host module reference ext4: Fix warning in blkdev_put() block: don't return -EINVAL for not found names in devt_from_devname cdrom: Fix spectre-v1 gadget block: Improve kernel-doc headers blk-mq: don't insert passthrough request into sw queue bsg: make bsg_class a static const structure ublk: make ublk_chr_class a static const structure aoe: make aoe_class a static const structure block/rnbd: make all 'class' structures const block: fix the exclusive open mask in disk_scan_partitions block: add overflow checks for Amiga partition support block: change all __u32 annotations to __be32 in affs_hardblocks.h block: fix signed int overflow in Amiga partition support block: add capacity validation in bdev_add_partition() block: fine-granular CAP_SYS_ADMIN for Persistent Reservation block: disallow Persistent Reservation on partitions reiserfs: fix blkdev_put() warning from release_journal_dev() block: fix wrong mode for blkdev_get_by_dev() from disk_scan_partitions() block: document the holder argument to blkdev_get_by_path ...
2023-06-21nvme: Print capabilities changes just onceBreno Leitao
This current dev_info() could be very verbose and being printed very frequently depending on some userspace application sending some specific commands. Just print this message once and skip it until the controller resets. Use a controller flag (NVME_CTRL_DIRTY_CAPABILITY) to track if the capability needs a reset. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12nvme: skip optional id ctrl csi if it failedKeith Busch
A frequently recieved report is the driver requests the optional Command Set Specific Identify Controller structure. Some controllers report this in their error log, which tiggers other warnings to user space monitoring the devices. These error entries are harmless and of questionable value to save in the log, but let's reduce their occurance by not resending the command if it previously failed. This will not prevent the errors on the initial module load, but will greatly reduce their occurance on any rescans and resumes from suspend. Link: https://bugzilla.kernel.org/show_bug.cgi?id=217445 Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12nvme-core: use nvme_ns_head_multipath instead of ns->head->diskIrvin Cote
Change the way we check for a multipath nshead so as to consistently use the same check to assert the same condition. Signed-off-by: Irvin Cote <irvincoteg@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>