summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-09-25Merge tag 'powerpc-6.12-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: - Fix build error in vdso32 when building 64-bit with COMPAT=y and -Os - Fix build error in pseries EEH when CONFIG_DEBUG_FS is not set Thanks to Christophe Leroy, Narayana Murty N, Christian Zigotzky, and Ritesh Harjani. * tag 'powerpc-6.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/pseries/eeh: move pseries_eeh_err_inject() outside CONFIG_DEBUG_FS block powerpc/vdso32: Fix use of crtsavres for PPC64
2024-09-25Merge tag 'clang-format-6.12' of https://github.com/ojeda/linuxLinus Torvalds
Pull clang-format updates from Miguel Ojeda: "A routine update of the 'for_each' macro list" * tag 'clang-format-6.12' of https://github.com/ojeda/linux: clang-format: Update with v6.11-rc1's `for_each` macro list
2024-09-25Kbuild: make MODVERSIONS support depend on not being a compile test buildLinus Torvalds
Currently the Rust support is gated on not having MODVERSIONS enabled, and as a result an "allmodconfig" build will disable Rust build tests. While MODVERSIONS configurations are worth build testing, the feature is not actually meaningful unless you run the result, and I'd rather get build coverage of Rust than MODVERSIONS. So let's disable MODVERSIONS for build testing until the Rust side clears up. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-09-25Merge tag 'rust-6.12' of https://github.com/Rust-for-Linux/linuxLinus Torvalds
Pull Rust updates from Miguel Ojeda: "Toolchain and infrastructure: - Support 'MITIGATION_{RETHUNK,RETPOLINE,SLS}' (which cleans up objtool warnings), teach objtool about 'noreturn' Rust symbols and mimic '___ADDRESSABLE()' for 'module_{init,exit}'. With that, we should be objtool-warning-free, so enable it to run for all Rust object files. - KASAN (no 'SW_TAGS'), KCFI and shadow call sanitizer support. - Support 'RUSTC_VERSION', including re-config and re-build on change. - Split helpers file into several files in a folder, to avoid conflicts in it. Eventually those files will be moved to the right places with the new build system. In addition, remove the need to manually export the symbols defined there, reusing existing machinery for that. - Relax restriction on configurations with Rust + GCC plugins to just the RANDSTRUCT plugin. 'kernel' crate: - New 'list' module: doubly-linked linked list for use with reference counted values, which is heavily used by the upcoming Rust Binder. This includes 'ListArc' (a wrapper around 'Arc' that is guaranteed unique for the given ID), 'AtomicTracker' (tracks whether a 'ListArc' exists using an atomic), 'ListLinks' (the prev/next pointers for an item in a linked list), 'List' (the linked list itself), 'Iter' (an iterator over a 'List'), 'Cursor' (a cursor into a 'List' that allows to remove elements), 'ListArcField' (a field exclusively owned by a 'ListArc'), as well as support for heterogeneous lists. - New 'rbtree' module: red-black tree abstractions used by the upcoming Rust Binder. This includes 'RBTree' (the red-black tree itself), 'RBTreeNode' (a node), 'RBTreeNodeReservation' (a memory reservation for a node), 'Iter' and 'IterMut' (immutable and mutable iterators), 'Cursor' (bidirectional cursor that allows to remove elements), as well as an entry API similar to the Rust standard library one. - 'init' module: add 'write_[pin_]init' methods and the 'InPlaceWrite' trait. Add the 'assert_pinned!' macro. - 'sync' module: implement the 'InPlaceInit' trait for 'Arc' by introducing an associated type in the trait. - 'alloc' module: add 'drop_contents' method to 'BoxExt'. - 'types' module: implement the 'ForeignOwnable' trait for 'Pin<Box<T>>' and improve the trait's documentation. In addition, add the 'into_raw' method to the 'ARef' type. - 'error' module: in preparation for the upcoming Rust support for 32-bit architectures, like arm, locally allow Clippy lint for those. Documentation: - https://rust.docs.kernel.org has been announced, so link to it. - Enable rustdoc's "jump to definition" feature, making its output a bit closer to the experience in a cross-referencer. - Debian Testing now also provides recent Rust releases (outside of the freeze period), so add it to the list. MAINTAINERS: - Trevor is joining as reviewer of the "RUST" entry. And a few other small bits" * tag 'rust-6.12' of https://github.com/Rust-for-Linux/linux: (54 commits) kasan: rust: Add KASAN smoke test via UAF kbuild: rust: Enable KASAN support rust: kasan: Rust does not support KHWASAN kbuild: rust: Define probing macros for rustc kasan: simplify and clarify Makefile rust: cfi: add support for CFI_CLANG with Rust cfi: add CONFIG_CFI_ICALL_NORMALIZE_INTEGERS rust: support for shadow call stack sanitizer docs: rust: include other expressions in conditional compilation section kbuild: rust: replace proc macros dependency on `core.o` with the version text kbuild: rust: rebuild if the version text changes kbuild: rust: re-run Kconfig if the version text changes kbuild: rust: add `CONFIG_RUSTC_VERSION` rust: avoid `box_uninit_write` feature MAINTAINERS: add Trevor Gross as Rust reviewer rust: rbtree: add `RBTree::entry` rust: rbtree: add cursor rust: rbtree: add mutable iterator rust: rbtree: add iterator rust: rbtree: add red-black tree implementation backed by the C version ...
2024-09-25sefltests/tracing: Add a test for tracepoint events on modulesMasami Hiramatsu (Google)
Add a test case for tracepoint events on modules. This checks if it can add and remove the events correctly. Link: https://lore.kernel.org/all/172397781494.286558.7581515061075998225.stgit@devnote2/ Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-09-25tracing/fprobe: Support raw tracepoints on future loaded modulesMasami Hiramatsu (Google)
Support raw tracepoint events on future loaded (unloaded) modules. This allows user to create raw tracepoint events which can be used from module's __init functions. Note: since the kernel does not have any information about the tracepoints in the unloaded modules, fprobe events can not check whether the tracepoint exists nor extend the BTF based arguments. Link: https://lore.kernel.org/all/172397780593.286558.18360375226968537828.stgit@devnote2/ Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-09-25tracing/fprobe: Support raw tracepoint events on modulesMasami Hiramatsu (Google)
Support raw tracepoint event on module by fprobe events. Since it only uses for_each_kernel_tracepoint() to find a tracepoint, the tracepoints on modules are not handled. Thus if user specified a tracepoint on a module, it shows an error. This adds new for_each_module_tracepoint() API to tracepoint subsystem, and uses it to find tracepoints on modules. Link: https://lore.kernel.org/all/172397779651.286558.15903703620679186867.stgit@devnote2/ Reported-by: don <zds100@gmail.com> Closes: https://lore.kernel.org/all/20240530215718.aeec973a1d0bf058d39cb1e3@kernel.org/ Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-09-25tracepoint: Support iterating tracepoints in a loading moduleMasami Hiramatsu (Google)
Add for_each_tracepoint_in_module() function to iterate tracepoints in a module. This API is needed for handling tracepoints in a loading module from tracepoint_module_notifier callback function. This also update for_each_module_tracepoint() to pass the module to callback function so that it can find module easily. Link: https://lore.kernel.org/all/172397778740.286558.15781131277732977643.stgit@devnote2/ Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-09-25tracepoint: Support iterating over tracepoints on modulesMasami Hiramatsu (Google)
Add for_each_module_tracepoint() for iterating over tracepoints on modules. This is similar to the for_each_kernel_tracepoint() but only for the tracepoints on modules (not including kernel built-in tracepoints). Link: https://lore.kernel.org/all/172397777800.286558.14554748203446214056.stgit@devnote2/ Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-09-25kprobes: Remove obsoleted declaration for init_test_probesGaosheng Cui
The init_test_probes() have been removed since commit e44e81c5b90f ("kprobes: convert tests to kunit"), and now it is useless, so remove it. Link: https://lore.kernel.org/all/20240826032552.4016314-1-cuigaosheng1@huawei.com/ Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-09-25uprobes: turn trace_uprobe's nhit counter to be per-CPU oneAndrii Nakryiko
trace_uprobe->nhit counter is not incremented atomically, so its value is questionable in when uprobe is hit on multiple CPUs simultaneously. Also, doing this shared counter increment across many CPUs causes heavy cache line bouncing, limiting uprobe/uretprobe performance scaling with number of CPUs. Solve both problems by making this a per-CPU counter. Link: https://lore.kernel.org/all/20240813203409.3985398-1-andrii@kernel.org/ Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-09-25vsock/virtio: avoid queuing packets when intermediate queue is emptyLuigi Leonardi
When the driver needs to send new packets to the device, it always queues the new sk_buffs into an intermediate queue (send_pkt_queue) and schedules a worker (send_pkt_work) to then queue them into the virtqueue exposed to the device. This increases the chance of batching, but also introduces a lot of latency into the communication. So we can optimize this path by adding a fast path to be taken when there is no element in the intermediate queue, there is space available in the virtqueue, and no other process that is sending packets (tx_lock held). The following benchmarks were run to check improvements in latency and throughput. The test bed is a host with Intel i7-10700KF CPU @ 3.80GHz and L1 guest running on QEMU/KVM with vhost process and all vCPUs pinned individually to pCPUs. - Latency Tool: Fio version 3.37-56 Mode: pingpong (h-g-h) Test runs: 50 Runtime-per-test: 50s Type: SOCK_STREAM In the following fio benchmark (pingpong mode) the host sends a payload to the guest and waits for the same payload back. fio process pinned both inside the host and the guest system. Before: Linux 6.9.8 Payload 64B: 1st perc. overall 99th perc. Before 12.91 16.78 42.24 us After 9.77 13.57 39.17 us Payload 512B: 1st perc. overall 99th perc. Before 13.35 17.35 41.52 us After 10.25 14.11 39.58 us Payload 4K: 1st perc. overall 99th perc. Before 14.71 19.87 41.52 us After 10.51 14.96 40.81 us - Throughput Tool: iperf-vsock The size represents the buffer length (-l) to read/write P represents the number of parallel streams P=1 4K 64K 128K Before 6.87 29.3 29.5 Gb/s After 10.5 39.4 39.9 Gb/s P=2 4K 64K 128K Before 10.5 32.8 33.2 Gb/s After 17.8 47.7 48.5 Gb/s P=4 4K 64K 128K Before 12.7 33.6 34.2 Gb/s After 16.9 48.1 50.5 Gb/s The performance improvement is related to this optimization, I used a ebpf kretprobe on virtio_transport_send_skb to check that each packet was sent directly to the virtqueue Co-developed-by: Marco Pinna <marco.pinn95@gmail.com> Signed-off-by: Marco Pinna <marco.pinn95@gmail.com> Signed-off-by: Luigi Leonardi <luigi.leonardi@outlook.com> Message-Id: <20240730-pinna-v4-2-5c9179164db5@outlook.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
2024-09-25vsock/virtio: refactor virtio_transport_send_pkt_workMarco Pinna
Preliminary patch to introduce an optimization to the enqueue system. All the code used to enqueue a packet into the virtqueue is removed from virtio_transport_send_pkt_work() and moved to the new virtio_transport_send_skb() function. Co-developed-by: Luigi Leonardi <luigi.leonardi@outlook.com> Signed-off-by: Luigi Leonardi <luigi.leonardi@outlook.com> Signed-off-by: Marco Pinna <marco.pinn95@gmail.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Message-Id: <20240730-pinna-v4-1-5c9179164db5@outlook.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-09-25fw_cfg: Constify struct kobj_typeHongbo Li
This 'struct kobj_type' is not modified. It is only used in kobject_init_and_add() which takes a 'const struct kobj_type *ktype' parameter. Constifying this structure and moving it to a read-only section, and this can increase over all security. ``` [Before] text data bss dec hex filename 5974 1008 96 7078 1ba6 drivers/firmware/qemu_fw_cfg.o [After] text data bss dec hex filename 6038 944 96 7078 1ba6 drivers/firmware/qemu_fw_cfg.o ``` Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Message-Id: <20240904011743.2010319-1-lihongbo22@huawei.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-09-25vdpa/mlx5: Postpone MR deletionDragos Tatulea
Currently, when a new MR is set up, the old MR is deleted. MR deletion is about 30-40% the time of MR creation. As deleting the old MR is not important for the process of setting up the new MR, this operation can be postponed. This series adds a workqueue that does MR garbage collection at a later point. If the MR lock is taken, the handler will back off and reschedule. The exception during shutdown: then the handler must not postpone the work. Note that this is only a speculative optimization: if there is some mapping operation that is triggered while the garbage collector handler has the lock taken, this operation it will have to wait for the handler to finish. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Message-Id: <20240830105838.2666587-9-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-09-25vdpa/mlx5: Introduce init/destroy for MR resourcesDragos Tatulea
There's currently not a lot of action happening during the init/destroy of MR resources. But more will be added in the upcoming patches. As the mr mutex lock init/destroy has been moved to these new functions, the lifetime has now shifted away from mlx5_vdpa_alloc_resources() / mlx5_vdpa_free_resources() into these new functions. However, the lifetime at the outer scope remains the same: mlx5_vdpa_dev_add() / mlx5_vdpa_dev_free() Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Message-Id: <20240830105838.2666587-8-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-09-25vdpa/mlx5: Rename mr_mtx -> lockDragos Tatulea
Now that the mr resources have their own namespace in the struct, give the lock a clearer name. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Message-Id: <20240830105838.2666587-7-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-09-25vdpa/mlx5: Extract mr members in own resource structDragos Tatulea
Group all mapping related resources into their own structure. Upcoming patches will add more members in this new structure. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Message-Id: <20240830105838.2666587-6-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-09-25vdpa/mlx5: Rename functionDragos Tatulea
A followup patch will use this name for something else. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Message-Id: <20240830105838.2666587-5-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-09-25vdpa/mlx5: Delete direct MKEYs in parallelDragos Tatulea
Use the async interface to issue MTT MKEY deletion. This makes destroy_user_mr() on average 8x times faster. This number is also dependent on the size of the MR being deleted. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Message-Id: <20240830105838.2666587-4-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-09-25vdpa/mlx5: Create direct MKEYs in parallelDragos Tatulea
Use the async interface to issue MTT MKEY creation. Extra care is taken at the allocation of FW input commands due to the MTT tables having variable sizes depending on MR. The indirect MKEY is still created synchronously at the end as the direct MKEYs need to be filled in. This makes create_user_mr() 3-5x faster, depending on the size of the MR. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Message-Id: <20240830105838.2666587-3-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-09-25MAINTAINERS: add virtio-vsock driver in the VIRTIO CORE sectionStefano Garzarella
The virtio-vsock driver is already under VM SOCKETS (AF_VSOCK), managed pricipally with the net tree, and VIRTIO AND VHOST VSOCK DRIVER. However, changes that only affect the virtio part usually go with Michael's tree, so let's also put the driver in the VIRTIO CORE section to have its maintainers in CC for changes to the virtio-vsock driver. Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Message-Id: <20240829143757.85844-1-sgarzare@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com>
2024-09-25virtio_fs: add sysfs entries for queue informationMax Gurtovoy
Introduce sysfs entries to provide visibility to the multiple queues used by the Virtio FS device. This enhancement allows users to query information about these queues. Specifically, add two sysfs entries: 1. Queue name: Provides the name of each queue (e.g. hiprio/requests.8). 2. CPU list: Shows the list of CPUs that can process requests for each queue. The CPU list feature is inspired by similar functionality in the block MQ layer, which provides analogous sysfs entries for block devices. These new sysfs entries will improve observability and aid in debugging and performance tuning of Virtio FS devices. Reviewed-by: Idan Zach <izach@nvidia.com> Reviewed-by: Shai Malin <smalin@nvidia.com> Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Message-Id: <20240825130716.9506-2-mgurtovoy@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-09-25virtio_fs: introduce virtio_fs_put_locked helperMax Gurtovoy
Introduce a new helper function virtio_fs_put_locked to encapsulate the common pattern of releasing a virtio_fs reference while holding a lock. The existing virtio_fs_put helper will be used to release a virtio_fs reference while not holding a lock. Also add an assertion in case the lock is not taken when it should. Reviewed-by: Idan Zach <izach@nvidia.com> Reviewed-by: Shai Malin <smalin@nvidia.com> Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Message-Id: <20240825130716.9506-1-mgurtovoy@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
2024-09-25vdpa: Remove unused declarationsYue Haibing
There is no caller and implementation in tree. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Message-Id: <20240819140930.122019-1-yuehaibing@huawei.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Shannon Nelson <shannon.nelson@amd.com> Reviewed-by: Zhu Lingshan <lingshan.zhu@kernel.org> Reviewed-by: Shannon Nelson &lt;<a href="mailto:shannon.nelson@amd.com" target="_blank">shannon.nelson@amd.com</a>&gt;<br> Reviewed-by: Zhu Lingshan <lingshan.zhu@kernel.org>
2024-09-25vdpa/mlx5: Parallelize VQ suspend/resume for CVQ MQ commandDragos Tatulea
change_num_qps() is still suspending/resuming VQs one by one. This change switches to parallel suspend/resume. When increasing the number of queues the flow has changed a bit for simplicity: the setup_vq() function will always be called before resume_vqs(). If the VQ is initialized, setup_vq() will exit early. If the VQ is not initialized, setup_vq() will create it and resume_vqs() will resume it. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Message-Id: <20240816090159.1967650-11-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com>
2024-09-25vdpa/mlx5: Small improvement for change_num_qps()Dragos Tatulea
change_num_qps() has a lot of multiplications by 2 to convert the number of VQ pairs to number of VQs. This patch simplifies the code by doing the VQP -> VQ count conversion at the beginning in a variable. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Message-Id: <20240816090159.1967650-10-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com>
2024-09-25vdpa/mlx5: Keep notifiers during suspend but ignoreDragos Tatulea
Unregistering notifiers is a costly operation. Instead of removing the notifiers during device suspend and adding them back at resume, simply ignore the call when the device is suspended. At resume time call queue_link_work() to make sure that the device state is propagated in case there were changes. For 1 vDPA device x 32 VQs (16 VQPs) attached to a large VM (256 GB RAM, 32 CPUs x 2 threads per core), the device suspend time is reduced from ~13 ms to ~2.5 ms. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Message-Id: <20240816090159.1967650-9-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com>
2024-09-25vdpa/mlx5: Parallelize device resumeDragos Tatulea
Currently device resume works on vqs serially. Building up on previous changes that converted vq operations to the async api, this patch parallelizes the device resume. For 1 vDPA device x 32 VQs (16 VQPs) attached to a large VM (256 GB RAM, 32 CPUs x 2 threads per core), the device resume time is reduced from ~16 ms to ~4.5 ms. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Message-Id: <20240816090159.1967650-8-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com>
2024-09-25vdpa/mlx5: Parallelize device suspendDragos Tatulea
Currently device suspend works on vqs serially. Building up on previous changes that converted vq operations to the async api, this patch parallelizes the device suspend: 1) Suspend all active vqs parallel. 2) Query suspended vqs in parallel. For 1 vDPA device x 32 VQs (16 VQPs) attached to a large VM (256 GB RAM, 32 CPUs x 2 threads per core), the device suspend time is reduced from ~37 ms to ~13 ms. A later patch will remove the link unregister operation which will make it even faster. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Message-Id: <20240816090159.1967650-7-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com>
2024-09-25vdpa/mlx5: Use async API for vq modify commandsDragos Tatulea
Switch firmware vq modify command to be issued via the async API to allow future parallelization. The new refactored function applies the modify on a range of vqs and waits for their execution to complete. For now the command is still used in a serial fashion. A later patch will switch to modifying multiple vqs in parallel. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Message-Id: <20240816090159.1967650-6-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com>
2024-09-25vdpa/mlx5: Use async API for vq query commandDragos Tatulea
Switch firmware vq query command to be issued via the async API to allow future parallelization. For now the command is still serial but the infrastructure is there to issue commands in parallel, including ratelimiting the number of issued async commands to firmware. A later patch will switch to issuing more commands at a time. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Message-Id: <20240816090159.1967650-5-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com>
2024-09-25vdpa/mlx5: Introduce async fw command wrapperDragos Tatulea
Introduce a new function mlx5_vdpa_exec_async_cmds() which wraps the mlx5_core async firmware command API in a way that will be used to parallelize certain operation in this driver. The wrapper deals with the case when mlx5_cmd_exec_cb() returns EBUSY due to the command being throttled. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Message-Id: <20240816090159.1967650-4-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com>
2024-09-25vdpa/mlx5: Introduce error logging functionDragos Tatulea
mlx5_vdpa_err() was missing. This patch adds it and uses it in the necessary places. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Message-Id: <20240816090159.1967650-3-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com>
2024-09-25net/mlx5: Support throttled commands from async APIDragos Tatulea
Currently, commands that qualify as throttled can't be used via the async API. That's due to the fact that the throttle semaphore can sleep but the async API can't. This patch allows throttling in the async API by using the tentative variant of the semaphore and upon failure (semaphore at 0) returns EBUSY to signal to the caller that they need to wait for the completion of previously issued commands. Furthermore, make sure that the semaphore is released in the callback. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Cc: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Message-Id: <20240816090159.1967650-2-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com>
2024-09-25Merge tag 'nvme-6.12-2024-09-25' of git://git.infradead.org/nvme into ↵Jens Axboe
for-6.12/block Pull NVMe fixes from Keith: "nvme fixes for Linux 6.12 - Multipath fixes (Hannes) - Sysfs attribute list NULL terminate fix (Shin'ichiro) - Remove problematic read-back (Keith)" * tag 'nvme-6.12-2024-09-25' of git://git.infradead.org/nvme: nvme: remove CC register read-back during enabling nvme: null terminate nvme_tls_attrs nvme-multipath: avoid hang on inaccessible namespaces nvme-multipath: system fails to create generic nvme device
2024-09-25Revert "driver core: don't always lock parent in shutdown"Greg Kroah-Hartman
This reverts commit ba6353748e71bd1d7e422fec2b5c2e2dfc2e3bd9. The series is being reverted before -rc1 as there are still reports of lockups on shutdown, so it's not quite ready for "prime time." Reported-by: Andrey Skvortsov <andrej.skvortzov@gmail.com> Link: https://lore.kernel.org/r/ZvMkkhyJrohaajuk@skv.local Cc: Christoph Hellwig <hch@lst.de> Cc: David Jeffery <djeffery@redhat.com> Cc: Keith Busch <kbusch@kernel.org> Cc: Laurence Oberman <loberman@redhat.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Stuart Hayes <stuart.w.hayes@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-25Revert "driver core: separate function to shutdown one device"Greg Kroah-Hartman
This reverts commit 95dc7565253a8564911190ebd1e4ffceb4de208a. The series is being reverted before -rc1 as there are still reports of lockups on shutdown, so it's not quite ready for "prime time." Reported-by: Andrey Skvortsov <andrej.skvortzov@gmail.com> Link: https://lore.kernel.org/r/ZvMkkhyJrohaajuk@skv.local Cc: Christoph Hellwig <hch@lst.de> Cc: David Jeffery <djeffery@redhat.com> Cc: Keith Busch <kbusch@kernel.org> Cc: Laurence Oberman <loberman@redhat.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Stuart Hayes <stuart.w.hayes@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-25Revert "driver core: shut down devices asynchronously"Greg Kroah-Hartman
This reverts commit 8064952c65045f05ee2671fe437770e50c151776. The series is being reverted before -rc1 as there are still reports of lockups on shutdown, so it's not quite ready for "prime time." Reported-by: Andrey Skvortsov <andrej.skvortzov@gmail.com> Link: https://lore.kernel.org/r/ZvMkkhyJrohaajuk@skv.local Cc: Christoph Hellwig <hch@lst.de> Cc: David Jeffery <djeffery@redhat.com> Cc: Keith Busch <kbusch@kernel.org> Cc: Laurence Oberman <loberman@redhat.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Stuart Hayes <stuart.w.hayes@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-25Revert "nvme-pci: Make driver prefer asynchronous shutdown"Greg Kroah-Hartman
This reverts commit ba82e10c3c6b5b5d2c8279a8bd0dae5c2abaacfc. The series is being reverted before -rc1 as there are still reports of lockups on shutdown, so it's not quite ready for "prime time." Reported-by: Andrey Skvortsov <andrej.skvortzov@gmail.com> Link: https://lore.kernel.org/r/ZvMkkhyJrohaajuk@skv.local Cc: Christoph Hellwig <hch@lst.de> Cc: David Jeffery <djeffery@redhat.com> Cc: Keith Busch <kbusch@kernel.org> Cc: Laurence Oberman <loberman@redhat.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Stuart Hayes <stuart.w.hayes@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-25Revert "driver core: fix async device shutdown hang"Greg Kroah-Hartman
This reverts commit 4f2c346e621624315e2a1405e98616a0c5ac146f. The series is being reverted before -rc1 as there are still reports of lockups on shutdown, so it's not quite ready for "prime time." Reported-by: Andrey Skvortsov <andrej.skvortzov@gmail.com> Link: https://lore.kernel.org/r/ZvMkkhyJrohaajuk@skv.local Cc: Christoph Hellwig <hch@lst.de> Cc: David Jeffery <djeffery@redhat.com> Cc: Keith Busch <kbusch@kernel.org> Cc: Laurence Oberman <loberman@redhat.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Stuart Hayes <stuart.w.hayes@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-24nvme: remove CC register read-back during enablingKeith Busch
Any non-posted read should flush the previous write, so we don't necessarily need to read back the value we just wrote. I've found at least some controllers that respond with 0 for short moments after writing the CC register with EN (enable) cleared, so the read-back is overwriting our valid ctrl_config value and ends up breaking on the subsequent enabling. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-09-24nvme: null terminate nvme_tls_attrsShin'ichiro Kawasaki
Commit 1e48b34c9bc7 ("nvme: split off TLS sysfs attributes into a separate group") introduced the struct attribute array nvme_tls_attrs. However, the array was not null terminated and caused BUG KASAN global- out-of-bounds. To avoid the BUG, null terminate the array. Reported-by: Yi Zhang <yi.zhang@redhat.com> Closes: https://lore.kernel.org/linux-nvme/jhllwfxcedrcxcnbajwl4x2l2ujcqowqcd4ps574zrafrqhjna@f4icvecutekm/ Fixes: 1e48b34c9bc7 ("nvme: split off TLS sysfs attributes into a separate group") Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Tested-by: Yi Zhang <yi.zhang@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-09-24nvme-multipath: avoid hang on inaccessible namespacesHannes Reinecke
During repetitive namespace remapping operations on the target the namespace might have changed between the time the initial scan was performed, and partition scan was invoked by device_add_disk() in nvme_mpath_set_live(). We then end up with a stuck scanning process: [<0>] folio_wait_bit_common+0x12a/0x310 [<0>] filemap_read_folio+0x97/0xd0 [<0>] do_read_cache_folio+0x108/0x390 [<0>] read_part_sector+0x31/0xa0 [<0>] read_lba+0xc5/0x160 [<0>] efi_partition+0xd9/0x8f0 [<0>] bdev_disk_changed+0x23d/0x6d0 [<0>] blkdev_get_whole+0x78/0xc0 [<0>] bdev_open+0x2c6/0x3b0 [<0>] bdev_file_open_by_dev+0xcb/0x120 [<0>] disk_scan_partitions+0x5d/0x100 [<0>] device_add_disk+0x402/0x420 [<0>] nvme_mpath_set_live+0x4f/0x1f0 [nvme_core] [<0>] nvme_mpath_add_disk+0x107/0x120 [nvme_core] [<0>] nvme_alloc_ns+0xac6/0xe60 [nvme_core] [<0>] nvme_scan_ns+0x2dd/0x3e0 [nvme_core] [<0>] nvme_scan_work+0x1a3/0x490 [nvme_core] This happens when we have several paths, some of which are inaccessible, and the active paths are removed first. Then nvme_find_path() will requeue I/O in the ns_head (as paths are present), but the requeue list is never triggered as all remaining paths are inactive. This patch checks for NVME_NSHEAD_DISK_LIVE in nvme_available_path(), and requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared once the last path has been removed to properly terminate pending I/O. Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-09-24nvme-multipath: system fails to create generic nvme deviceHannes Reinecke
NVME_NSHEAD_DISK_LIVE is a flag for struct nvme_ns_head, not nvme_ns. The current code has a typo causing NVME_NSHEAD_DISK_LIVE never to be cleared once device_add_disk_fails, causing the system never to create the 'generic' character device. Even several rescan attempts will change the situation and the system has to be rebooted to fix the issue. Fixes: 11384580e332 ("nvme-multipath: add error handling support for add_disk()") Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-09-24netfs, cifs: Fix mtime/ctime update for mmapped writesDavid Howells
The cifs flag CIFS_INO_MODIFIED_ATTR, which indicates that the mtime and ctime need to be written back on close, got taken over by netfs as NETFS_ICTX_MODIFIED_ATTR to avoid the need to call a function pointer to set it. The flag gets set correctly on buffered writes, but doesn't get set by netfs_page_mkwrite(), leading to occasional failures in generic/080 and generic/215. Fix this by setting the flag in netfs_page_mkwrite(). Fixes: 73425800ac94 ("netfs, cifs: Move CIFS_INO_MODIFIED_ATTR to netfs_inode") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202409161629.98887b2-oliver.sang@intel.com Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.com> cc: Jeff Layton <jlayton@kernel.org> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>
2024-09-24cifs: update internal version numberSteve French
To 2.51 Signed-off-by: Steve French <stfrench@microsoft.com>
2024-09-24smb: client: print failed session logoffs with FYIPaulo Alcantara
Do not flood dmesg with failed session logoffs as kerberos tickets getting expired or passwords being rotated is a very common scenario. Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-09-24cifs: Fix reversion of the iter in cifs_readv_receive().David Howells
cifs_read_iter_from_socket() copies the iterator that's passed in for the socket to modify as and if it will, and then advances the original iterator by the amount sent. However, both callers revert the advancement (although receive_encrypted_read() zeros beyond the iterator first). The problem is, though, that cifs_readv_receive() reverts by the original length, not the amount transmitted which can cause an oops in iov_iter_revert(). Fix this by: (1) Remove the iov_iter_advance() from cifs_read_iter_from_socket(). (2) Remove the iov_iter_revert() from both callers. This fixes the bug in cifs_readv_receive(). (3) In receive_encrypted_read(), if we didn't get back as much data as the buffer will hold, copy the iterator, advance the copy and use the copy to drive iov_iter_zero(). As a bonus, this gets rid of some unnecessary work. This was triggered by generic/074 with the "-o sign" mount option. Fixes: 3ee1a1fc3981 ("cifs: Cut over to using netfslib") Signed-off-by: David Howells <dhowells@redhat.com> cc: Steve French <sfrench@samba.org> cc: Paulo Alcantara <pc@manguebit.com> cc: Shyam Prasad N <nspmangalore@gmail.com> cc: Rohith Surabattula <rohiths.msft@gmail.com> cc: Jeff Layton <jlayton@kernel.org> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>
2024-09-24smb3: fix incorrect mode displayed for read-only filesSteve French
Commands like "chmod 0444" mark a file readonly via the attribute flag (when mapping of mode bits into the ACL are not set, or POSIX extensions are not negotiated), but they were not reported correctly for stat of directories (they were reported ok for files and for "ls"). See example below: root:~# ls /mnt2 -l total 12 drwxr-xr-x 2 root root 0 Sep 21 18:03 normaldir -rwxr-xr-x 1 root root 0 Sep 21 23:24 normalfile dr-xr-xr-x 2 root root 0 Sep 21 17:55 readonly-dir -r-xr-xr-x 1 root root 209716224 Sep 21 18:15 readonly-file root:~# stat -c %a /mnt2/readonly-dir 755 root:~# stat -c %a /mnt2/readonly-file 555 This fixes the stat of directories when ATTR_READONLY is set (in cases where the mode can not be obtained other ways). root:~# stat -c %a /mnt2/readonly-dir 555 Cc: stable@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>