Age | Commit message (Collapse) | Author |
|
Provide a high-level API to allow replacements of one domain with another
for specific pasid of a device. This is similar to
iommu_replace_group_handle() and it is expected to be used only by IOMMUFD.
Link: https://patch.msgid.link/r/20250321171940.7213-3-yi.l.liu@intel.com
Co-developed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Add kdoc to highligt the caller of iommu_[attach|replace]_group_handle()
and iommu_attach_device_pasid() should always provide a new handle. This
can avoid race with lockless reference to the handle. e.g. the
find_fault_handler() and iommu_report_device_fault() in the PRI path.
Link: https://patch.msgid.link/r/20250321171940.7213-2-yi.l.liu@intel.com
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
There are only two sw_msi implementations in the entire system, thus it's
not very necessary to have an sw_msi pointer.
Instead, check domain->cookie_type to call the two sw_msi implementations
directly from the core code.
Link: https://patch.msgid.link/r/7ded87c871afcbaac665b71354de0a335087bf0f.1742871535.git.nicolinc@nvidia.com
Suggested-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
To provide the iommufd_sw_msi() to the iommu core that is under a different
Kconfig, move it and its related functions to driver.c. Then, stub it into
the iommu-priv header. The iommufd_sw_msi_install() continues to be used by
iommufd internal, so put it in the private header.
Note that iommufd_sw_msi() will be called in the iommu core, replacing the
sw_msi function pointer. Given that IOMMU_API is "bool" in Kconfig, change
IOMMUFD_DRIVER_CORE to "bool" as well.
Since this affects the module size, here is before-n-after size comparison:
[Before]
text data bss dec hex filename
18797 848 56 19701 4cf5 drivers/iommu/iommufd/device.o
722 44 0 766 2fe drivers/iommu/iommufd/driver.o
[After]
text data bss dec hex filename
17735 808 56 18599 48a7 drivers/iommu/iommufd/device.o
3020 180 0 3200 c80 drivers/iommu/iommufd/driver.o
Link: https://patch.msgid.link/r/374c159592dba7852bee20968f3f66fa0ee8ca93.1742871535.git.nicolinc@nvidia.com
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
When DMA/MSI cookies were made first-class citizens back in commit
46983fcd67ac ("iommu: Pull IOVA cookie management into the core"), there
was no real need to further expose the two different cookie types.
However, now that IOMMUFD wants to add a third type of MSI-mapping
cookie, we do have a nicely compelling reason to properly dismabiguate
things at the domain level beyond just vaguely guessing from the domain
type.
Meanwhile, we also effectively have another "cookie" in the form of the
anonymous union for other user data, which isn't much better in terms of
being vague and unenforced. The fact is that all these cookie types are
mutually exclusive, in the sense that combining them makes zero sense
and/or would be catastrophic (iommu_set_fault_handler() on an SVA
domain, anyone?) - the only combination which *might* be reasonable is
perhaps a fault handler and an MSI cookie, but nobody's doing that at
the moment, so let's rule it out as well for the sake of being clear and
robust. To that end, we pull DMA and MSI cookies apart a little more,
mostly to clear up the ambiguity at domain teardown, then for clarity
(and to save a little space), move them into the union, whose ownership
we can then properly describe and enforce entirely unambiguously.
[nicolinc: rebase on latest tree; use prefix IOMMU_COOKIE_; merge unions
in iommu_domain; add IOMMU_COOKIE_IOMMUFD for iommufd_hwpt]
Link: https://patch.msgid.link/r/1ace9076c95204bbe193ee77499d395f15f44b23.1742871535.git.nicolinc@nvidia.com
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Introduce hv_status_printk() macros as a convenience to log hypercall
errors, formatting them with the status code (HV_STATUS_*) as a raw hex
value and also as a string, which saves some time while debugging.
Create a table of HV_STATUS_ codes with strings and mapped errnos, and
use it for hv_result_to_string() and hv_result_to_errno().
Use the new hv_status_printk()s in hv_proc.c, hyperv-iommu.c, and
irqdomain.c hypercalls to aid debugging in the root partition.
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Link: https://lore.kernel.org/r/1741980536-3865-2-git-send-email-nunodasneves@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <1741980536-3865-2-git-send-email-nunodasneves@linux.microsoft.com>
|
|
'rockchip', 's390', 'core', 'intel/vt-d' and 'amd/amd-vi' into next
|
|
We have recently seen report of lockdep circular lock dependency warnings
on platforms like Skylake and Kabylake:
======================================================
WARNING: possible circular locking dependency detected
6.14.0-rc6-CI_DRM_16276-gca2c04fe76e8+ #1 Not tainted
------------------------------------------------------
swapper/0/1 is trying to acquire lock:
ffffffff8360ee48 (iommu_probe_device_lock){+.+.}-{3:3},
at: iommu_probe_device+0x1d/0x70
but task is already holding lock:
ffff888102c7efa8 (&device->physical_node_lock){+.+.}-{3:3},
at: intel_iommu_init+0xe75/0x11f0
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (&device->physical_node_lock){+.+.}-{3:3}:
__mutex_lock+0xb4/0xe40
mutex_lock_nested+0x1b/0x30
intel_iommu_init+0xe75/0x11f0
pci_iommu_init+0x13/0x70
do_one_initcall+0x62/0x3f0
kernel_init_freeable+0x3da/0x6a0
kernel_init+0x1b/0x200
ret_from_fork+0x44/0x70
ret_from_fork_asm+0x1a/0x30
-> #5 (dmar_global_lock){++++}-{3:3}:
down_read+0x43/0x1d0
enable_drhd_fault_handling+0x21/0x110
cpuhp_invoke_callback+0x4c6/0x870
cpuhp_issue_call+0xbf/0x1f0
__cpuhp_setup_state_cpuslocked+0x111/0x320
__cpuhp_setup_state+0xb0/0x220
irq_remap_enable_fault_handling+0x3f/0xa0
apic_intr_mode_init+0x5c/0x110
x86_late_time_init+0x24/0x40
start_kernel+0x895/0xbd0
x86_64_start_reservations+0x18/0x30
x86_64_start_kernel+0xbf/0x110
common_startup_64+0x13e/0x141
-> #4 (cpuhp_state_mutex){+.+.}-{3:3}:
__mutex_lock+0xb4/0xe40
mutex_lock_nested+0x1b/0x30
__cpuhp_setup_state_cpuslocked+0x67/0x320
__cpuhp_setup_state+0xb0/0x220
page_alloc_init_cpuhp+0x2d/0x60
mm_core_init+0x18/0x2c0
start_kernel+0x576/0xbd0
x86_64_start_reservations+0x18/0x30
x86_64_start_kernel+0xbf/0x110
common_startup_64+0x13e/0x141
-> #3 (cpu_hotplug_lock){++++}-{0:0}:
__cpuhp_state_add_instance+0x4f/0x220
iova_domain_init_rcaches+0x214/0x280
iommu_setup_dma_ops+0x1a4/0x710
iommu_device_register+0x17d/0x260
intel_iommu_init+0xda4/0x11f0
pci_iommu_init+0x13/0x70
do_one_initcall+0x62/0x3f0
kernel_init_freeable+0x3da/0x6a0
kernel_init+0x1b/0x200
ret_from_fork+0x44/0x70
ret_from_fork_asm+0x1a/0x30
-> #2 (&domain->iova_cookie->mutex){+.+.}-{3:3}:
__mutex_lock+0xb4/0xe40
mutex_lock_nested+0x1b/0x30
iommu_setup_dma_ops+0x16b/0x710
iommu_device_register+0x17d/0x260
intel_iommu_init+0xda4/0x11f0
pci_iommu_init+0x13/0x70
do_one_initcall+0x62/0x3f0
kernel_init_freeable+0x3da/0x6a0
kernel_init+0x1b/0x200
ret_from_fork+0x44/0x70
ret_from_fork_asm+0x1a/0x30
-> #1 (&group->mutex){+.+.}-{3:3}:
__mutex_lock+0xb4/0xe40
mutex_lock_nested+0x1b/0x30
__iommu_probe_device+0x24c/0x4e0
probe_iommu_group+0x2b/0x50
bus_for_each_dev+0x7d/0xe0
iommu_device_register+0xe1/0x260
intel_iommu_init+0xda4/0x11f0
pci_iommu_init+0x13/0x70
do_one_initcall+0x62/0x3f0
kernel_init_freeable+0x3da/0x6a0
kernel_init+0x1b/0x200
ret_from_fork+0x44/0x70
ret_from_fork_asm+0x1a/0x30
-> #0 (iommu_probe_device_lock){+.+.}-{3:3}:
__lock_acquire+0x1637/0x2810
lock_acquire+0xc9/0x300
__mutex_lock+0xb4/0xe40
mutex_lock_nested+0x1b/0x30
iommu_probe_device+0x1d/0x70
intel_iommu_init+0xe90/0x11f0
pci_iommu_init+0x13/0x70
do_one_initcall+0x62/0x3f0
kernel_init_freeable+0x3da/0x6a0
kernel_init+0x1b/0x200
ret_from_fork+0x44/0x70
ret_from_fork_asm+0x1a/0x30
other info that might help us debug this:
Chain exists of:
iommu_probe_device_lock --> dmar_global_lock -->
&device->physical_node_lock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&device->physical_node_lock);
lock(dmar_global_lock);
lock(&device->physical_node_lock);
lock(iommu_probe_device_lock);
*** DEADLOCK ***
This driver uses a global lock to protect the list of enumerated DMA
remapping units. It is necessary due to the driver's support for dynamic
addition and removal of remapping units at runtime.
Two distinct code paths require iteration over this remapping unit list:
- Device registration and probing: the driver iterates the list to
register each remapping unit with the upper layer IOMMU framework
and subsequently probe the devices managed by that unit.
- Global configuration: Upper layer components may also iterate the list
to apply configuration changes.
The lock acquisition order between these two code paths was reversed. This
caused lockdep warnings, indicating a risk of deadlock. Fix this warning
by releasing the global lock before invoking upper layer interfaces for
device registration.
Fixes: b150654f74bf ("iommu/vt-d: Fix suspicious RCU usage")
Closes: https://lore.kernel.org/linux-iommu/SJ1PR11MB612953431F94F18C954C4A9CB9D32@SJ1PR11MB6129.namprd11.prod.outlook.com/
Tested-by: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20250317035714.1041549-1-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
Don't overwrite an IRTE that is posting IRQs to a vCPU with a posted MSI
entry if the host IRQ affinity happens to change. If/when the IRTE is
reverted back to "host mode", it will be reconfigured as a posted MSI or
remapped entry as appropriate.
Drop the "mode" field, which doesn't differentiate between posted MSIs and
posted vCPUs, in favor of a dedicated posted_vcpu flag. Note! The two
posted_{msi,vcpu} flags are intentionally not mutually exclusive; an IRTE
can transition between posted MSI and posted vCPU.
Fixes: ed1e48ea4370 ("iommu/vt-d: Enable posted mode for device MSIs")
Cc: stable@vger.kernel.org
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250315025135.2365846-3-seanjc@google.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
Add a helper to take care of reconfiguring an IRTE to deliver IRQs to the
host, i.e. not to a vCPU, and use the helper when an IRTE's vCPU affinity
is nullified, i.e. when KVM puts an IRTE back into "host" mode. Because
posted MSIs use an ephemeral IRTE, using modify_irte() puts the IRTE into
full remapped mode, i.e. unintentionally disables posted MSIs on the IRQ.
Fixes: ed1e48ea4370 ("iommu/vt-d: Enable posted mode for device MSIs")
Cc: stable@vger.kernel.org
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250315025135.2365846-2-seanjc@google.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
If kzalloc() fails, accessing cfg->supports_bypass causes a null pointer dereference.
Fix by checking for NULL immediately after allocation and returning -ENOMEM.
Fixes: 3bc0102835f6 ("iommu: apple-dart: Allow mismatched bypass support")
Signed-off-by: Qasim Ijaz <qasdev00@gmail.com>
Reviewed-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Link: https://lore.kernel.org/r/20250314230102.11008-1-qasdev00@gmail.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
The global dma_dev trick was mostly because the old domain_alloc op
provided no context, so no way to know which IOMMU was to own the
pagetable, or if a suitable one even existed at all. In the new
multi-instance world with domain_alloc_paging this is no longer a
concern - now we know that the given device must be associated with a
valid IOMMU instance which provided the op to call in the first place,
and therefore that instance can and should be the pagetable owner. To
avoid worrying about the lifetime and stability of the rk_domain->iommus
list, and keep the lookups simple and efficient, we'll still stash a
dma_dev pointer, but now it's accurately per-domain.
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Tested-by: Quentin Schulz <quentin.schulz@cherry.de>
Tested-by: Dang Huynh <danct12@riseup.net>
Reviewed-by: Nicolas Frattaroli <nicolas.frattaroli@collabora.com>
Tested-by: Nicolas Frattaroli <nicolas.frattaroli@collabora.com>
Link: https://lore.kernel.org/r/25dc948a7d35c8142c5719ac22bc523f8524d006.1741886382.git.robin.murphy@arm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
Currently Rockchip calls iommu_device_register() before it's finished
setting up the hardware and driver state, and as such it now gets
unhappy in various ways when registration starts working the way it was
always intended to, and probing client devices straight away. Reorder
the operations to ensure that what we're registering is a prepared and
functional IOMMU instance.
Fixes: bcb81ac6ae3c ("iommu: Get DT/ACPI parsing into the proper probe path")
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Tested-by: Quentin Schulz <quentin.schulz@cherry.de>
Tested-by: Dang Huynh <danct12@riseup.net>
Reviewed-by: Nicolas Frattaroli <nicolas.frattaroli@collabora.com>
Tested-by: Nicolas Frattaroli <nicolas.frattaroli@collabora.com>
Link: https://lore.kernel.org/r/e69532f00bf49d98322b96788edb7e2e305e4006.1741886382.git.robin.murphy@arm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
Now that DT-based probing is finally happening in the right order again,
it reveals an issue in Rockchip's of_xlate, which can now be called
during registration, but is using the global dma_dev which is only
assigned later. However, this makes little sense when we're already
looking up the correct IOMMU device, who should logically be the owner
of the devm allocation anyway.
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Anders Roxell <anders.roxell@linaro.org>
Fixes: bcb81ac6ae3c ("iommu: Get DT/ACPI parsing into the proper probe path")
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Tested-by: Quentin Schulz <quentin.schulz@cherry.de>
Tested-by: Dang Huynh <danct12@riseup.net>
Reviewed-by: Nicolas Frattaroli <nicolas.frattaroli@collabora.com>
Tested-by: Nicolas Frattaroli <nicolas.frattaroli@collabora.com>
Link: https://lore.kernel.org/r/771e91cf16b3048e93f657153b76905665878fa2.1741886382.git.robin.murphy@arm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
There is a DoS concern on the shared hardware event queue among devices
passed through to VMs, that too many translation failures that belong to
VMs could overflow the shared hardware event queue if those VMs or their
VMMs don't handle/recover the devices properly.
The MEV bit in the STE allows to configure the SMMU HW to merge similar
event records, though there is no guarantee. Set it in a nested STE for
DoS mitigations.
In the future, we might want to enable the MEV for non-nested cases too
such as domain->type == IOMMU_DOMAIN_UNMANAGED or even IOMMU_DOMAIN_DMA.
Link: https://patch.msgid.link/r/8ed12feef67fc65273d0f5925f401a81f56acebe.1741719725.git.nicolinc@nvidia.com
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Aside from the IOPF framework, iommufd provides an additional pathway to
report hardware events, via the vEVENTQ of vIOMMU infrastructure.
Define an iommu_vevent_arm_smmuv3 uAPI structure, and report stage-1 events
in the threaded IRQ handler. Also, add another four event record types that
can be forwarded to a VM.
Link: https://patch.msgid.link/r/5cf6719682fdfdabffdb08374cdf31ad2466d75a.1741719725.git.nicolinc@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Use it to store all vSMMU-related data. The vsid (Virtual Stream ID) will
be the first use case. Since the vsid reader will be the eventq handler
that already holds a streams_mutex, reuse that to fence the vmaster too.
Also add a pair of arm_smmu_attach_prepare/commit_vmaster helpers to set
or unset the master->vmaster pointer. Put the helpers inside the existing
arm_smmu_attach_prepare/commit().
For identity/blocked ops that don't call arm_smmu_attach_prepare/commit(),
add a simpler arm_smmu_master_clear_vmaster helper to unset the vmaster.
Link: https://patch.msgid.link/r/a7f282e1a531279e25f06c651e95d56f6b120886.1741719725.git.nicolinc@nvidia.com
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
The handler will get vDEVICE object from the given mdev and convert it to
its per-vIOMMU virtual ID to mimic a real IOMMU driver.
Link: https://patch.msgid.link/r/1ea874d20e56d65e7cfd6e0e8e01bd3dbd038761.1741719725.git.nicolinc@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
When attaching a device to a vIOMMU-based nested domain, vdev_id must be
present. Add a piece of code hard-requesting it, preparing for a vEVENTQ
support in the following patch. Then, update the TEST_F.
A HWPT-based nested domain will return a NULL new_viommu, thus no such a
vDEVICE requirement.
Link: https://patch.msgid.link/r/4051ca8a819e51cb30de6b4fe9e4d94d956afe3d.1741719725.git.nicolinc@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Similar to iommu_report_device_fault, this allows IOMMU drivers to report
vIOMMU events from threaded IRQ handlers to user space hypervisors.
Link: https://patch.msgid.link/r/44be825042c8255e75d0151b338ffd8ba0e4920b.1741719725.git.nicolinc@nvidia.com
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
This is a reverse search v.s. iommufd_viommu_find_dev, as drivers may want
to convert a struct device pointer (physical) to its virtual device ID for
an event injection to the user space VM.
Again, this avoids exposing more core structures to the drivers, than the
iommufd_viommu alone.
Link: https://patch.msgid.link/r/18b8e8bc1b8104d43b205d21602c036fd0804e56.1741719725.git.nicolinc@nvidia.com
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Introduce a new IOMMUFD_OBJ_VEVENTQ object for vIOMMU Event Queue that
provides user space (VMM) another FD to read the vIOMMU Events.
Allow a vIOMMU object to allocate vEVENTQs, with a condition that each
vIOMMU can only have one single vEVENTQ per type.
Add iommufd_veventq_alloc() with iommufd_veventq_ops for the new ioctl.
Link: https://patch.msgid.link/r/21acf0751dd5c93846935ee06f93b9c65eff5e04.1741719725.git.nicolinc@nvidia.com
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Rename the file, aligning with the new eventq object.
Link: https://patch.msgid.link/r/d726397e2d08028e25a1cb6eb9febefac35a32ba.1741719725.git.nicolinc@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
The fault object was designed exclusively for hwpt's IO page faults (PRI).
But its queue implementation can be reused for other purposes too, such as
hardware IRQ and event injections to user space.
Meanwhile, a fault object holds a list of faults. So it's more accurate to
call it a "fault queue". Combining the reusing idea above, abstract a new
iommufd_eventq as a common structure embedded into struct iommufd_fault,
similar to hwpt_paging holding a common hwpt.
Add a common iommufd_eventq_ops and iommufd_eventq_init to prepare for an
IOMMUFD_OBJ_VEVENTQ (vIOMMU Event Queue).
Link: https://patch.msgid.link/r/e7336a857954209aabb466e0694aab323da95d90.1741719725.git.nicolinc@nvidia.com
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
The infrastructure of a fault object will be shared with a new vEVENTQ
object in a following change. Add an iommufd_fault_init helper and an
INIT_EVENTQ_FOPS marco for a vEVENTQ allocator to use too.
Reorder the iommufd_ctx_get and refcount_inc, to keep them symmetrical
with the iommufd_fault_fops_release().
Since the new vEVENTQ doesn't need "response" and its "mutex", so keep
the xa_init_flags and mutex_init in their original locations.
Link: https://patch.msgid.link/r/a9522c521909baeb1bd843950b2490478f3d06e0.1741719725.git.nicolinc@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
There is no need to keep them in the header. The vEVENTQ version of these
two functions will turn out to be a different implementation and will not
share with this fault version. Thus, move them out of the header.
Link: https://patch.msgid.link/r/7eebe32f3d354799f5e28128c693c3c284740b21.1741719725.git.nicolinc@nvidia.com
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Add COMPILE_TEST support to Mediatek v1 so I have less chance of
breaking it again in future. This just needs the ARM dma-iommu API
stubbing out for now.
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Link: https://lore.kernel.org/r/a8d7c0b4393b360ce556f5f15e229d1e2fe73c83.1741702556.git.robin.murphy@arm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
AMD IOMMU optionally supports up to 2K interrupts per function on newer
platforms. Support for this feature is indicated through Extended
Feature 2 Register (MMIO Offset 01A0h[NumIntRemapSup]). Allocate 2K IRTEs
per device when this support is available.
Co-developed-by: Sairaj Kodilkar <sarunkod@amd.com>
Signed-off-by: Sairaj Kodilkar <sarunkod@amd.com>
Signed-off-by: Kishon Vijay Abraham I <kvijayab@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Link: https://lore.kernel.org/r/20250307095822.2274-5-sarunkod@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
AMD iommu can support both 512 and 2K interrupts on newer platform. Hence
add suffix "512" to the existing macros.
Signed-off-by: Sairaj Kodilkar <sarunkod@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Link: https://lore.kernel.org/r/20250307095822.2274-4-sarunkod@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
Commit 05152a049444 ("iommu/amd: Add slab-cache for irq remapping tables")
introduces slab cache allocator. But slab cache allocator provides benefit
only when the allocation and deallocation of many identical objects is
frequent. The AMD IOMMU driver allocates Interrupt remapping table (IRT)
when device driver requests IRQ for the first time and never frees it.
Hence the slab allocator does not provide any benefit here.
Signed-off-by: Sairaj Kodilkar <sarunkod@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Link: https://lore.kernel.org/r/20250307095822.2274-3-sarunkod@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
Define generic function `iommu_feature_set()` to set the values
in the feature control register and replace `iommu_set_inv_tlb_timeout()`
with it.
Signed-off-by: Sairaj Kodilkar <sarunkod@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Link: https://lore.kernel.org/r/20250307095822.2274-2-sarunkod@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
The warning for suspect probe conditions inadvertently got moved too
early in a prior respin - it happened to work out OK for fwspecs, but in
general still needs to be after the ops->probe_device call so drivers
which filter devices for themselves have a chance to do that.
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Fixes: bcb81ac6ae3c ("iommu: Get DT/ACPI parsing into the proper probe path")
Link: https://lore.kernel.org/r/72a4853e7ef36e7c1c4ca171ed4ed8e1a463a61a.1741791691.git.robin.murphy@arm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
The current code calls arm_smmu_rpm_use_autosuspend() during device
attach, which seems unusual as it sets the autosuspend delay and the
'use_autosuspend' flag for the smmu device. These parameters can be
simply set once during the smmu probe and in order to avoid bouncing
rpm states, we can simply mark_last_busy() during a client dev attach
as discussed in [1].
Move the handling of arm_smmu_rpm_use_autosuspend() to the SMMU probe
and modify the arm_smmu_rpm_put() function to mark_last_busy() before
calling __pm_runtime_put_autosuspend(). Additionally,
s/pm_runtime_put_autosuspend/__pm_runtime_put_autosuspend/ to help with
the refactor of the pm_runtime_put_autosuspend() API [2].
Link: https://lore.kernel.org/r/20241023164835.GF29251@willie-the-truck [1]
Link: https://git.kernel.org/linus/b7d46644e554 [2]
Signed-off-by: Pranjal Shrivastava <praan@google.com>
Link: https://lore.kernel.org/r/20250123195636.4182099-1-praan@google.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
In hindsight, there were some crucial subtleties overlooked when moving
{of,acpi}_dma_configure() to driver probe time to allow waiting for
IOMMU drivers with -EPROBE_DEFER, and these have become an
ever-increasing source of problems. The IOMMU API has some fundamental
assumptions that iommu_probe_device() is called for every device added
to the system, in the order in which they are added. Calling it in a
random order or not at all dependent on driver binding leads to
malformed groups, a potential lack of isolation for devices with no
driver, and all manner of unexpected concurrency and race conditions.
We've attempted to mitigate the latter with point-fix bodges like
iommu_probe_device_lock, but it's a losing battle and the time has come
to bite the bullet and address the true source of the problem instead.
The crux of the matter is that the firmware parsing actually serves two
distinct purposes; one is identifying the IOMMU instance associated with
a device so we can check its availability, the second is actually
telling that instance about the relevant firmware-provided data for the
device. However the latter also depends on the former, and at the time
there was no good place to defer and retry that separately from the
availability check we also wanted for client driver probe.
Nowadays, though, we have a proper notion of multiple IOMMU instances in
the core API itself, and each one gets a chance to probe its own devices
upon registration, so we can finally make that work as intended for
DT/IORT/VIOT platforms too. All we need is for iommu_probe_device() to
be able to run the iommu_fwspec machinery currently buried deep in the
wrong end of {of,acpi}_dma_configure(). Luckily it turns out to be
surprisingly straightforward to bootstrap this transformation by pretty
much just calling the same path twice. At client driver probe time,
dev->driver is obviously set; conversely at device_add(), or a
subsequent bus_iommu_probe(), any device waiting for an IOMMU really
should *not* have a driver already, so we can use that as a condition to
disambiguate the two cases, and avoid recursing back into the IOMMU core
at the wrong times.
Obviously this isn't the nicest thing, but for now it gives us a
functional baseline to then unpick the layers in between without many
more awkward cross-subsystem patches. There are some minor side-effects
like dma_range_map potentially being created earlier, and some debug
prints being repeated, but these aren't significantly detrimental. Let's
make things work first, then deal with making them nice.
With the basic flow finally in the right order again, the next step is
probably turning the bus->dma_configure paths inside-out, since all we
really need from bus code is its notion of which device and input ID(s)
to parse the common firmware properties with...
Acked-by: Bjorn Helgaas <bhelgaas@google.com> # pci-driver.c
Acked-by: Rob Herring (Arm) <robh@kernel.org> # of/device.c
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Lorenzo Pieralisi <lpieralisi@kernel.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/e3b191e6fd6ca9a1e84c5e5e40044faf97abb874.1740753261.git.robin.murphy@arm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
At the moment, if of_iommu_configure() allocates dev->iommu itself via
iommu_fwspec_init(), then suffers a DT parsing failure, it cleans up the
fwspec but leaves the empty dev_iommu hanging around. So far this is
benign (if a tiny bit wasteful), but we'd like to be able to reason
about dev->iommu having a consistent and unambiguous lifecycle. Thus
make sure that the of_iommu cleanup undoes precisely whatever it did.
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/d219663a3f23001f23d520a883ac622d70b4e642.1740753261.git.robin.murphy@arm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
Since iommu_init_device() was factored out, it is in fact the only
consumer of the ops which __iommu_probe_device() is resolving, so let it
do that itself rather than passing them in. This also puts the ops
lookup at a more logical point relative to the rest of the flow through
__iommu_probe_device().
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/fa4b6cfc67a352488b7f4e0b736008307ce9ac2e.1740753261.git.robin.murphy@arm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
It turns out that deferred default domain creation leaves a subtle
race window during iommu_device_register() wherein a client driver may
asynchronously probe in parallel and get as far as performing DMA API
operations with dma-direct, only to be switched to iommu-dma underfoot
once the default domain attachment finally happens, with obviously
disastrous consequences. Even the wonky of_iommu_configure() path is at
risk, since iommu_fwspec_init() will no longer defer client probe as the
instance ops are (necessarily) already registered, and the "replay"
iommu_probe_device() call can see dev->iommu_group already set and so
think there's nothing to do either.
Fortunately we already have the right tool in the right place in the
form of iommu_device_use_default_domain(), which just needs to ensure
that said default domain is actually ready to *be* used. Deferring the
client probe shouldn't have too much impact, given that this only
happens while the IOMMU driver is probing, and thus due to kick the
deferred probe list again once it finishes.
Reported-by: Charan Teja Kalla <quic_charante@quicinc.com>
Fixes: 98ac73f99bc4 ("iommu: Require a default_domain for all iommu drivers")
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Link: https://lore.kernel.org/r/e88b94c9b575034a2c98a48b3d383654cbda7902.1740753261.git.robin.murphy@arm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
The drivers doing their own fwspec parsing have no need to call
iommu_fwspec_free() since fwspecs were moved into dev_iommu, as
returning an error from .probe_device will tear down the whole lot
anyway. Move it into the private interface now that it only serves
for of_iommu to clean up in an error case.
I have no idea what mtk_v1 was doing in effectively guaranteeing
a NULL fwspec would be dereferenced if no "iommus" DT property was
found, so add a check for that to at least make the code look sane.
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/36e245489361de2d13db22a510fa5c79e7126278.1740667667.git.robin.murphy@arm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
The intel_context_flush_present() is called in places where either the
scalable mode is disabled, or scalable mode is enabled but all PASID
entries are known to be non-present. In these cases, the flush_domains
path within intel_context_flush_present() will never execute. This dead
code is therefore removed.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Link: https://lore.kernel.org/r/20250228092631.3425464-7-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
Update PRI enablement to use the new method, similar to the amd iommu
driver. Enable PRI in the device probe path and disable it when the device
is released. PRI is enabled throughout the device's iommu lifecycle. The
infrastructure for the iommu subsystem to handle iopf requests is created
during iopf enablement and released during iopf disablement. All invalid
page requests from the device are automatically handled by the iommu
subsystem if iopf is not enabled. Add iopf_refcount to track the iopf
enablement.
Convert the return type of intel_iommu_disable_iopf() to void, as there
is no way to handle a failure when disabling this feature. Make
intel_iommu_enable/disable_iopf() helpers global, as they will be used
beyond the current file in the subsequent patch.
The iopf_refcount is not protected by any lock. This is acceptable, as
there is no concurrent access to it in the current code. The following
patch will address this by moving it to the domain attach/detach paths,
which are protected by the iommu group mutex.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Link: https://lore.kernel.org/r/20250228092631.3425464-6-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
Device ATS is currently enabled when a domain is attached to the device
and disabled when the domain is detached. This creates a limitation:
when the IOMMU is operating in scalable mode and IOPF is enabled, the
device's domain cannot be changed.
The previous code enables ATS when a domain is set to a device's RID and
disables it during RID domain switch. So, if a PASID is set with a
domain requiring PRI, ATS should remain enabled until the domain is
removed. During the PASID domain's lifecycle, if the RID's domain
changes, PRI will be disrupted because it depends on ATS, which is
disabled when the blocking domain is set for the device's RID.
Remove this limitation by moving ATS enablement to the device probe path.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Link: https://lore.kernel.org/r/20250228092631.3425464-5-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
Attach of a SVA domain should fail if SVA is not supported, move the check
for SVA support out of IOMMU_DEV_FEAT_SVA and into attach.
Also check when allocating a SVA domain to match other drivers.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Link: https://lore.kernel.org/r/20250228092631.3425464-3-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
If all the inlines are unwound virt_to_dma_pfn() is simply:
return page_to_pfn(virt_to_page(p)) << (PAGE_SHIFT - VTD_PAGE_SHIFT);
Which can be re-arranged to:
(page_to_pfn(virt_to_page(p)) << PAGE_SHIFT) >> VTD_PAGE_SHIFT
The only caller is:
((uint64_t)virt_to_dma_pfn(tmp_page) << VTD_PAGE_SHIFT)
re-arranged to:
((page_to_pfn(virt_to_page(tmp_page)) << PAGE_SHIFT) >> VTD_PAGE_SHIFT)
<< VTD_PAGE_SHIFT
Which simplifies to:
page_to_pfn(virt_to_page(tmp_page)) << PAGE_SHIFT
That is the same as virt_to_phys(tmp_page), so just remove all of this.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/8-v3-e797f4dc6918+93057-iommu_pages_jgg@nvidia.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
We found that executing the command ./a.out &;reboot -f (where a.out is a
program that only executes a while(1) infinite loop) can probabilistically
cause the system to hang in the intel_iommu_shutdown() function, rendering
it unresponsive. Through analysis, we identified that the factors
contributing to this issue are as follows:
1. The reboot -f command does not prompt the kernel to notify the
application layer to perform cleanup actions, allowing the application to
continue running.
2. When the kernel reaches the intel_iommu_shutdown() function, only the
BSP (Bootstrap Processor) CPU is operational in the system.
3. During the execution of intel_iommu_shutdown(), the function down_write
(&dmar_global_lock) causes the process to sleep and be scheduled out.
4. At this point, though the processor's interrupt flag is not cleared,
allowing interrupts to be accepted. However, only legacy devices and NMI
(Non-Maskable Interrupt) interrupts could come in, as other interrupts
routing have already been disabled. If no legacy or NMI interrupts occur
at this stage, the scheduler will not be able to run.
5. If the application got scheduled at this time is executing a while(1)-
type loop, it will be unable to be preempted, leading to an infinite loop
and causing the system to become unresponsive.
To resolve this issue, the intel_iommu_shutdown() function should not
execute down_write(), which can potentially cause the process to be
scheduled out. Furthermore, since only the BSP is running during the later
stages of the reboot, there is no need for protection against parallel
access to the DMAR (DMA Remapping) unit. Therefore, the following lines
could be removed:
down_write(&dmar_global_lock);
up_write(&dmar_global_lock);
After testing, the issue has been resolved.
Fixes: 6c3a44ed3c55 ("iommu/vt-d: Turn off translations at shutdown")
Co-developed-by: Ethan Zhao <haifeng.zhao@linux.intel.com>
Signed-off-by: Ethan Zhao <haifeng.zhao@linux.intel.com>
Signed-off-by: Yunhui Cui <cuiyunhui@bytedance.com>
Link: https://lore.kernel.org/r/20250303062421.17929-1-cuiyunhui@bytedance.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
This is needed by ISP, which has DART0 with bypass and DART1/2 without.
Signed-off-by: Hector Martin <marcan@marcan.st>
Reviewed-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Signed-off-by: Sasha Finkelstein <fnkl.kernel@gmail.com>
Reviewed-by: Sven Peter <sven@svenpeter.dev>
Link: https://lore.kernel.org/r/20250307-isp-dart-v3-2-684fe4489591@gmail.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
ISP needs this.
Signed-off-by: Hector Martin <marcan@marcan.st>
Reviewed-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Signed-off-by: Sasha Finkelstein <fnkl.kernel@gmail.com>
Reviewed-by: Sven Peter <sven@svenpeter.dev>
Link: https://lore.kernel.org/r/20250307-isp-dart-v3-1-684fe4489591@gmail.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
Consolidate protection domain free code inside amd_iommu_domain_free()
and remove protection_domain_free() function.
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20250227162320.5805-8-vasant.hegde@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
Remove unused forward declaration.
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20250227162320.5805-7-vasant.hegde@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
Move function declaration inside AMD_IOMMU_H defination.
Fixes: fd5dff9de4be ("iommu/amd: Modify set_dte_entry() to use 256-bit DTE helpers")
Fixes: 457da5764668 ("iommu/amd: Lock DTE before updating the entry with WRITE_ONCE()")
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20250227162320.5805-6-vasant.hegde@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
Remove outdated comment.
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20250227162320.5805-5-vasant.hegde@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|