Age | Commit message (Collapse) | Author |
|
Pull kvm updates from Paolo Bonzini:
"ARM:
- Host driver for GICv5, the next generation interrupt controller for
arm64, including support for interrupt routing, MSIs, interrupt
translation and wired interrupts
- Use FEAT_GCIE_LEGACY on GICv5 systems to virtualize GICv3 VMs on
GICv5 hardware, leveraging the legacy VGIC interface
- Userspace control of the 'nASSGIcap' GICv3 feature, allowing
userspace to disable support for SGIs w/o an active state on
hardware that previously advertised it unconditionally
- Map supporting endpoints with cacheable memory attributes on
systems with FEAT_S2FWB and DIC where KVM no longer needs to
perform cache maintenance on the address range
- Nested support for FEAT_RAS and FEAT_DoubleFault2, allowing the
guest hypervisor to inject external aborts into an L2 VM and take
traps of masked external aborts to the hypervisor
- Convert more system register sanitization to the config-driven
implementation
- Fixes to the visibility of EL2 registers, namely making VGICv3
system registers accessible through the VGIC device instead of the
ONE_REG vCPU ioctls
- Various cleanups and minor fixes
LoongArch:
- Add stat information for in-kernel irqchip
- Add tracepoints for CPUCFG and CSR emulation exits
- Enhance in-kernel irqchip emulation
- Various cleanups
RISC-V:
- Enable ring-based dirty memory tracking
- Improve perf kvm stat to report interrupt events
- Delegate illegal instruction trap to VS-mode
- MMU improvements related to upcoming nested virtualization
s390x
- Fixes
x86:
- Add CONFIG_KVM_IOAPIC for x86 to allow disabling support for I/O
APIC, PIC, and PIT emulation at compile time
- Share device posted IRQ code between SVM and VMX and harden it
against bugs and runtime errors
- Use vcpu_idx, not vcpu_id, for GA log tag/metadata, to make lookups
O(1) instead of O(n)
- For MMIO stale data mitigation, track whether or not a vCPU has
access to (host) MMIO based on whether the page tables have MMIO
pfns mapped; using VFIO is prone to false negatives
- Rework the MSR interception code so that the SVM and VMX APIs are
more or less identical
- Recalculate all MSR intercepts from scratch on MSR filter changes,
instead of maintaining shadow bitmaps
- Advertise support for LKGS (Load Kernel GS base), a new instruction
that's loosely related to FRED, but is supported and enumerated
independently
- Fix a user-triggerable WARN that syzkaller found by setting the
vCPU in INIT_RECEIVED state (aka wait-for-SIPI), and then putting
the vCPU into VMX Root Mode (post-VMXON). Trying to detect every
possible path leading to architecturally forbidden states is hard
and even risks breaking userspace (if it goes from valid to valid
state but passes through invalid states), so just wait until
KVM_RUN to detect that the vCPU state isn't allowed
- Add KVM_X86_DISABLE_EXITS_APERFMPERF to allow disabling
interception of APERF/MPERF reads, so that a "properly" configured
VM can access APERF/MPERF. This has many caveats (APERF/MPERF
cannot be zeroed on vCPU creation or saved/restored on suspend and
resume, or preserved over thread migration let alone VM migration)
but can be useful whenever you're interested in letting Linux
guests see the effective physical CPU frequency in /proc/cpuinfo
- Reject KVM_SET_TSC_KHZ for vm file descriptors if vCPUs have been
created, as there's no known use case for changing the default
frequency for other VM types and it goes counter to the very reason
why the ioctl was added to the vm file descriptor. And also, there
would be no way to make it work for confidential VMs with a
"secure" TSC, so kill two birds with one stone
- Dynamically allocation the shadow MMU's hashed page list, and defer
allocating the hashed list until it's actually needed (the TDP MMU
doesn't use the list)
- Extract many of KVM's helpers for accessing architectural local
APIC state to common x86 so that they can be shared by guest-side
code for Secure AVIC
- Various cleanups and fixes
x86 (Intel):
- Preserve the host's DEBUGCTL.FREEZE_IN_SMM when running the guest.
Failure to honor FREEZE_IN_SMM can leak host state into guests
- Explicitly check vmcs12.GUEST_DEBUGCTL on nested VM-Enter to
prevent L1 from running L2 with features that KVM doesn't support,
e.g. BTF
x86 (AMD):
- WARN and reject loading kvm-amd.ko instead of panicking the kernel
if the nested SVM MSRPM offsets tracker can't handle an MSR (which
is pretty much a static condition and therefore should never
happen, but still)
- Fix a variety of flaws and bugs in the AVIC device posted IRQ code
- Inhibit AVIC if a vCPU's ID is too big (relative to what hardware
supports) instead of rejecting vCPU creation
- Extend enable_ipiv module param support to SVM, by simply leaving
IsRunning clear in the vCPU's physical ID table entry
- Disable IPI virtualization, via enable_ipiv, if the CPU is affected
by erratum #1235, to allow (safely) enabling AVIC on such CPUs
- Request GA Log interrupts if and only if the target vCPU is
blocking, i.e. only if KVM needs a notification in order to wake
the vCPU
- Intercept SPEC_CTRL on AMD if the MSR shouldn't exist according to
the vCPU's CPUID model
- Accept any SNP policy that is accepted by the firmware with respect
to SMT and single-socket restrictions. An incompatible policy
doesn't put the kernel at risk in any way, so there's no reason for
KVM to care
- Drop a superfluous WBINVD (on all CPUs!) when destroying a VM and
use WBNOINVD instead of WBINVD when possible for SEV cache
maintenance
- When reclaiming memory from an SEV guest, only do cache flushes on
CPUs that have ever run a vCPU for the guest, i.e. don't flush the
caches for CPUs that can't possibly have cache lines with dirty,
encrypted data
Generic:
- Rework irqbypass to track/match producers and consumers via an
xarray instead of a linked list. Using a linked list leads to
O(n^2) insertion times, which is hugely problematic for use cases
that create large numbers of VMs. Such use cases typically don't
actually use irqbypass, but eliminating the pointless registration
is a future problem to solve as it likely requires new uAPI
- Track irqbypass's "token" as "struct eventfd_ctx *" instead of a
"void *", to avoid making a simple concept unnecessarily difficult
to understand
- Decouple device posted IRQs from VFIO device assignment, as binding
a VM to a VFIO group is not a requirement for enabling device
posted IRQs
- Clean up and document/comment the irqfd assignment code
- Disallow binding multiple irqfds to an eventfd with a priority
waiter, i.e. ensure an eventfd is bound to at most one irqfd
through the entire host, and add a selftest to verify eventfd:irqfd
bindings are globally unique
- Add a tracepoint for KVM_SET_MEMORY_ATTRIBUTES to help debug issues
related to private <=> shared memory conversions
- Drop guest_memfd's .getattr() implementation as the VFS layer will
call generic_fillattr() if inode_operations.getattr is NULL
- Fix issues with dirty ring harvesting where KVM doesn't bound the
processing of entries in any way, which allows userspace to keep
KVM in a tight loop indefinitely
- Kill off kvm_arch_{start,end}_assignment() and x86's associated
tracking, now that KVM no longer uses assigned_device_count as a
heuristic for either irqbypass usage or MDS mitigation
Selftests:
- Fix a comment typo
- Verify KVM is loaded when getting any KVM module param so that
attempting to run a selftest without kvm.ko loaded results in a
SKIP message about KVM not being loaded/enabled (versus some random
parameter not existing)
- Skip tests that hit EACCES when attempting to access a file, and
print a "Root required?" help message. In most cases, the test just
needs to be run with elevated permissions"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (340 commits)
Documentation: KVM: Use unordered list for pre-init VGIC registers
RISC-V: KVM: Avoid re-acquiring memslot in kvm_riscv_gstage_map()
RISC-V: KVM: Use find_vma_intersection() to search for intersecting VMAs
RISC-V: perf/kvm: Add reporting of interrupt events
RISC-V: KVM: Enable ring-based dirty memory tracking
RISC-V: KVM: Fix inclusion of Smnpm in the guest ISA bitmap
RISC-V: KVM: Delegate illegal instruction fault to VS mode
RISC-V: KVM: Pass VMID as parameter to kvm_riscv_hfence_xyz() APIs
RISC-V: KVM: Factor-out g-stage page table management
RISC-V: KVM: Add vmid field to struct kvm_riscv_hfence
RISC-V: KVM: Introduce struct kvm_gstage_mapping
RISC-V: KVM: Factor-out MMU related declarations into separate headers
RISC-V: KVM: Use ncsr_xyz() in kvm_riscv_vcpu_trap_redirect()
RISC-V: KVM: Implement kvm_arch_flush_remote_tlbs_range()
RISC-V: KVM: Don't flush TLB when PTE is unchanged
RISC-V: KVM: Replace KVM_REQ_HFENCE_GVMA_VMID_ALL with KVM_REQ_TLB_FLUSH
RISC-V: KVM: Rename and move kvm_riscv_local_tlb_sanitize()
RISC-V: KVM: Drop the return value of kvm_riscv_vcpu_aia_init()
RISC-V: KVM: Check kvm_riscv_vcpu_alloc_vector_context() return value
KVM: arm64: selftests: Add FEAT_RAS EL2 registers to get-reg-list
...
|
|
* amd/amd-vi:
iommu/amd: Fix geometry.aperture_end for V2 tables
iommu/amd: Wrap debugfs ABI testing symbols snippets in literal code blocks
iommu/amd: Add documentation for AMD IOMMU debugfs support
iommu/amd: Add debugfs support to dump IRT Table
iommu/amd: Add debugfs support to dump device table
iommu/amd: Add support for device id user input
iommu/amd: Add debugfs support to dump IOMMU command buffer
iommu/amd: Add debugfs support to dump IOMMU Capability registers
iommu/amd: Add debugfs support to dump IOMMU MMIO registers
iommu/amd: Refactor AMD IOMMU debugfs initial setup
iommu/amd: Enable PASID and ATS capabilities in the correct order
iommu/amd: Add efr[HATS] max v1 page table level
iommu/amd: Add HATDis feature support
|
|
The AMD IOMMU documentation seems pretty clear that the V2 table follows
the normal CPU expectation of sign extension. This is shown in
Figure 25: AMD64 Long Mode 4-Kbyte Page Address Translation
Where bits Sign-Extend [63:57] == [56]. This is typical for x86 which
would have three regions in the page table: lower, non-canonical, upper.
The manual describes that the V1 table does not sign extend in section
2.2.4 Sharing AMD64 Processor and IOMMU Page Tables GPA-to-SPA
Further, Vasant has checked this and indicates the HW has an addtional
behavior that the manual does not yet describe. The AMDv2 table does not
have the sign extended behavior when attached to PASID 0, which may
explain why this has gone unnoticed.
The iommu domain geometry does not directly support sign extended page
tables. The driver should report only one of the lower/upper spaces. Solve
this by removing the top VA bit from the geometry to use only the lower
space.
This will also make the iommu_domain work consistently on all PASID 0 and
PASID != 1.
Adjust dma_max_address() to remove the top VA bit. It now returns:
5 Level:
Before 0x1ffffffffffffff
After 0x0ffffffffffffff
4 Level:
Before 0xffffffffffff
After 0x7fffffffffff
Fixes: 11c439a19466 ("iommu/amd/pgtbl_v2: Fix domain max address")
Link: https://lore.kernel.org/all/8858d4d6-d360-4ef0-935c-bfd13ea54f42@amd.com/
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/0-v2-0615cc99b88a+1ce-amdv2_geo_jgg@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
IOMMU driver sends command to IOMMU hardware via command buffer. In cases
where IOMMU hardware fails to process commands in command buffer, dumping
it is a valuable input to debug the issue.
IOMMU hardware processes command buffer entry at offset equals to the head
pointer. Dumping just the entry at the head pointer may not always be
useful. The current head may not be pointing to the entry of the command
buffer which is causing the issue. IOMMU Hardware may have processed the
entry and updated the head pointer. So dumping the entire command buffer
gives a broad understanding of what hardware was/is doing. The command
buffer dump will have all entries from start to end of the command buffer.
Along with that, it will have a head and tail command buffer pointer
register dump to facilitate where the IOMMU driver and hardware are in
the command buffer for injecting and processing the entries respectively.
Command buffer is a per IOMMU data structure. So dumping on per IOMMU
basis.
eg.
-> To get command buffer dump for iommu<x> (say, iommu00)
#cat /sys/kernel/debug/iommu/amd/iommu00/cmdbuf
Signed-off-by: Dheeraj Kumar Srivastava <dheerajkumar.srivastava@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Link: https://lore.kernel.org/r/20250702093804.849-5-dheerajkumar.srivastava@amd.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Per the PCIe spec, behavior of the PASID capability is undefined if the
value of the PASID Enable bit changes while the Enable bit of the
function's ATS control register is Set. Unfortunately,
pdev_enable_caps() does exactly that by ordering enabling ATS for the
device before enabling PASID.
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: Vasant Hegde <vasant.hegde@amd.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jerry Snitselaar <jsnitsel@redhat.com>
Fixes: eda8c2860ab679 ("iommu/amd: Enable device ATS/PASID/PRI capabilities independently")
Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20250703155433.6221-1-eahariha@linux.microsoft.com
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
Now that we have a concise helper to create an MSI parent domain,
switch the AMD IOMMU remapping over to that.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Nam Cao <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20241204124549.607054-9-maz@kernel.org
Link: https://lore.kernel.org/r/92e5ae97a03e4ffc272349d0863cd2cc8f904c44.1750858125.git.namcao@linutronix.de
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
The EFR[HATS] bits indicate maximum host translation level supported by
IOMMU. Adding support to set the maximum host page table level as indicated
by EFR[HATS]. If the HATS=11b (reserved), the driver will attempt to use
guest page table for DMA API.
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Ankit Soni <Ankit.Soni@amd.com>
Link: https://lore.kernel.org/r/df0f8562c2a20895cc185c86f1a02c4d826fd597.1749016436.git.Ankit.Soni@amd.com
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
Current AMD IOMMU assumes Host Address Translation (HAT) is always
supported, and Linux kernel enables this capability by default. However,
in case of emulated and virtualized IOMMU, this might not be the case.
For example,current QEMU-emulated AMD vIOMMU does not support host
translation for VFIO pass-through device, but the interrupt remapping
support is required for x2APIC (i.e. kvm-msi-ext-dest-id is also not
supported by the guest OS). This would require the guest kernel to boot
with guest kernel option iommu=pt to by-pass the initialization of
host (v1) table.
The AMD I/O Virtualization Technology (IOMMU) Specification Rev 3.10 [1]
introduces a new flag 'HATDis' in the IVHD 11h IOMMU attributes to indicate
that HAT is not supported on a particular IOMMU instance.
Therefore, modifies the AMD IOMMU driver to detect the new HATDis
attributes, and disable host translation and switch to use guest
translation if it is available. Otherwise, the driver will disable DMA
translation.
[1] https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/specifications/48882_IOMMU.pdf
Reviewed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Ankit Soni <Ankit.Soni@amd.com>
Link: https://lore.kernel.org/r/8109b208f87b80e400c2abd24a2e44fcbc0763a5.1749016436.git.Ankit.Soni@amd.com
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
Add plumbing to the AMD IOMMU driver to allow KVM to control whether or
not an IRTE is configured to generate GA log interrupts. KVM only needs a
notification if the target vCPU is blocking, so the vCPU can be awakened.
If a vCPU is preempted or exits to userspace, KVM clears is_run, but will
set the vCPU back to running when userspace does KVM_RUN and/or the vCPU
task is scheduled back in, i.e. KVM doesn't need a notification.
Unconditionally pass "true" in all KVM paths to isolate the IOMMU changes
from the KVM changes insofar as possible.
Opportunistically swap the ordering of parameters for amd_iommu_update_ga()
so that the match amd_iommu_activate_guest_mode().
Note, as of this writing, the AMD IOMMU manual doesn't list GALogIntr as
a non-cached field, but per AMD hardware architects, it's not cached and
can be safely updated without an invalidation.
Link: https://lore.kernel.org/all/b29b8c22-2fd4-4b5e-b755-9198874157c7@amd.com
Cc: Vasant Hegde <vasant.hegde@amd.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Link: https://lore.kernel.org/r/20250611224604.313496-62-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
WARN if KVM attempts to update IRTE entries when virtual APIC isn't fully
supported, as KVM should guard all such calls on IRQ posting being enabled.
Link: https://lore.kernel.org/r/20250611224604.313496-58-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
inhibited
If an IRQ can be posted to a vCPU, but AVIC is currently inhibited on the
vCPU, go through the dance of "affining" the IRTE to the vCPU, but leave
the actual IRTE in remapped mode. KVM already handles the case where AVIC
is inhibited => uninhibited with posted IRQs (see avic_set_pi_irte_mode()),
but doesn't handle the scenario where a postable IRQ comes along while AVIC
is inhibited.
Link: https://lore.kernel.org/r/20250611224604.313496-45-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Now that setting vCPU affinity is guarded with ir_list_lock, i.e. now that
avic_physical_id_entry can be safely accessed, set the pCPU info
straight-away when setting vCPU affinity. Putting the IRTE into posted
mode, and then immediately updating the IRTE a second time if the target
vCPU is running is wasteful and confusing.
This also fixes a flaw where a posted IRQ that arrives between putting
the IRTE into guest_mode and setting the correct destination could cause
the IOMMU to ring the doorbell on the wrong pCPU.
Link: https://lore.kernel.org/r/20250611224604.313496-44-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Split the guts of amd_iommu_update_ga() to a dedicated helper so that the
logic can be shared with flows that put the IRTE into posted mode.
Opportunistically move amd_iommu_update_ga() and its new helper above
amd_iommu_activate_guest_mode() so that it's all co-located.
Link: https://lore.kernel.org/r/20250611224604.313496-43-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Infer whether or not a vCPU should be marked running from the validity of
the pCPU on which it is running. amd_iommu_update_ga() already skips the
IRTE update if the pCPU is invalid, i.e. passing %true for is_run with an
invalid pCPU would be a blatant and egregrious KVM bug.
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-42-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add a comment to amd_iommu_update_ga() to document what fields it can
safely modify without issuing an invalidation of the IRTE, and to explain
its role in keeping GA IRTEs up-to-date.
Per page 93 of the IOMMU spec dated Feb 2025:
When virtual interrupts are enabled by setting MMIO Offset 0018h[GAEn] and
IRTE[GuestMode=1], IRTE[IsRun], IRTE[Destination], and if present IRTE[GATag],
are not cached by the IOMMU. Modifications to these fields do not require an
invalidation of the Interrupt Remapping Table.
Link: https://lore.kernel.org/all/9b7ceea3-8c47-4383-ad9c-1a9bbdc9044a@oracle.com
Cc: Joao Martins <joao.m.martins@oracle.com>
Link: https://lore.kernel.org/r/20250611224604.313496-41-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Split the vcpu_data structure that serves as a handoff from KVM to IOMMU
drivers into vendor specific structures. Overloading a single structure
makes the code hard to read and maintain, is *very* misleading as it
suggests that mixing vendors is actually supported, and bastardizing
Intel's posted interrupt descriptor address when AMD's IOMMU already has
its own structure is quite unnecessary.
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-33-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Pass NULL to amd_ir_set_vcpu_affinity() to communicate "don't post to a
vCPU" now that there's no need to communicate information back to KVM
about the previous vCPU (KVM does its own tracking).
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-24-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Use vcpu_data.pi_desc_addr instead of amd_iommu_pi_data.base to get the
GA root pointer. KVM is the only source of amd_iommu_pi_data.base, and
KVM's one and only path for writing amd_iommu_pi_data.base computes the
exact same value for vcpu_data.pi_desc_addr and amd_iommu_pi_data.base,
and fills amd_iommu_pi_data.base if and only if vcpu_data.pi_desc_addr is
valid, i.e. amd_iommu_pi_data.base is fully redundant.
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-23-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Delete the amd_ir_data.prev_ga_tag field now that all usage is
superfluous.
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
'arm/smmu/bindings', 'fsl/pamu', 'mediatek', 'renesas/ipmmu', 's390', 'intel/vt-d', 'amd/amd-vi' and 'core' into next
|
|
A BIOS upgrade has changed the IVRS DTE UID for a device that no
longer matches the UID in the SSDT. In this case there is only
one ACPI device on the system with that _HID but the _UID mismatch.
IVRS:
```
Subtable Type : F0 [Device Entry: ACPI HID Named Device]
Device ID : 0060
Data Setting (decoded below) : 40
INITPass : 0
EIntPass : 0
NMIPass : 0
Reserved : 0
System MGMT : 0
LINT0 Pass : 1
LINT1 Pass : 0
ACPI HID : "MSFT0201"
ACPI CID : 0000000000000000
UID Format : 02
UID Length : 09
UID : "\_SB.MHSP"
```
SSDT:
```
Device (MHSP)
{
Name (_ADR, Zero) // _ADR: Address
Name (_HID, "MSFT0201") // _HID: Hardware ID
Name (_UID, One) // _UID: Unique ID
```
To handle this case; while enumerating ACPI devices in
get_acpihid_device_id() count the number of matching ACPI devices with
a matching _HID. If there is exactly one _HID match then accept it even
if the UID doesn't match. Other operating systems allow this, but the
current IVRS spec leaves some ambiguity whether to allow or disallow it.
This should be clarified in future revisions of the spec. Output
'Firmware Bug' for this case to encourage it to be solved in the BIOS.
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Link: https://lore.kernel.org/r/20250512173129.1274275-1-superm1@kernel.org
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
AMD IOMMU reserves the address range 0xfd00000000-0xffffffffff for
the hypertransport protocol (HT) and has special meaning. Hence devices
cannot use this address range for the DMA. However on some AMD platforms
this HT range is shifted to the very top of the address space and new
feature bit `HTRangeIgnore` is introduced. When this feature bit is on,
IOMMU treats the GPA access to the legacy HT range as regular GPA access.
Signed-off-by: Sairaj Kodilkar <sarunkod@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Link: https://lore.kernel.org/r/20250317055020.25214-1-sarunkod@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
Synchronize RCU when unregistering KVM's GA log notifier to ensure all
in-flight interrupt handlers complete before KVM-the module is unloaded.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250315031048.2374109-1-seanjc@google.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
No external drivers use these interfaces anymore. Furthermore, no existing
iommu drivers implement anything in the callbacks. Remove them to avoid
dead code.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Link: https://lore.kernel.org/r/20250418080130.1844424-9-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
None of the drivers implement anything here anymore, remove the dead code.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Link: https://lore.kernel.org/r/20250418080130.1844424-3-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
WARN if KVM attempts to set vCPU affinity when posted interrupts aren't
enabled, as KVM shouldn't try to enable posting when they're unsupported,
and the IOMMU driver darn well should only advertise posting support when
AMD_IOMMU_GUEST_IR_VAPIC() is true.
Note, KVM consumes is_guest_mode only on success.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250404193923.1413163-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Return -EINVAL instead of success if amd_ir_set_vcpu_affinity() is
invoked without use_vapic; lying to KVM about whether or not the IRTE was
configured to post IRQs is all kinds of bad.
Fixes: d98de49a53e4 ("iommu/amd: Enable vAPIC interrupt remapping mode by default")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250404193923.1413163-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Use the actual size of the irq_table allocation, limiting to 128 due to
the HW alignment needs.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/22-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
Use iommu_alloc_pages_node_sz() instead.
AMD and Intel are both using 4K pages for these structures since those
drivers only work on 4K PAGE_SIZE.
riscv is also spec'd to use SZ_4K.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/21-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
Use iommu_free_pages() instead.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Mostafa Saleh <smostafa@google.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/6-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
Now that we have a folio under the allocation iommu_free_pages() can know
the order of the original allocation and do the correct thing to free it.
The next patch will rename iommu_free_page() to iommu_free_pages() so we
have naming consistency with iommu_alloc_pages_node().
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Mostafa Saleh <smostafa@google.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/5-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
'rockchip', 's390', 'core', 'intel/vt-d' and 'amd/amd-vi' into next
|
|
AMD IOMMU optionally supports up to 2K interrupts per function on newer
platforms. Support for this feature is indicated through Extended
Feature 2 Register (MMIO Offset 01A0h[NumIntRemapSup]). Allocate 2K IRTEs
per device when this support is available.
Co-developed-by: Sairaj Kodilkar <sarunkod@amd.com>
Signed-off-by: Sairaj Kodilkar <sarunkod@amd.com>
Signed-off-by: Kishon Vijay Abraham I <kvijayab@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Link: https://lore.kernel.org/r/20250307095822.2274-5-sarunkod@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
AMD iommu can support both 512 and 2K interrupts on newer platform. Hence
add suffix "512" to the existing macros.
Signed-off-by: Sairaj Kodilkar <sarunkod@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Link: https://lore.kernel.org/r/20250307095822.2274-4-sarunkod@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
Commit 05152a049444 ("iommu/amd: Add slab-cache for irq remapping tables")
introduces slab cache allocator. But slab cache allocator provides benefit
only when the allocation and deallocation of many identical objects is
frequent. The AMD IOMMU driver allocates Interrupt remapping table (IRT)
when device driver requests IRQ for the first time and never frees it.
Hence the slab allocator does not provide any benefit here.
Signed-off-by: Sairaj Kodilkar <sarunkod@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Link: https://lore.kernel.org/r/20250307095822.2274-3-sarunkod@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
Consolidate protection domain free code inside amd_iommu_domain_free()
and remove protection_domain_free() function.
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20250227162320.5805-8-vasant.hegde@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
Useful for debugging ILLEGAL_DEV_TABLE_ENTRY events as some of the
DTE settings depends on Control register settings.
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20250227162320.5805-2-vasant.hegde@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
When updating the page table root field on the DTE, avoid overwriting any
bits that are already set. The earlier call to make_clear_dte() writes
default values that all DTEs must have set (currently DTE[V]), and those
must be preserved.
Currently this doesn't cause problems since the page table root update is
the first field that is set after make_clear_dte() is called, and
DTE_FLAG_V is set again later along with the permission bits (IR/IW).
Remove this redundant assignment too.
Fixes: fd5dff9de4be ("iommu/amd: Modify set_dte_entry() to use 256-bit DTE helpers")
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Reviewed-by: Dheeraj Kumar Srivastava <dheerajkumar.srivastava@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Link: https://lore.kernel.org/r/20250106191413.3107140-1-alejandro.j.jimenez@oracle.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux
Pull iommu updates from Joerg Roedel:
"Core changes:
- PASID support for the blocked_domain
ARM-SMMU Updates:
- SMMUv2:
- Implement per-client prefetcher configuration on Qualcomm SoCs
- Support for the Adreno SMMU on Qualcomm's SDM670 SOC
- SMMUv3:
- Pretty-printing of event records
- Drop the ->domain_alloc_paging implementation in favour of
domain_alloc_paging_flags(flags==0)
- IO-PGTable:
- Generalisation of the page-table walker to enable external
walkers (e.g. for debugging unexpected page-faults from the GPU)
- Minor fix for handling concatenated PGDs at stage-2 with 16KiB
pages
- Misc:
- Clean-up device probing and replace the crufty probe-deferral
hack with a more robust implementation of
arm_smmu_get_by_fwnode()
- Device-tree binding updates for a bunch of Qualcomm platforms
Intel VT-d Updates:
- Remove domain_alloc_paging()
- Remove capability audit code
- Draining PRQ in sva unbind path when FPD bit set
- Link cache tags of same iommu unit together
AMD-Vi Updates:
- Use CMPXCHG128 to update DTE
- Cleanups of the domain_alloc_paging() path
RiscV IOMMU:
- Platform MSI support
- Shutdown support
Rockchip IOMMU:
- Add DT bindings for Rockchip RK3576
More smaller fixes and cleanups"
* tag 'iommu-updates-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux: (66 commits)
iommu: Use str_enable_disable-like helpers
iommu/amd: Fully decode all combinations of alloc_paging_flags
iommu/amd: Move the nid to pdom_setup_pgtable()
iommu/amd: Change amd_iommu_pgtable to use enum protection_domain_mode
iommu/amd: Remove type argument from do_iommu_domain_alloc() and related
iommu/amd: Remove dev == NULL checks
iommu/amd: Remove domain_alloc()
iommu/amd: Remove unused amd_iommu_domain_update()
iommu/riscv: Fixup compile warning
iommu/arm-smmu-v3: Add missing #include of linux/string_choices.h
iommu/arm-smmu-v3: Use str_read_write helper w/ logs
iommu/io-pgtable-arm: Add way to debug pgtable walk
iommu/io-pgtable-arm: Re-use the pgtable walk for iova_to_phys
iommu/io-pgtable-arm: Make pgtable walker more generic
iommu/arm-smmu: Add ACTLR data and support for qcom_smmu_500
iommu/arm-smmu: Introduce ACTLR custom prefetcher settings
iommu/arm-smmu: Add support for PRR bit setup
iommu/arm-smmu: Refactor qcom_smmu structure to include single pointer
iommu/arm-smmu: Re-enable context caching in smmu reset operation
iommu/vt-d: Link cache tags of same iommu unit together
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull interrupt subsystem updates from Thomas Gleixner:
- Consolidate the machine_kexec_mask_interrupts() by providing a
generic implementation and replacing the copy & pasta orgy in the
relevant architectures.
- Prevent unconditional operations on interrupt chips during kexec
shutdown, which can trigger warnings in certain cases when the
underlying interrupt has been shut down before.
- Make the enforcement of interrupt handling in interrupt context
unconditionally available, so that it actually works for non x86
related interrupt chips. The earlier enablement for ARM GIC chips set
the required chip flag, but did not notice that the check was hidden
behind a config switch which is not selected by ARM[64].
- Decrapify the handling of deferred interrupt affinity setting.
Some interrupt chips require that affinity changes are made from the
context of handling an interrupt to avoid certain race conditions.
For x86 this was the default, but with interrupt remapping this
requirement was lifted and a flag was introduced which tells the core
code that affinity changes can be done in any context. Unrestricted
affinity changes are the default for the majority of interrupt chips.
RISCV has the requirement to add the deferred mode to one of it's
interrupt controllers, but with the original implementation this
would require to add the any context flag to all other RISC-V
interrupt chips. That's backwards, so reverse the logic and require
that chips, which need the deferred mode have to be marked
accordingly. That avoids chasing the 'sane' chips and marking them.
- Add multi-node support to the Loongarch AVEC interrupt controller
driver.
- The usual tiny cleanups, fixes and improvements all over the place.
* tag 'irq-core-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq/generic_chip: Export irq_gc_mask_disable_and_ack_set()
genirq/timings: Add kernel-doc for a function parameter
genirq: Remove IRQ_MOVE_PCNTXT and related code
x86/apic: Convert to IRQCHIP_MOVE_DEFERRED
genirq: Provide IRQCHIP_MOVE_DEFERRED
hexagon: Remove GENERIC_PENDING_IRQ leftover
ARC: Remove GENERIC_PENDING_IRQ
genirq: Remove handle_enforce_irqctx() wrapper
genirq: Make handle_enforce_irqctx() unconditionally available
irqchip/loongarch-avec: Add multi-nodes topology support
irqchip/ts4800: Replace seq_printf() by seq_puts()
irqchip/ti-sci-inta : Add module build support
irqchip/ti-sci-intr: Add module build support
irqchip/irq-brcmstb-l2: Replace brcmstb_l2_mask_and_ack() by generic function
irqchip: keystone: Use syscon_regmap_lookup_by_phandle_args
genirq/kexec: Prevent redundant IRQ masking by checking state before shutdown
kexec: Consolidate machine_kexec_mask_interrupts() implementation
genirq: Reuse irq_thread_fn() for forced thread case
genirq: Move irq_thread_fn() further up in the code
|
|
'rockchip', 'riscv', 'core', 'intel/vt-d' and 'amd/amd-vi' into next
|
|
Currently AMD does not support
IOMMU_HWPT_ALLOC_PASID | IOMMU_HWPT_ALLOC_DIRTY_TRACKING
It should be rejected. Instead it creates a V1 domain without dirty
tracking support.
Use a switch to fully decode the flags.
Fixes: ce2cd175469f ("iommu/amd: Enhance amd_iommu_domain_alloc_user()")
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/7-v2-9776c53c2966+1c7-amd_paging_flags_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
The only thing that uses the nid is the io_pgtable code, and it should be
set before calling alloc_io_pgtable_ops() to ensure that the top levels
are allocated on the correct nid.
Since dev is never NULL now we can just do this trivially and remove the
other uses of nid. SVA and identity code paths never use it since they
don't use io_pgtable.
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/6-v2-9776c53c2966+1c7-amd_paging_flags_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
Currently it uses enum io_pgtable_fmt which is from the io pagetable code
and most of the enum values are invalid. protection_domain_mode is
internal the driver and has the only two valid values.
Fix some signatures and variables to use the right type as well.
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/5-v2-9776c53c2966+1c7-amd_paging_flags_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
do_iommu_domain_alloc() is only called from
amd_iommu_domain_alloc_paging_flags() so type is always
IOMMU_DOMAIN_UNMANAGED. Remove type and all the dead conditionals checking
it.
IOMMU_DOMAIN_IDENTITY checks are similarly obsolete as the conversion to
the global static identity domain removed those call paths.
The caller of protection_domain_alloc() should set the type, fix the miss
in the SVA code.
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/4-v2-9776c53c2966+1c7-amd_paging_flags_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
This is no longer possible, amd_iommu_domain_alloc_paging_flags() is never
called with dev = NULL from the core code. Similarly
get_amd_iommu_from_dev() can never be NULL either.
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/3-v2-9776c53c2966+1c7-amd_paging_flags_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
IOMMU drivers should not be sensitive to the domain type, a paging domain
should be created based only on the flags passed in, the same for all
callers.
AMD was using the domain_alloc() path to force VFIO into a v1 domain type,
because v1 gives higher performance. However now that
IOMMU_HWPT_ALLOC_PASID is present, and a NULL device is not possible,
domain_alloc_paging_flags() will do the right thing for VFIO.
When invoked from VFIO flags will be 0 and the amd_iommu_pgtable type of
domain will be selected. This is v1 by default unless the kernel command
line has overridden it to v2.
If the admin is forcing v2 assume they know what they are doing so force
it everywhere, including for VFIO.
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/2-v2-9776c53c2966+1c7-amd_paging_flags_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
All the callers have been removed by the below commit, remove the
implementation and prototypes.
Fixes: 322d889ae7d3 ("iommu/amd: Remove amd_iommu_domain_update() from page table freeing")
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/1-v2-9776c53c2966+1c7-amd_paging_flags_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
Instead of marking individual interrupts as safe to be migrated in
arbitrary contexts, mark the interrupt chips, which require the interrupt
to be moved in actual interrupt context, with the new IRQCHIP_MOVE_DEFERRED
flag. This makes more sense because this is a per interrupt chip property
and not restricted to individual interrupts.
That flips the logic from the historical opt-out to a opt-in model. This is
simpler to handle for other architectures, which default to unrestricted
affinity setting. It also allows to cleanup the redundant core logic
significantly.
All interrupt chips, which belong to a top-level domain sitting directly on
top of the x86 vector domain are marked accordingly, unless the related
setup code marks the interrupts with IRQ_MOVE_PCNTXT, i.e. XEN.
No functional change intended.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steve Wahl <steve.wahl@hpe.com>
Acked-by: Wei Liu <wei.liu@kernel.org>
Link: https://lore.kernel.org/all/20241210103335.563277044@linutronix.de
|
|
The blocked domain can be extended to park PASID of a device to be the
DMA blocking state. By this the remove_dev_pasid() op is dropped.
Remove PASID from old domain and device GCR3 table. No need to attach
PASID to the blocked domain as clearing PASID from GCR3 table will make
sure all DMAs for that PASID are blocked.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Link: https://lore.kernel.org/r/20241204122928.11987-7-yi.l.liu@intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|