Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd
Pull iommufd updates from Jason Gunthorpe:
"Two minor fixes:
- Make the selftest work again on x86 platforms with iommus enabled
- Fix a compiler warning in the userspace kselftest"
* tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd:
iommufd: Register iommufd mock devices with fwspec
iommu/selftest: prevent use of uninitialized variable
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux
Pull iommu updates from Joerg Roedel:
- Inte VT-d:
- IOMMU driver updated to the latest VT-d specification
- Don't enable PRS if PDS isn't supported
- Replace snprintf with scnprintf
- Fix legacy mode page table dump through debugfs
- Miscellaneous cleanups
- AMD-Vi:
- Support kdump boot when SNP is enabled
- Apple-DART:
- 4-level page-table support
- RISC-V IOMMU:
- ACPI support
- Small number of miscellaneous cleanups and fixes
* tag 'iommu-updates-v6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux: (22 commits)
iommu/vt-d: Disallow dirty tracking if incoherent page walk
iommu/vt-d: debugfs: Avoid dumping context command register
iommu/vt-d: Removal of Advanced Fault Logging
iommu/vt-d: PRS isn't usable if PDS isn't supported
iommu/vt-d: Remove LPIG from page group response descriptor
iommu/vt-d: Drop unused cap_super_offset()
iommu/vt-d: debugfs: Fix legacy mode page table dump logic
iommu/vt-d: Replace snprintf with scnprintf in dmar_latency_snapshot()
iommu/io-pgtable-dart: Fix off by one error in table index check
iommu/riscv: Add ACPI support
ACPI: scan: Add support for RISC-V in acpi_iommu_configure_id()
ACPI: RISC-V: Add support for RIMT
iommu/omap: Use int type to store negative error codes
iommu/apple-dart: Clear stream error indicator bits for T8110 DARTs
iommu/amd: Skip enabling command/event buffers for kdump
crypto: ccp: Skip SEV and SNP INIT for kdump boot
iommu/amd: Reuse device table for kdump
iommu/amd: Add support to remap/unmap IOMMU buffers for kdump
iommu/apple-dart: Add 4-level page table support
iommu/io-pgtable-dart: Add 4-level page table support
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux
Pull dma-mapping updates from Marek Szyprowski:
- Refactoring of DMA mapping API to physical addresses as the primary
interface instead of page+offset parameters
This gets much closer to Matthew Wilcox's long term wish for
struct-pageless IO to cacheable DRAM and is supporting memdesc
project which seeks to substantially transform how struct page works.
An advantage of this approach is the possibility of introducing
DMA_ATTR_MMIO, which covers existing 'dma_map_resource' flow in the
common paths, what in turn lets to use recently introduced
dma_iova_link() API to map PCI P2P MMIO without creating struct page
Developped by Leon Romanovsky and Jason Gunthorpe
- Minor clean-up by Petr Tesarik and Qianfeng Rong
* tag 'dma-mapping-6.18-2025-09-30' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
kmsan: fix missed kmsan_handle_dma() signature conversion
mm/hmm: properly take MMIO path
mm/hmm: migrate to physical address-based DMA mapping API
dma-mapping: export new dma_*map_phys() interface
xen: swiotlb: Open code map_resource callback
dma-mapping: implement DMA_ATTR_MMIO for dma_(un)map_page_attrs()
kmsan: convert kmsan_handle_dma to use physical addresses
dma-mapping: convert dma_direct_*map_page to be phys_addr_t based
iommu/dma: implement DMA_ATTR_MMIO for iommu_dma_(un)map_phys()
iommu/dma: rename iommu_dma_*map_page to iommu_dma_*map_phys
dma-mapping: rename trace_dma_*map_page to trace_dma_*map_phys
dma-debug: refactor to use physical addresses for page mapping
iommu/dma: implement DMA_ATTR_MMIO for dma_iova_link().
dma-mapping: introduce new DMA attribute to indicate MMIO memory
swiotlb: Remove redundant __GFP_NOWARN
dma-direct: clean up the logic in __dma_direct_alloc_pages()
|
|
Since the bus ops were retired the iommu subsystem changed to using fwspec
to match the iommu driver to the iommu device. If a device has a NULL
fwspec then it is matched to the first iommu driver with a NULL fwspec,
effectively disabling support for systems with more than one non-fwspec
iommu driver.
Thus, if the iommufd selfest are run in an x86 system that registers a
non-fwspec iommu driver they fail to bind their mock devices to the mock
iommu driver.
Fix this by allocating a software fwnode for mock iommu driver's
iommu_device, and set it to the device which mock iommu driver created.
This is done by adding a new helper iommu_mock_device_add() which abuses
the internals of the fwspec system to establish a fwspec before the device
is added and is careful not to leak it. A matching dummy fwspec is
automatically added to the mock iommu driver.
Test by "make -C toosl/testing/selftests TARGETS=iommu run_tests":
PASSED: 229 / 229 tests passed.
In addition, this issue is also can be found on amd platform, and
also tested on a amd machine.
Link: https://patch.msgid.link/r/20250925054730.3877-1-kanie@linux.alibaba.com
Fixes: 17de3f5fdd35 ("iommu: Retire bus ops")
Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Tested-by: Qinyun Tan <qinyuntan@linux.alibaba.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
'amd/amd-vi' into next
|
|
Dirty page tracking relies on the IOMMU atomically updating the dirty bit
in the paging-structure entry. For this operation to succeed, the paging-
structure memory must be coherent between the IOMMU and the CPU. In
another word, if the iommu page walk is incoherent, dirty page tracking
doesn't work.
The Intel VT-d specification, Section 3.10 "Snoop Behavior" states:
"Remapping hardware encountering the need to atomically update A/EA/D bits
in a paging-structure entry that is not snooped will result in a non-
recoverable fault."
To prevent an IOMMU from being incorrectly configured for dirty page
tracking when it is operating in an incoherent mode, mark SSADS as
supported only when both ecap_slads and ecap_smpwc are supported.
Fixes: f35f22cc760e ("iommu/vt-d: Access/Dirty bit support for SS domains")
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20250924083447.123224-1-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd
Pull iommufd fixes from Jason Gunthorpe:
"Fix two user triggerable use-after-free issues:
- Possible race UAF setting up mmaps
- Syzkaller found UAF when erroring an file descriptor creation ioctl
due to the fput() work queue"
* tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd:
iommufd/selftest: Update the fail_nth limit
iommufd: WARN if an object is aborted with an elevated refcount
iommufd: Fix race during abort for file descriptors
iommufd: Fix refcounting race during mmap
|
|
If something holds a refcount then it is at risk of UAFing. For abort
paths we expect the caller to never share the object with a parallel
thread and to clean up any refcounts it obtained on its own.
Add the missing dec inside iommufd_hwpt_paging_alloc() during error unwind
by making iommufd_hw_pagetable_attach/detach() proper pairs.
Link: https://patch.msgid.link/r/2-v1-02cd136829df+31-iommufd_syz_fput_jgg@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
fput() doesn't actually call file_operations release() synchronously, it
puts the file on a work queue and it will be released eventually.
This is normally fine, except for iommufd the file and the iommufd_object
are tied to gether. The file has the object as it's private_data and holds
a users refcount, while the object is expected to remain alive as long as
the file is.
When the allocation of a new object aborts before installing the file it
will fput() the file and then go on to immediately kfree() the obj. This
causes a UAF once the workqueue completes the fput() and tries to
decrement the users refcount.
Fix this by putting the core code in charge of the file lifetime, and call
__fput_sync() during abort to ensure that release() is called before
kfree. __fput_sync() is a bit too tricky to open code in all the object
implementations. Instead the objects tell the core code where the file
pointer is and the core will take care of the life cycle.
If the object is successfully allocated then the file will hold a users
refcount and the iommufd_object cannot be destroyed.
It is worth noting that close(); ioctl(IOMMU_DESTROY); doesn't have an
issue because close() is already using a synchronous version of fput().
The UAF looks like this:
BUG: KASAN: slab-use-after-free in iommufd_eventq_fops_release+0x45/0xc0 drivers/iommu/iommufd/eventq.c:376
Write of size 4 at addr ffff888059c97804 by task syz.0.46/6164
CPU: 0 UID: 0 PID: 6164 Comm: syz.0.46 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/18/2025
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:94 [inline]
dump_stack_lvl+0x116/0x1f0 lib/dump_stack.c:120
print_address_description mm/kasan/report.c:378 [inline]
print_report+0xcd/0x630 mm/kasan/report.c:482
kasan_report+0xe0/0x110 mm/kasan/report.c:595
check_region_inline mm/kasan/generic.c:183 [inline]
kasan_check_range+0x100/0x1b0 mm/kasan/generic.c:189
instrument_atomic_read_write include/linux/instrumented.h:96 [inline]
atomic_fetch_sub_release include/linux/atomic/atomic-instrumented.h:400 [inline]
__refcount_dec include/linux/refcount.h:455 [inline]
refcount_dec include/linux/refcount.h:476 [inline]
iommufd_eventq_fops_release+0x45/0xc0 drivers/iommu/iommufd/eventq.c:376
__fput+0x402/0xb70 fs/file_table.c:468
task_work_run+0x14d/0x240 kernel/task_work.c:227
resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
exit_to_user_mode_loop+0xeb/0x110 kernel/entry/common.c:43
exit_to_user_mode_prepare include/linux/irq-entry-common.h:225 [inline]
syscall_exit_to_user_mode_work include/linux/entry-common.h:175 [inline]
syscall_exit_to_user_mode include/linux/entry-common.h:210 [inline]
do_syscall_64+0x41c/0x4c0 arch/x86/entry/syscall_64.c:100
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Link: https://patch.msgid.link/r/1-v1-02cd136829df+31-iommufd_syz_fput_jgg@nvidia.com
Cc: stable@vger.kernel.org
Fixes: 07838f7fd529 ("iommufd: Add iommufd fault object")
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Nirmoy Das <nirmoyd@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Reported-by: syzbot+80620e2d0d0a33b09f93@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/r/68c8583d.050a0220.2ff435.03a2.GAE@google.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
The owner object of the imap can be destroyed while the imap remains in
the mtree. So access to the imap pointer without holding locks is racy
with destruction.
The imap is safe to access outside the lock once a users refcount is
obtained, the owner object cannot start destruction until users is 0.
Thus the users refcount should not be obtained at the end of
iommufd_fops_mmap() but instead inside the mtree lock held around the
mtree_load(). Move the refcount there and use refcount_inc_not_zero() as
we can have a 0 refcount inside the mtree during destruction races.
Link: https://patch.msgid.link/r/0-v1-e6faace50971+3cc-iommufd_mmap_fix_jgg@nvidia.com
Cc: stable@vger.kernel.org
Fixes: 56e9a0d8e53f ("iommufd: Add mmap interface")
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
The register-based cache invalidation interface is in the process of being
replaced by the queued invalidation interface. The VT-d architecture
allows hardware implementations with a queued invalidation interface to
not implement the registers used for cache invalidation. Currently, the
debugfs interface dumps the Context Command Register unconditionally,
which is not reasonable.
Remove it to avoid potential access to non-present registers.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20250917025051.143853-1-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
The advanced fault logging has been removed from the specification since
v4.0. Linux doesn't implement advanced fault logging functionality, but
it currently dumps the advanced logging registers through debugfs. Remove
the dumping of these advanced fault logging registers through debugfs to
avoid potential access to non-present registers.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20250917024850.143801-1-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
The specification, Section 7.10, "Software Steps to Drain Page Requests &
Responses," requires software to submit an Invalidation Wait Descriptor
(inv_wait_dsc) with the Page-request Drain (PD=1) flag set, along with
the Invalidation Wait Completion Status Write flag (SW=1). It then waits
for the Invalidation Wait Descriptor's completion.
However, the PD field in the Invalidation Wait Descriptor is optional, as
stated in Section 6.5.2.9, "Invalidation Wait Descriptor":
"Page-request Drain (PD): Remapping hardware implementations reporting
Page-request draining as not supported (PDS = 0 in ECAP_REG) treat this
field as reserved."
This implies that if the IOMMU doesn't support the PDS capability, software
can't drain page requests and group responses as expected.
Do not enable PCI/PRI if the IOMMU doesn't support PDS.
Reported-by: Joel Granados <joel.granados@kernel.org>
Closes: https://lore.kernel.org/r/20250909-jag-pds-v1-1-ad8cba0e494e@kernel.org
Fixes: 66ac4db36f4c ("iommu/vt-d: Add page request draining support")
Cc: stable@vger.kernel.org
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20250915062946.120196-1-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
Bit 66 in the page group response descriptor used to be the LPIG (Last
Page in Group), but it was marked as Reserved since Specification 4.0.
Remove programming on this bit to make it consistent with the latest
specification.
Existing hardware all treats bit 66 of the page group response descriptor
as "ignored", therefore this change doesn't break any existing hardware.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20250901053943.1708490-1-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
The macro is unused. Drop the dead code.
Signed-off-by: Yury Norov (NVIDIA) <yury.norov@gmail.com>
Link: https://lore.kernel.org/r/20250913015024.81186-1-yury.norov@gmail.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
In legacy mode, SSPTPTR is ignored if TT is not 00b or 01b. SSPTPTR
maybe uninitialized or zero in that case and may cause oops like:
Oops: general protection fault, probably for non-canonical address
0xf00087d3f000f000: 0000 [#1] SMP NOPTI
CPU: 2 UID: 0 PID: 786 Comm: cat Not tainted 6.16.0 #191 PREEMPT(voluntary)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-5.fc42 04/01/2014
RIP: 0010:pgtable_walk_level+0x98/0x150
RSP: 0018:ffffc90000f279c0 EFLAGS: 00010206
RAX: 0000000040000000 RBX: ffffc90000f27ab0 RCX: 000000000000001e
RDX: 0000000000000003 RSI: f00087d3f000f000 RDI: f00087d3f0010000
RBP: ffffc90000f27a00 R08: ffffc90000f27a98 R09: 0000000000000002
R10: 0000000000000000 R11: 0000000000000000 R12: f00087d3f000f000
R13: 0000000000000000 R14: 0000000040000000 R15: ffffc90000f27a98
FS: 0000764566dcb740(0000) GS:ffff8881f812c000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000764566d44000 CR3: 0000000109d81003 CR4: 0000000000772ef0
PKRU: 55555554
Call Trace:
<TASK>
pgtable_walk_level+0x88/0x150
domain_translation_struct_show.isra.0+0x2d9/0x300
dev_domain_translation_struct_show+0x20/0x40
seq_read_iter+0x12d/0x490
...
Avoid walking the page table if TT is not 00b or 01b.
Fixes: 2b437e804566 ("iommu/vt-d: debugfs: Support dumping a specified page table")
Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20250814163153.634680-1-vineeth@bitbyteword.org
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
snprintf() returns the number of bytes that would have been written, not
the number actually written. Using this for offset tracking can cause
buffer overruns if truncation occurs.
Replace snprintf() with scnprintf() to ensure the offset stays within
bounds.
Since scnprintf() never returns a negative value, and zero is not possible
in this context because 'bytes' starts at 0 and 'size - bytes' is
DEBUG_BUFFER_SIZE in the first call, which is large enough to hold the
string literals used, the return value is always positive. An integer
overflow is also completely out of reach here due to the small and fixed
buffer size. The error check in latency_show_one() is therefore
unnecessary. Remove it and make dmar_latency_snapshot() return void.
Signed-off-by: Seyediman Seyedarab <ImanDevel@gmail.com>
Link: https://lore.kernel.org/r/20250731225048.131364-1-ImanDevel@gmail.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
The AMD IOMMU host page table implementation supports dynamic page table levels
(up to 6 levels), starting with a 3-level configuration that expands based on
IOVA address. The kernel maintains a root pointer and current page table level
to enable proper page table walks in alloc_pte()/fetch_pte() operations.
The IOMMU IOVA allocator initially starts with 32-bit address and onces its
exhuasted it switches to 64-bit address (max address is determined based
on IOMMU and device DMA capability). To support larger IOVA, AMD IOMMU
driver increases page table level.
But in unmap path (iommu_v1_unmap_pages()), fetch_pte() reads
pgtable->[root/mode] without lock. So its possible that in exteme corner case,
when increase_address_space() is updating pgtable->[root/mode], fetch_pte()
reads wrong page table level (pgtable->mode). It does compare the value with
level encoded in page table and returns NULL. This will result is
iommu_unmap ops to fail and upper layer may retry/log WARN_ON.
CPU 0 CPU 1
------ ------
map pages unmap pages
alloc_pte() -> increase_address_space() iommu_v1_unmap_pages() -> fetch_pte()
pgtable->root = pte (new root value)
READ pgtable->[mode/root]
Reads new root, old mode
Updates mode (pgtable->mode += 1)
Since Page table level updates are infrequent and already synchronized with a
spinlock, implement seqcount to enable lock-free read operations on the read path.
Fixes: 754265bcab7 ("iommu/amd: Fix race in increase_address_space()")
Reported-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Cc: stable@vger.kernel.org
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
Commit 7bea695ada0 restructured DTE flag handling but inadvertently changed
the alias device configuration logic. This may cause incorrect DTE settings
for certain devices.
Add alias flag check before calling set_dev_entry_from_acpi(). Also move the
device iteration loop inside the alias check to restrict execution to cases
where alias devices are present.
Fixes: 7bea695ada0 ("iommu/amd: Introduce struct ivhd_dte_flags to store persistent DTE flags")
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
The check for the dart table index allowed values of
(1 << data->tbl_bits) while only as many entries are initialized in
apple_dart_alloc_pgtable. This results in an array out of bounds access
when data->tbl_bits is at its maximal value of 2. When data->tbl_bits is
0 or 1 an unset (initialized to zero) data->pgd[] entry is used. In both
cases the value is used as pointer to read page table entries and
results in dereferencing invalid pointers.
There is no prior check that the passed iova is inside the iommu's IAS
so invalid values can be passed from driver's calling iommu_map().
Fixes: 74a0e72f03ff ("iommu/io-pgtable-dart: Add 4-level page table support")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/asahi/aMACFlJjrZHs_Yf-@stanley.mountain/
Signed-off-by: Janne Grunau <j@jannau.net>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
Introduce new DMA mapping functions dma_map_phys() and dma_unmap_phys()
that operate directly on physical addresses instead of page+offset
parameters. This provides a more efficient interface for drivers that
already have physical addresses available.
The new functions are implemented as the primary mapping layer, with
the existing dma_map_page_attrs()/dma_map_resource() and
dma_unmap_page_attrs()/dma_unmap_resource() functions converted to simple
wrappers around the phys-based implementations.
In case dma_map_page_attrs(), the struct page is converted to physical
address with help of page_to_phys() function and dma_map_resource()
provides physical address as is together with addition of DMA_ATTR_MMIO
attribute.
The old page-based API is preserved in mapping.c to ensure that existing
code won't be affected by changing EXPORT_SYMBOL to EXPORT_SYMBOL_GPL
variant for dma_*map_phys().
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/54cc52af91777906bbe4a386113437ba0bcfba9c.1757423202.git.leonro@nvidia.com
|
|
Make iommu_dma_map_phys() and iommu_dma_unmap_phys() respect
DMA_ATTR_MMIO.
DMA_ATTR_MMIO makes the functions behave the same as
iommu_dma_(un)map_resource():
- No swiotlb is possible
- No cache flushing is done (ATTR_MMIO should not be cached memory)
- prot for iommu_map() has IOMMU_MMIO not IOMMU_CACHE
This is preparation for replacing iommu_dma_map_resource() callers
with iommu_dma_map_phys(DMA_ATTR_MMIO) and removing
iommu_dma_(un)map_resource().
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/acc255bee358fec9c7da6b2a5904ee50abcd09f1.1757423202.git.leonro@nvidia.com
|
|
Rename the IOMMU DMA mapping functions to better reflect their actual
calling convention. The functions iommu_dma_map_page() and
iommu_dma_unmap_page() are renamed to iommu_dma_map_phys() and
iommu_dma_unmap_phys() respectively, as they already operate on physical
addresses rather than page structures.
The calling convention changes from accepting (struct page *page,
unsigned long offset) to (phys_addr_t phys), which eliminates the need
for page-to-physical address conversion within the functions. This
renaming prepares for the broader DMA API conversion from page-based
to physical address-based mapping throughout the kernel.
All callers are updated to pass physical addresses directly, including
dma_map_page_attrs(), scatterlist mapping functions, and DMA page
allocation helpers. The change simplifies the code by removing the
page_to_phys() + offset calculation that was previously done inside
the IOMMU functions.
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/ed172f95f8f57782beae04f782813366894e98df.1757423202.git.leonro@nvidia.com
|
|
This will replace the hacky use of DMA_ATTR_SKIP_CPU_SYNC to avoid
touching the possibly non-KVA MMIO memory.
Also correct the incorrect caching attribute for the IOMMU, MMIO
memory should not be cachable inside the IOMMU mapping or it can
possibly create system problems. Set IOMMU_MMIO for DMA_ATTR_MMIO.
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/17ba63991aeaf8a80d5aca9ba5d028f1daa58f62.1757423202.git.leonro@nvidia.com
|
|
When a PCI device is removed with surprise hotplug, there may still be
attempts to attach the device to the default domain as part of tear down
via (__iommu_release_dma_ownership()), or because the removal happens
during probe (__iommu_probe_device()). In both cases zpci_register_ioat()
fails with a cc value indicating that the device handle is invalid. This
is because the device is no longer part of the instance as far as the
hypervisor is concerned.
Currently this leads to an error return and s390_iommu_attach_device()
fails. This triggers the WARN_ON() in __iommu_group_set_domain_nofail()
because attaching to the default domain must never fail.
With the device fenced by the hypervisor no DMAs to or from memory are
possible and the IOMMU translations have no effect. Proceed as if the
registration was successful and let the hotplug event handling clean up
the device.
This is similar to how devices in the error state are handled since
commit 59bbf596791b ("iommu/s390: Make attach succeed even if the device
is in error state") except that for removal the domain will not be
registered later. This approach was also previously discussed at the
link.
Handle both cases, error state and removal, in a helper which checks if
the error needs to be propagated or ignored. Avoid magic number
condition codes by using the pre-existing, but never used, defines for
PCI load/store condition codes and rename them to reflect that they
apply to all PCI instructions.
Cc: stable@vger.kernel.org # v6.2
Link: https://lore.kernel.org/linux-iommu/20240808194155.GD1985367@ziepe.ca/
Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
Reviewed-by: Benjamin Block <bblock@linux.ibm.com>
Link: https://lore.kernel.org/r/20250904-iommu_succeed_attach_removed-v1-1-e7f333d2f80f@linux.ibm.com
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
switch_to_super_page() assumes the memory range it's working on is aligned
to the target large page level. Unfortunately, __domain_mapping() doesn't
take this into account when using it, and will pass unaligned ranges
ultimately freeing a PTE range larger than expected.
Take for example a mapping with the following iov_pfn range [0x3fe400,
0x4c0600), which should be backed by the following mappings:
iov_pfn [0x3fe400, 0x3fffff] covered by 2MiB pages
iov_pfn [0x400000, 0x4bffff] covered by 1GiB pages
iov_pfn [0x4c0000, 0x4c05ff] covered by 2MiB pages
Under this circumstance, __domain_mapping() will pass [0x400000, 0x4c05ff]
to switch_to_super_page() at a 1 GiB granularity, which will in turn
free PTEs all the way to iov_pfn 0x4fffff.
Mitigate this by rounding down the iov_pfn range passed to
switch_to_super_page() in __domain_mapping()
to the target large page level.
Additionally add range alignment checks to switch_to_super_page.
Fixes: 9906b9352a35 ("iommu/vt-d: Avoid duplicate removing in __domain_mapping()")
Signed-off-by: Eugene Koira <eugkoira@amazon.com>
Cc: stable@vger.kernel.org
Reviewed-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lore.kernel.org/r/20250826143816.38686-1-eugkoira@amazon.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
zpci_get_iommu_ctrs() returns counter information to be reported as part
of device statistics; these counters are stored as part of the s390_domain.
The problem, however, is that the identity domain is not backed by an
s390_domain and so the conversion via to_s390_domain() yields a bad address
that is zero'd initially and read on-demand later via a sysfs read.
These counters aren't necessary for the identity domain; just return NULL
in this case.
This issue was discovered via KASAN with reports that look like:
BUG: KASAN: global-out-of-bounds in zpci_fmb_enable_device
when using the identity domain for a device on s390.
Cc: stable@vger.kernel.org
Fixes: 64af12c6ec3a ("iommu/s390: implement iommu passthrough via identity domain")
Reported-by: Cam Miller <cam@linux.ibm.com>
Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
Tested-by: Cam Miller <cam@linux.ibm.com>
Reviewed-by: Farhan Ali <alifm@linux.ibm.com>
Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
Link: https://lore.kernel.org/r/20250827210828.274527-1-mjrosato@linux.ibm.com
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
Fix a permanent ACPI table memory leak in early_amd_iommu_init() when
CMPXCHG16B feature is not supported
Fixes: 82582f85ed22 ("iommu/amd: Disable AMD IOMMU if CMPXCHG16B feature is not supported")
Cc: stable@vger.kernel.org
Signed-off-by: Zhen Ni <zhen.ni@easystack.cn>
Reviewed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Link: https://lore.kernel.org/r/20250822024915.673427-1-zhen.ni@easystack.cn
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
RISC-V IO Mapping Table (RIMT) provides the information about the IOMMU
to the OS in ACPI. Add support for ACPI in RISC-V IOMMU drivers by using
RIMT data.
The changes at high level are,
a) Register the IOMMU with RIMT data structures.
b) Enable probing of platform IOMMU in ACPI way using the ACPIID defined
for the RISC-V IOMMU in the BRS spec [1]. Configure the MSI domain if
the platform IOMMU uses MSIs.
[1] - https://github.com/riscv-non-isa/riscv-brs/blob/main/acpi-id.adoc
Signed-off-by: Sunil V L <sunilvl@ventanamicro.com>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250818045807.763922-4-sunilvl@ventanamicro.com
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
Change the 'ret' variable from u32 to int to store negative error codes
or zero;
Storing the negative error codes in unsigned type, doesn't cause an issue
at runtime but it's ugly. Additionally, assigning negative error codes to
unsigned type may trigger a GCC warning when the -Wsign-conversion flag
is enabled.
No effect on runtime.
Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20250829140219.121783-1-rongqianfeng@vivo.com
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
These registers exist and at least on the t602x variant the IRQ only
clears when theses are cleared.
Signed-off-by: Hector Martin <marcan@marcan.st>
Signed-off-by: Janne Grunau <j@jannau.net>
Reviewed-by: Sven Peter <sven@kernel.org>
Reviewed-by: Neal Gompa <neal@gompa.dev>
Link: https://lore.kernel.org/r/20250826-dart-t8110-stream-error-v1-1-e33395112014@jannau.net
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
After a panic if SNP is enabled in the previous kernel then the kdump
kernel boots with IOMMU SNP enforcement still enabled.
IOMMU command buffers and event buffer registers remain locked and
exclusive to the previous kernel. Attempts to enable command and event
buffers in the kdump kernel will fail, as hardware ignores writes to
the locked MMIO registers as per AMD IOMMU spec Section 2.12.2.1.
Skip enabling command buffers and event buffers for kdump boot as they
are already enabled in the previous kernel.
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Link: https://lore.kernel.org/r/576445eb4f168b467b0fc789079b650ca7c5b037.1756157913.git.ashish.kalra@amd.com
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
After a panic if SNP is enabled in the previous kernel then the kdump
kernel boots with IOMMU SNP enforcement still enabled.
IOMMU device table register is locked and exclusive to the previous
kernel. Attempts to copy old device table from the previous kernel
fails in kdump kernel as hardware ignores writes to the locked device
table base address register as per AMD IOMMU spec Section 2.12.2.1.
This causes the IOMMU driver (OS) and the hardware to reference
different memory locations. As a result, the IOMMU hardware cannot
process the command which results in repeated "Completion-Wait loop
timed out" errors and a second kernel panic: "Kernel panic - not
syncing: timer doesn't work through Interrupt-remapped IO-APIC".
Reuse device table instead of copying device table in case of kdump
boot and remove all copying device table code.
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Link: https://lore.kernel.org/r/3a31036fb2f7323e6b1a1a1921ac777e9f7bdddc.1756157913.git.ashish.kalra@amd.com
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
After a panic if SNP is enabled in the previous kernel then the kdump
kernel boots with IOMMU SNP enforcement still enabled.
IOMMU completion wait buffers (CWBs), command buffers and event buffer
registers remain locked and exclusive to the previous kernel. Attempts
to allocate and use new buffers in the kdump kernel fail, as hardware
ignores writes to the locked MMIO registers as per AMD IOMMU spec
Section 2.12.2.1.
This results in repeated "Completion-Wait loop timed out" errors and a
second kernel panic: "Kernel panic - not syncing: timer doesn't work
through Interrupt-remapped IO-APIC"
The list of MMIO registers locked and which ignore writes after failed
SNP shutdown are mentioned in the AMD IOMMU specifications below:
Section 2.12.2.1.
https://docs.amd.com/v/u/en-US/48882_3.10_PUB
Reuse the pages of the previous kernel for completion wait buffers,
command buffers, event buffers and memremap them during kdump boot
and essentially work with an already enabled IOMMU configuration and
re-using the previous kernel’s data structures.
Reusing of command buffers and event buffers is now done for kdump boot
irrespective of SNP being enabled during kdump.
Re-use of completion wait buffers is only done when SNP is enabled as
the exclusion base register is used for the completion wait buffer
(CWB) address only when SNP is enabled.
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Link: https://lore.kernel.org/r/ff04b381a8fe774b175c23c1a336b28bc1396511.1756157913.git.ashish.kalra@amd.com
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
The T8110 variant DART implementation on T602x SoCs indicates an IAS of
42, which requires an extra page table level. The extra level is
optional, but let's implement it.
Later it might be useful to restrict this based on the actual attached
devices, since most won't need that much address space anyway.
Signed-off-by: Hector Martin <marcan@marcan.st>
Signed-off-by: Janne Grunau <j@jannau.net>
Reviewed-by: Sven Peter <sven@kernel.org>
Reviewed-by: Neal Gompa <neal@gompa.dev>
Link: https://lore.kernel.org/r/20250821-apple-dart-4levels-v2-3-e39af79daa37@jannau.net
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
DARTs on t602x SoCs are of the t8110 variant but have an IAS of 42,
which means optional support for an extra page table level.
Refactor the PTE management to support an arbitrary level count, and
then calculate how many levels we need for any given configuration.
Signed-off-by: Hector Martin <marcan@marcan.st>
Signed-off-by: Janne Grunau <j@jannau.net>
Reviewed-by: Sven Peter <sven@kernel.org>
Reviewed-by: Neal Gompa <neal@gompa.dev>
Link: https://lore.kernel.org/r/20250821-apple-dart-4levels-v2-2-e39af79daa37@jannau.net
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
The registers are 32-bit and the offsets definitely don't need 64 bits
either, these should've been u32s.
Signed-off-by: Hector Martin <marcan@marcan.st>
Signed-off-by: Janne Grunau <j@jannau.net>
Reviewed-by: Sven Peter <sven@kernel.org>
Reviewed-by: Neal Gompa <neal@gompa.dev>
Link: https://lore.kernel.org/r/20250821-apple-dart-4levels-v2-1-e39af79daa37@jannau.net
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
Use the string choice helper function str_plural() to simplify the code.
Signed-off-by: Xichao Zhao <zhao.xichao@vivo.com>
Reviewed-by: Ankit Soni <Ankit.Soni@amd.com>
Link: https://lore.kernel.org/r/20250818070556.458271-1-zhao.xichao@vivo.com
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd
Pull iommufd fixes from Jason Gunthorpe:
"Two very minor fixes:
- Fix mismatched kvalloc()/kfree()
- Spelling fixes in documentation"
* tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd:
iommufd: Fix spelling errors in iommufd.rst
iommufd: viommu: free memory allocated by kvcalloc() using kvfree()
|
|
The riscv_iommu_pte_fetch() function returns either NULL for
unmapped/never-mapped iova, or a valid leaf pte pointer that
requires no further validation.
riscv_iommu_iova_to_phys() failed to handle NULL returns.
Prevent null pointer dereference in
riscv_iommu_iova_to_phys(), and remove the pte validation.
Fixes: 488ffbf18171 ("iommu/riscv: Paging domain support")
Cc: Tomasz Jeznach <tjeznach@rivosinc.com>
Signed-off-by: XianLiang Huang <huangxianliang@lanxincomputing.com>
Link: https://lore.kernel.org/r/20250820072248.312-1-huangxianliang@lanxincomputing.com
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
Much like arm-smmu in commit 7d835134d4e1 ("iommu/arm-smmu: Make
instance lookup robust"), virtio-iommu appears to have the same issue
where iommu_device_register() makes the IOMMU instance visible to other
API callers (including itself) straight away, but internally the
instance isn't ready to recognise itself for viommu_probe_device() to
work correctly until after viommu_probe() has returned. This matters a
lot more now that bus_iommu_probe() has the DT/VIOT knowledge to probe
client devices the way that was always intended. Tweak the lookup and
initialisation in much the same way as for arm-smmu, to ensure that what
we register is functional and ready to go.
Cc: stable@vger.kernel.org
Fixes: bcb81ac6ae3c ("iommu: Get DT/ACPI parsing into the proper probe path")
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Link: https://lore.kernel.org/r/308911aaa1f5be32a3a709996c7bd6cf71d30f33.1755190036.git.robin.murphy@arm.com
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
The arm_smmu_attach_commit() updates master->ats_enabled before calling
arm_smmu_remove_master_domain() that is supposed to clean up everything
in the old domain, including the old domain's nr_ats_masters. So, it is
supposed to use the old ats_enabled state of the device, not an updated
state.
This isn't a problem if switching between two domains where:
- old ats_enabled = false; new ats_enabled = false
- old ats_enabled = true; new ats_enabled = true
but can fail cases where:
- old ats_enabled = false; new ats_enabled = true
(old domain should keep the counter but incorrectly decreased it)
- old ats_enabled = true; new ats_enabled = false
(old domain needed to decrease the counter but incorrectly missed it)
Update master->ats_enabled after arm_smmu_remove_master_domain() to fix
this.
Fixes: 7497f4211f4f ("iommu/arm-smmu-v3: Make changing domains be hitless for ATS")
Cc: stable@vger.kernel.org
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Acked-by: Will Deacon <will@kernel.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Link: https://lore.kernel.org/r/20250801030127.2006979-1-nicolinc@nvidia.com
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
Use kvfree() instead of kfree() to free pages allocated by kvcalloc()
in iommufs_hw_queue_alloc_phys() to fix potential memory corruption.
Ensure the memory is properly freed, as kvcalloc may internally use
vmalloc or kmalloc depending on available memory in the system.
Fixes: 2238ddc2b056 ("iommufd/viommu: Add IOMMUFD_CMD_HW_QUEUE_ALLOC ioctl")
Link: https://patch.msgid.link/r/aJifyVV2PL6WGEs6@bhairav-test.ee.iitb.ac.in
Signed-off-by: Akhilesh Patil <akhilesh@ee.iitb.ac.in>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Sparse reported a warning:
drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c:305:47:
sparse: expected restricted __le64
sparse: got unsigned long long
Add cpu_to_le64() to fix that.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202508142105.Jb5Smjsg-lkp@intel.com/
Suggested-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Link: https://lore.kernel.org/r/20250814193039.2265813-1-nicolinc@nvidia.com
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
While the kernel command line is considered trusted in most environments,
avoid writing 1 byte past the end of "acpiid" if the "str" argument is
maximum length.
Reported-by: Simcha Kosman <simcha.kosman@cyberark.com>
Closes: https://lore.kernel.org/all/AS8P193MB2271C4B24BCEDA31830F37AE84A52@AS8P193MB2271.EURP193.PROD.OUTLOOK.COM
Fixes: b6b26d86c61c ("iommu/amd: Add a length limitation for the ivrs_acpihid command-line parameter")
Signed-off-by: Kees Cook <kees@kernel.org>
Reviewed-by: Ankit Soni <Ankit.Soni@amd.com>
Link: https://lore.kernel.org/r/20250804154023.work.970-kees@kernel.org
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci
Pull PCI updates from Bjorn Helgaas:
"Enumeration:
- Allow built-in drivers, not just modular drivers, to use async
initial probing (Lukas Wunner)
- Support Immediate Readiness even on devices with no PM Capability
(Sean Christopherson)
- Consolidate definition of PCIE_RESET_CONFIG_WAIT_MS (100ms), the
required delay between a reset and sending config requests to a
device (Niklas Cassel)
- Add pci_is_display() to check for "Display" base class and use it
in ALSA hda, vfio, vga_switcheroo, vt-d (Mario Limonciello)
- Allow 'isolated PCI functions' (multi-function devices without a
function 0) for LoongArch, similar to s390 and jailhouse (Huacai
Chen)
Power control:
- Add ability to enable optional slot clock for cases where the PCIe
host controller and the slot are supplied by different clocks
(Marek Vasut)
PCIe native device hotplug:
- Fix runtime PM ref imbalance on Hot-Plug Capable ports caused by
misinterpreting a config read failure after a device has been
removed (Lukas Wunner)
- Avoid creating a useless PCIe port service device for pciehp if the
slot is handled by the ACPI hotplug driver (Lukas Wunner)
- Ignore ACPI hotplug slots when calculating depth of pciehp hotplug
ports (Lukas Wunner)
Virtualization:
- Save VF resizable BAR state and restore it after reset (Michał
Winiarski)
- Allow IOV resources (VF BARs) to be resized (Michał Winiarski)
- Add pci_iov_vf_bar_set_size() so drivers can control VF BAR size
(Michał Winiarski)
Endpoint framework:
- Add RC-to-EP doorbell support using platform MSI controller,
including a test case (Frank Li)
- Allow BAR assignment via configfs so platforms have flexibility in
determining BAR usage (Jerome Brunet)
Native PCIe controller drivers:
- Convert amazon,al-alpine-v[23]-pcie, apm,xgene-pcie,
axis,artpec6-pcie, marvell,armada-3700-pcie, st,spear1340-pcie to
DT schema format (Rob Herring)
- Use dev_fwnode() instead of of_fwnode_handle() to remove OF
dependency in altera (fixes an unused variable), designware-host,
mediatek, mediatek-gen3, mobiveil, plda, xilinx, xilinx-dma,
xilinx-nwl (Jiri Slaby, Arnd Bergmann)
- Convert aardvark, altera, brcmstb, designware-host, iproc,
mediatek, mediatek-gen3, mobiveil, plda, rcar-host, vmd, xilinx,
xilinx-dma, xilinx-nwl from using pci_msi_create_irq_domain() to
using msi_create_parent_irq_domain() instead; this makes the
interrupt controller per-PCI device, allows dynamic allocation of
vectors after initialization, and allows support of IMS (Nam Cao)
APM X-Gene PCIe controller driver:
- Rewrite MSI handling to MSI CPU affinity, drop useless CPU hotplug
bits, use device-managed memory allocations, and clean things up
(Marc Zyngier)
- Probe xgene-msi as a standard platform driver rather than a
subsys_initcall (Marc Zyngier)
Broadcom STB PCIe controller driver:
- Add optional DT 'num-lanes' property and if present, use it to
override the Maximum Link Width advertised in Link Capabilities
(Jim Quinlan)
Cadence PCIe controller driver:
- Use PCIe Message routing types from the PCI core rather than
defining private ones (Hans Zhang)
Freescale i.MX6 PCIe controller driver:
- Add IMX8MQ_EP third 64-bit BAR in epc_features (Richard Zhu)
- Add IMX8MM_EP and IMX8MP_EP fixed 256-byte BAR 4 in epc_features
(Richard Zhu)
- Configure LUT for MSI/IOMMU in Endpoint mode so Root Complex can
trigger doorbel on Endpoint (Frank Li)
- Remove apps_reset (LTSSM_EN) from
imx_pcie_{assert,deassert}_core_reset(), which fixes a hotplug
regression on i.MX8MM (Richard Zhu)
- Delay Endpoint link start until configfs 'start' written (Richard
Zhu)
Intel VMD host bridge driver:
- Add Intel Panther Lake (PTL)-H/P/U Vendor ID (George D Sworo)
Qualcomm PCIe controller driver:
- Add DT binding and driver support for SA8255p, which supports ECAM
for Configuration Space access (Mayank Rana)
- Update DT binding and driver to describe PHYs and per-Root Port
resets in a Root Port stanza and deprecate describing them in the
host bridge; this makes it possible to support multiple Root Ports
in the future (Krishna Chaitanya Chundru)
- Add Qualcomm QCS615 to SM8150 DT binding (Ziyue Zhang)
- Add Qualcomm QCS8300 to SA8775p DT binding (Ziyue Zhang)
- Drop TBU and ref clocks from Qualcomm SM8150 and SC8180x DT
bindings (Konrad Dybcio)
- Document 'link_down' reset in Qualcomm SA8775P DT binding (Ziyue
Zhang)
- Add required PCIE_RESET_CONFIG_WAIT_MS delay after Link up IRQ
(Niklas Cassel)
Rockchip PCIe controller driver:
- Drop unused PCIe Message routing and code definitions (Hans Zhang)
- Remove several unused header includes (Hans Zhang)
- Use standard PCIe config register definitions instead of
rockchip-specific redefinitions (Geraldo Nascimento)
- Set Target Link Speed to 5.0 GT/s before retraining so we have a
chance to train at a higher speed (Geraldo Nascimento)
Rockchip DesignWare PCIe controller driver:
- Prevent race between link training and register update via DBI by
inhibiting link training after hot reset and link down (Wilfred
Mallawa)
- Add required PCIE_RESET_CONFIG_WAIT_MS delay after Link up IRQ
(Niklas Cassel)
Sophgo PCIe controller driver:
- Add DT binding and driver for Sophgo SG2044 PCIe controller driver
in Root Complex mode (Inochi Amaoto)
Synopsys DesignWare PCIe controller driver:
- Add required PCIE_RESET_CONFIG_WAIT_MS after waiting for Link up on
Ports that support > 5.0 GT/s. Slower Ports still rely on the
not-quite-correct PCIE_LINK_WAIT_SLEEP_MS 90ms default delay while
waiting for the Link (Niklas Cassel)"
* tag 'pci-v6.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci: (116 commits)
dt-bindings: PCI: qcom,pcie-sa8775p: Document 'link_down' reset
dt-bindings: PCI: Remove 83xx-512x-pci.txt
dt-bindings: PCI: Convert amazon,al-alpine-v[23]-pcie to DT schema
dt-bindings: PCI: Convert marvell,armada-3700-pcie to DT schema
dt-bindings: PCI: Convert apm,xgene-pcie to DT schema
dt-bindings: PCI: Convert axis,artpec6-pcie to DT schema
dt-bindings: PCI: Convert st,spear1340-pcie to DT schema
PCI: Move is_pciehp check out of pciehp_is_native()
PCI: pciehp: Use is_pciehp instead of is_hotplug_bridge
PCI/portdrv: Use is_pciehp instead of is_hotplug_bridge
PCI/ACPI: Fix runtime PM ref imbalance on Hot-Plug Capable ports
selftests: pci_endpoint: Add doorbell test case
misc: pci_endpoint_test: Add doorbell test case
PCI: endpoint: pci-epf-test: Add doorbell test support
PCI: endpoint: Add pci_epf_align_inbound_addr() helper for inbound address alignment
PCI: endpoint: pci-ep-msi: Add checks for MSI parent and mutability
PCI: endpoint: Add RC-to-EP doorbell support using platform MSI controller
PCI: dwc: Add Sophgo SG2044 PCIe controller driver in Root Complex mode
PCI: vmd: Switch to msi_create_parent_irq_domain()
PCI: vmd: Convert to lock guards
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd
Pull iommufd updates from Jason Gunthorpe:
"This broadly brings the assigned HW command queue support to iommufd.
This feature is used to improve SVA performance in VMs by avoiding
paravirtualization traps during SVA invalidations.
Along the way I think some of the core logic is in a much better state
to support future driver backed features.
Summary:
- IOMMU HW now has features to directly assign HW command queues to a
guest VM. In this mode the command queue operates on a limited set
of invalidation commands that are suitable for improving guest
invalidation performance and easy for the HW to virtualize.
This brings the generic infrastructure to allow IOMMU drivers to
expose such command queues through the iommufd uAPI, mmap the
doorbell pages, and get the guest physical range for the command
queue ring itself.
- An implementation for the NVIDIA SMMUv3 extension "cmdqv" is built
on the new iommufd command queue features. It works with the
existing SMMU driver support for cmdqv in guest VMs.
- Many precursor cleanups and improvements to support the above
cleanly, changes to the general ioctl and object helpers, driver
support for VDEVICE, and mmap pgoff cookie infrastructure.
- Sequence VDEVICE destruction to always happen before VFIO device
destruction. When using the above type features, and also in future
confidential compute, the internal virtual device representation
becomes linked to HW or CC TSM configuration and objects. If a VFIO
device is removed from iommufd those HW objects should also be
cleaned up to prevent a sort of UAF. This became important now that
we have HW backing the VDEVICE.
- Fix one syzkaller found error related to math overflows during iova
allocation"
* tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd: (57 commits)
iommu/arm-smmu-v3: Replace vsmmu_size/type with get_viommu_size
iommu/arm-smmu-v3: Do not bother impl_ops if IOMMU_VIOMMU_TYPE_ARM_SMMUV3
iommufd: Rename some shortterm-related identifiers
iommufd/selftest: Add coverage for vdevice tombstone
iommufd/selftest: Explicitly skip tests for inapplicable variant
iommufd/vdevice: Remove struct device reference from struct vdevice
iommufd: Destroy vdevice on idevice destroy
iommufd: Add a pre_destroy() op for objects
iommufd: Add iommufd_object_tombstone_user() helper
iommufd/viommu: Roll back to use iommufd_object_alloc() for vdevice
iommufd/selftest: Test reserved regions near ULONG_MAX
iommufd: Prevent ALIGN() overflow
iommu/tegra241-cmdqv: import IOMMUFD module namespace
iommufd: Do not allow _iommufd_object_alloc_ucmd if abort op is set
iommu/tegra241-cmdqv: Add IOMMU_VEVENTQ_TYPE_TEGRA241_CMDQV support
iommu/tegra241-cmdqv: Add user-space use support
iommu/tegra241-cmdqv: Do not statically map LVCMDQs
iommu/tegra241-cmdqv: Simplify deinit flow in tegra241_cmdqv_remove_vintf()
iommu/tegra241-cmdqv: Use request_threaded_irq
iommu/arm-smmu-v3-iommufd: Add hw_info to impl_ops
...
|
|
Pull kvm updates from Paolo Bonzini:
"ARM:
- Host driver for GICv5, the next generation interrupt controller for
arm64, including support for interrupt routing, MSIs, interrupt
translation and wired interrupts
- Use FEAT_GCIE_LEGACY on GICv5 systems to virtualize GICv3 VMs on
GICv5 hardware, leveraging the legacy VGIC interface
- Userspace control of the 'nASSGIcap' GICv3 feature, allowing
userspace to disable support for SGIs w/o an active state on
hardware that previously advertised it unconditionally
- Map supporting endpoints with cacheable memory attributes on
systems with FEAT_S2FWB and DIC where KVM no longer needs to
perform cache maintenance on the address range
- Nested support for FEAT_RAS and FEAT_DoubleFault2, allowing the
guest hypervisor to inject external aborts into an L2 VM and take
traps of masked external aborts to the hypervisor
- Convert more system register sanitization to the config-driven
implementation
- Fixes to the visibility of EL2 registers, namely making VGICv3
system registers accessible through the VGIC device instead of the
ONE_REG vCPU ioctls
- Various cleanups and minor fixes
LoongArch:
- Add stat information for in-kernel irqchip
- Add tracepoints for CPUCFG and CSR emulation exits
- Enhance in-kernel irqchip emulation
- Various cleanups
RISC-V:
- Enable ring-based dirty memory tracking
- Improve perf kvm stat to report interrupt events
- Delegate illegal instruction trap to VS-mode
- MMU improvements related to upcoming nested virtualization
s390x
- Fixes
x86:
- Add CONFIG_KVM_IOAPIC for x86 to allow disabling support for I/O
APIC, PIC, and PIT emulation at compile time
- Share device posted IRQ code between SVM and VMX and harden it
against bugs and runtime errors
- Use vcpu_idx, not vcpu_id, for GA log tag/metadata, to make lookups
O(1) instead of O(n)
- For MMIO stale data mitigation, track whether or not a vCPU has
access to (host) MMIO based on whether the page tables have MMIO
pfns mapped; using VFIO is prone to false negatives
- Rework the MSR interception code so that the SVM and VMX APIs are
more or less identical
- Recalculate all MSR intercepts from scratch on MSR filter changes,
instead of maintaining shadow bitmaps
- Advertise support for LKGS (Load Kernel GS base), a new instruction
that's loosely related to FRED, but is supported and enumerated
independently
- Fix a user-triggerable WARN that syzkaller found by setting the
vCPU in INIT_RECEIVED state (aka wait-for-SIPI), and then putting
the vCPU into VMX Root Mode (post-VMXON). Trying to detect every
possible path leading to architecturally forbidden states is hard
and even risks breaking userspace (if it goes from valid to valid
state but passes through invalid states), so just wait until
KVM_RUN to detect that the vCPU state isn't allowed
- Add KVM_X86_DISABLE_EXITS_APERFMPERF to allow disabling
interception of APERF/MPERF reads, so that a "properly" configured
VM can access APERF/MPERF. This has many caveats (APERF/MPERF
cannot be zeroed on vCPU creation or saved/restored on suspend and
resume, or preserved over thread migration let alone VM migration)
but can be useful whenever you're interested in letting Linux
guests see the effective physical CPU frequency in /proc/cpuinfo
- Reject KVM_SET_TSC_KHZ for vm file descriptors if vCPUs have been
created, as there's no known use case for changing the default
frequency for other VM types and it goes counter to the very reason
why the ioctl was added to the vm file descriptor. And also, there
would be no way to make it work for confidential VMs with a
"secure" TSC, so kill two birds with one stone
- Dynamically allocation the shadow MMU's hashed page list, and defer
allocating the hashed list until it's actually needed (the TDP MMU
doesn't use the list)
- Extract many of KVM's helpers for accessing architectural local
APIC state to common x86 so that they can be shared by guest-side
code for Secure AVIC
- Various cleanups and fixes
x86 (Intel):
- Preserve the host's DEBUGCTL.FREEZE_IN_SMM when running the guest.
Failure to honor FREEZE_IN_SMM can leak host state into guests
- Explicitly check vmcs12.GUEST_DEBUGCTL on nested VM-Enter to
prevent L1 from running L2 with features that KVM doesn't support,
e.g. BTF
x86 (AMD):
- WARN and reject loading kvm-amd.ko instead of panicking the kernel
if the nested SVM MSRPM offsets tracker can't handle an MSR (which
is pretty much a static condition and therefore should never
happen, but still)
- Fix a variety of flaws and bugs in the AVIC device posted IRQ code
- Inhibit AVIC if a vCPU's ID is too big (relative to what hardware
supports) instead of rejecting vCPU creation
- Extend enable_ipiv module param support to SVM, by simply leaving
IsRunning clear in the vCPU's physical ID table entry
- Disable IPI virtualization, via enable_ipiv, if the CPU is affected
by erratum #1235, to allow (safely) enabling AVIC on such CPUs
- Request GA Log interrupts if and only if the target vCPU is
blocking, i.e. only if KVM needs a notification in order to wake
the vCPU
- Intercept SPEC_CTRL on AMD if the MSR shouldn't exist according to
the vCPU's CPUID model
- Accept any SNP policy that is accepted by the firmware with respect
to SMT and single-socket restrictions. An incompatible policy
doesn't put the kernel at risk in any way, so there's no reason for
KVM to care
- Drop a superfluous WBINVD (on all CPUs!) when destroying a VM and
use WBNOINVD instead of WBINVD when possible for SEV cache
maintenance
- When reclaiming memory from an SEV guest, only do cache flushes on
CPUs that have ever run a vCPU for the guest, i.e. don't flush the
caches for CPUs that can't possibly have cache lines with dirty,
encrypted data
Generic:
- Rework irqbypass to track/match producers and consumers via an
xarray instead of a linked list. Using a linked list leads to
O(n^2) insertion times, which is hugely problematic for use cases
that create large numbers of VMs. Such use cases typically don't
actually use irqbypass, but eliminating the pointless registration
is a future problem to solve as it likely requires new uAPI
- Track irqbypass's "token" as "struct eventfd_ctx *" instead of a
"void *", to avoid making a simple concept unnecessarily difficult
to understand
- Decouple device posted IRQs from VFIO device assignment, as binding
a VM to a VFIO group is not a requirement for enabling device
posted IRQs
- Clean up and document/comment the irqfd assignment code
- Disallow binding multiple irqfds to an eventfd with a priority
waiter, i.e. ensure an eventfd is bound to at most one irqfd
through the entire host, and add a selftest to verify eventfd:irqfd
bindings are globally unique
- Add a tracepoint for KVM_SET_MEMORY_ATTRIBUTES to help debug issues
related to private <=> shared memory conversions
- Drop guest_memfd's .getattr() implementation as the VFS layer will
call generic_fillattr() if inode_operations.getattr is NULL
- Fix issues with dirty ring harvesting where KVM doesn't bound the
processing of entries in any way, which allows userspace to keep
KVM in a tight loop indefinitely
- Kill off kvm_arch_{start,end}_assignment() and x86's associated
tracking, now that KVM no longer uses assigned_device_count as a
heuristic for either irqbypass usage or MDS mitigation
Selftests:
- Fix a comment typo
- Verify KVM is loaded when getting any KVM module param so that
attempting to run a selftest without kvm.ko loaded results in a
SKIP message about KVM not being loaded/enabled (versus some random
parameter not existing)
- Skip tests that hit EACCES when attempting to access a file, and
print a "Root required?" help message. In most cases, the test just
needs to be run with elevated permissions"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (340 commits)
Documentation: KVM: Use unordered list for pre-init VGIC registers
RISC-V: KVM: Avoid re-acquiring memslot in kvm_riscv_gstage_map()
RISC-V: KVM: Use find_vma_intersection() to search for intersecting VMAs
RISC-V: perf/kvm: Add reporting of interrupt events
RISC-V: KVM: Enable ring-based dirty memory tracking
RISC-V: KVM: Fix inclusion of Smnpm in the guest ISA bitmap
RISC-V: KVM: Delegate illegal instruction fault to VS mode
RISC-V: KVM: Pass VMID as parameter to kvm_riscv_hfence_xyz() APIs
RISC-V: KVM: Factor-out g-stage page table management
RISC-V: KVM: Add vmid field to struct kvm_riscv_hfence
RISC-V: KVM: Introduce struct kvm_gstage_mapping
RISC-V: KVM: Factor-out MMU related declarations into separate headers
RISC-V: KVM: Use ncsr_xyz() in kvm_riscv_vcpu_trap_redirect()
RISC-V: KVM: Implement kvm_arch_flush_remote_tlbs_range()
RISC-V: KVM: Don't flush TLB when PTE is unchanged
RISC-V: KVM: Replace KVM_REQ_HFENCE_GVMA_VMID_ALL with KVM_REQ_TLB_FLUSH
RISC-V: KVM: Rename and move kvm_riscv_local_tlb_sanitize()
RISC-V: KVM: Drop the return value of kvm_riscv_vcpu_aia_init()
RISC-V: KVM: Check kvm_riscv_vcpu_alloc_vector_context() return value
KVM: arm64: selftests: Add FEAT_RAS EL2 registers to get-reg-list
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux
Pull iommu updates from Will Deacon:
"Core:
- Remove the 'pgsize_bitmap' member from 'struct iommu_ops'
- Convert the x86 drivers over to msi_create_parent_irq_domain()
AMD-Vi:
- Add support for examining driver/device internals via debugfs
- Add support for "HATDis" to disable host translation when it is not
supported
- Add support for limiting the maximum host translation level based
on EFR[HATS]
Apple DART:
- Don't enable as built-in by default when ARCH_APPLE is selected
Arm SMMU:
- Devicetree bindings update for the Qualcomm SMMU in the "Milos" SoC
- Support for Qualcomm SM6115 MDSS parts
- Disable PRR on Qualcomm SM8250 as using these bits causes the
hypervisor to explode
Intel VT-d:
- Reorganize Intel VT-d to be ready for iommupt
- Optimize iotlb_sync_map for non-caching/non-RWBF modes
- Fix missed PASID in dev TLB invalidation in cache_tag_flush_all()
Mediatek:
- Fix build warnings when W=1
Samsung Exynos:
- Add support for reserved memory regions specified by the bootloader
TI OMAP:
- Use syscon_regmap_lookup_by_phandle_args() instead of parsing the
node manually
Misc:
- Cleanups and minor fixes across the board"
* tag 'iommu-updates-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux: (48 commits)
iommu/vt-d: Fix UAF on sva unbind with pending IOPFs
iommu/vt-d: Make iotlb_sync_map a static property of dmar_domain
dt-bindings: arm-smmu: Remove sdm845-cheza specific entry
iommu/amd: Fix geometry.aperture_end for V2 tables
iommu/amd: Wrap debugfs ABI testing symbols snippets in literal code blocks
iommu/amd: Add documentation for AMD IOMMU debugfs support
iommu/amd: Add debugfs support to dump IRT Table
iommu/amd: Add debugfs support to dump device table
iommu/amd: Add support for device id user input
iommu/amd: Add debugfs support to dump IOMMU command buffer
iommu/amd: Add debugfs support to dump IOMMU Capability registers
iommu/amd: Add debugfs support to dump IOMMU MMIO registers
iommu/amd: Refactor AMD IOMMU debugfs initial setup
dt-bindings: arm-smmu: document the support on Milos
iommu/exynos: add support for reserved regions
iommu/arm-smmu: disable PRR on SM8250
iommu/arm-smmu-v3: Revert vmaster in the error path
iommu/io-pgtable-arm: Remove unused macro iopte_prot
iommu/arm-smmu-qcom: Add SM6115 MDSS compatible
iommu/qcom: Fix pgsize_bitmap
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Catalin Marinas:
"A quick summary: perf support for Branch Record Buffer Extensions
(BRBE), typical PMU hardware updates, small additions to MTE for
store-only tag checking and exposing non-address bits to signal
handlers, HAVE_LIVEPATCH enabled on arm64, VMAP_STACK forced on.
There is also a TLBI optimisation on hardware that does not require
break-before-make when changing the user PTEs between contiguous and
non-contiguous.
More details:
Perf and PMU updates:
- Add support for new (v3) Hisilicon SLLC and DDRC PMUs
- Add support for Arm-NI PMU integrations that share interrupts
between clock domains within a given instance
- Allow SPE to be configured with a lower sample period than the
minimum recommendation advertised by PMSIDR_EL1.Interval
- Add suppport for Arm's "Branch Record Buffer Extension" (BRBE)
- Adjust the perf watchdog period according to cpu frequency changes
- Minor driver fixes and cleanups
Hardware features:
- Support for MTE store-only checking (FEAT_MTE_STORE_ONLY)
- Support for reporting the non-address bits during a synchronous MTE
tag check fault (FEAT_MTE_TAGGED_FAR)
- Optimise the TLBI when folding/unfolding contiguous PTEs on
hardware with FEAT_BBM (break-before-make) level 2 and no TLB
conflict aborts
Software features:
- Enable HAVE_LIVEPATCH after implementing arch_stack_walk_reliable()
and using the text-poke API for late module relocations
- Force VMAP_STACK always on and change arm64_efi_rt_init() to use
arch_alloc_vmap_stack() in order to avoid KASAN false positives
ACPI:
- Improve SPCR handling and messaging on systems lacking an SPCR
table
Debug:
- Simplify the debug exception entry path
- Drop redundant DBG_MDSCR_* macros
Kselftests:
- Cleanups and improvements for SME, SVE and FPSIMD tests
Miscellaneous:
- Optimise loop to reduce redundant operations in contpte_ptep_get()
- Remove ISB when resetting POR_EL0 during signal handling
- Mark the kernel as tainted on SEA and SError panic
- Remove redundant gcs_free() call"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (93 commits)
arm64/gcs: task_gcs_el0_enable() should use passed task
arm64: Kconfig: Keep selects somewhat alphabetically ordered
arm64: signal: Remove ISB when resetting POR_EL0
kselftest/arm64: Handle attempts to disable SM on SME only systems
kselftest/arm64: Fix SVE write data generation for SME only systems
kselftest/arm64: Test SME on SME only systems in fp-ptrace
kselftest/arm64: Test FPSIMD format data writes via NT_ARM_SVE in fp-ptrace
kselftest/arm64: Allow sve-ptrace to run on SME only systems
arm64/mm: Drop redundant addr increment in set_huge_pte_at()
kselftest/arm4: Provide local defines for AT_HWCAP3
arm64: Mark kernel as tainted on SAE and SError panic
arm64/gcs: Don't call gcs_free() when releasing task_struct
drivers/perf: hisi: Support PMUs with no interrupt
drivers/perf: hisi: Relax the event number check of v2 PMUs
drivers/perf: hisi: Add support for HiSilicon SLLC v3 PMU driver
drivers/perf: hisi: Use ACPI driver_data to retrieve SLLC PMU information
drivers/perf: hisi: Add support for HiSilicon DDRC v3 PMU driver
drivers/perf: hisi: Simplify the probe process for each DDRC version
perf/arm-ni: Support sharing IRQs within an NI instance
perf/arm-ni: Consolidate CPU affinity handling
...
|